Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU

2018-06-01 Thread Mathieu Desnoyers
- On May 31, 2018, at 1:51 PM, Joel Fernandes, Google 
j...@joelfernandes.org wrote:

>> I find it odd to have a "return" in a macro that consists of a
>> do { } while (0). I'm tempted to replace "return" by "break" here,
>> to break the macro do/while (0) loop.
> 
> "return;" is also used from "if (!(cond))" above so I prefer be consistent
> and just use return than break as done above, but please let me know if you
> still object.

It's fine by me,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU

2018-05-31 Thread Joel Fernandes
On Thu, May 31, 2018 at 02:50:41AM -0400, Mathieu Desnoyers wrote:
> - On May 30, 2018, at 2:04 AM, Joel Fernandes joe...@google.com wrote:
> 
> > From: "Joel Fernandes (Google)" 
> > 
> > In recent tests with IRQ on/off tracepoints, a large performance
> > overhead ~10% is noticed when running hackbench. This is root caused to
> > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > tracepoint code. Following a long discussion on the list [1] about this,
> > we concluded that srcu is a better alternative for use during rcu idle.
> > Although it does involve extra barriers, its lighter than the sched-rcu
> > version which has to do additional RCU calls to notify RCU idle about
> > entry into RCU sections.
> > 
> > In this patch, we change the underlying implementation of the
> > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > alot for the high frequency irq enable/disable tracepoints.
> > 
> > Test: Tested idle and preempt/irq tracepoints.
> > 
> > Here are some performance numbers:
> > 
> > With a run of the following 30 times on a single core x86 Qemu instance
> > with 1GB memory:
> > hackbench -g 4 -f 2 -l 3000
> > 
> > Completion times in seconds. CONFIG_PROVE_LOCKING=y.
> > 
> > No patches (without this series)
> > Mean: 3.048
> > Median: 3.025
> > Std Dev: 0.064
> > 
> > With Lockdep using irq tracepoints with RCU implementation:
> > Mean: 3.451   (-11.66 %)
> > Median: 3.447 (-12.22%)
> > Std Dev: 0.049
> > 
> > With Lockdep using irq tracepoints with SRCU implementation (this series):
> > Mean: 3.020   (I would consider the improvement against the "without
> >this series" case as just noise).
> > Median: 3.013
> > Std Dev: 0.033
> > 
> > [1] https://patchwork.kernel.org/patch/10344297/
> > 
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> > include/linux/tracepoint.h | 48 +++---
> > kernel/tracepoint.c| 15 +++-
> > 2 files changed, 54 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > index c94f466d57ef..880794207921 100644
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -15,6 +15,7 @@
> >  */
> > 
> > #include 
> > +#include 
> > #include 
> > #include 
> > #include 
> > @@ -33,6 +34,8 @@ struct trace_eval_map {
> > 
> > #define TRACEPOINT_DEFAULT_PRIO 10
> > 
> > +extern struct srcu_struct tracepoint_srcu;
> > +
> > extern int
> > tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> > extern int
> > @@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct
> > notifier_block *nb)
> >  * probe unregistration and the end of module exit to make sure there is no
> >  * caller executing a probe when it is freed.
> >  */
> > +#ifdef CONFIG_TRACEPOINTS
> > static inline void tracepoint_synchronize_unregister(void)
> > {
> > +   synchronize_srcu(&tracepoint_srcu);
> > synchronize_sched();
> > }
> > +#else
> > +static inline void tracepoint_synchronize_unregister(void) { }
> 
> In order to add some consistency to the style in this header file,
> this empty function should look like:
> 
> static inline void tracepoint_synchronize_unregister(void)
> { }
> 
> (with newline before "{")

Ok, I changed it to the style you're proposing.

> > +#endif
> > 
> > #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> > extern int syscall_regfunc(void);
> > @@ -129,18 +137,38 @@ extern void syscall_unregfunc(void);
> >  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
> >  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, 
> > proto".
> >  */
> > -#define __DO_TRACE(tp, proto, args, cond, rcucheck)
> > \
> > +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> > do {\
> > struct tracepoint_func *it_func_ptr;\
> > void *it_func;  \
> > void *__data;   \
> > +   int __maybe_unused idx = 0; \
> > \
> > if (!(cond))\
> > return; \
> > -   if (rcucheck)   \
> > -   rcu_irq_enter_irqson(); \
> > -   rcu_read_lock_sched_notrace();  \
> > -   it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
> > +   \
> > +   /*  \
> > +* For rcuidle callers, use srcu since sched-rcu\
> > +* doesn't work from the idle path.   

Re: [PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU

2018-05-30 Thread Mathieu Desnoyers
- On May 30, 2018, at 2:04 AM, Joel Fernandes joe...@google.com wrote:

> From: "Joel Fernandes (Google)" 
> 
> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
> 
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.
> 
> Test: Tested idle and preempt/irq tracepoints.
> 
> Here are some performance numbers:
> 
> With a run of the following 30 times on a single core x86 Qemu instance
> with 1GB memory:
> hackbench -g 4 -f 2 -l 3000
> 
> Completion times in seconds. CONFIG_PROVE_LOCKING=y.
> 
> No patches (without this series)
> Mean: 3.048
> Median: 3.025
> Std Dev: 0.064
> 
> With Lockdep using irq tracepoints with RCU implementation:
> Mean: 3.451   (-11.66 %)
> Median: 3.447 (-12.22%)
> Std Dev: 0.049
> 
> With Lockdep using irq tracepoints with SRCU implementation (this series):
> Mean: 3.020   (I would consider the improvement against the "without
>  this series" case as just noise).
> Median: 3.013
> Std Dev: 0.033
> 
> [1] https://patchwork.kernel.org/patch/10344297/
> 
> Signed-off-by: Joel Fernandes (Google) 
> ---
> include/linux/tracepoint.h | 48 +++---
> kernel/tracepoint.c| 15 +++-
> 2 files changed, 54 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..880794207921 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
>  */
> 
> #include 
> +#include 
> #include 
> #include 
> #include 
> @@ -33,6 +34,8 @@ struct trace_eval_map {
> 
> #define TRACEPOINT_DEFAULT_PRIO   10
> 
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct
> notifier_block *nb)
>  * probe unregistration and the end of module exit to make sure there is no
>  * caller executing a probe when it is freed.
>  */
> +#ifdef CONFIG_TRACEPOINTS
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_srcu(&tracepoint_srcu);
>   synchronize_sched();
> }
> +#else
> +static inline void tracepoint_synchronize_unregister(void) { }

In order to add some consistency to the style in this header file,
this empty function should look like:

static inline void tracepoint_synchronize_unregister(void)
{ }

(with newline before "{")

> +#endif
> 
> #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> extern int syscall_regfunc(void);
> @@ -129,18 +137,38 @@ extern void syscall_unregfunc(void);
>  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
>  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, 
> proto".
>  */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck)  \
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle)   \
>   do {\
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   void *__data;   \
> + int __maybe_unused idx = 0; \
>   \
>   if (!(cond))\
>   return; \
> - if (rcucheck)   \
> - rcu_irq_enter_irqson(); \
> - rcu_read_lock_sched_notrace();  \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
> + \
> + /*  \
> +  * For rcuidle callers, use srcu since sched-rcu\
> +  * doesn't work from the idle path. \
> +  */ \
> + if (rcuidle) {  \
> + if (in_nmi()) { \
> + WARN_ON_ONCE(1);\
> +  

[PATCH v8 5/8] tracepoint: Make rcuidle tracepoint callers use SRCU

2018-05-29 Thread Joel Fernandes
From: "Joel Fernandes (Google)" 

In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.

In this patch, we change the underlying implementation of the
trace_*_rcuidle API to use SRCU. This has shown to improve performance
alot for the high frequency irq enable/disable tracepoints.

Test: Tested idle and preempt/irq tracepoints.

Here are some performance numbers:

With a run of the following 30 times on a single core x86 Qemu instance
with 1GB memory:
hackbench -g 4 -f 2 -l 3000

Completion times in seconds. CONFIG_PROVE_LOCKING=y.

No patches (without this series)
Mean: 3.048
Median: 3.025
Std Dev: 0.064

With Lockdep using irq tracepoints with RCU implementation:
Mean: 3.451   (-11.66 %)
Median: 3.447 (-12.22%)
Std Dev: 0.049

With Lockdep using irq tracepoints with SRCU implementation (this series):
Mean: 3.020   (I would consider the improvement against the "without
   this series" case as just noise).
Median: 3.013
Std Dev: 0.033

[1] https://patchwork.kernel.org/patch/10344297/

Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/tracepoint.h | 48 +++---
 kernel/tracepoint.c| 15 +++-
 2 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..880794207921 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -15,6 +15,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -33,6 +34,8 @@ struct trace_eval_map {
 
 #define TRACEPOINT_DEFAULT_PRIO10
 
+extern struct srcu_struct tracepoint_srcu;
+
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
 extern int
@@ -75,10 +78,15 @@ int unregister_tracepoint_module_notifier(struct 
notifier_block *nb)
  * probe unregistration and the end of module exit to make sure there is no
  * caller executing a probe when it is freed.
  */
+#ifdef CONFIG_TRACEPOINTS
 static inline void tracepoint_synchronize_unregister(void)
 {
+   synchronize_srcu(&tracepoint_srcu);
synchronize_sched();
 }
+#else
+static inline void tracepoint_synchronize_unregister(void) { }
+#endif
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 extern int syscall_regfunc(void);
@@ -129,18 +137,38 @@ extern void syscall_unregfunc(void);
  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcucheck)\
+#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
do {\
struct tracepoint_func *it_func_ptr;\
void *it_func;  \
void *__data;   \
+   int __maybe_unused idx = 0; \
\
if (!(cond))\
return; \
-   if (rcucheck)   \
-   rcu_irq_enter_irqson(); \
-   rcu_read_lock_sched_notrace();  \
-   it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
+   \
+   /*  \
+* For rcuidle callers, use srcu since sched-rcu\
+* doesn't work from the idle path. \
+*/ \
+   if (rcuidle) {  \
+   if (in_nmi()) { \
+   WARN_ON_ONCE(1);\
+   return; /* no srcu from nmi */  \
+   }   \
+   \
+   idx = srcu_read_lock_notrace(&tracepoint_srcu); \
+   it_func_ptr =   \
+   srcu_dereference_notrace((tp)->funcs,