Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-08 Thread Peter Zijlstra
On Thu, May 08, 2014 at 08:37:27AM -0700, Paul E. McKenney wrote:
> On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
> > On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
> > >  static void ring_buffer_attach(struct perf_event *event,
> > >  struct ring_buffer *rb)
> > >  {
> > > + struct ring_buffer *old_rb = NULL;
> > >   unsigned long flags;
> > >  
> > > + if (event->rb) {
> > > + /*
> > > +  * Should be impossible, we set this when removing
> > > +  * event->rb_entry and wait/clear when adding event->rb_entry.
> > > +  */
> > > + WARN_ON_ONCE(event->rcu_pending);
> > >  
> > > + old_rb = event->rb;
> > > + event->rcu_batches = get_state_synchronize_rcu();
> > > + event->rcu_pending = 1;
> > >  
> > > + spin_lock_irqsave(>event_lock, flags);
> > > + list_del_rcu(>rb_entry);
> > > + spin_unlock_irqrestore(>event_lock, flags);
> > 
> > This all works a whole lot better if you make that old_rb->event_lock.
> > 
> > > + }
> > >  
> > > + if (event->rcu_pending && rb) {
> > > + cond_synchronize_rcu(event->rcu_batches);
> 
> There is not a whole lot of code between the get_state_synchronize_rcu()
> and the cond_synchronize_rcu(), so I would expect this to do a
> synchronize_rcu() almost all the time.  Or am I missing something here?

From the Changelog:

 2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.

This case is more complex because the buffer destruction does:
  ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
  ring_buffer_attach(.rb = foo);

and we still need to observe the grace period between these two
calls due to us reusing the event->rb_entry list_head.


pgpQ5OYGUViSm.pgp
Description: PGP signature


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-08 Thread Paul E. McKenney
On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
> >  static void ring_buffer_attach(struct perf_event *event,
> >struct ring_buffer *rb)
> >  {
> > +   struct ring_buffer *old_rb = NULL;
> > unsigned long flags;
> >  
> > +   if (event->rb) {
> > +   /*
> > +* Should be impossible, we set this when removing
> > +* event->rb_entry and wait/clear when adding event->rb_entry.
> > +*/
> > +   WARN_ON_ONCE(event->rcu_pending);
> >  
> > +   old_rb = event->rb;
> > +   event->rcu_batches = get_state_synchronize_rcu();
> > +   event->rcu_pending = 1;
> >  
> > +   spin_lock_irqsave(>event_lock, flags);
> > +   list_del_rcu(>rb_entry);
> > +   spin_unlock_irqrestore(>event_lock, flags);
> 
> This all works a whole lot better if you make that old_rb->event_lock.
> 
> > +   }
> >  
> > +   if (event->rcu_pending && rb) {
> > +   cond_synchronize_rcu(event->rcu_batches);

There is not a whole lot of code between the get_state_synchronize_rcu()
and the cond_synchronize_rcu(), so I would expect this to do a
synchronize_rcu() almost all the time.  Or am I missing something here?

Thanx, Paul

> > +   event->rcu_pending = 0;
> > +   }
> >  
> > +   if (rb) {
> > +   spin_lock_irqsave(>event_lock, flags);
> > +   list_add_rcu(>rb_entry, >event_list);
> > +   spin_unlock_irqrestore(>event_lock, flags);
> > +   }
> > +
> > +   rcu_assign_pointer(event->rb, rb);
> > +
> > +   if (old_rb) {
> > +   ring_buffer_put(old_rb);
> > +   /*
> > +* Since we detached before setting the new rb, so that we
> > +* could attach the new rb, we could have missed a wakeup.
> > +* Provide it now.
> > +*/
> > +   wake_up_all(>waitq);
> > +   }
> >  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-08 Thread Paul E. McKenney
On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
 On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
   static void ring_buffer_attach(struct perf_event *event,
 struct ring_buffer *rb)
   {
  +   struct ring_buffer *old_rb = NULL;
  unsigned long flags;
   
  +   if (event-rb) {
  +   /*
  +* Should be impossible, we set this when removing
  +* event-rb_entry and wait/clear when adding event-rb_entry.
  +*/
  +   WARN_ON_ONCE(event-rcu_pending);
   
  +   old_rb = event-rb;
  +   event-rcu_batches = get_state_synchronize_rcu();
  +   event-rcu_pending = 1;
   
  +   spin_lock_irqsave(rb-event_lock, flags);
  +   list_del_rcu(event-rb_entry);
  +   spin_unlock_irqrestore(rb-event_lock, flags);
 
 This all works a whole lot better if you make that old_rb-event_lock.
 
  +   }
   
  +   if (event-rcu_pending  rb) {
  +   cond_synchronize_rcu(event-rcu_batches);

There is not a whole lot of code between the get_state_synchronize_rcu()
and the cond_synchronize_rcu(), so I would expect this to do a
synchronize_rcu() almost all the time.  Or am I missing something here?

Thanx, Paul

  +   event-rcu_pending = 0;
  +   }
   
  +   if (rb) {
  +   spin_lock_irqsave(rb-event_lock, flags);
  +   list_add_rcu(event-rb_entry, rb-event_list);
  +   spin_unlock_irqrestore(rb-event_lock, flags);
  +   }
  +
  +   rcu_assign_pointer(event-rb, rb);
  +
  +   if (old_rb) {
  +   ring_buffer_put(old_rb);
  +   /*
  +* Since we detached before setting the new rb, so that we
  +* could attach the new rb, we could have missed a wakeup.
  +* Provide it now.
  +*/
  +   wake_up_all(event-waitq);
  +   }
   }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-08 Thread Peter Zijlstra
On Thu, May 08, 2014 at 08:37:27AM -0700, Paul E. McKenney wrote:
 On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
  On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
static void ring_buffer_attach(struct perf_event *event,
struct ring_buffer *rb)
{
   + struct ring_buffer *old_rb = NULL;
 unsigned long flags;

   + if (event-rb) {
   + /*
   +  * Should be impossible, we set this when removing
   +  * event-rb_entry and wait/clear when adding event-rb_entry.
   +  */
   + WARN_ON_ONCE(event-rcu_pending);

   + old_rb = event-rb;
   + event-rcu_batches = get_state_synchronize_rcu();
   + event-rcu_pending = 1;

   + spin_lock_irqsave(rb-event_lock, flags);
   + list_del_rcu(event-rb_entry);
   + spin_unlock_irqrestore(rb-event_lock, flags);
  
  This all works a whole lot better if you make that old_rb-event_lock.
  
   + }

   + if (event-rcu_pending  rb) {
   + cond_synchronize_rcu(event-rcu_batches);
 
 There is not a whole lot of code between the get_state_synchronize_rcu()
 and the cond_synchronize_rcu(), so I would expect this to do a
 synchronize_rcu() almost all the time.  Or am I missing something here?

From the Changelog:

 2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.

This case is more complex because the buffer destruction does:
  ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
  ring_buffer_attach(.rb = foo);

and we still need to observe the grace period between these two
calls due to us reusing the event-rb_entry list_head.


pgpQ5OYGUViSm.pgp
Description: PGP signature


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-07 Thread Peter Zijlstra
On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
>  static void ring_buffer_attach(struct perf_event *event,
>  struct ring_buffer *rb)
>  {
> + struct ring_buffer *old_rb = NULL;
>   unsigned long flags;
>  
> + if (event->rb) {
> + /*
> +  * Should be impossible, we set this when removing
> +  * event->rb_entry and wait/clear when adding event->rb_entry.
> +  */
> + WARN_ON_ONCE(event->rcu_pending);
>  
> + old_rb = event->rb;
> + event->rcu_batches = get_state_synchronize_rcu();
> + event->rcu_pending = 1;
>  
> + spin_lock_irqsave(>event_lock, flags);
> + list_del_rcu(>rb_entry);
> + spin_unlock_irqrestore(>event_lock, flags);

This all works a whole lot better if you make that old_rb->event_lock.

> + }
>  
> + if (event->rcu_pending && rb) {
> + cond_synchronize_rcu(event->rcu_batches);
> + event->rcu_pending = 0;
> + }
>  
> + if (rb) {
> + spin_lock_irqsave(>event_lock, flags);
> + list_add_rcu(>rb_entry, >event_list);
> + spin_unlock_irqrestore(>event_lock, flags);
> + }
> +
> + rcu_assign_pointer(event->rb, rb);
> +
> + if (old_rb) {
> + ring_buffer_put(old_rb);
> + /*
> +  * Since we detached before setting the new rb, so that we
> +  * could attach the new rb, we could have missed a wakeup.
> +  * Provide it now.
> +  */
> + wake_up_all(>waitq);
> + }
>  }


pgprjmBN6cA6l.pgp
Description: PGP signature


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-07 Thread Peter Zijlstra
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
> How about something like so that only does the sync_rcu() if really
> needed.

Paul reminded me we still have this issue open; I had another look and
the patch I proposed was completely broken, so I had to write me a new
one :-)

---
Subject: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
From: Peter Zijlstra 
Date: Fri, 14 Mar 2014 10:50:33 +0100

Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
ring_buffer set, will now observe the required grace period between
removing itself from the old and attaching itself to the new buffer.

This case is 'simple' in that both buffers are present in
perf_event_set_output() one could think an unconditional
synchronize_rcu() would be sufficient; however...

 2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.

This case is more complex because the buffer destruction does:
  ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
  ring_buffer_attach(.rb = foo);

and we still need to observe the grace period between these two
calls due to us reusing the event->rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Frederic Weisbecker 
Cc: Mike Galbraith 
Cc: Paul Mackerras 
Cc: Stephane Eranian 
Cc: Andi Kleen 
Cc: "Paul E. McKenney" 
Cc: Ingo Molnar 
Reported-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra 
---
 include/linux/perf_event.h |2 
 kernel/events/core.c   |  109 -
 2 files changed, 51 insertions(+), 60 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -403,6 +403,8 @@ struct perf_event {
 
struct ring_buffer  *rb;
struct list_headrb_entry;
+   unsigned long   rcu_batches;
+   int rcu_pending;
 
/* poll related */
wait_queue_head_t   waitq;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3179,7 +3179,8 @@ static void free_event_rcu(struct rcu_he
 }
 
 static void ring_buffer_put(struct ring_buffer *rb);
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer 
*rb);
+static void ring_buffer_attach(struct perf_event *event,
+  struct ring_buffer *rb);
 
 static void unaccount_event_cpu(struct perf_event *event, int cpu)
 {
@@ -3242,8 +3243,6 @@ static void free_event(struct perf_event
unaccount_event(event);
 
if (event->rb) {
-   struct ring_buffer *rb;
-
/*
 * Can happen when we close an event with re-directed output.
 *
@@ -3251,12 +3250,7 @@ static void free_event(struct perf_event
 * over us; possibly making our ring_buffer_put() the last.
 */
mutex_lock(>mmap_mutex);
-   rb = event->rb;
-   if (rb) {
-   rcu_assign_pointer(event->rb, NULL);
-   ring_buffer_detach(event, rb);
-   ring_buffer_put(rb); /* could be last */
-   }
+   ring_buffer_attach(event, NULL);
mutex_unlock(>mmap_mutex);
}
 
@@ -3843,28 +3837,47 @@ static int perf_mmap_fault(struct vm_are
 static void ring_buffer_attach(struct perf_event *event,
   struct ring_buffer *rb)
 {
+   struct ring_buffer *old_rb = NULL;
unsigned long flags;
 
-   if (!list_empty(>rb_entry))
-   return;
+   if (event->rb) {
+   /*
+* Should be impossible, we set this when removing
+* event->rb_entry and wait/clear when adding event->rb_entry.
+*/
+   WARN_ON_ONCE(event->rcu_pending);
 
-   spin_lock_irqsave(>event_lock, flags);
-   if (list_empty(>rb_entry))
-   list_add(>rb_entry, >event_list);
-   spin_unlock_irqrestore(>event_lock, flags);
-}
+

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-07 Thread Peter Zijlstra
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
 How about something like so that only does the sync_rcu() if really
 needed.

Paul reminded me we still have this issue open; I had another look and
the patch I proposed was completely broken, so I had to write me a new
one :-)

---
Subject: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
From: Peter Zijlstra pet...@infradead.org
Date: Fri, 14 Mar 2014 10:50:33 +0100

Alexander noticed that we use RCU iteration on rb-event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
ring_buffer set, will now observe the required grace period between
removing itself from the old and attaching itself to the new buffer.

This case is 'simple' in that both buffers are present in
perf_event_set_output() one could think an unconditional
synchronize_rcu() would be sufficient; however...

 2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.

This case is more complex because the buffer destruction does:
  ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
  ring_buffer_attach(.rb = foo);

and we still need to observe the grace period between these two
calls due to us reusing the event-rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Mike Galbraith efa...@gmx.de
Cc: Paul Mackerras pau...@samba.org
Cc: Stephane Eranian eran...@google.com
Cc: Andi Kleen a...@linux.intel.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Ingo Molnar mi...@redhat.com
Reported-by: Alexander Shishkin alexander.shish...@linux.intel.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/linux/perf_event.h |2 
 kernel/events/core.c   |  109 -
 2 files changed, 51 insertions(+), 60 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -403,6 +403,8 @@ struct perf_event {
 
struct ring_buffer  *rb;
struct list_headrb_entry;
+   unsigned long   rcu_batches;
+   int rcu_pending;
 
/* poll related */
wait_queue_head_t   waitq;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3179,7 +3179,8 @@ static void free_event_rcu(struct rcu_he
 }
 
 static void ring_buffer_put(struct ring_buffer *rb);
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer 
*rb);
+static void ring_buffer_attach(struct perf_event *event,
+  struct ring_buffer *rb);
 
 static void unaccount_event_cpu(struct perf_event *event, int cpu)
 {
@@ -3242,8 +3243,6 @@ static void free_event(struct perf_event
unaccount_event(event);
 
if (event-rb) {
-   struct ring_buffer *rb;
-
/*
 * Can happen when we close an event with re-directed output.
 *
@@ -3251,12 +3250,7 @@ static void free_event(struct perf_event
 * over us; possibly making our ring_buffer_put() the last.
 */
mutex_lock(event-mmap_mutex);
-   rb = event-rb;
-   if (rb) {
-   rcu_assign_pointer(event-rb, NULL);
-   ring_buffer_detach(event, rb);
-   ring_buffer_put(rb); /* could be last */
-   }
+   ring_buffer_attach(event, NULL);
mutex_unlock(event-mmap_mutex);
}
 
@@ -3843,28 +3837,47 @@ static int perf_mmap_fault(struct vm_are
 static void ring_buffer_attach(struct perf_event *event,
   struct ring_buffer *rb)
 {
+   struct ring_buffer *old_rb = NULL;
unsigned long flags;
 
-   if (!list_empty(event-rb_entry))
-   return;
+   if (event-rb) {
+   /*
+* Should be impossible, we set this when removing
+* event-rb_entry and wait/clear when adding event-rb_entry.
+*/
+   

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-05-07 Thread Peter Zijlstra
On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
  static void ring_buffer_attach(struct perf_event *event,
  struct ring_buffer *rb)
  {
 + struct ring_buffer *old_rb = NULL;
   unsigned long flags;
  
 + if (event-rb) {
 + /*
 +  * Should be impossible, we set this when removing
 +  * event-rb_entry and wait/clear when adding event-rb_entry.
 +  */
 + WARN_ON_ONCE(event-rcu_pending);
  
 + old_rb = event-rb;
 + event-rcu_batches = get_state_synchronize_rcu();
 + event-rcu_pending = 1;
  
 + spin_lock_irqsave(rb-event_lock, flags);
 + list_del_rcu(event-rb_entry);
 + spin_unlock_irqrestore(rb-event_lock, flags);

This all works a whole lot better if you make that old_rb-event_lock.

 + }
  
 + if (event-rcu_pending  rb) {
 + cond_synchronize_rcu(event-rcu_batches);
 + event-rcu_pending = 0;
 + }
  
 + if (rb) {
 + spin_lock_irqsave(rb-event_lock, flags);
 + list_add_rcu(event-rb_entry, rb-event_list);
 + spin_unlock_irqrestore(rb-event_lock, flags);
 + }
 +
 + rcu_assign_pointer(event-rb, rb);
 +
 + if (old_rb) {
 + ring_buffer_put(old_rb);
 + /*
 +  * Since we detached before setting the new rb, so that we
 +  * could attach the new rb, we could have missed a wakeup.
 +  * Provide it now.
 +  */
 + wake_up_all(event-waitq);
 + }
  }


pgprjmBN6cA6l.pgp
Description: PGP signature


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-18 Thread Peter Zijlstra
On Mon, Mar 17, 2014 at 07:45:22PM -0700, Paul E. McKenney wrote:
> > > Yep!  I will risk an ASCII diagram:
> > > 
> > > 
> > > 3:  
> > > +gpnum+-- ...
> > > | |
> > > 2:
> > > +gpnum+---+--completed--+
> > >   | |
> > > 1:  +gpnum+---+--completed--+
> > > | |
> > > 0:+-+--completed--+
> > > 
> > > 
> > > A full RCU grace period happens between a pair of "|"s on the same line.
> > > By inspection, if your snapshot of ->gpnum is greater than the current
> > > value of ->completed, a grace period has passed.
> > 
> > OK, so I get the > part, but I'm not sure I get the = part of the above.
> > The way I read the diagram, when completed matches gpnum the grace
> > period is done and we don't have to wait anymore.
> 
> Absolutely not!  Let's try laying out the scenario:
> 
> 1.Someone calls get_state_synchronize_rcu() when ->gpnum==->completed==0.
>   It returns zero.
> 
> 2.That someone immediately calls cond_synchronize_rcu().  Nothing
>   has changed, so oldstate==newstate==0.
> 
>   We had better call synchronize_rcu() in this case!!!

> Make sense?

Yes, should have seen that! Thanks for bearing with me on this.

> 
> 
> rcu: Provide grace-period piggybacking API
> 
> The following pattern is currently not well supported by RCU:
> 
> 1.Make data element inaccessible to RCU readers.
> 
> 2.Do work that probably lasts for more than one grace period.
> 
> 3.Do something to make sure RCU readers in flight before #1 above
>   have completed.
> 
> Here are some things that could currently be done:
> 
> a.Do a synchronize_rcu() unconditionally at either #1 or #3 above.
>   This works, but imposes needless work and latency.
> 
> b.Post an RCU callback at #1 above that does a wakeup, then
>   wait for the wakeup at #3.  This works well, but likely results
>   in an extra unneeded grace period.  Open-coding this is also
>   a bit more semi-tricky code than would be good.
> 
> This commit therefore adds get_state_synchronize_rcu() and
> cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> above and pass its return value to cond_synchronize_rcu() at #3 above.
> This results in a call to synchronize_rcu() if no grace period has
> elapsed between #1 and #3, but requires only a load, comparison, and
> memory barrier if a full grace period did elapse.
> 
> Requested-by: Peter Zijlstra 
> Signed-off-by: Paul E. McKenney 

Thanks!

Acked-by: Peter Zijlstra 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-18 Thread Peter Zijlstra
On Mon, Mar 17, 2014 at 07:45:22PM -0700, Paul E. McKenney wrote:
   Yep!  I will risk an ASCII diagram:
   
   
   3:  
   +gpnum+-- ...
   | |
   2:
   +gpnum+---+--completed--+
 | |
   1:  +gpnum+---+--completed--+
   | |
   0:+-+--completed--+
   
   
   A full RCU grace period happens between a pair of |s on the same line.
   By inspection, if your snapshot of -gpnum is greater than the current
   value of -completed, a grace period has passed.
  
  OK, so I get the  part, but I'm not sure I get the = part of the above.
  The way I read the diagram, when completed matches gpnum the grace
  period is done and we don't have to wait anymore.
 
 Absolutely not!  Let's try laying out the scenario:
 
 1.Someone calls get_state_synchronize_rcu() when -gpnum==-completed==0.
   It returns zero.
 
 2.That someone immediately calls cond_synchronize_rcu().  Nothing
   has changed, so oldstate==newstate==0.
 
   We had better call synchronize_rcu() in this case!!!

 Make sense?

Yes, should have seen that! Thanks for bearing with me on this.

 
 
 rcu: Provide grace-period piggybacking API
 
 The following pattern is currently not well supported by RCU:
 
 1.Make data element inaccessible to RCU readers.
 
 2.Do work that probably lasts for more than one grace period.
 
 3.Do something to make sure RCU readers in flight before #1 above
   have completed.
 
 Here are some things that could currently be done:
 
 a.Do a synchronize_rcu() unconditionally at either #1 or #3 above.
   This works, but imposes needless work and latency.
 
 b.Post an RCU callback at #1 above that does a wakeup, then
   wait for the wakeup at #3.  This works well, but likely results
   in an extra unneeded grace period.  Open-coding this is also
   a bit more semi-tricky code than would be good.
 
 This commit therefore adds get_state_synchronize_rcu() and
 cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
 above and pass its return value to cond_synchronize_rcu() at #3 above.
 This results in a call to synchronize_rcu() if no grace period has
 elapsed between #1 and #3, but requires only a load, comparison, and
 memory barrier if a full grace period did elapse.
 
 Requested-by: Peter Zijlstra pet...@infradead.org
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

Thanks!

Acked-by: Peter Zijlstra pet...@infradead.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Paul E. McKenney
On Mon, Mar 17, 2014 at 06:30:38PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
> > Good point, this barrier is a bit obsure.  Here you go:
> > 
> > /*
> >  * Make sure this load happens before the purportedly
> >  * time-consuming work between get_state_synchronize_rcu()
> >  * and cond_synchronize_rcu().
> >  */
> > 
> > So all you lose by leaving this barrier out is a slightly higher
> > probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
> > And without the barrier there is some chance of the CPU doing the load in
> > cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
> > Which should be OK, but simpler to exclude this than have to worry about
> > counters out of sync.  Besides, you are taking a grace-period delay in
> > here somewhere, so the memory barrier is way down in the noise.
> 
> Oh sure; expense wasn't my concern. But I always question barriers
> without comments, and I couldn't come up with a reason for this one.
> 
> > > The way I imaged using this is taking this snapshot after the RCU
> > > operation, such that we err towards seeing a later grace period and
> > > synchronizing too much in stead of seeing an earlier and sync'ing too
> > > little.
> > > 
> > > Such use would suggest the barrier the other way around.
> > 
> > Ah, but that ordering happens at the updater's end.  And there
> > are in fact memory barriers before and after the increment of
> > ->gpnum in rcu_gp_init():
> > 
> > /* Advance to a new grace period and initialize state. */
> > record_gp_stall_check_time(rsp);
> > /* Record GP times before starting GP, hence smp_store_release(). */
> > smp_store_release(>gpnum, rsp->gpnum + 1);
> > trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
> > raw_spin_unlock_irq(>lock);
> > 
> > /* Exclude any concurrent CPU-hotplug operations. */
> > mutex_lock(>onoff_mutex);
> > smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
> > 
> > The smp_store_release() provides the memory barrier before, and the
> > smp_mb__after_unlock_lock() provides it after.
> 
> That wasn't exactly what I was talking about.
> 
> So suppose:
> 
>   spin_lock()
>   list_del_rcu(>foo);
>   spin_unlock();
> 
>   rcu_stamp = get_state_sync_rcu();
> 
> Now we very much want to ensure to get the gpnum _after_ completing that
> list_del_rcu(), because if we were to observe the gpnum before it could
> complete before and we'd fail to wait long enough.
> 
> Now in the above, and with your proposed implementation, there is in
> fact no guarantee the load doesn't slip up past the list_del_rcu().
> 
> The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
> happens before it, but the gpnum load can slip in, and in fact pass over
> the list_del_rcu().

Right you are!  Fixed patch below.

> > > > +void cond_synchronize_rcu(unsigned long oldstate)
> > > > +{
> > > > +   unsigned long newstate = 
> > > > smp_load_acquire(_state->completed);
> > > 
> > > Again, uncommented barriers; the load_acquire seems to suggest you want
> > > to make sure the sync_rcu() bits happen after this read. But seeing how
> > > sync_rcu() will invariably read the same data again and you get an
> > > address dep this seems somewhat superfluous.
> > 
> > Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
> > but cond_synchronize_rcu() reads ->completed.  The
> 
> I was talking about the synchronize_rcu() call later in
> cond_synchronize_rcu().
> 
> But looking at its implementation; they don't in fact load either gpnum
> or completed.

They don't directly, but they register callbacks, which causes the
RCU core code to access them.  But if we invoke synchronize_rcu(),
it really doesn't matter what either get_state_synchronize_rcu() or
cond_synchronize_rcu().  The memory barrier from cond_synchronize_rcu()'s
smp_load_acquire() is instead needed for the case where we -don't-
invoke synchornize_rcu().  In that case, whatever was supposed to follow
the grace period had better really follow the grace period.

> > > > +   if (ULONG_CMP_GE(oldstate, newstate))
> > > 
> > > So if the observed active gp (oldstate) is ahead or equal to the last
> > > completed gp, then we wait?
> > 
> > Yep!  I will risk an ASCII diagram:
> > 
> > 
> > 3:+gpnum+-- ...
> >   | |
> > 2:  +gpnum+---+--completed--+
> > | |
> > 1:+gpnum+---+--completed--+
> >   | |
> > 0:  +-+--completed--+
> > 
> > 
> > A full RCU grace period happens between a pair of "|"s on the same line.
> > By inspection, if your snapshot of ->gpnum is greater than the current
> > value of ->completed, a grace period has passed.
> 
> OK, so I get the > 

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Peter Zijlstra
On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
> Good point, this barrier is a bit obsure.  Here you go:
> 
>   /*
>* Make sure this load happens before the purportedly
>* time-consuming work between get_state_synchronize_rcu()
>* and cond_synchronize_rcu().
>*/
> 
> So all you lose by leaving this barrier out is a slightly higher
> probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
> And without the barrier there is some chance of the CPU doing the load in
> cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
> Which should be OK, but simpler to exclude this than have to worry about
> counters out of sync.  Besides, you are taking a grace-period delay in
> here somewhere, so the memory barrier is way down in the noise.

Oh sure; expense wasn't my concern. But I always question barriers
without comments, and I couldn't come up with a reason for this one.

> > The way I imaged using this is taking this snapshot after the RCU
> > operation, such that we err towards seeing a later grace period and
> > synchronizing too much in stead of seeing an earlier and sync'ing too
> > little.
> > 
> > Such use would suggest the barrier the other way around.
> 
> Ah, but that ordering happens at the updater's end.  And there
> are in fact memory barriers before and after the increment of
> ->gpnum in rcu_gp_init():
> 
>   /* Advance to a new grace period and initialize state. */
>   record_gp_stall_check_time(rsp);
>   /* Record GP times before starting GP, hence smp_store_release(). */
>   smp_store_release(>gpnum, rsp->gpnum + 1);
>   trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
>   raw_spin_unlock_irq(>lock);
> 
>   /* Exclude any concurrent CPU-hotplug operations. */
>   mutex_lock(>onoff_mutex);
>   smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
> 
> The smp_store_release() provides the memory barrier before, and the
> smp_mb__after_unlock_lock() provides it after.

That wasn't exactly what I was talking about.

So suppose:

  spin_lock()
  list_del_rcu(>foo);
  spin_unlock();

  rcu_stamp = get_state_sync_rcu();

Now we very much want to ensure to get the gpnum _after_ completing that
list_del_rcu(), because if we were to observe the gpnum before it could
complete before and we'd fail to wait long enough.

Now in the above, and with your proposed implementation, there is in
fact no guarantee the load doesn't slip up past the list_del_rcu().

The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
happens before it, but the gpnum load can slip in, and in fact pass over
the list_del_rcu().

> > > +void cond_synchronize_rcu(unsigned long oldstate)
> > > +{
> > > + unsigned long newstate = smp_load_acquire(_state->completed);
> > 
> > Again, uncommented barriers; the load_acquire seems to suggest you want
> > to make sure the sync_rcu() bits happen after this read. But seeing how
> > sync_rcu() will invariably read the same data again and you get an
> > address dep this seems somewhat superfluous.
> 
> Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
> but cond_synchronize_rcu() reads ->completed.  The

I was talking about the synchronize_rcu() call later in
cond_synchronize_rcu().

But looking at its implementation; they don't in fact load either gpnum
or completed.

> > > + if (ULONG_CMP_GE(oldstate, newstate))
> > 
> > So if the observed active gp (oldstate) is ahead or equal to the last
> > completed gp, then we wait?
> 
> Yep!  I will risk an ASCII diagram:
> 
> 
> 3:  +gpnum+-- ...
> | |
> 2:+gpnum+---+--completed--+
>   | |
> 1:  +gpnum+---+--completed--+
> | |
> 0:+-+--completed--+
> 
> 
> A full RCU grace period happens between a pair of "|"s on the same line.
> By inspection, if your snapshot of ->gpnum is greater than the current
> value of ->completed, a grace period has passed.

OK, so I get the > part, but I'm not sure I get the = part of the above.
The way I read the diagram, when completed matches gpnum the grace
period is done and we don't have to wait anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Paul E. McKenney
On Mon, Mar 17, 2014 at 12:18:07PM +0100, Peter Zijlstra wrote:
> > rcu: Provide grace-period piggybacking API
> > 
> > The following pattern is currently not well supported by RCU:
> > 
> > 1.  Make data element inaccessible to RCU readers.
> > 
> > 2.  Do work that probably lasts for more than one grace period.
> > 
> > 3.  Do something to make sure RCU readers in flight before #1 above
> > have completed.
> > 
> > Here are some things that could currently be done:
> > 
> > a.  Do a synchronize_rcu() unconditionally at either #1 or #3 above.
> > This works, but imposes needless work and latency.
> > 
> > b.  Post an RCU callback at #1 above that does a wakeup, then
> > wait for the wakeup at #3.  This works well, but likely results
> > in an extra unneeded grace period.  Open-coding this is also
> > a bit more semi-tricky code than would be good.
> > 
> > This commit therefore adds get_state_synchronize_rcu() and
> > cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> > above and pass its return value to cond_synchronize_rcu() at #3 above.
> > This results in a call to synchronize_rcu() if no grace period has
> > elapsed between #1 and #3, but requires only a load, comparison, and
> > memory barrier if a full grace period did elapse.
> > 
> > Reported-by: Peter Zijlstra 
> 
> More a requested-by, I'd say.

Fair enough...

> > Signed-off-by: Paul E. McKenney 
> > +/**
> > + * get_state_synchronize_rcu - Snapshot current RCU state
> > + *
> > + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> > + * to determine whether or not a full grace period has elapsed in the
> > + * meantime.
> > + */
> > +unsigned long get_state_synchronize_rcu(void)
> > +{
> 
> /*
>  * Make sure this load happens before anything following it; such that
>  * ... ?
>  */

Good point, this barrier is a bit obsure.  Here you go:

/*
 * Make sure this load happens before the purportedly
 * time-consuming work between get_state_synchronize_rcu()
 * and cond_synchronize_rcu().
 */

So all you lose by leaving this barrier out is a slightly higher
probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
And without the barrier there is some chance of the CPU doing the load in
cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
Which should be OK, but simpler to exclude this than have to worry about
counters out of sync.  Besides, you are taking a grace-period delay in
here somewhere, so the memory barrier is way down in the noise.

> The way I imaged using this is taking this snapshot after the RCU
> operation, such that we err towards seeing a later grace period and
> synchronizing too much in stead of seeing an earlier and sync'ing too
> little.
> 
> Such use would suggest the barrier the other way around.

Ah, but that ordering happens at the updater's end.  And there
are in fact memory barriers before and after the increment of
->gpnum in rcu_gp_init():

/* Advance to a new grace period and initialize state. */
record_gp_stall_check_time(rsp);
/* Record GP times before starting GP, hence smp_store_release(). */
smp_store_release(>gpnum, rsp->gpnum + 1);
trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
raw_spin_unlock_irq(>lock);

/* Exclude any concurrent CPU-hotplug operations. */
mutex_lock(>onoff_mutex);
smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */

The smp_store_release() provides the memory barrier before, and the
smp_mb__after_unlock_lock() provides it after.

> > +   return smp_load_acquire(_state->gpnum);
> > +}
> > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> 
> I can't say I'm excited about that function name; but I'm also
> completely lacking in alternatives.

;-)

> > +
> > +/**
> > + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
> > + *
> > + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
> > + *
> > + * If a full RCU grace period has elapsed since the earlier call to
> > + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
> > + * synchronize_rcu() to wait for a full grace period.
> > + */
> > +void cond_synchronize_rcu(unsigned long oldstate)
> > +{
> > +   unsigned long newstate = smp_load_acquire(_state->completed);
> 
> Again, uncommented barriers; the load_acquire seems to suggest you want
> to make sure the sync_rcu() bits happen after this read. But seeing how
> sync_rcu() will invariably read the same data again and you get an
> address dep this seems somewhat superfluous.

Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
but cond_synchronize_rcu() reads ->completed.  The

> Then again, can't hurt I suppose :-)
> 
> > +   if (ULONG_CMP_GE(oldstate, newstate))
> 
> So if the observed active gp (oldstate) is ahead or equal 

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Peter Zijlstra
> rcu: Provide grace-period piggybacking API
> 
> The following pattern is currently not well supported by RCU:
> 
> 1.Make data element inaccessible to RCU readers.
> 
> 2.Do work that probably lasts for more than one grace period.
> 
> 3.Do something to make sure RCU readers in flight before #1 above
>   have completed.
> 
> Here are some things that could currently be done:
> 
> a.Do a synchronize_rcu() unconditionally at either #1 or #3 above.
>   This works, but imposes needless work and latency.
> 
> b.Post an RCU callback at #1 above that does a wakeup, then
>   wait for the wakeup at #3.  This works well, but likely results
>   in an extra unneeded grace period.  Open-coding this is also
>   a bit more semi-tricky code than would be good.
> 
> This commit therefore adds get_state_synchronize_rcu() and
> cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> above and pass its return value to cond_synchronize_rcu() at #3 above.
> This results in a call to synchronize_rcu() if no grace period has
> elapsed between #1 and #3, but requires only a load, comparison, and
> memory barrier if a full grace period did elapse.
> 
> Reported-by: Peter Zijlstra 

More a requested-by, I'd say.

> Signed-off-by: Paul E. McKenney 
> +/**
> + * get_state_synchronize_rcu - Snapshot current RCU state
> + *
> + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> + * to determine whether or not a full grace period has elapsed in the
> + * meantime.
> + */
> +unsigned long get_state_synchronize_rcu(void)
> +{

/*
 * Make sure this load happens before anything following it; such that
 * ... ?
 */

The way I imaged using this is taking this snapshot after the RCU
operation, such that we err towards seeing a later grace period and
synchronizing too much in stead of seeing an earlier and sync'ing too
little.

Such use would suggest the barrier the other way around.

> + return smp_load_acquire(_state->gpnum);
> +}
> +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);

I can't say I'm excited about that function name; but I'm also
completely lacking in alternatives.

> +
> +/**
> + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
> + *
> + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
> + *
> + * If a full RCU grace period has elapsed since the earlier call to
> + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
> + * synchronize_rcu() to wait for a full grace period.
> + */
> +void cond_synchronize_rcu(unsigned long oldstate)
> +{
> + unsigned long newstate = smp_load_acquire(_state->completed);

Again, uncommented barriers; the load_acquire seems to suggest you want
to make sure the sync_rcu() bits happen after this read. But seeing how
sync_rcu() will invariably read the same data again and you get an
address dep this seems somewhat superfluous.

Then again, can't hurt I suppose :-)

> + if (ULONG_CMP_GE(oldstate, newstate))

So if the observed active gp (oldstate) is ahead or equal to the last
completed gp, then we wait?

I thought the double grace period thing was for preemptible rcu; is it
done here because you don't want to special case the preemptible rcu and
make do with a single implementation?

And I suppose that on wrap around; we do extra sync_rcu() calls, which
can never be wrong.

> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(cond_synchronize_rcu);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Peter Zijlstra
 rcu: Provide grace-period piggybacking API
 
 The following pattern is currently not well supported by RCU:
 
 1.Make data element inaccessible to RCU readers.
 
 2.Do work that probably lasts for more than one grace period.
 
 3.Do something to make sure RCU readers in flight before #1 above
   have completed.
 
 Here are some things that could currently be done:
 
 a.Do a synchronize_rcu() unconditionally at either #1 or #3 above.
   This works, but imposes needless work and latency.
 
 b.Post an RCU callback at #1 above that does a wakeup, then
   wait for the wakeup at #3.  This works well, but likely results
   in an extra unneeded grace period.  Open-coding this is also
   a bit more semi-tricky code than would be good.
 
 This commit therefore adds get_state_synchronize_rcu() and
 cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
 above and pass its return value to cond_synchronize_rcu() at #3 above.
 This results in a call to synchronize_rcu() if no grace period has
 elapsed between #1 and #3, but requires only a load, comparison, and
 memory barrier if a full grace period did elapse.
 
 Reported-by: Peter Zijlstra pet...@infradead.org

More a requested-by, I'd say.

 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 +/**
 + * get_state_synchronize_rcu - Snapshot current RCU state
 + *
 + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
 + * to determine whether or not a full grace period has elapsed in the
 + * meantime.
 + */
 +unsigned long get_state_synchronize_rcu(void)
 +{

/*
 * Make sure this load happens before anything following it; such that
 * ... ?
 */

The way I imaged using this is taking this snapshot after the RCU
operation, such that we err towards seeing a later grace period and
synchronizing too much in stead of seeing an earlier and sync'ing too
little.

Such use would suggest the barrier the other way around.

 + return smp_load_acquire(rcu_state-gpnum);
 +}
 +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);

I can't say I'm excited about that function name; but I'm also
completely lacking in alternatives.

 +
 +/**
 + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
 + *
 + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
 + *
 + * If a full RCU grace period has elapsed since the earlier call to
 + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
 + * synchronize_rcu() to wait for a full grace period.
 + */
 +void cond_synchronize_rcu(unsigned long oldstate)
 +{
 + unsigned long newstate = smp_load_acquire(rcu_state-completed);

Again, uncommented barriers; the load_acquire seems to suggest you want
to make sure the sync_rcu() bits happen after this read. But seeing how
sync_rcu() will invariably read the same data again and you get an
address dep this seems somewhat superfluous.

Then again, can't hurt I suppose :-)

 + if (ULONG_CMP_GE(oldstate, newstate))

So if the observed active gp (oldstate) is ahead or equal to the last
completed gp, then we wait?

I thought the double grace period thing was for preemptible rcu; is it
done here because you don't want to special case the preemptible rcu and
make do with a single implementation?

And I suppose that on wrap around; we do extra sync_rcu() calls, which
can never be wrong.

 + synchronize_rcu();
 +}
 +EXPORT_SYMBOL_GPL(cond_synchronize_rcu);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Paul E. McKenney
On Mon, Mar 17, 2014 at 12:18:07PM +0100, Peter Zijlstra wrote:
  rcu: Provide grace-period piggybacking API
  
  The following pattern is currently not well supported by RCU:
  
  1.  Make data element inaccessible to RCU readers.
  
  2.  Do work that probably lasts for more than one grace period.
  
  3.  Do something to make sure RCU readers in flight before #1 above
  have completed.
  
  Here are some things that could currently be done:
  
  a.  Do a synchronize_rcu() unconditionally at either #1 or #3 above.
  This works, but imposes needless work and latency.
  
  b.  Post an RCU callback at #1 above that does a wakeup, then
  wait for the wakeup at #3.  This works well, but likely results
  in an extra unneeded grace period.  Open-coding this is also
  a bit more semi-tricky code than would be good.
  
  This commit therefore adds get_state_synchronize_rcu() and
  cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
  above and pass its return value to cond_synchronize_rcu() at #3 above.
  This results in a call to synchronize_rcu() if no grace period has
  elapsed between #1 and #3, but requires only a load, comparison, and
  memory barrier if a full grace period did elapse.
  
  Reported-by: Peter Zijlstra pet...@infradead.org
 
 More a requested-by, I'd say.

Fair enough...

  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  +/**
  + * get_state_synchronize_rcu - Snapshot current RCU state
  + *
  + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
  + * to determine whether or not a full grace period has elapsed in the
  + * meantime.
  + */
  +unsigned long get_state_synchronize_rcu(void)
  +{
 
 /*
  * Make sure this load happens before anything following it; such that
  * ... ?
  */

Good point, this barrier is a bit obsure.  Here you go:

/*
 * Make sure this load happens before the purportedly
 * time-consuming work between get_state_synchronize_rcu()
 * and cond_synchronize_rcu().
 */

So all you lose by leaving this barrier out is a slightly higher
probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
And without the barrier there is some chance of the CPU doing the load in
cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
Which should be OK, but simpler to exclude this than have to worry about
counters out of sync.  Besides, you are taking a grace-period delay in
here somewhere, so the memory barrier is way down in the noise.

 The way I imaged using this is taking this snapshot after the RCU
 operation, such that we err towards seeing a later grace period and
 synchronizing too much in stead of seeing an earlier and sync'ing too
 little.
 
 Such use would suggest the barrier the other way around.

Ah, but that ordering happens at the updater's end.  And there
are in fact memory barriers before and after the increment of
-gpnum in rcu_gp_init():

/* Advance to a new grace period and initialize state. */
record_gp_stall_check_time(rsp);
/* Record GP times before starting GP, hence smp_store_release(). */
smp_store_release(rsp-gpnum, rsp-gpnum + 1);
trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start));
raw_spin_unlock_irq(rnp-lock);

/* Exclude any concurrent CPU-hotplug operations. */
mutex_lock(rsp-onoff_mutex);
smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */

The smp_store_release() provides the memory barrier before, and the
smp_mb__after_unlock_lock() provides it after.

  +   return smp_load_acquire(rcu_state-gpnum);
  +}
  +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
 
 I can't say I'm excited about that function name; but I'm also
 completely lacking in alternatives.

;-)

  +
  +/**
  + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
  + *
  + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
  + *
  + * If a full RCU grace period has elapsed since the earlier call to
  + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
  + * synchronize_rcu() to wait for a full grace period.
  + */
  +void cond_synchronize_rcu(unsigned long oldstate)
  +{
  +   unsigned long newstate = smp_load_acquire(rcu_state-completed);
 
 Again, uncommented barriers; the load_acquire seems to suggest you want
 to make sure the sync_rcu() bits happen after this read. But seeing how
 sync_rcu() will invariably read the same data again and you get an
 address dep this seems somewhat superfluous.

Not quite!  The get_state_synchronize_rcu() function reads -gpnum,
but cond_synchronize_rcu() reads -completed.  The

 Then again, can't hurt I suppose :-)
 
  +   if (ULONG_CMP_GE(oldstate, newstate))
 
 So if the observed active gp (oldstate) is ahead or equal to the last
 completed gp, then we wait?

Yep!  I will risk an ASCII diagram:


3:   

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Peter Zijlstra
On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
 Good point, this barrier is a bit obsure.  Here you go:
 
   /*
* Make sure this load happens before the purportedly
* time-consuming work between get_state_synchronize_rcu()
* and cond_synchronize_rcu().
*/
 
 So all you lose by leaving this barrier out is a slightly higher
 probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
 And without the barrier there is some chance of the CPU doing the load in
 cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
 Which should be OK, but simpler to exclude this than have to worry about
 counters out of sync.  Besides, you are taking a grace-period delay in
 here somewhere, so the memory barrier is way down in the noise.

Oh sure; expense wasn't my concern. But I always question barriers
without comments, and I couldn't come up with a reason for this one.

  The way I imaged using this is taking this snapshot after the RCU
  operation, such that we err towards seeing a later grace period and
  synchronizing too much in stead of seeing an earlier and sync'ing too
  little.
  
  Such use would suggest the barrier the other way around.
 
 Ah, but that ordering happens at the updater's end.  And there
 are in fact memory barriers before and after the increment of
 -gpnum in rcu_gp_init():
 
   /* Advance to a new grace period and initialize state. */
   record_gp_stall_check_time(rsp);
   /* Record GP times before starting GP, hence smp_store_release(). */
   smp_store_release(rsp-gpnum, rsp-gpnum + 1);
   trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start));
   raw_spin_unlock_irq(rnp-lock);
 
   /* Exclude any concurrent CPU-hotplug operations. */
   mutex_lock(rsp-onoff_mutex);
   smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */
 
 The smp_store_release() provides the memory barrier before, and the
 smp_mb__after_unlock_lock() provides it after.

That wasn't exactly what I was talking about.

So suppose:

  spin_lock(lock)
  list_del_rcu(entry-foo);
  spin_unlock(lock);

  rcu_stamp = get_state_sync_rcu();

Now we very much want to ensure to get the gpnum _after_ completing that
list_del_rcu(), because if we were to observe the gpnum before it could
complete before and we'd fail to wait long enough.

Now in the above, and with your proposed implementation, there is in
fact no guarantee the load doesn't slip up past the list_del_rcu().

The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
happens before it, but the gpnum load can slip in, and in fact pass over
the list_del_rcu().

   +void cond_synchronize_rcu(unsigned long oldstate)
   +{
   + unsigned long newstate = smp_load_acquire(rcu_state-completed);
  
  Again, uncommented barriers; the load_acquire seems to suggest you want
  to make sure the sync_rcu() bits happen after this read. But seeing how
  sync_rcu() will invariably read the same data again and you get an
  address dep this seems somewhat superfluous.
 
 Not quite!  The get_state_synchronize_rcu() function reads -gpnum,
 but cond_synchronize_rcu() reads -completed.  The

I was talking about the synchronize_rcu() call later in
cond_synchronize_rcu().

But looking at its implementation; they don't in fact load either gpnum
or completed.

   + if (ULONG_CMP_GE(oldstate, newstate))
  
  So if the observed active gp (oldstate) is ahead or equal to the last
  completed gp, then we wait?
 
 Yep!  I will risk an ASCII diagram:
 
 
 3:  +gpnum+-- ...
 | |
 2:+gpnum+---+--completed--+
   | |
 1:  +gpnum+---+--completed--+
 | |
 0:+-+--completed--+
 
 
 A full RCU grace period happens between a pair of |s on the same line.
 By inspection, if your snapshot of -gpnum is greater than the current
 value of -completed, a grace period has passed.

OK, so I get the  part, but I'm not sure I get the = part of the above.
The way I read the diagram, when completed matches gpnum the grace
period is done and we don't have to wait anymore.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-17 Thread Paul E. McKenney
On Mon, Mar 17, 2014 at 06:30:38PM +0100, Peter Zijlstra wrote:
 On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
  Good point, this barrier is a bit obsure.  Here you go:
  
  /*
   * Make sure this load happens before the purportedly
   * time-consuming work between get_state_synchronize_rcu()
   * and cond_synchronize_rcu().
   */
  
  So all you lose by leaving this barrier out is a slightly higher
  probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
  And without the barrier there is some chance of the CPU doing the load in
  cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
  Which should be OK, but simpler to exclude this than have to worry about
  counters out of sync.  Besides, you are taking a grace-period delay in
  here somewhere, so the memory barrier is way down in the noise.
 
 Oh sure; expense wasn't my concern. But I always question barriers
 without comments, and I couldn't come up with a reason for this one.
 
   The way I imaged using this is taking this snapshot after the RCU
   operation, such that we err towards seeing a later grace period and
   synchronizing too much in stead of seeing an earlier and sync'ing too
   little.
   
   Such use would suggest the barrier the other way around.
  
  Ah, but that ordering happens at the updater's end.  And there
  are in fact memory barriers before and after the increment of
  -gpnum in rcu_gp_init():
  
  /* Advance to a new grace period and initialize state. */
  record_gp_stall_check_time(rsp);
  /* Record GP times before starting GP, hence smp_store_release(). */
  smp_store_release(rsp-gpnum, rsp-gpnum + 1);
  trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start));
  raw_spin_unlock_irq(rnp-lock);
  
  /* Exclude any concurrent CPU-hotplug operations. */
  mutex_lock(rsp-onoff_mutex);
  smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */
  
  The smp_store_release() provides the memory barrier before, and the
  smp_mb__after_unlock_lock() provides it after.
 
 That wasn't exactly what I was talking about.
 
 So suppose:
 
   spin_lock(lock)
   list_del_rcu(entry-foo);
   spin_unlock(lock);
 
   rcu_stamp = get_state_sync_rcu();
 
 Now we very much want to ensure to get the gpnum _after_ completing that
 list_del_rcu(), because if we were to observe the gpnum before it could
 complete before and we'd fail to wait long enough.
 
 Now in the above, and with your proposed implementation, there is in
 fact no guarantee the load doesn't slip up past the list_del_rcu().
 
 The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
 happens before it, but the gpnum load can slip in, and in fact pass over
 the list_del_rcu().

Right you are!  Fixed patch below.

+void cond_synchronize_rcu(unsigned long oldstate)
+{
+   unsigned long newstate = 
smp_load_acquire(rcu_state-completed);
   
   Again, uncommented barriers; the load_acquire seems to suggest you want
   to make sure the sync_rcu() bits happen after this read. But seeing how
   sync_rcu() will invariably read the same data again and you get an
   address dep this seems somewhat superfluous.
  
  Not quite!  The get_state_synchronize_rcu() function reads -gpnum,
  but cond_synchronize_rcu() reads -completed.  The
 
 I was talking about the synchronize_rcu() call later in
 cond_synchronize_rcu().
 
 But looking at its implementation; they don't in fact load either gpnum
 or completed.

They don't directly, but they register callbacks, which causes the
RCU core code to access them.  But if we invoke synchronize_rcu(),
it really doesn't matter what either get_state_synchronize_rcu() or
cond_synchronize_rcu().  The memory barrier from cond_synchronize_rcu()'s
smp_load_acquire() is instead needed for the case where we -don't-
invoke synchornize_rcu().  In that case, whatever was supposed to follow
the grace period had better really follow the grace period.

+   if (ULONG_CMP_GE(oldstate, newstate))
   
   So if the observed active gp (oldstate) is ahead or equal to the last
   completed gp, then we wait?
  
  Yep!  I will risk an ASCII diagram:
  
  
  3:+gpnum+-- ...
| |
  2:  +gpnum+---+--completed--+
  | |
  1:+gpnum+---+--completed--+
| |
  0:  +-+--completed--+
  
  
  A full RCU grace period happens between a pair of |s on the same line.
  By inspection, if your snapshot of -gpnum is greater than the current
  value of -completed, a grace period has passed.
 
 OK, so I get the  part, but I'm not sure I get the = part of the above.
 The way I read the diagram, when completed matches gpnum the grace
 period is done and we don't have to wait anymore.

Absolutely not!  

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 04:02:31PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> > > This general idea can be made to work, but it will need some
> > > internal-to-RCU help.  One vulnerability of the patch below is the
> > > following sequence of steps:
> > > 
> > > 1.RCU has just finished a grace period, and is doing the
> > >   end-of-grace-period accounting.
> > > 
> > > 2.The code below invokes rcu_batches_completed().  Let's assume
> > >   the result returned is 42.
> > > 
> > > 3.RCU completes the end-of-grace-period accounting, and increments
> > >   rcu_sched_state.completed.
> > > 
> > > 4.The code below checks ->rcu_batches against the result from
> > >   another invocation of rcu_batches_completed() and sees that
> > >   the 43 is not equal to 42, so skips the synchronize_rcu().
> > > 
> > > Except that a grace period has not actually completed.  Boom!!!
> > > 
> > > The problem is that rcu_batches_completed() is only intended to give
> > > progress information on RCU.
> > 
> > Ah, I thought I was missing something when I was looking through the rcu
> > code in a hurry :-)
> 
> Well, given that I sometimes miss things when looking through RCU code
> carefuly, I guess I cannot give you too much trouble about it.
> 
> > I knew there'd be some subtlety between completed and gpnum and such :-)
> 
> Some of which I have learned about one RCU bug at a time.  ;-)
> 
> > > What I can do is give you a pair of functions, one to take a snapshot of
> > > the current grace-period state (returning an unsigned long) and another
> > > to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> > > not been a full grace period in the meantime.
> > > 
> > > The most straightforward approach would invoke acquiring the global
> > > rcu_state ->lock on each call, which I am guessing just might be
> > > considered to be excessive overhead.  ;-)  I should be able to decrease
> > > the overhead to a memory barrier on each call, and perhaps even down
> > > to an smp_load_acquire().  Accessing the RCU state probably costs you
> > > a cache miss both times.
> > > 
> > > Thoughts?
> > 
> > Sounds fine, the attach isn't a hotpath, so even the locked version
> > should be fine, but I won't keep you from making it all fancy and such
> > :-)
> 
> Fair enough, let me see what I can come up with.

And here is an untested patch.  Thoughts?

(And yes, I need to update documentation and torture tests accordingly.)

Thanx, Paul



rcu: Provide grace-period piggybacking API

The following pattern is currently not well supported by RCU:

1.  Make data element inaccessible to RCU readers.

2.  Do work that probably lasts for more than one grace period.

3.  Do something to make sure RCU readers in flight before #1 above
have completed.

Here are some things that could currently be done:

a.  Do a synchronize_rcu() unconditionally at either #1 or #3 above.
This works, but imposes needless work and latency.

b.  Post an RCU callback at #1 above that does a wakeup, then
wait for the wakeup at #3.  This works well, but likely results
in an extra unneeded grace period.  Open-coding this is also
a bit more semi-tricky code than would be good.

This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Reported-by: Peter Zijlstra 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c9be2235712c..dbf0f225bca0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
/* Advance to a new grace period and initialize state. */
record_gp_stall_check_time(rsp);
-   smp_wmb(); /* Record GP times before starting GP. */
-   rsp->gpnum++;
+   /* Record GP times before starting GP, hence smp_store_release(). */
+   smp_store_release(>gpnum, rsp->gpnum + 1);
trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
raw_spin_unlock_irq(>lock);
 
/* Exclude any concurrent CPU-hotplug operations. */
mutex_lock(>onoff_mutex);
+   smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
 
/*
 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
}
rnp = 

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> > This general idea can be made to work, but it will need some
> > internal-to-RCU help.  One vulnerability of the patch below is the
> > following sequence of steps:
> > 
> > 1.  RCU has just finished a grace period, and is doing the
> > end-of-grace-period accounting.
> > 
> > 2.  The code below invokes rcu_batches_completed().  Let's assume
> > the result returned is 42.
> > 
> > 3.  RCU completes the end-of-grace-period accounting, and increments
> > rcu_sched_state.completed.
> > 
> > 4.  The code below checks ->rcu_batches against the result from
> > another invocation of rcu_batches_completed() and sees that
> > the 43 is not equal to 42, so skips the synchronize_rcu().
> > 
> > Except that a grace period has not actually completed.  Boom!!!
> > 
> > The problem is that rcu_batches_completed() is only intended to give
> > progress information on RCU.
> 
> Ah, I thought I was missing something when I was looking through the rcu
> code in a hurry :-)

Well, given that I sometimes miss things when looking through RCU code
carefuly, I guess I cannot give you too much trouble about it.

> I knew there'd be some subtlety between completed and gpnum and such :-)

Some of which I have learned about one RCU bug at a time.  ;-)

> > What I can do is give you a pair of functions, one to take a snapshot of
> > the current grace-period state (returning an unsigned long) and another
> > to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> > not been a full grace period in the meantime.
> > 
> > The most straightforward approach would invoke acquiring the global
> > rcu_state ->lock on each call, which I am guessing just might be
> > considered to be excessive overhead.  ;-)  I should be able to decrease
> > the overhead to a memory barrier on each call, and perhaps even down
> > to an smp_load_acquire().  Accessing the RCU state probably costs you
> > a cache miss both times.
> > 
> > Thoughts?
> 
> Sounds fine, the attach isn't a hotpath, so even the locked version
> should be fine, but I won't keep you from making it all fancy and such
> :-)

Fair enough, let me see what I can come up with.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Peter Zijlstra
On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> This general idea can be made to work, but it will need some
> internal-to-RCU help.  One vulnerability of the patch below is the
> following sequence of steps:
> 
> 1.RCU has just finished a grace period, and is doing the
>   end-of-grace-period accounting.
> 
> 2.The code below invokes rcu_batches_completed().  Let's assume
>   the result returned is 42.
> 
> 3.RCU completes the end-of-grace-period accounting, and increments
>   rcu_sched_state.completed.
> 
> 4.The code below checks ->rcu_batches against the result from
>   another invocation of rcu_batches_completed() and sees that
>   the 43 is not equal to 42, so skips the synchronize_rcu().
> 
> Except that a grace period has not actually completed.  Boom!!!
> 
> The problem is that rcu_batches_completed() is only intended to give
> progress information on RCU.

Ah, I thought I was missing something when I was looking through the rcu
code in a hurry :-)

I knew there'd be some subtlety between completed and gpnum and such :-)

> What I can do is give you a pair of functions, one to take a snapshot of
> the current grace-period state (returning an unsigned long) and another
> to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> not been a full grace period in the meantime.
> 
> The most straightforward approach would invoke acquiring the global
> rcu_state ->lock on each call, which I am guessing just might be
> considered to be excessive overhead.  ;-)  I should be able to decrease
> the overhead to a memory barrier on each call, and perhaps even down
> to an smp_load_acquire().  Accessing the RCU state probably costs you
> a cache miss both times.
> 
> Thoughts?

Sounds fine, the attach isn't a hotpath, so even the locked version
should be fine, but I won't keep you from making it all fancy and such
:-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> > > This is more of a problem description than an actual bugfix, but currently
> > > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> > > the ring buffer's event list, leading to cpu stalls.
> > > 
> > > What this patch does is crude, but fixes the problem, which is: one rcu
> > > grace period has to elapse between ring_buffer_detach() and subsequent
> > > ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> > > will misbehave. Also, making it a call_rcu() callback will make it race
> > > with attach().
> > > 
> > > Another solution that I see is to check for list_empty(>rb_entry)
> > > before wake_up_all() in ring_buffer_wakeup() and restart the list
> > > traversal if it is indeed empty, but that is ugly too as there will be
> > > extra wakeups on some events.
> > > 
> > > Anything that I'm missing here? Any better ideas?
> > 
> > Not sure it qualifies as "better", but git call to ring_buffer_detach()
> > is going to free the event anyway, so the synchronize_rcu() and the
> > INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
> > the same is true for perf_mmap_close().
> > 
> > So that leaves the call in perf_event_set_output(), which detaches from an
> > old rb before attaching that same event to a new one.  So maybe have the
> > synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
> > which might be a reasonably uncommon case?
> 
> How about something like so that only does the sync_rcu() if really
> needed.

This general idea can be made to work, but it will need some
internal-to-RCU help.  One vulnerability of the patch below is the
following sequence of steps:

1.  RCU has just finished a grace period, and is doing the
end-of-grace-period accounting.

2.  The code below invokes rcu_batches_completed().  Let's assume
the result returned is 42.

3.  RCU completes the end-of-grace-period accounting, and increments
rcu_sched_state.completed.

4.  The code below checks ->rcu_batches against the result from
another invocation of rcu_batches_completed() and sees that
the 43 is not equal to 42, so skips the synchronize_rcu().

Except that a grace period has not actually completed.  Boom!!!

The problem is that rcu_batches_completed() is only intended to give
progress information on RCU.

What I can do is give you a pair of functions, one to take a snapshot of
the current grace-period state (returning an unsigned long) and another
to evaluate a previous snapshot, invoking synchronize_rcu() if there has
not been a full grace period in the meantime.

The most straightforward approach would invoke acquiring the global
rcu_state ->lock on each call, which I am guessing just might be
considered to be excessive overhead.  ;-)  I should be able to decrease
the overhead to a memory barrier on each call, and perhaps even down
to an smp_load_acquire().  Accessing the RCU state probably costs you
a cache miss both times.

Thoughts?

Thanx, Paul

> ---
>  kernel/events/core.c | 11 +--
>  kernel/events/internal.h |  1 +
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 661951ab8ae7..88c8c810e081 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event 
> *event,
>  {
>   unsigned long flags;
> 
> + if (rb->rcu_batches == rcu_batches_completed()) {
> + synchronize_rcu();
> + INIT_LIST_HEAD(>rb_entry);
> + }
> +
>   if (!list_empty(>rb_entry))
>   return;
> 
>   spin_lock_irqsave(>event_lock, flags);
>   if (list_empty(>rb_entry))
> - list_add(>rb_entry, >event_list);
> + list_add_rcu(>rb_entry, >event_list);
>   spin_unlock_irqrestore(>event_lock, flags);
>  }
> 
> @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event 
> *event, struct ring_buffer *rb)
>   return;
> 
>   spin_lock_irqsave(>event_lock, flags);
> - list_del_init(>rb_entry);
> + list_del_rcu(>rb_entry);
>   wake_up_all(>waitq);
>   spin_unlock_irqrestore(>event_lock, flags);
> +
> + rb->rcu_batches = rcu_batches_completed();
>  }
> 
>  static void ring_buffer_wakeup(struct perf_event *event)
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 569b218782ad..698b5881b2a4 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -30,6 +30,7 @@ struct ring_buffer {
>   /* poll crap */
>   spinlock_t  event_lock;
>   struct list_headevent_list;
> + unsigned long

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Peter Zijlstra
On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> > This is more of a problem description than an actual bugfix, but currently
> > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> > the ring buffer's event list, leading to cpu stalls.
> > 
> > What this patch does is crude, but fixes the problem, which is: one rcu
> > grace period has to elapse between ring_buffer_detach() and subsequent
> > ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> > will misbehave. Also, making it a call_rcu() callback will make it race
> > with attach().
> > 
> > Another solution that I see is to check for list_empty(>rb_entry)
> > before wake_up_all() in ring_buffer_wakeup() and restart the list
> > traversal if it is indeed empty, but that is ugly too as there will be
> > extra wakeups on some events.
> > 
> > Anything that I'm missing here? Any better ideas?
> 
> Not sure it qualifies as "better", but git call to ring_buffer_detach()
> is going to free the event anyway, so the synchronize_rcu() and the
> INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
> the same is true for perf_mmap_close().
> 
> So that leaves the call in perf_event_set_output(), which detaches from an
> old rb before attaching that same event to a new one.  So maybe have the
> synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
> which might be a reasonably uncommon case?

How about something like so that only does the sync_rcu() if really
needed.


---
 kernel/events/core.c | 11 +--
 kernel/events/internal.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951ab8ae7..88c8c810e081 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event,
 {
unsigned long flags;
 
+   if (rb->rcu_batches == rcu_batches_completed()) {
+   synchronize_rcu();
+   INIT_LIST_HEAD(>rb_entry);
+   }
+
if (!list_empty(>rb_entry))
return;
 
spin_lock_irqsave(>event_lock, flags);
if (list_empty(>rb_entry))
-   list_add(>rb_entry, >event_list);
+   list_add_rcu(>rb_entry, >event_list);
spin_unlock_irqrestore(>event_lock, flags);
 }
 
@@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, 
struct ring_buffer *rb)
return;
 
spin_lock_irqsave(>event_lock, flags);
-   list_del_init(>rb_entry);
+   list_del_rcu(>rb_entry);
wake_up_all(>waitq);
spin_unlock_irqrestore(>event_lock, flags);
+
+   rb->rcu_batches = rcu_batches_completed();
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782ad..698b5881b2a4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -30,6 +30,7 @@ struct ring_buffer {
/* poll crap */
spinlock_t  event_lock;
struct list_headevent_list;
+   unsigned long   rcu_batches;
 
atomic_tmmap_count;
unsigned long   mmap_locked;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Peter Zijlstra
On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
 On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
  This is more of a problem description than an actual bugfix, but currently
  ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
  the ring buffer's event list, leading to cpu stalls.
  
  What this patch does is crude, but fixes the problem, which is: one rcu
  grace period has to elapse between ring_buffer_detach() and subsequent
  ring_buffer_attach(), otherwise either the attach will fail or the wakeup
  will misbehave. Also, making it a call_rcu() callback will make it race
  with attach().
  
  Another solution that I see is to check for list_empty(event-rb_entry)
  before wake_up_all() in ring_buffer_wakeup() and restart the list
  traversal if it is indeed empty, but that is ugly too as there will be
  extra wakeups on some events.
  
  Anything that I'm missing here? Any better ideas?
 
 Not sure it qualifies as better, but git call to ring_buffer_detach()
 is going to free the event anyway, so the synchronize_rcu() and the
 INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
 the same is true for perf_mmap_close().
 
 So that leaves the call in perf_event_set_output(), which detaches from an
 old rb before attaching that same event to a new one.  So maybe have the
 synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb),
 which might be a reasonably uncommon case?

How about something like so that only does the sync_rcu() if really
needed.


---
 kernel/events/core.c | 11 +--
 kernel/events/internal.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951ab8ae7..88c8c810e081 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event,
 {
unsigned long flags;
 
+   if (rb-rcu_batches == rcu_batches_completed()) {
+   synchronize_rcu();
+   INIT_LIST_HEAD(event-rb_entry);
+   }
+
if (!list_empty(event-rb_entry))
return;
 
spin_lock_irqsave(rb-event_lock, flags);
if (list_empty(event-rb_entry))
-   list_add(event-rb_entry, rb-event_list);
+   list_add_rcu(event-rb_entry, rb-event_list);
spin_unlock_irqrestore(rb-event_lock, flags);
 }
 
@@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, 
struct ring_buffer *rb)
return;
 
spin_lock_irqsave(rb-event_lock, flags);
-   list_del_init(event-rb_entry);
+   list_del_rcu(event-rb_entry);
wake_up_all(event-waitq);
spin_unlock_irqrestore(rb-event_lock, flags);
+
+   rb-rcu_batches = rcu_batches_completed();
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782ad..698b5881b2a4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -30,6 +30,7 @@ struct ring_buffer {
/* poll crap */
spinlock_t  event_lock;
struct list_headevent_list;
+   unsigned long   rcu_batches;
 
atomic_tmmap_count;
unsigned long   mmap_locked;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
 On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
  On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
   This is more of a problem description than an actual bugfix, but currently
   ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
   the ring buffer's event list, leading to cpu stalls.
   
   What this patch does is crude, but fixes the problem, which is: one rcu
   grace period has to elapse between ring_buffer_detach() and subsequent
   ring_buffer_attach(), otherwise either the attach will fail or the wakeup
   will misbehave. Also, making it a call_rcu() callback will make it race
   with attach().
   
   Another solution that I see is to check for list_empty(event-rb_entry)
   before wake_up_all() in ring_buffer_wakeup() and restart the list
   traversal if it is indeed empty, but that is ugly too as there will be
   extra wakeups on some events.
   
   Anything that I'm missing here? Any better ideas?
  
  Not sure it qualifies as better, but git call to ring_buffer_detach()
  is going to free the event anyway, so the synchronize_rcu() and the
  INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
  the same is true for perf_mmap_close().
  
  So that leaves the call in perf_event_set_output(), which detaches from an
  old rb before attaching that same event to a new one.  So maybe have the
  synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb),
  which might be a reasonably uncommon case?
 
 How about something like so that only does the sync_rcu() if really
 needed.

This general idea can be made to work, but it will need some
internal-to-RCU help.  One vulnerability of the patch below is the
following sequence of steps:

1.  RCU has just finished a grace period, and is doing the
end-of-grace-period accounting.

2.  The code below invokes rcu_batches_completed().  Let's assume
the result returned is 42.

3.  RCU completes the end-of-grace-period accounting, and increments
rcu_sched_state.completed.

4.  The code below checks -rcu_batches against the result from
another invocation of rcu_batches_completed() and sees that
the 43 is not equal to 42, so skips the synchronize_rcu().

Except that a grace period has not actually completed.  Boom!!!

The problem is that rcu_batches_completed() is only intended to give
progress information on RCU.

What I can do is give you a pair of functions, one to take a snapshot of
the current grace-period state (returning an unsigned long) and another
to evaluate a previous snapshot, invoking synchronize_rcu() if there has
not been a full grace period in the meantime.

The most straightforward approach would invoke acquiring the global
rcu_state -lock on each call, which I am guessing just might be
considered to be excessive overhead.  ;-)  I should be able to decrease
the overhead to a memory barrier on each call, and perhaps even down
to an smp_load_acquire().  Accessing the RCU state probably costs you
a cache miss both times.

Thoughts?

Thanx, Paul

 ---
  kernel/events/core.c | 11 +--
  kernel/events/internal.h |  1 +
  2 files changed, 10 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 661951ab8ae7..88c8c810e081 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event 
 *event,
  {
   unsigned long flags;
 
 + if (rb-rcu_batches == rcu_batches_completed()) {
 + synchronize_rcu();
 + INIT_LIST_HEAD(event-rb_entry);
 + }
 +
   if (!list_empty(event-rb_entry))
   return;
 
   spin_lock_irqsave(rb-event_lock, flags);
   if (list_empty(event-rb_entry))
 - list_add(event-rb_entry, rb-event_list);
 + list_add_rcu(event-rb_entry, rb-event_list);
   spin_unlock_irqrestore(rb-event_lock, flags);
  }
 
 @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event 
 *event, struct ring_buffer *rb)
   return;
 
   spin_lock_irqsave(rb-event_lock, flags);
 - list_del_init(event-rb_entry);
 + list_del_rcu(event-rb_entry);
   wake_up_all(event-waitq);
   spin_unlock_irqrestore(rb-event_lock, flags);
 +
 + rb-rcu_batches = rcu_batches_completed();
  }
 
  static void ring_buffer_wakeup(struct perf_event *event)
 diff --git a/kernel/events/internal.h b/kernel/events/internal.h
 index 569b218782ad..698b5881b2a4 100644
 --- a/kernel/events/internal.h
 +++ b/kernel/events/internal.h
 @@ -30,6 +30,7 @@ struct ring_buffer {
   /* poll crap */
   spinlock_t  event_lock;
   struct list_headevent_list;
 + unsigned long   rcu_batches;
 
   atomic_t

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Peter Zijlstra
On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
 This general idea can be made to work, but it will need some
 internal-to-RCU help.  One vulnerability of the patch below is the
 following sequence of steps:
 
 1.RCU has just finished a grace period, and is doing the
   end-of-grace-period accounting.
 
 2.The code below invokes rcu_batches_completed().  Let's assume
   the result returned is 42.
 
 3.RCU completes the end-of-grace-period accounting, and increments
   rcu_sched_state.completed.
 
 4.The code below checks -rcu_batches against the result from
   another invocation of rcu_batches_completed() and sees that
   the 43 is not equal to 42, so skips the synchronize_rcu().
 
 Except that a grace period has not actually completed.  Boom!!!
 
 The problem is that rcu_batches_completed() is only intended to give
 progress information on RCU.

Ah, I thought I was missing something when I was looking through the rcu
code in a hurry :-)

I knew there'd be some subtlety between completed and gpnum and such :-)

 What I can do is give you a pair of functions, one to take a snapshot of
 the current grace-period state (returning an unsigned long) and another
 to evaluate a previous snapshot, invoking synchronize_rcu() if there has
 not been a full grace period in the meantime.
 
 The most straightforward approach would invoke acquiring the global
 rcu_state -lock on each call, which I am guessing just might be
 considered to be excessive overhead.  ;-)  I should be able to decrease
 the overhead to a memory barrier on each call, and perhaps even down
 to an smp_load_acquire().  Accessing the RCU state probably costs you
 a cache miss both times.
 
 Thoughts?

Sounds fine, the attach isn't a hotpath, so even the locked version
should be fine, but I won't keep you from making it all fancy and such
:-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
 On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
  This general idea can be made to work, but it will need some
  internal-to-RCU help.  One vulnerability of the patch below is the
  following sequence of steps:
  
  1.  RCU has just finished a grace period, and is doing the
  end-of-grace-period accounting.
  
  2.  The code below invokes rcu_batches_completed().  Let's assume
  the result returned is 42.
  
  3.  RCU completes the end-of-grace-period accounting, and increments
  rcu_sched_state.completed.
  
  4.  The code below checks -rcu_batches against the result from
  another invocation of rcu_batches_completed() and sees that
  the 43 is not equal to 42, so skips the synchronize_rcu().
  
  Except that a grace period has not actually completed.  Boom!!!
  
  The problem is that rcu_batches_completed() is only intended to give
  progress information on RCU.
 
 Ah, I thought I was missing something when I was looking through the rcu
 code in a hurry :-)

Well, given that I sometimes miss things when looking through RCU code
carefuly, I guess I cannot give you too much trouble about it.

 I knew there'd be some subtlety between completed and gpnum and such :-)

Some of which I have learned about one RCU bug at a time.  ;-)

  What I can do is give you a pair of functions, one to take a snapshot of
  the current grace-period state (returning an unsigned long) and another
  to evaluate a previous snapshot, invoking synchronize_rcu() if there has
  not been a full grace period in the meantime.
  
  The most straightforward approach would invoke acquiring the global
  rcu_state -lock on each call, which I am guessing just might be
  considered to be excessive overhead.  ;-)  I should be able to decrease
  the overhead to a memory barrier on each call, and perhaps even down
  to an smp_load_acquire().  Accessing the RCU state probably costs you
  a cache miss both times.
  
  Thoughts?
 
 Sounds fine, the attach isn't a hotpath, so even the locked version
 should be fine, but I won't keep you from making it all fancy and such
 :-)

Fair enough, let me see what I can come up with.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-14 Thread Paul E. McKenney
On Fri, Mar 14, 2014 at 04:02:31PM -0700, Paul E. McKenney wrote:
 On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
  On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
   This general idea can be made to work, but it will need some
   internal-to-RCU help.  One vulnerability of the patch below is the
   following sequence of steps:
   
   1.RCU has just finished a grace period, and is doing the
 end-of-grace-period accounting.
   
   2.The code below invokes rcu_batches_completed().  Let's assume
 the result returned is 42.
   
   3.RCU completes the end-of-grace-period accounting, and increments
 rcu_sched_state.completed.
   
   4.The code below checks -rcu_batches against the result from
 another invocation of rcu_batches_completed() and sees that
 the 43 is not equal to 42, so skips the synchronize_rcu().
   
   Except that a grace period has not actually completed.  Boom!!!
   
   The problem is that rcu_batches_completed() is only intended to give
   progress information on RCU.
  
  Ah, I thought I was missing something when I was looking through the rcu
  code in a hurry :-)
 
 Well, given that I sometimes miss things when looking through RCU code
 carefuly, I guess I cannot give you too much trouble about it.
 
  I knew there'd be some subtlety between completed and gpnum and such :-)
 
 Some of which I have learned about one RCU bug at a time.  ;-)
 
   What I can do is give you a pair of functions, one to take a snapshot of
   the current grace-period state (returning an unsigned long) and another
   to evaluate a previous snapshot, invoking synchronize_rcu() if there has
   not been a full grace period in the meantime.
   
   The most straightforward approach would invoke acquiring the global
   rcu_state -lock on each call, which I am guessing just might be
   considered to be excessive overhead.  ;-)  I should be able to decrease
   the overhead to a memory barrier on each call, and perhaps even down
   to an smp_load_acquire().  Accessing the RCU state probably costs you
   a cache miss both times.
   
   Thoughts?
  
  Sounds fine, the attach isn't a hotpath, so even the locked version
  should be fine, but I won't keep you from making it all fancy and such
  :-)
 
 Fair enough, let me see what I can come up with.

And here is an untested patch.  Thoughts?

(And yes, I need to update documentation and torture tests accordingly.)

Thanx, Paul



rcu: Provide grace-period piggybacking API

The following pattern is currently not well supported by RCU:

1.  Make data element inaccessible to RCU readers.

2.  Do work that probably lasts for more than one grace period.

3.  Do something to make sure RCU readers in flight before #1 above
have completed.

Here are some things that could currently be done:

a.  Do a synchronize_rcu() unconditionally at either #1 or #3 above.
This works, but imposes needless work and latency.

b.  Post an RCU callback at #1 above that does a wakeup, then
wait for the wakeup at #3.  This works well, but likely results
in an extra unneeded grace period.  Open-coding this is also
a bit more semi-tricky code than would be good.

This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Reported-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c9be2235712c..dbf0f225bca0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
/* Advance to a new grace period and initialize state. */
record_gp_stall_check_time(rsp);
-   smp_wmb(); /* Record GP times before starting GP. */
-   rsp-gpnum++;
+   /* Record GP times before starting GP, hence smp_store_release(). */
+   smp_store_release(rsp-gpnum, rsp-gpnum + 1);
trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start));
raw_spin_unlock_irq(rnp-lock);
 
/* Exclude any concurrent CPU-hotplug operations. */
mutex_lock(rsp-onoff_mutex);
+   smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */
 
/*
 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
}
rnp = rcu_get_root(rsp);
raw_spin_lock_irq(rnp-lock);
-   smp_mb__after_unlock_lock();
+   

Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-13 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> This is more of a problem description than an actual bugfix, but currently
> ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> the ring buffer's event list, leading to cpu stalls.
> 
> What this patch does is crude, but fixes the problem, which is: one rcu
> grace period has to elapse between ring_buffer_detach() and subsequent
> ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> will misbehave. Also, making it a call_rcu() callback will make it race
> with attach().
> 
> Another solution that I see is to check for list_empty(>rb_entry)
> before wake_up_all() in ring_buffer_wakeup() and restart the list
> traversal if it is indeed empty, but that is ugly too as there will be
> extra wakeups on some events.
> 
> Anything that I'm missing here? Any better ideas?

Not sure it qualifies as "better", but git call to ring_buffer_detach()
is going to free the event anyway, so the synchronize_rcu() and the
INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
the same is true for perf_mmap_close().

So that leaves the call in perf_event_set_output(), which detaches from an
old rb before attaching that same event to a new one.  So maybe have the
synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
which might be a reasonably uncommon case?

Thanx, Paul

> Signed-off-by: Alexander Shishkin 
> Cc: Paul McKenney 
> ---
>  kernel/events/core.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 661951a..bce41e0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,
> 
>   spin_lock_irqsave(>event_lock, flags);
>   if (list_empty(>rb_entry))
> - list_add(>rb_entry, >event_list);
> + list_add_rcu(>rb_entry, >event_list);
>   spin_unlock_irqrestore(>event_lock, flags);
>  }
> 
> @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event 
> *event, struct ring_buffer *rb)
>   return;
> 
>   spin_lock_irqsave(>event_lock, flags);
> - list_del_init(>rb_entry);
> + list_del_rcu(>rb_entry);
>   wake_up_all(>waitq);
>   spin_unlock_irqrestore(>event_lock, flags);
> + synchronize_rcu();
> + INIT_LIST_HEAD(>rb_entry);
>  }
> 
>  static void ring_buffer_wakeup(struct perf_event *event)
> -- 
> 1.9.0
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-13 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
 This is more of a problem description than an actual bugfix, but currently
 ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
 the ring buffer's event list, leading to cpu stalls.
 
 What this patch does is crude, but fixes the problem, which is: one rcu
 grace period has to elapse between ring_buffer_detach() and subsequent
 ring_buffer_attach(), otherwise either the attach will fail or the wakeup
 will misbehave. Also, making it a call_rcu() callback will make it race
 with attach().
 
 Another solution that I see is to check for list_empty(event-rb_entry)
 before wake_up_all() in ring_buffer_wakeup() and restart the list
 traversal if it is indeed empty, but that is ugly too as there will be
 extra wakeups on some events.
 
 Anything that I'm missing here? Any better ideas?

Not sure it qualifies as better, but git call to ring_buffer_detach()
is going to free the event anyway, so the synchronize_rcu() and the
INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
the same is true for perf_mmap_close().

So that leaves the call in perf_event_set_output(), which detaches from an
old rb before attaching that same event to a new one.  So maybe have the
synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb),
which might be a reasonably uncommon case?

Thanx, Paul

 Signed-off-by: Alexander Shishkin alexander.shish...@linux.intel.com
 Cc: Paul McKenney paul...@linux.vnet.ibm.com
 ---
  kernel/events/core.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 661951a..bce41e0 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,
 
   spin_lock_irqsave(rb-event_lock, flags);
   if (list_empty(event-rb_entry))
 - list_add(event-rb_entry, rb-event_list);
 + list_add_rcu(event-rb_entry, rb-event_list);
   spin_unlock_irqrestore(rb-event_lock, flags);
  }
 
 @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event 
 *event, struct ring_buffer *rb)
   return;
 
   spin_lock_irqsave(rb-event_lock, flags);
 - list_del_init(event-rb_entry);
 + list_del_rcu(event-rb_entry);
   wake_up_all(event-waitq);
   spin_unlock_irqrestore(rb-event_lock, flags);
 + synchronize_rcu();
 + INIT_LIST_HEAD(event-rb_entry);
  }
 
  static void ring_buffer_wakeup(struct perf_event *event)
 -- 
 1.9.0
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-07 Thread Alexander Shishkin
This is more of a problem description than an actual bugfix, but currently
ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
the ring buffer's event list, leading to cpu stalls.

What this patch does is crude, but fixes the problem, which is: one rcu
grace period has to elapse between ring_buffer_detach() and subsequent
ring_buffer_attach(), otherwise either the attach will fail or the wakeup
will misbehave. Also, making it a call_rcu() callback will make it race
with attach().

Another solution that I see is to check for list_empty(>rb_entry)
before wake_up_all() in ring_buffer_wakeup() and restart the list
traversal if it is indeed empty, but that is ugly too as there will be
extra wakeups on some events.

Anything that I'm missing here? Any better ideas?

Signed-off-by: Alexander Shishkin 
Cc: Paul McKenney 
---
 kernel/events/core.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951a..bce41e0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,
 
spin_lock_irqsave(>event_lock, flags);
if (list_empty(>rb_entry))
-   list_add(>rb_entry, >event_list);
+   list_add_rcu(>rb_entry, >event_list);
spin_unlock_irqrestore(>event_lock, flags);
 }
 
@@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, 
struct ring_buffer *rb)
return;
 
spin_lock_irqsave(>event_lock, flags);
-   list_del_init(>rb_entry);
+   list_del_rcu(>rb_entry);
wake_up_all(>waitq);
spin_unlock_irqrestore(>event_lock, flags);
+   synchronize_rcu();
+   INIT_LIST_HEAD(>rb_entry);
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()

2014-03-07 Thread Alexander Shishkin
This is more of a problem description than an actual bugfix, but currently
ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
the ring buffer's event list, leading to cpu stalls.

What this patch does is crude, but fixes the problem, which is: one rcu
grace period has to elapse between ring_buffer_detach() and subsequent
ring_buffer_attach(), otherwise either the attach will fail or the wakeup
will misbehave. Also, making it a call_rcu() callback will make it race
with attach().

Another solution that I see is to check for list_empty(event-rb_entry)
before wake_up_all() in ring_buffer_wakeup() and restart the list
traversal if it is indeed empty, but that is ugly too as there will be
extra wakeups on some events.

Anything that I'm missing here? Any better ideas?

Signed-off-by: Alexander Shishkin alexander.shish...@linux.intel.com
Cc: Paul McKenney paul...@linux.vnet.ibm.com
---
 kernel/events/core.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951a..bce41e0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,
 
spin_lock_irqsave(rb-event_lock, flags);
if (list_empty(event-rb_entry))
-   list_add(event-rb_entry, rb-event_list);
+   list_add_rcu(event-rb_entry, rb-event_list);
spin_unlock_irqrestore(rb-event_lock, flags);
 }
 
@@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, 
struct ring_buffer *rb)
return;
 
spin_lock_irqsave(rb-event_lock, flags);
-   list_del_init(event-rb_entry);
+   list_del_rcu(event-rb_entry);
wake_up_all(event-waitq);
spin_unlock_irqrestore(rb-event_lock, flags);
+   synchronize_rcu();
+   INIT_LIST_HEAD(event-rb_entry);
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/