Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Thu, May 08, 2014 at 08:37:27AM -0700, Paul E. McKenney wrote: > On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote: > > On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: > > > static void ring_buffer_attach(struct perf_event *event, > > > struct ring_buffer *rb) > > > { > > > + struct ring_buffer *old_rb = NULL; > > > unsigned long flags; > > > > > > + if (event->rb) { > > > + /* > > > + * Should be impossible, we set this when removing > > > + * event->rb_entry and wait/clear when adding event->rb_entry. > > > + */ > > > + WARN_ON_ONCE(event->rcu_pending); > > > > > > + old_rb = event->rb; > > > + event->rcu_batches = get_state_synchronize_rcu(); > > > + event->rcu_pending = 1; > > > > > > + spin_lock_irqsave(>event_lock, flags); > > > + list_del_rcu(>rb_entry); > > > + spin_unlock_irqrestore(>event_lock, flags); > > > > This all works a whole lot better if you make that old_rb->event_lock. > > > > > + } > > > > > > + if (event->rcu_pending && rb) { > > > + cond_synchronize_rcu(event->rcu_batches); > > There is not a whole lot of code between the get_state_synchronize_rcu() > and the cond_synchronize_rcu(), so I would expect this to do a > synchronize_rcu() almost all the time. Or am I missing something here? From the Changelog: 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event->rb_entry list_head. pgpQ5OYGUViSm.pgp Description: PGP signature
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote: > On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: > > static void ring_buffer_attach(struct perf_event *event, > >struct ring_buffer *rb) > > { > > + struct ring_buffer *old_rb = NULL; > > unsigned long flags; > > > > + if (event->rb) { > > + /* > > +* Should be impossible, we set this when removing > > +* event->rb_entry and wait/clear when adding event->rb_entry. > > +*/ > > + WARN_ON_ONCE(event->rcu_pending); > > > > + old_rb = event->rb; > > + event->rcu_batches = get_state_synchronize_rcu(); > > + event->rcu_pending = 1; > > > > + spin_lock_irqsave(>event_lock, flags); > > + list_del_rcu(>rb_entry); > > + spin_unlock_irqrestore(>event_lock, flags); > > This all works a whole lot better if you make that old_rb->event_lock. > > > + } > > > > + if (event->rcu_pending && rb) { > > + cond_synchronize_rcu(event->rcu_batches); There is not a whole lot of code between the get_state_synchronize_rcu() and the cond_synchronize_rcu(), so I would expect this to do a synchronize_rcu() almost all the time. Or am I missing something here? Thanx, Paul > > + event->rcu_pending = 0; > > + } > > > > + if (rb) { > > + spin_lock_irqsave(>event_lock, flags); > > + list_add_rcu(>rb_entry, >event_list); > > + spin_unlock_irqrestore(>event_lock, flags); > > + } > > + > > + rcu_assign_pointer(event->rb, rb); > > + > > + if (old_rb) { > > + ring_buffer_put(old_rb); > > + /* > > +* Since we detached before setting the new rb, so that we > > +* could attach the new rb, we could have missed a wakeup. > > +* Provide it now. > > +*/ > > + wake_up_all(>waitq); > > + } > > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote: On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb) { + struct ring_buffer *old_rb = NULL; unsigned long flags; + if (event-rb) { + /* +* Should be impossible, we set this when removing +* event-rb_entry and wait/clear when adding event-rb_entry. +*/ + WARN_ON_ONCE(event-rcu_pending); + old_rb = event-rb; + event-rcu_batches = get_state_synchronize_rcu(); + event-rcu_pending = 1; + spin_lock_irqsave(rb-event_lock, flags); + list_del_rcu(event-rb_entry); + spin_unlock_irqrestore(rb-event_lock, flags); This all works a whole lot better if you make that old_rb-event_lock. + } + if (event-rcu_pending rb) { + cond_synchronize_rcu(event-rcu_batches); There is not a whole lot of code between the get_state_synchronize_rcu() and the cond_synchronize_rcu(), so I would expect this to do a synchronize_rcu() almost all the time. Or am I missing something here? Thanx, Paul + event-rcu_pending = 0; + } + if (rb) { + spin_lock_irqsave(rb-event_lock, flags); + list_add_rcu(event-rb_entry, rb-event_list); + spin_unlock_irqrestore(rb-event_lock, flags); + } + + rcu_assign_pointer(event-rb, rb); + + if (old_rb) { + ring_buffer_put(old_rb); + /* +* Since we detached before setting the new rb, so that we +* could attach the new rb, we could have missed a wakeup. +* Provide it now. +*/ + wake_up_all(event-waitq); + } } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Thu, May 08, 2014 at 08:37:27AM -0700, Paul E. McKenney wrote: On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote: On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb) { + struct ring_buffer *old_rb = NULL; unsigned long flags; + if (event-rb) { + /* + * Should be impossible, we set this when removing + * event-rb_entry and wait/clear when adding event-rb_entry. + */ + WARN_ON_ONCE(event-rcu_pending); + old_rb = event-rb; + event-rcu_batches = get_state_synchronize_rcu(); + event-rcu_pending = 1; + spin_lock_irqsave(rb-event_lock, flags); + list_del_rcu(event-rb_entry); + spin_unlock_irqrestore(rb-event_lock, flags); This all works a whole lot better if you make that old_rb-event_lock. + } + if (event-rcu_pending rb) { + cond_synchronize_rcu(event-rcu_batches); There is not a whole lot of code between the get_state_synchronize_rcu() and the cond_synchronize_rcu(), so I would expect this to do a synchronize_rcu() almost all the time. Or am I missing something here? From the Changelog: 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event-rb_entry list_head. pgpQ5OYGUViSm.pgp Description: PGP signature
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: > static void ring_buffer_attach(struct perf_event *event, > struct ring_buffer *rb) > { > + struct ring_buffer *old_rb = NULL; > unsigned long flags; > > + if (event->rb) { > + /* > + * Should be impossible, we set this when removing > + * event->rb_entry and wait/clear when adding event->rb_entry. > + */ > + WARN_ON_ONCE(event->rcu_pending); > > + old_rb = event->rb; > + event->rcu_batches = get_state_synchronize_rcu(); > + event->rcu_pending = 1; > > + spin_lock_irqsave(>event_lock, flags); > + list_del_rcu(>rb_entry); > + spin_unlock_irqrestore(>event_lock, flags); This all works a whole lot better if you make that old_rb->event_lock. > + } > > + if (event->rcu_pending && rb) { > + cond_synchronize_rcu(event->rcu_batches); > + event->rcu_pending = 0; > + } > > + if (rb) { > + spin_lock_irqsave(>event_lock, flags); > + list_add_rcu(>rb_entry, >event_list); > + spin_unlock_irqrestore(>event_lock, flags); > + } > + > + rcu_assign_pointer(event->rb, rb); > + > + if (old_rb) { > + ring_buffer_put(old_rb); > + /* > + * Since we detached before setting the new rb, so that we > + * could attach the new rb, we could have missed a wakeup. > + * Provide it now. > + */ > + wake_up_all(>waitq); > + } > } pgprjmBN6cA6l.pgp Description: PGP signature
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote: > How about something like so that only does the sync_rcu() if really > needed. Paul reminded me we still have this issue open; I had another look and the patch I proposed was completely broken, so I had to write me a new one :-) --- Subject: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() From: Peter Zijlstra Date: Fri, 14 Mar 2014 10:50:33 +0100 Alexander noticed that we use RCU iteration on rb->event_list but do not use list_{add,del}_rcu() to add,remove entries to that list, nor do we observe proper grace periods when re-using the entries. Merge ring_buffer_detach() into ring_buffer_attach() such that attaching to the NULL buffer is detaching. Furthermore, ensure that between any 'detach' and 'attach' of the same event we observe the required grace period, but only when strictly required. In effect this means that only ioctl(.request = PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the normal initial attach and final detach will not be delayed. This patch should, I think, do the right thing under all circumstances, the 'normal' cases all should never see the extra grace period, but the two cases: 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a ring_buffer set, will now observe the required grace period between removing itself from the old and attaching itself to the new buffer. This case is 'simple' in that both buffers are present in perf_event_set_output() one could think an unconditional synchronize_rcu() would be sufficient; however... 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event->rb_entry list_head. In order to make 2 happen we use Paul's latest cond_synchronize_rcu() call. Cc: Frederic Weisbecker Cc: Mike Galbraith Cc: Paul Mackerras Cc: Stephane Eranian Cc: Andi Kleen Cc: "Paul E. McKenney" Cc: Ingo Molnar Reported-by: Alexander Shishkin Signed-off-by: Peter Zijlstra --- include/linux/perf_event.h |2 kernel/events/core.c | 109 - 2 files changed, 51 insertions(+), 60 deletions(-) --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -403,6 +403,8 @@ struct perf_event { struct ring_buffer *rb; struct list_headrb_entry; + unsigned long rcu_batches; + int rcu_pending; /* poll related */ wait_queue_head_t waitq; --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3179,7 +3179,8 @@ static void free_event_rcu(struct rcu_he } static void ring_buffer_put(struct ring_buffer *rb); -static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb); +static void ring_buffer_attach(struct perf_event *event, + struct ring_buffer *rb); static void unaccount_event_cpu(struct perf_event *event, int cpu) { @@ -3242,8 +3243,6 @@ static void free_event(struct perf_event unaccount_event(event); if (event->rb) { - struct ring_buffer *rb; - /* * Can happen when we close an event with re-directed output. * @@ -3251,12 +3250,7 @@ static void free_event(struct perf_event * over us; possibly making our ring_buffer_put() the last. */ mutex_lock(>mmap_mutex); - rb = event->rb; - if (rb) { - rcu_assign_pointer(event->rb, NULL); - ring_buffer_detach(event, rb); - ring_buffer_put(rb); /* could be last */ - } + ring_buffer_attach(event, NULL); mutex_unlock(>mmap_mutex); } @@ -3843,28 +3837,47 @@ static int perf_mmap_fault(struct vm_are static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb) { + struct ring_buffer *old_rb = NULL; unsigned long flags; - if (!list_empty(>rb_entry)) - return; + if (event->rb) { + /* +* Should be impossible, we set this when removing +* event->rb_entry and wait/clear when adding event->rb_entry. +*/ + WARN_ON_ONCE(event->rcu_pending); - spin_lock_irqsave(>event_lock, flags); - if (list_empty(>rb_entry)) - list_add(>rb_entry, >event_list); - spin_unlock_irqrestore(>event_lock, flags); -} +
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote: How about something like so that only does the sync_rcu() if really needed. Paul reminded me we still have this issue open; I had another look and the patch I proposed was completely broken, so I had to write me a new one :-) --- Subject: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() From: Peter Zijlstra pet...@infradead.org Date: Fri, 14 Mar 2014 10:50:33 +0100 Alexander noticed that we use RCU iteration on rb-event_list but do not use list_{add,del}_rcu() to add,remove entries to that list, nor do we observe proper grace periods when re-using the entries. Merge ring_buffer_detach() into ring_buffer_attach() such that attaching to the NULL buffer is detaching. Furthermore, ensure that between any 'detach' and 'attach' of the same event we observe the required grace period, but only when strictly required. In effect this means that only ioctl(.request = PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the normal initial attach and final detach will not be delayed. This patch should, I think, do the right thing under all circumstances, the 'normal' cases all should never see the extra grace period, but the two cases: 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a ring_buffer set, will now observe the required grace period between removing itself from the old and attaching itself to the new buffer. This case is 'simple' in that both buffers are present in perf_event_set_output() one could think an unconditional synchronize_rcu() would be sufficient; however... 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event-rb_entry list_head. In order to make 2 happen we use Paul's latest cond_synchronize_rcu() call. Cc: Frederic Weisbecker fweis...@gmail.com Cc: Mike Galbraith efa...@gmx.de Cc: Paul Mackerras pau...@samba.org Cc: Stephane Eranian eran...@google.com Cc: Andi Kleen a...@linux.intel.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Ingo Molnar mi...@redhat.com Reported-by: Alexander Shishkin alexander.shish...@linux.intel.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/linux/perf_event.h |2 kernel/events/core.c | 109 - 2 files changed, 51 insertions(+), 60 deletions(-) --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -403,6 +403,8 @@ struct perf_event { struct ring_buffer *rb; struct list_headrb_entry; + unsigned long rcu_batches; + int rcu_pending; /* poll related */ wait_queue_head_t waitq; --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3179,7 +3179,8 @@ static void free_event_rcu(struct rcu_he } static void ring_buffer_put(struct ring_buffer *rb); -static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb); +static void ring_buffer_attach(struct perf_event *event, + struct ring_buffer *rb); static void unaccount_event_cpu(struct perf_event *event, int cpu) { @@ -3242,8 +3243,6 @@ static void free_event(struct perf_event unaccount_event(event); if (event-rb) { - struct ring_buffer *rb; - /* * Can happen when we close an event with re-directed output. * @@ -3251,12 +3250,7 @@ static void free_event(struct perf_event * over us; possibly making our ring_buffer_put() the last. */ mutex_lock(event-mmap_mutex); - rb = event-rb; - if (rb) { - rcu_assign_pointer(event-rb, NULL); - ring_buffer_detach(event, rb); - ring_buffer_put(rb); /* could be last */ - } + ring_buffer_attach(event, NULL); mutex_unlock(event-mmap_mutex); } @@ -3843,28 +3837,47 @@ static int perf_mmap_fault(struct vm_are static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb) { + struct ring_buffer *old_rb = NULL; unsigned long flags; - if (!list_empty(event-rb_entry)) - return; + if (event-rb) { + /* +* Should be impossible, we set this when removing +* event-rb_entry and wait/clear when adding event-rb_entry. +*/ +
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote: static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb) { + struct ring_buffer *old_rb = NULL; unsigned long flags; + if (event-rb) { + /* + * Should be impossible, we set this when removing + * event-rb_entry and wait/clear when adding event-rb_entry. + */ + WARN_ON_ONCE(event-rcu_pending); + old_rb = event-rb; + event-rcu_batches = get_state_synchronize_rcu(); + event-rcu_pending = 1; + spin_lock_irqsave(rb-event_lock, flags); + list_del_rcu(event-rb_entry); + spin_unlock_irqrestore(rb-event_lock, flags); This all works a whole lot better if you make that old_rb-event_lock. + } + if (event-rcu_pending rb) { + cond_synchronize_rcu(event-rcu_batches); + event-rcu_pending = 0; + } + if (rb) { + spin_lock_irqsave(rb-event_lock, flags); + list_add_rcu(event-rb_entry, rb-event_list); + spin_unlock_irqrestore(rb-event_lock, flags); + } + + rcu_assign_pointer(event-rb, rb); + + if (old_rb) { + ring_buffer_put(old_rb); + /* + * Since we detached before setting the new rb, so that we + * could attach the new rb, we could have missed a wakeup. + * Provide it now. + */ + wake_up_all(event-waitq); + } } pgprjmBN6cA6l.pgp Description: PGP signature
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 07:45:22PM -0700, Paul E. McKenney wrote: > > > Yep! I will risk an ASCII diagram: > > > > > > > > > 3: > > > +gpnum+-- ... > > > | | > > > 2: > > > +gpnum+---+--completed--+ > > > | | > > > 1: +gpnum+---+--completed--+ > > > | | > > > 0:+-+--completed--+ > > > > > > > > > A full RCU grace period happens between a pair of "|"s on the same line. > > > By inspection, if your snapshot of ->gpnum is greater than the current > > > value of ->completed, a grace period has passed. > > > > OK, so I get the > part, but I'm not sure I get the = part of the above. > > The way I read the diagram, when completed matches gpnum the grace > > period is done and we don't have to wait anymore. > > Absolutely not! Let's try laying out the scenario: > > 1.Someone calls get_state_synchronize_rcu() when ->gpnum==->completed==0. > It returns zero. > > 2.That someone immediately calls cond_synchronize_rcu(). Nothing > has changed, so oldstate==newstate==0. > > We had better call synchronize_rcu() in this case!!! > Make sense? Yes, should have seen that! Thanks for bearing with me on this. > > > rcu: Provide grace-period piggybacking API > > The following pattern is currently not well supported by RCU: > > 1.Make data element inaccessible to RCU readers. > > 2.Do work that probably lasts for more than one grace period. > > 3.Do something to make sure RCU readers in flight before #1 above > have completed. > > Here are some things that could currently be done: > > a.Do a synchronize_rcu() unconditionally at either #1 or #3 above. > This works, but imposes needless work and latency. > > b.Post an RCU callback at #1 above that does a wakeup, then > wait for the wakeup at #3. This works well, but likely results > in an extra unneeded grace period. Open-coding this is also > a bit more semi-tricky code than would be good. > > This commit therefore adds get_state_synchronize_rcu() and > cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 > above and pass its return value to cond_synchronize_rcu() at #3 above. > This results in a call to synchronize_rcu() if no grace period has > elapsed between #1 and #3, but requires only a load, comparison, and > memory barrier if a full grace period did elapse. > > Requested-by: Peter Zijlstra > Signed-off-by: Paul E. McKenney Thanks! Acked-by: Peter Zijlstra -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 07:45:22PM -0700, Paul E. McKenney wrote: Yep! I will risk an ASCII diagram: 3: +gpnum+-- ... | | 2: +gpnum+---+--completed--+ | | 1: +gpnum+---+--completed--+ | | 0:+-+--completed--+ A full RCU grace period happens between a pair of |s on the same line. By inspection, if your snapshot of -gpnum is greater than the current value of -completed, a grace period has passed. OK, so I get the part, but I'm not sure I get the = part of the above. The way I read the diagram, when completed matches gpnum the grace period is done and we don't have to wait anymore. Absolutely not! Let's try laying out the scenario: 1.Someone calls get_state_synchronize_rcu() when -gpnum==-completed==0. It returns zero. 2.That someone immediately calls cond_synchronize_rcu(). Nothing has changed, so oldstate==newstate==0. We had better call synchronize_rcu() in this case!!! Make sense? Yes, should have seen that! Thanks for bearing with me on this. rcu: Provide grace-period piggybacking API The following pattern is currently not well supported by RCU: 1.Make data element inaccessible to RCU readers. 2.Do work that probably lasts for more than one grace period. 3.Do something to make sure RCU readers in flight before #1 above have completed. Here are some things that could currently be done: a.Do a synchronize_rcu() unconditionally at either #1 or #3 above. This works, but imposes needless work and latency. b.Post an RCU callback at #1 above that does a wakeup, then wait for the wakeup at #3. This works well, but likely results in an extra unneeded grace period. Open-coding this is also a bit more semi-tricky code than would be good. This commit therefore adds get_state_synchronize_rcu() and cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 above and pass its return value to cond_synchronize_rcu() at #3 above. This results in a call to synchronize_rcu() if no grace period has elapsed between #1 and #3, but requires only a load, comparison, and memory barrier if a full grace period did elapse. Requested-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com Thanks! Acked-by: Peter Zijlstra pet...@infradead.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 06:30:38PM +0100, Peter Zijlstra wrote: > On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote: > > Good point, this barrier is a bit obsure. Here you go: > > > > /* > > * Make sure this load happens before the purportedly > > * time-consuming work between get_state_synchronize_rcu() > > * and cond_synchronize_rcu(). > > */ > > > > So all you lose by leaving this barrier out is a slightly higher > > probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. > > And without the barrier there is some chance of the CPU doing the load in > > cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). > > Which should be OK, but simpler to exclude this than have to worry about > > counters out of sync. Besides, you are taking a grace-period delay in > > here somewhere, so the memory barrier is way down in the noise. > > Oh sure; expense wasn't my concern. But I always question barriers > without comments, and I couldn't come up with a reason for this one. > > > > The way I imaged using this is taking this snapshot after the RCU > > > operation, such that we err towards seeing a later grace period and > > > synchronizing too much in stead of seeing an earlier and sync'ing too > > > little. > > > > > > Such use would suggest the barrier the other way around. > > > > Ah, but that ordering happens at the updater's end. And there > > are in fact memory barriers before and after the increment of > > ->gpnum in rcu_gp_init(): > > > > /* Advance to a new grace period and initialize state. */ > > record_gp_stall_check_time(rsp); > > /* Record GP times before starting GP, hence smp_store_release(). */ > > smp_store_release(>gpnum, rsp->gpnum + 1); > > trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start")); > > raw_spin_unlock_irq(>lock); > > > > /* Exclude any concurrent CPU-hotplug operations. */ > > mutex_lock(>onoff_mutex); > > smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */ > > > > The smp_store_release() provides the memory barrier before, and the > > smp_mb__after_unlock_lock() provides it after. > > That wasn't exactly what I was talking about. > > So suppose: > > spin_lock() > list_del_rcu(>foo); > spin_unlock(); > > rcu_stamp = get_state_sync_rcu(); > > Now we very much want to ensure to get the gpnum _after_ completing that > list_del_rcu(), because if we were to observe the gpnum before it could > complete before and we'd fail to wait long enough. > > Now in the above, and with your proposed implementation, there is in > fact no guarantee the load doesn't slip up past the list_del_rcu(). > > The RELEASE per the spin_unlock() only guarantees the list_del_rcu() > happens before it, but the gpnum load can slip in, and in fact pass over > the list_del_rcu(). Right you are! Fixed patch below. > > > > +void cond_synchronize_rcu(unsigned long oldstate) > > > > +{ > > > > + unsigned long newstate = > > > > smp_load_acquire(_state->completed); > > > > > > Again, uncommented barriers; the load_acquire seems to suggest you want > > > to make sure the sync_rcu() bits happen after this read. But seeing how > > > sync_rcu() will invariably read the same data again and you get an > > > address dep this seems somewhat superfluous. > > > > Not quite! The get_state_synchronize_rcu() function reads ->gpnum, > > but cond_synchronize_rcu() reads ->completed. The > > I was talking about the synchronize_rcu() call later in > cond_synchronize_rcu(). > > But looking at its implementation; they don't in fact load either gpnum > or completed. They don't directly, but they register callbacks, which causes the RCU core code to access them. But if we invoke synchronize_rcu(), it really doesn't matter what either get_state_synchronize_rcu() or cond_synchronize_rcu(). The memory barrier from cond_synchronize_rcu()'s smp_load_acquire() is instead needed for the case where we -don't- invoke synchornize_rcu(). In that case, whatever was supposed to follow the grace period had better really follow the grace period. > > > > + if (ULONG_CMP_GE(oldstate, newstate)) > > > > > > So if the observed active gp (oldstate) is ahead or equal to the last > > > completed gp, then we wait? > > > > Yep! I will risk an ASCII diagram: > > > > > > 3:+gpnum+-- ... > > | | > > 2: +gpnum+---+--completed--+ > > | | > > 1:+gpnum+---+--completed--+ > > | | > > 0: +-+--completed--+ > > > > > > A full RCU grace period happens between a pair of "|"s on the same line. > > By inspection, if your snapshot of ->gpnum is greater than the current > > value of ->completed, a grace period has passed. > > OK, so I get the >
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote: > Good point, this barrier is a bit obsure. Here you go: > > /* >* Make sure this load happens before the purportedly >* time-consuming work between get_state_synchronize_rcu() >* and cond_synchronize_rcu(). >*/ > > So all you lose by leaving this barrier out is a slightly higher > probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. > And without the barrier there is some chance of the CPU doing the load in > cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). > Which should be OK, but simpler to exclude this than have to worry about > counters out of sync. Besides, you are taking a grace-period delay in > here somewhere, so the memory barrier is way down in the noise. Oh sure; expense wasn't my concern. But I always question barriers without comments, and I couldn't come up with a reason for this one. > > The way I imaged using this is taking this snapshot after the RCU > > operation, such that we err towards seeing a later grace period and > > synchronizing too much in stead of seeing an earlier and sync'ing too > > little. > > > > Such use would suggest the barrier the other way around. > > Ah, but that ordering happens at the updater's end. And there > are in fact memory barriers before and after the increment of > ->gpnum in rcu_gp_init(): > > /* Advance to a new grace period and initialize state. */ > record_gp_stall_check_time(rsp); > /* Record GP times before starting GP, hence smp_store_release(). */ > smp_store_release(>gpnum, rsp->gpnum + 1); > trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start")); > raw_spin_unlock_irq(>lock); > > /* Exclude any concurrent CPU-hotplug operations. */ > mutex_lock(>onoff_mutex); > smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */ > > The smp_store_release() provides the memory barrier before, and the > smp_mb__after_unlock_lock() provides it after. That wasn't exactly what I was talking about. So suppose: spin_lock() list_del_rcu(>foo); spin_unlock(); rcu_stamp = get_state_sync_rcu(); Now we very much want to ensure to get the gpnum _after_ completing that list_del_rcu(), because if we were to observe the gpnum before it could complete before and we'd fail to wait long enough. Now in the above, and with your proposed implementation, there is in fact no guarantee the load doesn't slip up past the list_del_rcu(). The RELEASE per the spin_unlock() only guarantees the list_del_rcu() happens before it, but the gpnum load can slip in, and in fact pass over the list_del_rcu(). > > > +void cond_synchronize_rcu(unsigned long oldstate) > > > +{ > > > + unsigned long newstate = smp_load_acquire(_state->completed); > > > > Again, uncommented barriers; the load_acquire seems to suggest you want > > to make sure the sync_rcu() bits happen after this read. But seeing how > > sync_rcu() will invariably read the same data again and you get an > > address dep this seems somewhat superfluous. > > Not quite! The get_state_synchronize_rcu() function reads ->gpnum, > but cond_synchronize_rcu() reads ->completed. The I was talking about the synchronize_rcu() call later in cond_synchronize_rcu(). But looking at its implementation; they don't in fact load either gpnum or completed. > > > + if (ULONG_CMP_GE(oldstate, newstate)) > > > > So if the observed active gp (oldstate) is ahead or equal to the last > > completed gp, then we wait? > > Yep! I will risk an ASCII diagram: > > > 3: +gpnum+-- ... > | | > 2:+gpnum+---+--completed--+ > | | > 1: +gpnum+---+--completed--+ > | | > 0:+-+--completed--+ > > > A full RCU grace period happens between a pair of "|"s on the same line. > By inspection, if your snapshot of ->gpnum is greater than the current > value of ->completed, a grace period has passed. OK, so I get the > part, but I'm not sure I get the = part of the above. The way I read the diagram, when completed matches gpnum the grace period is done and we don't have to wait anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 12:18:07PM +0100, Peter Zijlstra wrote: > > rcu: Provide grace-period piggybacking API > > > > The following pattern is currently not well supported by RCU: > > > > 1. Make data element inaccessible to RCU readers. > > > > 2. Do work that probably lasts for more than one grace period. > > > > 3. Do something to make sure RCU readers in flight before #1 above > > have completed. > > > > Here are some things that could currently be done: > > > > a. Do a synchronize_rcu() unconditionally at either #1 or #3 above. > > This works, but imposes needless work and latency. > > > > b. Post an RCU callback at #1 above that does a wakeup, then > > wait for the wakeup at #3. This works well, but likely results > > in an extra unneeded grace period. Open-coding this is also > > a bit more semi-tricky code than would be good. > > > > This commit therefore adds get_state_synchronize_rcu() and > > cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 > > above and pass its return value to cond_synchronize_rcu() at #3 above. > > This results in a call to synchronize_rcu() if no grace period has > > elapsed between #1 and #3, but requires only a load, comparison, and > > memory barrier if a full grace period did elapse. > > > > Reported-by: Peter Zijlstra > > More a requested-by, I'd say. Fair enough... > > Signed-off-by: Paul E. McKenney > > +/** > > + * get_state_synchronize_rcu - Snapshot current RCU state > > + * > > + * Returns a cookie that is used by a later call to cond_synchronize_rcu() > > + * to determine whether or not a full grace period has elapsed in the > > + * meantime. > > + */ > > +unsigned long get_state_synchronize_rcu(void) > > +{ > > /* > * Make sure this load happens before anything following it; such that > * ... ? > */ Good point, this barrier is a bit obsure. Here you go: /* * Make sure this load happens before the purportedly * time-consuming work between get_state_synchronize_rcu() * and cond_synchronize_rcu(). */ So all you lose by leaving this barrier out is a slightly higher probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. And without the barrier there is some chance of the CPU doing the load in cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). Which should be OK, but simpler to exclude this than have to worry about counters out of sync. Besides, you are taking a grace-period delay in here somewhere, so the memory barrier is way down in the noise. > The way I imaged using this is taking this snapshot after the RCU > operation, such that we err towards seeing a later grace period and > synchronizing too much in stead of seeing an earlier and sync'ing too > little. > > Such use would suggest the barrier the other way around. Ah, but that ordering happens at the updater's end. And there are in fact memory barriers before and after the increment of ->gpnum in rcu_gp_init(): /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); /* Record GP times before starting GP, hence smp_store_release(). */ smp_store_release(>gpnum, rsp->gpnum + 1); trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start")); raw_spin_unlock_irq(>lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(>onoff_mutex); smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */ The smp_store_release() provides the memory barrier before, and the smp_mb__after_unlock_lock() provides it after. > > + return smp_load_acquire(_state->gpnum); > > +} > > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); > > I can't say I'm excited about that function name; but I'm also > completely lacking in alternatives. ;-) > > + > > +/** > > + * cond_synchronize_rcu - Conditionally wait for an RCU grace period > > + * > > + * @oldstate: return value from earlier call to get_state_synchronize_rcu() > > + * > > + * If a full RCU grace period has elapsed since the earlier call to > > + * get_state_synchronize_rcu(), just return. Otherwise, invoke > > + * synchronize_rcu() to wait for a full grace period. > > + */ > > +void cond_synchronize_rcu(unsigned long oldstate) > > +{ > > + unsigned long newstate = smp_load_acquire(_state->completed); > > Again, uncommented barriers; the load_acquire seems to suggest you want > to make sure the sync_rcu() bits happen after this read. But seeing how > sync_rcu() will invariably read the same data again and you get an > address dep this seems somewhat superfluous. Not quite! The get_state_synchronize_rcu() function reads ->gpnum, but cond_synchronize_rcu() reads ->completed. The > Then again, can't hurt I suppose :-) > > > + if (ULONG_CMP_GE(oldstate, newstate)) > > So if the observed active gp (oldstate) is ahead or equal
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
> rcu: Provide grace-period piggybacking API > > The following pattern is currently not well supported by RCU: > > 1.Make data element inaccessible to RCU readers. > > 2.Do work that probably lasts for more than one grace period. > > 3.Do something to make sure RCU readers in flight before #1 above > have completed. > > Here are some things that could currently be done: > > a.Do a synchronize_rcu() unconditionally at either #1 or #3 above. > This works, but imposes needless work and latency. > > b.Post an RCU callback at #1 above that does a wakeup, then > wait for the wakeup at #3. This works well, but likely results > in an extra unneeded grace period. Open-coding this is also > a bit more semi-tricky code than would be good. > > This commit therefore adds get_state_synchronize_rcu() and > cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 > above and pass its return value to cond_synchronize_rcu() at #3 above. > This results in a call to synchronize_rcu() if no grace period has > elapsed between #1 and #3, but requires only a load, comparison, and > memory barrier if a full grace period did elapse. > > Reported-by: Peter Zijlstra More a requested-by, I'd say. > Signed-off-by: Paul E. McKenney > +/** > + * get_state_synchronize_rcu - Snapshot current RCU state > + * > + * Returns a cookie that is used by a later call to cond_synchronize_rcu() > + * to determine whether or not a full grace period has elapsed in the > + * meantime. > + */ > +unsigned long get_state_synchronize_rcu(void) > +{ /* * Make sure this load happens before anything following it; such that * ... ? */ The way I imaged using this is taking this snapshot after the RCU operation, such that we err towards seeing a later grace period and synchronizing too much in stead of seeing an earlier and sync'ing too little. Such use would suggest the barrier the other way around. > + return smp_load_acquire(_state->gpnum); > +} > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); I can't say I'm excited about that function name; but I'm also completely lacking in alternatives. > + > +/** > + * cond_synchronize_rcu - Conditionally wait for an RCU grace period > + * > + * @oldstate: return value from earlier call to get_state_synchronize_rcu() > + * > + * If a full RCU grace period has elapsed since the earlier call to > + * get_state_synchronize_rcu(), just return. Otherwise, invoke > + * synchronize_rcu() to wait for a full grace period. > + */ > +void cond_synchronize_rcu(unsigned long oldstate) > +{ > + unsigned long newstate = smp_load_acquire(_state->completed); Again, uncommented barriers; the load_acquire seems to suggest you want to make sure the sync_rcu() bits happen after this read. But seeing how sync_rcu() will invariably read the same data again and you get an address dep this seems somewhat superfluous. Then again, can't hurt I suppose :-) > + if (ULONG_CMP_GE(oldstate, newstate)) So if the observed active gp (oldstate) is ahead or equal to the last completed gp, then we wait? I thought the double grace period thing was for preemptible rcu; is it done here because you don't want to special case the preemptible rcu and make do with a single implementation? And I suppose that on wrap around; we do extra sync_rcu() calls, which can never be wrong. > + synchronize_rcu(); > +} > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
rcu: Provide grace-period piggybacking API The following pattern is currently not well supported by RCU: 1.Make data element inaccessible to RCU readers. 2.Do work that probably lasts for more than one grace period. 3.Do something to make sure RCU readers in flight before #1 above have completed. Here are some things that could currently be done: a.Do a synchronize_rcu() unconditionally at either #1 or #3 above. This works, but imposes needless work and latency. b.Post an RCU callback at #1 above that does a wakeup, then wait for the wakeup at #3. This works well, but likely results in an extra unneeded grace period. Open-coding this is also a bit more semi-tricky code than would be good. This commit therefore adds get_state_synchronize_rcu() and cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 above and pass its return value to cond_synchronize_rcu() at #3 above. This results in a call to synchronize_rcu() if no grace period has elapsed between #1 and #3, but requires only a load, comparison, and memory barrier if a full grace period did elapse. Reported-by: Peter Zijlstra pet...@infradead.org More a requested-by, I'd say. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com +/** + * get_state_synchronize_rcu - Snapshot current RCU state + * + * Returns a cookie that is used by a later call to cond_synchronize_rcu() + * to determine whether or not a full grace period has elapsed in the + * meantime. + */ +unsigned long get_state_synchronize_rcu(void) +{ /* * Make sure this load happens before anything following it; such that * ... ? */ The way I imaged using this is taking this snapshot after the RCU operation, such that we err towards seeing a later grace period and synchronizing too much in stead of seeing an earlier and sync'ing too little. Such use would suggest the barrier the other way around. + return smp_load_acquire(rcu_state-gpnum); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); I can't say I'm excited about that function name; but I'm also completely lacking in alternatives. + +/** + * cond_synchronize_rcu - Conditionally wait for an RCU grace period + * + * @oldstate: return value from earlier call to get_state_synchronize_rcu() + * + * If a full RCU grace period has elapsed since the earlier call to + * get_state_synchronize_rcu(), just return. Otherwise, invoke + * synchronize_rcu() to wait for a full grace period. + */ +void cond_synchronize_rcu(unsigned long oldstate) +{ + unsigned long newstate = smp_load_acquire(rcu_state-completed); Again, uncommented barriers; the load_acquire seems to suggest you want to make sure the sync_rcu() bits happen after this read. But seeing how sync_rcu() will invariably read the same data again and you get an address dep this seems somewhat superfluous. Then again, can't hurt I suppose :-) + if (ULONG_CMP_GE(oldstate, newstate)) So if the observed active gp (oldstate) is ahead or equal to the last completed gp, then we wait? I thought the double grace period thing was for preemptible rcu; is it done here because you don't want to special case the preemptible rcu and make do with a single implementation? And I suppose that on wrap around; we do extra sync_rcu() calls, which can never be wrong. + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 12:18:07PM +0100, Peter Zijlstra wrote: rcu: Provide grace-period piggybacking API The following pattern is currently not well supported by RCU: 1. Make data element inaccessible to RCU readers. 2. Do work that probably lasts for more than one grace period. 3. Do something to make sure RCU readers in flight before #1 above have completed. Here are some things that could currently be done: a. Do a synchronize_rcu() unconditionally at either #1 or #3 above. This works, but imposes needless work and latency. b. Post an RCU callback at #1 above that does a wakeup, then wait for the wakeup at #3. This works well, but likely results in an extra unneeded grace period. Open-coding this is also a bit more semi-tricky code than would be good. This commit therefore adds get_state_synchronize_rcu() and cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 above and pass its return value to cond_synchronize_rcu() at #3 above. This results in a call to synchronize_rcu() if no grace period has elapsed between #1 and #3, but requires only a load, comparison, and memory barrier if a full grace period did elapse. Reported-by: Peter Zijlstra pet...@infradead.org More a requested-by, I'd say. Fair enough... Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com +/** + * get_state_synchronize_rcu - Snapshot current RCU state + * + * Returns a cookie that is used by a later call to cond_synchronize_rcu() + * to determine whether or not a full grace period has elapsed in the + * meantime. + */ +unsigned long get_state_synchronize_rcu(void) +{ /* * Make sure this load happens before anything following it; such that * ... ? */ Good point, this barrier is a bit obsure. Here you go: /* * Make sure this load happens before the purportedly * time-consuming work between get_state_synchronize_rcu() * and cond_synchronize_rcu(). */ So all you lose by leaving this barrier out is a slightly higher probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. And without the barrier there is some chance of the CPU doing the load in cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). Which should be OK, but simpler to exclude this than have to worry about counters out of sync. Besides, you are taking a grace-period delay in here somewhere, so the memory barrier is way down in the noise. The way I imaged using this is taking this snapshot after the RCU operation, such that we err towards seeing a later grace period and synchronizing too much in stead of seeing an earlier and sync'ing too little. Such use would suggest the barrier the other way around. Ah, but that ordering happens at the updater's end. And there are in fact memory barriers before and after the increment of -gpnum in rcu_gp_init(): /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); /* Record GP times before starting GP, hence smp_store_release(). */ smp_store_release(rsp-gpnum, rsp-gpnum + 1); trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start)); raw_spin_unlock_irq(rnp-lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(rsp-onoff_mutex); smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */ The smp_store_release() provides the memory barrier before, and the smp_mb__after_unlock_lock() provides it after. + return smp_load_acquire(rcu_state-gpnum); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); I can't say I'm excited about that function name; but I'm also completely lacking in alternatives. ;-) + +/** + * cond_synchronize_rcu - Conditionally wait for an RCU grace period + * + * @oldstate: return value from earlier call to get_state_synchronize_rcu() + * + * If a full RCU grace period has elapsed since the earlier call to + * get_state_synchronize_rcu(), just return. Otherwise, invoke + * synchronize_rcu() to wait for a full grace period. + */ +void cond_synchronize_rcu(unsigned long oldstate) +{ + unsigned long newstate = smp_load_acquire(rcu_state-completed); Again, uncommented barriers; the load_acquire seems to suggest you want to make sure the sync_rcu() bits happen after this read. But seeing how sync_rcu() will invariably read the same data again and you get an address dep this seems somewhat superfluous. Not quite! The get_state_synchronize_rcu() function reads -gpnum, but cond_synchronize_rcu() reads -completed. The Then again, can't hurt I suppose :-) + if (ULONG_CMP_GE(oldstate, newstate)) So if the observed active gp (oldstate) is ahead or equal to the last completed gp, then we wait? Yep! I will risk an ASCII diagram: 3:
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote: Good point, this barrier is a bit obsure. Here you go: /* * Make sure this load happens before the purportedly * time-consuming work between get_state_synchronize_rcu() * and cond_synchronize_rcu(). */ So all you lose by leaving this barrier out is a slightly higher probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. And without the barrier there is some chance of the CPU doing the load in cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). Which should be OK, but simpler to exclude this than have to worry about counters out of sync. Besides, you are taking a grace-period delay in here somewhere, so the memory barrier is way down in the noise. Oh sure; expense wasn't my concern. But I always question barriers without comments, and I couldn't come up with a reason for this one. The way I imaged using this is taking this snapshot after the RCU operation, such that we err towards seeing a later grace period and synchronizing too much in stead of seeing an earlier and sync'ing too little. Such use would suggest the barrier the other way around. Ah, but that ordering happens at the updater's end. And there are in fact memory barriers before and after the increment of -gpnum in rcu_gp_init(): /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); /* Record GP times before starting GP, hence smp_store_release(). */ smp_store_release(rsp-gpnum, rsp-gpnum + 1); trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start)); raw_spin_unlock_irq(rnp-lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(rsp-onoff_mutex); smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */ The smp_store_release() provides the memory barrier before, and the smp_mb__after_unlock_lock() provides it after. That wasn't exactly what I was talking about. So suppose: spin_lock(lock) list_del_rcu(entry-foo); spin_unlock(lock); rcu_stamp = get_state_sync_rcu(); Now we very much want to ensure to get the gpnum _after_ completing that list_del_rcu(), because if we were to observe the gpnum before it could complete before and we'd fail to wait long enough. Now in the above, and with your proposed implementation, there is in fact no guarantee the load doesn't slip up past the list_del_rcu(). The RELEASE per the spin_unlock() only guarantees the list_del_rcu() happens before it, but the gpnum load can slip in, and in fact pass over the list_del_rcu(). +void cond_synchronize_rcu(unsigned long oldstate) +{ + unsigned long newstate = smp_load_acquire(rcu_state-completed); Again, uncommented barriers; the load_acquire seems to suggest you want to make sure the sync_rcu() bits happen after this read. But seeing how sync_rcu() will invariably read the same data again and you get an address dep this seems somewhat superfluous. Not quite! The get_state_synchronize_rcu() function reads -gpnum, but cond_synchronize_rcu() reads -completed. The I was talking about the synchronize_rcu() call later in cond_synchronize_rcu(). But looking at its implementation; they don't in fact load either gpnum or completed. + if (ULONG_CMP_GE(oldstate, newstate)) So if the observed active gp (oldstate) is ahead or equal to the last completed gp, then we wait? Yep! I will risk an ASCII diagram: 3: +gpnum+-- ... | | 2:+gpnum+---+--completed--+ | | 1: +gpnum+---+--completed--+ | | 0:+-+--completed--+ A full RCU grace period happens between a pair of |s on the same line. By inspection, if your snapshot of -gpnum is greater than the current value of -completed, a grace period has passed. OK, so I get the part, but I'm not sure I get the = part of the above. The way I read the diagram, when completed matches gpnum the grace period is done and we don't have to wait anymore. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Mon, Mar 17, 2014 at 06:30:38PM +0100, Peter Zijlstra wrote: On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote: Good point, this barrier is a bit obsure. Here you go: /* * Make sure this load happens before the purportedly * time-consuming work between get_state_synchronize_rcu() * and cond_synchronize_rcu(). */ So all you lose by leaving this barrier out is a slightly higher probability of invoking synchronize_rcu() at cond_synchronize_rcu() time. And without the barrier there is some chance of the CPU doing the load in cond_synchronize_rcu() before this load in get_state_synchronize_rcu(). Which should be OK, but simpler to exclude this than have to worry about counters out of sync. Besides, you are taking a grace-period delay in here somewhere, so the memory barrier is way down in the noise. Oh sure; expense wasn't my concern. But I always question barriers without comments, and I couldn't come up with a reason for this one. The way I imaged using this is taking this snapshot after the RCU operation, such that we err towards seeing a later grace period and synchronizing too much in stead of seeing an earlier and sync'ing too little. Such use would suggest the barrier the other way around. Ah, but that ordering happens at the updater's end. And there are in fact memory barriers before and after the increment of -gpnum in rcu_gp_init(): /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); /* Record GP times before starting GP, hence smp_store_release(). */ smp_store_release(rsp-gpnum, rsp-gpnum + 1); trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start)); raw_spin_unlock_irq(rnp-lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(rsp-onoff_mutex); smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */ The smp_store_release() provides the memory barrier before, and the smp_mb__after_unlock_lock() provides it after. That wasn't exactly what I was talking about. So suppose: spin_lock(lock) list_del_rcu(entry-foo); spin_unlock(lock); rcu_stamp = get_state_sync_rcu(); Now we very much want to ensure to get the gpnum _after_ completing that list_del_rcu(), because if we were to observe the gpnum before it could complete before and we'd fail to wait long enough. Now in the above, and with your proposed implementation, there is in fact no guarantee the load doesn't slip up past the list_del_rcu(). The RELEASE per the spin_unlock() only guarantees the list_del_rcu() happens before it, but the gpnum load can slip in, and in fact pass over the list_del_rcu(). Right you are! Fixed patch below. +void cond_synchronize_rcu(unsigned long oldstate) +{ + unsigned long newstate = smp_load_acquire(rcu_state-completed); Again, uncommented barriers; the load_acquire seems to suggest you want to make sure the sync_rcu() bits happen after this read. But seeing how sync_rcu() will invariably read the same data again and you get an address dep this seems somewhat superfluous. Not quite! The get_state_synchronize_rcu() function reads -gpnum, but cond_synchronize_rcu() reads -completed. The I was talking about the synchronize_rcu() call later in cond_synchronize_rcu(). But looking at its implementation; they don't in fact load either gpnum or completed. They don't directly, but they register callbacks, which causes the RCU core code to access them. But if we invoke synchronize_rcu(), it really doesn't matter what either get_state_synchronize_rcu() or cond_synchronize_rcu(). The memory barrier from cond_synchronize_rcu()'s smp_load_acquire() is instead needed for the case where we -don't- invoke synchornize_rcu(). In that case, whatever was supposed to follow the grace period had better really follow the grace period. + if (ULONG_CMP_GE(oldstate, newstate)) So if the observed active gp (oldstate) is ahead or equal to the last completed gp, then we wait? Yep! I will risk an ASCII diagram: 3:+gpnum+-- ... | | 2: +gpnum+---+--completed--+ | | 1:+gpnum+---+--completed--+ | | 0: +-+--completed--+ A full RCU grace period happens between a pair of |s on the same line. By inspection, if your snapshot of -gpnum is greater than the current value of -completed, a grace period has passed. OK, so I get the part, but I'm not sure I get the = part of the above. The way I read the diagram, when completed matches gpnum the grace period is done and we don't have to wait anymore. Absolutely not!
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 04:02:31PM -0700, Paul E. McKenney wrote: > On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote: > > On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: > > > This general idea can be made to work, but it will need some > > > internal-to-RCU help. One vulnerability of the patch below is the > > > following sequence of steps: > > > > > > 1.RCU has just finished a grace period, and is doing the > > > end-of-grace-period accounting. > > > > > > 2.The code below invokes rcu_batches_completed(). Let's assume > > > the result returned is 42. > > > > > > 3.RCU completes the end-of-grace-period accounting, and increments > > > rcu_sched_state.completed. > > > > > > 4.The code below checks ->rcu_batches against the result from > > > another invocation of rcu_batches_completed() and sees that > > > the 43 is not equal to 42, so skips the synchronize_rcu(). > > > > > > Except that a grace period has not actually completed. Boom!!! > > > > > > The problem is that rcu_batches_completed() is only intended to give > > > progress information on RCU. > > > > Ah, I thought I was missing something when I was looking through the rcu > > code in a hurry :-) > > Well, given that I sometimes miss things when looking through RCU code > carefuly, I guess I cannot give you too much trouble about it. > > > I knew there'd be some subtlety between completed and gpnum and such :-) > > Some of which I have learned about one RCU bug at a time. ;-) > > > > What I can do is give you a pair of functions, one to take a snapshot of > > > the current grace-period state (returning an unsigned long) and another > > > to evaluate a previous snapshot, invoking synchronize_rcu() if there has > > > not been a full grace period in the meantime. > > > > > > The most straightforward approach would invoke acquiring the global > > > rcu_state ->lock on each call, which I am guessing just might be > > > considered to be excessive overhead. ;-) I should be able to decrease > > > the overhead to a memory barrier on each call, and perhaps even down > > > to an smp_load_acquire(). Accessing the RCU state probably costs you > > > a cache miss both times. > > > > > > Thoughts? > > > > Sounds fine, the attach isn't a hotpath, so even the locked version > > should be fine, but I won't keep you from making it all fancy and such > > :-) > > Fair enough, let me see what I can come up with. And here is an untested patch. Thoughts? (And yes, I need to update documentation and torture tests accordingly.) Thanx, Paul rcu: Provide grace-period piggybacking API The following pattern is currently not well supported by RCU: 1. Make data element inaccessible to RCU readers. 2. Do work that probably lasts for more than one grace period. 3. Do something to make sure RCU readers in flight before #1 above have completed. Here are some things that could currently be done: a. Do a synchronize_rcu() unconditionally at either #1 or #3 above. This works, but imposes needless work and latency. b. Post an RCU callback at #1 above that does a wakeup, then wait for the wakeup at #3. This works well, but likely results in an extra unneeded grace period. Open-coding this is also a bit more semi-tricky code than would be good. This commit therefore adds get_state_synchronize_rcu() and cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 above and pass its return value to cond_synchronize_rcu() at #3 above. This results in a call to synchronize_rcu() if no grace period has elapsed between #1 and #3, but requires only a load, comparison, and memory barrier if a full grace period did elapse. Reported-by: Peter Zijlstra Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index c9be2235712c..dbf0f225bca0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp) /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); - smp_wmb(); /* Record GP times before starting GP. */ - rsp->gpnum++; + /* Record GP times before starting GP, hence smp_store_release(). */ + smp_store_release(>gpnum, rsp->gpnum + 1); trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start")); raw_spin_unlock_irq(>lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(>onoff_mutex); + smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */ /* * Set the quiescent-state-needed bits in all the rcu_node @@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) } rnp =
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote: > On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: > > This general idea can be made to work, but it will need some > > internal-to-RCU help. One vulnerability of the patch below is the > > following sequence of steps: > > > > 1. RCU has just finished a grace period, and is doing the > > end-of-grace-period accounting. > > > > 2. The code below invokes rcu_batches_completed(). Let's assume > > the result returned is 42. > > > > 3. RCU completes the end-of-grace-period accounting, and increments > > rcu_sched_state.completed. > > > > 4. The code below checks ->rcu_batches against the result from > > another invocation of rcu_batches_completed() and sees that > > the 43 is not equal to 42, so skips the synchronize_rcu(). > > > > Except that a grace period has not actually completed. Boom!!! > > > > The problem is that rcu_batches_completed() is only intended to give > > progress information on RCU. > > Ah, I thought I was missing something when I was looking through the rcu > code in a hurry :-) Well, given that I sometimes miss things when looking through RCU code carefuly, I guess I cannot give you too much trouble about it. > I knew there'd be some subtlety between completed and gpnum and such :-) Some of which I have learned about one RCU bug at a time. ;-) > > What I can do is give you a pair of functions, one to take a snapshot of > > the current grace-period state (returning an unsigned long) and another > > to evaluate a previous snapshot, invoking synchronize_rcu() if there has > > not been a full grace period in the meantime. > > > > The most straightforward approach would invoke acquiring the global > > rcu_state ->lock on each call, which I am guessing just might be > > considered to be excessive overhead. ;-) I should be able to decrease > > the overhead to a memory barrier on each call, and perhaps even down > > to an smp_load_acquire(). Accessing the RCU state probably costs you > > a cache miss both times. > > > > Thoughts? > > Sounds fine, the attach isn't a hotpath, so even the locked version > should be fine, but I won't keep you from making it all fancy and such > :-) Fair enough, let me see what I can come up with. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: > This general idea can be made to work, but it will need some > internal-to-RCU help. One vulnerability of the patch below is the > following sequence of steps: > > 1.RCU has just finished a grace period, and is doing the > end-of-grace-period accounting. > > 2.The code below invokes rcu_batches_completed(). Let's assume > the result returned is 42. > > 3.RCU completes the end-of-grace-period accounting, and increments > rcu_sched_state.completed. > > 4.The code below checks ->rcu_batches against the result from > another invocation of rcu_batches_completed() and sees that > the 43 is not equal to 42, so skips the synchronize_rcu(). > > Except that a grace period has not actually completed. Boom!!! > > The problem is that rcu_batches_completed() is only intended to give > progress information on RCU. Ah, I thought I was missing something when I was looking through the rcu code in a hurry :-) I knew there'd be some subtlety between completed and gpnum and such :-) > What I can do is give you a pair of functions, one to take a snapshot of > the current grace-period state (returning an unsigned long) and another > to evaluate a previous snapshot, invoking synchronize_rcu() if there has > not been a full grace period in the meantime. > > The most straightforward approach would invoke acquiring the global > rcu_state ->lock on each call, which I am guessing just might be > considered to be excessive overhead. ;-) I should be able to decrease > the overhead to a memory barrier on each call, and perhaps even down > to an smp_load_acquire(). Accessing the RCU state probably costs you > a cache miss both times. > > Thoughts? Sounds fine, the attach isn't a hotpath, so even the locked version should be fine, but I won't keep you from making it all fancy and such :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote: > On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote: > > On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: > > > This is more of a problem description than an actual bugfix, but currently > > > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing > > > the ring buffer's event list, leading to cpu stalls. > > > > > > What this patch does is crude, but fixes the problem, which is: one rcu > > > grace period has to elapse between ring_buffer_detach() and subsequent > > > ring_buffer_attach(), otherwise either the attach will fail or the wakeup > > > will misbehave. Also, making it a call_rcu() callback will make it race > > > with attach(). > > > > > > Another solution that I see is to check for list_empty(>rb_entry) > > > before wake_up_all() in ring_buffer_wakeup() and restart the list > > > traversal if it is indeed empty, but that is ugly too as there will be > > > extra wakeups on some events. > > > > > > Anything that I'm missing here? Any better ideas? > > > > Not sure it qualifies as "better", but git call to ring_buffer_detach() > > is going to free the event anyway, so the synchronize_rcu() and the > > INIT_LIST_HEAD() should not be needed in that case. I am guessing that > > the same is true for perf_mmap_close(). > > > > So that leaves the call in perf_event_set_output(), which detaches from an > > old rb before attaching that same event to a new one. So maybe have the > > synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)", > > which might be a reasonably uncommon case? > > How about something like so that only does the sync_rcu() if really > needed. This general idea can be made to work, but it will need some internal-to-RCU help. One vulnerability of the patch below is the following sequence of steps: 1. RCU has just finished a grace period, and is doing the end-of-grace-period accounting. 2. The code below invokes rcu_batches_completed(). Let's assume the result returned is 42. 3. RCU completes the end-of-grace-period accounting, and increments rcu_sched_state.completed. 4. The code below checks ->rcu_batches against the result from another invocation of rcu_batches_completed() and sees that the 43 is not equal to 42, so skips the synchronize_rcu(). Except that a grace period has not actually completed. Boom!!! The problem is that rcu_batches_completed() is only intended to give progress information on RCU. What I can do is give you a pair of functions, one to take a snapshot of the current grace-period state (returning an unsigned long) and another to evaluate a previous snapshot, invoking synchronize_rcu() if there has not been a full grace period in the meantime. The most straightforward approach would invoke acquiring the global rcu_state ->lock on each call, which I am guessing just might be considered to be excessive overhead. ;-) I should be able to decrease the overhead to a memory barrier on each call, and perhaps even down to an smp_load_acquire(). Accessing the RCU state probably costs you a cache miss both times. Thoughts? Thanx, Paul > --- > kernel/events/core.c | 11 +-- > kernel/events/internal.h | 1 + > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 661951ab8ae7..88c8c810e081 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event > *event, > { > unsigned long flags; > > + if (rb->rcu_batches == rcu_batches_completed()) { > + synchronize_rcu(); > + INIT_LIST_HEAD(>rb_entry); > + } > + > if (!list_empty(>rb_entry)) > return; > > spin_lock_irqsave(>event_lock, flags); > if (list_empty(>rb_entry)) > - list_add(>rb_entry, >event_list); > + list_add_rcu(>rb_entry, >event_list); > spin_unlock_irqrestore(>event_lock, flags); > } > > @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event > *event, struct ring_buffer *rb) > return; > > spin_lock_irqsave(>event_lock, flags); > - list_del_init(>rb_entry); > + list_del_rcu(>rb_entry); > wake_up_all(>waitq); > spin_unlock_irqrestore(>event_lock, flags); > + > + rb->rcu_batches = rcu_batches_completed(); > } > > static void ring_buffer_wakeup(struct perf_event *event) > diff --git a/kernel/events/internal.h b/kernel/events/internal.h > index 569b218782ad..698b5881b2a4 100644 > --- a/kernel/events/internal.h > +++ b/kernel/events/internal.h > @@ -30,6 +30,7 @@ struct ring_buffer { > /* poll crap */ > spinlock_t event_lock; > struct list_headevent_list; > + unsigned long
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote: > On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: > > This is more of a problem description than an actual bugfix, but currently > > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing > > the ring buffer's event list, leading to cpu stalls. > > > > What this patch does is crude, but fixes the problem, which is: one rcu > > grace period has to elapse between ring_buffer_detach() and subsequent > > ring_buffer_attach(), otherwise either the attach will fail or the wakeup > > will misbehave. Also, making it a call_rcu() callback will make it race > > with attach(). > > > > Another solution that I see is to check for list_empty(>rb_entry) > > before wake_up_all() in ring_buffer_wakeup() and restart the list > > traversal if it is indeed empty, but that is ugly too as there will be > > extra wakeups on some events. > > > > Anything that I'm missing here? Any better ideas? > > Not sure it qualifies as "better", but git call to ring_buffer_detach() > is going to free the event anyway, so the synchronize_rcu() and the > INIT_LIST_HEAD() should not be needed in that case. I am guessing that > the same is true for perf_mmap_close(). > > So that leaves the call in perf_event_set_output(), which detaches from an > old rb before attaching that same event to a new one. So maybe have the > synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)", > which might be a reasonably uncommon case? How about something like so that only does the sync_rcu() if really needed. --- kernel/events/core.c | 11 +-- kernel/events/internal.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951ab8ae7..88c8c810e081 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event, { unsigned long flags; + if (rb->rcu_batches == rcu_batches_completed()) { + synchronize_rcu(); + INIT_LIST_HEAD(>rb_entry); + } + if (!list_empty(>rb_entry)) return; spin_lock_irqsave(>event_lock, flags); if (list_empty(>rb_entry)) - list_add(>rb_entry, >event_list); + list_add_rcu(>rb_entry, >event_list); spin_unlock_irqrestore(>event_lock, flags); } @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(>event_lock, flags); - list_del_init(>rb_entry); + list_del_rcu(>rb_entry); wake_up_all(>waitq); spin_unlock_irqrestore(>event_lock, flags); + + rb->rcu_batches = rcu_batches_completed(); } static void ring_buffer_wakeup(struct perf_event *event) diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 569b218782ad..698b5881b2a4 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -30,6 +30,7 @@ struct ring_buffer { /* poll crap */ spinlock_t event_lock; struct list_headevent_list; + unsigned long rcu_batches; atomic_tmmap_count; unsigned long mmap_locked; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote: On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: This is more of a problem description than an actual bugfix, but currently ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing the ring buffer's event list, leading to cpu stalls. What this patch does is crude, but fixes the problem, which is: one rcu grace period has to elapse between ring_buffer_detach() and subsequent ring_buffer_attach(), otherwise either the attach will fail or the wakeup will misbehave. Also, making it a call_rcu() callback will make it race with attach(). Another solution that I see is to check for list_empty(event-rb_entry) before wake_up_all() in ring_buffer_wakeup() and restart the list traversal if it is indeed empty, but that is ugly too as there will be extra wakeups on some events. Anything that I'm missing here? Any better ideas? Not sure it qualifies as better, but git call to ring_buffer_detach() is going to free the event anyway, so the synchronize_rcu() and the INIT_LIST_HEAD() should not be needed in that case. I am guessing that the same is true for perf_mmap_close(). So that leaves the call in perf_event_set_output(), which detaches from an old rb before attaching that same event to a new one. So maybe have the synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb), which might be a reasonably uncommon case? How about something like so that only does the sync_rcu() if really needed. --- kernel/events/core.c | 11 +-- kernel/events/internal.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951ab8ae7..88c8c810e081 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event, { unsigned long flags; + if (rb-rcu_batches == rcu_batches_completed()) { + synchronize_rcu(); + INIT_LIST_HEAD(event-rb_entry); + } + if (!list_empty(event-rb_entry)) return; spin_lock_irqsave(rb-event_lock, flags); if (list_empty(event-rb_entry)) - list_add(event-rb_entry, rb-event_list); + list_add_rcu(event-rb_entry, rb-event_list); spin_unlock_irqrestore(rb-event_lock, flags); } @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(rb-event_lock, flags); - list_del_init(event-rb_entry); + list_del_rcu(event-rb_entry); wake_up_all(event-waitq); spin_unlock_irqrestore(rb-event_lock, flags); + + rb-rcu_batches = rcu_batches_completed(); } static void ring_buffer_wakeup(struct perf_event *event) diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 569b218782ad..698b5881b2a4 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -30,6 +30,7 @@ struct ring_buffer { /* poll crap */ spinlock_t event_lock; struct list_headevent_list; + unsigned long rcu_batches; atomic_tmmap_count; unsigned long mmap_locked; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote: On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote: On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: This is more of a problem description than an actual bugfix, but currently ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing the ring buffer's event list, leading to cpu stalls. What this patch does is crude, but fixes the problem, which is: one rcu grace period has to elapse between ring_buffer_detach() and subsequent ring_buffer_attach(), otherwise either the attach will fail or the wakeup will misbehave. Also, making it a call_rcu() callback will make it race with attach(). Another solution that I see is to check for list_empty(event-rb_entry) before wake_up_all() in ring_buffer_wakeup() and restart the list traversal if it is indeed empty, but that is ugly too as there will be extra wakeups on some events. Anything that I'm missing here? Any better ideas? Not sure it qualifies as better, but git call to ring_buffer_detach() is going to free the event anyway, so the synchronize_rcu() and the INIT_LIST_HEAD() should not be needed in that case. I am guessing that the same is true for perf_mmap_close(). So that leaves the call in perf_event_set_output(), which detaches from an old rb before attaching that same event to a new one. So maybe have the synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb), which might be a reasonably uncommon case? How about something like so that only does the sync_rcu() if really needed. This general idea can be made to work, but it will need some internal-to-RCU help. One vulnerability of the patch below is the following sequence of steps: 1. RCU has just finished a grace period, and is doing the end-of-grace-period accounting. 2. The code below invokes rcu_batches_completed(). Let's assume the result returned is 42. 3. RCU completes the end-of-grace-period accounting, and increments rcu_sched_state.completed. 4. The code below checks -rcu_batches against the result from another invocation of rcu_batches_completed() and sees that the 43 is not equal to 42, so skips the synchronize_rcu(). Except that a grace period has not actually completed. Boom!!! The problem is that rcu_batches_completed() is only intended to give progress information on RCU. What I can do is give you a pair of functions, one to take a snapshot of the current grace-period state (returning an unsigned long) and another to evaluate a previous snapshot, invoking synchronize_rcu() if there has not been a full grace period in the meantime. The most straightforward approach would invoke acquiring the global rcu_state -lock on each call, which I am guessing just might be considered to be excessive overhead. ;-) I should be able to decrease the overhead to a memory barrier on each call, and perhaps even down to an smp_load_acquire(). Accessing the RCU state probably costs you a cache miss both times. Thoughts? Thanx, Paul --- kernel/events/core.c | 11 +-- kernel/events/internal.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951ab8ae7..88c8c810e081 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event, { unsigned long flags; + if (rb-rcu_batches == rcu_batches_completed()) { + synchronize_rcu(); + INIT_LIST_HEAD(event-rb_entry); + } + if (!list_empty(event-rb_entry)) return; spin_lock_irqsave(rb-event_lock, flags); if (list_empty(event-rb_entry)) - list_add(event-rb_entry, rb-event_list); + list_add_rcu(event-rb_entry, rb-event_list); spin_unlock_irqrestore(rb-event_lock, flags); } @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(rb-event_lock, flags); - list_del_init(event-rb_entry); + list_del_rcu(event-rb_entry); wake_up_all(event-waitq); spin_unlock_irqrestore(rb-event_lock, flags); + + rb-rcu_batches = rcu_batches_completed(); } static void ring_buffer_wakeup(struct perf_event *event) diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 569b218782ad..698b5881b2a4 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -30,6 +30,7 @@ struct ring_buffer { /* poll crap */ spinlock_t event_lock; struct list_headevent_list; + unsigned long rcu_batches; atomic_t
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: This general idea can be made to work, but it will need some internal-to-RCU help. One vulnerability of the patch below is the following sequence of steps: 1.RCU has just finished a grace period, and is doing the end-of-grace-period accounting. 2.The code below invokes rcu_batches_completed(). Let's assume the result returned is 42. 3.RCU completes the end-of-grace-period accounting, and increments rcu_sched_state.completed. 4.The code below checks -rcu_batches against the result from another invocation of rcu_batches_completed() and sees that the 43 is not equal to 42, so skips the synchronize_rcu(). Except that a grace period has not actually completed. Boom!!! The problem is that rcu_batches_completed() is only intended to give progress information on RCU. Ah, I thought I was missing something when I was looking through the rcu code in a hurry :-) I knew there'd be some subtlety between completed and gpnum and such :-) What I can do is give you a pair of functions, one to take a snapshot of the current grace-period state (returning an unsigned long) and another to evaluate a previous snapshot, invoking synchronize_rcu() if there has not been a full grace period in the meantime. The most straightforward approach would invoke acquiring the global rcu_state -lock on each call, which I am guessing just might be considered to be excessive overhead. ;-) I should be able to decrease the overhead to a memory barrier on each call, and perhaps even down to an smp_load_acquire(). Accessing the RCU state probably costs you a cache miss both times. Thoughts? Sounds fine, the attach isn't a hotpath, so even the locked version should be fine, but I won't keep you from making it all fancy and such :-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote: On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: This general idea can be made to work, but it will need some internal-to-RCU help. One vulnerability of the patch below is the following sequence of steps: 1. RCU has just finished a grace period, and is doing the end-of-grace-period accounting. 2. The code below invokes rcu_batches_completed(). Let's assume the result returned is 42. 3. RCU completes the end-of-grace-period accounting, and increments rcu_sched_state.completed. 4. The code below checks -rcu_batches against the result from another invocation of rcu_batches_completed() and sees that the 43 is not equal to 42, so skips the synchronize_rcu(). Except that a grace period has not actually completed. Boom!!! The problem is that rcu_batches_completed() is only intended to give progress information on RCU. Ah, I thought I was missing something when I was looking through the rcu code in a hurry :-) Well, given that I sometimes miss things when looking through RCU code carefuly, I guess I cannot give you too much trouble about it. I knew there'd be some subtlety between completed and gpnum and such :-) Some of which I have learned about one RCU bug at a time. ;-) What I can do is give you a pair of functions, one to take a snapshot of the current grace-period state (returning an unsigned long) and another to evaluate a previous snapshot, invoking synchronize_rcu() if there has not been a full grace period in the meantime. The most straightforward approach would invoke acquiring the global rcu_state -lock on each call, which I am guessing just might be considered to be excessive overhead. ;-) I should be able to decrease the overhead to a memory barrier on each call, and perhaps even down to an smp_load_acquire(). Accessing the RCU state probably costs you a cache miss both times. Thoughts? Sounds fine, the attach isn't a hotpath, so even the locked version should be fine, but I won't keep you from making it all fancy and such :-) Fair enough, let me see what I can come up with. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 14, 2014 at 04:02:31PM -0700, Paul E. McKenney wrote: On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote: On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote: This general idea can be made to work, but it will need some internal-to-RCU help. One vulnerability of the patch below is the following sequence of steps: 1.RCU has just finished a grace period, and is doing the end-of-grace-period accounting. 2.The code below invokes rcu_batches_completed(). Let's assume the result returned is 42. 3.RCU completes the end-of-grace-period accounting, and increments rcu_sched_state.completed. 4.The code below checks -rcu_batches against the result from another invocation of rcu_batches_completed() and sees that the 43 is not equal to 42, so skips the synchronize_rcu(). Except that a grace period has not actually completed. Boom!!! The problem is that rcu_batches_completed() is only intended to give progress information on RCU. Ah, I thought I was missing something when I was looking through the rcu code in a hurry :-) Well, given that I sometimes miss things when looking through RCU code carefuly, I guess I cannot give you too much trouble about it. I knew there'd be some subtlety between completed and gpnum and such :-) Some of which I have learned about one RCU bug at a time. ;-) What I can do is give you a pair of functions, one to take a snapshot of the current grace-period state (returning an unsigned long) and another to evaluate a previous snapshot, invoking synchronize_rcu() if there has not been a full grace period in the meantime. The most straightforward approach would invoke acquiring the global rcu_state -lock on each call, which I am guessing just might be considered to be excessive overhead. ;-) I should be able to decrease the overhead to a memory barrier on each call, and perhaps even down to an smp_load_acquire(). Accessing the RCU state probably costs you a cache miss both times. Thoughts? Sounds fine, the attach isn't a hotpath, so even the locked version should be fine, but I won't keep you from making it all fancy and such :-) Fair enough, let me see what I can come up with. And here is an untested patch. Thoughts? (And yes, I need to update documentation and torture tests accordingly.) Thanx, Paul rcu: Provide grace-period piggybacking API The following pattern is currently not well supported by RCU: 1. Make data element inaccessible to RCU readers. 2. Do work that probably lasts for more than one grace period. 3. Do something to make sure RCU readers in flight before #1 above have completed. Here are some things that could currently be done: a. Do a synchronize_rcu() unconditionally at either #1 or #3 above. This works, but imposes needless work and latency. b. Post an RCU callback at #1 above that does a wakeup, then wait for the wakeup at #3. This works well, but likely results in an extra unneeded grace period. Open-coding this is also a bit more semi-tricky code than would be good. This commit therefore adds get_state_synchronize_rcu() and cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1 above and pass its return value to cond_synchronize_rcu() at #3 above. This results in a call to synchronize_rcu() if no grace period has elapsed between #1 and #3, but requires only a load, comparison, and memory barrier if a full grace period did elapse. Reported-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index c9be2235712c..dbf0f225bca0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp) /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(rsp); - smp_wmb(); /* Record GP times before starting GP. */ - rsp-gpnum++; + /* Record GP times before starting GP, hence smp_store_release(). */ + smp_store_release(rsp-gpnum, rsp-gpnum + 1); trace_rcu_grace_period(rsp-name, rsp-gpnum, TPS(start)); raw_spin_unlock_irq(rnp-lock); /* Exclude any concurrent CPU-hotplug operations. */ mutex_lock(rsp-onoff_mutex); + smp_mb__after_unlock_lock(); /* -gpnum increment before GP! */ /* * Set the quiescent-state-needed bits in all the rcu_node @@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) } rnp = rcu_get_root(rsp); raw_spin_lock_irq(rnp-lock); - smp_mb__after_unlock_lock(); +
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: > This is more of a problem description than an actual bugfix, but currently > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing > the ring buffer's event list, leading to cpu stalls. > > What this patch does is crude, but fixes the problem, which is: one rcu > grace period has to elapse between ring_buffer_detach() and subsequent > ring_buffer_attach(), otherwise either the attach will fail or the wakeup > will misbehave. Also, making it a call_rcu() callback will make it race > with attach(). > > Another solution that I see is to check for list_empty(>rb_entry) > before wake_up_all() in ring_buffer_wakeup() and restart the list > traversal if it is indeed empty, but that is ugly too as there will be > extra wakeups on some events. > > Anything that I'm missing here? Any better ideas? Not sure it qualifies as "better", but git call to ring_buffer_detach() is going to free the event anyway, so the synchronize_rcu() and the INIT_LIST_HEAD() should not be needed in that case. I am guessing that the same is true for perf_mmap_close(). So that leaves the call in perf_event_set_output(), which detaches from an old rb before attaching that same event to a new one. So maybe have the synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)", which might be a reasonably uncommon case? Thanx, Paul > Signed-off-by: Alexander Shishkin > Cc: Paul McKenney > --- > kernel/events/core.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 661951a..bce41e0 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event, > > spin_lock_irqsave(>event_lock, flags); > if (list_empty(>rb_entry)) > - list_add(>rb_entry, >event_list); > + list_add_rcu(>rb_entry, >event_list); > spin_unlock_irqrestore(>event_lock, flags); > } > > @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event > *event, struct ring_buffer *rb) > return; > > spin_lock_irqsave(>event_lock, flags); > - list_del_init(>rb_entry); > + list_del_rcu(>rb_entry); > wake_up_all(>waitq); > spin_unlock_irqrestore(>event_lock, flags); > + synchronize_rcu(); > + INIT_LIST_HEAD(>rb_entry); > } > > static void ring_buffer_wakeup(struct perf_event *event) > -- > 1.9.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote: This is more of a problem description than an actual bugfix, but currently ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing the ring buffer's event list, leading to cpu stalls. What this patch does is crude, but fixes the problem, which is: one rcu grace period has to elapse between ring_buffer_detach() and subsequent ring_buffer_attach(), otherwise either the attach will fail or the wakeup will misbehave. Also, making it a call_rcu() callback will make it race with attach(). Another solution that I see is to check for list_empty(event-rb_entry) before wake_up_all() in ring_buffer_wakeup() and restart the list traversal if it is indeed empty, but that is ugly too as there will be extra wakeups on some events. Anything that I'm missing here? Any better ideas? Not sure it qualifies as better, but git call to ring_buffer_detach() is going to free the event anyway, so the synchronize_rcu() and the INIT_LIST_HEAD() should not be needed in that case. I am guessing that the same is true for perf_mmap_close(). So that leaves the call in perf_event_set_output(), which detaches from an old rb before attaching that same event to a new one. So maybe have the synchronize_rcu() and INIT_LIST_HEAD() instead be in the if (old_rb), which might be a reasonably uncommon case? Thanx, Paul Signed-off-by: Alexander Shishkin alexander.shish...@linux.intel.com Cc: Paul McKenney paul...@linux.vnet.ibm.com --- kernel/events/core.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951a..bce41e0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event, spin_lock_irqsave(rb-event_lock, flags); if (list_empty(event-rb_entry)) - list_add(event-rb_entry, rb-event_list); + list_add_rcu(event-rb_entry, rb-event_list); spin_unlock_irqrestore(rb-event_lock, flags); } @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(rb-event_lock, flags); - list_del_init(event-rb_entry); + list_del_rcu(event-rb_entry); wake_up_all(event-waitq); spin_unlock_irqrestore(rb-event_lock, flags); + synchronize_rcu(); + INIT_LIST_HEAD(event-rb_entry); } static void ring_buffer_wakeup(struct perf_event *event) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
This is more of a problem description than an actual bugfix, but currently ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing the ring buffer's event list, leading to cpu stalls. What this patch does is crude, but fixes the problem, which is: one rcu grace period has to elapse between ring_buffer_detach() and subsequent ring_buffer_attach(), otherwise either the attach will fail or the wakeup will misbehave. Also, making it a call_rcu() callback will make it race with attach(). Another solution that I see is to check for list_empty(>rb_entry) before wake_up_all() in ring_buffer_wakeup() and restart the list traversal if it is indeed empty, but that is ugly too as there will be extra wakeups on some events. Anything that I'm missing here? Any better ideas? Signed-off-by: Alexander Shishkin Cc: Paul McKenney --- kernel/events/core.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951a..bce41e0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event, spin_lock_irqsave(>event_lock, flags); if (list_empty(>rb_entry)) - list_add(>rb_entry, >event_list); + list_add_rcu(>rb_entry, >event_list); spin_unlock_irqrestore(>event_lock, flags); } @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(>event_lock, flags); - list_del_init(>rb_entry); + list_del_rcu(>rb_entry); wake_up_all(>waitq); spin_unlock_irqrestore(>event_lock, flags); + synchronize_rcu(); + INIT_LIST_HEAD(>rb_entry); } static void ring_buffer_wakeup(struct perf_event *event) -- 1.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
This is more of a problem description than an actual bugfix, but currently ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing the ring buffer's event list, leading to cpu stalls. What this patch does is crude, but fixes the problem, which is: one rcu grace period has to elapse between ring_buffer_detach() and subsequent ring_buffer_attach(), otherwise either the attach will fail or the wakeup will misbehave. Also, making it a call_rcu() callback will make it race with attach(). Another solution that I see is to check for list_empty(event-rb_entry) before wake_up_all() in ring_buffer_wakeup() and restart the list traversal if it is indeed empty, but that is ugly too as there will be extra wakeups on some events. Anything that I'm missing here? Any better ideas? Signed-off-by: Alexander Shishkin alexander.shish...@linux.intel.com Cc: Paul McKenney paul...@linux.vnet.ibm.com --- kernel/events/core.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 661951a..bce41e0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event, spin_lock_irqsave(rb-event_lock, flags); if (list_empty(event-rb_entry)) - list_add(event-rb_entry, rb-event_list); + list_add_rcu(event-rb_entry, rb-event_list); spin_unlock_irqrestore(rb-event_lock, flags); } @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) return; spin_lock_irqsave(rb-event_lock, flags); - list_del_init(event-rb_entry); + list_del_rcu(event-rb_entry); wake_up_all(event-waitq); spin_unlock_irqrestore(rb-event_lock, flags); + synchronize_rcu(); + INIT_LIST_HEAD(event-rb_entry); } static void ring_buffer_wakeup(struct perf_event *event) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/