Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Paul E. McKenney
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.

That would be very good!

> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.

Agreed, if we pull too much stuff into the model it might well become
difficult to maintain in and of itself.  So we do need to choose
carefully based on model performance and ease of use.

For a performance example, note that locking can be easily modeled with
xchg_acquire() for lock acquisition and smp_store_release() for lock
release.  But here is the resulting performance compared to directly
modeling locking within the cat code:

xchg_acquire() and
# Processes smp_store_release() Direct modeling

2   0.061 s 0.007 s

3   3.542 s 0.039 s

4 569.189 s 0.397 s

5   > 113,853.000 s 5.187 s

Combinatorial explosion is a real pain...  The problem is that the
xchg_acquire() / smp_store_release() approach requires the herd7 tool
to consider and then reject all the scenarios where the lock critical
sections overlap.  In contrast, when direct modeling, those overlapping
critical-section scenarios are implicitly rejected by the cat-code that
models locking.  Don't get me wrong, there is still combinatorial
explosion, as you can see from the rightmost column of numbers.
After all, the herd7 tool still needs to consider all possible
lock-acquisition orders.  But the explosion is much less explosive.  ;-)

Given that we expect locking to be very heavily used, it makes a lot
of sense to directly model it, which we (mostly Alan, Andrea, and Luc)
have done.  Direct modeling of RCU gets similar speedups over emulation,
plus the fact that accurate emulation of RCU requires some surprisingly
unnatural acts.  ("Ghost" accesses unaffected by normal memory-ordering
primitives, and "ghost-buster memory barriers" that affect both "ghost"
and normal accesses.  Plus the transformation is complex, so much so
that I broke down and wrote a script, which was used to validate the
early versions of the RCU model.)

In constrast, I have no reason to believe that direct modeling could make
wait_for_completion() / complete() go any faster than the equivalent
setup using smp_store_release() / smp_load_acquire(), that is, I don't
see anything that could be optimized away byt cat-code.  So the only
reason to directly model wait_for_completion() / complete() would be
to get rid of the need to add the extra term to the "exists" clause.
And agreed, we should wait until people are actually being significantly
inconvenienced by this, which I do not expect any time soon.

> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Agreed, we should be careful how much we promise lest we get locked
into needlessly heavyweight implementations.

> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..

Hear, hear!  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Paul E. McKenney
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.

That would be very good!

> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.

Agreed, if we pull too much stuff into the model it might well become
difficult to maintain in and of itself.  So we do need to choose
carefully based on model performance and ease of use.

For a performance example, note that locking can be easily modeled with
xchg_acquire() for lock acquisition and smp_store_release() for lock
release.  But here is the resulting performance compared to directly
modeling locking within the cat code:

xchg_acquire() and
# Processes smp_store_release() Direct modeling

2   0.061 s 0.007 s

3   3.542 s 0.039 s

4 569.189 s 0.397 s

5   > 113,853.000 s 5.187 s

Combinatorial explosion is a real pain...  The problem is that the
xchg_acquire() / smp_store_release() approach requires the herd7 tool
to consider and then reject all the scenarios where the lock critical
sections overlap.  In contrast, when direct modeling, those overlapping
critical-section scenarios are implicitly rejected by the cat-code that
models locking.  Don't get me wrong, there is still combinatorial
explosion, as you can see from the rightmost column of numbers.
After all, the herd7 tool still needs to consider all possible
lock-acquisition orders.  But the explosion is much less explosive.  ;-)

Given that we expect locking to be very heavily used, it makes a lot
of sense to directly model it, which we (mostly Alan, Andrea, and Luc)
have done.  Direct modeling of RCU gets similar speedups over emulation,
plus the fact that accurate emulation of RCU requires some surprisingly
unnatural acts.  ("Ghost" accesses unaffected by normal memory-ordering
primitives, and "ghost-buster memory barriers" that affect both "ghost"
and normal accesses.  Plus the transformation is complex, so much so
that I broke down and wrote a script, which was used to validate the
early versions of the RCU model.)

In constrast, I have no reason to believe that direct modeling could make
wait_for_completion() / complete() go any faster than the equivalent
setup using smp_store_release() / smp_load_acquire(), that is, I don't
see anything that could be optimized away byt cat-code.  So the only
reason to directly model wait_for_completion() / complete() would be
to get rid of the need to add the extra term to the "exists" clause.
And agreed, we should wait until people are actually being significantly
inconvenienced by this, which I do not expect any time soon.

> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Agreed, we should be careful how much we promise lest we get locked
into needlessly heavyweight implementations.

> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..

Hear, hear!  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Andrea Parri
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.
> 
> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.
> 
> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Except, maybe, that simplicity and maintainability do apply to "models"
(design) as well... ;-)

As Ingo once put it [1] (referring to the "Linux-kernel memory model"):

  "it's not true that Linux has to offer a barrier and locking model
   that panders to the weakest (and craziest!) memory ordering model
   amongst all the possible Linux platforms - theoretical or real metal.

   Instead what we want to do is to consciously, intelligently _pick_
   a sane, maintainable memory model and offer primitives for that -
   at least as far as generic code is concerned. Each architecture can
   map those primitives to the best of its abilities."

  Andrea

[1] https://marc.info/?l=linux-mm=138513336717990=2


> 
> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Andrea Parri
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> > But if you are saying that it would be good to have wait_for_completion()
> > and complete() directly modeled at some point, no argument.  In addition,
> > I hope that the memory model is applied to other tools that analyze kernel
> > code.
> 
> > > I'm not sure I got the point across; so I'll try once more. Without
> > > providing this ordering the completion would be fundamentally broken. It
> > > _must_ provide this ordering.
> > 
> > OK, I now understand what you are getting at, and I do very much like
> > that guarantee.
> 
> Right, so maybe we should update the completion comments a bit to call
> out this property, because I'm not sure its there.
> 
> Also, with this, I think the smp_store_release() / smp_load_acquire() is
> a perfectly fine abstraction of it, I don't think the model needs to be
> taught about the completion interface.
> 
> > > Why not? In what sort of cases does it go wobbly?
> > 
> > For one, when it conflicts with maintainability.  For example, it would
> > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> > smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> > small speedup on only one architecture is just not worth the added pain.
> > Especially given the nice wrapper functions that you provided.
> > 
> > But of course if this were instead (say) rcu_read_lock() or common-case
> > rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> > other hand, for that exact reason, that common-case code path doesn't
> > acquire locks in the first place.  ;-)
> 
> Ah, so for models I would go absolutely minimal; it helps understand
> what the strict requirements are and where we over-provide etc..

Except, maybe, that simplicity and maintainability do apply to "models"
(design) as well... ;-)

As Ingo once put it [1] (referring to the "Linux-kernel memory model"):

  "it's not true that Linux has to offer a barrier and locking model
   that panders to the weakest (and craziest!) memory ordering model
   amongst all the possible Linux platforms - theoretical or real metal.

   Instead what we want to do is to consciously, intelligently _pick_
   a sane, maintainable memory model and offer primitives for that -
   at least as far as generic code is concerned. Each architecture can
   map those primitives to the best of its abilities."

  Andrea

[1] https://marc.info/?l=linux-mm=138513336717990=2


> 
> For actual code you're entirely right, there's no point in trying to be
> cute with the rcu-node locks. Simplicity rules etc..


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Peter Zijlstra
On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> But if you are saying that it would be good to have wait_for_completion()
> and complete() directly modeled at some point, no argument.  In addition,
> I hope that the memory model is applied to other tools that analyze kernel
> code.

> > I'm not sure I got the point across; so I'll try once more. Without
> > providing this ordering the completion would be fundamentally broken. It
> > _must_ provide this ordering.
> 
> OK, I now understand what you are getting at, and I do very much like
> that guarantee.

Right, so maybe we should update the completion comments a bit to call
out this property, because I'm not sure its there.

Also, with this, I think the smp_store_release() / smp_load_acquire() is
a perfectly fine abstraction of it, I don't think the model needs to be
taught about the completion interface.

> > Why not? In what sort of cases does it go wobbly?
> 
> For one, when it conflicts with maintainability.  For example, it would
> probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> small speedup on only one architecture is just not worth the added pain.
> Especially given the nice wrapper functions that you provided.
> 
> But of course if this were instead (say) rcu_read_lock() or common-case
> rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> other hand, for that exact reason, that common-case code path doesn't
> acquire locks in the first place.  ;-)

Ah, so for models I would go absolutely minimal; it helps understand
what the strict requirements are and where we over-provide etc..

For actual code you're entirely right, there's no point in trying to be
cute with the rcu-node locks. Simplicity rules etc..


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-09 Thread Peter Zijlstra
On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote:
> But if you are saying that it would be good to have wait_for_completion()
> and complete() directly modeled at some point, no argument.  In addition,
> I hope that the memory model is applied to other tools that analyze kernel
> code.

> > I'm not sure I got the point across; so I'll try once more. Without
> > providing this ordering the completion would be fundamentally broken. It
> > _must_ provide this ordering.
> 
> OK, I now understand what you are getting at, and I do very much like
> that guarantee.

Right, so maybe we should update the completion comments a bit to call
out this property, because I'm not sure its there.

Also, with this, I think the smp_store_release() / smp_load_acquire() is
a perfectly fine abstraction of it, I don't think the model needs to be
taught about the completion interface.

> > Why not? In what sort of cases does it go wobbly?
> 
> For one, when it conflicts with maintainability.  For example, it would
> probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
> smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
> small speedup on only one architecture is just not worth the added pain.
> Especially given the nice wrapper functions that you provided.
> 
> But of course if this were instead (say) rcu_read_lock() or common-case
> rcu_read_unlock(), I would be willing to undergo much more pain.  On the
> other hand, for that exact reason, that common-case code path doesn't
> acquire locks in the first place.  ;-)

Ah, so for models I would go absolutely minimal; it helps understand
what the strict requirements are and where we over-provide etc..

For actual code you're entirely right, there's no point in trying to be
cute with the rcu-node locks. Simplicity rules etc..


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-07 Thread Paul E. McKenney
On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:
> 
> > > > OK, I will bite...  What do the smp_store_release() and the
> > > > smp_load_acquire() correspond to?  I see just plain locking in
> > > > wait_for_completion() and complete().
> > > 
> > > They reflect the concept of complete() / wait_for_completion().
> > > Fundamentally all it needs to do is pass the message of 'completion'.
> > > 
> > > That is, if we were to go optimize our completion implementation, it
> > > would be impossible to be weaker than this and still correct.
> > 
> > OK, though the model does not provide spinlocks, and there can be

Sigh.  s/not//.  The current model -does- provide spinlocks, though
they are a bit new.  I don't know of any breakage, but I am paranoid
enough so that where feasible I double-check against xchg_acquire()
and store_release().

> > differences in behavior between spinlocks and release-acquire.
> > But yes, in this case, it works.
> 
> Sure; but the fundamental property here is that if we observe the
> complete() we must also observe everything that went before. The exact
> means of implementing that is irrelevant.

Agreed, and that also tends to speed up the running of the model on
the litmus test, so this sort of abstraction is a very good thing for
multiple reasons.

So why did I use spinlocks?  Because the model was small and fast enough,
and using the spinlocks meant that I didn't need to take time to worry
about the code's intent.

But if you are saying that it would be good to have wait_for_completion()
and complete() directly modeled at some point, no argument.  In addition,
I hope that the memory model is applied to other tools that analyze kernel
code.

> > > > So I dropped that patch yesterday.  The main thing I was missing was
> > > > that there is no ordering-free fastpath in wait_for_completion() and
> > > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > > that I was trying to add doesn't need to be there.
> > > 
> > > Going by the above, it never needs to be there, even if there was a
> > > lock-free fast-path.
> > 
> > Given that wait_for_completion()/complete() both acquire the same lock,
> > yes, and agreed, if it were lockless but provided the release and
> > acquire ordering, then yes.
> 
> I'm not sure I got the point across; so I'll try once more. Without
> providing this ordering the completion would be fundamentally broken. It
> _must_ provide this ordering.

OK, I now understand what you are getting at, and I do very much like
that guarantee.

> > But if it was instead structured like
> > wait_event()/wake_up(), there would be ordering only if the caller
> > supplied it.
> 
> Right, wait_event()/wake_up() are different in that the 'condition'
> variable is external to the abstraction and thus it cannot help.
> 
> All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
> the woken thread will observe the prior state of the waker. But given
> the actual condition is external and we might not hit the actual sleep
> case, there is no guarantees.

Agreed.

> > All that aside, paring the ordering down to the bare minimum is not
> > always the right approach.
> 
> Why not? In what sort of cases does it go wobbly?

For one, when it conflicts with maintainability.  For example, it would
probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
small speedup on only one architecture is just not worth the added pain.
Especially given the nice wrapper functions that you provided.

But of course if this were instead (say) rcu_read_lock() or common-case
rcu_read_unlock(), I would be willing to undergo much more pain.  On the
other hand, for that exact reason, that common-case code path doesn't
acquire locks in the first place.  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-07 Thread Paul E. McKenney
On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:
> 
> > > > OK, I will bite...  What do the smp_store_release() and the
> > > > smp_load_acquire() correspond to?  I see just plain locking in
> > > > wait_for_completion() and complete().
> > > 
> > > They reflect the concept of complete() / wait_for_completion().
> > > Fundamentally all it needs to do is pass the message of 'completion'.
> > > 
> > > That is, if we were to go optimize our completion implementation, it
> > > would be impossible to be weaker than this and still correct.
> > 
> > OK, though the model does not provide spinlocks, and there can be

Sigh.  s/not//.  The current model -does- provide spinlocks, though
they are a bit new.  I don't know of any breakage, but I am paranoid
enough so that where feasible I double-check against xchg_acquire()
and store_release().

> > differences in behavior between spinlocks and release-acquire.
> > But yes, in this case, it works.
> 
> Sure; but the fundamental property here is that if we observe the
> complete() we must also observe everything that went before. The exact
> means of implementing that is irrelevant.

Agreed, and that also tends to speed up the running of the model on
the litmus test, so this sort of abstraction is a very good thing for
multiple reasons.

So why did I use spinlocks?  Because the model was small and fast enough,
and using the spinlocks meant that I didn't need to take time to worry
about the code's intent.

But if you are saying that it would be good to have wait_for_completion()
and complete() directly modeled at some point, no argument.  In addition,
I hope that the memory model is applied to other tools that analyze kernel
code.

> > > > So I dropped that patch yesterday.  The main thing I was missing was
> > > > that there is no ordering-free fastpath in wait_for_completion() and
> > > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > > that I was trying to add doesn't need to be there.
> > > 
> > > Going by the above, it never needs to be there, even if there was a
> > > lock-free fast-path.
> > 
> > Given that wait_for_completion()/complete() both acquire the same lock,
> > yes, and agreed, if it were lockless but provided the release and
> > acquire ordering, then yes.
> 
> I'm not sure I got the point across; so I'll try once more. Without
> providing this ordering the completion would be fundamentally broken. It
> _must_ provide this ordering.

OK, I now understand what you are getting at, and I do very much like
that guarantee.

> > But if it was instead structured like
> > wait_event()/wake_up(), there would be ordering only if the caller
> > supplied it.
> 
> Right, wait_event()/wake_up() are different in that the 'condition'
> variable is external to the abstraction and thus it cannot help.
> 
> All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
> the woken thread will observe the prior state of the waker. But given
> the actual condition is external and we might not hit the actual sleep
> case, there is no guarantees.

Agreed.

> > All that aside, paring the ordering down to the bare minimum is not
> > always the right approach.
> 
> Why not? In what sort of cases does it go wobbly?

For one, when it conflicts with maintainability.  For example, it would
probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
smp_mb__after_unlock_lock() invocations.  But those are slowpaths, and the
small speedup on only one architecture is just not worth the added pain.
Especially given the nice wrapper functions that you provided.

But of course if this were instead (say) rcu_read_lock() or common-case
rcu_read_unlock(), I would be willing to undergo much more pain.  On the
other hand, for that exact reason, that common-case code path doesn't
acquire locks in the first place.  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-07 Thread Peter Zijlstra
On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:

> > > OK, I will bite...  What do the smp_store_release() and the
> > > smp_load_acquire() correspond to?  I see just plain locking in
> > > wait_for_completion() and complete().
> > 
> > They reflect the concept of complete() / wait_for_completion().
> > Fundamentally all it needs to do is pass the message of 'completion'.
> > 
> > That is, if we were to go optimize our completion implementation, it
> > would be impossible to be weaker than this and still correct.
> 
> OK, though the model does not provide spinlocks, and there can be
> differences in behavior between spinlocks and release-acquire.
> But yes, in this case, it works.

Sure; but the fundamental property here is that if we observe the
complete() we must also observe everything that went before. The exact
means of implementing that is irrelevant.

> > > So I dropped that patch yesterday.  The main thing I was missing was
> > > that there is no ordering-free fastpath in wait_for_completion() and
> > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > that I was trying to add doesn't need to be there.
> > 
> > Going by the above, it never needs to be there, even if there was a
> > lock-free fast-path.
> 
> Given that wait_for_completion()/complete() both acquire the same lock,
> yes, and agreed, if it were lockless but provided the release and
> acquire ordering, then yes.

I'm not sure I got the point across; so I'll try once more. Without
providing this ordering the completion would be fundamentally broken. It
_must_ provide this ordering.

> But if it was instead structured like
> wait_event()/wake_up(), there would be ordering only if the caller
> supplied it.

Right, wait_event()/wake_up() are different in that the 'condition'
variable is external to the abstraction and thus it cannot help.

All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
the woken thread will observe the prior state of the waker. But given
the actual condition is external and we might not hit the actual sleep
case, there is no guarantees.

> All that aside, paring the ordering down to the bare minimum is not
> always the right approach.

Why not? In what sort of cases does it go wobbly?


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-07 Thread Peter Zijlstra
On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:

> > > OK, I will bite...  What do the smp_store_release() and the
> > > smp_load_acquire() correspond to?  I see just plain locking in
> > > wait_for_completion() and complete().
> > 
> > They reflect the concept of complete() / wait_for_completion().
> > Fundamentally all it needs to do is pass the message of 'completion'.
> > 
> > That is, if we were to go optimize our completion implementation, it
> > would be impossible to be weaker than this and still correct.
> 
> OK, though the model does not provide spinlocks, and there can be
> differences in behavior between spinlocks and release-acquire.
> But yes, in this case, it works.

Sure; but the fundamental property here is that if we observe the
complete() we must also observe everything that went before. The exact
means of implementing that is irrelevant.

> > > So I dropped that patch yesterday.  The main thing I was missing was
> > > that there is no ordering-free fastpath in wait_for_completion() and
> > > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > > that I was trying to add doesn't need to be there.
> > 
> > Going by the above, it never needs to be there, even if there was a
> > lock-free fast-path.
> 
> Given that wait_for_completion()/complete() both acquire the same lock,
> yes, and agreed, if it were lockless but provided the release and
> acquire ordering, then yes.

I'm not sure I got the point across; so I'll try once more. Without
providing this ordering the completion would be fundamentally broken. It
_must_ provide this ordering.

> But if it was instead structured like
> wait_event()/wake_up(), there would be ordering only if the caller
> supplied it.

Right, wait_event()/wake_up() are different in that the 'condition'
variable is external to the abstraction and thus it cannot help.

All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
the woken thread will observe the prior state of the waker. But given
the actual condition is external and we might not hit the actual sleep
case, there is no guarantees.

> All that aside, paring the ordering down to the bare minimum is not
> always the right approach.

Why not? In what sort of cases does it go wobbly?


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 10:15:37PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > > /me goes and install this herd thing again.. I'm sure I had it running
> > > _somewhere_.. A well.
> > > 
> > >   C C-PaulEMcKenney-W+RWC4+2017-10-05
> > > 
> > >   {
> > >   }
> > > 
> > >   P0(int *a, int *x)
> > >   {
> > >   WRITE_ONCE(*a, 1);
> > >   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > >   WRITE_ONCE(*x, 1);
> > >   }
> > > 
> > >   P1(int *x, int *y)
> > >   {
> > >   r3 = READ_ONCE(*x);
> > >   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > >   smp_store_release(y, 1);
> > >   }
> > > 
> > >   P2(int *y, int *b)
> > >   {
> > >   r4 = smp_load_acquire(y);
> > >   r1 = READ_ONCE(*b);
> > >   }
> > > 
> > >   P3(int *b, int *a)
> > >   {
> > >   WRITE_ONCE(*b, 1);
> > >   smp_mb();
> > >   r2 = READ_ONCE(*a);
> > >   }
> > > 
> > >   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > > 
> > > 
> > > Is what I was thinking of, I think that is the minimal ordering
> > > complete()/wait_for_completion() need to provide.
> > 
> > OK, I will bite...  What do the smp_store_release() and the
> > smp_load_acquire() correspond to?  I see just plain locking in
> > wait_for_completion() and complete().
> 
> They reflect the concept of complete() / wait_for_completion().
> Fundamentally all it needs to do is pass the message of 'completion'.
> 
> That is, if we were to go optimize our completion implementation, it
> would be impossible to be weaker than this and still correct.

OK, though the model does not provide spinlocks, and there can be
differences in behavior between spinlocks and release-acquire.
But yes, in this case, it works.

> > So I dropped that patch yesterday.  The main thing I was missing was
> > that there is no ordering-free fastpath in wait_for_completion() and
> > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > that I was trying to add doesn't need to be there.
> 
> Going by the above, it never needs to be there, even if there was a
> lock-free fast-path.

Given that wait_for_completion()/complete() both acquire the same lock,
yes, and agreed, if it were lockless but provided the release and
acquire ordering, then yes.  But if it was instead structured like
wait_event()/wake_up(), there would be ordering only if the caller
supplied it.

All that aside, paring the ordering down to the bare minimum is not
always the right approach.  Nevertheless, in this particular case,
there is plenty of ordering, so yet again, I have dropped this commit.
Like yesterday.  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 10:15:37PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > > /me goes and install this herd thing again.. I'm sure I had it running
> > > _somewhere_.. A well.
> > > 
> > >   C C-PaulEMcKenney-W+RWC4+2017-10-05
> > > 
> > >   {
> > >   }
> > > 
> > >   P0(int *a, int *x)
> > >   {
> > >   WRITE_ONCE(*a, 1);
> > >   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > >   WRITE_ONCE(*x, 1);
> > >   }
> > > 
> > >   P1(int *x, int *y)
> > >   {
> > >   r3 = READ_ONCE(*x);
> > >   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > >   smp_store_release(y, 1);
> > >   }
> > > 
> > >   P2(int *y, int *b)
> > >   {
> > >   r4 = smp_load_acquire(y);
> > >   r1 = READ_ONCE(*b);
> > >   }
> > > 
> > >   P3(int *b, int *a)
> > >   {
> > >   WRITE_ONCE(*b, 1);
> > >   smp_mb();
> > >   r2 = READ_ONCE(*a);
> > >   }
> > > 
> > >   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > > 
> > > 
> > > Is what I was thinking of, I think that is the minimal ordering
> > > complete()/wait_for_completion() need to provide.
> > 
> > OK, I will bite...  What do the smp_store_release() and the
> > smp_load_acquire() correspond to?  I see just plain locking in
> > wait_for_completion() and complete().
> 
> They reflect the concept of complete() / wait_for_completion().
> Fundamentally all it needs to do is pass the message of 'completion'.
> 
> That is, if we were to go optimize our completion implementation, it
> would be impossible to be weaker than this and still correct.

OK, though the model does not provide spinlocks, and there can be
differences in behavior between spinlocks and release-acquire.
But yes, in this case, it works.

> > So I dropped that patch yesterday.  The main thing I was missing was
> > that there is no ordering-free fastpath in wait_for_completion() and
> > complete(): Each unconditionally acquires the lock.  So the smp_mb()
> > that I was trying to add doesn't need to be there.
> 
> Going by the above, it never needs to be there, even if there was a
> lock-free fast-path.

Given that wait_for_completion()/complete() both acquire the same lock,
yes, and agreed, if it were lockless but provided the release and
acquire ordering, then yes.  But if it was instead structured like
wait_event()/wake_up(), there would be ordering only if the caller
supplied it.

All that aside, paring the ordering down to the bare minimum is not
always the right approach.  Nevertheless, in this particular case,
there is plenty of ordering, so yet again, I have dropped this commit.
Like yesterday.  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Peter Zijlstra
On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > /me goes and install this herd thing again.. I'm sure I had it running
> > _somewhere_.. A well.
> > 
> > C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > {
> > }
> > 
> > P0(int *a, int *x)
> > {
> > WRITE_ONCE(*a, 1);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > WRITE_ONCE(*x, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > r3 = READ_ONCE(*x);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > smp_store_release(y, 1);
> > }
> > 
> > P2(int *y, int *b)
> > {
> > r4 = smp_load_acquire(y);
> > r1 = READ_ONCE(*b);
> > }
> > 
> > P3(int *b, int *a)
> > {
> > WRITE_ONCE(*b, 1);
> > smp_mb();
> > r2 = READ_ONCE(*a);
> > }
> > 
> > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > 
> > 
> > Is what I was thinking of, I think that is the minimal ordering
> > complete()/wait_for_completion() need to provide.
> 
> OK, I will bite...  What do the smp_store_release() and the
> smp_load_acquire() correspond to?  I see just plain locking in
> wait_for_completion() and complete().

They reflect the concept of complete() / wait_for_completion().
Fundamentally all it needs to do is pass the message of 'completion'.

That is, if we were to go optimize our completion implementation, it
would be impossible to be weaker than this and still correct.

> So I dropped that patch yesterday.  The main thing I was missing was
> that there is no ordering-free fastpath in wait_for_completion() and
> complete(): Each unconditionally acquires the lock.  So the smp_mb()
> that I was trying to add doesn't need to be there.

Going by the above, it never needs to be there, even if there was a
lock-free fast-path.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Peter Zijlstra
On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote:
> > /me goes and install this herd thing again.. I'm sure I had it running
> > _somewhere_.. A well.
> > 
> > C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > {
> > }
> > 
> > P0(int *a, int *x)
> > {
> > WRITE_ONCE(*a, 1);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > WRITE_ONCE(*x, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > r3 = READ_ONCE(*x);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > smp_store_release(y, 1);
> > }
> > 
> > P2(int *y, int *b)
> > {
> > r4 = smp_load_acquire(y);
> > r1 = READ_ONCE(*b);
> > }
> > 
> > P3(int *b, int *a)
> > {
> > WRITE_ONCE(*b, 1);
> > smp_mb();
> > r2 = READ_ONCE(*a);
> > }
> > 
> > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> > 
> > 
> > Is what I was thinking of, I think that is the minimal ordering
> > complete()/wait_for_completion() need to provide.
> 
> OK, I will bite...  What do the smp_store_release() and the
> smp_load_acquire() correspond to?  I see just plain locking in
> wait_for_completion() and complete().

They reflect the concept of complete() / wait_for_completion().
Fundamentally all it needs to do is pass the message of 'completion'.

That is, if we were to go optimize our completion implementation, it
would be impossible to be weaker than this and still correct.

> So I dropped that patch yesterday.  The main thing I was missing was
> that there is no ordering-free fastpath in wait_for_completion() and
> complete(): Each unconditionally acquires the lock.  So the smp_mb()
> that I was trying to add doesn't need to be there.

Going by the above, it never needs to be there, even if there was a
lock-free fast-path.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 11:07:23AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> > Hmmm...  Here is what I was worried about:
> > 
> > C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > {
> > }
> > 
> > P0(int *a, int *x)
> > {
> > WRITE_ONCE(*a, 1);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > WRITE_ONCE(*x, 1);
> > }
> > 
> > P1(int *x, int *y, spinlock_t *l)
> > {
> > r3 = READ_ONCE(*x);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > spin_lock(l); /* Locking in complete(). */
> > WRITE_ONCE(*y, 1);
> > spin_unlock(l);
> > }
> > 
> > P2(int *y, int *b, spinlock_t *l)
> > {
> > spin_lock(l); /* Locking in wait_for_completion. */
> > r4 = READ_ONCE(*y);
> > spin_unlock(l);
> > r1 = READ_ONCE(*b);
> > }
> > 
> > P3(int *b, int *a)
> > {
> > WRITE_ONCE(*b, 1);
> > smp_mb();
> > r2 = READ_ONCE(*a);
> > }
> > 
> > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> /me goes and install this herd thing again.. I'm sure I had it running
> _somewhere_.. A well.
> 
>   C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
>   {
>   }
> 
>   P0(int *a, int *x)
>   {
>   WRITE_ONCE(*a, 1);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   WRITE_ONCE(*x, 1);
>   }
> 
>   P1(int *x, int *y)
>   {
>   r3 = READ_ONCE(*x);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   smp_store_release(y, 1);
>   }
> 
>   P2(int *y, int *b)
>   {
>   r4 = smp_load_acquire(y);
>   r1 = READ_ONCE(*b);
>   }
> 
>   P3(int *b, int *a)
>   {
>   WRITE_ONCE(*b, 1);
>   smp_mb();
>   r2 = READ_ONCE(*a);
>   }
> 
>   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> 
> Is what I was thinking of, I think that is the minimal ordering
> complete()/wait_for_completion() need to provide.

OK, I will bite...  What do the smp_store_release() and the
smp_load_acquire() correspond to?  I see just plain locking in
wait_for_completion() and complete().

> (also, that r# numbering confuses the hell out of me, its not related to
> P nor to the variables)

Yeah, it is random, sorry!!!

> Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
> States 15
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
> No
> Witnesses
> Positive: 0 Negative: 15
> Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
> Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
> Hash=f7f8ad6eab33e90718a394bcb021557d

But yes, looking closer, this corresponds to the rule of thumb about
non-rf relations and full memory barriers.  We have two non-rf relations
(P2->P3 and P3->P0), so we need two full barriers, one each between the
non-rf relations.

So I dropped that patch yesterday.  The main thing I was missing was
that there is no ordering-free fastpath in wait_for_completion() and
complete(): Each unconditionally acquires the lock.  So the smp_mb()
that I was trying to add doesn't need to be there.

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 11:07:23AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> > Hmmm...  Here is what I was worried about:
> > 
> > C C-PaulEMcKenney-W+RWC4+2017-10-05
> > 
> > {
> > }
> > 
> > P0(int *a, int *x)
> > {
> > WRITE_ONCE(*a, 1);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > WRITE_ONCE(*x, 1);
> > }
> > 
> > P1(int *x, int *y, spinlock_t *l)
> > {
> > r3 = READ_ONCE(*x);
> > smp_mb(); /* Lock acquisition for rcu_node ->lock. */
> > spin_lock(l); /* Locking in complete(). */
> > WRITE_ONCE(*y, 1);
> > spin_unlock(l);
> > }
> > 
> > P2(int *y, int *b, spinlock_t *l)
> > {
> > spin_lock(l); /* Locking in wait_for_completion. */
> > r4 = READ_ONCE(*y);
> > spin_unlock(l);
> > r1 = READ_ONCE(*b);
> > }
> > 
> > P3(int *b, int *a)
> > {
> > WRITE_ONCE(*b, 1);
> > smp_mb();
> > r2 = READ_ONCE(*a);
> > }
> > 
> > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> /me goes and install this herd thing again.. I'm sure I had it running
> _somewhere_.. A well.
> 
>   C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
>   {
>   }
> 
>   P0(int *a, int *x)
>   {
>   WRITE_ONCE(*a, 1);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   WRITE_ONCE(*x, 1);
>   }
> 
>   P1(int *x, int *y)
>   {
>   r3 = READ_ONCE(*x);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   smp_store_release(y, 1);
>   }
> 
>   P2(int *y, int *b)
>   {
>   r4 = smp_load_acquire(y);
>   r1 = READ_ONCE(*b);
>   }
> 
>   P3(int *b, int *a)
>   {
>   WRITE_ONCE(*b, 1);
>   smp_mb();
>   r2 = READ_ONCE(*a);
>   }
> 
>   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> 
> 
> Is what I was thinking of, I think that is the minimal ordering
> complete()/wait_for_completion() need to provide.

OK, I will bite...  What do the smp_store_release() and the
smp_load_acquire() correspond to?  I see just plain locking in
wait_for_completion() and complete().

> (also, that r# numbering confuses the hell out of me, its not related to
> P nor to the variables)

Yeah, it is random, sorry!!!

> Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
> States 15
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
> 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
> No
> Witnesses
> Positive: 0 Negative: 15
> Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
> Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
> Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
> Hash=f7f8ad6eab33e90718a394bcb021557d

But yes, looking closer, this corresponds to the rule of thumb about
non-rf relations and full memory barriers.  We have two non-rf relations
(P2->P3 and P3->P0), so we need two full barriers, one each between the
non-rf relations.

So I dropped that patch yesterday.  The main thing I was missing was
that there is no ordering-free fastpath in wait_for_completion() and
complete(): Each unconditionally acquires the lock.  So the smp_mb()
that I was trying to add doesn't need to be there.

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> Hmmm...  Here is what I was worried about:
> 
>   C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
>   {
>   }
> 
>   P0(int *a, int *x)
>   {
>   WRITE_ONCE(*a, 1);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   WRITE_ONCE(*x, 1);
>   }
> 
>   P1(int *x, int *y, spinlock_t *l)
>   {
>   r3 = READ_ONCE(*x);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   spin_lock(l); /* Locking in complete(). */
>   WRITE_ONCE(*y, 1);
>   spin_unlock(l);
>   }
> 
>   P2(int *y, int *b, spinlock_t *l)
>   {
>   spin_lock(l); /* Locking in wait_for_completion. */
>   r4 = READ_ONCE(*y);
>   spin_unlock(l);
>   r1 = READ_ONCE(*b);
>   }
> 
>   P3(int *b, int *a)
>   {
>   WRITE_ONCE(*b, 1);
>   smp_mb();
>   r2 = READ_ONCE(*a);
>   }
> 
>   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


/me goes and install this herd thing again.. I'm sure I had it running
_somewhere_.. A well.


C C-PaulEMcKenney-W+RWC4+2017-10-05

{
}

P0(int *a, int *x)
{
WRITE_ONCE(*a, 1);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
WRITE_ONCE(*x, 1);
}

P1(int *x, int *y)
{
r3 = READ_ONCE(*x);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
smp_store_release(y, 1);
}

P2(int *y, int *b)
{
r4 = smp_load_acquire(y);
r1 = READ_ONCE(*b);
}

P3(int *b, int *a)
{
WRITE_ONCE(*b, 1);
smp_mb();
r2 = READ_ONCE(*a);
}

exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


Is what I was thinking of, I think that is the minimal ordering
complete()/wait_for_completion() need to provide.

(also, that r# numbering confuses the hell out of me, its not related to
P nor to the variables)

Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
States 15
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
No
Witnesses
Positive: 0 Negative: 15
Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
Hash=f7f8ad6eab33e90718a394bcb021557d


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-06 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote:
> Hmmm...  Here is what I was worried about:
> 
>   C C-PaulEMcKenney-W+RWC4+2017-10-05
> 
>   {
>   }
> 
>   P0(int *a, int *x)
>   {
>   WRITE_ONCE(*a, 1);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   WRITE_ONCE(*x, 1);
>   }
> 
>   P1(int *x, int *y, spinlock_t *l)
>   {
>   r3 = READ_ONCE(*x);
>   smp_mb(); /* Lock acquisition for rcu_node ->lock. */
>   spin_lock(l); /* Locking in complete(). */
>   WRITE_ONCE(*y, 1);
>   spin_unlock(l);
>   }
> 
>   P2(int *y, int *b, spinlock_t *l)
>   {
>   spin_lock(l); /* Locking in wait_for_completion. */
>   r4 = READ_ONCE(*y);
>   spin_unlock(l);
>   r1 = READ_ONCE(*b);
>   }
> 
>   P3(int *b, int *a)
>   {
>   WRITE_ONCE(*b, 1);
>   smp_mb();
>   r2 = READ_ONCE(*a);
>   }
> 
>   exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


/me goes and install this herd thing again.. I'm sure I had it running
_somewhere_.. A well.


C C-PaulEMcKenney-W+RWC4+2017-10-05

{
}

P0(int *a, int *x)
{
WRITE_ONCE(*a, 1);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
WRITE_ONCE(*x, 1);
}

P1(int *x, int *y)
{
r3 = READ_ONCE(*x);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
smp_store_release(y, 1);
}

P2(int *y, int *b)
{
r4 = smp_load_acquire(y);
r1 = READ_ONCE(*b);
}

P3(int *b, int *a)
{
WRITE_ONCE(*b, 1);
smp_mb();
r2 = READ_ONCE(*a);
}

exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)


Is what I was thinking of, I think that is the minimal ordering
complete()/wait_for_completion() need to provide.

(also, that r# numbering confuses the hell out of me, its not related to
P nor to the variables)

Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed
States 15
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0;
1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1;
No
Witnesses
Positive: 0 Negative: 15
Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)
Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15
Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04
Hash=f7f8ad6eab33e90718a394bcb021557d


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 06:25:14PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> > 
> > Yes, the ordering does need to be visible to uninvolved CPUs, so
> > release-acquire is not necessarily strong enough.
> 
> Why? Because I'm not at all seeing why it needs more. Your Changelog
> doesn't provide enough hints.

Hmmm...  Here is what I was worried about:

C C-PaulEMcKenney-W+RWC4+2017-10-05

{
}

P0(int *a, int *x)
{
WRITE_ONCE(*a, 1);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
WRITE_ONCE(*x, 1);
}

P1(int *x, int *y, spinlock_t *l)
{
r3 = READ_ONCE(*x);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
spin_lock(l); /* Locking in complete(). */
WRITE_ONCE(*y, 1);
spin_unlock(l);
}

P2(int *y, int *b, spinlock_t *l)
{
spin_lock(l); /* Locking in wait_for_completion. */
r4 = READ_ONCE(*y);
spin_unlock(l);
r1 = READ_ONCE(*b);
}

P3(int *b, int *a)
{
WRITE_ONCE(*b, 1);
smp_mb();
r2 = READ_ONCE(*a);
}

exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)

My concern was that P2()'s read from b could be pulled into the lock's
critical section and reordered with the read from y.  But even if
you physically rearrange the code, the cycle is forbidden.  Removing
P2()'s lock acquisition and release does allow the cycle.  I get the
same result when replacing spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release().  I can drop P1()'s smp_mb()
(but otherwise like the original above) and the cycle is forbidden.
Removing P1()'s lock acquisition and release, but leaving the smp_mb(),
does allow the cycle.

So it looks like I can drop this patch entirely.  Though somewhat
nervously, despite the evidence that I have ample redundancy in
ordering.  ;-)

So, thank you for persisting on this one!

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 06:25:14PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> > 
> > Yes, the ordering does need to be visible to uninvolved CPUs, so
> > release-acquire is not necessarily strong enough.
> 
> Why? Because I'm not at all seeing why it needs more. Your Changelog
> doesn't provide enough hints.

Hmmm...  Here is what I was worried about:

C C-PaulEMcKenney-W+RWC4+2017-10-05

{
}

P0(int *a, int *x)
{
WRITE_ONCE(*a, 1);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
WRITE_ONCE(*x, 1);
}

P1(int *x, int *y, spinlock_t *l)
{
r3 = READ_ONCE(*x);
smp_mb(); /* Lock acquisition for rcu_node ->lock. */
spin_lock(l); /* Locking in complete(). */
WRITE_ONCE(*y, 1);
spin_unlock(l);
}

P2(int *y, int *b, spinlock_t *l)
{
spin_lock(l); /* Locking in wait_for_completion. */
r4 = READ_ONCE(*y);
spin_unlock(l);
r1 = READ_ONCE(*b);
}

P3(int *b, int *a)
{
WRITE_ONCE(*b, 1);
smp_mb();
r2 = READ_ONCE(*a);
}

exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0)

My concern was that P2()'s read from b could be pulled into the lock's
critical section and reordered with the read from y.  But even if
you physically rearrange the code, the cycle is forbidden.  Removing
P2()'s lock acquisition and release does allow the cycle.  I get the
same result when replacing spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release().  I can drop P1()'s smp_mb()
(but otherwise like the original above) and the cycle is forbidden.
Removing P1()'s lock acquisition and release, but leaving the smp_mb(),
does allow the cycle.

So it looks like I can drop this patch entirely.  Though somewhat
nervously, despite the evidence that I have ample redundancy in
ordering.  ;-)

So, thank you for persisting on this one!

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> 
> Yes, the ordering does need to be visible to uninvolved CPUs, so
> release-acquire is not necessarily strong enough.

Why? Because I'm not at all seeing why it needs more. Your Changelog
doesn't provide enough hints.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote:
> 
> Yes, the ordering does need to be visible to uninvolved CPUs, so
> release-acquire is not necessarily strong enough.

Why? Because I'm not at all seeing why it needs more. Your Changelog
doesn't provide enough hints.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > > Consider the following admittedly improbable sequence of events:
> > > > 
> > > > o   RCU is initially idle.
> > > > 
> > > > o   Task A on CPU 0 executes rcu_read_lock().
> > > > 
> > > > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > > > wait on Task A:
> > > > 
> > > > o   Task B registers the callback, which starts a new
> > > > grace period, awakening the grace-period kthread
> > > > on CPU 3, which immediately starts a new grace period.
> > > > 
> > > > o   Task B migrates to CPU 2, which provides a quiescent
> > > > state for both CPUs 1 and 2.
> > > > 
> > > > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > > and both invoke RCU_SOFTIRQ, both thus learning of the
> > > > new grace period.
> > > > 
> > > > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > > 
> > > > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > > > to core RCU.
> > > > 
> > > > o   Task B is resumed just long enough to be migrated to CPU 3,
> > > > and then is once again delayed.
> > > > 
> > > > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > > critical section.
> > > > 
> > > > o   CPU 0 passes through a quiescent sate, which is reported to
> > > > core RCU.  Only CPU 1 continues to block the grace period.
> > > > 
> > > > o   CPU 1 passes through a quiescent state, which is reported to
> > > > core RCU.  This ends the grace period, and CPU 1 therefore
> > > > invokes its callbacks, one of which awakens Task B via
> > > > complete().
> > > > 
> > > > o   Task B resumes (still on CPU 3) and starts executing
> > > > wait_for_completion(), which sees that the completion has
> > > > already completed, and thus does not block.  It returns from
> > > > the synchronize_rcu() without any ordering against the
> > > > end of Task A's RCU read-side critical section.
> > > > 
> > > > It can therefore mess up Task A's RCU read-side critical 
> > > > section,
> > > > in theory, anyway.
> > > 
> > > I'm not sure I follow, at the very least the wait_for_completion() does
> > > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > > by complete(), no?
> > 
> > Your point being that both wait_for_completion() and complete() acquire
> > and release the same lock?  (Yes, I suspect that I was confusing this
> > with wait_event() and wake_up(), just so you know.)
> 
> Well, fundamentally complete()/wait_for_completion() is a message-pass
> and they include a RELEASE/ACQUIRE pair for causal reasons.
> 
> Per the implementation they use a spinlock, but any implementation needs
> to provide at least that RELEASE/ACQUIRE pair.
> 
> > > And is not CPU0's QS reporting ordered against that complete()?
> > 
> > Mumble mumble mumble powerpc mumble mumble mumble...
> > 
> > OK, I will make this new memory barrier only execute for powerpc.
> > 
> > Or am I missing something else here?
> 
> So I'm not entirely clear on the required semantics here; why do we need
> a full mb? I'm thinking CPU0's QS propagating through the tree and
> arriving at the root node is a multi-copy-atomic / transitive thing and
> all CPUs will agree the system QS has ended, right?
> 
> Whichever CPU establishes the system QS does complete() and the
> wait_for_completion() then has the weak-transitive causal relation to
> that, ensuring that -- in the above example -- CPU3 must be _after_
> CPU0's rcu_read_unlock().

Yes, the ordering does need to be visible to uninvolved CPUs, so
release-acquire is not necessarily strong enough.

My current thought is like this:

if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE))
smp_mb();

Thoughts?

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > > Consider the following admittedly improbable sequence of events:
> > > > 
> > > > o   RCU is initially idle.
> > > > 
> > > > o   Task A on CPU 0 executes rcu_read_lock().
> > > > 
> > > > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > > > wait on Task A:
> > > > 
> > > > o   Task B registers the callback, which starts a new
> > > > grace period, awakening the grace-period kthread
> > > > on CPU 3, which immediately starts a new grace period.
> > > > 
> > > > o   Task B migrates to CPU 2, which provides a quiescent
> > > > state for both CPUs 1 and 2.
> > > > 
> > > > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > > and both invoke RCU_SOFTIRQ, both thus learning of the
> > > > new grace period.
> > > > 
> > > > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > > 
> > > > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > > > to core RCU.
> > > > 
> > > > o   Task B is resumed just long enough to be migrated to CPU 3,
> > > > and then is once again delayed.
> > > > 
> > > > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > > critical section.
> > > > 
> > > > o   CPU 0 passes through a quiescent sate, which is reported to
> > > > core RCU.  Only CPU 1 continues to block the grace period.
> > > > 
> > > > o   CPU 1 passes through a quiescent state, which is reported to
> > > > core RCU.  This ends the grace period, and CPU 1 therefore
> > > > invokes its callbacks, one of which awakens Task B via
> > > > complete().
> > > > 
> > > > o   Task B resumes (still on CPU 3) and starts executing
> > > > wait_for_completion(), which sees that the completion has
> > > > already completed, and thus does not block.  It returns from
> > > > the synchronize_rcu() without any ordering against the
> > > > end of Task A's RCU read-side critical section.
> > > > 
> > > > It can therefore mess up Task A's RCU read-side critical 
> > > > section,
> > > > in theory, anyway.
> > > 
> > > I'm not sure I follow, at the very least the wait_for_completion() does
> > > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > > by complete(), no?
> > 
> > Your point being that both wait_for_completion() and complete() acquire
> > and release the same lock?  (Yes, I suspect that I was confusing this
> > with wait_event() and wake_up(), just so you know.)
> 
> Well, fundamentally complete()/wait_for_completion() is a message-pass
> and they include a RELEASE/ACQUIRE pair for causal reasons.
> 
> Per the implementation they use a spinlock, but any implementation needs
> to provide at least that RELEASE/ACQUIRE pair.
> 
> > > And is not CPU0's QS reporting ordered against that complete()?
> > 
> > Mumble mumble mumble powerpc mumble mumble mumble...
> > 
> > OK, I will make this new memory barrier only execute for powerpc.
> > 
> > Or am I missing something else here?
> 
> So I'm not entirely clear on the required semantics here; why do we need
> a full mb? I'm thinking CPU0's QS propagating through the tree and
> arriving at the root node is a multi-copy-atomic / transitive thing and
> all CPUs will agree the system QS has ended, right?
> 
> Whichever CPU establishes the system QS does complete() and the
> wait_for_completion() then has the weak-transitive causal relation to
> that, ensuring that -- in the above example -- CPU3 must be _after_
> CPU0's rcu_read_unlock().

Yes, the ordering does need to be visible to uninvolved CPUs, so
release-acquire is not necessarily strong enough.

My current thought is like this:

if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE))
smp_mb();

Thoughts?

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o RCU is initially idle.
> > > 
> > > o Task A on CPU 0 executes rcu_read_lock().
> > > 
> > > o Task B on CPU 1 executes synchronize_rcu(), which must
> > >   wait on Task A:
> > > 
> > >   o   Task B registers the callback, which starts a new
> > >   grace period, awakening the grace-period kthread
> > >   on CPU 3, which immediately starts a new grace period.
> > > 
> > >   o   Task B migrates to CPU 2, which provides a quiescent
> > >   state for both CPUs 1 and 2.
> > > 
> > >   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > >   and both invoke RCU_SOFTIRQ, both thus learning of the
> > >   new grace period.
> > > 
> > >   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o CPUs 2 and 3 pass through quiescent states, which are reported
> > >   to core RCU.
> > > 
> > > o Task B is resumed just long enough to be migrated to CPU 3,
> > >   and then is once again delayed.
> > > 
> > > o Task A executes rcu_read_unlock(), exiting its RCU read-side
> > >   critical section.
> > > 
> > > o CPU 0 passes through a quiescent sate, which is reported to
> > >   core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o CPU 1 passes through a quiescent state, which is reported to
> > >   core RCU.  This ends the grace period, and CPU 1 therefore
> > >   invokes its callbacks, one of which awakens Task B via
> > >   complete().
> > > 
> > > o Task B resumes (still on CPU 3) and starts executing
> > >   wait_for_completion(), which sees that the completion has
> > >   already completed, and thus does not block.  It returns from
> > >   the synchronize_rcu() without any ordering against the
> > >   end of Task A's RCU read-side critical section.
> > > 
> > >   It can therefore mess up Task A's RCU read-side critical section,
> > >   in theory, anyway.
> > 
> > I'm not sure I follow, at the very least the wait_for_completion() does
> > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > by complete(), no?
> 
> Your point being that both wait_for_completion() and complete() acquire
> and release the same lock?  (Yes, I suspect that I was confusing this
> with wait_event() and wake_up(), just so you know.)

Well, fundamentally complete()/wait_for_completion() is a message-pass
and they include a RELEASE/ACQUIRE pair for causal reasons.

Per the implementation they use a spinlock, but any implementation needs
to provide at least that RELEASE/ACQUIRE pair.

> > And is not CPU0's QS reporting ordered against that complete()?
> 
> Mumble mumble mumble powerpc mumble mumble mumble...
> 
> OK, I will make this new memory barrier only execute for powerpc.
> 
> Or am I missing something else here?

So I'm not entirely clear on the required semantics here; why do we need
a full mb? I'm thinking CPU0's QS propagating through the tree and
arriving at the root node is a multi-copy-atomic / transitive thing and
all CPUs will agree the system QS has ended, right?

Whichever CPU establishes the system QS does complete() and the
wait_for_completion() then has the weak-transitive causal relation to
that, ensuring that -- in the above example -- CPU3 must be _after_
CPU0's rcu_read_unlock().




Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o RCU is initially idle.
> > > 
> > > o Task A on CPU 0 executes rcu_read_lock().
> > > 
> > > o Task B on CPU 1 executes synchronize_rcu(), which must
> > >   wait on Task A:
> > > 
> > >   o   Task B registers the callback, which starts a new
> > >   grace period, awakening the grace-period kthread
> > >   on CPU 3, which immediately starts a new grace period.
> > > 
> > >   o   Task B migrates to CPU 2, which provides a quiescent
> > >   state for both CPUs 1 and 2.
> > > 
> > >   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > >   and both invoke RCU_SOFTIRQ, both thus learning of the
> > >   new grace period.
> > > 
> > >   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o CPUs 2 and 3 pass through quiescent states, which are reported
> > >   to core RCU.
> > > 
> > > o Task B is resumed just long enough to be migrated to CPU 3,
> > >   and then is once again delayed.
> > > 
> > > o Task A executes rcu_read_unlock(), exiting its RCU read-side
> > >   critical section.
> > > 
> > > o CPU 0 passes through a quiescent sate, which is reported to
> > >   core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o CPU 1 passes through a quiescent state, which is reported to
> > >   core RCU.  This ends the grace period, and CPU 1 therefore
> > >   invokes its callbacks, one of which awakens Task B via
> > >   complete().
> > > 
> > > o Task B resumes (still on CPU 3) and starts executing
> > >   wait_for_completion(), which sees that the completion has
> > >   already completed, and thus does not block.  It returns from
> > >   the synchronize_rcu() without any ordering against the
> > >   end of Task A's RCU read-side critical section.
> > > 
> > >   It can therefore mess up Task A's RCU read-side critical section,
> > >   in theory, anyway.
> > 
> > I'm not sure I follow, at the very least the wait_for_completion() does
> > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > by complete(), no?
> 
> Your point being that both wait_for_completion() and complete() acquire
> and release the same lock?  (Yes, I suspect that I was confusing this
> with wait_event() and wake_up(), just so you know.)

Well, fundamentally complete()/wait_for_completion() is a message-pass
and they include a RELEASE/ACQUIRE pair for causal reasons.

Per the implementation they use a spinlock, but any implementation needs
to provide at least that RELEASE/ACQUIRE pair.

> > And is not CPU0's QS reporting ordered against that complete()?
> 
> Mumble mumble mumble powerpc mumble mumble mumble...
> 
> OK, I will make this new memory barrier only execute for powerpc.
> 
> Or am I missing something else here?

So I'm not entirely clear on the required semantics here; why do we need
a full mb? I'm thinking CPU0's QS propagating through the tree and
arriving at the root node is a multi-copy-atomic / transitive thing and
all CPUs will agree the system QS has ended, right?

Whichever CPU establishes the system QS does complete() and the
wait_for_completion() then has the weak-transitive causal relation to
that, ensuring that -- in the above example -- CPU3 must be _after_
CPU0's rcu_read_unlock().




Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > Consider the following admittedly improbable sequence of events:
> > 
> > o   RCU is initially idle.
> > 
> > o   Task A on CPU 0 executes rcu_read_lock().
> > 
> > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > wait on Task A:
> > 
> > o   Task B registers the callback, which starts a new
> > grace period, awakening the grace-period kthread
> > on CPU 3, which immediately starts a new grace period.
> > 
> > o   Task B migrates to CPU 2, which provides a quiescent
> > state for both CPUs 1 and 2.
> > 
> > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > and both invoke RCU_SOFTIRQ, both thus learning of the
> > new grace period.
> > 
> > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > to core RCU.
> > 
> > o   Task B is resumed just long enough to be migrated to CPU 3,
> > and then is once again delayed.
> > 
> > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > critical section.
> > 
> > o   CPU 0 passes through a quiescent sate, which is reported to
> > core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o   CPU 1 passes through a quiescent state, which is reported to
> > core RCU.  This ends the grace period, and CPU 1 therefore
> > invokes its callbacks, one of which awakens Task B via
> > complete().
> > 
> > o   Task B resumes (still on CPU 3) and starts executing
> > wait_for_completion(), which sees that the completion has
> > already completed, and thus does not block.  It returns from
> > the synchronize_rcu() without any ordering against the
> > end of Task A's RCU read-side critical section.
> > 
> > It can therefore mess up Task A's RCU read-side critical section,
> > in theory, anyway.
> 
> I'm not sure I follow, at the very least the wait_for_completion() does
> an ACQUIRE such that it observes the state prior to the RELEASE as done
> by complete(), no?

Your point being that both wait_for_completion() and complete() acquire
and release the same lock?  (Yes, I suspect that I was confusing this
with wait_event() and wake_up(), just so you know.)

> And is not CPU0's QS reporting ordered against that complete()?

Mumble mumble mumble powerpc mumble mumble mumble...

OK, I will make this new memory barrier only execute for powerpc.

Or am I missing something else here?

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Paul E. McKenney
On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > Consider the following admittedly improbable sequence of events:
> > 
> > o   RCU is initially idle.
> > 
> > o   Task A on CPU 0 executes rcu_read_lock().
> > 
> > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > wait on Task A:
> > 
> > o   Task B registers the callback, which starts a new
> > grace period, awakening the grace-period kthread
> > on CPU 3, which immediately starts a new grace period.
> > 
> > o   Task B migrates to CPU 2, which provides a quiescent
> > state for both CPUs 1 and 2.
> > 
> > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > and both invoke RCU_SOFTIRQ, both thus learning of the
> > new grace period.
> > 
> > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > to core RCU.
> > 
> > o   Task B is resumed just long enough to be migrated to CPU 3,
> > and then is once again delayed.
> > 
> > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > critical section.
> > 
> > o   CPU 0 passes through a quiescent sate, which is reported to
> > core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o   CPU 1 passes through a quiescent state, which is reported to
> > core RCU.  This ends the grace period, and CPU 1 therefore
> > invokes its callbacks, one of which awakens Task B via
> > complete().
> > 
> > o   Task B resumes (still on CPU 3) and starts executing
> > wait_for_completion(), which sees that the completion has
> > already completed, and thus does not block.  It returns from
> > the synchronize_rcu() without any ordering against the
> > end of Task A's RCU read-side critical section.
> > 
> > It can therefore mess up Task A's RCU read-side critical section,
> > in theory, anyway.
> 
> I'm not sure I follow, at the very least the wait_for_completion() does
> an ACQUIRE such that it observes the state prior to the RELEASE as done
> by complete(), no?

Your point being that both wait_for_completion() and complete() acquire
and release the same lock?  (Yes, I suspect that I was confusing this
with wait_event() and wake_up(), just so you know.)

> And is not CPU0's QS reporting ordered against that complete()?

Mumble mumble mumble powerpc mumble mumble mumble...

OK, I will make this new memory barrier only execute for powerpc.

Or am I missing something else here?

Thanx, Paul



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Steven Rostedt
On Thu, 5 Oct 2017 15:40:42 +0200
Peter Zijlstra  wrote:

> On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> > On Wed,  4 Oct 2017 14:29:27 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o RCU is initially idle.
> > > 
> > > o Task A on CPU 0 executes rcu_read_lock().  
> > 
> > A starts rcu_read_lock() critical section.
> >   
> > > 
> > > o Task B on CPU 1 executes synchronize_rcu(), which must
> > >   wait on Task A:  
> > 
> > B waits for A.
> >   
> > > 
> > >   o   Task B registers the callback, which starts a new
> > >   grace period, awakening the grace-period kthread
> > >   on CPU 3, which immediately starts a new grace period.  
> > 
> >   [ isn't B blocked (off rq)? How does it migrate? ]  
> 
> No, its running synchronize_rcu() but hasn't blocked yet. It would block
> on wait_for_completion(), but per the very last point, we'll observe the
> complete() before we block.

Hmm, maybe this should be stated as well, as it looked confusing.

> 
> > >   o   Task B migrates to CPU 2, which provides a quiescent
> > >   state for both CPUs 1 and 2.
> > > 
> > >   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > >   and both invoke RCU_SOFTIRQ, both thus learning of the
> > >   new grace period.
> > > 
> > >   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o CPUs 2 and 3 pass through quiescent states, which are reported
> > >   to core RCU.
> > > 
> > > o Task B is resumed just long enough to be migrated to CPU 3,
> > >   and then is once again delayed.
> > > 
> > > o Task A executes rcu_read_unlock(), exiting its RCU read-side
> > >   critical section.  
> > 
> > A calls rcu_read_unlock() ending the critical section  
> 
> The point is that rcu_read_unlock() doesn't have memory ordering.

Ah, that makes a difference. Can we state that in the change log too.

> 
> > > 
> > > o CPU 0 passes through a quiescent sate, which is reported to
> > >   core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o CPU 1 passes through a quiescent state, which is reported to
> > >   core RCU.  This ends the grace period, and CPU 1 therefore
> > >   invokes its callbacks, one of which awakens Task B via
> > >   complete().
> > > 
> > > o Task B resumes (still on CPU 3) and starts executing
> > >   wait_for_completion(), which sees that the completion has
> > >   already completed, and thus does not block.  It returns from
> > >   the synchronize_rcu() without any ordering against the
> > >   end of Task A's RCU read-side critical section.  
> > 
> > B runs
> > 
> >   
> > > 
> > >   It can therefore mess up Task A's RCU read-side critical section,
> > >   in theory, anyway.  
> > 
> > I don't see how B ran during A's critical section.  
> 
> It didn't but that doesn't mean the memory ordering agrees. What we
> require is B observes (per the memory ordering) everything up to and
> including the rcu_read_unlock(). This is not 'time' related.

OK, that makes sense. It was just missing from the change log, so mere
mortals like myself can keep up. (This is why I'm Batman. I'm only
human but I can still compete with superheros ;-)

> 
> 
> That said, I don't think it can actually happen, because CPU0's QS state
> is ordered against the complete and the wait_for_completion is ordered
> against the complete.

OK, but this doesn't hurt to have. Just a paranoid check, which when it
comes to RCU, leaning toward paranoid is good.

-- Steve


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Steven Rostedt
On Thu, 5 Oct 2017 15:40:42 +0200
Peter Zijlstra  wrote:

> On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> > On Wed,  4 Oct 2017 14:29:27 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > Consider the following admittedly improbable sequence of events:
> > > 
> > > o RCU is initially idle.
> > > 
> > > o Task A on CPU 0 executes rcu_read_lock().  
> > 
> > A starts rcu_read_lock() critical section.
> >   
> > > 
> > > o Task B on CPU 1 executes synchronize_rcu(), which must
> > >   wait on Task A:  
> > 
> > B waits for A.
> >   
> > > 
> > >   o   Task B registers the callback, which starts a new
> > >   grace period, awakening the grace-period kthread
> > >   on CPU 3, which immediately starts a new grace period.  
> > 
> >   [ isn't B blocked (off rq)? How does it migrate? ]  
> 
> No, its running synchronize_rcu() but hasn't blocked yet. It would block
> on wait_for_completion(), but per the very last point, we'll observe the
> complete() before we block.

Hmm, maybe this should be stated as well, as it looked confusing.

> 
> > >   o   Task B migrates to CPU 2, which provides a quiescent
> > >   state for both CPUs 1 and 2.
> > > 
> > >   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > >   and both invoke RCU_SOFTIRQ, both thus learning of the
> > >   new grace period.
> > > 
> > >   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > 
> > > o CPUs 2 and 3 pass through quiescent states, which are reported
> > >   to core RCU.
> > > 
> > > o Task B is resumed just long enough to be migrated to CPU 3,
> > >   and then is once again delayed.
> > > 
> > > o Task A executes rcu_read_unlock(), exiting its RCU read-side
> > >   critical section.  
> > 
> > A calls rcu_read_unlock() ending the critical section  
> 
> The point is that rcu_read_unlock() doesn't have memory ordering.

Ah, that makes a difference. Can we state that in the change log too.

> 
> > > 
> > > o CPU 0 passes through a quiescent sate, which is reported to
> > >   core RCU.  Only CPU 1 continues to block the grace period.
> > > 
> > > o CPU 1 passes through a quiescent state, which is reported to
> > >   core RCU.  This ends the grace period, and CPU 1 therefore
> > >   invokes its callbacks, one of which awakens Task B via
> > >   complete().
> > > 
> > > o Task B resumes (still on CPU 3) and starts executing
> > >   wait_for_completion(), which sees that the completion has
> > >   already completed, and thus does not block.  It returns from
> > >   the synchronize_rcu() without any ordering against the
> > >   end of Task A's RCU read-side critical section.  
> > 
> > B runs
> > 
> >   
> > > 
> > >   It can therefore mess up Task A's RCU read-side critical section,
> > >   in theory, anyway.  
> > 
> > I don't see how B ran during A's critical section.  
> 
> It didn't but that doesn't mean the memory ordering agrees. What we
> require is B observes (per the memory ordering) everything up to and
> including the rcu_read_unlock(). This is not 'time' related.

OK, that makes sense. It was just missing from the change log, so mere
mortals like myself can keep up. (This is why I'm Batman. I'm only
human but I can still compete with superheros ;-)

> 
> 
> That said, I don't think it can actually happen, because CPU0's QS state
> is ordered against the complete and the wait_for_completion is ordered
> against the complete.

OK, but this doesn't hurt to have. Just a paranoid check, which when it
comes to RCU, leaning toward paranoid is good.

-- Steve


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> On Wed,  4 Oct 2017 14:29:27 -0700
> "Paul E. McKenney"  wrote:
> 
> > Consider the following admittedly improbable sequence of events:
> > 
> > o   RCU is initially idle.
> > 
> > o   Task A on CPU 0 executes rcu_read_lock().
> 
> A starts rcu_read_lock() critical section.
> 
> > 
> > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > wait on Task A:
> 
> B waits for A.
> 
> > 
> > o   Task B registers the callback, which starts a new
> > grace period, awakening the grace-period kthread
> > on CPU 3, which immediately starts a new grace period.
> 
>   [ isn't B blocked (off rq)? How does it migrate? ]

No, its running synchronize_rcu() but hasn't blocked yet. It would block
on wait_for_completion(), but per the very last point, we'll observe the
complete() before we block.

> > o   Task B migrates to CPU 2, which provides a quiescent
> > state for both CPUs 1 and 2.
> > 
> > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > and both invoke RCU_SOFTIRQ, both thus learning of the
> > new grace period.
> > 
> > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > to core RCU.
> > 
> > o   Task B is resumed just long enough to be migrated to CPU 3,
> > and then is once again delayed.
> > 
> > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > critical section.
> 
> A calls rcu_read_unlock() ending the critical section

The point is that rcu_read_unlock() doesn't have memory ordering.

> > 
> > o   CPU 0 passes through a quiescent sate, which is reported to
> > core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o   CPU 1 passes through a quiescent state, which is reported to
> > core RCU.  This ends the grace period, and CPU 1 therefore
> > invokes its callbacks, one of which awakens Task B via
> > complete().
> > 
> > o   Task B resumes (still on CPU 3) and starts executing
> > wait_for_completion(), which sees that the completion has
> > already completed, and thus does not block.  It returns from
> > the synchronize_rcu() without any ordering against the
> > end of Task A's RCU read-side critical section.
> 
> B runs
> 
> 
> > 
> > It can therefore mess up Task A's RCU read-side critical section,
> > in theory, anyway.
> 
> I don't see how B ran during A's critical section.

It didn't but that doesn't mean the memory ordering agrees. What we
require is B observes (per the memory ordering) everything up to and
including the rcu_read_unlock(). This is not 'time' related.


That said, I don't think it can actually happen, because CPU0's QS state
is ordered against the complete and the wait_for_completion is ordered
against the complete.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote:
> On Wed,  4 Oct 2017 14:29:27 -0700
> "Paul E. McKenney"  wrote:
> 
> > Consider the following admittedly improbable sequence of events:
> > 
> > o   RCU is initially idle.
> > 
> > o   Task A on CPU 0 executes rcu_read_lock().
> 
> A starts rcu_read_lock() critical section.
> 
> > 
> > o   Task B on CPU 1 executes synchronize_rcu(), which must
> > wait on Task A:
> 
> B waits for A.
> 
> > 
> > o   Task B registers the callback, which starts a new
> > grace period, awakening the grace-period kthread
> > on CPU 3, which immediately starts a new grace period.
> 
>   [ isn't B blocked (off rq)? How does it migrate? ]

No, its running synchronize_rcu() but hasn't blocked yet. It would block
on wait_for_completion(), but per the very last point, we'll observe the
complete() before we block.

> > o   Task B migrates to CPU 2, which provides a quiescent
> > state for both CPUs 1 and 2.
> > 
> > o   Both CPUs 1 and 2 take scheduling-clock interrupts,
> > and both invoke RCU_SOFTIRQ, both thus learning of the
> > new grace period.
> > 
> > o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > 
> > o   CPUs 2 and 3 pass through quiescent states, which are reported
> > to core RCU.
> > 
> > o   Task B is resumed just long enough to be migrated to CPU 3,
> > and then is once again delayed.
> > 
> > o   Task A executes rcu_read_unlock(), exiting its RCU read-side
> > critical section.
> 
> A calls rcu_read_unlock() ending the critical section

The point is that rcu_read_unlock() doesn't have memory ordering.

> > 
> > o   CPU 0 passes through a quiescent sate, which is reported to
> > core RCU.  Only CPU 1 continues to block the grace period.
> > 
> > o   CPU 1 passes through a quiescent state, which is reported to
> > core RCU.  This ends the grace period, and CPU 1 therefore
> > invokes its callbacks, one of which awakens Task B via
> > complete().
> > 
> > o   Task B resumes (still on CPU 3) and starts executing
> > wait_for_completion(), which sees that the completion has
> > already completed, and thus does not block.  It returns from
> > the synchronize_rcu() without any ordering against the
> > end of Task A's RCU read-side critical section.
> 
> B runs
> 
> 
> > 
> > It can therefore mess up Task A's RCU read-side critical section,
> > in theory, anyway.
> 
> I don't see how B ran during A's critical section.

It didn't but that doesn't mean the memory ordering agrees. What we
require is B observes (per the memory ordering) everything up to and
including the rcu_read_unlock(). This is not 'time' related.


That said, I don't think it can actually happen, because CPU0's QS state
is ordered against the complete and the wait_for_completion is ordered
against the complete.


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Steven Rostedt
On Wed,  4 Oct 2017 14:29:27 -0700
"Paul E. McKenney"  wrote:

> Consider the following admittedly improbable sequence of events:
> 
> o RCU is initially idle.
> 
> o Task A on CPU 0 executes rcu_read_lock().

A starts rcu_read_lock() critical section.

> 
> o Task B on CPU 1 executes synchronize_rcu(), which must
>   wait on Task A:

B waits for A.

> 
>   o   Task B registers the callback, which starts a new
>   grace period, awakening the grace-period kthread
>   on CPU 3, which immediately starts a new grace period.

  [ isn't B blocked (off rq)? How does it migrate? ]

> 
>   o   Task B migrates to CPU 2, which provides a quiescent
>   state for both CPUs 1 and 2.
> 
>   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
>   and both invoke RCU_SOFTIRQ, both thus learning of the
>   new grace period.
> 
>   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o CPUs 2 and 3 pass through quiescent states, which are reported
>   to core RCU.
> 
> o Task B is resumed just long enough to be migrated to CPU 3,
>   and then is once again delayed.
> 
> o Task A executes rcu_read_unlock(), exiting its RCU read-side
>   critical section.

A calls rcu_read_unlock() ending the critical section

> 
> o CPU 0 passes through a quiescent sate, which is reported to
>   core RCU.  Only CPU 1 continues to block the grace period.
> 
> o CPU 1 passes through a quiescent state, which is reported to
>   core RCU.  This ends the grace period, and CPU 1 therefore
>   invokes its callbacks, one of which awakens Task B via
>   complete().
> 
> o Task B resumes (still on CPU 3) and starts executing
>   wait_for_completion(), which sees that the completion has
>   already completed, and thus does not block.  It returns from
>   the synchronize_rcu() without any ordering against the
>   end of Task A's RCU read-side critical section.

B runs


> 
>   It can therefore mess up Task A's RCU read-side critical section,
>   in theory, anyway.

I don't see how B ran during A's critical section.

-- Steve

> 
> However, if CPU hotplug ever gets rid of stop_machine(), there will be
> more straightforward ways for this sort of thing to happen, so this
> commit adds a memory barrier in order to enforce the needed ordering.
> 


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Steven Rostedt
On Wed,  4 Oct 2017 14:29:27 -0700
"Paul E. McKenney"  wrote:

> Consider the following admittedly improbable sequence of events:
> 
> o RCU is initially idle.
> 
> o Task A on CPU 0 executes rcu_read_lock().

A starts rcu_read_lock() critical section.

> 
> o Task B on CPU 1 executes synchronize_rcu(), which must
>   wait on Task A:

B waits for A.

> 
>   o   Task B registers the callback, which starts a new
>   grace period, awakening the grace-period kthread
>   on CPU 3, which immediately starts a new grace period.

  [ isn't B blocked (off rq)? How does it migrate? ]

> 
>   o   Task B migrates to CPU 2, which provides a quiescent
>   state for both CPUs 1 and 2.
> 
>   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
>   and both invoke RCU_SOFTIRQ, both thus learning of the
>   new grace period.
> 
>   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o CPUs 2 and 3 pass through quiescent states, which are reported
>   to core RCU.
> 
> o Task B is resumed just long enough to be migrated to CPU 3,
>   and then is once again delayed.
> 
> o Task A executes rcu_read_unlock(), exiting its RCU read-side
>   critical section.

A calls rcu_read_unlock() ending the critical section

> 
> o CPU 0 passes through a quiescent sate, which is reported to
>   core RCU.  Only CPU 1 continues to block the grace period.
> 
> o CPU 1 passes through a quiescent state, which is reported to
>   core RCU.  This ends the grace period, and CPU 1 therefore
>   invokes its callbacks, one of which awakens Task B via
>   complete().
> 
> o Task B resumes (still on CPU 3) and starts executing
>   wait_for_completion(), which sees that the completion has
>   already completed, and thus does not block.  It returns from
>   the synchronize_rcu() without any ordering against the
>   end of Task A's RCU read-side critical section.

B runs


> 
>   It can therefore mess up Task A's RCU read-side critical section,
>   in theory, anyway.

I don't see how B ran during A's critical section.

-- Steve

> 
> However, if CPU hotplug ever gets rid of stop_machine(), there will be
> more straightforward ways for this sort of thing to happen, so this
> commit adds a memory barrier in order to enforce the needed ordering.
> 


Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> Consider the following admittedly improbable sequence of events:
> 
> o RCU is initially idle.
> 
> o Task A on CPU 0 executes rcu_read_lock().
> 
> o Task B on CPU 1 executes synchronize_rcu(), which must
>   wait on Task A:
> 
>   o   Task B registers the callback, which starts a new
>   grace period, awakening the grace-period kthread
>   on CPU 3, which immediately starts a new grace period.
> 
>   o   Task B migrates to CPU 2, which provides a quiescent
>   state for both CPUs 1 and 2.
> 
>   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
>   and both invoke RCU_SOFTIRQ, both thus learning of the
>   new grace period.
> 
>   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o CPUs 2 and 3 pass through quiescent states, which are reported
>   to core RCU.
> 
> o Task B is resumed just long enough to be migrated to CPU 3,
>   and then is once again delayed.
> 
> o Task A executes rcu_read_unlock(), exiting its RCU read-side
>   critical section.
> 
> o CPU 0 passes through a quiescent sate, which is reported to
>   core RCU.  Only CPU 1 continues to block the grace period.
> 
> o CPU 1 passes through a quiescent state, which is reported to
>   core RCU.  This ends the grace period, and CPU 1 therefore
>   invokes its callbacks, one of which awakens Task B via
>   complete().
> 
> o Task B resumes (still on CPU 3) and starts executing
>   wait_for_completion(), which sees that the completion has
>   already completed, and thus does not block.  It returns from
>   the synchronize_rcu() without any ordering against the
>   end of Task A's RCU read-side critical section.
> 
>   It can therefore mess up Task A's RCU read-side critical section,
>   in theory, anyway.

I'm not sure I follow, at the very least the wait_for_completion() does
an ACQUIRE such that it observes the state prior to the RELEASE as done
by complete(), no?

And is not CPU0's QS reporting ordered against that complete()?



Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

2017-10-05 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> Consider the following admittedly improbable sequence of events:
> 
> o RCU is initially idle.
> 
> o Task A on CPU 0 executes rcu_read_lock().
> 
> o Task B on CPU 1 executes synchronize_rcu(), which must
>   wait on Task A:
> 
>   o   Task B registers the callback, which starts a new
>   grace period, awakening the grace-period kthread
>   on CPU 3, which immediately starts a new grace period.
> 
>   o   Task B migrates to CPU 2, which provides a quiescent
>   state for both CPUs 1 and 2.
> 
>   o   Both CPUs 1 and 2 take scheduling-clock interrupts,
>   and both invoke RCU_SOFTIRQ, both thus learning of the
>   new grace period.
> 
>   o   Task B is delayed, perhaps by vCPU preemption on CPU 2.
> 
> o CPUs 2 and 3 pass through quiescent states, which are reported
>   to core RCU.
> 
> o Task B is resumed just long enough to be migrated to CPU 3,
>   and then is once again delayed.
> 
> o Task A executes rcu_read_unlock(), exiting its RCU read-side
>   critical section.
> 
> o CPU 0 passes through a quiescent sate, which is reported to
>   core RCU.  Only CPU 1 continues to block the grace period.
> 
> o CPU 1 passes through a quiescent state, which is reported to
>   core RCU.  This ends the grace period, and CPU 1 therefore
>   invokes its callbacks, one of which awakens Task B via
>   complete().
> 
> o Task B resumes (still on CPU 3) and starts executing
>   wait_for_completion(), which sees that the completion has
>   already completed, and thus does not block.  It returns from
>   the synchronize_rcu() without any ordering against the
>   end of Task A's RCU read-side critical section.
> 
>   It can therefore mess up Task A's RCU read-side critical section,
>   in theory, anyway.

I'm not sure I follow, at the very least the wait_for_completion() does
an ACQUIRE such that it observes the state prior to the RELEASE as done
by complete(), no?

And is not CPU0's QS reporting ordered against that complete()?