Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote: > On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > > But if you are saying that it would be good to have wait_for_completion() > > and complete() directly modeled at some point, no argument. In addition, > > I hope that the memory model is applied to other tools that analyze kernel > > code. > > > > I'm not sure I got the point across; so I'll try once more. Without > > > providing this ordering the completion would be fundamentally broken. It > > > _must_ provide this ordering. > > > > OK, I now understand what you are getting at, and I do very much like > > that guarantee. > > Right, so maybe we should update the completion comments a bit to call > out this property, because I'm not sure its there. That would be very good! > Also, with this, I think the smp_store_release() / smp_load_acquire() is > a perfectly fine abstraction of it, I don't think the model needs to be > taught about the completion interface. Agreed, if we pull too much stuff into the model it might well become difficult to maintain in and of itself. So we do need to choose carefully based on model performance and ease of use. For a performance example, note that locking can be easily modeled with xchg_acquire() for lock acquisition and smp_store_release() for lock release. But here is the resulting performance compared to directly modeling locking within the cat code: xchg_acquire() and # Processes smp_store_release() Direct modeling 2 0.061 s 0.007 s 3 3.542 s 0.039 s 4 569.189 s 0.397 s 5 > 113,853.000 s 5.187 s Combinatorial explosion is a real pain... The problem is that the xchg_acquire() / smp_store_release() approach requires the herd7 tool to consider and then reject all the scenarios where the lock critical sections overlap. In contrast, when direct modeling, those overlapping critical-section scenarios are implicitly rejected by the cat-code that models locking. Don't get me wrong, there is still combinatorial explosion, as you can see from the rightmost column of numbers. After all, the herd7 tool still needs to consider all possible lock-acquisition orders. But the explosion is much less explosive. ;-) Given that we expect locking to be very heavily used, it makes a lot of sense to directly model it, which we (mostly Alan, Andrea, and Luc) have done. Direct modeling of RCU gets similar speedups over emulation, plus the fact that accurate emulation of RCU requires some surprisingly unnatural acts. ("Ghost" accesses unaffected by normal memory-ordering primitives, and "ghost-buster memory barriers" that affect both "ghost" and normal accesses. Plus the transformation is complex, so much so that I broke down and wrote a script, which was used to validate the early versions of the RCU model.) In constrast, I have no reason to believe that direct modeling could make wait_for_completion() / complete() go any faster than the equivalent setup using smp_store_release() / smp_load_acquire(), that is, I don't see anything that could be optimized away byt cat-code. So the only reason to directly model wait_for_completion() / complete() would be to get rid of the need to add the extra term to the "exists" clause. And agreed, we should wait until people are actually being significantly inconvenienced by this, which I do not expect any time soon. > > > Why not? In what sort of cases does it go wobbly? > > > > For one, when it conflicts with maintainability. For example, it would > > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > > small speedup on only one architecture is just not worth the added pain. > > Especially given the nice wrapper functions that you provided. > > > > But of course if this were instead (say) rcu_read_lock() or common-case > > rcu_read_unlock(), I would be willing to undergo much more pain. On the > > other hand, for that exact reason, that common-case code path doesn't > > acquire locks in the first place. ;-) > > Ah, so for models I would go absolutely minimal; it helps understand > what the strict requirements are and where we over-provide etc.. Agreed, we should be careful how much we promise lest we get locked into needlessly heavyweight implementations. > For actual code you're entirely right, there's no point in trying to be > cute with the rcu-node locks. Simplicity rules etc.. Hear, hear! ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote: > On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > > But if you are saying that it would be good to have wait_for_completion() > > and complete() directly modeled at some point, no argument. In addition, > > I hope that the memory model is applied to other tools that analyze kernel > > code. > > > > I'm not sure I got the point across; so I'll try once more. Without > > > providing this ordering the completion would be fundamentally broken. It > > > _must_ provide this ordering. > > > > OK, I now understand what you are getting at, and I do very much like > > that guarantee. > > Right, so maybe we should update the completion comments a bit to call > out this property, because I'm not sure its there. That would be very good! > Also, with this, I think the smp_store_release() / smp_load_acquire() is > a perfectly fine abstraction of it, I don't think the model needs to be > taught about the completion interface. Agreed, if we pull too much stuff into the model it might well become difficult to maintain in and of itself. So we do need to choose carefully based on model performance and ease of use. For a performance example, note that locking can be easily modeled with xchg_acquire() for lock acquisition and smp_store_release() for lock release. But here is the resulting performance compared to directly modeling locking within the cat code: xchg_acquire() and # Processes smp_store_release() Direct modeling 2 0.061 s 0.007 s 3 3.542 s 0.039 s 4 569.189 s 0.397 s 5 > 113,853.000 s 5.187 s Combinatorial explosion is a real pain... The problem is that the xchg_acquire() / smp_store_release() approach requires the herd7 tool to consider and then reject all the scenarios where the lock critical sections overlap. In contrast, when direct modeling, those overlapping critical-section scenarios are implicitly rejected by the cat-code that models locking. Don't get me wrong, there is still combinatorial explosion, as you can see from the rightmost column of numbers. After all, the herd7 tool still needs to consider all possible lock-acquisition orders. But the explosion is much less explosive. ;-) Given that we expect locking to be very heavily used, it makes a lot of sense to directly model it, which we (mostly Alan, Andrea, and Luc) have done. Direct modeling of RCU gets similar speedups over emulation, plus the fact that accurate emulation of RCU requires some surprisingly unnatural acts. ("Ghost" accesses unaffected by normal memory-ordering primitives, and "ghost-buster memory barriers" that affect both "ghost" and normal accesses. Plus the transformation is complex, so much so that I broke down and wrote a script, which was used to validate the early versions of the RCU model.) In constrast, I have no reason to believe that direct modeling could make wait_for_completion() / complete() go any faster than the equivalent setup using smp_store_release() / smp_load_acquire(), that is, I don't see anything that could be optimized away byt cat-code. So the only reason to directly model wait_for_completion() / complete() would be to get rid of the need to add the extra term to the "exists" clause. And agreed, we should wait until people are actually being significantly inconvenienced by this, which I do not expect any time soon. > > > Why not? In what sort of cases does it go wobbly? > > > > For one, when it conflicts with maintainability. For example, it would > > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > > small speedup on only one architecture is just not worth the added pain. > > Especially given the nice wrapper functions that you provided. > > > > But of course if this were instead (say) rcu_read_lock() or common-case > > rcu_read_unlock(), I would be willing to undergo much more pain. On the > > other hand, for that exact reason, that common-case code path doesn't > > acquire locks in the first place. ;-) > > Ah, so for models I would go absolutely minimal; it helps understand > what the strict requirements are and where we over-provide etc.. Agreed, we should be careful how much we promise lest we get locked into needlessly heavyweight implementations. > For actual code you're entirely right, there's no point in trying to be > cute with the rcu-node locks. Simplicity rules etc.. Hear, hear! ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote: > On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > > But if you are saying that it would be good to have wait_for_completion() > > and complete() directly modeled at some point, no argument. In addition, > > I hope that the memory model is applied to other tools that analyze kernel > > code. > > > > I'm not sure I got the point across; so I'll try once more. Without > > > providing this ordering the completion would be fundamentally broken. It > > > _must_ provide this ordering. > > > > OK, I now understand what you are getting at, and I do very much like > > that guarantee. > > Right, so maybe we should update the completion comments a bit to call > out this property, because I'm not sure its there. > > Also, with this, I think the smp_store_release() / smp_load_acquire() is > a perfectly fine abstraction of it, I don't think the model needs to be > taught about the completion interface. > > > > Why not? In what sort of cases does it go wobbly? > > > > For one, when it conflicts with maintainability. For example, it would > > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > > small speedup on only one architecture is just not worth the added pain. > > Especially given the nice wrapper functions that you provided. > > > > But of course if this were instead (say) rcu_read_lock() or common-case > > rcu_read_unlock(), I would be willing to undergo much more pain. On the > > other hand, for that exact reason, that common-case code path doesn't > > acquire locks in the first place. ;-) > > Ah, so for models I would go absolutely minimal; it helps understand > what the strict requirements are and where we over-provide etc.. Except, maybe, that simplicity and maintainability do apply to "models" (design) as well... ;-) As Ingo once put it [1] (referring to the "Linux-kernel memory model"): "it's not true that Linux has to offer a barrier and locking model that panders to the weakest (and craziest!) memory ordering model amongst all the possible Linux platforms - theoretical or real metal. Instead what we want to do is to consciously, intelligently _pick_ a sane, maintainable memory model and offer primitives for that - at least as far as generic code is concerned. Each architecture can map those primitives to the best of its abilities." Andrea [1] https://marc.info/?l=linux-mm=138513336717990=2 > > For actual code you're entirely right, there's no point in trying to be > cute with the rcu-node locks. Simplicity rules etc..
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Mon, Oct 09, 2017 at 10:16:37AM +0200, Peter Zijlstra wrote: > On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > > But if you are saying that it would be good to have wait_for_completion() > > and complete() directly modeled at some point, no argument. In addition, > > I hope that the memory model is applied to other tools that analyze kernel > > code. > > > > I'm not sure I got the point across; so I'll try once more. Without > > > providing this ordering the completion would be fundamentally broken. It > > > _must_ provide this ordering. > > > > OK, I now understand what you are getting at, and I do very much like > > that guarantee. > > Right, so maybe we should update the completion comments a bit to call > out this property, because I'm not sure its there. > > Also, with this, I think the smp_store_release() / smp_load_acquire() is > a perfectly fine abstraction of it, I don't think the model needs to be > taught about the completion interface. > > > > Why not? In what sort of cases does it go wobbly? > > > > For one, when it conflicts with maintainability. For example, it would > > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > > small speedup on only one architecture is just not worth the added pain. > > Especially given the nice wrapper functions that you provided. > > > > But of course if this were instead (say) rcu_read_lock() or common-case > > rcu_read_unlock(), I would be willing to undergo much more pain. On the > > other hand, for that exact reason, that common-case code path doesn't > > acquire locks in the first place. ;-) > > Ah, so for models I would go absolutely minimal; it helps understand > what the strict requirements are and where we over-provide etc.. Except, maybe, that simplicity and maintainability do apply to "models" (design) as well... ;-) As Ingo once put it [1] (referring to the "Linux-kernel memory model"): "it's not true that Linux has to offer a barrier and locking model that panders to the weakest (and craziest!) memory ordering model amongst all the possible Linux platforms - theoretical or real metal. Instead what we want to do is to consciously, intelligently _pick_ a sane, maintainable memory model and offer primitives for that - at least as far as generic code is concerned. Each architecture can map those primitives to the best of its abilities." Andrea [1] https://marc.info/?l=linux-mm=138513336717990=2 > > For actual code you're entirely right, there's no point in trying to be > cute with the rcu-node locks. Simplicity rules etc..
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > But if you are saying that it would be good to have wait_for_completion() > and complete() directly modeled at some point, no argument. In addition, > I hope that the memory model is applied to other tools that analyze kernel > code. > > I'm not sure I got the point across; so I'll try once more. Without > > providing this ordering the completion would be fundamentally broken. It > > _must_ provide this ordering. > > OK, I now understand what you are getting at, and I do very much like > that guarantee. Right, so maybe we should update the completion comments a bit to call out this property, because I'm not sure its there. Also, with this, I think the smp_store_release() / smp_load_acquire() is a perfectly fine abstraction of it, I don't think the model needs to be taught about the completion interface. > > Why not? In what sort of cases does it go wobbly? > > For one, when it conflicts with maintainability. For example, it would > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > small speedup on only one architecture is just not worth the added pain. > Especially given the nice wrapper functions that you provided. > > But of course if this were instead (say) rcu_read_lock() or common-case > rcu_read_unlock(), I would be willing to undergo much more pain. On the > other hand, for that exact reason, that common-case code path doesn't > acquire locks in the first place. ;-) Ah, so for models I would go absolutely minimal; it helps understand what the strict requirements are and where we over-provide etc.. For actual code you're entirely right, there's no point in trying to be cute with the rcu-node locks. Simplicity rules etc..
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Sat, Oct 07, 2017 at 11:28:57AM -0700, Paul E. McKenney wrote: > But if you are saying that it would be good to have wait_for_completion() > and complete() directly modeled at some point, no argument. In addition, > I hope that the memory model is applied to other tools that analyze kernel > code. > > I'm not sure I got the point across; so I'll try once more. Without > > providing this ordering the completion would be fundamentally broken. It > > _must_ provide this ordering. > > OK, I now understand what you are getting at, and I do very much like > that guarantee. Right, so maybe we should update the completion comments a bit to call out this property, because I'm not sure its there. Also, with this, I think the smp_store_release() / smp_load_acquire() is a perfectly fine abstraction of it, I don't think the model needs to be taught about the completion interface. > > Why not? In what sort of cases does it go wobbly? > > For one, when it conflicts with maintainability. For example, it would > probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the > smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the > small speedup on only one architecture is just not worth the added pain. > Especially given the nice wrapper functions that you provided. > > But of course if this were instead (say) rcu_read_lock() or common-case > rcu_read_unlock(), I would be willing to undergo much more pain. On the > other hand, for that exact reason, that common-case code path doesn't > acquire locks in the first place. ;-) Ah, so for models I would go absolutely minimal; it helps understand what the strict requirements are and where we over-provide etc.. For actual code you're entirely right, there's no point in trying to be cute with the rcu-node locks. Simplicity rules etc..
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote: > On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote: > > > > > OK, I will bite... What do the smp_store_release() and the > > > > smp_load_acquire() correspond to? I see just plain locking in > > > > wait_for_completion() and complete(). > > > > > > They reflect the concept of complete() / wait_for_completion(). > > > Fundamentally all it needs to do is pass the message of 'completion'. > > > > > > That is, if we were to go optimize our completion implementation, it > > > would be impossible to be weaker than this and still correct. > > > > OK, though the model does not provide spinlocks, and there can be Sigh. s/not//. The current model -does- provide spinlocks, though they are a bit new. I don't know of any breakage, but I am paranoid enough so that where feasible I double-check against xchg_acquire() and store_release(). > > differences in behavior between spinlocks and release-acquire. > > But yes, in this case, it works. > > Sure; but the fundamental property here is that if we observe the > complete() we must also observe everything that went before. The exact > means of implementing that is irrelevant. Agreed, and that also tends to speed up the running of the model on the litmus test, so this sort of abstraction is a very good thing for multiple reasons. So why did I use spinlocks? Because the model was small and fast enough, and using the spinlocks meant that I didn't need to take time to worry about the code's intent. But if you are saying that it would be good to have wait_for_completion() and complete() directly modeled at some point, no argument. In addition, I hope that the memory model is applied to other tools that analyze kernel code. > > > > So I dropped that patch yesterday. The main thing I was missing was > > > > that there is no ordering-free fastpath in wait_for_completion() and > > > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > > > that I was trying to add doesn't need to be there. > > > > > > Going by the above, it never needs to be there, even if there was a > > > lock-free fast-path. > > > > Given that wait_for_completion()/complete() both acquire the same lock, > > yes, and agreed, if it were lockless but provided the release and > > acquire ordering, then yes. > > I'm not sure I got the point across; so I'll try once more. Without > providing this ordering the completion would be fundamentally broken. It > _must_ provide this ordering. OK, I now understand what you are getting at, and I do very much like that guarantee. > > But if it was instead structured like > > wait_event()/wake_up(), there would be ordering only if the caller > > supplied it. > > Right, wait_event()/wake_up() are different in that the 'condition' > variable is external to the abstraction and thus it cannot help. > > All wait_event()/wake_up() can guarantee is that IFF it does a wakeup, > the woken thread will observe the prior state of the waker. But given > the actual condition is external and we might not hit the actual sleep > case, there is no guarantees. Agreed. > > All that aside, paring the ordering down to the bare minimum is not > > always the right approach. > > Why not? In what sort of cases does it go wobbly? For one, when it conflicts with maintainability. For example, it would probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the small speedup on only one architecture is just not worth the added pain. Especially given the nice wrapper functions that you provided. But of course if this were instead (say) rcu_read_lock() or common-case rcu_read_unlock(), I would be willing to undergo much more pain. On the other hand, for that exact reason, that common-case code path doesn't acquire locks in the first place. ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote: > On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote: > > > > > OK, I will bite... What do the smp_store_release() and the > > > > smp_load_acquire() correspond to? I see just plain locking in > > > > wait_for_completion() and complete(). > > > > > > They reflect the concept of complete() / wait_for_completion(). > > > Fundamentally all it needs to do is pass the message of 'completion'. > > > > > > That is, if we were to go optimize our completion implementation, it > > > would be impossible to be weaker than this and still correct. > > > > OK, though the model does not provide spinlocks, and there can be Sigh. s/not//. The current model -does- provide spinlocks, though they are a bit new. I don't know of any breakage, but I am paranoid enough so that where feasible I double-check against xchg_acquire() and store_release(). > > differences in behavior between spinlocks and release-acquire. > > But yes, in this case, it works. > > Sure; but the fundamental property here is that if we observe the > complete() we must also observe everything that went before. The exact > means of implementing that is irrelevant. Agreed, and that also tends to speed up the running of the model on the litmus test, so this sort of abstraction is a very good thing for multiple reasons. So why did I use spinlocks? Because the model was small and fast enough, and using the spinlocks meant that I didn't need to take time to worry about the code's intent. But if you are saying that it would be good to have wait_for_completion() and complete() directly modeled at some point, no argument. In addition, I hope that the memory model is applied to other tools that analyze kernel code. > > > > So I dropped that patch yesterday. The main thing I was missing was > > > > that there is no ordering-free fastpath in wait_for_completion() and > > > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > > > that I was trying to add doesn't need to be there. > > > > > > Going by the above, it never needs to be there, even if there was a > > > lock-free fast-path. > > > > Given that wait_for_completion()/complete() both acquire the same lock, > > yes, and agreed, if it were lockless but provided the release and > > acquire ordering, then yes. > > I'm not sure I got the point across; so I'll try once more. Without > providing this ordering the completion would be fundamentally broken. It > _must_ provide this ordering. OK, I now understand what you are getting at, and I do very much like that guarantee. > > But if it was instead structured like > > wait_event()/wake_up(), there would be ordering only if the caller > > supplied it. > > Right, wait_event()/wake_up() are different in that the 'condition' > variable is external to the abstraction and thus it cannot help. > > All wait_event()/wake_up() can guarantee is that IFF it does a wakeup, > the woken thread will observe the prior state of the waker. But given > the actual condition is external and we might not hit the actual sleep > case, there is no guarantees. Agreed. > > All that aside, paring the ordering down to the bare minimum is not > > always the right approach. > > Why not? In what sort of cases does it go wobbly? For one, when it conflicts with maintainability. For example, it would probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the small speedup on only one architecture is just not worth the added pain. Especially given the nice wrapper functions that you provided. But of course if this were instead (say) rcu_read_lock() or common-case rcu_read_unlock(), I would be willing to undergo much more pain. On the other hand, for that exact reason, that common-case code path doesn't acquire locks in the first place. ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote: > > > OK, I will bite... What do the smp_store_release() and the > > > smp_load_acquire() correspond to? I see just plain locking in > > > wait_for_completion() and complete(). > > > > They reflect the concept of complete() / wait_for_completion(). > > Fundamentally all it needs to do is pass the message of 'completion'. > > > > That is, if we were to go optimize our completion implementation, it > > would be impossible to be weaker than this and still correct. > > OK, though the model does not provide spinlocks, and there can be > differences in behavior between spinlocks and release-acquire. > But yes, in this case, it works. Sure; but the fundamental property here is that if we observe the complete() we must also observe everything that went before. The exact means of implementing that is irrelevant. > > > So I dropped that patch yesterday. The main thing I was missing was > > > that there is no ordering-free fastpath in wait_for_completion() and > > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > > that I was trying to add doesn't need to be there. > > > > Going by the above, it never needs to be there, even if there was a > > lock-free fast-path. > > Given that wait_for_completion()/complete() both acquire the same lock, > yes, and agreed, if it were lockless but provided the release and > acquire ordering, then yes. I'm not sure I got the point across; so I'll try once more. Without providing this ordering the completion would be fundamentally broken. It _must_ provide this ordering. > But if it was instead structured like > wait_event()/wake_up(), there would be ordering only if the caller > supplied it. Right, wait_event()/wake_up() are different in that the 'condition' variable is external to the abstraction and thus it cannot help. All wait_event()/wake_up() can guarantee is that IFF it does a wakeup, the woken thread will observe the prior state of the waker. But given the actual condition is external and we might not hit the actual sleep case, there is no guarantees. > All that aside, paring the ordering down to the bare minimum is not > always the right approach. Why not? In what sort of cases does it go wobbly?
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote: > > > OK, I will bite... What do the smp_store_release() and the > > > smp_load_acquire() correspond to? I see just plain locking in > > > wait_for_completion() and complete(). > > > > They reflect the concept of complete() / wait_for_completion(). > > Fundamentally all it needs to do is pass the message of 'completion'. > > > > That is, if we were to go optimize our completion implementation, it > > would be impossible to be weaker than this and still correct. > > OK, though the model does not provide spinlocks, and there can be > differences in behavior between spinlocks and release-acquire. > But yes, in this case, it works. Sure; but the fundamental property here is that if we observe the complete() we must also observe everything that went before. The exact means of implementing that is irrelevant. > > > So I dropped that patch yesterday. The main thing I was missing was > > > that there is no ordering-free fastpath in wait_for_completion() and > > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > > that I was trying to add doesn't need to be there. > > > > Going by the above, it never needs to be there, even if there was a > > lock-free fast-path. > > Given that wait_for_completion()/complete() both acquire the same lock, > yes, and agreed, if it were lockless but provided the release and > acquire ordering, then yes. I'm not sure I got the point across; so I'll try once more. Without providing this ordering the completion would be fundamentally broken. It _must_ provide this ordering. > But if it was instead structured like > wait_event()/wake_up(), there would be ordering only if the caller > supplied it. Right, wait_event()/wake_up() are different in that the 'condition' variable is external to the abstraction and thus it cannot help. All wait_event()/wake_up() can guarantee is that IFF it does a wakeup, the woken thread will observe the prior state of the waker. But given the actual condition is external and we might not hit the actual sleep case, there is no guarantees. > All that aside, paring the ordering down to the bare minimum is not > always the right approach. Why not? In what sort of cases does it go wobbly?
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 10:15:37PM +0200, Peter Zijlstra wrote: > On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote: > > > /me goes and install this herd thing again.. I'm sure I had it running > > > _somewhere_.. A well. > > > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > > > { > > > } > > > > > > P0(int *a, int *x) > > > { > > > WRITE_ONCE(*a, 1); > > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > > WRITE_ONCE(*x, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > r3 = READ_ONCE(*x); > > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > > smp_store_release(y, 1); > > > } > > > > > > P2(int *y, int *b) > > > { > > > r4 = smp_load_acquire(y); > > > r1 = READ_ONCE(*b); > > > } > > > > > > P3(int *b, int *a) > > > { > > > WRITE_ONCE(*b, 1); > > > smp_mb(); > > > r2 = READ_ONCE(*a); > > > } > > > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > > > > > > > Is what I was thinking of, I think that is the minimal ordering > > > complete()/wait_for_completion() need to provide. > > > > OK, I will bite... What do the smp_store_release() and the > > smp_load_acquire() correspond to? I see just plain locking in > > wait_for_completion() and complete(). > > They reflect the concept of complete() / wait_for_completion(). > Fundamentally all it needs to do is pass the message of 'completion'. > > That is, if we were to go optimize our completion implementation, it > would be impossible to be weaker than this and still correct. OK, though the model does not provide spinlocks, and there can be differences in behavior between spinlocks and release-acquire. But yes, in this case, it works. > > So I dropped that patch yesterday. The main thing I was missing was > > that there is no ordering-free fastpath in wait_for_completion() and > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > that I was trying to add doesn't need to be there. > > Going by the above, it never needs to be there, even if there was a > lock-free fast-path. Given that wait_for_completion()/complete() both acquire the same lock, yes, and agreed, if it were lockless but provided the release and acquire ordering, then yes. But if it was instead structured like wait_event()/wake_up(), there would be ordering only if the caller supplied it. All that aside, paring the ordering down to the bare minimum is not always the right approach. Nevertheless, in this particular case, there is plenty of ordering, so yet again, I have dropped this commit. Like yesterday. ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 10:15:37PM +0200, Peter Zijlstra wrote: > On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote: > > > /me goes and install this herd thing again.. I'm sure I had it running > > > _somewhere_.. A well. > > > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > > > { > > > } > > > > > > P0(int *a, int *x) > > > { > > > WRITE_ONCE(*a, 1); > > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > > WRITE_ONCE(*x, 1); > > > } > > > > > > P1(int *x, int *y) > > > { > > > r3 = READ_ONCE(*x); > > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > > smp_store_release(y, 1); > > > } > > > > > > P2(int *y, int *b) > > > { > > > r4 = smp_load_acquire(y); > > > r1 = READ_ONCE(*b); > > > } > > > > > > P3(int *b, int *a) > > > { > > > WRITE_ONCE(*b, 1); > > > smp_mb(); > > > r2 = READ_ONCE(*a); > > > } > > > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > > > > > > > Is what I was thinking of, I think that is the minimal ordering > > > complete()/wait_for_completion() need to provide. > > > > OK, I will bite... What do the smp_store_release() and the > > smp_load_acquire() correspond to? I see just plain locking in > > wait_for_completion() and complete(). > > They reflect the concept of complete() / wait_for_completion(). > Fundamentally all it needs to do is pass the message of 'completion'. > > That is, if we were to go optimize our completion implementation, it > would be impossible to be weaker than this and still correct. OK, though the model does not provide spinlocks, and there can be differences in behavior between spinlocks and release-acquire. But yes, in this case, it works. > > So I dropped that patch yesterday. The main thing I was missing was > > that there is no ordering-free fastpath in wait_for_completion() and > > complete(): Each unconditionally acquires the lock. So the smp_mb() > > that I was trying to add doesn't need to be there. > > Going by the above, it never needs to be there, even if there was a > lock-free fast-path. Given that wait_for_completion()/complete() both acquire the same lock, yes, and agreed, if it were lockless but provided the release and acquire ordering, then yes. But if it was instead structured like wait_event()/wake_up(), there would be ordering only if the caller supplied it. All that aside, paring the ordering down to the bare minimum is not always the right approach. Nevertheless, in this particular case, there is plenty of ordering, so yet again, I have dropped this commit. Like yesterday. ;-) Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote: > > /me goes and install this herd thing again.. I'm sure I had it running > > _somewhere_.. A well. > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > { > > } > > > > P0(int *a, int *x) > > { > > WRITE_ONCE(*a, 1); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > WRITE_ONCE(*x, 1); > > } > > > > P1(int *x, int *y) > > { > > r3 = READ_ONCE(*x); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > smp_store_release(y, 1); > > } > > > > P2(int *y, int *b) > > { > > r4 = smp_load_acquire(y); > > r1 = READ_ONCE(*b); > > } > > > > P3(int *b, int *a) > > { > > WRITE_ONCE(*b, 1); > > smp_mb(); > > r2 = READ_ONCE(*a); > > } > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > > > > Is what I was thinking of, I think that is the minimal ordering > > complete()/wait_for_completion() need to provide. > > OK, I will bite... What do the smp_store_release() and the > smp_load_acquire() correspond to? I see just plain locking in > wait_for_completion() and complete(). They reflect the concept of complete() / wait_for_completion(). Fundamentally all it needs to do is pass the message of 'completion'. That is, if we were to go optimize our completion implementation, it would be impossible to be weaker than this and still correct. > So I dropped that patch yesterday. The main thing I was missing was > that there is no ordering-free fastpath in wait_for_completion() and > complete(): Each unconditionally acquires the lock. So the smp_mb() > that I was trying to add doesn't need to be there. Going by the above, it never needs to be there, even if there was a lock-free fast-path.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 12:18:22PM -0700, Paul E. McKenney wrote: > > /me goes and install this herd thing again.. I'm sure I had it running > > _somewhere_.. A well. > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > { > > } > > > > P0(int *a, int *x) > > { > > WRITE_ONCE(*a, 1); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > WRITE_ONCE(*x, 1); > > } > > > > P1(int *x, int *y) > > { > > r3 = READ_ONCE(*x); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > smp_store_release(y, 1); > > } > > > > P2(int *y, int *b) > > { > > r4 = smp_load_acquire(y); > > r1 = READ_ONCE(*b); > > } > > > > P3(int *b, int *a) > > { > > WRITE_ONCE(*b, 1); > > smp_mb(); > > r2 = READ_ONCE(*a); > > } > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > > > > Is what I was thinking of, I think that is the minimal ordering > > complete()/wait_for_completion() need to provide. > > OK, I will bite... What do the smp_store_release() and the > smp_load_acquire() correspond to? I see just plain locking in > wait_for_completion() and complete(). They reflect the concept of complete() / wait_for_completion(). Fundamentally all it needs to do is pass the message of 'completion'. That is, if we were to go optimize our completion implementation, it would be impossible to be weaker than this and still correct. > So I dropped that patch yesterday. The main thing I was missing was > that there is no ordering-free fastpath in wait_for_completion() and > complete(): Each unconditionally acquires the lock. So the smp_mb() > that I was trying to add doesn't need to be there. Going by the above, it never needs to be there, even if there was a lock-free fast-path.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 11:07:23AM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote: > > Hmmm... Here is what I was worried about: > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > { > > } > > > > P0(int *a, int *x) > > { > > WRITE_ONCE(*a, 1); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > WRITE_ONCE(*x, 1); > > } > > > > P1(int *x, int *y, spinlock_t *l) > > { > > r3 = READ_ONCE(*x); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > spin_lock(l); /* Locking in complete(). */ > > WRITE_ONCE(*y, 1); > > spin_unlock(l); > > } > > > > P2(int *y, int *b, spinlock_t *l) > > { > > spin_lock(l); /* Locking in wait_for_completion. */ > > r4 = READ_ONCE(*y); > > spin_unlock(l); > > r1 = READ_ONCE(*b); > > } > > > > P3(int *b, int *a) > > { > > WRITE_ONCE(*b, 1); > > smp_mb(); > > r2 = READ_ONCE(*a); > > } > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > /me goes and install this herd thing again.. I'm sure I had it running > _somewhere_.. A well. > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > { > } > > P0(int *a, int *x) > { > WRITE_ONCE(*a, 1); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > WRITE_ONCE(*x, 1); > } > > P1(int *x, int *y) > { > r3 = READ_ONCE(*x); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > smp_store_release(y, 1); > } > > P2(int *y, int *b) > { > r4 = smp_load_acquire(y); > r1 = READ_ONCE(*b); > } > > P3(int *b, int *a) > { > WRITE_ONCE(*b, 1); > smp_mb(); > r2 = READ_ONCE(*a); > } > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > Is what I was thinking of, I think that is the minimal ordering > complete()/wait_for_completion() need to provide. OK, I will bite... What do the smp_store_release() and the smp_load_acquire() correspond to? I see just plain locking in wait_for_completion() and complete(). > (also, that r# numbering confuses the hell out of me, its not related to > P nor to the variables) Yeah, it is random, sorry!!! > Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed > States 15 > 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0; > 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1; > 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0; > 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1; > 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0; > 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1; > 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0; > 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1; > 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0; > 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1; > 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1; > 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0; > 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1; > 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0; > 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1; > No > Witnesses > Positive: 0 Negative: 15 > Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15 > Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04 > Hash=f7f8ad6eab33e90718a394bcb021557d But yes, looking closer, this corresponds to the rule of thumb about non-rf relations and full memory barriers. We have two non-rf relations (P2->P3 and P3->P0), so we need two full barriers, one each between the non-rf relations. So I dropped that patch yesterday. The main thing I was missing was that there is no ordering-free fastpath in wait_for_completion() and complete(): Each unconditionally acquires the lock. So the smp_mb() that I was trying to add doesn't need to be there. Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Fri, Oct 06, 2017 at 11:07:23AM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote: > > Hmmm... Here is what I was worried about: > > > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > > > { > > } > > > > P0(int *a, int *x) > > { > > WRITE_ONCE(*a, 1); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > WRITE_ONCE(*x, 1); > > } > > > > P1(int *x, int *y, spinlock_t *l) > > { > > r3 = READ_ONCE(*x); > > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > > spin_lock(l); /* Locking in complete(). */ > > WRITE_ONCE(*y, 1); > > spin_unlock(l); > > } > > > > P2(int *y, int *b, spinlock_t *l) > > { > > spin_lock(l); /* Locking in wait_for_completion. */ > > r4 = READ_ONCE(*y); > > spin_unlock(l); > > r1 = READ_ONCE(*b); > > } > > > > P3(int *b, int *a) > > { > > WRITE_ONCE(*b, 1); > > smp_mb(); > > r2 = READ_ONCE(*a); > > } > > > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > /me goes and install this herd thing again.. I'm sure I had it running > _somewhere_.. A well. > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > { > } > > P0(int *a, int *x) > { > WRITE_ONCE(*a, 1); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > WRITE_ONCE(*x, 1); > } > > P1(int *x, int *y) > { > r3 = READ_ONCE(*x); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > smp_store_release(y, 1); > } > > P2(int *y, int *b) > { > r4 = smp_load_acquire(y); > r1 = READ_ONCE(*b); > } > > P3(int *b, int *a) > { > WRITE_ONCE(*b, 1); > smp_mb(); > r2 = READ_ONCE(*a); > } > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > > > Is what I was thinking of, I think that is the minimal ordering > complete()/wait_for_completion() need to provide. OK, I will bite... What do the smp_store_release() and the smp_load_acquire() correspond to? I see just plain locking in wait_for_completion() and complete(). > (also, that r# numbering confuses the hell out of me, its not related to > P nor to the variables) Yeah, it is random, sorry!!! > Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed > States 15 > 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0; > 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1; > 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0; > 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1; > 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0; > 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1; > 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0; > 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1; > 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0; > 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1; > 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1; > 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0; > 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1; > 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0; > 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1; > No > Witnesses > Positive: 0 Negative: 15 > Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) > Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15 > Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04 > Hash=f7f8ad6eab33e90718a394bcb021557d But yes, looking closer, this corresponds to the rule of thumb about non-rf relations and full memory barriers. We have two non-rf relations (P2->P3 and P3->P0), so we need two full barriers, one each between the non-rf relations. So I dropped that patch yesterday. The main thing I was missing was that there is no ordering-free fastpath in wait_for_completion() and complete(): Each unconditionally acquires the lock. So the smp_mb() that I was trying to add doesn't need to be there. Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote: > Hmmm... Here is what I was worried about: > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > { > } > > P0(int *a, int *x) > { > WRITE_ONCE(*a, 1); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > WRITE_ONCE(*x, 1); > } > > P1(int *x, int *y, spinlock_t *l) > { > r3 = READ_ONCE(*x); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > spin_lock(l); /* Locking in complete(). */ > WRITE_ONCE(*y, 1); > spin_unlock(l); > } > > P2(int *y, int *b, spinlock_t *l) > { > spin_lock(l); /* Locking in wait_for_completion. */ > r4 = READ_ONCE(*y); > spin_unlock(l); > r1 = READ_ONCE(*b); > } > > P3(int *b, int *a) > { > WRITE_ONCE(*b, 1); > smp_mb(); > r2 = READ_ONCE(*a); > } > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) /me goes and install this herd thing again.. I'm sure I had it running _somewhere_.. A well. C C-PaulEMcKenney-W+RWC4+2017-10-05 { } P0(int *a, int *x) { WRITE_ONCE(*a, 1); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ WRITE_ONCE(*x, 1); } P1(int *x, int *y) { r3 = READ_ONCE(*x); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ smp_store_release(y, 1); } P2(int *y, int *b) { r4 = smp_load_acquire(y); r1 = READ_ONCE(*b); } P3(int *b, int *a) { WRITE_ONCE(*b, 1); smp_mb(); r2 = READ_ONCE(*a); } exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) Is what I was thinking of, I think that is the minimal ordering complete()/wait_for_completion() need to provide. (also, that r# numbering confuses the hell out of me, its not related to P nor to the variables) Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed States 15 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0; 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1; 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0; 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1; 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0; 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1; 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0; 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1; 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0; 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1; 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1; 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0; 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1; 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0; 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1; No Witnesses Positive: 0 Negative: 15 Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15 Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04 Hash=f7f8ad6eab33e90718a394bcb021557d
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 11:22:04AM -0700, Paul E. McKenney wrote: > Hmmm... Here is what I was worried about: > > C C-PaulEMcKenney-W+RWC4+2017-10-05 > > { > } > > P0(int *a, int *x) > { > WRITE_ONCE(*a, 1); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > WRITE_ONCE(*x, 1); > } > > P1(int *x, int *y, spinlock_t *l) > { > r3 = READ_ONCE(*x); > smp_mb(); /* Lock acquisition for rcu_node ->lock. */ > spin_lock(l); /* Locking in complete(). */ > WRITE_ONCE(*y, 1); > spin_unlock(l); > } > > P2(int *y, int *b, spinlock_t *l) > { > spin_lock(l); /* Locking in wait_for_completion. */ > r4 = READ_ONCE(*y); > spin_unlock(l); > r1 = READ_ONCE(*b); > } > > P3(int *b, int *a) > { > WRITE_ONCE(*b, 1); > smp_mb(); > r2 = READ_ONCE(*a); > } > > exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) /me goes and install this herd thing again.. I'm sure I had it running _somewhere_.. A well. C C-PaulEMcKenney-W+RWC4+2017-10-05 { } P0(int *a, int *x) { WRITE_ONCE(*a, 1); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ WRITE_ONCE(*x, 1); } P1(int *x, int *y) { r3 = READ_ONCE(*x); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ smp_store_release(y, 1); } P2(int *y, int *b) { r4 = smp_load_acquire(y); r1 = READ_ONCE(*b); } P3(int *b, int *a) { WRITE_ONCE(*b, 1); smp_mb(); r2 = READ_ONCE(*a); } exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) Is what I was thinking of, I think that is the minimal ordering complete()/wait_for_completion() need to provide. (also, that r# numbering confuses the hell out of me, its not related to P nor to the variables) Test C-PaulEMcKenney-W+RWC4+2017-10-05 Allowed States 15 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=0; 1:r3=0; 2:r1=0; 2:r4=0; 3:r2=1; 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=0; 1:r3=0; 2:r1=0; 2:r4=1; 3:r2=1; 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=0; 1:r3=0; 2:r1=1; 2:r4=0; 3:r2=1; 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=0; 1:r3=0; 2:r1=1; 2:r4=1; 3:r2=1; 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=0; 1:r3=1; 2:r1=0; 2:r4=0; 3:r2=1; 1:r3=1; 2:r1=0; 2:r4=1; 3:r2=1; 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=0; 1:r3=1; 2:r1=1; 2:r4=0; 3:r2=1; 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=0; 1:r3=1; 2:r1=1; 2:r4=1; 3:r2=1; No Witnesses Positive: 0 Negative: 15 Condition exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) Observation C-PaulEMcKenney-W+RWC4+2017-10-05 Never 0 15 Time C-PaulEMcKenney-W+RWC4+2017-10-05 0.04 Hash=f7f8ad6eab33e90718a394bcb021557d
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 06:25:14PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote: > > > > Yes, the ordering does need to be visible to uninvolved CPUs, so > > release-acquire is not necessarily strong enough. > > Why? Because I'm not at all seeing why it needs more. Your Changelog > doesn't provide enough hints. Hmmm... Here is what I was worried about: C C-PaulEMcKenney-W+RWC4+2017-10-05 { } P0(int *a, int *x) { WRITE_ONCE(*a, 1); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ WRITE_ONCE(*x, 1); } P1(int *x, int *y, spinlock_t *l) { r3 = READ_ONCE(*x); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ spin_lock(l); /* Locking in complete(). */ WRITE_ONCE(*y, 1); spin_unlock(l); } P2(int *y, int *b, spinlock_t *l) { spin_lock(l); /* Locking in wait_for_completion. */ r4 = READ_ONCE(*y); spin_unlock(l); r1 = READ_ONCE(*b); } P3(int *b, int *a) { WRITE_ONCE(*b, 1); smp_mb(); r2 = READ_ONCE(*a); } exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) My concern was that P2()'s read from b could be pulled into the lock's critical section and reordered with the read from y. But even if you physically rearrange the code, the cycle is forbidden. Removing P2()'s lock acquisition and release does allow the cycle. I get the same result when replacing spin_lock() with xchg_acquire() and spin_unlock() with smp_store_release(). I can drop P1()'s smp_mb() (but otherwise like the original above) and the cycle is forbidden. Removing P1()'s lock acquisition and release, but leaving the smp_mb(), does allow the cycle. So it looks like I can drop this patch entirely. Though somewhat nervously, despite the evidence that I have ample redundancy in ordering. ;-) So, thank you for persisting on this one! Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 06:25:14PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote: > > > > Yes, the ordering does need to be visible to uninvolved CPUs, so > > release-acquire is not necessarily strong enough. > > Why? Because I'm not at all seeing why it needs more. Your Changelog > doesn't provide enough hints. Hmmm... Here is what I was worried about: C C-PaulEMcKenney-W+RWC4+2017-10-05 { } P0(int *a, int *x) { WRITE_ONCE(*a, 1); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ WRITE_ONCE(*x, 1); } P1(int *x, int *y, spinlock_t *l) { r3 = READ_ONCE(*x); smp_mb(); /* Lock acquisition for rcu_node ->lock. */ spin_lock(l); /* Locking in complete(). */ WRITE_ONCE(*y, 1); spin_unlock(l); } P2(int *y, int *b, spinlock_t *l) { spin_lock(l); /* Locking in wait_for_completion. */ r4 = READ_ONCE(*y); spin_unlock(l); r1 = READ_ONCE(*b); } P3(int *b, int *a) { WRITE_ONCE(*b, 1); smp_mb(); r2 = READ_ONCE(*a); } exists (1:r3=1 /\ 2:r4=1 /\ 2:r1=0 /\ 3:r2=0) My concern was that P2()'s read from b could be pulled into the lock's critical section and reordered with the read from y. But even if you physically rearrange the code, the cycle is forbidden. Removing P2()'s lock acquisition and release does allow the cycle. I get the same result when replacing spin_lock() with xchg_acquire() and spin_unlock() with smp_store_release(). I can drop P1()'s smp_mb() (but otherwise like the original above) and the cycle is forbidden. Removing P1()'s lock acquisition and release, but leaving the smp_mb(), does allow the cycle. So it looks like I can drop this patch entirely. Though somewhat nervously, despite the evidence that I have ample redundancy in ordering. ;-) So, thank you for persisting on this one! Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote: > > Yes, the ordering does need to be visible to uninvolved CPUs, so > release-acquire is not necessarily strong enough. Why? Because I'm not at all seeing why it needs more. Your Changelog doesn't provide enough hints.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 09:19:09AM -0700, Paul E. McKenney wrote: > > Yes, the ordering does need to be visible to uninvolved CPUs, so > release-acquire is not necessarily strong enough. Why? Because I'm not at all seeing why it needs more. Your Changelog doesn't provide enough hints.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote: > > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > > > Consider the following admittedly improbable sequence of events: > > > > > > > > o RCU is initially idle. > > > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > > wait on Task A: > > > > > > > > o Task B registers the callback, which starts a new > > > > grace period, awakening the grace-period kthread > > > > on CPU 3, which immediately starts a new grace period. > > > > > > > > o Task B migrates to CPU 2, which provides a quiescent > > > > state for both CPUs 1 and 2. > > > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > > new grace period. > > > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > > to core RCU. > > > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > > and then is once again delayed. > > > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > > critical section. > > > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > > core RCU. This ends the grace period, and CPU 1 therefore > > > > invokes its callbacks, one of which awakens Task B via > > > > complete(). > > > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > > wait_for_completion(), which sees that the completion has > > > > already completed, and thus does not block. It returns from > > > > the synchronize_rcu() without any ordering against the > > > > end of Task A's RCU read-side critical section. > > > > > > > > It can therefore mess up Task A's RCU read-side critical > > > > section, > > > > in theory, anyway. > > > > > > I'm not sure I follow, at the very least the wait_for_completion() does > > > an ACQUIRE such that it observes the state prior to the RELEASE as done > > > by complete(), no? > > > > Your point being that both wait_for_completion() and complete() acquire > > and release the same lock? (Yes, I suspect that I was confusing this > > with wait_event() and wake_up(), just so you know.) > > Well, fundamentally complete()/wait_for_completion() is a message-pass > and they include a RELEASE/ACQUIRE pair for causal reasons. > > Per the implementation they use a spinlock, but any implementation needs > to provide at least that RELEASE/ACQUIRE pair. > > > > And is not CPU0's QS reporting ordered against that complete()? > > > > Mumble mumble mumble powerpc mumble mumble mumble... > > > > OK, I will make this new memory barrier only execute for powerpc. > > > > Or am I missing something else here? > > So I'm not entirely clear on the required semantics here; why do we need > a full mb? I'm thinking CPU0's QS propagating through the tree and > arriving at the root node is a multi-copy-atomic / transitive thing and > all CPUs will agree the system QS has ended, right? > > Whichever CPU establishes the system QS does complete() and the > wait_for_completion() then has the weak-transitive causal relation to > that, ensuring that -- in the above example -- CPU3 must be _after_ > CPU0's rcu_read_unlock(). Yes, the ordering does need to be visible to uninvolved CPUs, so release-acquire is not necessarily strong enough. My current thought is like this: if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE)) smp_mb(); Thoughts? Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote: > > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > > > Consider the following admittedly improbable sequence of events: > > > > > > > > o RCU is initially idle. > > > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > > wait on Task A: > > > > > > > > o Task B registers the callback, which starts a new > > > > grace period, awakening the grace-period kthread > > > > on CPU 3, which immediately starts a new grace period. > > > > > > > > o Task B migrates to CPU 2, which provides a quiescent > > > > state for both CPUs 1 and 2. > > > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > > new grace period. > > > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > > to core RCU. > > > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > > and then is once again delayed. > > > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > > critical section. > > > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > > core RCU. This ends the grace period, and CPU 1 therefore > > > > invokes its callbacks, one of which awakens Task B via > > > > complete(). > > > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > > wait_for_completion(), which sees that the completion has > > > > already completed, and thus does not block. It returns from > > > > the synchronize_rcu() without any ordering against the > > > > end of Task A's RCU read-side critical section. > > > > > > > > It can therefore mess up Task A's RCU read-side critical > > > > section, > > > > in theory, anyway. > > > > > > I'm not sure I follow, at the very least the wait_for_completion() does > > > an ACQUIRE such that it observes the state prior to the RELEASE as done > > > by complete(), no? > > > > Your point being that both wait_for_completion() and complete() acquire > > and release the same lock? (Yes, I suspect that I was confusing this > > with wait_event() and wake_up(), just so you know.) > > Well, fundamentally complete()/wait_for_completion() is a message-pass > and they include a RELEASE/ACQUIRE pair for causal reasons. > > Per the implementation they use a spinlock, but any implementation needs > to provide at least that RELEASE/ACQUIRE pair. > > > > And is not CPU0's QS reporting ordered against that complete()? > > > > Mumble mumble mumble powerpc mumble mumble mumble... > > > > OK, I will make this new memory barrier only execute for powerpc. > > > > Or am I missing something else here? > > So I'm not entirely clear on the required semantics here; why do we need > a full mb? I'm thinking CPU0's QS propagating through the tree and > arriving at the root node is a multi-copy-atomic / transitive thing and > all CPUs will agree the system QS has ended, right? > > Whichever CPU establishes the system QS does complete() and the > wait_for_completion() then has the weak-transitive causal relation to > that, ensuring that -- in the above example -- CPU3 must be _after_ > CPU0's rcu_read_unlock(). Yes, the ordering does need to be visible to uninvolved CPUs, so release-acquire is not necessarily strong enough. My current thought is like this: if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE)) smp_mb(); Thoughts? Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote: > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > > Consider the following admittedly improbable sequence of events: > > > > > > o RCU is initially idle. > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > wait on Task A: > > > > > > o Task B registers the callback, which starts a new > > > grace period, awakening the grace-period kthread > > > on CPU 3, which immediately starts a new grace period. > > > > > > o Task B migrates to CPU 2, which provides a quiescent > > > state for both CPUs 1 and 2. > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > new grace period. > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > to core RCU. > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > and then is once again delayed. > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > critical section. > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > core RCU. This ends the grace period, and CPU 1 therefore > > > invokes its callbacks, one of which awakens Task B via > > > complete(). > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > wait_for_completion(), which sees that the completion has > > > already completed, and thus does not block. It returns from > > > the synchronize_rcu() without any ordering against the > > > end of Task A's RCU read-side critical section. > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > > in theory, anyway. > > > > I'm not sure I follow, at the very least the wait_for_completion() does > > an ACQUIRE such that it observes the state prior to the RELEASE as done > > by complete(), no? > > Your point being that both wait_for_completion() and complete() acquire > and release the same lock? (Yes, I suspect that I was confusing this > with wait_event() and wake_up(), just so you know.) Well, fundamentally complete()/wait_for_completion() is a message-pass and they include a RELEASE/ACQUIRE pair for causal reasons. Per the implementation they use a spinlock, but any implementation needs to provide at least that RELEASE/ACQUIRE pair. > > And is not CPU0's QS reporting ordered against that complete()? > > Mumble mumble mumble powerpc mumble mumble mumble... > > OK, I will make this new memory barrier only execute for powerpc. > > Or am I missing something else here? So I'm not entirely clear on the required semantics here; why do we need a full mb? I'm thinking CPU0's QS propagating through the tree and arriving at the root node is a multi-copy-atomic / transitive thing and all CPUs will agree the system QS has ended, right? Whichever CPU establishes the system QS does complete() and the wait_for_completion() then has the weak-transitive causal relation to that, ensuring that -- in the above example -- CPU3 must be _after_ CPU0's rcu_read_unlock().
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote: > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > > Consider the following admittedly improbable sequence of events: > > > > > > o RCU is initially idle. > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > wait on Task A: > > > > > > o Task B registers the callback, which starts a new > > > grace period, awakening the grace-period kthread > > > on CPU 3, which immediately starts a new grace period. > > > > > > o Task B migrates to CPU 2, which provides a quiescent > > > state for both CPUs 1 and 2. > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > new grace period. > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > to core RCU. > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > and then is once again delayed. > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > critical section. > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > core RCU. This ends the grace period, and CPU 1 therefore > > > invokes its callbacks, one of which awakens Task B via > > > complete(). > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > wait_for_completion(), which sees that the completion has > > > already completed, and thus does not block. It returns from > > > the synchronize_rcu() without any ordering against the > > > end of Task A's RCU read-side critical section. > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > > in theory, anyway. > > > > I'm not sure I follow, at the very least the wait_for_completion() does > > an ACQUIRE such that it observes the state prior to the RELEASE as done > > by complete(), no? > > Your point being that both wait_for_completion() and complete() acquire > and release the same lock? (Yes, I suspect that I was confusing this > with wait_event() and wake_up(), just so you know.) Well, fundamentally complete()/wait_for_completion() is a message-pass and they include a RELEASE/ACQUIRE pair for causal reasons. Per the implementation they use a spinlock, but any implementation needs to provide at least that RELEASE/ACQUIRE pair. > > And is not CPU0's QS reporting ordered against that complete()? > > Mumble mumble mumble powerpc mumble mumble mumble... > > OK, I will make this new memory barrier only execute for powerpc. > > Or am I missing something else here? So I'm not entirely clear on the required semantics here; why do we need a full mb? I'm thinking CPU0's QS propagating through the tree and arriving at the root node is a multi-copy-atomic / transitive thing and all CPUs will agree the system QS has ended, right? Whichever CPU establishes the system QS does complete() and the wait_for_completion() then has the weak-transitive causal relation to that, ensuring that -- in the above example -- CPU3 must be _after_ CPU0's rcu_read_unlock().
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > Consider the following admittedly improbable sequence of events: > > > > o RCU is initially idle. > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > wait on Task A: > > > > o Task B registers the callback, which starts a new > > grace period, awakening the grace-period kthread > > on CPU 3, which immediately starts a new grace period. > > > > o Task B migrates to CPU 2, which provides a quiescent > > state for both CPUs 1 and 2. > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > and both invoke RCU_SOFTIRQ, both thus learning of the > > new grace period. > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > to core RCU. > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > and then is once again delayed. > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > critical section. > > > > o CPU 0 passes through a quiescent sate, which is reported to > > core RCU. Only CPU 1 continues to block the grace period. > > > > o CPU 1 passes through a quiescent state, which is reported to > > core RCU. This ends the grace period, and CPU 1 therefore > > invokes its callbacks, one of which awakens Task B via > > complete(). > > > > o Task B resumes (still on CPU 3) and starts executing > > wait_for_completion(), which sees that the completion has > > already completed, and thus does not block. It returns from > > the synchronize_rcu() without any ordering against the > > end of Task A's RCU read-side critical section. > > > > It can therefore mess up Task A's RCU read-side critical section, > > in theory, anyway. > > I'm not sure I follow, at the very least the wait_for_completion() does > an ACQUIRE such that it observes the state prior to the RELEASE as done > by complete(), no? Your point being that both wait_for_completion() and complete() acquire and release the same lock? (Yes, I suspect that I was confusing this with wait_event() and wake_up(), just so you know.) > And is not CPU0's QS reporting ordered against that complete()? Mumble mumble mumble powerpc mumble mumble mumble... OK, I will make this new memory barrier only execute for powerpc. Or am I missing something else here? Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > Consider the following admittedly improbable sequence of events: > > > > o RCU is initially idle. > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > wait on Task A: > > > > o Task B registers the callback, which starts a new > > grace period, awakening the grace-period kthread > > on CPU 3, which immediately starts a new grace period. > > > > o Task B migrates to CPU 2, which provides a quiescent > > state for both CPUs 1 and 2. > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > and both invoke RCU_SOFTIRQ, both thus learning of the > > new grace period. > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > to core RCU. > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > and then is once again delayed. > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > critical section. > > > > o CPU 0 passes through a quiescent sate, which is reported to > > core RCU. Only CPU 1 continues to block the grace period. > > > > o CPU 1 passes through a quiescent state, which is reported to > > core RCU. This ends the grace period, and CPU 1 therefore > > invokes its callbacks, one of which awakens Task B via > > complete(). > > > > o Task B resumes (still on CPU 3) and starts executing > > wait_for_completion(), which sees that the completion has > > already completed, and thus does not block. It returns from > > the synchronize_rcu() without any ordering against the > > end of Task A's RCU read-side critical section. > > > > It can therefore mess up Task A's RCU read-side critical section, > > in theory, anyway. > > I'm not sure I follow, at the very least the wait_for_completion() does > an ACQUIRE such that it observes the state prior to the RELEASE as done > by complete(), no? Your point being that both wait_for_completion() and complete() acquire and release the same lock? (Yes, I suspect that I was confusing this with wait_event() and wake_up(), just so you know.) > And is not CPU0's QS reporting ordered against that complete()? Mumble mumble mumble powerpc mumble mumble mumble... OK, I will make this new memory barrier only execute for powerpc. Or am I missing something else here? Thanx, Paul
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, 5 Oct 2017 15:40:42 +0200 Peter Zijlstrawrote: > On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote: > > On Wed, 4 Oct 2017 14:29:27 -0700 > > "Paul E. McKenney" wrote: > > > > > Consider the following admittedly improbable sequence of events: > > > > > > o RCU is initially idle. > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > A starts rcu_read_lock() critical section. > > > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > wait on Task A: > > > > B waits for A. > > > > > > > > o Task B registers the callback, which starts a new > > > grace period, awakening the grace-period kthread > > > on CPU 3, which immediately starts a new grace period. > > > > [ isn't B blocked (off rq)? How does it migrate? ] > > No, its running synchronize_rcu() but hasn't blocked yet. It would block > on wait_for_completion(), but per the very last point, we'll observe the > complete() before we block. Hmm, maybe this should be stated as well, as it looked confusing. > > > > o Task B migrates to CPU 2, which provides a quiescent > > > state for both CPUs 1 and 2. > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > new grace period. > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > to core RCU. > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > and then is once again delayed. > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > critical section. > > > > A calls rcu_read_unlock() ending the critical section > > The point is that rcu_read_unlock() doesn't have memory ordering. Ah, that makes a difference. Can we state that in the change log too. > > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > core RCU. This ends the grace period, and CPU 1 therefore > > > invokes its callbacks, one of which awakens Task B via > > > complete(). > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > wait_for_completion(), which sees that the completion has > > > already completed, and thus does not block. It returns from > > > the synchronize_rcu() without any ordering against the > > > end of Task A's RCU read-side critical section. > > > > B runs > > > > > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > > in theory, anyway. > > > > I don't see how B ran during A's critical section. > > It didn't but that doesn't mean the memory ordering agrees. What we > require is B observes (per the memory ordering) everything up to and > including the rcu_read_unlock(). This is not 'time' related. OK, that makes sense. It was just missing from the change log, so mere mortals like myself can keep up. (This is why I'm Batman. I'm only human but I can still compete with superheros ;-) > > > That said, I don't think it can actually happen, because CPU0's QS state > is ordered against the complete and the wait_for_completion is ordered > against the complete. OK, but this doesn't hurt to have. Just a paranoid check, which when it comes to RCU, leaning toward paranoid is good. -- Steve
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, 5 Oct 2017 15:40:42 +0200 Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote: > > On Wed, 4 Oct 2017 14:29:27 -0700 > > "Paul E. McKenney" wrote: > > > > > Consider the following admittedly improbable sequence of events: > > > > > > o RCU is initially idle. > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > A starts rcu_read_lock() critical section. > > > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > wait on Task A: > > > > B waits for A. > > > > > > > > o Task B registers the callback, which starts a new > > > grace period, awakening the grace-period kthread > > > on CPU 3, which immediately starts a new grace period. > > > > [ isn't B blocked (off rq)? How does it migrate? ] > > No, its running synchronize_rcu() but hasn't blocked yet. It would block > on wait_for_completion(), but per the very last point, we'll observe the > complete() before we block. Hmm, maybe this should be stated as well, as it looked confusing. > > > > o Task B migrates to CPU 2, which provides a quiescent > > > state for both CPUs 1 and 2. > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > new grace period. > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > to core RCU. > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > and then is once again delayed. > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > critical section. > > > > A calls rcu_read_unlock() ending the critical section > > The point is that rcu_read_unlock() doesn't have memory ordering. Ah, that makes a difference. Can we state that in the change log too. > > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > core RCU. This ends the grace period, and CPU 1 therefore > > > invokes its callbacks, one of which awakens Task B via > > > complete(). > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > wait_for_completion(), which sees that the completion has > > > already completed, and thus does not block. It returns from > > > the synchronize_rcu() without any ordering against the > > > end of Task A's RCU read-side critical section. > > > > B runs > > > > > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > > in theory, anyway. > > > > I don't see how B ran during A's critical section. > > It didn't but that doesn't mean the memory ordering agrees. What we > require is B observes (per the memory ordering) everything up to and > including the rcu_read_unlock(). This is not 'time' related. OK, that makes sense. It was just missing from the change log, so mere mortals like myself can keep up. (This is why I'm Batman. I'm only human but I can still compete with superheros ;-) > > > That said, I don't think it can actually happen, because CPU0's QS state > is ordered against the complete and the wait_for_completion is ordered > against the complete. OK, but this doesn't hurt to have. Just a paranoid check, which when it comes to RCU, leaning toward paranoid is good. -- Steve
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote: > On Wed, 4 Oct 2017 14:29:27 -0700 > "Paul E. McKenney"wrote: > > > Consider the following admittedly improbable sequence of events: > > > > o RCU is initially idle. > > > > o Task A on CPU 0 executes rcu_read_lock(). > > A starts rcu_read_lock() critical section. > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > wait on Task A: > > B waits for A. > > > > > o Task B registers the callback, which starts a new > > grace period, awakening the grace-period kthread > > on CPU 3, which immediately starts a new grace period. > > [ isn't B blocked (off rq)? How does it migrate? ] No, its running synchronize_rcu() but hasn't blocked yet. It would block on wait_for_completion(), but per the very last point, we'll observe the complete() before we block. > > o Task B migrates to CPU 2, which provides a quiescent > > state for both CPUs 1 and 2. > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > and both invoke RCU_SOFTIRQ, both thus learning of the > > new grace period. > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > to core RCU. > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > and then is once again delayed. > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > critical section. > > A calls rcu_read_unlock() ending the critical section The point is that rcu_read_unlock() doesn't have memory ordering. > > > > o CPU 0 passes through a quiescent sate, which is reported to > > core RCU. Only CPU 1 continues to block the grace period. > > > > o CPU 1 passes through a quiescent state, which is reported to > > core RCU. This ends the grace period, and CPU 1 therefore > > invokes its callbacks, one of which awakens Task B via > > complete(). > > > > o Task B resumes (still on CPU 3) and starts executing > > wait_for_completion(), which sees that the completion has > > already completed, and thus does not block. It returns from > > the synchronize_rcu() without any ordering against the > > end of Task A's RCU read-side critical section. > > B runs > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > in theory, anyway. > > I don't see how B ran during A's critical section. It didn't but that doesn't mean the memory ordering agrees. What we require is B observes (per the memory ordering) everything up to and including the rcu_read_unlock(). This is not 'time' related. That said, I don't think it can actually happen, because CPU0's QS state is ordered against the complete and the wait_for_completion is ordered against the complete.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Thu, Oct 05, 2017 at 09:17:03AM -0400, Steven Rostedt wrote: > On Wed, 4 Oct 2017 14:29:27 -0700 > "Paul E. McKenney" wrote: > > > Consider the following admittedly improbable sequence of events: > > > > o RCU is initially idle. > > > > o Task A on CPU 0 executes rcu_read_lock(). > > A starts rcu_read_lock() critical section. > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > wait on Task A: > > B waits for A. > > > > > o Task B registers the callback, which starts a new > > grace period, awakening the grace-period kthread > > on CPU 3, which immediately starts a new grace period. > > [ isn't B blocked (off rq)? How does it migrate? ] No, its running synchronize_rcu() but hasn't blocked yet. It would block on wait_for_completion(), but per the very last point, we'll observe the complete() before we block. > > o Task B migrates to CPU 2, which provides a quiescent > > state for both CPUs 1 and 2. > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > and both invoke RCU_SOFTIRQ, both thus learning of the > > new grace period. > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > to core RCU. > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > and then is once again delayed. > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > critical section. > > A calls rcu_read_unlock() ending the critical section The point is that rcu_read_unlock() doesn't have memory ordering. > > > > o CPU 0 passes through a quiescent sate, which is reported to > > core RCU. Only CPU 1 continues to block the grace period. > > > > o CPU 1 passes through a quiescent state, which is reported to > > core RCU. This ends the grace period, and CPU 1 therefore > > invokes its callbacks, one of which awakens Task B via > > complete(). > > > > o Task B resumes (still on CPU 3) and starts executing > > wait_for_completion(), which sees that the completion has > > already completed, and thus does not block. It returns from > > the synchronize_rcu() without any ordering against the > > end of Task A's RCU read-side critical section. > > B runs > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > in theory, anyway. > > I don't see how B ran during A's critical section. It didn't but that doesn't mean the memory ordering agrees. What we require is B observes (per the memory ordering) everything up to and including the rcu_read_unlock(). This is not 'time' related. That said, I don't think it can actually happen, because CPU0's QS state is ordered against the complete and the wait_for_completion is ordered against the complete.
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Wed, 4 Oct 2017 14:29:27 -0700 "Paul E. McKenney"wrote: > Consider the following admittedly improbable sequence of events: > > o RCU is initially idle. > > o Task A on CPU 0 executes rcu_read_lock(). A starts rcu_read_lock() critical section. > > o Task B on CPU 1 executes synchronize_rcu(), which must > wait on Task A: B waits for A. > > o Task B registers the callback, which starts a new > grace period, awakening the grace-period kthread > on CPU 3, which immediately starts a new grace period. [ isn't B blocked (off rq)? How does it migrate? ] > > o Task B migrates to CPU 2, which provides a quiescent > state for both CPUs 1 and 2. > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > and both invoke RCU_SOFTIRQ, both thus learning of the > new grace period. > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > o CPUs 2 and 3 pass through quiescent states, which are reported > to core RCU. > > o Task B is resumed just long enough to be migrated to CPU 3, > and then is once again delayed. > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > critical section. A calls rcu_read_unlock() ending the critical section > > o CPU 0 passes through a quiescent sate, which is reported to > core RCU. Only CPU 1 continues to block the grace period. > > o CPU 1 passes through a quiescent state, which is reported to > core RCU. This ends the grace period, and CPU 1 therefore > invokes its callbacks, one of which awakens Task B via > complete(). > > o Task B resumes (still on CPU 3) and starts executing > wait_for_completion(), which sees that the completion has > already completed, and thus does not block. It returns from > the synchronize_rcu() without any ordering against the > end of Task A's RCU read-side critical section. B runs > > It can therefore mess up Task A's RCU read-side critical section, > in theory, anyway. I don't see how B ran during A's critical section. -- Steve > > However, if CPU hotplug ever gets rid of stop_machine(), there will be > more straightforward ways for this sort of thing to happen, so this > commit adds a memory barrier in order to enforce the needed ordering. >
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Wed, 4 Oct 2017 14:29:27 -0700 "Paul E. McKenney" wrote: > Consider the following admittedly improbable sequence of events: > > o RCU is initially idle. > > o Task A on CPU 0 executes rcu_read_lock(). A starts rcu_read_lock() critical section. > > o Task B on CPU 1 executes synchronize_rcu(), which must > wait on Task A: B waits for A. > > o Task B registers the callback, which starts a new > grace period, awakening the grace-period kthread > on CPU 3, which immediately starts a new grace period. [ isn't B blocked (off rq)? How does it migrate? ] > > o Task B migrates to CPU 2, which provides a quiescent > state for both CPUs 1 and 2. > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > and both invoke RCU_SOFTIRQ, both thus learning of the > new grace period. > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > o CPUs 2 and 3 pass through quiescent states, which are reported > to core RCU. > > o Task B is resumed just long enough to be migrated to CPU 3, > and then is once again delayed. > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > critical section. A calls rcu_read_unlock() ending the critical section > > o CPU 0 passes through a quiescent sate, which is reported to > core RCU. Only CPU 1 continues to block the grace period. > > o CPU 1 passes through a quiescent state, which is reported to > core RCU. This ends the grace period, and CPU 1 therefore > invokes its callbacks, one of which awakens Task B via > complete(). > > o Task B resumes (still on CPU 3) and starts executing > wait_for_completion(), which sees that the completion has > already completed, and thus does not block. It returns from > the synchronize_rcu() without any ordering against the > end of Task A's RCU read-side critical section. B runs > > It can therefore mess up Task A's RCU read-side critical section, > in theory, anyway. I don't see how B ran during A's critical section. -- Steve > > However, if CPU hotplug ever gets rid of stop_machine(), there will be > more straightforward ways for this sort of thing to happen, so this > commit adds a memory barrier in order to enforce the needed ordering. >
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > Consider the following admittedly improbable sequence of events: > > o RCU is initially idle. > > o Task A on CPU 0 executes rcu_read_lock(). > > o Task B on CPU 1 executes synchronize_rcu(), which must > wait on Task A: > > o Task B registers the callback, which starts a new > grace period, awakening the grace-period kthread > on CPU 3, which immediately starts a new grace period. > > o Task B migrates to CPU 2, which provides a quiescent > state for both CPUs 1 and 2. > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > and both invoke RCU_SOFTIRQ, both thus learning of the > new grace period. > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > o CPUs 2 and 3 pass through quiescent states, which are reported > to core RCU. > > o Task B is resumed just long enough to be migrated to CPU 3, > and then is once again delayed. > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > critical section. > > o CPU 0 passes through a quiescent sate, which is reported to > core RCU. Only CPU 1 continues to block the grace period. > > o CPU 1 passes through a quiescent state, which is reported to > core RCU. This ends the grace period, and CPU 1 therefore > invokes its callbacks, one of which awakens Task B via > complete(). > > o Task B resumes (still on CPU 3) and starts executing > wait_for_completion(), which sees that the completion has > already completed, and thus does not block. It returns from > the synchronize_rcu() without any ordering against the > end of Task A's RCU read-side critical section. > > It can therefore mess up Task A's RCU read-side critical section, > in theory, anyway. I'm not sure I follow, at the very least the wait_for_completion() does an ACQUIRE such that it observes the state prior to the RELEASE as done by complete(), no? And is not CPU0's QS reporting ordered against that complete()?
Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays
On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > Consider the following admittedly improbable sequence of events: > > o RCU is initially idle. > > o Task A on CPU 0 executes rcu_read_lock(). > > o Task B on CPU 1 executes synchronize_rcu(), which must > wait on Task A: > > o Task B registers the callback, which starts a new > grace period, awakening the grace-period kthread > on CPU 3, which immediately starts a new grace period. > > o Task B migrates to CPU 2, which provides a quiescent > state for both CPUs 1 and 2. > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > and both invoke RCU_SOFTIRQ, both thus learning of the > new grace period. > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > o CPUs 2 and 3 pass through quiescent states, which are reported > to core RCU. > > o Task B is resumed just long enough to be migrated to CPU 3, > and then is once again delayed. > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > critical section. > > o CPU 0 passes through a quiescent sate, which is reported to > core RCU. Only CPU 1 continues to block the grace period. > > o CPU 1 passes through a quiescent state, which is reported to > core RCU. This ends the grace period, and CPU 1 therefore > invokes its callbacks, one of which awakens Task B via > complete(). > > o Task B resumes (still on CPU 3) and starts executing > wait_for_completion(), which sees that the completion has > already completed, and thus does not block. It returns from > the synchronize_rcu() without any ordering against the > end of Task A's RCU read-side critical section. > > It can therefore mess up Task A's RCU read-side critical section, > in theory, anyway. I'm not sure I follow, at the very least the wait_for_completion() does an ACQUIRE such that it observes the state prior to the RELEASE as done by complete(), no? And is not CPU0's QS reporting ordered against that complete()?