Re: [PATCH v9 7/7] rcu/segcblist: Add additional comments to explain smp_mb()

2020-11-09 Thread Paul E. McKenney
On Fri, Nov 06, 2020 at 05:41:41PM -0500, Joel Fernandes wrote:
> On Thu, Nov 05, 2020 at 10:55:51AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 03, 2020 at 09:26:03AM -0500, Joel Fernandes (Google) wrote:
> > > Memory barriers are needed when updating the full length of the
> > > segcblist, however it is not fully clearly why one is needed before and
> > > after. This patch therefore adds additional comments to the function
> > > header to explain it.
> > > 
> > > Signed-off-by: Joel Fernandes (Google) 
> > 
> > Looks good, thank you!  As always, I could not resist the urge to
> > do a bit of wordsmithing, so that the queued commit is as shown
> > below.  Please let me know if I messed anything up.
> 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 7dac7adefcae7558b3a85a16f51186d621623733
> > Author: Joel Fernandes (Google) 
> > Date:   Tue Nov 3 09:26:03 2020 -0500
> > 
> > rcu/segcblist: Add additional comments to explain smp_mb()
> > 
> > One counter-intuitive property of RCU is the fact that full memory
> > barriers are needed both before and after updates to the full
> > (non-segmented) length.  This patch therefore helps to assist the
> > reader's intuition by adding appropriate comments.
> > 
> > [ paulmck:  Wordsmithing. ]
> > Signed-off-by: Joel Fernandes (Google) 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> > index bb246d8..b6dda7c 100644
> > --- a/kernel/rcu/rcu_segcblist.c
> > +++ b/kernel/rcu/rcu_segcblist.c
> > @@ -94,17 +94,77 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
> > *rsclp, long v)
> >   * field to disagree with the actual number of callbacks on the structure.
> >   * This increase is fully ordered with respect to the callers accesses
> >   * both before and after.
> > + *
> > + * So why on earth is a memory barrier required both before and after
> > + * the update to the ->len field???
> > + *
> > + * The reason is that rcu_barrier() locklessly samples each CPU's ->len
> > + * field, and if a given CPU's field is zero, avoids IPIing that CPU.
> > + * This can of course race with both queuing and invoking of callbacks.
> > + * Failng to correctly handle either of these races could result in
> > + * rcu_barrier() failing to IPI a CPU that actually had callbacks queued
> > + * which rcu_barrier() was obligated to wait on.  And if rcu_barrier()
> > + * failed to wait on such a callback, unloading certain kernel modules
> > + * would result in calls to functions whose code was no longer present in
> > + * the kernel, for but one example.
> > + *
> > + * Therefore, ->len transitions from 1->0 and 0->1 have to be carefully
> > + * ordered with respect with both list modifications and the rcu_barrier().
> > + *
> > + * The queuing case is CASE 1 and the invoking case is CASE 2.
> > + *
> > + * CASE 1: Suppose that CPU 0 has no callbacks queued, but invokes
> > + * call_rcu() just as CPU 1 invokes rcu_barrier().  CPU 0's ->len field
> > + * will transition from 0->1, which is one of the transitions that must be
> > + * handled carefully.  Without the full memory barriers before the ->len
> > + * update and at the beginning of rcu_barrier(), the following could 
> > happen:
> > + *
> > + * CPU 0   CPU 1
> > + *
> > + * call_rcu().
> > + * rcu_barrier() sees ->len as 0.
> > + * set ->len = 1.
> > + * rcu_barrier() does nothing.
> > + * module is unloaded.
> > + * callback invokes unloaded function!
> > + *
> > + * With the full barriers, any case where rcu_barrier() sees ->len as 0 
> > will
> > + * have unambiguously preceded the return from the racing call_rcu(), which
> > + * means that this call_rcu() invocation is OK to not wait on.  After all,
> > + * you are supposed to make sure that any problematic call_rcu() 
> > invocations
> > + * happen before the rcu_barrier().
> 
> Unfortunately, I did not understand your explanation. To me the barrier
> *before* the setting of length is needed on CPU0 only for 1->0 transition
> (Dequeue). Where as in
> your example above, it is for enqueue.
> 
> This was case 1 in my patch:
> 
> + * To illustrate the problematic scenario to avoid:
> + * P0 (what P1 sees) P1
> + * set len = 0
> + *  rcu_barrier sees len as 0
> + * dequeue from list
> + *  rcu_barrier does nothing.
> + *
> 
> 
> Here, P1 should see the transition of 1->0 *after* the CB is dequeued. Which
> means you needed a memory barrier *before* the setting of len from 1->0 and
> *after* the dequeue. IOW, rcu_barrier should 'see' the memory ordering as:
> 
> 1. dequeue
> 2. set len from 1 -> 0.
> 
> For the enqueue case, it is the reverse, rcu_barrier should see:
> 1. set len from 0 -> 1
> 

Re: [PATCH v9 7/7] rcu/segcblist: Add additional comments to explain smp_mb()

2020-11-06 Thread Joel Fernandes
On Thu, Nov 05, 2020 at 10:55:51AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 03, 2020 at 09:26:03AM -0500, Joel Fernandes (Google) wrote:
> > Memory barriers are needed when updating the full length of the
> > segcblist, however it is not fully clearly why one is needed before and
> > after. This patch therefore adds additional comments to the function
> > header to explain it.
> > 
> > Signed-off-by: Joel Fernandes (Google) 
> 
> Looks good, thank you!  As always, I could not resist the urge to
> do a bit of wordsmithing, so that the queued commit is as shown
> below.  Please let me know if I messed anything up.

>   Thanx, Paul
> 
> 
> 
> commit 7dac7adefcae7558b3a85a16f51186d621623733
> Author: Joel Fernandes (Google) 
> Date:   Tue Nov 3 09:26:03 2020 -0500
> 
> rcu/segcblist: Add additional comments to explain smp_mb()
> 
> One counter-intuitive property of RCU is the fact that full memory
> barriers are needed both before and after updates to the full
> (non-segmented) length.  This patch therefore helps to assist the
> reader's intuition by adding appropriate comments.
> 
> [ paulmck:  Wordsmithing. ]
> Signed-off-by: Joel Fernandes (Google) 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> index bb246d8..b6dda7c 100644
> --- a/kernel/rcu/rcu_segcblist.c
> +++ b/kernel/rcu/rcu_segcblist.c
> @@ -94,17 +94,77 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
> *rsclp, long v)
>   * field to disagree with the actual number of callbacks on the structure.
>   * This increase is fully ordered with respect to the callers accesses
>   * both before and after.
> + *
> + * So why on earth is a memory barrier required both before and after
> + * the update to the ->len field???
> + *
> + * The reason is that rcu_barrier() locklessly samples each CPU's ->len
> + * field, and if a given CPU's field is zero, avoids IPIing that CPU.
> + * This can of course race with both queuing and invoking of callbacks.
> + * Failng to correctly handle either of these races could result in
> + * rcu_barrier() failing to IPI a CPU that actually had callbacks queued
> + * which rcu_barrier() was obligated to wait on.  And if rcu_barrier()
> + * failed to wait on such a callback, unloading certain kernel modules
> + * would result in calls to functions whose code was no longer present in
> + * the kernel, for but one example.
> + *
> + * Therefore, ->len transitions from 1->0 and 0->1 have to be carefully
> + * ordered with respect with both list modifications and the rcu_barrier().
> + *
> + * The queuing case is CASE 1 and the invoking case is CASE 2.
> + *
> + * CASE 1: Suppose that CPU 0 has no callbacks queued, but invokes
> + * call_rcu() just as CPU 1 invokes rcu_barrier().  CPU 0's ->len field
> + * will transition from 0->1, which is one of the transitions that must be
> + * handled carefully.  Without the full memory barriers before the ->len
> + * update and at the beginning of rcu_barrier(), the following could happen:
> + *
> + * CPU 0 CPU 1
> + *
> + * call_rcu().
> + *   rcu_barrier() sees ->len as 0.
> + * set ->len = 1.
> + *   rcu_barrier() does nothing.
> + *   module is unloaded.
> + * callback invokes unloaded function!
> + *
> + * With the full barriers, any case where rcu_barrier() sees ->len as 0 will
> + * have unambiguously preceded the return from the racing call_rcu(), which
> + * means that this call_rcu() invocation is OK to not wait on.  After all,
> + * you are supposed to make sure that any problematic call_rcu() invocations
> + * happen before the rcu_barrier().

Unfortunately, I did not understand your explanation. To me the barrier
*before* the setting of length is needed on CPU0 only for 1->0 transition
(Dequeue). Where as in
your example above, it is for enqueue.

This was case 1 in my patch:

+ * To illustrate the problematic scenario to avoid:
+ * P0 (what P1 sees)   P1
+ * set len = 0
+ *  rcu_barrier sees len as 0
+ * dequeue from list
+ *  rcu_barrier does nothing.
+ *


Here, P1 should see the transition of 1->0 *after* the CB is dequeued. Which
means you needed a memory barrier *before* the setting of len from 1->0 and
*after* the dequeue. IOW, rcu_barrier should 'see' the memory ordering as:

1. dequeue
2. set len from 1 -> 0.

For the enqueue case, it is the reverse, rcu_barrier should see:
1. set len from 0 -> 1
2. enqueue

Either way, the point I think I was trying to make is that the length should
always be seen as non-zero if the list is non-empty. Basically, the
rcu_barrier() should always not do the fast-path if the list is non-empty.
Worst-case it might do the slow-path when it is 

Re: [PATCH v9 7/7] rcu/segcblist: Add additional comments to explain smp_mb()

2020-11-05 Thread Paul E. McKenney
On Tue, Nov 03, 2020 at 09:26:03AM -0500, Joel Fernandes (Google) wrote:
> Memory barriers are needed when updating the full length of the
> segcblist, however it is not fully clearly why one is needed before and
> after. This patch therefore adds additional comments to the function
> header to explain it.
> 
> Signed-off-by: Joel Fernandes (Google) 

Looks good, thank you!  As always, I could not resist the urge to
do a bit of wordsmithing, so that the queued commit is as shown
below.  Please let me know if I messed anything up.

Thanx, Paul



commit 7dac7adefcae7558b3a85a16f51186d621623733
Author: Joel Fernandes (Google) 
Date:   Tue Nov 3 09:26:03 2020 -0500

rcu/segcblist: Add additional comments to explain smp_mb()

One counter-intuitive property of RCU is the fact that full memory
barriers are needed both before and after updates to the full
(non-segmented) length.  This patch therefore helps to assist the
reader's intuition by adding appropriate comments.

[ paulmck:  Wordsmithing. ]
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index bb246d8..b6dda7c 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,17 +94,77 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * So why on earth is a memory barrier required both before and after
+ * the update to the ->len field???
+ *
+ * The reason is that rcu_barrier() locklessly samples each CPU's ->len
+ * field, and if a given CPU's field is zero, avoids IPIing that CPU.
+ * This can of course race with both queuing and invoking of callbacks.
+ * Failng to correctly handle either of these races could result in
+ * rcu_barrier() failing to IPI a CPU that actually had callbacks queued
+ * which rcu_barrier() was obligated to wait on.  And if rcu_barrier()
+ * failed to wait on such a callback, unloading certain kernel modules
+ * would result in calls to functions whose code was no longer present in
+ * the kernel, for but one example.
+ *
+ * Therefore, ->len transitions from 1->0 and 0->1 have to be carefully
+ * ordered with respect with both list modifications and the rcu_barrier().
+ *
+ * The queuing case is CASE 1 and the invoking case is CASE 2.
+ *
+ * CASE 1: Suppose that CPU 0 has no callbacks queued, but invokes
+ * call_rcu() just as CPU 1 invokes rcu_barrier().  CPU 0's ->len field
+ * will transition from 0->1, which is one of the transitions that must be
+ * handled carefully.  Without the full memory barriers before the ->len
+ * update and at the beginning of rcu_barrier(), the following could happen:
+ *
+ * CPU 0   CPU 1
+ *
+ * call_rcu().
+ * rcu_barrier() sees ->len as 0.
+ * set ->len = 1.
+ * rcu_barrier() does nothing.
+ * module is unloaded.
+ * callback invokes unloaded function!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0 will
+ * have unambiguously preceded the return from the racing call_rcu(), which
+ * means that this call_rcu() invocation is OK to not wait on.  After all,
+ * you are supposed to make sure that any problematic call_rcu() invocations
+ * happen before the rcu_barrier().
+ *
+ *
+ * CASE 2: Suppose that CPU 0 is invoking its last callback just as CPU 1 
invokes
+ * rcu_barrier().  CPU 0's ->len field will transition from 1->0, which is one
+ * of the transitions that must be handled carefully.  Without the full memory
+ * barriers after the ->len update and at the end of rcu_barrier(), the 
following
+ * could happen:
+ * 
+ * CPU 0   CPU 1
+ *
+ * start invoking last callback
+ * set ->len = 0 (reordered)
+ * rcu_barrier() sees ->len as 0
+ * rcu_barrier() does nothing.
+ * module is unloaded
+ * callback executing after unloaded!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0
+ * will be fully ordered after the completion of the callback function,
+ * so that the module unloading operation is completely safe.
+ * 
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); // Read header comment above.
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  // Read header comment above.
 #else
- 

[PATCH v9 7/7] rcu/segcblist: Add additional comments to explain smp_mb()

2020-11-03 Thread Joel Fernandes (Google)
Memory barriers are needed when updating the full length of the
segcblist, however it is not fully clearly why one is needed before and
after. This patch therefore adds additional comments to the function
header to explain it.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 40 ++
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index d96272e8d604..9b43d686b1f3 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -135,17 +135,49 @@ static void rcu_segcblist_inc_seglen(struct rcu_segcblist 
*rsclp, int seg)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * About memory barriers:
+ * There is a situation where rcu_barrier() locklessly samples the full
+ * length of the segmented cblist before deciding what to do. That can
+ * race with another path that calls this function such as enqueue or dequeue.
+ * rcu_barrier() should not wrongly assume there are no callbacks, so any
+ * transitions from 1->0 and 0->1 have to be carefully ordered with respect to
+ * list modifications and with whatever follows the rcu_barrier().
+ *
+ * There are at least 2 cases:
+ * CASE 1: Memory barrier is needed before adding to length, for the case where
+ * v is negative (happens during dequeue). When length transitions from 1 -> 0,
+ * the write to 0 has to be ordered to appear to be *after* the memory accesses
+ * of the CBs that were dequeued and the segcblist modifications:
+ * To illustrate the problematic scenario to avoid:
+ * P0 (what P1 sees)   P1
+ * set len = 0
+ *  rcu_barrier sees len as 0
+ * dequeue from list
+ *  rcu_barrier does nothing.
+ *
+ * CASE 2: Memory barrier is needed after adding to length for the case
+ * where length transitions from 0 -> 1. This is because rcu_barrier()
+ * should never miss an update to the length. So the update to length
+ * has to be seen *before* any modifications to the segmented list. Otherwise a
+ * race can happen.
+ * To illustrate the problematic scenario to avoid:
+ * P0 (what P1 sees)   P1
+ * queue to list
+ *  rcu_barrier sees len as 0
+ * set len = 1.
+ *  rcu_barrier does nothing.
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); /* Read function's comments */
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  /* Read function's comments */
 #else
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
WRITE_ONCE(rsclp->len, rsclp->len + v);
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
 #endif
 }
 
-- 
2.29.1.341.ge80a0c044ae-goog