Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 02:53:43PM -0500, Mathieu Desnoyers wrote:
> On 2024-03-07 14:47, Paul E. McKenney wrote:
> > On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote:
> > > On 2024-03-06 22:37, Paul E. McKenney wrote:
> > > > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:
> > > [...]
> > > > 
> > > > > As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
> > > > > is concerned, the only valid use-case I can think of is
> > > > > split counters or RCU implementations where there is a
> > > > > single updater doing the increment, and one or more
> > > > > concurrent reader threads that need to snapshot a
> > > > > consistent value with READ_ONCE().
> > > > 
> > > [...]
> > > > 
> > > > So what would you use that pattern for?
> > > > 
> > > > One possibility is a per-CPU statistical counter in userspace on a
> > > > fastpath, in cases where losing the occasional count is OK.  Then 
> > > > learning
> > > > your CPU (and possibly being immediately migrated to some other CPU),
> > > > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might 
> > > > not)
> > > > make sense.
> > > > 
> > > > I suppose the same in the kernel if there was a fastpath so extreme you
> > > > could not afford to disable preemption.
> > > > 
> > > > At best, very niche.
> > > > 
> > > > Or am I suffering a failure of imagination yet again?  ;-)
> > > 
> > > The (niche) use-cases I have in mind are split-counters and RCU
> > > grace period tracking, where precise counters totals are needed
> > > (no lost count).
> > > 
> > > In the kernel, this could be:
> > 
> > Thank you for looking into this!
> > 
> > > - A per-cpu counter, each counter incremented from thread context with
> > >preemption disabled (single updater per counter), read concurrently by
> > >other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
> > >is no store/load tearing there. Atomics on the update would be stronger
> > >than necessary on the increment fast-path.
> > 
> > But if preemption is disabled, the updater can read the value without
> > READ_ONCE() without risk of concurrent update.  Or are you concerned about
> > interrupt handlers?  This would have to be a read from the interrupt
> > handler, given that an updated from the interrupt handler could result
> > in a lost count.
> 
> You are correct that the updater don't need READ_ONCE there. It would
> however require a WRITE_ONCE() to match READ_ONCE() from concurrent
> reader threads.
> 
> > 
> > > - A per-thread counter (e.g. within task_struct), only incremented by the
> > >single thread, read by various threads concurrently.
> > 
> > Ditto.
> 
> Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers.
> 
> > 
> > > - A counter which increment happens to be already protected by a lock, 
> > > read
> > >by various threads without taking the lock. (technically doable, but
> > >I'm not sure I see a relevant use-case for it)
> > 
> > In that case, the lock would exclude concurrent updates, so the lock
> > holder would not need READ_ONCE(), correct?
> 
> Correct.
> 
> > 
> > > In user-space:
> > > 
> > > - The "per-cpu" counter would have to use rseq for increments to prevent
> > >inopportune migrations, which needs to be implemented in assembler 
> > > anyway.
> > >The counter reads would have to use READ_ONCE().
> > 
> > Fair enough!
> > 
> > > - The per-thread counter (Thread-Local Storage) incremented by a single
> > >thread, read by various threads concurrently, is a good target
> > >for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
> > >various liburcu implementations which track read-side critical sections
> > >per-thread.
> > 
> > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
> > similar?
> 
> Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such,
> so we can grab the value of the current gp counter and store it into a
> TLS variable.

Good point, you could use that pattern to grab a shared snapshot.

Still sounds niche, but you never know!  ;-)

Thanx, Paul

> > > - Same as for the kernel, a counter increment protected by a lock which
> > >needs to be read from various threads concurrently without taking
> > >the lock could be a valid use-case, though I fail to see how it is
> > >useful due to lack of imagination on my part. ;-)
> > 
> > In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though
> > here we have the WRITE_ONCE() but not the READ_ONCE() because we hold
> > the lock excluding any other updates.
> 
> You are right, the READ_ONCE() is not needed in this case for the
> updater, only for the concurrent readers.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 



Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 02:09:00PM -0800, Linus Torvalds wrote:
> On Thu, 7 Mar 2024 at 13:40, Julia Lawall  wrote:
> >
> > I tried the following:
> >
> > @@
> > expression x;
> > @@
> >
> > *WRITE_ONCE(x,<+...READ_ONCE(x)...+>)
> >
> > This gave a number of results, shown below.  Let me know if some of them
> > are undesirable.
> 
> Well, all the ones you list do look like garbage.
> 
> That said, quite often the garbage does seem to be "we don't actually
> care about the result". Several of them look like statistics.

[ . . . ]

> > diff -u -p /home/julia/linux/kernel/rcu/tree.c 
> > /tmp/nothing/kernel/rcu/tree.c
> > --- /home/julia/linux/kernel/rcu/tree.c
> > +++ /tmp/nothing/kernel/rcu/tree.c
> > @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time)
> > /* Clear flag to prevent immediate re-entry. */
> > if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) {
> > raw_spin_lock_irq_rcu_node(rnp);
> > -   WRITE_ONCE(rcu_state.gp_flags,
> > -  READ_ONCE(rcu_state.gp_flags) & 
> > ~RCU_GP_FLAG_FQS);
> > raw_spin_unlock_irq_rcu_node(rnp);
> 
> This smells bad to me. The code is holding a lock, but apparently not
> one that protects gp_flags.
> 
> And that READ_ONCE->WRITE_ONCE sequence can corrupt all the other flags.
> 
> Maybe it's fine for some reason (that reason being either that the
> ONCE operations aren't actually needed at all, or because nobody
> *really* cares about the flags), but it smells.
> 
> > @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l
> >  {
> > raw_lockdep_assert_held_rcu_node(rcu_get_root());
> > WARN_ON_ONCE(!rcu_gp_in_progress());
> > -   WRITE_ONCE(rcu_state.gp_flags,
> > -  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
> > raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags);
> 
> Same field, same lock held, same odd smelly pattern.
> 
> > -   WRITE_ONCE(rcu_state.gp_flags,
> > -  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
> > raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags);
> 
> .. and again.

In all three cases, the updates are protected by the lock, so the
READ_ONCE() is unnecessary.  I have queued a commit to remove the
READ_ONCE()s.

Thanks to both of you (Julia and Linus) for spotting these!

Thanx, Paul



Re: [PATCH] [RFC] rcu/tree: Reduce wake up for synchronize_rcu() common case

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 06:52:14PM -0500, Joel Fernandes wrote:
> On Thu, Mar 7, 2024 at 6:48 PM Joel Fernandes (Google)
>  wrote:
> >
> > In the synchronize_rcu() common case, we will have less than
> > SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> > is pointless just to free the last injected wait head since at that point,
> > all the users have already been awakened.
> >
> > Introduce a new counter to track this and prevent the wakeup in the
> > common case.
> >
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> 
> Forgot to mention, this is based on the latest RCU -dev branch and
> passes light rcutorture testing on all configs. Heavier rcutorture
> testing (60 minutes) was performed on TREE03.

Very good, thank you!

Uladzislau, could you please pull this into the next series you send?
I can then replace your commits in -rcu with the updated series.

Thanx, Paul



Re: [PATCH] [RFC] rcu/tree: Reduce wake up for synchronize_rcu() common case

2024-03-07 Thread Joel Fernandes
On Thu, Mar 7, 2024 at 6:48 PM Joel Fernandes (Google)
 wrote:
>
> In the synchronize_rcu() common case, we will have less than
> SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
> is pointless just to free the last injected wait head since at that point,
> all the users have already been awakened.
>
> Introduce a new counter to track this and prevent the wakeup in the
> common case.
>
> Signed-off-by: Joel Fernandes (Google) 
> ---

Forgot to mention, this is based on the latest RCU -dev branch and
passes light rcutorture testing on all configs. Heavier rcutorture
testing (60 minutes) was performed on TREE03.

Thanks.



[PATCH] [RFC] rcu/tree: Reduce wake up for synchronize_rcu() common case

2024-03-07 Thread Joel Fernandes (Google)
In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.

Introduce a new counter to track this and prevent the wakeup in the
common case.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/tree.c | 36 +++-
 kernel/rcu/tree.h |  1 +
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 12978049cb99..cba3a82e9ed9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -96,6 +96,7 @@ static struct rcu_state rcu_state = {
.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
rcu_sr_normal_gp_cleanup_work),
+   .srs_cleanups_pending = ATOMIC_INIT(0),
 };
 
 /* Dump rcu_node combining tree at boot to verify correct setup. */
@@ -1641,8 +1642,11 @@ static void rcu_sr_normal_gp_cleanup_work(struct 
work_struct *work)
 * the done tail list manipulations are protected here.
 */
done = smp_load_acquire(&rcu_state.srs_done_tail);
-   if (!done)
+   if (!done) {
+   /* See comments below. */
+   atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
return;
+   }
 
WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
head = done->next;
@@ -1665,6 +1669,9 @@ static void rcu_sr_normal_gp_cleanup_work(struct 
work_struct *work)
 
rcu_sr_put_wait_head(rcu);
}
+
+   /* Order list manipulations with atomic access. */
+   atomic_dec_return_release(&rcu_state.srs_cleanups_pending);
 }
 
 /*
@@ -1672,7 +1679,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct 
work_struct *work)
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-   struct llist_node *wait_tail, *next, *rcu;
+   struct llist_node *wait_tail, *next = NULL, *rcu = NULL;
int done = 0;
 
wait_tail = rcu_state.srs_wait_tail;
@@ -1698,16 +1705,35 @@ static void rcu_sr_normal_gp_cleanup(void)
break;
}
 
-   // concurrent sr_normal_gp_cleanup work might observe this update.
-   smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+   /*
+* Fast path, no more users to process. Remove the last wait head
+* if no inflight-workers. If there are in-flight workers, let them
+* remove the last wait head.
+*/
+   WARN_ON_ONCE(!rcu);
ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
 
+   if (rcu && rcu_sr_is_wait_head(rcu) && rcu->next == NULL &&
+   /* Order atomic access with list manipulation. */
+   !atomic_read_acquire(&rcu_state.srs_cleanups_pending)) {
+   wait_tail->next = NULL;
+   rcu_sr_put_wait_head(rcu);
+   smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+   return;
+   }
+
+   /* Concurrent sr_normal_gp_cleanup work might observe this update. */
+   smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+
/*
 * We schedule a work in order to perform a final processing
 * of outstanding users(if still left) and releasing wait-heads
 * added by rcu_sr_normal_gp_init() call.
 */
-   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
+   atomic_inc(&rcu_state.srs_cleanups_pending);
+   if (!queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work)) {
+   atomic_dec(&rcu_state.srs_cleanups_pending);
+   }
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2832787cee1d..f162b947c5b6 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -420,6 +420,7 @@ struct rcu_state {
struct llist_node *srs_done_tail; /* ready for GP users. */
struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
struct work_struct srs_cleanup_work;
+   atomic_t srs_cleanups_pending; /* srs inflight worker cleanups. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */
-- 
2.34.1




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Linus Torvalds
On Thu, 7 Mar 2024 at 13:40, Julia Lawall  wrote:
>
> I tried the following:
>
> @@
> expression x;
> @@
>
> *WRITE_ONCE(x,<+...READ_ONCE(x)...+>)
>
> This gave a number of results, shown below.  Let me know if some of them
> are undesirable.

Well, all the ones you list do look like garbage.

That said, quite often the garbage does seem to be "we don't actually
care about the result". Several of them look like statistics.

Some of them look outright nasty, though:

> --- /home/julia/linux/net/netfilter/nf_tables_api.c
> +++ /tmp/nothing/net/netfilter/nf_tables_api.c
> @@ -10026,8 +10026,6 @@ static unsigned int nft_gc_seq_begin(str
> unsigned int gc_seq;
>
> /* Bump gc counter, it becomes odd, this is the busy mark. */
> -   gc_seq = READ_ONCE(nft_net->gc_seq);
> -   WRITE_ONCE(nft_net->gc_seq, ++gc_seq);

The above is garbage code, and the comment implies that it is garbage
code that _should_ be reliable.

> diff -u -p /home/julia/linux/fs/xfs/xfs_icache.c 
> /tmp/nothing/fs/xfs/xfs_icache.c
> --- /home/julia/linux/fs/xfs/xfs_icache.c
> +++ /tmp/nothing/fs/xfs/xfs_icache.c
> @@ -2076,8 +2076,6 @@ xfs_inodegc_queue(
> cpu_nr = get_cpu();
> gc = this_cpu_ptr(mp->m_inodegc);
> llist_add(&ip->i_gclist, &gc->list);
> -   items = READ_ONCE(gc->items);
> -   WRITE_ONCE(gc->items, items + 1);

In contrast, this is also garbage code, but the only user of it seems
to be a heuristic, so if 'items' is off by one (or by a hundred), it
probably doesn't matter.

The xfs code is basically using that 'items' count to decide if it
really wants to do GC or not.

This is actually a case where having a "UNSAFE_INCREMENTISH()" macro
might make sense.

That said, this is also a case where using a "local_t" and using
"local_add_return()" might be a better option. It falls back on true
atomics, but at least on x86 you probably get *better* code generation
for the "incrementish" operation than you get with READ_ONCE ->
WRITE_ONCE.


> diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c
> --- /home/julia/linux/kernel/rcu/tree.c
> +++ /tmp/nothing/kernel/rcu/tree.c
> @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time)
> /* Clear flag to prevent immediate re-entry. */
> if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) {
> raw_spin_lock_irq_rcu_node(rnp);
> -   WRITE_ONCE(rcu_state.gp_flags,
> -  READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS);
> raw_spin_unlock_irq_rcu_node(rnp);

This smells bad to me. The code is holding a lock, but apparently not
one that protects gp_flags.

And that READ_ONCE->WRITE_ONCE sequence can corrupt all the other flags.

Maybe it's fine for some reason (that reason being either that the
ONCE operations aren't actually needed at all, or because nobody
*really* cares about the flags), but it smells.

> @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l
>  {
> raw_lockdep_assert_held_rcu_node(rcu_get_root());
> WARN_ON_ONCE(!rcu_gp_in_progress());
> -   WRITE_ONCE(rcu_state.gp_flags,
> -  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
> raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags);

Same field, same lock held, same odd smelly pattern.

> -   WRITE_ONCE(rcu_state.gp_flags,
> -  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
> raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags);

.. and again.

> --- /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> @@ -80,8 +80,6 @@ static int cn23xx_vf_reset_io_queues(str
> q_no);
> return -1;
> }
> -   WRITE_ONCE(reg_val, READ_ONCE(reg_val) &
> -  ~CN23XX_PKT_INPUT_CTL_RST);
> octeon_write_csr64(oct, CN23XX_VF_SLI_IQ_PKT_CONTROL64(q_no),
>READ_ONCE(reg_val));

I suspect this is garbage that has been triggered by the usual
mindless "fix the symptoms, not the bug" as a result of a "undefined
behavior report".

>> --- /home/julia/linux/kernel/kcsan/kcsan_test.c
> +++ /tmp/nothing/kernel/kcsan/kcsan_test.c
> @@ -381,7 +381,6 @@ static noinline void test_kernel_change_
> test_var ^= TEST_CHANGE_BITS;
> kcsan_nestable_atomic_end();
> } else
> -   WRITE_ONCE(test_var, READ_ONCE(test_var) ^ TEST_CHANGE_BITS);

Presumably this is intentionally testing whether KCSAN notices these
things at all.

> diff -u -p /home/julia/linux/arch/s390/kernel/idle.c 
> /tmp/nothing/arch/s390/kernel/idle.c
> /* Account time spent with enabled wait psw loaded as idle time. */
> -   WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
> -   WRITE_ONCE(idle->idle_count, READ_ONCE(idl

Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Julia Lawall



On Thu, 7 Mar 2024, Paul E. McKenney wrote:

> On Thu, Mar 07, 2024 at 12:00:44PM -0800, Linus Torvalds wrote:
> > On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney  wrote:
> > >
> > > > - The per-thread counter (Thread-Local Storage) incremented by a single
> > > >   thread, read by various threads concurrently, is a good target
> > > >   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
> > > >   various liburcu implementations which track read-side critical 
> > > > sections
> > > >   per-thread.
> > >
> > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
> > > similar?
> >
> > Absolutely not.
> >
> > The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug.
> >
> > The valid reason to have a WRITE_ONCE() is that there's a _later_
> > READ_ONCE() on another CPU.
> >
> > So WRITE_ONCE->READ_ONCE (across threads) is very valid. But
> > READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid
> > operation.
> >
> > We do have things like "local_t", which allows for non-smp-safe local
> > thread atomic accesses, but they explicitly are *NOT* about some kind
> > of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be
> > atomic unless you disable interrupts and disable preemption (at which
> > point they become pointless and only generate worse code).
> >
> > But the point of "local_t" is that you can do things that aresafe if
> > there is no SMP issues. They are kind of an extension of the
> > percpu_add() kind of operations.
> >
> > In fact, I think it might be interesting to catch those
> > READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they
> > are a sign of bugs.
>
> Good point, adding Julia Lawall on CC.

I tried the following:

@@
expression x;
@@

*WRITE_ONCE(x,<+...READ_ONCE(x)...+>)

This gave a number of results, shown below.  Let me know if some of them
are undesirable.

I also tried:

@@
expression x,v;

*v = READ_ONCE(x)
... when != v
*WRITE_ONCE(x,<+...v...+>)

I found the results somewhat less compelling.  Maybe the following are
interesting (note that - means this line is interesting, not a suggestion
to remove it):

479 files match
diff -u -p /home/julia/linux/net/netfilter/nf_tables_api.c 
/tmp/nothing/net/netfilter/nf_tables_api.c
--- /home/julia/linux/net/netfilter/nf_tables_api.c
+++ /tmp/nothing/net/netfilter/nf_tables_api.c
@@ -10026,8 +10026,6 @@ static unsigned int nft_gc_seq_begin(str
unsigned int gc_seq;

/* Bump gc counter, it becomes odd, this is the busy mark. */
-   gc_seq = READ_ONCE(nft_net->gc_seq);
-   WRITE_ONCE(nft_net->gc_seq, ++gc_seq);

return gc_seq;
 }
diff -u -p /home/julia/linux/fs/xfs/xfs_icache.c 
/tmp/nothing/fs/xfs/xfs_icache.c
--- /home/julia/linux/fs/xfs/xfs_icache.c
+++ /tmp/nothing/fs/xfs/xfs_icache.c
@@ -2076,8 +2076,6 @@ xfs_inodegc_queue(
cpu_nr = get_cpu();
gc = this_cpu_ptr(mp->m_inodegc);
llist_add(&ip->i_gclist, &gc->list);
-   items = READ_ONCE(gc->items);
-   WRITE_ONCE(gc->items, items + 1);
shrinker_hits = READ_ONCE(gc->shrinker_hits);

/*
@@ -2199,9 +2197,7 @@ xfs_inodegc_shrinker_scan(
for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
gc = per_cpu_ptr(mp->m_inodegc, cpu);
if (!llist_empty(&gc->list)) {
-   unsigned inth = READ_ONCE(gc->shrinker_hits);

-   WRITE_ONCE(gc->shrinker_hits, h + 1);
mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 
0);
no_items = false;
}

In other cases, more work is done between the read and the write.  I can send 
themif of interest.

julia

diff -u -p /home/julia/linux/kernel/sched/fair.c 
/tmp/nothing/kernel/sched/fair.c
--- /home/julia/linux/kernel/sched/fair.c
+++ /tmp/nothing/kernel/sched/fair.c
@@ -3153,7 +3153,6 @@ static void reset_ptenuma_scan(struct ta
 * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not
 * expensive, to avoid any form of compiler optimizations:
 */
-   WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
p->mm->numa_scan_offset = 0;
 }

diff -u -p /home/julia/linux/net/ipv4/inet_hashtables.c 
/tmp/nothing/net/ipv4/inet_hashtables.c
--- /home/julia/linux/net/ipv4/inet_hashtables.c
+++ /tmp/nothing/net/ipv4/inet_hashtables.c
@@ -1109,7 +1109,6 @@ ok:
 * it may be inexistent.
 */
i = max_t(int, i, get_random_u32_below(8) * step);
-   WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 
step);

/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, tb2, port);
diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c
--- /home/julia/linux/kernel/rcu/tree.c
+++ /tmp/nothing/kernel/rcu/tree.c
@@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time)
/* Clear flag to prevent immediate re-entry. */
if (READ_ONCE(rcu_state.

Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 12:00:44PM -0800, Linus Torvalds wrote:
> On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney  wrote:
> >
> > > - The per-thread counter (Thread-Local Storage) incremented by a single
> > >   thread, read by various threads concurrently, is a good target
> > >   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
> > >   various liburcu implementations which track read-side critical sections
> > >   per-thread.
> >
> > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
> > similar?
> 
> Absolutely not.
> 
> The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug.
> 
> The valid reason to have a WRITE_ONCE() is that there's a _later_
> READ_ONCE() on another CPU.
> 
> So WRITE_ONCE->READ_ONCE (across threads) is very valid. But
> READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid
> operation.
> 
> We do have things like "local_t", which allows for non-smp-safe local
> thread atomic accesses, but they explicitly are *NOT* about some kind
> of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be
> atomic unless you disable interrupts and disable preemption (at which
> point they become pointless and only generate worse code).
> 
> But the point of "local_t" is that you can do things that aresafe if
> there is no SMP issues. They are kind of an extension of the
> percpu_add() kind of operations.
> 
> In fact, I think it might be interesting to catch those
> READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they
> are a sign of bugs.

Good point, adding Julia Lawall on CC.

A really stupid version of this pattern (WRITE_ONCE.*READ_ONCE) found
four more in RCU, so I will take a look at those and either fix or add
comments as appropriate.

> Now, there's certainly some possibility of "I really don't care about
> some stats, I'm willing to do non-smp-safe and non-thread safe
> operations if they are faster". So I'm not saying a
> READ_ONCE->WRITE_ONCE data dependency is _always_ a bug, but I do
> think it's a pattern that is very very close to being one.

I agree that valid use cases should be quite rare.

Thanx, Paul



Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Linus Torvalds
On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney  wrote:
>
> > - The per-thread counter (Thread-Local Storage) incremented by a single
> >   thread, read by various threads concurrently, is a good target
> >   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
> >   various liburcu implementations which track read-side critical sections
> >   per-thread.
>
> Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
> similar?

Absolutely not.

The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug.

The valid reason to have a WRITE_ONCE() is that there's a _later_
READ_ONCE() on another CPU.

So WRITE_ONCE->READ_ONCE (across threads) is very valid. But
READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid
operation.

We do have things like "local_t", which allows for non-smp-safe local
thread atomic accesses, but they explicitly are *NOT* about some kind
of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be
atomic unless you disable interrupts and disable preemption (at which
point they become pointless and only generate worse code).

But the point of "local_t" is that you can do things that aresafe if
there is no SMP issues. They are kind of an extension of the
percpu_add() kind of operations.

In fact, I think it might be interesting to catch those
READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they
are a sign of bugs.

Now, there's certainly some possibility of "I really don't care about
some stats, I'm willing to do non-smp-safe and non-thread safe
operations if they are faster". So I'm not saying a
READ_ONCE->WRITE_ONCE data dependency is _always_ a bug, but I do
think it's a pattern that is very very close to being one.

 Linus



Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Mathieu Desnoyers

On 2024-03-07 14:47, Paul E. McKenney wrote:

On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote:

On 2024-03-06 22:37, Paul E. McKenney wrote:

On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:

[...]



As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
is concerned, the only valid use-case I can think of is
split counters or RCU implementations where there is a
single updater doing the increment, and one or more
concurrent reader threads that need to snapshot a
consistent value with READ_ONCE().



[...]


So what would you use that pattern for?

One possibility is a per-CPU statistical counter in userspace on a
fastpath, in cases where losing the occasional count is OK.  Then learning
your CPU (and possibly being immediately migrated to some other CPU),
READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
make sense.

I suppose the same in the kernel if there was a fastpath so extreme you
could not afford to disable preemption.

At best, very niche.

Or am I suffering a failure of imagination yet again?  ;-)


The (niche) use-cases I have in mind are split-counters and RCU
grace period tracking, where precise counters totals are needed
(no lost count).

In the kernel, this could be:


Thank you for looking into this!


- A per-cpu counter, each counter incremented from thread context with
   preemption disabled (single updater per counter), read concurrently by
   other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
   is no store/load tearing there. Atomics on the update would be stronger
   than necessary on the increment fast-path.


But if preemption is disabled, the updater can read the value without
READ_ONCE() without risk of concurrent update.  Or are you concerned about
interrupt handlers?  This would have to be a read from the interrupt
handler, given that an updated from the interrupt handler could result
in a lost count.


You are correct that the updater don't need READ_ONCE there. It would
however require a WRITE_ONCE() to match READ_ONCE() from concurrent
reader threads.




- A per-thread counter (e.g. within task_struct), only incremented by the
   single thread, read by various threads concurrently.


Ditto.


Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers.




- A counter which increment happens to be already protected by a lock, read
   by various threads without taking the lock. (technically doable, but
   I'm not sure I see a relevant use-case for it)


In that case, the lock would exclude concurrent updates, so the lock
holder would not need READ_ONCE(), correct?


Correct.




In user-space:

- The "per-cpu" counter would have to use rseq for increments to prevent
   inopportune migrations, which needs to be implemented in assembler anyway.
   The counter reads would have to use READ_ONCE().


Fair enough!


- The per-thread counter (Thread-Local Storage) incremented by a single
   thread, read by various threads concurrently, is a good target
   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
   various liburcu implementations which track read-side critical sections
   per-thread.


Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
similar?


Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such,
so we can grab the value of the current gp counter and store it into a
TLS variable.





- Same as for the kernel, a counter increment protected by a lock which
   needs to be read from various threads concurrently without taking
   the lock could be a valid use-case, though I fail to see how it is
   useful due to lack of imagination on my part. ;-)


In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though
here we have the WRITE_ONCE() but not the READ_ONCE() because we hold
the lock excluding any other updates.


You are right, the READ_ONCE() is not needed in this case for the
updater, only for the concurrent readers.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote:
> On 2024-03-06 22:37, Paul E. McKenney wrote:
> > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:
> [...]
> > 
> > > As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
> > > is concerned, the only valid use-case I can think of is
> > > split counters or RCU implementations where there is a
> > > single updater doing the increment, and one or more
> > > concurrent reader threads that need to snapshot a
> > > consistent value with READ_ONCE().
> > 
> [...]
> > 
> > So what would you use that pattern for?
> > 
> > One possibility is a per-CPU statistical counter in userspace on a
> > fastpath, in cases where losing the occasional count is OK.  Then learning
> > your CPU (and possibly being immediately migrated to some other CPU),
> > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
> > make sense.
> > 
> > I suppose the same in the kernel if there was a fastpath so extreme you
> > could not afford to disable preemption.
> > 
> > At best, very niche.
> > 
> > Or am I suffering a failure of imagination yet again?  ;-)
> 
> The (niche) use-cases I have in mind are split-counters and RCU
> grace period tracking, where precise counters totals are needed
> (no lost count).
> 
> In the kernel, this could be:

Thank you for looking into this!

> - A per-cpu counter, each counter incremented from thread context with
>   preemption disabled (single updater per counter), read concurrently by
>   other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
>   is no store/load tearing there. Atomics on the update would be stronger
>   than necessary on the increment fast-path.

But if preemption is disabled, the updater can read the value without
READ_ONCE() without risk of concurrent update.  Or are you concerned about
interrupt handlers?  This would have to be a read from the interrupt
handler, given that an updated from the interrupt handler could result
in a lost count.

> - A per-thread counter (e.g. within task_struct), only incremented by the
>   single thread, read by various threads concurrently.

Ditto.

> - A counter which increment happens to be already protected by a lock, read
>   by various threads without taking the lock. (technically doable, but
>   I'm not sure I see a relevant use-case for it)

In that case, the lock would exclude concurrent updates, so the lock
holder would not need READ_ONCE(), correct?

> In user-space:
> 
> - The "per-cpu" counter would have to use rseq for increments to prevent
>   inopportune migrations, which needs to be implemented in assembler anyway.
>   The counter reads would have to use READ_ONCE().

Fair enough!

> - The per-thread counter (Thread-Local Storage) incremented by a single
>   thread, read by various threads concurrently, is a good target
>   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
>   various liburcu implementations which track read-side critical sections
>   per-thread.

Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
similar?

> - Same as for the kernel, a counter increment protected by a lock which
>   needs to be read from various threads concurrently without taking
>   the lock could be a valid use-case, though I fail to see how it is
>   useful due to lack of imagination on my part. ;-)

In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though
here we have the WRITE_ONCE() but not the READ_ONCE() because we hold
the lock excluding any other updates.

> I'm possibly missing other use-cases, but those come to mind as not
> involving racy counter increments.
> 
> I agree that use-cases are so niche that we probably do _not_ want to
> introduce ADD_SHARED() for that purpose in a common header file,
> because I suspect that it would then become misused in plenty of
> scenarios where the updates are actually racy and would be better
> served by atomics or local-atomics.

I agree that unless or until we have a reasonable number of use cases, we
should open-code it.  With a big fat comment explaining how it works.  ;-)

Thanx, Paul



Re: [PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 06:54:38PM +, Russell King (Oracle) wrote:
> On Thu, Mar 07, 2024 at 09:45:36AM -0800, Paul E. McKenney wrote:
> > > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > > index 3431c0553f45..6875e2c5dd50 100644
> > > --- a/arch/arm/kernel/smp.c
> > > +++ b/arch/arm/kernel/smp.c
> > > @@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
> > >  {
> > >   unsigned int cpu = smp_processor_id();
> > >  
> > > + /*
> > > +  * Briefly report CPU as online again to avoid false positive
> > > +  * Lockdep-RCU splat when check_and_switch_context() acquires ASID
> > > +  * spinlock.
> > > +  */
> > > + rcutree_report_cpu_starting(cpu);
> > >   idle_task_exit();
> > > + rcutree_report_cpu_dead();
> > >  
> > >   local_irq_disable();
> > 
> > Both rcutree_report_cpu_starting() and rcutree_report_cpu_dead() complain
> > bitterly via lockdep if interrupts are enabled.  And the call sites have
> > interrupts disabled.  So I don't understand what this local_irq_disable()
> > is needed for.
> 
> I think that's a question for this commit:
> 
> commit e78a7614f3876ac649b3df608789cb6ef74d0480
> Author: Peter Zijlstra 
> Date:   Wed Jun 5 07:46:43 2019 -0700
> 
> Before this commit, arch_cpu_idle_dead() was called with IRQs enabled.
> This commit moved the local_irq_disable() before calling
> arch_cpu_idle_dead() but it seems no one looked at the various arch
> implementations to clean those up. Quite how arch people are supposed
> to spot this and clean up after such a commit, I'm not sure.

Telepathy?  ;-)

> The local_irq_disable() that you're asking about has been there ever
> since the inception of SMP on 32-bit ARM in this commit:
> 
> commit a054a811597a17ffbe92bc4db04a4dc2f1b1ea55
> Author: Russell King 
> Date:   Wed Nov 2 22:24:33 2005 +
> 
> Where cpu_die() was later renamed to arch_cpu_idle_dead(). So it's
> purely a case of a change being made to core code and arch code not
> receiving any fixups for it.

Thank you for the info!

Thanx, Paul



Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 12:44:39AM -0500, Joel Fernandes wrote:
> 
> 
> On 3/6/2024 10:37 PM, Paul E. McKenney wrote:
> > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:
> >> On 2024-03-06 21:43, Linus Torvalds wrote:
> >> [...]
> >>>
> >>> Honestly, this all makes me think that we'd be *much* better off
> >>> showing the real "handoff" with smp_store_release() and
> >>> smp_load_acquire().
> >>
> >> We've done something similar in liburcu (userspace code) to allow
> >> Thread Sanitizer to understand the happens-before relationships
> >> within the RCU implementations and lock-free data structures.
> >>
> >> Moving to load-acquire/store-release (C11 model in our case)
> >> allowed us to provide enough happens-before relationship for
> >> Thread Sanitizer to understand what is happening under the
> >> hood in liburcu and perform relevant race detection of user
> >> code.
> > 
> > Good point!
> > 
> > In the kernel, though, KCSAN understands the Linux-kernel memory model,
> > and so we don't get that sort of false positive.
> > 
> >> As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
> >> is concerned, the only valid use-case I can think of is
> >> split counters or RCU implementations where there is a
> >> single updater doing the increment, and one or more
> >> concurrent reader threads that need to snapshot a
> >> consistent value with READ_ONCE().
> > 
> > It is wrong here.  OK, not wrong from a functional viewpoint, but it
> > is at best very confusing.  I am applying patches to fix this.  I will
> > push out a new "dev" branch on -rcu that will make this function appear
> > as shown below.
> > 
> > So what would you use that pattern for?
> > 
> > One possibility is a per-CPU statistical counter in userspace on a
> > fastpath, in cases where losing the occasional count is OK.  Then learning
> > your CPU (and possibly being immediately migrated to some other CPU),
> > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
> > make sense.
> > 
> > I suppose the same in the kernel if there was a fastpath so extreme you
> > could not afford to disable preemption.
> > 
> > At best, very niche.
> > 
> > Or am I suffering a failure of imagination yet again?  ;-)
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > static bool
> > rcu_torture_pipe_update_one(struct rcu_torture *rp)
> > {
> > int i;
> > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp);
> > 
> > if (rtrcp) {
> > WRITE_ONCE(rp->rtort_chkp, NULL);
> > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with 
> > smp_load_acquire().
> > }
> > i = rp->rtort_pipe_count;
> > if (i > RCU_TORTURE_PIPE_LEN)
> > i = RCU_TORTURE_PIPE_LEN;
> > atomic_inc(&rcu_torture_wcount[i]);
> > WRITE_ONCE(rp->rtort_pipe_count, i + 1);
> > ASSERT_EXCLUSIVE_WRITER(rp->rtort_pipe_count);
> 
> I was going to say to add a comment here for the future reader, that 
> update-side
> ->rtort_pipe_count READ/WRITE are already mutually exclusive, but this ASSERT
> already documents it ;-)

Plus KCSAN is way better at repeatedly inspecting code for this sort of
issue than I am.  ;-)

> Also FWIW I confirmed after starting at code that indeed only one update-side
> access is possible at a time! Thanks,
> Reviewed-by: Joel Fernandes (Google) 

Thank you very much!  I will apply your Reviewed-by to these commits
on my next rebase:

28455c73b620 ("rcutorture: ASSERT_EXCLUSIVE_WRITER() for ->rtort_pipe_count 
updates")
b0b99e7db12e ("rcutorture: Remove extraneous rcu_torture_pipe_update_one() 
READ_ONCE()")

Thanx, Paul



Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 01:52:15PM -0500, Steven Rostedt wrote:
> On Thu, 7 Mar 2024 16:57:33 +
> Mark Rutland  wrote:
> 
> > * Use rcu_tasks_trace to synchronize updates?
> 
> Yes. I think I wanted both. The above to make sure it covers all cases
> where something could be preempted, and a case for those covered when RCU
> isn't watching (which always has preemption disabled).
> 
> My mistake was I thought synchronize_rcu_tasks_rude() did both. But I just
> found out recently that it is not a superset of synchronize_rcu_tasks().
> 
> But it really needs it in every location that synchronize_rcu_tasks_rude()
> is called.

Should any RCU Tasks Rude grace-period request also wait for an RCU Tasks
grace period?

I would feel better about proposing this, especially for
call_rcu_tasks_rude(), if RCU Tasks Rude was not supposed to be going
away after all the noinstr tags are in place.

Thanx, Paul



Re: [PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Russell King (Oracle)
On Thu, Mar 07, 2024 at 09:45:36AM -0800, Paul E. McKenney wrote:
> > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > index 3431c0553f45..6875e2c5dd50 100644
> > --- a/arch/arm/kernel/smp.c
> > +++ b/arch/arm/kernel/smp.c
> > @@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
> >  {
> > unsigned int cpu = smp_processor_id();
> >  
> > +   /*
> > +* Briefly report CPU as online again to avoid false positive
> > +* Lockdep-RCU splat when check_and_switch_context() acquires ASID
> > +* spinlock.
> > +*/
> > +   rcutree_report_cpu_starting(cpu);
> > idle_task_exit();
> > +   rcutree_report_cpu_dead();
> >  
> > local_irq_disable();
> 
> Both rcutree_report_cpu_starting() and rcutree_report_cpu_dead() complain
> bitterly via lockdep if interrupts are enabled.  And the call sites have
> interrupts disabled.  So I don't understand what this local_irq_disable()
> is needed for.

I think that's a question for this commit:

commit e78a7614f3876ac649b3df608789cb6ef74d0480
Author: Peter Zijlstra 
Date:   Wed Jun 5 07:46:43 2019 -0700

Before this commit, arch_cpu_idle_dead() was called with IRQs enabled.
This commit moved the local_irq_disable() before calling
arch_cpu_idle_dead() but it seems no one looked at the various arch
implementations to clean those up. Quite how arch people are supposed
to spot this and clean up after such a commit, I'm not sure.

The local_irq_disable() that you're asking about has been there ever
since the inception of SMP on 32-bit ARM in this commit:

commit a054a811597a17ffbe92bc4db04a4dc2f1b1ea55
Author: Russell King 
Date:   Wed Nov 2 22:24:33 2005 +

Where cpu_die() was later renamed to arch_cpu_idle_dead(). So it's
purely a case of a change being made to core code and arch code not
receiving any fixups for it.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!



Re: [PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Boqun Feng
[Cc Frederic]

On Thu, Mar 07, 2024 at 05:09:51PM +0100, Stefan Wiehler wrote:
> With CONFIG_PROVE_RCU_LIST=y and by executing
> 
>   $ echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> one can trigger the following Lockdep-RCU splat on ARM:
> 
>   =
>   WARNING: suspicious RCU usage
>   6.8.0-rc7-1-g0db1d0ed8958 #10 Not tainted
>   -
>   kernel/locking/lockdep.c:3762 RCU-list traversed in non-reader section!!
> 
>   other info that might help us debug this:
> 
>   RCU used illegally from offline CPU!
>   rcu_scheduler_active = 2, debug_locks = 1
>   no locks held by swapper/1/0.
> 
>   stack backtrace:
>   CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0-rc7-1-g0db1d0ed8958 #10
>   Hardware name: Allwinner sun8i Family
>unwind_backtrace from show_stack+0x10/0x14
>show_stack from dump_stack_lvl+0x60/0x90
>dump_stack_lvl from lockdep_rcu_suspicious+0x150/0x1a0
>lockdep_rcu_suspicious from __lock_acquire+0x11fc/0x29f8
>__lock_acquire from lock_acquire+0x10c/0x348
>lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
>_raw_spin_lock_irqsave from check_and_switch_context+0x7c/0x4a8
>check_and_switch_context from arch_cpu_idle_dead+0x10/0x7c
>arch_cpu_idle_dead from do_idle+0xbc/0x138
>do_idle from cpu_startup_entry+0x28/0x2c
>cpu_startup_entry from secondary_start_kernel+0x11c/0x124
>secondary_start_kernel from 0x401018a0
> 
> The CPU is already reported as offline from RCU perspective in
> cpuhp_report_idle_dead() before arch_cpu_idle_dead() is invoked. Above
> RCU-Lockdep splat is then triggered by check_and_switch_context() acquiring 
> the
> ASID spinlock.
> 
> Avoid the false-positive Lockdep-RCU splat by briefly reporting the CPU as

I'm not sure this is a false-positive: if you traverse an RCU-list
without RCU watching the CPU, then a grace period may pass, the next
list can be garbage and you can go anywhere from the list. Or am I
missing something that a CPU hotplug can block a grace period?

But codewise, it looks good to me. Although I hope we can avoid calling
rcutree_report_cpu_{starting,dead}() here and use something simple,
however after a few attempts, I don't find a better way.

Regards,
Boqun

> online again while the spinlock is held.
> 
> Signed-off-by: Stefan Wiehler 
> ---
>  arch/arm/kernel/smp.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 3431c0553f45..6875e2c5dd50 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
>  {
>   unsigned int cpu = smp_processor_id();
>  
> + /*
> +  * Briefly report CPU as online again to avoid false positive
> +  * Lockdep-RCU splat when check_and_switch_context() acquires ASID
> +  * spinlock.
> +  */
> + rcutree_report_cpu_starting(cpu);
>   idle_task_exit();
> + rcutree_report_cpu_dead();
>  
>   local_irq_disable();
>  
> -- 
> 2.42.0
> 



Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

2024-03-07 Thread Steven Rostedt
On Thu, 7 Mar 2024 16:57:33 +
Mark Rutland  wrote:

> * Use rcu_tasks_trace to synchronize updates?

Yes. I think I wanted both. The above to make sure it covers all cases
where something could be preempted, and a case for those covered when RCU
isn't watching (which always has preemption disabled).

My mistake was I thought synchronize_rcu_tasks_rude() did both. But I just
found out recently that it is not a superset of synchronize_rcu_tasks().

But it really needs it in every location that synchronize_rcu_tasks_rude()
is called.

-- Steve



Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

2024-03-07 Thread Mark Rutland
On Thu, Mar 07, 2024 at 04:57:33PM +, Mark Rutland wrote:
> On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> > On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > 
> > This config is itself expected to be slow. However seeing what it does, it 
> > is
> > trying to make sure the global function pointer "ftrace_trace_function" is
> > updated and any readers of that pointers would have finished reading it. I 
> > don't
> > personally think preemption has to be disabled across the entirety of the
> > section that calls into this function. So sensitivity to preempt disabling
> > should not be relevant for this case IMO, but lets see if ftrace folks 
> > disagree
> > (on CC). It has more to do with, any callers of this function pointer are no
> > longer calling into the old function.
> 
> I've been looking into this case for the last couple of days, and the short
> story is that the existing code is broken today for PREEMPT_FULL, and the code
> for CONFIG_DYNAMIC_FTRACE=y is similarly broken. A number of architectures 
> have
> also implemented the entry assembly incorrectly...

> I believe our options are:
> 
> * Avoid the mismatch by construction:
> 
>   - On architectures with trampolines, we could ensure that the list_ops gets
> its own trampoline and that we *always* use a trampoline, never using a
> common ftrace_caller. When we switch callers from one trampoline to 
> another
> they'd atomically get the new func+ops.
> 
> I reckon that might be a good option for x86-64?
> 
>   - On architectures without trampolines we could 
> require that that the ftrace_caller 
> loads ops->func from the ops pointer.
> 
> That'd mean removing the 'ftrace_trace_function' pointer and removing
> patching of the call to the trace function (but the actual tracee 
> callsites
> would still be patched).
> 
> I'd be in favour of this for arm64 since that matches the way CALL_OPS
> works; the only difference is we'd load a global ops pointer rather than a
> per-callsite ops pointer.
> 
> * Use rcu_tasks_trace to synchronize updates?

Having acquainted myself with the RCU flavours, I think the RCU Tasks Trace
suggestion wouldn't help, but *some* flavour of RCU might give us what we need.

That said, my preference is the "avoid the mismatch by construction" camp, as
even if we need to wait for uses of the old func+ops to finish, we'd have fewer
transitions (and consequently less patching) if we have:

switch_to_new_ops();
wait_for_old_ops_usage_to_finish();

... rather than:

switch_to_list_func();
wait_for_old_ops_usage_to_finish();
switch_to_new_ops();
ensure_new_ops_are_visible();
switch_to_new_func();

Mark.



Re: [PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 09:45:36AM -0800, Paul E. McKenney wrote:
> On Thu, Mar 07, 2024 at 05:09:51PM +0100, Stefan Wiehler wrote:
> > With CONFIG_PROVE_RCU_LIST=y and by executing
> > 
> >   $ echo 0 > /sys/devices/system/cpu/cpu1/online
> > 
> > one can trigger the following Lockdep-RCU splat on ARM:
> > 
> >   =
> >   WARNING: suspicious RCU usage
> >   6.8.0-rc7-1-g0db1d0ed8958 #10 Not tainted
> >   -
> >   kernel/locking/lockdep.c:3762 RCU-list traversed in non-reader section!!
> > 
> >   other info that might help us debug this:
> > 
> >   RCU used illegally from offline CPU!
> >   rcu_scheduler_active = 2, debug_locks = 1
> >   no locks held by swapper/1/0.
> > 
> >   stack backtrace:
> >   CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0-rc7-1-g0db1d0ed8958 
> > #10
> >   Hardware name: Allwinner sun8i Family
> >unwind_backtrace from show_stack+0x10/0x14
> >show_stack from dump_stack_lvl+0x60/0x90
> >dump_stack_lvl from lockdep_rcu_suspicious+0x150/0x1a0
> >lockdep_rcu_suspicious from __lock_acquire+0x11fc/0x29f8
> >__lock_acquire from lock_acquire+0x10c/0x348
> >lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
> >_raw_spin_lock_irqsave from check_and_switch_context+0x7c/0x4a8
> >check_and_switch_context from arch_cpu_idle_dead+0x10/0x7c
> >arch_cpu_idle_dead from do_idle+0xbc/0x138
> >do_idle from cpu_startup_entry+0x28/0x2c
> >cpu_startup_entry from secondary_start_kernel+0x11c/0x124
> >secondary_start_kernel from 0x401018a0
> > 
> > The CPU is already reported as offline from RCU perspective in
> > cpuhp_report_idle_dead() before arch_cpu_idle_dead() is invoked. Above
> > RCU-Lockdep splat is then triggered by check_and_switch_context() acquiring 
> > the
> > ASID spinlock.
> > 
> > Avoid the false-positive Lockdep-RCU splat by briefly reporting the CPU as
> > online again while the spinlock is held.
> > 
> > Signed-off-by: Stefan Wiehler 
> 
> From an RCU perspective, this looks plausible.  One question
> below.

But one additional caution...  If execution is delayed during that call
to idle_task_exit(), RCU will stall and won't have a reasonable way of
motivating this CPU.  Such delays could be due to vCPU preemption or
due to firmware grabbing the CPU.

But this is only a caution, not opposition.  After all, you could have
the same problem with an online CPU that gets similarly delayed while
its interrupts are disabled.

Thanx, Paul

> > ---
> >  arch/arm/kernel/smp.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > index 3431c0553f45..6875e2c5dd50 100644
> > --- a/arch/arm/kernel/smp.c
> > +++ b/arch/arm/kernel/smp.c
> > @@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
> >  {
> > unsigned int cpu = smp_processor_id();
> >  
> > +   /*
> > +* Briefly report CPU as online again to avoid false positive
> > +* Lockdep-RCU splat when check_and_switch_context() acquires ASID
> > +* spinlock.
> > +*/
> > +   rcutree_report_cpu_starting(cpu);
> > idle_task_exit();
> > +   rcutree_report_cpu_dead();
> >  
> > local_irq_disable();
> 
> Both rcutree_report_cpu_starting() and rcutree_report_cpu_dead() complain
> bitterly via lockdep if interrupts are enabled.  And the call sites have
> interrupts disabled.  So I don't understand what this local_irq_disable()
> is needed for.



Re: [PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Paul E. McKenney
On Thu, Mar 07, 2024 at 05:09:51PM +0100, Stefan Wiehler wrote:
> With CONFIG_PROVE_RCU_LIST=y and by executing
> 
>   $ echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> one can trigger the following Lockdep-RCU splat on ARM:
> 
>   =
>   WARNING: suspicious RCU usage
>   6.8.0-rc7-1-g0db1d0ed8958 #10 Not tainted
>   -
>   kernel/locking/lockdep.c:3762 RCU-list traversed in non-reader section!!
> 
>   other info that might help us debug this:
> 
>   RCU used illegally from offline CPU!
>   rcu_scheduler_active = 2, debug_locks = 1
>   no locks held by swapper/1/0.
> 
>   stack backtrace:
>   CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0-rc7-1-g0db1d0ed8958 #10
>   Hardware name: Allwinner sun8i Family
>unwind_backtrace from show_stack+0x10/0x14
>show_stack from dump_stack_lvl+0x60/0x90
>dump_stack_lvl from lockdep_rcu_suspicious+0x150/0x1a0
>lockdep_rcu_suspicious from __lock_acquire+0x11fc/0x29f8
>__lock_acquire from lock_acquire+0x10c/0x348
>lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
>_raw_spin_lock_irqsave from check_and_switch_context+0x7c/0x4a8
>check_and_switch_context from arch_cpu_idle_dead+0x10/0x7c
>arch_cpu_idle_dead from do_idle+0xbc/0x138
>do_idle from cpu_startup_entry+0x28/0x2c
>cpu_startup_entry from secondary_start_kernel+0x11c/0x124
>secondary_start_kernel from 0x401018a0
> 
> The CPU is already reported as offline from RCU perspective in
> cpuhp_report_idle_dead() before arch_cpu_idle_dead() is invoked. Above
> RCU-Lockdep splat is then triggered by check_and_switch_context() acquiring 
> the
> ASID spinlock.
> 
> Avoid the false-positive Lockdep-RCU splat by briefly reporting the CPU as
> online again while the spinlock is held.
> 
> Signed-off-by: Stefan Wiehler 

>From an RCU perspective, this looks plausible.  One question
below.

Thanx, Paul

> ---
>  arch/arm/kernel/smp.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 3431c0553f45..6875e2c5dd50 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
>  {
>   unsigned int cpu = smp_processor_id();
>  
> + /*
> +  * Briefly report CPU as online again to avoid false positive
> +  * Lockdep-RCU splat when check_and_switch_context() acquires ASID
> +  * spinlock.
> +  */
> + rcutree_report_cpu_starting(cpu);
>   idle_task_exit();
> + rcutree_report_cpu_dead();
>  
>   local_irq_disable();

Both rcutree_report_cpu_starting() and rcutree_report_cpu_dead() complain
bitterly via lockdep if interrupts are enabled.  And the call sites have
interrupts disabled.  So I don't understand what this local_irq_disable()
is needed for.



Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

2024-03-07 Thread Mark Rutland
Hi Joel, Paul, Steve,

On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> Hi Paul,
> 
> On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> >> As you noted, one thing that Ankur's series changes is that preemption
> >> can occur anywhere that it is not specifically disabled in kernels
> >> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> >> turn changes Tasks Rude RCU's definition of a quiescent state for these
> >> kernels, adding all code regions where preemption is not specifically
> >> disabled to the list of such quiescent states.
> >>
> >> Although from what I know, this is OK, it would be good to check the
> >> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> >> up so as to expect these new quiescent states.  One example where it
> >> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> >> right before or after that call to synchronize_rcu_tasks_rude().
> >>
> >> Would you be willing to check the call sites to verify that they
> >> are OK with this change in 
> > Yes, I will analyze and make sure those users did not unexpectedly
> > assume something about AUTO (i.e. preempt enabled sections using
> > readers).
> 
> Other than RCU test code, there are just 3 call sites for RUDE right now, all 
> in
> ftrace.c.
> 
> (Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
> other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is 
> on
> CC as well to CMIIW).
> 
> Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> 
> This config is itself expected to be slow. However seeing what it does, it is
> trying to make sure the global function pointer "ftrace_trace_function" is
> updated and any readers of that pointers would have finished reading it. I 
> don't
> personally think preemption has to be disabled across the entirety of the
> section that calls into this function. So sensitivity to preempt disabling
> should not be relevant for this case IMO, but lets see if ftrace folks 
> disagree
> (on CC). It has more to do with, any callers of this function pointer are no
> longer calling into the old function.

I've been looking into this case for the last couple of days, and the short
story is that the existing code is broken today for PREEMPT_FULL, and the code
for CONFIG_DYNAMIC_FTRACE=y is similarly broken. A number of architectures have
also implemented the entry assembly incorrectly...

The !CONFIG_DYNAMIC_FTRACE code in update_ftrace_function() is intended to
ensure that 'ftrace_trace_function' and 'function_trace_op' are used as a
consistent pair; i.e. for a given 'op', op->func() should be called with that
specific 'op' as an argument. That was introduced back in commit:

  405e1d834807e51b ("ftrace: Synchronize setting function_trace_op with 
ftrace_trace_function")

... and subsequently reworked to use synchronize_rcu_tasks_rude() in commit:

  e5a971d76d701dbf ("ftrace: Use synchronize_rcu_tasks_rude() instead of 
ftrace_sync())"

The key detail is that the ftrace entry code doesn't disable preemption, and so
when full preemption is used the entry code can be preempted between loading
'ftrace_trace_function' and 'function_trace_op', and there's a race of the
form:

Thread AThread B

// in ftrace entry asm  // in update_ftrace_function()

tmp_func = ftrace_trace_function;
< preempted >

ftrace_trace_function = 
ftrace_ops_list_func;
/*   
 * Make sure all CPUs see this. 
Yes this is slow, but static
 * tracing is slow and nasty to 
have enabled.
 */
synchronize_rcu_tasks_rude();
/* Now all cpus are using the 
list ops. */
function_trace_op = 
set_function_trace_op;

< restored > 
tmp_ops = function_trace_op;

tmp_func(..., tmp_ops, ...);

... and so we can end up using the old func with the new ops.



For static ftrace, this is only a problem when full preemption is used *and*
the architecture defines ARCH_SUPPORTS_FTRACE_OPS=1. Otherwise architecture
code isn't expected to provide the ops, and core code forces the use of the
list func, which iterates over all the ops to call them, e.g.

do_for_each_ftrace_op(op, ftrace_ops_list) {
...
op->func(ip, parent_ip, op, fregs);
} while_for_each_ftrace_op(op);

For static ftrace, a simple fix would be to force the use of the list func, and
not support static ftrace && ARCH_SUPPORTS_FTRACE_OPS=1. That's what x86 does 
today.

Another option is to follow the example of DYNAMIC_FTRACE_WITH_CALL_OP

[PATCH] arm: smp: Avoid false positive CPU hotplug Lockdep-RCU splat

2024-03-07 Thread Stefan Wiehler
With CONFIG_PROVE_RCU_LIST=y and by executing

  $ echo 0 > /sys/devices/system/cpu/cpu1/online

one can trigger the following Lockdep-RCU splat on ARM:

  =
  WARNING: suspicious RCU usage
  6.8.0-rc7-1-g0db1d0ed8958 #10 Not tainted
  -
  kernel/locking/lockdep.c:3762 RCU-list traversed in non-reader section!!

  other info that might help us debug this:

  RCU used illegally from offline CPU!
  rcu_scheduler_active = 2, debug_locks = 1
  no locks held by swapper/1/0.

  stack backtrace:
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0-rc7-1-g0db1d0ed8958 #10
  Hardware name: Allwinner sun8i Family
   unwind_backtrace from show_stack+0x10/0x14
   show_stack from dump_stack_lvl+0x60/0x90
   dump_stack_lvl from lockdep_rcu_suspicious+0x150/0x1a0
   lockdep_rcu_suspicious from __lock_acquire+0x11fc/0x29f8
   __lock_acquire from lock_acquire+0x10c/0x348
   lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
   _raw_spin_lock_irqsave from check_and_switch_context+0x7c/0x4a8
   check_and_switch_context from arch_cpu_idle_dead+0x10/0x7c
   arch_cpu_idle_dead from do_idle+0xbc/0x138
   do_idle from cpu_startup_entry+0x28/0x2c
   cpu_startup_entry from secondary_start_kernel+0x11c/0x124
   secondary_start_kernel from 0x401018a0

The CPU is already reported as offline from RCU perspective in
cpuhp_report_idle_dead() before arch_cpu_idle_dead() is invoked. Above
RCU-Lockdep splat is then triggered by check_and_switch_context() acquiring the
ASID spinlock.

Avoid the false-positive Lockdep-RCU splat by briefly reporting the CPU as
online again while the spinlock is held.

Signed-off-by: Stefan Wiehler 
---
 arch/arm/kernel/smp.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 3431c0553f45..6875e2c5dd50 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -319,7 +319,14 @@ void __noreturn arch_cpu_idle_dead(void)
 {
unsigned int cpu = smp_processor_id();
 
+   /*
+* Briefly report CPU as online again to avoid false positive
+* Lockdep-RCU splat when check_and_switch_context() acquires ASID
+* spinlock.
+*/
+   rcutree_report_cpu_starting(cpu);
idle_task_exit();
+   rcutree_report_cpu_dead();
 
local_irq_disable();
 
-- 
2.42.0




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Steven Rostedt
On Thu, 7 Mar 2024 08:20:59 -0500
Steven Rostedt  wrote:

> When a write happens, it looks to see if the smallest watermark is hit,
> if so, calls irqwork to wakeup all the waiters.
> 
> The waiters will wake up, check to see if a signal is pending or if the
> ring buffer has hit the watermark the waiter was waiting for and exit the
> wait loop.
> 
> What the wait_index does, is just a way to force all waiters out of the wait
> loop regardless of the watermark the waiter is waiting for. Before a waiter
> goes into the wait loop, it saves the current wait_index. The waker will
> increment the wait_index and then call the same irq_work to wake up all the
> waiters.
> 
> After the wakeup happens, the waiter will test if the new wait_index
> matches what it was before it entered the loop, and if it is different, it
> falls out of the loop. Then the caller of the ring_buffer_wait() can
> re-evaluate if it needs to enter the wait again.
> 
> The wait_index loop exit was needed for when the file descriptor of a file
> that uses a ring buffer closes and it needs to wake up all the readers of
> that file descriptor to notify their tasks that the file closed.
> 
> So we can switch the:
> 
>   rbwork->wait_index++;
>   smp_wmb();
> 
> into just a:
> 
>   (void)atomic_inc_return_release(&rbwork->wait_index);
> 
> and the:
> 
>   smp_rmb()
>   if (wait_index != work->wait_index)
> 
> into:
> 
>   if (wait_index != atomic_read_acquire(&rb->wait_index))
> 
> I'll write up a patch.
> 
> Hmm, I have the same wait_index logic at the higher level doing basically
> the same thing (at the close of the file). I'll switch that over too.

Discussing this with Maitheu on IRC, we found two bugs with the current
implementation. One was a stupid bug with an easy fix, and the other is
actually a design flaw.

The first bug was the (wait_index != work->wait_index) check was done
*after* the schedule() call and not before it.

The second more fundamental bug is that there's still a race between the
first read of wait_index and the call to prepare_to_wait().

The ring_buffer code doesn't have enough context to know enough to loop or
not. If a file is being closed when another thread is just entering this
code, it could miss the wakeup.

As the callers of ring_buffer_wait() also do a loop, it's redundant to have
ring_buffer_wait() do a loop. It should just do a single wait, and then
exit and let the callers decide if it should loop again.

This will get rid of the need for the rbwork->wait_index and simplifies the
code.

Working on that patch now.

-- Steve



Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Mathieu Desnoyers

On 2024-03-06 22:37, Paul E. McKenney wrote:

On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:

[...]



As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
is concerned, the only valid use-case I can think of is
split counters or RCU implementations where there is a
single updater doing the increment, and one or more
concurrent reader threads that need to snapshot a
consistent value with READ_ONCE().



[...]


So what would you use that pattern for?

One possibility is a per-CPU statistical counter in userspace on a
fastpath, in cases where losing the occasional count is OK.  Then learning
your CPU (and possibly being immediately migrated to some other CPU),
READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
make sense.

I suppose the same in the kernel if there was a fastpath so extreme you
could not afford to disable preemption.

At best, very niche.

Or am I suffering a failure of imagination yet again?  ;-)


The (niche) use-cases I have in mind are split-counters and RCU
grace period tracking, where precise counters totals are needed
(no lost count).

In the kernel, this could be:

- A per-cpu counter, each counter incremented from thread context with
  preemption disabled (single updater per counter), read concurrently by
  other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
  is no store/load tearing there. Atomics on the update would be stronger
  than necessary on the increment fast-path.

- A per-thread counter (e.g. within task_struct), only incremented by the
  single thread, read by various threads concurrently.

- A counter which increment happens to be already protected by a lock, read
  by various threads without taking the lock. (technically doable, but
  I'm not sure I see a relevant use-case for it)

In user-space:

- The "per-cpu" counter would have to use rseq for increments to prevent
  inopportune migrations, which needs to be implemented in assembler anyway.
  The counter reads would have to use READ_ONCE().

- The per-thread counter (Thread-Local Storage) incremented by a single
  thread, read by various threads concurrently, is a good target
  for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
  various liburcu implementations which track read-side critical sections
  per-thread.

- Same as for the kernel, a counter increment protected by a lock which
  needs to be read from various threads concurrently without taking
  the lock could be a valid use-case, though I fail to see how it is
  useful due to lack of imagination on my part. ;-)

I'm possibly missing other use-cases, but those come to mind as not
involving racy counter increments.

I agree that use-cases are so niche that we probably do _not_ want to
introduce ADD_SHARED() for that purpose in a common header file,
because I suspect that it would then become misused in plenty of
scenarios where the updates are actually racy and would be better
served by atomics or local-atomics.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Steven Rostedt
On Wed, 6 Mar 2024 12:06:00 -0800
Linus Torvalds  wrote:

> On Wed, 6 Mar 2024 at 11:45, Steven Rostedt  wrote:
> >
> > Here's the back story. I received the following patch:
> >
> >   
> > https://lore.kernel.org/all/tencent_ba1473492bc618b473864561ea3ab1418...@qq.com/
> >
> > I didn't like it. My reply was:
> >  
> > > - rbwork->wait_index++;
> > > + WRITE_ONCE(rbwork->wait_index, 
> > READ_ONCE(rbwork->wait_index) + 1);  
> >
> > I mean the above is really ugly. If this is the new thing to do, we 
> > need
> > better macros.
> >
> > If anything, just convert it to an atomic_t.  
> 
> The right thing is definitely to convert it to an atomic_t.

Agreed.

> 
> The memory barriers can probably also be turned into atomic ordering,
> although we don't always have all the variates.
> 
> But for example, that
> 
> /* Make sure to see the new wait index */
> smp_rmb();
> if (wait_index != work->wait_index)
> break;
> 
> looks odd, and should probably do an "atomic_read_acquire()" instead
> of a rmb and a (non-atomic and non-READ_ONCE thing).
> 
> The first READ_ONCE() should probably also be that atomic_read_acquire() op.
> 
> On the writing side, my gut feel is that the
> 
> rbwork->wait_index++;
> /* make sure the waiters see the new index */
> smp_wmb();
> 
> should be an "atomic_inc_release(&rbwork->wait_index);" but we don't
> actually have that operation. We only have the "release" versions for
> things that return a value.
> 
> So it would probably need to be either
> 
> atomic_inc(&rbwork->wait_index);
> /* make sure the waiters see the new index */
> smp_wmb();
> 
> or
> 
> atomic_inc_return_release(&rbwork->wait_index);
> 
> or we'd need to add the "basic atomics with ordering semantics" (which
> we aren't going to do unless we end up with a lot more people who want
> them).
> 
> I dunno. I didn't look all *that* closely at the code. The above might
> be garbage too. Somebody who actually knows the code should think
> about what ordering they actually were looking for.
> 
> (And I note that 'wait_index' is of type 'long' in 'struct
> rb_irq_work', so I guess it should be "atomic_long_t" instead -  just
> shows how little attention I paid on the first read-through, which
> should make everybody go "I need to double-check Linus here")

It doesn't need to be long. I'm not even sure why I made it long. Probably
for natural alignment.

The code isn't that complex. You can have a list of waiters waiting for the
ring buffer to fill to various watermarks (in most cases, it's just one waiter).

When a write happens, it looks to see if the smallest watermark is hit,
if so, calls irqwork to wakeup all the waiters.

The waiters will wake up, check to see if a signal is pending or if the
ring buffer has hit the watermark the waiter was waiting for and exit the
wait loop.

What the wait_index does, is just a way to force all waiters out of the wait
loop regardless of the watermark the waiter is waiting for. Before a waiter
goes into the wait loop, it saves the current wait_index. The waker will
increment the wait_index and then call the same irq_work to wake up all the
waiters.

After the wakeup happens, the waiter will test if the new wait_index
matches what it was before it entered the loop, and if it is different, it
falls out of the loop. Then the caller of the ring_buffer_wait() can
re-evaluate if it needs to enter the wait again.

The wait_index loop exit was needed for when the file descriptor of a file
that uses a ring buffer closes and it needs to wake up all the readers of
that file descriptor to notify their tasks that the file closed.

So we can switch the:

rbwork->wait_index++;
smp_wmb();

into just a:

(void)atomic_inc_return_release(&rbwork->wait_index);

and the:

smp_rmb()
if (wait_index != work->wait_index)

into:

if (wait_index != atomic_read_acquire(&rb->wait_index))

I'll write up a patch.

Hmm, I have the same wait_index logic at the higher level doing basically
the same thing (at the close of the file). I'll switch that over too.

-- Steve



Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Joel Fernandes



On 3/7/2024 7:57 AM, Uladzislau Rezki wrote:
> On Wed, Mar 06, 2024 at 05:31:31PM -0500, Joel Fernandes wrote:
>>
>>
>> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
>>> Fix a below race by not releasing a wait-head from the
>>> GP-kthread as it can lead for reusing it whereas a worker
>>> can still access it thus execute newly added callbacks too
>>> early.
>>>
>>> CPU 0  CPU 1
>>> -  -
>>>
>>> // wait_tail == HEAD1
>>> rcu_sr_normal_gp_cleanup() {
>>> // has passed SR_MAX_USERS_WAKE_FROM_GP
>>> wait_tail->next = next;
>>> // done_tail = HEAD1
>>> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>> queue_work() {
>>> test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>> __queue_work()
>>> }
>>> }
>>>
>>>set_work_pool_and_clear_pending()
>>>rcu_sr_normal_gp_cleanup_work() {
[..]
>>>
>>> Reported-by: Frederic Weisbecker 
>>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() 
>>> users")
>>> Signed-off-by: Uladzislau Rezki (Sony) 
>>> ---
>>>  kernel/rcu/tree.c | 22 --
>>>  1 file changed, 8 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>> index 31f3a61f9c38..475647620b12 100644
>>> --- a/kernel/rcu/tree.c
>>> +++ b/kernel/rcu/tree.c
>>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
>>> WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>  
>>> /*
>>> -* Process (a) and (d) cases. See an illustration. Apart of
>>> -* that it handles the scenario when all clients are done,
>>> -* wait-head is released if last. The worker is not kicked.
>>> +* Process (a) and (d) cases. See an illustration.
>>>  */
>>> llist_for_each_safe(rcu, next, wait_tail->next) {
>>> -   if (rcu_sr_is_wait_head(rcu)) {
>>> -   if (!rcu->next) {
>>> -   rcu_sr_put_wait_head(rcu);
>>> -   wait_tail->next = NULL;
>>> -   } else {
>>> -   wait_tail->next = rcu;
>>> -   }
>>> -
>>> +   if (rcu_sr_is_wait_head(rcu))
>>> break;
>>> -   }
>>>  
>>> rcu_sr_normal_complete(rcu);
>>> // It can be last, update a next on this step.
>>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>>> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>> ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>>>  
>>> -   if (wait_tail->next)
>>> -   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>> +   /*
>>> +* We schedule a work in order to perform a final processing
>>> +* of outstanding users(if still left) and releasing wait-heads
>>> +* added by rcu_sr_normal_gp_init() call.
>>> +*/
>>> +   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>>  }
>>
>> Ah, nice. So instead of allocating/freeing in GP thread and freeing in 
>> worker,
>> you allocate heads only in GP thread and free them only in worker, thus
>> essentially fixing the UAF that Frederick found.
>>
>> AFAICS, this fixes the issue.
>>
>> Reviewed-by: Joel Fernandes (Google) 
>>
> Thank you for the review-by!
> 
>> There might a way to prevent queuing new work as fast-path optimization, 
>> incase
>> the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not 
>> find a
>> workqueue API that helps there, and work_busy() has comments saying not to 
>> use that.
>>
> This is not really critical but yes, we can think of it.
> 

Thanks, I have a patch that does that. I could not help but write it as soon as
I woke up in the morning, ;-). It passes torture and I will push it for further
review after some more testing.

thanks,

 - Joel



Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Uladzislau Rezki
On Wed, Mar 06, 2024 at 05:31:31PM -0500, Joel Fernandes wrote:
> 
> 
> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> > Fix a below race by not releasing a wait-head from the
> > GP-kthread as it can lead for reusing it whereas a worker
> > can still access it thus execute newly added callbacks too
> > early.
> > 
> > CPU 0  CPU 1
> > -  -
> > 
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> > // has passed SR_MAX_USERS_WAKE_FROM_GP
> > wait_tail->next = next;
> > // done_tail = HEAD1
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > queue_work() {
> > test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > __queue_work()
> > }
> > }
> > 
> >set_work_pool_and_clear_pending()
> >rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> > // executes all completion, but stop at HEAD1
> > wait_tail->next = HEAD1;
> > // done_tail = HEAD2
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > queue_work() {
> > test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > __queue_work()
> > }
> > }
> >  // done = HEAD2
> >  done = 
> > smp_load_acquire(&rcu_state.srs_done_tail);
> >  // head = HEAD1
> >  head = done->next;
> >  done->next = NULL;
> >  llist_for_each_safe() {
> >  // completes all callbacks, release HEAD1
> >  }
> >}
> >// Process second queue
> >set_work_pool_and_clear_pending()
> >rcu_sr_normal_gp_cleanup_work() {
> >// done = HEAD2
> >done = 
> > smp_load_acquire(&rcu_state.srs_done_tail);
> > 
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> > // Finds HEAD2 with ->next == NULL at the end
> > rcu_sr_put_wait_head(HEAD2)
> > ...
> > 
> > // A few more GPs later
> > rcu_sr_normal_gp_init() {
> >  HEAD2 = rcu_sr_get_wait_head();
> >  llist_add(HEAD2, &rcu_state.srs_next);
> >// head == rcu_state.srs_next
> >head = done->next;
> >done->next = NULL;
> >llist_for_each_safe() {
> > // EXECUTE CALLBACKS TOO EARLY!!!
> > }
> >}
> > 
> > Reported-by: Frederic Weisbecker 
> > Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() 
> > users")
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > ---
> >  kernel/rcu/tree.c | 22 --
> >  1 file changed, 8 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 31f3a61f9c38..475647620b12 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> > WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >  
> > /*
> > -* Process (a) and (d) cases. See an illustration. Apart of
> > -* that it handles the scenario when all clients are done,
> > -* wait-head is released if last. The worker is not kicked.
> > +* Process (a) and (d) cases. See an illustration.
> >  */
> > llist_for_each_safe(rcu, next, wait_tail->next) {
> > -   if (rcu_sr_is_wait_head(rcu)) {
> > -   if (!rcu->next) {
> > -   rcu_sr_put_wait_head(rcu);
> > -   wait_tail->next = NULL;
> > -   } else {
> > -   wait_tail->next = rcu;
> > -   }
> > -
> > +   if (rcu_sr_is_wait_head(rcu))
> > break;
> > -   }
> >  
> > rcu_sr_normal_complete(rcu);
> > // It can be last, update a next on this step.
> > @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >  
> > -   if (wait_tail->next)
> > -   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> > +   /*
> > +* We schedule a work in order to perform a final processing
> > +* of outstanding users(if still left) and releasing wait-heads
> > +* added by rcu_sr_normal_gp_init() call.
> > +*/
> > +   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >  }
> 
> Ah, nice. So instead of allocating/freeing in GP thre

Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Uladzislau Rezki
On Thu, Mar 07, 2024 at 02:09:38AM -0500, Joel Fernandes wrote:
> 
> 
> On 3/7/2024 1:21 AM, Joel Fernandes wrote:
> > 
> > 
> > On 3/6/2024 5:31 PM, Joel Fernandes wrote:
> >>
> >>
> >> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> >>> Fix a below race by not releasing a wait-head from the
> >>> GP-kthread as it can lead for reusing it whereas a worker
> >>> can still access it thus execute newly added callbacks too
> >>> early.
> >>>
> >>> CPU 0  CPU 1
> >>> -  -
> >>>
> >>> // wait_tail == HEAD1
> >>> rcu_sr_normal_gp_cleanup() {
> >>> // has passed SR_MAX_USERS_WAKE_FROM_GP
> >>> wait_tail->next = next;
> >>> // done_tail = HEAD1
> >>> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>> queue_work() {
> >>> test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>> __queue_work()
> >>> }
> >>> }
> >>>
> >>>set_work_pool_and_clear_pending()
> >>>rcu_sr_normal_gp_cleanup_work() {
> >>> // new GP, wait_tail == HEAD2
> >>> rcu_sr_normal_gp_cleanup() {
> >>> // executes all completion, but stop at HEAD1
> >>> wait_tail->next = HEAD1;
> >>> // done_tail = HEAD2
> >>> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>> queue_work() {
> >>> test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>> __queue_work()
> >>> }
> >>> }
> >>>  // done = HEAD2
> >>>  done = 
> >>> smp_load_acquire(&rcu_state.srs_done_tail);
> >>>  // head = HEAD1
> >>>  head = done->next;
> >>>  done->next = NULL;
> >>>  llist_for_each_safe() {
> >>>  // completes all callbacks, release HEAD1
> >>>  }
> >>>}
> >>>// Process second queue
> >>>set_work_pool_and_clear_pending()
> >>>rcu_sr_normal_gp_cleanup_work() {
> >>>// done = HEAD2
> >>>done = 
> >>> smp_load_acquire(&rcu_state.srs_done_tail);
> >>>
> >>> // new GP, wait_tail == HEAD3
> >>> rcu_sr_normal_gp_cleanup() {
> >>> // Finds HEAD2 with ->next == NULL at the end
> >>> rcu_sr_put_wait_head(HEAD2)
> >>> ...
> >>>
> >>> // A few more GPs later
> >>> rcu_sr_normal_gp_init() {
> >>>  HEAD2 = rcu_sr_get_wait_head();
> >>>  llist_add(HEAD2, &rcu_state.srs_next);
> >>>// head == rcu_state.srs_next
> >>>head = done->next;
> >>>done->next = NULL;
> >>>llist_for_each_safe() {
> >>> // EXECUTE CALLBACKS TOO EARLY!!!
> >>> }
> >>>}
> >>>
> >>> Reported-by: Frederic Weisbecker 
> >>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() 
> >>> users")
> >>> Signed-off-by: Uladzislau Rezki (Sony) 
> >>> ---
> >>>  kernel/rcu/tree.c | 22 --
> >>>  1 file changed, 8 insertions(+), 14 deletions(-)
> >>>
> >>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>> index 31f3a61f9c38..475647620b12 100644
> >>> --- a/kernel/rcu/tree.c
> >>> +++ b/kernel/rcu/tree.c
> >>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>>   WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >>>  
> >>>   /*
> >>> -  * Process (a) and (d) cases. See an illustration. Apart of
> >>> -  * that it handles the scenario when all clients are done,
> >>> -  * wait-head is released if last. The worker is not kicked.
> >>> +  * Process (a) and (d) cases. See an illustration.
> >>>*/
> >>>   llist_for_each_safe(rcu, next, wait_tail->next) {
> >>> - if (rcu_sr_is_wait_head(rcu)) {
> >>> - if (!rcu->next) {
> >>> - rcu_sr_put_wait_head(rcu);
> >>> - wait_tail->next = NULL;
> >>> - } else {
> >>> - wait_tail->next = rcu;
> >>> - }
> >>> -
> >>> + if (rcu_sr_is_wait_head(rcu))
> >>>   break;
> >>> - }
> >>>  
> >>>   rcu_sr_normal_complete(rcu);
> >>>   // It can be last, update a next on this step.
> >>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>>   smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>>   ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >>>  
> >>> - if (wait_tail->next)
> >>> - queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >>> + /*
> >>> +  * We schedule a work in order to perform a fina

Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Uladzislau Rezki
On Thu, Mar 07, 2024 at 01:21:50AM -0500, Joel Fernandes wrote:
> 
> 
> On 3/6/2024 5:31 PM, Joel Fernandes wrote:
> > 
> > 
> > On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> >> Fix a below race by not releasing a wait-head from the
> >> GP-kthread as it can lead for reusing it whereas a worker
> >> can still access it thus execute newly added callbacks too
> >> early.
> >>
> >> CPU 0  CPU 1
> >> -  -
> >>
> >> // wait_tail == HEAD1
> >> rcu_sr_normal_gp_cleanup() {
> >> // has passed SR_MAX_USERS_WAKE_FROM_GP
> >> wait_tail->next = next;
> >> // done_tail = HEAD1
> >> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >> queue_work() {
> >> test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >> __queue_work()
> >> }
> >> }
> >>
> >>set_work_pool_and_clear_pending()
> >>rcu_sr_normal_gp_cleanup_work() {
> >> // new GP, wait_tail == HEAD2
> >> rcu_sr_normal_gp_cleanup() {
> >> // executes all completion, but stop at HEAD1
> >> wait_tail->next = HEAD1;
> >> // done_tail = HEAD2
> >> smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >> queue_work() {
> >> test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >> __queue_work()
> >> }
> >> }
> >>  // done = HEAD2
> >>  done = 
> >> smp_load_acquire(&rcu_state.srs_done_tail);
> >>  // head = HEAD1
> >>  head = done->next;
> >>  done->next = NULL;
> >>  llist_for_each_safe() {
> >>  // completes all callbacks, release HEAD1
> >>  }
> >>}
> >>// Process second queue
> >>set_work_pool_and_clear_pending()
> >>rcu_sr_normal_gp_cleanup_work() {
> >>// done = HEAD2
> >>done = 
> >> smp_load_acquire(&rcu_state.srs_done_tail);
> >>
> >> // new GP, wait_tail == HEAD3
> >> rcu_sr_normal_gp_cleanup() {
> >> // Finds HEAD2 with ->next == NULL at the end
> >> rcu_sr_put_wait_head(HEAD2)
> >> ...
> >>
> >> // A few more GPs later
> >> rcu_sr_normal_gp_init() {
> >>  HEAD2 = rcu_sr_get_wait_head();
> >>  llist_add(HEAD2, &rcu_state.srs_next);
> >>// head == rcu_state.srs_next
> >>head = done->next;
> >>done->next = NULL;
> >>llist_for_each_safe() {
> >> // EXECUTE CALLBACKS TOO EARLY!!!
> >> }
> >>}
> >>
> >> Reported-by: Frederic Weisbecker 
> >> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() 
> >> users")
> >> Signed-off-by: Uladzislau Rezki (Sony) 
> >> ---
> >>  kernel/rcu/tree.c | 22 --
> >>  1 file changed, 8 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 31f3a61f9c38..475647620b12 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >>  
> >>/*
> >> -   * Process (a) and (d) cases. See an illustration. Apart of
> >> -   * that it handles the scenario when all clients are done,
> >> -   * wait-head is released if last. The worker is not kicked.
> >> +   * Process (a) and (d) cases. See an illustration.
> >> */
> >>llist_for_each_safe(rcu, next, wait_tail->next) {
> >> -  if (rcu_sr_is_wait_head(rcu)) {
> >> -  if (!rcu->next) {
> >> -  rcu_sr_put_wait_head(rcu);
> >> -  wait_tail->next = NULL;
> >> -  } else {
> >> -  wait_tail->next = rcu;
> >> -  }
> >> -
> >> +  if (rcu_sr_is_wait_head(rcu))
> >>break;
> >> -  }
> >>  
> >>rcu_sr_normal_complete(rcu);
> >>// It can be last, update a next on this step.
> >> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >>  
> >> -  if (wait_tail->next)
> >> -  queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >> +  /*
> >> +   * We schedule a work in order to perform a final processing
> >> +   * of outstanding users(if still left) and releasing wait-heads
> >> +   * added by rcu_sr_normal_gp_init() call.
> >> +   

Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Uladzislau Rezki
On Wed, Mar 06, 2024 at 05:44:04PM -0500, Joel Fernandes wrote:
> On Wed, Mar 6, 2024 at 5:31 PM Joel Fernandes  wrote:
> >
> >
> >
> > On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> > > Fix a below race by not releasing a wait-head from the
> > > GP-kthread as it can lead for reusing it whereas a worker
> > > can still access it thus execute newly added callbacks too
> > > early.
> > >
> [...]
> > There might a way to prevent queuing new work as fast-path optimization, 
> > incase
> > the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not 
> > find a
> > workqueue API that helps there, and work_busy() has comments saying not to 
> > use that.
> 
> One way to do this would be to maintain a count of how many CBs are in
> flight via the worker route, and then fast-path-free the thing if the
> count is 0. Should I send a patch around something like that? It saves
> 1 more wakeup per synchronize_rcu() I think.
> 
We can release the last wait-head if we know that the worker is not
pending/running. Then we guarantee that Frederic's case is not possible.
>From the other hand it will introduce again more mess because the idea
was, in the begging, to start with something really simple :)

--
Uladzislau Rezki



Re: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread

2024-03-07 Thread Uladzislau Rezki
On Thu, Mar 07, 2024 at 01:04:25AM +0100, Frederic Weisbecker wrote:
> Le Tue, Mar 05, 2024 at 08:57:19PM +0100, Uladzislau Rezki (Sony) a écrit :
> > Fix a below race by not releasing a wait-head from the
> > GP-kthread as it can lead for reusing it whereas a worker
> > can still access it thus execute newly added callbacks too
> > early.
> > 
> > CPU 0  CPU 1
> > -  -
> > 
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> > // has passed SR_MAX_USERS_WAKE_FROM_GP
> > wait_tail->next = next;
> > // done_tail = HEAD1
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > queue_work() {
> > test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > __queue_work()
> > }
> > }
> > 
> >set_work_pool_and_clear_pending()
> >rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> > // executes all completion, but stop at HEAD1
> > wait_tail->next = HEAD1;
> > // done_tail = HEAD2
> > smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > queue_work() {
> > test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > __queue_work()
> > }
> > }
> >  // done = HEAD2
> >  done = 
> > smp_load_acquire(&rcu_state.srs_done_tail);
> >  // head = HEAD1
> >  head = done->next;
> >  done->next = NULL;
> >  llist_for_each_safe() {
> >  // completes all callbacks, release HEAD1
> >  }
> >}
> >// Process second queue
> >set_work_pool_and_clear_pending()
> >rcu_sr_normal_gp_cleanup_work() {
> >// done = HEAD2
> >done = 
> > smp_load_acquire(&rcu_state.srs_done_tail);
> > 
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> > // Finds HEAD2 with ->next == NULL at the end
> > rcu_sr_put_wait_head(HEAD2)
> > ...
> > 
> > // A few more GPs later
> > rcu_sr_normal_gp_init() {
> >  HEAD2 = rcu_sr_get_wait_head();
> >  llist_add(HEAD2, &rcu_state.srs_next);
> >// head == rcu_state.srs_next
> >head = done->next;
> >done->next = NULL;
> >llist_for_each_safe() {
> > // EXECUTE CALLBACKS TOO EARLY!!!
> > }
> >}
> > 
> > Reported-by: Frederic Weisbecker 
> > Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() 
> > users")
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> Reviewed-by: Frederic Weisbecker 
>
Thank you, I will update with your review-by tag!

--
Uladzislau Rezki



Re: [PATCH 2/2] rcu: Allocate WQ with WQ_MEM_RECLAIM bit set

2024-03-07 Thread Uladzislau Rezki
On Wed, Mar 06, 2024 at 12:57:25PM -0500, Joel Fernandes wrote:
> 
> 
> On 3/6/2024 6:56 AM, Uladzislau Rezki wrote:
> > On Wed, Mar 06, 2024 at 10:15:44AM +0800, Z qiang wrote:
> >>>
> >>> synchronize_rcu() users have to be processed regardless
> >>> of memory pressure so our private WQ needs to have at least
> >>> one execution context what WQ_MEM_RECLAIM flag guarantees.
> >>>
> >>> Signed-off-by: Uladzislau Rezki (Sony) 
> >>> ---
> >>>  kernel/rcu/tree.c | 6 +-
> >>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>> index 475647620b12..59881a68dd26 100644
> >>> --- a/kernel/rcu/tree.c
> >>> +++ b/kernel/rcu/tree.c
> >>> @@ -1581,6 +1581,7 @@ static void rcu_sr_put_wait_head(struct llist_node 
> >>> *node)
> >>>  /* Disabled by default. */
> >>>  static int rcu_normal_wake_from_gp;
> >>>  module_param(rcu_normal_wake_from_gp, int, 0644);
> >>> +static struct workqueue_struct *sync_wq;
> >>>
> >>>  static void rcu_sr_normal_complete(struct llist_node *node)
> >>>  {
> >>> @@ -1679,7 +1680,7 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>>  * of outstanding users(if still left) and releasing wait-heads
> >>>  * added by rcu_sr_normal_gp_init() call.
> >>>  */
> >>> -   queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >>> +   queue_work(sync_wq, &rcu_state.srs_cleanup_work);
> >>>  }
> >>>
> >>>  /*
> >>> @@ -5584,6 +5585,9 @@ void __init rcu_init(void)
> >>> rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
> >>> WARN_ON(!rcu_gp_wq);
> >>>
> >>> +   sync_wq = alloc_workqueue("sync_wq", WQ_MEM_RECLAIM, 0);
> >>
> >> Why was WQ_HIGHPRI removed?
> >>
> > I would like to check perf. figures with it and send out it as a
> > separate patch if it is worth it.
> 
> I guess one thing to note is that there are also other RCU-related WQ which 
> have
> WQ_MEM_RECLAIM but not WQ_HIGHPRI (such as for expedited RCU, at least some
> configs). So for consistency, this makes sense to me.
> 
> Reviewed-by: Joel Fernandes (Google)  
Thanks. I will update it with review tag!

--
Uladzislau Rezki