Re: rcu_preempt caused oom

2018-12-07 Thread Paul E. McKenney
_gp_kthread_wake(rsp);
> }
> }

This is a completely different code path.  The rcu_start_this_gp()
function is trying to start a new grace period.  In contrast, this
rcu_report_qs_rdp() function reports a quiescent state for a currently
running grace period.  In your earlier trace, there was no currently
running grace period, so rcu_report_qs_rdp() exiting early is expected
behavior.

Thanx, Paul

> -Original Message-
> From: Paul E. McKenney  
> Sent: Friday, December 7, 2018 1:38 AM
> To: He, Bo 
> Cc: Steven Rostedt ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin ; Bai, Jie A 
> 
> Subject: Re: rcu_preempt caused oom
> 
> On Thu, Dec 06, 2018 at 01:23:01PM +, He, Bo wrote:
> > 1. The test is positive after set the kthread priority to SCHED_FIFO 
> > without CONFIG_RCU_BOOST,  the issue is not reproduced until now.
> > 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds 
> > ftrace. The panic log was triggered with the enclosed debug patch, replaced 
> > the wait_for_completion(_array[i].completion) with 
> > wait_for_completion_timeout(_array[i].completion, 3*HZ) in 
> > __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump 
> > all tasks backtrace.
> 
> Thank you for collecting this information!
> 
> (By the way, the usual downside of the priority increase is increased 
> context-switch rate and thus CPU overhead.)
> 
> And all three grace-period kthreads are blocked apparently in their top-level 
> loops (though inlining and all that).  There are quite a few preemptions 
> ("72738.702815: rcu_preempt_task: rcu_preempt"), but they are all blocking 
> the next grace period (29041008), not the current one (29041004).  And the 
> "rcu_unlock_preempted_task" trace records flag the current grace-period 
> sequence number as 29041004, which means that there is no grace period in 
> progress, that is, RCU is idle.
> 
> Which explains why there is no RCU CPU stall warning -- after all, if there 
> is no grace period in flight, it is not possible to stall that non-existent 
> grace period.
> 
> That also could explain why increasing the priority of the grace-period 
> kthreads gets things going again.  There have been a great number of requests 
> for a new grace period (for example, "rcu_future_grace_period:
> rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the 
> grace-period kthread wakes up, a new grace period will start.
> 
> Except that the rcu_preempt task says "I" rather than "R", as you noted in an 
> earlier email.
> 
> And there should have been multiple attempts to wake up the grace-period 
> kthread, because there are lots of callbacks queued as in 136,045 of them 
> ("rcu_callback: rcu_preempt rhp=66f735c9 func=file_free_rcu 
> 2811/136045").  Which is of course why you are seeing the OOM.
> 
> So the question becomes "Why is the grace-period kthread being awakened so 
> many times, but not actually waking up?"  In the past, there was a scheduler 
> bug that could cause that, but that was -way- before the v4.19 that you are 
> running.  More recently, there have been timer-related problems, but those 
> only happened while a grace period was active, and where also long before 
> v4.19.
> 
> Hmmm...  One possibility is that you have somehow managed to invoke
> call_rcu() with interrupts disabled, which would in turn disable the extra 
> wakeups that RCU sends when it sees excessive numbers of callbacks.
> Except that in that case, boosting the priority wouldn't help.  Besides, the 
> scheduling-clock interrupt should also check for this, and should push things 
> forward if need be.
> 
> If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL 
> bucket on all CPUs, that would defeat the wakeup-if-no-grace-period checks 
> (RCU is supposed to have started the relevant grace period before putting 
> callbacks into that bucket).  But that cannot be the case here, because new 
> callbacks are being enqueued throughout, and these would then trigger RCU's 
> start-a-new-grace-period checks.
> 
> But it would be good to confirm that this is actually working like I would 
> expect it to.  Could you please add scheduler wakeup to your tracing, if 
> possible, only displaying those sent to the rcu_preempt task?
> 
>   Thanx, Paul
> 
> > -Original Message-
> > From: Paul E. McKenney 
> > Sent: Thursday, December 6, 2018 1:45 AM
> > To: He, Bo 
> 

Re: [PATCH] Linux: Implement membarrier function

2018-12-06 Thread Paul E. McKenney
Hello, David,

I took a crack at extending LKMM to accommodate what I think would
support what you have in your paper.  Please see the very end of this
email for a patch against the "dev" branch of my -rcu tree.

This gives the expected result for the following three litmus tests,
but is probably deficient or otherwise misguided in other ways.  I have
added the LKMM maintainers on CC for their amusement.  ;-)

Thoughts?

Thanx, Paul



C C-Goldblat-memb-1
{
}

P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 1);
r1 = READ_ONCE(*x1);
}


P1(int *x0, int *x1)
{
WRITE_ONCE(*x1, 1);
smp_memb();
r2 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r2=0)



C C-Goldblat-memb-2
{
}

P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 1);
r1 = READ_ONCE(*x1);
}


P1(int *x1, int *x2)
{
WRITE_ONCE(*x1, 1);
smp_memb();
r1 = READ_ONCE(*x2);
}

P2(int *x2, int *x0)
{
WRITE_ONCE(*x2, 1);
r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0)



C C-Goldblat-memb-3
{
}

P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 1);
r1 = READ_ONCE(*x1);
}


P1(int *x1, int *x2)
{
WRITE_ONCE(*x1, 1);
smp_memb();
r1 = READ_ONCE(*x2);
}

P2(int *x2, int *x3)
{
WRITE_ONCE(*x2, 1);
r1 = READ_ONCE(*x3);
}

P3(int *x3, int *x0)
{
WRITE_ONCE(*x3, 1);
smp_memb();
r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)



On Thu, Nov 29, 2018 at 11:02:17AM -0800, David Goldblatt wrote:
> One note with the suggested patch is that
> `atomic_thread_fence(memory_order_acq_rel)` should probably be
> `atomic_thread_fence (memory_order_seq_cst)` (otherwise the call would
> be a no-op on, say, x86, which it very much isn't).
> 
> The non-transitivity thing makes the resulting description arguably
> incorrect, but this is informal enough that it might not be a big deal
> to add something after "For these threads, the membarrier function
> call turns an existing compiler barrier (see above) executed by these
> threads into full memory barriers" that clarifies it. E.g. you could
> make it into "turns an existing compiler barrier [...] into full
> memory barriers, with respect to the calling thread".
> 
> Since this is targeting the description of the OS call (and doesn't
> have to concern itself with also being implementable by other
> asymmetric techniques or degrading to architectural barriers), I think
> that the description in "approach 2" in P1202 would also make sense
> for a formal description of the syscall. (Of course, without the
> kernel itself committing to a rigorous semantics, anything specified
> on top of it will be on slightly shaky ground).
> 
> - David
> 
> On Thu, Nov 29, 2018 at 7:04 AM Paul E. McKenney  
> wrote:
> >
> > On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> > > - On Nov 29, 2018, at 8:50 AM, Florian Weimer fwei...@redhat.com 
> > > wrote:
> > >
> > > > * Torvald Riegel:
> > > >
> > > >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> > > >>> This is essentially a repost of last year's patch, rebased to the 
> > > >>> glibc
> > > >>> 2.29 symbol version and reflecting the introduction of
> > > >>> MEMBARRIER_CMD_GLOBAL.
> > > >>>
> > > >>> I'm not including any changes to manual/ here because the set of
> > > >>> supported operations is evolving rapidly, we could not get consensus 
> > > >>> for
> > > >>> the language I proposed the last time, and I do not want to contribute
> > > >>> to the manual for the time being.
> > > >>
> > > >> Fair enough.  Nonetheless, can you summarize how far you're along with
> > > >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > > >
> > > > I wrote down what you could, but no one liked it.
> > > >
> > > > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > > >
> > > > I expect that a formalization would interact in non-trivial ways with
> > > > any potential formalization of usable relaxed memory order semantics,
> > > > and I'm not sure if anyone knows how to do the latter today.
> > >
> > > Addin

Re: rcu_preempt caused oom

2018-12-06 Thread Paul E. McKenney
On Thu, Dec 06, 2018 at 01:23:01PM +, He, Bo wrote:
> 1. The test is positive after set the kthread priority to SCHED_FIFO without 
> CONFIG_RCU_BOOST,  the issue is not reproduced until now.
> 2. Here is previous log enable the ftrace_dump, and we can get 4 seconds 
> ftrace. The panic log was triggered with the enclosed debug patch, replaced 
> the wait_for_completion(_array[i].completion) with 
> wait_for_completion_timeout(_array[i].completion, 3*HZ) in 
> __wait_rcu_gp(). The logs enabled the lockdep to dump the locks, and dump all 
> tasks backtrace.

Thank you for collecting this information!

(By the way, the usual downside of the priority increase is increased
context-switch rate and thus CPU overhead.)

And all three grace-period kthreads are blocked apparently in their
top-level loops (though inlining and all that).  There are quite a few
preemptions ("72738.702815: rcu_preempt_task: rcu_preempt"), but they
are all blocking the next grace period (29041008), not the current one
(29041004).  And the "rcu_unlock_preempted_task" trace records flag the
current grace-period sequence number as 29041004, which means that there
is no grace period in progress, that is, RCU is idle.

Which explains why there is no RCU CPU stall warning -- after all, if
there is no grace period in flight, it is not possible to stall that
non-existent grace period.

That also could explain why increasing the priority of the grace-period
kthreads gets things going again.  There have been a great number of
requests for a new grace period (for example, "rcu_future_grace_period:
rcu_preempt 29041004 29041008 0 0 3 Startleaf"), so as soon as the
grace-period kthread wakes up, a new grace period will start.

Except that the rcu_preempt task says "I" rather than "R", as you noted
in an earlier email.

And there should have been multiple attempts to wake up the grace-period
kthread, because there are lots of callbacks queued as in 136,045 of
them ("rcu_callback: rcu_preempt rhp=66f735c9 func=file_free_rcu
2811/136045").  Which is of course why you are seeing the OOM.

So the question becomes "Why is the grace-period kthread being awakened
so many times, but not actually waking up?"  In the past, there was a
scheduler bug that could cause that, but that was -way- before the v4.19
that you are running.  More recently, there have been timer-related
problems, but those only happened while a grace period was active,
and where also long before v4.19.

Hmmm...  One possibility is that you have somehow managed to invoke
call_rcu() with interrupts disabled, which would in turn disable the
extra wakeups that RCU sends when it sees excessive numbers of callbacks.
Except that in that case, boosting the priority wouldn't help.  Besides,
the scheduling-clock interrupt should also check for this, and should
push things forward if need be.

If RCU managed to put all of its callbacks into the RCU_NEXT_READY_TAIL
bucket on all CPUs, that would defeat the wakeup-if-no-grace-period
checks (RCU is supposed to have started the relevant grace period before
putting callbacks into that bucket).  But that cannot be the case here,
because new callbacks are being enqueued throughout, and these would
then trigger RCU's start-a-new-grace-period checks.

But it would be good to confirm that this is actually working like I would
expect it to.  Could you please add scheduler wakeup to your tracing,
if possible, only displaying those sent to the rcu_preempt task?

    Thanx, Paul

> -Original Message-
> From: Paul E. McKenney  
> Sent: Thursday, December 6, 2018 1:45 AM
> To: He, Bo 
> Cc: Steven Rostedt ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin ; Bai, Jie A 
> 
> Subject: Re: rcu_preempt caused oom
> 
> On Wed, Dec 05, 2018 at 08:42:54AM +, He, Bo wrote:
> > I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> > Our previous logs can dump all the task backtrace, and kthread (the 
> > rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" 
> > state, my understandings are if it's the side-effect of causing RCU's 
> > kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R 
> > state.
> 
> Hmmm...  Well, the tasks could in theory be waiting on a blocking mutex.
> But in practice the grace-period kthreads wait on events, so that makes no 
> sense.
> 
> Is it possible for you to dump out the grace-period kthread's stack, for 
> example, with sysreq-t?  (Steve might know a better way to do this.)
> 
> > I will do more experiments and keep you update once we have more findin

Re: [tip:core/rcu] rcutorture: Make initrd/init execute in userspace

2018-12-05 Thread Paul E. McKenney
On Wed, Dec 05, 2018 at 04:58:27PM -0800, Josh Triplett wrote:
> On Thu, Dec 06, 2018 at 01:51:47AM +0100, Andrea Parri wrote:
> > > commit 4f8f751961b536f77c8f82394963e8e2d26efd84
> > > Author: Paul E. McKenney 
> > > Date:   Tue Dec 4 14:59:12 2018 -0800
> > > 
> > > torture: Explain and simplify odd "for" loop in mkinitrd.sh
> > > 
> > > Why a Bourne-shell "for" loop?  And why 192 instances of "a"?  This 
> > > commit
> > > adds a shell comment to present the answer to these mysteries.  It 
> > > also
> > > uses a series of factor-of-four Bourne-shell assignments to make it
> > > easy to see how many instances there are, replacing the earlier wall 
> > > of
> > > 'a' characters.
> > > 
> > > Reported-by: Josh Triplett 
> > > Signed-off-by: Paul E. McKenney 
> > > 
> > > diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
> > > b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > index da298394daa2..ff69190604ea 100755
> > > --- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > +++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > @@ -40,17 +40,24 @@ mkdir $T
> > >  cat > $T/init << '__EOF___'
> > >  #!/bin/sh
> > >  # Run in userspace a few milliseconds every second.  This helps to
> > > -# exercise the NO_HZ_FULL portions of RCU.
> > > +# exercise the NO_HZ_FULL portions of RCU.  The 192 instances of "a" was
> > > +# empirically shown to give a nice multi-millisecond burst of user-mode
> > > +# execution on a 2GHz CPU, as desired.  Modern CPUs will vary from a
> > > +# couple of milliseconds up to perhaps 100 milliseconds, which is an
> > > +# acceptable range.
> > > +#
> > > +# Why not calibrate an exact delay?  Because within this initrd, we
> > > +# are restricted to Bourne-shell builtins, which as far as I know do not
> > > +# provide any means of obtaining a fine-grained timestamp.
> > > +
> > > +a4="a a a a"
> > > +a16="$a4 $a4 $a4 $a4"
> > > +a64="$a8 $a8 $a8 $a8"
> > 
> > Mmh, are you sure you don't want s/a8/a16/ here? ;-)
> 
> ... *facepalm*

Yeah, me as well...

> Good catch.

Thank you both!!!

Thanx, Paul



Re: [tip:core/rcu] rcutorture: Make initrd/init execute in userspace

2018-12-05 Thread Paul E. McKenney
On Thu, Dec 06, 2018 at 01:51:47AM +0100, Andrea Parri wrote:
> > commit 4f8f751961b536f77c8f82394963e8e2d26efd84
> > Author: Paul E. McKenney 
> > Date:   Tue Dec 4 14:59:12 2018 -0800
> > 
> > torture: Explain and simplify odd "for" loop in mkinitrd.sh
> > 
> > Why a Bourne-shell "for" loop?  And why 192 instances of "a"?  This 
> > commit
> > adds a shell comment to present the answer to these mysteries.  It also
> > uses a series of factor-of-four Bourne-shell assignments to make it
> > easy to see how many instances there are, replacing the earlier wall of
> >     'a' characters.
> > 
> > Reported-by: Josh Triplett 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
> > b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > index da298394daa2..ff69190604ea 100755
> > --- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > +++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > @@ -40,17 +40,24 @@ mkdir $T
> >  cat > $T/init << '__EOF___'
> >  #!/bin/sh
> >  # Run in userspace a few milliseconds every second.  This helps to
> > -# exercise the NO_HZ_FULL portions of RCU.
> > +# exercise the NO_HZ_FULL portions of RCU.  The 192 instances of "a" was
> > +# empirically shown to give a nice multi-millisecond burst of user-mode
> > +# execution on a 2GHz CPU, as desired.  Modern CPUs will vary from a
> > +# couple of milliseconds up to perhaps 100 milliseconds, which is an
> > +# acceptable range.
> > +#
> > +# Why not calibrate an exact delay?  Because within this initrd, we
> > +# are restricted to Bourne-shell builtins, which as far as I know do not
> > +# provide any means of obtaining a fine-grained timestamp.
> > +
> > +a4="a a a a"
> > +a16="$a4 $a4 $a4 $a4"
> > +a64="$a8 $a8 $a8 $a8"
> 
> Mmh, are you sure you don't want s/a8/a16/ here? ;-)

Indeed I do!  How about the following?

Thanx, Paul



commit 94cae122408cdc55470360868a1a4b8f160e576d
Author: Paul E. McKenney 
Date:   Tue Dec 4 14:59:12 2018 -0800

torture: Explain and simplify odd "for" loop in mkinitrd.sh

Why a Bourne-shell "for" loop?  And why 192 instances of "a"?  This commit
adds a shell comment to present the answer to these mysteries.  It also
uses a series of factor-of-four Bourne-shell assignments to make it
easy to see how many instances there are, replacing the earlier wall of
'a' characters.

Reported-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 
Reviewed-by: Josh Triplett 
[ paulmck: Fix wrong-variable bugs noted by Andrea Parri. ]

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index da298394daa2..e79eb35c41e2 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -40,17 +40,24 @@ mkdir $T
 cat > $T/init << '__EOF___'
 #!/bin/sh
 # Run in userspace a few milliseconds every second.  This helps to
-# exercise the NO_HZ_FULL portions of RCU.
+# exercise the NO_HZ_FULL portions of RCU.  The 192 instances of "a" was
+# empirically shown to give a nice multi-millisecond burst of user-mode
+# execution on a 2GHz CPU, as desired.  Modern CPUs will vary from a
+# couple of milliseconds up to perhaps 100 milliseconds, which is an
+# acceptable range.
+#
+# Why not calibrate an exact delay?  Because within this initrd, we
+# are restricted to Bourne-shell builtins, which as far as I know do not
+# provide any means of obtaining a fine-grained timestamp.
+
+a4="a a a a"
+a16="$a4 $a4 $a4 $a4"
+a64="$a16 $a16 $a16 $a16"
+a192="$a64 $a64 $a64"
 while :
 do
q=
-   for i in \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
-   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
+   for i in $a192
do
q="$q $i"
done



Re: [tip:core/rcu] rcutorture: Make initrd/init execute in userspace

2018-12-05 Thread Paul E. McKenney
On Wed, Dec 05, 2018 at 04:27:38PM -0800, Josh Triplett wrote:
> On Wed, Dec 05, 2018 at 04:08:09PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 05, 2018 at 02:25:24PM -0800, Josh Triplett wrote:
> > > On Tue, Dec 04, 2018 at 03:04:23PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Dec 04, 2018 at 02:24:13PM -0800, Josh Triplett wrote:
> > > > > On Tue, Dec 04, 2018 at 02:09:42PM -0800, tip-bot for Paul E. 
> > > > > McKenney wrote:
> > > > > > --- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > > > > +++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > > > > @@ -39,9 +39,22 @@ mkdir $T
> > > > > >  
> > > > > >  cat > $T/init << '__EOF___'
> > > > > >  #!/bin/sh
> > > > > > +# Run in userspace a few milliseconds every second.  This helps to
> > > > > > +# exercise the NO_HZ_FULL portions of RCU.
> > > > > >  while :
> > > > > >  do
> > > > > > -   sleep 100
> > > > > > +   q=
> > > > > > +   for i in \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a \
> > > > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > > > a a a
> > > > > 
> > > > > Ow. If there's no better way to do this, please do at least comment 
> > > > > how many 'a's
> > > > > this is. (And why 186, exactly?)
> > > > 
> > > > Yeah, that is admittedly a bit strange.  The reason for 186 occurrences 
> > > > of
> > > > "a" to one-time calibration, measuring a few millisecond's worth of 
> > > > delay.
> > > > 
> > > > > Please also consider calibrating the delay loop as you do in the C 
> > > > > code.
> > > > 
> > > > Good point.  And a quick web search finds me "date '+%s%N'", which gives
> > > > me nanoseconds since the epoch.  I probably don't want to do a 2038 to
> > > > myself (after all, I might still be alive then), so I should probably 
> > > > try
> > > > to make something work with "date '+%N'".  Or use something like this:
> > > > 
> > > > $ date '+%4N'; date '+%4N';date '+%4N'; date '+%4N'
> > > > 6660
> > > > 6685
> > > > 6697
> > > > 6710
> > > > 
> > > > Ah, but that means I need to add the "date" command to my initrd, 
> > > > doesn't
> > > > it?  And calculation requires either bash or the "test" command.  And it
> > > > would be quite good to restrict this to what can be done with Bourne 
> > > > shell
> > > > built-in commands, since a big point of this is to maintain a 
> > > > small-sized
> > > > initrd.  :-/
> > > 
> > > Sure, and I'm not suggesting adding commands to the initrd, hence my
> > > mention of "If there's no better way".
> > > 
> > > > So how about the following patch, which attempts to explain the 
> > > > situation?
> > > 
> > > That would help, but please also consider consolidating with something
> > > like a10="a a a a a a a a a a" to make it more readable (and perhaps
> > > rounding up to 200 for simplicity).
> > 
> > How about powers of four and one factor of three for 192, as shown below?
> 
> Perfect, thanks. That's much better.
> 
> Reviewed-by: Josh Triplett 

Applied, thank you!

Thanx, Paul

> > 
> > 
> > commit 4f8f751961b536f77c8f82394963e8e2d26efd84
> > Author: Paul E. McKenney 
> > Date:   Tue Dec 4 14:59:12 2018 -0800
> > 
> > torture: Explain and simplify odd "for&quo

Re: [tip:core/rcu] rcutorture: Make initrd/init execute in userspace

2018-12-05 Thread Paul E. McKenney
On Wed, Dec 05, 2018 at 02:25:24PM -0800, Josh Triplett wrote:
> On Tue, Dec 04, 2018 at 03:04:23PM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 04, 2018 at 02:24:13PM -0800, Josh Triplett wrote:
> > > On Tue, Dec 04, 2018 at 02:09:42PM -0800, tip-bot for Paul E. McKenney 
> > > wrote:
> > > > --- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > > +++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > > > @@ -39,9 +39,22 @@ mkdir $T
> > > >  
> > > >  cat > $T/init << '__EOF___'
> > > >  #!/bin/sh
> > > > +# Run in userspace a few milliseconds every second.  This helps to
> > > > +# exercise the NO_HZ_FULL portions of RCU.
> > > >  while :
> > > >  do
> > > > -   sleep 100
> > > > +   q=
> > > > +   for i in \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a \
> > > > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a 
> > > > a a a
> > > 
> > > Ow. If there's no better way to do this, please do at least comment how 
> > > many 'a's
> > > this is. (And why 186, exactly?)
> > 
> > Yeah, that is admittedly a bit strange.  The reason for 186 occurrences of
> > "a" to one-time calibration, measuring a few millisecond's worth of delay.
> > 
> > > Please also consider calibrating the delay loop as you do in the C code.
> > 
> > Good point.  And a quick web search finds me "date '+%s%N'", which gives
> > me nanoseconds since the epoch.  I probably don't want to do a 2038 to
> > myself (after all, I might still be alive then), so I should probably try
> > to make something work with "date '+%N'".  Or use something like this:
> > 
> > $ date '+%4N'; date '+%4N';date '+%4N'; date '+%4N'
> > 6660
> > 6685
> > 6697
> > 6710
> > 
> > Ah, but that means I need to add the "date" command to my initrd, doesn't
> > it?  And calculation requires either bash or the "test" command.  And it
> > would be quite good to restrict this to what can be done with Bourne shell
> > built-in commands, since a big point of this is to maintain a small-sized
> > initrd.  :-/
> 
> Sure, and I'm not suggesting adding commands to the initrd, hence my
> mention of "If there's no better way".
> 
> > So how about the following patch, which attempts to explain the situation?
> 
> That would help, but please also consider consolidating with something
> like a10="a a a a a a a a a a" to make it more readable (and perhaps
> rounding up to 200 for simplicity).

How about powers of four and one factor of three for 192, as shown below?

Thanx, Paul

----

commit 4f8f751961b536f77c8f82394963e8e2d26efd84
Author: Paul E. McKenney 
Date:   Tue Dec 4 14:59:12 2018 -0800

torture: Explain and simplify odd "for" loop in mkinitrd.sh

Why a Bourne-shell "for" loop?  And why 192 instances of "a"?  This commit
adds a shell comment to present the answer to these mysteries.  It also
uses a series of factor-of-four Bourne-shell assignments to make it
easy to see how many instances there are, replacing the earlier wall of
'a' characters.

Reported-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index da298394daa2..ff69190604ea 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -40,17 +40,24 @@ mkdir $T
 cat > $T/init << '__EOF___'
 #!/bin/sh
 # Run in userspace a few milliseconds every second.  This helps to
-# exercise the NO_HZ_FULL portion

Re: rcu_preempt caused oom

2018-12-05 Thread Paul E. McKenney
On Wed, Dec 05, 2018 at 08:42:54AM +, He, Bo wrote:
> I double checked the .config, we don't enable CONFIG_NO_HZ_FULL .
> Our previous logs can dump all the task backtrace, and kthread (the 
> rcu_preempt, rcu_sched, and rcu_bh tasks) are all in "I" state not in "R" 
> state, my understandings are if it's the side-effect of causing RCU's 
> kthreads to be run at SCHED_FIFO priority 1, the kthreads should be in R 
> state.

Hmmm...  Well, the tasks could in theory be waiting on a blocking mutex.
But in practice the grace-period kthreads wait on events, so that makes
no sense.

Is it possible for you to dump out the grace-period kthread's stack,
for example, with sysreq-t?  (Steve might know a better way to do this.)

> I will do more experiments and keep you update once we have more findings:
> 1. set the kthread priority to SCHED_FIFO without CONFIG_RCU_BOOST and see if 
> the issue can reproduce.

That sounds like a most excellent experiment!

> 2. check more ftrace to double confirm why there is no 
> trace_rcu_quiescent_state_report and most of the trace_rcu_grace_period are 
> in "AccWaitCB".

As noted earlier, to see something interesting, you will need to start
the ftrace before the grace period starts.  This would probably mean
having ftrace running before starting the test.  Starting the ftrace
after the hang commences is unlikely to produce useful information.

    Thanx, Paul

> -Original Message-
> From: Paul E. McKenney  
> Sent: Wednesday, December 5, 2018 3:50 AM
> To: He, Bo 
> Cc: Steven Rostedt ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin ; Bai, Jie A 
> 
> Subject: Re: rcu_preempt caused oom
> 
> On Tue, Dec 04, 2018 at 07:50:04AM +, He, Bo wrote:
> > Hi, Paul:
> > the enclosed is the log trigger the 120s hung_task_panic without other 
> > debug patches, the hung task is blocked at __wait_rcu_gp, it means the 
> > rcu_cpu_stall can't detect the scenario:
> > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> 
> Not necessarily.  If there is an RCU CPU stall warning, blocking within
> __wait_rcu_gp() is expected behavior.  It is possible that the problem is 
> that although the grace period is completing as required, the callbacks are 
> not being invoked in a timely fashion.  And that could happen if you had 
> CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively, callback 
> offloading enabled.  But I don't see these in your previous emails.  Another 
> possible cause is that the grace-period kthread is being delayed, so that the 
> grace period never starts.  This seems unlikely, but it is the only thing 
> thus far that matches the symptoms.
> 
> CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to be run at 
> SCHED_FIFO priority 1, and that would help in the case where RCU's 
> grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks, all of 
> which execute in the rcu_gp_kthread() function) was being starved of CPU time.
> 
> Does that sound likely?
> 
>   Thanx, Paul
> 
> > -Original Message-
> > From: Paul E. McKenney 
> > Sent: Monday, December 3, 2018 9:57 PM
> > To: He, Bo 
> > Cc: Steven Rostedt ; 
> > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> > ; Xiao, Jin ; Zhang, Yanmin 
> > 
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Mon, Dec 03, 2018 at 07:44:03AM +, He, Bo wrote:
> > > Thanks, we have run the test for the whole weekend and not reproduce the 
> > > issue,  so we confirm the CONFIG_RCU_BOOST can fix the issue.
> > 
> > Very good, that is encouraging.  Perhaps I should think about making 
> > CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for 
> > architectures for which rt_mutexes are implemented.
> > 
> > > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic 
> > > on rcu stall and will see if we can see the panic, will keep you posed 
> > > with the test results.
> > > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> > 
> > Looking forward to seeing what is going on!  Of course, to reproduce, you 
> > will need to again build with CONFIG_RCU_BOOST=n.
> > 
> > Thanx, Paul
> > 
> > > -Original Message-
> > > From: Paul E. 

Re: [tip:core/rcu] rcutorture: Make initrd/init execute in userspace

2018-12-04 Thread Paul E. McKenney
On Tue, Dec 04, 2018 at 02:24:13PM -0800, Josh Triplett wrote:
> On Tue, Dec 04, 2018 at 02:09:42PM -0800, tip-bot for Paul E. McKenney wrote:
> > --- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > +++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
> > @@ -39,9 +39,22 @@ mkdir $T
> >  
> >  cat > $T/init << '__EOF___'
> >  #!/bin/sh
> > +# Run in userspace a few milliseconds every second.  This helps to
> > +# exercise the NO_HZ_FULL portions of RCU.
> >  while :
> >  do
> > -   sleep 100
> > +   q=
> > +   for i in \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
> > +   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> 
> Ow. If there's no better way to do this, please do at least comment how many 
> 'a's
> this is. (And why 186, exactly?)

Yeah, that is admittedly a bit strange.  The reason for 186 occurrences of
"a" to one-time calibration, measuring a few millisecond's worth of delay.

> Please also consider calibrating the delay loop as you do in the C code.

Good point.  And a quick web search finds me "date '+%s%N'", which gives
me nanoseconds since the epoch.  I probably don't want to do a 2038 to
myself (after all, I might still be alive then), so I should probably try
to make something work with "date '+%N'".  Or use something like this:

$ date '+%4N'; date '+%4N';date '+%4N'; date '+%4N'
6660
6685
6697
6710

Ah, but that means I need to add the "date" command to my initrd, doesn't
it?  And calculation requires either bash or the "test" command.  And it
would be quite good to restrict this to what can be done with Bourne shell
built-in commands, since a big point of this is to maintain a small-sized
initrd.  :-/

So how about the following patch, which attempts to explain the situation?

    Thanx, Paul



commit 23c304cbeda435acd4096ab3213502d6ae9720f3
Author: Paul E. McKenney 
Date:   Tue Dec 4 14:59:12 2018 -0800

torture: Explain odd "for" loop in mkinitrd.sh

    Why a Bourne-shell "for" loop?  And why 186 instances of "a"?  This commit
adds a shell comment to present the answer to these mysteries.

Reported-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index da298394daa2..1df0bbbfde7c 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -40,7 +40,15 @@ mkdir $T
 cat > $T/init << '__EOF___'
 #!/bin/sh
 # Run in userspace a few milliseconds every second.  This helps to
-# exercise the NO_HZ_FULL portions of RCU.
+# exercise the NO_HZ_FULL portions of RCU.  Yes, there are 186 instances
+# of "a", which was empirically shown to give a nice multi-millisecond
+# burst of user-mode execution on a 2GHz CPU, as desired.  Modern CPUs
+# will vary from a couple of milliseconds up to perhaps 100 milliseconds,
+# which is an acceptable range.
+#
+# Why not calibrate an exact delay?  Because within this initrd, we
+# are restricted to Bourne-shell builtins, which as far as I know do not
+# provide any means of obtaining a fine-grained timestamp.
 while :
 do
q=



[tip:core/rcu] tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()

2018-12-04 Thread tip-bot for Paul E. McKenney
Commit-ID:  4a67e3a79e3bdc47dfd0c85a1888067d95a0282c
Gitweb: https://git.kernel.org/tip/4a67e3a79e3bdc47dfd0c85a1888067d95a0282c
Author: Paul E. McKenney 
AuthorDate: Wed, 7 Nov 2018 15:25:13 -0800
Committer:  Paul E. McKenney 
CommitDate: Sat, 1 Dec 2018 12:38:51 -0800

tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()

Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu().  This commit therefore makes this change,
even though it is but a comment.

Signed-off-by: Paul E. McKenney 
Cc: Matthew Wilcox 
Cc: 
---
 tools/include/linux/kernel.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 6935ef94e77a..857d9e22826e 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -116,6 +116,6 @@ int scnprintf(char * buf, size_t size, const char * fmt, 
...);
 #define round_down(x, y) ((x) & ~__round_mask(x, y))
 
 #define current_gfp_context(k) 0
-#define synchronize_sched()
+#define synchronize_rcu()
 
 #endif


[tip:core/rcu] tracing: Replace synchronize_sched() and call_rcu_sched()

2018-12-04 Thread tip-bot for Paul E. McKenney
Commit-ID:  7440172974e85b1828bdd84ac6b23b5bcad9c5eb
Gitweb: https://git.kernel.org/tip/7440172974e85b1828bdd84ac6b23b5bcad9c5eb
Author: Paul E. McKenney 
AuthorDate: Tue, 6 Nov 2018 18:44:52 -0800
Committer:  Paul E. McKenney 
CommitDate: Tue, 27 Nov 2018 09:21:41 -0800

tracing: Replace synchronize_sched() and call_rcu_sched()

Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can
be replaced by synchronize_rcu().  Similarly, call_rcu_sched() can be
replaced by call_rcu().  This commit therefore makes these changes.

Signed-off-by: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: 
Acked-by: Steven Rostedt (VMware) 
---
 include/linux/tracepoint.h |  2 +-
 kernel/trace/ftrace.c  | 24 
 kernel/trace/ring_buffer.c | 12 ++--
 kernel/trace/trace.c   | 10 +-
 kernel/trace/trace_events_filter.c |  4 ++--
 kernel/trace/trace_kprobe.c|  2 +-
 kernel/tracepoint.c|  4 ++--
 7 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 538ba1a58f5b..432080b59c26 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -82,7 +82,7 @@ int unregister_tracepoint_module_notifier(struct 
notifier_block *nb)
 static inline void tracepoint_synchronize_unregister(void)
 {
synchronize_srcu(_srcu);
-   synchronize_sched();
+   synchronize_rcu();
 }
 #else
 static inline void tracepoint_synchronize_unregister(void)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index f536f601bd46..5b4f73e4fd56 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -173,7 +173,7 @@ static void ftrace_sync(struct work_struct *work)
 {
/*
 * This function is just a stub to implement a hard force
-* of synchronize_sched(). This requires synchronizing
+* of synchronize_rcu(). This requires synchronizing
 * tasks even in userspace and idle.
 *
 * Yes, function tracing is rude.
@@ -934,7 +934,7 @@ ftrace_profile_write(struct file *filp, const char __user 
*ubuf,
ftrace_profile_enabled = 0;
/*
 * unregister_ftrace_profiler calls stop_machine
-* so this acts like an synchronize_sched.
+* so this acts like an synchronize_rcu.
 */
unregister_ftrace_profiler();
}
@@ -1086,7 +1086,7 @@ struct ftrace_ops *ftrace_ops_trampoline(unsigned long 
addr)
 
/*
 * Some of the ops may be dynamically allocated,
-* they are freed after a synchronize_sched().
+* they are freed after a synchronize_rcu().
 */
preempt_disable_notrace();
 
@@ -1286,7 +1286,7 @@ static void free_ftrace_hash_rcu(struct ftrace_hash *hash)
 {
if (!hash || hash == EMPTY_HASH)
return;
-   call_rcu_sched(>rcu, __free_ftrace_hash_rcu);
+   call_rcu(>rcu, __free_ftrace_hash_rcu);
 }
 
 void ftrace_free_filter(struct ftrace_ops *ops)
@@ -1501,7 +1501,7 @@ static bool hash_contains_ip(unsigned long ip,
  * the ip is not in the ops->notrace_hash.
  *
  * This needs to be called with preemption disabled as
- * the hashes are freed with call_rcu_sched().
+ * the hashes are freed with call_rcu().
  */
 static int
 ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs)
@@ -4496,7 +4496,7 @@ unregister_ftrace_function_probe_func(char *glob, struct 
trace_array *tr,
if (ftrace_enabled && !ftrace_hash_empty(hash))
ftrace_run_modify_code(>ops, FTRACE_UPDATE_CALLS,
   _hash_ops);
-   synchronize_sched();
+   synchronize_rcu();
 
hlist_for_each_entry_safe(entry, tmp, , hlist) {
hlist_del(>hlist);
@@ -5314,7 +5314,7 @@ ftrace_graph_release(struct inode *inode, struct file 
*file)
mutex_unlock(_lock);
 
/* Wait till all users are no longer using the old hash */
-   synchronize_sched();
+   synchronize_rcu();
 
free_ftrace_hash(old_hash);
}
@@ -5707,7 +5707,7 @@ void ftrace_release_mod(struct module *mod)
list_for_each_entry_safe(mod_map, n, _mod_maps, list) {
if (mod_map->mod == mod) {
list_del_rcu(_map->list);
-   call_rcu_sched(_map->rcu, ftrace_free_mod_map);
+   call_rcu(_map->rcu, ftrace_free_mod_map);
break;
}
}
@@ -5927,7 +5927,7 @@ ftrace_mod_address_lookup(unsigned long addr, unsigned 
long *size,
struct ftrace_mod_map *mod_map;
const char *ret = NULL;
 
-   /* mod_map is freed via call_rcu_sched() */
+

Re: [GIT PULL rcu/next] RCU commits for 4.21/5.0

2018-12-04 Thread Paul E. McKenney
On Tue, Dec 04, 2018 at 02:38:17PM +0100, Willy Tarreau wrote:
> Hi Ingo,
> 
> On Tue, Dec 04, 2018 at 09:08:37AM +0100, Ingo Molnar wrote:
> > I noticed this bit from Willy:
> > 
> > >  tools/testing/selftests/rcutorture/bin/nolibc.h| 2197 
> > > 
> > 
> > So  is a rather large header and it comes with very little 
> > documentation - but once you read through the header it's obvious what it 
> > does, the code is clean and it's pretty cool all around, and in hindsight 
> > the name is a strong hint about what the header does as well. ;)
> 
> Thanks for the positive comment, as it was initially not designed to be
> merged into the kernel and was just a local home project. I figured it
> could be a perfect solution to Paul's executable size issues and offered
> some help to get it in relatively quickly, but surely we can do much better!
> 
> > Still it would be nice to at least add a top level description to the 
> > header to make people (like me) who are reading the code before the 
> > changelogs wonder less. For tooling headers we require a similar 
> > self-explanatory, feel-fuzzy structure as for kernel headers.
> 
> I'm fine with doing this. I even wrote the very small header at the last
> minute, without knowing if there was any chance it survives a review :-)
> 
> > Beyond adding a bit more documentation it would also be useful to factor 
> > nolibc.h out into tools/include/nolibc/ or so, no reason to hide it in 
> > rcutorture, I bet there's a number of other testcases and smaller 
> > utilities in tools/ that could make good use of it.
> 
> Fine as well. It's important however to keep in mind that I only covered
> the few architectures I could test (i386/x86_64/arm/arm64/mips), and even
> there the coverage is still limited. I don't want it to become too much of
> a pain to use for other utilities just by lack of initial coverage. However
> I agree that better exposure will help contributions come in.
> 
> > My long term hope would be that eventually we could even create a minimal 
> > klibc from it (a minimal libc provided by the kernel itself), giving 
> > minimalist binaries a mechanism to link against klibc.so:
> > 
> > - klibc would be an explicit opt-in mechanism, i.e. binaries that are 
> >   coupled with the kernel anyway (and initrd executables certainly are) 
> >   could use this.
> 
> In fact it's very similar to my goal. I'm using it in initramfs and initrds
> that do very little stuff and where it's acceptable to have a few #ifdef to
> adapt to this or that libc. However I found it extremely convenient *not* to
> require any external symbol, thus not to have to link against anything. But
> I'm well aware that this position cannot last forever and that at some
> point if we want to go further we'll possibly have a few layers (naked
> syscalls returning -errno, decorated syscalls making use of an explicit
> errno, libc-specific stuff like string functions). Possibly that in this
> case only the naked version would remain in the .h and that the rest will
> require linking with the .so/.a.
> 
> > - We could also add a way for the kernel to provide (non-swappable) 
> >   binaries via an automatic /klib/ mount point or so. This would allow 
> >   features like a minimal, console based rescue/debug shell that doesn't 
> >   rely on any filesystem state or external library dependencies, other 
> >   than the initial kernel+initrd image.
> 
> This could be convenient indeed, I never thought about this. I'm currently
> doing something comparable using initramfs, so maybe in the end we don't
> need the kernel to create anything beyond this, but instead just let the
> user choose in the configuration what utilities should be added to the
> initramfs sources.
> 
> (...)
> > - klibc would also eventually allow deeper integration with the vDSO 
> >   proper: for example on klibc based embedded systems we could link klibc
> >   and the vDSO into a single vDSO library, further simplifying and 
> >   optimizing it.
> 
> I already looked how to implement vDSO. I figured it was not very difficult
> but will require that I maintain variables with the AUXV, then I thought
> that it went beyond the scope of this minimalist implementaiton and
> postponed this.
> 
> > - klibc would also allow faster feature propagation from kernel to libc 
> >   as well, as we could prototype, test and benchmark new system calls and 
> >   new features on klibc - i.e. klibc integration and testcases could be a
> >   requirement for new system calls.
> 
> This actually is a good idea. There was already a discussion in another
> thread about exposing syscalls better in the kernel for better interactions
> with the libc, but it could start this way with test cases. It also increases
> the likeliness that an awkward API is detected early when the person starts
> to write his/her own part of the libc side.
> 
> > - There's no upper limit to how such a minimal kernel-shell (root only) 
> >   environment 

Re: rcu_preempt caused oom

2018-12-04 Thread Paul E. McKenney
On Tue, Dec 04, 2018 at 07:50:04AM +, He, Bo wrote:
> Hi, Paul:
> the enclosed is the log trigger the 120s hung_task_panic without other debug 
> patches, the hung task is blocked at __wait_rcu_gp, it means the 
> rcu_cpu_stall can't detect the scenario:
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> echo 7 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

Not necessarily.  If there is an RCU CPU stall warning, blocking within
__wait_rcu_gp() is expected behavior.  It is possible that the problem is
that although the grace period is completing as required, the callbacks
are not being invoked in a timely fashion.  And that could happen if you
had CONFIG_NO_HZ_FULL and a bunch of nohz_full CPUs, or, alternatively,
callback offloading enabled.  But I don't see these in your previous
emails.  Another possible cause is that the grace-period kthread is being
delayed, so that the grace period never starts.  This seems unlikely,
but it is the only thing thus far that matches the symptoms.

CONFIG_RCU_BOOST=y has the side-effect of causing RCU's kthreads to
be run at SCHED_FIFO priority 1, and that would help in the case where
RCU's grace-period kthread (the rcu_preempt, rcu_sched, and rcu_bh tasks,
all of which execute in the rcu_gp_kthread() function) was being starved
of CPU time.

Does that sound likely?

Thanx, Paul

> -Original Message-----
> From: Paul E. McKenney  
> Sent: Monday, December 3, 2018 9:57 PM
> To: He, Bo 
> Cc: Steven Rostedt ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin 
> Subject: Re: rcu_preempt caused oom
> 
> On Mon, Dec 03, 2018 at 07:44:03AM +, He, Bo wrote:
> > Thanks, we have run the test for the whole weekend and not reproduce the 
> > issue,  so we confirm the CONFIG_RCU_BOOST can fix the issue.
> 
> Very good, that is encouraging.  Perhaps I should think about making 
> CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least for 
> architectures for which rt_mutexes are implemented.
> 
> > We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on 
> > rcu stall and will see if we can see the panic, will keep you posed with 
> > the test results.
> > echo 1 > /proc/sys/kernel/panic_on_rcu_stall
> 
> Looking forward to seeing what is going on!  Of course, to reproduce, you 
> will need to again build with CONFIG_RCU_BOOST=n.
> 
>   Thanx, Paul
> 
> > -Original Message-
> > From: Paul E. McKenney 
> > Sent: Saturday, December 1, 2018 12:49 AM
> > To: He, Bo 
> > Cc: Steven Rostedt ; 
> > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> > ; Xiao, Jin ; Zhang, Yanmin 
> > 
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Fri, Nov 30, 2018 at 03:18:58PM +, He, Bo wrote:
> > > Here is the kernel cmdline:
> > 
> > Thank you!
> > 
> > > Kernel command line: androidboot.acpio_idx=0
> > > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_0
> > > 3- userdebug androidboot.diskbus=00.0 
> > > androidboot.verifiedbootstate=green
> > > androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb
> > > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves 
> > > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti 
> > > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt
> > > loglevel=4 androidboot.hardware=gordon_peak 
> > > firmware_class.path=/vendor/firmware relative_sleep_states=1
> > > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10 
> > > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/pro
> > > pe rties/android/ pstore.backend=ramoops memmap=0x140$0x5000
> > > ramoops.mem_address=0x5000 ramoops.mem_size=0x140
> > > ramoops.record_size=0x4000 ramoops.console_size=0x100
> > > ramoops.ftrace_size=0x1 ramoops.dump_oops=1 vga=current
> > > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 
> > > drm.vblankoffdelay=
> > 
> > And no sign of any suppression of RCU CPU stall warnings.  Hmmm...
> > It does take more than 21 seconds to OOM?  Or do things happen faster than 
> > that?  If they do happen faster than that, then on approach would be to add 
> > something like this to the kernel command line:
> > 
> > rcupdate.rcu_cpu_stall_timeout=7
> > 
> > This would set the sta

Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-04 Thread Paul E. McKenney
On Tue, Dec 04, 2018 at 12:23:08PM -0500, Jason Baron wrote:
> 
> 
> On 12/3/18 6:02 AM, Roman Penyaev wrote:
> > Hi all,
> > 
> > The goal of this patch is to reduce contention of ep_poll_callback() which
> > can be called concurrently from different CPUs in case of high events
> > rates and many fds per epoll.  Problem can be very well reproduced by
> > generating events (write to pipe or eventfd) from many threads, while
> > consumer thread does polling.  In other words this patch increases the
> > bandwidth of events which can be delivered from sources to the poller by
> > adding poll items in a lockless way to the list.
> > 
> > The main change is in replacement of the spinlock with a rwlock, which is
> > taken on read in ep_poll_callback(), and then by adding poll items to the
> > tail of the list using xchg atomic instruction.  Write lock is taken
> > everywhere else in order to stop list modifications and guarantee that list
> > updates are fully completed (I assume that write side of a rwlock does not
> > starve, it seems qrwlock implementation has these guarantees).
> > 
> > The following are some microbenchmark results based on the test [1] which
> > starts threads which generate N events each.  The test ends when all
> > events are successfully fetched by the poller thread:
> > 
> > spinlock
> > 
> > 
> > threads   run timeevents per ms
> > ---   -   -
> >   8 13191ms 6064/ms
> >  16 30758ms 5201/ms
> >  32 44315ms 7220/ms
> > 
> > rwlock + xchg
> > =
> > 
> > threads   run timeevents per ms
> > ---   -   -
> >   8  8581ms 9323/ms
> >  16 13800ms11594/ms
> >  32 24167ms13240/ms
> > 
> > According to the results bandwidth of delivered events is significantly
> > increased, thus execution time is reduced.
> > 
> > This is RFC because I did not run any benchmarks comparing current
> > qrwlock and spinlock implementations (4.19 kernel), although I did
> > not notice any epoll performance degradations in other benchmarks.
> > 
> > Also I'm not quite sure where to put very special lockless variant
> > of adding element to the list (list_add_tail_lockless() in this
> > patch).  Seems keeping it locally is safer.
> > 
> > [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
> > 
> > Signed-off-by: Roman Penyaev 
> > Cc: Alexander Viro 
> > Cc: "Paul E. McKenney" 
> > Cc: Linus Torvalds 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  fs/eventpoll.c | 107 +++--
> >  1 file changed, 69 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index 42bbe6824b4b..89debda47aca 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -50,10 +50,10 @@
> >   *
> >   * 1) epmutex (mutex)
> >   * 2) ep->mtx (mutex)
> > - * 3) ep->wq.lock (spinlock)
> > + * 3) ep->lock (rwlock)
> >   *
> >   * The acquire order is the one listed above, from 1 to 3.
> > - * We need a spinlock (ep->wq.lock) because we manipulate objects
> > + * We need a rwlock (ep->lock) because we manipulate objects
> >   * from inside the poll callback, that might be triggered from
> >   * a wake_up() that in turn might be called from IRQ context.
> >   * So we can't sleep inside the poll callback and hence we need
> > @@ -85,7 +85,7 @@
> >   * of epoll file descriptors, we use the current recursion depth as
> >   * the lockdep subkey.
> >   * It is possible to drop the "ep->mtx" and to use the global
> > - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
> > + * mutex "epmutex" (together with "ep->lock") to have it working,
> >   * but having "ep->mtx" will make the interface more scalable.
> >   * Events that require holding "epmutex" are very rare, while for
> >   * normal operations the epoll private "ep->mtx" will guarantee
> > @@ -182,8 +182,6 @@ struct epitem {
> >   * This structure is stored inside the "private_data" member of the file
> >   * structure and represents the main data structure for the eventpoll
> >   * interface.
> > - *
> > - * Access to it is protected by the loc

Re: [PATCH memory-model 0/3] Updates to the formal memory model

2018-12-04 Thread Paul E. McKenney
On Wed, Dec 05, 2018 at 12:40:01AM +0900, Akira Yokosawa wrote:
> On 2018/12/03 15:51:27 -0800, Paul E. McKenney wrote:
> > On Tue, Dec 04, 2018 at 08:28:03AM +0900, Akira Yokosawa wrote:
> >> On 2018/12/03 15:04:11 -0800, Paul E. McKenney wrote:
> >>> Hello, Ingo!
> >>>
> >>> This series contains updates to the Linux kernel's formal memory model
> >>> in tools/memory-model.  These patches are ready for inclusion into -tip.
> >>>
> >>> 1.Model smp_mb__after_unlock_lock(), courtesy of Andrea Parri.
> >>>
> >>> 2.Add scripts to check github litmus tests.
> >>>
> >>> 3.Make scripts take "-j" abbreviation for "--jobs".
> >>>
> >>> There is another series in preparation to model SRCU, but this series
> >>> requires hot-off-the presses changes to the herd tool that have not yet
> >>> been released.  This SRCU series is therefore targeting the merge window
> >>> after the upcoming one.  People wishing to experiment with the prototype
> >>> SRCU model may obtain it from my -rcu tree at branch "dev", and use
> >>> a bleeding-edge herd7 built from https://github.com/herd/herdtools7/,
> >>> version 7.51+2(dev), which is (commit 10403b24070c) or later.
> >>
> >> On the master branch of herdtools7, SRCU support was added in version
> >> 7.51+4(dev), which is commit 6ec9da1f4d58, or later.
> > 
> > It has been working for me with version 7.51+2(dev), but perhaps I
> > have just been getting lucky.  It wouldn't be the first time!  ;-)
> 
> Sounds like you've been at the HEAD of topic branch "srcu".

You are quite right.  And this situation does confirm the wisdom of
waiting until a herd release containing the SRCU support.  ;-)

Thanx, Paul

> Thanks, Akira
> > 
> > Thanx, Paul
> > 
> >> Thanks, Akira
> >>
> >>>
> >>>   Thanx, Paul
> >>>
> >>> 
> >>>
> >>>  .gitignore |1 
> >>>  README |2 
> >>>  linux-kernel.bell  |3 
> >>>  linux-kernel.cat   |4 -
> >>>  linux-kernel.def   |1 
> >>>  scripts/README |   70 ++
> >>>  scripts/checkalllitmus.sh  |   53 +++--
> >>>  scripts/checkghlitmus.sh   |   65 
> >>>  scripts/checklitmus.sh |   74 +++
> >>>  scripts/checklitmushist.sh |   60 +++
> >>>  scripts/cmplitmushist.sh   |   87 +++
> >>>  scripts/initlitmushist.sh  |   68 +
> >>>  scripts/judgelitmus.sh |   78 +
> >>>  scripts/newlitmushist.sh   |   61 +++
> >>>  scripts/parseargs.sh   |  140 
> >>> -
> >>>  scripts/runlitmushist.sh   |   87 +++
> >>>  16 files changed, 757 insertions(+), 97 deletions(-)
> >>>
> >>
> > 
> 



[tip:locking/core] tools/memory-model: Make scripts take "-j" abbreviation for "--jobs"

2018-12-03 Thread tip-bot for Paul E. McKenney
Commit-ID:  a6f1de04276d036b61c4d1dbd0367e6b430d8783
Gitweb: https://git.kernel.org/tip/a6f1de04276d036b61c4d1dbd0367e6b430d8783
Author: Paul E. McKenney 
AuthorDate: Mon, 3 Dec 2018 15:04:51 -0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Dec 2018 07:29:52 +0100

tools/memory-model: Make scripts take "-j" abbreviation for "--jobs"

The "--jobs" argument to the litmus-test scripts is similar to the "-jN"
argument to "make", so this commit allows the "-jN" form as well.  While
in the area, it also prohibits the various forms of "-j0".

Suggested-by: Alan Stern 
Signed-off-by: Paul E. McKenney 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: aki...@gmail.com
Cc: boqun.f...@gmail.com
Cc: dhowe...@redhat.com
Cc: j.algl...@ucl.ac.uk
Cc: linux-a...@vger.kernel.org
Cc: luc.maran...@inria.fr
Cc: npig...@gmail.com
Cc: parri.and...@gmail.com
Cc: will.dea...@arm.com
Link: http://lkml.kernel.org/r/20181203230451.28921-3-paul...@linux.ibm.com
Signed-off-by: Ingo Molnar 
---
 tools/memory-model/scripts/parseargs.sh | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/scripts/parseargs.sh 
b/tools/memory-model/scripts/parseargs.sh
index 96b307c8d64a..859e1d581e05 100644
--- a/tools/memory-model/scripts/parseargs.sh
+++ b/tools/memory-model/scripts/parseargs.sh
@@ -95,8 +95,18 @@ do
LKMM_HERD_OPTIONS="$2"
shift
;;
-   --jobs|--job)
-   checkarg --jobs "(number)" "$#" "$2" '^[0-9]\+$' '^--'
+   -j[1-9]*)
+   njobs="`echo $1 | sed -e 's/^-j//'`"
+   trailchars="`echo $njobs | sed -e 's/[0-9]\+\(.*\)$/\1/'`"
+   if test -n "$trailchars"
+   then
+   echo $1 trailing characters "'$trailchars'"
+   usagehelp
+   fi
+   LKMM_JOBS="`echo $njobs | sed -e 's/^\([0-9]\+\).*$/\1/'`"
+   ;;
+   --jobs|--job|-j)
+   checkarg --jobs "(number)" "$#" "$2" '^[1-9][0-9]\+$' '^--'
LKMM_JOBS="$2"
shift
;;


[tip:locking/core] tools/memory-model: Add scripts to check github litmus tests

2018-12-03 Thread tip-bot for Paul E. McKenney
Commit-ID:  e188d24a382d609ec7ca6c1a00396202565b7831
Gitweb: https://git.kernel.org/tip/e188d24a382d609ec7ca6c1a00396202565b7831
Author: Paul E. McKenney 
AuthorDate: Mon, 3 Dec 2018 15:04:50 -0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Dec 2018 07:29:52 +0100

tools/memory-model: Add scripts to check github litmus tests

The https://github.com/paulmckrcu/litmus repository contains a large
number of C-language litmus tests that include "Result:" comments
predicting the verification result.  This commit adds a number of scripts
that run tests on these litmus tests:

checkghlitmus.sh:
Runs all litmus tests in the https://github.com/paulmckrcu/litmus
archive that are C-language and that have "Result:" comment lines
documenting expected results, comparing the actual results to
those expected.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

initlitmushist.sh
Run all litmus tests having no more than the specified number
of processes given a specified timeout, recording the results in
.litmus.out files.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

newlitmushist.sh
For all new or updated litmus tests having no more than the
specified number of processes given a specified timeout, run
and record the results in .litmus.out files.

checklitmushist.sh
Run all litmus tests having .litmus.out files from previous
initlitmushist.sh or newlitmushist.sh runs, comparing the
herd output to that of the original runs.

The above scripts will run litmus tests concurrently, by default with
one job per available CPU.  Giving any of these scripts the --help
argument will cause them to print usage information.

This commit also adds a number of helper scripts that are not intended
to be invoked from the command line:

cmplitmushist.sh: Compare the output of two different runs of the same
litmus test.

judgelitmus.sh: Compare the output of a litmus test to its "Result:"
comment line.

parseargs.sh: Parse command-line arguments.

runlitmushist.sh: Run the litmus tests whose pathnames are provided one
per line on standard input.

While in the area, this commit also makes the existing checklitmus.sh
and checkalllitmus.sh scripts use parseargs.sh in order to provide a
bit of uniformity.  In addition, per-litmus-test status output is directed
to stdout, while end-of-test summary information is directed to stderr.
Finally, the error flag standardizes on "!!!" to assist those familiar
with rcutorture output.

The defaults for the parseargs.sh arguments may be overridden by using
environment variables: LKMM_DESTDIR for --destdir, LKMM_HERD_OPTIONS
for --herdoptions, LKMM_JOBS for --jobs, LKMM_PROCS for --procs, and
LKMM_TIMEOUT for --timeout.

[ paulmck: History-check summary-line changes per Alan Stern feedback. ]
Signed-off-by: Paul E. McKenney 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: aki...@gmail.com
Cc: boqun.f...@gmail.com
Cc: dhowe...@redhat.com
Cc: j.algl...@ucl.ac.uk
Cc: linux-a...@vger.kernel.org
Cc: luc.maran...@inria.fr
Cc: npig...@gmail.com
Cc: parri.and...@gmail.com
Cc: st...@rowland.harvard.edu
Cc: will.dea...@arm.com
Link: http://lkml.kernel.org/r/20181203230451.28921-2-paul...@linux.ibm.com
Signed-off-by: Ingo Molnar 
---
 tools/memory-model/.gitignore |   1 +
 tools/memory-model/README |   2 +
 tools/memory-model/scripts/README |  70 ++
 tools/memory-model/scripts/checkalllitmus.sh  |  53 +--
 tools/memory-model/scripts/checkghlitmus.sh   |  65 +
 tools/memory-model/scripts/checklitmus.sh |  74 +++
 tools/memory-model/scripts/checklitmushist.sh |  60 
 tools/memory-model/scripts/cmplitmushist.sh   |  87 ++
 tools/memory-model/scripts/initlitmushist.sh  |  68 ++
 tools/memory-model/scripts/judgelitmus.sh |  78 
 tools/memory-model/scripts/newlitmushist.sh   |  61 +
 tools/memory-model/scripts/parseargs.sh   | 126 ++
 tools/memory-model/scripts/runlitmushist.sh   |  87 ++
 13 files changed, 739 insertions(+), 93 deletions(-)

diff --git a/tools/memory-model/.gitignore b/tools/memory-model/.gitignore
new file mode 100644
index ..b1d34c52f3c3
--- /dev/null
+++ b/tools/memory-model/.gitignore
@@ -0,0 +1 @@
+litmus
diff --git a/tools/memory-model/README b/tools/memory-model/README
index acf9077cffaa..0f2c366518c6 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -156,6 +156,8 @@ lock.cat
 README
This file.
 
+scriptsVarious scripts, see scripts/README.
+
 
 ===
 LIMITATIONS
diff --git a/tools/memory-model/

Re: [PATCH memory-model 0/3] Updates to the formal memory model

2018-12-03 Thread Paul E. McKenney
On Tue, Dec 04, 2018 at 08:28:03AM +0900, Akira Yokosawa wrote:
> On 2018/12/03 15:04:11 -0800, Paul E. McKenney wrote:
> > Hello, Ingo!
> > 
> > This series contains updates to the Linux kernel's formal memory model
> > in tools/memory-model.  These patches are ready for inclusion into -tip.
> > 
> > 1.  Model smp_mb__after_unlock_lock(), courtesy of Andrea Parri.
> > 
> > 2.  Add scripts to check github litmus tests.
> > 
> > 3.  Make scripts take "-j" abbreviation for "--jobs".
> > 
> > There is another series in preparation to model SRCU, but this series
> > requires hot-off-the presses changes to the herd tool that have not yet
> > been released.  This SRCU series is therefore targeting the merge window
> > after the upcoming one.  People wishing to experiment with the prototype
> > SRCU model may obtain it from my -rcu tree at branch "dev", and use
> > a bleeding-edge herd7 built from https://github.com/herd/herdtools7/,
> > version 7.51+2(dev), which is (commit 10403b24070c) or later.
> 
> On the master branch of herdtools7, SRCU support was added in version
> 7.51+4(dev), which is commit 6ec9da1f4d58, or later.

It has been working for me with version 7.51+2(dev), but perhaps I
have just been getting lucky.  It wouldn't be the first time!  ;-)

Thanx, Paul

> Thanks, Akira
> 
> > 
> > Thanx, Paul
> > 
> > 
> > 
> >  .gitignore |1 
> >  README |2 
> >  linux-kernel.bell  |3 
> >  linux-kernel.cat   |4 -
> >  linux-kernel.def   |1 
> >  scripts/README |   70 ++
> >  scripts/checkalllitmus.sh  |   53 +++--
> >  scripts/checkghlitmus.sh   |   65 
> >  scripts/checklitmus.sh |   74 +++
> >  scripts/checklitmushist.sh |   60 +++
> >  scripts/cmplitmushist.sh   |   87 +++
> >  scripts/initlitmushist.sh  |   68 +
> >  scripts/judgelitmus.sh |   78 +
> >  scripts/newlitmushist.sh   |   61 +++
> >  scripts/parseargs.sh   |  140 
> > -
> >  scripts/runlitmushist.sh   |   87 +++
> >  16 files changed, 757 insertions(+), 97 deletions(-)
> > 
> 



[PATCH memory-model 1/3] tools/memory-model: Model smp_mb__after_unlock_lock()

2018-12-03 Thread Paul E. McKenney
From: Andrea Parri 

>From the header comment for smp_mb__after_unlock_lock():

  "Place this after a lock-acquisition primitive to guarantee that
   an UNLOCK+LOCK pair acts as a full barrier.  This guarantee applies
   if the UNLOCK and LOCK are executed by the same CPU or if the
   UNLOCK and LOCK operate on the same lock variable."

This formalizes the above guarantee by defining (new) mb-links according
to the law:

  ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

where the component ([UL] ; co ; [LKW]) identifies "UNLOCK+LOCK pairs on
the same lock variable" and the component ([UL] ; po ; [LKW]) identifies
"UNLOCK+LOCK pairs executed by the same CPU".

In particular, the LKMM forbids the following two behaviors (the second
litmus test below is based on

  Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html

c.f., Section "Tree RCU Grace Period Memory Ordering Building Blocks"):

C after-unlock-lock-same-cpu

(*
 * Result: Never
 *)

{}

P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
{
int r0;

spin_lock(s);
WRITE_ONCE(*x, 1);
spin_unlock(s);
spin_lock(t);
smp_mb__after_unlock_lock();
r0 = READ_ONCE(*y);
spin_unlock(t);
}

P1(int *x, int *y)
{
int r0;

WRITE_ONCE(*y, 1);
smp_mb();
r0 = READ_ONCE(*x);
}

exists (0:r0=0 /\ 1:r0=0)

C after-unlock-lock-same-lock-variable

(*
 * Result: Never
 *)

{}

P0(spinlock_t *s, int *x, int *y)
{
int r0;

spin_lock(s);
WRITE_ONCE(*x, 1);
r0 = READ_ONCE(*y);
spin_unlock(s);
}

P1(spinlock_t *s, int *y, int *z)
{
int r0;

spin_lock(s);
smp_mb__after_unlock_lock();
WRITE_ONCE(*y, 1);
r0 = READ_ONCE(*z);
spin_unlock(s);
}

P2(int *z, int *x)
{
int r0;

WRITE_ONCE(*z, 1);
smp_mb();
r0 = READ_ONCE(*x);
}

exists (0:r0=0 /\ 1:r0=0 /\ 2:r0=0)

Signed-off-by: Andrea Parri 
Cc: Alan Stern 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Boqun Feng 
Cc: Nicholas Piggin 
Cc: David Howells 
Cc: Jade Alglave 
Cc: Luc Maranget 
Cc: "Paul E. McKenney" 
Cc: Akira Yokosawa 
Cc: Daniel Lustig 
Signed-off-by: Paul E. McKenney 
---
 tools/memory-model/linux-kernel.bell | 3 ++-
 tools/memory-model/linux-kernel.cat  | 4 +++-
 tools/memory-model/linux-kernel.def  | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/linux-kernel.bell 
b/tools/memory-model/linux-kernel.bell
index b84fb2f67109..796513362c05 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -29,7 +29,8 @@ enum Barriers = 'wmb (*smp_wmb*) ||
'sync-rcu (*synchronize_rcu*) ||
'before-atomic (*smp_mb__before_atomic*) ||
'after-atomic (*smp_mb__after_atomic*) ||
-   'after-spinlock (*smp_mb__after_spinlock*)
+   'after-spinlock (*smp_mb__after_spinlock*) ||
+   'after-unlock-lock (*smp_mb__after_unlock_lock*)
 instructions F[Barriers]
 
 (* Compute matching pairs of nested Rcu-lock and Rcu-unlock *)
diff --git a/tools/memory-model/linux-kernel.cat 
b/tools/memory-model/linux-kernel.cat
index 882fc33274ac..8f23c74a96fd 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -30,7 +30,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
-   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
+   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
+   ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
+   fencerel(After-unlock-lock) ; [M])
 let gp = po ; [Sync-rcu] ; po?
 
 let strong-fence = mb | gp
diff --git a/tools/memory-model/linux-kernel.def 
b/tools/memory-model/linux-kernel.def
index 6fa3eb28d40b..b27911cc087d 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -23,6 +23,7 @@ smp_wmb() { __fence{wmb}; }
 smp_mb__before_atomic() { __fence{before-atomic}; }
 smp_mb__after_atomic() { __fence{after-atomic}; }
 smp_mb__after_spinlock() { __fence{after-spinlock}; }
+smp_mb__after_unlock_lock() { __fence{after-unlock-lock}; }
 
 // Exchange
 xchg(X,V)  __xchg{mb}(X,V)
-- 
2.17.1



[PATCH memory-model 2/3] EXP tools/memory-model: Add scripts to check github litmus tests

2018-12-03 Thread Paul E. McKenney
From: "Paul E. McKenney" 

The https://github.com/paulmckrcu/litmus repository contains a large
number of C-language litmus tests that include "Result:" comments
predicting the verification result.  This commit adds a number of scripts
that run tests on these litmus tests:

checkghlitmus.sh:
Runs all litmus tests in the https://github.com/paulmckrcu/litmus
archive that are C-language and that have "Result:" comment lines
documenting expected results, comparing the actual results to
those expected.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

initlitmushist.sh
Run all litmus tests having no more than the specified number
of processes given a specified timeout, recording the results in
.litmus.out files.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

newlitmushist.sh
For all new or updated litmus tests having no more than the
specified number of processes given a specified timeout, run
and record the results in .litmus.out files.

checklitmushist.sh
Run all litmus tests having .litmus.out files from previous
initlitmushist.sh or newlitmushist.sh runs, comparing the
herd output to that of the original runs.

The above scripts will run litmus tests concurrently, by default with
one job per available CPU.  Giving any of these scripts the --help
argument will cause them to print usage information.

This commit also adds a number of helper scripts that are not intended
to be invoked from the command line:

cmplitmushist.sh: Compare the output of two different runs of the same
litmus test.

judgelitmus.sh: Compare the output of a litmus test to its "Result:"
comment line.

parseargs.sh: Parse command-line arguments.

runlitmushist.sh: Run the litmus tests whose pathnames are provided one
per line on standard input.

While in the area, this commit also makes the existing checklitmus.sh
and checkalllitmus.sh scripts use parseargs.sh in order to provide a
bit of uniformity.  In addition, per-litmus-test status output is directed
to stdout, while end-of-test summary information is directed to stderr.
Finally, the error flag standardizes on "!!!" to assist those familiar
with rcutorture output.

The defaults for the parseargs.sh arguments may be overridden by using
environment variables: LKMM_DESTDIR for --destdir, LKMM_HERD_OPTIONS
for --herdoptions, LKMM_JOBS for --jobs, LKMM_PROCS for --procs, and
LKMM_TIMEOUT for --timeout.

Signed-off-by: Paul E. McKenney 
[ paulmck: History-check summary-line changes per Alan Stern feedback. ]
---
 tools/memory-model/.gitignore |   1 +
 tools/memory-model/README |   2 +
 tools/memory-model/scripts/README |  70 ++
 tools/memory-model/scripts/checkalllitmus.sh  |  53 
 tools/memory-model/scripts/checkghlitmus.sh   |  65 +
 tools/memory-model/scripts/checklitmus.sh |  74 ++
 tools/memory-model/scripts/checklitmushist.sh |  60 +
 tools/memory-model/scripts/cmplitmushist.sh   |  87 
 tools/memory-model/scripts/initlitmushist.sh  |  68 ++
 tools/memory-model/scripts/judgelitmus.sh |  78 +++
 tools/memory-model/scripts/newlitmushist.sh   |  61 +
 tools/memory-model/scripts/parseargs.sh   | 126 ++
 tools/memory-model/scripts/runlitmushist.sh   |  87 
 13 files changed, 739 insertions(+), 93 deletions(-)
 create mode 100644 tools/memory-model/.gitignore
 create mode 100644 tools/memory-model/scripts/README
 create mode 100755 tools/memory-model/scripts/checkghlitmus.sh
 create mode 100755 tools/memory-model/scripts/checklitmushist.sh
 create mode 100644 tools/memory-model/scripts/cmplitmushist.sh
 create mode 100755 tools/memory-model/scripts/initlitmushist.sh
 create mode 100755 tools/memory-model/scripts/judgelitmus.sh
 create mode 100755 tools/memory-model/scripts/newlitmushist.sh
 create mode 100755 tools/memory-model/scripts/parseargs.sh
 create mode 100755 tools/memory-model/scripts/runlitmushist.sh

diff --git a/tools/memory-model/.gitignore b/tools/memory-model/.gitignore
new file mode 100644
index ..b1d34c52f3c3
--- /dev/null
+++ b/tools/memory-model/.gitignore
@@ -0,0 +1 @@
+litmus
diff --git a/tools/memory-model/README b/tools/memory-model/README
index acf9077cffaa..0f2c366518c6 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -156,6 +156,8 @@ lock.cat
 README
This file.
 
+scriptsVarious scripts, see scripts/README.
+
 
 ===
 LIMITATIONS
diff --git a/tools/memory-model/scripts/README 
b/tools/memory-model/scripts/README
new file mode 100644
index ..29375a1fbb

[PATCH memory-model 3/3] EXP tools/memory-model: Make scripts take "-j" abbreviation for "--jobs"

2018-12-03 Thread Paul E. McKenney
From: "Paul E. McKenney" 

The "--jobs" argument to the litmus-test scripts is similar to the "-jN"
argument to "make", so this commit allows the "-jN" form as well.  While
in the area, it also prohibits the various forms of "-j0".

Suggested-by: Alan Stern 
Signed-off-by: Paul E. McKenney 
---
 tools/memory-model/scripts/parseargs.sh | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/scripts/parseargs.sh 
b/tools/memory-model/scripts/parseargs.sh
index 96b307c8d64a..859e1d581e05 100755
--- a/tools/memory-model/scripts/parseargs.sh
+++ b/tools/memory-model/scripts/parseargs.sh
@@ -95,8 +95,18 @@ do
LKMM_HERD_OPTIONS="$2"
shift
;;
-   --jobs|--job)
-   checkarg --jobs "(number)" "$#" "$2" '^[0-9]\+$' '^--'
+   -j[1-9]*)
+   njobs="`echo $1 | sed -e 's/^-j//'`"
+   trailchars="`echo $njobs | sed -e 's/[0-9]\+\(.*\)$/\1/'`"
+   if test -n "$trailchars"
+   then
+   echo $1 trailing characters "'$trailchars'"
+   usagehelp
+   fi
+   LKMM_JOBS="`echo $njobs | sed -e 's/^\([0-9]\+\).*$/\1/'`"
+   ;;
+   --jobs|--job|-j)
+   checkarg --jobs "(number)" "$#" "$2" '^[1-9][0-9]\+$' '^--'
LKMM_JOBS="$2"
shift
;;
-- 
2.17.1



[PATCH memory-model 0/3] Updates to the formal memory model

2018-12-03 Thread Paul E. McKenney
Hello, Ingo!

This series contains updates to the Linux kernel's formal memory model
in tools/memory-model.  These patches are ready for inclusion into -tip.

1.  Model smp_mb__after_unlock_lock(), courtesy of Andrea Parri.

2.  Add scripts to check github litmus tests.

3.  Make scripts take "-j" abbreviation for "--jobs".

There is another series in preparation to model SRCU, but this series
requires hot-off-the presses changes to the herd tool that have not yet
been released.  This SRCU series is therefore targeting the merge window
after the upcoming one.  People wishing to experiment with the prototype
SRCU model may obtain it from my -rcu tree at branch "dev", and use
a bleeding-edge herd7 built from https://github.com/herd/herdtools7/,
version 7.51+2(dev), which is (commit 10403b24070c) or later.

Thanx, Paul



 .gitignore |1 
 README |2 
 linux-kernel.bell  |3 
 linux-kernel.cat   |4 -
 linux-kernel.def   |1 
 scripts/README |   70 ++
 scripts/checkalllitmus.sh  |   53 +++--
 scripts/checkghlitmus.sh   |   65 
 scripts/checklitmus.sh |   74 +++
 scripts/checklitmushist.sh |   60 +++
 scripts/cmplitmushist.sh   |   87 +++
 scripts/initlitmushist.sh  |   68 +
 scripts/judgelitmus.sh |   78 +
 scripts/newlitmushist.sh   |   61 +++
 scripts/parseargs.sh   |  140 -
 scripts/runlitmushist.sh   |   87 +++
 16 files changed, 757 insertions(+), 97 deletions(-)



[GIT PULL rcu/next] RCU commits for 4.21/5.0

2018-12-03 Thread Paul E. McKenney
Hello, Ingo,

This pull request contains the following changes:

1.  Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

http://lkml.kernel.org/r/2018193156.ga3...@linux.ibm.com

2.  Replace of calls to RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts.  This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.

http://lkml.kernel.org/r/2018194104.ga4...@linux.ibm.com

Note that several of the patches sent out have been dropped from
this series due to having been pulled in by their respective
maintainers.

3.  Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.

http://lkml.kernel.org/r/2018195619.ga6...@linux.ibm.com

4.  Miscellaneous fixes.

http://lkml.kernel.org/r/2018192839.ga32...@linux.ibm.com

5.  Automate generation of the initrd filesystem used for
rcutorture testing.

http://lkml.kernel.org/r/2018200127.ga9...@linux.ibm.com

6.  Convert spin_is_locked() assertions to instead use lockdep.

http://lkml.kernel.org/r/2018200421.ga10...@linux.ibm.com

Note that several of the patches sent out have been dropped from
this series due to having been pulled in by their respective
maintainers.

7.  SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.

http://lkml.kernel.org/r/2018200834.ga10...@linux.ibm.com

8.  RCU torture-test updates.

http://lkml.kernel.org/r/2018201956.ga11...@linux.ibm.com

All of these changes have been subjected to 0day Test Robot and -next
testing, and are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git for-mingo

for you to fetch changes up to 5ac7cdc29897e5fc3f5e214f3f8c8b03ef8d7029:

  rcutorture: Don't do busted forward-progress testing (2018-12-01 12:45:42 
-0800)


Connor Shu (1):
  rcutorture: Automatically create initrd directory

Dennis Krein (1):
  srcu: Lock srcu_data structure in srcu_gp_start()

Joe Perches (1):
  checkpatch: Create table of obsolete APIs and apply to RCU

Joel Fernandes (Google) (19):
  rcu: Remove unused rcu_state externs
  rcu: Fix rcu_{node,data} comments about gp_seq_needed
  doc: Clarify RCU data-structure comment about rcu_tree fanout
  doc: Remove rcu_preempt_state reference in stallwarn
  doc: Update information about resched_cpu
  doc: Remove rcu_dynticks from Data-Structures
  doc: rcu: Update Data-Structures for RCU flavor consolidation
  doc: rcu: Better clarify the rcu_segcblist ->len field
  doc: rcu: Update description of gp_seq fields in rcu_data
  doc: rcu: Update core and full API in whatisRCU
  doc: rcu: Add more rationale for using rcu_read_lock_sched in checklist
  doc: rcu: Remove obsolete suggestion from checklist
  doc: rcu: Remove obsolete checklist item about synchronize_rcu usage
  doc: rcu: Encourage use of rcu_barrier in checklist
  doc: Make reader aware of rcu_dereference_protected
  doc: Remove obsolete (non-)requirement about disabling preemption
  doc: Make listing in RCU perf/scale requirements use rcu_assign_pointer()
  doc: Correct parameter in stallwarn
  doc: Fix "struction" typo in RCU memory-ordering documentation

Lance Roy (7):
  x86/PCI: Replace spin_is_locked() with lockdep
  sfc: Replace spin_is_locked() with lockdep
  smsc: Replace spin_is_locked() with lockdep
  userfaultfd: Replace spin_is_locked() with lockdep
  locking/mutex: Replace spin_is_locked() with lockdep
  mm: Replace spin_is_locked() with lockdep
  KVM: arm/arm64: vgic: Replace spin_is_locked() with lockdep

Paul E. McKenney (77):
  rcu: Eliminate BUG_ON() for sync.c
  rcu: Eliminate BUG_ON() for kernel/rcu/tree.c
  rcu: Eliminate synchronize_rcu_mult()
  rcu: Consolidate the RCU update functions invoked by sync.c
  sched/membarrier: Replace synchronize_sched() with synchronize_rcu()
  sparc/oprofile: Convert timer_stop() to use synchronize_rcu()
  s390/mm: Convert tlb_table_flush() to use call_rcu()
  powerpc: Convert hugepd_free() to use call_rcu()
  doc: Set down forward-progress requirements
  rcutorture: Add initrd support for systems lacking dracut
  rcutorture: Make initrd/init execute in userspace
  rcutorture: Add cross-compile capability to initrd.sh
  srcu: Prevent __call_srcu() counter wrap with read-side critical section
  rcu: Stop expedited grace periods from relying on stop-machine
  rcu: Eliminate BUG_ON() for kernel/rcu/tree_plugin.h
  rcu: Eliminate BUG_ON() for kernel/rcu/update.c
  doc: Document rcutorture forward-progres

Re: rcu_preempt caused oom

2018-12-03 Thread Paul E. McKenney
On Mon, Dec 03, 2018 at 07:44:03AM +, He, Bo wrote:
> Thanks, we have run the test for the whole weekend and not reproduce the 
> issue,  so we confirm the CONFIG_RCU_BOOST can fix the issue.

Very good, that is encouraging.  Perhaps I should think about making
CONFIG_RCU_BOOST=y the default for CONFIG_PREEMPT in mainline, at least
for architectures for which rt_mutexes are implemented.

> We have enabled the rcupdate.rcu_cpu_stall_timeout=7 and also set panic on 
> rcu stall and will see if we can see the panic, will keep you posed with the 
> test results.
> echo 1 > /proc/sys/kernel/panic_on_rcu_stall

Looking forward to seeing what is going on!  Of course, to reproduce, you
will need to again build with CONFIG_RCU_BOOST=n.

Thanx, Paul

> -Original Message-
> From: Paul E. McKenney  
> Sent: Saturday, December 1, 2018 12:49 AM
> To: He, Bo 
> Cc: Steven Rostedt ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin 
> Subject: Re: rcu_preempt caused oom
> 
> On Fri, Nov 30, 2018 at 03:18:58PM +, He, Bo wrote:
> > Here is the kernel cmdline:
> 
> Thank you!
> 
> > Kernel command line: androidboot.acpio_idx=0  
> > androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-
> > userdebug androidboot.diskbus=00.0 androidboot.verifiedbootstate=green 
> > androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb 
> > g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves 
> > reboot_panic=p,w i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti 
> > ftrace_dump_on_oops trace_buf_size=1024K intel_iommu=off gpt 
> > loglevel=4 androidboot.hardware=gordon_peak 
> > firmware_class.path=/vendor/firmware relative_sleep_states=1 
> > enforcing=0 androidboot.selinux=permissive cpu_init_udelay=10 
> > androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/prope
> > rties/android/ pstore.backend=ramoops memmap=0x140$0x5000 
> > ramoops.mem_address=0x5000 ramoops.mem_size=0x140 
> > ramoops.record_size=0x4000 ramoops.console_size=0x100 
> > ramoops.ftrace_size=0x1 ramoops.dump_oops=1 vga=current
> > i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 
> > drm.vblankoffdelay=
> 
> And no sign of any suppression of RCU CPU stall warnings.  Hmmm...
> It does take more than 21 seconds to OOM?  Or do things happen faster than 
> that?  If they do happen faster than that, then on approach would be to add 
> something like this to the kernel command line:
> 
>   rcupdate.rcu_cpu_stall_timeout=7
> 
> This would set the stall timeout to seven seconds.  Note that timeouts less 
> than three seconds are silently interpreted as three seconds.
> 
>   Thanx, Paul
> 
> > -Original Message-
> > From: Steven Rostedt 
> > Sent: Friday, November 30, 2018 11:17 PM
> > To: Paul E. McKenney 
> > Cc: He, Bo ; linux-kernel@vger.kernel.org; 
> > j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> > jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> > ; Zhang, Yanmin 
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Fri, 30 Nov 2018 06:43:17 -0800
> > "Paul E. McKenney"  wrote:
> > 
> > > Could you please send me your list of kernel boot parameters?  They 
> > > usually appear near the start of your console output.
> > 
> > Or just: cat /proc/cmdline
> > 
> > -- Steve
> > 
> 



Re: [PATCH] locktorture: style fix - spaces required around

2018-12-01 Thread Paul E. McKenney
On Sat, Dec 01, 2018 at 04:40:39PM +0800, Wen Yang wrote:
> This patch fixes the following checkpatch.pl errors:
> 
> ERROR: spaces required around that ':' (ctx:VxW)
> +torture_type, tag, cxt.debug_lock ? " [debug]": "",
>^
> 
> Signed-off-by: Wen Yang 
> CC: Davidlohr Bueso 
> CC: "Paul E. McKenney" 
> CC: Josh Triplett 
> CC: linux-kernel@vger.kernel.org

Adding the current maintainers on CC.

Thanx, Paul

> ---
>  kernel/locking/locktorture.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
> index cd95c01491d8..d1d8356b770a 100644
> --- a/kernel/locking/locktorture.c
> +++ b/kernel/locking/locktorture.c
> @@ -790,7 +790,7 @@ lock_torture_print_module_parms(struct lock_torture_ops 
> *cur_ops,
>  {
>   pr_alert("%s" TORTURE_FLAG
>"--- %s%s: nwriters_stress=%d nreaders_stress=%d 
> stat_interval=%d verbose=%d shuffle_interval=%d stutter=%d shutdown_secs=%d 
> onoff_interval=%d onoff_holdoff=%d\n",
> -  torture_type, tag, cxt.debug_lock ? " [debug]": "",
> +  torture_type, tag, cxt.debug_lock ? " [debug]" : "",
>cxt.nrealwriters_stress, cxt.nrealreaders_stress, 
> stat_interval,
>verbose, shuffle_interval, stutter, shutdown_secs,
>onoff_interval, onoff_holdoff);
> -- 
> 2.19.1
> 



Re: [PATCH] locktorture: Fix assignment of boolean variables

2018-12-01 Thread Paul E. McKenney
On Sat, Dec 01, 2018 at 04:31:49PM +0800, Wen Yang wrote:
> Fix the following warnings reported by coccinelle:
> 
> kernel/locking/locktorture.c:703:6-10: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:918:2-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:949:3-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:682:2-19: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:688:2-19: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:648:2-20: WARNING: Assignment of bool to 0/1
> kernel/locking/locktorture.c:654:2-20: WARNING: Assignment of bool to 0/1
> 
> This patch also makes the code more readable.
> 
> Signed-off-by: Wen Yang 
> CC: Davidlohr Bueso 
> CC: "Paul E. McKenney" 
> CC: Josh Triplett 
> CC: linux-kernel@vger.kernel.org

Adding the current maintainers on CC.

Thanx, Paul

> ---
>  kernel/locking/locktorture.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
> index 7d0b0ed74404..cd95c01491d8 100644
> --- a/kernel/locking/locktorture.c
> +++ b/kernel/locking/locktorture.c
> @@ -645,13 +645,13 @@ static int lock_torture_writer(void *arg)
>   cxt.cur_ops->writelock();
>   if (WARN_ON_ONCE(lock_is_write_held))
>   lwsp->n_lock_fail++;
> - lock_is_write_held = 1;
> + lock_is_write_held = true;
>   if (WARN_ON_ONCE(lock_is_read_held))
>   lwsp->n_lock_fail++; /* rare, but... */
>  
>   lwsp->n_lock_acquired++;
>   cxt.cur_ops->write_delay();
> - lock_is_write_held = 0;
> + lock_is_write_held = false;
>   cxt.cur_ops->writeunlock();
>  
>   stutter_wait("lock_torture_writer");
> @@ -679,13 +679,13 @@ static int lock_torture_reader(void *arg)
>   schedule_timeout_uninterruptible(1);
>  
>   cxt.cur_ops->readlock();
> - lock_is_read_held = 1;
> + lock_is_read_held = true;
>   if (WARN_ON_ONCE(lock_is_write_held))
>   lrsp->n_lock_fail++; /* rare, but... */
>  
>   lrsp->n_lock_acquired++;
>   cxt.cur_ops->read_delay();
> - lock_is_read_held = 0;
> + lock_is_read_held = false;
>   cxt.cur_ops->readunlock();
>  
>   stutter_wait("lock_torture_reader");
> @@ -700,7 +700,7 @@ static int lock_torture_reader(void *arg)
>  static void __torture_print_stats(char *page,
> struct lock_stress_stats *statp, bool write)
>  {
> - bool fail = 0;
> + bool fail = false;
>   int i, n_stress;
>   long max = 0, min = statp ? statp[0].n_lock_acquired : 0;
>   long long sum = 0;
> @@ -915,7 +915,7 @@ static int __init lock_torture_init(void)
>  
>   /* Initialize the statistics so that each run gets its own numbers. */
>   if (nwriters_stress) {
> - lock_is_write_held = 0;
> + lock_is_write_held = false;
>   cxt.lwsa = kmalloc_array(cxt.nrealwriters_stress,
>sizeof(*cxt.lwsa),
>GFP_KERNEL);
> @@ -946,7 +946,7 @@ static int __init lock_torture_init(void)
>   }
>  
>   if (nreaders_stress) {
> - lock_is_read_held = 0;
> + lock_is_read_held = false;
>   cxt.lrsa = kmalloc_array(cxt.nrealreaders_stress,
>sizeof(*cxt.lrsa),
>GFP_KERNEL);
> -- 
> 2.19.1
> 



Re: rcu_preempt caused oom

2018-11-30 Thread Paul E. McKenney
On Fri, Nov 30, 2018 at 03:18:58PM +, He, Bo wrote:
> Here is the kernel cmdline:

Thank you!

> Kernel command line: androidboot.acpio_idx=0  
> androidboot.bootloader=efiwrapper-02_03-userdebug_kernelflinger-06_03-userdebug
>  androidboot.diskbus=00.0 androidboot.verifiedbootstate=green 
> androidboot.bootreason=power-on androidboot.serialno=R1J56L6006a7bb 
> g_ffs.iSerialNumber=R1J56L6006a7bb no_timer_check noxsaves reboot_panic=p,w 
> i915.hpd_sense_invert=0x7 mem=2G nokaslr nopti ftrace_dump_on_oops 
> trace_buf_size=1024K intel_iommu=off gpt loglevel=4 
> androidboot.hardware=gordon_peak firmware_class.path=/vendor/firmware 
> relative_sleep_states=1 enforcing=0 androidboot.selinux=permissive 
> cpu_init_udelay=10 
> androidboot.android_dt_dir=/sys/bus/platform/devices/ANDR0001:00/properties/android/
>  pstore.backend=ramoops memmap=0x140$0x5000 
> ramoops.mem_address=0x5000 ramoops.mem_size=0x140 
> ramoops.record_size=0x4000 ramoops.console_size=0x100 
> ramoops.ftrace_size=0x1 ramoops.dump_oops=1 vga=current 
> i915.modeset=1 drm.atomic=1 i915.nuclear_pageflip=1 drm.vblankoffdelay=

And no sign of any suppression of RCU CPU stall warnings.  Hmmm...
It does take more than 21 seconds to OOM?  Or do things happen faster
than that?  If they do happen faster than that, then on approach would
be to add something like this to the kernel command line:

rcupdate.rcu_cpu_stall_timeout=7

This would set the stall timeout to seven seconds.  Note that timeouts
less than three seconds are silently interpreted as three seconds.

Thanx, Paul

> -Original Message-
> From: Steven Rostedt  
> Sent: Friday, November 30, 2018 11:17 PM
> To: Paul E. McKenney 
> Cc: He, Bo ; linux-kernel@vger.kernel.org; 
> j...@joshtriplett.org; mathieu.desnoy...@efficios.com; 
> jiangshan...@gmail.com; Zhang, Jun ; Xiao, Jin 
> ; Zhang, Yanmin 
> Subject: Re: rcu_preempt caused oom
> 
> On Fri, 30 Nov 2018 06:43:17 -0800
> "Paul E. McKenney"  wrote:
> 
> > Could you please send me your list of kernel boot parameters?  They 
> > usually appear near the start of your console output.
> 
> Or just: cat /proc/cmdline
> 
> -- Steve
> 



Re: rcu_preempt caused oom

2018-11-30 Thread Paul E. McKenney
On Fri, Nov 30, 2018 at 08:03:38AM +, He, Bo wrote:
> Thanks for your great suggestions.
> After enable the CONFIG_RCU_BOOST=y, we don't reproduce the issue until now, 
> we will keep it running and update you with the test results.
> 
> The enclosed is the kernel config, here is the config I grep with the RCU, we 
> don't enable the CONFIG_RCU_BOOST in our build.
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # RCU Debugging
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y

Thank you!

What likely happened is that a low-priority RCU reader was preempted
indefinitely.  Though I would have expected an RCU CPU stall warning
in that case, so it might well be that something else is going on.
Could you please send me your list of kernel boot parameters?  They
usually appear near the start of your console output.

        Thanx, Paul

> -Original Message-
> From: Paul E. McKenney  
> Sent: Thursday, November 29, 2018 10:27 PM
> To: He, Bo 
> Cc: linux-kernel@vger.kernel.org; j...@joshtriplett.org; rost...@goodmis.org; 
> mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Zhang, Jun 
> ; Xiao, Jin ; Zhang, Yanmin 
> 
> Subject: Re: rcu_preempt caused oom
> 
> On Thu, Nov 29, 2018 at 05:06:47AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 29, 2018 at 08:49:35AM +, He, Bo wrote:
> > > Hi, 
> > >   we test on kernel 4.19.0 on android, after run more than 24 Hours 
> > > monkey stress test, we see OOM on 1/10 2G memory board, the issue is not 
> > > seen on the 4.14 kernel.
> > > we have done some debugs:
> > > 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> > > 2. with the 120s hung task detect, most of the tasks will block at 
> > > __wait_rcu_gp: wait_for_completion(_array[i].completion);
> 
> Did you did see any RCU CPU stall warnings?  Or have those been disabled?
> If they have been disabled, could you please rerun with them enabled?
> 
> > > [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> > > [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U   
> > >   O  4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> > > [47571.887603] Call Trace:
> > > [47571.890547]  dump_stack+0x70/0xa5 [47571.894456]  
> > > panic+0xe3/0x241 [47571.897977]  ? 
> > > wait_for_completion_timeout+0x72/0x1b0
> > > [47571.903830]  __wait_rcu_gp+0x17b/0x180 [47571.908226]  
> > > synchronize_rcu.part.76+0x38/0x50 [47571.913393]  ? 
> > > __call_rcu.constprop.79+0x3a0/0x3a0
> > > [47571.918948]  ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> > > [47571.925094]  synchronize_rcu+0x43/0x50 [47571.929487]  
> > > evdev_detach_client+0x59/0x60 [47571.934264]  
> > > evdev_release+0x4e/0xd0 [47571.938464]  __fput+0xfa/0x1f0 
> > > [47571.942072]  fput+0xe/0x10 [47571.945683]  
> > > task_work_run+0x90/0xc0 [47571.949884]  
> > > exit_to_usermode_loop+0x9f/0xb0 [47571.954855]  
> > > do_syscall_64+0xfa/0x110 [47571.959151]  
> > > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is indeed a task waiting on synchronize_rcu().
> 
> > > 3. after enable the rcu trace, we don't see rcu_quiescent_state_report 
> > > trace in a long time, we see rcu_callback: rcu_preempt will never 
> > > response with the rcu_invoke_callback.
> > > [47572.040668]  ps-12388   1d..1 47566097572us : rcu_grace_period: 
> > > rcu_preempt 23716088 AccWaitCB
> > > [47572.040707]  ps-12388   1d... 47566097621us : rcu_callback: 
> > > rcu_preempt rhp=783a728b func=file_free_rcu 4354/82824
> > > [47572.040734]  ps-12388   1d..1 47566097622us : 
> > > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.040756]  ps-12388   1d..1 47566097623us : 
> > > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > > [47572.040778]  ps-12388   1d..1 47566097623us : rcu_grace_period: 
> > > rcu_preempt 23716088 AccWaitCB
> > > [47572.040802]  ps-12388   1d... 47566097674us : rcu_callback: 
> > > rcu_preempt rhp=42c76521 func=file_free_rcu 4354/82825
> > > [47572.040824]  ps-12388   1d..1 47566097676us : 
> > > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > > [47572.040847]  ps-1238

Re: rcu_preempt caused oom

2018-11-29 Thread Paul E. McKenney
On Thu, Nov 29, 2018 at 05:06:47AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 29, 2018 at 08:49:35AM +, He, Bo wrote:
> > Hi, 
> >   we test on kernel 4.19.0 on android, after run more than 24 Hours 
> > monkey stress test, we see OOM on 1/10 2G memory board, the issue is not 
> > seen on the 4.14 kernel.
> > we have done some debugs:
> > 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> > 2. with the 120s hung task detect, most of the tasks will block at 
> > __wait_rcu_gp: wait_for_completion(_array[i].completion);

Did you did see any RCU CPU stall warnings?  Or have those been disabled?
If they have been disabled, could you please rerun with them enabled?

> > [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> > [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U 
> > O  4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> > [47571.887603] Call Trace:
> > [47571.890547]  dump_stack+0x70/0xa5
> > [47571.894456]  panic+0xe3/0x241
> > [47571.897977]  ? wait_for_completion_timeout+0x72/0x1b0
> > [47571.903830]  __wait_rcu_gp+0x17b/0x180
> > [47571.908226]  synchronize_rcu.part.76+0x38/0x50
> > [47571.913393]  ? __call_rcu.constprop.79+0x3a0/0x3a0
> > [47571.918948]  ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> > [47571.925094]  synchronize_rcu+0x43/0x50
> > [47571.929487]  evdev_detach_client+0x59/0x60
> > [47571.934264]  evdev_release+0x4e/0xd0
> > [47571.938464]  __fput+0xfa/0x1f0
> > [47571.942072]  fput+0xe/0x10
> > [47571.945683]  task_work_run+0x90/0xc0
> > [47571.949884]  exit_to_usermode_loop+0x9f/0xb0
> > [47571.954855]  do_syscall_64+0xfa/0x110
> > [47571.959151]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is indeed a task waiting on synchronize_rcu().

> > 3. after enable the rcu trace, we don't see rcu_quiescent_state_report 
> > trace in a long time, we see rcu_callback: rcu_preempt will never response 
> > with the rcu_invoke_callback.
> > [47572.040668]  ps-12388   1d..1 47566097572us : rcu_grace_period: 
> > rcu_preempt 23716088 AccWaitCB
> > [47572.040707]  ps-12388   1d... 47566097621us : rcu_callback: 
> > rcu_preempt rhp=783a728b func=file_free_rcu 4354/82824
> > [47572.040734]  ps-12388   1d..1 47566097622us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040756]  ps-12388   1d..1 47566097623us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040778]  ps-12388   1d..1 47566097623us : rcu_grace_period: 
> > rcu_preempt 23716088 AccWaitCB
> > [47572.040802]  ps-12388   1d... 47566097674us : rcu_callback: 
> > rcu_preempt rhp=42c76521 func=file_free_rcu 4354/82825
> > [47572.040824]  ps-12388   1d..1 47566097676us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.040847]  ps-12388   1d..1 47566097676us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.040868]  ps-12388   1d..1 47566097676us : rcu_grace_period: 
> > rcu_preempt 23716088 AccWaitCB
> > [47572.040895]  ps-12388   1d..1 47566097716us : rcu_callback: 
> > rcu_preempt rhp=5e40fde2 func=avc_node_free 4354/82826
> > [47572.040919]  ps-12388   1d..1 47566097735us : rcu_callback: 
> > rcu_preempt rhp=f80fe353 func=avc_node_free 4354/82827
> > [47572.040943]  ps-12388   1d..1 47566097758us : rcu_callback: 
> > rcu_preempt rhp=7486f400 func=avc_node_free 4354/82828
> > [47572.040967]  ps-12388   1d..1 47566097760us : rcu_callback: 
> > rcu_preempt rhp=b87872a8 func=avc_node_free 4354/82829
> > [47572.040990]  ps-12388   1d... 47566097789us : rcu_callback: 
> > rcu_preempt rhp=8c656343 func=file_free_rcu 4354/82830
> > [47572.041013]  ps-12388   1d..1 47566097790us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041036]  ps-12388   1d..1 47566097790us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Prestarted
> > [47572.041057]  ps-12388   1d..1 47566097791us : rcu_grace_period: 
> > rcu_preempt 23716088 AccWaitCB
> > [47572.041081]  ps-12388   1d... 47566097871us : rcu_callback: 
> > rcu_preempt rhp=7e6c898c func=file_free_rcu 4354/82831
> > [47572.041103]  ps-12388   1d..1 47566097872us : 
> > rcu_future_grace_period: rcu_preempt 23716088 23716092 0 0 3 Startleaf
> > [47572.041126]  ps-12388   1d..1 47566097872us : 
> > rcu_future_grace_period: rcu_preempt 23716088 2371609

Re: rcu_preempt caused oom

2018-11-29 Thread Paul E. McKenney
On Thu, Nov 29, 2018 at 08:49:35AM +, He, Bo wrote:
> Hi, 
>   we test on kernel 4.19.0 on android, after run more than 24 Hours 
> monkey stress test, we see OOM on 1/10 2G memory board, the issue is not seen 
> on the 4.14 kernel.
> we have done some debugs:
> 1. OOM is due to the filp consume too many memory: 300M vs 2G board.
> 2. with the 120s hung task detect, most of the tasks will block at 
> __wait_rcu_gp: wait_for_completion(_array[i].completion);
> [47571.863839] Kernel panic - not syncing: hung_task: blocked tasks
> [47571.875446] CPU: 1 PID: 13626 Comm: FinalizerDaemon Tainted: G U O 
>  4.19.0-quilt-2e5dc0ac-gf3f313245eb6 #1
> [47571.887603] Call Trace:
> [47571.890547]  dump_stack+0x70/0xa5
> [47571.894456]  panic+0xe3/0x241
> [47571.897977]  ? wait_for_completion_timeout+0x72/0x1b0
> [47571.903830]  __wait_rcu_gp+0x17b/0x180
> [47571.908226]  synchronize_rcu.part.76+0x38/0x50
> [47571.913393]  ? __call_rcu.constprop.79+0x3a0/0x3a0
> [47571.918948]  ? __bpf_trace_rcu_invoke_callback+0x10/0x10
> [47571.925094]  synchronize_rcu+0x43/0x50
> [47571.929487]  evdev_detach_client+0x59/0x60
> [47571.934264]  evdev_release+0x4e/0xd0
> [47571.938464]  __fput+0xfa/0x1f0
> [47571.942072]  fput+0xe/0x10
> [47571.945683]  task_work_run+0x90/0xc0
> [47571.949884]  exit_to_usermode_loop+0x9f/0xb0
> [47571.954855]  do_syscall_64+0xfa/0x110
> [47571.959151]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 3. after enable the rcu trace, we don't see rcu_quiescent_state_report trace 
> in a long time, we see rcu_callback: rcu_preempt will never response with the 
> rcu_invoke_callback.
> [47572.040668]  ps-12388   1d..1 47566097572us : rcu_grace_period: 
> rcu_preempt 23716088 AccWaitCB
> [47572.040707]  ps-12388   1d... 47566097621us : rcu_callback: 
> rcu_preempt rhp=783a728b func=file_free_rcu 4354/82824
> [47572.040734]  ps-12388   1d..1 47566097622us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.040756]  ps-12388   1d..1 47566097623us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.040778]  ps-12388   1d..1 47566097623us : rcu_grace_period: 
> rcu_preempt 23716088 AccWaitCB
> [47572.040802]  ps-12388   1d... 47566097674us : rcu_callback: 
> rcu_preempt rhp=42c76521 func=file_free_rcu 4354/82825
> [47572.040824]  ps-12388   1d..1 47566097676us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.040847]  ps-12388   1d..1 47566097676us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.040868]  ps-12388   1d..1 47566097676us : rcu_grace_period: 
> rcu_preempt 23716088 AccWaitCB
> [47572.040895]  ps-12388   1d..1 47566097716us : rcu_callback: 
> rcu_preempt rhp=5e40fde2 func=avc_node_free 4354/82826
> [47572.040919]  ps-12388   1d..1 47566097735us : rcu_callback: 
> rcu_preempt rhp=f80fe353 func=avc_node_free 4354/82827
> [47572.040943]  ps-12388   1d..1 47566097758us : rcu_callback: 
> rcu_preempt rhp=7486f400 func=avc_node_free 4354/82828
> [47572.040967]  ps-12388   1d..1 47566097760us : rcu_callback: 
> rcu_preempt rhp=b87872a8 func=avc_node_free 4354/82829
> [47572.040990]  ps-12388   1d... 47566097789us : rcu_callback: 
> rcu_preempt rhp=8c656343 func=file_free_rcu 4354/82830
> [47572.041013]  ps-12388   1d..1 47566097790us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.041036]  ps-12388   1d..1 47566097790us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.041057]  ps-12388   1d..1 47566097791us : rcu_grace_period: 
> rcu_preempt 23716088 AccWaitCB
> [47572.041081]  ps-12388   1d... 47566097871us : rcu_callback: 
> rcu_preempt rhp=7e6c898c func=file_free_rcu 4354/82831
> [47572.041103]  ps-12388   1d..1 47566097872us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Startleaf
> [47572.041126]  ps-12388   1d..1 47566097872us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Prestarted
> [47572.041147]  ps-12388   1d..1 47566097873us : rcu_grace_period: 
> rcu_preempt 23716088 AccWaitCB
> [47572.041170]  ps-12388   1d... 47566097945us : rcu_callback: 
> rcu_preempt rhp=32f4f174 func=file_free_rcu 4354/82832
> [47572.041193]  ps-12388   1d..1 47566097946us : rcu_future_grace_period: 
> rcu_preempt 23716088 23716092 0 0 3 Startleaf
> 
> Do you have any suggestions to debug the issue?

If you do not already have CONFIG_RCU_BOOST=y set, could you please
rebuild with that?

Could you also please send your .config file?

Thanx, Paul



Re: [PATCH 0/3] tools/memory-model: Add SRCU support

2018-11-27 Thread Paul E. McKenney
On Wed, Nov 28, 2018 at 07:34:14AM +0900, Akira Yokosawa wrote:
> On 2018/11/27 09:17:46 -0800, Paul E. McKenney wrote:
> > On Tue, Nov 27, 2018 at 01:26:42AM +0100, Andrea Parri wrote:
> >>> commit 72f61917f12236514a70017d1ebafb9b8d34a9b6
> >>> Author: Paul E. McKenney 
> >>> Date:   Mon Nov 26 14:26:43 2018 -0800
> >>>
> >>> tools/memory-model: Update README for addition of SRCU
> >>> 
> >>> This commit updates the section on LKMM limitations to no longer say
> >>> that SRCU is not modeled, but instead describe how LKMM's modeling of
> >>> SRCU departs from the Linux-kernel implementation.
> >>> 
> >>> TL;DR:  There is no known valid use case that cares about the Linux
> >>> kernel's ability to have partially overlapping SRCU read-side critical
> >>> sections.
> >>> 
> >>> Signed-off-by: Paul E. McKenney 
> >>
> >> Indeed!,
> >>
> >> Acked-by: Andrea Parri 
> > 
> > Thank you, applied!
> > 
> > I moved this commit and Alan's three SRCU commits to the branch destined
> > for the upcoming merge window.
> 
> We need to bump the version of herdtools7 in "REQUIREMENTS". Would it be
> 7.52?

Good catch!  And I am currently using 7.51+2(dev), so I suspect that
you are right.  But 7.52 appears to still be in the future.

> Removing the explicit version number might be a better idea. Just
> say "The latest version of ...".
> 
> Thoughts?

That approach would be easier for us, but might be painful for someone
(say) five years from now trying to run the v4.20 kernel's memory model.

Thanx, Paul

> Thanks, Akira
> > 
> > Thanx, Paul
> > 
> >>   Andrea
> >>
> >>
> >>>
> >>> diff --git a/tools/memory-model/README b/tools/memory-model/README
> >>> index 0f2c366518c6..9d7d4f23503f 100644
> >>> --- a/tools/memory-model/README
> >>> +++ b/tools/memory-model/README
> >>> @@ -221,8 +221,29 @@ The Linux-kernel memory model has the following 
> >>> limitations:
> >>>   additional call_rcu() process to the site of the
> >>>   emulated rcu-barrier().
> >>>  
> >>> - e.  Sleepable RCU (SRCU) is not modeled.  It can be
> >>> - emulated, but perhaps not simply.
> >>> + e.  Although sleepable RCU (SRCU) is now modeled, there
> >>> + are some subtle differences between its semantics and
> >>> + those in the Linux kernel.  For example, the kernel
> >>> + might interpret the following sequence as two partially
> >>> + overlapping SRCU read-side critical sections:
> >>> +
> >>> +  1  r1 = srcu_read_lock(_srcu);
> >>> +  2  do_something_1();
> >>> +  3  r2 = srcu_read_lock(_srcu);
> >>> +  4  do_something_2();
> >>> +  5  srcu_read_unlock(_srcu, r1);
> >>> +  6  do_something_3();
> >>> +  7  srcu_read_unlock(_srcu, r2);
> >>> +
> >>> + In contrast, LKMM will interpret this as a nested pair of
> >>> + SRCU read-side critical sections, with the outer critical
> >>> + section spanning lines 1-7 and the inner critical section
> >>> + spanning lines 3-5.
> >>> +
> >>> + This difference would be more of a concern had anyone
> >>> + identified a reasonable use case for partially overlapping
> >>> + SRCU read-side critical sections.  For more information,
> >>> + please see: https://paulmck.livejournal.com/40593.html
> >>>  
> >>>   f.  Reader-writer locking is not modeled.  It can be
> >>>   emulated in litmus tests using atomic read-modify-write
> >>>
> >>
> > 
> 



Re: [PATCH 0/3] tools/memory-model: Add SRCU support

2018-11-27 Thread Paul E. McKenney
On Tue, Nov 27, 2018 at 01:26:42AM +0100, Andrea Parri wrote:
> > commit 72f61917f12236514a70017d1ebafb9b8d34a9b6
> > Author: Paul E. McKenney 
> > Date:   Mon Nov 26 14:26:43 2018 -0800
> > 
> > tools/memory-model: Update README for addition of SRCU
> > 
> > This commit updates the section on LKMM limitations to no longer say
> > that SRCU is not modeled, but instead describe how LKMM's modeling of
> > SRCU departs from the Linux-kernel implementation.
> > 
> > TL;DR:  There is no known valid use case that cares about the Linux
> > kernel's ability to have partially overlapping SRCU read-side critical
> > sections.
> > 
> > Signed-off-by: Paul E. McKenney 
> 
> Indeed!,
> 
> Acked-by: Andrea Parri 

Thank you, applied!

I moved this commit and Alan's three SRCU commits to the branch destined
for the upcoming merge window.

Thanx, Paul

>   Andrea
> 
> 
> > 
> > diff --git a/tools/memory-model/README b/tools/memory-model/README
> > index 0f2c366518c6..9d7d4f23503f 100644
> > --- a/tools/memory-model/README
> > +++ b/tools/memory-model/README
> > @@ -221,8 +221,29 @@ The Linux-kernel memory model has the following 
> > limitations:
> > additional call_rcu() process to the site of the
> > emulated rcu-barrier().
> >  
> > -   e.  Sleepable RCU (SRCU) is not modeled.  It can be
> > -   emulated, but perhaps not simply.
> > +   e.  Although sleepable RCU (SRCU) is now modeled, there
> > +   are some subtle differences between its semantics and
> > +   those in the Linux kernel.  For example, the kernel
> > +   might interpret the following sequence as two partially
> > +   overlapping SRCU read-side critical sections:
> > +
> > +1  r1 = srcu_read_lock(_srcu);
> > +2  do_something_1();
> > +3  r2 = srcu_read_lock(_srcu);
> > +4  do_something_2();
> > +5  srcu_read_unlock(_srcu, r1);
> > +6  do_something_3();
> > +7  srcu_read_unlock(_srcu, r2);
> > +
> > +   In contrast, LKMM will interpret this as a nested pair of
> > +   SRCU read-side critical sections, with the outer critical
> > +   section spanning lines 1-7 and the inner critical section
> > +   spanning lines 3-5.
> > +
> > +   This difference would be more of a concern had anyone
> > +   identified a reasonable use case for partially overlapping
> > +   SRCU read-side critical sections.  For more information,
> > +   please see: https://paulmck.livejournal.com/40593.html
> >  
> > f.  Reader-writer locking is not modeled.  It can be
> > emulated in litmus tests using atomic read-modify-write
> > 
> 



Re: [PATCH 0/3] tools/memory-model: Add SRCU support

2018-11-26 Thread Paul E. McKenney
On Mon, Nov 19, 2018 at 01:01:20PM +0100, Andrea Parri wrote:
> On Thu, Nov 15, 2018 at 11:19:24AM -0500, Alan Stern wrote:
> > Paul and other LKMM maintainers:
> > 
> > The following series of patches adds support for SRCU to the Linux
> > Kernel Memory Model.  That is, it adds the srcu_read_lock(),
> > srcu_read_unlock(), and synchronize_srcu() primitives to the model.
> > 
> > Patch 1/3 does some renaming of the RCU parts of the
> > memory model's existing CAT code, to help distinguish them
> > from the upcoming SRCU parts.
> > 
> > Patch 2/3 refactors the definitions of some RCU relations
> > in the CAT code, in a way that the SRCU portions will need.
> > 
> > Patch 3/3 actually adds the SRCU support.
> > 
> > This new code requires herd7 version 7.51+4(dev) or later (now 
> > available in the herdtools7 github repository) to run.  Thanks to Luc 
> > for making the necessary changes to support SRCU.
> 
> Thank you Alan, Luc.
> 
> My only suggestion is to integrate changes to explanation.txt (in part.,
> Sect. 22), which the above renaming and refactoring make out-of-date.

Also tools/memory-model/README, but please see below.

> For this version,
> 
> Tested-by: Andrea Parri 

Applied to all three, thank you!

Thanx, Paul

>   Andrea
> 
> 
> > 
> > The code does not check that the index argument passed to 
> > srcu_read_unlock() is the same as the value returned by the 
> > corresponding srcu_read_lock() call.  This is deemed to be a semantic 
> > issue, not directly relevant to the memory model.
> > 
> > Alan



commit 72f61917f12236514a70017d1ebafb9b8d34a9b6
Author: Paul E. McKenney 
Date:   Mon Nov 26 14:26:43 2018 -0800

tools/memory-model: Update README for addition of SRCU

This commit updates the section on LKMM limitations to no longer say
that SRCU is not modeled, but instead describe how LKMM's modeling of
SRCU departs from the Linux-kernel implementation.

    TL;DR:  There is no known valid use case that cares about the Linux
kernel's ability to have partially overlapping SRCU read-side critical
sections.

Signed-off-by: Paul E. McKenney 

diff --git a/tools/memory-model/README b/tools/memory-model/README
index 0f2c366518c6..9d7d4f23503f 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -221,8 +221,29 @@ The Linux-kernel memory model has the following 
limitations:
additional call_rcu() process to the site of the
emulated rcu-barrier().
 
-   e.  Sleepable RCU (SRCU) is not modeled.  It can be
-   emulated, but perhaps not simply.
+   e.  Although sleepable RCU (SRCU) is now modeled, there
+   are some subtle differences between its semantics and
+   those in the Linux kernel.  For example, the kernel
+   might interpret the following sequence as two partially
+   overlapping SRCU read-side critical sections:
+
+1  r1 = srcu_read_lock(_srcu);
+2  do_something_1();
+3  r2 = srcu_read_lock(_srcu);
+4  do_something_2();
+5  srcu_read_unlock(_srcu, r1);
+6  do_something_3();
+7  srcu_read_unlock(_srcu, r2);
+
+   In contrast, LKMM will interpret this as a nested pair of
+   SRCU read-side critical sections, with the outer critical
+   section spanning lines 1-7 and the inner critical section
+   spanning lines 3-5.
+
+   This difference would be more of a concern had anyone
+   identified a reasonable use case for partially overlapping
+   SRCU read-side critical sections.  For more information,
+   please see: https://paulmck.livejournal.com/40593.html
 
f.  Reader-writer locking is not modeled.  It can be
emulated in litmus tests using atomic read-modify-write



Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")

2018-11-26 Thread Paul E. McKenney
On Wed, Nov 14, 2018 at 12:20:13PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote:
> > > Hi Paul,
> > > 
> > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown.
> > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched
> > > API in terms of RCU for Tree RCU PREEMPT builds").
> > > 
> > > I traced the hang into
> > > -> cpufreq_suspend()
> > >  -> cpufreq_stop_governor()
> > >   -> cpufreq_dbs_governor_stop()
> > >-> gov_clear_update_util()
> > > -> synchronize_sched()
> > >  -> synchronize_rcu()
> > > 
> > > Only PREEMPT=y is affected for obvious reasons, but that couldn't
> > > explain why the same UP kernel booted on an SMP machine worked fine.
> > > Eventually I realized that the difference between working and
> > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw
> > > that we mask everything in the PIC before cpufreq is shut down,
> > > and came up with the following fix:
> > > 
> > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > index 7aa3dcad2175..f88bf3c77fc0 100644
> > > --- a/drivers/cpufreq/cpufreq.c
> > > +++ b/drivers/cpufreq/cpufreq.c
> > > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void)
> > > return 0;
> > >  }
> > >  module_param(off, int, 0444);
> > > -core_initcall(cpufreq_core_init);
> > > +late_initcall(cpufreq_core_init);
> > 
> > Thank you for testing this and tracking it down!
> > 
> > I am glad that you have a fix, but I hope that we can arrive at a less
> > constraining one.
> > 
> > > Here's the resulting change in inutcall_debug:
> > >   pci :00:00.1: shutdown
> > >   hub 4-0:1.0: hub_ext_port_status failed (err = -110)
> > >   agpgart-intel :00:00.0: shutdown
> > > + PM: Calling cpufreq_suspend+0x0/0x100
> > >   PM: Calling mce_syscore_shutdown+0x0/0x10
> > >   PM: Calling i8259A_shutdown+0x0/0x10
> > > - PM: Calling cpufreq_suspend+0x0/0x100
> > > + reboot: Restarting system
> > > + reboot: machine restart
> > > 
> > > I didn't really look into what other ramifications the cpufreq
> > > initcall change might have. cpufreq_global_kobject worries
> > > me a bit. Maybe that one has to remain in core_initcall() and
> > > we could just move the suspend to late_initcall()? Anyways,
> > > I figured I'd leave this for someone more familiar with the
> > > code to figure out ;) 
> > 
> > Let me guess...
> > 
> > When the system suspends or shuts down, there comes a point after which
> > there is only a single CPU that is running with preemption and interrupts
> > are disabled.  At this point, RCU must change the way that it works, and
> > the commit you bisected to would make the change more necessary.  But if
> > I am guessing correctly, we have just been getting lucky in the past.
> > 
> > It looks like RCU needs to create a struct syscore_ops with a shutdown
> > function and pass this to register_syscore_ops().  Maybe a suspend
> > function as well.  And RCU needs to invoke register_syscore_ops() at
> > a time that causes RCU's shutdown function to be invoked in the right
> > order with respect to the other work in flight.  The hope would be that
> > RCU's suspend function gets called just as the system transitions into
> > a mode where the scheduler is no longer active, give or take.
> > 
> > Does this make sense, or am I confused?
> 
> Well, it certainly does not make sense in that blocking is still legal
> at .shutdown() invocation time, which means that RCU cannot revert to
> its boot-time approach at that point.  Looks like I need hooks in a
> bunch of arch-dependent functions.  Which is certainly doable, but will
> take a bit more digging.

A bit more detail, after some additional discussion at Linux Plumbers
conference...

The preferred approach is to hook into syscore_suspend(),
syscore_resume(), and syscore_shutdown().  This can be done easily by
creating an appropriately initialized struct syscore_ops and passing a
pointer to it to register_syscore_ops() during boot.  Taking these three
functions in turn:

syscore_suspend():

o   arch/x86/kernel/apm_32.c suspend(), standby()

These calls to syscore_suspend() has interrupts disabled, which
is very good, but they are immediately re-enabled, and only

Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-26 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 02:21:12PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 12, 2018 at 07:17:41PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 12, 2018 at 05:28:52AM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 12, 2018 at 10:00:47AM +0100, Peter Zijlstra wrote:
> > 
> > > > Still, better safe than sorry. It was a rather big change in behaviour,
> > > > so it wouldn't have been strange to call that out.
> > > 
> > > This guy:
> > > 
> > > 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU 
> > > PREEMPT builds")
> > > 
> > > Has a commit log that says:
> > > 
> > >   Now that RCU-preempt knows about preemption disabling, its
> > >   implementation of synchronize_rcu() works for synchronize_sched(),
> > >   and likewise for the other RCU-sched update-side API members.
> > >   This commit therefore confines the RCU-sched update-side code
> > >   to CONFIG_PREEMPT=n builds, and defines RCU-sched's update-side
> > >   API members in terms of those of RCU-preempt.
> > > 
> > > That last phrase seems pretty explicit.  What am I missing here?
> > 
> > That does not explicitly state that because RCU-preempt
> > synchornize_rcu() can take _much_ longer, the new synchronize_sched()
> > can now take _much_ longer too.
> > 
> > So when someone bisects a problem to this commit; and he reads the
> > Changelog, he might get the impression that was unexpected.
> 
> Of course, a preempt_disable() section of code can still be preempted
> by the underlying hypervisor, so in a surprisingly large fraction of
> the installed base, there really isn't that much difference.
> 
> > > Not that it matters, given that I know of no way to change a mainlined
> > > commit log.  I suppose I could ask Jon if he would be willing to take
> > > a 2018 RCU API LWN article, if that would help.
> > 
> > Yes, it is water under the bridge; but Changelogs should be explicit
> > about behavioural changes.
> > 
> > And while the merged RCU has the semantic behaviour required, the timing
> > behaviour did change significantly.
> 
> When running on bare metal, potentially.  From what I see, preemption
> of RCU read-side critical sections is the exception rather than the rule.
> And again, when running on hypervisors, even irq-disable regions of code
> can be preempted.  (And yes, there is work in flight to allow RCU to deal
> with this.)
> 
> > > > > > Again, the patch didn't say that.
> > > > > > 
> > > > > > If the Changelog would've read something like:
> > > > > > 
> > > > > > "Since synchronize_sched() is now equivalent to synchronize_rcu(),
> > > > > > replace the synchronize_sched() usage such that we can eventually 
> > > > > > remove
> > > > > > the interface."
> > > > > > 
> > > > > > It would've been clear that the patch is a nop and what the purpose
> > > > > > was.
> > > > > 
> > > > > I can easily make that change.
> > > > 
> > > > Please, sufficient doesn't imply necessary etc.. A changelog should
> > > > always clarify why we do the patch.
> > > 
> > > ???  Did you mean to say "necessary doesn't imply sufficient"?  If so,
> > > what else do you feel is missing?
> > 
> > No, I meant to say that your original Changelog only states that
> > sync_rcu now covers rcu-sched behaviour.  Which means that the change is
> > sufficient.
> > 
> > It completely and utterly fails to explain _why_ you're doing the
> > change. Ie. you do not address why it is necessary.
> > 
> > A Changelog should always explain why the change is needed.
> > 
> > In this case because you want to get rid of the sync_sched() api.
> 
> Right, which is stated in your suggested wording above.  So I am still
> not seeing what you want added to this:
> 
>   "Since synchronize_sched() is now equivalent to synchronize_rcu(),
>   replace the synchronize_sched() usage such that we can eventually
>   remove the interface."

Finally getting back to this.  I removed this commit from the group that
I intend to send in next week's -tip pull request, and updated its commit
log as shown below.  Does this work for you?

        Thanx, Paul



commit 52ffe7fbe615e8989f054432c76a7e43b8c35607
Au

Re: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt quiescent states when disabled

2018-11-26 Thread Paul E. McKenney
On Mon, Nov 26, 2018 at 01:55:37PM +, Ran Rozenstein wrote:
> > 
> > Hearing no objections, here is the updated patch.
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 970cab5d3d206029ed27274a98ea1c3d7e780e53
> > Author: Paul E. McKenney 
> > Date:   Mon Oct 29 07:36:50 2018 -0700
> > 
> > rcu: Avoid signed integer overflow in rcu_preempt_deferred_qs()
> > 
> > Subtracting INT_MIN can be interpreted as unconditional signed integer
> > overflow, which according to the C standard is undefined behavior.
> > Therefore, kernel build arguments notwithstanding, it would be good to
> > future-proof the code.  This commit therefore substitutes INT_MAX for
> > INT_MIN in order to avoid undefined behavior.
> > 
> > While in the neighborhood, this commit also creates some meaningful
> > names
> > for INT_MAX and friends in order to improve readability, as suggested
> > by Joel Fernandes.
> > 
> > Reported-by: Ran Rozenstein 
> > Signed-off-by: Paul E. McKenney 
> > 
> > squash! rcu: Avoid signed integer overflow in rcu_preempt_deferred_qs()
> > 
> > While in the neighborhood, use macros to give meaningful names.
> > 
> > Signed-off-by: Paul E. McKenney 
> 
> Hi,
> 
> What is the acceptance status of this patch?

It is queued in -rcu.  If no problems arise beforehand, I intend to submit
it as part of a pull request into -tip, which (again if no problems arise)
be pulled into mainline during the next merge window.

Oddly enough, a couple of weeks ago the C++ Standards Committee voted
in a proposal for C++20 removing undefined behavior for signed integer
overflow.  This is C++ rather than C, and C must support additional
hardware that wouldn't much like forcing twos complement for signed
integer overflow.  But still...  ;-)

Thanx, Paul



Re: Function rcu_dynticks_eqs_exit spent more cycles to processor

2018-11-25 Thread Paul E. McKenney
On Sun, Nov 25, 2018 at 09:41:26PM +0200, Corcodel Marian wrote:
> Hi below , in modified func from kernel/rcu/tree.c and not stall proc,
> run perf for more info.
> Item type atomic_t dynticks from rcu_dynticks stucture can bee replaced
> with u8 type, because Intel guarrantee atomic operations to byte.
>  Eg, rtdp->dynticks = ~RCU_DYNTICK_CTRL_MASK;

I am not clear on exactly what change you are suggesting.  But regardless,
this is core code, so it must run on all CPUs that the Linux kernel
supports, not just Intel x86.  Furthermore, a straight store of
~RCU_DYNTICK_CTRL_MASK would be rather destructive in this function,
if that is what you are getting at with your assignment statement above.
So again, I am not clear on exactly what change you are suggesting, but
whatever it is, it must build and run on all architectures.

Adding LKML on CC, develop in the open and all that.

Thanx, Paul

> static void rcu_dynticks_eqs_exit(void)
> {
>   struct rcu_dynticks *rdtp = this_cpu_ptr(_dynticks);
> 
>   /*
>* CPUs seeing atomic_add_return() must see prior idle
> sojourns,
>* and we also must force ordering with the next RCU read-side
>* critical section.
>*/
>   atomic_andnot(RCU_DYNTICK_CTRL_MASK, >dynticks);
>   smp_mb__after_atomic(); /* _exit after clearing mask.
> */
>   /* Prefer duplicate flushes to losing a flush. */
>   rcu_eqs_special_exit();
> }
> 



Re: dyntick-idle CPU and node's qsmask

2018-11-21 Thread Paul E. McKenney
On Tue, Nov 20, 2018 at 08:37:22PM -0800, Joel Fernandes wrote:
> On Tue, Nov 20, 2018 at 06:41:07PM -0800, Paul E. McKenney wrote:
> [...] 
> > > > > I was thinking if we could simplify rcu_note_context_switch (the 
> > > > > parts that
> > > > > call rcu_momentary_dyntick_idle), if we did the following in
> > > > > rcu_implicit_dynticks_qs.
> > > > > 
> > > > > Since we already call rcu_qs in rcu_note_context_switch, that would 
> > > > > clear the
> > > > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > > > 
> > > > But does this also work for the rcu_all_qs() code path?
> > > 
> > > Could we not do something like this in rcu_all_qs? as some over-simplified
> > > pseudo code:
> > > 
> > > rcu_all_qs() {
> > >   if (!urgent_qs || !heavy_qs)
> > >  return;
> > > 
> > >   rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can 
> > > monitor in
> > >   //  the diff in my last email (from 
> > > rcu_implicit_dynticks_qs)
> > > }
> > 
> > Except that rcu_qs() doesn't necessarily report the quiescent state to
> > the RCU core.  Keeping down context-switch overhead and all that.
> 
> Sure yeah, but I think the QS will be indirectly anyway by the force_qs_rnp()
> path if we detect that rcu_qs() happened on the CPU?

The force_qs_rnp() path won't see anything that has not already been
reported to the RCU core.

> > > > > I think this would simplify cond_resched as well.  Could this avoid 
> > > > > the need
> > > > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU 
> > > > > corner cases..
> > > > 
> > > > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > > > This needs rcu_all_qs().  Though it is quite possible that some 
> > > > additional
> > > > code collapsing is possible.
> > > > 
> > > > > Basically for some background, I was thinking can we simplify the 
> > > > > code that
> > > > > calls "rcu_momentary_dyntick_idle" since we already register a qs in 
> > > > > other
> > > > > ways (like by resetting cpu_no_qs).
> > > > 
> > > > One complication is that rcu_all_qs() is invoked with interrupts
> > > > and preemption enabled, while rcu_note_context_switch() is
> > > > invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> > > > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > > > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > > > -- yes, it could check, but that would be redundant with the checks
> > > 
> > > This immediate exit is taken care off in the above psuedo code, would that
> > > help the cond_resched performance?
> > 
> > It look like you are cautiously edging towards the two wrapper functions
> > calling common code, relying on inlining and simplification.  Why not just
> > try doing it?  ;-)
> 
> Sure yeah. I was more thinking of the ambitious goal of getting rid of the
> complexity and exploring the general design idea, than containing/managing
> the complexity with reducing code duplication. :D
> 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index c818e0c91a81..5aa0259c014d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct 
> > > > > rcu_data *rdp)
> > > > >* read-side critical section that started before the beginning
> > > > >* of the current RCU grace period.
> > > > >*/
> > > > > - if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > > > + if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || 
> > > > > !rdp->cpu_no_qs.b.norm) {
> > > > 
> > > > If I am not too confused, this change could cause trouble for
> > > > nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> > > > scheduler-clock interrupts, last I checked, and this could prevent the
> > > > CPU from reporting its quiescent state to core RCU.
> > > 
> > > Would that still be a problem if rcu_al

Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Paul E. McKenney
On Tue, Nov 20, 2018 at 06:06:12PM -0800, Joel Fernandes wrote:
> On Tue, Nov 20, 2018 at 02:28:14PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > > Hi Paul and everyone,
> > > > > > > > > 
> > > > > > > > > I was tracing/studying the RCU code today in paul/dev branch 
> > > > > > > > > and noticed that
> > > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the 
> > > > > > > > > rnp->qsmask
> > > > > > > > > corresponding to the leaf node for the idle CPU, and 
> > > > > > > > > reporting a QS on their
> > > > > > > > > behalf.
> > > > > > > > > 
> > > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > > rcu_sched 792 0 dti
> > > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > > rcu_sched 801 2 dti
> > > > > > > > > rcu_sched-10[003]40.008041: 
> > > > > > > > > rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > > > > 
> > > > > > > > > That's all good but I was wondering if we can do better for 
> > > > > > > > > the idle CPUs if
> > > > > > > > > we can some how not set the qsmask of the node in the first 
> > > > > > > > > place. Then no
> > > > > > > > > reporting would be needed of quiescent state is needed for 
> > > > > > > > > idle CPUs right?
> > > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > > 
> > > > > > > > > At least for a single node tree RCU system, it seems that 
> > > > > > > > > would avoid needing
> > > > > > > > > to acquire the lock without complications. Anyway let me know 
> > > > > > > > > your thoughts
> > > > > > > > > and happy to discuss this at the hallways of the LPC as well 
> > > > > > > > > for folks
> > > > > > > > > attending :)
> > > > > > > > 
> > > > > > > > We could, but that would require consulting the rcu_data 
> > > > > > > > structure for
> > > > > > > > each CPU while initializing the grace period, thus increasing 
> > > > > > > > the number
> > > > > > > > of cache misses during grace-period initialization and also 
> > > > > > > > shortly after
> > > > > > > > for any non-idle CPUs.  This seems backwards on busy systems 
> > > > > > > > where each
> > > > > > > 
> > > > > > > When I traced, it appears to me that rcu_data structure of a 
> > > > > > > remote CPU was
> > > > > > > being consulted anyway by the rcu_sched thread. So it seems like 
> > > > > > > such cache
> > > > > > > miss would happen anyway whether it is during grace-period 
> > > > > > > initialization or
> > > > > > > during the fqs stage? I guess I'm trying to say, the consultation 
> > > > > > > of remote
> > > > > > > CPU's rcu_data happens anyway.
> > > > > > 
> > > > > > Hmmm...
> > > > > > 
> > > > > > The rcu_gp_init() function does access an rcu_data structure, but 
> > > > > > it is
> > > > > > that of the current CPU, so shouldn't involve a communications 
> > > > > > cache miss,
> > > > > >

Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Paul E. McKenney
On Tue, Nov 20, 2018 at 02:28:13PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > Hi Paul and everyone,
> > > > > > > > 
> > > > > > > > I was tracing/studying the RCU code today in paul/dev branch 
> > > > > > > > and noticed that
> > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the 
> > > > > > > > rnp->qsmask
> > > > > > > > corresponding to the leaf node for the idle CPU, and reporting 
> > > > > > > > a QS on their
> > > > > > > > behalf.
> > > > > > > > 
> > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > rcu_sched 792 0 dti
> > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > rcu_sched 801 2 dti
> > > > > > > > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: 
> > > > > > > > rcu_sched 805 5>0 0 0 3 0
> > > > > > > > 
> > > > > > > > That's all good but I was wondering if we can do better for the 
> > > > > > > > idle CPUs if
> > > > > > > > we can some how not set the qsmask of the node in the first 
> > > > > > > > place. Then no
> > > > > > > > reporting would be needed of quiescent state is needed for idle 
> > > > > > > > CPUs right?
> > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > 
> > > > > > > > At least for a single node tree RCU system, it seems that would 
> > > > > > > > avoid needing
> > > > > > > > to acquire the lock without complications. Anyway let me know 
> > > > > > > > your thoughts
> > > > > > > > and happy to discuss this at the hallways of the LPC as well 
> > > > > > > > for folks
> > > > > > > > attending :)
> > > > > > > 
> > > > > > > We could, but that would require consulting the rcu_data 
> > > > > > > structure for
> > > > > > > each CPU while initializing the grace period, thus increasing the 
> > > > > > > number
> > > > > > > of cache misses during grace-period initialization and also 
> > > > > > > shortly after
> > > > > > > for any non-idle CPUs.  This seems backwards on busy systems 
> > > > > > > where each
> > > > > > 
> > > > > > When I traced, it appears to me that rcu_data structure of a remote 
> > > > > > CPU was
> > > > > > being consulted anyway by the rcu_sched thread. So it seems like 
> > > > > > such cache
> > > > > > miss would happen anyway whether it is during grace-period 
> > > > > > initialization or
> > > > > > during the fqs stage? I guess I'm trying to say, the consultation 
> > > > > > of remote
> > > > > > CPU's rcu_data happens anyway.
> > > > > 
> > > > > Hmmm...
> > > > > 
> > > > > The rcu_gp_init() function does access an rcu_data structure, but it 
> > > > > is
> > > > > that of the current CPU, so shouldn't involve a communications cache 
> > > > > miss,
> > > > > at least not in the common case.
> > > > > 
> > > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > > functions that it calls?  In that case, please see below.
> > > > 
> > > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > > 
> > > > > > > CP

Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Paul E. McKenney
On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > Hi Paul and everyone,
> > > > > > > 
> > > > > > > I was tracing/studying the RCU code today in paul/dev branch and 
> > > > > > > noticed that
> > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the 
> > > > > > > rnp->qsmask
> > > > > > > corresponding to the leaf node for the idle CPU, and reporting a 
> > > > > > > QS on their
> > > > > > > behalf.
> > > > > > > 
> > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > rcu_sched 792 0 dti
> > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > rcu_sched 801 2 dti
> > > > > > > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: 
> > > > > > > rcu_sched 805 5>0 0 0 3 0
> > > > > > > 
> > > > > > > That's all good but I was wondering if we can do better for the 
> > > > > > > idle CPUs if
> > > > > > > we can some how not set the qsmask of the node in the first 
> > > > > > > place. Then no
> > > > > > > reporting would be needed of quiescent state is needed for idle 
> > > > > > > CPUs right?
> > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > 
> > > > > > > At least for a single node tree RCU system, it seems that would 
> > > > > > > avoid needing
> > > > > > > to acquire the lock without complications. Anyway let me know 
> > > > > > > your thoughts
> > > > > > > and happy to discuss this at the hallways of the LPC as well for 
> > > > > > > folks
> > > > > > > attending :)
> > > > > > 
> > > > > > We could, but that would require consulting the rcu_data structure 
> > > > > > for
> > > > > > each CPU while initializing the grace period, thus increasing the 
> > > > > > number
> > > > > > of cache misses during grace-period initialization and also shortly 
> > > > > > after
> > > > > > for any non-idle CPUs.  This seems backwards on busy systems where 
> > > > > > each
> > > > > 
> > > > > When I traced, it appears to me that rcu_data structure of a remote 
> > > > > CPU was
> > > > > being consulted anyway by the rcu_sched thread. So it seems like such 
> > > > > cache
> > > > > miss would happen anyway whether it is during grace-period 
> > > > > initialization or
> > > > > during the fqs stage? I guess I'm trying to say, the consultation of 
> > > > > remote
> > > > > CPU's rcu_data happens anyway.
> > > > 
> > > > Hmmm...
> > > > 
> > > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > > that of the current CPU, so shouldn't involve a communications cache 
> > > > miss,
> > > > at least not in the common case.
> > > > 
> > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > functions that it calls?  In that case, please see below.
> > > 
> > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > 
> > > > > > CPU will with high probability report its own quiescent state 
> > > > > > before three
> > > > > > jiffies pass, in which case the cache misses on the rcu_data 
> > > > > > structures
> > > > > > would be wasted motion.
> > > > > 
> > > > > If all the CPUs are busy and reporting their QS themselves, then I 
> > > > > think the
&g

Re: [PATCH tip/core/rcu 6/7] mm: Replace spin_is_locked() with lockdep

2018-11-15 Thread Paul E. McKenney
On Thu, Nov 15, 2018 at 10:49:17AM -0800, Davidlohr Bueso wrote:
> On Sun, 11 Nov 2018, Paul E. McKenney wrote:
> 
> >From: Lance Roy 
> >
> >lockdep_assert_held() is better suited to checking locking requirements,
> >since it only checks if the current thread holds the lock regardless of
> >whether someone else does. This is also a step towards possibly removing
> >spin_is_locked().
> 
> So fyi I'm not crazy about these kind of patches simply because lockdep
> is a lot less used out of anything that's not a lab, and we can be missing
> potential offenders. There's obviously nothing wrong about what you describe
> above perse, just my two cents.

Fair point!

One countervailing advantage of lockdep is that it is not subject to the
false negatives that can happen if someone else happens to be currently
holding the lock.  But what would you suggest instead?

Thanx, Paul



Re: [PATCH 0/3] tools/memory-model: Add SRCU support

2018-11-15 Thread Paul E. McKenney
On Thu, Nov 15, 2018 at 11:19:24AM -0500, Alan Stern wrote:
> Paul and other LKMM maintainers:
> 
> The following series of patches adds support for SRCU to the Linux
> Kernel Memory Model.  That is, it adds the srcu_read_lock(),
> srcu_read_unlock(), and synchronize_srcu() primitives to the model.
> 
>   Patch 1/3 does some renaming of the RCU parts of the
>   memory model's existing CAT code, to help distinguish them
>   from the upcoming SRCU parts.
> 
>   Patch 2/3 refactors the definitions of some RCU relations
>   in the CAT code, in a way that the SRCU portions will need.
> 
>   Patch 3/3 actually adds the SRCU support.
> 
> This new code requires herd7 version 7.51+4(dev) or later (now 
> available in the herdtools7 github repository) to run.  Thanks to Luc 
> for making the necessary changes to support SRCU.

These patches pass the tests that I have constructed, and also regression
tests, very nice!  Applied and pushed, thank you.

> The code does not check that the index argument passed to 
> srcu_read_unlock() is the same as the value returned by the 
> corresponding srcu_read_lock() call.  This is deemed to be a semantic 
> issue, not directly relevant to the memory model.

Agreed.

If I understand correctly, there are in theory some use cases that these
patches do not support, for example:

r1 = srcu_read_lock(a);
do_1();
r2 = srcu_read_lock(a);
do_2();
srcu_read_unlock(a, r1);
do_3();
srcu_read_unlock(a, r2);

In practice, I would be more worried about this had I ever managed to
find a non-bogus use case for this pattern.  ;-)

Thanx, Paul



Q from "Concurrency with tools/memory-model"

2018-11-15 Thread Paul E. McKenney
Hello!

Good turnout and some good questions here in Vancouver BC, please see
below for rough notes.  ;-)

Thanx, Paul



"Concurrency with tools/memory-model"

Andrea Parri presenting.

Rough notes of Q

o   Want atomic bit operation.

o   But smp_read_barrier_depends() not there, so how to note pairing?
A:  Note the dependency as the other end of the pairing.

o   Speculation barriers, as in Spectre and Meltdown?  A: This would
require adding timing, not in the immediate future.

o   What ordering does system calls provide?  A: None that we know of.
Boqun: Userspace needs to explicitly provide the needed ordering
when interacting with the kernel.  Some architectures do provide
full barriers, but not to be counted on.

o   Why herd7?  A: Based on other formalizations -- note that herd7
had a number of hardware models.  Paul: Plus the founder of the
LKMM project is a co-author of herd, which might have had some
effect.

o   Why not also model interrupts and NMIs?  Promela and spin have
been used for this.  A: Cannot currently model them.  You can
emulated them with additional threads and locks, if you wish.
Vincent Nimal and Lihao Liang have done some academic work on
these topics.



Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")

2018-11-14 Thread Paul E. McKenney
On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote:
> > Hi Paul,
> > 
> > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown.
> > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched
> > API in terms of RCU for Tree RCU PREEMPT builds").
> > 
> > I traced the hang into
> > -> cpufreq_suspend()
> >  -> cpufreq_stop_governor()
> >   -> cpufreq_dbs_governor_stop()
> >-> gov_clear_update_util()
> > -> synchronize_sched()
> >  -> synchronize_rcu()
> > 
> > Only PREEMPT=y is affected for obvious reasons, but that couldn't
> > explain why the same UP kernel booted on an SMP machine worked fine.
> > Eventually I realized that the difference between working and
> > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw
> > that we mask everything in the PIC before cpufreq is shut down,
> > and came up with the following fix:
> > 
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 7aa3dcad2175..f88bf3c77fc0 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void)
> > return 0;
> >  }
> >  module_param(off, int, 0444);
> > -core_initcall(cpufreq_core_init);
> > +late_initcall(cpufreq_core_init);
> 
> Thank you for testing this and tracking it down!
> 
> I am glad that you have a fix, but I hope that we can arrive at a less
> constraining one.
> 
> > Here's the resulting change in inutcall_debug:
> >   pci :00:00.1: shutdown
> >   hub 4-0:1.0: hub_ext_port_status failed (err = -110)
> >   agpgart-intel :00:00.0: shutdown
> > + PM: Calling cpufreq_suspend+0x0/0x100
> >   PM: Calling mce_syscore_shutdown+0x0/0x10
> >   PM: Calling i8259A_shutdown+0x0/0x10
> > - PM: Calling cpufreq_suspend+0x0/0x100
> > + reboot: Restarting system
> > + reboot: machine restart
> > 
> > I didn't really look into what other ramifications the cpufreq
> > initcall change might have. cpufreq_global_kobject worries
> > me a bit. Maybe that one has to remain in core_initcall() and
> > we could just move the suspend to late_initcall()? Anyways,
> > I figured I'd leave this for someone more familiar with the
> > code to figure out ;) 
> 
> Let me guess...
> 
> When the system suspends or shuts down, there comes a point after which
> there is only a single CPU that is running with preemption and interrupts
> are disabled.  At this point, RCU must change the way that it works, and
> the commit you bisected to would make the change more necessary.  But if
> I am guessing correctly, we have just been getting lucky in the past.
> 
> It looks like RCU needs to create a struct syscore_ops with a shutdown
> function and pass this to register_syscore_ops().  Maybe a suspend
> function as well.  And RCU needs to invoke register_syscore_ops() at
> a time that causes RCU's shutdown function to be invoked in the right
> order with respect to the other work in flight.  The hope would be that
> RCU's suspend function gets called just as the system transitions into
> a mode where the scheduler is no longer active, give or take.
> 
> Does this make sense, or am I confused?

Well, it certainly does not make sense in that blocking is still legal
at .shutdown() invocation time, which means that RCU cannot revert to
its boot-time approach at that point.  Looks like I need hooks in a
bunch of arch-dependent functions.  Which is certainly doable, but will
take a bit more digging.

Thanx, Paul



Re: KMSAN: uninit-value in rcu_accelerate_cbs / KMSAN: uninit-value in rcu_process_callbacks

2018-11-14 Thread Paul E. McKenney
On Wed, Nov 14, 2018 at 04:31:11PM +0100, Alexander Potapenko wrote:
> On Wed, Nov 14, 2018 at 4:09 PM Paul E. McKenney  
> wrote:
> >
> > On Wed, Nov 14, 2018 at 04:03:33AM -0500, Kyungtae Kim wrote:
> > > We report two crashes in v4.19-rc8 (4.20-rc1 as well, I guess):
> > > (Unfortunately, there is no repro for those.)
> > >
> > > The two crashes seem to share the same issue.
> > > In both cases, (uninitialized) memory access violation occurs
> > > when "rdp->cblist" is about to be accessed (kernel/rcu/tree.c:2838,1728).
> > > I guess those are freed before the use, but I still haven't figured
> > > out the reason why.
> > > I'm looking forward to some help.
> 
> First of all, I'd avoid reporting KMSAN bugs without clear reproducers.
> The tool is still in beta and may still give false positives due to
> either missed initialization or rare memory corruptions.

OK, I will set this aside, then, thank you!

Thanx, Paul

> > You lost me on this one.  In both cases, rdp references a per-CPU
> > variable that is implicitly initialized to all zeroes, due to being
> > (sort of) a C-language global.
> >
> > If a callback is queued early, then the following lines in __call_rcu()
> > will make an honest list of that field because of the :
> >
> > if (rcu_segcblist_empty(>cblist))
> > rcu_segcblist_init(>cblist);
> >
> > Otherwise, when rcu_init() is invoked during early boot, we have this
> > in rcu_init_percpu_data(), which is called from rcutree_prepare_cpu()
> > which is called from rcu_init(), which is called from start_kernel():
> >
> > if (rcu_segcblist_empty(>cblist) && /* No early-boot CBs? */
> > !init_nocb_callback_list(rdp))
> > rcu_segcblist_init(>cblist);  /* Re-enable callbacks. 
> > */
> >
> > So either init_nocb_callback_list() initializes the alternative callback
> > lists for a no-CBs CPU or rcu_segcblist_init() again makes an honest
> > list of that field.
> >
> > My guess is that your tool is missing the
> >
> > rdp = this_cpu_ptr(rsp->rda);
> >
> > in the __call_rcu() case, and also missing the
> >
> > struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
> >
> > Note that the ->rda field is explicitly compile-time initialized to
> > the base address of the per-CPU variable, which is rcu_preempt_data,
> > rcu_bh_data, or rcu_sched_data, depending on which RCU flavor is at hand.
> > (In v4.20-rc1, these are all merged into a single flavor to rule them all.)
> >
> > Alternatively, your tool might be missing the implicit initialization
> > of per-CPU variables.
> This used to be fine, but after rebasing to v4.20-rc2 I also started
> seeing strange reports on per-CPU variables. Taking a look.
> > Or maybe I am missing something.  If so, please let me know what it is.
> >
> > Thanx, Paul
> >
> > > Crash log 1
> > > =
> > > BUG: KMSAN: uninit-value in __rcu_process_callbacks
> > > kernel/rcu/tree.c:2838 [inline]
> > > BUG: KMSAN: uninit-value in rcu_process_callbacks+0x5ac/0x1cb0
> > > kernel/rcu/tree.c:2864
> > > CPU: 0 PID: 20 Comm: kauditd Not tainted 4.19.0-rc8+ #18
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> > > 01/01/2011
> > > Call Trace:
> > >  
> > >  __dump_stack lib/dump_stack.c:77 [inline]
> > >  dump_stack+0x305/0x460 lib/dump_stack.c:113
> > >  kmsan_report+0x1a2/0x2e0 mm/kmsan/kmsan.c:917
> > >  __msan_warning+0x7d/0xe0 mm/kmsan/kmsan_instr.c:500
> > >  __rcu_process_callbacks kernel/rcu/tree.c:2838 [inline]
> > >  rcu_process_callbacks+0x5ac/0x1cb0 kernel/rcu/tree.c:2864
> > >  __do_softirq+0x5ff/0xa55 kernel/softirq.c:292
> > >  invoke_softirq kernel/softirq.c:373 [inline]
> > >  irq_exit+0x22d/0x270 kernel/softirq.c:414
> > >  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
> > >  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1059
> > >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:869
> > >  
> > > RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2578
> > > Code: 48 89 e5 53 48 89 fb e8 e3 43 9a 00 8b b8 88 0c 00 00 48 8b 00
> > > 48 85 c0 75 12 48 89 df e8 7d 38 9a 00 c6 00 00 c6 03 00 fb 5b <5d> c3
> > > 

Re: KMSAN: uninit-value in rcu_accelerate_cbs / KMSAN: uninit-value in rcu_process_callbacks

2018-11-14 Thread Paul E. McKenney
On Wed, Nov 14, 2018 at 04:03:33AM -0500, Kyungtae Kim wrote:
> We report two crashes in v4.19-rc8 (4.20-rc1 as well, I guess):
> (Unfortunately, there is no repro for those.)
> 
> The two crashes seem to share the same issue.
> In both cases, (uninitialized) memory access violation occurs
> when "rdp->cblist" is about to be accessed (kernel/rcu/tree.c:2838,1728).
> I guess those are freed before the use, but I still haven't figured
> out the reason why.
> I'm looking forward to some help.

You lost me on this one.  In both cases, rdp references a per-CPU
variable that is implicitly initialized to all zeroes, due to being
(sort of) a C-language global.

If a callback is queued early, then the following lines in __call_rcu()
will make an honest list of that field because of the :

if (rcu_segcblist_empty(>cblist))
rcu_segcblist_init(>cblist);

Otherwise, when rcu_init() is invoked during early boot, we have this
in rcu_init_percpu_data(), which is called from rcutree_prepare_cpu()
which is called from rcu_init(), which is called from start_kernel():

if (rcu_segcblist_empty(>cblist) && /* No early-boot CBs? */
!init_nocb_callback_list(rdp))
rcu_segcblist_init(>cblist);  /* Re-enable callbacks. */

So either init_nocb_callback_list() initializes the alternative callback
lists for a no-CBs CPU or rcu_segcblist_init() again makes an honest
list of that field.

My guess is that your tool is missing the

rdp = this_cpu_ptr(rsp->rda);

in the __call_rcu() case, and also missing the

struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);

Note that the ->rda field is explicitly compile-time initialized to
the base address of the per-CPU variable, which is rcu_preempt_data,
rcu_bh_data, or rcu_sched_data, depending on which RCU flavor is at hand.
(In v4.20-rc1, these are all merged into a single flavor to rule them all.)

Alternatively, your tool might be missing the implicit initialization
of per-CPU variables.

Or maybe I am missing something.  If so, please let me know what it is.

Thanx, Paul

> Crash log 1
> =
> BUG: KMSAN: uninit-value in __rcu_process_callbacks
> kernel/rcu/tree.c:2838 [inline]
> BUG: KMSAN: uninit-value in rcu_process_callbacks+0x5ac/0x1cb0
> kernel/rcu/tree.c:2864
> CPU: 0 PID: 20 Comm: kauditd Not tainted 4.19.0-rc8+ #18
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x305/0x460 lib/dump_stack.c:113
>  kmsan_report+0x1a2/0x2e0 mm/kmsan/kmsan.c:917
>  __msan_warning+0x7d/0xe0 mm/kmsan/kmsan_instr.c:500
>  __rcu_process_callbacks kernel/rcu/tree.c:2838 [inline]
>  rcu_process_callbacks+0x5ac/0x1cb0 kernel/rcu/tree.c:2864
>  __do_softirq+0x5ff/0xa55 kernel/softirq.c:292
>  invoke_softirq kernel/softirq.c:373 [inline]
>  irq_exit+0x22d/0x270 kernel/softirq.c:414
>  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
>  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1059
>  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:869
>  
> RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2578
> Code: 48 89 e5 53 48 89 fb e8 e3 43 9a 00 8b b8 88 0c 00 00 48 8b 00
> 48 85 c0 75 12 48 89 df e8 7d 38 9a 00 c6 00 00 c6 03 00 fb 5b <5d> c3
> e8 de 42 9a 00 eb e7 66 66 66 2e 0f 1f 84 00 00 00 00 00 55
> RSP: 0018:88010622fca0 EFLAGS: 0286 ORIG_RAX: ff13
> RAX: 8801105bcc40 RBX: 8801061554c0 RCX: 8801105bdc40
> RDX: 8801105bdc40 RSI: b000 RDI: ea00077ec560
> RBP: 88010622fca0 R08: 7fff R09: 0002
> R10:  R11:  R12: 8800751cb880
> R13:  R14: 880106155db8 R15: 88013fcb9c40
>  finish_task_switch+0xe3/0x270 kernel/sched/core.c:2679
>  context_switch kernel/sched/core.c:2832 [inline]
>  __schedule+0x78f/0x8f0 kernel/sched/core.c:3479
>  schedule+0x1cc/0x300 kernel/sched/core.c:3523
>  kauditd_thread+0xc64/0xee0 kernel/audit.c:889
>  kthread+0x5b1/0x5f0 kernel/kthread.c:247
>  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:416
> 
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:255 [inline]
>  kmsan_internal_alloc_meta_for_pages+0x157/0x730 mm/kmsan/kmsan.c:693
>  kmsan_alloc_page+0x80/0xe0 mm/kmsan/kmsan_hooks.c:320
>  __alloc_pages_nodemask+0x128c/0x69b0 mm/page_alloc.c:4416
>  alloc_pages_current+0x51f/0x760 mm/mempolicy.c:2093
>  alloc_pages include/linux/gfp.h:511 [inline]
>  alloc_slab_page mm/slub.c:1459 [inline]
>  allocate_slab mm/slub.c:1604 [inline]
>  new_slab+0x552/0x1f30 mm/slub.c:1675
>  new_slab_objects mm/slub.c:2438 [inline]
>  ___slab_alloc+0x1414/0x1dd0 mm/slub.c:2590
>  __slab_alloc mm/slub.c:2630 [inline]
>  slab_alloc_node mm/slub.c:2693 [inline]
>  slab_alloc mm/slub.c:2735 [inline]
>  

Re: [PATCH tip/core/rcu 20/41] kprobes: eplace synchronize_sched() with synchronize_rcu()

2018-11-13 Thread Paul E. McKenney
On Tue, Nov 13, 2018 at 10:08:36AM -0800, Masami Hiramatsu wrote:
> On Sun, 11 Nov 2018 19:19:16 -0800
> "Paul E. McKenney"  wrote:
> 
> > On Mon, Nov 12, 2018 at 12:00:48PM +0900, Masami Hiramatsu wrote:
> > > On Sun, 11 Nov 2018 11:43:49 -0800
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > Now that synchronize_rcu() waits for preempt-disable regions of code
> > > > as well as RCU read-side critical sections, synchronize_sched() can be
> > > > replaced by synchronize_rcu().  This commit therefore makes this change.
> > > 
> > > Would you mean synchronize_rcu() can ensure that any interrupt handler
> > > (which should run under preempt-disable state) run out (even on 
> > > non-preemptive
> > > kernel)?
> > 
> > Yes, but only as of this merge window.  See this commit:
> > 
> > 3e3100989869 ("rcu: Defer reporting RCU-preempt quiescent states when 
> > disabled")
> 
> OK, I also found that now those are same.
> 
> 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT 
> builds")
> 
> Acked-by: Masami Hiramatsu 

Applied, thank you!

Thanx, Paul

> Thank you!
> 
> > 
> > Don't try this in v4.19 or earlier, but v4.20 and later is OK.  ;-)
> > 
> > Thanx, Paul
> > 
> > > If so, I agree with these changes.
> > > 
> > > Thank you,
> > > 
> > > > 
> > > > Signed-off-by: Paul E. McKenney 
> > > > Cc: "Naveen N. Rao" 
> > > > Cc: Anil S Keshavamurthy 
> > > > Cc: "David S. Miller" 
> > > > Cc: Masami Hiramatsu 
> > > > ---
> > > >  kernel/kprobes.c | 10 +-
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > index 90e98e233647..08e31d863191 100644
> > > > --- a/kernel/kprobes.c
> > > > +++ b/kernel/kprobes.c
> > > > @@ -229,7 +229,7 @@ static int collect_garbage_slots(struct 
> > > > kprobe_insn_cache *c)
> > > > struct kprobe_insn_page *kip, *next;
> > > >  
> > > > /* Ensure no-one is interrupted on the garbages */
> > > > -   synchronize_sched();
> > > > +   synchronize_rcu();
> > > >  
> > > > list_for_each_entry_safe(kip, next, >pages, list) {
> > > > int i;
> > > > @@ -1382,7 +1382,7 @@ static int register_aggr_kprobe(struct kprobe 
> > > > *orig_p, struct kprobe *p)
> > > > if (ret) {
> > > > ap->flags |= KPROBE_FLAG_DISABLED;
> > > > list_del_rcu(>list);
> > > > -   synchronize_sched();
> > > > +   synchronize_rcu();
> > > > }
> > > > }
> > > > }
> > > > @@ -1597,7 +1597,7 @@ int register_kprobe(struct kprobe *p)
> > > > ret = arm_kprobe(p);
> > > > if (ret) {
> > > > hlist_del_rcu(>hlist);
> > > > -   synchronize_sched();
> > > > +   synchronize_rcu();
> > > > goto out;
> > > > }
> > > > }
> > > > @@ -1776,7 +1776,7 @@ void unregister_kprobes(struct kprobe **kps, int 
> > > > num)
> > > > kps[i]->addr = NULL;
> > > > mutex_unlock(_mutex);
> > > >  
> > > > -   synchronize_sched();
> > > > +   synchronize_rcu();
> > > > for (i = 0; i < num; i++)
> > > > if (kps[i]->addr)
> > > > __unregister_kprobe_bottom(kps[i]);
> > > > @@ -1966,7 +1966,7 @@ void unregister_kretprobes(struct kretprobe 
> > > > **rps, int num)
> > > > rps[i]->kp.addr = NULL;
> > > > mutex_unlock(_mutex);
> > > >  
> > > > -   synchronize_sched();
> > > > +   synchronize_rcu();
> > > > for (i = 0; i < num; i++) {
> > > > if (rps[i]->kp.addr) {
> > > > __unregister_kprobe_bottom([i]->kp);
> > > > -- 
> > > > 2.17.1
> > > > 
> > > 
> > > 
> > > -- 
> > > Masami Hiramatsu 
> > > 
> > 
> 
> 
> -- 
> Masami Hiramatsu 
> 



Re: [PATCH tip/core/rcu 25/41] workqueue: Replace call_rcu_sched() with call_rcu()

2018-11-13 Thread Paul E. McKenney
On Tue, Nov 13, 2018 at 07:48:03AM -0800, Tejun Heo wrote:
> On Sun, Nov 11, 2018 at 11:43:54AM -0800, Paul E. McKenney wrote:
> > Now that call_rcu()'s callback is not invoked until after all
> > preempt-disable regions of code have completed (in addition to explicitly
> > marked RCU read-side critical sections), call_rcu() can be used in place
> > of call_rcu_sched().  This commit therefore makes that change.
> > 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Tejun Heo 
> > Cc: Lai Jiangshan 
> 
> Acked-by: Tejun Heo 

Applied all four, thank you!

Thanx, Paul



Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")

2018-11-13 Thread Paul E. McKenney
On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote:
> Hi Paul,
> 
> After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown.
> I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched
> API in terms of RCU for Tree RCU PREEMPT builds").
> 
> I traced the hang into
> -> cpufreq_suspend()
>  -> cpufreq_stop_governor()
>   -> cpufreq_dbs_governor_stop()
>-> gov_clear_update_util()
> -> synchronize_sched()
>  -> synchronize_rcu()
> 
> Only PREEMPT=y is affected for obvious reasons, but that couldn't
> explain why the same UP kernel booted on an SMP machine worked fine.
> Eventually I realized that the difference between working and
> non-working machine was IOAPIC vs. PIC. With initcall_debug I saw
> that we mask everything in the PIC before cpufreq is shut down,
> and came up with the following fix:
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 7aa3dcad2175..f88bf3c77fc0 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void)
> return 0;
>  }
>  module_param(off, int, 0444);
> -core_initcall(cpufreq_core_init);
> +late_initcall(cpufreq_core_init);

Thank you for testing this and tracking it down!

I am glad that you have a fix, but I hope that we can arrive at a less
constraining one.

> Here's the resulting change in inutcall_debug:
>   pci :00:00.1: shutdown
>   hub 4-0:1.0: hub_ext_port_status failed (err = -110)
>   agpgart-intel :00:00.0: shutdown
> + PM: Calling cpufreq_suspend+0x0/0x100
>   PM: Calling mce_syscore_shutdown+0x0/0x10
>   PM: Calling i8259A_shutdown+0x0/0x10
> - PM: Calling cpufreq_suspend+0x0/0x100
> + reboot: Restarting system
> + reboot: machine restart
> 
> I didn't really look into what other ramifications the cpufreq
> initcall change might have. cpufreq_global_kobject worries
> me a bit. Maybe that one has to remain in core_initcall() and
> we could just move the suspend to late_initcall()? Anyways,
> I figured I'd leave this for someone more familiar with the
> code to figure out ;) 

Let me guess...

When the system suspends or shuts down, there comes a point after which
there is only a single CPU that is running with preemption and interrupts
are disabled.  At this point, RCU must change the way that it works, and
the commit you bisected to would make the change more necessary.  But if
I am guessing correctly, we have just been getting lucky in the past.

It looks like RCU needs to create a struct syscore_ops with a shutdown
function and pass this to register_syscore_ops().  Maybe a suspend
function as well.  And RCU needs to invoke register_syscore_ops() at
a time that causes RCU's shutdown function to be invoked in the right
order with respect to the other work in flight.  The hope would be that
RCU's suspend function gets called just as the system transitions into
a mode where the scheduler is no longer active, give or take.

Does this make sense, or am I confused?

Thanx, Paul



Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-12 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 07:17:41PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 12, 2018 at 05:28:52AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 12, 2018 at 10:00:47AM +0100, Peter Zijlstra wrote:
> 
> > > Still, better safe than sorry. It was a rather big change in behaviour,
> > > so it wouldn't have been strange to call that out.
> > 
> > This guy:
> > 
> > 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU 
> > PREEMPT builds")
> > 
> > Has a commit log that says:
> > 
> > Now that RCU-preempt knows about preemption disabling, its
> > implementation of synchronize_rcu() works for synchronize_sched(),
> > and likewise for the other RCU-sched update-side API members.
> > This commit therefore confines the RCU-sched update-side code
> > to CONFIG_PREEMPT=n builds, and defines RCU-sched's update-side
> > API members in terms of those of RCU-preempt.
> > 
> > That last phrase seems pretty explicit.  What am I missing here?
> 
> That does not explicitly state that because RCU-preempt
> synchornize_rcu() can take _much_ longer, the new synchronize_sched()
> can now take _much_ longer too.
> 
> So when someone bisects a problem to this commit; and he reads the
> Changelog, he might get the impression that was unexpected.

Of course, a preempt_disable() section of code can still be preempted
by the underlying hypervisor, so in a surprisingly large fraction of
the installed base, there really isn't that much difference.

> > Not that it matters, given that I know of no way to change a mainlined
> > commit log.  I suppose I could ask Jon if he would be willing to take
> > a 2018 RCU API LWN article, if that would help.
> 
> Yes, it is water under the bridge; but Changelogs should be explicit
> about behavioural changes.
> 
> And while the merged RCU has the semantic behaviour required, the timing
> behaviour did change significantly.

When running on bare metal, potentially.  From what I see, preemption
of RCU read-side critical sections is the exception rather than the rule.
And again, when running on hypervisors, even irq-disable regions of code
can be preempted.  (And yes, there is work in flight to allow RCU to deal
with this.)

> > > > > Again, the patch didn't say that.
> > > > > 
> > > > > If the Changelog would've read something like:
> > > > > 
> > > > > "Since synchronize_sched() is now equivalent to synchronize_rcu(),
> > > > > replace the synchronize_sched() usage such that we can eventually 
> > > > > remove
> > > > > the interface."
> > > > > 
> > > > > It would've been clear that the patch is a nop and what the purpose
> > > > > was.
> > > > 
> > > > I can easily make that change.
> > > 
> > > Please, sufficient doesn't imply necessary etc.. A changelog should
> > > always clarify why we do the patch.
> > 
> > ???  Did you mean to say "necessary doesn't imply sufficient"?  If so,
> > what else do you feel is missing?
> 
> No, I meant to say that your original Changelog only states that
> sync_rcu now covers rcu-sched behaviour.  Which means that the change is
> sufficient.
> 
> It completely and utterly fails to explain _why_ you're doing the
> change. Ie. you do not address why it is necessary.
> 
> A Changelog should always explain why the change is needed.
> 
> In this case because you want to get rid of the sync_sched() api.

Right, which is stated in your suggested wording above.  So I am still
not seeing what you want added to this:

"Since synchronize_sched() is now equivalent to synchronize_rcu(),
replace the synchronize_sched() usage such that we can eventually
remove the interface."

Thanx, Paul



Re: [PATCH tip/core/rcu 0/41] More RCU flavor consolidation cleanup for v4.21/v5.0

2018-11-12 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 04:40:23PM -0500, Sasha Levin wrote:
> On Mon, Nov 12, 2018 at 08:01:37AM -0800, Paul E. McKenney wrote:
> >On Mon, Nov 12, 2018 at 09:07:50AM -0500, Mathieu Desnoyers wrote:
> >>- On Nov 11, 2018, at 2:41 PM, paulmck paul...@linux.ibm.com wrote:
> >>
> >>> Hello!
> >>>
> >>> This series does additional cleanup for the RCU flavor consolidation,
> >>> focusing primarily on uses of old API members, for example, so that
> >>> call_rcu_bh() becomes call_rcu().  There are also a few straggling
> >>> internal-to-RCU cleanups.
> >>>
> >>> 1.Remove unused rcu_state externs, courtesy of Joel Fernandes.
> >>>
> >>> 2.Fix rcu_{node,data} comments about gp_seq_needed, courtesy of
> >>>   Joel Fernandes.
> >>>
> >>> 3.Eliminate synchronize_rcu_mult() and its sole caller.
> >>>
> >>> 4.Consolidate the RCU update functions invoked by sync.c.
> >>>
> >>> 5-41. Replace old flavorful RCU API calls with the corresponding
> >>>   vanilla calls.
> >>
> >>Hi Paul,
> >>
> >>Just a heads up: we might want to spell out warnings in very big letters
> >>for anyone trying to backport code using RCU from post-4.21 kernels
> >>back to older kernels. I fear that newer code will build just fine
> >>on older kernels, but will spectacularly fail in hard-to-debug ways at
> >>runtime.
> >>
> >>Renaming synchronize_rcu() and call_rcu() to something that did not
> >>exist in prior kernels would prevent that. It may not be as pretty
> >>though.
> >
> >From v4.20 rather than v4.21, but yes.  Would it make sense to have Sasha
> >automatically flag -stable candidates going back past that boundary that
> >contain call_rcu(), synchronize_rcu(), etc.?  Adding Sasha on CC, and
> >I might be able to touch base with him this week.
> 
> We had a similar issue recently with a vfs change
> (https://patchwork.kernel.org/patch/10604339/) leading to potentially
> the same results as described above, we took it as is to avoid these
> issues in the future, though this is a much smaller change than what's
> proposed here.
> 
> We can look into an good way to solve this. While I can alert on
> post-4.20 stable tagged patches that touch rcu, do you really want to be
> dealing with this for the next 10+ years? It'll also means each of those
> patches will need a manual backport.
> 
> Let's talk at Plumbers :)

Sounds like a plan!  ;-)

Thanx, Paul



Re: [PATCH tip/core/rcu 24/41] modules: Replace synchronize_sched() and call_rcu_sched()

2018-11-12 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 01:48:52PM +0100, Jessica Yu wrote:
> +++ Paul E. McKenney [11/11/18 11:43 -0800]:
> >Now that synchronize_rcu() waits for preempt-disable regions of code
> >as well as RCU read-side critical sections, synchronize_sched() can
> >be replaced by synchronize_rcu().  Similarly, call_rcu_sched() can be
> >replaced by call_rcu().  This commit therefore makes these changes.
> >
> >Signed-off-by: Paul E. McKenney 
> >Cc: Jessica Yu 
> 
> Acked-by: Jessica Yu 

Applied, thank you!

Thanx, Paul

> Thanks!
> 
> >---
> >kernel/module.c | 14 +++---
> >1 file changed, 7 insertions(+), 7 deletions(-)
> >
> >diff --git a/kernel/module.c b/kernel/module.c
> >index 49a405891587..99b46c32d579 100644
> >--- a/kernel/module.c
> >+++ b/kernel/module.c
> >@@ -2159,7 +2159,7 @@ static void free_module(struct module *mod)
> > /* Remove this module from bug list, this uses list_del_rcu */
> > module_bug_cleanup(mod);
> > /* Wait for RCU-sched synchronizing before releasing mod->list and 
> > buglist. */
> >-synchronize_sched();
> >+synchronize_rcu();
> > mutex_unlock(_mutex);
> >
> > /* This may be empty, but that's OK */
> >@@ -3507,15 +3507,15 @@ static noinline int do_init_module(struct module 
> >*mod)
> > /*
> >  * We want to free module_init, but be aware that kallsyms may be
> >  * walking this with preempt disabled.  In all the failure paths, we
> >- * call synchronize_sched(), but we don't want to slow down the success
> >+ * call synchronize_rcu(), but we don't want to slow down the success
> >  * path, so use actual RCU here.
> >  * Note that module_alloc() on most architectures creates W+X page
> >  * mappings which won't be cleaned up until do_free_init() runs.  Any
> >  * code such as mark_rodata_ro() which depends on those mappings to
> >  * be cleaned up needs to sync with the queued work - ie
> >- * rcu_barrier_sched()
> >+ * rcu_barrier()
> >  */
> >-call_rcu_sched(>rcu, do_free_init);
> >+call_rcu(>rcu, do_free_init);
> > mutex_unlock(_mutex);
> > wake_up_all(_wq);
> >
> >@@ -3526,7 +3526,7 @@ static noinline int do_init_module(struct module *mod)
> >fail:
> > /* Try to protect us from buggy refcounters. */
> > mod->state = MODULE_STATE_GOING;
> >-synchronize_sched();
> >+synchronize_rcu();
> > module_put(mod);
> > blocking_notifier_call_chain(_notify_list,
> >  MODULE_STATE_GOING, mod);
> >@@ -3819,7 +3819,7 @@ static int load_module(struct load_info *info, const 
> >char __user *uargs,
> > ddebug_cleanup:
> > ftrace_release_mod(mod);
> > dynamic_debug_remove(mod, info->debug);
> >-synchronize_sched();
> >+synchronize_rcu();
> > kfree(mod->args);
> > free_arch_cleanup:
> > module_arch_cleanup(mod);
> >@@ -3834,7 +3834,7 @@ static int load_module(struct load_info *info, const 
> >char __user *uargs,
> > mod_tree_remove(mod);
> > wake_up_all(_wq);
> > /* Wait for RCU-sched synchronizing before releasing mod->list. */
> >-synchronize_sched();
> >+synchronize_rcu();
> > mutex_unlock(_mutex);
> > free_module:
> > /* Free lock-classes; relies on the preceding sync_rcu() */
> >-- 
> >2.17.1
> >
> 



Re: [PATCH tip/core/rcu 0/41] More RCU flavor consolidation cleanup for v4.21/v5.0

2018-11-12 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 09:07:50AM -0500, Mathieu Desnoyers wrote:
> - On Nov 11, 2018, at 2:41 PM, paulmck paul...@linux.ibm.com wrote:
> 
> > Hello!
> > 
> > This series does additional cleanup for the RCU flavor consolidation,
> > focusing primarily on uses of old API members, for example, so that
> > call_rcu_bh() becomes call_rcu().  There are also a few straggling
> > internal-to-RCU cleanups.
> > 
> > 1.  Remove unused rcu_state externs, courtesy of Joel Fernandes.
> > 
> > 2.  Fix rcu_{node,data} comments about gp_seq_needed, courtesy of
> > Joel Fernandes.
> > 
> > 3.  Eliminate synchronize_rcu_mult() and its sole caller.
> > 
> > 4.  Consolidate the RCU update functions invoked by sync.c.
> > 
> > 5-41.   Replace old flavorful RCU API calls with the corresponding
> > vanilla calls.
> 
> Hi Paul,
> 
> Just a heads up: we might want to spell out warnings in very big letters
> for anyone trying to backport code using RCU from post-4.21 kernels
> back to older kernels. I fear that newer code will build just fine
> on older kernels, but will spectacularly fail in hard-to-debug ways at
> runtime.
> 
> Renaming synchronize_rcu() and call_rcu() to something that did not
> exist in prior kernels would prevent that. It may not be as pretty
> though.

>From v4.20 rather than v4.21, but yes.  Would it make sense to have Sasha
automatically flag -stable candidates going back past that boundary that
contain call_rcu(), synchronize_rcu(), etc.?  Adding Sasha on CC, and
I might be able to touch base with him this week.

Thanx, Paul



Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-12 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 10:00:47AM +0100, Peter Zijlstra wrote:
> On Sun, Nov 11, 2018 at 06:24:55PM -0800, Paul E. McKenney wrote:
> 
> > > > There were quite a few commits involved in making this happen.  Perhaps
> > > > the most pertinent are these:
> > > > 
> > > > 3e3100989869 ("rcu: Defer reporting RCU-preempt quiescent states when 
> > > > disabled")
> > > > 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU 
> > > > PREEMPT builds")
> > > 
> > > The latter; it does not mention that this will possible make
> > > synchronize_sched() quite a bit more expensive on PREEMPT=y builds :/
> > 
> > In theory, sure.  In practice, people have switched any number of
> > things from RCU-sched to RCU and back without problems.
> 
> Still, better safe than sorry. It was a rather big change in behaviour,
> so it wouldn't have been strange to call that out.

This guy:

45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT 
builds")

Has a commit log that says:

Now that RCU-preempt knows about preemption disabling, its
implementation of synchronize_rcu() works for synchronize_sched(),
and likewise for the other RCU-sched update-side API members.
This commit therefore confines the RCU-sched update-side code
to CONFIG_PREEMPT=n builds, and defines RCU-sched's update-side
API members in terms of those of RCU-preempt.

That last phrase seems pretty explicit.  What am I missing here?

Not that it matters, given that I know of no way to change a mainlined
commit log.  I suppose I could ask Jon if he would be willing to take
a 2018 RCU API LWN article, if that would help.

> > > But for PREEMPT=y synchronize_sched() can be quite a bit shorter than
> > > synchronize_rcu(), since we don't have to wait for preempted read side
> > > stuff.
> > 
> > Again, there are quite a few places that have managed that transition
> > without issue.  Why do you expect this change to have problems that have
> > not been seen elsewhere?
> 
> I'm not, I'm just taking issue with the Changelog.

OK, good.

> > > Again, the patch didn't say that.
> > > 
> > > If the Changelog would've read something like:
> > > 
> > > "Since synchronize_sched() is now equivalent to synchronize_rcu(),
> > > replace the synchronize_sched() usage such that we can eventually remove
> > > the interface."
> > > 
> > > It would've been clear that the patch is a nop and what the purpose
> > > was.
> > 
> > I can easily make that change.
> 
> Please, sufficient doesn't imply necessary etc.. A changelog should
> always clarify why we do the patch.

???  Did you mean to say "necessary doesn't imply sufficient"?  If so,
what else do you feel is missing?

If not, color me confused.

Thanx, Paul



Re: [PATCH tip/core/rcu 20/41] kprobes: eplace synchronize_sched() with synchronize_rcu()

2018-11-11 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 12:00:48PM +0900, Masami Hiramatsu wrote:
> On Sun, 11 Nov 2018 11:43:49 -0800
> "Paul E. McKenney"  wrote:
> 
> > Now that synchronize_rcu() waits for preempt-disable regions of code
> > as well as RCU read-side critical sections, synchronize_sched() can be
> > replaced by synchronize_rcu().  This commit therefore makes this change.
> 
> Would you mean synchronize_rcu() can ensure that any interrupt handler
> (which should run under preempt-disable state) run out (even on non-preemptive
> kernel)?

Yes, but only as of this merge window.  See this commit:

3e3100989869 ("rcu: Defer reporting RCU-preempt quiescent states when disabled")

Don't try this in v4.19 or earlier, but v4.20 and later is OK.  ;-)

Thanx, Paul

> If so, I agree with these changes.
> 
> Thank you,
> 
> > 
> > Signed-off-by: Paul E. McKenney 
> > Cc: "Naveen N. Rao" 
> > Cc: Anil S Keshavamurthy 
> > Cc: "David S. Miller" 
> > Cc: Masami Hiramatsu 
> > ---
> >  kernel/kprobes.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 90e98e233647..08e31d863191 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -229,7 +229,7 @@ static int collect_garbage_slots(struct 
> > kprobe_insn_cache *c)
> > struct kprobe_insn_page *kip, *next;
> >  
> > /* Ensure no-one is interrupted on the garbages */
> > -   synchronize_sched();
> > +   synchronize_rcu();
> >  
> > list_for_each_entry_safe(kip, next, >pages, list) {
> > int i;
> > @@ -1382,7 +1382,7 @@ static int register_aggr_kprobe(struct kprobe 
> > *orig_p, struct kprobe *p)
> > if (ret) {
> > ap->flags |= KPROBE_FLAG_DISABLED;
> > list_del_rcu(>list);
> > -   synchronize_sched();
> > +   synchronize_rcu();
> > }
> > }
> > }
> > @@ -1597,7 +1597,7 @@ int register_kprobe(struct kprobe *p)
> > ret = arm_kprobe(p);
> > if (ret) {
> > hlist_del_rcu(>hlist);
> > -   synchronize_sched();
> > +   synchronize_rcu();
> > goto out;
> > }
> > }
> > @@ -1776,7 +1776,7 @@ void unregister_kprobes(struct kprobe **kps, int num)
> > kps[i]->addr = NULL;
> > mutex_unlock(_mutex);
> >  
> > -   synchronize_sched();
> > +   synchronize_rcu();
> > for (i = 0; i < num; i++)
> > if (kps[i]->addr)
> > __unregister_kprobe_bottom(kps[i]);
> > @@ -1966,7 +1966,7 @@ void unregister_kretprobes(struct kretprobe **rps, 
> > int num)
> > rps[i]->kp.addr = NULL;
> > mutex_unlock(_mutex);
> >  
> > -   synchronize_sched();
> > +   synchronize_rcu();
> > for (i = 0; i < num; i++) {
> > if (rps[i]->kp.addr) {
> > __unregister_kprobe_bottom([i]->kp);
> > -- 
> > 2.17.1
> > 
> 
> 
> -- 
> Masami Hiramatsu 
> 



Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-11 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 03:07:10AM +0100, Peter Zijlstra wrote:
> On Sun, Nov 11, 2018 at 05:47:36PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 12, 2018 at 01:53:29AM +0100, Peter Zijlstra wrote:
> > > On Sun, Nov 11, 2018 at 04:45:28PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Nov 12, 2018 at 01:12:33AM +0100, Peter Zijlstra wrote:
> > > > > On Sun, Nov 11, 2018 at 11:43:52AM -0800, Paul E. McKenney wrote:
> > > > > > Now that synchronize_rcu() waits for preempt-disable regions of code
> > > > > > as well as RCU read-side critical sections, synchronize_sched() can 
> > > > > > be
> > > > > > replaced by synchronize_rcu().  This commit therefore makes this 
> > > > > > change.
> > > > > 
> > > > > Yes, but it also waits for an actual RCU quiestent state, which makes
> > > > > synchoinize_rcu() potentially much more expensive than an actual
> > > > > synchronize_sched().
> > > > 
> > > > None of the readers have changed.
> > > > 
> > > > For the updaters, if CONFIG_PREEMPT=n, synchronize_rcu() and
> > > > synchronize_sched() always were one and the same.  When 
> > > > CONFIG_PREEMPT=y,
> > > > synchronize_rcu() and synchronize_sched() are now one and the same.
> > > 
> > > The Changelog does not state this; and does the commit that makes that
> > > happen state the regression potential?
> > 
> > The Changelog says this:
> > 
> > Now that synchronize_rcu() waits for preempt-disable
> > regions of code as well as RCU read-side critical sections,
> > synchronize_sched() can be replaced by synchronize_rcu().
> > This commit therefore makes this change.
> > 
> > The "synchronize_rcu() waits for preempt-disable regions of code as
> > well as RCU read-side critical sections" seems pretty unambiguous to me.
> > Exactly what more are you wanting said there?
> 
> The quoted bit only states that synchronize_rcu() is sufficient; it does
> not say it is equivalent and the patch is a nop. It also doesn't say
> that the purpose is to get rid of the synchronize_sched() function.
> 
> > There were quite a few commits involved in making this happen.  Perhaps
> > the most pertinent are these:
> > 
> > 3e3100989869 ("rcu: Defer reporting RCU-preempt quiescent states when 
> > disabled")
> > 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU 
> > PREEMPT builds")
> 
> The latter; it does not mention that this will possible make
> synchronize_sched() quite a bit more expensive on PREEMPT=y builds :/

In theory, sure.  In practice, people have switched any number of
things from RCU-sched to RCU and back without problems.

> > Normal grace periods are almost always quite long compared to typical
> > read-side critical sections, preempt-disable regions of code, and so on.
> > So in the common case this should be OK.  Or are you instead worried
> > about synchronize_sched_expedited()?
> 
> No, I still feel expedited should not exist at all ;-)

I figured as much.  ;-)

> But for PREEMPT=y synchronize_sched() can be quite a bit shorter than
> synchronize_rcu(), since we don't have to wait for preempted read side
> stuff.

Again, there are quite a few places that have managed that transition
without issue.  Why do you expect this change to have problems that have
not been seen elsewhere?

> > > > > So why are we doing this?
> > > > 
> > > > Given that synchronize_rcu() and synchronize_sched() are now always one
> > > > and the same, this is a distinction without a difference.
> > > 
> > > The Changelog did not state a reason for the patch. Therefore it is a
> > > bad patch.
> > 
> > ???  Here is the current definition of synchronize_sched() in mainline:
> > 
> > static inline void synchronize_sched(void)
> > {
> > synchronize_rcu();
> > }
> 
> Again, the patch didn't say that.
> 
> If the Changelog would've read something like:
> 
> "Since synchronize_sched() is now equivalent to synchronize_rcu(),
> replace the synchronize_sched() usage such that we can eventually remove
> the interface."
> 
> It would've been clear that the patch is a nop and what the purpose
> was.

I can easily make that change.

Thanx, Paul



Re: [PATCH tip/core/rcu 1/4] rcu: Eliminate BUG_ON() for sync.c

2018-11-11 Thread Paul E. McKenney
On Sun, Nov 11, 2018 at 09:07:04PM -0500, Steven Rostedt wrote:
> On Sun, 11 Nov 2018 11:32:14 -0800
> "Paul E. McKenney"  wrote:
> 
> > The sync.c file has a number of calls to BUG_ON(), which panics the
> > kernel, which is not a good strategy for devices (like embedded) that
> > don't have a way to capture console output.  This commit therefore
> > changes these BUG_ON() calls to WARN_ON_ONCE(), but does so quite naively.
> > 
> > Reported-by: Linus Torvalds 
> > Signed-off-by: Paul E. McKenney 
> > Acked-by: Oleg Nesterov 
> > Cc: Peter Zijlstra 
> > ---
> >  kernel/rcu/sync.c | 13 ++---
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
> > index 3f943efcf61c..a6ba446a9693 100644
> > --- a/kernel/rcu/sync.c
> > +++ b/kernel/rcu/sync.c
> > @@ -125,8 +125,7 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> > rsp->gp_state = GP_PENDING;
> > spin_unlock_irq(>rss_lock);
> >  
> > -   BUG_ON(need_wait && need_sync);
> > -
> > +   WARN_ON_ONCE(need_wait && need_sync);
> > if (need_sync) {
> > gp_ops[rsp->gp_type].sync();
> > rsp->gp_state = GP_PASSED;
> > @@ -139,7 +138,7 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> >  * Nobody has yet been allowed the 'fast' path and thus we can
> >  * avoid doing any sync(). The callback will get 'dropped'.
> >  */
> > -   BUG_ON(rsp->gp_state != GP_PASSED);
> > +   WARN_ON_ONCE(rsp->gp_state != GP_PASSED);
> > }
> >  }
> >  
> > @@ -166,8 +165,8 @@ static void rcu_sync_func(struct rcu_head *rhp)
> > struct rcu_sync *rsp = container_of(rhp, struct rcu_sync, cb_head);
> > unsigned long flags;
> >  
> > -   BUG_ON(rsp->gp_state != GP_PASSED);
> > -   BUG_ON(rsp->cb_state == CB_IDLE);
> > +   WARN_ON_ONCE(rsp->gp_state != GP_PASSED);
> > +   WARN_ON_ONCE(rsp->cb_state == CB_IDLE);
> >  
> > spin_lock_irqsave(>rss_lock, flags);
> > if (rsp->gp_count) {
> > @@ -225,7 +224,7 @@ void rcu_sync_dtor(struct rcu_sync *rsp)
> >  {
> > int cb_state;
> >  
> > -   BUG_ON(rsp->gp_count);
> > +   WARN_ON_ONCE(rsp->gp_count);
> >  
> > spin_lock_irq(>rss_lock);
> > if (rsp->cb_state == CB_REPLAY)
> > @@ -235,6 +234,6 @@ void rcu_sync_dtor(struct rcu_sync *rsp)
> >  
> > if (cb_state != CB_IDLE) {
> > gp_ops[rsp->gp_type].wait();
> > -   BUG_ON(rsp->cb_state != CB_IDLE);
> > +   WARN_ON_ONCE(rsp->cb_state != CB_IDLE);
> > }
> >  }
> 
> I take it that if any of these WARN_ON_ONCE() triggers, they wont cause
> immediate catastrophe, and/or there's no gentle way out like you have
> with the other patches exiting the function when one is hit.

Oleg was actually OK with removing them entirely:

"I added these BUG_ON's for documentation when I was prototyping
this code, perhaps we can simply remove them."

And they are "cannot happen" types of things (famous last words).
Oleg also has another approach that could rip-and-replace the current
implementation, which would render these WARN*()s moot.

Thanx, Paul



Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-11 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 01:53:29AM +0100, Peter Zijlstra wrote:
> On Sun, Nov 11, 2018 at 04:45:28PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 12, 2018 at 01:12:33AM +0100, Peter Zijlstra wrote:
> > > On Sun, Nov 11, 2018 at 11:43:52AM -0800, Paul E. McKenney wrote:
> > > > Now that synchronize_rcu() waits for preempt-disable regions of code
> > > > as well as RCU read-side critical sections, synchronize_sched() can be
> > > > replaced by synchronize_rcu().  This commit therefore makes this change.
> > > 
> > > Yes, but it also waits for an actual RCU quiestent state, which makes
> > > synchoinize_rcu() potentially much more expensive than an actual
> > > synchronize_sched().
> > 
> > None of the readers have changed.
> > 
> > For the updaters, if CONFIG_PREEMPT=n, synchronize_rcu() and
> > synchronize_sched() always were one and the same.  When CONFIG_PREEMPT=y,
> > synchronize_rcu() and synchronize_sched() are now one and the same.
> 
> The Changelog does not state this; and does the commit that makes that
> happen state the regression potential?

The Changelog says this:

Now that synchronize_rcu() waits for preempt-disable
regions of code as well as RCU read-side critical sections,
synchronize_sched() can be replaced by synchronize_rcu().
This commit therefore makes this change.

The "synchronize_rcu() waits for preempt-disable regions of code as
well as RCU read-side critical sections" seems pretty unambiguous to me.
Exactly what more are you wanting said there?

There were quite a few commits involved in making this happen.  Perhaps
the most pertinent are these:

3e3100989869 ("rcu: Defer reporting RCU-preempt quiescent states when disabled")
45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT 
builds")

Normal grace periods are almost always quite long compared to typical
read-side critical sections, preempt-disable regions of code, and so on.
So in the common case this should be OK.  Or are you instead worried
about synchronize_sched_expedited()?

> > > So why are we doing this?
> > 
> > Given that synchronize_rcu() and synchronize_sched() are now always one
> > and the same, this is a distinction without a difference.
> 
> The Changelog did not state a reason for the patch. Therefore it is a
> bad patch.

???  Here is the current definition of synchronize_sched() in mainline:

static inline void synchronize_sched(void)
{
synchronize_rcu();
}

Thanx, Paul



Re: [PATCH tip/core/rcu 23/41] sched: Replace synchronize_sched() with synchronize_rcu()

2018-11-11 Thread Paul E. McKenney
On Mon, Nov 12, 2018 at 01:12:33AM +0100, Peter Zijlstra wrote:
> On Sun, Nov 11, 2018 at 11:43:52AM -0800, Paul E. McKenney wrote:
> > Now that synchronize_rcu() waits for preempt-disable regions of code
> > as well as RCU read-side critical sections, synchronize_sched() can be
> > replaced by synchronize_rcu().  This commit therefore makes this change.
> 
> Yes, but it also waits for an actual RCU quiestent state, which makes
> synchoinize_rcu() potentially much more expensive than an actual
> synchronize_sched().

None of the readers have changed.

For the updaters, if CONFIG_PREEMPT=n, synchronize_rcu() and
synchronize_sched() always were one and the same.  When CONFIG_PREEMPT=y,
synchronize_rcu() and synchronize_sched() are now one and the same.

> So why are we doing this?

Given that synchronize_rcu() and synchronize_sched() are now always one
and the same, this is a distinction without a difference.  So we might
as well get rid of the _bh and _sched APIs.  (See the tail end of current
mainline's include/linux/rcupdate.h.)

If you are instead asking why the RCU flavors (RCU-bh, RCU-preempt,
and RCU-sched) got merged, it was due to a security incident stemming
from confusion between two of the flavor, with the resulting bug turning
out to be exploitable.  Linus therefore requested that I do something
to make this not happen again, which I did.

Thanx, Paul



[PATCH tip/core/rcu 03/20] doc: Remove rcu_preempt_state reference in stallwarn

2018-11-11 Thread Paul E. McKenney
From: "Joel Fernandes (Google)" 

Consolidation of RCU-bh, RCU-preempt, and RCU-sched into one RCU flavor
to rule them all resulted in the removal of rcu_preempt_state.  However,
stallwarn.txt still mentions rcu_preempt_state.  This commit therefore
Updates stallwarn documentation accordingly.

Signed-off-by: Joel Fernandes (Google) 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 Documentation/RCU/stallwarn.txt | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 491043fd976f..b01bcafc64aa 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -176,9 +176,8 @@ causing stalls, and that the stall was affecting RCU-sched. 
 This message
 will normally be followed by stack dumps for each CPU.  Please note that
 PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that
 the tasks will be indicated by PID, for example, "P3421".  It is even
-possible for a rcu_preempt_state stall to be caused by both CPUs -and-
-tasks, in which case the offending CPUs and tasks will all be called
-out in the list.
+possible for an rcu_state stall to be caused by both CPUs -and- tasks,
+in which case the offending CPUs and tasks will all be called out in the list.
 
 CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with
 the RCU core for the past three grace periods.  In contrast, CPU 16's "(0
-- 
2.17.1



[PATCH tip/core/rcu 04/20] doc: rcu: Update information about resched_cpu

2018-11-11 Thread Paul E. McKenney
From: "Joel Fernandes (Google)" 

Since commit fced9c8cfe6b ("rcu: Avoid resched_cpu() when rescheduling
the current CPU"), resched_cpu is not directly called from
sync_sched_exp_handler. Update the documentation about the same.

Signed-off-by: Joel Fernandes (Google) 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 .../Expedited-Grace-Periods/Expedited-Grace-Periods.html| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html 
b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
index e62c7c34a369..8e4f873b979f 100644
--- 
a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
+++ 
b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
@@ -160,9 +160,9 @@ was in flight.
 If the CPU is idle, then sync_sched_exp_handler() reports
 the quiescent state.
 
-
-Otherwise, the handler invokes resched_cpu(), which forces
-a future context switch.
+ Otherwise, the handler forces a future context switch by setting the
+NEED_RESCHED flag of the current task's thread flag and the CPU preempt
+counter.
 At the time of the context switch, the CPU reports the quiescent state.
 Should the CPU go offline first, it will report the quiescent state
 at that time.
-- 
2.17.1



[PATCH tip/core/rcu 01/20] doc: Set down forward-progress requirements

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

This commit adds a section to the requirements documentation setting down
requirements for grace-period and callback-invocation forward progress.

Signed-off-by: Paul E. McKenney 
---
 .../RCU/Design/Requirements/Requirements.html | 110 +-
 1 file changed, 108 insertions(+), 2 deletions(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 43c4e2f05f40..7efc1c1da7af 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -1381,6 +1381,7 @@ Classes of quality-of-implementation requirements are as 
follows:
 
Specialization
Performance and Scalability
+   Forward Progress
Composability
Corner Cases
 
@@ -1822,6 +1823,106 @@ so it is too early to tell whether they will stand the 
test of time.
 RCU thus provides a range of tools to allow updaters to strike the
 required tradeoff between latency, flexibility and CPU overhead.
 
+Forward Progress
+
+
+In theory, delaying grace-period completion and callback invocation
+is harmless.
+In practice, not only are memory sizes finite but also callbacks sometimes
+do wakeups, and sufficiently deferred wakeups can be difficult
+to distinguish from system hangs.
+Therefore, RCU must provide a number of mechanisms to promote forward
+progress.
+
+
+These mechanisms are not foolproof, nor can they be.
+For one simple example, an infinite loop in an RCU read-side critical
+section must by definition prevent later grace periods from ever completing.
+For a more involved example, consider a 64-CPU system built with
+CONFIG_RCU_NOCB_CPU=y and booted with rcu_nocbs=1-63,
+where CPUs1 through63 spin in tight loops that invoke
+call_rcu().
+Even if these tight loops also contain calls to cond_resched()
+(thus allowing grace periods to complete), CPU0 simply will
+not be able to invoke callbacks as fast as the other 63 CPUs can
+register them, at least not until the system runs out of memory.
+In both of these examples, the Spiderman principle applies:  With great
+power comes great responsibility.
+However, short of this level of abuse, RCU is required to
+ensure timely completion of grace periods and timely invocation of
+callbacks.
+
+
+RCU takes the following steps to encourage timely completion of
+grace periods:
+
+
+   If a grace period fails to complete within 100milliseconds,
+   RCU causes future invocations of cond_resched() on
+   the holdout CPUs to provide an RCU quiescent state.
+   RCU also causes those CPUs' need_resched() invocations
+   to return true, but only after the corresponding CPU's
+   next scheduling-clock.
+   CPUs mentioned in the nohz_full kernel boot parameter
+   can run indefinitely in the kernel without scheduling-clock
+   interrupts, which defeats the above need_resched()
+   strategem.
+   RCU will therefore invoke resched_cpu() on any
+   nohz_full CPUs still holding out after
+   109milliseconds.
+   In kernels built with CONFIG_RCU_BOOST=y, if a given
+   task that has been preempted within an RCU read-side critical
+   section is holding out for more than 500milliseconds,
+   RCU will resort to priority boosting.
+   If a CPU is still holding out 10seconds into the grace
+   period, RCU will invoke resched_cpu() on it regardless
+   of its nohz_full state.
+
+
+
+The above values are defaults for systems running with HZ=1000.
+They will vary as the value of HZ varies, and can also be
+changed using the relevant Kconfig options and kernel boot parameters.
+RCU currently does not do much sanity checking of these
+parameters, so please use caution when changing them.
+Note that these forward-progress measures are provided only for RCU,
+not for
+SRCU or
+Tasks RCU.
+
+
+RCU takes the following steps in call_rcu() to encourage timely
+invocation of callbacks when any given non-rcu_nocbs CPU has
+10,000 callbacks, or has 10,000 more callbacks than it had the last time
+encouragement was provided:
+
+
+   Starts a grace period, if one is not already in progress.
+   Forces immediate checking for quiescent states, rather than
+   waiting for three milliseconds to have elapsed since the
+   beginning of the grace period.
+   Immediately tags the CPU's callbacks with their grace period
+   completion numbers, rather than waiting for the RCU_SOFTIRQ
+   handler to get around to it.
+   Lifts callback-execution batch limits, which speeds up callback
+   invocation at the expense of degrading realtime response.
+
+
+
+Again, these are default values when running at HZ=1000,
+and can be overridden.
+Again, these forward-progress measures are provided only for RCU,
+not for
+SRCU or
+Tasks RCU.
+Even for RCU, callback-invocation forward progress for rcu_nocbs
+CPUs is much less well-developed, in part because workloads benefiting
+from rcu_

[PATCH RFC LKMM 1/3] tools/memory-model: Model smp_mb__after_unlock_lock()

2018-11-11 Thread Paul E. McKenney
From: Andrea Parri 

>From the header comment for smp_mb__after_unlock_lock():

  "Place this after a lock-acquisition primitive to guarantee that
   an UNLOCK+LOCK pair acts as a full barrier.  This guarantee applies
   if the UNLOCK and LOCK are executed by the same CPU or if the
   UNLOCK and LOCK operate on the same lock variable."

This formalizes the above guarantee by defining (new) mb-links according
to the law:

  ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

where the component ([UL] ; co ; [LKW]) identifies "UNLOCK+LOCK pairs on
the same lock variable" and the component ([UL] ; po ; [LKW]) identifies
"UNLOCK+LOCK pairs executed by the same CPU".

In particular, the LKMM forbids the following two behaviors (the second
litmus test below is based on

  Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html

c.f., Section "Tree RCU Grace Period Memory Ordering Building Blocks"):

C after-unlock-lock-same-cpu

(*
 * Result: Never
 *)

{}

P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
{
int r0;

spin_lock(s);
WRITE_ONCE(*x, 1);
spin_unlock(s);
spin_lock(t);
smp_mb__after_unlock_lock();
r0 = READ_ONCE(*y);
spin_unlock(t);
}

P1(int *x, int *y)
{
int r0;

WRITE_ONCE(*y, 1);
smp_mb();
r0 = READ_ONCE(*x);
}

exists (0:r0=0 /\ 1:r0=0)

C after-unlock-lock-same-lock-variable

(*
 * Result: Never
 *)

{}

P0(spinlock_t *s, int *x, int *y)
{
int r0;

spin_lock(s);
WRITE_ONCE(*x, 1);
r0 = READ_ONCE(*y);
spin_unlock(s);
}

P1(spinlock_t *s, int *y, int *z)
{
int r0;

spin_lock(s);
smp_mb__after_unlock_lock();
WRITE_ONCE(*y, 1);
r0 = READ_ONCE(*z);
spin_unlock(s);
}

P2(int *z, int *x)
{
int r0;

WRITE_ONCE(*z, 1);
smp_mb();
r0 = READ_ONCE(*x);
}

exists (0:r0=0 /\ 1:r0=0 /\ 2:r0=0)

Signed-off-by: Andrea Parri 
Cc: Alan Stern 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Boqun Feng 
Cc: Nicholas Piggin 
Cc: David Howells 
Cc: Jade Alglave 
Cc: Luc Maranget 
Cc: "Paul E. McKenney" 
Cc: Akira Yokosawa 
Cc: Daniel Lustig 
Signed-off-by: Paul E. McKenney 
---
 tools/memory-model/linux-kernel.bell | 3 ++-
 tools/memory-model/linux-kernel.cat  | 4 +++-
 tools/memory-model/linux-kernel.def  | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/linux-kernel.bell 
b/tools/memory-model/linux-kernel.bell
index b84fb2f67109..796513362c05 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -29,7 +29,8 @@ enum Barriers = 'wmb (*smp_wmb*) ||
'sync-rcu (*synchronize_rcu*) ||
'before-atomic (*smp_mb__before_atomic*) ||
'after-atomic (*smp_mb__after_atomic*) ||
-   'after-spinlock (*smp_mb__after_spinlock*)
+   'after-spinlock (*smp_mb__after_spinlock*) ||
+   'after-unlock-lock (*smp_mb__after_unlock_lock*)
 instructions F[Barriers]
 
 (* Compute matching pairs of nested Rcu-lock and Rcu-unlock *)
diff --git a/tools/memory-model/linux-kernel.cat 
b/tools/memory-model/linux-kernel.cat
index 882fc33274ac..8f23c74a96fd 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -30,7 +30,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
-   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
+   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
+   ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
+   fencerel(After-unlock-lock) ; [M])
 let gp = po ; [Sync-rcu] ; po?
 
 let strong-fence = mb | gp
diff --git a/tools/memory-model/linux-kernel.def 
b/tools/memory-model/linux-kernel.def
index 6fa3eb28d40b..b27911cc087d 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -23,6 +23,7 @@ smp_wmb() { __fence{wmb}; }
 smp_mb__before_atomic() { __fence{before-atomic}; }
 smp_mb__after_atomic() { __fence{after-atomic}; }
 smp_mb__after_spinlock() { __fence{after-spinlock}; }
+smp_mb__after_unlock_lock() { __fence{after-unlock-lock}; }
 
 // Exchange
 xchg(X,V)  __xchg{mb}(X,V)
-- 
2.17.1



[PATCH RFC LKMM 3/3] EXP tools/memory-model: Make scripts take "-j" abbreviation for "--jobs"

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

The "--jobs" argument to the litmus-test scripts is similar to the "-jN"
argument to "make", so this commit allows the "-jN" form as well.  While
in the area, it also prohibits the various forms of "-j0".

Suggested-by: Alan Stern 
Signed-off-by: Paul E. McKenney 
---
 tools/memory-model/scripts/parseargs.sh | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/scripts/parseargs.sh 
b/tools/memory-model/scripts/parseargs.sh
index 96b307c8d64a..859e1d581e05 100755
--- a/tools/memory-model/scripts/parseargs.sh
+++ b/tools/memory-model/scripts/parseargs.sh
@@ -95,8 +95,18 @@ do
LKMM_HERD_OPTIONS="$2"
shift
;;
-   --jobs|--job)
-   checkarg --jobs "(number)" "$#" "$2" '^[0-9]\+$' '^--'
+   -j[1-9]*)
+   njobs="`echo $1 | sed -e 's/^-j//'`"
+   trailchars="`echo $njobs | sed -e 's/[0-9]\+\(.*\)$/\1/'`"
+   if test -n "$trailchars"
+   then
+   echo $1 trailing characters "'$trailchars'"
+   usagehelp
+   fi
+   LKMM_JOBS="`echo $njobs | sed -e 's/^\([0-9]\+\).*$/\1/'`"
+   ;;
+   --jobs|--job|-j)
+   checkarg --jobs "(number)" "$#" "$2" '^[1-9][0-9]\+$' '^--'
LKMM_JOBS="$2"
shift
;;
-- 
2.17.1



[PATCH RFC LKMM 2/3] EXP tools/memory-model: Add scripts to check github litmus tests

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

The https://github.com/paulmckrcu/litmus repository contains a large
number of C-language litmus tests that include "Result:" comments
predicting the verification result.  This commit adds a number of scripts
that run tests on these litmus tests:

checkghlitmus.sh:
Runs all litmus tests in the https://github.com/paulmckrcu/litmus
archive that are C-language and that have "Result:" comment lines
documenting expected results, comparing the actual results to
those expected.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

initlitmushist.sh
Run all litmus tests having no more than the specified number
of processes given a specified timeout, recording the results in
.litmus.out files.  Clones the repository if it has not already
been cloned into the "tools/memory-model/litmus" directory.

newlitmushist.sh
For all new or updated litmus tests having no more than the
specified number of processes given a specified timeout, run
and record the results in .litmus.out files.

checklitmushist.sh
Run all litmus tests having .litmus.out files from previous
initlitmushist.sh or newlitmushist.sh runs, comparing the
herd output to that of the original runs.

The above scripts will run litmus tests concurrently, by default with
one job per available CPU.  Giving any of these scripts the --help
argument will cause them to print usage information.

This commit also adds a number of helper scripts that are not intended
to be invoked from the command line:

cmplitmushist.sh: Compare the output of two different runs of the same
litmus test.

judgelitmus.sh: Compare the output of a litmus test to its "Result:"
comment line.

parseargs.sh: Parse command-line arguments.

runlitmushist.sh: Run the litmus tests whose pathnames are provided one
per line on standard input.

While in the area, this commit also makes the existing checklitmus.sh
and checkalllitmus.sh scripts use parseargs.sh in order to provide a
bit of uniformity.  In addition, per-litmus-test status output is directed
to stdout, while end-of-test summary information is directed to stderr.
Finally, the error flag standardizes on "!!!" to assist those familiar
with rcutorture output.

The defaults for the parseargs.sh arguments may be overridden by using
environment variables: LKMM_DESTDIR for --destdir, LKMM_HERD_OPTIONS
for --herdoptions, LKMM_JOBS for --jobs, LKMM_PROCS for --procs, and
LKMM_TIMEOUT for --timeout.

Signed-off-by: Paul E. McKenney 
[ paulmck: History-check summary-line changes per Alan Stern feedback. ]
---
 tools/memory-model/.gitignore |   1 +
 tools/memory-model/README |   2 +
 tools/memory-model/scripts/README |  70 ++
 tools/memory-model/scripts/checkalllitmus.sh  |  53 
 tools/memory-model/scripts/checkghlitmus.sh   |  65 +
 tools/memory-model/scripts/checklitmus.sh |  74 ++
 tools/memory-model/scripts/checklitmushist.sh |  60 +
 tools/memory-model/scripts/cmplitmushist.sh   |  87 
 tools/memory-model/scripts/initlitmushist.sh  |  68 ++
 tools/memory-model/scripts/judgelitmus.sh |  78 +++
 tools/memory-model/scripts/newlitmushist.sh   |  61 +
 tools/memory-model/scripts/parseargs.sh   | 126 ++
 tools/memory-model/scripts/runlitmushist.sh   |  87 
 13 files changed, 739 insertions(+), 93 deletions(-)
 create mode 100644 tools/memory-model/.gitignore
 create mode 100644 tools/memory-model/scripts/README
 create mode 100755 tools/memory-model/scripts/checkghlitmus.sh
 create mode 100755 tools/memory-model/scripts/checklitmushist.sh
 create mode 100644 tools/memory-model/scripts/cmplitmushist.sh
 create mode 100755 tools/memory-model/scripts/initlitmushist.sh
 create mode 100755 tools/memory-model/scripts/judgelitmus.sh
 create mode 100755 tools/memory-model/scripts/newlitmushist.sh
 create mode 100755 tools/memory-model/scripts/parseargs.sh
 create mode 100755 tools/memory-model/scripts/runlitmushist.sh

diff --git a/tools/memory-model/.gitignore b/tools/memory-model/.gitignore
new file mode 100644
index ..b1d34c52f3c3
--- /dev/null
+++ b/tools/memory-model/.gitignore
@@ -0,0 +1 @@
+litmus
diff --git a/tools/memory-model/README b/tools/memory-model/README
index acf9077cffaa..0f2c366518c6 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -156,6 +156,8 @@ lock.cat
 README
This file.
 
+scriptsVarious scripts, see scripts/README.
+
 
 ===
 LIMITATIONS
diff --git a/tools/memory-model/scripts/README 
b/tools/memory-model/scripts/README
new file mode 100644
index ..29375a1fbb

[PATCH RFC memory-model 0/3] LKMM updates for v4.21/v5.0

2018-11-11 Thread Paul E. McKenney
Hello!

This series contains updates for the Linux-kernel memory model:

1.  Model smp_mb__after_unlock_lock(), courtesy of Andrea Parri.

2.  Add scripts to check github litmus tests.

3.  Make scripts take "-j" abbreviation for "--jobs".

Thanx, Paul



 .gitignore |1 
 README |2 
 linux-kernel.bell  |3 
 linux-kernel.cat   |4 -
 linux-kernel.def   |1 
 scripts/README |   70 ++
 scripts/checkalllitmus.sh  |   53 +++--
 scripts/checkghlitmus.sh   |   65 
 scripts/checklitmus.sh |   74 +++
 scripts/checklitmushist.sh |   60 +++
 scripts/cmplitmushist.sh   |   87 +++
 scripts/initlitmushist.sh  |   68 +
 scripts/judgelitmus.sh |   78 +
 scripts/newlitmushist.sh   |   61 +++
 scripts/parseargs.sh   |  140 -
 scripts/runlitmushist.sh   |   87 +++
 16 files changed, 757 insertions(+), 97 deletions(-)



[PATCH tip/core/rcu 0/7] Use lockdep instead of asserting spin_is_locked() for v4.21/v5.0

2018-11-11 Thread Paul E. McKenney
Hello!

This series converts assertions of spin_is_locked() into
lockdep_assert_held(), all courtesy of Lance Roy.

Thanx, Paul



 arch/x86/pci/i386.c  |2 +-
 drivers/net/ethernet/sfc/efx.c   |2 +-
 drivers/net/ethernet/smsc/smsc911x.h |2 +-
 fs/userfaultfd.c |2 +-
 kernel/locking/mutex-debug.c |4 ++--
 mm/khugepaged.c  |4 ++--
 mm/swap.c|3 +--
 virt/kvm/arm/vgic/vgic.c |   12 ++--
 8 files changed, 15 insertions(+), 16 deletions(-)



[PATCH tip/core/rcu 5/7] locking/mutex: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Will Deacon 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/mutex-debug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 9aa713629387..771d4ca96dda 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -36,7 +36,7 @@ void debug_mutex_lock_common(struct mutex *lock, struct 
mutex_waiter *waiter)
 
 void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
 {
-   SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(>wait_lock));
+   lockdep_assert_held(>wait_lock);
DEBUG_LOCKS_WARN_ON(list_empty(>wait_list));
DEBUG_LOCKS_WARN_ON(waiter->magic != waiter);
DEBUG_LOCKS_WARN_ON(list_empty(>list));
@@ -51,7 +51,7 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
 {
-   SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(>wait_lock));
+   lockdep_assert_held(>wait_lock);
 
/* Mark the current thread as blocked on the lock: */
task->blocked_on = waiter;
-- 
2.17.1



[PATCH tip/core/rcu 4/7] userfaultfd: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Alexander Viro 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 fs/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 356d2b8568c1..681881dc8a9d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -926,7 +926,7 @@ static inline struct userfaultfd_wait_queue 
*find_userfault_in(
wait_queue_entry_t *wq;
struct userfaultfd_wait_queue *uwq;
 
-   VM_BUG_ON(!spin_is_locked(>lock));
+   lockdep_assert_held(>lock);
 
uwq = NULL;
if (!waitqueue_active(wqh))
-- 
2.17.1



[PATCH tip/core/rcu 03/17] rcutorture: Remove cbflood facility

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Now that the forward-progress code does a full-bore continuous callback
flood lasting multiple seconds, there is little point in also posting a
mere 60,000 callbacks every second or so.  This commit therefore removes
the old cbflood testing.  Over time, it may be desirable to concurrently
do full-bore continuous callback floods on all CPUs simultaneously, but
one dragon at a time.

Signed-off-by: Paul E. McKenney 
---
 .../admin-guide/kernel-parameters.txt | 18 
 kernel/rcu/rcutorture.c   | 86 +--
 2 files changed, 1 insertion(+), 103 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 81d1d5a74728..6c53d6eb4594 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3743,24 +3743,6 @@
in microseconds.  The default of zero says
no holdoff.
 
-   rcutorture.cbflood_inter_holdoff= [KNL]
-   Set holdoff time (jiffies) between successive
-   callback-flood tests.
-
-   rcutorture.cbflood_intra_holdoff= [KNL]
-   Set holdoff time (jiffies) between successive
-   bursts of callbacks within a given callback-flood
-   test.
-
-   rcutorture.cbflood_n_burst= [KNL]
-   Set the number of bursts making up a given
-   callback-flood test.  Set this to zero to
-   disable callback-flood testing.
-
-   rcutorture.cbflood_n_per_burst= [KNL]
-   Set the number of callbacks to be registered
-   in a given burst of a callback-flood test.
-
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 8cf700ca7845..17f480129a78 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -80,13 +80,6 @@ MODULE_AUTHOR("Paul E. McKenney  and 
Josh Triplett  0 &&
-   cbflood_inter_holdoff > 0 &&
-   cbflood_intra_holdoff > 0 &&
-   cur_ops->call &&
-   cur_ops->cb_barrier) {
-   rhp = vmalloc(array3_size(cbflood_n_burst,
- cbflood_n_per_burst,
- sizeof(*rhp)));
-   err = !rhp;
-   }
-   if (err) {
-   VERBOSE_TOROUT_STRING("rcu_torture_cbflood disabled: Bad args 
or OOM");
-   goto wait_for_stop;
-   }
-   VERBOSE_TOROUT_STRING("rcu_torture_cbflood task started");
-   do {
-   schedule_timeout_interruptible(cbflood_inter_holdoff);
-   atomic_long_inc(_cbfloods);
-   WARN_ON(signal_pending(current));
-   for (i = 0; i < cbflood_n_burst; i++) {
-   for (j = 0; j < cbflood_n_per_burst; j++) {
-   cur_ops->call([i * cbflood_n_per_burst + j],
- rcu_torture_cbflood_cb);
-   }
-   schedule_timeout_interruptible(cbflood_intra_holdoff);
-   WARN_ON(signal_pending(current));
-   }
-   cur_ops->cb_barrier();
-   stutter_wait("rcu_torture_cbflood");
-   } while (!torture_must_stop());
-   vfree(rhp);
-wait_for_stop:
-   torture_kthread_stopping("rcu_torture_cbflood");
-   return 0;
-}
-
 /*
  * RCU torture force-quiescent-state kthread.  Repeatedly induces
  * bursts of calls to force_quiescent_state(), increasing the probability
@@ -1460,11 +1397,10 @@ rcu_torture_stats_print(void)
n_rcu_torture_boosts,
atomic_long_read(_rcu_torture_timers));
torture_onoff_stats();
-   pr_cont("barrier: %ld/%ld:%ld ",
+   pr_cont("barrier: %ld/%ld:%ld\n",
n_barrier_successes,
n_barrier_attempts,
n_rcu_torture_barrier_error);
-   pr_cont("cbflood: %ld\n", atomic_long_read(_cbfloods));
 
pr_alert("%s%s ", torture_type, TORTURE_FLAG);
if (atomic_read(_rcu_torture_mberror) != 0 ||
@@ -2093,8 +2029,6 @@ rcu_torture_cleanup(void)
 cur_ops->name, gp_seq, flags);
torture_stop_kthread(rcu_torture_stats, stats_task);
torture_stop_kthread(rcu_torture_fqs, fqs_task);
-   for (i = 0; i < ncbflooders; i++)
-   torture_stop_kthread(rcu_torture_cbflood, cbflood_task[i]);
if (rcu_torture_can_boost())
cpuhp_remove_state(rcutor_hp);
 
@@ -2377,24 +2311

[PATCH tip/core/rcu 2/7] sfc: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Solarflare linux maintainers 
Cc: Edward Cree 
Cc: Bert Kenward 
Cc: "David S. Miller" 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 drivers/net/ethernet/sfc/efx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 98fe7e762e17..3643015a55cf 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -3167,7 +3167,7 @@ struct hlist_head *efx_rps_hash_bucket(struct efx_nic 
*efx,
 {
u32 hash = efx_filter_spec_hash(spec);
 
-   WARN_ON(!spin_is_locked(>rps_hash_lock));
+   lockdep_assert_held(>rps_hash_lock);
if (!efx->rps_hash_table)
return NULL;
return >rps_hash_table[hash % EFX_ARFS_HASH_TABLE_SIZE];
-- 
2.17.1



[PATCH tip/core/rcu 6/7] mm: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Andrew Morton 
Cc: "Kirill A. Shutemov" 
Cc: Yang Shi 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Jan Kara 
Cc: Shakeel Butt 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 mm/khugepaged.c | 4 ++--
 mm/swap.c   | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c13625c1ad5e..7b86600a47c9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1225,7 +1225,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 {
struct mm_struct *mm = mm_slot->mm;
 
-   VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(_mm_lock));
+   lockdep_assert_held(_mm_lock);
 
if (khugepaged_test_exit(mm)) {
/* free mm_slot */
@@ -1631,7 +1631,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int 
pages,
int progress = 0;
 
VM_BUG_ON(!pages);
-   VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(_mm_lock));
+   lockdep_assert_held(_mm_lock);
 
if (khugepaged_scan.mm_slot)
mm_slot = khugepaged_scan.mm_slot;
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..5d786019eab9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -823,8 +823,7 @@ void lru_add_page_tail(struct page *page, struct page 
*page_tail,
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-   VM_BUG_ON(NR_CPUS != 1 &&
- !spin_is_locked(_pgdat(lruvec)->lru_lock));
+   lockdep_assert_held(_pgdat(lruvec)->lru_lock);
 
if (!list)
SetPageLRU(page_tail);
-- 
2.17.1



[PATCH tip/core/rcu 11/17] rcutorture: Print GP age upon forward-progress failure

2018-11-11 Thread Paul E. McKenney
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cef7d9867508..95a3825b1b19 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2679,6 +2679,8 @@ void rcu_fwd_progress_check(unsigned long j)
struct rcu_data *rdp;
 
if (rcu_gp_in_progress()) {
+   pr_info("%s: GP age %lu jiffies\n",
+   __func__, jiffies - rcu_state.gp_start);
show_rcu_gp_kthreads();
} else {
preempt_disable();
-- 
2.17.1



[PATCH tip/core/rcu 13/17] rcutorture: Print time since GP end upon forward-progress failure

2018-11-11 Thread Paul E. McKenney
If rcutorture's forward-progress tests fail while a grace period is not
in progress, it is useful to print the time since the last grace period
ended as a way to detect failure to launch a new grace period.  This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 5 -
 kernel/rcu/tree.h | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 95a3825b1b19..4d8b50a7750a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1997,7 +1997,8 @@ static void rcu_gp_cleanup(void)
 
WRITE_ONCE(rcu_state.gp_activity, jiffies);
raw_spin_lock_irq_rcu_node(rnp);
-   gp_duration = jiffies - rcu_state.gp_start;
+   rcu_state.gp_end = jiffies;
+   gp_duration = rcu_state.gp_end - rcu_state.gp_start;
if (gp_duration > rcu_state.gp_max)
rcu_state.gp_max = gp_duration;
 
@@ -2683,6 +2684,8 @@ void rcu_fwd_progress_check(unsigned long j)
__func__, jiffies - rcu_state.gp_start);
show_rcu_gp_kthreads();
} else {
+   pr_info("%s: Last GP end %lu jiffies ago\n",
+   __func__, jiffies - rcu_state.gp_end);
preempt_disable();
rdp = this_cpu_ptr(_data);
rcu_check_gp_start_stall(rdp->mynode, rdp, j);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 14f6758f0989..3397089490ec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -328,6 +328,8 @@ struct rcu_state {
/*  force_quiescent_state(). */
unsigned long gp_start; /* Time at which GP started, */
/*  but in jiffies. */
+   unsigned long gp_end;   /* Time last GP ended, again */
+   /*  in jiffies. */
unsigned long gp_activity;  /* Time of last GP kthread */
/*  activity in jiffies. */
unsigned long gp_req_activity;  /* Time of last GP request */
-- 
2.17.1



[PATCH tip/core/rcu 05/17] rcutorture: Affinity forward-progress test to avoid housekeeping CPUs

2018-11-11 Thread Paul E. McKenney
This commit affinities the forward-progress tests to avoid hogging a
housekeeping CPU on the theory that the offloaded callbacks will be
running on those housekeeping CPUs.

Signed-off-by: Paul E. McKenney 
[ paulmck: Fix NULL-pointer issue located by kbuild test robot. ]
Tested-by: Rong Chen 
---
 kernel/rcu/rcu.h |  2 ++
 kernel/rcu/rcutorture.c  |  1 +
 kernel/rcu/tree_plugin.h | 11 +++
 3 files changed, 14 insertions(+)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 2866166863f0..0f0f5ae8c3d4 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -539,8 +539,10 @@ extern struct workqueue_struct *rcu_par_gp_wq;
 
 #ifdef CONFIG_RCU_NOCB_CPU
 bool rcu_is_nocb_cpu(int cpu);
+void rcu_bind_current_to_nocb(void);
 #else
 static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
+static inline void rcu_bind_current_to_nocb(void) { }
 #endif
 
 #endif /* __LINUX_RCU_H */
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index bcc33bb8d9a6..36a3bc42782d 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1803,6 +1803,7 @@ static int rcu_torture_fwd_prog(void *args)
int tested_tries = 0;
 
VERBOSE_TOROUT_STRING("rcu_torture_fwd_progress task started");
+   rcu_bind_current_to_nocb();
if (!IS_ENABLED(CONFIG_SMP) || !IS_ENABLED(CONFIG_RCU_BOOST))
set_user_nice(current, MAX_NICE);
do {
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 05915e536336..1db2b0780c62 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2587,6 +2587,17 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
return true;
 }
 
+/*
+ * Bind the current task to the offloaded CPUs.  If there are no offloaded
+ * CPUs, leave the task unbound.  Splat if the bind attempt fails.
+ */
+void rcu_bind_current_to_nocb(void)
+{
+   if (cpumask_available(rcu_nocb_mask) && cpumask_weight(rcu_nocb_mask))
+   WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask));
+}
+EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb);
+
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 static bool rcu_nocb_cpu_needs_barrier(int cpu)
-- 
2.17.1



[PATCH tip/core/rcu 09/17] rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings

2018-11-11 Thread Paul E. McKenney
The RCU CPU stall warnings print an estimate of the total number of
RCU callbacks queued in the system, but this estimate leaves out
the callbacks queued for nocbs CPUs.  This commit therefore introduces
rcu_get_n_cbs_cpu(), which gives an accurate callback estimate for
both nocbs and normal CPUs, and uses this new function as needed.

This commit also introduces a rcu_get_n_cbs_nocb_cpu() helper function
that returns the number of callbacks for nocbs CPUs or zero otherwise,
and also uses this function in place of direct access to ->nocb_q_count
while in the area (fewer characters, you see).

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c| 19 +++
 kernel/rcu/tree.h|  1 +
 kernel/rcu/tree_plugin.h | 24 +++-
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6f04352011d7..67f2c7a055b6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -207,6 +207,19 @@ static int rcu_gp_in_progress(void)
return rcu_seq_state(rcu_seq_current(_state.gp_seq));
 }
 
+/*
+ * Return the number of callbacks queued on the specified CPU.
+ * Handles both the nocbs and normal cases.
+ */
+static long rcu_get_n_cbs_cpu(int cpu)
+{
+   struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
+
+   if (rcu_segcblist_is_enabled(>cblist)) /* Online normal CPU? */
+   return rcu_segcblist_n_cbs(>cblist);
+   return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */
+}
+
 void rcu_softirq_qs(void)
 {
rcu_qs();
@@ -1262,8 +1275,7 @@ static void print_other_cpu_stall(unsigned long gp_seq)
 
print_cpu_stall_info_end();
for_each_possible_cpu(cpu)
-   totqlen += rcu_segcblist_n_cbs(_cpu_ptr(_data,
-   cpu)->cblist);
+   totqlen += rcu_get_n_cbs_cpu(cpu);
pr_cont("(detected by %d, t=%ld jiffies, g=%ld, q=%lu)\n",
   smp_processor_id(), (long)(jiffies - rcu_state.gp_start),
   (long)rcu_seq_current(_state.gp_seq), totqlen);
@@ -1323,8 +1335,7 @@ static void print_cpu_stall(void)
raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags);
print_cpu_stall_info_end();
for_each_possible_cpu(cpu)
-   totqlen += rcu_segcblist_n_cbs(_cpu_ptr(_data,
-   cpu)->cblist);
+   totqlen += rcu_get_n_cbs_cpu(cpu);
pr_cont(" (t=%lu jiffies g=%ld q=%lu)\n",
jiffies - rcu_state.gp_start,
(long)rcu_seq_current(_state.gp_seq), totqlen);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 703e19ff532d..14f6758f0989 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -466,6 +466,7 @@ static void __init rcu_spawn_nocb_kthreads(void);
 static void __init rcu_organize_nocb_kthreads(void);
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 static bool init_nocb_callback_list(struct rcu_data *rdp);
+static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp);
 static void rcu_bind_gp_kthread(void);
 static bool rcu_nohz_full_cpu(void);
 static void rcu_dynticks_task_enter(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1db2b0780c62..6d0c651c5f35 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1997,7 +1997,7 @@ static bool rcu_nocb_cpu_needs_barrier(int cpu)
 * (if a callback is in fact needed).  This is associated with an
 * atomic_inc() in the caller.
 */
-   ret = atomic_long_read(>nocb_q_count);
+   ret = rcu_get_n_cbs_nocb_cpu(rdp);
 
 #ifdef CONFIG_PROVE_RCU
rhp = READ_ONCE(rdp->nocb_head);
@@ -2052,7 +2052,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
TPS("WakeNotPoll"));
return;
}
-   len = atomic_long_read(>nocb_q_count);
+   len = rcu_get_n_cbs_nocb_cpu(rdp);
if (old_rhpp == >nocb_head) {
if (!irqs_disabled_flags(flags)) {
/* ... if queue was empty ... */
@@ -2101,11 +2101,11 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, 
struct rcu_head *rhp,
trace_rcu_kfree_callback(rcu_state.name, rhp,
 (unsigned long)rhp->func,
 
-atomic_long_read(>nocb_q_count_lazy),
--atomic_long_read(>nocb_q_count));
+-rcu_get_n_cbs_nocb_cpu(rdp));
else
trace_rcu_callback(rcu_state.name, rhp,
   -atomic_long_read(>nocb_q_count_lazy),
-  -atomic_long_read(>nocb_q_count));
+  -rcu_get_n_cbs_nocb_cpu(rdp));
 
/*
 * If called from an extended quiesc

[PATCH tip/core/rcu 16/17] rcutorture: Use 100ms buckets for forward-progress callback histograms

2018-11-11 Thread Paul E. McKenney
This commit narrows the scope of each bucket of the forward-progress
callback-invocation histograms from one second to 100 milliseconds, which
aids debugging of forward-progress problems by making shorter-duration
callback-invocation stalls visible.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index afa98162575d..a4c4a24bdcaa 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1629,7 +1629,8 @@ static bool rcu_fwd_emergency_stop;
 #define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */
 #define MIN_FWD_CB_LAUNDERS3   /* This many CB invocations to count. */
 #define MIN_FWD_CBS_LAUNDERED  100 /* Number of counted CBs. */
-static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ];
+#define FWD_CBS_HIST_DIV   10  /* Histogram buckets/second. */
+static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / (HZ / FWD_CBS_HIST_DIV)];
 
 static void rcu_torture_fwd_cb_hist(void)
 {
@@ -1642,7 +1643,8 @@ static void rcu_torture_fwd_cb_hist(void)
pr_alert("%s: Callback-invocation histogram (duration %lu jiffies):",
 __func__, jiffies - rcu_fwd_startat);
for (j = 0; j <= i; j++)
-   pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]);
+   pr_cont(" %ds/%d: %ld",
+   j + 1, FWD_CBS_HIST_DIV, n_launders_hist[j]);
pr_cont("\n");
 }
 
@@ -1661,7 +1663,7 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
rcu_fwd_cb_tail = >rfc_next;
WRITE_ONCE(*rfcpp, rfcp);
WRITE_ONCE(n_launders_cb, n_launders_cb + 1);
-   i = ((jiffies - rcu_fwd_startat) / HZ);
+   i = ((jiffies - rcu_fwd_startat) / (HZ / FWD_CBS_HIST_DIV));
if (i >= ARRAY_SIZE(n_launders_hist))
i = ARRAY_SIZE(n_launders_hist) - 1;
n_launders_hist[i]++;
-- 
2.17.1



[PATCH tip/core/rcu 15/17] rcutorture: Recover from OOM during forward-progress tests

2018-11-11 Thread Paul E. McKenney
This commit causes the OOM handler to do rcu_barrier() calls and to
free up forward-progress callbacks in order to recover from OOM events.
The current test is terminated, but subsequent forward-progress tests can
proceed.  This allows a long test to result in multiple forward-progress
failures, greatly reducing the required testing time.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 60 +
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 080b5ac6340c..afa98162575d 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1649,13 +1649,14 @@ static void rcu_torture_fwd_cb_hist(void)
 /* Callback function for continuous-flood RCU callbacks. */
 static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
 {
+   unsigned long flags;
int i;
struct rcu_fwd_cb *rfcp = container_of(rhp, struct rcu_fwd_cb, rh);
struct rcu_fwd_cb **rfcpp;
 
rfcp->rfc_next = NULL;
rfcp->rfc_gps++;
-   spin_lock(_fwd_lock);
+   spin_lock_irqsave(_fwd_lock, flags);
rfcpp = rcu_fwd_cb_tail;
rcu_fwd_cb_tail = >rfc_next;
WRITE_ONCE(*rfcpp, rfcp);
@@ -1664,7 +1665,33 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
if (i >= ARRAY_SIZE(n_launders_hist))
i = ARRAY_SIZE(n_launders_hist) - 1;
n_launders_hist[i]++;
-   spin_unlock(_fwd_lock);
+   spin_unlock_irqrestore(_fwd_lock, flags);
+}
+
+/*
+ * Free all callbacks on the rcu_fwd_cb_head list, either because the
+ * test is over or because we hit an OOM event.
+ */
+static unsigned long rcu_torture_fwd_prog_cbfree(void)
+{
+   unsigned long flags;
+   unsigned long freed = 0;
+   struct rcu_fwd_cb *rfcp;
+
+   for (;;) {
+   spin_lock_irqsave(_fwd_lock, flags);
+   rfcp = rcu_fwd_cb_head;
+   if (!rfcp)
+   break;
+   rcu_fwd_cb_head = rfcp->rfc_next;
+   if (!rcu_fwd_cb_head)
+   rcu_fwd_cb_tail = _fwd_cb_head;
+   spin_unlock_irqrestore(_fwd_lock, flags);
+   kfree(rfcp);
+   freed++;
+   }
+   spin_unlock_irqrestore(_fwd_lock, flags);
+   return freed;
 }
 
 /* Carry out need_resched()/cond_resched() forward-progress testing. */
@@ -1743,6 +1770,9 @@ static void rcu_torture_fwd_prog_cr(void)
unsigned long stopat;
unsigned long stoppedat;
 
+   if (READ_ONCE(rcu_fwd_emergency_stop))
+   return; /* Get out of the way quickly, no GP wait! */
+
/* Loop continuously posting RCU callbacks. */
WRITE_ONCE(rcu_fwd_cb_nodelay, true);
cur_ops->sync(); /* Later readers see above write. */
@@ -1788,16 +1818,10 @@ static void rcu_torture_fwd_prog_cr(void)
cver = READ_ONCE(rcu_torture_current_version) - cver;
gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps);
cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */
-   for (;;) {
-   rfcp = rcu_fwd_cb_head;
-   if (!rfcp)
-   break;
-   rcu_fwd_cb_head = rfcp->rfc_next;
-   kfree(rfcp);
-   }
-   rcu_fwd_cb_tail = _fwd_cb_head;
+   (void)rcu_torture_fwd_prog_cbfree();
+
WRITE_ONCE(rcu_fwd_cb_nodelay, false);
-   if (!torture_must_stop()) {
+   if (!torture_must_stop() && !READ_ONCE(rcu_fwd_emergency_stop)) {
WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED);
pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: 
%ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n",
 __func__,
@@ -1817,9 +1841,23 @@ static void rcu_torture_fwd_prog_cr(void)
 static int rcutorture_oom_notify(struct notifier_block *self,
 unsigned long notused, void *nfreed)
 {
+   WARN(1, "%s invoked upon OOM during forward-progress testing.\n",
+__func__);
rcu_torture_fwd_cb_hist();
rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2));
WRITE_ONCE(rcu_fwd_emergency_stop, true);
+   smp_mb(); /* Emergency stop before free and wait to avoid hangs. */
+   pr_info("%s: Freed %lu RCU callbacks.\n",
+   __func__, rcu_torture_fwd_prog_cbfree());
+   rcu_barrier();
+   pr_info("%s: Freed %lu RCU callbacks.\n",
+   __func__, rcu_torture_fwd_prog_cbfree());
+   rcu_barrier();
+   pr_info("%s: Freed %lu RCU callbacks.\n",
+   __func__, rcu_torture_fwd_prog_cbfree());
+   smp_mb(); /* Frees before return to avoid redoing OOM. */
+   (*(unsigned long *)nfreed)++; /* Forward progress CBs freed! */
+   pr_info("%s returning after OOM processing.\n", __func__);
return NOTIFY_OK;
 }
 
-- 
2.17.1



[PATCH tip/core/rcu 12/17] rcutorture: Print histogram of CB invocation at OOM time

2018-11-11 Thread Paul E. McKenney
One reason why a forward-progress test might fail would be if something
prevented or delayed callback invocation.  This commit therefore adds a
callback-invocation histogram printout when OOM is reported to rcutorture.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index f28b88ecb47a..329f4fb13125 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1631,6 +1631,20 @@ static bool rcu_fwd_emergency_stop;
 #define MIN_FWD_CBS_LAUNDERED  100 /* Number of counted CBs. */
 static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ];
 
+static void rcu_torture_fwd_cb_hist(void)
+{
+   int i;
+   int j;
+
+   for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--)
+   if (n_launders_hist[i] > 0)
+   break;
+   pr_alert("%s: Callback-invocation histogram:", __func__);
+   for (j = 0; j <= i; j++)
+   pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]);
+   pr_cont("\n");
+}
+
 /* Callback function for continuous-flood RCU callbacks. */
 static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
 {
@@ -1718,7 +1732,6 @@ static void rcu_torture_fwd_prog_cr(void)
unsigned long cver;
unsigned long gps;
int i;
-   int j;
long n_launders;
long n_launders_cb_snap;
long n_launders_sa;
@@ -1791,13 +1804,7 @@ static void rcu_torture_fwd_prog_cr(void)
 n_launders + n_max_cbs - n_launders_cb_snap,
 n_launders, n_launders_sa,
 n_max_gps, n_max_cbs, cver, gps);
-   for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--)
-   if (n_launders_hist[i] > 0)
-   break;
-   pr_alert("Callback-invocation histogram:");
-   for (j = 0; j <= i; j++)
-   pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]);
-   pr_cont("\n");
+   rcu_torture_fwd_cb_hist();
}
 }
 
@@ -1809,6 +1816,7 @@ static void rcu_torture_fwd_prog_cr(void)
 static int rcutorture_oom_notify(struct notifier_block *self,
 unsigned long notused, void *nfreed)
 {
+   rcu_torture_fwd_cb_hist();
rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2));
WRITE_ONCE(rcu_fwd_emergency_stop, true);
return NOTIFY_OK;
-- 
2.17.1



[PATCH tip/core/rcu 10/17] rcu: Print per-CPU callback counts for forward-progress failures

2018-11-11 Thread Paul E. McKenney
This commit prints out the non-zero per-CPU callback counts when a
forware-progress error (OOM event) occurs.

Signed-off-by: Paul E. McKenney 
[ paulmck: Fix a pair of uninitialized locals spotted by kbuild test robot. ]
---
 kernel/rcu/tree.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 67f2c7a055b6..cef7d9867508 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2672,6 +2672,10 @@ rcu_check_gp_start_stall(struct rcu_node *rnp, struct 
rcu_data *rdp)
  */
 void rcu_fwd_progress_check(unsigned long j)
 {
+   unsigned long cbs;
+   int cpu;
+   unsigned long max_cbs = 0;
+   int max_cpu = -1;
struct rcu_data *rdp;
 
if (rcu_gp_in_progress()) {
@@ -2682,6 +2686,20 @@ void rcu_fwd_progress_check(unsigned long j)
rcu_check_gp_start_stall(rdp->mynode, rdp, j);
preempt_enable();
}
+   for_each_possible_cpu(cpu) {
+   cbs = rcu_get_n_cbs_cpu(cpu);
+   if (!cbs)
+   continue;
+   if (max_cpu < 0)
+   pr_info("%s: callbacks", __func__);
+   pr_cont(" %d: %lu", cpu, cbs);
+   if (cbs <= max_cbs)
+   continue;
+   max_cbs = cbs;
+   max_cpu = cpu;
+   }
+   if (max_cpu >= 0)
+   pr_cont("\n");
 }
 EXPORT_SYMBOL_GPL(rcu_fwd_progress_check);
 
-- 
2.17.1



[PATCH tip/core/rcu 17/17] rcutorture: Don't do busted forward-progress testing

2018-11-11 Thread Paul E. McKenney
The "busted" rcutorture type is an intentionally broken implementation
of RCU.  Doing forward-progress testing on this implementation is not
particularly meaningful on the one hand and can result in fatal abuse
of the memory allocator on the other.  This commit therefore disables
forward-progress testing of the "busted" rcutorture type.

Reported-by: kernel test robot 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index a4c4a24bdcaa..f6e85faa4ff4 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1900,7 +1900,8 @@ static int __init rcu_torture_fwd_prog_init(void)
 {
if (!fwd_progress)
return 0; /* Not requested, so don't do it. */
-   if (!cur_ops->stall_dur || cur_ops->stall_dur() <= 0) {
+   if (!cur_ops->stall_dur || cur_ops->stall_dur() <= 0 ||
+   cur_ops == _busted_ops) {
VERBOSE_TOROUT_STRING("rcu_torture_fwd_prog_init: Disabled, 
unsupported by RCU flavor under test");
return 0;
}
-- 
2.17.1



[PATCH tip/core/rcu 14/17] rcutorture: Print forward-progress test age upon failure

2018-11-11 Thread Paul E. McKenney
This commit prints the age of the forward-progress test in jiffies,
in order to allow better interpretation of the callback-invocation
histograms.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 329f4fb13125..080b5ac6340c 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1639,7 +1639,8 @@ static void rcu_torture_fwd_cb_hist(void)
for (i = ARRAY_SIZE(n_launders_hist) - 1; i > 0; i--)
if (n_launders_hist[i] > 0)
break;
-   pr_alert("%s: Callback-invocation histogram:", __func__);
+   pr_alert("%s: Callback-invocation histogram (duration %lu jiffies):",
+__func__, jiffies - rcu_fwd_startat);
for (j = 0; j <= i; j++)
pr_cont(" %ds: %ld", j + 1, n_launders_hist[j]);
pr_cont("\n");
-- 
2.17.1



[PATCH tip/core/rcu 01/17] rcutorture: Add call_rcu() flooding forward-progress tests

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

This commit adds a call_rcu() flooding loop to the forward-progress test.
This emulates tight userspace loops that force call_rcu() invocations,
for example, the infamous loop containing close(open()) that instigated
the addition of blimit.  If RCU does not make sufficient forward progress
in invoking the resulting flood of callbacks, rcutorture emits a warning.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 129 +++-
 1 file changed, 127 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 210c77460365..8cf700ca7845 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -259,6 +259,8 @@ static atomic_t barrier_cbs_invoked;/* Barrier 
callbacks invoked. */
 static wait_queue_head_t *barrier_cbs_wq; /* Coordinate barrier testing. */
 static DECLARE_WAIT_QUEUE_HEAD(barrier_wq);
 
+static bool rcu_fwd_cb_nodelay;/* Short rcu_torture_delay() 
delays. */
+
 /*
  * Allocate an element from the rcu_tortures pool.
  */
@@ -348,7 +350,8 @@ rcu_read_delay(struct torture_random_state *rrsp, struct 
rt_read_seg *rtrsp)
 * period, and we want a long delay occasionally to trigger
 * force_quiescent_state. */
 
-   if (!(torture_random(rrsp) % (nrealreaders * 2000 * longdelay_ms))) {
+   if (!rcu_fwd_cb_nodelay &&
+   !(torture_random(rrsp) % (nrealreaders * 2000 * longdelay_ms))) {
started = cur_ops->get_gp_seq();
ts = rcu_trace_clock_local();
if (preempt_count() & (SOFTIRQ_MASK | HARDIRQ_MASK))
@@ -1674,6 +1677,43 @@ static void rcu_torture_fwd_prog_cb(struct rcu_head *rhp)
cur_ops->call(>rh, rcu_torture_fwd_prog_cb);
 }
 
+/* State for continuous-flood RCU callbacks. */
+struct rcu_fwd_cb {
+   struct rcu_head rh;
+   struct rcu_fwd_cb *rfc_next;
+   int rfc_gps;
+};
+static DEFINE_SPINLOCK(rcu_fwd_lock);
+static struct rcu_fwd_cb *rcu_fwd_cb_head;
+static struct rcu_fwd_cb **rcu_fwd_cb_tail = _fwd_cb_head;
+static long n_launders_cb;
+static unsigned long rcu_fwd_startat;
+#define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */
+#define MIN_FWD_CB_LAUNDERS3   /* This many CB invocations to count. */
+#define MIN_FWD_CBS_LAUNDERED  100 /* Number of counted CBs. */
+static long n_launders_hist[2 * MAX_FWD_CB_JIFFIES / HZ];
+
+/* Callback function for continuous-flood RCU callbacks. */
+static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
+{
+   int i;
+   struct rcu_fwd_cb *rfcp = container_of(rhp, struct rcu_fwd_cb, rh);
+   struct rcu_fwd_cb **rfcpp;
+
+   rfcp->rfc_next = NULL;
+   rfcp->rfc_gps++;
+   spin_lock(_fwd_lock);
+   rfcpp = rcu_fwd_cb_tail;
+   rcu_fwd_cb_tail = >rfc_next;
+   WRITE_ONCE(*rfcpp, rfcp);
+   WRITE_ONCE(n_launders_cb, n_launders_cb + 1);
+   i = ((jiffies - rcu_fwd_startat) / HZ);
+   if (i >= ARRAY_SIZE(n_launders_hist))
+   i = ARRAY_SIZE(n_launders_hist) - 1;
+   n_launders_hist[i]++;
+   spin_unlock(_fwd_lock);
+}
+
 /* Carry out grace-period forward-progress testing. */
 static int rcu_torture_fwd_prog(void *args)
 {
@@ -1681,11 +1721,21 @@ static int rcu_torture_fwd_prog(void *args)
unsigned long dur;
struct fwd_cb_state fcs;
unsigned long gps;
+   int i;
int idx;
+   int j;
+   long n_launders;
+   long n_launders_cb_snap;
+   long n_launders_sa;
+   long n_max_cbs;
+   long n_max_gps;
+   struct rcu_fwd_cb *rfcp;
+   struct rcu_fwd_cb *rfcpn;
int sd;
int sd4;
bool selfpropcb = false;
unsigned long stopat;
+   unsigned long stoppedat;
int tested = 0;
int tested_tries = 0;
static DEFINE_TORTURE_RANDOM(trs);
@@ -1699,6 +1749,8 @@ static int rcu_torture_fwd_prog(void *args)
}
do {
schedule_timeout_interruptible(fwd_progress_holdoff * HZ);
+
+   /* Tight loop containing cond_resched(). */
if  (selfpropcb) {
WRITE_ONCE(fcs.stop, 0);
cur_ops->call(, rcu_torture_fwd_prog_cb);
@@ -1708,7 +1760,8 @@ static int rcu_torture_fwd_prog(void *args)
sd = cur_ops->stall_dur() + 1;
sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div;
dur = sd4 + torture_random() % (sd - sd4);
-   stopat = jiffies + dur;
+   rcu_fwd_startat = jiffies;
+   stopat = rcu_fwd_startat + dur;
while (time_before(jiffies, stopat) && !torture_must_stop()) {
idx = cur_ops->readlock();
udelay(10);
@@ -1729,6 +1782,78 @@ static int rcu_torture_fwd_prog(void *args)
cur_ops->sync(); /* Wait for running

[PATCH tip/core/rcu 07/17] rcutorture: Prepare for asynchronous access to rcu_fwd_startat

2018-11-11 Thread Paul E. McKenney
Because rcutorture's forward-progress checking will trigger from an
OOM notifier, this notifier will introduce asynchronous concurrent
access to the rcu_fwd_startat variable.  This commit therefore prepares
for this by converting updates to WRITE_ONCE().

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 36a3bc42782d..c4fd61dccedb 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1679,7 +1679,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int 
*tested_tries)
sd = cur_ops->stall_dur() + 1;
sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div;
dur = sd4 + torture_random() % (sd - sd4);
-   rcu_fwd_startat = jiffies;
+   WRITE_ONCE(rcu_fwd_startat, jiffies);
stopat = rcu_fwd_startat + dur;
while (time_before(jiffies, stopat) && !torture_must_stop()) {
idx = cur_ops->readlock();
@@ -1728,7 +1728,7 @@ static void rcu_torture_fwd_prog_cr(void)
/* Loop continuously posting RCU callbacks. */
WRITE_ONCE(rcu_fwd_cb_nodelay, true);
cur_ops->sync(); /* Later readers see above write. */
-   rcu_fwd_startat = jiffies;
+   WRITE_ONCE(rcu_fwd_startat, jiffies);
stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES;
n_launders = 0;
n_launders_cb = 0;
-- 
2.17.1



[PATCH tip/core/rcu 04/17] rcutorture: Break up too-long rcu_torture_fwd_prog() function

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

This commit splits rcu_torture_fwd_prog_nr() and rcu_torture_fwd_prog_cr()
functions out of rcu_torture_fwd_prog() in order to reduce indentation
pain and because rcu_torture_fwd_prog() was getting a bit too long.
In addition, this will enable easier conditional execution of the
rcu_torture_fwd_prog_cr() function, which can give false-positive
failures in some NO_HZ_FULL configurations due to overloading the
housekeeping CPUs.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 254 +---
 1 file changed, 135 insertions(+), 119 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 17f480129a78..bcc33bb8d9a6 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1650,15 +1650,70 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp)
spin_unlock(_fwd_lock);
 }
 
-/* Carry out grace-period forward-progress testing. */
-static int rcu_torture_fwd_prog(void *args)
+/* Carry out need_resched()/cond_resched() forward-progress testing. */
+static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries)
 {
unsigned long cver;
unsigned long dur;
struct fwd_cb_state fcs;
unsigned long gps;
-   int i;
int idx;
+   int sd;
+   int sd4;
+   bool selfpropcb = false;
+   unsigned long stopat;
+   static DEFINE_TORTURE_RANDOM(trs);
+
+   if  (cur_ops->call && cur_ops->sync && cur_ops->cb_barrier) {
+   init_rcu_head_on_stack();
+   selfpropcb = true;
+   }
+
+   /* Tight loop containing cond_resched(). */
+   if  (selfpropcb) {
+   WRITE_ONCE(fcs.stop, 0);
+   cur_ops->call(, rcu_torture_fwd_prog_cb);
+   }
+   cver = READ_ONCE(rcu_torture_current_version);
+   gps = cur_ops->get_gp_seq();
+   sd = cur_ops->stall_dur() + 1;
+   sd4 = (sd + fwd_progress_div - 1) / fwd_progress_div;
+   dur = sd4 + torture_random() % (sd - sd4);
+   rcu_fwd_startat = jiffies;
+   stopat = rcu_fwd_startat + dur;
+   while (time_before(jiffies, stopat) && !torture_must_stop()) {
+   idx = cur_ops->readlock();
+   udelay(10);
+   cur_ops->readunlock(idx);
+   if (!fwd_progress_need_resched || need_resched())
+   cond_resched();
+   }
+   (*tested_tries)++;
+   if (!time_before(jiffies, stopat) && !torture_must_stop()) {
+   (*tested)++;
+   cver = READ_ONCE(rcu_torture_current_version) - cver;
+   gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps);
+   WARN_ON(!cver && gps < 2);
+   pr_alert("%s: Duration %ld cver %ld gps %ld\n", __func__, dur, 
cver, gps);
+   }
+   if (selfpropcb) {
+   WRITE_ONCE(fcs.stop, 1);
+   cur_ops->sync(); /* Wait for running CB to complete. */
+   cur_ops->cb_barrier(); /* Wait for queued callbacks. */
+   }
+
+   if (selfpropcb) {
+   WARN_ON(READ_ONCE(fcs.stop) != 2);
+   destroy_rcu_head_on_stack();
+   }
+}
+
+/* Carry out call_rcu() forward-progress testing. */
+static void rcu_torture_fwd_prog_cr(void)
+{
+   unsigned long cver;
+   unsigned long gps;
+   int i;
int j;
long n_launders;
long n_launders_cb_snap;
@@ -1667,136 +1722,97 @@ static int rcu_torture_fwd_prog(void *args)
long n_max_gps;
struct rcu_fwd_cb *rfcp;
struct rcu_fwd_cb *rfcpn;
-   int sd;
-   int sd4;
-   bool selfpropcb = false;
unsigned long stopat;
unsigned long stoppedat;
+
+   /* Loop continuously posting RCU callbacks. */
+   WRITE_ONCE(rcu_fwd_cb_nodelay, true);
+   cur_ops->sync(); /* Later readers see above write. */
+   rcu_fwd_startat = jiffies;
+   stopat = rcu_fwd_startat + MAX_FWD_CB_JIFFIES;
+   n_launders = 0;
+   n_launders_cb = 0;
+   n_launders_sa = 0;
+   n_max_cbs = 0;
+   n_max_gps = 0;
+   for (i = 0; i < ARRAY_SIZE(n_launders_hist); i++)
+   n_launders_hist[i] = 0;
+   cver = READ_ONCE(rcu_torture_current_version);
+   gps = cur_ops->get_gp_seq();
+   while (time_before(jiffies, stopat) && !torture_must_stop()) {
+   rfcp = READ_ONCE(rcu_fwd_cb_head);
+   rfcpn = NULL;
+   if (rfcp)
+   rfcpn = READ_ONCE(rfcp->rfc_next);
+   if (rfcpn) {
+   if (rfcp->rfc_gps >= MIN_FWD_CB_LAUNDERS &&
+   ++n_max_gps >= MIN_FWD_CBS_LAUNDERED)
+   break;
+   rcu_fwd_cb_head = rfcpn;
+   n_launders++;
+   n_launders_sa++;
+

[PATCH tip/core/rcu 06/17] torture: Remove unnecessary "ret" variables

2018-11-11 Thread Paul E. McKenney
From: Pierce Griffiths 

Remove return variables (declared as "ret") in cases where,
depending on whether a condition evaluates as true, the result of a
function call can be immediately returned instead of storing the result in
the return variable. When the condition evaluates as false, the constant
initially stored in the return variable at declaration is returned instead.

Signed-off-by: Pierce Griffiths 
Signed-off-by: Paul E. McKenney 
---
 kernel/torture.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/kernel/torture.c b/kernel/torture.c
index 9410d1bf84d6..bbf6d473e50c 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -245,16 +245,15 @@ torture_onoff(void *arg)
  */
 int torture_onoff_init(long ooholdoff, long oointerval)
 {
-   int ret = 0;
-
 #ifdef CONFIG_HOTPLUG_CPU
onoff_holdoff = ooholdoff;
onoff_interval = oointerval;
if (onoff_interval <= 0)
return 0;
-   ret = torture_create_kthread(torture_onoff, NULL, onoff_task);
-#endif /* #ifdef CONFIG_HOTPLUG_CPU */
-   return ret;
+   return torture_create_kthread(torture_onoff, NULL, onoff_task);
+#else /* #ifdef CONFIG_HOTPLUG_CPU */
+   return 0;
+#endif /* #else #ifdef CONFIG_HOTPLUG_CPU */
 }
 EXPORT_SYMBOL_GPL(torture_onoff_init);
 
@@ -525,15 +524,13 @@ static int torture_shutdown(void *arg)
  */
 int torture_shutdown_init(int ssecs, void (*cleanup)(void))
 {
-   int ret = 0;
-
torture_shutdown_hook = cleanup;
if (ssecs > 0) {
shutdown_time = ktime_add(ktime_get(), ktime_set(ssecs, 0));
-   ret = torture_create_kthread(torture_shutdown, NULL,
+   return torture_create_kthread(torture_shutdown, NULL,
 shutdown_task);
}
-   return ret;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(torture_shutdown_init);
 
@@ -632,13 +629,10 @@ static int torture_stutter(void *arg)
 /*
  * Initialize and kick off the torture_stutter kthread.
  */
-int torture_stutter_init(int s)
+int torture_stutter_init(const int s)
 {
-   int ret;
-
stutter = s;
-   ret = torture_create_kthread(torture_stutter, NULL, stutter_task);
-   return ret;
+   return torture_create_kthread(torture_stutter, NULL, stutter_task);
 }
 EXPORT_SYMBOL_GPL(torture_stutter_init);
 
-- 
2.17.1



[PATCH tip/core/rcu 08/17] rcutorture: Dump grace-period diagnostics upon forward-progress OOM

2018-11-11 Thread Paul E. McKenney
This commit adds an OOM notifier during rcutorture forward-progress
testing.  If this notifier is invoked, it dumps out some grace-period
state to help debug the forward-progress problem.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu.h|  2 ++
 kernel/rcu/rcutorture.c | 31 ---
 kernel/rcu/tree.c   | 20 
 3 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 0f0f5ae8c3d4..a393e24a9195 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -526,12 +526,14 @@ srcu_batches_completed(struct srcu_struct *sp) { return 
0; }
 static inline void rcu_force_quiescent_state(void) { }
 static inline void show_rcu_gp_kthreads(void) { }
 static inline int rcu_get_gp_kthreads_prio(void) { return 0; }
+static inline void rcu_fwd_progress_check(unsigned long j) { }
 #else /* #ifdef CONFIG_TINY_RCU */
 unsigned long rcu_get_gp_seq(void);
 unsigned long rcu_exp_batches_completed(void);
 unsigned long srcu_batches_completed(struct srcu_struct *sp);
 void show_rcu_gp_kthreads(void);
 int rcu_get_gp_kthreads_prio(void);
+void rcu_fwd_progress_check(unsigned long j);
 void rcu_force_quiescent_state(void);
 extern struct workqueue_struct *rcu_gp_wq;
 extern struct workqueue_struct *rcu_par_gp_wq;
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index c4fd61dccedb..f28b88ecb47a 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rcu.h"
 
@@ -1624,6 +1625,7 @@ static struct rcu_fwd_cb *rcu_fwd_cb_head;
 static struct rcu_fwd_cb **rcu_fwd_cb_tail = _fwd_cb_head;
 static long n_launders_cb;
 static unsigned long rcu_fwd_startat;
+static bool rcu_fwd_emergency_stop;
 #define MAX_FWD_CB_JIFFIES (8 * HZ) /* Maximum CB test duration. */
 #define MIN_FWD_CB_LAUNDERS3   /* This many CB invocations to count. */
 #define MIN_FWD_CBS_LAUNDERED  100 /* Number of counted CBs. */
@@ -1681,7 +1683,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int 
*tested_tries)
dur = sd4 + torture_random() % (sd - sd4);
WRITE_ONCE(rcu_fwd_startat, jiffies);
stopat = rcu_fwd_startat + dur;
-   while (time_before(jiffies, stopat) && !torture_must_stop()) {
+   while (time_before(jiffies, stopat) &&
+  !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) {
idx = cur_ops->readlock();
udelay(10);
cur_ops->readunlock(idx);
@@ -1689,7 +1692,8 @@ static void rcu_torture_fwd_prog_nr(int *tested, int 
*tested_tries)
cond_resched();
}
(*tested_tries)++;
-   if (!time_before(jiffies, stopat) && !torture_must_stop()) {
+   if (!time_before(jiffies, stopat) &&
+   !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) {
(*tested)++;
cver = READ_ONCE(rcu_torture_current_version) - cver;
gps = rcutorture_seq_diff(cur_ops->get_gp_seq(), gps);
@@ -1739,7 +1743,8 @@ static void rcu_torture_fwd_prog_cr(void)
n_launders_hist[i] = 0;
cver = READ_ONCE(rcu_torture_current_version);
gps = cur_ops->get_gp_seq();
-   while (time_before(jiffies, stopat) && !torture_must_stop()) {
+   while (time_before(jiffies, stopat) &&
+  !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) {
rfcp = READ_ONCE(rcu_fwd_cb_head);
rfcpn = NULL;
if (rfcp)
@@ -1796,6 +1801,23 @@ static void rcu_torture_fwd_prog_cr(void)
}
 }
 
+
+/*
+ * OOM notifier, but this only prints diagnostic information for the
+ * current forward-progress test.
+ */
+static int rcutorture_oom_notify(struct notifier_block *self,
+unsigned long notused, void *nfreed)
+{
+   rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2));
+   WRITE_ONCE(rcu_fwd_emergency_stop, true);
+   return NOTIFY_OK;
+}
+
+static struct notifier_block rcutorture_oom_nb = {
+   .notifier_call = rcutorture_oom_notify
+};
+
 /* Carry out grace-period forward-progress testing. */
 static int rcu_torture_fwd_prog(void *args)
 {
@@ -1808,8 +1830,11 @@ static int rcu_torture_fwd_prog(void *args)
set_user_nice(current, MAX_NICE);
do {
schedule_timeout_interruptible(fwd_progress_holdoff * HZ);
+   WRITE_ONCE(rcu_fwd_emergency_stop, false);
+   register_oom_notifier(_oom_nb);
rcu_torture_fwd_prog_nr(, _tries);
rcu_torture_fwd_prog_cr();
+   unregister_oom_notifier(_oom_nb);
 
/* Avoid slow periods, better to test when busy. */
stutter_wait("rcu_torture_fwd_prog");
diff --git a/kernel/rcu/tree.c b/k

[PATCH tip/core/rcu 02/17] torture: Bring any extra CPUs online during kernel startup

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, the torture scripts rely on the initrd/init script to bring
any extra CPUs online, for example, in the case where the kernel and
qemu have different ideas about how many CPUs are present.  This works,
but is an unnecessary dependency on initrd, which needs to vary depending
on the distro.  This commit therefore causes torture_onoff() to check
for additional CPUs, attempting to bring any found online. Errors are
ignored, just as they are by the initrd/init script.

Signed-off-by: Paul E. McKenney 
---
 kernel/torture.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/kernel/torture.c b/kernel/torture.c
index 17d91f5fba2a..9410d1bf84d6 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -194,11 +194,23 @@ torture_onoff(void *arg)
int cpu;
int maxcpu = -1;
DEFINE_TORTURE_RANDOM(rand);
+   int ret;
 
VERBOSE_TOROUT_STRING("torture_onoff task started");
for_each_online_cpu(cpu)
maxcpu = cpu;
WARN_ON(maxcpu < 0);
+   if (!IS_MODULE(CONFIG_TORTURE_TEST))
+   for_each_possible_cpu(cpu) {
+   if (cpu_online(cpu))
+   continue;
+   ret = cpu_up(cpu);
+   if (ret && verbose) {
+   pr_alert("%s" TORTURE_FLAG
+"%s: Initial online %d: errno %d\n",
+__func__, torture_type, cpu, ret);
+   }
+   }
 
if (maxcpu == 0) {
VERBOSE_TOROUT_STRING("Only one CPU, so CPU-hotplug testing is 
disabled");
-- 
2.17.1



[PATCH tip/core/rcu 0/17] Torture-test updates for v4.21/v5.0

2018-11-11 Thread Paul E. McKenney
Hello!

This series contains torture-test updates:

1.  Add call_rcu() flooding forward-progress tests.  If people
are going to be seeing forward-progress issues with RCU, then
rcutorture needs to up its game.

2.  Bring any extra CPUs online during kernel startup.

3.  Remove cbflood facility due to its being obsoleted by flooding
forward-progress tests.

4.  Break up too-long rcu_torture_fwd_prog() function.

5.  Affinity forward-progress test to avoid housekeeping CPUs.

6.  Remove unnecessary "ret" variables, courtesy of Pierce Griffiths.

7.  Prepare for asynchronous access to rcu_fwd_startat.

8.  Dump grace-period diagnostics upon forward-progress OOM.

9.  Account for nocb-CPU callback counts in RCU CPU stall warnings.

10. Print per-CPU callback counts for forward-progress failures.

11. Print GP age upon forward-progress failure.

12. Print histogram of CB invocation at OOM time.

13. Print time since GP end upon forward-progress failure.

14. Print forward-progress test age upon failure.

15. Recover from OOM during forward-progress tests.

16. Use 100ms buckets for forward-progress callback histograms.

17. Don't do forward-progress testing of known-bad "RCU" variants.

Thanx, Paul



 Documentation/admin-guide/kernel-parameters.txt |   18 
 kernel/rcu/rcu.h|4 
 kernel/rcu/rcutorture.c |  603 ++--
 kernel/rcu/tree.c   |   64 ++
 kernel/rcu/tree.h   |3 
 kernel/rcu/tree_plugin.h|   35 +
 kernel/torture.c|   34 -
 7 files changed, 484 insertions(+), 277 deletions(-)



[PATCH tip/core/rcu 3/4] srcu: Lock srcu_data structure in srcu_gp_start()

2018-11-11 Thread Paul E. McKenney
From: Dennis Krein 

The srcu_gp_start() function is called with the srcu_struct structure's
->lock held, but not with the srcu_data structure's ->lock.  This is
problematic because this function accesses and updates the srcu_data
structure's ->srcu_cblist, which is protected by that lock.  Failing to
hold this lock can result in corruption of the SRCU callback lists,
which in turn can result in arbitrarily bad results.

This commit therefore makes srcu_gp_start() acquire the srcu_data
structure's ->lock across the calls to rcu_segcblist_advance() and
rcu_segcblist_accelerate(), thus preventing this corruption.

Reported-by: Bart Van Assche 
Reported-by: Christoph Hellwig 
Reported-by: Sebastian Kuzminsky 
Signed-off-by: Dennis Krein 
Signed-off-by: Paul E. McKenney 
Tested-by: Dennis Krein 
Cc:  # 4.12.x
---
 kernel/rcu/srcutree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 60f3236beaf7..697a2d7e8e8a 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -451,10 +451,12 @@ static void srcu_gp_start(struct srcu_struct *sp)
 
lockdep_assert_held(_PRIVATE(sp, lock));
WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed));
+   spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
rcu_segcblist_advance(>srcu_cblist,
  rcu_seq_current(>srcu_gp_seq));
(void)rcu_segcblist_accelerate(>srcu_cblist,
   rcu_seq_snap(>srcu_gp_seq));
+   spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
smp_mb(); /* Order prior store to ->srcu_gp_seq_needed vs. GP start. */
rcu_seq_start(>srcu_gp_seq);
state = rcu_seq_state(READ_ONCE(sp->srcu_gp_seq));
-- 
2.17.1



[PATCH tip/core/rcu 2/4] srcu: Prevent __call_srcu() counter wrap with read-side critical section

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Ever since cdf7abc4610a ("srcu: Allow use of Tiny/Tree SRCU from
both process and interrupt context"), it has been permissible
to use SRCU read-side critical sections in interrupt context.
This allows __call_srcu() to use SRCU read-side critical sections to
prevent a new SRCU grace period from ending before the call to either
srcu_funnel_gp_start() or srcu_funnel_exp_start completes, thus preventing
SRCU grace-period counter overflow during that time.

Note that this does not permit removal of the counter-wrap checks in
srcu_gp_end().  These check are necessary to handle the case where
a given CPU does not interact at all with SRCU for an extended time
period.

This commit therefore adds an SRCU read-side critical section to
__call_srcu() in order to prevent grace period counter wrap during
the funnel-locking process.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/srcutree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index a8846ed7f352..60f3236beaf7 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -858,6 +858,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head 
*rhp,
 rcu_callback_t func, bool do_norm)
 {
unsigned long flags;
+   int idx;
bool needexp = false;
bool needgp = false;
unsigned long s;
@@ -871,6 +872,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head 
*rhp,
return;
}
rhp->func = func;
+   idx = srcu_read_lock(sp);
local_irq_save(flags);
sdp = this_cpu_ptr(sp->sda);
spin_lock_rcu_node(sdp);
@@ -892,6 +894,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head 
*rhp,
srcu_funnel_gp_start(sp, sdp, s, do_norm);
else if (needexp)
srcu_funnel_exp_start(sp, sdp->mynode, s);
+   srcu_read_unlock(sp, idx);
 }
 
 /**
-- 
2.17.1



[PATCH tip/core/rcu 4/4] srcu: Use "ssp" instead of "sp" for srcu_struct pointer

2018-11-11 Thread Paul E. McKenney
In RCU, the distinction between "rsp", "rnp", and "rdp" has served well
for a great many years, but in SRCU, "sp" vs. "sdp" has proven confusing.
This commit therefore renames SRCU's "sp" pointers to "ssp", so that there
is "ssp" for srcu_struct pointer, "snp" for srcu_node pointer, and "sdp"
for srcu_data pointer.

Signed-off-by: Paul E. McKenney 
---
 include/linux/srcu.h |  78 +++
 include/linux/srcutiny.h |  24 +-
 include/linux/srcutree.h |   8 +-
 kernel/rcu/srcutiny.c| 120 +-
 kernel/rcu/srcutree.c| 488 +++
 5 files changed, 359 insertions(+), 359 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index ebd5f1511690..c614375cd264 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -38,20 +38,20 @@ struct srcu_struct;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
-int __init_srcu_struct(struct srcu_struct *sp, const char *name,
+int __init_srcu_struct(struct srcu_struct *ssp, const char *name,
   struct lock_class_key *key);
 
-#define init_srcu_struct(sp) \
+#define init_srcu_struct(ssp) \
 ({ \
static struct lock_class_key __srcu_key; \
\
-   __init_srcu_struct((sp), #sp, &__srcu_key); \
+   __init_srcu_struct((ssp), #ssp, &__srcu_key); \
 })
 
 #define __SRCU_DEP_MAP_INIT(srcu_name) .dep_map = { .name = #srcu_name },
 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
-int init_srcu_struct(struct srcu_struct *sp);
+int init_srcu_struct(struct srcu_struct *ssp);
 
 #define __SRCU_DEP_MAP_INIT(srcu_name)
 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
@@ -67,28 +67,28 @@ int init_srcu_struct(struct srcu_struct *sp);
 struct srcu_struct { };
 #endif
 
-void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
+void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
void (*func)(struct rcu_head *head));
-void _cleanup_srcu_struct(struct srcu_struct *sp, bool quiesced);
-int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
-void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
-void synchronize_srcu(struct srcu_struct *sp);
+void _cleanup_srcu_struct(struct srcu_struct *ssp, bool quiesced);
+int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
+void synchronize_srcu(struct srcu_struct *ssp);
 
 /**
  * cleanup_srcu_struct - deconstruct a sleep-RCU structure
- * @sp: structure to clean up.
+ * @ssp: structure to clean up.
  *
  * Must invoke this after you are finished using a given srcu_struct that
  * was initialized via init_srcu_struct(), else you leak memory.
  */
-static inline void cleanup_srcu_struct(struct srcu_struct *sp)
+static inline void cleanup_srcu_struct(struct srcu_struct *ssp)
 {
-   _cleanup_srcu_struct(sp, false);
+   _cleanup_srcu_struct(ssp, false);
 }
 
 /**
  * cleanup_srcu_struct_quiesced - deconstruct a quiesced sleep-RCU structure
- * @sp: structure to clean up.
+ * @ssp: structure to clean up.
  *
  * Must invoke this after you are finished using a given srcu_struct that
  * was initialized via init_srcu_struct(), else you leak memory.  Also,
@@ -103,16 +103,16 @@ static inline void cleanup_srcu_struct(struct srcu_struct 
*sp)
  * (with high probability, anyway), and will also cause the srcu_struct
  * to be leaked.
  */
-static inline void cleanup_srcu_struct_quiesced(struct srcu_struct *sp)
+static inline void cleanup_srcu_struct_quiesced(struct srcu_struct *ssp)
 {
-   _cleanup_srcu_struct(sp, true);
+   _cleanup_srcu_struct(ssp, true);
 }
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 /**
  * srcu_read_lock_held - might we be in SRCU read-side critical section?
- * @sp: The srcu_struct structure to check
+ * @ssp: The srcu_struct structure to check
  *
  * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an SRCU
  * read-side critical section.  In absence of CONFIG_DEBUG_LOCK_ALLOC,
@@ -126,16 +126,16 @@ static inline void cleanup_srcu_struct_quiesced(struct 
srcu_struct *sp)
  * relies on normal RCU, it can be called from the CPU which
  * is in the idle loop from an RCU point of view or offline.
  */
-static inline int srcu_read_lock_held(const struct srcu_struct *sp)
+static inline int srcu_read_lock_held(const struct srcu_struct *ssp)
 {
if (!debug_lockdep_rcu_enabled())
return 1;
-   return lock_is_held(>dep_map);
+   return lock_is_held(>dep_map);
 }
 
 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
-static inline int srcu_read_lock_held(const struct srcu_struct *sp)
+static inline int srcu_read_lock_held(const struct srcu_struct *ssp)
 {
return 1;
 }
@@ -145,7 +145,7 @@ static inline int srcu_read_lock_held(const struct 
srcu_struct *sp)
 /**
  * srcu_dereference_check - fetch SRCU-protected pointer for later 
dereferencing
  * @p: th

[PATCH tip/core/rcu 1/4] srcu: Fix kernel-doc missing notation

2018-11-11 Thread Paul E. McKenney
From: Randy Dunlap 

Fix kernel-doc warnings for missing parameter descriptions:

../include/linux/srcu.h:175: warning: Function parameter or member 'p' not 
described in 'srcu_dereference_notrace'
../include/linux/srcu.h:175: warning: Function parameter or member 'sp' not 
described in 'srcu_dereference_notrace'

Fixes: 0b764a6e4e19d ("srcu: Add notrace variant of srcu_dereference")

Signed-off-by: Randy Dunlap 
Cc: Lai Jiangshan 
Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/srcu.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 67135d4a8a30..ebd5f1511690 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -171,6 +171,9 @@ static inline int srcu_read_lock_held(const struct 
srcu_struct *sp)
 
 /**
  * srcu_dereference_notrace - no tracing and no lockdep calls from here
+ * @p: the pointer to fetch and protect for later dereferencing
+ * @sp: pointer to the srcu_struct, which is used to check that we
+ * really are in an SRCU read-side critical section.
  */
 #define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1)
 
-- 
2.17.1



[PATCH tip/core/rcu 0/4] SRCU updates for v4.21/v5.0

2018-11-11 Thread Paul E. McKenney
Hello!

This series contains SRCU updates:

1.  Fix kernel-doc missing notation, courtesy of Randy Dunlap.

2.  Prevent __call_srcu() counter wrap with read-side critical section.

3.  Lock srcu_data structure in srcu_gp_start(), fixing a an extremely
rare but also extremely embarrassing concurrency bug, courtesy
of Dennis Krein.

4.  Use "ssp" instead of "sp" for srcu_struct pointer, in the hope
that this prevents confusion about which lock is held.

Thanx, Paul



 include/linux/srcu.h |   81 ---
 include/linux/srcutiny.h |   24 +-
 include/linux/srcutree.h |8 
 kernel/rcu/srcutiny.c|  120 +--
 kernel/rcu/srcutree.c|  493 +++
 5 files changed, 367 insertions(+), 359 deletions(-)



[PATCH tip/core/rcu 3/7] smsc: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Steve Glendinning 
Cc: "David S. Miller" 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 drivers/net/ethernet/smsc/smsc911x.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.h 
b/drivers/net/ethernet/smsc/smsc911x.h
index 8d75508acd2b..51b2fc1a395f 100644
--- a/drivers/net/ethernet/smsc/smsc911x.h
+++ b/drivers/net/ethernet/smsc/smsc911x.h
@@ -67,7 +67,7 @@
 
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define SMSC_ASSERT_MAC_LOCK(pdata) \
-   WARN_ON_SMP(!spin_is_locked(>mac_lock))
+   lockdep_assert_held(>mac_lock)
 #else
 #define SMSC_ASSERT_MAC_LOCK(pdata) do {} while (0)
 #endif /* CONFIG_DEBUG_SPINLOCK */
-- 
2.17.1



[PATCH tip/core/rcu 1/7] x86/PCI: Replace spin_is_locked() with lockdep

2018-11-11 Thread Paul E. McKenney
From: Lance Roy 

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: 
Cc: 
Signed-off-by: Paul E. McKenney 
---
 arch/x86/pci/i386.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 8cd66152cdb0..9df652d3d927 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -59,7 +59,7 @@ static struct pcibios_fwaddrmap 
*pcibios_fwaddrmap_lookup(struct pci_dev *dev)
 {
struct pcibios_fwaddrmap *map;
 
-   WARN_ON_SMP(!spin_is_locked(_fwaddrmap_lock));
+   lockdep_assert_held(_fwaddrmap_lock);
 
list_for_each_entry(map, _fwaddrmappings, list)
if (map->dev == dev)
-- 
2.17.1



[PATCH tip/core/rcu 2/8] rcutorture: Add initrd support for systems lacking dracut

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

The support for creating initrd directories using dracut is a great
improvement over having to always hand-create them, it is a bit annoying
to have to install some otherwise irrelevant package just to be able to
run rcutorture.  This commit therefore adds support for creating initrd
directories on systems innocent of dracut.  You do need gcc, but then
again you need that to build the kernel (or to build llvm) in any case.

The idea is to create an initrd directory containing nothing but a
statically linked binary having a for-loop over a long-term sleep().
The result is a Linux kernel with almost no userspace: even the
time-honored /dev, /lib, /tmp, and /usr directories are gone.  In fact,
the only directory present is "/", but only because I don't know how to
get rid of it, at least short of not having an initrd in the first place.
Although statically linked binaries are much maligned, and rightly so,
their disadvantages seem to be irrelevant for this particular use case.
>From https://www.akkadia.org/drepper/no_static_linking.html:

1.  Fixes are difficult to apply to hordes of widely scattered
statically linked binaries.  But in this case, there is only one
binary, but there would otherwise be no fewer than four libraries.

2.  Security measures like local address randomization cannot be used.
Prudence prevents me from asserting that it is impossible to
base a remote attack on a networking-free rcutorture instance.
Nevertheless, bonus points to the first person who comes up with
such an attack!

3.  More efficient use of physical memory.  Not in this case, given
that libc is 1.8MB and the statically linked binary "only" 800K.

4.  Features such as locales, name service switch (NSS),
internationalized domain names (IDN) tool, and so on require
dynamic linking.  Bonus points to the first person coming up
with a valid rcutorture use case requiring these features in
its initrd.

5.  Accidental violations of (L)GPL.  Actually, this change actually
helps -avoid- such violations by reducing the temptation to
pass around tarballs of rcutorture-ready initrd directories.
After all, the rcutorture scripts automatically create an initrd
directory for you, so why bother with the tarballs?

6.  Tools and hacks like ltrace, LD_PRELOAD, LD_PROFILE, and LD_AUDIT
don't work.  Again, bonus points to the first person coming up
with a valid rcutorture use case requiring these features in
its initrd.

Nevertheless, the script will use dracut if available, and will create the
statically linked binary only when dracut are missing.  Those preferring
the smaller initrd directory resulting from the statically linked binary
(like me) are free to hand-edit mkinitrd.sh to remove the code using
dracut.  ;-)

Signed-off-by: Paul E. McKenney 
---
 .../selftests/rcutorture/bin/mkinitrd.sh  | 40 ++--
 .../selftests/rcutorture/doc/initrd.txt   | 99 +++
 2 files changed, 45 insertions(+), 94 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index ae773760f396..87a87ffeaa85 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -46,15 +46,41 @@ done
 __EOF___
 
 # Try using dracut to create initrd
-command -v dracut >/dev/null 2>&1 || { echo >&2 "Dracut not installed"; exit 
1; }
-echo Creating $D/initrd using dracut.
+if command -v dracut >/dev/null 2>&1
+then
+   echo Creating $D/initrd using dracut.
+   # Filesystem creation
+   dracut --force --no-hostonly --no-hostonly-cmdline --module "base" 
$T/initramfs.img
+   cd $D
+   mkdir initrd
+   cd initrd
+   zcat $T/initramfs.img | cpio -id
+   cp $T/init init
+   chmod +x init
+   echo Done creating $D/initrd using dracut
+   exit 0
+fi
 
-# Filesystem creation
-dracut --force --no-hostonly --no-hostonly-cmdline --module "base" 
$T/initramfs.img
+# No dracut, so create a C-language initrd/init program and statically
+# link it.  This results in a very small initrd, but might be a bit less
+# future-proof than dracut.
+echo "Could not find dracut, attempting C initrd"
 cd $D
 mkdir initrd
 cd initrd
-zcat $T/initramfs.img | cpio -id
-cp $T/init init
-echo Done creating $D/initrd using dracut
+cat > init.c << '___EOF___'
+#include 
+
+int main(int argc, int argv[])
+{
+   for (;;)
+   sleep(1000*1000*1000); /* One gigasecond is ~30 years. */
+   return 0;
+}
+___EOF___
+gcc -static -Os -o init init.c
+strip init
+rm init.c
+echo "Done creating a statically linked C-language initrd"
+
 exit 0
diff --git a/tools/testing/selftests/rcutorture/

[PATCH tip/core/rcu 3/8] rcutorture: Make initrd/init execute in userspace

2018-11-11 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, the initrd/init script and executable remain blocked almost
all the time.  However, it is necessary to test nohz_full userspace
execution, which both variants of initrd/init fail to do.  This commit
therefore causes initrd/init to spend about a millisecond per second
executing in userspace.

Reported-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 
---
 .../selftests/rcutorture/bin/mkinitrd.sh  | 43 +--
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index 87a87ffeaa85..b48c504edfe1 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -39,9 +39,22 @@ mkdir $T
 
 cat > $T/init << '__EOF___'
 #!/bin/sh
+# Run in userspace a few milliseconds every second.  This helps to
+# exercise the NO_HZ_FULL portions of RCU.
 while :
 do
-   sleep 100
+   q=
+   for i in \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a \
+   a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
+   do
+   q="$q $i"
+   done
+   sleep 1
 done
 __EOF___
 
@@ -70,15 +83,37 @@ mkdir initrd
 cd initrd
 cat > init.c << '___EOF___'
 #include 
+#include 
+
+volatile unsigned long delaycount;
 
 int main(int argc, int argv[])
 {
-   for (;;)
-   sleep(1000*1000*1000); /* One gigasecond is ~30 years. */
+   int i;
+   struct timeval tv;
+   struct timeval tvb;
+
+   for (;;) {
+   sleep(1);
+   /* Need some userspace time. */
+   if (gettimeofday(, NULL))
+   continue;
+   do {
+   for (i = 0; i < 1000 * 100; i++)
+   delaycount = i * i;
+   if (gettimeofday(, NULL))
+   break;
+   tv.tv_sec -= tvb.tv_sec;
+   if (tv.tv_sec > 1)
+   break;
+   tv.tv_usec += tv.tv_sec * 1000 * 1000;
+   tv.tv_usec -= tvb.tv_usec;
+   } while (tv.tv_usec < 1000);
+   }
return 0;
 }
 ___EOF___
-gcc -static -Os -o init init.c
+cc -static -Os -o init init.c
 strip init
 rm init.c
 echo "Done creating a statically linked C-language initrd"
-- 
2.17.1



[PATCH tip/core/rcu 5/8] rcutorture: Always strip using the cross-compiler

2018-11-11 Thread Paul E. McKenney
From: Willy Tarreau 

Strip using -s on the compiler command line instead of calling the "strip"
utility as the latter isn't necessarily compatible with the target arch.

Signed-off-by: Willy Tarreau 
Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/mkinitrd.sh | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh 
b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
index 70661457e3d6..dbb6f0160281 100755
--- a/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
+++ b/tools/testing/selftests/rcutorture/bin/mkinitrd.sh
@@ -113,8 +113,7 @@ int main(int argc, int argv[])
return 0;
 }
 ___EOF___
-${CROSS_COMPILE}gcc -static -Os -o init init.c
-strip init
+${CROSS_COMPILE}gcc -s -static -Os -o init init.c
 rm init.c
 echo "Done creating a statically linked C-language initrd"
 
-- 
2.17.1



  1   2   3   4   5   6   7   8   9   10   >