Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-21 Thread Paul E. McKenney
On Thu, May 21, 2020 at 08:41:11PM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
> >> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
> >> context tracking is not enabled at boot.
> >> 
> >> void __rcu_irq_enter_check_tick(void);
> >> 
> >> static inline void rcu_irq_enter_check_tick(void)
> >> {
> >>if (context_tracking_enabled())
> >>__rcu_irq_enter_check_tick();
> >> }
> >
> > That certainly is a better approach!
> >
> > So let's go with your version.  But could you please post it, just in
> > case reviewing an alternative version causes me to spot something?
> 
> I'm just testing the complete rework of this series and will post it if
> it survives smoke testing.

Fair enough!

Thanx, Paul


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-21 Thread Thomas Gleixner
"Paul E. McKenney"  writes:
> On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
>> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
>> context tracking is not enabled at boot.
>> 
>> void __rcu_irq_enter_check_tick(void);
>> 
>> static inline void rcu_irq_enter_check_tick(void)
>> {
>>  if (context_tracking_enabled())
>>  __rcu_irq_enter_check_tick();
>> }
>
> That certainly is a better approach!
>
> So let's go with your version.  But could you please post it, just in
> case reviewing an alternative version causes me to spot something?

I'm just testing the complete rework of this series and will post it if
it survives smoke testing.

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-21 Thread Paul E. McKenney
On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> > Same patch, but with updated commit log based on IRC discussion
> > with Andy.
> 
> Fun. I came up with the same thing before going to bed. Just that I
> named the function differently: rcu_irq_enter_check_tick()

I am good with that name.

> >  #if defined(CONFIG_TINY_RCU)
> >  
> > +static inline void tickle_nohz_for_rcu(void)
> > +{
> > +}
> > +
> >  static inline void rcu_nmi_enter(void)
> >  {
> >  }
> > @@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
> >  }
> >  
> >  #else
> > +extern void tickle_nohz_for_rcu(void);
> 
> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
> context tracking is not enabled at boot.
> 
> void __rcu_irq_enter_check_tick(void);
> 
> static inline void rcu_irq_enter_check_tick(void)
> {
>   if (context_tracking_enabled())
>   __rcu_irq_enter_check_tick();
> }

That certainly is a better approach!

So let's go with your version.  But could you please post it, just in
case reviewing an alternative version causes me to spot something?

Thanx, Paul


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-21 Thread Thomas Gleixner
"Paul E. McKenney"  writes:
> On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> Same patch, but with updated commit log based on IRC discussion
> with Andy.

Fun. I came up with the same thing before going to bed. Just that I
named the function differently: rcu_irq_enter_check_tick()

>  #if defined(CONFIG_TINY_RCU)
>  
> +static inline void tickle_nohz_for_rcu(void)
> +{
> +}
> +
>  static inline void rcu_nmi_enter(void)
>  {
>  }
> @@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
>  }
>  
>  #else
> +extern void tickle_nohz_for_rcu(void);

And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
context tracking is not enabled at boot.

void __rcu_irq_enter_check_tick(void);

static inline void rcu_irq_enter_check_tick(void)
{
if (context_tracking_enabled())
__rcu_irq_enter_check_tick();
}

Thanks,

tglx



Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> On Wed, May 20, 2020 at 09:49:18PM +0200, Thomas Gleixner wrote:
> > "Paul E. McKenney"  writes:
> > > On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> > >> Paul, the major change here is that if an IRQ hits normal kernel code
> > >> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> > >> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> > >> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> > >> bases?
> > >
> > > From an RCU viewpoint, yes, give or take my concerns about someone
> > > putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> > > I can bring some lockdep trickery to bear.
> > 
> > An surplus rcu_irq_exit() should already trigger alarms today.
> 
> Fair point!
> 
> > > But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> > > portions of this.
> > 
> > I don't see a problem. Let me write that into actual testable patch
> > form.
> 
> Here is the RCU part, with my current best guess for the commit log.
> 
> Please note that this is on top of my -rcu stack, so some adjustment
> will likely be needed to pull it underneath Joel's series that removes
> the special-purpose bits at the bottom of the ->dynticks counter.
> 
> But a starting point, anyway.

Same patch, but with updated commit log based on IRC discussion
with Andy.

Thanx, Paul



commit 1771ea9fac5748d1424d9214c51b2f79cc1176b6
Author: Paul E. McKenney 
Date:   Wed May 20 15:03:07 2020 -0700

rcu: Abstract out tickle_nohz_for_rcu() from rcu_nmi_enter()

There will likely be exception handlers that can sleep, which rules
out the usual approach of invoking rcu_nmi_enter() on entry and also
rcu_nmi_exit() on all exit paths.  However, the alternative approach of
just not calling anything can prevent RCU from coaxing quiescent states
from nohz_full CPUs that are looping in the kernel:  RCU must instead
IPI them explicitly.  It would be better to enable the scheduler tick
on such CPUs to interact with RCU in a lighter-weight manner, and this
enabling is one of the things that rcu_nmi_enter() currently does.

What is needed is something that helps RCU coax quiescent states while
not preventing subsequent sleeps.  This commit therefore splits out the
nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
logic into a new function named tickle_nohz_for_rcu().

Suggested-by: Andy Lutomirski 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 621556e..d4be42a 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -14,6 +14,10 @@ extern bool synchronize_hardirq(unsigned int irq);
 
 #if defined(CONFIG_TINY_RCU)
 
+static inline void tickle_nohz_for_rcu(void)
+{
+}
+
 static inline void rcu_nmi_enter(void)
 {
 }
@@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
 }
 
 #else
+extern void tickle_nohz_for_rcu(void);
 extern void rcu_nmi_enter(void);
 extern void rcu_nmi_exit(void);
 #endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7812574..0a3cad4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -806,6 +806,67 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
+ * tickle_nohz_for_rcu - Enable scheduler tick on CPU if RCU needs it.
+ *
+ * The scheduler tick is not normally enabled when CPUs enter the kernel
+ * from nohz_full userspace execution.  After all, nohz_full userspace
+ * execution is an RCU quiescent state and the time executing in the kernel
+ * is quite short.  Except of course when it isn't.  And it is not hard to
+ * cause a large system to spend tens of seconds or even minutes looping
+ * in the kernel, which can cause a number of problems, include RCU CPU
+ * stall warnings.
+ *
+ * Therefore, if a nohz_full CPU fails to report a quiescent state
+ * in a timely manner, the RCU grace-period kthread sets that CPU's
+ * ->rcu_urgent_qs flag with the expectation that the next interrupt or
+ * exception will invoke this function, which will turn on the scheduler
+ * tick, which will enable RCU to detect that CPU's quiescent states,
+ * for example, due to cond_resched() calls in CONFIG_PREEMPT=n kernels.
+ * The tick will be disabled once a quiescent state is reported for
+ * this CPU.
+ *
+ * Of course, in carefully tuned systems, there might never be an
+ * interrupt or exception.  In that case, the RCU grace-period kthread
+ * will eventually cause one to happen.  However, in less carefully
+ * controlled environments, this function allows RCU to get what it
+ * needs without creating otherwise useless interruptions.
+ */
+noinstr void tickle_nohz_for_rcu(void)
+{
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
+   if 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 09:49:18PM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> >> Paul, the major change here is that if an IRQ hits normal kernel code
> >> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> >> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> >> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> >> bases?
> >
> > From an RCU viewpoint, yes, give or take my concerns about someone
> > putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> > I can bring some lockdep trickery to bear.
> 
> An surplus rcu_irq_exit() should already trigger alarms today.

Fair point!

> > But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> > portions of this.
> 
> I don't see a problem. Let me write that into actual testable patch
> form.

Here is the RCU part, with my current best guess for the commit log.

Please note that this is on top of my -rcu stack, so some adjustment
will likely be needed to pull it underneath Joel's series that removes
the special-purpose bits at the bottom of the ->dynticks counter.

But a starting point, anyway.

Thanx, Paul



commit ca05838a9a1809fafee63f488a7be8b30e1c2a6a
Author: Paul E. McKenney 
Date:   Wed May 20 15:03:07 2020 -0700

rcu: Abstract out tickle_nohz_for_rcu() from rcu_nmi_enter()

This commit splits out the nohz_full scheduler-tick enabling from the
rest of the rcu_nmi_enter() logic.  This allows short exception handlers
that interrupt kernel code regions that RCU is already watching to just
invoke tickle_nohz_for_rcu() at exception entry instead of having to
invoke rcu_nmi_enter() on entry and also rcu_nmi_exit() on all exit paths.

Suggested-by: Andy Lutomirski 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 621556e..d4be42a 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -14,6 +14,10 @@ extern bool synchronize_hardirq(unsigned int irq);
 
 #if defined(CONFIG_TINY_RCU)
 
+static inline void tickle_nohz_for_rcu(void)
+{
+}
+
 static inline void rcu_nmi_enter(void)
 {
 }
@@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
 }
 
 #else
+extern void tickle_nohz_for_rcu(void);
 extern void rcu_nmi_enter(void);
 extern void rcu_nmi_exit(void);
 #endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7812574..0a3cad4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -806,6 +806,67 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
+ * tickle_nohz_for_rcu - Enable scheduler tick on CPU if RCU needs it.
+ *
+ * The scheduler tick is not normally enabled when CPUs enter the kernel
+ * from nohz_full userspace execution.  After all, nohz_full userspace
+ * execution is an RCU quiescent state and the time executing in the kernel
+ * is quite short.  Except of course when it isn't.  And it is not hard to
+ * cause a large system to spend tens of seconds or even minutes looping
+ * in the kernel, which can cause a number of problems, include RCU CPU
+ * stall warnings.
+ *
+ * Therefore, if a nohz_full CPU fails to report a quiescent state
+ * in a timely manner, the RCU grace-period kthread sets that CPU's
+ * ->rcu_urgent_qs flag with the expectation that the next interrupt or
+ * exception will invoke this function, which will turn on the scheduler
+ * tick, which will enable RCU to detect that CPU's quiescent states,
+ * for example, due to cond_resched() calls in CONFIG_PREEMPT=n kernels.
+ * The tick will be disabled once a quiescent state is reported for
+ * this CPU.
+ *
+ * Of course, in carefully tuned systems, there might never be an
+ * interrupt or exception.  In that case, the RCU grace-period kthread
+ * will eventually cause one to happen.  However, in less carefully
+ * controlled environments, this function allows RCU to get what it
+ * needs without creating otherwise useless interruptions.
+ */
+noinstr void tickle_nohz_for_rcu(void)
+{
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
+   if (in_nmi())
+   return; // Enabling tick is unsafe in NMI handlers.
+   RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(),
+"Illegal tickle_nohz_for_rcu from extended quiescent 
state");
+   instrumentation_begin();
+   if (!tick_nohz_full_cpu(rdp->cpu) ||
+   !READ_ONCE(rdp->rcu_urgent_qs) ||
+   READ_ONCE(rdp->rcu_forced_tick)) {
+   // RCU doesn't need nohz_full help from this CPU, or it is
+   // already getting that help.
+   instrumentation_end();
+   return;
+   }
+
+   // We get here only when not in an extended quiescent state and
+   // from interrupts (as opposed to NMIs).  

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Thomas Gleixner
"Paul E. McKenney"  writes:
> On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
>> Paul, the major change here is that if an IRQ hits normal kernel code
>> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
>> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
>> rcu_tickle() on entry and nothing on exit.  Does that cover all the
>> bases?
>
> From an RCU viewpoint, yes, give or take my concerns about someone
> putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> I can bring some lockdep trickery to bear.

An surplus rcu_irq_exit() should already trigger alarms today.

> But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> portions of this.

I don't see a problem. Let me write that into actual testable patch
form.

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 09:24:46PM +0200, Thomas Gleixner wrote:
> Andy Lutomirski  writes:
> > On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski  wrote:
> > if (user_mode(regs)) {
> > enter_from_user_mode();
> > } else {
> > if (!__rcu_is_watching()) {
> > /*
> >  * If RCU is not watching then the same careful
> >  * sequence vs. lockdep and tracing is required.
> >  *
> >  * This only happens for IRQs that hit the idle loop, and
> >  * even that only happens if we aren't using the sane
> >  * MWAIT-while-IF=0 mode.
> >  */
> > lockdep_hardirqs_off(CALLER_ADDR0);
> > rcu_irq_enter();
> > instrumentation_begin();
> > trace_hardirqs_off_prepare();
> > instrumentation_end();
> > return true;
> > } else {
> > /*
> >  * If RCU is watching then the combo function
> >  * can be used.
> >  */
> > instrumentation_begin();
> > trace_hardirqs_off();
> > rcu_tickle();
> > instrumentation_end();
> > }
> > }
> > return false;
> >
> > This is exactly what you have except that the cond_rcu part is gone
> > and I added rcu_tickle().
> >
> > Paul, the major change here is that if an IRQ hits normal kernel code
> > (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> > won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> > rcu_tickle() on entry and nothing on exit.  Does that cover all the
> > bases?
> 
> Just chatted with Paul on IRC and he thinks this should work, but he's
> not sure whether it's actually sane :)

I will have more to say after coding it up.  ;-)

Thanx, Paul


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Thomas Gleixner
Andy Lutomirski  writes:
> On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski  wrote:
> if (user_mode(regs)) {
> enter_from_user_mode();
> } else {
> if (!__rcu_is_watching()) {
> /*
>  * If RCU is not watching then the same careful
>  * sequence vs. lockdep and tracing is required.
>  *
>  * This only happens for IRQs that hit the idle loop, and
>  * even that only happens if we aren't using the sane
>  * MWAIT-while-IF=0 mode.
>  */
> lockdep_hardirqs_off(CALLER_ADDR0);
> rcu_irq_enter();
> instrumentation_begin();
> trace_hardirqs_off_prepare();
> instrumentation_end();
> return true;
> } else {
> /*
>  * If RCU is watching then the combo function
>  * can be used.
>  */
> instrumentation_begin();
> trace_hardirqs_off();
> rcu_tickle();
> instrumentation_end();
> }
> }
> return false;
>
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
>
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

Just chatted with Paul on IRC and he thinks this should work, but he's
not sure whether it's actually sane :)

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Thomas Gleixner
Andy Lutomirski  writes:
> So maybe the code can change to:
>
> if (user_mode(regs)) {
> enter_from_user_mode();
> } else {
> if (!__rcu_is_watching()) {
> /*
>  * If RCU is not watching then the same careful
>  * sequence vs. lockdep and tracing is required.
>  *
>  * This only happens for IRQs that hit the idle loop, and
>  * even that only happens if we aren't using the sane
>  * MWAIT-while-IF=0 mode.
>  */
> lockdep_hardirqs_off(CALLER_ADDR0);
> rcu_irq_enter();
> instrumentation_begin();
> trace_hardirqs_off_prepare();
> instrumentation_end();
> return true;
> } else {
> /*
>  * If RCU is watching then the combo function
>  * can be used.
>  */
> instrumentation_begin();
> trace_hardirqs_off();
> rcu_tickle();
> instrumentation_end();
> }
> }
> return false;
>
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
>
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

Fine with me, but the final vote needs to come from Paul and Joel.

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Wed, May 20, 2020 at 10:38 AM Paul E. McKenney  wrote:
> >
> > On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  
> > > wrote:
> > > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  
> > > > > wrote:
> > > > > > Andy Lutomirski  writes:
> > > > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner 
> > > > > > >  wrote:
> > > > > > >> Thomas Gleixner  writes:
> > > > > > >> It's about this:
> > > > > > >>
> > > > > > >> rcu_nmi_enter()
> > > > > > >> {
> > > > > > >> if (!rcu_is_watching()) {
> > > > > > >> make it watch;
> > > > > > >> } else if (!in_nmi()) {
> > > > > > >> do_magic_nohz_dyntick_muck();
> > > > > > >> }
> > > > > > >>
> > > > > > >> So if we do all irq/system vector entries conditional then the
> > > > > > >> do_magic() gets never executed. After that I got lost...
> > > > > > >
> > > > > > > I'm also baffled by that magic, but I'm also not suggesting doing 
> > > > > > > this
> > > > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > > > idtentry_enter().
> > > > > > >
> > > > > > > Paul, what is this code actually trying to do?
> > > > > >
> > > > > > Citing Paul from IRC:
> > > > > >
> > > > > >   "The way things are right now, you can leave out the 
> > > > > > rcu_irq_enter()
> > > > > >if this is not a nohz_full CPU.
> > > > > >
> > > > > >Or if this is a nohz_full CPU, and the tick is already
> > > > > >enabled, in that case you could also leave out the 
> > > > > > rcu_irq_enter().
> > > > > >
> > > > > >Or even if this is a nohz_full CPU and it does not have the tick
> > > > > >enabled, if it has been in the kernel less than a few tens of
> > > > > >milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > > > >
> > > > > >But my guess is that it would be a lot simpler to just always 
> > > > > > call
> > > > > >it.
> > > > > >
> > > > > > Hope that helps.
> > > > >
> > > > > Maybe?
> > > > >
> > > > > Unless I've missed something, the effect here is that #PF hitting in
> > > > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > > > (because you converted them) as well as other faults and traps will
> > > > > call rcu_irq_enter().
> > > > >
> > > > > Once upon a time, we did this horrible thing where, on entry from user
> > > > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > > > means we could get an IRQ in an extended quiescent state.  This means
> > > > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > > > idle code.
> > > > >
> > > > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > > > interrupt.  But I guess we still need to support HLT, which means we
> > > > > have this mess.
> > > > >
> > > > > But I still think we can plausibly get rid of the conditional.
> > > >
> > > > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > > > this becomes:
> > >
> > > So, I meant the conditional in tglx's patch that makes page faults 
> > > special.
> >
> > OK.
> >
> > > > > If we
> > > > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> > > >
> > > > Or we could be early in the kernel-entry code or late in the kernel-exit
> > > > code, but as far as I know, preemption is disabled on those code paths.
> > > > As are interrupts, right?  And interrupts are disabled on the portions
> > > > of the CPU-hotplug code where RCU is not watching, if I recall 
> > > > correctly.
> > >
> > > Interrupts are off in the parts of the entry/exit that RCU considers
> > > to be user mode.  We can get various faults, although these should be
> > > either NMI-like or events that genuinely or effectively happened in
> > > user mode.
> >
> > Fair enough!
> >
> > > > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > > > entry to the kernel.  Normally, this is fine because that CPU will very
> > > > quickly exit back to nohz_full userspace execution, so that RCU will
> > > > see the quiescent state, either by sampling it directly or by deducing
> > > > the CPU's passage through that quiescent state by comparing with state
> > > > that was captured earlier.  The grace-period kthread notices the lack
> > > > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > > > trigger this code.
> > > >
> > > > But if the nohz_full CPU stays in the kernel for 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski  wrote:
> >
> > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  wrote:
> > >
> > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  
> > > > wrote:
> 
> First, the patch as you submitted it is Acked-by: Andy Lutomirski
> .  I think there are cleanups that should happen, but
> I think the patch is correct.
> 
> About cleanups, concretely:  I think that everything that calls
> __idtenter_entry() is called in one of a small number of relatively
> sane states:
> 
> 1. User mode.  This is easy.
> 
> 2. Kernel, RCU is watching, everything is sane.  We don't actually
> need to do any RCU entry/exit pairs -- we should be okay with just a
> hypothetical RCU tickle (and IRQ tracing, etc).  This variant can
> sleep after the entry part finishes if regs->flags & IF and no one
> turned off preemption.
> 
> 3. Kernel, RCU is not watching, system was idle.  This can only be an
> actual interrupt.
> 
> So maybe the code can change to:
> 
> if (user_mode(regs)) {
> enter_from_user_mode();
> } else {
> if (!__rcu_is_watching()) {
> /*
>  * If RCU is not watching then the same careful
>  * sequence vs. lockdep and tracing is required.
>  *
>  * This only happens for IRQs that hit the idle loop, and
>  * even that only happens if we aren't using the sane
>  * MWAIT-while-IF=0 mode.
>  */
> lockdep_hardirqs_off(CALLER_ADDR0);
> rcu_irq_enter();
> instrumentation_begin();
> trace_hardirqs_off_prepare();
> instrumentation_end();
> return true;
> } else {
> /*
>  * If RCU is watching then the combo function
>  * can be used.
>  */
> instrumentation_begin();
> trace_hardirqs_off();
> rcu_tickle();
> instrumentation_end();
> }
> }
> return false;
> 
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
> 
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

>From an RCU viewpoint, yes, give or take my concerns about someone
putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
I can bring some lockdep trickery to bear.

But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
portions of this.

Thanx, Paul


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Andy Lutomirski
On Wed, May 20, 2020 at 10:38 AM Paul E. McKenney  wrote:
>
> On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  wrote:
> > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  
> > > > wrote:
> > > > > Andy Lutomirski  writes:
> > > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner 
> > > > > >  wrote:
> > > > > >> Thomas Gleixner  writes:
> > > > > >> It's about this:
> > > > > >>
> > > > > >> rcu_nmi_enter()
> > > > > >> {
> > > > > >> if (!rcu_is_watching()) {
> > > > > >> make it watch;
> > > > > >> } else if (!in_nmi()) {
> > > > > >> do_magic_nohz_dyntick_muck();
> > > > > >> }
> > > > > >>
> > > > > >> So if we do all irq/system vector entries conditional then the
> > > > > >> do_magic() gets never executed. After that I got lost...
> > > > > >
> > > > > > I'm also baffled by that magic, but I'm also not suggesting doing 
> > > > > > this
> > > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > > idtentry_enter().
> > > > > >
> > > > > > Paul, what is this code actually trying to do?
> > > > >
> > > > > Citing Paul from IRC:
> > > > >
> > > > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > > > >if this is not a nohz_full CPU.
> > > > >
> > > > >Or if this is a nohz_full CPU, and the tick is already
> > > > >enabled, in that case you could also leave out the rcu_irq_enter().
> > > > >
> > > > >Or even if this is a nohz_full CPU and it does not have the tick
> > > > >enabled, if it has been in the kernel less than a few tens of
> > > > >milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > > >
> > > > >But my guess is that it would be a lot simpler to just always call
> > > > >it.
> > > > >
> > > > > Hope that helps.
> > > >
> > > > Maybe?
> > > >
> > > > Unless I've missed something, the effect here is that #PF hitting in
> > > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > > (because you converted them) as well as other faults and traps will
> > > > call rcu_irq_enter().
> > > >
> > > > Once upon a time, we did this horrible thing where, on entry from user
> > > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > > means we could get an IRQ in an extended quiescent state.  This means
> > > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > > idle code.
> > > >
> > > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > > interrupt.  But I guess we still need to support HLT, which means we
> > > > have this mess.
> > > >
> > > > But I still think we can plausibly get rid of the conditional.
> > >
> > > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > > this becomes:
> >
> > So, I meant the conditional in tglx's patch that makes page faults special.
>
> OK.
>
> > > > If we
> > > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> > >
> > > Or we could be early in the kernel-entry code or late in the kernel-exit
> > > code, but as far as I know, preemption is disabled on those code paths.
> > > As are interrupts, right?  And interrupts are disabled on the portions
> > > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> >
> > Interrupts are off in the parts of the entry/exit that RCU considers
> > to be user mode.  We can get various faults, although these should be
> > either NMI-like or events that genuinely or effectively happened in
> > user mode.
>
> Fair enough!
>
> > > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > > entry to the kernel.  Normally, this is fine because that CPU will very
> > > quickly exit back to nohz_full userspace execution, so that RCU will
> > > see the quiescent state, either by sampling it directly or by deducing
> > > the CPU's passage through that quiescent state by comparing with state
> > > that was captured earlier.  The grace-period kthread notices the lack
> > > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > > trigger this code.
> > >
> > > But if the nohz_full CPU stays in the kernel for an extended time,
> > > perhaps due to OOM handling or due to processing of some huge I/O that
> > > hits in-memory buffers/cache, then RCU needs some way of detecting
> > > quiescent states on that CPU.  This requires the scheduling-clock
> > > interrupt to be alive and well.
> > >
> > > Are there other ways to get this done?  But 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Paul E. McKenney
On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  wrote:
> > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  
> > > wrote:
> > > > Andy Lutomirski  writes:
> > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  
> > > > > wrote:
> > > > >> Thomas Gleixner  writes:
> > > > >> It's about this:
> > > > >>
> > > > >> rcu_nmi_enter()
> > > > >> {
> > > > >> if (!rcu_is_watching()) {
> > > > >> make it watch;
> > > > >> } else if (!in_nmi()) {
> > > > >> do_magic_nohz_dyntick_muck();
> > > > >> }
> > > > >>
> > > > >> So if we do all irq/system vector entries conditional then the
> > > > >> do_magic() gets never executed. After that I got lost...
> > > > >
> > > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > idtentry_enter().
> > > > >
> > > > > Paul, what is this code actually trying to do?
> > > >
> > > > Citing Paul from IRC:
> > > >
> > > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > > >if this is not a nohz_full CPU.
> > > >
> > > >Or if this is a nohz_full CPU, and the tick is already
> > > >enabled, in that case you could also leave out the rcu_irq_enter().
> > > >
> > > >Or even if this is a nohz_full CPU and it does not have the tick
> > > >enabled, if it has been in the kernel less than a few tens of
> > > >milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > >
> > > >But my guess is that it would be a lot simpler to just always call
> > > >it.
> > > >
> > > > Hope that helps.
> > >
> > > Maybe?
> > >
> > > Unless I've missed something, the effect here is that #PF hitting in
> > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > (because you converted them) as well as other faults and traps will
> > > call rcu_irq_enter().
> > >
> > > Once upon a time, we did this horrible thing where, on entry from user
> > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > means we could get an IRQ in an extended quiescent state.  This means
> > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > idle code.
> > >
> > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > interrupt.  But I guess we still need to support HLT, which means we
> > > have this mess.
> > >
> > > But I still think we can plausibly get rid of the conditional.
> >
> > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > this becomes:
> 
> So, I meant the conditional in tglx's patch that makes page faults special.

OK.

> > > If we
> > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> >
> > Or we could be early in the kernel-entry code or late in the kernel-exit
> > code, but as far as I know, preemption is disabled on those code paths.
> > As are interrupts, right?  And interrupts are disabled on the portions
> > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> 
> Interrupts are off in the parts of the entry/exit that RCU considers
> to be user mode.  We can get various faults, although these should be
> either NMI-like or events that genuinely or effectively happened in
> user mode.

Fair enough!

> > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > entry to the kernel.  Normally, this is fine because that CPU will very
> > quickly exit back to nohz_full userspace execution, so that RCU will
> > see the quiescent state, either by sampling it directly or by deducing
> > the CPU's passage through that quiescent state by comparing with state
> > that was captured earlier.  The grace-period kthread notices the lack
> > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > trigger this code.
> >
> > But if the nohz_full CPU stays in the kernel for an extended time,
> > perhaps due to OOM handling or due to processing of some huge I/O that
> > hits in-memory buffers/cache, then RCU needs some way of detecting
> > quiescent states on that CPU.  This requires the scheduling-clock
> > interrupt to be alive and well.
> >
> > Are there other ways to get this done?  But of course!  RCU could
> > for example use smp_call_function_single() or use workqueues to force
> > execution onto that CPU and enable the tick that way.  This gets a
> > little involved in order to avoid deadlock, but if the added check
> > in rcu_nmi_enter() is causing 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Andy Lutomirski
On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski  wrote:
>
> On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  wrote:
> >
> > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  
> > > wrote:

First, the patch as you submitted it is Acked-by: Andy Lutomirski
.  I think there are cleanups that should happen, but
I think the patch is correct.

About cleanups, concretely:  I think that everything that calls
__idtenter_entry() is called in one of a small number of relatively
sane states:

1. User mode.  This is easy.

2. Kernel, RCU is watching, everything is sane.  We don't actually
need to do any RCU entry/exit pairs -- we should be okay with just a
hypothetical RCU tickle (and IRQ tracing, etc).  This variant can
sleep after the entry part finishes if regs->flags & IF and no one
turned off preemption.

3. Kernel, RCU is not watching, system was idle.  This can only be an
actual interrupt.

So maybe the code can change to:

if (user_mode(regs)) {
enter_from_user_mode();
} else {
if (!__rcu_is_watching()) {
/*
 * If RCU is not watching then the same careful
 * sequence vs. lockdep and tracing is required.
 *
 * This only happens for IRQs that hit the idle loop, and
 * even that only happens if we aren't using the sane
 * MWAIT-while-IF=0 mode.
 */
lockdep_hardirqs_off(CALLER_ADDR0);
rcu_irq_enter();
instrumentation_begin();
trace_hardirqs_off_prepare();
instrumentation_end();
return true;
} else {
/*
 * If RCU is watching then the combo function
 * can be used.
 */
instrumentation_begin();
trace_hardirqs_off();
rcu_tickle();
instrumentation_end();
}
}
return false;

This is exactly what you have except that the cond_rcu part is gone
and I added rcu_tickle().

Paul, the major change here is that if an IRQ hits normal kernel code
(i.e. code where RCU is watching and we're not in an EQS), the IRQ
won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
rcu_tickle() on entry and nothing on exit.  Does that cover all the
bases?


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Andy Lutomirski
On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney  wrote:
>
> On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  wrote:
> > >
> > > Andy Lutomirski  writes:
> > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  
> > > > wrote:
> > > >> Thomas Gleixner  writes:
> > > >> It's about this:
> > > >>
> > > >> rcu_nmi_enter()
> > > >> {
> > > >> if (!rcu_is_watching()) {
> > > >> make it watch;
> > > >> } else if (!in_nmi()) {
> > > >> do_magic_nohz_dyntick_muck();
> > > >> }
> > > >>
> > > >> So if we do all irq/system vector entries conditional then the
> > > >> do_magic() gets never executed. After that I got lost...
> > > >
> > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > to *all* entries -- just the not-super-magic ones that use
> > > > idtentry_enter().
> > > >
> > > > Paul, what is this code actually trying to do?
> > >
> > > Citing Paul from IRC:
> > >
> > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > >if this is not a nohz_full CPU.
> > >
> > >Or if this is a nohz_full CPU, and the tick is already
> > >enabled, in that case you could also leave out the rcu_irq_enter().
> > >
> > >Or even if this is a nohz_full CPU and it does not have the tick
> > >enabled, if it has been in the kernel less than a few tens of
> > >milliseconds, still OK to avoid invoking rcu_irq_enter()
> > >
> > >But my guess is that it would be a lot simpler to just always call
> > >it.
> > >
> > > Hope that helps.
> >
> > Maybe?
> >
> > Unless I've missed something, the effect here is that #PF hitting in
> > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > (because you converted them) as well as other faults and traps will
> > call rcu_irq_enter().
> >
> > Once upon a time, we did this horrible thing where, on entry from user
> > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > means we could get an IRQ in an extended quiescent state.  This means
> > that the IRQ code had to end the EQS so that IRQ handlers could use
> > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > idle code.
> >
> > In my dream world, we would never ever get IRQs while in an EQS -- we
> > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > interrupt.  But I guess we still need to support HLT, which means we
> > have this mess.
> >
> > But I still think we can plausibly get rid of the conditional.
>
> You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> this becomes:

So, I meant the conditional in tglx's patch that makes page faults special.

>
> > If we
> > get an IRQ or (egads!) a fault in idle context, we'll have
> > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
>
> Or we could be early in the kernel-entry code or late in the kernel-exit
> code, but as far as I know, preemption is disabled on those code paths.
> As are interrupts, right?  And interrupts are disabled on the portions
> of the CPU-hotplug code where RCU is not watching, if I recall correctly.

Interrupts are off in the parts of the entry/exit that RCU considers
to be user mode.  We can get various faults, although these should be
either NMI-like or events that genuinely or effectively happened in
user mode.

>
> A nohz_full CPU does not enable the scheduling-clock interrupt upon
> entry to the kernel.  Normally, this is fine because that CPU will very
> quickly exit back to nohz_full userspace execution, so that RCU will
> see the quiescent state, either by sampling it directly or by deducing
> the CPU's passage through that quiescent state by comparing with state
> that was captured earlier.  The grace-period kthread notices the lack
> of a quiescent state and will eventually set ->rcu_urgent_qs to
> trigger this code.
>
> But if the nohz_full CPU stays in the kernel for an extended time,
> perhaps due to OOM handling or due to processing of some huge I/O that
> hits in-memory buffers/cache, then RCU needs some way of detecting
> quiescent states on that CPU.  This requires the scheduling-clock
> interrupt to be alive and well.
>
> Are there other ways to get this done?  But of course!  RCU could
> for example use smp_call_function_single() or use workqueues to force
> execution onto that CPU and enable the tick that way.  This gets a
> little involved in order to avoid deadlock, but if the added check
> in rcu_nmi_enter() is causing trouble, something can be arranged.
> Though that something would cause more latency excursions than
> does the current code.
>
> Or did you have something else in mind?

I'm trying to understand when we actually need to call the function.
Is it just the scheduling interrupt that's 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-20 Thread Thomas Gleixner
Andy Lutomirski  writes:
> On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  wrote:
> Unless I've missed something, the effect here is that #PF hitting in
> an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> (because you converted them) as well as other faults and traps will
> call rcu_irq_enter().

The only reason why this is needed for #PF is that a kernel mode #PF may
sleep. And of course you cannot sleep after calling rcu_irq_enter().

All other interrupts/traps/system vectors cannot sleep ever. So it's a
straight forward enter/exit.

> Once upon a time, we did this horrible thing where, on entry from user
> mode, we would turn on interrupts while still in CONTEXT_USER, which
> means we could get an IRQ in an extended quiescent state.  This means
> that the IRQ code had to end the EQS so that IRQ handlers could use
> RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> that, if IF=1, we are *not* in an EQS with the sole exception of the
> idle code.
>
> In my dream world, we would never ever get IRQs while in an EQS -- we
> would do MWAIT with IF=0 and we would exit the EQS before taking the
> interrupt.  But I guess we still need to support HLT, which means we
> have this mess.

You always can dream, but dont complain about the nightmares :)

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Paul E. McKenney
On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  wrote:
> >
> > Andy Lutomirski  writes:
> > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  
> > > wrote:
> > >> Thomas Gleixner  writes:
> > >> It's about this:
> > >>
> > >> rcu_nmi_enter()
> > >> {
> > >> if (!rcu_is_watching()) {
> > >> make it watch;
> > >> } else if (!in_nmi()) {
> > >> do_magic_nohz_dyntick_muck();
> > >> }
> > >>
> > >> So if we do all irq/system vector entries conditional then the
> > >> do_magic() gets never executed. After that I got lost...
> > >
> > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > to *all* entries -- just the not-super-magic ones that use
> > > idtentry_enter().
> > >
> > > Paul, what is this code actually trying to do?
> >
> > Citing Paul from IRC:
> >
> >   "The way things are right now, you can leave out the rcu_irq_enter()
> >if this is not a nohz_full CPU.
> >
> >Or if this is a nohz_full CPU, and the tick is already
> >enabled, in that case you could also leave out the rcu_irq_enter().
> >
> >Or even if this is a nohz_full CPU and it does not have the tick
> >enabled, if it has been in the kernel less than a few tens of
> >milliseconds, still OK to avoid invoking rcu_irq_enter()
> >
> >But my guess is that it would be a lot simpler to just always call
> >it.
> >
> > Hope that helps.
> 
> Maybe?
> 
> Unless I've missed something, the effect here is that #PF hitting in
> an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> (because you converted them) as well as other faults and traps will
> call rcu_irq_enter().
> 
> Once upon a time, we did this horrible thing where, on entry from user
> mode, we would turn on interrupts while still in CONTEXT_USER, which
> means we could get an IRQ in an extended quiescent state.  This means
> that the IRQ code had to end the EQS so that IRQ handlers could use
> RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> that, if IF=1, we are *not* in an EQS with the sole exception of the
> idle code.
> 
> In my dream world, we would never ever get IRQs while in an EQS -- we
> would do MWAIT with IF=0 and we would exit the EQS before taking the
> interrupt.  But I guess we still need to support HLT, which means we
> have this mess.
> 
> But I still think we can plausibly get rid of the conditional.

You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
this becomes:

if (!rcu_is_watching()) {
make it watch;
} else if (!in_nmi()) {
instrumentation_begin();
if (tick_nohz_full_cpu(rdp->cpu) && ... {
do stuff
}
instrumentation_end();
}

But tick_nohz_full_cpu() is compile-time-known false, so as long as the
compiler can ditch the instrumentation_begin() and instrumentation_end(),
the entire "else if" clause evaporates.

> If we
> get an IRQ or (egads!) a fault in idle context, we'll have
> !__rcu_is_watching(), but, AFAICT, we also have preemption off.

Or we could be early in the kernel-entry code or late in the kernel-exit
code, but as far as I know, preemption is disabled on those code paths.
As are interrupts, right?  And interrupts are disabled on the portions
of the CPU-hotplug code where RCU is not watching, if I recall correctly.

I am guessing that interrupts from userspace are not at issue here, but
completeness and all that.

>  So it
> should be okay to do rcu_irq_enter().  OTOH, if we get an IRQ or a
> fault anywhere else, then we either have a severe bug in the RCU code
> itself and the RCU code faulted (in which case we get what we deserve)
> or RCU is watching and all is well.  This means that the rule will be
> that, if preemption is on, it's fine to schedule inside an
> idtentry_begin()/idtentry_end() pair.

On this, I must defer to you guys.

> The remaining bit is just the urgent thing, and I don't understand
> what's going on.  Paul, could we split out the urgent logic all by
> itself so that the IRQ handlers could do rcu_poke_urgent()?  Or am I
> entirely misunderstanding its purpose?

A nohz_full CPU does not enable the scheduling-clock interrupt upon
entry to the kernel.  Normally, this is fine because that CPU will very
quickly exit back to nohz_full userspace execution, so that RCU will
see the quiescent state, either by sampling it directly or by deducing
the CPU's passage through that quiescent state by comparing with state
that was captured earlier.  The grace-period kthread notices the lack
of a quiescent state and will eventually set ->rcu_urgent_qs to
trigger this code.

But if the nohz_full CPU stays in the kernel for an extended time,
perhaps due to OOM handling or due to processing of some huge 

Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Andy Lutomirski
On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner  wrote:
>
> Andy Lutomirski  writes:
> > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  wrote:
> >> Thomas Gleixner  writes:
> >> It's about this:
> >>
> >> rcu_nmi_enter()
> >> {
> >> if (!rcu_is_watching()) {
> >> make it watch;
> >> } else if (!in_nmi()) {
> >> do_magic_nohz_dyntick_muck();
> >> }
> >>
> >> So if we do all irq/system vector entries conditional then the
> >> do_magic() gets never executed. After that I got lost...
> >
> > I'm also baffled by that magic, but I'm also not suggesting doing this
> > to *all* entries -- just the not-super-magic ones that use
> > idtentry_enter().
> >
> > Paul, what is this code actually trying to do?
>
> Citing Paul from IRC:
>
>   "The way things are right now, you can leave out the rcu_irq_enter()
>if this is not a nohz_full CPU.
>
>Or if this is a nohz_full CPU, and the tick is already
>enabled, in that case you could also leave out the rcu_irq_enter().
>
>Or even if this is a nohz_full CPU and it does not have the tick
>enabled, if it has been in the kernel less than a few tens of
>milliseconds, still OK to avoid invoking rcu_irq_enter()
>
>But my guess is that it would be a lot simpler to just always call
>it.
>
> Hope that helps.

Maybe?

Unless I've missed something, the effect here is that #PF hitting in
an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
(because you converted them) as well as other faults and traps will
call rcu_irq_enter().

Once upon a time, we did this horrible thing where, on entry from user
mode, we would turn on interrupts while still in CONTEXT_USER, which
means we could get an IRQ in an extended quiescent state.  This means
that the IRQ code had to end the EQS so that IRQ handlers could use
RCU.  But I killed this a few years ago -- x86 Linux now has a rule
that, if IF=1, we are *not* in an EQS with the sole exception of the
idle code.

In my dream world, we would never ever get IRQs while in an EQS -- we
would do MWAIT with IF=0 and we would exit the EQS before taking the
interrupt.  But I guess we still need to support HLT, which means we
have this mess.

But I still think we can plausibly get rid of the conditional.  If we
get an IRQ or (egads!) a fault in idle context, we'll have
!__rcu_is_watching(), but, AFAICT, we also have preemption off.  So it
should be okay to do rcu_irq_enter().  OTOH, if we get an IRQ or a
fault anywhere else, then we either have a severe bug in the RCU code
itself and the RCU code faulted (in which case we get what we deserve)
or RCU is watching and all is well.  This means that the rule will be
that, if preemption is on, it's fine to schedule inside an
idtentry_begin()/idtentry_end() pair.

The remaining bit is just the urgent thing, and I don't understand
what's going on.  Paul, could we split out the urgent logic all by
itself so that the IRQ handlers could do rcu_poke_urgent()?  Or am I
entirely misunderstanding its purpose?


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Thomas Gleixner
Andy Lutomirski  writes:
> On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  wrote:
>> Thomas Gleixner  writes:
>> It's about this:
>>
>> rcu_nmi_enter()
>> {
>> if (!rcu_is_watching()) {
>> make it watch;
>> } else if (!in_nmi()) {
>> do_magic_nohz_dyntick_muck();
>> }
>>
>> So if we do all irq/system vector entries conditional then the
>> do_magic() gets never executed. After that I got lost...
>
> I'm also baffled by that magic, but I'm also not suggesting doing this
> to *all* entries -- just the not-super-magic ones that use
> idtentry_enter().
>
> Paul, what is this code actually trying to do?

Citing Paul from IRC:

  "The way things are right now, you can leave out the rcu_irq_enter()
   if this is not a nohz_full CPU.

   Or if this is a nohz_full CPU, and the tick is already
   enabled, in that case you could also leave out the rcu_irq_enter().

   Or even if this is a nohz_full CPU and it does not have the tick
   enabled, if it has been in the kernel less than a few tens of
   milliseconds, still OK to avoid invoking rcu_irq_enter()

   But my guess is that it would be a lot simpler to just always call
   it.

Hope that helps.

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Andy Lutomirski
On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner  wrote:
>
> Thomas Gleixner  writes:
> > Andy Lutomirski  writes:
> >> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner  wrote:
> >>> The pagefault handler cannot use the regular idtentry_enter() because that
> >>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
> >>> problem per se, but kernel side page faults can schedule which is not
> >>> possible without invoking rcu_irq_exit().
> >>>
> >>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
> >>> pagefault handling code would be possible, but not pretty either.
> >>>
> >>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
> >>> when RCU is not watching. The conditional RCU enabling is a correctness
> >>> issue: A kernel page fault which hits a RCU idle reason can neither
> >>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU 
> >>> side
> >>> effects is at least increasing the chance for useful debug output.
> >>>
> >>> The function is also useful for implementing lightweight reschedule IPI 
> >>> and
> >>> KVM posted interrupt IPI entry handling later.
> >>
> >> Why is this conditional?  That is, couldn't we do this for all
> >> idtentry_enter() calls instead of just for page faults?  Evil things
> >> like NMI shouldn't go through this path at all.
> >
> > I thought about that, but then ended up with the conclusion that RCU
> > might be unhappy, but my conclusion might be fundamentally wrong.
>
> It's about this:
>
> rcu_nmi_enter()
> {
> if (!rcu_is_watching()) {
> make it watch;
> } else if (!in_nmi()) {
> do_magic_nohz_dyntick_muck();
> }
>
> So if we do all irq/system vector entries conditional then the
> do_magic() gets never executed. After that I got lost...

I'm also baffled by that magic, but I'm also not suggesting doing this
to *all* entries -- just the not-super-magic ones that use
idtentry_enter().

Paul, what is this code actually trying to do?


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Thomas Gleixner
Thomas Gleixner  writes:
> Andy Lutomirski  writes:
>> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner  wrote:
>>> The pagefault handler cannot use the regular idtentry_enter() because that
>>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
>>> problem per se, but kernel side page faults can schedule which is not
>>> possible without invoking rcu_irq_exit().
>>>
>>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
>>> pagefault handling code would be possible, but not pretty either.
>>>
>>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
>>> when RCU is not watching. The conditional RCU enabling is a correctness
>>> issue: A kernel page fault which hits a RCU idle reason can neither
>>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
>>> effects is at least increasing the chance for useful debug output.
>>>
>>> The function is also useful for implementing lightweight reschedule IPI and
>>> KVM posted interrupt IPI entry handling later.
>>
>> Why is this conditional?  That is, couldn't we do this for all
>> idtentry_enter() calls instead of just for page faults?  Evil things
>> like NMI shouldn't go through this path at all.
>
> I thought about that, but then ended up with the conclusion that RCU
> might be unhappy, but my conclusion might be fundamentally wrong.

It's about this:

rcu_nmi_enter()
{
if (!rcu_is_watching()) {
make it watch;
} else if (!in_nmi()) {
do_magic_nohz_dyntick_muck();
}

So if we do all irq/system vector entries conditional then the
do_magic() gets never executed. After that I got lost...

Thanks,

 tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Thomas Gleixner
Andy Lutomirski  writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner  wrote:
>> The pagefault handler cannot use the regular idtentry_enter() because that
>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
>> problem per se, but kernel side page faults can schedule which is not
>> possible without invoking rcu_irq_exit().
>>
>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
>> pagefault handling code would be possible, but not pretty either.
>>
>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
>> when RCU is not watching. The conditional RCU enabling is a correctness
>> issue: A kernel page fault which hits a RCU idle reason can neither
>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
>> effects is at least increasing the chance for useful debug output.
>>
>> The function is also useful for implementing lightweight reschedule IPI and
>> KVM posted interrupt IPI entry handling later.
>
> Why is this conditional?  That is, couldn't we do this for all
> idtentry_enter() calls instead of just for page faults?  Evil things
> like NMI shouldn't go through this path at all.

I thought about that, but then ended up with the conclusion that RCU
might be unhappy, but my conclusion might be fundamentally wrong.

Paul?

Thanks,

tglx


Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

2020-05-19 Thread Andy Lutomirski
On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner  wrote:
>
>
> The pagefault handler cannot use the regular idtentry_enter() because that
> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
> problem per se, but kernel side page faults can schedule which is not
> possible without invoking rcu_irq_exit().
>
> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
> pagefault handling code would be possible, but not pretty either.
>
> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
> when RCU is not watching. The conditional RCU enabling is a correctness
> issue: A kernel page fault which hits a RCU idle reason can neither
> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
> effects is at least increasing the chance for useful debug output.
>
> The function is also useful for implementing lightweight reschedule IPI and
> KVM posted interrupt IPI entry handling later.

Why is this conditional?  That is, couldn't we do this for all
idtentry_enter() calls instead of just for page faults?  Evil things
like NMI shouldn't go through this path at all.

--Andy