Re: smr_grace_wait(): Skip halted CPUs

2023-08-13 Thread Visa Hankala
On Sat, Aug 12, 2023 at 02:40:31PM +0200, Martin Pieuchot wrote:
> So do we want to keep the existing requirement of being able to execute
> a thread on a CPU that has been removed from the scheduler?  That's is
> what smr_flush() currently needs.  I find it surprising but I can add
> that as a requirement for the upcoming scheduler.  I don't know if other
> options are possible or even attractive.

I think it is useful that the kernel can force a thread to run on
a specific CPU even when general scheduling has been stopped. It is
maybe a bit crude, but nonetheless effective and machine independent,
way of synchronization.

The kernel has a few current uses of sched_peg_curproc(). Perhaps
the most prominent ones are interrupt barriers and the SMR grace wait
mechanism.



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Theo de Raadt
Visa Hankala  wrote:

> On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> > On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > > removed from the scheduler before smr_flush() is called.  So there's no
> > > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > > speed up rebooting/halting on machine with a huge number of CPUs.
> > > 
> > > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > > processing interrupts, this skipping is not safe as is. Interrupt
> > > handlers might access SMR-protected data.
> > 
> > Interesting.  This is worse than I expected.  It seems we completely
> > forgot about suspend/resume and rebooting when we started pinning
> > interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> > was enough to ensure no more code would be executed on secondary CPUs,
> > no?  Wouldn't it be better to remap interrupts to the primary CPU in
> > those cases?  Is it easily doable? 
> 
> I think device interrupt stopping already happens through
> config_suspend_all().

Right.  DVACT_QUIESCE is supposed to stop the device's transaction flow,
which implies it stops scheduling more things which will cause interrupts.
DVACT_SUSPEND is supposed to stop the device even further, so that it can
save a hardware state into memory, which is viable for recovery later on.

For hibernate, this is even more strict, because memory side effects must
not occur after these phases.

I suppose there is nothing which says interrupts must be stopped for
DVACT_SUSPEND.  A driver could ACK the interrupts, but create no software
side effects.

Anyways, for the halt/reboot case, config_suspend_all() is not done.
We have architectures that don't suspend/resume, but must halt/reboot.
We have not reviewed their driver stack to check if config_suspend_all()
would do the right thing.  This is not trivial, it isn't just in the
drivers.  Some of our non-suspending architectures could have incorrectly
ordered device trees...



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot
On 12/08/23(Sat) 11:48, Visa Hankala wrote:
> On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> > On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > > removed from the scheduler before smr_flush() is called.  So there's no
> > > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > > speed up rebooting/halting on machine with a huge number of CPUs.
> > > 
> > > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > > processing interrupts, this skipping is not safe as is. Interrupt
> > > handlers might access SMR-protected data.
> > 
> > Interesting.  This is worse than I expected.  It seems we completely
> > forgot about suspend/resume and rebooting when we started pinning
> > interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> > was enough to ensure no more code would be executed on secondary CPUs,
> > no?  Wouldn't it be better to remap interrupts to the primary CPU in
> > those cases?  Is it easily doable? 
> 
> I think device interrupt stopping already happens through
> config_suspend_all().

Indeed.  I'm a bit puzzled about the order of operations though.  In the
case of reboot/halt the existing order of operations are:

sched_stop_secondary_cpus() <--- remove secondary CPUs from the scheduler


vfs_shutdown()
if_downall()
uvm_swap_finicrypt_all()<--- happens on a single CPU but with
interrupts possibly on secondary CPUs


smr_flush() <--- tells the SMR thread to execute itself
on all CPUs even if they are out of the
scheduler

config_suspend_all()<--- stop interrupts from firing

x86_broadcast_ipi(X86_IPI_HALT) <--- stop secondary CPUs (on x86)


So do we want to keep the existing requirement of being able to execute
a thread on a CPU that has been removed from the scheduler?  That's is
what smr_flush() currently needs.  I find it surprising but I can add
that as a requirement for the upcoming scheduler.  I don't know if other
options are possible or even attractive.



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Visa Hankala
On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > removed from the scheduler before smr_flush() is called.  So there's no
> > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > speed up rebooting/halting on machine with a huge number of CPUs.
> > 
> > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > processing interrupts, this skipping is not safe as is. Interrupt
> > handlers might access SMR-protected data.
> 
> Interesting.  This is worse than I expected.  It seems we completely
> forgot about suspend/resume and rebooting when we started pinning
> interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> was enough to ensure no more code would be executed on secondary CPUs,
> no?  Wouldn't it be better to remap interrupts to the primary CPU in
> those cases?  Is it easily doable? 

I think device interrupt stopping already happens through
config_suspend_all().



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot
On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > removed from the scheduler before smr_flush() is called.  So there's no
> > need for the SMR thread to peg itself to such CPUs.  This currently
> > isn't a problem because we use per-CPU runqueues but it doesn't work
> > with a global one.  So the diff below skip halted CPUs.  It should also
> > speed up rebooting/halting on machine with a huge number of CPUs.
> 
> Because SPCF_HALTED does not (?) imply that the CPU has stopped
> processing interrupts, this skipping is not safe as is. Interrupt
> handlers might access SMR-protected data.

Interesting.  This is worse than I expected.  It seems we completely
forgot about suspend/resume and rebooting when we started pinning
interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
was enough to ensure no more code would be executed on secondary CPUs,
no?  Wouldn't it be better to remap interrupts to the primary CPU in
those cases?  Is it easily doable? 

> One possible solution is to spin. When smr_grace_wait() sees
> SPCF_HALTED, it should probably call cpu_unidle(ci) and spin until
> condition READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp becomes true.
> However, for this to work, sched_idle() needs to invoke smr_idle().
> Here is a potential problem since the cpu_idle_{enter,cycle,leave}()
> logic is not consistent between architectures.

We're trying to move away from per-CPU runqueues.  That's how I found
this issue.  I don't see how this possible solution could work with a
global runqueue.  Do you?

> I think the intent in sched_idle() was that cpu_idle_enter() should
> block interrupts so that sched_idle() could check without races if
> the CPU can sleep. Now, on some architectures cpu_idle_enter() is
> a no-op. These architectures have to check the idle state in their
> cpu_idle_cycle() function before pausing the CPU.
> 
> To avoid touching architecture-specific code, cpu_is_idle() could
> be redefined to
> 
> ((ci)->ci_schedstate.spc_whichqs == 0 &&
>  (ci)->ci_schedstate.spc_smrgp == READ_ONCE(smr_grace_period))
> 
> Then the loop conditions
> 
>   while (!cpu_is_idle(curcpu())) {
> 
> and
> 
>   while (spc->spc_whichqs == 0) {
> 
> in sched_idle() would have to be changed to
> 
>   while (spc->spc_whichqs != 0) {
> 
> and
> 
>   while (cpu_is_idle(ci)) {
> 
> 
> :(
> 
> > Index: kern/kern_smr.c
> > ===
> > RCS file: /cvs/src/sys/kern/kern_smr.c,v
> > retrieving revision 1.16
> > diff -u -p -r1.16 kern_smr.c
> > --- kern/kern_smr.c 14 Aug 2022 01:58:27 -  1.16
> > +++ kern/kern_smr.c 11 Aug 2023 19:43:54 -
> > @@ -158,6 +158,8 @@ smr_grace_wait(void)
> > CPU_INFO_FOREACH(cii, ci) {
> > if (!CPU_IS_RUNNING(ci))
> > continue;
> > +   if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
> > +   continue;
> > if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
> > continue;
> > sched_peg_curproc(ci);
> > 



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Visa Hankala
On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> When stopping a machine, with "halt -p" for example, secondary CPUs are
> removed from the scheduler before smr_flush() is called.  So there's no
> need for the SMR thread to peg itself to such CPUs.  This currently
> isn't a problem because we use per-CPU runqueues but it doesn't work
> with a global one.  So the diff below skip halted CPUs.  It should also
> speed up rebooting/halting on machine with a huge number of CPUs.

Because SPCF_HALTED does not (?) imply that the CPU has stopped
processing interrupts, this skipping is not safe as is. Interrupt
handlers might access SMR-protected data.

One possible solution is to spin. When smr_grace_wait() sees
SPCF_HALTED, it should probably call cpu_unidle(ci) and spin until
condition READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp becomes true.
However, for this to work, sched_idle() needs to invoke smr_idle().
Here is a potential problem since the cpu_idle_{enter,cycle,leave}()
logic is not consistent between architectures.

I think the intent in sched_idle() was that cpu_idle_enter() should
block interrupts so that sched_idle() could check without races if
the CPU can sleep. Now, on some architectures cpu_idle_enter() is
a no-op. These architectures have to check the idle state in their
cpu_idle_cycle() function before pausing the CPU.

To avoid touching architecture-specific code, cpu_is_idle() could
be redefined to

((ci)->ci_schedstate.spc_whichqs == 0 &&
 (ci)->ci_schedstate.spc_smrgp == READ_ONCE(smr_grace_period))

Then the loop conditions

while (!cpu_is_idle(curcpu())) {

and

while (spc->spc_whichqs == 0) {

in sched_idle() would have to be changed to

while (spc->spc_whichqs != 0) {

and

while (cpu_is_idle(ci)) {


:(

> Index: kern/kern_smr.c
> ===
> RCS file: /cvs/src/sys/kern/kern_smr.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 kern_smr.c
> --- kern/kern_smr.c   14 Aug 2022 01:58:27 -  1.16
> +++ kern/kern_smr.c   11 Aug 2023 19:43:54 -
> @@ -158,6 +158,8 @@ smr_grace_wait(void)
>   CPU_INFO_FOREACH(cii, ci) {
>   if (!CPU_IS_RUNNING(ci))
>   continue;
> + if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
> + continue;
>   if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
>   continue;
>   sched_peg_curproc(ci);
>