Visa Hankala <v...@hankala.org> wrote:

> On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> > On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > > removed from the scheduler before smr_flush() is called.  So there's no
> > > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > > speed up rebooting/halting on machine with a huge number of CPUs.
> > > 
> > > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > > processing interrupts, this skipping is not safe as is. Interrupt
> > > handlers might access SMR-protected data.
> > 
> > Interesting.  This is worse than I expected.  It seems we completely
> > forgot about suspend/resume and rebooting when we started pinning
> > interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> > was enough to ensure no more code would be executed on secondary CPUs,
> > no?  Wouldn't it be better to remap interrupts to the primary CPU in
> > those cases?  Is it easily doable? 
> 
> I think device interrupt stopping already happens through
> config_suspend_all().

Right.  DVACT_QUIESCE is supposed to stop the device's transaction flow,
which implies it stops scheduling more things which will cause interrupts.
DVACT_SUSPEND is supposed to stop the device even further, so that it can
save a hardware state into memory, which is viable for recovery later on.

For hibernate, this is even more strict, because memory side effects must
not occur after these phases.

I suppose there is nothing which says interrupts must be stopped for
DVACT_SUSPEND.  A driver could ACK the interrupts, but create no software
side effects.

Anyways, for the halt/reboot case, config_suspend_all() is not done.
We have architectures that don't suspend/resume, but must halt/reboot.
We have not reviewed their driver stack to check if config_suspend_all()
would do the right thing.  This is not trivial, it isn't just in the
drivers.  Some of our non-suspending architectures could have incorrectly
ordered device trees...

Reply via email to