Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-11 Thread Nathan Lynch
Michael Ellerman  writes:

> Michael Roth  writes:
>> Quoting Nathan Lynch (2020-08-07 02:05:09)
> ...
>>> wait_for_cpu_stopped() should be able to accommodate a time-based
>>> warning if necessary, but speaking as a likely recipient of any bug
>>> reports that would arise here, I'm not convinced of the need and I
>>> don't know what a good value would be. It's relatively easy to sample
>>> the stack of a task that's apparently failing to make progress, plus I
>>> probably would use 'perf probe' or similar to report the inputs and
>>> outputs for the RTAS call.
>>
>> I think if we make the timeout sufficiently high like 2 minutes or so
>> it wouldn't hurt and if we did seem them it would probably point to an
>> actual bug. But I don't have a strong feeling either way.
>
> I think we should print a warning after 2 minutes.
>
> It's true that there are fairly easy mechanisms to work out where the
> thread is stuck, but customers are unlikely to use them. They're just
> going to report that it's stuck with no further info, and probably
> reboot the machine before we get a chance to get any further info.
>
> Whereas if the kernel prints a warning with a stack trace we at least
> have that to go on in an initial bug report.
>
>>> I'm happy to make this a proper submission after I can clean it up and
>>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>>> acceptable.
>>> 
>>
>> I've given it a shot with this patch and it seems to be holding up in
>> testing. If we don't think the ~2 minutes warning message is needed I
>> can clean it up to post:
>>
>> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>>
>> I'd likely break the refactoring patches out to a separate patch under
>> Nathan's name since it fixes a separate bug potentially.
>
> While I like Nathan's refactoring, we probably want to do the minimal
> fix first to ease backporting.
>
> Then do the refactoring on top of that.

Fair enough, thanks.


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-11 Thread Michael Ellerman
Michael Roth  writes:
> Quoting Nathan Lynch (2020-08-07 02:05:09)
...
>> wait_for_cpu_stopped() should be able to accommodate a time-based
>> warning if necessary, but speaking as a likely recipient of any bug
>> reports that would arise here, I'm not convinced of the need and I
>> don't know what a good value would be. It's relatively easy to sample
>> the stack of a task that's apparently failing to make progress, plus I
>> probably would use 'perf probe' or similar to report the inputs and
>> outputs for the RTAS call.
>
> I think if we make the timeout sufficiently high like 2 minutes or so
> it wouldn't hurt and if we did seem them it would probably point to an
> actual bug. But I don't have a strong feeling either way.

I think we should print a warning after 2 minutes.

It's true that there are fairly easy mechanisms to work out where the
thread is stuck, but customers are unlikely to use them. They're just
going to report that it's stuck with no further info, and probably
reboot the machine before we get a chance to get any further info.

Whereas if the kernel prints a warning with a stack trace we at least
have that to go on in an initial bug report.

>> I'm happy to make this a proper submission after I can clean it up and
>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>> acceptable.
>> 
>
> I've given it a shot with this patch and it seems to be holding up in
> testing. If we don't think the ~2 minutes warning message is needed I
> can clean it up to post:
>
> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>
> I'd likely break the refactoring patches out to a separate patch under
> Nathan's name since it fixes a separate bug potentially.

While I like Nathan's refactoring, we probably want to do the minimal
fix first to ease backporting.

Then do the refactoring on top of that.

cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-10 Thread Michael Roth
Quoting Nathan Lynch (2020-08-07 02:05:09)
> Hi everyone,
> 
> Michael Ellerman  writes:
> > Greg Kurz  writes:
> >> On Tue, 04 Aug 2020 23:35:10 +1000
> >> Michael Ellerman  wrote:
> >>> Spinning forever seems like a bad idea, but as has been demonstrated at
> >>> least twice now, continuing when we don't know the state of the other
> >>> CPU can lead to straight up crashes.
> >>> 
> >>> So I think I'm persuaded that it's preferable to have the kernel stuck
> >>> spinning rather than oopsing.
> >>> 
> >>
> >> +1
> >>
> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >>> first instinct is no, if we're stuck here for 20s a stack trace would be
> >>> good. But then we will probably hit that on some big and/or heavily
> >>> loaded machine.
> >>> 
> >>> So possibly we should call cond_resched() but have some custom logic in
> >>> the loop to print a warning if we are stuck for more than some
> >>> sufficiently long amount of time.
> >>
> >> How long should that be ?
> >
> > Yeah good question.
> >
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> >
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> 
> Maybe I'm not quite clear what this is referring to, but I don't think
> stop-self/query-stopped-state latency increases with processor count, at
> least not on PowerVM. And IIRC I was concerned with the earlier patch's
> potential for causing the softlockup watchdog to rightly complain by
> polling the stopped state without ever scheduling away.
> 
> The fact that smp_query_cpu_stopped() kind of collapses the two distinct
> results from the query-cpu-stopped-state RTAS call into one return value
> may make it harder than necessary to reason about the questions around
> cond_resched() and whether to warn.
> 
> Sorry to pull this stunt but I have had some code sitting in a neglected
> branch that I think gets the logic around this right.
> 
> What we should have is a simple C wrapper for the RTAS call that reflects the
> architected inputs and outputs:
> 
> 
> (-- rtas.c --)
> 
> /**
>  * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
>  * @hwcpu: Identifies the processor thread to be queried.
>  * @status: Pointer to status, valid only on success.
>  *
>  * Determine whether the given processor thread is in the stopped
>  * state.  If successful and @status is non-NULL, the thread's status
>  * is stored to @status.
>  *
>  * Return:
>  * * 0   - Success
>  * * -1  - Hardware error
>  * * -2  - Busy, try again later
>  */
> int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
> {
>unsigned int cpu_status;
>int token;
>int fwrc;
> 
>token = rtas_token("query-cpu-stopped-state");
> 
>fwrc = rtas_call(token, 1, 2, _status, hwcpu);
>if (fwrc != 0)
>goto out;
> 
>if (status != NULL)
>*status = cpu_status;
> out:
>return fwrc;
> }
> 
> 
> 
> And then a utility function that waits for the remote thread to enter
> stopped state, with higher-level logic for rescheduling and warning. The
> fact that smp_query_cpu_stopped() currently does not handle a -2/busy
> status is a bug, fixed below by using rtas_busy_delay(). Note the
> justification for the explicit cond_resched() in the outer loop:
> 
> 
> (-- rtas.h --)
> 
> /* query-cpu-stopped-state CPU_status */
> #define RTAS_QCSS_STATUS_STOPPED 0
> #define RTAS_QCSS_STATUS_IN_PROGRESS 1
> #define RTAS_QCSS_STATUS_NOT_STOPPED 2
> 
> (-- pseries/hotplug-cpu.c --)
> 
> /**
>  * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
>  */
> static void wait_for_cpu_stopped(unsigned int cpu)
> {
>unsigned int status;
>unsigned int hwcpu;
> 
>hwcpu = get_hard_smp_processor_id(cpu);
> 
>do {
>int fwrc;
> 
>/*
> * rtas_busy_delay() will yield only if RTAS returns a
> * busy status. Since query-cpu-stopped-state can
> * yield RTAS_QCSS_STATUS_IN_PROGRESS or
> * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
> * period before the target thread stops, we must take
> * care to explicitly reschedule while polling.
> */
>cond_resched();
> 
>do {
>fwrc = rtas_query_cpu_stopped_state(hwcpu, );
>} while (rtas_busy_delay(fwrc));
> 
>if (fwrc == 0)
>continue;
> 
>pr_err_ratelimited("query-cpu-stopped-state for "
>  

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-07 Thread Nathan Lynch
Hi everyone,

Michael Ellerman  writes:
> Greg Kurz  writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman  wrote:
>>> Spinning forever seems like a bad idea, but as has been demonstrated at
>>> least twice now, continuing when we don't know the state of the other
>>> CPU can lead to straight up crashes.
>>> 
>>> So I think I'm persuaded that it's preferable to have the kernel stuck
>>> spinning rather than oopsing.
>>> 
>>
>> +1
>>
>>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>>> first instinct is no, if we're stuck here for 20s a stack trace would be
>>> good. But then we will probably hit that on some big and/or heavily
>>> loaded machine.
>>> 
>>> So possibly we should call cond_resched() but have some custom logic in
>>> the loop to print a warning if we are stuck for more than some
>>> sufficiently long amount of time.
>>
>> How long should that be ?
>
> Yeah good question.
>
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
>
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.

Maybe I'm not quite clear what this is referring to, but I don't think
stop-self/query-stopped-state latency increases with processor count, at
least not on PowerVM. And IIRC I was concerned with the earlier patch's
potential for causing the softlockup watchdog to rightly complain by
polling the stopped state without ever scheduling away.

The fact that smp_query_cpu_stopped() kind of collapses the two distinct
results from the query-cpu-stopped-state RTAS call into one return value
may make it harder than necessary to reason about the questions around
cond_resched() and whether to warn.

Sorry to pull this stunt but I have had some code sitting in a neglected
branch that I think gets the logic around this right.

What we should have is a simple C wrapper for the RTAS call that reflects the
architected inputs and outputs:


(-- rtas.c --)

/**
 * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
 * @hwcpu: Identifies the processor thread to be queried.
 * @status: Pointer to status, valid only on success.
 *
 * Determine whether the given processor thread is in the stopped
 * state.  If successful and @status is non-NULL, the thread's status
 * is stored to @status.
 *
 * Return:
 * * 0   - Success
 * * -1  - Hardware error
 * * -2  - Busy, try again later
 */
int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
{
   unsigned int cpu_status;
   int token;
   int fwrc;

   token = rtas_token("query-cpu-stopped-state");

   fwrc = rtas_call(token, 1, 2, _status, hwcpu);
   if (fwrc != 0)
   goto out;

   if (status != NULL)
   *status = cpu_status;
out:
   return fwrc;
}



And then a utility function that waits for the remote thread to enter
stopped state, with higher-level logic for rescheduling and warning. The
fact that smp_query_cpu_stopped() currently does not handle a -2/busy
status is a bug, fixed below by using rtas_busy_delay(). Note the
justification for the explicit cond_resched() in the outer loop:


(-- rtas.h --)

/* query-cpu-stopped-state CPU_status */
#define RTAS_QCSS_STATUS_STOPPED 0
#define RTAS_QCSS_STATUS_IN_PROGRESS 1
#define RTAS_QCSS_STATUS_NOT_STOPPED 2

(-- pseries/hotplug-cpu.c --)

/**
 * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
 */
static void wait_for_cpu_stopped(unsigned int cpu)
{
   unsigned int status;
   unsigned int hwcpu;

   hwcpu = get_hard_smp_processor_id(cpu);

   do {
   int fwrc;

   /*
* rtas_busy_delay() will yield only if RTAS returns a
* busy status. Since query-cpu-stopped-state can
* yield RTAS_QCSS_STATUS_IN_PROGRESS or
* RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
* period before the target thread stops, we must take
* care to explicitly reschedule while polling.
*/
   cond_resched();

   do {
   fwrc = rtas_query_cpu_stopped_state(hwcpu, );
   } while (rtas_busy_delay(fwrc));

   if (fwrc == 0)
   continue;

   pr_err_ratelimited("query-cpu-stopped-state for "
  "thread 0x%x returned %d\n",
  hwcpu, fwrc);
   goto out;

   } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
status == RTAS_QCSS_STATUS_IN_PROGRESS);

   if (status != RTAS_QCSS_STATUS_STOPPED) {
   

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-06 Thread Michael Ellerman
Michael Roth  writes:
> Quoting Michael Roth (2020-08-04 23:37:32)
>> Quoting Michael Ellerman (2020-08-04 22:07:08)
>> > Greg Kurz  writes:
>> > > On Tue, 04 Aug 2020 23:35:10 +1000
>> > > Michael Ellerman  wrote:
>> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
>> > >> least twice now, continuing when we don't know the state of the other
>> > >> CPU can lead to straight up crashes.
>> > >> 
>> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
>> > >> spinning rather than oopsing.
>> > >> 
>> > >
>> > > +1
>> > >
>> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
>> > >> good. But then we will probably hit that on some big and/or heavily
>> > >> loaded machine.
>> > >> 
>> > >> So possibly we should call cond_resched() but have some custom logic in
>> > >> the loop to print a warning if we are stuck for more than some
>> > >> sufficiently long amount of time.
>> > >
>> > > How long should that be ?
>> > 
>> > Yeah good question.
>> > 
>> > I guess step one would be seeing how long it can take on the 384 vcpu
>> > machine. And we can probably test on some other big machines.
>> > 
>> > Hopefully Nathan can give us some idea of how long he's seen it take on
>> > large systems? I know he was concerned about the 20s timeout of the
>> > softlockup detector.
>> > 
>> > Maybe a minute or two?
>> 
>> Hmm, so I took a stab at this where I called cond_resched() after
>> every 5 seconds of polling and printed a warning at the same time (FWIW
>> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
>> system using KVM running the 384vcpu unplug loop)
>> 
>> But it sounds like that's not quite what you had in mind. How frequently
>> do you think we should call cond_resched()? Maybe after 25 iterations
>> of polling smp_query_cpu_stopped() to keep original behavior somewhat
>> similar?

I think we can just call it on every iteration, it should be cheap
compared to an RTAS call.

The concern was just by doing that you effectively prevent the
softlockup detector from reporting you as stuck in that path. Hence the
desire to manually print a warning after ~60s or something.

cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-05 Thread Michael Roth
Quoting Michael Roth (2020-08-05 17:29:28)
> Quoting Michael Roth (2020-08-04 23:37:32)
> > Quoting Michael Ellerman (2020-08-04 22:07:08)
> > > Greg Kurz  writes:
> > > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > > Michael Ellerman  wrote:
> > > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > > >> least twice now, continuing when we don't know the state of the other
> > > >> CPU can lead to straight up crashes.
> > > >> 
> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > > >> spinning rather than oopsing.
> > > >> 
> > > >
> > > > +1
> > > >
> > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > > >> first instinct is no, if we're stuck here for 20s a stack trace would 
> > > >> be
> > > >> good. But then we will probably hit that on some big and/or heavily
> > > >> loaded machine.
> > > >> 
> > > >> So possibly we should call cond_resched() but have some custom logic in
> > > >> the loop to print a warning if we are stuck for more than some
> > > >> sufficiently long amount of time.
> > > >
> > > > How long should that be ?
> > > 
> > > Yeah good question.
> > > 
> > > I guess step one would be seeing how long it can take on the 384 vcpu
> > > machine. And we can probably test on some other big machines.
> > > 
> > > Hopefully Nathan can give us some idea of how long he's seen it take on
> > > large systems? I know he was concerned about the 20s timeout of the
> > > softlockup detector.
> > > 
> > > Maybe a minute or two?
> > 
> > Hmm, so I took a stab at this where I called cond_resched() after
> > every 5 seconds of polling and printed a warning at the same time (FWIW
> > that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> > system using KVM running the 384vcpu unplug loop)
> > 
> > But it sounds like that's not quite what you had in mind. How frequently
> > do you think we should call cond_resched()? Maybe after 25 iterations
> > of polling smp_query_cpu_stopped() to keep original behavior somewhat
> > similar?
> > 
> > I'll let the current patch run on the mihawk system overnight in the
> > meantime so we at least have that data point, but would be good to
> > know what things look like a large pHyp machine.
> 
> At one point I did manage to get the system in a state where unplug
> operations were taking 1-2s, but still not enough to trigger any
> 5s warning, and I wasn't able to reproduce that in subsequent runs.
> 
> I also tried reworking the patch so that we print a warning and
> cond_resched() after 200 ms to make sure that path gets executed, but
> only managed to trigger the warning twice after a few hours.
> 
> So, if we print a warning after a couple minutes, that seems pretty
> conservative as far as avoiding spurious warnings. And if we
> cond_resched() after 25 loops of polling (~0.1 ms in the cases

~0.1 seconds I mean

> that caused the original crash), that would avoid most of the
> default RCU/lockup warnings.
> 
> But having a second timeout to trigger the cond_resched() after some
> set interval like 2s seems more deterministic since we're less
> susceptible to longer delays due to things like the RTAS calls
> contending for QEMU's global mutex in the the KVM case.
> 
> 
> > 
> > Thanks!
> > 
> > > 
> > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > > >> > interrupt controller")
> > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > > >> 
> > > >> This is not public.
> > > >
> > > > I'll have a look at changing that.
> > > 
> > > Thanks.
> > > 
> > > cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-05 Thread Michael Roth
Quoting Michael Roth (2020-08-04 23:37:32)
> Quoting Michael Ellerman (2020-08-04 22:07:08)
> > Greg Kurz  writes:
> > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > Michael Ellerman  wrote:
> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > >> least twice now, continuing when we don't know the state of the other
> > >> CPU can lead to straight up crashes.
> > >> 
> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > >> spinning rather than oopsing.
> > >> 
> > >
> > > +1
> > >
> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > >> good. But then we will probably hit that on some big and/or heavily
> > >> loaded machine.
> > >> 
> > >> So possibly we should call cond_resched() but have some custom logic in
> > >> the loop to print a warning if we are stuck for more than some
> > >> sufficiently long amount of time.
> > >
> > > How long should that be ?
> > 
> > Yeah good question.
> > 
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> > 
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> > 
> > Maybe a minute or two?
> 
> Hmm, so I took a stab at this where I called cond_resched() after
> every 5 seconds of polling and printed a warning at the same time (FWIW
> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> system using KVM running the 384vcpu unplug loop)
> 
> But it sounds like that's not quite what you had in mind. How frequently
> do you think we should call cond_resched()? Maybe after 25 iterations
> of polling smp_query_cpu_stopped() to keep original behavior somewhat
> similar?
> 
> I'll let the current patch run on the mihawk system overnight in the
> meantime so we at least have that data point, but would be good to
> know what things look like a large pHyp machine.

At one point I did manage to get the system in a state where unplug
operations were taking 1-2s, but still not enough to trigger any
5s warning, and I wasn't able to reproduce that in subsequent runs.

I also tried reworking the patch so that we print a warning and
cond_resched() after 200 ms to make sure that path gets executed, but
only managed to trigger the warning twice after a few hours.

So, if we print a warning after a couple minutes, that seems pretty
conservative as far as avoiding spurious warnings. And if we
cond_resched() after 25 loops of polling (~0.1 ms in the cases
that caused the original crash), that would avoid most of the
default RCU/lockup warnings.

But having a second timeout to trigger the cond_resched() after some
set interval like 2s seems more deterministic since we're less
susceptible to longer delays due to things like the RTAS calls
contending for QEMU's global mutex in the the KVM case.


> 
> Thanks!
> 
> > 
> > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > >> > interrupt controller")
> > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > >> 
> > >> This is not public.
> > >
> > > I'll have a look at changing that.
> > 
> > Thanks.
> > 
> > cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Michael Roth
Quoting Michael Ellerman (2020-08-04 22:07:08)
> Greg Kurz  writes:
> > On Tue, 04 Aug 2020 23:35:10 +1000
> > Michael Ellerman  wrote:
> >> There is a bit of history to this code, but not in a good way :)
> >> 
> >> Michael Roth  writes:
> >> > For a power9 KVM guest with XIVE enabled, running a test loop
> >> > where we hotplug 384 vcpus and then unplug them, the following traces
> >> > can be seen (generally within a few loops) either from the unplugged
> >> > vcpu:
> >> >
> >> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >> >   [ 1767.952311] list_del corruption. next->prev should be 
> >> > c00a02470208, but was c00a02470048
> >> ...
> >> >
> >> > At that point the worker thread assumes the unplugged CPU is in some
> >> > unknown/dead state and procedes with the cleanup, causing the race with
> >> > the XIVE cleanup code executed by the unplugged CPU.
> >> >
> >> > Fix this by inserting an msleep() after each RTAS call to avoid
> >> 
> >> We previously had an msleep(), but it was removed:
> >> 
> >>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> >
> > Ah, I hadn't seen that one...
> >
> >> > pseries_cpu_die() returning prematurely, and double the number of
> >> > attempts so we wait at least a total of 5 seconds. While this isn't an
> >> > ideal solution, it is similar to how we dealt with a similar issue for
> >> > cede_offline mode in the past (940ce422a3).
> >> 
> >> Thiago tried to fix this previously but there was a bit of discussion
> >> that didn't quite resolve:
> >> 
> >>   
> >> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
> >
> > Yeah it appears that the motivation at the time was to make the "Querying 
> > DEAD?"
> > messages to disappear and to avoid potentially concurrent calls to 
> > rtas-stop-self
> > which is prohibited by PAPR... not fixing actual crashes.
> 
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.
> 
> >> Spinning forever seems like a bad idea, but as has been demonstrated at
> >> least twice now, continuing when we don't know the state of the other
> >> CPU can lead to straight up crashes.
> >> 
> >> So I think I'm persuaded that it's preferable to have the kernel stuck
> >> spinning rather than oopsing.
> >> 
> >
> > +1
> >
> >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >> first instinct is no, if we're stuck here for 20s a stack trace would be
> >> good. But then we will probably hit that on some big and/or heavily
> >> loaded machine.
> >> 
> >> So possibly we should call cond_resched() but have some custom logic in
> >> the loop to print a warning if we are stuck for more than some
> >> sufficiently long amount of time.
> >
> > How long should that be ?
> 
> Yeah good question.
> 
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
> 
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.
> 
> Maybe a minute or two?

Hmm, so I took a stab at this where I called cond_resched() after
every 5 seconds of polling and printed a warning at the same time (FWIW
that doesn't seem to trigger any warnings on a loaded 96-core mihawk
system using KVM running the 384vcpu unplug loop)

But it sounds like that's not quite what you had in mind. How frequently
do you think we should call cond_resched()? Maybe after 25 iterations
of polling smp_query_cpu_stopped() to keep original behavior somewhat
similar?

I'll let the current patch run on the mihawk system overnight in the
meantime so we at least have that data point, but would be good to
know what things look like a large pHyp machine.

Thanks!

> 
> >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> >> > interrupt controller")
> >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> >> 
> >> This is not public.
> >
> > I'll have a look at changing that.
> 
> Thanks.
> 
> cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Thiago Jung Bauermann


Michael Ellerman  writes:

> Greg Kurz  writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman  wrote:
>>> There is a bit of history to this code, but not in a good way :)
>>>
>>> Michael Roth  writes:
>>> > For a power9 KVM guest with XIVE enabled, running a test loop
>>> > where we hotplug 384 vcpus and then unplug them, the following traces
>>> > can be seen (generally within a few loops) either from the unplugged
>>> > vcpu:
>>> >
>>> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>>> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>>> >   [ 1767.952311] list_del corruption. next->prev should be 
>>> > c00a02470208, but was c00a02470048
>>> ...
>>> >
>>> > At that point the worker thread assumes the unplugged CPU is in some
>>> > unknown/dead state and procedes with the cleanup, causing the race with
>>> > the XIVE cleanup code executed by the unplugged CPU.
>>> >
>>> > Fix this by inserting an msleep() after each RTAS call to avoid
>>>
>>> We previously had an msleep(), but it was removed:
>>>
>>>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
>>
>> Ah, I hadn't seen that one...
>>
>>> > pseries_cpu_die() returning prematurely, and double the number of
>>> > attempts so we wait at least a total of 5 seconds. While this isn't an
>>> > ideal solution, it is similar to how we dealt with a similar issue for
>>> > cede_offline mode in the past (940ce422a3).
>>>
>>> Thiago tried to fix this previously but there was a bit of discussion
>>> that didn't quite resolve:
>>>
>>>   
>>> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
>>
>> Yeah it appears that the motivation at the time was to make the "Querying 
>> DEAD?"
>> messages to disappear and to avoid potentially concurrent calls to 
>> rtas-stop-self
>> which is prohibited by PAPR... not fixing actual crashes.
>
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.

Yes, pHyp's RTAS now tolerates concurrent calls to stop-self. The
original bug that was reported when I worked on this ended in an RTAS
crash because of this timeout. The crash was fixed then.

--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Michael Ellerman
Greg Kurz  writes:
> On Tue, 04 Aug 2020 23:35:10 +1000
> Michael Ellerman  wrote:
>> There is a bit of history to this code, but not in a good way :)
>> 
>> Michael Roth  writes:
>> > For a power9 KVM guest with XIVE enabled, running a test loop
>> > where we hotplug 384 vcpus and then unplug them, the following traces
>> > can be seen (generally within a few loops) either from the unplugged
>> > vcpu:
>> >
>> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>> >   [ 1767.952311] list_del corruption. next->prev should be 
>> > c00a02470208, but was c00a02470048
>> ...
>> >
>> > At that point the worker thread assumes the unplugged CPU is in some
>> > unknown/dead state and procedes with the cleanup, causing the race with
>> > the XIVE cleanup code executed by the unplugged CPU.
>> >
>> > Fix this by inserting an msleep() after each RTAS call to avoid
>> 
>> We previously had an msleep(), but it was removed:
>> 
>>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
>
> Ah, I hadn't seen that one...
>
>> > pseries_cpu_die() returning prematurely, and double the number of
>> > attempts so we wait at least a total of 5 seconds. While this isn't an
>> > ideal solution, it is similar to how we dealt with a similar issue for
>> > cede_offline mode in the past (940ce422a3).
>> 
>> Thiago tried to fix this previously but there was a bit of discussion
>> that didn't quite resolve:
>> 
>>   
>> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
>
> Yeah it appears that the motivation at the time was to make the "Querying 
> DEAD?"
> messages to disappear and to avoid potentially concurrent calls to 
> rtas-stop-self
> which is prohibited by PAPR... not fixing actual crashes.

I'm pretty sure at one point we were triggering crashes *in* RTAS via
this path, I think that got resolved.

>> Spinning forever seems like a bad idea, but as has been demonstrated at
>> least twice now, continuing when we don't know the state of the other
>> CPU can lead to straight up crashes.
>> 
>> So I think I'm persuaded that it's preferable to have the kernel stuck
>> spinning rather than oopsing.
>> 
>
> +1
>
>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> first instinct is no, if we're stuck here for 20s a stack trace would be
>> good. But then we will probably hit that on some big and/or heavily
>> loaded machine.
>> 
>> So possibly we should call cond_resched() but have some custom logic in
>> the loop to print a warning if we are stuck for more than some
>> sufficiently long amount of time.
>
> How long should that be ?

Yeah good question.

I guess step one would be seeing how long it can take on the 384 vcpu
machine. And we can probably test on some other big machines.

Hopefully Nathan can give us some idea of how long he's seen it take on
large systems? I know he was concerned about the 20s timeout of the
softlockup detector.

Maybe a minute or two?

>> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
>> > interrupt controller")
>> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
>> 
>> This is not public.
>
> I'll have a look at changing that.

Thanks.

cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Greg Kurz
On Tue, 04 Aug 2020 23:35:10 +1000
Michael Ellerman  wrote:

> Hi Mike,
> 
> There is a bit of history to this code, but not in a good way :)
> 
> Michael Roth  writes:
> > For a power9 KVM guest with XIVE enabled, running a test loop
> > where we hotplug 384 vcpus and then unplug them, the following traces
> > can be seen (generally within a few loops) either from the unplugged
> > vcpu:
> >
> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >   [ 1767.952311] list_del corruption. next->prev should be 
> > c00a02470208, but was c00a02470048
> ...
> >
> > At that point the worker thread assumes the unplugged CPU is in some
> > unknown/dead state and procedes with the cleanup, causing the race with
> > the XIVE cleanup code executed by the unplugged CPU.
> >
> > Fix this by inserting an msleep() after each RTAS call to avoid
> 
> We previously had an msleep(), but it was removed:
> 
>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> 

Ah, I hadn't seen that one...

> > pseries_cpu_die() returning prematurely, and double the number of
> > attempts so we wait at least a total of 5 seconds. While this isn't an
> > ideal solution, it is similar to how we dealt with a similar issue for
> > cede_offline mode in the past (940ce422a3).
> 
> Thiago tried to fix this previously but there was a bit of discussion
> that didn't quite resolve:
> 
>   
> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
> 

Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
messages to disappear and to avoid potentially concurrent calls to 
rtas-stop-self
which is prohibited by PAPR... not fixing actual crashes.

> 
> Spinning forever seems like a bad idea, but as has been demonstrated at
> least twice now, continuing when we don't know the state of the other
> CPU can lead to straight up crashes.
> 
> So I think I'm persuaded that it's preferable to have the kernel stuck
> spinning rather than oopsing.
> 

+1

> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> first instinct is no, if we're stuck here for 20s a stack trace would be
> good. But then we will probably hit that on some big and/or heavily
> loaded machine.
> 
> So possibly we should call cond_resched() but have some custom logic in
> the loop to print a warning if we are stuck for more than some
> sufficiently long amount of time.
> 

How long should that be ?

> 
> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > interrupt controller")
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> 
> This is not public.
> 

I'll have a look at changing that.

> I tend to trim Bugzilla links from the change log, because I'm not
> convinced they will last forever, but it is good to have them in the
> mail archive.
> 
> cheers
> 

Cheers,

--
Greg

> > Cc: Michael Ellerman 
> > Cc: Cedric Le Goater 
> > Cc: Greg Kurz 
> > Cc: Nathan Lynch 
> > Signed-off-by: Michael Roth 
> > ---
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> > b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index c6e0d8abf75e..3cb172758052 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
> > int cpu_status = 1;
> > unsigned int pcpu = get_hard_smp_processor_id(cpu);
> >  
> > -   for (tries = 0; tries < 25; tries++) {
> > +   for (tries = 0; tries < 50; tries++) {
> > cpu_status = smp_query_cpu_stopped(pcpu);
> > if (cpu_status == QCSS_STOPPED ||
> > cpu_status == QCSS_HARDWARE_ERROR)
> > break;
> > -   cpu_relax();
> > -
> > +   msleep(100);
> > }
> >  
> > if (cpu_status != 0) {
> > -- 
> > 2.17.1



Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Michael Ellerman
Hi Mike,

There is a bit of history to this code, but not in a good way :)

Michael Roth  writes:
> For a power9 KVM guest with XIVE enabled, running a test loop
> where we hotplug 384 vcpus and then unplug them, the following traces
> can be seen (generally within a few loops) either from the unplugged
> vcpu:
>
>   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>   [ 1767.952311] list_del corruption. next->prev should be c00a02470208, 
> but was c00a02470048
...
>
> At that point the worker thread assumes the unplugged CPU is in some
> unknown/dead state and procedes with the cleanup, causing the race with
> the XIVE cleanup code executed by the unplugged CPU.
>
> Fix this by inserting an msleep() after each RTAS call to avoid

We previously had an msleep(), but it was removed:

  b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")

> pseries_cpu_die() returning prematurely, and double the number of
> attempts so we wait at least a total of 5 seconds. While this isn't an
> ideal solution, it is similar to how we dealt with a similar issue for
> cede_offline mode in the past (940ce422a3).

Thiago tried to fix this previously but there was a bit of discussion
that didn't quite resolve:

  
https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/


Spinning forever seems like a bad idea, but as has been demonstrated at
least twice now, continuing when we don't know the state of the other
CPU can lead to straight up crashes.

So I think I'm persuaded that it's preferable to have the kernel stuck
spinning rather than oopsing.

I'm 50/50 on whether we should have a cond_resched() in the loop. My
first instinct is no, if we're stuck here for 20s a stack trace would be
good. But then we will probably hit that on some big and/or heavily
loaded machine.

So possibly we should call cond_resched() but have some custom logic in
the loop to print a warning if we are stuck for more than some
sufficiently long amount of time.


> Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt 
> controller")
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588

This is not public.

I tend to trim Bugzilla links from the change log, because I'm not
convinced they will last forever, but it is good to have them in the
mail archive.

cheers

> Cc: Michael Ellerman 
> Cc: Cedric Le Goater 
> Cc: Greg Kurz 
> Cc: Nathan Lynch 
> Signed-off-by: Michael Roth 
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index c6e0d8abf75e..3cb172758052 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
>   int cpu_status = 1;
>   unsigned int pcpu = get_hard_smp_processor_id(cpu);
>  
> - for (tries = 0; tries < 25; tries++) {
> + for (tries = 0; tries < 50; tries++) {
>   cpu_status = smp_query_cpu_stopped(pcpu);
>   if (cpu_status == QCSS_STOPPED ||
>   cpu_status == QCSS_HARDWARE_ERROR)
>   break;
> - cpu_relax();
> -
> + msleep(100);
>   }
>  
>   if (cpu_status != 0) {
> -- 
> 2.17.1


[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-03 Thread Michael Roth
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:

  [ 1767.353447] cpu 65 (hwid 65) Ready to die...
  [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
  [ 1767.952311] list_del corruption. next->prev should be c00a02470208, 
but was c00a02470048
  [ 1767.952322] [ cut here ]
  [ 1767.952323] kernel BUG at lib/list_debug.c:56!
  [ 1767.952325] Oops: Exception in kernel mode, sig: 5 [#1]
  [ 1767.952326] LE SMP NR_CPUS=2048 NUMA pSeries
  [ 1767.952328] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 
nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables 
nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c sd_mod sg xts vmx_crypto 
virtio_net net_failover failover virtio_scsi dm_multipath dm_mirror 
dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi 
libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi
  [ 1767.952352] CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 
4.18.0-221.el8.ppc64le #1
  [ 1767.952354] NIP:  c07ab50c LR: c07ab508 CTR: 
03ac
  [ 1767.952355] REGS: c009e5a17840 TRAP: 0700   Not tainted  
(4.18.0-221.el8.ppc64le)
  [ 1767.952355] MSR:  8282b033   CR: 
28000842  XER: 2004
  [ 1767.952360] CFAR: c01ffe64 IRQMASK: 1
  [ 1767.952360] GPR00: c07ab508 c009e5a17ac0 c1ac0700 
0054
  [ 1767.952360] GPR04: c009f056cf90 c009f05f5628 c009ed40d654 
c1c90700
  [ 1767.952360] GPR08: 0007 c009f0573980 0009ef2b 
7562202c38303230
  [ 1767.952360] GPR12:  c007fff6ce80 c00a02470208 

  [ 1767.952360] GPR16: 0001 c009f05fbb00 0800 
c009ff3d4980
  [ 1767.952360] GPR20: c009f05fbb10 5deadbeef100 5deadbeef200 
00187961
  [ 1767.952360] GPR24: c009e5a17b78  0001 

  [ 1767.952360] GPR28: c00a02470200 c009f05fbb10 c009f05fbb10 

  [ 1767.952375] NIP [c07ab50c] __list_del_entry_valid+0xac/0x100
  [ 1767.952376] LR [c07ab508] __list_del_entry_valid+0xa8/0x100
  [ 1767.952377] Call Trace:
  [ 1767.952378] [c009e5a17ac0] [c07ab508] 
__list_del_entry_valid+0xa8/0x100 (unreliable)
  [ 1767.952381] [c009e5a17b20] [c0476e18] 
free_pcppages_bulk+0x1f8/0x940
  [ 1767.952383] [c009e5a17c20] [c047a9a0] 
free_unref_page+0xd0/0x100
  [ 1767.952386] [c009e5a17c50] [c00bc2a8] 
xive_spapr_cleanup_queue+0x148/0x1b0
  [ 1767.952388] [c009e5a17cf0] [c00b96dc] 
xive_teardown_cpu+0x1bc/0x240
  [ 1767.952391] [c009e5a17d30] [c010bf28] 
pseries_mach_cpu_die+0x78/0x2f0
  [ 1767.952393] [c009e5a17de0] [c005c8d8] cpu_die+0x48/0x70
  [ 1767.952394] [c009e5a17e00] [c0021cf0] 
arch_cpu_idle_dead+0x20/0x40
  [ 1767.952397] [c009e5a17e20] [c01b4294] do_idle+0x2f4/0x4c0
  [ 1767.952399] [c009e5a17ea0] [c01b46a8] 
cpu_startup_entry+0x38/0x40
  [ 1767.952400] [c009e5a17ed0] [c005c43c] 
start_secondary+0x7bc/0x8f0
  [ 1767.952403] [c009e5a17f90] [c000ac70] 
start_secondary_prolog+0x10/0x14

or on the worker thread handling the unplug:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc 
index: 113a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
  [ 1538.360736] BUG: Bad page state in process kworker/u768:3  pfn:95de1
  [ 1538.360746] cpu 314 (hwid 314) Ready to die...
  [ 1538.360784] page:c00a02577840 refcount:0 mapcount:-128 
mapping: index:0x0
  [ 1538.361881] flags: 0x5c()
  [ 1538.361908] raw: 005c 5deadbeef100 5deadbeef200 

  [ 1538.361955] raw:   ff7f 

  [ 1538.362002] page dumped because: nonzero mapcount
  [ 1538.362033] Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack 
ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp 
llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet 
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat 
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set 
nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs libcrc32c sd_mod sg 
virtio_net net_failover virtio_scsi failover dm_mirror dm_region_hash dm_log 
dm_mod
  [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted