Re: Linux 5.3-rc7

2019-09-09 Thread Bandan Das
Linus Torvalds  writes:

> On Sat, Sep 7, 2019 at 12:17 PM Linus Torvalds
>  wrote:
>>
>> I'm really not clear on why it's a good idea to clear the LDR bits on
>> shutdown, and commit 558682b52919 ("x86/apic: Include the LDR when
>> clearing out APIC registers") just looks pointless. And now it has
>> proven to break some machines.
>>
>> So why wouldn't we just revert it?
>
> Side note: looking around for the discussion about this patch, at
> least one version of the patch from Bandan had
>
> +   if (!x2apic_enabled) {
>
> rather than
>
> +   if (!x2apic_enabled()) {
>

I believe this crept up by accident when I was preparing the series, my testing
was with x2apic_enabled() but I didn't test CPU hotplug - only the kdump path
with 32 bit guest. In hindsight, I should have been more careful with testing,
sorry about that.

Bandan

> which meant that whatever Bandan tested at that point was actually a
> complete no-op, since "!x2apic_enabled" is never true (it tests a
> function pointer against NULL, which it won't be).
>
> Then that was fixed by the time it hit -tip (and eventually my tree),
> but it kind of shows how the patch history of this is all
> questionable. Further strengthened by a quote from that discussion:
>
>  "this is really a KVM bug but it doesn't hurt to clear out the LDR in
> the guest and then, it wouldn't need a hypervisor fix"
>
> and clearly it *does* hurt to clear the LDR in the guest, making the
> whole thinking behind the patch wrong and broken. The kernel clearly
> _does_ depend on LDR having the right contents.
>
> Now, I still suspect the boot problem then comes from our
> cpu0_logical_apicid use mentioned in that previous email, but at this
> point I think the proper fix is "revert for now, and we can look at
> this as a cleanup with the cpu0_logical_apicid thing for 5.4 instead".
>
> Hmm?
>
>Linus


Re: Linux 5.3-rc7

2019-09-08 Thread Sasha Levin

On Sat, Sep 07, 2019 at 02:13:22PM -0700, Linus Torvalds wrote:

On Sat, Sep 7, 2019 at 1:44 PM Thomas Gleixner  wrote:


That's what I just replied to Chris. Can you do it right away or should I queue 
it up?


Done.


I'd like to bring back a discussion we had last year on ksummit-discuss:
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2018-May/005122.html
. I've pointed out that some of the commits that go in the -rc cycles
are of low quality and are untested, you seemed to agree but said that
it's "by-design" because late -rc cycle commits are more complex.

Is this commit and it's fallout really how our development process
should be working?

This commit was rushed through the process: it was authored and merged
into -tip of the same day, and pulled in by you just a few days later.
There was no meaningful time for review, testing, or really any sort of
QA.

We really do have a better story now for catching the sort of issues
introduced by these patch: multiple CI systems tripped on this, but
people still need the time to look into it, make sure that the failure
is real and bisect it.

What was the rush in making it skip all of our safeguards? The "bug" has
been there forever, the fix isn't urgent, and no one seemed to care for
quite a while.

Even if this patch was fixing a bug introduced in this merge window, is
the tradeoff around rushing an untested fix worth it vs giving it more
time and shipping it as part of our stable tree?

I'm not trying to pick on this patch in particular - I feel that this is
a systematic issue and should be addressed as part of our process.

--
Thanks,
Sasha


Re: Linux 5.3-rc7

2019-09-07 Thread Linus Torvalds
On Sat, Sep 7, 2019 at 1:44 PM Thomas Gleixner  wrote:
>
> That's what I just replied to Chris. Can you do it right away or should I 
> queue it up?

Done.

Thanks,
  Linus


Re: Linux 5.3-rc7

2019-09-07 Thread Thomas Gleixner
On Sat, 7 Sep 2019, Linus Torvalds wrote:
> On Sat, Sep 7, 2019 at 8:00 AM Thomas Gleixner  wrote:
> 
> So why wouldn't we just revert it?

That's what I just replied to Chris. Can you do it right away or should I queue 
it up?

Thanks,

tglx


Re: Linux 5.3-rc7

2019-09-07 Thread Thomas Gleixner
On Sat, 7 Sep 2019, Chris Wilson wrote:
> Quoting Thomas Gleixner (2019-09-07 16:00:17)
> > Does this only happen with that CPU0 hotplug stuff enabled or on CPUs other
> > than CPU0 as well? That hotplug CPU0 stuff is a bandaid so I wouldn't be
> > surprised if we broke that somehow.
> 
> If I ignore cpu0 in that test and so use
> 
> [  133.847187] smpboot: CPU 1 is now offline
> [  134.861861] x86: Booting SMP configuration:
> [  134.861875] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [  134.880218] smpboot: CPU 2 is now offline
> [  135.893806] smpboot: Booting Node 0 Processor 2 APIC 0x1
> [  135.935115] smpboot: CPU 3 is now offline
> [  136.949760] smpboot: Booting Node 0 Processor 3 APIC 0x3
> 
> that has run for 10 minutes without failure, so it seems confined to
> cpu0 hotplugging. All we are doing in the test to generate the hotplugs
> is:

Right, but you also have that config bit enabled which allows CPU0 hotplug
which usually is off even in testing and that's why nobody noticed so far.

So I looked at that code and I know why it's broken. I guess we'll end up
reverting that commit for now as fixing it proper will be not just a one
liner.

Thanks for providing all the information!


   tglx


Re: Linux 5.3-rc7

2019-09-07 Thread Linus Torvalds
On Sat, Sep 7, 2019 at 12:17 PM Linus Torvalds
 wrote:
>
> I'm really not clear on why it's a good idea to clear the LDR bits on
> shutdown, and commit 558682b52919 ("x86/apic: Include the LDR when
> clearing out APIC registers") just looks pointless. And now it has
> proven to break some machines.
>
> So why wouldn't we just revert it?

Side note: looking around for the discussion about this patch, at
least one version of the patch from Bandan had

+   if (!x2apic_enabled) {

rather than

+   if (!x2apic_enabled()) {

which meant that whatever Bandan tested at that point was actually a
complete no-op, since "!x2apic_enabled" is never true (it tests a
function pointer against NULL, which it won't be).

Then that was fixed by the time it hit -tip (and eventually my tree),
but it kind of shows how the patch history of this is all
questionable. Further strengthened by a quote from that discussion:

 "this is really a KVM bug but it doesn't hurt to clear out the LDR in
the guest and then, it wouldn't need a hypervisor fix"

and clearly it *does* hurt to clear the LDR in the guest, making the
whole thinking behind the patch wrong and broken. The kernel clearly
_does_ depend on LDR having the right contents.

Now, I still suspect the boot problem then comes from our
cpu0_logical_apicid use mentioned in that previous email, but at this
point I think the proper fix is "revert for now, and we can look at
this as a cleanup with the cpu0_logical_apicid thing for 5.4 instead".

Hmm?

   Linus


Re: Linux 5.3-rc7

2019-09-07 Thread Linus Torvalds
On Sat, Sep 7, 2019 at 8:00 AM Thomas Gleixner  wrote:
>
> Ok let me find a testbox to figure out whats wrong there.

Honestly, it looks like we should just revert that commit, since we
never used to clear the LDR bits before either, and the bug it "fixes"
doesn't really seem to be a bug (well, it's a bug in KVM, but that's a
different thing).

And I wouldn't be at all surprised if it confuses some BIOS code.

We use the LDR bits ourselves in smp_get_logical_apicid(), and so
clearing them out seems entirely bogus.

At a guess, it's wakeup_cpu_via_init_nmi() that does that

if (apic->dest_logical == APIC_DEST_LOGICAL)
id = cpu0_logical_apicid;
else
id = apicid;

and now that we've cleared the APIC LDR bits, we no longer wake the
BSP. We send the NMI to the _old_ APIC ID, but we've overwritten it
with 0 when we put it to sleep, so now nothing happens.

I'm really not clear on why it's a good idea to clear the LDR bits on
shutdown, and commit 558682b52919 ("x86/apic: Include the LDR when
clearing out APIC registers") just looks pointless. And now it has
proven to break some machines.

So why wouldn't we just revert it?

 Linus


Re: Linux 5.3-rc7

2019-09-07 Thread Chris Wilson
Quoting Thomas Gleixner (2019-09-07 16:00:17)
> Does this only happen with that CPU0 hotplug stuff enabled or on CPUs other
> than CPU0 as well? That hotplug CPU0 stuff is a bandaid so I wouldn't be
> surprised if we broke that somehow.

If I ignore cpu0 in that test and so use

[  133.847187] smpboot: CPU 1 is now offline
[  134.861861] x86: Booting SMP configuration:
[  134.861875] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  134.880218] smpboot: CPU 2 is now offline
[  135.893806] smpboot: Booting Node 0 Processor 2 APIC 0x1
[  135.935115] smpboot: CPU 3 is now offline
[  136.949760] smpboot: Booting Node 0 Processor 3 APIC 0x3

that has run for 10 minutes without failure, so it seems confined to
cpu0 hotplugging. All we are doing in the test to generate the hotplugs
is:

for (int cpu = 0;; cpu++) {
char name[128];
int cpufd;

snprintf(name, sizeof(name),
 "/sys/devices/system/cpu/cpu%d/online",
 cpu), sizeof(name));
cpufd = open(name, O_WRONLY);
if (cpufd < 0)
break;

write(cpufd, "0", 2);
usleep(1e6);
write(cpufd, "1", 2);

close(cpufd);
}

-Chris


Re: Linux 5.3-rc7

2019-09-07 Thread Thomas Gleixner
On Sat, 7 Sep 2019, Chris Wilson wrote:
> Quoting Thomas Gleixner (2019-09-07 15:29:19)
> > On Sat, 7 Sep 2019, Chris Wilson wrote:
> > > Quoting Linus Torvalds (2019-09-02 18:28:26)
> > > > Bandan Das:
> > > >   x86/apic: Include the LDR when clearing out APIC registers
> > > 
> > > Apologies if this is known already, I'm way behind on email.
> > > 
> > > I've bisected
> > > 
> > > [   18.693846] smpboot: CPU 0 is now offline
> > > [   19.707737] smpboot: Booting Node 0 Processor 0 APIC 0x0
> > > [   29.707602] smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
> > > 
> > > https://intel-gfx-ci.01.org/tree/drm-tip/igt@perf_...@cpu-hotplug.html
> > > 
> > > to 558682b52919. (Reverts cleanly and fixes the problem.)
> > > 
> > > I'm guessing that this is also behind the suspend failures, missing
> > > /dev/cpu/0/msr, and random perf_event_open() failures we have observed
> > > in our CI since -rc7 across all generations of Intel cpus.
> > 
> > So is this on bare metal or in a VM?
> 
> Our single virtualised piece of kit doesn't support cpu hotplug, so this
> test is not being run. We have failures on
> icl (2019), glk (2017), kbl (2017), bxt (2016), skl (2015),
> bsw (2016), hsw (2013), byt (2013), snb (2011), elk (2008),
> bwr (2006), blb (2007)

Ok let me find a testbox to figure out whats wrong there.

Does this only happen with that CPU0 hotplug stuff enabled or on CPUs other
than CPU0 as well? That hotplug CPU0 stuff is a bandaid so I wouldn't be
surprised if we broke that somehow.

Thanks,

tglx


Re: Linux 5.3-rc7

2019-09-07 Thread Chris Wilson
Quoting Thomas Gleixner (2019-09-07 15:29:19)
> On Sat, 7 Sep 2019, Chris Wilson wrote:
> > Quoting Linus Torvalds (2019-09-02 18:28:26)
> > > Bandan Das:
> > >   x86/apic: Include the LDR when clearing out APIC registers
> > 
> > Apologies if this is known already, I'm way behind on email.
> > 
> > I've bisected
> > 
> > [   18.693846] smpboot: CPU 0 is now offline
> > [   19.707737] smpboot: Booting Node 0 Processor 0 APIC 0x0
> > [   29.707602] smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
> > 
> > https://intel-gfx-ci.01.org/tree/drm-tip/igt@perf_...@cpu-hotplug.html
> > 
> > to 558682b52919. (Reverts cleanly and fixes the problem.)
> > 
> > I'm guessing that this is also behind the suspend failures, missing
> > /dev/cpu/0/msr, and random perf_event_open() failures we have observed
> > in our CI since -rc7 across all generations of Intel cpus.
> 
> So is this on bare metal or in a VM?

Our single virtualised piece of kit doesn't support cpu hotplug, so this
test is not being run. We have failures on
icl (2019), glk (2017), kbl (2017), bxt (2016), skl (2015),
bsw (2016), hsw (2013), byt (2013), snb (2011), elk (2008),
bwr (2006), blb (2007)
-Chris


Re: Linux 5.3-rc7

2019-09-07 Thread Thomas Gleixner
On Sat, 7 Sep 2019, Chris Wilson wrote:
> Quoting Linus Torvalds (2019-09-02 18:28:26)
> > Bandan Das:
> >   x86/apic: Include the LDR when clearing out APIC registers
> 
> Apologies if this is known already, I'm way behind on email.
> 
> I've bisected
> 
> [   18.693846] smpboot: CPU 0 is now offline
> [   19.707737] smpboot: Booting Node 0 Processor 0 APIC 0x0
> [   29.707602] smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/igt@perf_...@cpu-hotplug.html
> 
> to 558682b52919. (Reverts cleanly and fixes the problem.)
> 
> I'm guessing that this is also behind the suspend failures, missing
> /dev/cpu/0/msr, and random perf_event_open() failures we have observed
> in our CI since -rc7 across all generations of Intel cpus.

So is this on bare metal or in a VM?

Thanks,

tglx


Re: Linux 5.3-rc7

2019-09-07 Thread Chris Wilson
Quoting Linus Torvalds (2019-09-02 18:28:26)
> Bandan Das:
>   x86/apic: Include the LDR when clearing out APIC registers

Apologies if this is known already, I'm way behind on email.

I've bisected

[   18.693846] smpboot: CPU 0 is now offline
[   19.707737] smpboot: Booting Node 0 Processor 0 APIC 0x0
[   29.707602] smpboot: do_boot_cpu failed(-1) to wakeup CPU#0

https://intel-gfx-ci.01.org/tree/drm-tip/igt@perf_...@cpu-hotplug.html

to 558682b52919. (Reverts cleanly and fixes the problem.)

I'm guessing that this is also behind the suspend failures, missing
/dev/cpu/0/msr, and random perf_event_open() failures we have observed
in our CI since -rc7 across all generations of Intel cpus.
-Chris