Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Wed, Dec 20, 2017 at 10:08:21PM +0800, Zhang Rui wrote: > > Does this help? > > > No. Bah!, does this at least get you a IRQ0 line in /proc/interrupts? > > diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c > > index 749d189f8cd4..45675072771c 100644 > > --- a/arch/x86/kernel/time.c > > +++ b/arch/x86/kernel/time.c > > @@ -69,8 +69,6 @@ static struct irqaction irq0 = { > > > > static void __init setup_default_timer_irq(void) > > { > > - if (!nr_legacy_irqs()) > > - return; > > setup_irq(0, &irq0); > > } > >
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-12-19 at 18:23 +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 05:01:55PM +0100, Peter Zijlstra wrote: > > > > On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote: > > > > > > On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote: > > > > > > > > > > > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 > > > > dfl dfl) > > > > [0.00] ACPI: IRQ0 used by override. > > > > > > > > So your ACPI table has an override for IRQ2 and routes it to > > > > IRQ0. > > ^ this > > > > > > > > > > > > > The HPET document says: > > > > > > > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, > > > > then the > > > > interrupts will be routed as follows: > > > > > > > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the > > > > I/O APIC > > > But AFAICS, the HPET emulated timer interrupts goes to IRQ0 > > Right, so see that ACPI override, that routes I/O APIC IRQ2 to > > IRQ0, or > > it _should_. > > > > Clearly something is messed up here.. but I've no idea what. That > > whole > > IRQ routing stuff is confusing. > Does this help? > No. thanks, rui > diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c > index 749d189f8cd4..45675072771c 100644 > --- a/arch/x86/kernel/time.c > +++ b/arch/x86/kernel/time.c > @@ -69,8 +69,6 @@ static struct irqaction irq0 = { > > static void __init setup_default_timer_irq(void) > { > - if (!nr_legacy_irqs()) > - return; > setup_irq(0, &irq0); > } >
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Dec 19, 2017 at 05:01:55PM +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote: > > On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote: > > > > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > > > [0.00] ACPI: IRQ0 used by override. > > > > > > So your ACPI table has an override for IRQ2 and routes it to IRQ0. > > ^ this > > > > The HPET document says: > > > > > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the > > > interrupts will be routed as follows: > > > > > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC > > > > But AFAICS, the HPET emulated timer interrupts goes to IRQ0 > > Right, so see that ACPI override, that routes I/O APIC IRQ2 to IRQ0, or > it _should_. > > Clearly something is messed up here.. but I've no idea what. That whole > IRQ routing stuff is confusing. Does this help? diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c index 749d189f8cd4..45675072771c 100644 --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -69,8 +69,6 @@ static struct irqaction irq0 = { static void __init setup_default_timer_irq(void) { - if (!nr_legacy_irqs()) - return; setup_irq(0, &irq0); }
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote: > On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote: > > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > > [0.00] ACPI: IRQ0 used by override. > > > > So your ACPI table has an override for IRQ2 and routes it to IRQ0. ^ this > > The HPET document says: > > > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the > > interrupts will be routed as follows: > > > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC > > But AFAICS, the HPET emulated timer interrupts goes to IRQ0 Right, so see that ACPI override, that routes I/O APIC IRQ2 to IRQ0, or it _should_. Clearly something is messed up here.. but I've no idea what. That whole IRQ routing stuff is confusing.
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-12-19 at 16:31 +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 04:23:07PM +0100, Peter Zijlstra wrote: > > > > On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > > > > > > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > > > > > > > Hi, can you see if this makes you Surface boot? > > > > > > > No, it does not boot. > > So I'm confused on the lapic calibration. > > > > That stuff uses global_clock_event, which is initially the i8253 > > (PIT), > > but because !PIC this thing won't be there either on your platform. > > > > Then we initialize I/O APIC, and your machine has: > > > > [0.00] IOAPIC[0]: apic_id 2, version 32, address > > 0xfec0, GSI 0-119 > > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl > > dfl) > > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high > > level) > > [0.00] ACPI: IRQ0 used by override. > > [0.00] ACPI: IRQ9 used by override. > > > > So your ACPI table has an override for IRQ2 and routes it to IRQ0. > > > > Then we initialize HPET, and we _always_ do > > hpet_enable_legacy_int(), > > which sets the LegacyRouting bit. The HPET document says: > > > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then > > the > > interrupts will be routed as follows: > > > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O > > APIC > > Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O > > APIC > > Timer 2-n will be routed as per the routing in the timer n > > config registers. > > > > If the LegacyReplacement Route bit is set, the individual routing > > bits > > for timers 0 and 1 (APIC or FSB) will have no impact. > > > > And then we set global_clock_event to &hpet_clockevent. > > > > At this point that _SHOULD_ work afaict, even without actual PIC > > present. > > > > Sometime after that we call into calibrate_APIC_clock() -- because > > !TSC_DEADLINE -- and this is where you get stuck, because > > global_clock_event is not in fact delivering interrupts. > > > > Thomas may have more clue, we'll have to wait for him to have a > > time-slot available. > What does your /proc/interrupts look like (on a tsc-deadline boot) ? > > 0: is the HPET, LOC: is the lapic/tsc-deadline > > On my SKL desktop I get 21 PIT/HPET ticks on CPU0 before lapic takes > over. No irq0. LOC: 220071 210079 184892 176494 Local timer interrupts thanks, rui
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > > > > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > > > > > Hi, can you see if this makes you Surface boot? > > > > > No, it does not boot. > So I'm confused on the lapic calibration. > > That stuff uses global_clock_event, which is initially the i8253 > (PIT), > but because !PIC this thing won't be there either on your platform. > > Then we initialize I/O APIC, and your machine has: > > [0.00] IOAPIC[0]: apic_id 2, version 32, address 0xfec0, > GSI 0-119 > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl > dfl) > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high > level) > [0.00] ACPI: IRQ0 used by override. > [0.00] ACPI: IRQ9 used by override. > > So your ACPI table has an override for IRQ2 and routes it to IRQ0. > > Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(), > which sets the LegacyRouting bit. Right. > The HPET document says: > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the > interrupts will be routed as follows: > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O > APIC But AFAICS, the HPET emulated timer interrupts goes to IRQ0 on all the machines I have tested, but on this MS Surface Pro 4, there is no irq 0 row in /proc/interrupts. $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 8: 0 1 0 0 IR-IO-APIC8- edge rtc0 9:476144 4573132 IR-IO-APIC9- fasteoi acpi 14: 0 0 0 0 IR-IO-APIC 14- edge INT344B:00 16:646 4971 63574 3973 IR-IO-APIC 16- fasteoi idma64.0, MRVL_PCIE, i2c_designware.0 17: 0 0 0 0 IR-IO-APIC 17- fasteoi idma64.1, i2c_designware.1 18: 0 0 0 0 IR-IO-APIC 18- fasteoi idma64.2, i2c_designware.2 19: 0 0 0 0 IR-IO-APIC 19- fasteoi idma64.3, i2c_designware.3 Maybe I need to check if IRQ0 is overridden on other platforms, if no, registering to IRQ2 for HPET Timer 0 could help in this case? > Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O > APIC > Timer 2-n will be routed as per the routing in the timer n config > registers. > > If the LegacyReplacement Route bit is set, the individual routing > bits > for timers 0 and 1 (APIC or FSB) will have no impact. > > And then we set global_clock_event to &hpet_clockevent. Yes, I can confirm this. > > At this point that _SHOULD_ work afaict, even without actual PIC > present. so IOAPIC is ready when we calibrating Lapic timer? > > Sometime after that we call into calibrate_APIC_clock() -- because > !TSC_DEADLINE -- and this is where you get stuck, because > global_clock_event is not in fact delivering interrupts. right. > > Thomas may have more clue, we'll have to wait for him to have a > time-slot available. okay. thanks, rui
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Dec 19, 2017 at 04:23:07PM +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > > Hi, can you see if this makes you Surface boot? > > > > > No, it does not boot. > > So I'm confused on the lapic calibration. > > That stuff uses global_clock_event, which is initially the i8253 (PIT), > but because !PIC this thing won't be there either on your platform. > > Then we initialize I/O APIC, and your machine has: > > [0.00] IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-119 > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) > [0.00] ACPI: IRQ0 used by override. > [0.00] ACPI: IRQ9 used by override. > > So your ACPI table has an override for IRQ2 and routes it to IRQ0. > > Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(), > which sets the LegacyRouting bit. The HPET document says: > > If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the > interrupts will be routed as follows: > > Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC > Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O APIC > Timer 2-n will be routed as per the routing in the timer n config > registers. > > If the LegacyReplacement Route bit is set, the individual routing bits > for timers 0 and 1 (APIC or FSB) will have no impact. > > And then we set global_clock_event to &hpet_clockevent. > > At this point that _SHOULD_ work afaict, even without actual PIC > present. > > Sometime after that we call into calibrate_APIC_clock() -- because > !TSC_DEADLINE -- and this is where you get stuck, because > global_clock_event is not in fact delivering interrupts. > > Thomas may have more clue, we'll have to wait for him to have a > time-slot available. What does your /proc/interrupts look like (on a tsc-deadline boot) ? 0: is the HPET, LOC: is the lapic/tsc-deadline On my SKL desktop I get 21 PIT/HPET ticks on CPU0 before lapic takes over.
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > Hi, can you see if this makes you Surface boot? > > > No, it does not boot. So I'm confused on the lapic calibration. That stuff uses global_clock_event, which is initially the i8253 (PIT), but because !PIC this thing won't be there either on your platform. Then we initialize I/O APIC, and your machine has: [0.00] IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-119 [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [0.00] ACPI: IRQ0 used by override. [0.00] ACPI: IRQ9 used by override. So your ACPI table has an override for IRQ2 and routes it to IRQ0. Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(), which sets the LegacyRouting bit. The HPET document says: If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the interrupts will be routed as follows: Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O APIC Timer 2-n will be routed as per the routing in the timer n config registers. If the LegacyReplacement Route bit is set, the individual routing bits for timers 0 and 1 (APIC or FSB) will have no impact. And then we set global_clock_event to &hpet_clockevent. At this point that _SHOULD_ work afaict, even without actual PIC present. Sometime after that we call into calibrate_APIC_clock() -- because !TSC_DEADLINE -- and this is where you get stuck, because global_clock_event is not in fact delivering interrupts. Thomas may have more clue, we'll have to wait for him to have a time-slot available.
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-12-19 at 14:15 +0100, Peter Zijlstra wrote: > On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > > > > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > > > > > Hi, can you see if this makes you Surface boot? > > > > > No, it does not boot. > Bah, staring at the lapic calibrate now, that is a bit of a mess.. > > > > > > > > > I tested it on my IVB by making has_legacy_pic() return > > > unconditional > > > true. > > > > > > [0.024000] tsc: Unable to calibrate against PIT > > > [0.025000] tsc: using HPET reference calibration > > > [0.026000] tsc: Detected 2792.451 MHz processor > > > > > > --- > > > > > > > > > diff --git a/arch/x86/include/asm/i8259.h > > > b/arch/x86/include/asm/i8259.h > > > index c8376b40e882..e2cfc4b52ee4 100644 > > > --- a/arch/x86/include/asm/i8259.h > > > +++ b/arch/x86/include/asm/i8259.h > > > @@ -69,6 +69,11 @@ struct legacy_pic { > > > extern struct legacy_pic *legacy_pic; > > > extern struct legacy_pic null_legacy_pic; > > > > > > +static inline bool has_legacy_pic(void) > > > +{ > > > + return legacy_pic == &null_legacy_pic; > > > +} > > > + > > shouldn't this be > > return legacy_pic == &default_legacy_pic; > > ? > != &null, but yes, I mess that up. I see. I have changed that and the platform can not boot neither. thanks, rui
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote: > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > > Hi, can you see if this makes you Surface boot? > > > No, it does not boot. Bah, staring at the lapic calibrate now, that is a bit of a mess.. > > I tested it on my IVB by making has_legacy_pic() return unconditional > > true. > > > > [0.024000] tsc: Unable to calibrate against PIT > > [0.025000] tsc: using HPET reference calibration > > [0.026000] tsc: Detected 2792.451 MHz processor > > > > --- > > > > > > diff --git a/arch/x86/include/asm/i8259.h > > b/arch/x86/include/asm/i8259.h > > index c8376b40e882..e2cfc4b52ee4 100644 > > --- a/arch/x86/include/asm/i8259.h > > +++ b/arch/x86/include/asm/i8259.h > > @@ -69,6 +69,11 @@ struct legacy_pic { > > extern struct legacy_pic *legacy_pic; > > extern struct legacy_pic null_legacy_pic; > > > > +static inline bool has_legacy_pic(void) > > +{ > > + return legacy_pic == &null_legacy_pic; > > +} > > + > shouldn't this be > return legacy_pic == &default_legacy_pic; > ? != &null, but yes, I mess that up.
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote: > Hi, can you see if this makes you Surface boot? > No, it does not boot. > I tested it on my IVB by making has_legacy_pic() return unconditional > true. > > [0.024000] tsc: Unable to calibrate against PIT > [0.025000] tsc: using HPET reference calibration > [0.026000] tsc: Detected 2792.451 MHz processor > > --- > > > diff --git a/arch/x86/include/asm/i8259.h > b/arch/x86/include/asm/i8259.h > index c8376b40e882..e2cfc4b52ee4 100644 > --- a/arch/x86/include/asm/i8259.h > +++ b/arch/x86/include/asm/i8259.h > @@ -69,6 +69,11 @@ struct legacy_pic { > extern struct legacy_pic *legacy_pic; > extern struct legacy_pic null_legacy_pic; > > +static inline bool has_legacy_pic(void) > +{ > + return legacy_pic == &null_legacy_pic; > +} > + shouldn't this be return legacy_pic == &default_legacy_pic; ? thanks, rui > static inline int nr_legacy_irqs(void) > { > return legacy_pic->nr_legacy_irqs; > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index 8ea117f8142e..2afc623b2280 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -25,6 +25,7 @@ > #include > #include > #include > +#include > > unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not > used here */ > EXPORT_SYMBOL(cpu_khz); > @@ -363,6 +364,15 @@ static unsigned long pit_calibrate_tsc(u32 > latch, unsigned long ms, int loopmin) > unsigned long tscmin, tscmax; > int pitcnt; > > + if (!has_legacy_pic()) { > + udelay(10 * USEC_PER_MSEC); > + udelay(10 * USEC_PER_MSEC); > + udelay(10 * USEC_PER_MSEC); > + udelay(10 * USEC_PER_MSEC); > + udelay(10 * USEC_PER_MSEC); > + return ULONG_MAX; > + } > + > /* Set the Gate high, disable speaker */ > outb((inb(0x61) & ~0x02) | 0x01, 0x61); > > @@ -487,6 +497,9 @@ static unsigned long quick_pit_calibrate(void) > u64 tsc, delta; > unsigned long d1, d2; > > + if (!has_legacy_pic()) > + return 0; > + > /* Set the Gate high, disable speaker */ > outb((inb(0x61) & ~0x02) | 0x01, 0x61); >
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
Hi, can you see if this makes you Surface boot? I tested it on my IVB by making has_legacy_pic() return unconditional true. [0.024000] tsc: Unable to calibrate against PIT [0.025000] tsc: using HPET reference calibration [0.026000] tsc: Detected 2792.451 MHz processor --- diff --git a/arch/x86/include/asm/i8259.h b/arch/x86/include/asm/i8259.h index c8376b40e882..e2cfc4b52ee4 100644 --- a/arch/x86/include/asm/i8259.h +++ b/arch/x86/include/asm/i8259.h @@ -69,6 +69,11 @@ struct legacy_pic { extern struct legacy_pic *legacy_pic; extern struct legacy_pic null_legacy_pic; +static inline bool has_legacy_pic(void) +{ + return legacy_pic == &null_legacy_pic; +} + static inline int nr_legacy_irqs(void) { return legacy_pic->nr_legacy_irqs; diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 8ea117f8142e..2afc623b2280 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -25,6 +25,7 @@ #include #include #include +#include unsigned int __read_mostly cpu_khz;/* TSC clocks / usec, not used here */ EXPORT_SYMBOL(cpu_khz); @@ -363,6 +364,15 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin) unsigned long tscmin, tscmax; int pitcnt; + if (!has_legacy_pic()) { + udelay(10 * USEC_PER_MSEC); + udelay(10 * USEC_PER_MSEC); + udelay(10 * USEC_PER_MSEC); + udelay(10 * USEC_PER_MSEC); + udelay(10 * USEC_PER_MSEC); + return ULONG_MAX; + } + /* Set the Gate high, disable speaker */ outb((inb(0x61) & ~0x02) | 0x01, 0x61); @@ -487,6 +497,9 @@ static unsigned long quick_pit_calibrate(void) u64 tsc, delta; unsigned long d1, d2; + if (!has_legacy_pic()) + return 0; + /* Set the Gate high, disable speaker */ outb((inb(0x61) & ~0x02) | 0x01, 0x61);
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-11-28 at 13:36 +0100, Peter Zijlstra wrote: > On Tue, Nov 28, 2017 at 06:59:01PM +0800, Zhang Rui wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom > > > > > > is > > > > yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always > > freezes > > with boot option "notscdeadline"/"lapic=notscdeadline". > Then for some mysterious reason, your Surface thing has a borked > LAPIC. > I agree. But however, on this platform, forcing a cpu microcode upgrade after upgrading kernel does not seem like a good solution to users. Thus I still like to dig into the problem further to see if we can workaround this "regression" in software. As the kernel freezes at a very early stage, do you have any advice on how to narrow down the problem? thanks, rui
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Nov 28, 2017 at 06:59:01PM +0800, Zhang Rui wrote: > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is > yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always freezes > with boot option "notscdeadline"/"lapic=notscdeadline". Then for some mysterious reason, your Surface thing has a borked LAPIC. No idea how or why..
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-11-28 at 10:34 +0100, Thomas Gleixner wrote: > On Tue, 28 Nov 2017, Zhang Rui wrote: > > > > > On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote: > > > > > > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote: > > > > > > > > > > > > Hi, All, > > > > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is > > > > that > > > > kernel freezes during boot, and the last message in the screen > > > > is > > > > loading the initrd image. And I have bisected it to this commit > > > > > > > > +// DEADLINE_MODEL_MATCH_REV ( > > > > INTEL_FAM6_SKYLAKE_MOBILE, > > > > 0xb2), > > > And what microcode version do you run? Have you installed the > > > latest > > > microcode package and updated your initrd to include it? My > > > skylake > > > is > > > running 0xba. > > No, I didn't upgrade my microcode. > > > > $ cat /proc/cpuinfo > > ... > > processor : 0 > > vendor_id : GenuineIntel > > cpu family : 6 > > model : 78 > > model name : Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz > > stepping: 3 > > microcode : 0x9e > > ... > > > > I suppose the problem should be gone if I upgrade the microcode. > > But the real problem to me is that the system FREEZES after kernel > > upgrade. > Confused. So the match disables the deadline timer due to 0x9e < > 0xb2. And > not having deadline timer makes the boot fail despite the fact that > deadline timer is borked with microcode < 0xb2. > > Can you verify by adding 'lapic=notscdeadline' to the command line of > a > 'working' kernel? That should cause a freeze as well then. > yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always freezes with boot option "notscdeadline"/"lapic=notscdeadline". thanks, rui > Thanks, > > tglx
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Nov 28, 2017 at 04:22:15PM +0800, Zhang Rui wrote: > On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote: > > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote: > > > > > > Hi, All, > > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that > > > kernel freezes during boot, and the last message in the screen is > > > loading the initrd image. And I have bisected it to this commit > > > > > > +// DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, > > > 0xb2), > > And what microcode version do you run? Have you installed the latest > > microcode package and updated your initrd to include it? My skylake > > is > > running 0xba. > > No, I didn't upgrade my microcode. > > $ cat /proc/cpuinfo > ... > processor : 0 > vendor_id : GenuineIntel > cpu family: 6 > model : 78 > model name: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz > stepping : 3 > microcode : 0x9e > ... > > I suppose the problem should be gone if I upgrade the microcode. > But the real problem to me is that the system FREEZES after kernel > upgrade. Yes, that is weird and unexpected. The patch basically kills TSC_DEADLINE usage on older microcode. This makes the kernel then use the old LAPIC timer mode. Does your machine boot with "notscdeadline" ? Which _should_ be the same.
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 28 Nov 2017, Zhang Rui wrote: > On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote: > > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote: > > > > > > Hi, All, > > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that > > > kernel freezes during boot, and the last message in the screen is > > > loading the initrd image. And I have bisected it to this commit > > > > > > +// DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, > > > 0xb2), > > And what microcode version do you run? Have you installed the latest > > microcode package and updated your initrd to include it? My skylake > > is > > running 0xba. > > No, I didn't upgrade my microcode. > > $ cat /proc/cpuinfo > ... > processor : 0 > vendor_id : GenuineIntel > cpu family: 6 > model : 78 > model name: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz > stepping : 3 > microcode : 0x9e > ... > > I suppose the problem should be gone if I upgrade the microcode. > But the real problem to me is that the system FREEZES after kernel > upgrade. Confused. So the match disables the deadline timer due to 0x9e < 0xb2. And not having deadline timer makes the boot fail despite the fact that deadline timer is borked with microcode < 0xb2. Can you verify by adding 'lapic=notscdeadline' to the command line of a 'working' kernel? That should cause a freeze as well then. Thanks, tglx
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote: > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote: > > > > Hi, All, > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that > > kernel freezes during boot, and the last message in the screen is > > loading the initrd image. And I have bisected it to this commit > > > > +// DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, > > 0xb2), > And what microcode version do you run? Have you installed the latest > microcode package and updated your initrd to include it? My skylake > is > running 0xba. No, I didn't upgrade my microcode. $ cat /proc/cpuinfo ... processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 78 model name : Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz stepping: 3 microcode : 0x9e ... I suppose the problem should be gone if I upgrade the microcode. But the real problem to me is that the system FREEZES after kernel upgrade. thanks, rui
Re: Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote: > Hi, All, > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that > kernel freezes during boot, and the last message in the screen is > loading the initrd image. And I have bisected it to this commit > +// DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, 0xb2), And what microcode version do you run? Have you installed the latest microcode package and updated your initrd to include it? My skylake is running 0xba.
Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
Hi, All, My Surface Pro 4 is unable to boot after 4.12. The symptom is that kernel freezes during boot, and the last message in the screen is loading the initrd image. And I have bisected it to this commit commit bd9240a18edfbfa72e957fc2ba831cf1f13ea073 (refs/bisect/bad) Author: Peter Zijlstra AuthorDate: Wed May 31 17:52:03 2017 +0200 Commit: Thomas Gleixner CommitDate: Sun Jun 4 21:55:53 2017 +0200 x86/apic: Add TSC_DEADLINE quirk due to errata Due to errata it is possible for the TSC_DEADLINE timer to misbehave after using TSC_ADJUST. A microcode update is available to fix this situation. Avoid using the TSC_DEADLINE timer if it is affected by this issue and report the required microcode version. [ tglx: Renamed function to apic_check_deadline_errata() ] Signed-off-by: Peter Zijlstra (Intel) Cc: kevin.b.stan...@intel.com Link: http://lkml.kernel.org/r/20170531155306.050849877@infradead.o rg Signed-off-by: Thomas Gleixner Currently, I'm using v4.14 kernel with the following workaround on top. --- arch/x86/kernel/apic/apic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ff89177..cd419d4 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -596,7 +596,7 @@ static const struct x86_cpu_id deadline_match[] = { DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_CORE, 0x25), DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_GT3E, 0x17), - DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, 0xb2), +// DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE, 0xb2), DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_DESKTOP, 0xb2), DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_KABYLAKE_MOBILE, 0x52), -- 2.7.4 I'm reading the related code but have not figured out the root cause yet. Anyone suggestions? thanks, rui