>>> On 16.02.18 at 18:46, <igor.druzhi...@citrix.com> wrote:
> We're noticing a reproducible system boot hang on certain
> post-Skylake platforms where the BIOS is configured in
> legacy boot mode with x2APIC disabled. The system stalls
> immediately after writing the first SMP initialization
> sequence into APIC ICR.
> The cause of the problem is watchdog NMI handler execution -
> somewhere near the end of NMI handling (after it's already
> rescheduled the next NMI) it tries to access IO port 0x61
> to get the actual NMI reason on CPU0. Unfortunately, this
> port is emulated by BIOS using SMIs and this emulation for
> some reason takes more time than we expect during INIT-SIPI-SIPI
> sequence. As the result, the system is constantly moving between
> NMI and SMI handler and not making any progress.
> To avoid this, initialize the watchdog after SMP bootstrap on
> CPU0 and, additionally, protect the NMI handler by moving
> IO port access before NMI re-scheduling.
Much better, yet what about post boot onlining of CPUs? I think we
assume to be safe in that case just because at that time we run at
a lower nmi_hz. Might be worthwhile to spell this out above.
> @@ -1714,6 +1714,12 @@ void do_nmi(const struct cpu_user_regs *regs)
> if ( nmi_callback(regs, cpu) )
> + /* This IO port access is likely to produce SMI which, in turn,
> + * may take enough time for the next NMI tick to happen. To avoid having
> + * nested NMIs as the result let's call it before watchdog re-scheduling
Please correct the comment style (/* and */ on their own lines,
full stop after second sentence. Also following the earlier
discussion I don't think "likely" is appropriate - how about "not
impossible"? Also perhaps "do it" instead of "call it" (as you're
talking about a port access, not a function call)?
Xen-devel mailing list