Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 08.02.18 at 13:18,wrote: > We switch the NMI frequency to ~2Hz after the calibration, but that is > after having run the BSP at 100Hz for a long period of time, and the APs > at this rate for a short while. Irrespective of the exact fix here, it > is simply not a good idea to be running with this NMI frequency, other > than possibly during the immediate calibration logic. And I've previously agreed with this general statement. It's just that the so far suggested workaround doesn't really do what you say (and I'm pretty determined that the "possibly" above should be removed). Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Thu, 8 Feb 2018 15:00:33 + Andrew Cooperwrote: >On 08/02/18 14:37, Alexey G wrote: >> On Thu, 8 Feb 2018 12:40:41 + >> Andrew Cooper wrote: >>> - Perf/Oprofile. This is currently mutually exclusive with Xen >>> using the watchdog, but needn't be and hopefully won't be in the >>> future. Most of the time we deal with watchdog NMIs, while all others should be somewhat rare. The thing is, we actually need to read I/O port 61h on system NMIs only. If the main problem lies in a flow of SMIs due to reading port 61h on every NMI watchdog tick -- why not to avoid reading it? There are at least 2 ways to check if the NMI was due to a watchdog tick: - LAPIC (SDM states that "When a performance monitoring counters interrupt is generated, the mask bit for its associated LVT entry is set") - perf MSR overflow bit So, if we detect it was a NMI due to a watchdog using these methods (early in the NMI handler), we can avoid touching the port 61h and thus triggering SMI I/O trap on it. >>> The problem is having multiple NMIs arriving. Like all other edge >>> triggered interrupts, extra arrivals get dropped. By skipping the >>> 0x61 read if we believe it was a watchdog NMI, we've opened a race >>> condition where we will completely miss the system NMI. >> There shouldn't be any problem I think. NMIs don't need to be cleared >> with EOI and it's a common practice to handle NMIs one-by-one (as a >> NMI handler is not reentrant in a typical scenario). >> >> Execution of SMI doesn't cause a pending (blocked) NMI to get >> dropped, similar mechanisms might be employed for a single NMI which >> arrived in blocked-by-NMI state. Otherwise the whole thing will >> break -- merely handling arbitrary NMI will be enough to miss any >> other NMIs. This is a too obvious flaw. So normally it should be >> just a matter which NMI of two will be serviced first. >> This assumption can be verified empirically by requesting the chipset >> to send an external NMI while serving a watchdog NMI and checking if >> it arrive later on. >NMI handling works just like other interrupts, except that its >equivalent of the ISR/IRR state is hidden. > >One new NMI will become pending while an NMI is in progress (because >there is an IRR bit to be set), but any further will be dropped. > >You can demonstrate this easily by having CPUs or the chipset send >NMIs. One in service with one pending is enough for our scenario. It covers the situation when either watchdog or external NMI being serviced and another (single) NMI arrived. As long as two of these are serviced one after another, there shouldn't be trouble in missing NMIs due to NMI status misunderstanding -- all related status bits are sticky (incl. Mask bit in LAPIC) and and won't go away if two NMIs executed in either order, provided they both executed. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 08/02/18 14:37, Alexey G wrote: > On Thu, 8 Feb 2018 12:40:41 + > Andrew Cooperwrote: >> - Perf/Oprofile. This is currently mutually exclusive with Xen using >> the watchdog, but needn't be and hopefully won't be in the future. >> >>> Most of the time we deal with watchdog NMIs, while all others should >>> be somewhat rare. The thing is, we actually need to read I/O port >>> 61h on system NMIs only. >>> >>> If the main problem lies in a flow of SMIs due to reading port 61h on >>> every NMI watchdog tick -- why not to avoid reading it? >>> >>> There are at least 2 ways to check if the NMI was due to a watchdog >>> tick: >>> - LAPIC (SDM states that "When a performance monitoring counters >>> interrupt is generated, the mask bit for its associated LVT entry is >>> set") >>> - perf MSR overflow bit >>> >>> So, if we detect it was a NMI due to a watchdog using these >>> methods (early in the NMI handler), we can avoid touching the port >>> 61h and thus triggering SMI I/O trap on it. >> The problem is having multiple NMIs arriving. Like all other edge >> triggered interrupts, extra arrivals get dropped. By skipping the 0x61 >> read if we believe it was a watchdog NMI, we've opened a race condition >> where we will completely miss the system NMI. > There shouldn't be any problem I think. NMIs don't need to be cleared > with EOI and it's a common practice to handle NMIs one-by-one (as a NMI > handler is not reentrant in a typical scenario). > > Execution of SMI doesn't cause a pending (blocked) NMI to get dropped, > similar mechanisms might be employed for a single NMI which arrived in > blocked-by-NMI state. Otherwise the whole thing will break -- merely > handling arbitrary NMI will be enough to miss any other NMIs. This is a > too obvious flaw. So normally it should be just a matter which NMI of > two will be serviced first. > This assumption can be verified empirically by requesting the chipset > to send an external NMI while serving a watchdog NMI and checking if it > arrive later on. NMI handling works just like other interrupts, except that its equivalent of the ISR/IRR state is hidden. One new NMI will become pending while an NMI is in progress (because there is an IRR bit to be set), but any further will be dropped. You can demonstrate this easily by having CPUs or the chipset send NMIs. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Thu, 8 Feb 2018 12:40:41 + Andrew Cooperwrote: >- Perf/Oprofile. This is currently mutually exclusive with Xen using >the watchdog, but needn't be and hopefully won't be in the future. > >> >> Most of the time we deal with watchdog NMIs, while all others should >> be somewhat rare. The thing is, we actually need to read I/O port >> 61h on system NMIs only. >> >> If the main problem lies in a flow of SMIs due to reading port 61h on >> every NMI watchdog tick -- why not to avoid reading it? >> >> There are at least 2 ways to check if the NMI was due to a watchdog >> tick: >> - LAPIC (SDM states that "When a performance monitoring counters >> interrupt is generated, the mask bit for its associated LVT entry is >> set") >> - perf MSR overflow bit >> >> So, if we detect it was a NMI due to a watchdog using these >> methods (early in the NMI handler), we can avoid touching the port >> 61h and thus triggering SMI I/O trap on it. > >The problem is having multiple NMIs arriving. Like all other edge >triggered interrupts, extra arrivals get dropped. By skipping the 0x61 >read if we believe it was a watchdog NMI, we've opened a race condition >where we will completely miss the system NMI. There shouldn't be any problem I think. NMIs don't need to be cleared with EOI and it's a common practice to handle NMIs one-by-one (as a NMI handler is not reentrant in a typical scenario). Execution of SMI doesn't cause a pending (blocked) NMI to get dropped, similar mechanisms might be employed for a single NMI which arrived in blocked-by-NMI state. Otherwise the whole thing will break -- merely handling arbitrary NMI will be enough to miss any other NMIs. This is a too obvious flaw. So normally it should be just a matter which NMI of two will be serviced first. This assumption can be verified empirically by requesting the chipset to send an external NMI while serving a watchdog NMI and checking if it arrive later on. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 08/02/18 12:32, Alexey G wrote: > On Thu, 8 Feb 2018 10:47:45 + > Igor Druzhininwrote: >> I've done this measurement before. So what we are seeing exactly is >> that the time we are spending in SMI is spiking (sometimes up to >> 200ms) at the moment we go through INIT-SIPI-SIPI sequence. Looks like >> this is enough to push the system into a livelock spiral. So I agree >> with Jan to some point that the proposed workaround might not be >> working on some systems. > According to the Xen code, NMI expected for 2 primary purposes: > - watchdog NMI from LAPIC > - "system" NMIs (like due to SERR) - Perf/Oprofile. This is currently mutually exclusive with Xen using the watchdog, but needn't be and hopefully won't be in the future. > > Most of the time we deal with watchdog NMIs, while all others should be > somewhat rare. The thing is, we actually need to read I/O port 61h on > system NMIs only. > > If the main problem lies in a flow of SMIs due to reading port 61h on > every NMI watchdog tick -- why not to avoid reading it? > > There are at least 2 ways to check if the NMI was due to a watchdog > tick: > - LAPIC (SDM states that "When a performance monitoring counters > interrupt is generated, the mask bit for its associated LVT entry is > set") > - perf MSR overflow bit > > So, if we detect it was a NMI due to a watchdog using these > methods (early in the NMI handler), we can avoid touching the port 61h > and thus triggering SMI I/O trap on it. The problem is having multiple NMIs arriving. Like all other edge triggered interrupts, extra arrivals get dropped. By skipping the 0x61 read if we believe it was a watchdog NMI, we've opened a race condition where we will completely miss the system NMI. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Thu, 8 Feb 2018 10:47:45 + Igor Druzhininwrote: >I've done this measurement before. So what we are seeing exactly is >that the time we are spending in SMI is spiking (sometimes up to >200ms) at the moment we go through INIT-SIPI-SIPI sequence. Looks like >this is enough to push the system into a livelock spiral. So I agree >with Jan to some point that the proposed workaround might not be >working on some systems. According to the Xen code, NMI expected for 2 primary purposes: - watchdog NMI from LAPIC - "system" NMIs (like due to SERR) Most of the time we deal with watchdog NMIs, while all others should be somewhat rare. The thing is, we actually need to read I/O port 61h on system NMIs only. If the main problem lies in a flow of SMIs due to reading port 61h on every NMI watchdog tick -- why not to avoid reading it? There are at least 2 ways to check if the NMI was due to a watchdog tick: - LAPIC (SDM states that "When a performance monitoring counters interrupt is generated, the mask bit for its associated LVT entry is set") - perf MSR overflow bit So, if we detect it was a NMI due to a watchdog using these methods (early in the NMI handler), we can avoid touching the port 61h and thus triggering SMI I/O trap on it. >> There might be a chance that perf counter frequency is calculated >> wrong for some systems, resulting in a very high rate of NMI >> watchdog ticks instead of long SMI handler execution time. >10ms >> just looks... too extreme. >> > >We ruled that out. > >> Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the >> port 61h found. Some differences with gigabyte H270 system though: >> >> - no "allocated" I/O traps anymore, but one additional SMI I/O trap >> encountered: port 900h, dword size. Possibly related to PCIe PM >> facilities. >> >> - port 61h SMI handler now has multiple calls to debug/assert stub >> functions -- there might be a chance that some of impacted systems >> had debug build on, resulting in those stubs expanded to some real >> debugging code with negative impact on SMI handling speed. >> >> Few additional observations: >> >> - port 61h I/O Trap SMI handler checks accessed I/O address/size to >> be equal to 61h/1byte. There might be some difference when reading >> port 61h via inw(0x60)/inl(0x60)/etc >> >> - looks like there exist an alternative way to read NMI status >> without triggering SMI -- via ports 63h/65h/67h, but this depends on >> undocumented bit in Generic Control and Status register >> ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 08/02/18 09:12, Jan Beulich wrote: On 07.02.18 at 18:08,wrote: >> On 07/02/18 15:06, Jan Beulich wrote: >>> Also you completely ignore my argument against the seemingly >>> random division by 10, including the resulting question of what you >>> mean to do once 10Hz also turns out too high a frequency. >> We've got to pick a frequency. The current 100Hz is just as arbitrary >> as the proposed new 10Hz. > Not exactly - the 100Hz is simply the frequency we run the main tick > at, so while random it is not as random as any further derived value > which has no proper reason behind it. > > There's one more point wrt your argument of overhead: If servicing > an NMI takes that long on those boxes, you're basically saying you > are happy to waste at least 1% of a core's bandwidth on a > debugging feature. Is that reasonable for a production setup? And > considering that I'd expect the patch to have chosen e.g. HZ / 5, > HZ / 4, or even HZ / 2 if that worked reliably, I could even conclude > you're happy to spend somewhere between 5 and 10% of one > core's bandwidth. (FAOD all this is based on the 1Hz frequency we > - iirc - run the NMI at later on.) To me this is another clear argument > to turn off the watchdog on those systems, rather than trying to > "fix" its probing. It is not a debugging feature; it's a reliability feature. With clustered storage in particular, it is absolutely paramount to guarantee that a struggling host fences itself cleanly, or you lose the entire cluster. This particular issue is a failure to boot, but by far the most common issue we see in the field is a fence when all-but-one CPU is waiting in the calibration rendezvous, by which point the host has effectively been dead for 5 seconds already. Turning the watchdog off isn't a viable or reasonable solution to the problem. We switch the NMI frequency to ~2Hz after the calibration, but that is after having run the BSP at 100Hz for a long period of time, and the APs at this rate for a short while. Irrespective of the exact fix here, it is simply not a good idea to be running with this NMI frequency, other than possibly during the immediate calibration logic. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 08/02/18 06:37, Alexey G wrote: > On Wed, 7 Feb 2018 13:01:08 + > Igor Druzhininwrote: >> So far the issue confirmed: >> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >> that it was tested on), Intel S2600XX, etc. >> >> Also see: >> https://bugs.xenserver.org/browse/XSO-774 >> >> Well, no-watchdog is what we currently recommend in that case but we >> hoped there is a general solution here from Xen side. You have your >> point that they should fix this on their side because it's their fault >> indeed. But the user experience is also important for us I think. > > Igor, > > It would be nice to measure the actual SMI handling time on affected > systems (eg. via rdtsc before/after inb(0x61) + averaging for > multiple reads perhaps), is it really 10+ ms. > I've done this measurement before. So what we are seeing exactly is that the time we are spending in SMI is spiking (sometimes up to 200ms) at the moment we go through INIT-SIPI-SIPI sequence. Looks like this is enough to push the system into a livelock spiral. So I agree with Jan to some point that the proposed workaround might not be working on some systems. > There might be a chance that perf counter frequency is calculated wrong > for some systems, resulting in a very high rate of NMI watchdog ticks > instead of long SMI handler execution time. >10ms just looks... too > extreme. > We ruled that out. > Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the port > 61h found. Some differences with gigabyte H270 system though: > > - no "allocated" I/O traps anymore, but one additional SMI I/O trap > encountered: port 900h, dword size. Possibly related to PCIe PM > facilities. > > - port 61h SMI handler now has multiple calls to debug/assert stub > functions -- there might be a chance that some of impacted systems > had debug build on, resulting in those stubs expanded to some real > debugging code with negative impact on SMI handling speed. > > Few additional observations: > > - port 61h I/O Trap SMI handler checks accessed I/O address/size to be > equal to 61h/1byte. There might be some difference when reading port > 61h via inw(0x60)/inl(0x60)/etc > > - looks like there exist an alternative way to read NMI status without > triggering SMI -- via ports 63h/65h/67h, but this depends on > undocumented bit in Generic Control and Status register > ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 07.02.18 at 18:08,wrote: > On 07/02/18 15:06, Jan Beulich wrote: >> Also you completely ignore my argument against the seemingly >> random division by 10, including the resulting question of what you >> mean to do once 10Hz also turns out too high a frequency. > > We've got to pick a frequency. The current 100Hz is just as arbitrary > as the proposed new 10Hz. Not exactly - the 100Hz is simply the frequency we run the main tick at, so while random it is not as random as any further derived value which has no proper reason behind it. There's one more point wrt your argument of overhead: If servicing an NMI takes that long on those boxes, you're basically saying you are happy to waste at least 1% of a core's bandwidth on a debugging feature. Is that reasonable for a production setup? And considering that I'd expect the patch to have chosen e.g. HZ / 5, HZ / 4, or even HZ / 2 if that worked reliably, I could even conclude you're happy to spend somewhere between 5 and 10% of one core's bandwidth. (FAOD all this is based on the 1Hz frequency we - iirc - run the NMI at later on.) To me this is another clear argument to turn off the watchdog on those systems, rather than trying to "fix" its probing. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Wed, 7 Feb 2018 13:01:08 + Igor Druzhininwrote: >So far the issue confirmed: >Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >that it was tested on), Intel S2600XX, etc. > >Also see: >https://bugs.xenserver.org/browse/XSO-774 > >Well, no-watchdog is what we currently recommend in that case but we >hoped there is a general solution here from Xen side. You have your >point that they should fix this on their side because it's their fault >indeed. But the user experience is also important for us I think. Igor, It would be nice to measure the actual SMI handling time on affected systems (eg. via rdtsc before/after inb(0x61) + averaging for multiple reads perhaps), is it really 10+ ms. There might be a chance that perf counter frequency is calculated wrong for some systems, resulting in a very high rate of NMI watchdog ticks instead of long SMI handler execution time. >10ms just looks... too extreme. Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the port 61h found. Some differences with gigabyte H270 system though: - no "allocated" I/O traps anymore, but one additional SMI I/O trap encountered: port 900h, dword size. Possibly related to PCIe PM facilities. - port 61h SMI handler now has multiple calls to debug/assert stub functions -- there might be a chance that some of impacted systems had debug build on, resulting in those stubs expanded to some real debugging code with negative impact on SMI handling speed. Few additional observations: - port 61h I/O Trap SMI handler checks accessed I/O address/size to be equal to 61h/1byte. There might be some difference when reading port 61h via inw(0x60)/inl(0x60)/etc - looks like there exist an alternative way to read NMI status without triggering SMI -- via ports 63h/65h/67h, but this depends on undocumented bit in Generic Control and Status register ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 07/02/18 15:06, Jan Beulich wrote: On 07.02.18 at 14:24,wrote: >> On 07/02/18 13:08, Jan Beulich wrote: >> On 07.02.18 at 14:01, wrote: So far the issue confirmed: Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one that it was tested on), Intel S2600XX, etc. Also see: https://bugs.xenserver.org/browse/XSO-774 Well, no-watchdog is what we currently recommend in that case but we hoped there is a general solution here from Xen side. You have your point that they should fix this on their side because it's their fault indeed. But the user experience is also important for us I think. >>> Of course, hence the suggestion of possible alternative workarounds. >>> Impacting everyone is, as said, not a desirable approach in a case >>> like this one. I also continue to dislike the seemingly random division >>> by 10. >> Xen's usability is crap, which is in large part due to attitude like >> this. It is not ok to expect the end user to know diagnose/debug issues >> like this, and it is entirely unreasonable to expect the end user to >> have to manually work around it. > Excuse me? The watchdog is off by default. Anyone turning it on > ought to know what they do. You (iirc) turning it on unilaterally in > XenServer puts the burden of avoidng users to have to diagnose > the issue on you. And we have taken the burden of diagnosing the issue, as well as proposing a fix. > >> This particular issue does want feeding back to Intel so they can try >> and fix it, but whatever is wrong is present in a large amount of >> Skylake systems in the field. Xen needs to be able to cope. > But in a reasonable way. > >> Finally, as to boot times, your argument is backwards seeing as you care >> about elapsed boot time. Slowing the frequency will speed everything >> up, as we aren't executing a large chunk of the BSP boot path with 100hz >> NMI constantly interrupting. > How long does handling a single NMI take? Microseconds, I assume. > Contrast this with waiting for two NMIs to arrive in wait_for_nmis(), > which goes up from 20ms to 200ms with this change. So you're argument is to not change the frequency because an off-by-default option will *in the best case* add a few hundred milliseconds extra to the boot time? Times to boot computers are measured in minutes, not milliseconds. I don't know how long servicing an NMI takes, at a minimum of a rdmsr, wrmsr and then a further mmio write or wrmsr, I doubt it is that quick. > Also you completely ignore my argument against the seemingly > random division by 10, including the resulting question of what you > mean to do once 10Hz also turns out too high a frequency. We've got to pick a frequency. The current 100Hz is just as arbitrary as the proposed new 10Hz. > I wouldn't, btw, mind an attempt to avoid the high rate NMIs > during early boot (if those occur in the first place, which from > two successive replies by Igor yesterday I wasn't sure anymore > is an actual fact), but that's independent of the issue at hand. The 100Hz NMI is active from BSP APIC init, IO-APIC, deadline timer calibration, mwait idle, the entirety of HVM setup and full AP bringup. On one of my fastest boxes, it is about 1 second of wallclock time. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 07.02.18 at 14:24,wrote: > On 07/02/18 13:08, Jan Beulich wrote: > On 07.02.18 at 14:01, wrote: >>> So far the issue confirmed: >>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >>> that it was tested on), Intel S2600XX, etc. >>> >>> Also see: >>> https://bugs.xenserver.org/browse/XSO-774 >>> >>> Well, no-watchdog is what we currently recommend in that case but we >>> hoped there is a general solution here from Xen side. You have your >>> point that they should fix this on their side because it's their fault >>> indeed. But the user experience is also important for us I think. >> Of course, hence the suggestion of possible alternative workarounds. >> Impacting everyone is, as said, not a desirable approach in a case >> like this one. I also continue to dislike the seemingly random division >> by 10. > > Xen's usability is crap, which is in large part due to attitude like > this. It is not ok to expect the end user to know diagnose/debug issues > like this, and it is entirely unreasonable to expect the end user to > have to manually work around it. Excuse me? The watchdog is off by default. Anyone turning it on ought to know what they do. You (iirc) turning it on unilaterally in XenServer puts the burden of avoidng users to have to diagnose the issue on you. > This particular issue does want feeding back to Intel so they can try > and fix it, but whatever is wrong is present in a large amount of > Skylake systems in the field. Xen needs to be able to cope. But in a reasonable way. > Finally, as to boot times, your argument is backwards seeing as you care > about elapsed boot time. Slowing the frequency will speed everything > up, as we aren't executing a large chunk of the BSP boot path with 100hz > NMI constantly interrupting. How long does handling a single NMI take? Microseconds, I assume. Contrast this with waiting for two NMIs to arrive in wait_for_nmis(), which goes up from 20ms to 200ms with this change. Also you completely ignore my argument against the seemingly random division by 10, including the resulting question of what you mean to do once 10Hz also turns out too high a frequency. I wouldn't, btw, mind an attempt to avoid the high rate NMIs during early boot (if those occur in the first place, which from two successive replies by Igor yesterday I wasn't sure anymore is an actual fact), but that's independent of the issue at hand. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 07/02/18 13:08, Jan Beulich wrote: On 07.02.18 at 14:01,wrote: >> So far the issue confirmed: >> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >> that it was tested on), Intel S2600XX, etc. >> >> Also see: >> https://bugs.xenserver.org/browse/XSO-774 >> >> Well, no-watchdog is what we currently recommend in that case but we >> hoped there is a general solution here from Xen side. You have your >> point that they should fix this on their side because it's their fault >> indeed. But the user experience is also important for us I think. > > Of course, hence the suggestion of possible alternative workarounds. > Impacting everyone is, as said, not a desirable approach in a case > like this one. I also continue to dislike the seemingly random division > by 10. > > Jan > There is also a workaround by initializing the watchdog later (after SMP bootstrap) on CPU0 - as Linux does AFAIK. But I don't think this would be acceptable either. Igor ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 07/02/18 13:08, Jan Beulich wrote: On 07.02.18 at 14:01,wrote: >> So far the issue confirmed: >> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >> that it was tested on), Intel S2600XX, etc. >> >> Also see: >> https://bugs.xenserver.org/browse/XSO-774 >> >> Well, no-watchdog is what we currently recommend in that case but we >> hoped there is a general solution here from Xen side. You have your >> point that they should fix this on their side because it's their fault >> indeed. But the user experience is also important for us I think. > Of course, hence the suggestion of possible alternative workarounds. > Impacting everyone is, as said, not a desirable approach in a case > like this one. I also continue to dislike the seemingly random division > by 10. Xen's usability is crap, which is in large part due to attitude like this. It is not ok to expect the end user to know diagnose/debug issues like this, and it is entirely unreasonable to expect the end user to have to manually work around it. This particular issue does want feeding back to Intel so they can try and fix it, but whatever is wrong is present in a large amount of Skylake systems in the field. Xen needs to be able to cope. Finally, as to boot times, your argument is backwards seeing as you care about elapsed boot time. Slowing the frequency will speed everything up, as we aren't executing a large chunk of the BSP boot path with 100hz NMI constantly interrupting. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 07.02.18 at 14:01,wrote: > So far the issue confirmed: > Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one > that it was tested on), Intel S2600XX, etc. > > Also see: > https://bugs.xenserver.org/browse/XSO-774 > > Well, no-watchdog is what we currently recommend in that case but we > hoped there is a general solution here from Xen side. You have your > point that they should fix this on their side because it's their fault > indeed. But the user experience is also important for us I think. Of course, hence the suggestion of possible alternative workarounds. Impacting everyone is, as said, not a desirable approach in a case like this one. I also continue to dislike the seemingly random division by 10. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 07/02/18 09:13, Jan Beulich wrote: On 06.02.18 at 22:51,wrote: >> The problem with a quirk/commandline parameter is that the issue is >> reported for a wide variety of systems and, as it looks like, depends on >> the default BIOS setup - means it's hard to identify particular >> machines. We should obviously sort this out with Intel but until then >> lowering the initial frequency is our only option. > > "Wide variety" is interesting, considering that we've had no earlier > reports. As the description of the patch talks about "post-Skylake" - > are these production machines? If not, a command line option > would quite certainly be sufficient here. If yes, I'd like "wide variety" > to be further qualified. After all we're talking about a processing > overhead on the order of 10ms here, which is absurd. There are > systems anyway where the watchdog doesn't work - we may need > to consider to suggest to people to simply not enable the watchdog > on such systems until the firmware issue has been taken care of. > > As mentioned before - if firmware takes on the order of 10ms to > process the SMI intercept, I can't see why it wouldn't be possible > for them to screw up further and take 20, 50, or 100ms, at which > point your seemingly random HZ / 10 would no longer work either. > The same goes for the case of someone coming along and > changing HZ to a higher value (with a good reason provided). > > Jan > So far the issue confirmed: Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one that it was tested on), Intel S2600XX, etc. Also see: https://bugs.xenserver.org/browse/XSO-774 Well, no-watchdog is what we currently recommend in that case but we hoped there is a general solution here from Xen side. You have your point that they should fix this on their side because it's their fault indeed. But the user experience is also important for us I think. Igor ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 06.02.18 at 22:51,wrote: > The problem with a quirk/commandline parameter is that the issue is > reported for a wide variety of systems and, as it looks like, depends on > the default BIOS setup - means it's hard to identify particular > machines. We should obviously sort this out with Intel but until then > lowering the initial frequency is our only option. "Wide variety" is interesting, considering that we've had no earlier reports. As the description of the patch talks about "post-Skylake" - are these production machines? If not, a command line option would quite certainly be sufficient here. If yes, I'd like "wide variety" to be further qualified. After all we're talking about a processing overhead on the order of 10ms here, which is absurd. There are systems anyway where the watchdog doesn't work - we may need to consider to suggest to people to simply not enable the watchdog on such systems until the firmware issue has been taken care of. As mentioned before - if firmware takes on the order of 10ms to process the SMI intercept, I can't see why it wouldn't be possible for them to screw up further and take 20, 50, or 100ms, at which point your seemingly random HZ / 10 would no longer work either. The same goes for the case of someone coming along and changing HZ to a higher value (with a good reason provided). Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
If the actual SMI source is not related to some place in the NMI handler code but was eg. due to some SMI timer, lowering NMI watchdog frequency might not fix the issue completely, but lower its reproducibility (perhaps to some very rare occurrences). So it's better be sure what was the real source of SMI. >>> >>> This *is* related to this instruction - it was confirmed >>> empirically. Removing this instruction stops SMIs from occurring >>> and effectively removes the issue leaving the frequency unchanged. >> >> Hmm, it would be interesting to know for what evil purpose does it >> need to trap I/O port 61h. >> BTW, on which motherboard model the issue was reproduced? >> > >The issue has been reported for some Dell/Huawei Skylake platforms (one >of them PowerEdge R740 to be precise) but I don't think the others are >unaffected (the issue supposedly originates from Intel's reference >code) >- the default BIOS setup indeed matters. Here is a bit of info you might find useful. I did a quick research on my test system (Gigabyte GA-H270M-D3H) in order to confirm if BIOS traps I/O port 61h (NMI status) and for what purposes. Well, turns out it really does. Moreover, it's actually the only fixed I/O port location trapped by SMI I/O traps on this system. Few others are simply 'allocated' ones, meaning the real I/O port address being trapped is chosen dynamically by supplying Address=0 to a corresponding call to EFI I/O Trap interface function -- such I/O traps may be used as interfaces with a SMI handler in a manner similar to the SW SMI interface. The EFI module responsible for installing port 61h SMI I/O Trap is PchInitSmm in my case. The related code is: ... mov eax, 61h lea r9, qword_5778 mov [rsp+98h+io_trap_ctx.io_address], ax mov rax, cs:pIoTrapIF lea r8, [rsp+98h+io_trap_ctx] lea rdx, Port61h_IoTrapHandler mov rcx, rax mov [rsp+98h+io_trap_ctx.trap_type], ebp ; trap reads mov [rsp+98h+io_trap_ctx.io_len], bp ; ebp=1 callqword ptr [rax] ... The actual handler (named Port61h_IoTrapHandler in the above code) is fairly lightweight and does a bit of useless black magic. First, there is a loop for all CPUs which finds which CPU actually caused trapped I/O operation by reading NMI status port. Then it reads the original port 61h value and set NMI_SC bit4 to its inverted previous state for the selected CPU' bit. And then updated AL register value is returned to the NMI_SC-reading user code (via patching RAX register value in SMRAM saved state): ; ebp = 61h, rbx = CPU index ... mov edx, ebp in al, dx mov r8, cs:bmNmiRefTogglesForCpus mov rcx, rbx mov edx, 1 shl edx, cl mov r9, rbx movsxd rcx, edx mov dl, al and al, 0EFh xor r8, rcx or dl, 10h mov cs:bmNmiRefTogglesForCpus, r8 and r8, rcx movzx ecx, al movzx eax, dl testr8, r8 mov edx, 1 cmovnz ecx, eax lea rax, [rsp+58h+al_to_return] lea r8d, [rdx+25h] ; EFI_SMM_SAVE_STATE_REGISTER_RAX mov [rsp+58h+func_arg0], rax mov rax, cs:pEFI_SMM_CPU_PROTOCOL_GUID_IF mov [rsp+58h+al_to_return], cl mov rcx, rax callqword ptr [rax+8] ; WriteSaveState ... So, the only purpose of this stuff is emulating REF_TOGGLE bit toggling logic (simply by alternating ones and zeros on each NMI_SC read), nothing more. Sort of workaround for some legacy code which depends on REF_TOGGLE rolling (which is now being marked Reserved in docs). On this particular system SMI I/O trap for port 61h neither do anything time-consuming nor anything really useful. That Dell system must have something similar (thanks to common EFI ref code from Intel Igor mentioned), leaving the question why port 61h reading is so slow there. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 16:23, Jan Beulich wrote: On 06.02.18 at 17:14,wrote: >> On 06/02/18 16:07, Jan Beulich wrote: >> On 05.02.18 at 22:18, wrote: --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -34,7 +34,8 @@ #include unsigned int nmi_watchdog = NMI_NONE; -static unsigned int nmi_hz = HZ; +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs >> */ +static unsigned int nmi_hz = HZ / 10; >>> >>> For one - the comment should explain what "too high" means. >>> Further - what if on another system 10Hz is still too high? I also hope >>> you realize that you slow down boot a little for everyone just >>> because of this one machine model. Can the lower frequency perhaps >>> be set via DMI quirk, or otherwise obtain a command line override >>> (perhaps something like "watchdog=probe:10Hz")? >>> >> >> I can improve the comment message. >> Why does this change slow down anything while I'm lowering the frequency >> - not making it higher? > > We wait for two occurrences of the NMI in wait_for_nmis(). > >> The alternative approach would be to reshuffle >> the code and take the reason before programming the next interrupt as >> suggested before. In that case the actual frequency would be adjusted >> naturally I think. > > Thinking about this, reading the reason early seems like a good idea > to me irrespective of the issue here. > I ran a couple of experiments with different layouts in NMI handler: it looks like it doesn't really help as merely having this instruction inside the handler and running it at 100Hz breaks a number of timeouts in SMP bootstrap code and makes it unstable. So we are back to lowering the frequency as I'm now out of ideas. The problem with a quirk/commandline parameter is that the issue is reported for a wide variety of systems and, as it looks like, depends on the default BIOS setup - means it's hard to identify particular machines. We should obviously sort this out with Intel but until then lowering the initial frequency is our only option. Igor ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Tue, 6 Feb 2018 17:21:19 + Igor Druzhininwrote: >On 06/02/18 17:08, Alexey G wrote: >> The major concern here is the possiblity of SMI being triggered _not_ >> by some specific I/O port access. Primarily, if it actually was a >> periodic SMI. >> >> If the actual SMI source is not related to some place in the NMI >> handler code but was eg. due to some SMI timer, lowering NMI watchdog >> frequency might not fix the issue completely, but lower its >> reproducibility (perhaps to some very rare occurrences). So it's >> better be sure what was the real source of SMI. >> > >This *is* related to this instruction - it was confirmed empirically. >Removing this instruction stops SMIs from occurring and effectively >removes the issue leaving the frequency unchanged. Hmm, it would be interesting to know for what evil purpose does it need to trap I/O port 61h. BTW, on which motherboard model the issue was reproduced? >> 2. According to the code, it looks like NMI status reading happens >> while NMIs are still blocked -- this means that SMI handler must >> exec IRET by itself to reset NMI blocking state -- again, this is >> possible (eg. in unreal->protmode switching code), but not likely. >> > >According to SDM one NMI might be pending while taken in SMI mode (see >ch. 34.8). This is actually even true if NMI comes while servicing >another NMI. So when we return to the NMI handler from SMI and finish >it properly the next one appears immediately. If the SMI handler doesn't mess up with NMI blocking state, it means that SMI handler processes every reading of port 61h longer than a watchdog NMI period duration... which is quite long. Motherboard vendor did something very wrong with I/O trap handling in the SMI handler code if it takes so much. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 17:08, Alexey G wrote: > On Tue, 6 Feb 2018 14:21:12 + > Andrew Cooperwrote: > >> On 06/02/18 03:10, Alexey G wrote: >>> I/O port 61h normally is not emulated by SMI legacy kbd handling code >>> in BIOS, only ports like 60h, 64h, etc. >>> Contrary to USB legacy emulation, it has to intercept port 61h via a >>> different approach -- generic SMI I/O trap, which is not common (at >>> least it was) to use by BIOSes... although it is possible as EFI >>> interface and code for this is available. The assumption about port >>> 61h being trapped by the SMI handler must be explicitly confirmed by >>> checking I/O Trap control regs in the RCBA region. >>> >>> If I/O trap regs won't show an active I/O trap on I/O port 61h -- the >>> root cause might be different (might even be related to stuff like >>> NMI2SMI logic). >>> >>> If the problem is actually due to NMI handler being preempted by >>> another NMI which occurred after (a long) execution of triggered SMI >>> handler, it might be better to do all sensitive stuff before >>> re-enabling NMIs by IRET in the NMI handler. >> >> The problem is that the SMI handler executes enough instructions to >> trigger another NMI (which is based on the retired instruction count), >> which gets delivered once the SMI handler returns, and servicing the >> NMI triggers a new SMI, which triggers a new NMI. This is why the >> system stalls. >> >> I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is >> only a secondary concern here. We cannot reasonably have the watchdog >> able to trip because of exclusively SMI activity, or we'll potentially >> livelock anywhere. > > The major concern here is the possiblity of SMI being triggered _not_ > by some specific I/O port access. Primarily, if it actually was a > periodic SMI. > > If the actual SMI source is not related to some place in the NMI > handler code but was eg. due to some SMI timer, lowering NMI watchdog > frequency might not fix the issue completely, but lower its > reproducibility (perhaps to some very rare occurrences). So it's better > be sure what was the real source of SMI. > This *is* related to this instruction - it was confirmed empirically. Removing this instruction stops SMIs from occurring and effectively removes the issue leaving the frequency unchanged. > There are 2 weak spots in the analysis: > > 1. Triggering SMI by reading I/O port 61h -- intercepting the port 61h > is non-typical behavior for BIOS (unlike 60h/64h) > Apparently, it is now. > 2. According to the code, it looks like NMI status reading happens while > NMIs are still blocked -- this means that SMI handler must exec IRET > by itself to reset NMI blocking state -- again, this is possible (eg. > in unreal->protmode switching code), but not likely. > According to SDM one NMI might be pending while taken in SMI mode (see ch. 34.8). This is actually even true if NMI comes while servicing another NMI. So when we return to the NMI handler from SMI and finish it properly the next one appears immediately. Igor ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Tue, 6 Feb 2018 14:21:12 + Andrew Cooperwrote: >On 06/02/18 03:10, Alexey G wrote: >> I/O port 61h normally is not emulated by SMI legacy kbd handling code >> in BIOS, only ports like 60h, 64h, etc. >> Contrary to USB legacy emulation, it has to intercept port 61h via a >> different approach -- generic SMI I/O trap, which is not common (at >> least it was) to use by BIOSes... although it is possible as EFI >> interface and code for this is available. The assumption about port >> 61h being trapped by the SMI handler must be explicitly confirmed by >> checking I/O Trap control regs in the RCBA region. >> >> If I/O trap regs won't show an active I/O trap on I/O port 61h -- the >> root cause might be different (might even be related to stuff like >> NMI2SMI logic). >> >> If the problem is actually due to NMI handler being preempted by >> another NMI which occurred after (a long) execution of triggered SMI >> handler, it might be better to do all sensitive stuff before >> re-enabling NMIs by IRET in the NMI handler. > >The problem is that the SMI handler executes enough instructions to >trigger another NMI (which is based on the retired instruction count), >which gets delivered once the SMI handler returns, and servicing the >NMI triggers a new SMI, which triggers a new NMI. This is why the >system stalls. > >I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is >only a secondary concern here. We cannot reasonably have the watchdog >able to trip because of exclusively SMI activity, or we'll potentially >livelock anywhere. The major concern here is the possiblity of SMI being triggered _not_ by some specific I/O port access. Primarily, if it actually was a periodic SMI. If the actual SMI source is not related to some place in the NMI handler code but was eg. due to some SMI timer, lowering NMI watchdog frequency might not fix the issue completely, but lower its reproducibility (perhaps to some very rare occurrences). So it's better be sure what was the real source of SMI. There are 2 weak spots in the analysis: 1. Triggering SMI by reading I/O port 61h -- intercepting the port 61h is non-typical behavior for BIOS (unlike 60h/64h) 2. According to the code, it looks like NMI status reading happens while NMIs are still blocked -- this means that SMI handler must exec IRET by itself to reset NMI blocking state -- again, this is possible (eg. in unreal->protmode switching code), but not likely. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 16:23, Jan Beulich wrote: On 06.02.18 at 17:14,wrote: >> On 06/02/18 16:07, Jan Beulich wrote: >> On 05.02.18 at 22:18, wrote: --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -34,7 +34,8 @@ #include unsigned int nmi_watchdog = NMI_NONE; -static unsigned int nmi_hz = HZ; +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs >> */ +static unsigned int nmi_hz = HZ / 10; >>> >>> For one - the comment should explain what "too high" means. >>> Further - what if on another system 10Hz is still too high? I also hope >>> you realize that you slow down boot a little for everyone just >>> because of this one machine model. Can the lower frequency perhaps >>> be set via DMI quirk, or otherwise obtain a command line override >>> (perhaps something like "watchdog=probe:10Hz")? >>> >> >> I can improve the comment message. >> Why does this change slow down anything while I'm lowering the frequency >> - not making it higher? > > We wait for two occurrences of the NMI in wait_for_nmis(). Yes, you right, ignore my previous mail. We are waiting synchronously for them. I'll rewrite the code so I'll not change the frequency. Igor > >> The alternative approach would be to reshuffle >> the code and take the reason before programming the next interrupt as >> suggested before. In that case the actual frequency would be adjusted >> naturally I think. > > Thinking about this, reading the reason early seems like a good idea > to me irrespective of the issue here. > > Jan > ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 16:23, Jan Beulich wrote: On 06.02.18 at 17:14,wrote: >> On 06/02/18 16:07, Jan Beulich wrote: >> On 05.02.18 at 22:18, wrote: --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -34,7 +34,8 @@ #include unsigned int nmi_watchdog = NMI_NONE; -static unsigned int nmi_hz = HZ; +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs >> */ +static unsigned int nmi_hz = HZ / 10; >>> >>> For one - the comment should explain what "too high" means. >>> Further - what if on another system 10Hz is still too high? I also hope >>> you realize that you slow down boot a little for everyone just >>> because of this one machine model. Can the lower frequency perhaps >>> be set via DMI quirk, or otherwise obtain a command line override >>> (perhaps something like "watchdog=probe:10Hz")? >>> >> >> I can improve the comment message. >> Why does this change slow down anything while I'm lowering the frequency >> - not making it higher? > > We wait for two occurrences of the NMI in wait_for_nmis(). > This happens *much* later in the boot sequence after we actually enable the watchdog on CPU0. Till that time we already get hundreds of them regardless of the frequency. >> The alternative approach would be to reshuffle >> the code and take the reason before programming the next interrupt as >> suggested before. In that case the actual frequency would be adjusted >> naturally I think. > > Thinking about this, reading the reason early seems like a good idea > to me irrespective of the issue here. > > Jan > ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 06.02.18 at 17:14,wrote: > On 06/02/18 16:07, Jan Beulich wrote: > On 05.02.18 at 22:18, wrote: >>> --- a/xen/arch/x86/nmi.c >>> +++ b/xen/arch/x86/nmi.c >>> @@ -34,7 +34,8 @@ >>> #include >>> >>> unsigned int nmi_watchdog = NMI_NONE; >>> -static unsigned int nmi_hz = HZ; >>> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs > */ >>> +static unsigned int nmi_hz = HZ / 10; >> >> For one - the comment should explain what "too high" means. >> Further - what if on another system 10Hz is still too high? I also hope >> you realize that you slow down boot a little for everyone just >> because of this one machine model. Can the lower frequency perhaps >> be set via DMI quirk, or otherwise obtain a command line override >> (perhaps something like "watchdog=probe:10Hz")? >> > > I can improve the comment message. > Why does this change slow down anything while I'm lowering the frequency > - not making it higher? We wait for two occurrences of the NMI in wait_for_nmis(). > The alternative approach would be to reshuffle > the code and take the reason before programming the next interrupt as > suggested before. In that case the actual frequency would be adjusted > naturally I think. Thinking about this, reading the reason early seems like a good idea to me irrespective of the issue here. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 16:07, Jan Beulich wrote: On 05.02.18 at 22:18,wrote: >> --- a/xen/arch/x86/nmi.c >> +++ b/xen/arch/x86/nmi.c >> @@ -34,7 +34,8 @@ >> #include >> >> unsigned int nmi_watchdog = NMI_NONE; >> -static unsigned int nmi_hz = HZ; >> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */ >> +static unsigned int nmi_hz = HZ / 10; > > For one - the comment should explain what "too high" means. > Further - what if on another system 10Hz is still too high? I also hope > you realize that you slow down boot a little for everyone just > because of this one machine model. Can the lower frequency perhaps > be set via DMI quirk, or otherwise obtain a command line override > (perhaps something like "watchdog=probe:10Hz")? > I can improve the comment message. Why does this change slow down anything while I'm lowering the frequency - not making it higher? The alternative approach would be to reshuffle the code and take the reason before programming the next interrupt as suggested before. In that case the actual frequency would be adjusted naturally I think. Igor > Jan > ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
>>> On 05.02.18 at 22:18,wrote: > --- a/xen/arch/x86/nmi.c > +++ b/xen/arch/x86/nmi.c > @@ -34,7 +34,8 @@ > #include > > unsigned int nmi_watchdog = NMI_NONE; > -static unsigned int nmi_hz = HZ; > +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */ > +static unsigned int nmi_hz = HZ / 10; For one - the comment should explain what "too high" means. Further - what if on another system 10Hz is still too high? I also hope you realize that you slow down boot a little for everyone just because of this one machine model. Can the lower frequency perhaps be set via DMI quirk, or otherwise obtain a command line override (perhaps something like "watchdog=probe:10Hz")? Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 06/02/18 03:10, Alexey G wrote: > On Mon, 5 Feb 2018 21:18:42 + > Igor Druzhininwrote: > >> We're noticing a reproducible system boot hang on certain >> post-Skylake platforms where the BIOS is configured in >> legacy boot mode with x2APIC disabled. The system stalls >> immediately after writing the first SMP initialization >> sequence into APIC ICR. >> >> The cause of the problem is watchdog NMI handler execution - >> somewhere near the end of NMI handling (after it's already >> rescheduled the next NMI) it tries to access IO port 0x61 >> to get the actual NMI reason on CPU0. Unfortunately, this >> port is emulated by BIOS using SMIs and this emulation >> apparently might take more than we expect under certain >> conditions. As the result, the system is constantly moving >> between NMI and SMI handler and not making any progress. >> >> Just lower the initial frequency for now as we lower it later >> even more anyway. > I/O port 61h normally is not emulated by SMI legacy kbd handling code > in BIOS, only ports like 60h, 64h, etc. > Contrary to USB legacy emulation, it has to intercept port 61h via a > different approach -- generic SMI I/O trap, which is not common (at least > it was) to use by BIOSes... although it is possible as EFI interface and > code for this is available. The assumption about port 61h being trapped by > the SMI handler must be explicitly confirmed by checking I/O Trap control > regs in the RCBA region. > > If I/O trap regs won't show an active I/O trap on I/O port 61h -- the > root cause might be different (might even be related to stuff like > NMI2SMI logic). > > If the problem is actually due to NMI handler being preempted by another > NMI which occurred after (a long) execution of triggered SMI handler, it > might be better to do all sensitive stuff before re-enabling NMIs by IRET in > the NMI handler. The problem is that the SMI handler executes enough instructions to trigger another NMI (which is based on the retired instruction count), which gets delivered once the SMI handler returns, and servicing the NMI triggers a new SMI, which triggers a new NMI. This is why the system stalls. I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is only a secondary concern here. We cannot reasonably have the watchdog able to trip because of exclusively SMI activity, or we'll potentially livelock anywhere. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 05/02/18 21:18, Igor Druzhinin wrote: > We're noticing a reproducible system boot hang on certain > post-Skylake platforms where the BIOS is configured in Its just a plain Skylake Server, from what I can see. > legacy boot mode with x2APIC disabled. The system stalls > immediately after writing the first SMP initialization > sequence into APIC ICR. > > The cause of the problem is watchdog NMI handler execution - > somewhere near the end of NMI handling (after it's already > rescheduled the next NMI) it tries to access IO port 0x61 > to get the actual NMI reason on CPU0. Unfortunately, this > port is emulated by BIOS using SMIs and this emulation > apparently might take more than we expect under certain > conditions. As the result, the system is constantly moving > between NMI and SMI handler and not making any progress. > > Just lower the initial frequency for now as we lower it later > even more anyway. > > Signed-off-by: Igor DruzhininAcked-by: Andrew Cooper I can independently confirm these findings, and that the fix works. The NMI watchdog setup is rather crazy and complicated, but lets not get into that rats nest here. */ > --- > xen/arch/x86/nmi.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c > index d7fce28..1eb2a32 100644 > --- a/xen/arch/x86/nmi.c > +++ b/xen/arch/x86/nmi.c > @@ -34,7 +34,8 @@ > #include > > unsigned int nmi_watchdog = NMI_NONE; > -static unsigned int nmi_hz = HZ; > +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */ > +static unsigned int nmi_hz = HZ / 10; > static unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ > static unsigned int nmi_p4_cccr_val; > static DEFINE_PER_CPU(struct timer, nmi_timer); ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On Mon, 5 Feb 2018 21:18:42 + Igor Druzhininwrote: >We're noticing a reproducible system boot hang on certain >post-Skylake platforms where the BIOS is configured in >legacy boot mode with x2APIC disabled. The system stalls >immediately after writing the first SMP initialization >sequence into APIC ICR. > >The cause of the problem is watchdog NMI handler execution - >somewhere near the end of NMI handling (after it's already >rescheduled the next NMI) it tries to access IO port 0x61 >to get the actual NMI reason on CPU0. Unfortunately, this >port is emulated by BIOS using SMIs and this emulation >apparently might take more than we expect under certain >conditions. As the result, the system is constantly moving >between NMI and SMI handler and not making any progress. > >Just lower the initial frequency for now as we lower it later >even more anyway. I/O port 61h normally is not emulated by SMI legacy kbd handling code in BIOS, only ports like 60h, 64h, etc. Contrary to USB legacy emulation, it has to intercept port 61h via a different approach -- generic SMI I/O trap, which is not common (at least it was) to use by BIOSes... although it is possible as EFI interface and code for this is available. The assumption about port 61h being trapped by the SMI handler must be explicitly confirmed by checking I/O Trap control regs in the RCBA region. If I/O trap regs won't show an active I/O trap on I/O port 61h -- the root cause might be different (might even be related to stuff like NMI2SMI logic). If the problem is actually due to NMI handler being preempted by another NMI which occurred after (a long) execution of triggered SMI handler, it might be better to do all sensitive stuff before re-enabling NMIs by IRET in the NMI handler. >Signed-off-by: Igor Druzhinin >--- > xen/arch/x86/nmi.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > >diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c >index d7fce28..1eb2a32 100644 >--- a/xen/arch/x86/nmi.c >+++ b/xen/arch/x86/nmi.c >@@ -34,7 +34,8 @@ > #include > > unsigned int nmi_watchdog = NMI_NONE; >-static unsigned int nmi_hz = HZ; >+/* initial watchdog frequency - shouldn't be too high to avoid boot hangs >*/ +static unsigned int nmi_hz = HZ / 10; > static unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI > handler */ static unsigned int nmi_p4_cccr_val; > static DEFINE_PER_CPU(struct timer, nmi_timer); ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
[Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
We're noticing a reproducible system boot hang on certain post-Skylake platforms where the BIOS is configured in legacy boot mode with x2APIC disabled. The system stalls immediately after writing the first SMP initialization sequence into APIC ICR. The cause of the problem is watchdog NMI handler execution - somewhere near the end of NMI handling (after it's already rescheduled the next NMI) it tries to access IO port 0x61 to get the actual NMI reason on CPU0. Unfortunately, this port is emulated by BIOS using SMIs and this emulation apparently might take more than we expect under certain conditions. As the result, the system is constantly moving between NMI and SMI handler and not making any progress. Just lower the initial frequency for now as we lower it later even more anyway. Signed-off-by: Igor Druzhinin--- xen/arch/x86/nmi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c index d7fce28..1eb2a32 100644 --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -34,7 +34,8 @@ #include unsigned int nmi_watchdog = NMI_NONE; -static unsigned int nmi_hz = HZ; +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */ +static unsigned int nmi_hz = HZ / 10; static unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ static unsigned int nmi_p4_cccr_val; static DEFINE_PER_CPU(struct timer, nmi_timer); -- 2.7.4 ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel