Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-13 Thread Jan Beulich
>>> On 08.02.18 at 13:18,  wrote:
> We switch the NMI frequency to ~2Hz after the calibration, but that is
> after having run the BSP at 100Hz for a long period of time, and the APs
> at this rate for a short while.  Irrespective of the exact fix here, it
> is simply not a good idea to be running with this NMI frequency, other
> than possibly during the immediate calibration logic.

And I've previously agreed with this general statement. It's
just that the so far suggested workaround doesn't really do
what you say (and I'm pretty determined that the "possibly"
above should be removed).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Alexey G
On Thu, 8 Feb 2018 15:00:33 +
Andrew Cooper  wrote:

>On 08/02/18 14:37, Alexey G wrote:
>> On Thu, 8 Feb 2018 12:40:41 +
>> Andrew Cooper  wrote:  
>>> - Perf/Oprofile.  This is currently mutually exclusive with Xen
>>> using the watchdog, but needn't be and hopefully won't be in the
>>> future. 
 Most of the time we deal with watchdog NMIs, while all others
 should be somewhat rare. The thing is, we actually need to read
 I/O port 61h on system NMIs only. 

 If the main problem lies in a flow of SMIs due to reading port 61h
 on every NMI watchdog tick -- why not to avoid reading it?

 There are at least 2 ways to check if the NMI was due to a watchdog
 tick:
 - LAPIC (SDM states that "When a performance monitoring counters
 interrupt is generated, the mask bit for its associated LVT entry
 is set")
 - perf MSR overflow bit

 So, if we detect it was a NMI due to a watchdog using these
 methods (early in the NMI handler), we can avoid touching the port
 61h and thus triggering SMI I/O trap on it.
>>> The problem is having multiple NMIs arriving.  Like all other edge
>>> triggered interrupts, extra arrivals get dropped.  By skipping the
>>> 0x61 read if we believe it was a watchdog NMI, we've opened a race
>>> condition where we will completely miss the system NMI.  
>> There shouldn't be any problem I think. NMIs don't need to be cleared
>> with EOI and it's a common practice to handle NMIs one-by-one (as a
>> NMI handler is not reentrant in a typical scenario).
>>
>> Execution of SMI doesn't cause a pending (blocked) NMI to get
>> dropped, similar mechanisms might be employed for a single NMI which
>> arrived in blocked-by-NMI state. Otherwise the whole thing will
>> break -- merely handling arbitrary NMI will be enough to miss any
>> other NMIs. This is a too obvious flaw. So normally it should be
>> just a matter which NMI of two will be serviced first.
>> This assumption can be verified empirically by requesting the chipset
>> to send an external NMI while serving a watchdog NMI and checking if
>> it arrive later on.  

>NMI handling works just like other interrupts, except that its
>equivalent of the ISR/IRR state is hidden.
>
>One new NMI will become pending while an NMI is in progress (because
>there is an IRR bit to be set), but any further will be dropped.
>
>You can demonstrate this easily by having CPUs or the chipset send
>NMIs.

One in service with one pending is enough for our scenario. It covers
the situation when either watchdog or external NMI being serviced and
another (single) NMI arrived. As long as two of these are serviced one
after another, there shouldn't be trouble in missing NMIs due to NMI
status misunderstanding -- all related status bits are sticky (incl.
Mask bit in LAPIC) and and won't go away if two NMIs executed in either
order, provided they both executed.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Andrew Cooper
On 08/02/18 14:37, Alexey G wrote:
> On Thu, 8 Feb 2018 12:40:41 +
> Andrew Cooper  wrote:
>> - Perf/Oprofile.  This is currently mutually exclusive with Xen using
>> the watchdog, but needn't be and hopefully won't be in the future.
>>
>>> Most of the time we deal with watchdog NMIs, while all others should
>>> be somewhat rare. The thing is, we actually need to read I/O port
>>> 61h on system NMIs only. 
>>>
>>> If the main problem lies in a flow of SMIs due to reading port 61h on
>>> every NMI watchdog tick -- why not to avoid reading it?
>>>
>>> There are at least 2 ways to check if the NMI was due to a watchdog
>>> tick:
>>> - LAPIC (SDM states that "When a performance monitoring counters
>>> interrupt is generated, the mask bit for its associated LVT entry is
>>> set")
>>> - perf MSR overflow bit
>>>
>>> So, if we detect it was a NMI due to a watchdog using these
>>> methods (early in the NMI handler), we can avoid touching the port
>>> 61h and thus triggering SMI I/O trap on it.  
>> The problem is having multiple NMIs arriving.  Like all other edge
>> triggered interrupts, extra arrivals get dropped.  By skipping the 0x61
>> read if we believe it was a watchdog NMI, we've opened a race condition
>> where we will completely miss the system NMI.
> There shouldn't be any problem I think. NMIs don't need to be cleared
> with EOI and it's a common practice to handle NMIs one-by-one (as a NMI
> handler is not reentrant in a typical scenario).
>
> Execution of SMI doesn't cause a pending (blocked) NMI to get dropped,
> similar mechanisms might be employed for a single NMI which arrived in
> blocked-by-NMI state. Otherwise the whole thing will break -- merely
> handling arbitrary NMI will be enough to miss any other NMIs. This is a
> too obvious flaw. So normally it should be just a matter which NMI of
> two will be serviced first.
> This assumption can be verified empirically by requesting the chipset
> to send an external NMI while serving a watchdog NMI and checking if it
> arrive later on.

NMI handling works just like other interrupts, except that its
equivalent of the ISR/IRR state is hidden.

One new NMI will become pending while an NMI is in progress (because
there is an IRR bit to be set), but any further will be dropped.

You can demonstrate this easily by having CPUs or the chipset send NMIs.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Alexey G
On Thu, 8 Feb 2018 12:40:41 +
Andrew Cooper  wrote:
>- Perf/Oprofile.  This is currently mutually exclusive with Xen using
>the watchdog, but needn't be and hopefully won't be in the future.
>
>>
>> Most of the time we deal with watchdog NMIs, while all others should
>> be somewhat rare. The thing is, we actually need to read I/O port
>> 61h on system NMIs only. 
>>
>> If the main problem lies in a flow of SMIs due to reading port 61h on
>> every NMI watchdog tick -- why not to avoid reading it?
>>
>> There are at least 2 ways to check if the NMI was due to a watchdog
>> tick:
>> - LAPIC (SDM states that "When a performance monitoring counters
>> interrupt is generated, the mask bit for its associated LVT entry is
>> set")
>> - perf MSR overflow bit
>>
>> So, if we detect it was a NMI due to a watchdog using these
>> methods (early in the NMI handler), we can avoid touching the port
>> 61h and thus triggering SMI I/O trap on it.  
>
>The problem is having multiple NMIs arriving.  Like all other edge
>triggered interrupts, extra arrivals get dropped.  By skipping the 0x61
>read if we believe it was a watchdog NMI, we've opened a race condition
>where we will completely miss the system NMI.

There shouldn't be any problem I think. NMIs don't need to be cleared
with EOI and it's a common practice to handle NMIs one-by-one (as a NMI
handler is not reentrant in a typical scenario).

Execution of SMI doesn't cause a pending (blocked) NMI to get dropped,
similar mechanisms might be employed for a single NMI which arrived in
blocked-by-NMI state. Otherwise the whole thing will break -- merely
handling arbitrary NMI will be enough to miss any other NMIs. This is a
too obvious flaw. So normally it should be just a matter which NMI of
two will be serviced first.
This assumption can be verified empirically by requesting the chipset
to send an external NMI while serving a watchdog NMI and checking if it
arrive later on.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Andrew Cooper
On 08/02/18 12:32, Alexey G wrote:
> On Thu, 8 Feb 2018 10:47:45 +
> Igor Druzhinin  wrote:
>> I've done this measurement before. So what we are seeing exactly is
>> that the time we are spending in SMI is spiking (sometimes up to
>> 200ms) at the moment we go through INIT-SIPI-SIPI sequence. Looks like
>> this is enough to push the system into a livelock spiral. So I agree
>> with Jan to some point that the proposed workaround might not be
>> working on some systems.
> According to the Xen code, NMI expected for 2 primary purposes:
> - watchdog NMI from LAPIC
> - "system" NMIs (like due to SERR)

- Perf/Oprofile.  This is currently mutually exclusive with Xen using
the watchdog, but needn't be and hopefully won't be in the future.

>
> Most of the time we deal with watchdog NMIs, while all others should be
> somewhat rare. The thing is, we actually need to read I/O port 61h on
> system NMIs only. 
>
> If the main problem lies in a flow of SMIs due to reading port 61h on
> every NMI watchdog tick -- why not to avoid reading it?
>
> There are at least 2 ways to check if the NMI was due to a watchdog
> tick:
> - LAPIC (SDM states that "When a performance monitoring counters
> interrupt is generated, the mask bit for its associated LVT entry is
> set")
> - perf MSR overflow bit
>
> So, if we detect it was a NMI due to a watchdog using these
> methods (early in the NMI handler), we can avoid touching the port 61h
> and thus triggering SMI I/O trap on it.

The problem is having multiple NMIs arriving.  Like all other edge
triggered interrupts, extra arrivals get dropped.  By skipping the 0x61
read if we believe it was a watchdog NMI, we've opened a race condition
where we will completely miss the system NMI.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Alexey G
On Thu, 8 Feb 2018 10:47:45 +
Igor Druzhinin  wrote:
>I've done this measurement before. So what we are seeing exactly is
>that the time we are spending in SMI is spiking (sometimes up to
>200ms) at the moment we go through INIT-SIPI-SIPI sequence. Looks like
>this is enough to push the system into a livelock spiral. So I agree
>with Jan to some point that the proposed workaround might not be
>working on some systems.

According to the Xen code, NMI expected for 2 primary purposes:
- watchdog NMI from LAPIC
- "system" NMIs (like due to SERR)

Most of the time we deal with watchdog NMIs, while all others should be
somewhat rare. The thing is, we actually need to read I/O port 61h on
system NMIs only. 

If the main problem lies in a flow of SMIs due to reading port 61h on
every NMI watchdog tick -- why not to avoid reading it?

There are at least 2 ways to check if the NMI was due to a watchdog
tick:
- LAPIC (SDM states that "When a performance monitoring counters
interrupt is generated, the mask bit for its associated LVT entry is
set")
- perf MSR overflow bit

So, if we detect it was a NMI due to a watchdog using these
methods (early in the NMI handler), we can avoid touching the port 61h
and thus triggering SMI I/O trap on it.

>> There might be a chance that perf counter frequency is calculated
>> wrong for some systems, resulting in a very high rate of NMI
>> watchdog ticks instead of long SMI handler execution time. >10ms
>> just looks... too extreme.
>>   
>
>We ruled that out.
>
>> Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the
>> port 61h found. Some differences with gigabyte H270 system though:
>> 
>> - no "allocated" I/O traps anymore, but one additional SMI I/O trap
>>   encountered: port 900h, dword size. Possibly related to PCIe PM
>>   facilities.
>> 
>> - port 61h SMI handler now has multiple calls to debug/assert stub
>>   functions -- there might be a chance that some of impacted systems
>>   had debug build on, resulting in those stubs expanded to some real
>>   debugging code with negative impact on SMI handling speed.
>> 
>> Few additional observations:
>> 
>> - port 61h I/O Trap SMI handler checks accessed I/O address/size to
>> be equal to 61h/1byte. There might be some difference when reading
>> port 61h via inw(0x60)/inl(0x60)/etc
>> 
>> - looks like there exist an alternative way to read NMI status
>> without triggering SMI -- via ports 63h/65h/67h, but this depends on
>>   undocumented bit in Generic Control and Status register
>>   


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Andrew Cooper
On 08/02/18 09:12, Jan Beulich wrote:
 On 07.02.18 at 18:08,  wrote:
>> On 07/02/18 15:06, Jan Beulich wrote:
>>> Also you completely ignore my argument against the seemingly
>>> random division by 10, including the resulting question of what you
>>> mean to do once 10Hz also turns out too high a frequency.
>> We've got to pick a frequency.  The current 100Hz is just as arbitrary
>> as the proposed new 10Hz.
> Not exactly - the 100Hz is simply the frequency we run the main tick
> at, so while random it is not as random as any further derived value
> which has no proper reason behind it.
>
> There's one more point wrt your argument of overhead: If servicing
> an NMI takes that long on those boxes, you're basically saying you
> are happy to waste at least 1% of a core's bandwidth on a
> debugging feature. Is that reasonable for a production setup? And
> considering that I'd expect the patch to have chosen e.g. HZ / 5,
> HZ / 4, or even HZ / 2 if that worked reliably, I could even conclude
> you're happy to spend somewhere between 5 and 10% of one
> core's bandwidth. (FAOD all this is based on the 1Hz frequency we
> - iirc - run the NMI at later on.) To me this is another clear argument
> to turn off the watchdog on those systems, rather than trying to
> "fix" its probing.

It is not a debugging feature; it's a reliability feature.  With
clustered storage in particular, it is absolutely paramount to guarantee
that a struggling host fences itself cleanly, or you lose the entire
cluster.

This particular issue is a failure to boot, but by far the most common
issue we see in the field is a fence when all-but-one CPU is waiting in
the calibration rendezvous, by which point the host has effectively been
dead for 5 seconds already.  Turning the watchdog off isn't a viable or
reasonable solution to the problem.

We switch the NMI frequency to ~2Hz after the calibration, but that is
after having run the BSP at 100Hz for a long period of time, and the APs
at this rate for a short while.  Irrespective of the exact fix here, it
is simply not a good idea to be running with this NMI frequency, other
than possibly during the immediate calibration logic.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Igor Druzhinin
On 08/02/18 06:37, Alexey G wrote:
> On Wed, 7 Feb 2018 13:01:08 +
> Igor Druzhinin  wrote:
>> So far the issue confirmed:
>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>> that it was tested on), Intel S2600XX, etc.
>>
>> Also see:
>> https://bugs.xenserver.org/browse/XSO-774
>>
>> Well, no-watchdog is what we currently recommend in that case but we
>> hoped there is a general solution here from Xen side. You have your
>> point that they should fix this on their side because it's their fault
>> indeed. But the user experience is also important for us I think.
> 
> Igor,
> 
> It would be nice to measure the actual SMI handling time on affected
> systems (eg. via rdtsc before/after inb(0x61) + averaging for
> multiple reads perhaps), is it really 10+ ms.
> 

I've done this measurement before. So what we are seeing exactly is that
the time we are spending in SMI is spiking (sometimes up to 200ms) at
the moment we go through INIT-SIPI-SIPI sequence. Looks like this is
enough to push the system into a livelock spiral. So I agree with Jan to
some point that the proposed workaround might not be working on some
systems.

> There might be a chance that perf counter frequency is calculated wrong
> for some systems, resulting in a very high rate of NMI watchdog ticks
> instead of long SMI handler execution time. >10ms just looks... too
> extreme.
> 

We ruled that out.

> Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the port
> 61h found. Some differences with gigabyte H270 system though:
> 
> - no "allocated" I/O traps anymore, but one additional SMI I/O trap
>   encountered: port 900h, dword size. Possibly related to PCIe PM
>   facilities.
> 
> - port 61h SMI handler now has multiple calls to debug/assert stub
>   functions -- there might be a chance that some of impacted systems
>   had debug build on, resulting in those stubs expanded to some real
>   debugging code with negative impact on SMI handling speed.
> 
> Few additional observations:
> 
> - port 61h I/O Trap SMI handler checks accessed I/O address/size to be
>   equal to 61h/1byte. There might be some difference when reading port
>   61h via inw(0x60)/inl(0x60)/etc
> 
> - looks like there exist an alternative way to read NMI status without
>   triggering SMI -- via ports 63h/65h/67h, but this depends on
>   undocumented bit in Generic Control and Status register
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-08 Thread Jan Beulich
>>> On 07.02.18 at 18:08,  wrote:
> On 07/02/18 15:06, Jan Beulich wrote:
>> Also you completely ignore my argument against the seemingly
>> random division by 10, including the resulting question of what you
>> mean to do once 10Hz also turns out too high a frequency.
> 
> We've got to pick a frequency.  The current 100Hz is just as arbitrary
> as the proposed new 10Hz.

Not exactly - the 100Hz is simply the frequency we run the main tick
at, so while random it is not as random as any further derived value
which has no proper reason behind it.

There's one more point wrt your argument of overhead: If servicing
an NMI takes that long on those boxes, you're basically saying you
are happy to waste at least 1% of a core's bandwidth on a
debugging feature. Is that reasonable for a production setup? And
considering that I'd expect the patch to have chosen e.g. HZ / 5,
HZ / 4, or even HZ / 2 if that worked reliably, I could even conclude
you're happy to spend somewhere between 5 and 10% of one
core's bandwidth. (FAOD all this is based on the 1Hz frequency we
- iirc - run the NMI at later on.) To me this is another clear argument
to turn off the watchdog on those systems, rather than trying to
"fix" its probing.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Alexey G
On Wed, 7 Feb 2018 13:01:08 +
Igor Druzhinin  wrote:
>So far the issue confirmed:
>Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>that it was tested on), Intel S2600XX, etc.
>
>Also see:
>https://bugs.xenserver.org/browse/XSO-774
>
>Well, no-watchdog is what we currently recommend in that case but we
>hoped there is a general solution here from Xen side. You have your
>point that they should fix this on their side because it's their fault
>indeed. But the user experience is also important for us I think.

Igor,

It would be nice to measure the actual SMI handling time on affected
systems (eg. via rdtsc before/after inb(0x61) + averaging for
multiple reads perhaps), is it really 10+ ms.

There might be a chance that perf counter frequency is calculated wrong
for some systems, resulting in a very high rate of NMI watchdog ticks
instead of long SMI handler execution time. >10ms just looks... too
extreme.

Huawei Server 2488 V5 BIOS -- similar SMI I/O trap handler for the port
61h found. Some differences with gigabyte H270 system though:

- no "allocated" I/O traps anymore, but one additional SMI I/O trap
  encountered: port 900h, dword size. Possibly related to PCIe PM
  facilities.

- port 61h SMI handler now has multiple calls to debug/assert stub
  functions -- there might be a chance that some of impacted systems
  had debug build on, resulting in those stubs expanded to some real
  debugging code with negative impact on SMI handling speed.

Few additional observations:

- port 61h I/O Trap SMI handler checks accessed I/O address/size to be
  equal to 61h/1byte. There might be some difference when reading port
  61h via inw(0x60)/inl(0x60)/etc

- looks like there exist an alternative way to read NMI status without
  triggering SMI -- via ports 63h/65h/67h, but this depends on
  undocumented bit in Generic Control and Status register

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Andrew Cooper
On 07/02/18 15:06, Jan Beulich wrote:
 On 07.02.18 at 14:24,  wrote:
>> On 07/02/18 13:08, Jan Beulich wrote:
>> On 07.02.18 at 14:01,  wrote:
 So far the issue confirmed:
 Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
 that it was tested on), Intel S2600XX, etc.

 Also see:
 https://bugs.xenserver.org/browse/XSO-774 

 Well, no-watchdog is what we currently recommend in that case but we
 hoped there is a general solution here from Xen side. You have your
 point that they should fix this on their side because it's their fault
 indeed. But the user experience is also important for us I think.
>>> Of course, hence the suggestion of possible alternative workarounds.
>>> Impacting everyone is, as said, not a desirable approach in a case
>>> like this one. I also continue to dislike the seemingly random division
>>> by 10.
>> Xen's usability is crap, which is in large part due to attitude like
>> this.  It is not ok to expect the end user to know diagnose/debug issues
>> like this, and it is entirely unreasonable to expect the end user to
>> have to manually work around it.
> Excuse me? The watchdog is off by default. Anyone turning it on
> ought to know what they do. You (iirc) turning it on unilaterally in
> XenServer puts the burden of avoidng users to have to diagnose
> the issue on you.

And we have taken the burden of diagnosing the issue, as well as
proposing a fix.

>
>> This particular issue does want feeding back to Intel so they can try
>> and fix it, but whatever is wrong is present in a large amount of
>> Skylake systems in the field.  Xen needs to be able to cope.
> But in a reasonable way.
>
>> Finally, as to boot times, your argument is backwards seeing as you care
>> about elapsed boot time.  Slowing the frequency will speed everything
>> up, as we aren't executing a large chunk of the BSP boot path with 100hz
>> NMI constantly interrupting.
> How long does handling a single NMI take? Microseconds, I assume.
> Contrast this with waiting for two NMIs to arrive in wait_for_nmis(),
> which goes up from 20ms to 200ms with this change.

So you're argument is to not change the frequency because an
off-by-default option will *in the best case* add a few hundred
milliseconds extra to the boot time?  Times to boot computers are
measured in minutes, not milliseconds.

I don't know how long servicing an NMI takes, at a minimum of a rdmsr,
wrmsr and then a further mmio write or wrmsr, I doubt it is that quick.

> Also you completely ignore my argument against the seemingly
> random division by 10, including the resulting question of what you
> mean to do once 10Hz also turns out too high a frequency.

We've got to pick a frequency.  The current 100Hz is just as arbitrary
as the proposed new 10Hz.

> I wouldn't, btw, mind an attempt to avoid the high rate NMIs
> during early boot (if those occur in the first place, which from
> two successive replies by Igor yesterday I wasn't sure anymore
> is an actual fact), but that's independent of the issue at hand.

The 100Hz NMI is active from BSP APIC init, IO-APIC, deadline timer
calibration, mwait idle, the entirety of HVM setup and full AP bringup. 
On one of my fastest boxes, it is about 1 second of wallclock time.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Jan Beulich
>>> On 07.02.18 at 14:24,  wrote:
> On 07/02/18 13:08, Jan Beulich wrote:
> On 07.02.18 at 14:01,  wrote:
>>> So far the issue confirmed:
>>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>>> that it was tested on), Intel S2600XX, etc.
>>>
>>> Also see:
>>> https://bugs.xenserver.org/browse/XSO-774 
>>>
>>> Well, no-watchdog is what we currently recommend in that case but we
>>> hoped there is a general solution here from Xen side. You have your
>>> point that they should fix this on their side because it's their fault
>>> indeed. But the user experience is also important for us I think.
>> Of course, hence the suggestion of possible alternative workarounds.
>> Impacting everyone is, as said, not a desirable approach in a case
>> like this one. I also continue to dislike the seemingly random division
>> by 10.
> 
> Xen's usability is crap, which is in large part due to attitude like
> this.  It is not ok to expect the end user to know diagnose/debug issues
> like this, and it is entirely unreasonable to expect the end user to
> have to manually work around it.

Excuse me? The watchdog is off by default. Anyone turning it on
ought to know what they do. You (iirc) turning it on unilaterally in
XenServer puts the burden of avoidng users to have to diagnose
the issue on you.

> This particular issue does want feeding back to Intel so they can try
> and fix it, but whatever is wrong is present in a large amount of
> Skylake systems in the field.  Xen needs to be able to cope.

But in a reasonable way.

> Finally, as to boot times, your argument is backwards seeing as you care
> about elapsed boot time.  Slowing the frequency will speed everything
> up, as we aren't executing a large chunk of the BSP boot path with 100hz
> NMI constantly interrupting.

How long does handling a single NMI take? Microseconds, I assume.
Contrast this with waiting for two NMIs to arrive in wait_for_nmis(),
which goes up from 20ms to 200ms with this change.

Also you completely ignore my argument against the seemingly
random division by 10, including the resulting question of what you
mean to do once 10Hz also turns out too high a frequency.

I wouldn't, btw, mind an attempt to avoid the high rate NMIs
during early boot (if those occur in the first place, which from
two successive replies by Igor yesterday I wasn't sure anymore
is an actual fact), but that's independent of the issue at hand.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Igor Druzhinin
On 07/02/18 13:08, Jan Beulich wrote:
 On 07.02.18 at 14:01,  wrote:
>> So far the issue confirmed:
>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>> that it was tested on), Intel S2600XX, etc.
>>
>> Also see:
>> https://bugs.xenserver.org/browse/XSO-774 
>>
>> Well, no-watchdog is what we currently recommend in that case but we
>> hoped there is a general solution here from Xen side. You have your
>> point that they should fix this on their side because it's their fault
>> indeed. But the user experience is also important for us I think.
> 
> Of course, hence the suggestion of possible alternative workarounds.
> Impacting everyone is, as said, not a desirable approach in a case
> like this one. I also continue to dislike the seemingly random division
> by 10.
> 
> Jan
> 

There is also a workaround by initializing the watchdog later (after SMP
bootstrap) on CPU0 - as Linux does AFAIK. But I don't think this would
be acceptable either.

Igor

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Andrew Cooper
On 07/02/18 13:08, Jan Beulich wrote:
 On 07.02.18 at 14:01,  wrote:
>> So far the issue confirmed:
>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>> that it was tested on), Intel S2600XX, etc.
>>
>> Also see:
>> https://bugs.xenserver.org/browse/XSO-774 
>>
>> Well, no-watchdog is what we currently recommend in that case but we
>> hoped there is a general solution here from Xen side. You have your
>> point that they should fix this on their side because it's their fault
>> indeed. But the user experience is also important for us I think.
> Of course, hence the suggestion of possible alternative workarounds.
> Impacting everyone is, as said, not a desirable approach in a case
> like this one. I also continue to dislike the seemingly random division
> by 10.

Xen's usability is crap, which is in large part due to attitude like
this.  It is not ok to expect the end user to know diagnose/debug issues
like this, and it is entirely unreasonable to expect the end user to
have to manually work around it.

This particular issue does want feeding back to Intel so they can try
and fix it, but whatever is wrong is present in a large amount of
Skylake systems in the field.  Xen needs to be able to cope.

Finally, as to boot times, your argument is backwards seeing as you care
about elapsed boot time.  Slowing the frequency will speed everything
up, as we aren't executing a large chunk of the BSP boot path with 100hz
NMI constantly interrupting.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Jan Beulich
>>> On 07.02.18 at 14:01,  wrote:
> So far the issue confirmed:
> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
> that it was tested on), Intel S2600XX, etc.
> 
> Also see:
> https://bugs.xenserver.org/browse/XSO-774 
> 
> Well, no-watchdog is what we currently recommend in that case but we
> hoped there is a general solution here from Xen side. You have your
> point that they should fix this on their side because it's their fault
> indeed. But the user experience is also important for us I think.

Of course, hence the suggestion of possible alternative workarounds.
Impacting everyone is, as said, not a desirable approach in a case
like this one. I also continue to dislike the seemingly random division
by 10.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Igor Druzhinin
On 07/02/18 09:13, Jan Beulich wrote:
 On 06.02.18 at 22:51,  wrote:
>> The problem with a quirk/commandline parameter is that the issue is
>> reported for a wide variety of systems and, as it looks like, depends on
>> the default BIOS setup - means it's hard to identify particular
>> machines. We should obviously sort this out with Intel but until then
>> lowering the initial frequency is our only option.
> 
> "Wide variety" is interesting, considering that we've had no earlier
> reports. As the description of the patch talks about "post-Skylake" -
> are these production machines? If not, a command line option
> would quite certainly be sufficient here. If yes, I'd like "wide variety"
> to be further qualified. After all we're talking about a processing
> overhead on the order of 10ms here, which is absurd. There are
> systems anyway where the watchdog doesn't work - we may need
> to consider to suggest to people to simply not enable the watchdog
> on such systems until the firmware issue has been taken care of.
> 
> As mentioned before - if firmware takes on the order of 10ms to
> process the SMI intercept, I can't see why it wouldn't be possible
> for them to screw up further and take 20, 50, or 100ms, at which
> point your seemingly random HZ / 10 would no longer work either.
> The same goes for the case of someone coming along and
> changing HZ to a higher value (with a good reason provided).
> 
> Jan
> 

So far the issue confirmed:
Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
that it was tested on), Intel S2600XX, etc.

Also see:
https://bugs.xenserver.org/browse/XSO-774

Well, no-watchdog is what we currently recommend in that case but we
hoped there is a general solution here from Xen side. You have your
point that they should fix this on their side because it's their fault
indeed. But the user experience is also important for us I think.

Igor

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-07 Thread Jan Beulich
>>> On 06.02.18 at 22:51,  wrote:
> The problem with a quirk/commandline parameter is that the issue is
> reported for a wide variety of systems and, as it looks like, depends on
> the default BIOS setup - means it's hard to identify particular
> machines. We should obviously sort this out with Intel but until then
> lowering the initial frequency is our only option.

"Wide variety" is interesting, considering that we've had no earlier
reports. As the description of the patch talks about "post-Skylake" -
are these production machines? If not, a command line option
would quite certainly be sufficient here. If yes, I'd like "wide variety"
to be further qualified. After all we're talking about a processing
overhead on the order of 10ms here, which is absurd. There are
systems anyway where the watchdog doesn't work - we may need
to consider to suggest to people to simply not enable the watchdog
on such systems until the firmware issue has been taken care of.

As mentioned before - if firmware takes on the order of 10ms to
process the SMI intercept, I can't see why it wouldn't be possible
for them to screw up further and take 20, 50, or 100ms, at which
point your seemingly random HZ / 10 would no longer work either.
The same goes for the case of someone coming along and
changing HZ to a higher value (with a good reason provided).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Alexey G

 If the actual SMI source is not related to some place in the NMI
 handler code but was eg. due to some SMI timer, lowering NMI
 watchdog frequency might not fix the issue completely, but lower
 its reproducibility (perhaps to some very rare occurrences). So
 it's better be sure what was the real source of SMI.
 
>>>
>>> This *is* related to this instruction - it was confirmed
>>> empirically. Removing this instruction stops SMIs from occurring
>>> and effectively removes the issue leaving the frequency unchanged.  
>> 
>> Hmm, it would be interesting to know for what evil purpose does it
>> need to trap I/O port 61h.
>> BTW, on which motherboard model the issue was reproduced?
>>   
>
>The issue has been reported for some Dell/Huawei Skylake platforms (one
>of them PowerEdge R740 to be precise) but I don't think the others are
>unaffected (the issue supposedly originates from Intel's reference
>code)
>- the default BIOS setup indeed matters.

Here is a bit of info you might find useful. I did a quick research on
my test system (Gigabyte GA-H270M-D3H) in order to confirm if BIOS traps
I/O port 61h (NMI status) and for what purposes.

Well, turns out it really does.
Moreover, it's actually the only fixed I/O port location trapped by SMI
I/O traps on this system. Few others are simply 'allocated' ones,
meaning the real I/O port address being trapped is chosen dynamically by
supplying Address=0 to a corresponding call to EFI I/O Trap interface
function -- such I/O traps may be used as interfaces with a SMI handler
in a manner similar to the SW SMI interface.

The EFI module responsible for installing port 61h SMI I/O Trap is
PchInitSmm in my case. The related code is:
...
mov eax, 61h
lea r9, qword_5778
mov [rsp+98h+io_trap_ctx.io_address], ax
mov rax, cs:pIoTrapIF
lea r8, [rsp+98h+io_trap_ctx]
lea rdx, Port61h_IoTrapHandler
mov rcx, rax
mov [rsp+98h+io_trap_ctx.trap_type], ebp  ; trap reads
mov [rsp+98h+io_trap_ctx.io_len], bp  ; ebp=1
callqword ptr [rax]
...

The actual handler (named Port61h_IoTrapHandler in the above code) is
fairly lightweight and does a bit of useless black magic.

First, there is a loop for all CPUs which finds which CPU actually
caused trapped I/O operation by reading NMI status port.

Then it reads the original port 61h value and set NMI_SC bit4 to its
inverted previous state for the selected CPU' bit. And then updated AL
register value is returned to the NMI_SC-reading user code (via
patching RAX register value in SMRAM saved state):

; ebp = 61h, rbx = CPU index
...
mov edx, ebp
in  al, dx

mov r8, cs:bmNmiRefTogglesForCpus
mov rcx, rbx
mov edx, 1
shl edx, cl
mov r9, rbx
movsxd  rcx, edx
mov dl, al
and al, 0EFh
xor r8, rcx
or  dl, 10h
mov cs:bmNmiRefTogglesForCpus, r8
and r8, rcx
movzx   ecx, al
movzx   eax, dl
testr8, r8
mov edx, 1
cmovnz  ecx, eax
lea rax, [rsp+58h+al_to_return]
lea r8d, [rdx+25h]  ; EFI_SMM_SAVE_STATE_REGISTER_RAX
mov [rsp+58h+func_arg0], rax
mov rax, cs:pEFI_SMM_CPU_PROTOCOL_GUID_IF
mov [rsp+58h+al_to_return], cl
mov rcx, rax
callqword ptr [rax+8] ; WriteSaveState
...

So, the only purpose of this stuff is emulating REF_TOGGLE bit toggling
logic (simply by alternating ones and zeros on each NMI_SC read),
nothing more. Sort of workaround for some legacy code which depends on
REF_TOGGLE rolling (which is now being marked Reserved in docs).

On this particular system SMI I/O trap for port 61h neither do anything
time-consuming nor anything really useful. That Dell system must have
something similar (thanks to common EFI ref code from Intel Igor
mentioned), leaving the question why port 61h reading is so slow there.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Igor Druzhinin
On 06/02/18 16:23, Jan Beulich wrote:
 On 06.02.18 at 17:14,  wrote:
>> On 06/02/18 16:07, Jan Beulich wrote:
>> On 05.02.18 at 22:18,  wrote:
 --- a/xen/arch/x86/nmi.c
 +++ b/xen/arch/x86/nmi.c
 @@ -34,7 +34,8 @@
  #include 
  
  unsigned int nmi_watchdog = NMI_NONE;
 -static unsigned int nmi_hz = HZ;
 +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs 
>> */
 +static unsigned int nmi_hz = HZ / 10;
>>>
>>> For one - the comment should explain what "too high" means.
>>> Further - what if on another system 10Hz is still too high? I also hope
>>> you realize that you slow down boot a little for everyone just
>>> because of this one machine model. Can the lower frequency perhaps
>>> be set via DMI quirk, or otherwise obtain a command line override
>>> (perhaps something like "watchdog=probe:10Hz")?
>>>
>>
>> I can improve the comment message.
>> Why does this change slow down anything while I'm lowering the frequency
>> - not making it higher?
> 
> We wait for two occurrences of the NMI in wait_for_nmis().
> 
>> The alternative approach would be to reshuffle
>> the code and take the reason before programming the next interrupt as
>> suggested before. In that case the actual frequency would be adjusted
>> naturally I think.
> 
> Thinking about this, reading the reason early seems like a good idea
> to me irrespective of the issue here.
>

I ran a couple of experiments with different layouts in NMI handler:
it looks like it doesn't really help as merely having this instruction
inside the handler and running it at 100Hz breaks a number of timeouts
in SMP bootstrap code and makes it unstable. So we are back to lowering
the frequency as I'm now out of ideas.

The problem with a quirk/commandline parameter is that the issue is
reported for a wide variety of systems and, as it looks like, depends on
the default BIOS setup - means it's hard to identify particular
machines. We should obviously sort this out with Intel but until then
lowering the initial frequency is our only option.

Igor

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Alexey G
On Tue, 6 Feb 2018 17:21:19 +
Igor Druzhinin  wrote:
>On 06/02/18 17:08, Alexey G wrote:
>> The major concern here is the possiblity of SMI being triggered _not_
>> by some specific I/O port access. Primarily, if it actually was a
>> periodic SMI.
>> 
>> If the actual SMI source is not related to some place in the NMI
>> handler code but was eg. due to some SMI timer, lowering NMI watchdog
>> frequency might not fix the issue completely, but lower its
>> reproducibility (perhaps to some very rare occurrences). So it's
>> better be sure what was the real source of SMI.
>>   
>
>This *is* related to this instruction - it was confirmed empirically.
>Removing this instruction stops SMIs from occurring and effectively
>removes the issue leaving the frequency unchanged.

Hmm, it would be interesting to know for what evil purpose does it need
to trap I/O port 61h.
BTW, on which motherboard model the issue was reproduced?

>> 2. According to the code, it looks like NMI status reading happens
>> while NMIs are still blocked -- this means that SMI handler must
>> exec IRET by itself to reset NMI blocking state -- again, this is
>> possible (eg. in unreal->protmode switching code), but not likely.
>>   
>
>According to SDM one NMI might be pending while taken in SMI mode (see
>ch. 34.8). This is actually even true if NMI comes while servicing
>another NMI. So when we return to the NMI handler from SMI and finish
>it properly the next one appears immediately.

If the SMI handler doesn't mess up with NMI blocking state, it
means that SMI handler processes every reading of port 61h longer than a
watchdog NMI period duration... which is quite long. Motherboard vendor
did something very wrong with I/O trap handling in the SMI handler code
if it takes so much.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Igor Druzhinin
On 06/02/18 17:08, Alexey G wrote:
> On Tue, 6 Feb 2018 14:21:12 +
> Andrew Cooper  wrote:
> 
>> On 06/02/18 03:10, Alexey G wrote:
>>> I/O port 61h normally is not emulated by SMI legacy kbd handling code
>>> in BIOS, only ports like 60h, 64h, etc.
>>> Contrary to USB legacy emulation, it has to intercept port 61h via a
>>> different approach -- generic SMI I/O trap, which is not common (at
>>> least it was) to use by BIOSes... although it is possible as EFI
>>> interface and code for this is available. The assumption about port
>>> 61h being trapped by the SMI handler must be explicitly confirmed by
>>> checking I/O Trap control regs in the RCBA region.
>>>
>>> If I/O trap regs won't show an active I/O trap on I/O port 61h -- the
>>> root cause might be different (might even be related to stuff like
>>> NMI2SMI logic).
>>>
>>> If the problem is actually due to NMI handler being preempted by
>>> another NMI which occurred after (a long) execution of triggered SMI
>>> handler, it might be better to do all sensitive stuff before
>>> re-enabling NMIs by IRET in the NMI handler.  
>>
>> The problem is that the SMI handler executes enough instructions to
>> trigger another NMI (which is based on the retired instruction count),
>> which gets delivered once the SMI handler returns, and servicing the
>> NMI triggers a new SMI, which triggers a new NMI.  This is why the
>> system stalls.
>>
>> I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is
>> only a secondary concern here.  We cannot reasonably have the watchdog
>> able to trip because of exclusively SMI activity, or we'll potentially
>> livelock anywhere.
> 
> The major concern here is the possiblity of SMI being triggered _not_
> by some specific I/O port access. Primarily, if it actually was a
> periodic SMI.
> 
> If the actual SMI source is not related to some place in the NMI
> handler code but was eg. due to some SMI timer, lowering NMI watchdog
> frequency might not fix the issue completely, but lower its
> reproducibility (perhaps to some very rare occurrences). So it's better
> be sure what was the real source of SMI.
> 

This *is* related to this instruction - it was confirmed empirically.
Removing this instruction stops SMIs from occurring and effectively
removes the issue leaving the frequency unchanged.

> There are 2 weak spots in the analysis:
> 
> 1. Triggering SMI by reading I/O port 61h -- intercepting the port 61h
> is non-typical behavior for BIOS (unlike 60h/64h)
> 

Apparently, it is now.

> 2. According to the code, it looks like NMI status reading happens while
> NMIs are still blocked -- this means that SMI handler must exec IRET
> by itself to reset NMI blocking state -- again, this is possible (eg.
> in unreal->protmode switching code), but not likely.
> 

According to SDM one NMI might be pending while taken in SMI mode (see
ch. 34.8). This is actually even true if NMI comes while servicing
another NMI. So when we return to the NMI handler from SMI and finish it
properly the next one appears immediately.

Igor

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Alexey G
On Tue, 6 Feb 2018 14:21:12 +
Andrew Cooper  wrote:

>On 06/02/18 03:10, Alexey G wrote:
>> I/O port 61h normally is not emulated by SMI legacy kbd handling code
>> in BIOS, only ports like 60h, 64h, etc.
>> Contrary to USB legacy emulation, it has to intercept port 61h via a
>> different approach -- generic SMI I/O trap, which is not common (at
>> least it was) to use by BIOSes... although it is possible as EFI
>> interface and code for this is available. The assumption about port
>> 61h being trapped by the SMI handler must be explicitly confirmed by
>> checking I/O Trap control regs in the RCBA region.
>>
>> If I/O trap regs won't show an active I/O trap on I/O port 61h -- the
>> root cause might be different (might even be related to stuff like
>> NMI2SMI logic).
>>
>> If the problem is actually due to NMI handler being preempted by
>> another NMI which occurred after (a long) execution of triggered SMI
>> handler, it might be better to do all sensitive stuff before
>> re-enabling NMIs by IRET in the NMI handler.  
>
>The problem is that the SMI handler executes enough instructions to
>trigger another NMI (which is based on the retired instruction count),
>which gets delivered once the SMI handler returns, and servicing the
>NMI triggers a new SMI, which triggers a new NMI.  This is why the
>system stalls.
>
>I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is
>only a secondary concern here.  We cannot reasonably have the watchdog
>able to trip because of exclusively SMI activity, or we'll potentially
>livelock anywhere.

The major concern here is the possiblity of SMI being triggered _not_
by some specific I/O port access. Primarily, if it actually was a
periodic SMI.

If the actual SMI source is not related to some place in the NMI
handler code but was eg. due to some SMI timer, lowering NMI watchdog
frequency might not fix the issue completely, but lower its
reproducibility (perhaps to some very rare occurrences). So it's better
be sure what was the real source of SMI.

There are 2 weak spots in the analysis:

1. Triggering SMI by reading I/O port 61h -- intercepting the port 61h
is non-typical behavior for BIOS (unlike 60h/64h)

2. According to the code, it looks like NMI status reading happens while
NMIs are still blocked -- this means that SMI handler must exec IRET
by itself to reset NMI blocking state -- again, this is possible (eg.
in unreal->protmode switching code), but not likely.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Igor Druzhinin
On 06/02/18 16:23, Jan Beulich wrote:
 On 06.02.18 at 17:14,  wrote:
>> On 06/02/18 16:07, Jan Beulich wrote:
>> On 05.02.18 at 22:18,  wrote:
 --- a/xen/arch/x86/nmi.c
 +++ b/xen/arch/x86/nmi.c
 @@ -34,7 +34,8 @@
  #include 
  
  unsigned int nmi_watchdog = NMI_NONE;
 -static unsigned int nmi_hz = HZ;
 +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs 
>> */
 +static unsigned int nmi_hz = HZ / 10;
>>>
>>> For one - the comment should explain what "too high" means.
>>> Further - what if on another system 10Hz is still too high? I also hope
>>> you realize that you slow down boot a little for everyone just
>>> because of this one machine model. Can the lower frequency perhaps
>>> be set via DMI quirk, or otherwise obtain a command line override
>>> (perhaps something like "watchdog=probe:10Hz")?
>>>
>>
>> I can improve the comment message.
>> Why does this change slow down anything while I'm lowering the frequency
>> - not making it higher?
> 
> We wait for two occurrences of the NMI in wait_for_nmis().

Yes, you right, ignore my previous mail. We are waiting synchronously
for them. I'll rewrite the code so I'll not change the frequency.

Igor

> 
>> The alternative approach would be to reshuffle
>> the code and take the reason before programming the next interrupt as
>> suggested before. In that case the actual frequency would be adjusted
>> naturally I think.
> 
> Thinking about this, reading the reason early seems like a good idea
> to me irrespective of the issue here.
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Igor Druzhinin
On 06/02/18 16:23, Jan Beulich wrote:
 On 06.02.18 at 17:14,  wrote:
>> On 06/02/18 16:07, Jan Beulich wrote:
>> On 05.02.18 at 22:18,  wrote:
 --- a/xen/arch/x86/nmi.c
 +++ b/xen/arch/x86/nmi.c
 @@ -34,7 +34,8 @@
  #include 
  
  unsigned int nmi_watchdog = NMI_NONE;
 -static unsigned int nmi_hz = HZ;
 +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs 
>> */
 +static unsigned int nmi_hz = HZ / 10;
>>>
>>> For one - the comment should explain what "too high" means.
>>> Further - what if on another system 10Hz is still too high? I also hope
>>> you realize that you slow down boot a little for everyone just
>>> because of this one machine model. Can the lower frequency perhaps
>>> be set via DMI quirk, or otherwise obtain a command line override
>>> (perhaps something like "watchdog=probe:10Hz")?
>>>
>>
>> I can improve the comment message.
>> Why does this change slow down anything while I'm lowering the frequency
>> - not making it higher?
> 
> We wait for two occurrences of the NMI in wait_for_nmis().
> 

This happens *much* later in the boot sequence after we actually enable
the watchdog on CPU0. Till that time we already get hundreds of them
regardless of the frequency.

>> The alternative approach would be to reshuffle
>> the code and take the reason before programming the next interrupt as
>> suggested before. In that case the actual frequency would be adjusted
>> naturally I think.
> 
> Thinking about this, reading the reason early seems like a good idea
> to me irrespective of the issue here.
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Jan Beulich
>>> On 06.02.18 at 17:14,  wrote:
> On 06/02/18 16:07, Jan Beulich wrote:
> On 05.02.18 at 22:18,  wrote:
>>> --- a/xen/arch/x86/nmi.c
>>> +++ b/xen/arch/x86/nmi.c
>>> @@ -34,7 +34,8 @@
>>>  #include 
>>>  
>>>  unsigned int nmi_watchdog = NMI_NONE;
>>> -static unsigned int nmi_hz = HZ;
>>> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs 
> */
>>> +static unsigned int nmi_hz = HZ / 10;
>> 
>> For one - the comment should explain what "too high" means.
>> Further - what if on another system 10Hz is still too high? I also hope
>> you realize that you slow down boot a little for everyone just
>> because of this one machine model. Can the lower frequency perhaps
>> be set via DMI quirk, or otherwise obtain a command line override
>> (perhaps something like "watchdog=probe:10Hz")?
>>
> 
> I can improve the comment message.
> Why does this change slow down anything while I'm lowering the frequency
> - not making it higher?

We wait for two occurrences of the NMI in wait_for_nmis().

> The alternative approach would be to reshuffle
> the code and take the reason before programming the next interrupt as
> suggested before. In that case the actual frequency would be adjusted
> naturally I think.

Thinking about this, reading the reason early seems like a good idea
to me irrespective of the issue here.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Igor Druzhinin
On 06/02/18 16:07, Jan Beulich wrote:
 On 05.02.18 at 22:18,  wrote:
>> --- a/xen/arch/x86/nmi.c
>> +++ b/xen/arch/x86/nmi.c
>> @@ -34,7 +34,8 @@
>>  #include 
>>  
>>  unsigned int nmi_watchdog = NMI_NONE;
>> -static unsigned int nmi_hz = HZ;
>> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */
>> +static unsigned int nmi_hz = HZ / 10;
> 
> For one - the comment should explain what "too high" means.
> Further - what if on another system 10Hz is still too high? I also hope
> you realize that you slow down boot a little for everyone just
> because of this one machine model. Can the lower frequency perhaps
> be set via DMI quirk, or otherwise obtain a command line override
> (perhaps something like "watchdog=probe:10Hz")?
>

I can improve the comment message.
Why does this change slow down anything while I'm lowering the frequency
- not making it higher? The alternative approach would be to reshuffle
the code and take the reason before programming the next interrupt as
suggested before. In that case the actual frequency would be adjusted
naturally I think.

Igor

> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Jan Beulich
>>> On 05.02.18 at 22:18,  wrote:
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -34,7 +34,8 @@
>  #include 
>  
>  unsigned int nmi_watchdog = NMI_NONE;
> -static unsigned int nmi_hz = HZ;
> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */
> +static unsigned int nmi_hz = HZ / 10;

For one - the comment should explain what "too high" means.
Further - what if on another system 10Hz is still too high? I also hope
you realize that you slow down boot a little for everyone just
because of this one machine model. Can the lower frequency perhaps
be set via DMI quirk, or otherwise obtain a command line override
(perhaps something like "watchdog=probe:10Hz")?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Andrew Cooper
On 06/02/18 03:10, Alexey G wrote:
> On Mon, 5 Feb 2018 21:18:42 +
> Igor Druzhinin  wrote:
>
>> We're noticing a reproducible system boot hang on certain
>> post-Skylake platforms where the BIOS is configured in
>> legacy boot mode with x2APIC disabled. The system stalls
>> immediately after writing the first SMP initialization
>> sequence into APIC ICR.
>>
>> The cause of the problem is watchdog NMI handler execution -
>> somewhere near the end of NMI handling (after it's already
>> rescheduled the next NMI) it tries to access IO port 0x61
>> to get the actual NMI reason on CPU0. Unfortunately, this
>> port is emulated by BIOS using SMIs and this emulation
>> apparently might take more than we expect under certain
>> conditions. As the result, the system is constantly moving
>> between NMI and SMI handler and not making any progress.
>>
>> Just lower the initial frequency for now as we lower it later
>> even more anyway.
> I/O port 61h normally is not emulated by SMI legacy kbd handling code
> in BIOS, only ports like 60h, 64h, etc.
> Contrary to USB legacy emulation, it has to intercept port 61h via a
> different approach -- generic SMI I/O trap, which is not common (at least
> it was) to use by BIOSes... although it is possible as EFI interface and
> code for this is available. The assumption about port 61h being trapped by
> the SMI handler must be explicitly confirmed by checking I/O Trap control
> regs in the RCBA region.
>
> If I/O trap regs won't show an active I/O trap on I/O port 61h -- the
> root cause might be different (might even be related to stuff like
> NMI2SMI logic).
>
> If the problem is actually due to NMI handler being preempted by another
> NMI which occurred after (a long) execution of triggered SMI handler, it
> might be better to do all sensitive stuff before re-enabling NMIs by IRET in
> the NMI handler.

The problem is that the SMI handler executes enough instructions to
trigger another NMI (which is based on the retired instruction count),
which gets delivered once the SMI handler returns, and servicing the NMI
triggers a new SMI, which triggers a new NMI.  This is why the system
stalls.

I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is
only a secondary concern here.  We cannot reasonably have the watchdog
able to trip because of exclusively SMI activity, or we'll potentially
livelock anywhere.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-06 Thread Andrew Cooper
On 05/02/18 21:18, Igor Druzhinin wrote:
> We're noticing a reproducible system boot hang on certain
> post-Skylake platforms where the BIOS is configured in

Its just a plain Skylake Server, from what I can see.

> legacy boot mode with x2APIC disabled. The system stalls
> immediately after writing the first SMP initialization
> sequence into APIC ICR.
>
> The cause of the problem is watchdog NMI handler execution -
> somewhere near the end of NMI handling (after it's already
> rescheduled the next NMI) it tries to access IO port 0x61
> to get the actual NMI reason on CPU0. Unfortunately, this
> port is emulated by BIOS using SMIs and this emulation
> apparently might take more than we expect under certain
> conditions. As the result, the system is constantly moving
> between NMI and SMI handler and not making any progress.
>
> Just lower the initial frequency for now as we lower it later
> even more anyway.
>
> Signed-off-by: Igor Druzhinin 

Acked-by: Andrew Cooper 

I can independently confirm these findings, and that the fix works.  The
NMI watchdog setup is rather crazy and complicated, but lets not get
into that rats nest here. */

> ---
>  xen/arch/x86/nmi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
> index d7fce28..1eb2a32 100644
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -34,7 +34,8 @@
>  #include 
>  
>  unsigned int nmi_watchdog = NMI_NONE;
> -static unsigned int nmi_hz = HZ;
> +/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */
> +static unsigned int nmi_hz = HZ / 10;
>  static unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */
>  static unsigned int nmi_p4_cccr_val;
>  static DEFINE_PER_CPU(struct timer, nmi_timer);


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-05 Thread Alexey G
On Mon, 5 Feb 2018 21:18:42 +
Igor Druzhinin  wrote:

>We're noticing a reproducible system boot hang on certain
>post-Skylake platforms where the BIOS is configured in
>legacy boot mode with x2APIC disabled. The system stalls
>immediately after writing the first SMP initialization
>sequence into APIC ICR.
>
>The cause of the problem is watchdog NMI handler execution -
>somewhere near the end of NMI handling (after it's already
>rescheduled the next NMI) it tries to access IO port 0x61
>to get the actual NMI reason on CPU0. Unfortunately, this
>port is emulated by BIOS using SMIs and this emulation
>apparently might take more than we expect under certain
>conditions. As the result, the system is constantly moving
>between NMI and SMI handler and not making any progress.
>
>Just lower the initial frequency for now as we lower it later
>even more anyway.

I/O port 61h normally is not emulated by SMI legacy kbd handling code
in BIOS, only ports like 60h, 64h, etc.
Contrary to USB legacy emulation, it has to intercept port 61h via a
different approach -- generic SMI I/O trap, which is not common (at least
it was) to use by BIOSes... although it is possible as EFI interface and
code for this is available. The assumption about port 61h being trapped by
the SMI handler must be explicitly confirmed by checking I/O Trap control
regs in the RCBA region.

If I/O trap regs won't show an active I/O trap on I/O port 61h -- the
root cause might be different (might even be related to stuff like
NMI2SMI logic).

If the problem is actually due to NMI handler being preempted by another
NMI which occurred after (a long) execution of triggered SMI handler, it
might be better to do all sensitive stuff before re-enabling NMIs by IRET in
the NMI handler.

>Signed-off-by: Igor Druzhinin 
>---
> xen/arch/x86/nmi.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
>index d7fce28..1eb2a32 100644
>--- a/xen/arch/x86/nmi.c
>+++ b/xen/arch/x86/nmi.c
>@@ -34,7 +34,8 @@
> #include 
> 
> unsigned int nmi_watchdog = NMI_NONE;
>-static unsigned int nmi_hz = HZ;
>+/* initial watchdog frequency - shouldn't be too high to avoid boot hangs
>*/ +static unsigned int nmi_hz = HZ / 10;
> static unsigned int nmi_perfctr_msr;  /* the MSR to reset in NMI
> handler */ static unsigned int nmi_p4_cccr_val;
> static DEFINE_PER_CPU(struct timer, nmi_timer);


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

2018-02-05 Thread Igor Druzhinin
We're noticing a reproducible system boot hang on certain
post-Skylake platforms where the BIOS is configured in
legacy boot mode with x2APIC disabled. The system stalls
immediately after writing the first SMP initialization
sequence into APIC ICR.

The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation
apparently might take more than we expect under certain
conditions. As the result, the system is constantly moving
between NMI and SMI handler and not making any progress.

Just lower the initial frequency for now as we lower it later
even more anyway.

Signed-off-by: Igor Druzhinin 
---
 xen/arch/x86/nmi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index d7fce28..1eb2a32 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -34,7 +34,8 @@
 #include 
 
 unsigned int nmi_watchdog = NMI_NONE;
-static unsigned int nmi_hz = HZ;
+/* initial watchdog frequency - shouldn't be too high to avoid boot hangs */
+static unsigned int nmi_hz = HZ / 10;
 static unsigned int nmi_perfctr_msr;   /* the MSR to reset in NMI handler */
 static unsigned int nmi_p4_cccr_val;
 static DEFINE_PER_CPU(struct timer, nmi_timer);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel