Re: [PATCH v3 4/4] x86/mce: Add Zhaoxin LMCE support

2019-09-17 Thread Luck, Tony
On Tue, Sep 17, 2019 at 06:54:05AM +, Tony W Wang-oc wrote:
> But have a question about below codes:
>   if (mcgstatus & MCG_STATUS_RIPV) {
>   mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>   return true;
>   }
> These seems require all #MC exception errors set MCG_STATUS_RIPV = 1
> in order to skip synchronize which "return true;" actually does for this.
> 
> As Intel SDM show, "Recoverable-not-continuable SRAR Type" errors may
> set MCG_STATUS_RIPV = 0, PCC = 0. When these #MC errors broadcast
> to offline CPU, may cause kernel panic with synchronize timeout (offline
> CPU can't skip synchronize in this case).
> 
> Could "return true;" outside the if-case?
>   if (mcgstatus & MCG_STATUS_RIPV) {
>   mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>   } 
>   return true; 

If RIPV bit is not set in mcgstatus, then where will the CPU return
to if you simply return from the #MC handler? RIPV=1 means that the
CPU pushed a valid return instruction pointer onto the stack.

E.g. in the not-continuable case you mention above? The CPU
will likely do something undefined if you try to continue a
not-continuable instruction.

-Tony


Re: [PATCH v3 4/4] x86/mce: Add Zhaoxin LMCE support

2019-09-16 Thread Tony W Wang-oc
On Mon, Sep 16, 2019, Luck, Tony wrote:
>On Mon, Sep 16, 2019 at 11:37:18AM +, Tony W Wang-oc wrote:
>> Zhaoxin newer CPUs support LMCE that compatible with Intel's
>> "Machine-Check Architecture", so add support for Zhaoxin LMCE
>> in mce/core.c.
>>
>> Signed-off-by: Tony W Wang-oc 
>> ---
>>  arch/x86/kernel/cpu/mce/core.c | 35
>+--
>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 65c5a1f..acdd76b 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1132,6 +1132,27 @@ static bool __mc_check_crashing_cpu(int cpu)
>>  u64 mcgstatus;
>>
>>  mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
>> +
>> +if (boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
>> +if (mcgstatus & MCG_STATUS_LMCES)
>> +return false;
>> +
>> +if (!(mcgstatus & MCG_STATUS_LMCES)) {
>
>Don't really need this test ... you already did "return false" if
>the LMCES bit was set ... so this test is redundant (and you can avoid
>indenting the next dozen lines.

Got it, Thank you.

But have a question about below codes:
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return true;
}
These seems require all #MC exception errors set MCG_STATUS_RIPV = 1
in order to skip synchronize which "return true;" actually does for this.

As Intel SDM show, "Recoverable-not-continuable SRAR Type" errors may
set MCG_STATUS_RIPV = 0, PCC = 0. When these #MC errors broadcast
to offline CPU, may cause kernel panic with synchronize timeout (offline
CPU can't skip synchronize in this case).

Could "return true;" outside the if-case?
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
} 
return true; 

Sincerely
TonyWWang-oc


Re: [PATCH v3 4/4] x86/mce: Add Zhaoxin LMCE support

2019-09-16 Thread Luck, Tony
On Mon, Sep 16, 2019 at 11:37:18AM +, Tony W Wang-oc wrote:
> Zhaoxin newer CPUs support LMCE that compatible with Intel's
> "Machine-Check Architecture", so add support for Zhaoxin LMCE
> in mce/core.c.
> 
> Signed-off-by: Tony W Wang-oc 
> ---
>  arch/x86/kernel/cpu/mce/core.c | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 65c5a1f..acdd76b 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1132,6 +1132,27 @@ static bool __mc_check_crashing_cpu(int cpu)
>   u64 mcgstatus;
>  
>   mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
> +
> + if (boot_cpu_data.x86_vendor == X86_VENDOR_ZHAOXIN) {
> + if (mcgstatus & MCG_STATUS_LMCES)
> + return false;
> +
> + if (!(mcgstatus & MCG_STATUS_LMCES)) {

Don't really need this test ... you already did "return false" if 
the LMCES bit was set ... so this test is redundant (and you can avoid
indenting the next dozen lines.

> + /*
> +  * Clear the MCG_STATUS_RIPV valid status
> +  * bit so that a second MCE won't cause a
> +  * shutdown.
> +  */
> + if (mcgstatus & MCG_STATUS_RIPV)
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> + /*
> +  * On this CPU, skip synchronize regardless
> +  * of MCG_STATUS_RIPV status.
> +  */
> + return true;
> + }
> + }
> +

Otherwise I'm OK with the series.  May earlier comment about
wanting to clean up all the vendor/family/model checks should
be seen as a longer term goal. I don't want to block this waiting
until the day we figure out how to make this prettier.

-Tony

[The "Content-Language: zh-CN" in the mail headers is still freaking out
my version of mutt (Mutt 1.11.3 (2019-02-01)) ... but I figured out a
simple script to dowload a raw copy of each patch from lore.kernel.org
to work around that]