答复: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
On Fri, May 31, 2019, Raj, Ashok wrote: > On Thu, May 30, 2019 at 09:13:39AM +, Tony W Wang-oc wrote: > > On Thu, May 30, 2019, Tony W Wang-oc wrote: > > > Hi Ashok, > > > I have two questions about this patch, could you help to check: > > > > > > 1, for broadcast #MC exceptions, this patch seems require #MC exception > > > errors > > > set MCG_STATUS_RIPV = 1. > > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 > > > (like "Recoverable-not-continuable SRAR Type" Errors), for these errors > > > the patch doesn't seem to work, is that okay? > > > > > > 2, for LMCE exceptions, this patch seems require #MC exception errors > > > set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even > > > on offline CPU. > > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU > > > handle these LMCE errors, is that okay? > > > > > > > More specifically, this patch seems require #MC exceptions meet the > condition > > "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 > machine (SMP), > > The offline CPU will never get a LMCE=1, since those only happen on the CPU > that's doing active work. Offline CPUs just sitting in idle. > > The specific error here is a PCC=1, so irrespective of what happens > We do capture the errors in the per-cpu log, and kernel would panic. > > What specifically this patch tries to achieve is to leave an error > sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut > the > system dowm. Yes, agree with you for this point. But for question 1, When some #MC exception errors broadcast to offline CPU, like "Recoverable-not-continuable SRAR Type" Errors, set MCG_STATUS_RIPV = 0, PCC = 0, is there also the problem : " Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler"? Thanks > > I don't see anything wrong with what this patch does.. > > > "Data CACHE Level-2 Generic Error" does not meet this condition. > > > > I got below message from: > https://www.centos.org/forums/viewtopic.php?p=292742 > > > > Hardware event. This is not a software error. > > MCE 0 > > CPU 4 BANK 6 TSC b7065eeaa18b0 > > TIME 1545643603 Mon Dec 24 10:26:43 2018 > > MCG status:MCIP > > MCi status: > > Uncorrected error > > Error enabled > > Processor context corrupt > > MCA: Data CACHE Level-2 Generic Error > > STATUS b2008106 MCGSTATUS 4 > > MCGCAP 1c09 APICID 4 SOCKETID 0 > > CPUID Vendor Intel Family 6 Model 44 > > > > > Thanks > > > Tony W Wang-oc
答复: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
> -Original Mail- > Sender: Raj, Ashok > Time: 2019.05.31 1:11 > To : Tony W Wang-oc > CC: tip...@zytor.com; b...@suse.de; h...@zytor.com; > linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-tip-comm...@vger.kernel.org; mi...@kernel.org; pet...@infradead.org; > sta...@vger.kernel.org; t...@linutronix.de; tony.l...@intel.com; > torva...@linux-foundation.org; David Wang ; Ashok > Raj > Topic: Re: Re: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t > participate in rendezvous process > > On Thu, May 30, 2019 at 09:13:39AM +, Tony W Wang-oc wrote: > > On Thu, May 30, 2019, Tony W Wang-oc wrote: > > > Hi Ashok, > > > I have two questions about this patch, could you help to check: > > > > > > 1, for broadcast #MC exceptions, this patch seems require #MC > > > exception errors set MCG_STATUS_RIPV = 1. > > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 > > > (like "Recoverable-not-continuable SRAR Type" Errors), for these > > > errors the patch doesn't seem to work, is that okay? > > > > > > 2, for LMCE exceptions, this patch seems require #MC exception > > > errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally > > > even on offline CPU. > > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline > > > CPU handle these LMCE errors, is that okay? > > > > > > > More specifically, this patch seems require #MC exceptions meet the > > condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon > > X5650 machine (SMP), > > The offline CPU will never get a LMCE=1, since those only happen on the CPU > that's doing active work. Offline CPUs just sitting in idle. So, for intel CPU, LMCE is only for Thread level(or core level) error? If not, suppose 2 threads share level-2 cache. And thread 0 is active, thread 1 was offlined by SW. When MCE for this level-2 cache occurred, thread 1 will be active. When thread 1 read mcgstatus.lmce, the result will be always 0? Thanks. > > The specific error here is a PCC=1, so irrespective of what happens We do > capture > the errors in the per-cpu log, and kernel would panic. > > What specifically this patch tries to achieve is to leave an error sitting > with > MCG-STATUS.MCIP=1 and another recoverable error would shut the system > dowm. > > I don't see anything wrong with what this patch does.. > > > "Data CACHE Level-2 Generic Error" does not meet this condition. > > > > I got below message from: > > https://www.centos.org/forums/viewtopic.php?p=292742 > > > > Hardware event. This is not a software error. > > MCE 0 > > CPU 4 BANK 6 TSC b7065eeaa18b0 > > TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: > > Uncorrected error > > Error enabled > > Processor context corrupt > > MCA: Data CACHE Level-2 Generic Error > > STATUS b2008106 MCGSTATUS 4 > > MCGCAP 1c09 APICID 4 SOCKETID 0 > > CPUID Vendor Intel Family 6 Model 44 > > > > > Thanks > > > Tony W Wang-oc
Re: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
On Thu, May 30, 2019 at 09:13:39AM +, Tony W Wang-oc wrote: > On Thu, May 30, 2019, Tony W Wang-oc wrote: > > Hi Ashok, > > I have two questions about this patch, could you help to check: > > > > 1, for broadcast #MC exceptions, this patch seems require #MC exception > > errors > > set MCG_STATUS_RIPV = 1. > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 > > (like "Recoverable-not-continuable SRAR Type" Errors), for these errors > > the patch doesn't seem to work, is that okay? > > > > 2, for LMCE exceptions, this patch seems require #MC exception errors > > set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even > > on offline CPU. > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU > > handle these LMCE errors, is that okay? > > > > More specifically, this patch seems require #MC exceptions meet the condition > "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP), The offline CPU will never get a LMCE=1, since those only happen on the CPU that's doing active work. Offline CPUs just sitting in idle. The specific error here is a PCC=1, so irrespective of what happens We do capture the errors in the per-cpu log, and kernel would panic. What specifically this patch tries to achieve is to leave an error sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut the system dowm. I don't see anything wrong with what this patch does.. > "Data CACHE Level-2 Generic Error" does not meet this condition. > > I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742 > > Hardware event. This is not a software error. > MCE 0 > CPU 4 BANK 6 TSC b7065eeaa18b0 > TIME 1545643603 Mon Dec 24 10:26:43 2018 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > Processor context corrupt > MCA: Data CACHE Level-2 Generic Error > STATUS b2008106 MCGSTATUS 4 > MCGCAP 1c09 APICID 4 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > > > Thanks > > Tony W Wang-oc
答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
On Thu, May 30, 2019, Tony W Wang-oc wrote: > Hi Ashok, > I have two questions about this patch, could you help to check: > > 1, for broadcast #MC exceptions, this patch seems require #MC exception > errors > set MCG_STATUS_RIPV = 1. > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 > (like "Recoverable-not-continuable SRAR Type" Errors), for these errors > the patch doesn't seem to work, is that okay? > > 2, for LMCE exceptions, this patch seems require #MC exception errors > set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even > on offline CPU. > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU > handle these LMCE errors, is that okay? > More specifically, this patch seems require #MC exceptions meet the condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP), "Data CACHE Level-2 Generic Error" does not meet this condition. I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742 Hardware event. This is not a software error. MCE 0 CPU 4 BANK 6 TSC b7065eeaa18b0 TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Data CACHE Level-2 Generic Error STATUS b2008106 MCGSTATUS 4 MCGCAP 1c09 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 > Thanks > Tony W Wang-oc
Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Hi Ashok, I have two questions about this patch, could you help to check: 1, for broadcast #MC exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 1. But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), for these errors the patch doesn't seem to work, is that okay? 2, for LMCE exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even on offline CPU. For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU handle these LMCE errors, is that okay? Thanks Tony W Wang-oc
Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Hi, This patch requires all #MC exception errors set MCG_STATUS_RIPV = 1? Because on offline CPUs, for #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), this patch doesn't seem to work. if this patch's "return; " in a wrong place? Thanks Tony W Wang-oc
[tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Commit-ID: d90167a941f62860f35eb960e1012aa2d30e7e94 Gitweb: http://git.kernel.org/tip/d90167a941f62860f35eb960e1012aa2d30e7e94 Author: Ashok Raj AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100 Committer: Thomas Gleixner CommitDate: Sat, 19 Dec 2015 09:55:31 +0100 x86/mce: Ensure offline CPUs don't participate in rendezvous process Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Signed-off-by: Ashok Raj [ Massage commit message. ] Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Cc: Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-edac Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email...@alien8.de Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/mcheck/mce.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index c5b0d56..7e8a736 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) int flags = MF_ACTION_REQUIRED; int lmce = 0; + /* If this CPU is offline, just bail out. */ + if (cpu_is_offline(smp_processor_id())) { + u64 mcgstatus; + + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } + ist_enter(regs); this_cpu_inc(mce_exception_count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Commit-ID: d90167a941f62860f35eb960e1012aa2d30e7e94 Gitweb: http://git.kernel.org/tip/d90167a941f62860f35eb960e1012aa2d30e7e94 Author: Ashok RajAuthorDate: Thu, 10 Dec 2015 11:12:26 +0100 Committer: Thomas Gleixner CommitDate: Sat, 19 Dec 2015 09:55:31 +0100 x86/mce: Ensure offline CPUs don't participate in rendezvous process Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Signed-off-by: Ashok Raj [ Massage commit message. ] Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Cc: Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-edac Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email...@alien8.de Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/mcheck/mce.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index c5b0d56..7e8a736 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) int flags = MF_ACTION_REQUIRED; int lmce = 0; + /* If this CPU is offline, just bail out. */ + if (cpu_is_offline(smp_processor_id())) { + u64 mcgstatus; + + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } + ist_enter(regs); this_cpu_inc(mce_exception_count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Commit-ID: 06f337b7c7eb86254c86e8e717273d1e356d5a1b Gitweb: http://git.kernel.org/tip/06f337b7c7eb86254c86e8e717273d1e356d5a1b Author: Ashok Raj AuthorDate: Thu, 10 Dec 2015 11:12:26 +0100 Committer: Ingo Molnar CommitDate: Fri, 11 Dec 2015 08:59:48 +0100 x86/mce: Ensure offline CPUs don't participate in rendezvous process Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Signed-off-by: Ashok Raj [ Massage commit message. ] Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Cc: Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-edac Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email...@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/mcheck/mce.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index c5b0d56..7e8a736 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) int flags = MF_ACTION_REQUIRED; int lmce = 0; + /* If this CPU is offline, just bail out. */ + if (cpu_is_offline(smp_processor_id())) { + u64 mcgstatus; + + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } + ist_enter(regs); this_cpu_inc(mce_exception_count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
Commit-ID: 06f337b7c7eb86254c86e8e717273d1e356d5a1b Gitweb: http://git.kernel.org/tip/06f337b7c7eb86254c86e8e717273d1e356d5a1b Author: Ashok RajAuthorDate: Thu, 10 Dec 2015 11:12:26 +0100 Committer: Ingo Molnar CommitDate: Fri, 11 Dec 2015 08:59:48 +0100 x86/mce: Ensure offline CPUs don't participate in rendezvous process Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Signed-off-by: Ashok Raj [ Massage commit message. ] Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Cc: Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-edac Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email...@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/mcheck/mce.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index c5b0d56..7e8a736 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -999,6 +999,17 @@ void do_machine_check(struct pt_regs *regs, long error_code) int flags = MF_ACTION_REQUIRED; int lmce = 0; + /* If this CPU is offline, just bail out. */ + if (cpu_is_offline(smp_processor_id())) { + u64 mcgstatus; + + mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS); + if (mcgstatus & MCG_STATUS_RIPV) { + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); + return; + } + } + ist_enter(regs); this_cpu_inc(mce_exception_count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/