Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Xunlei Pang
On 02/22/2017 at 02:20 AM, Luck, Tony wrote: >> It's from my understanding, I didn't get the explicit description from the >> intel SDM on this point. >> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each >> cpu have MCG_STATUS_RIPV bit set? > MCG_STATUS is a per-thread

RE: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Luck, Tony
> It's from my understanding, I didn't get the explicit description from the > intel SDM on this point. > If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each > cpu have MCG_STATUS_RIPV bit set? MCG_STATUS is a per-thread MSR and will contain the status appropriate for th

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-17 Thread Xunlei Pang
On 02/17/2017 at 05:07 PM, Borislav Petkov wrote: > On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote: >> It changes the value of cpu_online_mask/etc which will cause confusion to >> vmcore analysis. > Then export the crashing_cpu variable, initialize it to something > invalid in the fir

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-17 Thread Borislav Petkov
On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote: > It changes the value of cpu_online_mask/etc which will cause confusion to > vmcore analysis. Then export the crashing_cpu variable, initialize it to something invalid in the first kernel, -1 for example, and test it in the #MC handlie

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 08:22 PM, Borislav Petkov wrote: > On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote: >> then mce will be broadcast to the other cpus which are still running >> in the first kernel(i.e. looping in crash_nmi_callback). > Simple: the crash code should really mark CP

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Borislav Petkov
On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote: > then mce will be broadcast to the other cpus which are still running > in the first kernel(i.e. looping in crash_nmi_callback). Simple: the crash code should really mark CPUs as not being online: void do_machine_check(struct p

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 06:18 PM, Borislav Petkov wrote: > On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote: >> I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 >> 0x0"), >> it works well in 1st kernel, but it doesn't work for 1st kernel after kdump >> boots(seems >>

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Borislav Petkov
On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote: > I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 > 0x0"), > it works well in 1st kernel, but it doesn't work for 1st kernel after kdump > boots(seems > the cpus remain in 1st kernel don't respond to the simula

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-15 Thread Xunlei Pang
On 01/26/2017 at 02:44 PM, Borislav Petkov wrote: > On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote: >> The hardware machine check is hard to reproduce, but the mce code of >> RHEL7 is quite the same as that of tip/master, anyway we are able to >> inject software mce to reproduce it. >

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-25 Thread Borislav Petkov
On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote: > The hardware machine check is hard to reproduce, but the mce code of > RHEL7 is quite the same as that of tip/master, anyway we are able to > inject software mce to reproduce it. Please give me your exact steps so that I can try to rep

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-25 Thread Xunlei Pang
On 01/24/2017 at 08:22 PM, Borislav Petkov wrote: > On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote: >> It occurred on real hardware when testing crash dump. >> >> 1) SysRq-c was injected for the test in 1st kernel >> [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kd

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-24 Thread Borislav Petkov
On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote: > It occurred on real hardware when testing crash dump. > > 1) SysRq-c was injected for the test in 1st kernel > [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump >[ 0.00] Command line: BOOT_IMAGE=/vmlinuz-

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: >> will ignore the machine check on the other cpus ... assuming >> that "cpu_is_offline(smp_processor_id())" does the right thing >> in the kexec case where this is an "old" cpu that isn'

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote: > On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: >> Hey Tony, >> >> a "welcome back" is in order? :-) >> >> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >>> If the system had experienced some memory corruption, but >>> recovered ... th

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) > > On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >> If the system had experienced some memory corruption, but >> recovered ... then there would be some pages sitting around >> that the

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 10:50 PM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: >> One possible timing sequence would be: >> 1st kernel running on multiple cpus panicked >> then the crash dump code starts >> the crash dump code stops the others cpus except the crash

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: > will ignore the machine check on the other cpus ... assuming > that "cpu_is_offline(smp_processor_id())" does the right thing > in the kexec case where this is an "old" cpu that isn't online > in the new kernel. Nice. And kdump did do t

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) Yes - first day back today. Lots of catching up to do. > And apparently crash knows about poisoned pages and handles them: > > static int __init crash_save_vmcoreinfo_init(void) >

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
Hey Tony, a "welcome back" is in order? :-) On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: > If the system had experienced some memory corruption, but > recovered ... then there would be some pages sitting around > that the old kernel had marked as POISON and stopped using. > The kex

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > > One possible timing sequence would be: > > 1st kernel running on multiple cpus panicked > > then the crash dump code starts > > the crash dump code stops the others cp

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > One possible timing sequence would be: > 1st kernel running on multiple cpus panicked > then the crash dump code starts > the crash dump code stops the others cpus except the crashing one > 2nd kernel boots up on the crash cpu with "nr_

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: >> We met an issue for kdump: after kdump kernel boots up, >> and there comes a broadcasted mce in first kernel, the > How does that even happen? > > Lemme try to understand this correct

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: > We met an issue for kdump: after kdump kernel boots up, > and there comes a broadcasted mce in first kernel, the How does that even happen? Lemme try to understand this correctly: the first kernel gets an MCE, kdump starts and boots a

[PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
We met an issue for kdump: after kdump kernel boots up, and there comes a broadcasted mce in first kernel, the other cpus remaining in first kernel will enter the old mce handler of first kernel, then timeout and panic due to MCE synchronization, finally reset the kdump cpus. This patch lets cpus