Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Xunlei Pang
On 02/22/2017 at 02:20 AM, Luck, Tony wrote: >> It's from my understanding, I didn't get the explicit description from the >> intel SDM on this point. >> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each >> cpu have MCG_STATUS_RIPV bit set? > MCG_STATUS is a per-thread

RE: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-21 Thread Luck, Tony
> It's from my understanding, I didn't get the explicit description from the > intel SDM on this point. > If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each > cpu have MCG_STATUS_RIPV bit set? MCG_STATUS is a per-thread MSR and will contain the status appropriate for

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-17 Thread Xunlei Pang
On 02/17/2017 at 05:07 PM, Borislav Petkov wrote: > On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote: >> It changes the value of cpu_online_mask/etc which will cause confusion to >> vmcore analysis. > Then export the crashing_cpu variable, initialize it to something > invalid in the

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-17 Thread Borislav Petkov
On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote: > It changes the value of cpu_online_mask/etc which will cause confusion to > vmcore analysis. Then export the crashing_cpu variable, initialize it to something invalid in the first kernel, -1 for example, and test it in the #MC

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 08:22 PM, Borislav Petkov wrote: > On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote: >> then mce will be broadcast to the other cpus which are still running >> in the first kernel(i.e. looping in crash_nmi_callback). > Simple: the crash code should really mark

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Borislav Petkov
On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote: > then mce will be broadcast to the other cpus which are still running > in the first kernel(i.e. looping in crash_nmi_callback). Simple: the crash code should really mark CPUs as not being online: void do_machine_check(struct

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Xunlei Pang
On 02/16/2017 at 06:18 PM, Borislav Petkov wrote: > On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote: >> I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 >> 0x0"), >> it works well in 1st kernel, but it doesn't work for 1st kernel after kdump >> boots(seems >>

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-16 Thread Borislav Petkov
On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote: > I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0 > 0x0"), > it works well in 1st kernel, but it doesn't work for 1st kernel after kdump > boots(seems > the cpus remain in 1st kernel don't respond to the

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-02-15 Thread Xunlei Pang
On 01/26/2017 at 02:44 PM, Borislav Petkov wrote: > On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote: >> The hardware machine check is hard to reproduce, but the mce code of >> RHEL7 is quite the same as that of tip/master, anyway we are able to >> inject software mce to reproduce it. >

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-25 Thread Borislav Petkov
On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote: > The hardware machine check is hard to reproduce, but the mce code of > RHEL7 is quite the same as that of tip/master, anyway we are able to > inject software mce to reproduce it. Please give me your exact steps so that I can try to

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-25 Thread Xunlei Pang
On 01/24/2017 at 08:22 PM, Borislav Petkov wrote: > On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote: >> It occurred on real hardware when testing crash dump. >> >> 1) SysRq-c was injected for the test in 1st kernel >> [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-24 Thread Borislav Petkov
On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote: > It occurred on real hardware when testing crash dump. > > 1) SysRq-c was injected for the test in 1st kernel > [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump >[ 0.00] Command line:

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-24 Thread Xunlei Pang
On 01/23/2017 at 10:50 PM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: >> One possible timing sequence would be: >> 1st kernel running on multiple cpus panicked >> then the crash dump code starts >> the crash dump code stops the others cpus except the

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: >> will ignore the machine check on the other cpus ... assuming >> that "cpu_is_offline(smp_processor_id())" does the right thing >> in the kexec case where this is an "old" cpu that

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote: > On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: >> Hey Tony, >> >> a "welcome back" is in order? :-) >> >> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >>> If the system had experienced some memory corruption, but >>> recovered ...

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) > > On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: >> If the system had experienced some memory corruption, but >> recovered ... then there would be some pages sitting around >> that

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote: > will ignore the machine check on the other cpus ... assuming > that "cpu_is_offline(smp_processor_id())" does the right thing > in the kexec case where this is an "old" cpu that isn't online > in the new kernel. Nice. And kdump did do

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote: > Hey Tony, > > a "welcome back" is in order? :-) Yes - first day back today. Lots of catching up to do. > And apparently crash knows about poisoned pages and handles them: > > static int __init crash_save_vmcoreinfo_init(void) >

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
Hey Tony, a "welcome back" is in order? :-) On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote: > If the system had experienced some memory corruption, but > recovered ... then there would be some pages sitting around > that the old kernel had marked as POISON and stopped using. > The

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Luck, Tony
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > > One possible timing sequence would be: > > 1st kernel running on multiple cpus panicked > > then the crash dump code starts > > the crash dump code stops the others

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote: > One possible timing sequence would be: > 1st kernel running on multiple cpus panicked > then the crash dump code starts > the crash dump code stops the others cpus except the crashing one > 2nd kernel boots up on the crash cpu with

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Xunlei Pang
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote: > On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: >> We met an issue for kdump: after kdump kernel boots up, >> and there comes a broadcasted mce in first kernel, the > How does that even happen? > > Lemme try to understand this

Re: [PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic

2017-01-23 Thread Borislav Petkov
On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote: > We met an issue for kdump: after kdump kernel boots up, > and there comes a broadcasted mce in first kernel, the How does that even happen? Lemme try to understand this correctly: the first kernel gets an MCE, kdump starts and boots