On 02/22/2017 at 02:20 AM, Luck, Tony wrote:
>> It's from my understanding, I didn't get the explicit description from the
>> intel SDM on this point.
>> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each
>> cpu have MCG_STATUS_RIPV bit set?
> MCG_STATUS is a per-thread
> It's from my understanding, I didn't get the explicit description from the
> intel SDM on this point.
> If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each
> cpu have MCG_STATUS_RIPV bit set?
MCG_STATUS is a per-thread MSR and will contain the status appropriate for th
On 02/17/2017 at 05:07 PM, Borislav Petkov wrote:
> On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote:
>> It changes the value of cpu_online_mask/etc which will cause confusion to
>> vmcore analysis.
> Then export the crashing_cpu variable, initialize it to something
> invalid in the fir
On Fri, Feb 17, 2017 at 09:53:21AM +0800, Xunlei Pang wrote:
> It changes the value of cpu_online_mask/etc which will cause confusion to
> vmcore analysis.
Then export the crashing_cpu variable, initialize it to something
invalid in the first kernel, -1 for example, and test it in the #MC
handlie
On 02/16/2017 at 08:22 PM, Borislav Petkov wrote:
> On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote:
>> then mce will be broadcast to the other cpus which are still running
>> in the first kernel(i.e. looping in crash_nmi_callback).
> Simple: the crash code should really mark CP
On Thu, Feb 16, 2017 at 07:52:09PM +0800, Xunlei Pang wrote:
> then mce will be broadcast to the other cpus which are still running
> in the first kernel(i.e. looping in crash_nmi_callback).
Simple: the crash code should really mark CPUs as not being online:
void do_machine_check(struct p
On 02/16/2017 at 06:18 PM, Borislav Petkov wrote:
> On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote:
>> I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0
>> 0x0"),
>> it works well in 1st kernel, but it doesn't work for 1st kernel after kdump
>> boots(seems
>>
On Thu, Feb 16, 2017 at 01:36:37PM +0800, Xunlei Pang wrote:
> I tried to use qemu to inject SRAO("mce -b 0 0 0xb100 0x5 0x0
> 0x0"),
> it works well in 1st kernel, but it doesn't work for 1st kernel after kdump
> boots(seems
> the cpus remain in 1st kernel don't respond to the simula
On 01/26/2017 at 02:44 PM, Borislav Petkov wrote:
> On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote:
>> The hardware machine check is hard to reproduce, but the mce code of
>> RHEL7 is quite the same as that of tip/master, anyway we are able to
>> inject software mce to reproduce it.
>
On Thu, Jan 26, 2017 at 02:30:02PM +0800, Xunlei Pang wrote:
> The hardware machine check is hard to reproduce, but the mce code of
> RHEL7 is quite the same as that of tip/master, anyway we are able to
> inject software mce to reproduce it.
Please give me your exact steps so that I can try to rep
On 01/24/2017 at 08:22 PM, Borislav Petkov wrote:
> On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote:
>> It occurred on real hardware when testing crash dump.
>>
>> 1) SysRq-c was injected for the test in 1st kernel
>> [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kd
On Tue, Jan 24, 2017 at 09:27:45AM +0800, Xunlei Pang wrote:
> It occurred on real hardware when testing crash dump.
>
> 1) SysRq-c was injected for the test in 1st kernel
> [ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
>[ 0.00] Command line: BOOT_IMAGE=/vmlinuz-
On 01/24/2017 at 02:14 AM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote:
>> will ignore the machine check on the other cpus ... assuming
>> that "cpu_is_offline(smp_processor_id())" does the right thing
>> in the kexec case where this is an "old" cpu that isn'
On 01/24/2017 at 09:46 AM, Xunlei Pang wrote:
> On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
>> Hey Tony,
>>
>> a "welcome back" is in order? :-)
>>
>> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>>> If the system had experienced some memory corruption, but
>>> recovered ... th
On 01/24/2017 at 01:51 AM, Borislav Petkov wrote:
> Hey Tony,
>
> a "welcome back" is in order? :-)
>
> On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
>> If the system had experienced some memory corruption, but
>> recovered ... then there would be some pages sitting around
>> that the
On 01/23/2017 at 10:50 PM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
>> One possible timing sequence would be:
>> 1st kernel running on multiple cpus panicked
>> then the crash dump code starts
>> the crash dump code stops the others cpus except the crash
On Mon, Jan 23, 2017 at 10:01:53AM -0800, Luck, Tony wrote:
> will ignore the machine check on the other cpus ... assuming
> that "cpu_is_offline(smp_processor_id())" does the right thing
> in the kexec case where this is an "old" cpu that isn't online
> in the new kernel.
Nice. And kdump did do t
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote:
> Hey Tony,
>
> a "welcome back" is in order? :-)
Yes - first day back today. Lots of catching up to do.
> And apparently crash knows about poisoned pages and handles them:
>
> static int __init crash_save_vmcoreinfo_init(void)
>
Hey Tony,
a "welcome back" is in order? :-)
On Mon, Jan 23, 2017 at 09:40:09AM -0800, Luck, Tony wrote:
> If the system had experienced some memory corruption, but
> recovered ... then there would be some pages sitting around
> that the old kernel had marked as POISON and stopped using.
> The kex
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> > One possible timing sequence would be:
> > 1st kernel running on multiple cpus panicked
> > then the crash dump code starts
> > the crash dump code stops the others cp
On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> One possible timing sequence would be:
> 1st kernel running on multiple cpus panicked
> then the crash dump code starts
> the crash dump code stops the others cpus except the crashing one
> 2nd kernel boots up on the crash cpu with "nr_
On 01/23/2017 at 08:51 PM, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote:
>> We met an issue for kdump: after kdump kernel boots up,
>> and there comes a broadcasted mce in first kernel, the
> How does that even happen?
>
> Lemme try to understand this correct
On Mon, Jan 23, 2017 at 04:01:51PM +0800, Xunlei Pang wrote:
> We met an issue for kdump: after kdump kernel boots up,
> and there comes a broadcasted mce in first kernel, the
How does that even happen?
Lemme try to understand this correctly: the first kernel gets an
MCE, kdump starts and boots a
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.
This patch lets cpus
24 matches
Mail list logo