> Your test case is presumably doing something that involves setting
> undocumented registers* to program the CPU or memory controller to
> generate a machine check on access to some address. Presumably this
> is done by broadcasting an SMI and programming the registers in SMM.
Good theory - but
On Tue, Nov 18, 2014 at 10:30 AM, Luck, Tony wrote:
>>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I
>>> usually
>>> have to do a full power cycle.
>
>> How is it even possible that I did that with a few lines of asm?
>
> Probably not your directly your fault - some
>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
>> have to do a full power cycle.
> How is it even possible that I did that with a few lines of asm?
Probably not your directly your fault - some cascade of errors may have
occurred.
> Could this be a hardware
On Mon, Nov 17, 2014 at 12:05:59PM -0800, Andy Lutomirski wrote:
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/paranoid
>
> I'm not quite ready to send v3. I want to do two things first:
>
> 1. Consider disabling the stack switch for double_fault.
Sounds
On Mon, Nov 17, 2014 at 12:05:59PM -0800, Andy Lutomirski wrote:
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/paranoid
I'm not quite ready to send v3. I want to do two things first:
1. Consider disabling the stack switch for double_fault.
Sounds conservatively
The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
have to do a full power cycle.
How is it even possible that I did that with a few lines of asm?
Probably not your directly your fault - some cascade of errors may have
occurred.
Could this be a hardware bug? Is
On Tue, Nov 18, 2014 at 10:30 AM, Luck, Tony tony.l...@intel.com wrote:
The lost cpu is *really* lost. Warm reset doesn't fix the machine, I
usually
have to do a full power cycle.
How is it even possible that I did that with a few lines of asm?
Probably not your directly your fault - some
Your test case is presumably doing something that involves setting
undocumented registers* to program the CPU or memory controller to
generate a machine check on access to some address. Presumably this
is done by broadcasting an SMI and programming the registers in SMM.
Good theory - but not
On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony wrote:
>> It could also be interesting to tweak mce_panic to not actually panic
>> the machine but to try to return and stop the test instead. Then real
>> debugging could be possible :)
>
> The lost cpu is *really* lost. Warm reset doesn't fix the
> It could also be interesting to tweak mce_panic to not actually panic
> the machine but to try to return and stop the test instead. Then real
> debugging could be possible :)
The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
have to do a full power cycle.
-Tony
On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony wrote:
>> I still wonder whether the timeout code is the real culprit. My patch
>> will slow down entry into do_machine_check by tens of cycles, several
>> cachelines, and possibly a couple of TLB misses. Given that the
>> timing seemed marginal to
> I still wonder whether the timeout code is the real culprit. My patch
> will slow down entry into do_machine_check by tens of cycles, several
> cachelines, and possibly a couple of TLB misses. Given that the
> timing seemed marginal to me, it's possible (albeit not that likely)
> that it
On Mon, Nov 17, 2014 at 1:55 PM, Luck, Tony wrote:
>>> However, I'd like to be very sure this thing doesn't introduce any
>>> regressions to the MCA code. So even if Tony's testing passes, I'd like
>>> to be very conservative here and stress it more than usual. Because once
>>> this thing hits
>> However, I'd like to be very sure this thing doesn't introduce any
>> regressions to the MCA code. So even if Tony's testing passes, I'd like
>> to be very conservative here and stress it more than usual. Because once
>> this thing hits upstream and stuff starts breaking, it'll be a serious
>>
On Mon, Nov 17, 2014 at 12:03 PM, Borislav Petkov wrote:
> On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote:
>> Would it be worth making a decision on task_work_add vs. stack
>> switching first?
>
> Probably a prudent thing to do in order to save unnecessary cycles :-)
>
>> Stack
On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote:
> Would it be worth making a decision on task_work_add vs. stack
> switching first?
Probably a prudent thing to do in order to save unnecessary cycles :-)
> Stack switching pros: all this lockless allocation stuff is completely
>
On Mon, Nov 17, 2014 at 10:50 AM, Borislav Petkov wrote:
> On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote:
>> ...
>> But I think that means we need more than one of these structures ...
>> we may not be done with one before a new machine check occurs. So
>> we'd have to make an
On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote:
> ...
> But I think that means we need more than one of these structures ...
> we may not be done with one before a new machine check occurs. So
> we'd have to make an NMI-safe allocator to grab one for use inside
> do_machine_check()
On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote:
...
But I think that means we need more than one of these structures ...
we may not be done with one before a new machine check occurs. So
we'd have to make an NMI-safe allocator to grab one for use inside
do_machine_check()
Well, I
On Mon, Nov 17, 2014 at 10:50 AM, Borislav Petkov b...@alien8.de wrote:
On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote:
...
But I think that means we need more than one of these structures ...
we may not be done with one before a new machine check occurs. So
we'd have to make an
On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote:
Would it be worth making a decision on task_work_add vs. stack
switching first?
Probably a prudent thing to do in order to save unnecessary cycles :-)
Stack switching pros: all this lockless allocation stuff is completely
On Mon, Nov 17, 2014 at 12:03 PM, Borislav Petkov b...@alien8.de wrote:
On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote:
Would it be worth making a decision on task_work_add vs. stack
switching first?
Probably a prudent thing to do in order to save unnecessary cycles :-)
However, I'd like to be very sure this thing doesn't introduce any
regressions to the MCA code. So even if Tony's testing passes, I'd like
to be very conservative here and stress it more than usual. Because once
this thing hits upstream and stuff starts breaking, it'll be a serious
PITA
On Mon, Nov 17, 2014 at 1:55 PM, Luck, Tony tony.l...@intel.com wrote:
However, I'd like to be very sure this thing doesn't introduce any
regressions to the MCA code. So even if Tony's testing passes, I'd like
to be very conservative here and stress it more than usual. Because once
this thing
I still wonder whether the timeout code is the real culprit. My patch
will slow down entry into do_machine_check by tens of cycles, several
cachelines, and possibly a couple of TLB misses. Given that the
timing seemed marginal to me, it's possible (albeit not that likely)
that it pushed the
On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony tony.l...@intel.com wrote:
I still wonder whether the timeout code is the real culprit. My patch
will slow down entry into do_machine_check by tens of cycles, several
cachelines, and possibly a couple of TLB misses. Given that the
timing seemed
It could also be interesting to tweak mce_panic to not actually panic
the machine but to try to return and stop the test instead. Then real
debugging could be possible :)
The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
have to do a full power cycle.
-Tony
On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony tony.l...@intel.com wrote:
It could also be interesting to tweak mce_panic to not actually panic
the machine but to try to return and stop the test instead. Then real
debugging could be possible :)
The lost cpu is *really* lost. Warm reset doesn't
On Fri, Nov 14, 2014 at 1:56 PM, Luck, Tony wrote:
>>> Right, I can do it in the meantime and we can always experiment more
>>> later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
>>
>> Yep, it looks pretty simple - not tested yet, it builds though.
>
> It seems pretty solid under test
>> Right, I can do it in the meantime and we can always experiment more
>> later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
>
> Yep, it looks pretty simple - not tested yet, it builds though.
It seems pretty solid under test so far.
Can we make it pass the address/flag to
> So far, the only thing I've come up with is that do_machine_check
> seems to be missing exception_enter or the equivalent. Do you have
> CONFIG_CONTEXT_TRACKING on and/or full nohz enabled? I don't think
> that this explains my bug, though.
Yes to both:
$ grep CONTEXT_TRACK .config
On Fri, Nov 14, 2014 at 9:49 AM, Luck, Tony wrote:
>> Can you also try rebasing onto what will probably be v3?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9
>
> Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE
> etc. No
On Fri, Nov 14, 2014 at 09:26:26AM -0800, Andy Lutomirski wrote:
> I was hoping for an actual worked-out example of what the parameters
> should be :)
Sorry, I haven't played with this myself either - haven't had a box with
EINJ yet. Maybe Tony has something.
--
Regards/Gruss,
Boris.
Sent
>> It adds debugging for inappropriate reschedules from the wrong stack.
>> Setting CONFIG_DEBUG_ATOMIC_SLEEP might also be a good idea.
>
> Will add that for next build/test
Didn't see anything new. System died at 1108 recoveries with the
"Timeout synchronization ..." panic
-Tony
> Can you also try rebasing onto what will probably be v3?
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9
Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE
etc. No printk()
in the MCE context.
System ran 736
On Fri, Nov 14, 2014 at 9:24 AM, Borislav Petkov wrote:
> On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote:
>> Grr. Do you or Tony have any pointers for how to test this myself? I
>> don't know enough about the acpi error injection thing, which I assume
>> is that Tony is using.
>
On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote:
> Grr. Do you or Tony have any pointers for how to test this myself? I
> don't know enough about the acpi error injection thing, which I assume
> is that Tony is using.
Maybe that would help:
Documentation/acpi/apei/einj.txt
On Nov 14, 2014 2:34 AM, "Borislav Petkov" wrote:
>
> On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote:
> > printk seems to work just fine in do_machine_check.
>
> That must be pure luck. Has anything changed which I missed to make
> printk NMI-safe?
Heh. Probably not. Now I
On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote:
> printk seems to work just fine in do_machine_check.
That must be pure luck. Has anything changed which I missed to make
printk NMI-safe?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote:
printk seems to work just fine in do_machine_check.
That must be pure luck. Has anything changed which I missed to make
printk NMI-safe?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
--
On Nov 14, 2014 2:34 AM, Borislav Petkov b...@alien8.de wrote:
On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote:
printk seems to work just fine in do_machine_check.
That must be pure luck. Has anything changed which I missed to make
printk NMI-safe?
Heh. Probably not. Now
On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote:
Grr. Do you or Tony have any pointers for how to test this myself? I
don't know enough about the acpi error injection thing, which I assume
is that Tony is using.
Maybe that would help:
Documentation/acpi/apei/einj.txt
provided
On Fri, Nov 14, 2014 at 9:24 AM, Borislav Petkov b...@alien8.de wrote:
On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote:
Grr. Do you or Tony have any pointers for how to test this myself? I
don't know enough about the acpi error injection thing, which I assume
is that Tony is
Can you also try rebasing onto what will probably be v3?
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9
Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE
etc. No printk()
in the MCE context.
System ran 736
It adds debugging for inappropriate reschedules from the wrong stack.
Setting CONFIG_DEBUG_ATOMIC_SLEEP might also be a good idea.
Will add that for next build/test
Didn't see anything new. System died at 1108 recoveries with the
Timeout synchronization ... panic
-Tony
On Fri, Nov 14, 2014 at 09:26:26AM -0800, Andy Lutomirski wrote:
I was hoping for an actual worked-out example of what the parameters
should be :)
Sorry, I haven't played with this myself either - haven't had a box with
EINJ yet. Maybe Tony has something.
--
Regards/Gruss,
Boris.
Sent
On Fri, Nov 14, 2014 at 9:49 AM, Luck, Tony tony.l...@intel.com wrote:
Can you also try rebasing onto what will probably be v3?
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9
Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE
So far, the only thing I've come up with is that do_machine_check
seems to be missing exception_enter or the equivalent. Do you have
CONFIG_CONTEXT_TRACKING on and/or full nohz enabled? I don't think
that this explains my bug, though.
Yes to both:
$ grep CONTEXT_TRACK .config
Right, I can do it in the meantime and we can always experiment more
later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
Yep, it looks pretty simple - not tested yet, it builds though.
It seems pretty solid under test so far.
Can we make it pass the address/flag to
On Fri, Nov 14, 2014 at 1:56 PM, Luck, Tony tony.l...@intel.com wrote:
Right, I can do it in the meantime and we can always experiment more
later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
Yep, it looks pretty simple - not tested yet, it builds though.
It seems pretty solid
On Thu, Nov 13, 2014 at 5:20 PM, Luck, Tony wrote:
> "worst ==
> MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel)"
>
> This can't happen. We can only declare AR severity for a user mode fault.
I believe you, and I see that in the code, but the code is mightily twisted.
Anyway, my v3 will also
"worst ==
MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel)"
This can't happen. We can only declare AR severity for a user mode fault.
Sent from my iPhone
> On Nov 13, 2014, at 16:50, Andy Lutomirski wrote:
>
> worst ==
> MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel)
--
To unsubscribe
On Thu, Nov 13, 2014 at 3:13 PM, Andy Lutomirski wrote:
> On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski wrote:
>> On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote:
Are you sure that this works in an unmodified kernel
>>>
>>> Unmodified kernel has run tens of thousands of
>>>
On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski wrote:
> On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote:
>>> Are you sure that this works in an unmodified kernel
>>
>> Unmodified kernel has run tens of thousands of
>> injection/consumption/recovery cycles.
>>
>> I did get a crash with the
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote:
>> Are you sure that this works in an unmodified kernel
>
> Unmodified kernel has run tens of thousands of injection/consumption/recovery
> cycles.
>
> I did get a crash with the entry/exit traces you asked for. Last 2 lines
> of console
> Are you sure that this works in an unmodified kernel
Unmodified kernel has run tens of thousands of injection/consumption/recovery
cycles.
I did get a crash with the entry/exit traces you asked for. Last 2 lines
of console log
attached. There are a couple of OOPs before things fall
On Thu, Nov 13, 2014 at 2:23 PM, Andy Lutomirski wrote:
> On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony wrote:
>>> printk seems to work just fine in do_machine_check. Any chance you
>>> can instrument, for each cpu, all entries to do_machine_check, all
>>> calls to do_machine_check, all returns,
On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony wrote:
>> printk seems to work just fine in do_machine_check. Any chance you
>> can instrument, for each cpu, all entries to do_machine_check, all
>> calls to do_machine_check, all returns, and everything that tries to
>> do memory_failure?
>
> I
On Thu, Nov 13, 2014 at 11:59:37AM +0100, Borislav Petkov wrote:
> I've been thinking about it recently too - adding MCA functionality to
> qemu/kvm could be very useful, especially the thresholding stuff, for
> testing RAS kernel code.
Btw, qemu monitor has a mce injection command with which I
> printk seems to work just fine in do_machine_check. Any chance you
> can instrument, for each cpu, all entries to do_machine_check, all
> calls to do_machine_check, all returns, and everything that tries to
> do memory_failure?
I first added a printk() just for the cpu that calls
On Wed, Nov 12, 2014 at 05:22:25PM +0100, Borislav Petkov wrote:
> > Less intrusive is certainly true.
>
> Right, I can do it in the meantime and we can always experiment more
> later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
Yep, it looks pretty simple - not tested yet, it builds
On Thu, Nov 13, 2014 at 12:31:30AM +, Luck, Tony wrote:
> > Is this something I can try under KVM?
>
> I don't know if KVM has a way to simulate a machine check event.
I've been thinking about it recently too - adding MCA functionality to
qemu/kvm could be very useful, especially the
On Thu, Nov 13, 2014 at 12:31:30AM +, Luck, Tony wrote:
Is this something I can try under KVM?
I don't know if KVM has a way to simulate a machine check event.
I've been thinking about it recently too - adding MCA functionality to
qemu/kvm could be very useful, especially the
On Wed, Nov 12, 2014 at 05:22:25PM +0100, Borislav Petkov wrote:
Less intrusive is certainly true.
Right, I can do it in the meantime and we can always experiment more
later. Getting rid of _TIF_MCE_NOTIFY is a good thing already.
Yep, it looks pretty simple - not tested yet, it builds
printk seems to work just fine in do_machine_check. Any chance you
can instrument, for each cpu, all entries to do_machine_check, all
calls to do_machine_check, all returns, and everything that tries to
do memory_failure?
I first added a printk() just for the cpu that calls
On Thu, Nov 13, 2014 at 11:59:37AM +0100, Borislav Petkov wrote:
I've been thinking about it recently too - adding MCA functionality to
qemu/kvm could be very useful, especially the thresholding stuff, for
testing RAS kernel code.
Btw, qemu monitor has a mce injection command with which I was
On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony tony.l...@intel.com wrote:
printk seems to work just fine in do_machine_check. Any chance you
can instrument, for each cpu, all entries to do_machine_check, all
calls to do_machine_check, all returns, and everything that tries to
do memory_failure?
On Thu, Nov 13, 2014 at 2:23 PM, Andy Lutomirski l...@amacapital.net wrote:
On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony tony.l...@intel.com wrote:
printk seems to work just fine in do_machine_check. Any chance you
can instrument, for each cpu, all entries to do_machine_check, all
calls to
Are you sure that this works in an unmodified kernel
Unmodified kernel has run tens of thousands of injection/consumption/recovery
cycles.
I did get a crash with the entry/exit traces you asked for. Last 2 lines
of console log
attached. There are a couple of OOPs before things fall
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote:
Are you sure that this works in an unmodified kernel
Unmodified kernel has run tens of thousands of injection/consumption/recovery
cycles.
I did get a crash with the entry/exit traces you asked for. Last 2 lines
On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski l...@amacapital.net wrote:
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote:
Are you sure that this works in an unmodified kernel
Unmodified kernel has run tens of thousands of
injection/consumption/recovery cycles.
I
On Thu, Nov 13, 2014 at 3:13 PM, Andy Lutomirski l...@amacapital.net wrote:
On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski l...@amacapital.net wrote:
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote:
Are you sure that this works in an unmodified kernel
Unmodified kernel
worst ==
MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel)
This can't happen. We can only declare AR severity for a user mode fault.
Sent from my iPhone
On Nov 13, 2014, at 16:50, Andy Lutomirski l...@amacapital.net wrote:
worst ==
MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel)
--
To
On Thu, Nov 13, 2014 at 5:20 PM, Luck, Tony tony.l...@intel.com wrote:
worst ==
MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel)
This can't happen. We can only declare AR severity for a user mode fault.
I believe you, and I see that in the code, but the code is mightily twisted.
Anyway, my
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony wrote:
>> v2's not going to make a difference unless you're using uprobes at the
>> same time.
>
> Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
> but is essentially
> idle except for my test program.
>
>> In the
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony wrote:
>> v2's not going to make a difference unless you're using uprobes at the
>> same time.
>
> Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
> but is essentially
> idle except for my test program.
>
>> In the
> v2's not going to make a difference unless you're using uprobes at the
> same time.
Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
but is essentially
idle except for my test program.
> In the interest of my sanity, can you add something like
>
On Wed, Nov 12, 2014 at 3:41 PM, Luck, Tony wrote:
>> v2 coming soon with these changes and some additional comment cleanups.
>
v2's not going to make a difference unless you're using uprobes at the
same time.
> So v1 + do_machine_check change is not surviving some real testing. I'm
>
> v2 coming soon with these changes and some additional comment cleanups.
So v1 + do_machine_check change is not surviving some real testing. I'm
injecting and
consuming errors sequentially with a small delay in between - so no fancy
corner cases with
multiple errors being processed ... we get
On Wed, Nov 12, 2014 at 2:00 PM, Oleg Nesterov wrote:
> Andy,
>
> As I said many times I do not understand asm ;) so most probably I missed
> something but let me ask anyway.
You must be the most competent non-asm-speaking asm reviewer in the world :)
>
> On 11/11, Andy Lutomirski wrote:
>>
>>
Andy,
As I said many times I do not understand asm ;) so most probably I missed
something but let me ask anyway.
On 11/11, Andy Lutomirski wrote:
>
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1064,6 +1064,9 @@ ENTRY(\sym)
> CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
On Wed, Nov 12, 2014 at 05:17:55PM +, Luck, Tony wrote:
> > Not that easy for testing the #MC path - there we have to inject real
> > MCEs and then noodle through the memory_failure() code. I'd be very much
> > interested to see what would happen if two MCEs happen back-to-back with
> > your
> Not that easy for testing the #MC path - there we have to inject real
> MCEs and then noodle through the memory_failure() code. I'd be very much
> interested to see what would happen if two MCEs happen back-to-back with
> your change, the second one being raised when we're on the kernel stack
>
On Wed, Nov 12, 2014 at 07:48:15AM -0800, Andy Lutomirski wrote:
> I only switch stacks on entry from userspace, and the kernel stack is
> completely empty if that happens.
Ok, fair enough. There's still the argument that something might've
corrupted the kernel stack memory while the MCE_STACK is
On Nov 12, 2014 2:30 AM, "Borislav Petkov" wrote:
>
> On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote:
> > > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
> > > are quite easy to spot.
> >
> > Bother ... except for the SRAO cases where *everyone* is an innocent
> >
On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote:
> > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
> > are quite easy to spot.
>
> Bother ... except for the SRAO cases where *everyone* is an innocent
> bystander - but someone should go look for the error and queue up
On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote:
Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
are quite easy to spot.
Bother ... except for the SRAO cases where *everyone* is an innocent
bystander - but someone should go look for the error and queue up
a
On Nov 12, 2014 2:30 AM, Borislav Petkov b...@alien8.de wrote:
On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote:
Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
are quite easy to spot.
Bother ... except for the SRAO cases where *everyone* is an innocent
On Wed, Nov 12, 2014 at 07:48:15AM -0800, Andy Lutomirski wrote:
I only switch stacks on entry from userspace, and the kernel stack is
completely empty if that happens.
Ok, fair enough. There's still the argument that something might've
corrupted the kernel stack memory while the MCE_STACK is
Not that easy for testing the #MC path - there we have to inject real
MCEs and then noodle through the memory_failure() code. I'd be very much
interested to see what would happen if two MCEs happen back-to-back with
your change, the second one being raised when we're on the kernel stack
and
On Wed, Nov 12, 2014 at 05:17:55PM +, Luck, Tony wrote:
Not that easy for testing the #MC path - there we have to inject real
MCEs and then noodle through the memory_failure() code. I'd be very much
interested to see what would happen if two MCEs happen back-to-back with
your change,
Andy,
As I said many times I do not understand asm ;) so most probably I missed
something but let me ask anyway.
On 11/11, Andy Lutomirski wrote:
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1064,6 +1064,9 @@ ENTRY(\sym)
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
On Wed, Nov 12, 2014 at 2:00 PM, Oleg Nesterov o...@redhat.com wrote:
Andy,
As I said many times I do not understand asm ;) so most probably I missed
something but let me ask anyway.
You must be the most competent non-asm-speaking asm reviewer in the world :)
On 11/11, Andy Lutomirski
v2 coming soon with these changes and some additional comment cleanups.
So v1 + do_machine_check change is not surviving some real testing. I'm
injecting and
consuming errors sequentially with a small delay in between - so no fancy
corner cases with
multiple errors being processed ... we get
On Wed, Nov 12, 2014 at 3:41 PM, Luck, Tony tony.l...@intel.com wrote:
v2 coming soon with these changes and some additional comment cleanups.
v2's not going to make a difference unless you're using uprobes at the
same time.
So v1 + do_machine_check change is not surviving some real testing.
v2's not going to make a difference unless you're using uprobes at the
same time.
Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
but is essentially
idle except for my test program.
In the interest of my sanity, can you add something like
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony tony.l...@intel.com wrote:
v2's not going to make a difference unless you're using uprobes at the
same time.
Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
but is essentially
idle except for my test program.
In the
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony tony.l...@intel.com wrote:
v2's not going to make a difference unless you're using uprobes at the
same time.
Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ...
but is essentially
idle except for my test program.
In the
> Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
> are quite easy to spot.
Bother ... except for the SRAO cases where *everyone* is an innocent
bystander - but someone should go look for the error and queue up
a page offline event. Perhaps for this we'd do the self-ipi trick
On Tue, Nov 11, 2014 at 5:06 PM, Luck, Tony wrote:
>> I've thought about one sneaky option. If we can reliably determine
>> that we're an innocent bystander of a broadcast #MC, can we send an
>> IPI-to-self and return without clearing MCIP? Then we get another
>> interrupt as soon as interrupts
1 - 100 of 124 matches
Mail list logo