RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Luck, Tony
> Your test case is presumably doing something that involves setting > undocumented registers* to program the CPU or memory controller to > generate a machine check on access to some address. Presumably this > is done by broadcasting an SMI and programming the registers in SMM. Good theory - but

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Andy Lutomirski
On Tue, Nov 18, 2014 at 10:30 AM, Luck, Tony wrote: >>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I >>> usually >>> have to do a full power cycle. > >> How is it even possible that I did that with a few lines of asm? > > Probably not your directly your fault - some

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Luck, Tony
>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually >> have to do a full power cycle. > How is it even possible that I did that with a few lines of asm? Probably not your directly your fault - some cascade of errors may have occurred. > Could this be a hardware

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Borislav Petkov
On Mon, Nov 17, 2014 at 12:05:59PM -0800, Andy Lutomirski wrote: > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/paranoid > > I'm not quite ready to send v3. I want to do two things first: > > 1. Consider disabling the stack switch for double_fault. Sounds

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Borislav Petkov
On Mon, Nov 17, 2014 at 12:05:59PM -0800, Andy Lutomirski wrote: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/paranoid I'm not quite ready to send v3. I want to do two things first: 1. Consider disabling the stack switch for double_fault. Sounds conservatively

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Luck, Tony
The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually have to do a full power cycle. How is it even possible that I did that with a few lines of asm? Probably not your directly your fault - some cascade of errors may have occurred. Could this be a hardware bug? Is

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Andy Lutomirski
On Tue, Nov 18, 2014 at 10:30 AM, Luck, Tony tony.l...@intel.com wrote: The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually have to do a full power cycle. How is it even possible that I did that with a few lines of asm? Probably not your directly your fault - some

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-18 Thread Luck, Tony
Your test case is presumably doing something that involves setting undocumented registers* to program the CPU or memory controller to generate a machine check on access to some address. Presumably this is done by broadcasting an SMI and programming the registers in SMM. Good theory - but not

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony wrote: >> It could also be interesting to tweak mce_panic to not actually panic >> the machine but to try to return and stop the test instead. Then real >> debugging could be possible :) > > The lost cpu is *really* lost. Warm reset doesn't fix the

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
> It could also be interesting to tweak mce_panic to not actually panic > the machine but to try to return and stop the test instead. Then real > debugging could be possible :) The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually have to do a full power cycle. -Tony

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony wrote: >> I still wonder whether the timeout code is the real culprit. My patch >> will slow down entry into do_machine_check by tens of cycles, several >> cachelines, and possibly a couple of TLB misses. Given that the >> timing seemed marginal to

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
> I still wonder whether the timeout code is the real culprit. My patch > will slow down entry into do_machine_check by tens of cycles, several > cachelines, and possibly a couple of TLB misses. Given that the > timing seemed marginal to me, it's possible (albeit not that likely) > that it

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 1:55 PM, Luck, Tony wrote: >>> However, I'd like to be very sure this thing doesn't introduce any >>> regressions to the MCA code. So even if Tony's testing passes, I'd like >>> to be very conservative here and stress it more than usual. Because once >>> this thing hits

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
>> However, I'd like to be very sure this thing doesn't introduce any >> regressions to the MCA code. So even if Tony's testing passes, I'd like >> to be very conservative here and stress it more than usual. Because once >> this thing hits upstream and stuff starts breaking, it'll be a serious >>

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 12:03 PM, Borislav Petkov wrote: > On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote: >> Would it be worth making a decision on task_work_add vs. stack >> switching first? > > Probably a prudent thing to do in order to save unnecessary cycles :-) > >> Stack

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Borislav Petkov
On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote: > Would it be worth making a decision on task_work_add vs. stack > switching first? Probably a prudent thing to do in order to save unnecessary cycles :-) > Stack switching pros: all this lockless allocation stuff is completely >

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 10:50 AM, Borislav Petkov wrote: > On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote: >> ... >> But I think that means we need more than one of these structures ... >> we may not be done with one before a new machine check occurs. So >> we'd have to make an

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote: > ... > But I think that means we need more than one of these structures ... > we may not be done with one before a new machine check occurs. So > we'd have to make an NMI-safe allocator to grab one for use inside > do_machine_check()

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote: ... But I think that means we need more than one of these structures ... we may not be done with one before a new machine check occurs. So we'd have to make an NMI-safe allocator to grab one for use inside do_machine_check() Well, I

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 10:50 AM, Borislav Petkov b...@alien8.de wrote: On Fri, Nov 14, 2014 at 09:56:38PM +, Luck, Tony wrote: ... But I think that means we need more than one of these structures ... we may not be done with one before a new machine check occurs. So we'd have to make an

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Borislav Petkov
On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote: Would it be worth making a decision on task_work_add vs. stack switching first? Probably a prudent thing to do in order to save unnecessary cycles :-) Stack switching pros: all this lockless allocation stuff is completely

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 12:03 PM, Borislav Petkov b...@alien8.de wrote: On Mon, Nov 17, 2014 at 11:57:22AM -0800, Andy Lutomirski wrote: Would it be worth making a decision on task_work_add vs. stack switching first? Probably a prudent thing to do in order to save unnecessary cycles :-)

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
However, I'd like to be very sure this thing doesn't introduce any regressions to the MCA code. So even if Tony's testing passes, I'd like to be very conservative here and stress it more than usual. Because once this thing hits upstream and stuff starts breaking, it'll be a serious PITA

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 1:55 PM, Luck, Tony tony.l...@intel.com wrote: However, I'd like to be very sure this thing doesn't introduce any regressions to the MCA code. So even if Tony's testing passes, I'd like to be very conservative here and stress it more than usual. Because once this thing

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
I still wonder whether the timeout code is the real culprit. My patch will slow down entry into do_machine_check by tens of cycles, several cachelines, and possibly a couple of TLB misses. Given that the timing seemed marginal to me, it's possible (albeit not that likely) that it pushed the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony tony.l...@intel.com wrote: I still wonder whether the timeout code is the real culprit. My patch will slow down entry into do_machine_check by tens of cycles, several cachelines, and possibly a couple of TLB misses. Given that the timing seemed

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Luck, Tony
It could also be interesting to tweak mce_panic to not actually panic the machine but to try to return and stop the test instead. Then real debugging could be possible :) The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually have to do a full power cycle. -Tony

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-17 Thread Andy Lutomirski
On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony tony.l...@intel.com wrote: It could also be interesting to tweak mce_panic to not actually panic the machine but to try to return and stop the test instead. Then real debugging could be possible :) The lost cpu is *really* lost. Warm reset doesn't

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 1:56 PM, Luck, Tony wrote: >>> Right, I can do it in the meantime and we can always experiment more >>> later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. >> >> Yep, it looks pretty simple - not tested yet, it builds though. > > It seems pretty solid under test

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
>> Right, I can do it in the meantime and we can always experiment more >> later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. > > Yep, it looks pretty simple - not tested yet, it builds though. It seems pretty solid under test so far. Can we make it pass the address/flag to

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
> So far, the only thing I've come up with is that do_machine_check > seems to be missing exception_enter or the equivalent. Do you have > CONFIG_CONTEXT_TRACKING on and/or full nohz enabled? I don't think > that this explains my bug, though. Yes to both: $ grep CONTEXT_TRACK .config

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 9:49 AM, Luck, Tony wrote: >> Can you also try rebasing onto what will probably be v3? >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9 > > Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE > etc. No

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:26:26AM -0800, Andy Lutomirski wrote: > I was hoping for an actual worked-out example of what the parameters > should be :) Sorry, I haven't played with this myself either - haven't had a box with EINJ yet. Maybe Tony has something. -- Regards/Gruss, Boris. Sent

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
>> It adds debugging for inappropriate reschedules from the wrong stack. >> Setting CONFIG_DEBUG_ATOMIC_SLEEP might also be a good idea. > > Will add that for next build/test Didn't see anything new. System died at 1108 recoveries with the "Timeout synchronization ..." panic -Tony

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
> Can you also try rebasing onto what will probably be v3? > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9 Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE etc. No printk() in the MCE context. System ran 736

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 9:24 AM, Borislav Petkov wrote: > On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote: >> Grr. Do you or Tony have any pointers for how to test this myself? I >> don't know enough about the acpi error injection thing, which I assume >> is that Tony is using. >

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote: > Grr. Do you or Tony have any pointers for how to test this myself? I > don't know enough about the acpi error injection thing, which I assume > is that Tony is using. Maybe that would help: Documentation/acpi/apei/einj.txt

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Nov 14, 2014 2:34 AM, "Borislav Petkov" wrote: > > On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote: > > printk seems to work just fine in do_machine_check. > > That must be pure luck. Has anything changed which I missed to make > printk NMI-safe? Heh. Probably not. Now I

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote: > printk seems to work just fine in do_machine_check. That must be pure luck. Has anything changed which I missed to make printk NMI-safe? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote: printk seems to work just fine in do_machine_check. That must be pure luck. Has anything changed which I missed to make printk NMI-safe? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- --

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Nov 14, 2014 2:34 AM, Borislav Petkov b...@alien8.de wrote: On Wed, Nov 12, 2014 at 07:03:21PM -0800, Andy Lutomirski wrote: printk seems to work just fine in do_machine_check. That must be pure luck. Has anything changed which I missed to make printk NMI-safe? Heh. Probably not. Now

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote: Grr. Do you or Tony have any pointers for how to test this myself? I don't know enough about the acpi error injection thing, which I assume is that Tony is using. Maybe that would help: Documentation/acpi/apei/einj.txt provided

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 9:24 AM, Borislav Petkov b...@alien8.de wrote: On Fri, Nov 14, 2014 at 09:18:51AM -0800, Andy Lutomirski wrote: Grr. Do you or Tony have any pointers for how to test this myself? I don't know enough about the acpi error injection thing, which I assume is that Tony is

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
Can you also try rebasing onto what will probably be v3? https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9 Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE etc. No printk() in the MCE context. System ran 736

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
It adds debugging for inappropriate reschedules from the wrong stack. Setting CONFIG_DEBUG_ATOMIC_SLEEP might also be a good idea. Will add that for next build/test Didn't see anything new. System died at 1108 recoveries with the Timeout synchronization ... panic -Tony

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Borislav Petkov
On Fri, Nov 14, 2014 at 09:26:26AM -0800, Andy Lutomirski wrote: I was hoping for an actual worked-out example of what the parameters should be :) Sorry, I haven't played with this myself either - haven't had a box with EINJ yet. Maybe Tony has something. -- Regards/Gruss, Boris. Sent

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 9:49 AM, Luck, Tony tony.l...@intel.com wrote: Can you also try rebasing onto what will probably be v3? https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=paranoid-stack-v2.9 Built that - with none of my other changes ... i.e. still use TIF_NOTIFY_MCE

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
So far, the only thing I've come up with is that do_machine_check seems to be missing exception_enter or the equivalent. Do you have CONFIG_CONTEXT_TRACKING on and/or full nohz enabled? I don't think that this explains my bug, though. Yes to both: $ grep CONTEXT_TRACK .config

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Luck, Tony
Right, I can do it in the meantime and we can always experiment more later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. Yep, it looks pretty simple - not tested yet, it builds though. It seems pretty solid under test so far. Can we make it pass the address/flag to

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 1:56 PM, Luck, Tony tony.l...@intel.com wrote: Right, I can do it in the meantime and we can always experiment more later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. Yep, it looks pretty simple - not tested yet, it builds though. It seems pretty solid

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 5:20 PM, Luck, Tony wrote: > "worst == > MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel)" > > This can't happen. We can only declare AR severity for a user mode fault. I believe you, and I see that in the code, but the code is mightily twisted. Anyway, my v3 will also

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
"worst == MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel)" This can't happen. We can only declare AR severity for a user mode fault. Sent from my iPhone > On Nov 13, 2014, at 16:50, Andy Lutomirski wrote: > > worst == > MCE_AR_SEVERITY but regs->cs == 0 (i.e. in kernel) -- To unsubscribe

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 3:13 PM, Andy Lutomirski wrote: > On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski wrote: >> On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote: Are you sure that this works in an unmodified kernel >>> >>> Unmodified kernel has run tens of thousands of >>>

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski wrote: > On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote: >>> Are you sure that this works in an unmodified kernel >> >> Unmodified kernel has run tens of thousands of >> injection/consumption/recovery cycles. >> >> I did get a crash with the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony wrote: >> Are you sure that this works in an unmodified kernel > > Unmodified kernel has run tens of thousands of injection/consumption/recovery > cycles. > > I did get a crash with the entry/exit traces you asked for. Last 2 lines > of console

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
> Are you sure that this works in an unmodified kernel Unmodified kernel has run tens of thousands of injection/consumption/recovery cycles. I did get a crash with the entry/exit traces you asked for. Last 2 lines of console log attached. There are a couple of OOPs before things fall

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:23 PM, Andy Lutomirski wrote: > On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony wrote: >>> printk seems to work just fine in do_machine_check. Any chance you >>> can instrument, for each cpu, all entries to do_machine_check, all >>> calls to do_machine_check, all returns,

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony wrote: >> printk seems to work just fine in do_machine_check. Any chance you >> can instrument, for each cpu, all entries to do_machine_check, all >> calls to do_machine_check, all returns, and everything that tries to >> do memory_failure? > > I

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Thu, Nov 13, 2014 at 11:59:37AM +0100, Borislav Petkov wrote: > I've been thinking about it recently too - adding MCA functionality to > qemu/kvm could be very useful, especially the thresholding stuff, for > testing RAS kernel code. Btw, qemu monitor has a mce injection command with which I

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
> printk seems to work just fine in do_machine_check. Any chance you > can instrument, for each cpu, all entries to do_machine_check, all > calls to do_machine_check, all returns, and everything that tries to > do memory_failure? I first added a printk() just for the cpu that calls

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 05:22:25PM +0100, Borislav Petkov wrote: > > Less intrusive is certainly true. > > Right, I can do it in the meantime and we can always experiment more > later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. Yep, it looks pretty simple - not tested yet, it builds

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Thu, Nov 13, 2014 at 12:31:30AM +, Luck, Tony wrote: > > Is this something I can try under KVM? > > I don't know if KVM has a way to simulate a machine check event. I've been thinking about it recently too - adding MCA functionality to qemu/kvm could be very useful, especially the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Thu, Nov 13, 2014 at 12:31:30AM +, Luck, Tony wrote: Is this something I can try under KVM? I don't know if KVM has a way to simulate a machine check event. I've been thinking about it recently too - adding MCA functionality to qemu/kvm could be very useful, especially the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 05:22:25PM +0100, Borislav Petkov wrote: Less intrusive is certainly true. Right, I can do it in the meantime and we can always experiment more later. Getting rid of _TIF_MCE_NOTIFY is a good thing already. Yep, it looks pretty simple - not tested yet, it builds

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
printk seems to work just fine in do_machine_check. Any chance you can instrument, for each cpu, all entries to do_machine_check, all calls to do_machine_check, all returns, and everything that tries to do memory_failure? I first added a printk() just for the cpu that calls

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov
On Thu, Nov 13, 2014 at 11:59:37AM +0100, Borislav Petkov wrote: I've been thinking about it recently too - adding MCA functionality to qemu/kvm could be very useful, especially the thresholding stuff, for testing RAS kernel code. Btw, qemu monitor has a mce injection command with which I was

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony tony.l...@intel.com wrote: printk seems to work just fine in do_machine_check. Any chance you can instrument, for each cpu, all entries to do_machine_check, all calls to do_machine_check, all returns, and everything that tries to do memory_failure?

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:23 PM, Andy Lutomirski l...@amacapital.net wrote: On Thu, Nov 13, 2014 at 10:43 AM, Luck, Tony tony.l...@intel.com wrote: printk seems to work just fine in do_machine_check. Any chance you can instrument, for each cpu, all entries to do_machine_check, all calls to

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
Are you sure that this works in an unmodified kernel Unmodified kernel has run tens of thousands of injection/consumption/recovery cycles. I did get a crash with the entry/exit traces you asked for. Last 2 lines of console log attached. There are a couple of OOPs before things fall

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote: Are you sure that this works in an unmodified kernel Unmodified kernel has run tens of thousands of injection/consumption/recovery cycles. I did get a crash with the entry/exit traces you asked for. Last 2 lines

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski l...@amacapital.net wrote: On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote: Are you sure that this works in an unmodified kernel Unmodified kernel has run tens of thousands of injection/consumption/recovery cycles. I

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 3:13 PM, Andy Lutomirski l...@amacapital.net wrote: On Thu, Nov 13, 2014 at 2:47 PM, Andy Lutomirski l...@amacapital.net wrote: On Thu, Nov 13, 2014 at 2:33 PM, Luck, Tony tony.l...@intel.com wrote: Are you sure that this works in an unmodified kernel Unmodified kernel

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Luck, Tony
worst == MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel) This can't happen. We can only declare AR severity for a user mode fault. Sent from my iPhone On Nov 13, 2014, at 16:50, Andy Lutomirski l...@amacapital.net wrote: worst == MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel) -- To

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Andy Lutomirski
On Thu, Nov 13, 2014 at 5:20 PM, Luck, Tony tony.l...@intel.com wrote: worst == MCE_AR_SEVERITY but regs-cs == 0 (i.e. in kernel) This can't happen. We can only declare AR severity for a user mode fault. I believe you, and I see that in the code, but the code is mightily twisted. Anyway, my

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony wrote: >> v2's not going to make a difference unless you're using uprobes at the >> same time. > > Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... > but is essentially > idle except for my test program. > >> In the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony wrote: >> v2's not going to make a difference unless you're using uprobes at the >> same time. > > Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... > but is essentially > idle except for my test program. > >> In the

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
> v2's not going to make a difference unless you're using uprobes at the > same time. Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... but is essentially idle except for my test program. > In the interest of my sanity, can you add something like >

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 3:41 PM, Luck, Tony wrote: >> v2 coming soon with these changes and some additional comment cleanups. > v2's not going to make a difference unless you're using uprobes at the same time. > So v1 + do_machine_check change is not surviving some real testing. I'm >

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
> v2 coming soon with these changes and some additional comment cleanups. So v1 + do_machine_check change is not surviving some real testing. I'm injecting and consuming errors sequentially with a small delay in between - so no fancy corner cases with multiple errors being processed ... we get

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 2:00 PM, Oleg Nesterov wrote: > Andy, > > As I said many times I do not understand asm ;) so most probably I missed > something but let me ask anyway. You must be the most competent non-asm-speaking asm reviewer in the world :) > > On 11/11, Andy Lutomirski wrote: >> >>

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Oleg Nesterov
Andy, As I said many times I do not understand asm ;) so most probably I missed something but let me ask anyway. On 11/11, Andy Lutomirski wrote: > > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -1064,6 +1064,9 @@ ENTRY(\sym) > CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 05:17:55PM +, Luck, Tony wrote: > > Not that easy for testing the #MC path - there we have to inject real > > MCEs and then noodle through the memory_failure() code. I'd be very much > > interested to see what would happen if two MCEs happen back-to-back with > > your

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
> Not that easy for testing the #MC path - there we have to inject real > MCEs and then noodle through the memory_failure() code. I'd be very much > interested to see what would happen if two MCEs happen back-to-back with > your change, the second one being raised when we're on the kernel stack >

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 07:48:15AM -0800, Andy Lutomirski wrote: > I only switch stacks on entry from userspace, and the kernel stack is > completely empty if that happens. Ok, fair enough. There's still the argument that something might've corrupted the kernel stack memory while the MCE_STACK is

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Nov 12, 2014 2:30 AM, "Borislav Petkov" wrote: > > On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote: > > > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they > > > are quite easy to spot. > > > > Bother ... except for the SRAO cases where *everyone* is an innocent > >

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote: > > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they > > are quite easy to spot. > > Bother ... except for the SRAO cases where *everyone* is an innocent > bystander - but someone should go look for the error and queue up

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote: Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they are quite easy to spot. Bother ... except for the SRAO cases where *everyone* is an innocent bystander - but someone should go look for the error and queue up a

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Nov 12, 2014 2:30 AM, Borislav Petkov b...@alien8.de wrote: On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote: Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they are quite easy to spot. Bother ... except for the SRAO cases where *everyone* is an innocent

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 07:48:15AM -0800, Andy Lutomirski wrote: I only switch stacks on entry from userspace, and the kernel stack is completely empty if that happens. Ok, fair enough. There's still the argument that something might've corrupted the kernel stack memory while the MCE_STACK is

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
Not that easy for testing the #MC path - there we have to inject real MCEs and then noodle through the memory_failure() code. I'd be very much interested to see what would happen if two MCEs happen back-to-back with your change, the second one being raised when we're on the kernel stack and

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Borislav Petkov
On Wed, Nov 12, 2014 at 05:17:55PM +, Luck, Tony wrote: Not that easy for testing the #MC path - there we have to inject real MCEs and then noodle through the memory_failure() code. I'd be very much interested to see what would happen if two MCEs happen back-to-back with your change,

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Oleg Nesterov
Andy, As I said many times I do not understand asm ;) so most probably I missed something but let me ask anyway. On 11/11, Andy Lutomirski wrote: --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1064,6 +1064,9 @@ ENTRY(\sym) CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 2:00 PM, Oleg Nesterov o...@redhat.com wrote: Andy, As I said many times I do not understand asm ;) so most probably I missed something but let me ask anyway. You must be the most competent non-asm-speaking asm reviewer in the world :) On 11/11, Andy Lutomirski

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
v2 coming soon with these changes and some additional comment cleanups. So v1 + do_machine_check change is not surviving some real testing. I'm injecting and consuming errors sequentially with a small delay in between - so no fancy corner cases with multiple errors being processed ... we get

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 3:41 PM, Luck, Tony tony.l...@intel.com wrote: v2 coming soon with these changes and some additional comment cleanups. v2's not going to make a difference unless you're using uprobes at the same time. So v1 + do_machine_check change is not surviving some real testing.

RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Luck, Tony
v2's not going to make a difference unless you're using uprobes at the same time. Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... but is essentially idle except for my test program. In the interest of my sanity, can you add something like

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony tony.l...@intel.com wrote: v2's not going to make a difference unless you're using uprobes at the same time. Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... but is essentially idle except for my test program. In the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-12 Thread Andy Lutomirski
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony tony.l...@intel.com wrote: v2's not going to make a difference unless you're using uprobes at the same time. Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... but is essentially idle except for my test program. In the

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-11 Thread Tony Luck
> Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they > are quite easy to spot. Bother ... except for the SRAO cases where *everyone* is an innocent bystander - but someone should go look for the error and queue up a page offline event. Perhaps for this we'd do the self-ipi trick

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-11 Thread Andy Lutomirski
On Tue, Nov 11, 2014 at 5:06 PM, Luck, Tony wrote: >> I've thought about one sneaky option. If we can reliably determine >> that we're an innocent bystander of a broadcast #MC, can we send an >> IPI-to-self and return without clearing MCIP? Then we get another >> interrupt as soon as interrupts

  1   2   >