Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 09:03:34AM +0900, Linus Torvalds wrote: > No, that's fine, if it's a thread-synchronous thing (ie a memory load > that causes errors). But for NMI handlers, that is irrelevant: if > the NMI code itself gets memory errors, the machine really is dead. > Let's face it, we're going to panic and reboot, there's no other > real alternative (other than the "just log it, pray, and continue > in unstable mode", which is actually a perfectly valid alternative > in many cases, since people don't necessarily care deeply and have > written their distributed algorithms to not rely on any particular > thread too much, and will verify the end results anyway). Oh, definitely. Infact, we'll panic on uncorrectable errors in any unmovable memory, i.e. kernel code and data because we simply can't recover from it. Anything that happens in the NMI handler most probably falls in that category so... I was simply pointing out the fact that Andy's algo needs to pay attention to MCEs and other higher prio exceptions happening. > The problem is literally the non-synchronous things (like another > CPU having problems) where things like broadcast will actually turn > a non-thread-synchronous thing into problems for other CPU's. Then, > a user-mode memory access error (that we *can* recover from, perhaps > by killing the process and isolating the page) can turn into a > unrecoverable error on another CPU because it got interrupted at a > point where it really couldn't afford to be interrupted. That definitely sounds like a nasty thing, sure. Although, there's at least one problem I've been thinking about wrt the non-broadcast MCE: it is pretty hard to handle an uncorrectable memory error in a page which is shared by multiple threads running on multiple cores. So normally one of the cores will detect it, raise an MCE and deal with it but there's nothing stopping the other cores from touching that data. One of the possible things which could happen is, if the other cores consume that data, they will trigger an MCE too and will have to see that the first core which detected the error is about to poison that page so their job in the MCE handler is done and they have to exit. I'm not saying this is undoable but it is a bit tricky and some scenarios would need to be played out first to know better. So, to a certain extent, broadcasting the MCE and keeping the cores in a holding pattern, not touching any userspace stuff might've been one way to deal with situations like that. It certainly makes things easier for that particular scenario. I'm not saying it was a good idea due to the point you're making - maybe they should've talked to software people first. I'm basically trying to explain to me what the reasoning behind that broadcasting might be. > It appears Intel is fixing their braindamage. Yep, we'd still need to deal with the existing systems but we don't have a choice anyway. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 09:03:34AM +0900, Linus Torvalds wrote: No, that's fine, if it's a thread-synchronous thing (ie a memory load that causes errors). But for NMI handlers, that is irrelevant: if the NMI code itself gets memory errors, the machine really is dead. Let's face it, we're going to panic and reboot, there's no other real alternative (other than the just log it, pray, and continue in unstable mode, which is actually a perfectly valid alternative in many cases, since people don't necessarily care deeply and have written their distributed algorithms to not rely on any particular thread too much, and will verify the end results anyway). Oh, definitely. Infact, we'll panic on uncorrectable errors in any unmovable memory, i.e. kernel code and data because we simply can't recover from it. Anything that happens in the NMI handler most probably falls in that category so... I was simply pointing out the fact that Andy's algo needs to pay attention to MCEs and other higher prio exceptions happening. The problem is literally the non-synchronous things (like another CPU having problems) where things like broadcast will actually turn a non-thread-synchronous thing into problems for other CPU's. Then, a user-mode memory access error (that we *can* recover from, perhaps by killing the process and isolating the page) can turn into a unrecoverable error on another CPU because it got interrupted at a point where it really couldn't afford to be interrupted. That definitely sounds like a nasty thing, sure. Although, there's at least one problem I've been thinking about wrt the non-broadcast MCE: it is pretty hard to handle an uncorrectable memory error in a page which is shared by multiple threads running on multiple cores. So normally one of the cores will detect it, raise an MCE and deal with it but there's nothing stopping the other cores from touching that data. One of the possible things which could happen is, if the other cores consume that data, they will trigger an MCE too and will have to see that the first core which detected the error is about to poison that page so their job in the MCE handler is done and they have to exit. I'm not saying this is undoable but it is a bit tricky and some scenarios would need to be played out first to know better. So, to a certain extent, broadcasting the MCE and keeping the cores in a holding pattern, not touching any userspace stuff might've been one way to deal with situations like that. It certainly makes things easier for that particular scenario. I'm not saying it was a good idea due to the point you're making - maybe they should've talked to software people first. I'm basically trying to explain to me what the reasoning behind that broadcasting might be. It appears Intel is fixing their braindamage. Yep, we'd still need to deal with the existing systems but we don't have a choice anyway. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 4:51 PM, Borislav Petkov wrote: > On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote: >> If the OS then decides to take down the whole machine, the OS - not >> the hardware - can choose to do something that will punch through >> other CPU's NMI blocking (notably, init/reset), but the hardware doing >> this on its own is just broken if true. > > Not that it is any consolation but MCE is not broadcast on AMD. > > Regardless, exceptions like MCE cannot be held pending and do pierce the > NMI handler on both. > > Now, if the NMI handler experiences a non-broadcast MCE on the same CPU, > while running, we're simply going to panic as we're in kernel space > anyway. > > The only problem is if the NMI handler gets interrupted while running > on a bystander CPU. And I think we could deal with this because the > bystander would not see an MCE and will return safely. We just need > to make sure that it returns back to the said NMI handler and not to > userspace. Unless I'm missing something ... Under my "always RET unless returning from IST to weird CS or to specific known-invalid-stack regions" proposal this should work fine. In the current code it'll also work fine *unless* it hits really early in the NMI, in which case a second NMI can kill us. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 8:51 AM, Borislav Petkov wrote: > > Regardless, exceptions like MCE cannot be held pending and do pierce the > NMI handler on both. No, that's fine, if it's a thread-synchronous thing (ie a memory load that causes errors). But for NMI handlers, that is irrelevant: if the NMI code itself gets memory errors, the machine really is dead. Let's face it, we're going to panic and reboot, there's no other real alternative (other than the "just log it, pray, and continue in unstable mode", which is actually a perfectly valid alternative in many cases, since people don't necessarily care deeply and have written their distributed algorithms to not rely on any particular thread too much, and will verify the end results anyway). The problem is literally the non-synchronous things (like another CPU having problems) where things like broadcast will actually turn a non-thread-synchronous thing into problems for other CPU's. Then, a user-mode memory access error (that we *can* recover from, perhaps by killing the process and isolating the page) can turn into a unrecoverable error on another CPU because it got interrupted at a point where it really couldn't afford to be interrupted. It appears Intel is fixing their braindamage. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote: > If the OS then decides to take down the whole machine, the OS - not > the hardware - can choose to do something that will punch through > other CPU's NMI blocking (notably, init/reset), but the hardware doing > this on its own is just broken if true. Not that it is any consolation but MCE is not broadcast on AMD. Regardless, exceptions like MCE cannot be held pending and do pierce the NMI handler on both. Now, if the NMI handler experiences a non-broadcast MCE on the same CPU, while running, we're simply going to panic as we're in kernel space anyway. The only problem is if the NMI handler gets interrupted while running on a bystander CPU. And I think we could deal with this because the bystander would not see an MCE and will return safely. We just need to make sure that it returns back to the said NMI handler and not to userspace. Unless I'm missing something ... Oh yeah, fun :-\ -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
> MCE is frankly misdesigned. It's a piece of shit, and any of the > hardware designers that claim that what they do is for system > stability are out to lunch. This is a prime example of what *NOT* to > do, and how you can actually spread what was potentially a localized > and recoverable error, and make it global and unrecoverable. Latest SDM (version 050 from late February this year) describes how this is going to be fixed. Recoverable machine checks are going to be thread local. But current silicon still has the broadcast behavior ... silicon development pipeline is very long :-( -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 8:23 AM, Andi Kleen wrote: >> Hardware-interrupts during kernel are actually fairly common under >> network-intensive loads, even outside of idle (but idle is admittedly >> likely *the* most common one). Many network loads are fairly >> kernel-intensive. > > For network workloads we can arbitarily coalesce interrupts or just use NAPI > to lower the costs. No need to optimize network interrupts too much. BS. Lots of network loads are latency-criticial, to the point that people sometimes actually turn off coalescing. But even with coalescing, it doesn't do crap for ping-pong kinds of loads that are not "interrupt storm from tons and tons of separate packets", but "lots of individual packets that are data-dependent", so you don't have new ones coming in while processing old ones. Ask Andy L. He had numbers. Interrupt overhead was quite big for him. And you ignored the real issue: special-casing idle is *stupid*. It's more complicated, and gives fewer cases where it helps. It's simply fundamentally stupid and wrong. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 8:19 AM, Luck, Tony wrote: > > Yes. Bystander broadcast machine checks can and will hit processors > that are in NMI context ... and we must not make that fatal. .. and this, btw, is just another example of why MCE hardware designers are f*cking morons that should be given extensive education about birth control and how not to procreate. MCE is frankly misdesigned. It's a piece of shit, and any of the hardware designers that claim that what they do is for system stability are out to lunch. This is a prime example of what *NOT* to do, and how you can actually spread what was potentially a localized and recoverable error, and make it global and unrecoverable. Can we please get these designers either fired, or re-educated? Because this shit has been going on too long. I complained about this to Tony many years ago, and nothing was ever fixed. Synchronous MCE's are fine for synchronous errors, but then trying to turn them "synchronous" for other CPU's (where they *weren't* synchronous errors) is a major mistake. External errors punching through irq context is wrong, punching through NMI is just inexcusable. If the OS then decides to take down the whole machine, the OS - not the hardware - can choose to do something that will punch through other CPU's NMI blocking (notably, init/reset), but the hardware doing this on its own is just broken if true. Anyway, I repeat: I refuse to fix hardware bugs. As far as we are concerned, this is "best effort", and the hardware designers should take a long deep look at their idiotic schemes. If something punches through NMI, it's deadly. It's that simple. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
> Hardware-interrupts during kernel are actually fairly common under > network-intensive loads, even outside of idle (but idle is admittedly > likely *the* most common one). Many network loads are fairly > kernel-intensive. For network workloads we can arbitarily coalesce interrupts or just use NAPI to lower the costs. No need to optimize network interrupts too much. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
> FWIW, this means that there really is a problem if one of these #MC > errors hits an innocent bystander who just happens to be handling an > NMI, at least if we delete the nested NMI code. But I think my > simplified proposal gets this right. Yes. Bystander broadcast machine checks can and will hit processors that are in NMI context ... and we must not make that fatal. Peek harder at your proposal so you can state confidently that you get this right. "I think ... gets this right" is a bit too wishy-washy for mission critical :-) -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 4:05 PM, Luck, Tony wrote: > On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: >> But if we get a new MCE in here, it will be an MCE from kernel context >> and it's fatal. So, yes, we'll clobber the stack, but we'll never >> return (unless tolerant is set to something insane), so who cares? > > Remember that machine checks are broadcast. So some other cpu > can hit a recoverable machine check in user mode ... but that int#18 > goes everywhere. Other cpus are innocent bystanders ... they will > see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important > in any of their machine check banks. > > But if we are still finishing off processing the previous machine check, > this will be a nested one - and BOOM, we are dead. Oh. Well, crap. FWIW, this means that there really is a problem if one of these #MC errors hits an innocent bystander who just happens to be handling an NMI, at least if we delete the nested NMI code. But I think my simplified proposal gets this right. > > -Tony > > [If you peer closely at the latest edition of the SDM - you'll see the > bits are defined for a non-broadcast model ... e.g. LMCE_S bit in > MCG_STATUS but currently shipping silicon doesn't use that] -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: > But if we get a new MCE in here, it will be an MCE from kernel context > and it's fatal. So, yes, we'll clobber the stack, but we'll never > return (unless tolerant is set to something insane), so who cares? Remember that machine checks are broadcast. So some other cpu can hit a recoverable machine check in user mode ... but that int#18 goes everywhere. Other cpus are innocent bystanders ... they will see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important in any of their machine check banks. But if we are still finishing off processing the previous machine check, this will be a nested one - and BOOM, we are dead. -Tony [If you peer closely at the latest edition of the SDM - you'll see the bits are defined for a non-broadcast model ... e.g. LMCE_S bit in MCG_STATUS but currently shipping silicon doesn't use that]
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 03:41 PM, Andy Lutomirski wrote: >> >> I think you are onto something here. >> >> In particular, the key observation here is that inside the kernel, we >> can never *both* have an invalid stack *and* be inside an NMI, #MC or >> #DB handler, even if nested. > > Except for espfix :) Argh. Yes, I got that wrong... it isn't really about being inside NMI, #MC or #DB, but rather being on those respective stacks. If you are on the espfix stack you are on your way back to userspace OR (and this gets really, really ugly) you took an NMI/MC/DB after a SYSCALL executed in 16-bit mode, but even then you are in the kernel entry/exit code and re-enabling NMI is fine. >> Now, does this prevent us from using RET in the common case? I'm not >> sure it is a huge loss since kernel-to-kernel is relatively rare. > > I don't think so. The most common case should be plain old interrupts > and I suspect that #PF is a distant second. > > In any event, plain old interrupts and #PF are non-IST interrupts and > they should be unconditionally safe for RET -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:52:16PM -0700, Andy Lutomirski wrote: > I'm suggesting that you re-enable interrupts and do the work in > do_machine_check. I think it'll just work. It might pay to set a flag > so that you panic very loudly if do_machine_check recurses. And that might happen very likely if we're trying to poison a page which is shared by a couple of processes' mm's and some process on some cpu starts touching it. So keeping all cpus in a holding pattern is much more safe, IMO. (#MC is broadcasted on Intel, I'm sure you know). And even if it made sense, why go the trouble? To shorten the time we're in the MCE handler? Well, if we spend too much time in it, then the box is dying anyway. On a normal, healthy hw, do_machine_check doesn't run. :-) > I suspect that, if the hardware is generating machine checks while > doing memory poisoning, the hardware is broken enough that even > panicking might not work, though :) Yeah, in such cases, they tend to escalate to fatal errors very fast so we panic right on the spot. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: >> But if we get a new MCE in here, it will be an MCE from kernel context >> and it's fatal. So, yes, we'll clobber the stack, but we'll never >> return (unless tolerant is set to something insane), so who cares? > > Ok, but we still have to do the work before returning to the process. So > if not mce_notify_process() how else are you suggesting we do this? I'm suggesting that you re-enable interrupts and do the work in do_machine_check. I think it'll just work. It might pay to set a flag so that you panic very loudly if do_machine_check recurses. I suspect that, if the hardware is generating machine checks while doing memory poisoning, the hardware is broken enough that even panicking might not work, though :) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: > But if we get a new MCE in here, it will be an MCE from kernel context > and it's fatal. So, yes, we'll clobber the stack, but we'll never > return (unless tolerant is set to something insane), so who cares? Ok, but we still have to do the work before returning to the process. So if not mce_notify_process() how else are you suggesting we do this? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin wrote: > On 05/21/2014 11:11 AM, Andy Lutomirski wrote: >> On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski wrote: >>> Here's a real proposal for iret-less return. If this is correct, then >>> NMIs will never nest, which will probably delete a lot more scariness >>> than is added by the code I'm describing. >> >> OK, here's a case where I'm wrong. An NMI interrupts userspace on a >> 16-bit stack. The return from NMI goes through the espfix code. >> Something interrupts while on the espfix stack. Boom! Neither return >> style is particularly good. >> >> More generally, if we got interrupted while on the espfix stack, we >> need to return back there using IRET. Fortunately, re-enabling NMIs >> there in harmless, since we've already switched off the NMI stack. >> >> This makes me think that maybe the logic should be turned around: have >> some RIP ranges on which the kernel stack might be invalid (which >> includes the espfix code and some of the syscall code) and use IRET >> only on return from NMI, return to nonstandard CS, and return to these >> special ranges. The NMI code just needs to never so any of this stuff >> unless it switches off the NMI stack first. >> >> For this to work reliably, we'll probably have to change CS before >> calling into EFI code. That should be straightforward. >> > > I think you are onto something here. > > In particular, the key observation here is that inside the kernel, we > can never *both* have an invalid stack *and* be inside an NMI, #MC or > #DB handler, even if nested. Except for espfix :) > > Now, does this prevent us from using RET in the common case? I'm not > sure it is a huge loss since kernel-to-kernel is relatively rare. I don't think so. The most common case should be plain old interrupts and I suspect that #PF is a distant second. In any event, plain old interrupts and #PF are non-IST interrupts and they should be unconditionally safe for RET --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony wrote: >>> The recovery path has to do more than just send a signal - it needs to walk >>> processes and >>> "mm"s to see which have mapped the physical address that the h/w told us >>> has gone bad. >> >> I still feel like I'm missing something. If we interrupted user space >> code, then the context we're in should be identical to the context >> we'll get when we're about to return to userspace. > > True. And this far along in do_machine_check() we have set all the other cpus > free, so the are heading back to whatever context we interrupted them in. So > we might be able to do all that other stuff inline here ... we interrupted > user > mode, so we know we don't hold any locks. Other cpus are running, so they can > complete what they are doing to release any locks we might need. > > But it will take a while (to scan all those processes). And we haven't yet > cleared MCG_STATUS ... so another machine check before we do that > would be fatal (x86 doesn't allow nesting). Even if we moved the work > after the clear of MCG_STATUS we'd still be vulnerable to a new machine > check on x86_64 because we are sitting on the one & only machine check > stack. But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Anyway, I care less about this now that I don't have to worry about it re: IRET :) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 11:11 AM, Andy Lutomirski wrote: > On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski wrote: >> Here's a real proposal for iret-less return. If this is correct, then >> NMIs will never nest, which will probably delete a lot more scariness >> than is added by the code I'm describing. > > OK, here's a case where I'm wrong. An NMI interrupts userspace on a > 16-bit stack. The return from NMI goes through the espfix code. > Something interrupts while on the espfix stack. Boom! Neither return > style is particularly good. > > More generally, if we got interrupted while on the espfix stack, we > need to return back there using IRET. Fortunately, re-enabling NMIs > there in harmless, since we've already switched off the NMI stack. > > This makes me think that maybe the logic should be turned around: have > some RIP ranges on which the kernel stack might be invalid (which > includes the espfix code and some of the syscall code) and use IRET > only on return from NMI, return to nonstandard CS, and return to these > special ranges. The NMI code just needs to never so any of this stuff > unless it switches off the NMI stack first. > > For this to work reliably, we'll probably have to change CS before > calling into EFI code. That should be straightforward. > I think you are onto something here. In particular, the key observation here is that inside the kernel, we can never *both* have an invalid stack *and* be inside an NMI, #MC or #DB handler, even if nested. Now, does this prevent us from using RET in the common case? I'm not sure it is a huge loss since kernel-to-kernel is relatively rare. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:20:50PM -0700, Andy Lutomirski wrote: > It's not even scheduled out, right? Right. > This should be just like a signal from a failed page fault, I think. Right, but there this additional work it needs to be done (mce_notify_process()) before sending the signal. So you want to do this after the MCE handler is done but before you return to the process. > What switch_to? Nevermind, that was bollocks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 7:25 AM, Andi Kleen wrote: > > I suspect the only case that's really interesting here is interrupting > idle. Maybe it would be possible to do some fast path in this case only. Hardware-interrupts during kernel are actually fairly common under network-intensive loads, even outside of idle (but idle is admittedly likely *the* most common one). Many network loads are fairly kernel-intensive. Also, from a kernel perspective, idle isn't really any different from most other kernel code. Using "ret" to return to the idle handler would be *more* of a special case than using "ret" to return to just generic kernel context. So I disagree vehemently. Do *not* special-case idle. It makes the code more complex and less generic. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
>> The recovery path has to do more than just send a signal - it needs to walk >> processes and >> "mm"s to see which have mapped the physical address that the h/w told us has >> gone bad. > > I still feel like I'm missing something. If we interrupted user space > code, then the context we're in should be identical to the context > we'll get when we're about to return to userspace. True. And this far along in do_machine_check() we have set all the other cpus free, so the are heading back to whatever context we interrupted them in. So we might be able to do all that other stuff inline here ... we interrupted user mode, so we know we don't hold any locks. Other cpus are running, so they can complete what they are doing to release any locks we might need. But it will take a while (to scan all those processes). And we haven't yet cleared MCG_STATUS ... so another machine check before we do that would be fatal (x86 doesn't allow nesting). Even if we moved the work after the clear of MCG_STATUS we'd still be vulnerable to a new machine check on x86_64 because we are sitting on the one & only machine check stack. -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen wrote: > > Seems like a lot of effort and risk to essentially only optimize in kernel > interrupt handlers. The idea is that it might allow us to remove a bunch of scary nested NMI code as well as speeding things up. > > AFAIK the most interesting cases (like user page faults) are not > affected at all. Usually most workloads don't spend all that much time > in the kernel, so it won't help most interrupts. > > I suspect the only case that's really interesting here is interrupting > idle. Maybe it would be possible to do some fast path in this case only. > > However idle currently has so much overhead that I suspect that there > are lower hanging fruit elsewhere. I will gladly buy a meal or beverage for whomever fixes the ttwu stuff to stop sending IPIs to idle CPUs, which will help a lot. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
Seems like a lot of effort and risk to essentially only optimize in kernel interrupt handlers. AFAIK the most interesting cases (like user page faults) are not affected at all. Usually most workloads don't spend all that much time in the kernel, so it won't help most interrupts. I suspect the only case that's really interesting here is interrupting idle. Maybe it would be possible to do some fast path in this case only. However idle currently has so much overhead that I suspect that there are lower hanging fruit elsewhere. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony wrote: >>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in >>> mce_notify_process(). >> >> Why is this necessary? > > The recovery path has to do more than just send a signal - it needs to walk > processes and > "mm"s to see which have mapped the physical address that the h/w told us has > gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote: > Seems like a comment would be in order, though. Sure, I'll do a nice one once this discussions quiets down. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in >> mce_notify_process(). > > Why is this necessary? The recovery path has to do more than just send a signal - it needs to walk processes and "mm"s to see which have mapped the physical address that the h/w told us has gone bad. -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote: >> Why is this necessary? >> >> If the MCE hit kernel code, then we're going to die anyway. If the >> MCE hit user code, then we should be in a completely sensible context >> and we can just send the signal. > > Are we guaranteed that the first thing the process will execute when > scheduled back in are the signal handlers? It's not even scheduled out, right? This should be just like a signal from a failed page fault, I think. > > And besides, maybe we don't even want to allow to do the switch_to() but > kill it while it is sleeping. What switch_to? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote: > Why is this necessary? > > If the MCE hit kernel code, then we're going to die anyway. If the > MCE hit user code, then we should be in a completely sensible context > and we can just send the signal. Are we guaranteed that the first thing the process will execute when scheduled back in are the signal handlers? And besides, maybe we don't even want to allow to do the switch_to() but kill it while it is sleeping. (I know, we're that nasty :-)) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 03:11 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote: >> So this is not a problem then. > > Yeah, f'get it - it is all good at that front. :-) > Seems like a comment would be in order, though. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony wrote: >> But sending signals from #MC context is definitely a bad idea. I think >> we had addressed this with irq_work at some point but my memory is very >> hazy. > > We added code for recoverable errors to get out of the MC context > before trying to lookup the page and send the signal. Bottom of > do_machine_check(): > > if (cfg->tolerant < 3) { > if (no_way_out) > mce_panic("Fatal machine check on current CPU", , > msg); > if (worst == MCE_AR_SEVERITY) { > /* schedule action before return to userland */ > mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV); > set_thread_flag(TIF_MCE_NOTIFY); > } else if (kill_it) { > force_sig(SIGBUS, current); > } > } > > That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in > mce_notify_process(). Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. --Andy > > The "force_sig()" there is legacy code - and perhaps should just move off to > mce_notify_process() > too (need to save "worst" so it will know what to do). > > -Tony -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote: > So this is not a problem then. Yeah, f'get it - it is all good at that front. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
> But sending signals from #MC context is definitely a bad idea. I think > we had addressed this with irq_work at some point but my memory is very > hazy. We added code for recoverable errors to get out of the MC context before trying to lookup the page and send the signal. Bottom of do_machine_check(): if (cfg->tolerant < 3) { if (no_way_out) mce_panic("Fatal machine check on current CPU", , msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before return to userland */ mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV); set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) { force_sig(SIGBUS, current); } } That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). The "force_sig()" there is legacy code - and perhaps should just move off to mce_notify_process() too (need to save "worst" so it will know what to do). -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 02:54 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote: >> Is there anything actually wrong with just panicking if >> !user_mode_vm(regs)? That would make this a lot more sane. > > It does that already - mce_severity(). > So this is not a problem then. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote: >> Why is it a problem if user_mode_vm(regs)? Conversely, why is sending >> a signal a remotely reasonable thing to do if !user_mode_vm(regs)? > > Let me quote Jiri: > > (1) task sends signal to itself > (2) it acquires sighand->siglock so that it's able to queue the signal > (3) MCE triggers ...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should presumably just panic instead of trying to send a signal. I missed the IN_KERNEL thing because I didn't realize that ->cs was copied to struct mce. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, Borislav Petkov wrote: > > Why is it a problem if user_mode_vm(regs)? Conversely, why is sending > > a signal a remotely reasonable thing to do if !user_mode_vm(regs)? > > Let me quote Jiri: > > (1) task sends signal to itself > (2) it acquires sighand->siglock so that it's able to queue the signal > (3) MCE triggers > (4) it tries to send a signal to the same task > (5) it tries to acquire sighand->siglock and loops forever Ah, alright, but due to what mce_severity() does, this can't happen, because if the current CPU is in the kernel (which is obviously implied by holding a spinlock), it never proceeds sending the signal, becase no_way_out gets set and mce_panic() invoked. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote: > Why is it a problem if user_mode_vm(regs)? Conversely, why is sending > a signal a remotely reasonable thing to do if !user_mode_vm(regs)? Let me quote Jiri: (1) task sends signal to itself (2) it acquires sighand->siglock so that it's able to queue the signal (3) MCE triggers (4) it tries to send a signal to the same task (5) it tries to acquire sighand->siglock and loops forever -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote: > Is there anything actually wrong with just panicking if > !user_mode_vm(regs)? That would make this a lot more sane. It does that already - mce_severity(). -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote: >> If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem >> like the right solution anyway. >> >> Are there any machine check exceptions for which it makes sense to >> continue right where we left off without a signal? Is CMIC such a >> beast? Can CMIC be delivered when interrupts are off? > > I think you mean CMCI and that's not even reported with a MCE exception > - there's a separate APIC interrupt for that. > > I think this signal thing is for killing processes which have poisoned > memory but this memory can contained within that process and the > physical page frame can be poisoned so that it doesn't get used ever > again. In any case, this is an example for an uncorrectable error which > needs action from us but doesn't necessarily have to kill the whole > machine. > > This is supposed to be more graceful instead of consuming the corrupted > data and sending it out to disk. > > But sending signals from #MC context is definitely a bad idea. I think > we had addressed this with irq_work at some point but my memory is very > hazy. Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, H. Peter Anvin wrote: > > I certainly like this way of handling it. We can even issue a nice > > banner saying something like "You're f*cked - go change hw." > > Actually, it would be a lot better to panic than deadlock (HA systems > tend to have something in place to catch the panic and/or reboot). Any > way we can see if the CPU is already holding that lock and panic in that > case? Well, spin_trylock() and then either spin_unlock() and proceed sending the signal, otherwise panic(). -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote: > If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem > like the right solution anyway. > > Are there any machine check exceptions for which it makes sense to > continue right where we left off without a signal? Is CMIC such a > beast? Can CMIC be delivered when interrupts are off? I think you mean CMCI and that's not even reported with a MCE exception - there's a separate APIC interrupt for that. I think this signal thing is for killing processes which have poisoned memory but this memory can contained within that process and the physical page frame can be poisoned so that it doesn't get used ever again. In any case, this is an example for an uncorrectable error which needs action from us but doesn't necessarily have to kill the whole machine. This is supposed to be more graceful instead of consuming the corrupted data and sending it out to disk. But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. @Tony: this is something we need to take a look at soonish. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin wrote: > Adding Tony. > > On 05/21/2014 02:43 PM, Borislav Petkov wrote: >> On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: >>> Seriously. If an NMI is interrupted by an MCE, you might as well >>> consider the machine dead. Don't worry about it. We may or may not >>> recover, but it is *not* our problem. >> >> I certainly like this way of handling it. We can even issue a nice >> banner saying something like "You're f*cked - go change hw." >> > > Actually, it would be a lot better to panic than deadlock (HA systems > tend to have something in place to catch the panic and/or reboot). Any > way we can see if the CPU is already holding that lock and panic in that > case? > Is there anything actually wrong with just panicking if !user_mode_vm(regs)? That would make this a lot more sane. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
Adding Tony. On 05/21/2014 02:43 PM, Borislav Petkov wrote: > On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: >> Seriously. If an NMI is interrupted by an MCE, you might as well >> consider the machine dead. Don't worry about it. We may or may not >> recover, but it is *not* our problem. > > I certainly like this way of handling it. We can even issue a nice > banner saying something like "You're f*cked - go change hw." > Actually, it would be a lot better to panic than deadlock (HA systems tend to have something in place to catch the panic and/or reboot). Any way we can see if the CPU is already holding that lock and panic in that case? -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: > Seriously. If an NMI is interrupted by an MCE, you might as well > consider the machine dead. Don't worry about it. We may or may not > recover, but it is *not* our problem. I certainly like this way of handling it. We can even issue a nice banner saying something like "You're f*cked - go change hw." :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 6:25 AM, Jiri Kosina wrote: > > Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered > at the time the CPU was already holding sighand->siglock for that > particular task, it'll deadlock against itself. Don't worry too much about the MCE's. The hardware is f*cking broken, and nobody sane ever thought that synchronous MCE's were a good idea. Proof: look at Itanium. The truly nonmaskable synchronous MCE's are a fatal error. It's that simple. Anybody who thinks anything else is simply wrong, and has probably talked to too many hardware engineers that don't actually understand the bigger picture. Sane hardware handles anything that *can* be handled in hardware, and then reports (later) to software about the errors with a regular non-critical MCE that doesn't punch through NMI or even regular interrupt disabling. So the true "MCE punches through even NMI protection" case is relegated purely to the "hardware is broken and needs to be replaced" situation, and our only worry as kernel people is to try to be as graceful as possible about it - but that "as graceful as possible" does *not* include bending over and worrying about random possible deadlocks or other crazy situations. It's purely a "best effort" kind of thing where we try to do whatever logging etc that is easy to do. Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina wrote: > On Wed, 21 May 2014, Borislav Petkov wrote: > >> > ISTM the do_machine_check code ought to consider any kill-worthy MCE >> > from kernel space to be non-recoverable, but I want to keep the scope >> > of these patches under control. >> >> MCA has a bit called RIPV which, if set, signals that RIP is valid and >> it is safe to return provided we've taken proper care of handling even >> non-correctable errors (memory poisoning, etc). > > Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered > at the time the CPU was already holding sighand->siglock for that > particular task, it'll deadlock against itself. > If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, Borislav Petkov wrote: > > ISTM the do_machine_check code ought to consider any kill-worthy MCE > > from kernel space to be non-recoverable, but I want to keep the scope > > of these patches under control. > > MCA has a bit called RIPV which, if set, signals that RIP is valid and > it is safe to return provided we've taken proper care of handling even > non-correctable errors (memory poisoning, etc). Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered at the time the CPU was already holding sighand->siglock for that particular task, it'll deadlock against itself. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski wrote: > Here's a real proposal for iret-less return. If this is correct, then > NMIs will never nest, which will probably delete a lot more scariness > than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 10:52:01AM -0700, Andy Lutomirski wrote: > I think that some of these exceptions are synchronous things (e.g. > int3 or page faults) that happen because the kernel caused them. > > Anyway, going through the list: > > Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway. Yeah. > SMI is already supposedly correct wrt nesting inside NMI. It better be. :) > Debug register stuff should be handled in my outline. Hopefully > correctly :) We need to make sure that no breakpoints trip before the > nmi count is incremented, but that should be straightforward as long > as we don't do ridiculous things like poking at userspace addresses. > I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a > kernel address (e.g. the nesting count) or enables single-stepping, > we'll mess up. > > > It may pay to bump the nesting count inside the #DB and #BP handlers > and to check the RIP that we're returning to, Right, at a first glance, all those higher prio exceptions' nesting count could be nicely dealt with in those paranoidzeroentry* macros. > but that starts to look ugly, and we have to be careful about NMI, > immediate breakpoint, and them immediate MCE. Btw, hpa just confirmed that exceptions are never deferred and thus can happen while the NMI nahdler runs. Which means, we should defensively prepare for NMI handlers being interrupted at any point. > I'd rather just be able to say that there are some very short windows > in which a debug or breakpoint exception will never happen. Sounds perfectly fine to me. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov wrote: > On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote: >> On May 21, 2014 2:46 AM, "Borislav Petkov" wrote: >> > >> > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: >> > > So the issue here is that we can have an NMI followed immediately by >> > > an MCE. >> > >> > That part might need clarification for me: #MC is higher prio interrupt >> > than NMI so a machine check exception can interrupt the NMI handler at >> > any point. >> >> Except that NMI can interrupt #MC at any point as well, I think. > > No, #MC is higher prio than NMI, actually even the highest along with > RESET#. And come to think of it, all exceptions which have a higher prio > than NMI should touch that nmi_mce_nest_count thing. > > See Table 8-8 here: > > http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf > > That's the prios before 3, i.e. the NMI one. > > HOWEVER, this all is spoken with the assumption that higher prio > interrupts can interrupt the NMI handler too at the first instruction > boundary they've been recognized. > > The text is talking about simultaneous interrupts and not about > interrupt handler preemption. > > But it must be because Steve wouldn't be dealing with exceptions in the > NMI handler and nested NMIs otherwise... I think that some of these exceptions are synchronous things (e.g. int3 or page faults) that happen because the kernel caused them. Anyway, going through the list: Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway. SMI is already supposedly correct wrt nesting inside NMI. Debug register stuff should be handled in my outline. Hopefully correctly :) We need to make sure that no breakpoints trip before the nmi count is incremented, but that should be straightforward as long as we don't do ridiculous things like poking at userspace addresses. I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a kernel address (e.g. the nesting count) or enables single-stepping, we'll mess up. It may pay to bump the nesting count inside the #DB and #BP handlers and to check the RIP that we're returning to, but that starts to look ugly, and we have to be careful about NMI, immediate breakpoint, and them immediate MCE. I'd rather just be able to say that there are some very short windows in which a debug or breakpoint exception will never happen. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 08:21:57AM -0700, Andy Lutomirski wrote: > ISTM the do_machine_check code ought to consider any kill-worthy MCE > from kernel space to be non-recoverable, but I want to keep the scope > of these patches under control. MCA has a bit called RIPV which, if set, signals that RIP is valid and it is safe to return provided we've taken proper care of handling even non-correctable errors (memory poisoning, etc). If RIPV is not set, we panic anyway. > That being said, if an MCE that came from CPL0 never tried to return, > this would be simpler. I don't know enough about the machine check > architecture to know whether that's a reasonable thing to do. Yeah, there are cases where MCE can return, see above. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote: > On May 21, 2014 2:46 AM, "Borislav Petkov" wrote: > > > > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: > > > So the issue here is that we can have an NMI followed immediately by > > > an MCE. > > > > That part might need clarification for me: #MC is higher prio interrupt > > than NMI so a machine check exception can interrupt the NMI handler at > > any point. > > Except that NMI can interrupt #MC at any point as well, I think. No, #MC is higher prio than NMI, actually even the highest along with RESET#. And come to think of it, all exceptions which have a higher prio than NMI should touch that nmi_mce_nest_count thing. See Table 8-8 here: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf That's the prios before 3, i.e. the NMI one. HOWEVER, this all is spoken with the assumption that higher prio interrupts can interrupt the NMI handler too at the first instruction boundary they've been recognized. The text is talking about simultaneous interrupts and not about interrupt handler preemption. But it must be because Steve wouldn't be dealing with exceptions in the NMI handler and nested NMIs otherwise... Hmmm. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 5:51 AM, "Jiri Kosina" wrote: > > On Tue, 20 May 2014, Andy Lutomirski wrote: > > > So the issue here is that we can have an NMI followed immediately by > > an MCE. The MCE code can call force_sig > > This is interesting by itself. force_sig() takes siglock spinlock. This > really looks like a deadlock sitting there waiting to happen. ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. That being said, if an MCE that came from CPL0 never tried to return, this would be simpler. I don't know enough about the machine check architecture to know whether that's a reasonable thing to do. --Andy > > -- > Jiri Kosina > SUSE Labs > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 2:46 AM, "Borislav Petkov" wrote: > > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: > > So the issue here is that we can have an NMI followed immediately by > > an MCE. > > That part might need clarification for me: #MC is higher prio interrupt > than NMI so a machine check exception can interrupt the NMI handler at > any point. Except that NMI can interrupt #MC at any point as well, I think. > > But you're talking only about the small window when nmi_mce_nest_count > hasn't been incremented yet, right? I.e., this: > > "The result is that the only interrupt that can happen with > `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or > end of the NMI handler." > > Correct? Exactly. > > -- > Regards/Gruss, > Boris. > > Sent from a fat crate under my desk. Formatting is fine. > -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, 20 May 2014, Andy Lutomirski wrote: > So the issue here is that we can have an NMI followed immediately by > an MCE. The MCE code can call force_sig This is interesting by itself. force_sig() takes siglock spinlock. This really looks like a deadlock sitting there waiting to happen. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: > So the issue here is that we can have an NMI followed immediately by > an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. But you're talking only about the small window when nmi_mce_nest_count hasn't been incremented yet, right? I.e., this: "The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler." Correct? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. But you're talking only about the small window when nmi_mce_nest_count hasn't been incremented yet, right? I.e., this: The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler. Correct? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, 20 May 2014, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. The MCE code can call force_sig This is interesting by itself. force_sig() takes siglock spinlock. This really looks like a deadlock sitting there waiting to happen. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote: On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. Except that NMI can interrupt #MC at any point as well, I think. But you're talking only about the small window when nmi_mce_nest_count hasn't been incremented yet, right? I.e., this: The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler. Correct? Exactly. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 5:51 AM, Jiri Kosina jkos...@suse.cz wrote: On Tue, 20 May 2014, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. The MCE code can call force_sig This is interesting by itself. force_sig() takes siglock spinlock. This really looks like a deadlock sitting there waiting to happen. ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. That being said, if an MCE that came from CPL0 never tried to return, this would be simpler. I don't know enough about the machine check architecture to know whether that's a reasonable thing to do. --Andy -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote: On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote: On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. Except that NMI can interrupt #MC at any point as well, I think. No, #MC is higher prio than NMI, actually even the highest along with RESET#. And come to think of it, all exceptions which have a higher prio than NMI should touch that nmi_mce_nest_count thing. See Table 8-8 here: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf That's the prios before 3, i.e. the NMI one. HOWEVER, this all is spoken with the assumption that higher prio interrupts can interrupt the NMI handler too at the first instruction boundary they've been recognized. The text is talking about simultaneous interrupts and not about interrupt handler preemption. But it must be because Steve wouldn't be dealing with exceptions in the NMI handler and nested NMIs otherwise... Hmmm. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 08:21:57AM -0700, Andy Lutomirski wrote: ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. MCA has a bit called RIPV which, if set, signals that RIP is valid and it is safe to return provided we've taken proper care of handling even non-correctable errors (memory poisoning, etc). If RIPV is not set, we panic anyway. That being said, if an MCE that came from CPL0 never tried to return, this would be simpler. I don't know enough about the machine check architecture to know whether that's a reasonable thing to do. Yeah, there are cases where MCE can return, see above. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote: On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote: On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. Except that NMI can interrupt #MC at any point as well, I think. No, #MC is higher prio than NMI, actually even the highest along with RESET#. And come to think of it, all exceptions which have a higher prio than NMI should touch that nmi_mce_nest_count thing. See Table 8-8 here: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf That's the prios before 3, i.e. the NMI one. HOWEVER, this all is spoken with the assumption that higher prio interrupts can interrupt the NMI handler too at the first instruction boundary they've been recognized. The text is talking about simultaneous interrupts and not about interrupt handler preemption. But it must be because Steve wouldn't be dealing with exceptions in the NMI handler and nested NMIs otherwise... I think that some of these exceptions are synchronous things (e.g. int3 or page faults) that happen because the kernel caused them. Anyway, going through the list: Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway. SMI is already supposedly correct wrt nesting inside NMI. Debug register stuff should be handled in my outline. Hopefully correctly :) We need to make sure that no breakpoints trip before the nmi count is incremented, but that should be straightforward as long as we don't do ridiculous things like poking at userspace addresses. I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a kernel address (e.g. the nesting count) or enables single-stepping, we'll mess up. It may pay to bump the nesting count inside the #DB and #BP handlers and to check the RIP that we're returning to, but that starts to look ugly, and we have to be careful about NMI, immediate breakpoint, and them immediate MCE. I'd rather just be able to say that there are some very short windows in which a debug or breakpoint exception will never happen. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 10:52:01AM -0700, Andy Lutomirski wrote: I think that some of these exceptions are synchronous things (e.g. int3 or page faults) that happen because the kernel caused them. Anyway, going through the list: Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway. Yeah. SMI is already supposedly correct wrt nesting inside NMI. It better be. :) Debug register stuff should be handled in my outline. Hopefully correctly :) We need to make sure that no breakpoints trip before the nmi count is incremented, but that should be straightforward as long as we don't do ridiculous things like poking at userspace addresses. I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a kernel address (e.g. the nesting count) or enables single-stepping, we'll mess up. It may pay to bump the nesting count inside the #DB and #BP handlers and to check the RIP that we're returning to, Right, at a first glance, all those higher prio exceptions' nesting count could be nicely dealt with in those paranoidzeroentry* macros. but that starts to look ugly, and we have to be careful about NMI, immediate breakpoint, and them immediate MCE. Btw, hpa just confirmed that exceptions are never deferred and thus can happen while the NMI nahdler runs. Which means, we should defensively prepare for NMI handlers being interrupted at any point. I'd rather just be able to say that there are some very short windows in which a debug or breakpoint exception will never happen. Sounds perfectly fine to me. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote: Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, Borislav Petkov wrote: ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. MCA has a bit called RIPV which, if set, signals that RIP is valid and it is safe to return provided we've taken proper care of handling even non-correctable errors (memory poisoning, etc). Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered at the time the CPU was already holding sighand-siglock for that particular task, it'll deadlock against itself. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina jkos...@suse.cz wrote: On Wed, 21 May 2014, Borislav Petkov wrote: ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. MCA has a bit called RIPV which, if set, signals that RIP is valid and it is safe to return provided we've taken proper care of handling even non-correctable errors (memory poisoning, etc). Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered at the time the CPU was already holding sighand-siglock for that particular task, it'll deadlock against itself. If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 6:25 AM, Jiri Kosina jkos...@suse.cz wrote: Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered at the time the CPU was already holding sighand-siglock for that particular task, it'll deadlock against itself. Don't worry too much about the MCE's. The hardware is f*cking broken, and nobody sane ever thought that synchronous MCE's were a good idea. Proof: look at Itanium. The truly nonmaskable synchronous MCE's are a fatal error. It's that simple. Anybody who thinks anything else is simply wrong, and has probably talked to too many hardware engineers that don't actually understand the bigger picture. Sane hardware handles anything that *can* be handled in hardware, and then reports (later) to software about the errors with a regular non-critical MCE that doesn't punch through NMI or even regular interrupt disabling. So the true MCE punches through even NMI protection case is relegated purely to the hardware is broken and needs to be replaced situation, and our only worry as kernel people is to try to be as graceful as possible about it - but that as graceful as possible does *not* include bending over and worrying about random possible deadlocks or other crazy situations. It's purely a best effort kind of thing where we try to do whatever logging etc that is easy to do. Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. I certainly like this way of handling it. We can even issue a nice banner saying something like You're f*cked - go change hw. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
Adding Tony. On 05/21/2014 02:43 PM, Borislav Petkov wrote: On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. I certainly like this way of handling it. We can even issue a nice banner saying something like You're f*cked - go change hw. Actually, it would be a lot better to panic than deadlock (HA systems tend to have something in place to catch the panic and/or reboot). Any way we can see if the CPU is already holding that lock and panic in that case? -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote: Adding Tony. On 05/21/2014 02:43 PM, Borislav Petkov wrote: On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. I certainly like this way of handling it. We can even issue a nice banner saying something like You're f*cked - go change hw. Actually, it would be a lot better to panic than deadlock (HA systems tend to have something in place to catch the panic and/or reboot). Any way we can see if the CPU is already holding that lock and panic in that case? Is there anything actually wrong with just panicking if !user_mode_vm(regs)? That would make this a lot more sane. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote: If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? I think you mean CMCI and that's not even reported with a MCE exception - there's a separate APIC interrupt for that. I think this signal thing is for killing processes which have poisoned memory but this memory can contained within that process and the physical page frame can be poisoned so that it doesn't get used ever again. In any case, this is an example for an uncorrectable error which needs action from us but doesn't necessarily have to kill the whole machine. This is supposed to be more graceful instead of consuming the corrupted data and sending it out to disk. But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. @Tony: this is something we need to take a look at soonish. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, H. Peter Anvin wrote: I certainly like this way of handling it. We can even issue a nice banner saying something like You're f*cked - go change hw. Actually, it would be a lot better to panic than deadlock (HA systems tend to have something in place to catch the panic and/or reboot). Any way we can see if the CPU is already holding that lock and panic in that case? Well, spin_trylock() and then either spin_unlock() and proceed sending the signal, otherwise panic(). -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote: If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? I think you mean CMCI and that's not even reported with a MCE exception - there's a separate APIC interrupt for that. I think this signal thing is for killing processes which have poisoned memory but this memory can contained within that process and the physical page frame can be poisoned so that it doesn't get used ever again. In any case, this is an example for an uncorrectable error which needs action from us but doesn't necessarily have to kill the whole machine. This is supposed to be more graceful instead of consuming the corrupted data and sending it out to disk. But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote: Is there anything actually wrong with just panicking if !user_mode_vm(regs)? That would make this a lot more sane. It does that already - mce_severity(). -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote: Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? Let me quote Jiri: (1) task sends signal to itself (2) it acquires sighand-siglock so that it's able to queue the signal (3) MCE triggers (4) it tries to send a signal to the same task (5) it tries to acquire sighand-siglock and loops forever -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, 21 May 2014, Borislav Petkov wrote: Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? Let me quote Jiri: (1) task sends signal to itself (2) it acquires sighand-siglock so that it's able to queue the signal (3) MCE triggers (4) it tries to send a signal to the same task (5) it tries to acquire sighand-siglock and loops forever Ah, alright, but due to what mce_severity() does, this can't happen, because if the current CPU is in the kernel (which is obviously implied by holding a spinlock), it never proceeds sending the signal, becase no_way_out gets set and mce_panic() invoked. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote: Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? Let me quote Jiri: (1) task sends signal to itself (2) it acquires sighand-siglock so that it's able to queue the signal (3) MCE triggers ...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should presumably just panic instead of trying to send a signal. I missed the IN_KERNEL thing because I didn't realize that -cs was copied to struct mce. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 02:54 PM, Borislav Petkov wrote: On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote: Is there anything actually wrong with just panicking if !user_mode_vm(regs)? That would make this a lot more sane. It does that already - mce_severity(). So this is not a problem then. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. We added code for recoverable errors to get out of the MC context before trying to lookup the page and send the signal. Bottom of do_machine_check(): if (cfg-tolerant 3) { if (no_way_out) mce_panic(Fatal machine check on current CPU, m, msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before return to userland */ mce_save_info(m.addr, m.mcgstatus MCG_STATUS_RIPV); set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) { force_sig(SIGBUS, current); } } That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). The force_sig() there is legacy code - and perhaps should just move off to mce_notify_process() too (need to save worst so it will know what to do). -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote: So this is not a problem then. Yeah, f'get it - it is all good at that front. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony tony.l...@intel.com wrote: But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. We added code for recoverable errors to get out of the MC context before trying to lookup the page and send the signal. Bottom of do_machine_check(): if (cfg-tolerant 3) { if (no_way_out) mce_panic(Fatal machine check on current CPU, m, msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before return to userland */ mce_save_info(m.addr, m.mcgstatus MCG_STATUS_RIPV); set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) { force_sig(SIGBUS, current); } } That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. --Andy The force_sig() there is legacy code - and perhaps should just move off to mce_notify_process() too (need to save worst so it will know what to do). -Tony -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 03:11 PM, Borislav Petkov wrote: On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote: So this is not a problem then. Yeah, f'get it - it is all good at that front. :-) Seems like a comment would be in order, though. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote: Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. Are we guaranteed that the first thing the process will execute when scheduled back in are the signal handlers? And besides, maybe we don't even want to allow to do the switch_to() but kill it while it is sleeping. (I know, we're that nasty :-)) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote: Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. Are we guaranteed that the first thing the process will execute when scheduled back in are the signal handlers? It's not even scheduled out, right? This should be just like a signal from a failed page fault, I think. And besides, maybe we don't even want to allow to do the switch_to() but kill it while it is sleeping. What switch_to? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). Why is this necessary? The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote: Seems like a comment would be in order, though. Sure, I'll do a nice one once this discussions quiets down. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony tony.l...@intel.com wrote: That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). Why is this necessary? The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
Seems like a lot of effort and risk to essentially only optimize in kernel interrupt handlers. AFAIK the most interesting cases (like user page faults) are not affected at all. Usually most workloads don't spend all that much time in the kernel, so it won't help most interrupts. I suspect the only case that's really interesting here is interrupting idle. Maybe it would be possible to do some fast path in this case only. However idle currently has so much overhead that I suspect that there are lower hanging fruit elsewhere. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] x86_64: A real proposal for iret-less return to kernel
The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. True. And this far along in do_machine_check() we have set all the other cpus free, so the are heading back to whatever context we interrupted them in. So we might be able to do all that other stuff inline here ... we interrupted user mode, so we know we don't hold any locks. Other cpus are running, so they can complete what they are doing to release any locks we might need. But it will take a while (to scan all those processes). And we haven't yet cleared MCG_STATUS ... so another machine check before we do that would be fatal (x86 doesn't allow nesting). Even if we moved the work after the clear of MCG_STATUS we'd still be vulnerable to a new machine check on x86_64 because we are sitting on the one only machine check stack. -Tony
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen a...@firstfloor.org wrote: Seems like a lot of effort and risk to essentially only optimize in kernel interrupt handlers. The idea is that it might allow us to remove a bunch of scary nested NMI code as well as speeding things up. AFAIK the most interesting cases (like user page faults) are not affected at all. Usually most workloads don't spend all that much time in the kernel, so it won't help most interrupts. I suspect the only case that's really interesting here is interrupting idle. Maybe it would be possible to do some fast path in this case only. However idle currently has so much overhead that I suspect that there are lower hanging fruit elsewhere. I will gladly buy a meal or beverage for whomever fixes the ttwu stuff to stop sending IPIs to idle CPUs, which will help a lot. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Thu, May 22, 2014 at 7:25 AM, Andi Kleen a...@firstfloor.org wrote: I suspect the only case that's really interesting here is interrupting idle. Maybe it would be possible to do some fast path in this case only. Hardware-interrupts during kernel are actually fairly common under network-intensive loads, even outside of idle (but idle is admittedly likely *the* most common one). Many network loads are fairly kernel-intensive. Also, from a kernel perspective, idle isn't really any different from most other kernel code. Using ret to return to the idle handler would be *more* of a special case than using ret to return to just generic kernel context. So I disagree vehemently. Do *not* special-case idle. It makes the code more complex and less generic. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:20:50PM -0700, Andy Lutomirski wrote: It's not even scheduled out, right? Right. This should be just like a signal from a failed page fault, I think. Right, but there this additional work it needs to be done (mce_notify_process()) before sending the signal. So you want to do this after the MCE handler is done but before you return to the process. What switch_to? Nevermind, that was bollocks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 11:11 AM, Andy Lutomirski wrote: On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote: Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. I think you are onto something here. In particular, the key observation here is that inside the kernel, we can never *both* have an invalid stack *and* be inside an NMI, #MC or #DB handler, even if nested. Now, does this prevent us from using RET in the common case? I'm not sure it is a huge loss since kernel-to-kernel is relatively rare. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony tony.l...@intel.com wrote: The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. True. And this far along in do_machine_check() we have set all the other cpus free, so the are heading back to whatever context we interrupted them in. So we might be able to do all that other stuff inline here ... we interrupted user mode, so we know we don't hold any locks. Other cpus are running, so they can complete what they are doing to release any locks we might need. But it will take a while (to scan all those processes). And we haven't yet cleared MCG_STATUS ... so another machine check before we do that would be fatal (x86 doesn't allow nesting). Even if we moved the work after the clear of MCG_STATUS we'd still be vulnerable to a new machine check on x86_64 because we are sitting on the one only machine check stack. But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Anyway, I care less about this now that I don't have to worry about it re: IRET :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin h...@zytor.com wrote: On 05/21/2014 11:11 AM, Andy Lutomirski wrote: On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote: Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. I think you are onto something here. In particular, the key observation here is that inside the kernel, we can never *both* have an invalid stack *and* be inside an NMI, #MC or #DB handler, even if nested. Except for espfix :) Now, does this prevent us from using RET in the common case? I'm not sure it is a huge loss since kernel-to-kernel is relatively rare. I don't think so. The most common case should be plain old interrupts and I suspect that #PF is a distant second. In any event, plain old interrupts and #PF are non-IST interrupts and they should be unconditionally safe for RET --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Ok, but we still have to do the work before returning to the process. So if not mce_notify_process() how else are you suggesting we do this? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Ok, but we still have to do the work before returning to the process. So if not mce_notify_process() how else are you suggesting we do this? I'm suggesting that you re-enable interrupts and do the work in do_machine_check. I think it'll just work. It might pay to set a flag so that you panic very loudly if do_machine_check recurses. I suspect that, if the hardware is generating machine checks while doing memory poisoning, the hardware is broken enough that even panicking might not work, though :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 03:52:16PM -0700, Andy Lutomirski wrote: I'm suggesting that you re-enable interrupts and do the work in do_machine_check. I think it'll just work. It might pay to set a flag so that you panic very loudly if do_machine_check recurses. And that might happen very likely if we're trying to poison a page which is shared by a couple of processes' mm's and some process on some cpu starts touching it. So keeping all cpus in a holding pattern is much more safe, IMO. (#MC is broadcasted on Intel, I'm sure you know). And even if it made sense, why go the trouble? To shorten the time we're in the MCE handler? Well, if we spend too much time in it, then the box is dying anyway. On a normal, healthy hw, do_machine_check doesn't run. :-) I suspect that, if the hardware is generating machine checks while doing memory poisoning, the hardware is broken enough that even panicking might not work, though :) Yeah, in such cases, they tend to escalate to fatal errors very fast so we panic right on the spot. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On 05/21/2014 03:41 PM, Andy Lutomirski wrote: I think you are onto something here. In particular, the key observation here is that inside the kernel, we can never *both* have an invalid stack *and* be inside an NMI, #MC or #DB handler, even if nested. Except for espfix :) Argh. Yes, I got that wrong... it isn't really about being inside NMI, #MC or #DB, but rather being on those respective stacks. If you are on the espfix stack you are on your way back to userspace OR (and this gets really, really ugly) you took an NMI/MC/DB after a SYSCALL executed in 16-bit mode, but even then you are in the kernel entry/exit code and re-enabling NMI is fine. Now, does this prevent us from using RET in the common case? I'm not sure it is a huge loss since kernel-to-kernel is relatively rare. I don't think so. The most common case should be plain old interrupts and I suspect that #PF is a distant second. In any event, plain old interrupts and #PF are non-IST interrupts and they should be unconditionally safe for RET -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/