Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-22 Thread Borislav Petkov
On Thu, May 22, 2014 at 09:03:34AM +0900, Linus Torvalds wrote:
> No, that's fine, if it's a thread-synchronous thing (ie a memory load
> that causes errors). But for NMI handlers, that is irrelevant: if
> the NMI code itself gets memory errors, the machine really is dead.
> Let's face it, we're going to panic and reboot, there's no other
> real alternative (other than the "just log it, pray, and continue
> in unstable mode", which is actually a perfectly valid alternative
> in many cases, since people don't necessarily care deeply and have
> written their distributed algorithms to not rely on any particular
> thread too much, and will verify the end results anyway).

Oh, definitely.

Infact, we'll panic on uncorrectable errors in any unmovable memory,
i.e. kernel code and data because we simply can't recover from it.
Anything that happens in the NMI handler most probably falls in that
category so...

I was simply pointing out the fact that Andy's algo needs to pay
attention to MCEs and other higher prio exceptions happening.

> The problem is literally the non-synchronous things (like another
> CPU having problems) where things like broadcast will actually turn
> a non-thread-synchronous thing into problems for other CPU's. Then,
> a user-mode memory access error (that we *can* recover from, perhaps
> by killing the process and isolating the page) can turn into a
> unrecoverable error on another CPU because it got interrupted at a
> point where it really couldn't afford to be interrupted.

That definitely sounds like a nasty thing, sure.

Although, there's at least one problem I've been thinking about wrt the
non-broadcast MCE: it is pretty hard to handle an uncorrectable memory
error in a page which is shared by multiple threads running on multiple
cores.

So normally one of the cores will detect it, raise an MCE and deal with
it but there's nothing stopping the other cores from touching that data.

One of the possible things which could happen is, if the other cores
consume that data, they will trigger an MCE too and will have to see
that the first core which detected the error is about to poison that
page so their job in the MCE handler is done and they have to exit.

I'm not saying this is undoable but it is a bit tricky and some
scenarios would need to be played out first to know better.

So, to a certain extent, broadcasting the MCE and keeping the cores in a
holding pattern, not touching any userspace stuff might've been one way
to deal with situations like that. It certainly makes things easier for
that particular scenario.

I'm not saying it was a good idea due to the point you're making - maybe
they should've talked to software people first. I'm basically trying to
explain to me what the reasoning behind that broadcasting might be.

> It appears Intel is fixing their braindamage.

Yep, we'd still need to deal with the existing systems but we don't have
a choice anyway.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-22 Thread Borislav Petkov
On Thu, May 22, 2014 at 09:03:34AM +0900, Linus Torvalds wrote:
 No, that's fine, if it's a thread-synchronous thing (ie a memory load
 that causes errors). But for NMI handlers, that is irrelevant: if
 the NMI code itself gets memory errors, the machine really is dead.
 Let's face it, we're going to panic and reboot, there's no other
 real alternative (other than the just log it, pray, and continue
 in unstable mode, which is actually a perfectly valid alternative
 in many cases, since people don't necessarily care deeply and have
 written their distributed algorithms to not rely on any particular
 thread too much, and will verify the end results anyway).

Oh, definitely.

Infact, we'll panic on uncorrectable errors in any unmovable memory,
i.e. kernel code and data because we simply can't recover from it.
Anything that happens in the NMI handler most probably falls in that
category so...

I was simply pointing out the fact that Andy's algo needs to pay
attention to MCEs and other higher prio exceptions happening.

 The problem is literally the non-synchronous things (like another
 CPU having problems) where things like broadcast will actually turn
 a non-thread-synchronous thing into problems for other CPU's. Then,
 a user-mode memory access error (that we *can* recover from, perhaps
 by killing the process and isolating the page) can turn into a
 unrecoverable error on another CPU because it got interrupted at a
 point where it really couldn't afford to be interrupted.

That definitely sounds like a nasty thing, sure.

Although, there's at least one problem I've been thinking about wrt the
non-broadcast MCE: it is pretty hard to handle an uncorrectable memory
error in a page which is shared by multiple threads running on multiple
cores.

So normally one of the cores will detect it, raise an MCE and deal with
it but there's nothing stopping the other cores from touching that data.

One of the possible things which could happen is, if the other cores
consume that data, they will trigger an MCE too and will have to see
that the first core which detected the error is about to poison that
page so their job in the MCE handler is done and they have to exit.

I'm not saying this is undoable but it is a bit tricky and some
scenarios would need to be played out first to know better.

So, to a certain extent, broadcasting the MCE and keeping the cores in a
holding pattern, not touching any userspace stuff might've been one way
to deal with situations like that. It certainly makes things easier for
that particular scenario.

I'm not saying it was a good idea due to the point you're making - maybe
they should've talked to software people first. I'm basically trying to
explain to me what the reasoning behind that broadcasting might be.

 It appears Intel is fixing their braindamage.

Yep, we'd still need to deal with the existing systems but we don't have
a choice anyway.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 4:51 PM, Borislav Petkov  wrote:
> On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote:
>> If the OS then decides to take down the whole machine, the OS - not
>> the hardware - can choose to do something that will punch through
>> other CPU's NMI blocking (notably, init/reset), but the hardware doing
>> this on its own is just broken if true.
>
> Not that it is any consolation but MCE is not broadcast on AMD.
>
> Regardless, exceptions like MCE cannot be held pending and do pierce the
> NMI handler on both.
>
> Now, if the NMI handler experiences a non-broadcast MCE on the same CPU,
> while running, we're simply going to panic as we're in kernel space
> anyway.
>
> The only problem is if the NMI handler gets interrupted while running
> on a bystander CPU. And I think we could deal with this because the
> bystander would not see an MCE and will return safely. We just need
> to make sure that it returns back to the said NMI handler and not to
> userspace. Unless I'm missing something ...

Under my "always RET unless returning from IST to weird CS or to
specific known-invalid-stack regions" proposal this should work fine.
In the current code it'll also work fine *unless* it hits really early
in the NMI, in which case a second NMI can kill us.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 8:51 AM, Borislav Petkov  wrote:
>
> Regardless, exceptions like MCE cannot be held pending and do pierce the
> NMI handler on both.

No, that's fine, if it's a thread-synchronous thing (ie a memory load
that causes errors). But for NMI handlers, that is irrelevant: if the
NMI code itself gets memory errors, the machine really is dead. Let's
face it, we're going to panic and reboot, there's no other real
alternative (other than the "just log it, pray, and continue in
unstable mode", which is actually a perfectly valid alternative in
many cases, since people don't necessarily care deeply and have
written their distributed algorithms to not rely on any particular
thread too  much, and will verify the end results anyway).

The problem is literally the non-synchronous things (like another CPU
having problems) where things like broadcast will actually turn a
non-thread-synchronous thing into problems for other CPU's. Then, a
user-mode memory access error (that we *can* recover from, perhaps by
killing the process and isolating the page) can turn into a
unrecoverable error on another CPU because it got interrupted at a
point where it really couldn't afford to be interrupted.

It appears Intel is fixing their braindamage.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote:
> If the OS then decides to take down the whole machine, the OS - not
> the hardware - can choose to do something that will punch through
> other CPU's NMI blocking (notably, init/reset), but the hardware doing
> this on its own is just broken if true.

Not that it is any consolation but MCE is not broadcast on AMD.

Regardless, exceptions like MCE cannot be held pending and do pierce the
NMI handler on both.

Now, if the NMI handler experiences a non-broadcast MCE on the same CPU,
while running, we're simply going to panic as we're in kernel space
anyway.

The only problem is if the NMI handler gets interrupted while running
on a bystander CPU. And I think we could deal with this because the
bystander would not see an MCE and will return safely. We just need
to make sure that it returns back to the said NMI handler and not to
userspace. Unless I'm missing something ...

Oh yeah, fun :-\

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
> MCE is frankly misdesigned. It's a piece of shit, and any of the
> hardware designers that claim that what they do is for system
> stability are out to lunch. This is a prime example of what *NOT* to
> do, and how you can actually spread what was potentially a localized
> and recoverable error, and make it global and unrecoverable.

Latest SDM (version 050 from late February this year) describes how
this is going to be fixed. Recoverable machine checks are going to be
thread local. But current silicon still has the broadcast behavior ...
silicon development pipeline is very long :-(

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 8:23 AM, Andi Kleen  wrote:
>> Hardware-interrupts during kernel are actually fairly common under
>> network-intensive loads, even outside of idle (but idle is admittedly
>> likely *the* most common one). Many network loads are fairly
>> kernel-intensive.
>
> For network workloads we can arbitarily coalesce interrupts or just use NAPI
> to lower the costs.  No need to optimize network interrupts too much.

BS. Lots of network loads are latency-criticial, to the point that
people sometimes actually turn off coalescing. But even with
coalescing, it doesn't do crap for ping-pong kinds of loads that are
not "interrupt storm from tons and tons of separate packets", but
"lots of individual packets that are data-dependent", so you don't
have new ones coming in while processing old ones.

Ask Andy L. He had numbers. Interrupt overhead was quite big for him.

And you ignored the real issue: special-casing idle is *stupid*. It's
more complicated, and gives fewer cases where it helps. It's simply
fundamentally stupid and wrong.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 8:19 AM, Luck, Tony  wrote:
>
> Yes. Bystander broadcast machine checks can and will hit processors
> that are in NMI context ... and we must not make that fatal.

.. and this, btw, is just another example of why MCE hardware
designers are f*cking morons that should be given extensive education
about birth control and how not to procreate.

MCE is frankly misdesigned. It's a piece of shit, and any of the
hardware designers that claim that what they do is for system
stability are out to lunch. This is a prime example of what *NOT* to
do, and how you can actually spread what was potentially a localized
and recoverable error, and make it global and unrecoverable.

Can we please get these designers either fired, or re-educated?
Because this shit has been going on too long. I complained about this
to Tony many years ago, and nothing was ever fixed.

Synchronous MCE's are fine for synchronous errors, but then trying to
turn them "synchronous" for other CPU's (where they *weren't*
synchronous errors) is a major mistake. External errors punching
through irq context is wrong, punching through NMI is just
inexcusable.

If the OS then decides to take down the whole machine, the OS - not
the hardware - can choose to do something that will punch through
other CPU's NMI blocking (notably, init/reset), but the hardware doing
this on its own is just broken if true.

Anyway, I repeat: I refuse to fix hardware bugs. As far as we are
concerned, this is "best effort", and the hardware designers should
take a long deep look at their idiotic schemes. If something punches
through NMI, it's deadly. It's that simple.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andi Kleen
> Hardware-interrupts during kernel are actually fairly common under
> network-intensive loads, even outside of idle (but idle is admittedly
> likely *the* most common one). Many network loads are fairly
> kernel-intensive.

For network workloads we can arbitarily coalesce interrupts or just use NAPI
to lower the costs.  No need to optimize network interrupts too much.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
> FWIW, this means that there really is a problem if one of these #MC
> errors hits an innocent bystander who just happens to be handling an
> NMI, at least if we delete the nested NMI code.  But I think my
> simplified proposal gets this right.

Yes. Bystander broadcast machine checks can and will hit processors
that are in NMI context ... and we must not make that fatal. Peek
harder at your proposal so you can state confidently that you get
this right.  "I think ... gets this right" is a bit too wishy-washy for
mission critical :-)

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 4:05 PM, Luck, Tony  wrote:
> On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
>> But if we get a new MCE in here, it will be an MCE from kernel context
>> and it's fatal. So, yes, we'll clobber the stack, but we'll never
>> return (unless tolerant is set to something insane), so who cares?
>
> Remember that machine checks are broadcast.  So some other cpu
> can hit a recoverable machine check in user mode ... but that int#18
> goes everywhere.  Other cpus are innocent bystanders ... they will
> see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important
> in any of their machine check banks.
>
> But if we are still finishing off processing the previous machine check,
> this will be a nested one - and BOOM, we are dead.

Oh.  Well, crap.

FWIW, this means that there really is a problem if one of these #MC
errors hits an innocent bystander who just happens to be handling an
NMI, at least if we delete the nested NMI code.  But I think my
simplified proposal gets this right.

>
> -Tony
>
> [If you peer closely at the latest edition of the SDM - you'll see the
> bits are defined for a non-broadcast model ... e.g. LMCE_S bit in
> MCG_STATUS  but currently shipping silicon doesn't use that]



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
> But if we get a new MCE in here, it will be an MCE from kernel context
> and it's fatal. So, yes, we'll clobber the stack, but we'll never
> return (unless tolerant is set to something insane), so who cares?

Remember that machine checks are broadcast.  So some other cpu
can hit a recoverable machine check in user mode ... but that int#18
goes everywhere.  Other cpus are innocent bystanders ... they will
see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important
in any of their machine check banks.

But if we are still finishing off processing the previous machine check,
this will be a nested one - and BOOM, we are dead.

-Tony

[If you peer closely at the latest edition of the SDM - you'll see the
bits are defined for a non-broadcast model ... e.g. LMCE_S bit in
MCG_STATUS  but currently shipping silicon doesn't use that]


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 03:41 PM, Andy Lutomirski wrote:
>>
>> I think you are onto something here.
>>
>> In particular, the key observation here is that inside the kernel, we
>> can never *both* have an invalid stack *and* be inside an NMI, #MC or
>> #DB handler, even if nested.
> 
> Except for espfix :)

Argh.  Yes, I got that wrong... it isn't really about being inside NMI,
#MC or #DB, but rather being on those respective stacks.  If you are on
the espfix stack you are on your way back to userspace OR (and this gets
really, really ugly) you took an NMI/MC/DB after a SYSCALL executed in
16-bit mode, but even then you are in the kernel entry/exit code and
re-enabling NMI is fine.

>> Now, does this prevent us from using RET in the common case?  I'm not
>> sure it is a huge loss since kernel-to-kernel is relatively rare.
> 
> I don't think so.  The most common case should be plain old interrupts
> and I suspect that #PF is a distant second.
> 
> In any event, plain old interrupts and #PF are non-IST interrupts and
> they should be unconditionally safe for RET

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:52:16PM -0700, Andy Lutomirski wrote:
> I'm suggesting that you re-enable interrupts and do the work in
> do_machine_check. I think it'll just work. It might pay to set a flag
> so that you panic very loudly if do_machine_check recurses.

And that might happen very likely if we're trying to poison a page which
is shared by a couple of processes' mm's and some process on some cpu
starts touching it.

So keeping all cpus in a holding pattern is much more safe, IMO. (#MC is
broadcasted on Intel, I'm sure you know).

And even if it made sense, why go the trouble? To shorten the time we're
in the MCE handler? Well, if we spend too much time in it, then the box
is dying anyway. On a normal, healthy hw, do_machine_check doesn't run.

:-)

> I suspect that, if the hardware is generating machine checks while
> doing memory poisoning, the hardware is broken enough that even
> panicking might not work, though :)

Yeah, in such cases, they tend to escalate to fatal errors very fast so
we panic right on the spot.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov  wrote:
> On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
>> But if we get a new MCE in here, it will be an MCE from kernel context
>> and it's fatal. So, yes, we'll clobber the stack, but we'll never
>> return (unless tolerant is set to something insane), so who cares?
>
> Ok, but we still have to do the work before returning to the process. So
> if not mce_notify_process() how else are you suggesting we do this?

I'm suggesting that you re-enable interrupts and do the work in
do_machine_check.  I think it'll just work.  It might pay to set a
flag so that you panic very loudly if do_machine_check recurses.

I suspect that, if the hardware is generating machine checks while
doing memory poisoning, the hardware is broken enough that even
panicking might not work, though :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
> But if we get a new MCE in here, it will be an MCE from kernel context
> and it's fatal. So, yes, we'll clobber the stack, but we'll never
> return (unless tolerant is set to something insane), so who cares?

Ok, but we still have to do the work before returning to the process. So
if not mce_notify_process() how else are you suggesting we do this?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin  wrote:
> On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
>> On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski  wrote:
>>> Here's a real proposal for iret-less return.  If this is correct, then
>>> NMIs will never nest, which will probably delete a lot more scariness
>>> than is added by the code I'm describing.
>>
>> OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
>> 16-bit stack.  The return from NMI goes through the espfix code.
>> Something interrupts while on the espfix stack.  Boom!  Neither return
>> style is particularly good.
>>
>> More generally, if we got interrupted while on the espfix stack, we
>> need to return back there using IRET.  Fortunately, re-enabling NMIs
>> there in harmless, since we've already switched off the NMI stack.
>>
>> This makes me think that maybe the logic should be turned around: have
>> some RIP ranges on which the kernel stack might be invalid (which
>> includes the espfix code and some of the syscall code) and use IRET
>> only on return from NMI, return to nonstandard CS, and return to these
>> special ranges.  The NMI code just needs to never so any of this stuff
>> unless it switches off the NMI stack first.
>>
>> For this to work reliably, we'll probably have to change CS before
>> calling into EFI code.  That should be straightforward.
>>
>
> I think you are onto something here.
>
> In particular, the key observation here is that inside the kernel, we
> can never *both* have an invalid stack *and* be inside an NMI, #MC or
> #DB handler, even if nested.

Except for espfix :)

>
> Now, does this prevent us from using RET in the common case?  I'm not
> sure it is a huge loss since kernel-to-kernel is relatively rare.

I don't think so.  The most common case should be plain old interrupts
and I suspect that #PF is a distant second.

In any event, plain old interrupts and #PF are non-IST interrupts and
they should be unconditionally safe for RET

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony  wrote:
>>> The recovery path has to do more than just send a signal - it needs to walk 
>>> processes and
>>> "mm"s to see which have mapped the physical address that the h/w told us 
>>> has gone bad.
>>
>> I still feel like I'm missing something.  If we interrupted user space
>> code, then the context we're in should be identical to the context
>> we'll get when we're about to return to userspace.
>
> True. And this far along in do_machine_check() we have set all the other cpus
> free, so the are heading back to whatever context we interrupted them in. So
> we might be able to do all that other stuff inline here ... we interrupted 
> user
> mode, so we know we don't hold any locks. Other cpus are running, so they can
> complete what they are doing to release any locks we might need.
>
> But it will take a while (to scan all those processes). And we haven't yet
> cleared MCG_STATUS ... so another machine check before we do that
> would be fatal (x86 doesn't allow nesting).  Even if we moved the work
> after the clear of MCG_STATUS we'd still be vulnerable to a new machine
> check on x86_64 because we are sitting on the one & only machine check
> stack.

But if we get a new MCE in here, it will be an MCE from kernel context
and it's fatal.  So, yes, we'll clobber the stack, but we'll never
return (unless tolerant is set to something insane), so who cares?

Anyway, I care less about this now that I don't have to worry about it
re: IRET :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
> On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski  wrote:
>> Here's a real proposal for iret-less return.  If this is correct, then
>> NMIs will never nest, which will probably delete a lot more scariness
>> than is added by the code I'm describing.
> 
> OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
> 16-bit stack.  The return from NMI goes through the espfix code.
> Something interrupts while on the espfix stack.  Boom!  Neither return
> style is particularly good.
> 
> More generally, if we got interrupted while on the espfix stack, we
> need to return back there using IRET.  Fortunately, re-enabling NMIs
> there in harmless, since we've already switched off the NMI stack.
> 
> This makes me think that maybe the logic should be turned around: have
> some RIP ranges on which the kernel stack might be invalid (which
> includes the espfix code and some of the syscall code) and use IRET
> only on return from NMI, return to nonstandard CS, and return to these
> special ranges.  The NMI code just needs to never so any of this stuff
> unless it switches off the NMI stack first.
> 
> For this to work reliably, we'll probably have to change CS before
> calling into EFI code.  That should be straightforward.
> 

I think you are onto something here.

In particular, the key observation here is that inside the kernel, we
can never *both* have an invalid stack *and* be inside an NMI, #MC or
#DB handler, even if nested.

Now, does this prevent us from using RET in the common case?  I'm not
sure it is a huge loss since kernel-to-kernel is relatively rare.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:20:50PM -0700, Andy Lutomirski wrote:
> It's not even scheduled out, right?

Right.

> This should be just like a signal from a failed page fault, I think.

Right, but there this additional work it needs to be done
(mce_notify_process()) before sending the signal. So you want to do this
after the MCE handler is done but before you return to the process.

> What switch_to?

Nevermind, that was bollocks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 7:25 AM, Andi Kleen  wrote:
>
> I suspect the only case that's really interesting here is interrupting
> idle. Maybe it would be possible to do some fast path in this case only.

Hardware-interrupts during kernel are actually fairly common under
network-intensive loads, even outside of idle (but idle is admittedly
likely *the* most common one). Many network loads are fairly
kernel-intensive.

Also, from a kernel perspective, idle isn't really any different from
most other kernel code. Using "ret" to return to the idle handler
would be *more* of a special case than using "ret" to return to just
generic kernel context.

So I disagree vehemently. Do *not* special-case idle. It makes the
code more complex and less generic.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
>> The recovery path has to do more than just send a signal - it needs to walk 
>> processes and
>> "mm"s to see which have mapped the physical address that the h/w told us has 
>> gone bad.
>
> I still feel like I'm missing something.  If we interrupted user space
> code, then the context we're in should be identical to the context
> we'll get when we're about to return to userspace.

True. And this far along in do_machine_check() we have set all the other cpus
free, so the are heading back to whatever context we interrupted them in. So
we might be able to do all that other stuff inline here ... we interrupted user
mode, so we know we don't hold any locks. Other cpus are running, so they can
complete what they are doing to release any locks we might need.

But it will take a while (to scan all those processes). And we haven't yet
cleared MCG_STATUS ... so another machine check before we do that
would be fatal (x86 doesn't allow nesting).  Even if we moved the work
after the clear of MCG_STATUS we'd still be vulnerable to a new machine
check on x86_64 because we are sitting on the one & only machine check
stack.

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen  wrote:
>
> Seems like a lot of effort and risk to essentially only optimize in kernel
> interrupt handlers.

The idea is that it might allow us to remove a bunch of scary nested
NMI code as well as speeding things up.

>
> AFAIK the most interesting cases (like user page faults) are not
> affected at all. Usually most workloads don't spend all that much time
> in the kernel, so it won't help most interrupts.
>
> I suspect the only case that's really interesting here is interrupting
> idle. Maybe it would be possible to do some fast path in this case only.
>
> However idle currently has so much overhead that I suspect that there
> are lower hanging fruit elsewhere.

I will gladly buy a meal or beverage for whomever fixes the ttwu stuff
to stop sending IPIs to idle CPUs, which will help a lot.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andi Kleen

Seems like a lot of effort and risk to essentially only optimize in kernel
interrupt handlers.

AFAIK the most interesting cases (like user page faults) are not
affected at all. Usually most workloads don't spend all that much time
in the kernel, so it won't help most interrupts.

I suspect the only case that's really interesting here is interrupting
idle. Maybe it would be possible to do some fast path in this case only.

However idle currently has so much overhead that I suspect that there 
are lower hanging fruit elsewhere.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony  wrote:
>>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
>>> mce_notify_process().
>>
>> Why is this necessary?
>
> The recovery path has to do more than just send a signal - it needs to walk 
> processes and
> "mm"s to see which have mapped the physical address that the h/w told us has 
> gone bad.

I still feel like I'm missing something.  If we interrupted user space
code, then the context we're in should be identical to the context
we'll get when we're about to return to userspace.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote:
> Seems like a comment would be in order, though.

Sure, I'll do a nice one once this discussions quiets down. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
>> mce_notify_process().
>
> Why is this necessary?

The recovery path has to do more than just send a signal - it needs to walk 
processes and
"mm"s to see which have mapped the physical address that the h/w told us has 
gone bad.

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov  wrote:
> On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
>> Why is this necessary?
>>
>> If the MCE hit kernel code, then we're going to die anyway.  If the
>> MCE hit user code, then we should be in a completely sensible context
>> and we can just send the signal.
>
> Are we guaranteed that the first thing the process will execute when
> scheduled back in are the signal handlers?

It's not even scheduled out, right?  This should be just like a signal
from a failed page fault, I think.

>
> And besides, maybe we don't even want to allow to do the switch_to() but
> kill it while it is sleeping.

What switch_to?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
> Why is this necessary?
> 
> If the MCE hit kernel code, then we're going to die anyway.  If the
> MCE hit user code, then we should be in a completely sensible context
> and we can just send the signal.

Are we guaranteed that the first thing the process will execute when
scheduled back in are the signal handlers?

And besides, maybe we don't even want to allow to do the switch_to() but
kill it while it is sleeping.

(I know, we're that nasty :-))

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 03:11 PM, Borislav Petkov wrote:
> On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
>> So this is not a problem then.
> 
> Yeah, f'get it - it is all good at that front. :-)
> 

Seems like a comment would be in order, though.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony  wrote:
>> But sending signals from #MC context is definitely a bad idea. I think
>> we had addressed this with irq_work at some point but my memory is very
>> hazy.
>
> We added code for recoverable errors to get out of the MC context
> before trying to lookup the page and send the signal.  Bottom of
> do_machine_check():
>
> if (cfg->tolerant < 3) {
> if (no_way_out)
> mce_panic("Fatal machine check on current CPU", , 
> msg);
> if (worst == MCE_AR_SEVERITY) {
> /* schedule action before return to userland */
> mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
> set_thread_flag(TIF_MCE_NOTIFY);
> } else if (kill_it) {
> force_sig(SIGBUS, current);
> }
> }
>
> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
> mce_notify_process().

Why is this necessary?

If the MCE hit kernel code, then we're going to die anyway.  If the
MCE hit user code, then we should be in a completely sensible context
and we can just send the signal.

--Andy

>
> The "force_sig()" there is legacy code - and perhaps should just move off to 
> mce_notify_process()
> too (need to save "worst" so it will know what to do).
>
> -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
> So this is not a problem then.

Yeah, f'get it - it is all good at that front. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
> But sending signals from #MC context is definitely a bad idea. I think
> we had addressed this with irq_work at some point but my memory is very
> hazy.

We added code for recoverable errors to get out of the MC context
before trying to lookup the page and send the signal.  Bottom of
do_machine_check():

if (cfg->tolerant < 3) {
if (no_way_out)
mce_panic("Fatal machine check on current CPU", , 
msg);
if (worst == MCE_AR_SEVERITY) {
/* schedule action before return to userland */
mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
set_thread_flag(TIF_MCE_NOTIFY);
} else if (kill_it) {
force_sig(SIGBUS, current);
}
}

That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
mce_notify_process().

The "force_sig()" there is legacy code - and perhaps should just move off to 
mce_notify_process()
too (need to save "worst" so it will know what to do).

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 02:54 PM, Borislav Petkov wrote:
> On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
>> Is there anything actually wrong with just panicking if
>> !user_mode_vm(regs)?  That would make this a lot more sane.
> 
> It does that already - mce_severity().
> 

So this is not a problem then.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov  wrote:
> On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
>> Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
>> a signal a remotely reasonable thing to do if !user_mode_vm(regs)?
>
> Let me quote Jiri:
>
> (1) task sends signal to itself
> (2) it acquires sighand->siglock so that it's able to queue the signal
> (3) MCE triggers

...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should
presumably just panic instead of trying to send a signal.

I missed the IN_KERNEL thing because I didn't realize that ->cs was
copied to struct mce.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, Borislav Petkov wrote:

> > Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
> > a signal a remotely reasonable thing to do if !user_mode_vm(regs)?
> 
> Let me quote Jiri:
> 
> (1) task sends signal to itself
> (2) it acquires sighand->siglock so that it's able to queue the signal
> (3) MCE triggers
> (4) it tries to send a signal to the same task
> (5) it tries to acquire sighand->siglock and loops forever

Ah, alright, but due to what mce_severity() does, this can't happen, 
because if the current CPU is in the kernel (which is obviously implied by 
holding a spinlock), it never proceeds sending the signal, becase 
no_way_out gets set and mce_panic() invoked.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
> Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
> a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

Let me quote Jiri:

(1) task sends signal to itself
(2) it acquires sighand->siglock so that it's able to queue the signal
(3) MCE triggers
(4) it tries to send a signal to the same task
(5) it tries to acquire sighand->siglock and loops forever

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
> Is there anything actually wrong with just panicking if
> !user_mode_vm(regs)?  That would make this a lot more sane.

It does that already - mce_severity().

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov  wrote:
> On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
>> If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
>> like the right solution anyway.
>>
>> Are there any machine check exceptions for which it makes sense to
>> continue right where we left off without a signal?  Is CMIC such a
>> beast?  Can CMIC be delivered when interrupts are off?
>
> I think you mean CMCI and that's not even reported with a MCE exception
> - there's a separate APIC interrupt for that.
>
> I think this signal thing is for killing processes which have poisoned
> memory but this memory can contained within that process and the
> physical page frame can be poisoned so that it doesn't get used ever
> again. In any case, this is an example for an uncorrectable error which
> needs action from us but doesn't necessarily have to kill the whole
> machine.
>
> This is supposed to be more graceful instead of consuming the corrupted
> data and sending it out to disk.
>
> But sending signals from #MC context is definitely a bad idea. I think
> we had addressed this with irq_work at some point but my memory is very
> hazy.

Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, H. Peter Anvin wrote:

> > I certainly like this way of handling it. We can even issue a nice
> > banner saying something like "You're f*cked - go change hw."
> 
> Actually, it would be a lot better to panic than deadlock (HA systems
> tend to have something in place to catch the panic and/or reboot).  Any
> way we can see if the CPU is already holding that lock and panic in that
> case?

Well, spin_trylock() and then either spin_unlock() and proceed sending 
the signal, otherwise panic().

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
> If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
> like the right solution anyway.
>
> Are there any machine check exceptions for which it makes sense to
> continue right where we left off without a signal?  Is CMIC such a
> beast?  Can CMIC be delivered when interrupts are off?

I think you mean CMCI and that's not even reported with a MCE exception
- there's a separate APIC interrupt for that.

I think this signal thing is for killing processes which have poisoned
memory but this memory can contained within that process and the
physical page frame can be poisoned so that it doesn't get used ever
again. In any case, this is an example for an uncorrectable error which
needs action from us but doesn't necessarily have to kill the whole
machine.

This is supposed to be more graceful instead of consuming the corrupted
data and sending it out to disk.

But sending signals from #MC context is definitely a bad idea. I think
we had addressed this with irq_work at some point but my memory is very
hazy.

@Tony: this is something we need to take a look at soonish.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin  wrote:
> Adding Tony.
>
> On 05/21/2014 02:43 PM, Borislav Petkov wrote:
>> On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
>>> Seriously. If an NMI is interrupted by an MCE, you might as well
>>> consider the machine dead. Don't worry about it. We may or may not
>>> recover, but it is *not* our problem.
>>
>> I certainly like this way of handling it. We can even issue a nice
>> banner saying something like "You're f*cked - go change hw."
>>
>
> Actually, it would be a lot better to panic than deadlock (HA systems
> tend to have something in place to catch the panic and/or reboot).  Any
> way we can see if the CPU is already holding that lock and panic in that
> case?
>

Is there anything actually wrong with just panicking if
!user_mode_vm(regs)?  That would make this a lot more sane.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
Adding Tony.

On 05/21/2014 02:43 PM, Borislav Petkov wrote:
> On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
>> Seriously. If an NMI is interrupted by an MCE, you might as well
>> consider the machine dead. Don't worry about it. We may or may not
>> recover, but it is *not* our problem.
> 
> I certainly like this way of handling it. We can even issue a nice
> banner saying something like "You're f*cked - go change hw."
> 

Actually, it would be a lot better to panic than deadlock (HA systems
tend to have something in place to catch the panic and/or reboot).  Any
way we can see if the CPU is already holding that lock and panic in that
case?

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
> Seriously. If an NMI is interrupted by an MCE, you might as well
> consider the machine dead. Don't worry about it. We may or may not
> recover, but it is *not* our problem.

I certainly like this way of handling it. We can even issue a nice
banner saying something like "You're f*cked - go change hw."

:-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 6:25 AM, Jiri Kosina  wrote:
>
> Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
> at the time the CPU was already holding sighand->siglock for that
> particular task, it'll deadlock against itself.

Don't worry too much about the MCE's. The hardware is f*cking broken,
and nobody sane ever thought that synchronous MCE's were a good idea.

Proof: look at Itanium.

The truly nonmaskable synchronous MCE's are a fatal error. It's that
simple. Anybody who thinks anything else is simply wrong, and has
probably talked to too many hardware engineers that don't actually
understand the bigger picture.

Sane hardware handles anything that *can* be handled in hardware, and
then reports (later) to software about the errors with a regular
non-critical MCE that doesn't punch through NMI or even regular
interrupt disabling.

So the true "MCE punches through even NMI protection" case is
relegated purely to the "hardware is broken and needs to be replaced"
situation, and our only worry as kernel people is to try to be as
graceful as possible about it - but that "as graceful as possible"
does *not* include bending over and worrying about random possible
deadlocks or other crazy situations. It's purely a "best effort" kind
of thing where we try to do whatever logging etc that is easy to do.

Seriously. If an NMI is interrupted by an MCE, you might as well
consider the machine dead. Don't worry about it. We may or may not
recover, but it is *not* our problem.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina  wrote:
> On Wed, 21 May 2014, Borislav Petkov wrote:
>
>> > ISTM the do_machine_check code ought to consider any kill-worthy MCE
>> > from kernel space to be non-recoverable, but I want to keep the scope
>> > of these patches under control.
>>
>> MCA has a bit called RIPV which, if set, signals that RIP is valid and
>> it is safe to return provided we've taken proper care of handling even
>> non-correctable errors (memory poisoning, etc).
>
> Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
> at the time the CPU was already holding sighand->siglock for that
> particular task, it'll deadlock against itself.
>

If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
like the right solution anyway.

Are there any machine check exceptions for which it makes sense to
continue right where we left off without a signal?  Is CMIC such a
beast?  Can CMIC be delivered when interrupts are off?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, Borislav Petkov wrote:

> > ISTM the do_machine_check code ought to consider any kill-worthy MCE
> > from kernel space to be non-recoverable, but I want to keep the scope
> > of these patches under control.
> 
> MCA has a bit called RIPV which, if set, signals that RIP is valid and
> it is safe to return provided we've taken proper care of handling even
> non-correctable errors (memory poisoning, etc).

Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered 
at the time the CPU was already holding sighand->siglock for that 
particular task, it'll deadlock against itself.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski  wrote:
> Here's a real proposal for iret-less return.  If this is correct, then
> NMIs will never nest, which will probably delete a lot more scariness
> than is added by the code I'm describing.

OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
16-bit stack.  The return from NMI goes through the espfix code.
Something interrupts while on the espfix stack.  Boom!  Neither return
style is particularly good.

More generally, if we got interrupted while on the espfix stack, we
need to return back there using IRET.  Fortunately, re-enabling NMIs
there in harmless, since we've already switched off the NMI stack.

This makes me think that maybe the logic should be turned around: have
some RIP ranges on which the kernel stack might be invalid (which
includes the espfix code and some of the syscall code) and use IRET
only on return from NMI, return to nonstandard CS, and return to these
special ranges.  The NMI code just needs to never so any of this stuff
unless it switches off the NMI stack first.

For this to work reliably, we'll probably have to change CS before
calling into EFI code.  That should be straightforward.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 10:52:01AM -0700, Andy Lutomirski wrote:
> I think that some of these exceptions are synchronous things (e.g.
> int3 or page faults) that happen because the kernel caused them.
> 
> Anyway, going through the list:
> 
> Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

Yeah.

> SMI is already supposedly correct wrt nesting inside NMI.

It better be. :)

> Debug register stuff should be handled in my outline.  Hopefully
> correctly :)  We need to make sure that no breakpoints trip before the
> nmi count is incremented, but that should be straightforward as long
> as we don't do ridiculous things like poking at userspace addresses.
> I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
> kernel address (e.g. the nesting count) or enables single-stepping,
> we'll mess up.
> 
> 
> It may pay to bump the nesting count inside the #DB and #BP handlers
> and to check the RIP that we're returning to,

Right, at a first glance, all those higher prio exceptions' nesting
count could be nicely dealt with in those paranoidzeroentry* macros.

> but that starts to look ugly, and we have to be careful about NMI,
> immediate breakpoint, and them immediate MCE.

Btw, hpa just confirmed that exceptions are never deferred and thus can
happen while the NMI nahdler runs. Which means, we should defensively
prepare for NMI handlers being interrupted at any point.

> I'd rather just be able to say that there are some very short windows
> in which a debug or breakpoint exception will never happen.

Sounds perfectly fine to me.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov  wrote:
> On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
>> On May 21, 2014 2:46 AM, "Borislav Petkov"  wrote:
>> >
>> > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
>> > > So the issue here is that we can have an NMI followed immediately by
>> > > an MCE.
>> >
>> > That part might need clarification for me: #MC is higher prio interrupt
>> > than NMI so a machine check exception can interrupt the NMI handler at
>> > any point.
>>
>> Except that NMI can interrupt #MC at any point as well, I think.
>
> No, #MC is higher prio than NMI, actually even the highest along with
> RESET#. And come to think of it, all exceptions which have a higher prio
> than NMI should touch that nmi_mce_nest_count thing.
>
> See Table 8-8 here:
>
> http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf
>
> That's the prios before 3, i.e. the NMI one.
>
> HOWEVER, this all is spoken with the assumption that higher prio
> interrupts can interrupt the NMI handler too at the first instruction
> boundary they've been recognized.
>
> The text is talking about simultaneous interrupts and not about
> interrupt handler preemption.
>
> But it must be because Steve wouldn't be dealing with exceptions in the
> NMI handler and nested NMIs otherwise...

I think that some of these exceptions are synchronous things (e.g.
int3 or page faults) that happen because the kernel caused them.

Anyway, going through the list:

Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

SMI is already supposedly correct wrt nesting inside NMI.

Debug register stuff should be handled in my outline.  Hopefully
correctly :)  We need to make sure that no breakpoints trip before the
nmi count is incremented, but that should be straightforward as long
as we don't do ridiculous things like poking at userspace addresses.
I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
kernel address (e.g. the nesting count) or enables single-stepping,
we'll mess up.


It may pay to bump the nesting count inside the #DB and #BP handlers
and to check the RIP that we're returning to, but that starts to look
ugly, and we have to be careful about NMI, immediate breakpoint, and
them immediate MCE.  I'd rather just be able to say that there are
some very short windows in which a debug or breakpoint exception will
never happen.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 08:21:57AM -0700, Andy Lutomirski wrote:
> ISTM the do_machine_check code ought to consider any kill-worthy MCE
> from kernel space to be non-recoverable, but I want to keep the scope
> of these patches under control.

MCA has a bit called RIPV which, if set, signals that RIP is valid and
it is safe to return provided we've taken proper care of handling even
non-correctable errors (memory poisoning, etc).

If RIPV is not set, we panic anyway.

> That being said, if an MCE that came from CPL0 never tried to return,
> this would be simpler.  I don't know enough about the machine check
> architecture to know whether that's a reasonable thing to do.

Yeah, there are cases where MCE can return, see above.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
> On May 21, 2014 2:46 AM, "Borislav Petkov"  wrote:
> >
> > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> > > So the issue here is that we can have an NMI followed immediately by
> > > an MCE.
> >
> > That part might need clarification for me: #MC is higher prio interrupt
> > than NMI so a machine check exception can interrupt the NMI handler at
> > any point.
> 
> Except that NMI can interrupt #MC at any point as well, I think.

No, #MC is higher prio than NMI, actually even the highest along with
RESET#. And come to think of it, all exceptions which have a higher prio
than NMI should touch that nmi_mce_nest_count thing.

See Table 8-8 here:

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

That's the prios before 3, i.e. the NMI one.

HOWEVER, this all is spoken with the assumption that higher prio
interrupts can interrupt the NMI handler too at the first instruction
boundary they've been recognized.

The text is talking about simultaneous interrupts and not about
interrupt handler preemption.

But it must be because Steve wouldn't be dealing with exceptions in the
NMI handler and nested NMIs otherwise...

Hmmm.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 5:51 AM, "Jiri Kosina"  wrote:
>
> On Tue, 20 May 2014, Andy Lutomirski wrote:
>
> > So the issue here is that we can have an NMI followed immediately by
> > an MCE.  The MCE code can call force_sig
>
> This is interesting by itself. force_sig() takes siglock spinlock. This
> really looks like a deadlock sitting there waiting to happen.

ISTM the do_machine_check code ought to consider any kill-worthy MCE
from kernel space to be non-recoverable, but I want to keep the scope
of these patches under control.

That being said, if an MCE that came from CPL0 never tried to return,
this would be simpler.  I don't know enough about the machine check
architecture to know whether that's a reasonable thing to do.

--Andy

>
> --
> Jiri Kosina
> SUSE Labs
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 2:46 AM, "Borislav Petkov"  wrote:
>
> On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> > So the issue here is that we can have an NMI followed immediately by
> > an MCE.
>
> That part might need clarification for me: #MC is higher prio interrupt
> than NMI so a machine check exception can interrupt the NMI handler at
> any point.

Except that NMI can interrupt #MC at any point as well, I think.

>
> But you're talking only about the small window when nmi_mce_nest_count
> hasn't been incremented yet, right? I.e., this:
>
> "The result is that the only interrupt that can happen with
> `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
> end of the NMI handler."
>
> Correct?

Exactly.

>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Tue, 20 May 2014, Andy Lutomirski wrote:

> So the issue here is that we can have an NMI followed immediately by
> an MCE.  The MCE code can call force_sig

This is interesting by itself. force_sig() takes siglock spinlock. This 
really looks like a deadlock sitting there waiting to happen.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> So the issue here is that we can have an NMI followed immediately by
> an MCE.

That part might need clarification for me: #MC is higher prio interrupt
than NMI so a machine check exception can interrupt the NMI handler at
any point.

But you're talking only about the small window when nmi_mce_nest_count
hasn't been incremented yet, right? I.e., this:

"The result is that the only interrupt that can happen with
`nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
end of the NMI handler."

Correct?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
 So the issue here is that we can have an NMI followed immediately by
 an MCE.

That part might need clarification for me: #MC is higher prio interrupt
than NMI so a machine check exception can interrupt the NMI handler at
any point.

But you're talking only about the small window when nmi_mce_nest_count
hasn't been incremented yet, right? I.e., this:

The result is that the only interrupt that can happen with
`nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
end of the NMI handler.

Correct?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Tue, 20 May 2014, Andy Lutomirski wrote:

 So the issue here is that we can have an NMI followed immediately by
 an MCE.  The MCE code can call force_sig

This is interesting by itself. force_sig() takes siglock spinlock. This 
really looks like a deadlock sitting there waiting to happen.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote:

 On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
  So the issue here is that we can have an NMI followed immediately by
  an MCE.

 That part might need clarification for me: #MC is higher prio interrupt
 than NMI so a machine check exception can interrupt the NMI handler at
 any point.

Except that NMI can interrupt #MC at any point as well, I think.


 But you're talking only about the small window when nmi_mce_nest_count
 hasn't been incremented yet, right? I.e., this:

 The result is that the only interrupt that can happen with
 `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
 end of the NMI handler.

 Correct?

Exactly.


 --
 Regards/Gruss,
 Boris.

 Sent from a fat crate under my desk. Formatting is fine.
 --
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 5:51 AM, Jiri Kosina jkos...@suse.cz wrote:

 On Tue, 20 May 2014, Andy Lutomirski wrote:

  So the issue here is that we can have an NMI followed immediately by
  an MCE.  The MCE code can call force_sig

 This is interesting by itself. force_sig() takes siglock spinlock. This
 really looks like a deadlock sitting there waiting to happen.

ISTM the do_machine_check code ought to consider any kill-worthy MCE
from kernel space to be non-recoverable, but I want to keep the scope
of these patches under control.

That being said, if an MCE that came from CPL0 never tried to return,
this would be simpler.  I don't know enough about the machine check
architecture to know whether that's a reasonable thing to do.

--Andy


 --
 Jiri Kosina
 SUSE Labs

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
 On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote:
 
  On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
   So the issue here is that we can have an NMI followed immediately by
   an MCE.
 
  That part might need clarification for me: #MC is higher prio interrupt
  than NMI so a machine check exception can interrupt the NMI handler at
  any point.
 
 Except that NMI can interrupt #MC at any point as well, I think.

No, #MC is higher prio than NMI, actually even the highest along with
RESET#. And come to think of it, all exceptions which have a higher prio
than NMI should touch that nmi_mce_nest_count thing.

See Table 8-8 here:

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

That's the prios before 3, i.e. the NMI one.

HOWEVER, this all is spoken with the assumption that higher prio
interrupts can interrupt the NMI handler too at the first instruction
boundary they've been recognized.

The text is talking about simultaneous interrupts and not about
interrupt handler preemption.

But it must be because Steve wouldn't be dealing with exceptions in the
NMI handler and nested NMIs otherwise...

Hmmm.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 08:21:57AM -0700, Andy Lutomirski wrote:
 ISTM the do_machine_check code ought to consider any kill-worthy MCE
 from kernel space to be non-recoverable, but I want to keep the scope
 of these patches under control.

MCA has a bit called RIPV which, if set, signals that RIP is valid and
it is safe to return provided we've taken proper care of handling even
non-correctable errors (memory poisoning, etc).

If RIPV is not set, we panic anyway.

 That being said, if an MCE that came from CPL0 never tried to return,
 this would be simpler.  I don't know enough about the machine check
 architecture to know whether that's a reasonable thing to do.

Yeah, there are cases where MCE can return, see above.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
 On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote:
 
  On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
   So the issue here is that we can have an NMI followed immediately by
   an MCE.
 
  That part might need clarification for me: #MC is higher prio interrupt
  than NMI so a machine check exception can interrupt the NMI handler at
  any point.

 Except that NMI can interrupt #MC at any point as well, I think.

 No, #MC is higher prio than NMI, actually even the highest along with
 RESET#. And come to think of it, all exceptions which have a higher prio
 than NMI should touch that nmi_mce_nest_count thing.

 See Table 8-8 here:

 http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

 That's the prios before 3, i.e. the NMI one.

 HOWEVER, this all is spoken with the assumption that higher prio
 interrupts can interrupt the NMI handler too at the first instruction
 boundary they've been recognized.

 The text is talking about simultaneous interrupts and not about
 interrupt handler preemption.

 But it must be because Steve wouldn't be dealing with exceptions in the
 NMI handler and nested NMIs otherwise...

I think that some of these exceptions are synchronous things (e.g.
int3 or page faults) that happen because the kernel caused them.

Anyway, going through the list:

Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

SMI is already supposedly correct wrt nesting inside NMI.

Debug register stuff should be handled in my outline.  Hopefully
correctly :)  We need to make sure that no breakpoints trip before the
nmi count is incremented, but that should be straightforward as long
as we don't do ridiculous things like poking at userspace addresses.
I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
kernel address (e.g. the nesting count) or enables single-stepping,
we'll mess up.


It may pay to bump the nesting count inside the #DB and #BP handlers
and to check the RIP that we're returning to, but that starts to look
ugly, and we have to be careful about NMI, immediate breakpoint, and
them immediate MCE.  I'd rather just be able to say that there are
some very short windows in which a debug or breakpoint exception will
never happen.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 10:52:01AM -0700, Andy Lutomirski wrote:
 I think that some of these exceptions are synchronous things (e.g.
 int3 or page faults) that happen because the kernel caused them.
 
 Anyway, going through the list:
 
 Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

Yeah.

 SMI is already supposedly correct wrt nesting inside NMI.

It better be. :)

 Debug register stuff should be handled in my outline.  Hopefully
 correctly :)  We need to make sure that no breakpoints trip before the
 nmi count is incremented, but that should be straightforward as long
 as we don't do ridiculous things like poking at userspace addresses.
 I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
 kernel address (e.g. the nesting count) or enables single-stepping,
 we'll mess up.
 
 
 It may pay to bump the nesting count inside the #DB and #BP handlers
 and to check the RIP that we're returning to,

Right, at a first glance, all those higher prio exceptions' nesting
count could be nicely dealt with in those paranoidzeroentry* macros.

 but that starts to look ugly, and we have to be careful about NMI,
 immediate breakpoint, and them immediate MCE.

Btw, hpa just confirmed that exceptions are never deferred and thus can
happen while the NMI nahdler runs. Which means, we should defensively
prepare for NMI handlers being interrupted at any point.

 I'd rather just be able to say that there are some very short windows
 in which a debug or breakpoint exception will never happen.

Sounds perfectly fine to me.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Here's a real proposal for iret-less return.  If this is correct, then
 NMIs will never nest, which will probably delete a lot more scariness
 than is added by the code I'm describing.

OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
16-bit stack.  The return from NMI goes through the espfix code.
Something interrupts while on the espfix stack.  Boom!  Neither return
style is particularly good.

More generally, if we got interrupted while on the espfix stack, we
need to return back there using IRET.  Fortunately, re-enabling NMIs
there in harmless, since we've already switched off the NMI stack.

This makes me think that maybe the logic should be turned around: have
some RIP ranges on which the kernel stack might be invalid (which
includes the espfix code and some of the syscall code) and use IRET
only on return from NMI, return to nonstandard CS, and return to these
special ranges.  The NMI code just needs to never so any of this stuff
unless it switches off the NMI stack first.

For this to work reliably, we'll probably have to change CS before
calling into EFI code.  That should be straightforward.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, Borislav Petkov wrote:

  ISTM the do_machine_check code ought to consider any kill-worthy MCE
  from kernel space to be non-recoverable, but I want to keep the scope
  of these patches under control.
 
 MCA has a bit called RIPV which, if set, signals that RIP is valid and
 it is safe to return provided we've taken proper care of handling even
 non-correctable errors (memory poisoning, etc).

Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered 
at the time the CPU was already holding sighand-siglock for that 
particular task, it'll deadlock against itself.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina jkos...@suse.cz wrote:
 On Wed, 21 May 2014, Borislav Petkov wrote:

  ISTM the do_machine_check code ought to consider any kill-worthy MCE
  from kernel space to be non-recoverable, but I want to keep the scope
  of these patches under control.

 MCA has a bit called RIPV which, if set, signals that RIP is valid and
 it is safe to return provided we've taken proper care of handling even
 non-correctable errors (memory poisoning, etc).

 Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
 at the time the CPU was already holding sighand-siglock for that
 particular task, it'll deadlock against itself.


If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
like the right solution anyway.

Are there any machine check exceptions for which it makes sense to
continue right where we left off without a signal?  Is CMIC such a
beast?  Can CMIC be delivered when interrupts are off?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 6:25 AM, Jiri Kosina jkos...@suse.cz wrote:

 Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
 at the time the CPU was already holding sighand-siglock for that
 particular task, it'll deadlock against itself.

Don't worry too much about the MCE's. The hardware is f*cking broken,
and nobody sane ever thought that synchronous MCE's were a good idea.

Proof: look at Itanium.

The truly nonmaskable synchronous MCE's are a fatal error. It's that
simple. Anybody who thinks anything else is simply wrong, and has
probably talked to too many hardware engineers that don't actually
understand the bigger picture.

Sane hardware handles anything that *can* be handled in hardware, and
then reports (later) to software about the errors with a regular
non-critical MCE that doesn't punch through NMI or even regular
interrupt disabling.

So the true MCE punches through even NMI protection case is
relegated purely to the hardware is broken and needs to be replaced
situation, and our only worry as kernel people is to try to be as
graceful as possible about it - but that as graceful as possible
does *not* include bending over and worrying about random possible
deadlocks or other crazy situations. It's purely a best effort kind
of thing where we try to do whatever logging etc that is easy to do.

Seriously. If an NMI is interrupted by an MCE, you might as well
consider the machine dead. Don't worry about it. We may or may not
recover, but it is *not* our problem.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
 Seriously. If an NMI is interrupted by an MCE, you might as well
 consider the machine dead. Don't worry about it. We may or may not
 recover, but it is *not* our problem.

I certainly like this way of handling it. We can even issue a nice
banner saying something like You're f*cked - go change hw.

:-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
Adding Tony.

On 05/21/2014 02:43 PM, Borislav Petkov wrote:
 On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
 Seriously. If an NMI is interrupted by an MCE, you might as well
 consider the machine dead. Don't worry about it. We may or may not
 recover, but it is *not* our problem.
 
 I certainly like this way of handling it. We can even issue a nice
 banner saying something like You're f*cked - go change hw.
 

Actually, it would be a lot better to panic than deadlock (HA systems
tend to have something in place to catch the panic and/or reboot).  Any
way we can see if the CPU is already holding that lock and panic in that
case?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote:
 Adding Tony.

 On 05/21/2014 02:43 PM, Borislav Petkov wrote:
 On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
 Seriously. If an NMI is interrupted by an MCE, you might as well
 consider the machine dead. Don't worry about it. We may or may not
 recover, but it is *not* our problem.

 I certainly like this way of handling it. We can even issue a nice
 banner saying something like You're f*cked - go change hw.


 Actually, it would be a lot better to panic than deadlock (HA systems
 tend to have something in place to catch the panic and/or reboot).  Any
 way we can see if the CPU is already holding that lock and panic in that
 case?


Is there anything actually wrong with just panicking if
!user_mode_vm(regs)?  That would make this a lot more sane.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
 If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
 like the right solution anyway.

 Are there any machine check exceptions for which it makes sense to
 continue right where we left off without a signal?  Is CMIC such a
 beast?  Can CMIC be delivered when interrupts are off?

I think you mean CMCI and that's not even reported with a MCE exception
- there's a separate APIC interrupt for that.

I think this signal thing is for killing processes which have poisoned
memory but this memory can contained within that process and the
physical page frame can be poisoned so that it doesn't get used ever
again. In any case, this is an example for an uncorrectable error which
needs action from us but doesn't necessarily have to kill the whole
machine.

This is supposed to be more graceful instead of consuming the corrupted
data and sending it out to disk.

But sending signals from #MC context is definitely a bad idea. I think
we had addressed this with irq_work at some point but my memory is very
hazy.

@Tony: this is something we need to take a look at soonish.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, H. Peter Anvin wrote:

  I certainly like this way of handling it. We can even issue a nice
  banner saying something like You're f*cked - go change hw.
 
 Actually, it would be a lot better to panic than deadlock (HA systems
 tend to have something in place to catch the panic and/or reboot).  Any
 way we can see if the CPU is already holding that lock and panic in that
 case?

Well, spin_trylock() and then either spin_unlock() and proceed sending 
the signal, otherwise panic().

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
 If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
 like the right solution anyway.

 Are there any machine check exceptions for which it makes sense to
 continue right where we left off without a signal?  Is CMIC such a
 beast?  Can CMIC be delivered when interrupts are off?

 I think you mean CMCI and that's not even reported with a MCE exception
 - there's a separate APIC interrupt for that.

 I think this signal thing is for killing processes which have poisoned
 memory but this memory can contained within that process and the
 physical page frame can be poisoned so that it doesn't get used ever
 again. In any case, this is an example for an uncorrectable error which
 needs action from us but doesn't necessarily have to kill the whole
 machine.

 This is supposed to be more graceful instead of consuming the corrupted
 data and sending it out to disk.

 But sending signals from #MC context is definitely a bad idea. I think
 we had addressed this with irq_work at some point but my memory is very
 hazy.

Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
 Is there anything actually wrong with just panicking if
 !user_mode_vm(regs)?  That would make this a lot more sane.

It does that already - mce_severity().

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
 Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
 a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

Let me quote Jiri:

(1) task sends signal to itself
(2) it acquires sighand-siglock so that it's able to queue the signal
(3) MCE triggers
(4) it tries to send a signal to the same task
(5) it tries to acquire sighand-siglock and loops forever

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Jiri Kosina
On Wed, 21 May 2014, Borislav Petkov wrote:

  Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
  a signal a remotely reasonable thing to do if !user_mode_vm(regs)?
 
 Let me quote Jiri:
 
 (1) task sends signal to itself
 (2) it acquires sighand-siglock so that it's able to queue the signal
 (3) MCE triggers
 (4) it tries to send a signal to the same task
 (5) it tries to acquire sighand-siglock and loops forever

Ah, alright, but due to what mce_severity() does, this can't happen, 
because if the current CPU is in the kernel (which is obviously implied by 
holding a spinlock), it never proceeds sending the signal, becase 
no_way_out gets set and mce_panic() invoked.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
 Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
 a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

 Let me quote Jiri:

 (1) task sends signal to itself
 (2) it acquires sighand-siglock so that it's able to queue the signal
 (3) MCE triggers

...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should
presumably just panic instead of trying to send a signal.

I missed the IN_KERNEL thing because I didn't realize that -cs was
copied to struct mce.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 02:54 PM, Borislav Petkov wrote:
 On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
 Is there anything actually wrong with just panicking if
 !user_mode_vm(regs)?  That would make this a lot more sane.
 
 It does that already - mce_severity().
 

So this is not a problem then.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
 But sending signals from #MC context is definitely a bad idea. I think
 we had addressed this with irq_work at some point but my memory is very
 hazy.

We added code for recoverable errors to get out of the MC context
before trying to lookup the page and send the signal.  Bottom of
do_machine_check():

if (cfg-tolerant  3) {
if (no_way_out)
mce_panic(Fatal machine check on current CPU, m, 
msg);
if (worst == MCE_AR_SEVERITY) {
/* schedule action before return to userland */
mce_save_info(m.addr, m.mcgstatus  MCG_STATUS_RIPV);
set_thread_flag(TIF_MCE_NOTIFY);
} else if (kill_it) {
force_sig(SIGBUS, current);
}
}

That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
mce_notify_process().

The force_sig() there is legacy code - and perhaps should just move off to 
mce_notify_process()
too (need to save worst so it will know what to do).

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
 So this is not a problem then.

Yeah, f'get it - it is all good at that front. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony tony.l...@intel.com wrote:
 But sending signals from #MC context is definitely a bad idea. I think
 we had addressed this with irq_work at some point but my memory is very
 hazy.

 We added code for recoverable errors to get out of the MC context
 before trying to lookup the page and send the signal.  Bottom of
 do_machine_check():

 if (cfg-tolerant  3) {
 if (no_way_out)
 mce_panic(Fatal machine check on current CPU, m, 
 msg);
 if (worst == MCE_AR_SEVERITY) {
 /* schedule action before return to userland */
 mce_save_info(m.addr, m.mcgstatus  MCG_STATUS_RIPV);
 set_thread_flag(TIF_MCE_NOTIFY);
 } else if (kill_it) {
 force_sig(SIGBUS, current);
 }
 }

 That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
 mce_notify_process().

Why is this necessary?

If the MCE hit kernel code, then we're going to die anyway.  If the
MCE hit user code, then we should be in a completely sensible context
and we can just send the signal.

--Andy


 The force_sig() there is legacy code - and perhaps should just move off to 
 mce_notify_process()
 too (need to save worst so it will know what to do).

 -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 03:11 PM, Borislav Petkov wrote:
 On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
 So this is not a problem then.
 
 Yeah, f'get it - it is all good at that front. :-)
 

Seems like a comment would be in order, though.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
 Why is this necessary?
 
 If the MCE hit kernel code, then we're going to die anyway.  If the
 MCE hit user code, then we should be in a completely sensible context
 and we can just send the signal.

Are we guaranteed that the first thing the process will execute when
scheduled back in are the signal handlers?

And besides, maybe we don't even want to allow to do the switch_to() but
kill it while it is sleeping.

(I know, we're that nasty :-))

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
 Why is this necessary?

 If the MCE hit kernel code, then we're going to die anyway.  If the
 MCE hit user code, then we should be in a completely sensible context
 and we can just send the signal.

 Are we guaranteed that the first thing the process will execute when
 scheduled back in are the signal handlers?

It's not even scheduled out, right?  This should be just like a signal
from a failed page fault, I think.


 And besides, maybe we don't even want to allow to do the switch_to() but
 kill it while it is sleeping.

What switch_to?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
 That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
 mce_notify_process().

 Why is this necessary?

The recovery path has to do more than just send a signal - it needs to walk 
processes and
mms to see which have mapped the physical address that the h/w told us has 
gone bad.

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote:
 Seems like a comment would be in order, though.

Sure, I'll do a nice one once this discussions quiets down. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony tony.l...@intel.com wrote:
 That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
 mce_notify_process().

 Why is this necessary?

 The recovery path has to do more than just send a signal - it needs to walk 
 processes and
 mms to see which have mapped the physical address that the h/w told us has 
 gone bad.

I still feel like I'm missing something.  If we interrupted user space
code, then the context we're in should be identical to the context
we'll get when we're about to return to userspace.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andi Kleen

Seems like a lot of effort and risk to essentially only optimize in kernel
interrupt handlers.

AFAIK the most interesting cases (like user page faults) are not
affected at all. Usually most workloads don't spend all that much time
in the kernel, so it won't help most interrupts.

I suspect the only case that's really interesting here is interrupting
idle. Maybe it would be possible to do some fast path in this case only.

However idle currently has so much overhead that I suspect that there 
are lower hanging fruit elsewhere.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Luck, Tony
 The recovery path has to do more than just send a signal - it needs to walk 
 processes and
 mms to see which have mapped the physical address that the h/w told us has 
 gone bad.

 I still feel like I'm missing something.  If we interrupted user space
 code, then the context we're in should be identical to the context
 we'll get when we're about to return to userspace.

True. And this far along in do_machine_check() we have set all the other cpus
free, so the are heading back to whatever context we interrupted them in. So
we might be able to do all that other stuff inline here ... we interrupted user
mode, so we know we don't hold any locks. Other cpus are running, so they can
complete what they are doing to release any locks we might need.

But it will take a while (to scan all those processes). And we haven't yet
cleared MCG_STATUS ... so another machine check before we do that
would be fatal (x86 doesn't allow nesting).  Even if we moved the work
after the clear of MCG_STATUS we'd still be vulnerable to a new machine
check on x86_64 because we are sitting on the one  only machine check
stack.

-Tony


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen a...@firstfloor.org wrote:

 Seems like a lot of effort and risk to essentially only optimize in kernel
 interrupt handlers.

The idea is that it might allow us to remove a bunch of scary nested
NMI code as well as speeding things up.


 AFAIK the most interesting cases (like user page faults) are not
 affected at all. Usually most workloads don't spend all that much time
 in the kernel, so it won't help most interrupts.

 I suspect the only case that's really interesting here is interrupting
 idle. Maybe it would be possible to do some fast path in this case only.

 However idle currently has so much overhead that I suspect that there
 are lower hanging fruit elsewhere.

I will gladly buy a meal or beverage for whomever fixes the ttwu stuff
to stop sending IPIs to idle CPUs, which will help a lot.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Linus Torvalds
On Thu, May 22, 2014 at 7:25 AM, Andi Kleen a...@firstfloor.org wrote:

 I suspect the only case that's really interesting here is interrupting
 idle. Maybe it would be possible to do some fast path in this case only.

Hardware-interrupts during kernel are actually fairly common under
network-intensive loads, even outside of idle (but idle is admittedly
likely *the* most common one). Many network loads are fairly
kernel-intensive.

Also, from a kernel perspective, idle isn't really any different from
most other kernel code. Using ret to return to the idle handler
would be *more* of a special case than using ret to return to just
generic kernel context.

So I disagree vehemently. Do *not* special-case idle. It makes the
code more complex and less generic.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:20:50PM -0700, Andy Lutomirski wrote:
 It's not even scheduled out, right?

Right.

 This should be just like a signal from a failed page fault, I think.

Right, but there this additional work it needs to be done
(mce_notify_process()) before sending the signal. So you want to do this
after the MCE handler is done but before you return to the process.

 What switch_to?

Nevermind, that was bollocks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
 On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Here's a real proposal for iret-less return.  If this is correct, then
 NMIs will never nest, which will probably delete a lot more scariness
 than is added by the code I'm describing.
 
 OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
 16-bit stack.  The return from NMI goes through the espfix code.
 Something interrupts while on the espfix stack.  Boom!  Neither return
 style is particularly good.
 
 More generally, if we got interrupted while on the espfix stack, we
 need to return back there using IRET.  Fortunately, re-enabling NMIs
 there in harmless, since we've already switched off the NMI stack.
 
 This makes me think that maybe the logic should be turned around: have
 some RIP ranges on which the kernel stack might be invalid (which
 includes the espfix code and some of the syscall code) and use IRET
 only on return from NMI, return to nonstandard CS, and return to these
 special ranges.  The NMI code just needs to never so any of this stuff
 unless it switches off the NMI stack first.
 
 For this to work reliably, we'll probably have to change CS before
 calling into EFI code.  That should be straightforward.
 

I think you are onto something here.

In particular, the key observation here is that inside the kernel, we
can never *both* have an invalid stack *and* be inside an NMI, #MC or
#DB handler, even if nested.

Now, does this prevent us from using RET in the common case?  I'm not
sure it is a huge loss since kernel-to-kernel is relatively rare.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony tony.l...@intel.com wrote:
 The recovery path has to do more than just send a signal - it needs to walk 
 processes and
 mms to see which have mapped the physical address that the h/w told us 
 has gone bad.

 I still feel like I'm missing something.  If we interrupted user space
 code, then the context we're in should be identical to the context
 we'll get when we're about to return to userspace.

 True. And this far along in do_machine_check() we have set all the other cpus
 free, so the are heading back to whatever context we interrupted them in. So
 we might be able to do all that other stuff inline here ... we interrupted 
 user
 mode, so we know we don't hold any locks. Other cpus are running, so they can
 complete what they are doing to release any locks we might need.

 But it will take a while (to scan all those processes). And we haven't yet
 cleared MCG_STATUS ... so another machine check before we do that
 would be fatal (x86 doesn't allow nesting).  Even if we moved the work
 after the clear of MCG_STATUS we'd still be vulnerable to a new machine
 check on x86_64 because we are sitting on the one  only machine check
 stack.

But if we get a new MCE in here, it will be an MCE from kernel context
and it's fatal.  So, yes, we'll clobber the stack, but we'll never
return (unless tolerant is set to something insane), so who cares?

Anyway, I care less about this now that I don't have to worry about it
re: IRET :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
 On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Here's a real proposal for iret-less return.  If this is correct, then
 NMIs will never nest, which will probably delete a lot more scariness
 than is added by the code I'm describing.

 OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
 16-bit stack.  The return from NMI goes through the espfix code.
 Something interrupts while on the espfix stack.  Boom!  Neither return
 style is particularly good.

 More generally, if we got interrupted while on the espfix stack, we
 need to return back there using IRET.  Fortunately, re-enabling NMIs
 there in harmless, since we've already switched off the NMI stack.

 This makes me think that maybe the logic should be turned around: have
 some RIP ranges on which the kernel stack might be invalid (which
 includes the espfix code and some of the syscall code) and use IRET
 only on return from NMI, return to nonstandard CS, and return to these
 special ranges.  The NMI code just needs to never so any of this stuff
 unless it switches off the NMI stack first.

 For this to work reliably, we'll probably have to change CS before
 calling into EFI code.  That should be straightforward.


 I think you are onto something here.

 In particular, the key observation here is that inside the kernel, we
 can never *both* have an invalid stack *and* be inside an NMI, #MC or
 #DB handler, even if nested.

Except for espfix :)


 Now, does this prevent us from using RET in the common case?  I'm not
 sure it is a huge loss since kernel-to-kernel is relatively rare.

I don't think so.  The most common case should be plain old interrupts
and I suspect that #PF is a distant second.

In any event, plain old interrupts and #PF are non-IST interrupts and
they should be unconditionally safe for RET

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
 But if we get a new MCE in here, it will be an MCE from kernel context
 and it's fatal. So, yes, we'll clobber the stack, but we'll never
 return (unless tolerant is set to something insane), so who cares?

Ok, but we still have to do the work before returning to the process. So
if not mce_notify_process() how else are you suggesting we do this?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
 But if we get a new MCE in here, it will be an MCE from kernel context
 and it's fatal. So, yes, we'll clobber the stack, but we'll never
 return (unless tolerant is set to something insane), so who cares?

 Ok, but we still have to do the work before returning to the process. So
 if not mce_notify_process() how else are you suggesting we do this?

I'm suggesting that you re-enable interrupts and do the work in
do_machine_check.  I think it'll just work.  It might pay to set a
flag so that you panic very loudly if do_machine_check recurses.

I suspect that, if the hardware is generating machine checks while
doing memory poisoning, the hardware is broken enough that even
panicking might not work, though :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Borislav Petkov
On Wed, May 21, 2014 at 03:52:16PM -0700, Andy Lutomirski wrote:
 I'm suggesting that you re-enable interrupts and do the work in
 do_machine_check. I think it'll just work. It might pay to set a flag
 so that you panic very loudly if do_machine_check recurses.

And that might happen very likely if we're trying to poison a page which
is shared by a couple of processes' mm's and some process on some cpu
starts touching it.

So keeping all cpus in a holding pattern is much more safe, IMO. (#MC is
broadcasted on Intel, I'm sure you know).

And even if it made sense, why go the trouble? To shorten the time we're
in the MCE handler? Well, if we spend too much time in it, then the box
is dying anyway. On a normal, healthy hw, do_machine_check doesn't run.

:-)

 I suspect that, if the hardware is generating machine checks while
 doing memory poisoning, the hardware is broken enough that even
 panicking might not work, though :)

Yeah, in such cases, they tend to escalate to fatal errors very fast so
we panic right on the spot.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread H. Peter Anvin
On 05/21/2014 03:41 PM, Andy Lutomirski wrote:

 I think you are onto something here.

 In particular, the key observation here is that inside the kernel, we
 can never *both* have an invalid stack *and* be inside an NMI, #MC or
 #DB handler, even if nested.
 
 Except for espfix :)

Argh.  Yes, I got that wrong... it isn't really about being inside NMI,
#MC or #DB, but rather being on those respective stacks.  If you are on
the espfix stack you are on your way back to userspace OR (and this gets
really, really ugly) you took an NMI/MC/DB after a SYSCALL executed in
16-bit mode, but even then you are in the kernel entry/exit code and
re-enabling NMI is fine.

 Now, does this prevent us from using RET in the common case?  I'm not
 sure it is a huge loss since kernel-to-kernel is relatively rare.
 
 I don't think so.  The most common case should be plain old interrupts
 and I suspect that #PF is a distant second.
 
 In any event, plain old interrupts and #PF are non-IST interrupts and
 they should be unconditionally safe for RET

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >