Re: Request for help: what did I do wrong with idtentry?

2014-11-15 Thread Andy Lutomirski
On Sat, Nov 15, 2014 at 10:28 AM, Andi Kleen  wrote:
>> I'm not
>> completely thrilled with what it does to double_fault, though.  If we
>> somehow get a double fault caused by an interrupt hitting userspace
>> with a bad kernel_stack, then we'll end up page faulting in the
>> double_fault prologue.  I'm not convinced that this is worth worrying
>> about.  It would be easy enough to fix, though, even if it would
>> further uglify the code.
>
> If you're "cleaning up" good and working code the functionality should
> be the same as before. The old code handled this situation fine.
> So your new code should handle this too.

First, this failure mode should be almost impossible.  We'd really
have to screw up to have the kernel stack point to a bad address.
(This isn't the stack *pointer* being bad -- it's the value in the
TSS.)

If this happens, the existing code will die (no recovery possible
unlike with normal OOPSes).  The new code will log a kernel-mode page
fault on the DF stack (as shown on the stack trace, assuming that
logic works), complain some more in do_exit, and make some sort of
effort to recover, which might even work.

In other words, I'd be happy to "fix" it, but I'm not entirely
convinced that this change should count as a regression in the first
place.

If we go for the fix-it approach, we could add a fixup in sync_regs
and probe the kernel_stack or we could add a paranoid=2 mode for
double_fault.

>
> In general yes handling all the corner cases makes code ugly.
> That is how the existing code got how it became.

Most of those corner cases are at least in code paths that are
supposed to work.  This particular corner case is in a handler that's
just trying to print something useful rather than silently rebooting,
and it should still work well enough to print something useful.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Request for help: what did I do wrong with idtentry?

2014-11-15 Thread Andi Kleen
> I'm not
> completely thrilled with what it does to double_fault, though.  If we
> somehow get a double fault caused by an interrupt hitting userspace
> with a bad kernel_stack, then we'll end up page faulting in the
> double_fault prologue.  I'm not convinced that this is worth worrying
> about.  It would be easy enough to fix, though, even if it would
> further uglify the code.

If you're "cleaning up" good and working code the functionality should
be the same as before. The old code handled this situation fine. 
So your new code should handle this too.

In general yes handling all the corner cases makes code ugly.
That is how the existing code got how it became.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Request for help: what did I do wrong with idtentry?

2014-11-15 Thread Andi Kleen
 I'm not
 completely thrilled with what it does to double_fault, though.  If we
 somehow get a double fault caused by an interrupt hitting userspace
 with a bad kernel_stack, then we'll end up page faulting in the
 double_fault prologue.  I'm not convinced that this is worth worrying
 about.  It would be easy enough to fix, though, even if it would
 further uglify the code.

If you're cleaning up good and working code the functionality should
be the same as before. The old code handled this situation fine. 
So your new code should handle this too.

In general yes handling all the corner cases makes code ugly.
That is how the existing code got how it became.

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Request for help: what did I do wrong with idtentry?

2014-11-15 Thread Andy Lutomirski
On Sat, Nov 15, 2014 at 10:28 AM, Andi Kleen a...@firstfloor.org wrote:
 I'm not
 completely thrilled with what it does to double_fault, though.  If we
 somehow get a double fault caused by an interrupt hitting userspace
 with a bad kernel_stack, then we'll end up page faulting in the
 double_fault prologue.  I'm not convinced that this is worth worrying
 about.  It would be easy enough to fix, though, even if it would
 further uglify the code.

 If you're cleaning up good and working code the functionality should
 be the same as before. The old code handled this situation fine.
 So your new code should handle this too.

First, this failure mode should be almost impossible.  We'd really
have to screw up to have the kernel stack point to a bad address.
(This isn't the stack *pointer* being bad -- it's the value in the
TSS.)

If this happens, the existing code will die (no recovery possible
unlike with normal OOPSes).  The new code will log a kernel-mode page
fault on the DF stack (as shown on the stack trace, assuming that
logic works), complain some more in do_exit, and make some sort of
effort to recover, which might even work.

In other words, I'd be happy to fix it, but I'm not entirely
convinced that this change should count as a regression in the first
place.

If we go for the fix-it approach, we could add a fixup in sync_regs
and probe the kernel_stack or we could add a paranoid=2 mode for
double_fault.


 In general yes handling all the corner cases makes code ugly.
 That is how the existing code got how it became.

Most of those corner cases are at least in code paths that are
supposed to work.  This particular corner case is in a handler that's
just trying to print something useful rather than silently rebooting,
and it should still work well enough to print something useful.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Request for help: what did I do wrong with idtentry?

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony  wrote:
>> causes Tony's MCE stress test to fail, presumably when some CPU either
>> becomes permanently non-interruptable or otherwise wanders off into
>> the weeds.
>
> It might be that recent "improvements" I made to my test harness have
> messed things up.  I trimmed one delay (between injection and consumption),
> but it turns out the other delay in the code never get executed (because we
> take a SIGBUS on consumption and then longjmp).  So my test that used
> to pause a bit between iterations were running almost back to back
> consumption and injection of next error.

Hmm.

Am I right that the timeout code in mce.c is overly aggressive, too?

>
> This meant the serial console was a huge bottleneck (especially as my
> development BIOS is also kicking its own debug junk onto the same port).
> Some of the errors pointed obliquely at console.
>
> I've slowed things back down to where they used to be, and things are
> ticking along nicely (with 0.6 second delay between iterations).  Just
> passed the 2800 mark and still going.  I'm leaving it running over the
> weekend - if it makes it into the 50k level I'm willing to call it good.
>

Phew :)

FWIW, I've confirmed that my code survives int3 from userspace, int3
from normal kernel code, and int3 from kernel with user gs.  I'm not
completely thrilled with what it does to double_fault, though.  If we
somehow get a double fault caused by an interrupt hitting userspace
with a bad kernel_stack, then we'll end up page faulting in the
double_fault prologue.  I'm not convinced that this is worth worrying
about.  It would be easy enough to fix, though, even if it would
further uglify the code.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Request for help: what did I do wrong with idtentry?

2014-11-14 Thread Luck, Tony
> causes Tony's MCE stress test to fail, presumably when some CPU either
> becomes permanently non-interruptable or otherwise wanders off into
> the weeds.

It might be that recent "improvements" I made to my test harness have
messed things up.  I trimmed one delay (between injection and consumption),
but it turns out the other delay in the code never get executed (because we
take a SIGBUS on consumption and then longjmp).  So my test that used
to pause a bit between iterations were running almost back to back
consumption and injection of next error.

This meant the serial console was a huge bottleneck (especially as my
development BIOS is also kicking its own debug junk onto the same port).
Some of the errors pointed obliquely at console.

I've slowed things back down to where they used to be, and things are
ticking along nicely (with 0.6 second delay between iterations).  Just
passed the 2800 mark and still going.  I'm leaving it running over the
weekend - if it makes it into the 50k level I'm willing to call it good.

-Tony


RE: Request for help: what did I do wrong with idtentry?

2014-11-14 Thread Luck, Tony
 causes Tony's MCE stress test to fail, presumably when some CPU either
 becomes permanently non-interruptable or otherwise wanders off into
 the weeds.

It might be that recent improvements I made to my test harness have
messed things up.  I trimmed one delay (between injection and consumption),
but it turns out the other delay in the code never get executed (because we
take a SIGBUS on consumption and then longjmp).  So my test that used
to pause a bit between iterations were running almost back to back
consumption and injection of next error.

This meant the serial console was a huge bottleneck (especially as my
development BIOS is also kicking its own debug junk onto the same port).
Some of the errors pointed obliquely at console.

I've slowed things back down to where they used to be, and things are
ticking along nicely (with 0.6 second delay between iterations).  Just
passed the 2800 mark and still going.  I'm leaving it running over the
weekend - if it makes it into the 50k level I'm willing to call it good.

-Tony


Re: Request for help: what did I do wrong with idtentry?

2014-11-14 Thread Andy Lutomirski
On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony tony.l...@intel.com wrote:
 causes Tony's MCE stress test to fail, presumably when some CPU either
 becomes permanently non-interruptable or otherwise wanders off into
 the weeds.

 It might be that recent improvements I made to my test harness have
 messed things up.  I trimmed one delay (between injection and consumption),
 but it turns out the other delay in the code never get executed (because we
 take a SIGBUS on consumption and then longjmp).  So my test that used
 to pause a bit between iterations were running almost back to back
 consumption and injection of next error.

Hmm.

Am I right that the timeout code in mce.c is overly aggressive, too?


 This meant the serial console was a huge bottleneck (especially as my
 development BIOS is also kicking its own debug junk onto the same port).
 Some of the errors pointed obliquely at console.

 I've slowed things back down to where they used to be, and things are
 ticking along nicely (with 0.6 second delay between iterations).  Just
 passed the 2800 mark and still going.  I'm leaving it running over the
 weekend - if it makes it into the 50k level I'm willing to call it good.


Phew :)

FWIW, I've confirmed that my code survives int3 from userspace, int3
from normal kernel code, and int3 from kernel with user gs.  I'm not
completely thrilled with what it does to double_fault, though.  If we
somehow get a double fault caused by an interrupt hitting userspace
with a bad kernel_stack, then we'll end up page faulting in the
double_fault prologue.  I'm not convinced that this is worth worrying
about.  It would be easy enough to fix, though, even if it would
further uglify the code.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/