Re: Request for help: what did I do wrong with idtentry?
On Sat, Nov 15, 2014 at 10:28 AM, Andi Kleen wrote: >> I'm not >> completely thrilled with what it does to double_fault, though. If we >> somehow get a double fault caused by an interrupt hitting userspace >> with a bad kernel_stack, then we'll end up page faulting in the >> double_fault prologue. I'm not convinced that this is worth worrying >> about. It would be easy enough to fix, though, even if it would >> further uglify the code. > > If you're "cleaning up" good and working code the functionality should > be the same as before. The old code handled this situation fine. > So your new code should handle this too. First, this failure mode should be almost impossible. We'd really have to screw up to have the kernel stack point to a bad address. (This isn't the stack *pointer* being bad -- it's the value in the TSS.) If this happens, the existing code will die (no recovery possible unlike with normal OOPSes). The new code will log a kernel-mode page fault on the DF stack (as shown on the stack trace, assuming that logic works), complain some more in do_exit, and make some sort of effort to recover, which might even work. In other words, I'd be happy to "fix" it, but I'm not entirely convinced that this change should count as a regression in the first place. If we go for the fix-it approach, we could add a fixup in sync_regs and probe the kernel_stack or we could add a paranoid=2 mode for double_fault. > > In general yes handling all the corner cases makes code ugly. > That is how the existing code got how it became. Most of those corner cases are at least in code paths that are supposed to work. This particular corner case is in a handler that's just trying to print something useful rather than silently rebooting, and it should still work well enough to print something useful. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Request for help: what did I do wrong with idtentry?
> I'm not > completely thrilled with what it does to double_fault, though. If we > somehow get a double fault caused by an interrupt hitting userspace > with a bad kernel_stack, then we'll end up page faulting in the > double_fault prologue. I'm not convinced that this is worth worrying > about. It would be easy enough to fix, though, even if it would > further uglify the code. If you're "cleaning up" good and working code the functionality should be the same as before. The old code handled this situation fine. So your new code should handle this too. In general yes handling all the corner cases makes code ugly. That is how the existing code got how it became. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Request for help: what did I do wrong with idtentry?
I'm not completely thrilled with what it does to double_fault, though. If we somehow get a double fault caused by an interrupt hitting userspace with a bad kernel_stack, then we'll end up page faulting in the double_fault prologue. I'm not convinced that this is worth worrying about. It would be easy enough to fix, though, even if it would further uglify the code. If you're cleaning up good and working code the functionality should be the same as before. The old code handled this situation fine. So your new code should handle this too. In general yes handling all the corner cases makes code ugly. That is how the existing code got how it became. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Request for help: what did I do wrong with idtentry?
On Sat, Nov 15, 2014 at 10:28 AM, Andi Kleen a...@firstfloor.org wrote: I'm not completely thrilled with what it does to double_fault, though. If we somehow get a double fault caused by an interrupt hitting userspace with a bad kernel_stack, then we'll end up page faulting in the double_fault prologue. I'm not convinced that this is worth worrying about. It would be easy enough to fix, though, even if it would further uglify the code. If you're cleaning up good and working code the functionality should be the same as before. The old code handled this situation fine. So your new code should handle this too. First, this failure mode should be almost impossible. We'd really have to screw up to have the kernel stack point to a bad address. (This isn't the stack *pointer* being bad -- it's the value in the TSS.) If this happens, the existing code will die (no recovery possible unlike with normal OOPSes). The new code will log a kernel-mode page fault on the DF stack (as shown on the stack trace, assuming that logic works), complain some more in do_exit, and make some sort of effort to recover, which might even work. In other words, I'd be happy to fix it, but I'm not entirely convinced that this change should count as a regression in the first place. If we go for the fix-it approach, we could add a fixup in sync_regs and probe the kernel_stack or we could add a paranoid=2 mode for double_fault. In general yes handling all the corner cases makes code ugly. That is how the existing code got how it became. Most of those corner cases are at least in code paths that are supposed to work. This particular corner case is in a handler that's just trying to print something useful rather than silently rebooting, and it should still work well enough to print something useful. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Request for help: what did I do wrong with idtentry?
On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony wrote: >> causes Tony's MCE stress test to fail, presumably when some CPU either >> becomes permanently non-interruptable or otherwise wanders off into >> the weeds. > > It might be that recent "improvements" I made to my test harness have > messed things up. I trimmed one delay (between injection and consumption), > but it turns out the other delay in the code never get executed (because we > take a SIGBUS on consumption and then longjmp). So my test that used > to pause a bit between iterations were running almost back to back > consumption and injection of next error. Hmm. Am I right that the timeout code in mce.c is overly aggressive, too? > > This meant the serial console was a huge bottleneck (especially as my > development BIOS is also kicking its own debug junk onto the same port). > Some of the errors pointed obliquely at console. > > I've slowed things back down to where they used to be, and things are > ticking along nicely (with 0.6 second delay between iterations). Just > passed the 2800 mark and still going. I'm leaving it running over the > weekend - if it makes it into the 50k level I'm willing to call it good. > Phew :) FWIW, I've confirmed that my code survives int3 from userspace, int3 from normal kernel code, and int3 from kernel with user gs. I'm not completely thrilled with what it does to double_fault, though. If we somehow get a double fault caused by an interrupt hitting userspace with a bad kernel_stack, then we'll end up page faulting in the double_fault prologue. I'm not convinced that this is worth worrying about. It would be easy enough to fix, though, even if it would further uglify the code. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Request for help: what did I do wrong with idtentry?
> causes Tony's MCE stress test to fail, presumably when some CPU either > becomes permanently non-interruptable or otherwise wanders off into > the weeds. It might be that recent "improvements" I made to my test harness have messed things up. I trimmed one delay (between injection and consumption), but it turns out the other delay in the code never get executed (because we take a SIGBUS on consumption and then longjmp). So my test that used to pause a bit between iterations were running almost back to back consumption and injection of next error. This meant the serial console was a huge bottleneck (especially as my development BIOS is also kicking its own debug junk onto the same port). Some of the errors pointed obliquely at console. I've slowed things back down to where they used to be, and things are ticking along nicely (with 0.6 second delay between iterations). Just passed the 2800 mark and still going. I'm leaving it running over the weekend - if it makes it into the 50k level I'm willing to call it good. -Tony
RE: Request for help: what did I do wrong with idtentry?
causes Tony's MCE stress test to fail, presumably when some CPU either becomes permanently non-interruptable or otherwise wanders off into the weeds. It might be that recent improvements I made to my test harness have messed things up. I trimmed one delay (between injection and consumption), but it turns out the other delay in the code never get executed (because we take a SIGBUS on consumption and then longjmp). So my test that used to pause a bit between iterations were running almost back to back consumption and injection of next error. This meant the serial console was a huge bottleneck (especially as my development BIOS is also kicking its own debug junk onto the same port). Some of the errors pointed obliquely at console. I've slowed things back down to where they used to be, and things are ticking along nicely (with 0.6 second delay between iterations). Just passed the 2800 mark and still going. I'm leaving it running over the weekend - if it makes it into the 50k level I'm willing to call it good. -Tony
Re: Request for help: what did I do wrong with idtentry?
On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony tony.l...@intel.com wrote: causes Tony's MCE stress test to fail, presumably when some CPU either becomes permanently non-interruptable or otherwise wanders off into the weeds. It might be that recent improvements I made to my test harness have messed things up. I trimmed one delay (between injection and consumption), but it turns out the other delay in the code never get executed (because we take a SIGBUS on consumption and then longjmp). So my test that used to pause a bit between iterations were running almost back to back consumption and injection of next error. Hmm. Am I right that the timeout code in mce.c is overly aggressive, too? This meant the serial console was a huge bottleneck (especially as my development BIOS is also kicking its own debug junk onto the same port). Some of the errors pointed obliquely at console. I've slowed things back down to where they used to be, and things are ticking along nicely (with 0.6 second delay between iterations). Just passed the 2800 mark and still going. I'm leaving it running over the weekend - if it makes it into the 50k level I'm willing to call it good. Phew :) FWIW, I've confirmed that my code survives int3 from userspace, int3 from normal kernel code, and int3 from kernel with user gs. I'm not completely thrilled with what it does to double_fault, though. If we somehow get a double fault caused by an interrupt hitting userspace with a bad kernel_stack, then we'll end up page faulting in the double_fault prologue. I'm not convinced that this is worth worrying about. It would be easy enough to fix, though, even if it would further uglify the code. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/