Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Andy Lutomirski
On Tue, Jul 26, 2016 at 3:24 PM, Josh Poimboeuf  wrote:
> On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
>> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
>> >> wrote:
>> >> >> > Unless I'm missing something, I think it should be fine for nested 
>> >> >> > NMIs,
>> >> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> >> > particular are you worried about?
>> >> >> >
>> >> >>
>> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> >> architecture.
>> >> >
>> >> > I did read that comment.  Fortunately there's a big difference between
>> >> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >> >
>> >> > For nested NMIs, it does look like the stack of the exception which
>> >> > interrupted the first NMI would get skipped by the stack dump.  (But
>> >> > that's a general problem, not specific to my patch set.)
>> >>
>> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> >> we're going to crash, so it doesn't matter much whether the unwinder
>> >> works.  Is that what you mean?
>> >
>> > I read the NMI entry code again, and now I realize my comment was
>> > completely misinformed, so never mind.
>> >
>> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
>> > higher priority than NMI?
>>
>> Priority only matters for events that happen concurrently.
>> Synchronous things like #DB will always fire if the conditions that
>> trigger them are hit,
>
> So just to clarify, are you saying a lower priority exception like NMI
> can interrupt a higher priority exception handler like #DB?  I'm getting
> a different conclusion from reading section 6.9 of the Intel System
> Programming Guide.

Yes, effectively.  From the CPU's perspective, it's done with the #DB
as soon as it finishes pushing the stack frame and starts running
instructions again.  So the chain of events looks like:


<-- CPU is delivering #DB.  NMI can't be delivered.
debug:
<-- Oh boy, done with delivering #DB.  NMIs can be delivered again!
  pushq $whatever
  ...
  iretq  <-- CPU has no idea that this is related to the #DB

>
>> >> > Am I correct in understanding that there can only be one level of NMI
>> >> > nesting at any given time?  If so, could we make it easier on the
>> >> > unwinder by putting the nested NMI on a separate software stack, so the
>> >> > "next stack" pointers are always in the same place?  Or am I just being
>> >> > naive?
>> >>
>> >> I think you're being naive :)
>> >>
>> >> But we don't really need the unwinder to be 100% faithful here.  If we 
>> >> have:
>> >>
>> >> task stack
>> >> NMI
>> >> nested NMI
>> >>
>> >> then the nested NMI code won't call into C and thus it should be
>> >> impossible to ever invoke your unwinder on that state.  Instead the
>> >> nested NMI code will fiddle with the saved regs and return in such a
>> >> way that the outer NMI will be forced to loop through again.  So it
>> >> *should* (assuming I'm remembering all this crap correctly) be
>> >> sufficient to find the "outermost" pt_regs, which is sitting at
>> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> >> value.  This ought to be the same thing that the frame-based unwinder
>> >> would naturally try to do.  But if you make this change, ISTM you
>> >> should make it separately because it does change behavior and Linus is
>> >> understandably leery about that.
>> >
>> > Ok, I think that makes sense to me now.  As I understand it, the
>> > "outermost" RIP is the authoritative one, because it was written by the
>> > original NMI.  Any nested NMIs will update the original and/or iret
>> > RIPs, which will only ever point to NMI entry code, and so they should
>> > be ignored.
>> >
>> > But I think there's a case where this wouldn't work:
>> >
>> > task stack
>> > NMI
>> > IST
>> > stack dump
>> >
>> > If the IST interrupt hits before the NMI has a chance to update the
>> > outermost regs, the authoritative RIP would be the original one written
>> > by HW, right?
>>
>> This should be impossible unless that last entry is MCE.  If we
>> actually fire an event that isn't MCE early in NMI entry, something
>> already went very wrong.
>
> So we don't need to support breakpoints in the early NMI entry code?

No.  Instead we try not to let it happen.  See, for example:

commit 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Andy Lutomirski
On Tue, Jul 26, 2016 at 3:24 PM, Josh Poimboeuf  wrote:
> On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
>> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
>> >> wrote:
>> >> >> > Unless I'm missing something, I think it should be fine for nested 
>> >> >> > NMIs,
>> >> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> >> > particular are you worried about?
>> >> >> >
>> >> >>
>> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> >> architecture.
>> >> >
>> >> > I did read that comment.  Fortunately there's a big difference between
>> >> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >> >
>> >> > For nested NMIs, it does look like the stack of the exception which
>> >> > interrupted the first NMI would get skipped by the stack dump.  (But
>> >> > that's a general problem, not specific to my patch set.)
>> >>
>> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> >> we're going to crash, so it doesn't matter much whether the unwinder
>> >> works.  Is that what you mean?
>> >
>> > I read the NMI entry code again, and now I realize my comment was
>> > completely misinformed, so never mind.
>> >
>> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
>> > higher priority than NMI?
>>
>> Priority only matters for events that happen concurrently.
>> Synchronous things like #DB will always fire if the conditions that
>> trigger them are hit,
>
> So just to clarify, are you saying a lower priority exception like NMI
> can interrupt a higher priority exception handler like #DB?  I'm getting
> a different conclusion from reading section 6.9 of the Intel System
> Programming Guide.

Yes, effectively.  From the CPU's perspective, it's done with the #DB
as soon as it finishes pushing the stack frame and starts running
instructions again.  So the chain of events looks like:


<-- CPU is delivering #DB.  NMI can't be delivered.
debug:
<-- Oh boy, done with delivering #DB.  NMIs can be delivered again!
  pushq $whatever
  ...
  iretq  <-- CPU has no idea that this is related to the #DB

>
>> >> > Am I correct in understanding that there can only be one level of NMI
>> >> > nesting at any given time?  If so, could we make it easier on the
>> >> > unwinder by putting the nested NMI on a separate software stack, so the
>> >> > "next stack" pointers are always in the same place?  Or am I just being
>> >> > naive?
>> >>
>> >> I think you're being naive :)
>> >>
>> >> But we don't really need the unwinder to be 100% faithful here.  If we 
>> >> have:
>> >>
>> >> task stack
>> >> NMI
>> >> nested NMI
>> >>
>> >> then the nested NMI code won't call into C and thus it should be
>> >> impossible to ever invoke your unwinder on that state.  Instead the
>> >> nested NMI code will fiddle with the saved regs and return in such a
>> >> way that the outer NMI will be forced to loop through again.  So it
>> >> *should* (assuming I'm remembering all this crap correctly) be
>> >> sufficient to find the "outermost" pt_regs, which is sitting at
>> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> >> value.  This ought to be the same thing that the frame-based unwinder
>> >> would naturally try to do.  But if you make this change, ISTM you
>> >> should make it separately because it does change behavior and Linus is
>> >> understandably leery about that.
>> >
>> > Ok, I think that makes sense to me now.  As I understand it, the
>> > "outermost" RIP is the authoritative one, because it was written by the
>> > original NMI.  Any nested NMIs will update the original and/or iret
>> > RIPs, which will only ever point to NMI entry code, and so they should
>> > be ignored.
>> >
>> > But I think there's a case where this wouldn't work:
>> >
>> > task stack
>> > NMI
>> > IST
>> > stack dump
>> >
>> > If the IST interrupt hits before the NMI has a chance to update the
>> > outermost regs, the authoritative RIP would be the original one written
>> > by HW, right?
>>
>> This should be impossible unless that last entry is MCE.  If we
>> actually fire an event that isn't MCE early in NMI entry, something
>> already went very wrong.
>
> So we don't need to support breakpoints in the early NMI entry code?

No.  Instead we try not to let it happen.  See, for example:

commit e5779e8e12299f77c2421a707855d8d124171d85
Author: Andy Lutomirski 
Date:   Thu Jul 30 20:32:40 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Steven Rostedt
On Tue, 26 Jul 2016 17:24:54 -0500
Josh Poimboeuf  wrote:

> > This should be impossible unless that last entry is MCE.  If we
> > actually fire an event that isn't MCE early in NMI entry, something
> > already went very wrong.  
> 
> So we don't need to support breakpoints in the early NMI entry code?

Yes, if that happens, then bad things can really happen.

The only way a breakpoint could be added there, is perhaps with KGDB,
and that's just asking for trouble anyway.

-- Steve



Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Steven Rostedt
On Tue, 26 Jul 2016 17:24:54 -0500
Josh Poimboeuf  wrote:

> > This should be impossible unless that last entry is MCE.  If we
> > actually fire an event that isn't MCE early in NMI entry, something
> > already went very wrong.  
> 
> So we don't need to support breakpoints in the early NMI entry code?

Yes, if that happens, then bad things can really happen.

The only way a breakpoint could be added there, is perhaps with KGDB,
and that's just asking for trouble anyway.

-- Steve



Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
> >> wrote:
> >> >> > Unless I'm missing something, I think it should be fine for nested 
> >> >> > NMIs,
> >> >> > since they're all on the same stack.  I can try to test it.  What in
> >> >> > particular are you worried about?
> >> >> >
> >> >>
> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> >> architecture.
> >> >
> >> > I did read that comment.  Fortunately there's a big difference between
> >> > reading and understanding so I can go on being an ignorant x86 hacker!
> >> >
> >> > For nested NMIs, it does look like the stack of the exception which
> >> > interrupted the first NMI would get skipped by the stack dump.  (But
> >> > that's a general problem, not specific to my patch set.)
> >>
> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> >> we're going to crash, so it doesn't matter much whether the unwinder
> >> works.  Is that what you mean?
> >
> > I read the NMI entry code again, and now I realize my comment was
> > completely misinformed, so never mind.
> >
> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
> > higher priority than NMI?
> 
> Priority only matters for events that happen concurrently.
> Synchronous things like #DB will always fire if the conditions that
> trigger them are hit,

So just to clarify, are you saying a lower priority exception like NMI
can interrupt a higher priority exception handler like #DB?  I'm getting
a different conclusion from reading section 6.9 of the Intel System
Programming Guide.

> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >>
> >> But we don't really need the unwinder to be 100% faithful here.  If we 
> >> have:
> >>
> >> task stack
> >> NMI
> >> nested NMI
> >>
> >> then the nested NMI code won't call into C and thus it should be
> >> impossible to ever invoke your unwinder on that state.  Instead the
> >> nested NMI code will fiddle with the saved regs and return in such a
> >> way that the outer NMI will be forced to loop through again.  So it
> >> *should* (assuming I'm remembering all this crap correctly) be
> >> sufficient to find the "outermost" pt_regs, which is sitting at
> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> >> value.  This ought to be the same thing that the frame-based unwinder
> >> would naturally try to do.  But if you make this change, ISTM you
> >> should make it separately because it does change behavior and Linus is
> >> understandably leery about that.
> >
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> >
> > But I think there's a case where this wouldn't work:
> >
> > task stack
> > NMI
> > IST
> > stack dump
> >
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> This should be impossible unless that last entry is MCE.  If we
> actually fire an event that isn't MCE early in NMI entry, something
> already went very wrong.

So we don't need to support breakpoints in the early NMI entry code?

> For NMI -> MCE -> stack dump, the frame-based unwinder will do better
> than get_stack_info() unless get_stack_info() learns to use the
> top-of-stack hardware copy if the actual RSP it finds is too high
> (above the "outermost" frame).

Ok.

> >> Hmm.  I wonder if it would make sense to decode this thing both ways
> >> and display it.  So show_trace_log_lvl() could print something like:
> >>
> >> <#DB (0xwhatever000-0xwhateverfff), next frame is at 
> >> 0xsomething>
> >>
> >> and, in the case where the frame unwinder disagrees, it'll at least be
> >> visible in that 0xsomething won't be between 0xwhatever000 and
> >> 0xwhateverfff.
> >>
> >> Then 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
> >> wrote:
> >> >> > Unless I'm missing something, I think it should be fine for nested 
> >> >> > NMIs,
> >> >> > since they're all on the same stack.  I can try to test it.  What in
> >> >> > particular are you worried about?
> >> >> >
> >> >>
> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> >> architecture.
> >> >
> >> > I did read that comment.  Fortunately there's a big difference between
> >> > reading and understanding so I can go on being an ignorant x86 hacker!
> >> >
> >> > For nested NMIs, it does look like the stack of the exception which
> >> > interrupted the first NMI would get skipped by the stack dump.  (But
> >> > that's a general problem, not specific to my patch set.)
> >>
> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> >> we're going to crash, so it doesn't matter much whether the unwinder
> >> works.  Is that what you mean?
> >
> > I read the NMI entry code again, and now I realize my comment was
> > completely misinformed, so never mind.
> >
> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
> > higher priority than NMI?
> 
> Priority only matters for events that happen concurrently.
> Synchronous things like #DB will always fire if the conditions that
> trigger them are hit,

So just to clarify, are you saying a lower priority exception like NMI
can interrupt a higher priority exception handler like #DB?  I'm getting
a different conclusion from reading section 6.9 of the Intel System
Programming Guide.

> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >>
> >> But we don't really need the unwinder to be 100% faithful here.  If we 
> >> have:
> >>
> >> task stack
> >> NMI
> >> nested NMI
> >>
> >> then the nested NMI code won't call into C and thus it should be
> >> impossible to ever invoke your unwinder on that state.  Instead the
> >> nested NMI code will fiddle with the saved regs and return in such a
> >> way that the outer NMI will be forced to loop through again.  So it
> >> *should* (assuming I'm remembering all this crap correctly) be
> >> sufficient to find the "outermost" pt_regs, which is sitting at
> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> >> value.  This ought to be the same thing that the frame-based unwinder
> >> would naturally try to do.  But if you make this change, ISTM you
> >> should make it separately because it does change behavior and Linus is
> >> understandably leery about that.
> >
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> >
> > But I think there's a case where this wouldn't work:
> >
> > task stack
> > NMI
> > IST
> > stack dump
> >
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> This should be impossible unless that last entry is MCE.  If we
> actually fire an event that isn't MCE early in NMI entry, something
> already went very wrong.

So we don't need to support breakpoints in the early NMI entry code?

> For NMI -> MCE -> stack dump, the frame-based unwinder will do better
> than get_stack_info() unless get_stack_info() learns to use the
> top-of-stack hardware copy if the actual RSP it finds is too high
> (above the "outermost" frame).

Ok.

> >> Hmm.  I wonder if it would make sense to decode this thing both ways
> >> and display it.  So show_trace_log_lvl() could print something like:
> >>
> >> <#DB (0xwhatever000-0xwhateverfff), next frame is at 
> >> 0xsomething>
> >>
> >> and, in the case where the frame unwinder disagrees, it'll at least be
> >> visible in that 0xsomething won't be between 0xwhatever000 and
> >> 0xwhateverfff.
> >>
> >> Then Linus is happy because the unwinder works 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Andy Lutomirski
On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
>> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> > particular are you worried about?
>> >> >
>> >>
>> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> architecture.
>> >
>> > I did read that comment.  Fortunately there's a big difference between
>> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >
>> > For nested NMIs, it does look like the stack of the exception which
>> > interrupted the first NMI would get skipped by the stack dump.  (But
>> > that's a general problem, not specific to my patch set.)
>>
>> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> we're going to crash, so it doesn't matter much whether the unwinder
>> works.  Is that what you mean?
>
> I read the NMI entry code again, and now I realize my comment was
> completely misinformed, so never mind.
>
> Is "IST -> NMI -> same IST" even possible, since the other IST's are
> higher priority than NMI?

Priority only matters for events that happen concurrently.
Synchronous things like #DB will always fire if the conditions that
trigger them are hit,

>
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>>
>> But we don't really need the unwinder to be 100% faithful here.  If we have:
>>
>> task stack
>> NMI
>> nested NMI
>>
>> then the nested NMI code won't call into C and thus it should be
>> impossible to ever invoke your unwinder on that state.  Instead the
>> nested NMI code will fiddle with the saved regs and return in such a
>> way that the outer NMI will be forced to loop through again.  So it
>> *should* (assuming I'm remembering all this crap correctly) be
>> sufficient to find the "outermost" pt_regs, which is sitting at
>> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> value.  This ought to be the same thing that the frame-based unwinder
>> would naturally try to do.  But if you make this change, ISTM you
>> should make it separately because it does change behavior and Linus is
>> understandably leery about that.
>
> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.
>
> But I think there's a case where this wouldn't work:
>
> task stack
> NMI
> IST
> stack dump
>
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

This should be impossible unless that last entry is MCE.  If we
actually fire an event that isn't MCE early in NMI entry, something
already went very wrong.

For NMI -> MCE -> stack dump, the frame-based unwinder will do better
than get_stack_info() unless get_stack_info() learns to use the
top-of-stack hardware copy if the actual RSP it finds is too high
(above the "outermost" frame).

>
>> Hmm.  I wonder if it would make sense to decode this thing both ways
>> and display it.  So show_trace_log_lvl() could print something like:
>>
>> <#DB (0xwhatever000-0xwhateverfff), next frame is at 0xsomething>
>>
>> and, in the case where the frame unwinder disagrees, it'll at least be
>> visible in that 0xsomething won't be between 0xwhatever000 and
>> 0xwhateverfff.
>>
>> Then Linus is happy because the unwinder works just like it did before
>> but people like me are also happy because it's clearer what's going
>> on.  FWIW, I've personally debugged crashes in the NMI code where the
>> current unwinder falls apart completely and it's not fun -- having a
>> display indicating that the unwinder got confused would be nice.
>
> Hm, maybe.  Another idea would be to have the unwinder print some kind
> of warning if it goes off the rails.  It should be able to detect that,
> because every stack trace should end at a user pt_regs.

I like that.

Be careful, though: 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Andy Lutomirski
On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf  wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
>> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> > particular are you worried about?
>> >> >
>> >>
>> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> architecture.
>> >
>> > I did read that comment.  Fortunately there's a big difference between
>> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >
>> > For nested NMIs, it does look like the stack of the exception which
>> > interrupted the first NMI would get skipped by the stack dump.  (But
>> > that's a general problem, not specific to my patch set.)
>>
>> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> we're going to crash, so it doesn't matter much whether the unwinder
>> works.  Is that what you mean?
>
> I read the NMI entry code again, and now I realize my comment was
> completely misinformed, so never mind.
>
> Is "IST -> NMI -> same IST" even possible, since the other IST's are
> higher priority than NMI?

Priority only matters for events that happen concurrently.
Synchronous things like #DB will always fire if the conditions that
trigger them are hit,

>
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>>
>> But we don't really need the unwinder to be 100% faithful here.  If we have:
>>
>> task stack
>> NMI
>> nested NMI
>>
>> then the nested NMI code won't call into C and thus it should be
>> impossible to ever invoke your unwinder on that state.  Instead the
>> nested NMI code will fiddle with the saved regs and return in such a
>> way that the outer NMI will be forced to loop through again.  So it
>> *should* (assuming I'm remembering all this crap correctly) be
>> sufficient to find the "outermost" pt_regs, which is sitting at
>> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> value.  This ought to be the same thing that the frame-based unwinder
>> would naturally try to do.  But if you make this change, ISTM you
>> should make it separately because it does change behavior and Linus is
>> understandably leery about that.
>
> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.
>
> But I think there's a case where this wouldn't work:
>
> task stack
> NMI
> IST
> stack dump
>
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

This should be impossible unless that last entry is MCE.  If we
actually fire an event that isn't MCE early in NMI entry, something
already went very wrong.

For NMI -> MCE -> stack dump, the frame-based unwinder will do better
than get_stack_info() unless get_stack_info() learns to use the
top-of-stack hardware copy if the actual RSP it finds is too high
(above the "outermost" frame).

>
>> Hmm.  I wonder if it would make sense to decode this thing both ways
>> and display it.  So show_trace_log_lvl() could print something like:
>>
>> <#DB (0xwhatever000-0xwhateverfff), next frame is at 0xsomething>
>>
>> and, in the case where the frame unwinder disagrees, it'll at least be
>> visible in that 0xsomething won't be between 0xwhatever000 and
>> 0xwhateverfff.
>>
>> Then Linus is happy because the unwinder works just like it did before
>> but people like me are also happy because it's clearer what's going
>> on.  FWIW, I've personally debugged crashes in the NMI code where the
>> current unwinder falls apart completely and it's not fun -- having a
>> display indicating that the unwinder got confused would be nice.
>
> Hm, maybe.  Another idea would be to have the unwinder print some kind
> of warning if it goes off the rails.  It should be able to detect that,
> because every stack trace should end at a user pt_regs.

I like that.

Be careful, though: kernel threads might not have a "user" 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:49:06PM -0400, Brian Gerst wrote:
> On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf  wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
> >> wrote:
> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >
> > Another dumb question: since NMIs are reentrant, have you considered
> > removing the NMI IST entry, and instead just have NMIs keep using the
> > current stack?
> >
> > The first NMI could then be switched to an NMI software stack, like IRQs
> > (assuming there's a way to do that atomically!).  And then determining
> > the context of subsequent NMIs would be straightforward, and we'd no
> > longer need to jump through all those horrible hoops in the entry code
> > to deal with NMI nesting.
> >
> > Now you can tell me what else I'm missing...
> 
> There are several places (most notably SYSCALL entry) where the kernel
> stack pointer is unsafe/user controlled for a brief time.  Since an
> NMI can interrupt anywhere in the kernel, you have to use an IST to
> protect against that case.

Ah, that makes sense.  Thanks.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:49:06PM -0400, Brian Gerst wrote:
> On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf  wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  
> >> wrote:
> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >
> > Another dumb question: since NMIs are reentrant, have you considered
> > removing the NMI IST entry, and instead just have NMIs keep using the
> > current stack?
> >
> > The first NMI could then be switched to an NMI software stack, like IRQs
> > (assuming there's a way to do that atomically!).  And then determining
> > the context of subsequent NMIs would be straightforward, and we'd no
> > longer need to jump through all those horrible hoops in the entry code
> > to deal with NMI nesting.
> >
> > Now you can tell me what else I'm missing...
> 
> There are several places (most notably SYSCALL entry) where the kernel
> stack pointer is unsafe/user controlled for a brief time.  Since an
> NMI can interrupt anywhere in the kernel, you have to use an IST to
> protect against that case.

Ah, that makes sense.  Thanks.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:51:27PM -0400, Steven Rostedt wrote:
> On Tue, 26 Jul 2016 11:26:42 -0500
> Josh Poimboeuf  wrote:
> 
> 
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> 
> Just to confirm:
> 
>   -- top-of-stack --
>   [ hardware written stack ] <- what the NMI hardware mechanism wrote
>   [ internal variables ] <- you don't need to know what this is
>   [ where to go next ] <- the stack to use to return on current NMI
>   [ original copy of hardware stack ] <- the stack of the first NMI
> 
> IIRC, the original version had the "where to go next" stack last, but
> to keep pt_regs in line with the stack, it made sense to have the
> original NMI stack at the bottom, just above pt_regs, like a real
> interrupt would.
> 
> > 
> > But I think there's a case where this wouldn't work:
> > 
> > task stack
> > NMI
> > IST
> > stack dump
> > 
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> The only IST interrupt that would hit there is MCE and it would
> probably be a critical error. Do we really need to worry about such an
> unlikely scenario? The system is probably doomed anyway.

According to entry_64.S:

/*
 * We allow breakpoints in NMIs. If a breakpoint occurs, then
 * the iretq it performs will take us out of NMI context.
 * This means that we can have nested NMIs where the next
 * NMI is using the top of the stack of the previous NMI.

So I think this means that when a debug exception returns to an NMI with
iret, further NMIs are no longer masked.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Tue, Jul 26, 2016 at 01:51:27PM -0400, Steven Rostedt wrote:
> On Tue, 26 Jul 2016 11:26:42 -0500
> Josh Poimboeuf  wrote:
> 
> 
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> 
> Just to confirm:
> 
>   -- top-of-stack --
>   [ hardware written stack ] <- what the NMI hardware mechanism wrote
>   [ internal variables ] <- you don't need to know what this is
>   [ where to go next ] <- the stack to use to return on current NMI
>   [ original copy of hardware stack ] <- the stack of the first NMI
> 
> IIRC, the original version had the "where to go next" stack last, but
> to keep pt_regs in line with the stack, it made sense to have the
> original NMI stack at the bottom, just above pt_regs, like a real
> interrupt would.
> 
> > 
> > But I think there's a case where this wouldn't work:
> > 
> > task stack
> > NMI
> > IST
> > stack dump
> > 
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> The only IST interrupt that would hit there is MCE and it would
> probably be a critical error. Do we really need to worry about such an
> unlikely scenario? The system is probably doomed anyway.

According to entry_64.S:

/*
 * We allow breakpoints in NMIs. If a breakpoint occurs, then
 * the iretq it performs will take us out of NMI context.
 * This means that we can have nested NMIs where the next
 * NMI is using the top of the stack of the previous NMI.

So I think this means that when a debug exception returns to an NMI with
iret, further NMIs are no longer masked.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Steven Rostedt
On Tue, 26 Jul 2016 11:26:42 -0500
Josh Poimboeuf  wrote:


> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.

Just to confirm:

  -- top-of-stack --
  [ hardware written stack ] <- what the NMI hardware mechanism wrote
  [ internal variables ] <- you don't need to know what this is
  [ where to go next ] <- the stack to use to return on current NMI
  [ original copy of hardware stack ] <- the stack of the first NMI

IIRC, the original version had the "where to go next" stack last, but
to keep pt_regs in line with the stack, it made sense to have the
original NMI stack at the bottom, just above pt_regs, like a real
interrupt would.

> 
> But I think there's a case where this wouldn't work:
> 
> task stack
> NMI
> IST
> stack dump
> 
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

The only IST interrupt that would hit there is MCE and it would
probably be a critical error. Do we really need to worry about such an
unlikely scenario? The system is probably doomed anyway.

-- Steve


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Steven Rostedt
On Tue, 26 Jul 2016 11:26:42 -0500
Josh Poimboeuf  wrote:


> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.

Just to confirm:

  -- top-of-stack --
  [ hardware written stack ] <- what the NMI hardware mechanism wrote
  [ internal variables ] <- you don't need to know what this is
  [ where to go next ] <- the stack to use to return on current NMI
  [ original copy of hardware stack ] <- the stack of the first NMI

IIRC, the original version had the "where to go next" stack last, but
to keep pt_regs in line with the stack, it made sense to have the
original NMI stack at the bottom, just above pt_regs, like a real
interrupt would.

> 
> But I think there's a case where this wouldn't work:
> 
> task stack
> NMI
> IST
> stack dump
> 
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

The only IST interrupt that would hit there is MCE and it would
probably be a critical error. Do we really need to worry about such an
unlikely scenario? The system is probably doomed anyway.

-- Steve


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Brian Gerst
On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf  wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>
> Another dumb question: since NMIs are reentrant, have you considered
> removing the NMI IST entry, and instead just have NMIs keep using the
> current stack?
>
> The first NMI could then be switched to an NMI software stack, like IRQs
> (assuming there's a way to do that atomically!).  And then determining
> the context of subsequent NMIs would be straightforward, and we'd no
> longer need to jump through all those horrible hoops in the entry code
> to deal with NMI nesting.
>
> Now you can tell me what else I'm missing...

There are several places (most notably SYSCALL entry) where the kernel
stack pointer is unsafe/user controlled for a brief time.  Since an
NMI can interrupt anywhere in the kernel, you have to use an IST to
protect against that case.

Blame Intel's legacy behavior for this mess, because IRET
unconditionally re-enables NMIs even if you are returning from another
exception like a page fault.  This wasn't a problem on the 8086 which
didn't have an MMU, but makes makes no sense on modern systems.

--
Brian Gerst


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Brian Gerst
On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf  wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>
> Another dumb question: since NMIs are reentrant, have you considered
> removing the NMI IST entry, and instead just have NMIs keep using the
> current stack?
>
> The first NMI could then be switched to an NMI software stack, like IRQs
> (assuming there's a way to do that atomically!).  And then determining
> the context of subsequent NMIs would be straightforward, and we'd no
> longer need to jump through all those horrible hoops in the entry code
> to deal with NMI nesting.
>
> Now you can tell me what else I'm missing...

There are several places (most notably SYSCALL entry) where the kernel
stack pointer is unsafe/user controlled for a brief time.  Since an
NMI can interrupt anywhere in the kernel, you have to use an IST to
protect against that case.

Blame Intel's legacy behavior for this mess, because IRET
unconditionally re-enables NMIs even if you are returning from another
exception like a page fault.  This wasn't a problem on the 8086 which
didn't have an MMU, but makes makes no sense on modern systems.

--
Brian Gerst


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)

Another dumb question: since NMIs are reentrant, have you considered
removing the NMI IST entry, and instead just have NMIs keep using the
current stack?

The first NMI could then be switched to an NMI software stack, like IRQs
(assuming there's a way to do that atomically!).  And then determining
the context of subsequent NMIs would be straightforward, and we'd no
longer need to jump through all those horrible hoops in the entry code
to deal with NMI nesting.

Now you can tell me what else I'm missing...

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)

Another dumb question: since NMIs are reentrant, have you considered
removing the NMI IST entry, and instead just have NMIs keep using the
current stack?

The first NMI could then be switched to an NMI software stack, like IRQs
(assuming there's a way to do that atomically!).  And then determining
the context of subsequent NMIs would be straightforward, and we'd no
longer need to jump through all those horrible hoops in the entry code
to deal with NMI nesting.

Now you can tell me what else I'm missing...

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
> >> > since they're all on the same stack.  I can try to test it.  What in
> >> > particular are you worried about?
> >> >
> >>
> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> architecture.
> >
> > I did read that comment.  Fortunately there's a big difference between
> > reading and understanding so I can go on being an ignorant x86 hacker!
> >
> > For nested NMIs, it does look like the stack of the exception which
> > interrupted the first NMI would get skipped by the stack dump.  (But
> > that's a general problem, not specific to my patch set.)
> 
> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> we're going to crash, so it doesn't matter much whether the unwinder
> works.  Is that what you mean?

I read the NMI entry code again, and now I realize my comment was
completely misinformed, so never mind.

Is "IST -> NMI -> same IST" even possible, since the other IST's are
higher priority than NMI?

> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)
> 
> But we don't really need the unwinder to be 100% faithful here.  If we have:
> 
> task stack
> NMI
> nested NMI
> 
> then the nested NMI code won't call into C and thus it should be
> impossible to ever invoke your unwinder on that state.  Instead the
> nested NMI code will fiddle with the saved regs and return in such a
> way that the outer NMI will be forced to loop through again.  So it
> *should* (assuming I'm remembering all this crap correctly) be
> sufficient to find the "outermost" pt_regs, which is sitting at
> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> value.  This ought to be the same thing that the frame-based unwinder
> would naturally try to do.  But if you make this change, ISTM you
> should make it separately because it does change behavior and Linus is
> understandably leery about that.

Ok, I think that makes sense to me now.  As I understand it, the
"outermost" RIP is the authoritative one, because it was written by the
original NMI.  Any nested NMIs will update the original and/or iret
RIPs, which will only ever point to NMI entry code, and so they should
be ignored.

But I think there's a case where this wouldn't work:

task stack
NMI
IST
stack dump

If the IST interrupt hits before the NMI has a chance to update the
outermost regs, the authoritative RIP would be the original one written
by HW, right?

> Hmm.  I wonder if it would make sense to decode this thing both ways
> and display it.  So show_trace_log_lvl() could print something like:
> 
> <#DB (0xwhatever000-0xwhateverfff), next frame is at 0xsomething>
> 
> and, in the case where the frame unwinder disagrees, it'll at least be
> visible in that 0xsomething won't be between 0xwhatever000 and
> 0xwhateverfff.
> 
> Then Linus is happy because the unwinder works just like it did before
> but people like me are also happy because it's clearer what's going
> on.  FWIW, I've personally debugged crashes in the NMI code where the
> current unwinder falls apart completely and it's not fun -- having a
> display indicating that the unwinder got confused would be nice.

Hm, maybe.  Another idea would be to have the unwinder print some kind
of warning if it goes off the rails.  It should be able to detect that,
because every stack trace should end at a user pt_regs.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-26 Thread Josh Poimboeuf
On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
> >> > since they're all on the same stack.  I can try to test it.  What in
> >> > particular are you worried about?
> >> >
> >>
> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> architecture.
> >
> > I did read that comment.  Fortunately there's a big difference between
> > reading and understanding so I can go on being an ignorant x86 hacker!
> >
> > For nested NMIs, it does look like the stack of the exception which
> > interrupted the first NMI would get skipped by the stack dump.  (But
> > that's a general problem, not specific to my patch set.)
> 
> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> we're going to crash, so it doesn't matter much whether the unwinder
> works.  Is that what you mean?

I read the NMI entry code again, and now I realize my comment was
completely misinformed, so never mind.

Is "IST -> NMI -> same IST" even possible, since the other IST's are
higher priority than NMI?

> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)
> 
> But we don't really need the unwinder to be 100% faithful here.  If we have:
> 
> task stack
> NMI
> nested NMI
> 
> then the nested NMI code won't call into C and thus it should be
> impossible to ever invoke your unwinder on that state.  Instead the
> nested NMI code will fiddle with the saved regs and return in such a
> way that the outer NMI will be forced to loop through again.  So it
> *should* (assuming I'm remembering all this crap correctly) be
> sufficient to find the "outermost" pt_regs, which is sitting at
> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> value.  This ought to be the same thing that the frame-based unwinder
> would naturally try to do.  But if you make this change, ISTM you
> should make it separately because it does change behavior and Linus is
> understandably leery about that.

Ok, I think that makes sense to me now.  As I understand it, the
"outermost" RIP is the authoritative one, because it was written by the
original NMI.  Any nested NMIs will update the original and/or iret
RIPs, which will only ever point to NMI entry code, and so they should
be ignored.

But I think there's a case where this wouldn't work:

task stack
NMI
IST
stack dump

If the IST interrupt hits before the NMI has a chance to update the
outermost regs, the authoritative RIP would be the original one written
by HW, right?

> Hmm.  I wonder if it would make sense to decode this thing both ways
> and display it.  So show_trace_log_lvl() could print something like:
> 
> <#DB (0xwhatever000-0xwhateverfff), next frame is at 0xsomething>
> 
> and, in the case where the frame unwinder disagrees, it'll at least be
> visible in that 0xsomething won't be between 0xwhatever000 and
> 0xwhateverfff.
> 
> Then Linus is happy because the unwinder works just like it did before
> but people like me are also happy because it's clearer what's going
> on.  FWIW, I've personally debugged crashes in the NMI code where the
> current unwinder falls apart completely and it's not fun -- having a
> display indicating that the unwinder got confused would be nice.

Hm, maybe.  Another idea would be to have the unwinder print some kind
of warning if it goes off the rails.  It should be able to detect that,
because every stack trace should end at a user pt_regs.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-25 Thread Andy Lutomirski
On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> >> > *info,
>> >> > +unsigned long *visit_mask)
>> >> > +{
>> >> > +   unsigned long *begin = (unsigned long 
>> >> > *)this_cpu_read(hardirq_stack);
>> >> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> >> > +
>> >> > +   if (stack < begin || stack >= end)
>> >> > +   return false;
>> >> > +
>> >> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> >> > +   return false;
>> >> > +
>> >> > +   info->type  = STACK_TYPE_IRQ;
>> >> > +   info->begin = begin;
>> >> > +   info->end   = end;
>> >> > +   info->next  = (unsigned long *)*begin;
>> >>
>> >> This works, but it's a bit magic.  I don't suppose we could get rid of
>> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> >> move from stack to stack by querying the stack type of whatever the
>> >> frame base address is if the frame base address ends up being out of
>> >> bounds for the current stack?  Or maybe the unwinder could even do
>> >> this by itself.
>> >
>> > I'm not quite sure what you mean here.  The ->next stack pointer is
>> > quite useful and it abstracts that ugliness away from the callers of
>> > get_stack_info().  I'm open to any specific suggestions.
>>
>> So far I've found two users of this thing.  One is
>> show_stack_log_lvl(), and it makes sense there, but maybe
>> info->heuristic_next_stack would be a better name.  The other is the
>> unwinder itself, and I think that walking from stack to stack using
>> this heuristic is the wrong approach there, at least in the long term.
>> I'd rather we just followed the bp chain wherever it leads us, as long
>> as it leads us to a valid stack that we haven't visited before.
>>
>> As a concrete example of what I think is wrong with the current
>> approach, ISTM it would be totally valid to implement stack switching
>> like this:
>>
>> some_func:
>>  push %rbp
>>  mov %rsp, %rbp
>>  ...
>>  mov [next stack], %rsp
>>  call some_other_func
>>  mov %rbp, %rsp
>>  pop %rbp
>>  ret
>>
>> With the current approach, you can't unwind out of that function,
>> because there's no way to populate info->next.  I'm not actually
>> suggesting that the kernel should ever do such a thing on x86, and my
>> proposed rewrite of the IRQ stack code [1] will be fully compatible
>> with your approach, but it seems odd to me that the unwinder should
>> depend on idea that the stacks in use are chained together in a way
>> that can be decoded without .  (But maybe some of the Go compilers do
>> work this way -- I've never looked at their precise stack layout.)
>
> I don't think relying on frame pointers to switch between stacks is
> necessarily a good idea:
>
> - It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
>   The current approach is unwinder-agnostic.
>
> - Instead of relying on a single correct "next stack" pointer, it
>   requires relying on potentially dozens of correct frame pointers,
>   across multiple stacks.  So a lot of things have to go right, instead
>   of just one.  And then show_trace_log_lvl() becomes more dependent on
>   the unwinder not screwing things up.

That's a fair point, at least for show_trace_log_lvl().  So let's
leave it alone for now.  We can always revisit it later.

>>
>> If you can write it as:
>>
>> struct pt_regs *regs = (struct pt_regs *)end - 1;
>> info->next = regs->sp;
>>
>> and it still works, then no comment required :)
>
> Yeah.  in_irq_stack() does something similar, though it uses end[-1].
> And its regs are actually stored on the thread stack.  So something
> doesn't quite add up for irqs.  I still need to do some homework there.

I can do your homework for you: the irq stacks doesn't contain pt_regs.

The current code is quite hard to understand, but this patch of mine
(which I'll try to dust off and send in soon) cleans it up and should
be much easier to understand:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1

So a comment like /* When the IRQ stack is in use, the top word stores
the previous stack pointer. */ should do the trick.

>> >
>> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> > since they're all on the same stack.  I can try to test it.  What in
>> > particular are you worried about?
>> >
>>
>> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> CS, IP) records.  Off the top of my head, the record that matters is
>> the third one, so it should be regs[-15].  If an MCE hits while this
>> mess is being set up, good luck unwinding *that*.  If you really 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-25 Thread Andy Lutomirski
On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf  wrote:
> On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> >> > *info,
>> >> > +unsigned long *visit_mask)
>> >> > +{
>> >> > +   unsigned long *begin = (unsigned long 
>> >> > *)this_cpu_read(hardirq_stack);
>> >> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> >> > +
>> >> > +   if (stack < begin || stack >= end)
>> >> > +   return false;
>> >> > +
>> >> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> >> > +   return false;
>> >> > +
>> >> > +   info->type  = STACK_TYPE_IRQ;
>> >> > +   info->begin = begin;
>> >> > +   info->end   = end;
>> >> > +   info->next  = (unsigned long *)*begin;
>> >>
>> >> This works, but it's a bit magic.  I don't suppose we could get rid of
>> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> >> move from stack to stack by querying the stack type of whatever the
>> >> frame base address is if the frame base address ends up being out of
>> >> bounds for the current stack?  Or maybe the unwinder could even do
>> >> this by itself.
>> >
>> > I'm not quite sure what you mean here.  The ->next stack pointer is
>> > quite useful and it abstracts that ugliness away from the callers of
>> > get_stack_info().  I'm open to any specific suggestions.
>>
>> So far I've found two users of this thing.  One is
>> show_stack_log_lvl(), and it makes sense there, but maybe
>> info->heuristic_next_stack would be a better name.  The other is the
>> unwinder itself, and I think that walking from stack to stack using
>> this heuristic is the wrong approach there, at least in the long term.
>> I'd rather we just followed the bp chain wherever it leads us, as long
>> as it leads us to a valid stack that we haven't visited before.
>>
>> As a concrete example of what I think is wrong with the current
>> approach, ISTM it would be totally valid to implement stack switching
>> like this:
>>
>> some_func:
>>  push %rbp
>>  mov %rsp, %rbp
>>  ...
>>  mov [next stack], %rsp
>>  call some_other_func
>>  mov %rbp, %rsp
>>  pop %rbp
>>  ret
>>
>> With the current approach, you can't unwind out of that function,
>> because there's no way to populate info->next.  I'm not actually
>> suggesting that the kernel should ever do such a thing on x86, and my
>> proposed rewrite of the IRQ stack code [1] will be fully compatible
>> with your approach, but it seems odd to me that the unwinder should
>> depend on idea that the stacks in use are chained together in a way
>> that can be decoded without .  (But maybe some of the Go compilers do
>> work this way -- I've never looked at their precise stack layout.)
>
> I don't think relying on frame pointers to switch between stacks is
> necessarily a good idea:
>
> - It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
>   The current approach is unwinder-agnostic.
>
> - Instead of relying on a single correct "next stack" pointer, it
>   requires relying on potentially dozens of correct frame pointers,
>   across multiple stacks.  So a lot of things have to go right, instead
>   of just one.  And then show_trace_log_lvl() becomes more dependent on
>   the unwinder not screwing things up.

That's a fair point, at least for show_trace_log_lvl().  So let's
leave it alone for now.  We can always revisit it later.

>>
>> If you can write it as:
>>
>> struct pt_regs *regs = (struct pt_regs *)end - 1;
>> info->next = regs->sp;
>>
>> and it still works, then no comment required :)
>
> Yeah.  in_irq_stack() does something similar, though it uses end[-1].
> And its regs are actually stored on the thread stack.  So something
> doesn't quite add up for irqs.  I still need to do some homework there.

I can do your homework for you: the irq stacks doesn't contain pt_regs.

The current code is quite hard to understand, but this patch of mine
(which I'll try to dust off and send in soon) cleans it up and should
be much easier to understand:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1

So a comment like /* When the IRQ stack is in use, the top word stores
the previous stack pointer. */ should do the trick.

>> >
>> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> > since they're all on the same stack.  I can try to test it.  What in
>> > particular are you worried about?
>> >
>>
>> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> CS, IP) records.  Off the top of my head, the record that matters is
>> the third one, so it should be regs[-15].  If an MCE hits while this
>> mess is being set up, good luck unwinding *that*.  If you really want
>> to know, take a deep breath, read 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-23 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
> >> > *info,
> >> > +unsigned long *visit_mask)
> >> > +{
> >> > +   unsigned long *begin = (unsigned long 
> >> > *)this_cpu_read(hardirq_stack);
> >> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> >> > +
> >> > +   if (stack < begin || stack >= end)
> >> > +   return false;
> >> > +
> >> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> >> > +   return false;
> >> > +
> >> > +   info->type  = STACK_TYPE_IRQ;
> >> > +   info->begin = begin;
> >> > +   info->end   = end;
> >> > +   info->next  = (unsigned long *)*begin;
> >>
> >> This works, but it's a bit magic.  I don't suppose we could get rid of
> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> >> move from stack to stack by querying the stack type of whatever the
> >> frame base address is if the frame base address ends up being out of
> >> bounds for the current stack?  Or maybe the unwinder could even do
> >> this by itself.
> >
> > I'm not quite sure what you mean here.  The ->next stack pointer is
> > quite useful and it abstracts that ugliness away from the callers of
> > get_stack_info().  I'm open to any specific suggestions.
> 
> So far I've found two users of this thing.  One is
> show_stack_log_lvl(), and it makes sense there, but maybe
> info->heuristic_next_stack would be a better name.  The other is the
> unwinder itself, and I think that walking from stack to stack using
> this heuristic is the wrong approach there, at least in the long term.
> I'd rather we just followed the bp chain wherever it leads us, as long
> as it leads us to a valid stack that we haven't visited before.
>
> As a concrete example of what I think is wrong with the current
> approach, ISTM it would be totally valid to implement stack switching
> like this:
> 
> some_func:
>  push %rbp
>  mov %rsp, %rbp
>  ...
>  mov [next stack], %rsp
>  call some_other_func
>  mov %rbp, %rsp
>  pop %rbp
>  ret
> 
> With the current approach, you can't unwind out of that function,
> because there's no way to populate info->next.  I'm not actually
> suggesting that the kernel should ever do such a thing on x86, and my
> proposed rewrite of the IRQ stack code [1] will be fully compatible
> with your approach, but it seems odd to me that the unwinder should
> depend on idea that the stacks in use are chained together in a way
> that can be decoded without .  (But maybe some of the Go compilers do
> work this way -- I've never looked at their precise stack layout.)

I don't think relying on frame pointers to switch between stacks is
necessarily a good idea:

- It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
  The current approach is unwinder-agnostic.

- Instead of relying on a single correct "next stack" pointer, it
  requires relying on potentially dozens of correct frame pointers,
  across multiple stacks.  So a lot of things have to go right, instead
  of just one.  And then show_trace_log_lvl() becomes more dependent on
  the unwinder not screwing things up.

> Also, if you ever intend to port this thing to other architectures, I
> think there are architectures that have separate exception stacks and
> that track the next available slot on those stacks dynamically.  I
> think that x86_32 is an example of this if task gates are used in a
> back-and-forth manner, although Linux doesn't do that.  (x86_64 should
> have done this for IST, but it didn't.)  On those architectures, you
> can have two separate switches onto the same stack live at the same
> time, and your current approach won't work.  (Even if you make the
> change I'm suggesting, visit_mask will break, too, but fixing that
> would be a much less invasive change.)
>
> Am I making any sense?  This is a suggestion for making it better, not
> something I see as a requirement for getting the x86 code upstream.

I think porting these interfaces to other architectures could eventually
be a good idea, and you're right that the current approach might need to
be tweaked in order to work everywhere.  (But I agree this needs more
thought and this discussion can wait until later.)

> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info 
> >> > *info,
> >> > +  unsigned long *visit_mask)
> >> >  {
> >> > unsigned long stack = (unsigned long)s;
> >> > unsigned long begin, end;
> >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned 
> >> > long *s, char **name,
> >> > if (stack < begin || stack >= end)
> >> > continue;
> >> >
> >> > -   if (test_and_set_bit(k, visit_mask))
> >> > +   if 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-23 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
> >> > *info,
> >> > +unsigned long *visit_mask)
> >> > +{
> >> > +   unsigned long *begin = (unsigned long 
> >> > *)this_cpu_read(hardirq_stack);
> >> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> >> > +
> >> > +   if (stack < begin || stack >= end)
> >> > +   return false;
> >> > +
> >> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> >> > +   return false;
> >> > +
> >> > +   info->type  = STACK_TYPE_IRQ;
> >> > +   info->begin = begin;
> >> > +   info->end   = end;
> >> > +   info->next  = (unsigned long *)*begin;
> >>
> >> This works, but it's a bit magic.  I don't suppose we could get rid of
> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> >> move from stack to stack by querying the stack type of whatever the
> >> frame base address is if the frame base address ends up being out of
> >> bounds for the current stack?  Or maybe the unwinder could even do
> >> this by itself.
> >
> > I'm not quite sure what you mean here.  The ->next stack pointer is
> > quite useful and it abstracts that ugliness away from the callers of
> > get_stack_info().  I'm open to any specific suggestions.
> 
> So far I've found two users of this thing.  One is
> show_stack_log_lvl(), and it makes sense there, but maybe
> info->heuristic_next_stack would be a better name.  The other is the
> unwinder itself, and I think that walking from stack to stack using
> this heuristic is the wrong approach there, at least in the long term.
> I'd rather we just followed the bp chain wherever it leads us, as long
> as it leads us to a valid stack that we haven't visited before.
>
> As a concrete example of what I think is wrong with the current
> approach, ISTM it would be totally valid to implement stack switching
> like this:
> 
> some_func:
>  push %rbp
>  mov %rsp, %rbp
>  ...
>  mov [next stack], %rsp
>  call some_other_func
>  mov %rbp, %rsp
>  pop %rbp
>  ret
> 
> With the current approach, you can't unwind out of that function,
> because there's no way to populate info->next.  I'm not actually
> suggesting that the kernel should ever do such a thing on x86, and my
> proposed rewrite of the IRQ stack code [1] will be fully compatible
> with your approach, but it seems odd to me that the unwinder should
> depend on idea that the stacks in use are chained together in a way
> that can be decoded without .  (But maybe some of the Go compilers do
> work this way -- I've never looked at their precise stack layout.)

I don't think relying on frame pointers to switch between stacks is
necessarily a good idea:

- It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
  The current approach is unwinder-agnostic.

- Instead of relying on a single correct "next stack" pointer, it
  requires relying on potentially dozens of correct frame pointers,
  across multiple stacks.  So a lot of things have to go right, instead
  of just one.  And then show_trace_log_lvl() becomes more dependent on
  the unwinder not screwing things up.

> Also, if you ever intend to port this thing to other architectures, I
> think there are architectures that have separate exception stacks and
> that track the next available slot on those stacks dynamically.  I
> think that x86_32 is an example of this if task gates are used in a
> back-and-forth manner, although Linux doesn't do that.  (x86_64 should
> have done this for IST, but it didn't.)  On those architectures, you
> can have two separate switches onto the same stack live at the same
> time, and your current approach won't work.  (Even if you make the
> change I'm suggesting, visit_mask will break, too, but fixing that
> would be a much less invasive change.)
>
> Am I making any sense?  This is a suggestion for making it better, not
> something I see as a requirement for getting the x86 code upstream.

I think porting these interfaces to other architectures could eventually
be a good idea, and you're right that the current approach might need to
be tweaked in order to work everywhere.  (But I agree this needs more
thought and this discussion can wait until later.)

> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info 
> >> > *info,
> >> > +  unsigned long *visit_mask)
> >> >  {
> >> > unsigned long stack = (unsigned long)s;
> >> > unsigned long begin, end;
> >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned 
> >> > long *s, char **name,
> >> > if (stack < begin || stack >= end)
> >> > continue;
> >> >
> >> > -   if (test_and_set_bit(k, visit_mask))
> >> > +   if (visit_mask &&
> >> 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-23 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 04:52:10PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> >> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> >> THREAD_SIZE, which is not true for exception stacks.  So the
> >> walk_stack() callbacks will need to know the location of the beginning
> >> of the stack as well as the end.
> >>
> >> Another issue is that in general the various features of a stack (type,
> >> size, next stack pointer, description string) are scattered around in
> >> various places throughout the stack dump code.
> >
> > I finally figured out what visit_info is.  But would it make more
> > sense to track it in the unwind state so that the unwinder can
> > directly make sure it doesn't start looping?
> >
> 
> I just realized that it *is* in the unwind state.  But maybe this code
> in update_stack_state:
> 
> sp = info->next;
> if (!sp).
> goto unknown;
> 
> if (get_stack_info(sp, state->task, info, >stack_mask))
> goto unknown;
> 
> if (!on_stack(info, addr, len))
> goto unknown;
> 
> should do something like:
> 
> if (get_stack_info(addr, ...))
>   goto unknown.
> 
> sp = info->end;
> 
> instead.  Alternatively, maybe it would make sense to keep sp as is
> (have update_stack_state return bool instead of returning a pointer)
> so that a frame that switches stacks still shows the actual sp at the
> time that the frame called whatever the it called.
> 
> I'm really quite confused by what state->sp means in your current
> code.  In the non-stack-switching case (everything is on the thread
> stack), it appears to always match state->bp.  Am I missing something?
>  If I'm understanding this correctly, when you're pointing at a call
> frame, state->bp is that frame's base address (the top of the stack
> frame), unwind_get_return_address() returns the address to which that
> frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
> whatever it ends up looking like will return RDI at the time that the
> frame called whatever function it called, if known.  By that logic,
> shouldn't state->sp be sp on entry to the call instruction?  (Or could
> sp just be removed?  Does it do anything?)

Yeah, I think sp has no purpose and can actually just be removed.

(It was leftover from a previous iteration of the code where it did have
a purpose and I forgot to remove it.)

> I guess the reason I'm still not 100% comfortable with the idea that
> pt_regs frames don't exist a real frames is that I keep waffling as to
> how I should think about the regs associated with a frame in the
> future DWARF world.  I think I imagine them being the regs at the time
> that the frame did it's call to the next frame, which, by an
> admittedly weak analogy, means that the pt_regs state would be the
> regs at the time that the exception or interrupt happened.  That
> offers a third silly option for dealing with the annoying '?': emit
> two frames for a pt_regs, but have the frame in the entry code show
> NULL for its return address because it's not a normal return.

Well, I'd say let's not get ahead of ourselves.  I think the current
regs-aren't-a-frame design works fine for now, and the code is fairly
simple.  If/when we get a DWARF unwinder, we can revisit that decision.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-23 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 04:52:10PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> >> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> >> THREAD_SIZE, which is not true for exception stacks.  So the
> >> walk_stack() callbacks will need to know the location of the beginning
> >> of the stack as well as the end.
> >>
> >> Another issue is that in general the various features of a stack (type,
> >> size, next stack pointer, description string) are scattered around in
> >> various places throughout the stack dump code.
> >
> > I finally figured out what visit_info is.  But would it make more
> > sense to track it in the unwind state so that the unwinder can
> > directly make sure it doesn't start looping?
> >
> 
> I just realized that it *is* in the unwind state.  But maybe this code
> in update_stack_state:
> 
> sp = info->next;
> if (!sp).
> goto unknown;
> 
> if (get_stack_info(sp, state->task, info, >stack_mask))
> goto unknown;
> 
> if (!on_stack(info, addr, len))
> goto unknown;
> 
> should do something like:
> 
> if (get_stack_info(addr, ...))
>   goto unknown.
> 
> sp = info->end;
> 
> instead.  Alternatively, maybe it would make sense to keep sp as is
> (have update_stack_state return bool instead of returning a pointer)
> so that a frame that switches stacks still shows the actual sp at the
> time that the frame called whatever the it called.
> 
> I'm really quite confused by what state->sp means in your current
> code.  In the non-stack-switching case (everything is on the thread
> stack), it appears to always match state->bp.  Am I missing something?
>  If I'm understanding this correctly, when you're pointing at a call
> frame, state->bp is that frame's base address (the top of the stack
> frame), unwind_get_return_address() returns the address to which that
> frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
> whatever it ends up looking like will return RDI at the time that the
> frame called whatever function it called, if known.  By that logic,
> shouldn't state->sp be sp on entry to the call instruction?  (Or could
> sp just be removed?  Does it do anything?)

Yeah, I think sp has no purpose and can actually just be removed.

(It was leftover from a previous iteration of the code where it did have
a purpose and I forgot to remove it.)

> I guess the reason I'm still not 100% comfortable with the idea that
> pt_regs frames don't exist a real frames is that I keep waffling as to
> how I should think about the regs associated with a frame in the
> future DWARF world.  I think I imagine them being the regs at the time
> that the frame did it's call to the next frame, which, by an
> admittedly weak analogy, means that the pt_regs state would be the
> regs at the time that the exception or interrupt happened.  That
> offers a third silly option for dealing with the annoying '?': emit
> two frames for a pt_regs, but have the frame in the entry code show
> NULL for its return address because it's not a normal return.

Well, I'd say let's not get ahead of ourselves.  I think the current
regs-aren't-a-frame design works fine for now, and the code is fairly
simple.  If/when we get a DWARF unwinder, we can revisit that decision.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> > *info,
>> > +unsigned long *visit_mask)
>> > +{
>> > +   unsigned long *begin = (unsigned long 
>> > *)this_cpu_read(hardirq_stack);
>> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +   if (stack < begin || stack >= end)
>> > +   return false;
>> > +
>> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +   return false;
>> > +
>> > +   info->type  = STACK_TYPE_IRQ;
>> > +   info->begin = begin;
>> > +   info->end   = end;
>> > +   info->next  = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +  unsigned long *visit_mask)
>> >  {
>> > unsigned long stack = (unsigned long)s;
>> > unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> > *s, char **name,
>> > if (stack < begin || stack >= end)
>> > continue;
>> >
>> > -   if (test_and_set_bit(k, visit_mask))
>> > +   if (visit_mask &&
>> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> > return false;
>> >
>> > -   *name = exception_stack_names[k];
>> > -   return (unsigned long *)end;
>> > +   info->type  = STACK_TYPE_EXCEPTION + k;
>> > +   info->begin = (unsigned long *)begin;
>> > +   info->end   = (unsigned long *)end;
>> > +   info->next  = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>> 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> > *info,
>> > +unsigned long *visit_mask)
>> > +{
>> > +   unsigned long *begin = (unsigned long 
>> > *)this_cpu_read(hardirq_stack);
>> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +   if (stack < begin || stack >= end)
>> > +   return false;
>> > +
>> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +   return false;
>> > +
>> > +   info->type  = STACK_TYPE_IRQ;
>> > +   info->begin = begin;
>> > +   info->end   = end;
>> > +   info->next  = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +  unsigned long *visit_mask)
>> >  {
>> > unsigned long stack = (unsigned long)s;
>> > unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> > *s, char **name,
>> > if (stack < begin || stack >= end)
>> > continue;
>> >
>> > -   if (test_and_set_bit(k, visit_mask))
>> > +   if (visit_mask &&
>> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> > return false;
>> >
>> > -   *name = exception_stack_names[k];
>> > -   return (unsigned long *)end;
>> > +   info->type  = STACK_TYPE_EXCEPTION + k;
>> > +   info->begin = (unsigned long *)begin;
>> > +   info->end   = (unsigned long *)end;
>> > +   info->next  = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>> pt_regs to avoid needing 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> > valid_stack_ptr() is buggy: it assumes that all stacks are of size
> > THREAD_SIZE, which is not true for exception stacks.  So the
> > walk_stack() callbacks will need to know the location of the beginning
> > of the stack as well as the end.
> >
> > Another issue is that in general the various features of a stack (type,
> > size, next stack pointer, description string) are scattered around in
> > various places throughout the stack dump code.
> 
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?

Well, the unwinders aren't the only users of get_stack_info() and the
visit_mask.  show_trace_log_lvl() also uses it.

But it would probably be cleaner to at least do the visit_mask bit
testing/setting in get_stack_info() rather than in the in_*_stack()
functions.

> And please remove test_and_set_bit() -- it's pointlessly slow.

Ok.

> 
> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> > +unsigned long *visit_mask)
> > +{
> > +   unsigned long *begin = (unsigned long 
> > *)this_cpu_read(hardirq_stack);
> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> > +
> > +   if (stack < begin || stack >= end)
> > +   return false;
> > +
> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> > +   return false;
> > +
> > +   info->type  = STACK_TYPE_IRQ;
> > +   info->begin = begin;
> > +   info->end   = end;
> > +   info->next  = (unsigned long *)*begin;
> 
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.

I'm not quite sure what you mean here.  The ->next stack pointer is
quite useful and it abstracts that ugliness away from the callers of
get_stack_info().  I'm open to any specific suggestions.

> 
> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> > +  unsigned long *visit_mask)
> >  {
> > unsigned long stack = (unsigned long)s;
> > unsigned long begin, end;
> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> > *s, char **name,
> > if (stack < begin || stack >= end)
> > continue;
> >
> > -   if (test_and_set_bit(k, visit_mask))
> > +   if (visit_mask &&
> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> > return false;
> >
> > -   *name = exception_stack_names[k];
> > -   return (unsigned long *)end;
> > +   info->type  = STACK_TYPE_EXCEPTION + k;
> > +   info->begin = (unsigned long *)begin;
> > +   info->end   = (unsigned long *)end;
> > +   info->next  = (unsigned long *)info->end[-2];
> 
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.

Heck if I know, I just stole it from dump_trace() ;-)

I'll figure it out and add a comment.

> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.

Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
fallback: the stack dump code *needs* to be able to still traverse the
stacks if frame pointers fail.

> Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.

Unless I'm missing something, I think it should be fine for nested NMIs,
since they're all on the same stack.  I can try to test it.  What in
particular are you worried about?

> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
> 
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.

Ok, I'll test it.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Josh Poimboeuf
On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> > valid_stack_ptr() is buggy: it assumes that all stacks are of size
> > THREAD_SIZE, which is not true for exception stacks.  So the
> > walk_stack() callbacks will need to know the location of the beginning
> > of the stack as well as the end.
> >
> > Another issue is that in general the various features of a stack (type,
> > size, next stack pointer, description string) are scattered around in
> > various places throughout the stack dump code.
> 
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?

Well, the unwinders aren't the only users of get_stack_info() and the
visit_mask.  show_trace_log_lvl() also uses it.

But it would probably be cleaner to at least do the visit_mask bit
testing/setting in get_stack_info() rather than in the in_*_stack()
functions.

> And please remove test_and_set_bit() -- it's pointlessly slow.

Ok.

> 
> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> > +unsigned long *visit_mask)
> > +{
> > +   unsigned long *begin = (unsigned long 
> > *)this_cpu_read(hardirq_stack);
> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> > +
> > +   if (stack < begin || stack >= end)
> > +   return false;
> > +
> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> > +   return false;
> > +
> > +   info->type  = STACK_TYPE_IRQ;
> > +   info->begin = begin;
> > +   info->end   = end;
> > +   info->next  = (unsigned long *)*begin;
> 
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.

I'm not quite sure what you mean here.  The ->next stack pointer is
quite useful and it abstracts that ugliness away from the callers of
get_stack_info().  I'm open to any specific suggestions.

> 
> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> > +  unsigned long *visit_mask)
> >  {
> > unsigned long stack = (unsigned long)s;
> > unsigned long begin, end;
> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> > *s, char **name,
> > if (stack < begin || stack >= end)
> > continue;
> >
> > -   if (test_and_set_bit(k, visit_mask))
> > +   if (visit_mask &&
> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> > return false;
> >
> > -   *name = exception_stack_names[k];
> > -   return (unsigned long *)end;
> > +   info->type  = STACK_TYPE_EXCEPTION + k;
> > +   info->begin = (unsigned long *)begin;
> > +   info->end   = (unsigned long *)end;
> > +   info->next  = (unsigned long *)info->end[-2];
> 
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.

Heck if I know, I just stole it from dump_trace() ;-)

I'll figure it out and add a comment.

> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.

Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
fallback: the stack dump code *needs* to be able to still traverse the
stacks if frame pointers fail.

> Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.

Unless I'm missing something, I think it should be fine for nested NMIs,
since they're all on the same stack.  I can try to test it.  What in
particular are you worried about?

> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
> 
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.

Ok, I'll test it.

-- 
Josh


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
>> valid_stack_ptr() is buggy: it assumes that all stacks are of size
>> THREAD_SIZE, which is not true for exception stacks.  So the
>> walk_stack() callbacks will need to know the location of the beginning
>> of the stack as well as the end.
>>
>> Another issue is that in general the various features of a stack (type,
>> size, next stack pointer, description string) are scattered around in
>> various places throughout the stack dump code.
>
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?
>

I just realized that it *is* in the unwind state.  But maybe this code
in update_stack_state:

sp = info->next;
if (!sp).
goto unknown;

if (get_stack_info(sp, state->task, info, >stack_mask))
goto unknown;

if (!on_stack(info, addr, len))
goto unknown;

should do something like:

if (get_stack_info(addr, ...))
  goto unknown.

sp = info->end;

instead.  Alternatively, maybe it would make sense to keep sp as is
(have update_stack_state return bool instead of returning a pointer)
so that a frame that switches stacks still shows the actual sp at the
time that the frame called whatever the it called.

I'm really quite confused by what state->sp means in your current
code.  In the non-stack-switching case (everything is on the thread
stack), it appears to always match state->bp.  Am I missing something?
 If I'm understanding this correctly, when you're pointing at a call
frame, state->bp is that frame's base address (the top of the stack
frame), unwind_get_return_address() returns the address to which that
frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
whatever it ends up looking like will return RDI at the time that the
frame called whatever function it called, if known.  By that logic,
shouldn't state->sp be sp on entry to the call instruction?  (Or could
sp just be removed?  Does it do anything?)

I guess the reason I'm still not 100% comfortable with the idea that
pt_regs frames don't exist a real frames is that I keep waffling as to
how I should think about the regs associated with a frame in the
future DWARF world.  I think I imagine them being the regs at the time
that the frame did it's call to the next frame, which, by an
admittedly weak analogy, means that the pt_regs state would be the
regs at the time that the exception or interrupt happened.  That
offers a third silly option for dealing with the annoying '?': emit
two frames for a pt_regs, but have the frame in the entry code show
NULL for its return address because it's not a normal return.

> And please remove test_and_set_bit() -- it's pointlessly slow.
>
>> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> +unsigned long *visit_mask)
>> +{
>> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> +
>> +   if (stack < begin || stack >= end)
>> +   return false;
>> +
>> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> +   return false;
>> +
>> +   info->type  = STACK_TYPE_IRQ;
>> +   info->begin = begin;
>> +   info->end   = end;
>> +   info->next  = (unsigned long *)*begin;
>
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.
>
>> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> +  unsigned long *visit_mask)
>>  {
>> unsigned long stack = (unsigned long)s;
>> unsigned long begin, end;
>> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> *s, char **name,
>> if (stack < begin || stack >= end)
>> continue;
>>
>> -   if (test_and_set_bit(k, visit_mask))
>> +   if (visit_mask &&
>> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> return false;
>>
>> -   *name = exception_stack_names[k];
>> -   return (unsigned long *)end;
>> +   info->type  = STACK_TYPE_EXCEPTION + k;
>> +   info->begin = (unsigned long *)begin;
>> +   info->end   = (unsigned long *)end;
>> +   info->next  = (unsigned long *)info->end[-2];
>
> This is so magical that I don't immediately see why 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
>> valid_stack_ptr() is buggy: it assumes that all stacks are of size
>> THREAD_SIZE, which is not true for exception stacks.  So the
>> walk_stack() callbacks will need to know the location of the beginning
>> of the stack as well as the end.
>>
>> Another issue is that in general the various features of a stack (type,
>> size, next stack pointer, description string) are scattered around in
>> various places throughout the stack dump code.
>
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?
>

I just realized that it *is* in the unwind state.  But maybe this code
in update_stack_state:

sp = info->next;
if (!sp).
goto unknown;

if (get_stack_info(sp, state->task, info, >stack_mask))
goto unknown;

if (!on_stack(info, addr, len))
goto unknown;

should do something like:

if (get_stack_info(addr, ...))
  goto unknown.

sp = info->end;

instead.  Alternatively, maybe it would make sense to keep sp as is
(have update_stack_state return bool instead of returning a pointer)
so that a frame that switches stacks still shows the actual sp at the
time that the frame called whatever the it called.

I'm really quite confused by what state->sp means in your current
code.  In the non-stack-switching case (everything is on the thread
stack), it appears to always match state->bp.  Am I missing something?
 If I'm understanding this correctly, when you're pointing at a call
frame, state->bp is that frame's base address (the top of the stack
frame), unwind_get_return_address() returns the address to which that
frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
whatever it ends up looking like will return RDI at the time that the
frame called whatever function it called, if known.  By that logic,
shouldn't state->sp be sp on entry to the call instruction?  (Or could
sp just be removed?  Does it do anything?)

I guess the reason I'm still not 100% comfortable with the idea that
pt_regs frames don't exist a real frames is that I keep waffling as to
how I should think about the regs associated with a frame in the
future DWARF world.  I think I imagine them being the regs at the time
that the frame did it's call to the next frame, which, by an
admittedly weak analogy, means that the pt_regs state would be the
regs at the time that the exception or interrupt happened.  That
offers a third silly option for dealing with the annoying '?': emit
two frames for a pt_regs, but have the frame in the entry code show
NULL for its return address because it's not a normal return.

> And please remove test_and_set_bit() -- it's pointlessly slow.
>
>> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> +unsigned long *visit_mask)
>> +{
>> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> +
>> +   if (stack < begin || stack >= end)
>> +   return false;
>> +
>> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> +   return false;
>> +
>> +   info->type  = STACK_TYPE_IRQ;
>> +   info->begin = begin;
>> +   info->end   = end;
>> +   info->next  = (unsigned long *)*begin;
>
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.
>
>> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> +  unsigned long *visit_mask)
>>  {
>> unsigned long stack = (unsigned long)s;
>> unsigned long begin, end;
>> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> *s, char **name,
>> if (stack < begin || stack >= end)
>> continue;
>>
>> -   if (test_and_set_bit(k, visit_mask))
>> +   if (visit_mask &&
>> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> return false;
>>
>> -   *name = exception_stack_names[k];
>> -   return (unsigned long *)end;
>> +   info->type  = STACK_TYPE_EXCEPTION + k;
>> +   info->begin = (unsigned long *)begin;
>> +   info->end   = (unsigned long *)end;
>> +   info->next  = (unsigned long *)info->end[-2];
>
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the 

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> THREAD_SIZE, which is not true for exception stacks.  So the
> walk_stack() callbacks will need to know the location of the beginning
> of the stack as well as the end.
>
> Another issue is that in general the various features of a stack (type,
> size, next stack pointer, description string) are scattered around in
> various places throughout the stack dump code.

I finally figured out what visit_info is.  But would it make more
sense to track it in the unwind state so that the unwinder can
directly make sure it doesn't start looping?

And please remove test_and_set_bit() -- it's pointlessly slow.

> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> +unsigned long *visit_mask)
> +{
> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> +
> +   if (stack < begin || stack >= end)
> +   return false;
> +
> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> +   return false;
> +
> +   info->type  = STACK_TYPE_IRQ;
> +   info->begin = begin;
> +   info->end   = end;
> +   info->next  = (unsigned long *)*begin;

This works, but it's a bit magic.  I don't suppose we could get rid of
this ->next thing entirely and teach show_stack_log_lvl(), etc. to
move from stack to stack by querying the stack type of whatever the
frame base address is if the frame base address ends up being out of
bounds for the current stack?  Or maybe the unwinder could even do
this by itself.

> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> +  unsigned long *visit_mask)
>  {
> unsigned long stack = (unsigned long)s;
> unsigned long begin, end;
> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> *s, char **name,
> if (stack < begin || stack >= end)
> continue;
>
> -   if (test_and_set_bit(k, visit_mask))
> +   if (visit_mask &&
> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> return false;
>
> -   *name = exception_stack_names[k];
> -   return (unsigned long *)end;
> +   info->type  = STACK_TYPE_EXCEPTION + k;
> +   info->begin = (unsigned long *)begin;
> +   info->end   = (unsigned long *)end;
> +   info->next  = (unsigned long *)info->end[-2];

This is so magical that I don't immediately see why it's correct.
Presumably it's because the thing two slots down from the top of the
stack is regs->sp?  If so, that needs a comment.

But again, couldn't we use the fact that we now know how to decode
pt_regs to avoid needing this?  I can imagine it being useful as a
fallback in the event that the unwinder fails, but this is just a
fallback.  Also, NMI is weird and I'm wondering whether this works at
all when trying to unwind from a looped NMI.

Fixing this up could be a followup after this series is in, I think --
you're preserving existing behavior AFAICS.  I just don't particularly
like the existing behavior.

FWIW, I think this code needs to be explicitly tested for the 32-bit
double fault case.  It's highly magical.


Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

2016-07-22 Thread Andy Lutomirski
On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> THREAD_SIZE, which is not true for exception stacks.  So the
> walk_stack() callbacks will need to know the location of the beginning
> of the stack as well as the end.
>
> Another issue is that in general the various features of a stack (type,
> size, next stack pointer, description string) are scattered around in
> various places throughout the stack dump code.

I finally figured out what visit_info is.  But would it make more
sense to track it in the unwind state so that the unwinder can
directly make sure it doesn't start looping?

And please remove test_and_set_bit() -- it's pointlessly slow.

> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> +unsigned long *visit_mask)
> +{
> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> +
> +   if (stack < begin || stack >= end)
> +   return false;
> +
> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> +   return false;
> +
> +   info->type  = STACK_TYPE_IRQ;
> +   info->begin = begin;
> +   info->end   = end;
> +   info->next  = (unsigned long *)*begin;

This works, but it's a bit magic.  I don't suppose we could get rid of
this ->next thing entirely and teach show_stack_log_lvl(), etc. to
move from stack to stack by querying the stack type of whatever the
frame base address is if the frame base address ends up being out of
bounds for the current stack?  Or maybe the unwinder could even do
this by itself.

> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> +  unsigned long *visit_mask)
>  {
> unsigned long stack = (unsigned long)s;
> unsigned long begin, end;
> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> *s, char **name,
> if (stack < begin || stack >= end)
> continue;
>
> -   if (test_and_set_bit(k, visit_mask))
> +   if (visit_mask &&
> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> return false;
>
> -   *name = exception_stack_names[k];
> -   return (unsigned long *)end;
> +   info->type  = STACK_TYPE_EXCEPTION + k;
> +   info->begin = (unsigned long *)begin;
> +   info->end   = (unsigned long *)end;
> +   info->next  = (unsigned long *)info->end[-2];

This is so magical that I don't immediately see why it's correct.
Presumably it's because the thing two slots down from the top of the
stack is regs->sp?  If so, that needs a comment.

But again, couldn't we use the fact that we now know how to decode
pt_regs to avoid needing this?  I can imagine it being useful as a
fallback in the event that the unwinder fails, but this is just a
fallback.  Also, NMI is weird and I'm wondering whether this works at
all when trying to unwind from a looped NMI.

Fixing this up could be a followup after this series is in, I think --
you're preserving existing behavior AFAICS.  I just don't particularly
like the existing behavior.

FWIW, I think this code needs to be explicitly tested for the 32-bit
double fault case.  It's highly magical.