Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, Jun 23, 2016 at 1:40 PM, Josh Poimboeufwrote: > On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote: >> On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote: >> > > So which is the least-bad option? To summarize: >> > > >> > > 1) task flag(s) for preemption and page faults >> > > >> > > 2) turn pt_regs into a stack frame >> > > >> > > 3) annotate all calls from entry code in a table >> > > >> > > 4) encode rbp on entry >> > > >> > > They all have their issues, though I'm partial to #2. >> > > >> > > Any more hare-brained ideas? :-) >> > >> > I'll try to take a closer look at #2 and see just how much I dislike >> > all the stack frame munging. >> >> Ok. >> >> > Also, in principle, it's only the >> > sleeping calls and the calls that make it into real (non-entry) kernel >> > code that really want to be unwindable through this mechanism. >> >> Yeah, that's true. We could modify options 2 or 3 to be less absolute. >> Though I think that makes them more prone to future breakage. >> >> > FWIW, I don't care that much about preserving gdb's partial ability to >> > unwind through pt_regs, especially because gdb really ought to be able >> > to use DWARF, too. >> >> Hm, that's a good point. I really don't know if there are any other >> external tools out there that would care. Maybe we could try option 4 >> and then see if anybody complains. > > I'm starting to think hare-brained option 4 is the way to go. Any > external tooling should really be relying on DWARF anyway. > > Here's a sneak preview. If this general approach looks ok to you, I'll > go ahead and port all the in-tree unwinders and post a proper patch. > > Instead of using xor -1 on the pt_regs pointer, I just cleared the > high-order bit. That makes the unwinding experience much more pleasant > for a human stack walker, and also ensures that anybody trying to > dereference it gets slapped with an oops, at least in the 48-bit address > space era. > > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h > index 9a9e588..bf397426 100644 > --- a/arch/x86/entry/calling.h > +++ b/arch/x86/entry/calling.h > @@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is > built with > .byte 0xf1 > .endm > > + /* > +* This is a sneaky trick to help the unwinder find pt_regs on the > +* stack. The frame pointer is replaced with an encoded pointer to > +* pt_regs. The encoding is just a clearing of the highest-order bit, > +* which makes it an invalid address and is also a signal to the > +* unwinder that it's a pt_regs pointer in disguise. > +* > +* NOTE: This must be called *after* SAVE_EXTRA_REGS because it > +* corrupts rbp. > +*/ > + .macro ENCODE_FRAME_POINTER ptregs_offset=0 > +#ifdef CONFIG_FRAME_POINTER > + leaq \ptregs_offset(%rsp), %rbp > + btr $63, %rbp > +#endif > + .endm > + Maybe optimize slightly: .ifeq \ptregs_offset mov %rsp, %rbp .else leaq \ptregs_offset(%rsp), %rbp .endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote: > On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote: > > > So which is the least-bad option? To summarize: > > > > > > 1) task flag(s) for preemption and page faults > > > > > > 2) turn pt_regs into a stack frame > > > > > > 3) annotate all calls from entry code in a table > > > > > > 4) encode rbp on entry > > > > > > They all have their issues, though I'm partial to #2. > > > > > > Any more hare-brained ideas? :-) > > > > I'll try to take a closer look at #2 and see just how much I dislike > > all the stack frame munging. > > Ok. > > > Also, in principle, it's only the > > sleeping calls and the calls that make it into real (non-entry) kernel > > code that really want to be unwindable through this mechanism. > > Yeah, that's true. We could modify options 2 or 3 to be less absolute. > Though I think that makes them more prone to future breakage. > > > FWIW, I don't care that much about preserving gdb's partial ability to > > unwind through pt_regs, especially because gdb really ought to be able > > to use DWARF, too. > > Hm, that's a good point. I really don't know if there are any other > external tools out there that would care. Maybe we could try option 4 > and then see if anybody complains. I'm starting to think hare-brained option 4 is the way to go. Any external tooling should really be relying on DWARF anyway. Here's a sneak preview. If this general approach looks ok to you, I'll go ahead and port all the in-tree unwinders and post a proper patch. Instead of using xor -1 on the pt_regs pointer, I just cleared the high-order bit. That makes the unwinding experience much more pleasant for a human stack walker, and also ensures that anybody trying to dereference it gets slapped with an oops, at least in the 48-bit address space era. diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 9a9e588..bf397426 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is built with .byte 0xf1 .endm + /* +* This is a sneaky trick to help the unwinder find pt_regs on the +* stack. The frame pointer is replaced with an encoded pointer to +* pt_regs. The encoding is just a clearing of the highest-order bit, +* which makes it an invalid address and is also a signal to the +* unwinder that it's a pt_regs pointer in disguise. +* +* NOTE: This must be called *after* SAVE_EXTRA_REGS because it +* corrupts rbp. +*/ + .macro ENCODE_FRAME_POINTER ptregs_offset=0 +#ifdef CONFIG_FRAME_POINTER + leaq \ptregs_offset(%rsp), %rbp + btr $63, %rbp +#endif + .endm + #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 9ee0da1..eb79652 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -431,6 +431,7 @@ END(irq_entries_start) ALLOC_PT_GPREGS_ON_STACK SAVE_C_REGS SAVE_EXTRA_REGS + ENCODE_FRAME_POINTER testb $3, CS(%rsp) jz 1f @@ -893,6 +894,7 @@ ENTRY(xen_failsafe_callback) ALLOC_PT_GPREGS_ON_STACK SAVE_C_REGS SAVE_EXTRA_REGS + ENCODE_FRAME_POINTER jmp error_exit END(xen_failsafe_callback) @@ -936,6 +938,7 @@ ENTRY(paranoid_entry) cld SAVE_C_REGS 8 SAVE_EXTRA_REGS 8 + ENCODE_FRAME_POINTER 8 movl$1, %ebx movl$MSR_GS_BASE, %ecx rdmsr @@ -983,6 +986,7 @@ ENTRY(error_entry) cld SAVE_C_REGS 8 SAVE_EXTRA_REGS 8 + ENCODE_FRAME_POINTER 8 xorl%ebx, %ebx testb $3, CS+8(%rsp) jz .Lerror_kernelspace @@ -1165,6 +1169,7 @@ ENTRY(nmi) pushq %r13/* pt_regs->r13 */ pushq %r14/* pt_regs->r14 */ pushq %r15/* pt_regs->r15 */ + ENCODE_FRAME_POINTER /* * At this point we no longer need to worry about stack damage @@ -1182,7 +1187,7 @@ ENTRY(nmi) * do_nmi doesn't modify pt_regs. */ SWAPGS - jmp restore_c_regs_and_iret + jmp restore_regs_and_iret .Lnmi_from_kernel: /* ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote: > > So which is the least-bad option? To summarize: > > > > 1) task flag(s) for preemption and page faults > > > > 2) turn pt_regs into a stack frame > > > > 3) annotate all calls from entry code in a table > > > > 4) encode rbp on entry > > > > They all have their issues, though I'm partial to #2. > > > > Any more hare-brained ideas? :-) > > I'll try to take a closer look at #2 and see just how much I dislike > all the stack frame munging. Ok. > Also, in principle, it's only the > sleeping calls and the calls that make it into real (non-entry) kernel > code that really want to be unwindable through this mechanism. Yeah, that's true. We could modify options 2 or 3 to be less absolute. Though I think that makes them more prone to future breakage. > FWIW, I don't care that much about preserving gdb's partial ability to > unwind through pt_regs, especially because gdb really ought to be able > to use DWARF, too. Hm, that's a good point. I really don't know if there are any other external tools out there that would care. Maybe we could try option 4 and then see if anybody complains. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, Jun 23, 2016 at 9:19 AM, Josh Poimboeufwrote: > On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote: >> On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf wrote: >> > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote: >> >> > >> >> > So are you suggesting something like: >> >> > >> >> > .macro ENTRY_CALL func pt_regs_offset=0 >> >> > call \func >> >> > 1: .pushsection .entry_calls, "a" >> >> > .long 1b - . >> >> > .long \pt_regs_offset >> >> > .popsection >> >> > .endm >> >> > >> >> > and then change every call in the entry code to ENTRY_CALL? >> >> >> >> Yes, exactly, modulo whether the section name is good. hpa is >> >> probably the authority on that. >> > >> > Well, as you probably know, I don't really like peppering ENTRY_CALL >> > everywhere. :-/ >> >> Me neither. But at least it's less constraining on the >> already-fairly-hairy code. >> >> > >> > Also I wonder how we could annotate the hypercalls, for example >> > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair. >> >> Oh, yuck. But forcing all the DISABLE_INTERRUPTS and >> ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so >> great either. > > Hm, I don't follow this statement. Why not? The more frame pointer > coverage, the better, especially if it doesn't add any additional > overhead. Less flexibility, and it's IMO annoying to make the Xen case have extra constraints. It also makes it very awkward or impossible to take advantage of the sti interrupt window, although admittedly that doesn't work on Xen either, so maybe that's moot. > >> DWARF solves this problem completely and IMO fairly cleanly. Maybe we >> should add your task flag and then consider removing it again when >> DWARF happens. > > I tend to doubt we'd be able to remove it later. As you said before, > many embedded platforms probably won't be able to switch to DWARF, and > they'll want to do live patching too. > > So which is the least-bad option? To summarize: > > 1) task flag(s) for preemption and page faults > > 2) turn pt_regs into a stack frame > > 3) annotate all calls from entry code in a table > > 4) encode rbp on entry > > They all have their issues, though I'm partial to #2. > > Any more hare-brained ideas? :-) I'll try to take a closer look at #2 and see just how much I dislike all the stack frame munging. Also, in principle, it's only the sleeping calls and the calls that make it into real (non-entry) kernel code that really want to be unwindable through this mechanism. FWIW, I don't care that much about preserving gdb's partial ability to unwind through pt_regs, especially because gdb really ought to be able to use DWARF, too. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote: > On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeufwrote: > > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote: > >> > > >> > So are you suggesting something like: > >> > > >> > .macro ENTRY_CALL func pt_regs_offset=0 > >> > call \func > >> > 1: .pushsection .entry_calls, "a" > >> > .long 1b - . > >> > .long \pt_regs_offset > >> > .popsection > >> > .endm > >> > > >> > and then change every call in the entry code to ENTRY_CALL? > >> > >> Yes, exactly, modulo whether the section name is good. hpa is > >> probably the authority on that. > > > > Well, as you probably know, I don't really like peppering ENTRY_CALL > > everywhere. :-/ > > Me neither. But at least it's less constraining on the > already-fairly-hairy code. > > > > > Also I wonder how we could annotate the hypercalls, for example > > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair. > > Oh, yuck. But forcing all the DISABLE_INTERRUPTS and > ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so > great either. Hm, I don't follow this statement. Why not? The more frame pointer coverage, the better, especially if it doesn't add any additional overhead. > DWARF solves this problem completely and IMO fairly cleanly. Maybe we > should add your task flag and then consider removing it again when > DWARF happens. I tend to doubt we'd be able to remove it later. As you said before, many embedded platforms probably won't be able to switch to DWARF, and they'll want to do live patching too. So which is the least-bad option? To summarize: 1) task flag(s) for preemption and page faults 2) turn pt_regs into a stack frame 3) annotate all calls from entry code in a table 4) encode rbp on entry They all have their issues, though I'm partial to #2. Any more hare-brained ideas? :-) -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 05:09:11PM -0700, Andy Lutomirski wrote: > On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeufwrote: > > Andy, > > > > So I got a chance to look at this some more. I'm thinking that to make > > this feature more consistently useful, we shouldn't only annotate > > pt_regs frames for calls to handlers; other calls should be annotated as > > well: preempt_schedule_irq, CALL_enter_from_user_mode, > > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, > > etc. That way, the unwinder will always be able to find pt_regs from an > > interrupt/exception, even if starting from one of these other calls. > > > > But then, things get ugly. You have to either setup and tear down the > > frame for every possible call, or do a higher-level setup/teardown > > across multiple calls, which invalidates several assumptions in the > > entry code about the location of pt_regs on the stack. > > > > Here's yet another harebrained idea. Maybe it works better than my > previous harebrained ideas :) > > Your patch is already creating a somewhat nonstandard stack frame: > > + movq%rbp, 0*8(%rsp) > + movq$entry_frame_ret, 1*8(%rsp) > + movq%rsp, %rbp > > It's kind of a normal stack frame, but rbp points at something odd, > and to unwind it fully correctly, the unwinder needs to know about it. > > What if we made it even more special, along the lines of: > > leaq offset_to_ptregs(%rsp), %rbp > xorq $-1, %rbp > > IOW, don't write anything to the stack at all, and just put a special > value into RBP that says "the next frame is pt_regs at such-and-such > address". Do this once on entry and make sure to restore RBP (from > pt_regs) on exit. Now the unwinder can notice that RBP has the high > bits clear *and* that the negation of it points to the stack, and it > can figure out what's going on. > > What do you think? Am I nuts or could this work? > > It had better not have much risk of breaking things worse than they > currently are, given that current kernel allow user code to stick any > value it likes into the very last element of the RBP chain. I think it's a good idea, and it could work... BUT it would break external unwinders like gdb for the in-kernel entry case. For interrupts and exceptions in kernel mode, rbp *is* valid. Sure, it doesn't tell you the interrupted function, but it does tell you its caller. A generic frame pointer unwinder skips the interrupted function, but at least it keeps going. If we encoded rbp on entry, that would break. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeufwrote: > Andy, > > So I got a chance to look at this some more. I'm thinking that to make > this feature more consistently useful, we shouldn't only annotate > pt_regs frames for calls to handlers; other calls should be annotated as > well: preempt_schedule_irq, CALL_enter_from_user_mode, > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, > etc. That way, the unwinder will always be able to find pt_regs from an > interrupt/exception, even if starting from one of these other calls. > > But then, things get ugly. You have to either setup and tear down the > frame for every possible call, or do a higher-level setup/teardown > across multiple calls, which invalidates several assumptions in the > entry code about the location of pt_regs on the stack. > Here's yet another harebrained idea. Maybe it works better than my previous harebrained ideas :) Your patch is already creating a somewhat nonstandard stack frame: + movq%rbp, 0*8(%rsp) + movq$entry_frame_ret, 1*8(%rsp) + movq%rsp, %rbp It's kind of a normal stack frame, but rbp points at something odd, and to unwind it fully correctly, the unwinder needs to know about it. What if we made it even more special, along the lines of: leaq offset_to_ptregs(%rsp), %rbp xorq $-1, %rbp IOW, don't write anything to the stack at all, and just put a special value into RBP that says "the next frame is pt_regs at such-and-such address". Do this once on entry and make sure to restore RBP (from pt_regs) on exit. Now the unwinder can notice that RBP has the high bits clear *and* that the negation of it points to the stack, and it can figure out what's going on. What do you think? Am I nuts or could this work? It had better not have much risk of breaking things worse than they currently are, given that current kernel allow user code to stick any value it likes into the very last element of the RBP chain. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeufwrote: > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote: >> > >> > So are you suggesting something like: >> > >> > .macro ENTRY_CALL func pt_regs_offset=0 >> > call \func >> > 1: .pushsection .entry_calls, "a" >> > .long 1b - . >> > .long \pt_regs_offset >> > .popsection >> > .endm >> > >> > and then change every call in the entry code to ENTRY_CALL? >> >> Yes, exactly, modulo whether the section name is good. hpa is >> probably the authority on that. > > Well, as you probably know, I don't really like peppering ENTRY_CALL > everywhere. :-/ Me neither. But at least it's less constraining on the already-fairly-hairy code. > > Also I wonder how we could annotate the hypercalls, for example > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair. Oh, yuck. But forcing all the DISABLE_INTERRUPTS and ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so great either. DWARF solves this problem completely and IMO fairly cleanly. Maybe we should add your task flag and then consider removing it again when DWARF happens. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote: > On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeufwrote: > > On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote: > >> > So I got a chance to look at this some more. I'm thinking that to make > >> > this feature more consistently useful, we shouldn't only annotate > >> > pt_regs frames for calls to handlers; other calls should be annotated as > >> > well: preempt_schedule_irq, CALL_enter_from_user_mode, > >> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, > >> > etc. That way, the unwinder will always be able to find pt_regs from an > >> > interrupt/exception, even if starting from one of these other calls. > >> > > >> > But then, things get ugly. You have to either setup and tear down the > >> > frame for every possible call, or do a higher-level setup/teardown > >> > across multiple calls, which invalidates several assumptions in the > >> > entry code about the location of pt_regs on the stack. > >> > > >> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ) > >> > make assumptions about the location of pt_regs. And they're used by > >> > both syscall and interrupt code. So if we didn't create a frame pointer > >> > header for syscalls, we'd basically need two versions of the macros: one > >> > for irqs/exceptions and one for syscalls. > >> > > >> > So I think the cleanest way to handle this is to always allocate two > >> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK. Then all > >> > entry code can assume that pt_regs is at a constant location, and all > >> > the above problems go away. Another benefit is that we'd only need two > >> > saves instead of three -- the pointer to pt_regs is no longer needed > >> > since pt_regs is always immediately after the frame header. > >> > > >> > I worked up a patch to implement this -- see below. It writes the frame > >> > pointer in all entry paths, including syscalls. This helps keep the > >> > code simple. > >> > > >> > The downside is a small performance penalty: with getppid()-in-a-loop on > >> > my laptop, the average syscall went from 52ns to 53ns, which is about a > >> > 2% slowdown. But I doubt it would be measurable in a real-world > >> > workload. > >> > > >> > It looks like about half the slowdown is due to the extra stack > >> > allocation (which presumably adds a little d-cache pressure on the stack > >> > memory) and the other half is due to the stack writes. > >> > > >> > I could remove the writes from the syscall path but it would only save > >> > about half a ns, and it would make the code less robust. Plus it's nice > >> > to have the consistency of having *all* pt_regs frames annotated. > >> > >> This is a bit messy, and I'm not really sure that the entry code > >> should be have to operate under constraints like this. Also, > >> convincing myself this works for NMI sounds unpleasant. > >> > >> Maybe we should go back to my idea of just listing the call sites in a > >> table. > > > > So are you suggesting something like: > > > > .macro ENTRY_CALL func pt_regs_offset=0 > > call \func > > 1: .pushsection .entry_calls, "a" > > .long 1b - . > > .long \pt_regs_offset > > .popsection > > .endm > > > > and then change every call in the entry code to ENTRY_CALL? > > Yes, exactly, modulo whether the section name is good. hpa is > probably the authority on that. Well, as you probably know, I don't really like peppering ENTRY_CALL everywhere. :-/ Also I wonder how we could annotate the hypercalls, for example DISABLE_INTERRUPTS actually wraps the call in a push/pop pair. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeufwrote: > On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote: >> > So I got a chance to look at this some more. I'm thinking that to make >> > this feature more consistently useful, we shouldn't only annotate >> > pt_regs frames for calls to handlers; other calls should be annotated as >> > well: preempt_schedule_irq, CALL_enter_from_user_mode, >> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, >> > etc. That way, the unwinder will always be able to find pt_regs from an >> > interrupt/exception, even if starting from one of these other calls. >> > >> > But then, things get ugly. You have to either setup and tear down the >> > frame for every possible call, or do a higher-level setup/teardown >> > across multiple calls, which invalidates several assumptions in the >> > entry code about the location of pt_regs on the stack. >> > >> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ) >> > make assumptions about the location of pt_regs. And they're used by >> > both syscall and interrupt code. So if we didn't create a frame pointer >> > header for syscalls, we'd basically need two versions of the macros: one >> > for irqs/exceptions and one for syscalls. >> > >> > So I think the cleanest way to handle this is to always allocate two >> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK. Then all >> > entry code can assume that pt_regs is at a constant location, and all >> > the above problems go away. Another benefit is that we'd only need two >> > saves instead of three -- the pointer to pt_regs is no longer needed >> > since pt_regs is always immediately after the frame header. >> > >> > I worked up a patch to implement this -- see below. It writes the frame >> > pointer in all entry paths, including syscalls. This helps keep the >> > code simple. >> > >> > The downside is a small performance penalty: with getppid()-in-a-loop on >> > my laptop, the average syscall went from 52ns to 53ns, which is about a >> > 2% slowdown. But I doubt it would be measurable in a real-world >> > workload. >> > >> > It looks like about half the slowdown is due to the extra stack >> > allocation (which presumably adds a little d-cache pressure on the stack >> > memory) and the other half is due to the stack writes. >> > >> > I could remove the writes from the syscall path but it would only save >> > about half a ns, and it would make the code less robust. Plus it's nice >> > to have the consistency of having *all* pt_regs frames annotated. >> >> This is a bit messy, and I'm not really sure that the entry code >> should be have to operate under constraints like this. Also, >> convincing myself this works for NMI sounds unpleasant. >> >> Maybe we should go back to my idea of just listing the call sites in a table. > > So are you suggesting something like: > > .macro ENTRY_CALL func pt_regs_offset=0 > call \func > 1: .pushsection .entry_calls, "a" > .long 1b - . > .long \pt_regs_offset > .popsection > .endm > > and then change every call in the entry code to ENTRY_CALL? Yes, exactly, modulo whether the section name is good. hpa is probably the authority on that. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote: > > So I got a chance to look at this some more. I'm thinking that to make > > this feature more consistently useful, we shouldn't only annotate > > pt_regs frames for calls to handlers; other calls should be annotated as > > well: preempt_schedule_irq, CALL_enter_from_user_mode, > > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, > > etc. That way, the unwinder will always be able to find pt_regs from an > > interrupt/exception, even if starting from one of these other calls. > > > > But then, things get ugly. You have to either setup and tear down the > > frame for every possible call, or do a higher-level setup/teardown > > across multiple calls, which invalidates several assumptions in the > > entry code about the location of pt_regs on the stack. > > > > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ) > > make assumptions about the location of pt_regs. And they're used by > > both syscall and interrupt code. So if we didn't create a frame pointer > > header for syscalls, we'd basically need two versions of the macros: one > > for irqs/exceptions and one for syscalls. > > > > So I think the cleanest way to handle this is to always allocate two > > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK. Then all > > entry code can assume that pt_regs is at a constant location, and all > > the above problems go away. Another benefit is that we'd only need two > > saves instead of three -- the pointer to pt_regs is no longer needed > > since pt_regs is always immediately after the frame header. > > > > I worked up a patch to implement this -- see below. It writes the frame > > pointer in all entry paths, including syscalls. This helps keep the > > code simple. > > > > The downside is a small performance penalty: with getppid()-in-a-loop on > > my laptop, the average syscall went from 52ns to 53ns, which is about a > > 2% slowdown. But I doubt it would be measurable in a real-world > > workload. > > > > It looks like about half the slowdown is due to the extra stack > > allocation (which presumably adds a little d-cache pressure on the stack > > memory) and the other half is due to the stack writes. > > > > I could remove the writes from the syscall path but it would only save > > about half a ns, and it would make the code less robust. Plus it's nice > > to have the consistency of having *all* pt_regs frames annotated. > > This is a bit messy, and I'm not really sure that the entry code > should be have to operate under constraints like this. Also, > convincing myself this works for NMI sounds unpleasant. > > Maybe we should go back to my idea of just listing the call sites in a table. So are you suggesting something like: .macro ENTRY_CALL func pt_regs_offset=0 call \func 1: .pushsection .entry_calls, "a" .long 1b - . .long \pt_regs_offset .popsection .endm and then change every call in the entry code to ENTRY_CALL? -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeufwrote: > On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote: >> On May 23, 2016 7:28 PM, "Josh Poimboeuf" wrote: >> > > Maybe I'm coming around to liking this idea. >> > >> > Ok, good :-) >> > >> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty >> > > printer, etc), unwinding across a kernel exception would look like: >> > > >> > > - some_func >> > > - some_other_func >> > > - do_page_fault >> > > - page_fault >> > > >> > > After page_fault, the next unwind step takes us to the faulting RIP >> > > (faulting_func) and reports that all GPRs are known. It should >> > > probably learn this fact from DWARF if DWARF is available, instead of >> > > directly decoding pt_regs, due to a few funny cases in which pt_regs >> > > may be incomplete. A nice pretty printer could now print all the >> > > regs. >> > > >> > > - faulting_func >> > > - etc. >> > > >> > > For this to work, we don't actually need the unwinder to explicitly >> > > know where pt_regs is. >> > >> > That's true (but only for DWARF). >> > >> > > Food for thought, though: if user code does SYSENTER with TF set, >> > > you'll end up with partial pt_regs. There's nothing the kernel can do >> > > about it. DWARF will handle it without any fanfare, but I don't know >> > > if it's going to cause trouble for you pre-DWARF. >> > >> > In this case it should see the stack pointer is past the pt_regs offset, >> > so it would just report it as an empty stack. >> >> OK >> >> > >> > > I'm also not sure it makes sense to apply this before the unwinder >> > > that can consume it is ready. Maybe, if it would be consistent with >> > > your plans, it would make sense to rewrite the unwinder first, then >> > > apply this and teach live patching to use the new unwinder, and *then* >> > > add DWARF support? >> > >> > For the purposes of livepatch, the reliable unwinder needs to detect >> > whether an interrupt/exception pt_regs frame exists on a sleeping task >> > (or current). This patch would allow us to do that. >> > >> > So my preferred order of doing things would be: >> > >> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes >> > 2) this patch for annotating pt_regs stack frames >> > 3) reliable unwinder, similar to what I already posted, except it relies >> >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with >> >the new inactive task frame format of #1 >> > 4) livepatch consistency model which uses the reliable unwinder >> > 5) rewrite unwinder, and port all users to the new interface >> > 6) add DWARF unwinder >> > >> > 1-4 are pretty much already written, whereas 5 and 6 will take >> > considerably more work. >> >> Fair enough. >> >> > >> > > > + /* >> > > > +* Create a stack frame for the saved pt_regs. This allows >> > > > frame >> > > > +* pointer based unwinders to find pt_regs on the stack. >> > > > +*/ >> > > > + .macro CREATE_PT_REGS_FRAME regs=%rsp >> > > > +#ifdef CONFIG_FRAME_POINTER >> > > > + pushq \regs >> > > > + pushq $pt_regs+1 >> > > >> > > Why the +1? >> > >> > Some unwinders like gdb are smart enough to report the function which >> > contains the instruction *before* the return address. Without the +1, >> > they would show the wrong function. >> >> Lovely. Want to add a comment? >> >> > >> > > > + pushq %rbp >> > > > + movq%rsp, %rbp >> > > > +#endif >> > > > + .endm >> > > >> > > I keep wanting this to be only two pushes and to fudge rbp to make it >> > > work, but I don't see a good way. But let's call it >> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to >> > > nested_frame or similar. >> > >> > Or, if we aren't going to annotate syscall pt_regs, we could give it a >> > more specific name. CREATE_INTERRUPT_FRAME and interrupt_frame()? >> >> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry, >> too. CREATE_PT_REGS_FRAME is probably fine. >> >> > > > + >> > > > +/* fake function which allows stack unwinders to detect pt_regs >> > > > frames */ >> > > > +#ifdef CONFIG_FRAME_POINTER >> > > > +ENTRY(pt_regs) >> > > > + nop >> > > > + nop >> > > > +ENDPROC(pt_regs) >> > > > +#endif /* CONFIG_FRAME_POINTER */ >> > > >> > > Why is this two bytes long? Is there some reason it has to be more >> > > than one byte? >> > >> > Similar to above, this is related to the need to support various >> > unwinders. Whether the unwinder displays the ret_addr or the >> > instruction preceding it, either way the instruction needs to be inside >> > the function for the function to be reported. >> >> OK. > > Andy, > > So I got a chance to look at this some more. I'm thinking that to make > this feature more consistently useful, we shouldn't only annotate > pt_regs frames for calls to handlers; other calls should be annotated as > well:
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote: > On May 23, 2016 7:28 PM, "Josh Poimboeuf"wrote: > > > Maybe I'm coming around to liking this idea. > > > > Ok, good :-) > > > > > In an ideal world (DWARF support, high-quality unwinder, nice pretty > > > printer, etc), unwinding across a kernel exception would look like: > > > > > > - some_func > > > - some_other_func > > > - do_page_fault > > > - page_fault > > > > > > After page_fault, the next unwind step takes us to the faulting RIP > > > (faulting_func) and reports that all GPRs are known. It should > > > probably learn this fact from DWARF if DWARF is available, instead of > > > directly decoding pt_regs, due to a few funny cases in which pt_regs > > > may be incomplete. A nice pretty printer could now print all the > > > regs. > > > > > > - faulting_func > > > - etc. > > > > > > For this to work, we don't actually need the unwinder to explicitly > > > know where pt_regs is. > > > > That's true (but only for DWARF). > > > > > Food for thought, though: if user code does SYSENTER with TF set, > > > you'll end up with partial pt_regs. There's nothing the kernel can do > > > about it. DWARF will handle it without any fanfare, but I don't know > > > if it's going to cause trouble for you pre-DWARF. > > > > In this case it should see the stack pointer is past the pt_regs offset, > > so it would just report it as an empty stack. > > OK > > > > > > I'm also not sure it makes sense to apply this before the unwinder > > > that can consume it is ready. Maybe, if it would be consistent with > > > your plans, it would make sense to rewrite the unwinder first, then > > > apply this and teach live patching to use the new unwinder, and *then* > > > add DWARF support? > > > > For the purposes of livepatch, the reliable unwinder needs to detect > > whether an interrupt/exception pt_regs frame exists on a sleeping task > > (or current). This patch would allow us to do that. > > > > So my preferred order of doing things would be: > > > > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes > > 2) this patch for annotating pt_regs stack frames > > 3) reliable unwinder, similar to what I already posted, except it relies > >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with > >the new inactive task frame format of #1 > > 4) livepatch consistency model which uses the reliable unwinder > > 5) rewrite unwinder, and port all users to the new interface > > 6) add DWARF unwinder > > > > 1-4 are pretty much already written, whereas 5 and 6 will take > > considerably more work. > > Fair enough. > > > > > > > + /* > > > > +* Create a stack frame for the saved pt_regs. This allows > > > > frame > > > > +* pointer based unwinders to find pt_regs on the stack. > > > > +*/ > > > > + .macro CREATE_PT_REGS_FRAME regs=%rsp > > > > +#ifdef CONFIG_FRAME_POINTER > > > > + pushq \regs > > > > + pushq $pt_regs+1 > > > > > > Why the +1? > > > > Some unwinders like gdb are smart enough to report the function which > > contains the instruction *before* the return address. Without the +1, > > they would show the wrong function. > > Lovely. Want to add a comment? > > > > > > > + pushq %rbp > > > > + movq%rsp, %rbp > > > > +#endif > > > > + .endm > > > > > > I keep wanting this to be only two pushes and to fudge rbp to make it > > > work, but I don't see a good way. But let's call it > > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to > > > nested_frame or similar. > > > > Or, if we aren't going to annotate syscall pt_regs, we could give it a > > more specific name. CREATE_INTERRUPT_FRAME and interrupt_frame()? > > CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry, > too. CREATE_PT_REGS_FRAME is probably fine. > > > > > + > > > > +/* fake function which allows stack unwinders to detect pt_regs frames > > > > */ > > > > +#ifdef CONFIG_FRAME_POINTER > > > > +ENTRY(pt_regs) > > > > + nop > > > > + nop > > > > +ENDPROC(pt_regs) > > > > +#endif /* CONFIG_FRAME_POINTER */ > > > > > > Why is this two bytes long? Is there some reason it has to be more > > > than one byte? > > > > Similar to above, this is related to the need to support various > > unwinders. Whether the unwinder displays the ret_addr or the > > instruction preceding it, either way the instruction needs to be inside > > the function for the function to be reported. > > OK. Andy, So I got a chance to look at this some more. I'm thinking that to make this feature more consistently useful, we shouldn't only annotate pt_regs frames for calls to handlers; other calls should be annotated as well: preempt_schedule_irq, CALL_enter_from_user_mode, prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS, etc. That way, the unwinder will always be able to find pt_regs from an interrupt/exception, even
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On May 23, 2016 7:28 PM, "Josh Poimboeuf"wrote: > > Maybe I'm coming around to liking this idea. > > Ok, good :-) > > > In an ideal world (DWARF support, high-quality unwinder, nice pretty > > printer, etc), unwinding across a kernel exception would look like: > > > > - some_func > > - some_other_func > > - do_page_fault > > - page_fault > > > > After page_fault, the next unwind step takes us to the faulting RIP > > (faulting_func) and reports that all GPRs are known. It should > > probably learn this fact from DWARF if DWARF is available, instead of > > directly decoding pt_regs, due to a few funny cases in which pt_regs > > may be incomplete. A nice pretty printer could now print all the > > regs. > > > > - faulting_func > > - etc. > > > > For this to work, we don't actually need the unwinder to explicitly > > know where pt_regs is. > > That's true (but only for DWARF). > > > Food for thought, though: if user code does SYSENTER with TF set, > > you'll end up with partial pt_regs. There's nothing the kernel can do > > about it. DWARF will handle it without any fanfare, but I don't know > > if it's going to cause trouble for you pre-DWARF. > > In this case it should see the stack pointer is past the pt_regs offset, > so it would just report it as an empty stack. OK > > > I'm also not sure it makes sense to apply this before the unwinder > > that can consume it is ready. Maybe, if it would be consistent with > > your plans, it would make sense to rewrite the unwinder first, then > > apply this and teach live patching to use the new unwinder, and *then* > > add DWARF support? > > For the purposes of livepatch, the reliable unwinder needs to detect > whether an interrupt/exception pt_regs frame exists on a sleeping task > (or current). This patch would allow us to do that. > > So my preferred order of doing things would be: > > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes > 2) this patch for annotating pt_regs stack frames > 3) reliable unwinder, similar to what I already posted, except it relies >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with >the new inactive task frame format of #1 > 4) livepatch consistency model which uses the reliable unwinder > 5) rewrite unwinder, and port all users to the new interface > 6) add DWARF unwinder > > 1-4 are pretty much already written, whereas 5 and 6 will take > considerably more work. Fair enough. > > > > + /* > > > +* Create a stack frame for the saved pt_regs. This allows frame > > > +* pointer based unwinders to find pt_regs on the stack. > > > +*/ > > > + .macro CREATE_PT_REGS_FRAME regs=%rsp > > > +#ifdef CONFIG_FRAME_POINTER > > > + pushq \regs > > > + pushq $pt_regs+1 > > > > Why the +1? > > Some unwinders like gdb are smart enough to report the function which > contains the instruction *before* the return address. Without the +1, > they would show the wrong function. Lovely. Want to add a comment? > > > > + pushq %rbp > > > + movq%rsp, %rbp > > > +#endif > > > + .endm > > > > I keep wanting this to be only two pushes and to fudge rbp to make it > > work, but I don't see a good way. But let's call it > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to > > nested_frame or similar. > > Or, if we aren't going to annotate syscall pt_regs, we could give it a > more specific name. CREATE_INTERRUPT_FRAME and interrupt_frame()? CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry, too. CREATE_PT_REGS_FRAME is probably fine. > > > + > > > +/* fake function which allows stack unwinders to detect pt_regs frames */ > > > +#ifdef CONFIG_FRAME_POINTER > > > +ENTRY(pt_regs) > > > + nop > > > + nop > > > +ENDPROC(pt_regs) > > > +#endif /* CONFIG_FRAME_POINTER */ > > > > Why is this two bytes long? Is there some reason it has to be more > > than one byte? > > Similar to above, this is related to the need to support various > unwinders. Whether the unwinder displays the ret_addr or the > instruction preceding it, either way the instruction needs to be inside > the function for the function to be reported. OK. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 23, 2016 at 02:34:56PM -0700, Andy Lutomirski wrote: > On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeufwrote: > > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf wrote: > >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: > >> >> > > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf > >> >> > > wrote: > >> >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> >> > > >> pt_regs or similar. > >> >> > > > > >> >> > > > I think we should avoid doing something like that because it > >> >> > > > would break > >> >> > > > gdb and all the other unwinders who don't know about it. > >> >> > > > >> >> > > How so? > >> >> > > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting > >> >> > > that, > >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> >> > > the pt_regs. Currently it points to something stale (which the > >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> >> > > safe to assume that if you unwind to the 'call \do_sym', then > >> >> > > pt_regs > >> >> > > is the next thing on the stack, so just doing the section thing > >> >> > > would > >> >> > > work. > >> >> > > >> >> > Yes, rbp is meaningless on the entry from user space. But if an > >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> >> > nested entry, rbp keeps its old value, right? So the unwinder can > >> >> > walk > >> >> > past the nested entry frame and keep going until it gets to the > >> >> > original > >> >> > entry. > >> >> > >> >> Yes. > >> >> > >> >> It would be nice if we could do better, though, and actually notice > >> >> the pt_regs and identify the entry. For example, I'd love to see > >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> >> crash. > >> >> > >> >> Also, I think that just following rbp links will lose the > >> >> actual function that took the page fault (or whatever function > >> >> pt_regs->ip actually points to). > >> > > >> > Hm. I think we could fix all that in a more standard way. Whenever a > >> > new pt_regs frame gets saved on entry, we could also create a new stack > >> > frame which points to a fake kernel_entry() function. That would tell > >> > the unwinder there's a pt_regs frame without otherwise breaking frame > >> > pointers across the frame. > >> > > >> > Then I guess we wouldn't need my other solution of putting the idt > >> > entries in a special section. > >> > > >> > How does that sound? > >> > >> Let me try to understand. > >> > >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > >> points to (prev rbp, prev rip) on the stack, and you can follow the > >> chain back. Right now, on a user access page fault or similar, we > >> have rbp (probably) pointing to the interrupted frame, and the > >> interrupted rip isn't saved anywhere that a naive unwinder can find > >> it. (It's in pt_regs, but the rbp chain skips right over that.) > >> > >> We could change the entry code so that an interrupt / idtentry does: > >> > >> push pt_regs > >> push kernel_entry > >> push %rbp > >> mov %rsp, %rbp > >> call handler > >> pop %rbp > >> addq $8, %rsp > >> > >> or similar. That would make it appear that the actual C handler was > >> caused by a dummy function "kernel_entry". Now the unwinder would get > >> to kernel_entry, but it *still* wouldn't find its way to the calling > >> frame, which only solves part of the problem. We could at least teach > >> the unwinder how kernel_entry works and let it decode pt_regs to > >> continue unwinding. This would be nice, and I think it could work. > >> > >> I think I like this, except that, if it used a separate section, it > >> could potentially be faster, as, for each actual entry type, the > >> offset from the C handler frame to pt_regs is a foregone conclusion. > >> But this is pretty simple and performance is already abysmal in most > >> handlers. > >> > >> There's an added benefit to using a separate section, though: we could > >> also annotate the calls with what type of entry they were so the > >> unwinder could print it out nicely. > >> > >> I could be convinced either way. > > > > Ok, I took a stab at this. See the patch below. > > > > In addition to annotating interrupt/exception pt_regs frames, I also > > annotated all the syscall pt_regs, for consistency. > > > > As you mentioned, it will affect performance a bit, but I think it will > > be insignificant. > > > > I think I like this approach better than putting the > > interrupt/idtentry's in a special section, because this is much more > > precise. Especially now that I'm annotating
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 23, 2016 at 4:02 PM, Jiri Kosinawrote: > On Fri, 20 May 2016, Andy Lutomirski wrote: > >> I think it would be negligible, at least for interrupts, since >> interrupts are already extremely expensive. But I don't love adding >> assembly code that makes them even slower. The real thing I dislike >> about this approach is that it's not a normal stack frame, so you need >> code in the unwinder to unwind through it correctly, which makes me >> think that you're not saving much complexity by adding the pushes. > > I fail to see what is so special about the stack frame; it's in fact > pretty normal. > > It has added semantic value for "those who know", but the others will > (pretty much correctly) consider it to be a stackframe from a function > call, and be done with it. > > What am I missing? In Josh's code, the stack looks like: ... interrupted frame pt_regs pointer to pt_regs address of pt_regs dummy function rbp handler frame A naive unwinder won't unwind this correctly, as there's no link back to the interrupted frame's RIP. If the layout changed to: ... interrupted frame pt_regs interrupted RIP rbp handler frame then I think it would unwind correctly, but the pt_regs would be invisible, which is IMO a bit unfortunate. It could also be (I think): ... interrupted frame pt_regs interrupted rbp interrupted RIP pointer to pt_regs address of pt_regs dummy function pointer to "interrupted RIP" stack slot handler frame but now this is *five* pushes for the dummy frame, which I think is getting a bit out of hand. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, 20 May 2016, Andy Lutomirski wrote: > I think it would be negligible, at least for interrupts, since > interrupts are already extremely expensive. But I don't love adding > assembly code that makes them even slower. The real thing I dislike > about this approach is that it's not a normal stack frame, so you need > code in the unwinder to unwind through it correctly, which makes me > think that you're not saving much complexity by adding the pushes. I fail to see what is so special about the stack frame; it's in fact pretty normal. It has added semantic value for "those who know", but the others will (pretty much correctly) consider it to be a stackframe from a function call, and be done with it. What am I missing? Thanks, -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeufwrote: > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf wrote: >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: >> >> > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf >> >> > > wrote: >> >> > > >> I suppose we could try to rejigger the code so that rbp points to >> >> > > >> pt_regs or similar. >> >> > > > >> >> > > > I think we should avoid doing something like that because it would >> >> > > > break >> >> > > > gdb and all the other unwinders who don't know about it. >> >> > > >> >> > > How so? >> >> > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> >> > > the pt_regs. Currently it points to something stale (which the >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> >> > > is the next thing on the stack, so just doing the section thing would >> >> > > work. >> >> > >> >> > Yes, rbp is meaningless on the entry from user space. But if an >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> >> > past the nested entry frame and keep going until it gets to the original >> >> > entry. >> >> >> >> Yes. >> >> >> >> It would be nice if we could do better, though, and actually notice >> >> the pt_regs and identify the entry. For example, I'd love to see >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> >> crash. >> >> >> >> Also, I think that just following rbp links will lose the >> >> actual function that took the page fault (or whatever function >> >> pt_regs->ip actually points to). >> > >> > Hm. I think we could fix all that in a more standard way. Whenever a >> > new pt_regs frame gets saved on entry, we could also create a new stack >> > frame which points to a fake kernel_entry() function. That would tell >> > the unwinder there's a pt_regs frame without otherwise breaking frame >> > pointers across the frame. >> > >> > Then I guess we wouldn't need my other solution of putting the idt >> > entries in a special section. >> > >> > How does that sound? >> >> Let me try to understand. >> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp >> points to (prev rbp, prev rip) on the stack, and you can follow the >> chain back. Right now, on a user access page fault or similar, we >> have rbp (probably) pointing to the interrupted frame, and the >> interrupted rip isn't saved anywhere that a naive unwinder can find >> it. (It's in pt_regs, but the rbp chain skips right over that.) >> >> We could change the entry code so that an interrupt / idtentry does: >> >> push pt_regs >> push kernel_entry >> push %rbp >> mov %rsp, %rbp >> call handler >> pop %rbp >> addq $8, %rsp >> >> or similar. That would make it appear that the actual C handler was >> caused by a dummy function "kernel_entry". Now the unwinder would get >> to kernel_entry, but it *still* wouldn't find its way to the calling >> frame, which only solves part of the problem. We could at least teach >> the unwinder how kernel_entry works and let it decode pt_regs to >> continue unwinding. This would be nice, and I think it could work. >> >> I think I like this, except that, if it used a separate section, it >> could potentially be faster, as, for each actual entry type, the >> offset from the C handler frame to pt_regs is a foregone conclusion. >> But this is pretty simple and performance is already abysmal in most >> handlers. >> >> There's an added benefit to using a separate section, though: we could >> also annotate the calls with what type of entry they were so the >> unwinder could print it out nicely. >> >> I could be convinced either way. > > Ok, I took a stab at this. See the patch below. > > In addition to annotating interrupt/exception pt_regs frames, I also > annotated all the syscall pt_regs, for consistency. > > As you mentioned, it will affect performance a bit, but I think it will > be insignificant. > > I think I like this approach better than putting the > interrupt/idtentry's in a special section, because this is much more > precise. Especially now that I'm annotating pt_regs syscalls. > > Also I think with a few minor changes we could implement your idea of > annotating the calls with what type of entry they are. But I don't > think that's really needed, because the name of the interrupt/idtentry > is already on the stack trace. > > Before: > > [] dump_stack+0x85/0xc2 > []
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, May 20, 2016 at 09:59:38AM -0700, Andy Lutomirski wrote: > On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeufwrote: > > On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote: > >> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf > >> wrote: > >> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote: > >> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf > >> >> wrote: > >> >> > Note this example is with today's unwinder. It could be made smarter > >> >> > to > >> >> > get the RIP from the pt_regs so the '?' could be removed from > >> >> > copy_page_to_iter(). > >> >> > > >> >> > Thoughts? > >> >> > >> >> I think we should do that. The silly sample patch I sent you (or at > >> >> least that I think I sent you) did that, and it worked nicely. > >> > > >> > Yeah, we can certainly do something similar to make the unwinder > >> > smarter. It should be very simple with this approach: if it finds the > >> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be > >> > right after it. > >> > >> That seems barely easier than checking if it finds a function in > >> .entry that's marked on the stack, > > > > Don't forget you'd also have to look up the function's pt_regs offset in > > the table. > > > >> and the latter has no runtime cost. > > > > Well, I had been assuming the three extra pushes and one extra pop for > > interrupts would be negligible, but I'm definitely no expert there. Any > > idea how I can measure it? > > I think it would be negligible, at least for interrupts, since > interrupts are already extremely expensive. But I don't love adding > assembly code that makes them even slower. I just don't find that very convincing :-) If the performance impact is negligible, we should ignore it. We should instead be focusing on finding the simplest solution. > The real thing I dislike about this approach is that it's not a normal > stack frame, so you need code in the unwinder to unwind through it > correctly, which makes me think that you're not saving much complexity > by adding the pushes. But maybe I'm wrong. Hm, I view it differently. Sure the stack frame is a bit unusual, but as far as unwinders go, it _is_ normal. So even a stock unwinder can show the user that a pt_regs() -- or interrupt_frame() or whatever we call it -- function was called. Just knowing that an interrupt occurred could be useful information, even without the contents of RIP. > >> > I'm not sure about the idea of a table. I get the feeling it would add > >> > more complexity to both the entry code and the unwinder. How would you > >> > specify the pt_regs location when it's on a different stack? (See the > >> > interrupt macro: non-nested interrupts will place pt_regs on the task > >> > stack before switching to the irq stack.) > >> > >> Hmm. I need to think about the interrupt stack case a bit. Although > >> the actual top of the interrupt stack has a nearly fixed format, and I > >> have old patches to clean it up and make it actually be fixed. I'll > >> try to dust those off and resend them soon. > > > > How would that solve the problem? Would pt_regs be moved or copied to > > the irq stack? > > Hmm. > > Maybe the right way would be to unwind off the IRQ stack in two steps. > Step one would be to figure out that you're on the IRQ stack and pop > your way off it. Step two would be to find pt_regs. > > But that could be rather nasty to implement. Maybe what we actually > want to do is this: > > First, apply this thing: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1 > > that gives us a common format for the IRQ stack. > > Second, use my idea of making a table of offsets to pt_regs, so we'd > have, roughly: > > ENTER_IRQ_STACK old_rsp=%r11 > call __do_softirq > ANNOTATE_IRQSTACK_PTREGS_CALL offset=0 > LEAVE_IRQ_STACK > > the idea here is that offset=0 means that there is no offset beyond > that implied by ENTER_IRQ_STACK. What actually gets written to the > table is offset 8, because ENTER_IRQ_STACK itself does one push. > > So far, this has no runtime overhead at all. > > Then, in the unwinder, the logic becomes: > > If the return address is annotated in the ptregs entry table, look up > the offset. If the offset is in bounds, then use it. If the offset > is out of bounds and we're on the IRQ stack, then unwind the > ENTER_IRQ_STACK as well. > > Does that seem reasonable? I can try to implement it and see what it > looks like. To be honest I'm not crazy about it. The ANNOTATE_IRQSTACK_PTREGS_CALL macro needs knowledge about the implementation of ENTER_IRQ_STACK and how many pushes it does. I think that makes the entry code more complex and harder to understand than my patch. Also the unwinders would need to be quite a bit more complex, and would need to know more of the intimate details of the irq
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeufwrote: > On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote: >> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf wrote: >> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote: >> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf >> >> wrote: >> >> > Note this example is with today's unwinder. It could be made smarter to >> >> > get the RIP from the pt_regs so the '?' could be removed from >> >> > copy_page_to_iter(). >> >> > >> >> > Thoughts? >> >> >> >> I think we should do that. The silly sample patch I sent you (or at >> >> least that I think I sent you) did that, and it worked nicely. >> > >> > Yeah, we can certainly do something similar to make the unwinder >> > smarter. It should be very simple with this approach: if it finds the >> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be >> > right after it. >> >> That seems barely easier than checking if it finds a function in >> .entry that's marked on the stack, > > Don't forget you'd also have to look up the function's pt_regs offset in > the table. > >> and the latter has no runtime cost. > > Well, I had been assuming the three extra pushes and one extra pop for > interrupts would be negligible, but I'm definitely no expert there. Any > idea how I can measure it? I think it would be negligible, at least for interrupts, since interrupts are already extremely expensive. But I don't love adding assembly code that makes them even slower. The real thing I dislike about this approach is that it's not a normal stack frame, so you need code in the unwinder to unwind through it correctly, which makes me think that you're not saving much complexity by adding the pushes. But maybe I'm wrong. > >> > I'm not sure about the idea of a table. I get the feeling it would add >> > more complexity to both the entry code and the unwinder. How would you >> > specify the pt_regs location when it's on a different stack? (See the >> > interrupt macro: non-nested interrupts will place pt_regs on the task >> > stack before switching to the irq stack.) >> >> Hmm. I need to think about the interrupt stack case a bit. Although >> the actual top of the interrupt stack has a nearly fixed format, and I >> have old patches to clean it up and make it actually be fixed. I'll >> try to dust those off and resend them soon. > > How would that solve the problem? Would pt_regs be moved or copied to > the irq stack? Hmm. Maybe the right way would be to unwind off the IRQ stack in two steps. Step one would be to figure out that you're on the IRQ stack and pop your way off it. Step two would be to find pt_regs. But that could be rather nasty to implement. Maybe what we actually want to do is this: First, apply this thing: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1 that gives us a common format for the IRQ stack. Second, use my idea of making a table of offsets to pt_regs, so we'd have, roughly: ENTER_IRQ_STACK old_rsp=%r11 call __do_softirq ANNOTATE_IRQSTACK_PTREGS_CALL offset=0 LEAVE_IRQ_STACK the idea here is that offset=0 means that there is no offset beyond that implied by ENTER_IRQ_STACK. What actually gets written to the table is offset 8, because ENTER_IRQ_STACK itself does one push. So far, this has no runtime overhead at all. Then, in the unwinder, the logic becomes: If the return address is annotated in the ptregs entry table, look up the offset. If the offset is in bounds, then use it. If the offset is out of bounds and we're on the IRQ stack, then unwind the ENTER_IRQ_STACK as well. Does that seem reasonable? I can try to implement it and see what it looks like. --Andy > >> > If you're worried about performance, I can remove the syscall >> > annotations. They're optional anyway, since the pt_regs is always at >> > the same place on the stack for syscalls. >> > >> > I think three extra pushes wouldn't be a performance issue for >> > interrupts/exceptions. And they'll go away when we finally bury >> > CONFIG_FRAME_POINTER. >> >> I bet we'll always need to support CONFIG_FRAME_POINTER for some >> embedded systems. > > Yeah, probably. > >> I'll play with this a bit. > > Thanks, looking forward to seeing what you come up with. > > -- > Josh -- Andy Lutomirski AMA Capital Management, LLC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote: > On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeufwrote: > > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote: > >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf > >> wrote: > >> > Note this example is with today's unwinder. It could be made smarter to > >> > get the RIP from the pt_regs so the '?' could be removed from > >> > copy_page_to_iter(). > >> > > >> > Thoughts? > >> > >> I think we should do that. The silly sample patch I sent you (or at > >> least that I think I sent you) did that, and it worked nicely. > > > > Yeah, we can certainly do something similar to make the unwinder > > smarter. It should be very simple with this approach: if it finds the > > pt_regs() function on the stack, the (struct pt_regs *) pointer will be > > right after it. > > That seems barely easier than checking if it finds a function in > .entry that's marked on the stack, Don't forget you'd also have to look up the function's pt_regs offset in the table. > and the latter has no runtime cost. Well, I had been assuming the three extra pushes and one extra pop for interrupts would be negligible, but I'm definitely no expert there. Any idea how I can measure it? > > I'm not sure about the idea of a table. I get the feeling it would add > > more complexity to both the entry code and the unwinder. How would you > > specify the pt_regs location when it's on a different stack? (See the > > interrupt macro: non-nested interrupts will place pt_regs on the task > > stack before switching to the irq stack.) > > Hmm. I need to think about the interrupt stack case a bit. Although > the actual top of the interrupt stack has a nearly fixed format, and I > have old patches to clean it up and make it actually be fixed. I'll > try to dust those off and resend them soon. How would that solve the problem? Would pt_regs be moved or copied to the irq stack? > > If you're worried about performance, I can remove the syscall > > annotations. They're optional anyway, since the pt_regs is always at > > the same place on the stack for syscalls. > > > > I think three extra pushes wouldn't be a performance issue for > > interrupts/exceptions. And they'll go away when we finally bury > > CONFIG_FRAME_POINTER. > > I bet we'll always need to support CONFIG_FRAME_POINTER for some > embedded systems. Yeah, probably. > I'll play with this a bit. Thanks, looking forward to seeing what you come up with. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeufwrote: > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote: >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf wrote: >> > Note this example is with today's unwinder. It could be made smarter to >> > get the RIP from the pt_regs so the '?' could be removed from >> > copy_page_to_iter(). >> > >> > Thoughts? >> >> I think we should do that. The silly sample patch I sent you (or at >> least that I think I sent you) did that, and it worked nicely. > > Yeah, we can certainly do something similar to make the unwinder > smarter. It should be very simple with this approach: if it finds the > pt_regs() function on the stack, the (struct pt_regs *) pointer will be > right after it. That seems barely easier than checking if it finds a function in .entry that's marked on the stack, and the latter has no runtime cost. > >> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h >> > index 9a9e588..f54886a 100644 >> > --- a/arch/x86/entry/calling.h >> > +++ b/arch/x86/entry/calling.h >> > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel >> > is built with >> > .byte 0xf1 >> > .endm >> > >> > + /* >> > +* Create a stack frame for the saved pt_regs. This allows frame >> > +* pointer based unwinders to find pt_regs on the stack. >> > +*/ >> > + .macro CREATE_PT_REGS_FRAME regs=%rsp >> > +#ifdef CONFIG_FRAME_POINTER >> > + pushq \regs >> > + pushq $pt_regs+1 >> > + pushq %rbp >> > + movq%rsp, %rbp >> > +#endif >> > + .endm >> >> I don't love this part. It's going to hurt performance, and, given >> that we need to change the unwinder anyway to make it useful, let's >> just emit a table somewhere in .rodata and use it directly. > > I'm not sure about the idea of a table. I get the feeling it would add > more complexity to both the entry code and the unwinder. How would you > specify the pt_regs location when it's on a different stack? (See the > interrupt macro: non-nested interrupts will place pt_regs on the task > stack before switching to the irq stack.) Hmm. I need to think about the interrupt stack case a bit. Although the actual top of the interrupt stack has a nearly fixed format, and I have old patches to clean it up and make it actually be fixed. I'll try to dust those off and resend them soon. > > If you're worried about performance, I can remove the syscall > annotations. They're optional anyway, since the pt_regs is always at > the same place on the stack for syscalls. > > I think three extra pushes wouldn't be a performance issue for > interrupts/exceptions. And they'll go away when we finally bury > CONFIG_FRAME_POINTER. I bet we'll always need to support CONFIG_FRAME_POINTER for some embedded systems. I'll play with this a bit. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote: > On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeufwrote: > > Note this example is with today's unwinder. It could be made smarter to > > get the RIP from the pt_regs so the '?' could be removed from > > copy_page_to_iter(). > > > > Thoughts? > > I think we should do that. The silly sample patch I sent you (or at > least that I think I sent you) did that, and it worked nicely. Yeah, we can certainly do something similar to make the unwinder smarter. It should be very simple with this approach: if it finds the pt_regs() function on the stack, the (struct pt_regs *) pointer will be right after it. > > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h > > index 9a9e588..f54886a 100644 > > --- a/arch/x86/entry/calling.h > > +++ b/arch/x86/entry/calling.h > > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel > > is built with > > .byte 0xf1 > > .endm > > > > + /* > > +* Create a stack frame for the saved pt_regs. This allows frame > > +* pointer based unwinders to find pt_regs on the stack. > > +*/ > > + .macro CREATE_PT_REGS_FRAME regs=%rsp > > +#ifdef CONFIG_FRAME_POINTER > > + pushq \regs > > + pushq $pt_regs+1 > > + pushq %rbp > > + movq%rsp, %rbp > > +#endif > > + .endm > > I don't love this part. It's going to hurt performance, and, given > that we need to change the unwinder anyway to make it useful, let's > just emit a table somewhere in .rodata and use it directly. I'm not sure about the idea of a table. I get the feeling it would add more complexity to both the entry code and the unwinder. How would you specify the pt_regs location when it's on a different stack? (See the interrupt macro: non-nested interrupts will place pt_regs on the task stack before switching to the irq stack.) If you're worried about performance, I can remove the syscall annotations. They're optional anyway, since the pt_regs is always at the same place on the stack for syscalls. I think three extra pushes wouldn't be a performance issue for interrupts/exceptions. And they'll go away when we finally bury CONFIG_FRAME_POINTER. Here's the same patch without syscall annotations: diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 9a9e588..f54886a 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is built with .byte 0xf1 .endm + /* +* Create a stack frame for the saved pt_regs. This allows frame +* pointer based unwinders to find pt_regs on the stack. +*/ + .macro CREATE_PT_REGS_FRAME regs=%rsp +#ifdef CONFIG_FRAME_POINTER + pushq \regs + pushq $pt_regs+1 + pushq %rbp + movq%rsp, %rbp +#endif + .endm + + .macro REMOVE_PT_REGS_FRAME +#ifdef CONFIG_FRAME_POINTER + popq%rbp + addq$0x10, %rsp +#endif + .endm + + .macro CALL_HANDLER handler regs=%rsp + CREATE_PT_REGS_FRAME \regs + call\handler + REMOVE_PT_REGS_FRAME + .endm + #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 9ee0da1..93bc8f0 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -468,7 +468,7 @@ END(irq_entries_start) /* We entered an interrupt context - irqs are off: */ TRACE_IRQS_OFF - call\func /* rdi points to pt_regs */ + CALL_HANDLER \func regs=%rdi .endm /* @@ -495,7 +495,7 @@ ret_from_intr: /* Interrupt came from user space */ GLOBAL(retint_user) mov %rsp,%rdi - callprepare_exit_to_usermode + CALL_HANDLER prepare_exit_to_usermode TRACE_IRQS_IRETQ SWAPGS jmp restore_regs_and_iret @@ -509,7 +509,7 @@ retint_kernel: jnc 1f 0: cmpl$0, PER_CPU_VAR(__preempt_count) jnz 1f - callpreempt_schedule_irq + CALL_HANDLER preempt_schedule_irq jmp 0b 1: #endif @@ -688,8 +688,6 @@ ENTRY(\sym) .endif .endif - movq%rsp, %rdi /* pt_regs pointer */ - .if \has_error_code movqORIG_RAX(%rsp), %rsi/* get error code */ movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ @@ -701,7 +699,8 @@ ENTRY(\sym) subq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) .endif - call\do_sym + movq%rsp, %rdi /* pt_regs pointer */ + CALL_HANDLER \do_sym .if \shift_ist != -1 addq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) @@ -728,8 +727,6 @@ ENTRY(\sym) callsync_regs movq%rax, %rsp /* switch stack */ - movq%rsp,
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeufwrote: > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf wrote: >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: >> >> > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf >> >> > > wrote: >> >> > > >> I suppose we could try to rejigger the code so that rbp points to >> >> > > >> pt_regs or similar. >> >> > > > >> >> > > > I think we should avoid doing something like that because it would >> >> > > > break >> >> > > > gdb and all the other unwinders who don't know about it. >> >> > > >> >> > > How so? >> >> > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> >> > > the pt_regs. Currently it points to something stale (which the >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> >> > > is the next thing on the stack, so just doing the section thing would >> >> > > work. >> >> > >> >> > Yes, rbp is meaningless on the entry from user space. But if an >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> >> > past the nested entry frame and keep going until it gets to the original >> >> > entry. >> >> >> >> Yes. >> >> >> >> It would be nice if we could do better, though, and actually notice >> >> the pt_regs and identify the entry. For example, I'd love to see >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> >> crash. >> >> >> >> Also, I think that just following rbp links will lose the >> >> actual function that took the page fault (or whatever function >> >> pt_regs->ip actually points to). >> > >> > Hm. I think we could fix all that in a more standard way. Whenever a >> > new pt_regs frame gets saved on entry, we could also create a new stack >> > frame which points to a fake kernel_entry() function. That would tell >> > the unwinder there's a pt_regs frame without otherwise breaking frame >> > pointers across the frame. >> > >> > Then I guess we wouldn't need my other solution of putting the idt >> > entries in a special section. >> > >> > How does that sound? >> >> Let me try to understand. >> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp >> points to (prev rbp, prev rip) on the stack, and you can follow the >> chain back. Right now, on a user access page fault or similar, we >> have rbp (probably) pointing to the interrupted frame, and the >> interrupted rip isn't saved anywhere that a naive unwinder can find >> it. (It's in pt_regs, but the rbp chain skips right over that.) >> >> We could change the entry code so that an interrupt / idtentry does: >> >> push pt_regs >> push kernel_entry >> push %rbp >> mov %rsp, %rbp >> call handler >> pop %rbp >> addq $8, %rsp >> >> or similar. That would make it appear that the actual C handler was >> caused by a dummy function "kernel_entry". Now the unwinder would get >> to kernel_entry, but it *still* wouldn't find its way to the calling >> frame, which only solves part of the problem. We could at least teach >> the unwinder how kernel_entry works and let it decode pt_regs to >> continue unwinding. This would be nice, and I think it could work. >> >> I think I like this, except that, if it used a separate section, it >> could potentially be faster, as, for each actual entry type, the >> offset from the C handler frame to pt_regs is a foregone conclusion. >> But this is pretty simple and performance is already abysmal in most >> handlers. >> >> There's an added benefit to using a separate section, though: we could >> also annotate the calls with what type of entry they were so the >> unwinder could print it out nicely. >> >> I could be convinced either way. > > Ok, I took a stab at this. See the patch below. > > In addition to annotating interrupt/exception pt_regs frames, I also > annotated all the syscall pt_regs, for consistency. > > As you mentioned, it will affect performance a bit, but I think it will > be insignificant. > > I think I like this approach better than putting the > interrupt/idtentry's in a special section, because this is much more > precise. Especially now that I'm annotating pt_regs syscalls. > > Also I think with a few minor changes we could implement your idea of > annotating the calls with what type of entry they are. But I don't > think that's really needed, because the name of the interrupt/idtentry > is already on the stack trace. > > Before: > > [] dump_stack+0x85/0xc2 > []
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeufwrote: > > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: > >> > > >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf > >> > > wrote: > >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> > > >> pt_regs or similar. > >> > > > > >> > > > I think we should avoid doing something like that because it would > >> > > > break > >> > > > gdb and all the other unwinders who don't know about it. > >> > > > >> > > How so? > >> > > > >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> > > the pt_regs. Currently it points to something stale (which the > >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > >> > > is the next thing on the stack, so just doing the section thing would > >> > > work. > >> > > >> > Yes, rbp is meaningless on the entry from user space. But if an > >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> > nested entry, rbp keeps its old value, right? So the unwinder can walk > >> > past the nested entry frame and keep going until it gets to the original > >> > entry. > >> > >> Yes. > >> > >> It would be nice if we could do better, though, and actually notice > >> the pt_regs and identify the entry. For example, I'd love to see > >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> crash. > >> > >> Also, I think that just following rbp links will lose the > >> actual function that took the page fault (or whatever function > >> pt_regs->ip actually points to). > > > > Hm. I think we could fix all that in a more standard way. Whenever a > > new pt_regs frame gets saved on entry, we could also create a new stack > > frame which points to a fake kernel_entry() function. That would tell > > the unwinder there's a pt_regs frame without otherwise breaking frame > > pointers across the frame. > > > > Then I guess we wouldn't need my other solution of putting the idt > > entries in a special section. > > > > How does that sound? > > Let me try to understand. > > The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > points to (prev rbp, prev rip) on the stack, and you can follow the > chain back. Right now, on a user access page fault or similar, we > have rbp (probably) pointing to the interrupted frame, and the > interrupted rip isn't saved anywhere that a naive unwinder can find > it. (It's in pt_regs, but the rbp chain skips right over that.) > > We could change the entry code so that an interrupt / idtentry does: > > push pt_regs > push kernel_entry > push %rbp > mov %rsp, %rbp > call handler > pop %rbp > addq $8, %rsp > > or similar. That would make it appear that the actual C handler was > caused by a dummy function "kernel_entry". Now the unwinder would get > to kernel_entry, but it *still* wouldn't find its way to the calling > frame, which only solves part of the problem. We could at least teach > the unwinder how kernel_entry works and let it decode pt_regs to > continue unwinding. This would be nice, and I think it could work. > > I think I like this, except that, if it used a separate section, it > could potentially be faster, as, for each actual entry type, the > offset from the C handler frame to pt_regs is a foregone conclusion. > But this is pretty simple and performance is already abysmal in most > handlers. > > There's an added benefit to using a separate section, though: we could > also annotate the calls with what type of entry they were so the > unwinder could print it out nicely. > > I could be convinced either way. Ok, I took a stab at this. See the patch below. In addition to annotating interrupt/exception pt_regs frames, I also annotated all the syscall pt_regs, for consistency. As you mentioned, it will affect performance a bit, but I think it will be insignificant. I think I like this approach better than putting the interrupt/idtentry's in a special section, because this is much more precise. Especially now that I'm annotating pt_regs syscalls. Also I think with a few minor changes we could implement your idea of annotating the calls with what type of entry they are. But I don't think that's really needed, because the name of the interrupt/idtentry is already on the stack trace. Before: [] dump_stack+0x85/0xc2 [] __do_page_fault+0x576/0x5a0 [] trace_do_page_fault+0x5c/0x2e0 [] do_async_page_fault+0x2c/0xa0 [] async_page_fault+0x28/0x30 [] ? copy_page_to_iter+0x70/0x440 [] ? pagecache_get_page+0x2c/0x290 []
RE: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
From: Andy Lutomirski > Sent: 02 May 2016 19:13 ... > I hope your plans include rewriting the current stack unwinder > completely. The thing in print_context_stack is (a) > hard-to-understand and hard-to-modify crap and (b) is called in a loop > from another file using totally ridiculous conventions. I've seen a 'stack unwinder' that parsed the instruction stream forwards looking for 'return' instructions. I fixed it to add a few extra instructions needs to sort out the exit path from hardware interrupts. It only had to understand instructions that modified %sp and %bp and remember which branch instructions and branch targets it had used in order to find the correct exit path from a function. Worked reasonably well without any debug info or guaranteed frame pointers. It did have to fall back on scanning the stack if it was inside an infinite loop. Even on x86 it is reasonably possible to check for 'call' instructions in this case. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 2, 2016 at 1:00 PM, Jiri Kosinawrote: > On Mon, 2 May 2016, Jiri Kosina wrote: > >> > FWIW, I just tried this: >> > >> > static bool is_entry_text(unsigned long addr) >> > { >> > return addr >= (unsigned long)__entry_text_start && >> > addr < (unsigned long)__entry_text_end; >> > } >> > >> > it works. So the entry code is already annotated reasonably well :) >> > >> > I just hacked it up here: >> > >> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21 >> > >> > and it seems to work, at least for page faults. A better >> > implementation would print out the entire contents of pt_regs so that >> > people reading the stack trace will know the registers at the time of >> > the exception, which might be helpful. >> >> Sorry for being dense, but how do you distinguish here between a "real" >> kernel entry, that pushes pt_regs, and any "non-entry" function call that >> passes pt_regs around? > > Umm, actually, the more tricky part is the other way around -- how do you > make sure that whenever you are calling out from a code between > __entry_text_start and __entry_text_end, pt_regs will be at the place > you're looking for it? How's that guaranteed? It's not guaranteed in my code. I think we'd want to add a little table of call sites and their pt_regs offsets. This was just meant to test that the general idea works (and it does indeed generate better traces than the stock kernel, which gets it unconditionally wrong). --Andy > > Thanks, > > -- > Jiri Kosina > SUSE Labs > -- Andy Lutomirski AMA Capital Management, LLC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, 2 May 2016, Jiri Kosina wrote: > > FWIW, I just tried this: > > > > static bool is_entry_text(unsigned long addr) > > { > > return addr >= (unsigned long)__entry_text_start && > > addr < (unsigned long)__entry_text_end; > > } > > > > it works. So the entry code is already annotated reasonably well :) > > > > I just hacked it up here: > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21 > > > > and it seems to work, at least for page faults. A better > > implementation would print out the entire contents of pt_regs so that > > people reading the stack trace will know the registers at the time of > > the exception, which might be helpful. > > Sorry for being dense, but how do you distinguish here between a "real" > kernel entry, that pushes pt_regs, and any "non-entry" function call that > passes pt_regs around? Umm, actually, the more tricky part is the other way around -- how do you make sure that whenever you are calling out from a code between __entry_text_start and __entry_text_end, pt_regs will be at the place you're looking for it? How's that guaranteed? Thanks, -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, 2 May 2016, Andy Lutomirski wrote: > FWIW, I just tried this: > > static bool is_entry_text(unsigned long addr) > { > return addr >= (unsigned long)__entry_text_start && > addr < (unsigned long)__entry_text_end; > } > > it works. So the entry code is already annotated reasonably well :) > > I just hacked it up here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21 > > and it seems to work, at least for page faults. A better > implementation would print out the entire contents of pt_regs so that > people reading the stack trace will know the registers at the time of > the exception, which might be helpful. Sorry for being dense, but how do you distinguish here between a "real" kernel entry, that pushes pt_regs, and any "non-entry" function call that passes pt_regs around? -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote: > On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeufwrote: > > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf wrote: > >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: > >> >> > > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf > >> >> > > wrote: > >> >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> >> > > >> pt_regs or similar. > >> >> > > > > >> >> > > > I think we should avoid doing something like that because it > >> >> > > > would break > >> >> > > > gdb and all the other unwinders who don't know about it. > >> >> > > > >> >> > > How so? > >> >> > > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting > >> >> > > that, > >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> >> > > the pt_regs. Currently it points to something stale (which the > >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> >> > > safe to assume that if you unwind to the 'call \do_sym', then > >> >> > > pt_regs > >> >> > > is the next thing on the stack, so just doing the section thing > >> >> > > would > >> >> > > work. > >> >> > > >> >> > Yes, rbp is meaningless on the entry from user space. But if an > >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> >> > nested entry, rbp keeps its old value, right? So the unwinder can > >> >> > walk > >> >> > past the nested entry frame and keep going until it gets to the > >> >> > original > >> >> > entry. > >> >> > >> >> Yes. > >> >> > >> >> It would be nice if we could do better, though, and actually notice > >> >> the pt_regs and identify the entry. For example, I'd love to see > >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> >> crash. > >> >> > >> >> Also, I think that just following rbp links will lose the > >> >> actual function that took the page fault (or whatever function > >> >> pt_regs->ip actually points to). > >> > > >> > Hm. I think we could fix all that in a more standard way. Whenever a > >> > new pt_regs frame gets saved on entry, we could also create a new stack > >> > frame which points to a fake kernel_entry() function. That would tell > >> > the unwinder there's a pt_regs frame without otherwise breaking frame > >> > pointers across the frame. > >> > > >> > Then I guess we wouldn't need my other solution of putting the idt > >> > entries in a special section. > >> > > >> > How does that sound? > >> > >> Let me try to understand. > >> > >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > >> points to (prev rbp, prev rip) on the stack, and you can follow the > >> chain back. Right now, on a user access page fault or similar, we > >> have rbp (probably) pointing to the interrupted frame, and the > >> interrupted rip isn't saved anywhere that a naive unwinder can find > >> it. (It's in pt_regs, but the rbp chain skips right over that.) > >> > >> We could change the entry code so that an interrupt / idtentry does: > >> > >> push pt_regs > >> push kernel_entry > >> push %rbp > >> mov %rsp, %rbp > >> call handler > >> pop %rbp > >> addq $8, %rsp > >> > >> or similar. That would make it appear that the actual C handler was > >> caused by a dummy function "kernel_entry". Now the unwinder would get > >> to kernel_entry, but it *still* wouldn't find its way to the calling > >> frame, which only solves part of the problem. We could at least teach > >> the unwinder how kernel_entry works and let it decode pt_regs to > >> continue unwinding. This would be nice, and I think it could work. > > > > Yeah, that's about what I had in mind. > > FWIW, I just tried this: > > static bool is_entry_text(unsigned long addr) > { > return addr >= (unsigned long)__entry_text_start && > addr < (unsigned long)__entry_text_end; > } > > it works. So the entry code is already annotated reasonably well :) > > I just hacked it up here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21 > > and it seems to work, at least for page faults. A better > implementation would print out the entire contents of pt_regs so that > people reading the stack trace will know the registers at the time of > the exception, which might be helpful. I still think we would need more specific annotations to do that reliably: a call from entry code doesn't necessarily correlate with a pt_regs frame. > >> I think I like this, except that, if it used a separate section, it > >> could potentially be faster, as, for each actual entry type, the > >>
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
* Andy Lutomirskiwrote: > > Another idea to detect missing frames: for each return address on the > > stack, > > ensure there's a corresponding "call " instruction immediately > > preceding > > the return location, where matches what's on the stack. > > Hmm, interesting. > > I hope your plans include rewriting the current stack unwinder completely. > The > thing in print_context_stack is (a) hard-to-understand and hard-to-modify > crap > and (b) is called in a loop from another file using totally ridiculous > conventions. So we had several attempts at making it better, any further improvements (including radical rewrites) are more than welcome! The generalization between the various stack walking methods certainly didn't make things easier to read - we might want to eliminate that by using better primitives to iterate over the stack frame. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeufwrote: > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf wrote: >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: >> >> > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf >> >> > > wrote: >> >> > > >> I suppose we could try to rejigger the code so that rbp points to >> >> > > >> pt_regs or similar. >> >> > > > >> >> > > > I think we should avoid doing something like that because it would >> >> > > > break >> >> > > > gdb and all the other unwinders who don't know about it. >> >> > > >> >> > > How so? >> >> > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> >> > > the pt_regs. Currently it points to something stale (which the >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> >> > > is the next thing on the stack, so just doing the section thing would >> >> > > work. >> >> > >> >> > Yes, rbp is meaningless on the entry from user space. But if an >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> >> > past the nested entry frame and keep going until it gets to the original >> >> > entry. >> >> >> >> Yes. >> >> >> >> It would be nice if we could do better, though, and actually notice >> >> the pt_regs and identify the entry. For example, I'd love to see >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> >> crash. >> >> >> >> Also, I think that just following rbp links will lose the >> >> actual function that took the page fault (or whatever function >> >> pt_regs->ip actually points to). >> > >> > Hm. I think we could fix all that in a more standard way. Whenever a >> > new pt_regs frame gets saved on entry, we could also create a new stack >> > frame which points to a fake kernel_entry() function. That would tell >> > the unwinder there's a pt_regs frame without otherwise breaking frame >> > pointers across the frame. >> > >> > Then I guess we wouldn't need my other solution of putting the idt >> > entries in a special section. >> > >> > How does that sound? >> >> Let me try to understand. >> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp >> points to (prev rbp, prev rip) on the stack, and you can follow the >> chain back. Right now, on a user access page fault or similar, we >> have rbp (probably) pointing to the interrupted frame, and the >> interrupted rip isn't saved anywhere that a naive unwinder can find >> it. (It's in pt_regs, but the rbp chain skips right over that.) >> >> We could change the entry code so that an interrupt / idtentry does: >> >> push pt_regs >> push kernel_entry >> push %rbp >> mov %rsp, %rbp >> call handler >> pop %rbp >> addq $8, %rsp >> >> or similar. That would make it appear that the actual C handler was >> caused by a dummy function "kernel_entry". Now the unwinder would get >> to kernel_entry, but it *still* wouldn't find its way to the calling >> frame, which only solves part of the problem. We could at least teach >> the unwinder how kernel_entry works and let it decode pt_regs to >> continue unwinding. This would be nice, and I think it could work. > > Yeah, that's about what I had in mind. FWIW, I just tried this: static bool is_entry_text(unsigned long addr) { return addr >= (unsigned long)__entry_text_start && addr < (unsigned long)__entry_text_end; } it works. So the entry code is already annotated reasonably well :) I just hacked it up here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21 and it seems to work, at least for page faults. A better implementation would print out the entire contents of pt_regs so that people reading the stack trace will know the registers at the time of the exception, which might be helpful. > >> I think I like this, except that, if it used a separate section, it >> could potentially be faster, as, for each actual entry type, the >> offset from the C handler frame to pt_regs is a foregone conclusion. > > Hm, this I don't really follow. It's true that the unwinder can easily > find RIP from pt_regs, which will always be a known offset from the > kernel_entry pointer on the stack. But why would having the entry code > in a separate section make that faster? It doesn't make the unwinder faster -- it makes the entry code faster. > >> But this is pretty simple and performance is already abysmal in most >>
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeufwrote: > > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: > >> > > >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf > >> > > wrote: > >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> > > >> pt_regs or similar. > >> > > > > >> > > > I think we should avoid doing something like that because it would > >> > > > break > >> > > > gdb and all the other unwinders who don't know about it. > >> > > > >> > > How so? > >> > > > >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> > > the pt_regs. Currently it points to something stale (which the > >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > >> > > is the next thing on the stack, so just doing the section thing would > >> > > work. > >> > > >> > Yes, rbp is meaningless on the entry from user space. But if an > >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> > nested entry, rbp keeps its old value, right? So the unwinder can walk > >> > past the nested entry frame and keep going until it gets to the original > >> > entry. > >> > >> Yes. > >> > >> It would be nice if we could do better, though, and actually notice > >> the pt_regs and identify the entry. For example, I'd love to see > >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> crash. > >> > >> Also, I think that just following rbp links will lose the > >> actual function that took the page fault (or whatever function > >> pt_regs->ip actually points to). > > > > Hm. I think we could fix all that in a more standard way. Whenever a > > new pt_regs frame gets saved on entry, we could also create a new stack > > frame which points to a fake kernel_entry() function. That would tell > > the unwinder there's a pt_regs frame without otherwise breaking frame > > pointers across the frame. > > > > Then I guess we wouldn't need my other solution of putting the idt > > entries in a special section. > > > > How does that sound? > > Let me try to understand. > > The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > points to (prev rbp, prev rip) on the stack, and you can follow the > chain back. Right now, on a user access page fault or similar, we > have rbp (probably) pointing to the interrupted frame, and the > interrupted rip isn't saved anywhere that a naive unwinder can find > it. (It's in pt_regs, but the rbp chain skips right over that.) > > We could change the entry code so that an interrupt / idtentry does: > > push pt_regs > push kernel_entry > push %rbp > mov %rsp, %rbp > call handler > pop %rbp > addq $8, %rsp > > or similar. That would make it appear that the actual C handler was > caused by a dummy function "kernel_entry". Now the unwinder would get > to kernel_entry, but it *still* wouldn't find its way to the calling > frame, which only solves part of the problem. We could at least teach > the unwinder how kernel_entry works and let it decode pt_regs to > continue unwinding. This would be nice, and I think it could work. Yeah, that's about what I had in mind. > I think I like this, except that, if it used a separate section, it > could potentially be faster, as, for each actual entry type, the > offset from the C handler frame to pt_regs is a foregone conclusion. Hm, this I don't really follow. It's true that the unwinder can easily find RIP from pt_regs, which will always be a known offset from the kernel_entry pointer on the stack. But why would having the entry code in a separate section make that faster? > But this is pretty simple and performance is already abysmal in most > handlers. > > There's an added benefit to using a separate section, though: we could > also annotate the calls with what type of entry they were so the > unwinder could print it out nicely. Yeah, that could be a nice feature... but doesn't printing the name of the C handler pretty much already give that information? In any case, once we have a working DWARF unwinder, I think it will show the name of the idt entry anyway. > >> Have you looked at my vdso unwinding test at all? If we could do > >> something similar for the kernel, IMO it would make testing much more > >> pleasant. > > > > I found it, but I'm not sure what it would mean to do something similar > > for the kernel. Do you mean doing something like an NMI sampling-based > > approach where we periodically do a random stack sanity check? > > I was imagining something a little more strict: single-step >
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeufwrote: > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" wrote: >> > >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf >> > > wrote: >> > > >> I suppose we could try to rejigger the code so that rbp points to >> > > >> pt_regs or similar. >> > > > >> > > > I think we should avoid doing something like that because it would >> > > > break >> > > > gdb and all the other unwinders who don't know about it. >> > > >> > > How so? >> > > >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> > > the pt_regs. Currently it points to something stale (which the >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> > > is the next thing on the stack, so just doing the section thing would >> > > work. >> > >> > Yes, rbp is meaningless on the entry from user space. But if an >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> > past the nested entry frame and keep going until it gets to the original >> > entry. >> >> Yes. >> >> It would be nice if we could do better, though, and actually notice >> the pt_regs and identify the entry. For example, I'd love to see >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> crash. >> >> Also, I think that just following rbp links will lose the >> actual function that took the page fault (or whatever function >> pt_regs->ip actually points to). > > Hm. I think we could fix all that in a more standard way. Whenever a > new pt_regs frame gets saved on entry, we could also create a new stack > frame which points to a fake kernel_entry() function. That would tell > the unwinder there's a pt_regs frame without otherwise breaking frame > pointers across the frame. > > Then I guess we wouldn't need my other solution of putting the idt > entries in a special section. > > How does that sound? Let me try to understand. The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp points to (prev rbp, prev rip) on the stack, and you can follow the chain back. Right now, on a user access page fault or similar, we have rbp (probably) pointing to the interrupted frame, and the interrupted rip isn't saved anywhere that a naive unwinder can find it. (It's in pt_regs, but the rbp chain skips right over that.) We could change the entry code so that an interrupt / idtentry does: push pt_regs push kernel_entry push %rbp mov %rsp, %rbp call handler pop %rbp addq $8, %rsp or similar. That would make it appear that the actual C handler was caused by a dummy function "kernel_entry". Now the unwinder would get to kernel_entry, but it *still* wouldn't find its way to the calling frame, which only solves part of the problem. We could at least teach the unwinder how kernel_entry works and let it decode pt_regs to continue unwinding. This would be nice, and I think it could work. I think I like this, except that, if it used a separate section, it could potentially be faster, as, for each actual entry type, the offset from the C handler frame to pt_regs is a foregone conclusion. But this is pretty simple and performance is already abysmal in most handlers. There's an added benefit to using a separate section, though: we could also annotate the calls with what type of entry they were so the unwinder could print it out nicely. I could be convinced either way. > >> Have you looked at my vdso unwinding test at all? If we could do >> something similar for the kernel, IMO it would make testing much more >> pleasant. > > I found it, but I'm not sure what it would mean to do something similar > for the kernel. Do you mean doing something like an NMI sampling-based > approach where we periodically do a random stack sanity check? I was imagining something a little more strict: single-step interesting parts of the kernel and make sure that each step unwinds correctly. That could detect missing frames and similar. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"wrote: > > > > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf > > > wrote: > > > >> I suppose we could try to rejigger the code so that rbp points to > > > >> pt_regs or similar. > > > > > > > > I think we should avoid doing something like that because it would break > > > > gdb and all the other unwinders who don't know about it. > > > > > > How so? > > > > > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > > > the pt_regs. Currently it points to something stale (which the > > > dump_stack code might be relying on. Hmm.) But it's probably also > > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > > > is the next thing on the stack, so just doing the section thing would > > > work. > > > > Yes, rbp is meaningless on the entry from user space. But if an > > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > > nested entry, rbp keeps its old value, right? So the unwinder can walk > > past the nested entry frame and keep going until it gets to the original > > entry. > > Yes. > > It would be nice if we could do better, though, and actually notice > the pt_regs and identify the entry. For example, I'd love to see > "page fault, RIP=xyz" printed in the middle of a stack dump on a > crash. > > Also, I think that just following rbp links will lose the > actual function that took the page fault (or whatever function > pt_regs->ip actually points to). Hm. I think we could fix all that in a more standard way. Whenever a new pt_regs frame gets saved on entry, we could also create a new stack frame which points to a fake kernel_entry() function. That would tell the unwinder there's a pt_regs frame without otherwise breaking frame pointers across the frame. Then I guess we wouldn't need my other solution of putting the idt entries in a special section. How does that sound? > Have you looked at my vdso unwinding test at all? If we could do > something similar for the kernel, IMO it would make testing much more > pleasant. I found it, but I'm not sure what it would mean to do something similar for the kernel. Do you mean doing something like an NMI sampling-based approach where we periodically do a random stack sanity check? (If so, I do have something like that planned.) -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Apr 29, 2016 3:11 PM, "Jiri Kosina"wrote: > > On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > > > NMI, MCE and interrupts aren't a problem because they have dedicated > > > stacks, which are easy to detect. If the tasks' stack is on an > > > exception stack or an irq stack, we consider it unreliable. > > > > Only on x86_64. > > Well, MCEs are more or less x86-specific as well. But otherwise good > point, thanks Andy. > > So, how does stack layout generally look like in case when NMI is actually > running on proper kernel stack? I thought it's guaranteed to contain > pt_regs anyway in all cases. Is that not guaranteed to be the case? > On x86, at least, there will still be pt_regs for the NMI. For the interrupted state, though, there might not be pt_regs, as the NMI might have happened while still populating pt_regs. In fact, the NMI stack could overlap task_pt_regs. For x86_32, there's no guarantee that pt_regs contains sp due to hardware silliness. You need to parse it more carefully, as, !user_mode(regs), then the old sp is just above pt_regs. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"wrote: > > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf wrote: > > > I think the easiest way to make it work would be to modify the idtentry > > > macro to put all the idt entries in a dedicated section. Then the > > > unwinder could easily detect any calls from that code. > > > > That would work. Would it make sense to do the same for the irq entries? > > Yes, I think so. > > > >> I suppose we could try to rejigger the code so that rbp points to > > >> pt_regs or similar. > > > > > > I think we should avoid doing something like that because it would break > > > gdb and all the other unwinders who don't know about it. > > > > How so? > > > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > > the pt_regs. Currently it points to something stale (which the > > dump_stack code might be relying on. Hmm.) But it's probably also > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > > is the next thing on the stack, so just doing the section thing would > > work. > > Yes, rbp is meaningless on the entry from user space. But if an > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > nested entry, rbp keeps its old value, right? So the unwinder can walk > past the nested entry frame and keep going until it gets to the original > entry. Yes. It would be nice if we could do better, though, and actually notice the pt_regs and identify the entry. For example, I'd love to see "page fault, RIP=xyz" printed in the middle of a stack dump on a crash. Also, I think that just following rbp links will lose the actual function that took the page fault (or whatever function pt_regs->ip actually points to). > > > We should really re-add DWARF some day. > > Working on it :-) Excellent. Have you looked at my vdso unwinding test at all? If we could do something similar for the kernel, IMO it would make testing much more pleasant. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Sat, Apr 30, 2016 at 12:11:45AM +0200, Jiri Kosina wrote: > On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > > NMI, MCE and interrupts aren't a problem because they have dedicated > > > stacks, which are easy to detect. If the tasks' stack is on an > > > exception stack or an irq stack, we consider it unreliable. > > > > Only on x86_64. > > Well, MCEs are more or less x86-specific as well. But otherwise good > point, thanks Andy. > > So, how does stack layout generally look like in case when NMI is actually > running on proper kernel stack? I thought it's guaranteed to contain > pt_regs anyway in all cases. Is that not guaranteed to be the case? If the NMI were using the normal kernel stack and it interrupted kernel space, pt_regs would be placed in the "middle" of the stack rather than the bottom, and there's currently no way to detect that. However, NMIs don't sleep, and we only consider sleeping tasks for stack reliability, so it wouldn't be an issue anyway. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeufwrote: > > I think the easiest way to make it work would be to modify the idtentry > > macro to put all the idt entries in a dedicated section. Then the > > unwinder could easily detect any calls from that code. > > That would work. Would it make sense to do the same for the irq entries? Yes, I think so. > >> I suppose we could try to rejigger the code so that rbp points to > >> pt_regs or similar. > > > > I think we should avoid doing something like that because it would break > > gdb and all the other unwinders who don't know about it. > > How so? > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > the pt_regs. Currently it points to something stale (which the > dump_stack code might be relying on. Hmm.) But it's probably also > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > is the next thing on the stack, so just doing the section thing would > work. Yes, rbp is meaningless on the entry from user space. But if an in-kernel interrupt occurs (e.g. page fault, preemption) and you have nested entry, rbp keeps its old value, right? So the unwinder can walk past the nested entry frame and keep going until it gets to the original entry. > We should really re-add DWARF some day. Working on it :-) -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > NMI, MCE and interrupts aren't a problem because they have dedicated > > stacks, which are easy to detect. If the tasks' stack is on an > > exception stack or an irq stack, we consider it unreliable. > > Only on x86_64. Well, MCEs are more or less x86-specific as well. But otherwise good point, thanks Andy. So, how does stack layout generally look like in case when NMI is actually running on proper kernel stack? I thought it's guaranteed to contain pt_regs anyway in all cases. Is that not guaranteed to be the case? Thanks, -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeufwrote: > On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote: >> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf wrote: >> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: >> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf >> >> wrote: >> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf >> >> >> wrote: >> >> >> > A preempted function might not have had a chance to save the frame >> >> >> > pointer to the stack yet, which can result in its caller getting >> >> >> > skipped >> >> >> > on a stack trace. >> >> >> > >> >> >> > Add a flag to indicate when the task has been preempted so that stack >> >> >> > dump code can determine whether the stack trace is reliable. >> >> >> >> >> >> I think I like this, but how do you handle the rather similar case in >> >> >> which a task goes to sleep because it's waiting on IO that happened in >> >> >> response to get_user, put_user, copy_from_user, etc? >> >> > >> >> > Hm, good question. I was thinking that page faults had a dedicated >> >> > stack, but now looking at the entry and traps code, that doesn't seem to >> >> > be the case. >> >> > >> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel >> >> > function which might trigger a valid page fault (e.g., >> >> > copy_user_generic_string) do the proper frame pointer setup first. Then >> >> > the stack should still be reliable. >> >> > >> >> > In fact I might be able to teach objtool to enforce that: any function >> >> > which uses an exception table should create a stack frame. >> >> > >> >> > Or alternatively, maybe set some kind of flag for page faults, similar >> >> > to what I did with this patch. >> >> > >> >> >> >> How about doing it the other way around: teach the unwinder to detect >> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) >> >> and use some reasonable heuristic as to whether it's okay to keep >> >> unwinding. You should be able to handle preemption like that, too -- >> >> the unwind process will end up in an IRQ frame. >> > >> > How exactly would the unwinder detect if a text address is in an >> > idtentry? Maybe put all the idt entries in a special ELF section? >> > >> >> Hmm. >> >> What actually happens when you unwind all the way into the entry code? >> Don't you end up in something that isn't in an ELF function? Can you >> detect that? > > For entry from user space (e.g., syscalls), it's easy to detect because > there's always a pt_regs at the bottom of the stack. So if the unwinder > reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows > it's done. > > But for nested entry (e.g. in-kernel irqs/exceptions like preemption and > page faults which don't have dedicated stacks), where the pt_regs is > stored somewhere in the middle of the stack instead of the bottom, > there's no reliable way to detect that. > >> Ideally, the unwinder could actually detect that it's >> hit a pt_regs struct and report that. If used for stack dumps, it >> could display some indication of this and then continue its unwinding >> by decoding the pt_regs. If used for patching, it could take some >> other appropriate action. >> >> I would have no objection to annotating all the pt_regs-style entry >> code, whether by putting it in a separate section or by making a table >> of addresses. > > I think the easiest way to make it work would be to modify the idtentry > macro to put all the idt entries in a dedicated section. Then the > unwinder could easily detect any calls from that code. That would work. Would it make sense to do the same for the irq entries? I'd be glad to review a patch. It should be straightforward. > >> There are a couple of nasty cases if NMI or MCE is involved but, as of >> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should >> always be a complete pt_regs on the stack before interrupts get >> enabled for each entry. Of course, finding the thing may be >> nontrivial in case other things were pushed. > > NMI, MCE and interrupts aren't a problem because they have dedicated > stacks, which are easy to detect. If the tasks' stack is on an > exception stack or an irq stack, we consider it unreliable. Only on x86_64. > > And also, they don't sleep. The stack of any running task (other than > current) is automatically considered unreliable anyway, since they could > be modifying it while we're reading it. True. > >> I suppose we could try to rejigger the code so that rbp points to >> pt_regs or similar. > > I think we should avoid doing something like that because it would break > gdb and all the other unwinders who don't know about it. How so? Currently, rbp in the entry code is meaningless. I'm suggesting that, when we do, for example, 'call \do_sym' in idtentry,
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeufwrote: > > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: > >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf > >> wrote: > >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf > >> >> wrote: > >> >> > A preempted function might not have had a chance to save the frame > >> >> > pointer to the stack yet, which can result in its caller getting > >> >> > skipped > >> >> > on a stack trace. > >> >> > > >> >> > Add a flag to indicate when the task has been preempted so that stack > >> >> > dump code can determine whether the stack trace is reliable. > >> >> > >> >> I think I like this, but how do you handle the rather similar case in > >> >> which a task goes to sleep because it's waiting on IO that happened in > >> >> response to get_user, put_user, copy_from_user, etc? > >> > > >> > Hm, good question. I was thinking that page faults had a dedicated > >> > stack, but now looking at the entry and traps code, that doesn't seem to > >> > be the case. > >> > > >> > Anyway I think it shouldn't be a problem if we make sure that any kernel > >> > function which might trigger a valid page fault (e.g., > >> > copy_user_generic_string) do the proper frame pointer setup first. Then > >> > the stack should still be reliable. > >> > > >> > In fact I might be able to teach objtool to enforce that: any function > >> > which uses an exception table should create a stack frame. > >> > > >> > Or alternatively, maybe set some kind of flag for page faults, similar > >> > to what I did with this patch. > >> > > >> > >> How about doing it the other way around: teach the unwinder to detect > >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) > >> and use some reasonable heuristic as to whether it's okay to keep > >> unwinding. You should be able to handle preemption like that, too -- > >> the unwind process will end up in an IRQ frame. > > > > How exactly would the unwinder detect if a text address is in an > > idtentry? Maybe put all the idt entries in a special ELF section? > > > > Hmm. > > What actually happens when you unwind all the way into the entry code? > Don't you end up in something that isn't in an ELF function? Can you > detect that? For entry from user space (e.g., syscalls), it's easy to detect because there's always a pt_regs at the bottom of the stack. So if the unwinder reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows it's done. But for nested entry (e.g. in-kernel irqs/exceptions like preemption and page faults which don't have dedicated stacks), where the pt_regs is stored somewhere in the middle of the stack instead of the bottom, there's no reliable way to detect that. > Ideally, the unwinder could actually detect that it's > hit a pt_regs struct and report that. If used for stack dumps, it > could display some indication of this and then continue its unwinding > by decoding the pt_regs. If used for patching, it could take some > other appropriate action. > > I would have no objection to annotating all the pt_regs-style entry > code, whether by putting it in a separate section or by making a table > of addresses. I think the easiest way to make it work would be to modify the idtentry macro to put all the idt entries in a dedicated section. Then the unwinder could easily detect any calls from that code. > There are a couple of nasty cases if NMI or MCE is involved but, as of > 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should > always be a complete pt_regs on the stack before interrupts get > enabled for each entry. Of course, finding the thing may be > nontrivial in case other things were pushed. NMI, MCE and interrupts aren't a problem because they have dedicated stacks, which are easy to detect. If the tasks' stack is on an exception stack or an irq stack, we consider it unreliable. And also, they don't sleep. The stack of any running task (other than current) is automatically considered unreliable anyway, since they could be modifying it while we're reading it. > I suppose we could try to rejigger the code so that rbp points to > pt_regs or similar. I think we should avoid doing something like that because it would break gdb and all the other unwinders who don't know about it. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeufwrote: > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf wrote: >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf >> >> wrote: >> >> > A preempted function might not have had a chance to save the frame >> >> > pointer to the stack yet, which can result in its caller getting skipped >> >> > on a stack trace. >> >> > >> >> > Add a flag to indicate when the task has been preempted so that stack >> >> > dump code can determine whether the stack trace is reliable. >> >> >> >> I think I like this, but how do you handle the rather similar case in >> >> which a task goes to sleep because it's waiting on IO that happened in >> >> response to get_user, put_user, copy_from_user, etc? >> > >> > Hm, good question. I was thinking that page faults had a dedicated >> > stack, but now looking at the entry and traps code, that doesn't seem to >> > be the case. >> > >> > Anyway I think it shouldn't be a problem if we make sure that any kernel >> > function which might trigger a valid page fault (e.g., >> > copy_user_generic_string) do the proper frame pointer setup first. Then >> > the stack should still be reliable. >> > >> > In fact I might be able to teach objtool to enforce that: any function >> > which uses an exception table should create a stack frame. >> > >> > Or alternatively, maybe set some kind of flag for page faults, similar >> > to what I did with this patch. >> > >> >> How about doing it the other way around: teach the unwinder to detect >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) >> and use some reasonable heuristic as to whether it's okay to keep >> unwinding. You should be able to handle preemption like that, too -- >> the unwind process will end up in an IRQ frame. > > How exactly would the unwinder detect if a text address is in an > idtentry? Maybe put all the idt entries in a special ELF section? > Hmm. What actually happens when you unwind all the way into the entry code? Don't you end up in something that isn't in an ELF function? Can you detect that? Ideally, the unwinder could actually detect that it's hit a pt_regs struct and report that. If used for stack dumps, it could display some indication of this and then continue its unwinding by decoding the pt_regs. If used for patching, it could take some other appropriate action. I would have no objection to annotating all the pt_regs-style entry code, whether by putting it in a separate section or by making a table of addresses. There are a couple of nasty cases if NMI or MCE is involved but, as of 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should always be a complete pt_regs on the stack before interrupts get enabled for each entry. Of course, finding the thing may be nontrivial in case other things were pushed. I suppose we could try to rejigger the code so that rbp points to pt_regs or similar. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeufwrote: > > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf > >> wrote: > >> > A preempted function might not have had a chance to save the frame > >> > pointer to the stack yet, which can result in its caller getting skipped > >> > on a stack trace. > >> > > >> > Add a flag to indicate when the task has been preempted so that stack > >> > dump code can determine whether the stack trace is reliable. > >> > >> I think I like this, but how do you handle the rather similar case in > >> which a task goes to sleep because it's waiting on IO that happened in > >> response to get_user, put_user, copy_from_user, etc? > > > > Hm, good question. I was thinking that page faults had a dedicated > > stack, but now looking at the entry and traps code, that doesn't seem to > > be the case. > > > > Anyway I think it shouldn't be a problem if we make sure that any kernel > > function which might trigger a valid page fault (e.g., > > copy_user_generic_string) do the proper frame pointer setup first. Then > > the stack should still be reliable. > > > > In fact I might be able to teach objtool to enforce that: any function > > which uses an exception table should create a stack frame. > > > > Or alternatively, maybe set some kind of flag for page faults, similar > > to what I did with this patch. > > > > How about doing it the other way around: teach the unwinder to detect > when it hits a non-outermost entry (i.e. it lands in idtentry, etc) > and use some reasonable heuristic as to whether it's okay to keep > unwinding. You should be able to handle preemption like that, too -- > the unwind process will end up in an IRQ frame. How exactly would the unwinder detect if a text address is in an idtentry? Maybe put all the idt entries in a special ELF section? -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeufwrote: > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf wrote: >> > A preempted function might not have had a chance to save the frame >> > pointer to the stack yet, which can result in its caller getting skipped >> > on a stack trace. >> > >> > Add a flag to indicate when the task has been preempted so that stack >> > dump code can determine whether the stack trace is reliable. >> >> I think I like this, but how do you handle the rather similar case in >> which a task goes to sleep because it's waiting on IO that happened in >> response to get_user, put_user, copy_from_user, etc? > > Hm, good question. I was thinking that page faults had a dedicated > stack, but now looking at the entry and traps code, that doesn't seem to > be the case. > > Anyway I think it shouldn't be a problem if we make sure that any kernel > function which might trigger a valid page fault (e.g., > copy_user_generic_string) do the proper frame pointer setup first. Then > the stack should still be reliable. > > In fact I might be able to teach objtool to enforce that: any function > which uses an exception table should create a stack frame. > > Or alternatively, maybe set some kind of flag for page faults, similar > to what I did with this patch. > How about doing it the other way around: teach the unwinder to detect when it hits a non-outermost entry (i.e. it lands in idtentry, etc) and use some reasonable heuristic as to whether it's okay to keep unwinding. You should be able to handle preemption like that, too -- the unwind process will end up in an IRQ frame. --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeufwrote: > > A preempted function might not have had a chance to save the frame > > pointer to the stack yet, which can result in its caller getting skipped > > on a stack trace. > > > > Add a flag to indicate when the task has been preempted so that stack > > dump code can determine whether the stack trace is reliable. > > I think I like this, but how do you handle the rather similar case in > which a task goes to sleep because it's waiting on IO that happened in > response to get_user, put_user, copy_from_user, etc? Hm, good question. I was thinking that page faults had a dedicated stack, but now looking at the entry and traps code, that doesn't seem to be the case. Anyway I think it shouldn't be a problem if we make sure that any kernel function which might trigger a valid page fault (e.g., copy_user_generic_string) do the proper frame pointer setup first. Then the stack should still be reliable. In fact I might be able to teach objtool to enforce that: any function which uses an exception table should create a stack frame. Or alternatively, maybe set some kind of flag for page faults, similar to what I did with this patch. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeufwrote: > A preempted function might not have had a chance to save the frame > pointer to the stack yet, which can result in its caller getting skipped > on a stack trace. > > Add a flag to indicate when the task has been preempted so that stack > dump code can determine whether the stack trace is reliable. I think I like this, but how do you handle the rather similar case in which a task goes to sleep because it's waiting on IO that happened in response to get_user, put_user, copy_from_user, etc? --Andy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
A preempted function might not have had a chance to save the frame pointer to the stack yet, which can result in its caller getting skipped on a stack trace. Add a flag to indicate when the task has been preempted so that stack dump code can determine whether the stack trace is reliable. Signed-off-by: Josh Poimboeuf--- include/linux/sched.h | 1 + kernel/fork.c | 2 +- kernel/sched/core.c | 4 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 3d31572..fb364a0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2137,6 +2137,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, #define PF_SWAPWRITE 0x0080 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x0400 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY0x0800 /* Early kill for mce process policy */ +#define PF_PREEMPT_IRQ 0x1000 /* Thread is preempted by an irq */ #define PF_MUTEX_TESTER0x2000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP0x4000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x8000 /* this thread called freeze_processes and should not be frozen */ diff --git a/kernel/fork.c b/kernel/fork.c index b73a539..d2fe04a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1373,7 +1373,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_count; delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ - p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER); + p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_PREEMPT_IRQ); p->flags |= PF_FORKNOEXEC; INIT_LIST_HEAD(>children); INIT_LIST_HEAD(>sibling); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9d84d60..7594267 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3422,6 +3422,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) prev_state = exception_enter(); + current->flags |= PF_PREEMPT_IRQ; + do { preempt_disable(); local_irq_enable(); @@ -3430,6 +3432,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) sched_preempt_enable_no_resched(); } while (need_resched()); + current->flags &= ~PF_PREEMPT_IRQ; + exception_exit(prev_state); } -- 2.4.11 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev