Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Andy Lutomirski
On Thu, Jun 23, 2016 at 1:40 PM, Josh Poimboeuf  wrote:
> On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote:
>> On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
>> > > So which is the least-bad option?  To summarize:
>> > >
>> > >   1) task flag(s) for preemption and page faults
>> > >
>> > >   2) turn pt_regs into a stack frame
>> > >
>> > >   3) annotate all calls from entry code in a table
>> > >
>> > >   4) encode rbp on entry
>> > >
>> > > They all have their issues, though I'm partial to #2.
>> > >
>> > > Any more hare-brained ideas? :-)
>> >
>> > I'll try to take a closer look at #2 and see just how much I dislike
>> > all the stack frame munging.
>>
>> Ok.
>>
>> > Also, in principle, it's only the
>> > sleeping calls and the calls that make it into real (non-entry) kernel
>> > code that really want to be unwindable through this mechanism.
>>
>> Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
>> Though I think that makes them more prone to future breakage.
>>
>> > FWIW, I don't care that much about preserving gdb's partial ability to
>> > unwind through pt_regs, especially because gdb really ought to be able
>> > to use DWARF, too.
>>
>> Hm, that's a good point.  I really don't know if there are any other
>> external tools out there that would care.  Maybe we could try option 4
>> and then see if anybody complains.
>
> I'm starting to think hare-brained option 4 is the way to go.  Any
> external tooling should really be relying on DWARF anyway.
>
> Here's a sneak preview.  If this general approach looks ok to you, I'll
> go ahead and port all the in-tree unwinders and post a proper patch.
>
> Instead of using xor -1 on the pt_regs pointer, I just cleared the
> high-order bit.  That makes the unwinding experience much more pleasant
> for a human stack walker, and also ensures that anybody trying to
> dereference it gets slapped with an oops, at least in the 48-bit address
> space era.
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 9a9e588..bf397426 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is 
> built with
> .byte 0xf1
> .endm
>
> +   /*
> +* This is a sneaky trick to help the unwinder find pt_regs on the
> +* stack.  The frame pointer is replaced with an encoded pointer to
> +* pt_regs.  The encoding is just a clearing of the highest-order bit,
> +* which makes it an invalid address and is also a signal to the
> +* unwinder that it's a pt_regs pointer in disguise.
> +*
> +* NOTE: This must be called *after* SAVE_EXTRA_REGS because it
> +* corrupts rbp.
> +*/
> +   .macro ENCODE_FRAME_POINTER ptregs_offset=0
> +#ifdef CONFIG_FRAME_POINTER
> +   leaq \ptregs_offset(%rsp), %rbp
> +   btr $63, %rbp
> +#endif
> +   .endm
> +

Maybe optimize slightly:

.ifeq \ptregs_offset
mov %rsp, %rbp
.else
leaq \ptregs_offset(%rsp), %rbp
.endif
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Josh Poimboeuf
On Thu, Jun 23, 2016 at 01:31:32PM -0500, Josh Poimboeuf wrote:
> On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
> > > So which is the least-bad option?  To summarize:
> > >
> > >   1) task flag(s) for preemption and page faults
> > >
> > >   2) turn pt_regs into a stack frame
> > >
> > >   3) annotate all calls from entry code in a table
> > >
> > >   4) encode rbp on entry
> > >
> > > They all have their issues, though I'm partial to #2.
> > >
> > > Any more hare-brained ideas? :-)
> > 
> > I'll try to take a closer look at #2 and see just how much I dislike
> > all the stack frame munging.
> 
> Ok.
> 
> > Also, in principle, it's only the
> > sleeping calls and the calls that make it into real (non-entry) kernel
> > code that really want to be unwindable through this mechanism.
> 
> Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
> Though I think that makes them more prone to future breakage.
> 
> > FWIW, I don't care that much about preserving gdb's partial ability to
> > unwind through pt_regs, especially because gdb really ought to be able
> > to use DWARF, too.
> 
> Hm, that's a good point.  I really don't know if there are any other
> external tools out there that would care.  Maybe we could try option 4
> and then see if anybody complains.

I'm starting to think hare-brained option 4 is the way to go.  Any
external tooling should really be relying on DWARF anyway.

Here's a sneak preview.  If this general approach looks ok to you, I'll
go ahead and port all the in-tree unwinders and post a proper patch.

Instead of using xor -1 on the pt_regs pointer, I just cleared the
high-order bit.  That makes the unwinding experience much more pleasant
for a human stack walker, and also ensures that anybody trying to
dereference it gets slapped with an oops, at least in the 48-bit address
space era.

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..bf397426 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,23 @@ For 32-bit we have the following conventions - kernel is 
built with
.byte 0xf1
.endm
 
+   /*
+* This is a sneaky trick to help the unwinder find pt_regs on the
+* stack.  The frame pointer is replaced with an encoded pointer to
+* pt_regs.  The encoding is just a clearing of the highest-order bit,
+* which makes it an invalid address and is also a signal to the
+* unwinder that it's a pt_regs pointer in disguise.
+*
+* NOTE: This must be called *after* SAVE_EXTRA_REGS because it
+* corrupts rbp.
+*/
+   .macro ENCODE_FRAME_POINTER ptregs_offset=0
+#ifdef CONFIG_FRAME_POINTER
+   leaq \ptregs_offset(%rsp), %rbp
+   btr $63, %rbp
+#endif
+   .endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..eb79652 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -431,6 +431,7 @@ END(irq_entries_start)
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS
SAVE_EXTRA_REGS
+   ENCODE_FRAME_POINTER
 
testb   $3, CS(%rsp)
jz  1f
@@ -893,6 +894,7 @@ ENTRY(xen_failsafe_callback)
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS
SAVE_EXTRA_REGS
+   ENCODE_FRAME_POINTER
jmp error_exit
 END(xen_failsafe_callback)
 
@@ -936,6 +938,7 @@ ENTRY(paranoid_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
+   ENCODE_FRAME_POINTER 8
movl$1, %ebx
movl$MSR_GS_BASE, %ecx
rdmsr
@@ -983,6 +986,7 @@ ENTRY(error_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
+   ENCODE_FRAME_POINTER 8
xorl%ebx, %ebx
testb   $3, CS+8(%rsp)
jz  .Lerror_kernelspace
@@ -1165,6 +1169,7 @@ ENTRY(nmi)
pushq   %r13/* pt_regs->r13 */
pushq   %r14/* pt_regs->r14 */
pushq   %r15/* pt_regs->r15 */
+   ENCODE_FRAME_POINTER
 
/*
 * At this point we no longer need to worry about stack damage
@@ -1182,7 +1187,7 @@ ENTRY(nmi)
 * do_nmi doesn't modify pt_regs.
 */
SWAPGS
-   jmp restore_c_regs_and_iret
+   jmp restore_regs_and_iret
 
 .Lnmi_from_kernel:
/*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Josh Poimboeuf
On Thu, Jun 23, 2016 at 09:35:29AM -0700, Andy Lutomirski wrote:
> > So which is the least-bad option?  To summarize:
> >
> >   1) task flag(s) for preemption and page faults
> >
> >   2) turn pt_regs into a stack frame
> >
> >   3) annotate all calls from entry code in a table
> >
> >   4) encode rbp on entry
> >
> > They all have their issues, though I'm partial to #2.
> >
> > Any more hare-brained ideas? :-)
> 
> I'll try to take a closer look at #2 and see just how much I dislike
> all the stack frame munging.

Ok.

> Also, in principle, it's only the
> sleeping calls and the calls that make it into real (non-entry) kernel
> code that really want to be unwindable through this mechanism.

Yeah, that's true.  We could modify options 2 or 3 to be less absolute.
Though I think that makes them more prone to future breakage.

> FWIW, I don't care that much about preserving gdb's partial ability to
> unwind through pt_regs, especially because gdb really ought to be able
> to use DWARF, too.

Hm, that's a good point.  I really don't know if there are any other
external tools out there that would care.  Maybe we could try option 4
and then see if anybody complains.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Andy Lutomirski
On Thu, Jun 23, 2016 at 9:19 AM, Josh Poimboeuf  wrote:
> On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote:
>> On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf  wrote:
>> > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
>> >> >
>> >> > So are you suggesting something like:
>> >> >
>> >> > .macro ENTRY_CALL func pt_regs_offset=0
>> >> > call \func
>> >> > 1:  .pushsection .entry_calls, "a"
>> >> > .long 1b - .
>> >> > .long \pt_regs_offset
>> >> > .popsection
>> >> > .endm
>> >> >
>> >> > and then change every call in the entry code to ENTRY_CALL?
>> >>
>> >> Yes, exactly, modulo whether the section name is good.  hpa is
>> >> probably the authority on that.
>> >
>> > Well, as you probably know, I don't really like peppering ENTRY_CALL
>> > everywhere. :-/
>>
>> Me neither.  But at least it's less constraining on the
>> already-fairly-hairy code.
>>
>> >
>> > Also I wonder how we could annotate the hypercalls, for example
>> > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.
>>
>> Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
>> ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
>> great either.
>
> Hm, I don't follow this statement.  Why not?  The more frame pointer
> coverage, the better, especially if it doesn't add any additional
> overhead.

Less flexibility, and it's IMO annoying to make the Xen case have
extra constraints.  It also makes it very awkward or impossible to
take advantage of the sti interrupt window, although admittedly that
doesn't work on Xen either, so maybe that's moot.

>
>> DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
>> should add your task flag and then consider removing it again when
>> DWARF happens.
>
> I tend to doubt we'd be able to remove it later.  As you said before,
> many embedded platforms probably won't be able to switch to DWARF, and
> they'll want to do live patching too.
>
> So which is the least-bad option?  To summarize:
>
>   1) task flag(s) for preemption and page faults
>
>   2) turn pt_regs into a stack frame
>
>   3) annotate all calls from entry code in a table
>
>   4) encode rbp on entry
>
> They all have their issues, though I'm partial to #2.
>
> Any more hare-brained ideas? :-)

I'll try to take a closer look at #2 and see just how much I dislike
all the stack frame munging.  Also, in principle, it's only the
sleeping calls and the calls that make it into real (non-entry) kernel
code that really want to be unwindable through this mechanism.

FWIW, I don't care that much about preserving gdb's partial ability to
unwind through pt_regs, especially because gdb really ought to be able
to use DWARF, too.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 12:17:25PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf  wrote:
> > On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
> >> >
> >> > So are you suggesting something like:
> >> >
> >> > .macro ENTRY_CALL func pt_regs_offset=0
> >> > call \func
> >> > 1:  .pushsection .entry_calls, "a"
> >> > .long 1b - .
> >> > .long \pt_regs_offset
> >> > .popsection
> >> > .endm
> >> >
> >> > and then change every call in the entry code to ENTRY_CALL?
> >>
> >> Yes, exactly, modulo whether the section name is good.  hpa is
> >> probably the authority on that.
> >
> > Well, as you probably know, I don't really like peppering ENTRY_CALL
> > everywhere. :-/
> 
> Me neither.  But at least it's less constraining on the
> already-fairly-hairy code.
> 
> >
> > Also I wonder how we could annotate the hypercalls, for example
> > DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.
> 
> Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
> ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
> great either.

Hm, I don't follow this statement.  Why not?  The more frame pointer
coverage, the better, especially if it doesn't add any additional
overhead.

> DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
> should add your task flag and then consider removing it again when
> DWARF happens.

I tend to doubt we'd be able to remove it later.  As you said before,
many embedded platforms probably won't be able to switch to DWARF, and
they'll want to do live patching too.

So which is the least-bad option?  To summarize:

  1) task flag(s) for preemption and page faults

  2) turn pt_regs into a stack frame

  3) annotate all calls from entry code in a table

  4) encode rbp on entry

They all have their issues, though I'm partial to #2.

Any more hare-brained ideas? :-)

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-23 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 05:09:11PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf  wrote:
> > Andy,
> >
> > So I got a chance to look at this some more.  I'm thinking that to make
> > this feature more consistently useful, we shouldn't only annotate
> > pt_regs frames for calls to handlers; other calls should be annotated as
> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> > etc.  That way, the unwinder will always be able to find pt_regs from an
> > interrupt/exception, even if starting from one of these other calls.
> >
> > But then, things get ugly.  You have to either setup and tear down the
> > frame for every possible call, or do a higher-level setup/teardown
> > across multiple calls, which invalidates several assumptions in the
> > entry code about the location of pt_regs on the stack.
> >
> 
> Here's yet another harebrained idea.  Maybe it works better than my
> previous harebrained ideas :)
> 
> Your patch is already creating a somewhat nonstandard stack frame:
> 
> +   movq%rbp,   0*8(%rsp)
> +   movq$entry_frame_ret,   1*8(%rsp)
> +   movq%rsp, %rbp
> 
> It's kind of a normal stack frame, but rbp points at something odd,
> and to unwind it fully correctly, the unwinder needs to know about it.
> 
> What if we made it even more special, along the lines of:
> 
> leaq offset_to_ptregs(%rsp), %rbp
> xorq $-1, %rbp
> 
> IOW, don't write anything to the stack at all, and just put a special
> value into RBP that says "the next frame is pt_regs at such-and-such
> address".  Do this once on entry and make sure to restore RBP (from
> pt_regs) on exit.  Now the unwinder can notice that RBP has the high
> bits clear *and* that the negation of it points to the stack, and it
> can figure out what's going on.
> 
> What do you think?  Am I nuts or could this work?
> 
> It had better not have much risk of breaking things worse than they
> currently are, given that current kernel allow user code to stick any
> value it likes into the very last element of the RBP chain.

I think it's a good idea, and it could work... BUT it would break
external unwinders like gdb for the in-kernel entry case.

For interrupts and exceptions in kernel mode, rbp *is* valid.  Sure, it
doesn't tell you the interrupted function, but it does tell you its
caller.  A generic frame pointer unwinder skips the interrupted
function, but at least it keeps going.  If we encoded rbp on entry, that
would break.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf  wrote:
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: preempt_schedule_irq, CALL_enter_from_user_mode,
> prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> etc.  That way, the unwinder will always be able to find pt_regs from an
> interrupt/exception, even if starting from one of these other calls.
>
> But then, things get ugly.  You have to either setup and tear down the
> frame for every possible call, or do a higher-level setup/teardown
> across multiple calls, which invalidates several assumptions in the
> entry code about the location of pt_regs on the stack.
>

Here's yet another harebrained idea.  Maybe it works better than my
previous harebrained ideas :)

Your patch is already creating a somewhat nonstandard stack frame:

+   movq%rbp,   0*8(%rsp)
+   movq$entry_frame_ret,   1*8(%rsp)
+   movq%rsp, %rbp

It's kind of a normal stack frame, but rbp points at something odd,
and to unwind it fully correctly, the unwinder needs to know about it.

What if we made it even more special, along the lines of:

leaq offset_to_ptregs(%rsp), %rbp
xorq $-1, %rbp

IOW, don't write anything to the stack at all, and just put a special
value into RBP that says "the next frame is pt_regs at such-and-such
address".  Do this once on entry and make sure to restore RBP (from
pt_regs) on exit.  Now the unwinder can notice that RBP has the high
bits clear *and* that the negation of it points to the stack, and it
can figure out what's going on.

What do you think?  Am I nuts or could this work?

It had better not have much risk of breaking things worse than they
currently are, given that current kernel allow user code to stick any
value it likes into the very last element of the RBP chain.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf  wrote:
> On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
>> >
>> > So are you suggesting something like:
>> >
>> > .macro ENTRY_CALL func pt_regs_offset=0
>> > call \func
>> > 1:  .pushsection .entry_calls, "a"
>> > .long 1b - .
>> > .long \pt_regs_offset
>> > .popsection
>> > .endm
>> >
>> > and then change every call in the entry code to ENTRY_CALL?
>>
>> Yes, exactly, modulo whether the section name is good.  hpa is
>> probably the authority on that.
>
> Well, as you probably know, I don't really like peppering ENTRY_CALL
> everywhere. :-/

Me neither.  But at least it's less constraining on the
already-fairly-hairy code.

>
> Also I wonder how we could annotate the hypercalls, for example
> DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
great either.

DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
should add your task flag and then consider removing it again when
DWARF happens.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf  wrote:
> > On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> >> > So I got a chance to look at this some more.  I'm thinking that to make
> >> > this feature more consistently useful, we shouldn't only annotate
> >> > pt_regs frames for calls to handlers; other calls should be annotated as
> >> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> >> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> >> > etc.  That way, the unwinder will always be able to find pt_regs from an
> >> > interrupt/exception, even if starting from one of these other calls.
> >> >
> >> > But then, things get ugly.  You have to either setup and tear down the
> >> > frame for every possible call, or do a higher-level setup/teardown
> >> > across multiple calls, which invalidates several assumptions in the
> >> > entry code about the location of pt_regs on the stack.
> >> >
> >> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> >> > make assumptions about the location of pt_regs.  And they're used by
> >> > both syscall and interrupt code.  So if we didn't create a frame pointer
> >> > header for syscalls, we'd basically need two versions of the macros: one
> >> > for irqs/exceptions and one for syscalls.
> >> >
> >> > So I think the cleanest way to handle this is to always allocate two
> >> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> >> > entry code can assume that pt_regs is at a constant location, and all
> >> > the above problems go away.  Another benefit is that we'd only need two
> >> > saves instead of three -- the pointer to pt_regs is no longer needed
> >> > since pt_regs is always immediately after the frame header.
> >> >
> >> > I worked up a patch to implement this -- see below.  It writes the frame
> >> > pointer in all entry paths, including syscalls.  This helps keep the
> >> > code simple.
> >> >
> >> > The downside is a small performance penalty: with getppid()-in-a-loop on
> >> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> >> > 2% slowdown.  But I doubt it would be measurable in a real-world
> >> > workload.
> >> >
> >> > It looks like about half the slowdown is due to the extra stack
> >> > allocation (which presumably adds a little d-cache pressure on the stack
> >> > memory) and the other half is due to the stack writes.
> >> >
> >> > I could remove the writes from the syscall path but it would only save
> >> > about half a ns, and it would make the code less robust.  Plus it's nice
> >> > to have the consistency of having *all* pt_regs frames annotated.
> >>
> >> This is a bit messy, and I'm not really sure that the entry code
> >> should be have to operate under constraints like this.  Also,
> >> convincing myself this works for NMI sounds unpleasant.
> >>
> >> Maybe we should go back to my idea of just listing the call sites in a 
> >> table.
> >
> > So are you suggesting something like:
> >
> > .macro ENTRY_CALL func pt_regs_offset=0
> > call \func
> > 1:  .pushsection .entry_calls, "a"
> > .long 1b - .
> > .long \pt_regs_offset
> > .popsection
> > .endm
> >
> > and then change every call in the entry code to ENTRY_CALL?
> 
> Yes, exactly, modulo whether the section name is good.  hpa is
> probably the authority on that.

Well, as you probably know, I don't really like peppering ENTRY_CALL
everywhere. :-/

Also I wonder how we could annotate the hypercalls, for example
DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf  wrote:
> On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
>> > So I got a chance to look at this some more.  I'm thinking that to make
>> > this feature more consistently useful, we shouldn't only annotate
>> > pt_regs frames for calls to handlers; other calls should be annotated as
>> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
>> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
>> > etc.  That way, the unwinder will always be able to find pt_regs from an
>> > interrupt/exception, even if starting from one of these other calls.
>> >
>> > But then, things get ugly.  You have to either setup and tear down the
>> > frame for every possible call, or do a higher-level setup/teardown
>> > across multiple calls, which invalidates several assumptions in the
>> > entry code about the location of pt_regs on the stack.
>> >
>> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
>> > make assumptions about the location of pt_regs.  And they're used by
>> > both syscall and interrupt code.  So if we didn't create a frame pointer
>> > header for syscalls, we'd basically need two versions of the macros: one
>> > for irqs/exceptions and one for syscalls.
>> >
>> > So I think the cleanest way to handle this is to always allocate two
>> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
>> > entry code can assume that pt_regs is at a constant location, and all
>> > the above problems go away.  Another benefit is that we'd only need two
>> > saves instead of three -- the pointer to pt_regs is no longer needed
>> > since pt_regs is always immediately after the frame header.
>> >
>> > I worked up a patch to implement this -- see below.  It writes the frame
>> > pointer in all entry paths, including syscalls.  This helps keep the
>> > code simple.
>> >
>> > The downside is a small performance penalty: with getppid()-in-a-loop on
>> > my laptop, the average syscall went from 52ns to 53ns, which is about a
>> > 2% slowdown.  But I doubt it would be measurable in a real-world
>> > workload.
>> >
>> > It looks like about half the slowdown is due to the extra stack
>> > allocation (which presumably adds a little d-cache pressure on the stack
>> > memory) and the other half is due to the stack writes.
>> >
>> > I could remove the writes from the syscall path but it would only save
>> > about half a ns, and it would make the code less robust.  Plus it's nice
>> > to have the consistency of having *all* pt_regs frames annotated.
>>
>> This is a bit messy, and I'm not really sure that the entry code
>> should be have to operate under constraints like this.  Also,
>> convincing myself this works for NMI sounds unpleasant.
>>
>> Maybe we should go back to my idea of just listing the call sites in a table.
>
> So are you suggesting something like:
>
> .macro ENTRY_CALL func pt_regs_offset=0
> call \func
> 1:  .pushsection .entry_calls, "a"
> .long 1b - .
> .long \pt_regs_offset
> .popsection
> .endm
>
> and then change every call in the entry code to ENTRY_CALL?

Yes, exactly, modulo whether the section name is good.  hpa is
probably the authority on that.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> > So I got a chance to look at this some more.  I'm thinking that to make
> > this feature more consistently useful, we shouldn't only annotate
> > pt_regs frames for calls to handlers; other calls should be annotated as
> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> > etc.  That way, the unwinder will always be able to find pt_regs from an
> > interrupt/exception, even if starting from one of these other calls.
> >
> > But then, things get ugly.  You have to either setup and tear down the
> > frame for every possible call, or do a higher-level setup/teardown
> > across multiple calls, which invalidates several assumptions in the
> > entry code about the location of pt_regs on the stack.
> >
> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> > make assumptions about the location of pt_regs.  And they're used by
> > both syscall and interrupt code.  So if we didn't create a frame pointer
> > header for syscalls, we'd basically need two versions of the macros: one
> > for irqs/exceptions and one for syscalls.
> >
> > So I think the cleanest way to handle this is to always allocate two
> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> > entry code can assume that pt_regs is at a constant location, and all
> > the above problems go away.  Another benefit is that we'd only need two
> > saves instead of three -- the pointer to pt_regs is no longer needed
> > since pt_regs is always immediately after the frame header.
> >
> > I worked up a patch to implement this -- see below.  It writes the frame
> > pointer in all entry paths, including syscalls.  This helps keep the
> > code simple.
> >
> > The downside is a small performance penalty: with getppid()-in-a-loop on
> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> > 2% slowdown.  But I doubt it would be measurable in a real-world
> > workload.
> >
> > It looks like about half the slowdown is due to the extra stack
> > allocation (which presumably adds a little d-cache pressure on the stack
> > memory) and the other half is due to the stack writes.
> >
> > I could remove the writes from the syscall path but it would only save
> > about half a ns, and it would make the code less robust.  Plus it's nice
> > to have the consistency of having *all* pt_regs frames annotated.
> 
> This is a bit messy, and I'm not really sure that the entry code
> should be have to operate under constraints like this.  Also,
> convincing myself this works for NMI sounds unpleasant.
> 
> Maybe we should go back to my idea of just listing the call sites in a table.

So are you suggesting something like:

.macro ENTRY_CALL func pt_regs_offset=0
call \func
1:  .pushsection .entry_calls, "a"
.long 1b - .
.long \pt_regs_offset
.popsection
.endm

and then change every call in the entry code to ENTRY_CALL?

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf  wrote:
> On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
>> On May 23, 2016 7:28 PM, "Josh Poimboeuf"  wrote:
>> > > Maybe I'm coming around to liking this idea.
>> >
>> > Ok, good :-)
>> >
>> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
>> > > printer, etc), unwinding across a kernel exception would look like:
>> > >
>> > >  - some_func
>> > >  - some_other_func
>> > >  - do_page_fault
>> > >  - page_fault
>> > >
>> > > After page_fault, the next unwind step takes us to the faulting RIP
>> > > (faulting_func) and reports that all GPRs are known.  It should
>> > > probably learn this fact from DWARF if DWARF is available, instead of
>> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
>> > > may be incomplete.  A nice pretty printer could now print all the
>> > > regs.
>> > >
>> > >  - faulting_func
>> > >  - etc.
>> > >
>> > > For this to work, we don't actually need the unwinder to explicitly
>> > > know where pt_regs is.
>> >
>> > That's true (but only for DWARF).
>> >
>> > > Food for thought, though: if user code does SYSENTER with TF set,
>> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
>> > > about it.  DWARF will handle it without any fanfare, but I don't know
>> > > if it's going to cause trouble for you pre-DWARF.
>> >
>> > In this case it should see the stack pointer is past the pt_regs offset,
>> > so it would just report it as an empty stack.
>>
>> OK
>>
>> >
>> > > I'm also not sure it makes sense to apply this before the unwinder
>> > > that can consume it is ready.  Maybe, if it would be consistent with
>> > > your plans, it would make sense to rewrite the unwinder first, then
>> > > apply this and teach live patching to use the new unwinder, and *then*
>> > > add DWARF support?
>> >
>> > For the purposes of livepatch, the reliable unwinder needs to detect
>> > whether an interrupt/exception pt_regs frame exists on a sleeping task
>> > (or current).  This patch would allow us to do that.
>> >
>> > So my preferred order of doing things would be:
>> >
>> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
>> > 2) this patch for annotating pt_regs stack frames
>> > 3) reliable unwinder, similar to what I already posted, except it relies
>> >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>> >the new inactive task frame format of #1
>> > 4) livepatch consistency model which uses the reliable unwinder
>> > 5) rewrite unwinder, and port all users to the new interface
>> > 6) add DWARF unwinder
>> >
>> > 1-4 are pretty much already written, whereas 5 and 6 will take
>> > considerably more work.
>>
>> Fair enough.
>>
>> >
>> > > > +   /*
>> > > > +* Create a stack frame for the saved pt_regs.  This allows 
>> > > > frame
>> > > > +* pointer based unwinders to find pt_regs on the stack.
>> > > > +*/
>> > > > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +   pushq   \regs
>> > > > +   pushq   $pt_regs+1
>> > >
>> > > Why the +1?
>> >
>> > Some unwinders like gdb are smart enough to report the function which
>> > contains the instruction *before* the return address.  Without the +1,
>> > they would show the wrong function.
>>
>> Lovely.  Want to add a comment?
>>
>> >
>> > > > +   pushq   %rbp
>> > > > +   movq%rsp, %rbp
>> > > > +#endif
>> > > > +   .endm
>> > >
>> > > I keep wanting this to be only two pushes and to fudge rbp to make it
>> > > work, but I don't see a good way.  But let's call it
>> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
>> > > nested_frame or similar.
>> >
>> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
>> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
>>
>> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
>> too.  CREATE_PT_REGS_FRAME is probably fine.
>>
>> > > > +
>> > > > +/* fake function which allows stack unwinders to detect pt_regs 
>> > > > frames */
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +ENTRY(pt_regs)
>> > > > +   nop
>> > > > +   nop
>> > > > +ENDPROC(pt_regs)
>> > > > +#endif /* CONFIG_FRAME_POINTER */
>> > >
>> > > Why is this two bytes long?  Is there some reason it has to be more
>> > > than one byte?
>> >
>> > Similar to above, this is related to the need to support various
>> > unwinders.  Whether the unwinder displays the ret_addr or the
>> > instruction preceding it, either way the instruction needs to be inside
>> > the function for the function to be reported.
>>
>> OK.
>
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
> On May 23, 2016 7:28 PM, "Josh Poimboeuf"  wrote:
> > > Maybe I'm coming around to liking this idea.
> >
> > Ok, good :-)
> >
> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > > printer, etc), unwinding across a kernel exception would look like:
> > >
> > >  - some_func
> > >  - some_other_func
> > >  - do_page_fault
> > >  - page_fault
> > >
> > > After page_fault, the next unwind step takes us to the faulting RIP
> > > (faulting_func) and reports that all GPRs are known.  It should
> > > probably learn this fact from DWARF if DWARF is available, instead of
> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > > may be incomplete.  A nice pretty printer could now print all the
> > > regs.
> > >
> > >  - faulting_func
> > >  - etc.
> > >
> > > For this to work, we don't actually need the unwinder to explicitly
> > > know where pt_regs is.
> >
> > That's true (but only for DWARF).
> >
> > > Food for thought, though: if user code does SYSENTER with TF set,
> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > > about it.  DWARF will handle it without any fanfare, but I don't know
> > > if it's going to cause trouble for you pre-DWARF.
> >
> > In this case it should see the stack pointer is past the pt_regs offset,
> > so it would just report it as an empty stack.
> 
> OK
> 
> >
> > > I'm also not sure it makes sense to apply this before the unwinder
> > > that can consume it is ready.  Maybe, if it would be consistent with
> > > your plans, it would make sense to rewrite the unwinder first, then
> > > apply this and teach live patching to use the new unwinder, and *then*
> > > add DWARF support?
> >
> > For the purposes of livepatch, the reliable unwinder needs to detect
> > whether an interrupt/exception pt_regs frame exists on a sleeping task
> > (or current).  This patch would allow us to do that.
> >
> > So my preferred order of doing things would be:
> >
> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> > 2) this patch for annotating pt_regs stack frames
> > 3) reliable unwinder, similar to what I already posted, except it relies
> >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
> >the new inactive task frame format of #1
> > 4) livepatch consistency model which uses the reliable unwinder
> > 5) rewrite unwinder, and port all users to the new interface
> > 6) add DWARF unwinder
> >
> > 1-4 are pretty much already written, whereas 5 and 6 will take
> > considerably more work.
> 
> Fair enough.
> 
> >
> > > > +   /*
> > > > +* Create a stack frame for the saved pt_regs.  This allows 
> > > > frame
> > > > +* pointer based unwinders to find pt_regs on the stack.
> > > > +*/
> > > > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +   pushq   \regs
> > > > +   pushq   $pt_regs+1
> > >
> > > Why the +1?
> >
> > Some unwinders like gdb are smart enough to report the function which
> > contains the instruction *before* the return address.  Without the +1,
> > they would show the wrong function.
> 
> Lovely.  Want to add a comment?
> 
> >
> > > > +   pushq   %rbp
> > > > +   movq%rsp, %rbp
> > > > +#endif
> > > > +   .endm
> > >
> > > I keep wanting this to be only two pushes and to fudge rbp to make it
> > > work, but I don't see a good way.  But let's call it
> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > > nested_frame or similar.
> >
> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
> 
> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
> too.  CREATE_PT_REGS_FRAME is probably fine.
> 
> > > > +
> > > > +/* fake function which allows stack unwinders to detect pt_regs frames 
> > > > */
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +ENTRY(pt_regs)
> > > > +   nop
> > > > +   nop
> > > > +ENDPROC(pt_regs)
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > >
> > > Why is this two bytes long?  Is there some reason it has to be more
> > > than one byte?
> >
> > Similar to above, this is related to the need to support various
> > unwinders.  Whether the unwinder displays the ret_addr or the
> > instruction preceding it, either way the instruction needs to be inside
> > the function for the function to be reported.
> 
> OK.

Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-23 Thread Andy Lutomirski
On May 23, 2016 7:28 PM, "Josh Poimboeuf"  wrote:
> > Maybe I'm coming around to liking this idea.
>
> Ok, good :-)
>
> > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > printer, etc), unwinding across a kernel exception would look like:
> >
> >  - some_func
> >  - some_other_func
> >  - do_page_fault
> >  - page_fault
> >
> > After page_fault, the next unwind step takes us to the faulting RIP
> > (faulting_func) and reports that all GPRs are known.  It should
> > probably learn this fact from DWARF if DWARF is available, instead of
> > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > may be incomplete.  A nice pretty printer could now print all the
> > regs.
> >
> >  - faulting_func
> >  - etc.
> >
> > For this to work, we don't actually need the unwinder to explicitly
> > know where pt_regs is.
>
> That's true (but only for DWARF).
>
> > Food for thought, though: if user code does SYSENTER with TF set,
> > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > about it.  DWARF will handle it without any fanfare, but I don't know
> > if it's going to cause trouble for you pre-DWARF.
>
> In this case it should see the stack pointer is past the pt_regs offset,
> so it would just report it as an empty stack.

OK

>
> > I'm also not sure it makes sense to apply this before the unwinder
> > that can consume it is ready.  Maybe, if it would be consistent with
> > your plans, it would make sense to rewrite the unwinder first, then
> > apply this and teach live patching to use the new unwinder, and *then*
> > add DWARF support?
>
> For the purposes of livepatch, the reliable unwinder needs to detect
> whether an interrupt/exception pt_regs frame exists on a sleeping task
> (or current).  This patch would allow us to do that.
>
> So my preferred order of doing things would be:
>
> 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> 2) this patch for annotating pt_regs stack frames
> 3) reliable unwinder, similar to what I already posted, except it relies
>on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>the new inactive task frame format of #1
> 4) livepatch consistency model which uses the reliable unwinder
> 5) rewrite unwinder, and port all users to the new interface
> 6) add DWARF unwinder
>
> 1-4 are pretty much already written, whereas 5 and 6 will take
> considerably more work.

Fair enough.

>
> > > +   /*
> > > +* Create a stack frame for the saved pt_regs.  This allows frame
> > > +* pointer based unwinders to find pt_regs on the stack.
> > > +*/
> > > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +   pushq   \regs
> > > +   pushq   $pt_regs+1
> >
> > Why the +1?
>
> Some unwinders like gdb are smart enough to report the function which
> contains the instruction *before* the return address.  Without the +1,
> they would show the wrong function.

Lovely.  Want to add a comment?

>
> > > +   pushq   %rbp
> > > +   movq%rsp, %rbp
> > > +#endif
> > > +   .endm
> >
> > I keep wanting this to be only two pushes and to fudge rbp to make it
> > work, but I don't see a good way.  But let's call it
> > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > nested_frame or similar.
>
> Or, if we aren't going to annotate syscall pt_regs, we could give it a
> more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?

CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
too.  CREATE_PT_REGS_FRAME is probably fine.

> > > +
> > > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > > +#ifdef CONFIG_FRAME_POINTER
> > > +ENTRY(pt_regs)
> > > +   nop
> > > +   nop
> > > +ENDPROC(pt_regs)
> > > +#endif /* CONFIG_FRAME_POINTER */
> >
> > Why is this two bytes long?  Is there some reason it has to be more
> > than one byte?
>
> Similar to above, this is related to the need to support various
> unwinders.  Whether the unwinder displays the ret_addr or the
> instruction preceding it, either way the instruction needs to be inside
> the function for the function to be reported.

OK.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-23 Thread Josh Poimboeuf
On Mon, May 23, 2016 at 02:34:56PM -0700, Andy Lutomirski wrote:
> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf 
> >> >> > >  wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it 
> >> >> > > > would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting 
> >> >> > > that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then 
> >> >> > > pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing 
> >> >> > > would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can 
> >> >> > walk
> >> >> > past the nested entry frame and keep going until it gets to the 
> >> >> > original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >>
> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >> But this is pretty simple and performance is already abysmal in most
> >> handlers.
> >>
> >> There's an added benefit to using a separate section, though: we could
> >> also annotate the calls with what type of entry they were so the
> >> unwinder could print it out nicely.
> >>
> >> I could be convinced either way.
> >
> > Ok, I took a stab at this.  See the patch below.
> >
> > In addition to annotating interrupt/exception pt_regs frames, I also
> > annotated all the syscall pt_regs, for consistency.
> >
> > As you mentioned, it will affect performance a bit, but I think it will
> > be insignificant.
> >
> > I think I like this approach better than putting the
> > interrupt/idtentry's in a special section, because this is much more
> > precise.  Especially now that I'm annotating 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-23 Thread Andy Lutomirski
On Mon, May 23, 2016 at 4:02 PM, Jiri Kosina  wrote:
> On Fri, 20 May 2016, Andy Lutomirski wrote:
>
>> I think it would be negligible, at least for interrupts, since
>> interrupts are already extremely expensive.  But I don't love adding
>> assembly code that makes them even slower.  The real thing I dislike
>> about this approach is that it's not a normal stack frame, so you need
>> code in the unwinder to unwind through it correctly, which makes me
>> think that you're not saving much complexity by adding the pushes.
>
> I fail to see what is so special about the stack frame; it's in fact
> pretty normal.
>
> It has added semantic value for "those who know", but the others will
> (pretty much correctly) consider it to be a stackframe from a function
> call, and be done with it.
>
> What am I missing?

In Josh's code, the stack looks like:

...
interrupted frame
pt_regs
pointer to pt_regs
address of pt_regs dummy function
rbp
handler frame

A naive unwinder won't unwind this correctly, as there's no link back
to the interrupted frame's RIP.  If the layout changed to:


...
interrupted frame
pt_regs
interrupted RIP
rbp
handler frame

then I think it would unwind correctly, but the pt_regs would be
invisible, which is IMO a bit unfortunate.  It could also be (I
think):


...
interrupted frame
pt_regs
interrupted rbp
interrupted RIP
pointer to pt_regs
address of pt_regs dummy function
pointer to "interrupted RIP" stack slot
handler frame

but now this is *five* pushes for the dummy frame, which I think is
getting a bit out of hand.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-23 Thread Jiri Kosina
On Fri, 20 May 2016, Andy Lutomirski wrote:

> I think it would be negligible, at least for interrupts, since
> interrupts are already extremely expensive.  But I don't love adding
> assembly code that makes them even slower.  The real thing I dislike
> about this approach is that it's not a normal stack frame, so you need
> code in the unwinder to unwind through it correctly, which makes me
> think that you're not saving much complexity by adding the pushes.

I fail to see what is so special about the stack frame; it's in fact 
pretty normal.

It has added semantic value for "those who know", but the others will 
(pretty much correctly) consider it to be a stackframe from a function 
call, and be done with it.

What am I missing?

Thanks,

-- 
Jiri Kosina
SUSE Labs

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-23 Thread Andy Lutomirski
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
>> >> > > wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would 
>> >> > > > break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [] dump_stack+0x85/0xc2
>   [] 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-20 Thread Josh Poimboeuf
On Fri, May 20, 2016 at 09:59:38AM -0700, Andy Lutomirski wrote:
> On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeuf  wrote:
> > On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
> >> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf  
> >> wrote:
> >> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  
> >> >> wrote:
> >> >> > Note this example is with today's unwinder.  It could be made smarter 
> >> >> > to
> >> >> > get the RIP from the pt_regs so the '?' could be removed from
> >> >> > copy_page_to_iter().
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> I think we should do that.  The silly sample patch I sent you (or at
> >> >> least that I think I sent you) did that, and it worked nicely.
> >> >
> >> > Yeah, we can certainly do something similar to make the unwinder
> >> > smarter.  It should be very simple with this approach: if it finds the
> >> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> >> > right after it.
> >>
> >> That seems barely easier than checking if it finds a function in
> >> .entry that's marked on the stack,
> >
> > Don't forget you'd also have to look up the function's pt_regs offset in
> > the table.
> >
> >> and the latter has no runtime cost.
> >
> > Well, I had been assuming the three extra pushes and one extra pop for
> > interrupts would be negligible, but I'm definitely no expert there.  Any
> > idea how I can measure it?
> 
> I think it would be negligible, at least for interrupts, since
> interrupts are already extremely expensive.  But I don't love adding
> assembly code that makes them even slower.

I just don't find that very convincing :-)  If the performance impact is
negligible, we should ignore it.  We should instead be focusing on
finding the simplest solution.

> The real thing I dislike about this approach is that it's not a normal
> stack frame, so you need code in the unwinder to unwind through it
> correctly, which makes me think that you're not saving much complexity
> by adding the pushes.  But maybe I'm wrong.

Hm, I view it differently.  Sure the stack frame is a bit unusual, but
as far as unwinders go, it _is_ normal.  So even a stock unwinder can
show the user that a pt_regs() -- or interrupt_frame() or whatever we
call it -- function was called.  Just knowing that an interrupt occurred
could be useful information, even without the contents of RIP.

> >> > I'm not sure about the idea of a table.  I get the feeling it would add
> >> > more complexity to both the entry code and the unwinder.  How would you
> >> > specify the pt_regs location when it's on a different stack?  (See the
> >> > interrupt macro: non-nested interrupts will place pt_regs on the task
> >> > stack before switching to the irq stack.)
> >>
> >> Hmm.  I need to think about the interrupt stack case a bit.  Although
> >> the actual top of the interrupt stack has a nearly fixed format, and I
> >> have old patches to clean it up and make it actually be fixed.  I'll
> >> try to dust those off and resend them soon.
> >
> > How would that solve the problem?  Would pt_regs be moved or copied to
> > the irq stack?
> 
> Hmm.
> 
> Maybe the right way would be to unwind off the IRQ stack in two steps.
> Step one would be to figure out that you're on the IRQ stack and pop
> your way off it.  Step two would be to find pt_regs.
> 
> But that could be rather nasty to implement.  Maybe what we actually
> want to do is this:
> 
> First, apply this thing:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1
> 
> that gives us a common format for the IRQ stack.
> 
> Second, use my idea of making a table of offsets to pt_regs, so we'd
> have, roughly:
> 
> ENTER_IRQ_STACK old_rsp=%r11
> call __do_softirq
> ANNOTATE_IRQSTACK_PTREGS_CALL offset=0
> LEAVE_IRQ_STACK
> 
> the idea here is that offset=0 means that there is no offset beyond
> that implied by ENTER_IRQ_STACK.  What actually gets written to the
> table is offset 8, because ENTER_IRQ_STACK itself does one push.
> 
> So far, this has no runtime overhead at all.
> 
> Then, in the unwinder, the logic becomes:
> 
> If the return address is annotated in the ptregs entry table, look up
> the offset.  If the offset is in bounds, then use it.  If the offset
> is out of bounds and we're on the IRQ stack, then unwind the
> ENTER_IRQ_STACK as well.
> 
> Does that seem reasonable?  I can try to implement it and see what it
> looks like.

To be honest I'm not crazy about it.  The ANNOTATE_IRQSTACK_PTREGS_CALL
macro needs knowledge about the implementation of ENTER_IRQ_STACK and
how many pushes it does.  I think that makes the entry code more complex
and harder to understand than my patch.

Also the unwinders would need to be quite a bit more complex, and would
need to know more of the intimate details of the irq 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-20 Thread Andy Lutomirski
On Fri, May 20, 2016 at 9:41 AM, Josh Poimboeuf  wrote:
> On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
>> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf  wrote:
>> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
>> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  
>> >> wrote:
>> >> > Note this example is with today's unwinder.  It could be made smarter to
>> >> > get the RIP from the pt_regs so the '?' could be removed from
>> >> > copy_page_to_iter().
>> >> >
>> >> > Thoughts?
>> >>
>> >> I think we should do that.  The silly sample patch I sent you (or at
>> >> least that I think I sent you) did that, and it worked nicely.
>> >
>> > Yeah, we can certainly do something similar to make the unwinder
>> > smarter.  It should be very simple with this approach: if it finds the
>> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
>> > right after it.
>>
>> That seems barely easier than checking if it finds a function in
>> .entry that's marked on the stack,
>
> Don't forget you'd also have to look up the function's pt_regs offset in
> the table.
>
>> and the latter has no runtime cost.
>
> Well, I had been assuming the three extra pushes and one extra pop for
> interrupts would be negligible, but I'm definitely no expert there.  Any
> idea how I can measure it?

I think it would be negligible, at least for interrupts, since
interrupts are already extremely expensive.  But I don't love adding
assembly code that makes them even slower.  The real thing I dislike
about this approach is that it's not a normal stack frame, so you need
code in the unwinder to unwind through it correctly, which makes me
think that you're not saving much complexity by adding the pushes.
But maybe I'm wrong.

>
>> > I'm not sure about the idea of a table.  I get the feeling it would add
>> > more complexity to both the entry code and the unwinder.  How would you
>> > specify the pt_regs location when it's on a different stack?  (See the
>> > interrupt macro: non-nested interrupts will place pt_regs on the task
>> > stack before switching to the irq stack.)
>>
>> Hmm.  I need to think about the interrupt stack case a bit.  Although
>> the actual top of the interrupt stack has a nearly fixed format, and I
>> have old patches to clean it up and make it actually be fixed.  I'll
>> try to dust those off and resend them soon.
>
> How would that solve the problem?  Would pt_regs be moved or copied to
> the irq stack?

Hmm.

Maybe the right way would be to unwind off the IRQ stack in two steps.
Step one would be to figure out that you're on the IRQ stack and pop
your way off it.  Step two would be to find pt_regs.

But that could be rather nasty to implement.  Maybe what we actually
want to do is this:

First, apply this thing:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1

that gives us a common format for the IRQ stack.

Second, use my idea of making a table of offsets to pt_regs, so we'd
have, roughly:

ENTER_IRQ_STACK old_rsp=%r11
call __do_softirq
ANNOTATE_IRQSTACK_PTREGS_CALL offset=0
LEAVE_IRQ_STACK

the idea here is that offset=0 means that there is no offset beyond
that implied by ENTER_IRQ_STACK.  What actually gets written to the
table is offset 8, because ENTER_IRQ_STACK itself does one push.

So far, this has no runtime overhead at all.

Then, in the unwinder, the logic becomes:

If the return address is annotated in the ptregs entry table, look up
the offset.  If the offset is in bounds, then use it.  If the offset
is out of bounds and we're on the IRQ stack, then unwind the
ENTER_IRQ_STACK as well.

Does that seem reasonable?  I can try to implement it and see what it
looks like.

--Andy

>
>> > If you're worried about performance, I can remove the syscall
>> > annotations.  They're optional anyway, since the pt_regs is always at
>> > the same place on the stack for syscalls.
>> >
>> > I think three extra pushes wouldn't be a performance issue for
>> > interrupts/exceptions.  And they'll go away when we finally bury
>> > CONFIG_FRAME_POINTER.
>>
>> I bet we'll always need to support CONFIG_FRAME_POINTER for some
>> embedded systems.
>
> Yeah, probably.
>
>> I'll play with this a bit.
>
> Thanks, looking forward to seeing what you come up with.
>
> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-20 Thread Josh Poimboeuf
On Fri, May 20, 2016 at 08:41:00AM -0700, Andy Lutomirski wrote:
> On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf  wrote:
> > On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> >> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  
> >> wrote:
> >> > Note this example is with today's unwinder.  It could be made smarter to
> >> > get the RIP from the pt_regs so the '?' could be removed from
> >> > copy_page_to_iter().
> >> >
> >> > Thoughts?
> >>
> >> I think we should do that.  The silly sample patch I sent you (or at
> >> least that I think I sent you) did that, and it worked nicely.
> >
> > Yeah, we can certainly do something similar to make the unwinder
> > smarter.  It should be very simple with this approach: if it finds the
> > pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> > right after it.
> 
> That seems barely easier than checking if it finds a function in
> .entry that's marked on the stack,

Don't forget you'd also have to look up the function's pt_regs offset in
the table.

> and the latter has no runtime cost.

Well, I had been assuming the three extra pushes and one extra pop for
interrupts would be negligible, but I'm definitely no expert there.  Any
idea how I can measure it?

> > I'm not sure about the idea of a table.  I get the feeling it would add
> > more complexity to both the entry code and the unwinder.  How would you
> > specify the pt_regs location when it's on a different stack?  (See the
> > interrupt macro: non-nested interrupts will place pt_regs on the task
> > stack before switching to the irq stack.)
> 
> Hmm.  I need to think about the interrupt stack case a bit.  Although
> the actual top of the interrupt stack has a nearly fixed format, and I
> have old patches to clean it up and make it actually be fixed.  I'll
> try to dust those off and resend them soon.

How would that solve the problem?  Would pt_regs be moved or copied to
the irq stack?

> > If you're worried about performance, I can remove the syscall
> > annotations.  They're optional anyway, since the pt_regs is always at
> > the same place on the stack for syscalls.
> >
> > I think three extra pushes wouldn't be a performance issue for
> > interrupts/exceptions.  And they'll go away when we finally bury
> > CONFIG_FRAME_POINTER.
> 
> I bet we'll always need to support CONFIG_FRAME_POINTER for some
> embedded systems.

Yeah, probably.

> I'll play with this a bit.

Thanks, looking forward to seeing what you come up with.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-20 Thread Andy Lutomirski
On Fri, May 20, 2016 at 7:05 AM, Josh Poimboeuf  wrote:
> On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
>> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  wrote:
>> > Note this example is with today's unwinder.  It could be made smarter to
>> > get the RIP from the pt_regs so the '?' could be removed from
>> > copy_page_to_iter().
>> >
>> > Thoughts?
>>
>> I think we should do that.  The silly sample patch I sent you (or at
>> least that I think I sent you) did that, and it worked nicely.
>
> Yeah, we can certainly do something similar to make the unwinder
> smarter.  It should be very simple with this approach: if it finds the
> pt_regs() function on the stack, the (struct pt_regs *) pointer will be
> right after it.

That seems barely easier than checking if it finds a function in
.entry that's marked on the stack, and the latter has no runtime cost.

>
>> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> > index 9a9e588..f54886a 100644
>> > --- a/arch/x86/entry/calling.h
>> > +++ b/arch/x86/entry/calling.h
>> > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel 
>> > is built with
>> > .byte 0xf1
>> > .endm
>> >
>> > +   /*
>> > +* Create a stack frame for the saved pt_regs.  This allows frame
>> > +* pointer based unwinders to find pt_regs on the stack.
>> > +*/
>> > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > +#ifdef CONFIG_FRAME_POINTER
>> > +   pushq   \regs
>> > +   pushq   $pt_regs+1
>> > +   pushq   %rbp
>> > +   movq%rsp, %rbp
>> > +#endif
>> > +   .endm
>>
>> I don't love this part.  It's going to hurt performance, and, given
>> that we need to change the unwinder anyway to make it useful, let's
>> just emit a table somewhere in .rodata and use it directly.
>
> I'm not sure about the idea of a table.  I get the feeling it would add
> more complexity to both the entry code and the unwinder.  How would you
> specify the pt_regs location when it's on a different stack?  (See the
> interrupt macro: non-nested interrupts will place pt_regs on the task
> stack before switching to the irq stack.)

Hmm.  I need to think about the interrupt stack case a bit.  Although
the actual top of the interrupt stack has a nearly fixed format, and I
have old patches to clean it up and make it actually be fixed.  I'll
try to dust those off and resend them soon.

>
> If you're worried about performance, I can remove the syscall
> annotations.  They're optional anyway, since the pt_regs is always at
> the same place on the stack for syscalls.
>
> I think three extra pushes wouldn't be a performance issue for
> interrupts/exceptions.  And they'll go away when we finally bury
> CONFIG_FRAME_POINTER.

I bet we'll always need to support CONFIG_FRAME_POINTER for some
embedded systems.

I'll play with this a bit.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-20 Thread Josh Poimboeuf
On Thu, May 19, 2016 at 04:39:51PM -0700, Andy Lutomirski wrote:
> On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  wrote:
> > Note this example is with today's unwinder.  It could be made smarter to
> > get the RIP from the pt_regs so the '?' could be removed from
> > copy_page_to_iter().
> >
> > Thoughts?
> 
> I think we should do that.  The silly sample patch I sent you (or at
> least that I think I sent you) did that, and it worked nicely.

Yeah, we can certainly do something similar to make the unwinder
smarter.  It should be very simple with this approach: if it finds the
pt_regs() function on the stack, the (struct pt_regs *) pointer will be
right after it.

> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index 9a9e588..f54886a 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel 
> > is built with
> > .byte 0xf1
> > .endm
> >
> > +   /*
> > +* Create a stack frame for the saved pt_regs.  This allows frame
> > +* pointer based unwinders to find pt_regs on the stack.
> > +*/
> > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
> > +#ifdef CONFIG_FRAME_POINTER
> > +   pushq   \regs
> > +   pushq   $pt_regs+1
> > +   pushq   %rbp
> > +   movq%rsp, %rbp
> > +#endif
> > +   .endm
> 
> I don't love this part.  It's going to hurt performance, and, given
> that we need to change the unwinder anyway to make it useful, let's
> just emit a table somewhere in .rodata and use it directly.

I'm not sure about the idea of a table.  I get the feeling it would add
more complexity to both the entry code and the unwinder.  How would you
specify the pt_regs location when it's on a different stack?  (See the
interrupt macro: non-nested interrupts will place pt_regs on the task
stack before switching to the irq stack.)

If you're worried about performance, I can remove the syscall
annotations.  They're optional anyway, since the pt_regs is always at
the same place on the stack for syscalls.

I think three extra pushes wouldn't be a performance issue for
interrupts/exceptions.  And they'll go away when we finally bury
CONFIG_FRAME_POINTER.

Here's the same patch without syscall annotations:


diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..f54886a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,32 @@ For 32-bit we have the following conventions - kernel is 
built with
.byte 0xf1
.endm
 
+   /*
+* Create a stack frame for the saved pt_regs.  This allows frame
+* pointer based unwinders to find pt_regs on the stack.
+*/
+   .macro CREATE_PT_REGS_FRAME regs=%rsp
+#ifdef CONFIG_FRAME_POINTER
+   pushq   \regs
+   pushq   $pt_regs+1
+   pushq   %rbp
+   movq%rsp, %rbp
+#endif
+   .endm
+
+   .macro REMOVE_PT_REGS_FRAME
+#ifdef CONFIG_FRAME_POINTER
+   popq%rbp
+   addq$0x10, %rsp
+#endif
+   .endm
+
+   .macro CALL_HANDLER handler regs=%rsp
+   CREATE_PT_REGS_FRAME \regs
+   call\handler
+   REMOVE_PT_REGS_FRAME
+   .endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..93bc8f0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -468,7 +468,7 @@ END(irq_entries_start)
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF
 
-   call\func   /* rdi points to pt_regs */
+   CALL_HANDLER \func regs=%rdi
.endm
 
/*
@@ -495,7 +495,7 @@ ret_from_intr:
/* Interrupt came from user space */
 GLOBAL(retint_user)
mov %rsp,%rdi
-   callprepare_exit_to_usermode
+   CALL_HANDLER prepare_exit_to_usermode
TRACE_IRQS_IRETQ
SWAPGS
jmp restore_regs_and_iret
@@ -509,7 +509,7 @@ retint_kernel:
jnc 1f
 0: cmpl$0, PER_CPU_VAR(__preempt_count)
jnz 1f
-   callpreempt_schedule_irq
+   CALL_HANDLER preempt_schedule_irq
jmp 0b
 1:
 #endif
@@ -688,8 +688,6 @@ ENTRY(\sym)
.endif
.endif
 
-   movq%rsp, %rdi  /* pt_regs pointer */
-
.if \has_error_code
movqORIG_RAX(%rsp), %rsi/* get error code */
movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
@@ -701,7 +699,8 @@ ENTRY(\sym)
subq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
 
-   call\do_sym
+   movq%rsp, %rdi  /* pt_regs pointer */
+   CALL_HANDLER \do_sym
 
.if \shift_ist != -1
addq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
@@ -728,8 +727,6 @@ ENTRY(\sym)
callsync_regs
movq%rax, %rsp  /* switch stack */
 
-   movq%rsp, 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-19 Thread Andy Lutomirski
On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf  wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
>> >> > > wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would 
>> >> > > > break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>>
>> I could be convinced either way.
>
> Ok, I took a stab at this.  See the patch below.
>
> In addition to annotating interrupt/exception pt_regs frames, I also
> annotated all the syscall pt_regs, for consistency.
>
> As you mentioned, it will affect performance a bit, but I think it will
> be insignificant.
>
> I think I like this approach better than putting the
> interrupt/idtentry's in a special section, because this is much more
> precise.  Especially now that I'm annotating pt_regs syscalls.
>
> Also I think with a few minor changes we could implement your idea of
> annotating the calls with what type of entry they are.  But I don't
> think that's really needed, because the name of the interrupt/idtentry
> is already on the stack trace.
>
> Before:
>
>   [] dump_stack+0x85/0xc2
>   [] 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-19 Thread Josh Poimboeuf
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
> >> > > wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would 
> >> > > > break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.
> 
> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.
> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.
> 
> I could be convinced either way.

Ok, I took a stab at this.  See the patch below.

In addition to annotating interrupt/exception pt_regs frames, I also
annotated all the syscall pt_regs, for consistency.

As you mentioned, it will affect performance a bit, but I think it will
be insignificant.

I think I like this approach better than putting the
interrupt/idtentry's in a special section, because this is much more
precise.  Especially now that I'm annotating pt_regs syscalls.

Also I think with a few minor changes we could implement your idea of
annotating the calls with what type of entry they are.  But I don't
think that's really needed, because the name of the interrupt/idtentry
is already on the stack trace.

Before:

  [] dump_stack+0x85/0xc2
  [] __do_page_fault+0x576/0x5a0
  [] trace_do_page_fault+0x5c/0x2e0
  [] do_async_page_fault+0x2c/0xa0
  [] async_page_fault+0x28/0x30
  [] ? copy_page_to_iter+0x70/0x440
  [] ? pagecache_get_page+0x2c/0x290
  [] 

RE: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-04 Thread David Laight
From: Andy Lutomirski
> Sent: 02 May 2016 19:13
...
> I hope your plans include rewriting the current stack unwinder
> completely.  The thing in print_context_stack is (a)
> hard-to-understand and hard-to-modify crap and (b) is called in a loop
> from another file using totally ridiculous conventions.

I've seen a 'stack unwinder' that parsed the instruction stream
forwards looking for 'return' instructions.
I fixed it to add a few extra instructions needs to sort out the
exit path from hardware interrupts.

It only had to understand instructions that modified %sp and %bp
and remember which branch instructions and branch targets it had
used in order to find the correct exit path from a function.

Worked reasonably well without any debug info or guaranteed frame
pointers.
It did have to fall back on scanning the stack if it was inside
an infinite loop. Even on x86 it is reasonably possible to check
for 'call' instructions in this case.

David

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Andy Lutomirski
On Mon, May 2, 2016 at 1:00 PM, Jiri Kosina  wrote:
> On Mon, 2 May 2016, Jiri Kosina wrote:
>
>> > FWIW, I just tried this:
>> >
>> > static bool is_entry_text(unsigned long addr)
>> > {
>> > return addr >= (unsigned long)__entry_text_start &&
>> > addr < (unsigned long)__entry_text_end;
>> > }
>> >
>> > it works.  So the entry code is already annotated reasonably well :)
>> >
>> > I just hacked it up here:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21
>> >
>> > and it seems to work, at least for page faults.  A better
>> > implementation would print out the entire contents of pt_regs so that
>> > people reading the stack trace will know the registers at the time of
>> > the exception, which might be helpful.
>>
>> Sorry for being dense, but how do you distinguish here between a "real"
>> kernel entry, that pushes pt_regs, and any "non-entry" function call that
>> passes pt_regs around?
>
> Umm, actually, the more tricky part is the other way around -- how do you
> make sure that whenever you are calling out from a code between
> __entry_text_start and __entry_text_end, pt_regs will be at the place
> you're looking for it? How's that guaranteed?

It's not guaranteed in my code.  I think we'd want to add a little
table of call sites and their pt_regs offsets.  This was just meant to
test that the general idea works (and it does indeed generate better
traces than the stock kernel, which gets it unconditionally wrong).

--Andy

>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Jiri Kosina
On Mon, 2 May 2016, Jiri Kosina wrote:

> > FWIW, I just tried this:
> > 
> > static bool is_entry_text(unsigned long addr)
> > {
> > return addr >= (unsigned long)__entry_text_start &&
> > addr < (unsigned long)__entry_text_end;
> > }
> > 
> > it works.  So the entry code is already annotated reasonably well :)
> > 
> > I just hacked it up here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21
> > 
> > and it seems to work, at least for page faults.  A better
> > implementation would print out the entire contents of pt_regs so that
> > people reading the stack trace will know the registers at the time of
> > the exception, which might be helpful.
> 
> Sorry for being dense, but how do you distinguish here between a "real" 
> kernel entry, that pushes pt_regs, and any "non-entry" function call that 
> passes pt_regs around?

Umm, actually, the more tricky part is the other way around -- how do you 
make sure that whenever you are calling out from a code between 
__entry_text_start and __entry_text_end, pt_regs will be at the place 
you're looking for it? How's that guaranteed?

Thanks,

-- 
Jiri Kosina
SUSE Labs

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Jiri Kosina
On Mon, 2 May 2016, Andy Lutomirski wrote:

> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
> return addr >= (unsigned long)__entry_text_start &&
> addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

Sorry for being dense, but how do you distinguish here between a "real" 
kernel entry, that pushes pt_regs, and any "non-entry" function call that 
passes pt_regs around?

-- 
Jiri Kosina
SUSE Labs

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Josh Poimboeuf
On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf  wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf 
> >> >> > >  wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it 
> >> >> > > > would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting 
> >> >> > > that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then 
> >> >> > > pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing 
> >> >> > > would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can 
> >> >> > walk
> >> >> > past the nested entry frame and keep going until it gets to the 
> >> >> > original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >
> > Yeah, that's about what I had in mind.
> 
> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
> return addr >= (unsigned long)__entry_text_start &&
> addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

I still think we would need more specific annotations to do that
reliably: a call from entry code doesn't necessarily correlate with a
pt_regs frame.

> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > Another idea to detect missing frames: for each return address on the 
> > stack, 
> > ensure there's a corresponding "call " instruction immediately 
> > preceding 
> > the return location, where  matches what's on the stack.
> 
> Hmm, interesting.
> 
> I hope your plans include rewriting the current stack unwinder completely.  
> The 
> thing in print_context_stack is (a) hard-to-understand and hard-to-modify 
> crap 
> and (b) is called in a loop from another file using totally ridiculous 
> conventions.

So we had several attempts at making it better, any further improvements 
(including radical rewrites) are more than welcome!

The generalization between the various stack walking methods certainly didn't 
make 
things easier to read - we might want to eliminate that by using better 
primitives 
to iterate over the stack frame.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Andy Lutomirski
On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf  wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
>> >> > > wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would 
>> >> > > > break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>
> Yeah, that's about what I had in mind.

FWIW, I just tried this:

static bool is_entry_text(unsigned long addr)
{
return addr >= (unsigned long)__entry_text_start &&
addr < (unsigned long)__entry_text_end;
}

it works.  So the entry code is already annotated reasonably well :)

I just hacked it up here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack=085eacfe0edfc18768e48340084415dba9a6bd21

and it seems to work, at least for page faults.  A better
implementation would print out the entire contents of pt_regs so that
people reading the stack trace will know the registers at the time of
the exception, which might be helpful.

>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>
> Hm, this I don't really follow.  It's true that the unwinder can easily
> find RIP from pt_regs, which will always be a known offset from the
> kernel_entry pointer on the stack.  But why would having the entry code
> in a separate section make that faster?

It doesn't make the unwinder faster -- it makes the entry code faster.

>
>> But this is pretty simple and performance is already abysmal in most
>> 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Josh Poimboeuf
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
> >> > > wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would 
> >> > > > break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.

Yeah, that's about what I had in mind.

> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.

Hm, this I don't really follow.  It's true that the unwinder can easily
find RIP from pt_regs, which will always be a known offset from the
kernel_entry pointer on the stack.  But why would having the entry code
in a separate section make that faster?

> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.

Yeah, that could be a nice feature... but doesn't printing the name of
the C handler pretty much already give that information?

In any case, once we have a working DWARF unwinder, I think it will show
the name of the idt entry anyway.

> >> Have you looked at my vdso unwinding test at all?  If we could do
> >> something similar for the kernel, IMO it would make testing much more
> >> pleasant.
> >
> > I found it, but I'm not sure what it would mean to do something similar
> > for the kernel.  Do you mean doing something like an NMI sampling-based
> > approach where we periodically do a random stack sanity check?
> 
> I was imagining something a little more strict: single-step
> 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Andy Lutomirski
On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf  wrote:
> On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
>> >
>> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
>> > > wrote:
>> > > >> I suppose we could try to rejigger the code so that rbp points to
>> > > >> pt_regs or similar.
>> > > >
>> > > > I think we should avoid doing something like that because it would 
>> > > > break
>> > > > gdb and all the other unwinders who don't know about it.
>> > >
>> > > How so?
>> > >
>> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> > > the pt_regs.  Currently it points to something stale (which the
>> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> > > is the next thing on the stack, so just doing the section thing would
>> > > work.
>> >
>> > Yes, rbp is meaningless on the entry from user space.  But if an
>> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> > past the nested entry frame and keep going until it gets to the original
>> > entry.
>>
>> Yes.
>>
>> It would be nice if we could do better, though, and actually notice
>> the pt_regs and identify the entry.  For example, I'd love to see
>> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> crash.
>>
>> Also, I think that just following rbp links will lose the
>> actual function that took the page fault (or whatever function
>> pt_regs->ip actually points to).
>
> Hm.  I think we could fix all that in a more standard way.  Whenever a
> new pt_regs frame gets saved on entry, we could also create a new stack
> frame which points to a fake kernel_entry() function.  That would tell
> the unwinder there's a pt_regs frame without otherwise breaking frame
> pointers across the frame.
>
> Then I guess we wouldn't need my other solution of putting the idt
> entries in a special section.
>
> How does that sound?

Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back.  Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it.  (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar.  That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry".  Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem.  We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding.  This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most
handlers.

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.


>
>> Have you looked at my vdso unwinding test at all?  If we could do
>> something similar for the kernel, IMO it would make testing much more
>> pleasant.
>
> I found it, but I'm not sure what it would mean to do something similar
> for the kernel.  Do you mean doing something like an NMI sampling-based
> approach where we periodically do a random stack sanity check?

I was imagining something a little more strict: single-step
interesting parts of the kernel and make sure that each step unwinds
correctly.  That could detect missing frames and similar.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Josh Poimboeuf
On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
> >
> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  
> > > wrote:
> > > >> I suppose we could try to rejigger the code so that rbp points to
> > > >> pt_regs or similar.
> > > >
> > > > I think we should avoid doing something like that because it would break
> > > > gdb and all the other unwinders who don't know about it.
> > >
> > > How so?
> > >
> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > > the pt_regs.  Currently it points to something stale (which the
> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > > is the next thing on the stack, so just doing the section thing would
> > > work.
> >
> > Yes, rbp is meaningless on the entry from user space.  But if an
> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> > past the nested entry frame and keep going until it gets to the original
> > entry.
> 
> Yes.
> 
> It would be nice if we could do better, though, and actually notice
> the pt_regs and identify the entry.  For example, I'd love to see
> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> crash.
>
> Also, I think that just following rbp links will lose the
> actual function that took the page fault (or whatever function
> pt_regs->ip actually points to).

Hm.  I think we could fix all that in a more standard way.  Whenever a
new pt_regs frame gets saved on entry, we could also create a new stack
frame which points to a fake kernel_entry() function.  That would tell
the unwinder there's a pt_regs frame without otherwise breaking frame
pointers across the frame.

Then I guess we wouldn't need my other solution of putting the idt
entries in a special section.

How does that sound?

> Have you looked at my vdso unwinding test at all?  If we could do
> something similar for the kernel, IMO it would make testing much more
> pleasant.

I found it, but I'm not sure what it would mean to do something similar
for the kernel.  Do you mean doing something like an NMI sampling-based
approach where we periodically do a random stack sanity check?

(If so, I do have something like that planned.)

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Apr 29, 2016 3:11 PM, "Jiri Kosina"  wrote:
>
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
>
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> >
> > Only on x86_64.
>
> Well, MCEs are more or less x86-specific as well. But otherwise good
> point, thanks Andy.
>
> So, how does stack layout generally look like in case when NMI is actually
> running on proper kernel stack? I thought it's guaranteed to contain
> pt_regs anyway in all cases. Is that not guaranteed to be the case?
>

On x86, at least, there will still be pt_regs for the NMI.  For the
interrupted state, though, there might not be pt_regs, as the NMI
might have happened while still populating pt_regs.  In fact, the NMI
stack could overlap task_pt_regs.

For x86_32, there's no guarantee that pt_regs contains sp due to
hardware silliness.  You need to parse it more carefully, as,
!user_mode(regs), then the old sp is just above pt_regs.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Apr 29, 2016 3:41 PM, "Josh Poimboeuf"  wrote:
>
> On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  wrote:
> > > I think the easiest way to make it work would be to modify the idtentry
> > > macro to put all the idt entries in a dedicated section.  Then the
> > > unwinder could easily detect any calls from that code.
> >
> > That would work.  Would it make sense to do the same for the irq entries?
>
> Yes, I think so.
>
> > >> I suppose we could try to rejigger the code so that rbp points to
> > >> pt_regs or similar.
> > >
> > > I think we should avoid doing something like that because it would break
> > > gdb and all the other unwinders who don't know about it.
> >
> > How so?
> >
> > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > the pt_regs.  Currently it points to something stale (which the
> > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > is the next thing on the stack, so just doing the section thing would
> > work.
>
> Yes, rbp is meaningless on the entry from user space.  But if an
> in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> nested entry, rbp keeps its old value, right?  So the unwinder can walk
> past the nested entry frame and keep going until it gets to the original
> entry.

Yes.

It would be nice if we could do better, though, and actually notice
the pt_regs and identify the entry.  For example, I'd love to see
"page fault, RIP=xyz" printed in the middle of a stack dump on a
crash.  Also, I think that just following rbp links will lose the
actual function that took the page fault (or whatever function
pt_regs->ip actually points to).

>
> > We should really re-add DWARF some day.
>
> Working on it :-)

Excellent.

Have you looked at my vdso unwinding test at all?  If we could do
something similar for the kernel, IMO it would make testing much more
pleasant.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Josh Poimboeuf
On Sat, Apr 30, 2016 at 12:11:45AM +0200, Jiri Kosina wrote:
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> > 
> > Only on x86_64.
> 
> Well, MCEs are more or less x86-specific as well. But otherwise good 
> point, thanks Andy.
> 
> So, how does stack layout generally look like in case when NMI is actually 
> running on proper kernel stack? I thought it's guaranteed to contain 
> pt_regs anyway in all cases. Is that not guaranteed to be the case?

If the NMI were using the normal kernel stack and it interrupted kernel
space, pt_regs would be placed in the "middle" of the stack rather than
the bottom, and there's currently no way to detect that.

However, NMIs don't sleep, and we only consider sleeping tasks for stack
reliability, so it wouldn't be an issue anyway.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Josh Poimboeuf
On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  wrote:
> > I think the easiest way to make it work would be to modify the idtentry
> > macro to put all the idt entries in a dedicated section.  Then the
> > unwinder could easily detect any calls from that code.
> 
> That would work.  Would it make sense to do the same for the irq entries?

Yes, I think so.

> >> I suppose we could try to rejigger the code so that rbp points to
> >> pt_regs or similar.
> >
> > I think we should avoid doing something like that because it would break
> > gdb and all the other unwinders who don't know about it.
> 
> How so?
> 
> Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> the pt_regs.  Currently it points to something stale (which the
> dump_stack code might be relying on.  Hmm.)  But it's probably also
> safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> is the next thing on the stack, so just doing the section thing would
> work.

Yes, rbp is meaningless on the entry from user space.  But if an
in-kernel interrupt occurs (e.g. page fault, preemption) and you have
nested entry, rbp keeps its old value, right?  So the unwinder can walk
past the nested entry frame and keep going until it gets to the original
entry.

> We should really re-add DWARF some day.

Working on it :-)

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Jiri Kosina
On Fri, 29 Apr 2016, Andy Lutomirski wrote:

> > NMI, MCE and interrupts aren't a problem because they have dedicated
> > stacks, which are easy to detect.  If the tasks' stack is on an
> > exception stack or an irq stack, we consider it unreliable.
> 
> Only on x86_64.

Well, MCEs are more or less x86-specific as well. But otherwise good 
point, thanks Andy.

So, how does stack layout generally look like in case when NMI is actually 
running on proper kernel stack? I thought it's guaranteed to contain 
pt_regs anyway in all cases. Is that not guaranteed to be the case?

Thanks,

-- 
Jiri Kosina
SUSE Labs

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf  wrote:
> On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf  wrote:
>> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf  
>> >> wrote:
>> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  
>> >> >> wrote:
>> >> >> > A preempted function might not have had a chance to save the frame
>> >> >> > pointer to the stack yet, which can result in its caller getting 
>> >> >> > skipped
>> >> >> > on a stack trace.
>> >> >> >
>> >> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> >> > dump code can determine whether the stack trace is reliable.
>> >> >>
>> >> >> I think I like this, but how do you handle the rather similar case in
>> >> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> >> response to get_user, put_user, copy_from_user, etc?
>> >> >
>> >> > Hm, good question.  I was thinking that page faults had a dedicated
>> >> > stack, but now looking at the entry and traps code, that doesn't seem to
>> >> > be the case.
>> >> >
>> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> >> > function which might trigger a valid page fault (e.g.,
>> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> >> > the stack should still be reliable.
>> >> >
>> >> > In fact I might be able to teach objtool to enforce that: any function
>> >> > which uses an exception table should create a stack frame.
>> >> >
>> >> > Or alternatively, maybe set some kind of flag for page faults, similar
>> >> > to what I did with this patch.
>> >> >
>> >>
>> >> How about doing it the other way around: teach the unwinder to detect
>> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> >> and use some reasonable heuristic as to whether it's okay to keep
>> >> unwinding.  You should be able to handle preemption like that, too --
>> >> the unwind process will end up in an IRQ frame.
>> >
>> > How exactly would the unwinder detect if a text address is in an
>> > idtentry?  Maybe put all the idt entries in a special ELF section?
>> >
>>
>> Hmm.
>>
>> What actually happens when you unwind all the way into the entry code?
>>  Don't you end up in something that isn't in an ELF function?  Can you
>> detect that?
>
> For entry from user space (e.g., syscalls), it's easy to detect because
> there's always a pt_regs at the bottom of the stack.  So if the unwinder
> reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
> it's done.
>
> But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
> page faults which don't have dedicated stacks), where the pt_regs is
> stored somewhere in the middle of the stack instead of the bottom,
> there's no reliable way to detect that.

>
>> Ideally, the unwinder could actually detect that it's
>> hit a pt_regs struct and report that.  If used for stack dumps, it
>> could display some indication of this and then continue its unwinding
>> by decoding the pt_regs.  If used for patching, it could take some
>> other appropriate action.
>>
>> I would have no objection to annotating all the pt_regs-style entry
>> code, whether by putting it in a separate section or by making a table
>> of addresses.
>
> I think the easiest way to make it work would be to modify the idtentry
> macro to put all the idt entries in a dedicated section.  Then the
> unwinder could easily detect any calls from that code.

That would work.  Would it make sense to do the same for the irq entries?

I'd be glad to review a patch.  It should be straightforward.

>
>> There are a couple of nasty cases if NMI or MCE is involved but, as of
>> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
>> always be a complete pt_regs on the stack before interrupts get
>> enabled for each entry.  Of course, finding the thing may be
>> nontrivial in case other things were pushed.
>
> NMI, MCE and interrupts aren't a problem because they have dedicated
> stacks, which are easy to detect.  If the tasks' stack is on an
> exception stack or an irq stack, we consider it unreliable.

Only on x86_64.

>
> And also, they don't sleep.  The stack of any running task (other than
> current) is automatically considered unreliable anyway, since they could
> be modifying it while we're reading it.

True.

>
>> I suppose we could try to rejigger the code so that rbp points to
>> pt_regs or similar.
>
> I think we should avoid doing something like that because it would break
> gdb and all the other unwinders who don't know about it.

How so?

Currently, rbp in the entry code is meaningless.  I'm suggesting that,
when we do, for example, 'call \do_sym' in idtentry, 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Josh Poimboeuf
On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf  wrote:
> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf  
> >> wrote:
> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  
> >> >> wrote:
> >> >> > A preempted function might not have had a chance to save the frame
> >> >> > pointer to the stack yet, which can result in its caller getting 
> >> >> > skipped
> >> >> > on a stack trace.
> >> >> >
> >> >> > Add a flag to indicate when the task has been preempted so that stack
> >> >> > dump code can determine whether the stack trace is reliable.
> >> >>
> >> >> I think I like this, but how do you handle the rather similar case in
> >> >> which a task goes to sleep because it's waiting on IO that happened in
> >> >> response to get_user, put_user, copy_from_user, etc?
> >> >
> >> > Hm, good question.  I was thinking that page faults had a dedicated
> >> > stack, but now looking at the entry and traps code, that doesn't seem to
> >> > be the case.
> >> >
> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> >> > function which might trigger a valid page fault (e.g.,
> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> >> > the stack should still be reliable.
> >> >
> >> > In fact I might be able to teach objtool to enforce that: any function
> >> > which uses an exception table should create a stack frame.
> >> >
> >> > Or alternatively, maybe set some kind of flag for page faults, similar
> >> > to what I did with this patch.
> >> >
> >>
> >> How about doing it the other way around: teach the unwinder to detect
> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> >> and use some reasonable heuristic as to whether it's okay to keep
> >> unwinding.  You should be able to handle preemption like that, too --
> >> the unwind process will end up in an IRQ frame.
> >
> > How exactly would the unwinder detect if a text address is in an
> > idtentry?  Maybe put all the idt entries in a special ELF section?
> >
> 
> Hmm.
> 
> What actually happens when you unwind all the way into the entry code?
>  Don't you end up in something that isn't in an ELF function?  Can you
> detect that?

For entry from user space (e.g., syscalls), it's easy to detect because
there's always a pt_regs at the bottom of the stack.  So if the unwinder
reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
it's done.

But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
page faults which don't have dedicated stacks), where the pt_regs is
stored somewhere in the middle of the stack instead of the bottom,
there's no reliable way to detect that.

> Ideally, the unwinder could actually detect that it's
> hit a pt_regs struct and report that.  If used for stack dumps, it
> could display some indication of this and then continue its unwinding
> by decoding the pt_regs.  If used for patching, it could take some
> other appropriate action.
>
> I would have no objection to annotating all the pt_regs-style entry
> code, whether by putting it in a separate section or by making a table
> of addresses.

I think the easiest way to make it work would be to modify the idtentry
macro to put all the idt entries in a dedicated section.  Then the
unwinder could easily detect any calls from that code.

> There are a couple of nasty cases if NMI or MCE is involved but, as of
> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
> always be a complete pt_regs on the stack before interrupts get
> enabled for each entry.  Of course, finding the thing may be
> nontrivial in case other things were pushed.

NMI, MCE and interrupts aren't a problem because they have dedicated
stacks, which are easy to detect.  If the tasks' stack is on an
exception stack or an irq stack, we consider it unreliable.

And also, they don't sleep.  The stack of any running task (other than
current) is automatically considered unreliable anyway, since they could
be modifying it while we're reading it.

> I suppose we could try to rejigger the code so that rbp points to
> pt_regs or similar.

I think we should avoid doing something like that because it would break
gdb and all the other unwinders who don't know about it.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf  wrote:
> On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf  wrote:
>> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  
>> >> wrote:
>> >> > A preempted function might not have had a chance to save the frame
>> >> > pointer to the stack yet, which can result in its caller getting skipped
>> >> > on a stack trace.
>> >> >
>> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> > dump code can determine whether the stack trace is reliable.
>> >>
>> >> I think I like this, but how do you handle the rather similar case in
>> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> response to get_user, put_user, copy_from_user, etc?
>> >
>> > Hm, good question.  I was thinking that page faults had a dedicated
>> > stack, but now looking at the entry and traps code, that doesn't seem to
>> > be the case.
>> >
>> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> > function which might trigger a valid page fault (e.g.,
>> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> > the stack should still be reliable.
>> >
>> > In fact I might be able to teach objtool to enforce that: any function
>> > which uses an exception table should create a stack frame.
>> >
>> > Or alternatively, maybe set some kind of flag for page faults, similar
>> > to what I did with this patch.
>> >
>>
>> How about doing it the other way around: teach the unwinder to detect
>> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> and use some reasonable heuristic as to whether it's okay to keep
>> unwinding.  You should be able to handle preemption like that, too --
>> the unwind process will end up in an IRQ frame.
>
> How exactly would the unwinder detect if a text address is in an
> idtentry?  Maybe put all the idt entries in a special ELF section?
>

Hmm.

What actually happens when you unwind all the way into the entry code?
 Don't you end up in something that isn't in an ELF function?  Can you
detect that?  Ideally, the unwinder could actually detect that it's
hit a pt_regs struct and report that.  If used for stack dumps, it
could display some indication of this and then continue its unwinding
by decoding the pt_regs.  If used for patching, it could take some
other appropriate action.

I would have no objection to annotating all the pt_regs-style entry
code, whether by putting it in a separate section or by making a table
of addresses.

There are a couple of nasty cases if NMI or MCE is involved but, as of
4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
always be a complete pt_regs on the stack before interrupts get
enabled for each entry.  Of course, finding the thing may be
nontrivial in case other things were pushed.  I suppose we could try
to rejigger the code so that rbp points to pt_regs or similar.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Josh Poimboeuf
On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf  wrote:
> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  
> >> wrote:
> >> > A preempted function might not have had a chance to save the frame
> >> > pointer to the stack yet, which can result in its caller getting skipped
> >> > on a stack trace.
> >> >
> >> > Add a flag to indicate when the task has been preempted so that stack
> >> > dump code can determine whether the stack trace is reliable.
> >>
> >> I think I like this, but how do you handle the rather similar case in
> >> which a task goes to sleep because it's waiting on IO that happened in
> >> response to get_user, put_user, copy_from_user, etc?
> >
> > Hm, good question.  I was thinking that page faults had a dedicated
> > stack, but now looking at the entry and traps code, that doesn't seem to
> > be the case.
> >
> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> > function which might trigger a valid page fault (e.g.,
> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> > the stack should still be reliable.
> >
> > In fact I might be able to teach objtool to enforce that: any function
> > which uses an exception table should create a stack frame.
> >
> > Or alternatively, maybe set some kind of flag for page faults, similar
> > to what I did with this patch.
> >
> 
> How about doing it the other way around: teach the unwinder to detect
> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> and use some reasonable heuristic as to whether it's okay to keep
> unwinding.  You should be able to handle preemption like that, too --
> the unwind process will end up in an IRQ frame.

How exactly would the unwinder detect if a text address is in an
idtentry?  Maybe put all the idt entries in a special ELF section?

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf  wrote:
> On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  wrote:
>> > A preempted function might not have had a chance to save the frame
>> > pointer to the stack yet, which can result in its caller getting skipped
>> > on a stack trace.
>> >
>> > Add a flag to indicate when the task has been preempted so that stack
>> > dump code can determine whether the stack trace is reliable.
>>
>> I think I like this, but how do you handle the rather similar case in
>> which a task goes to sleep because it's waiting on IO that happened in
>> response to get_user, put_user, copy_from_user, etc?
>
> Hm, good question.  I was thinking that page faults had a dedicated
> stack, but now looking at the entry and traps code, that doesn't seem to
> be the case.
>
> Anyway I think it shouldn't be a problem if we make sure that any kernel
> function which might trigger a valid page fault (e.g.,
> copy_user_generic_string) do the proper frame pointer setup first.  Then
> the stack should still be reliable.
>
> In fact I might be able to teach objtool to enforce that: any function
> which uses an exception table should create a stack frame.
>
> Or alternatively, maybe set some kind of flag for page faults, similar
> to what I did with this patch.
>

How about doing it the other way around: teach the unwinder to detect
when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
and use some reasonable heuristic as to whether it's okay to keep
unwinding.  You should be able to handle preemption like that, too --
the unwind process will end up in an IRQ frame.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Josh Poimboeuf
On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  wrote:
> > A preempted function might not have had a chance to save the frame
> > pointer to the stack yet, which can result in its caller getting skipped
> > on a stack trace.
> >
> > Add a flag to indicate when the task has been preempted so that stack
> > dump code can determine whether the stack trace is reliable.
> 
> I think I like this, but how do you handle the rather similar case in
> which a task goes to sleep because it's waiting on IO that happened in
> response to get_user, put_user, copy_from_user, etc?

Hm, good question.  I was thinking that page faults had a dedicated
stack, but now looking at the entry and traps code, that doesn't seem to
be the case.

Anyway I think it shouldn't be a problem if we make sure that any kernel
function which might trigger a valid page fault (e.g.,
copy_user_generic_string) do the proper frame pointer setup first.  Then
the stack should still be reliable.

In fact I might be able to teach objtool to enforce that: any function
which uses an exception table should create a stack frame.

Or alternatively, maybe set some kind of flag for page faults, similar
to what I did with this patch.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-29 Thread Andy Lutomirski
On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf  wrote:
> A preempted function might not have had a chance to save the frame
> pointer to the stack yet, which can result in its caller getting skipped
> on a stack trace.
>
> Add a flag to indicate when the task has been preempted so that stack
> dump code can determine whether the stack trace is reliable.

I think I like this, but how do you handle the rather similar case in
which a task goes to sleep because it's waiting on IO that happened in
response to get_user, put_user, copy_from_user, etc?

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-04-28 Thread Josh Poimboeuf
A preempted function might not have had a chance to save the frame
pointer to the stack yet, which can result in its caller getting skipped
on a stack trace.

Add a flag to indicate when the task has been preempted so that stack
dump code can determine whether the stack trace is reliable.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/sched.h | 1 +
 kernel/fork.c | 2 +-
 kernel/sched/core.c   | 4 
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d31572..fb364a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2137,6 +2137,7 @@ extern void thread_group_cputime_adjusted(struct 
task_struct *p, cputime_t *ut,
 #define PF_SWAPWRITE   0x0080  /* Allowed to write to swap */
 #define PF_NO_SETAFFINITY 0x0400   /* Userland is not allowed to meddle 
with cpus_allowed */
 #define PF_MCE_EARLY0x0800  /* Early kill for mce process policy */
+#define PF_PREEMPT_IRQ 0x1000  /* Thread is preempted by an irq */
 #define PF_MUTEX_TESTER0x2000  /* Thread belongs to the rt 
mutex tester */
 #define PF_FREEZER_SKIP0x4000  /* Freezer should not count it 
as freezable */
 #define PF_SUSPEND_TASK 0x8000  /* this thread called freeze_processes 
and should not be frozen */
diff --git a/kernel/fork.c b/kernel/fork.c
index b73a539..d2fe04a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1373,7 +1373,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_count;
 
delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
-   p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
+   p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_PREEMPT_IRQ);
p->flags |= PF_FORKNOEXEC;
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>sibling);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d84d60..7594267 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3422,6 +3422,8 @@ asmlinkage __visible void __sched 
preempt_schedule_irq(void)
 
prev_state = exception_enter();
 
+   current->flags |= PF_PREEMPT_IRQ;
+
do {
preempt_disable();
local_irq_enable();
@@ -3430,6 +3432,8 @@ asmlinkage __visible void __sched 
preempt_schedule_irq(void)
sched_preempt_enable_no_resched();
} while (need_resched());
 
+   current->flags &= ~PF_PREEMPT_IRQ;
+
exception_exit(prev_state);
 }
 
-- 
2.4.11

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev