Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-17 Thread Segher Boessenkool
On Sun, Jan 13, 2019 at 11:33:56PM +1100, Balbir Singh wrote:
> On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> > On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > > Could you please define interesting frame on top a bit more? Usually
> > > the topmost return address is in LR
> > 
> > There is no reliable way (other than DWARF unwind info) to find out where
> > the value of LR at function entry currently lives (if anywhere). It may or
> > may not be still available in LR, it may or may not be saved to the return
> > stack slot.  It can also live in some GPR, or in some other stack slot.
> > 
> > (The same is true for all other registers).
> > 
> > The only thing the ABI guarantees you is that you can find all stack frames
> > via the back chain.  If you want more you can use some heuristics and do
> > some heroics (like GDB does), but this is not fully reliable.  Using DWARF
> > unwind info is, but that requires big tables.
> >
> 
> Thanks, so are you suggesting that a reliable stack is not possible on
> ppc64le? Even with the restricted scope of the kernel?

It depends on what you mean with "reliable stack unwinder".  You can unwind
the stack reliably on Power, but you want more, you want to know where some
state local to functions is kept on the stack.


Segher


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-14 Thread Joe Lawrence

On 1/14/19 12:09 PM, Josh Poimboeuf wrote:

On Mon, Jan 14, 2019 at 11:46:59AM -0500, Joe Lawrence wrote:

@@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
return 1; /* invalid backlink, too far up. */
}
  
+		/* We can only trust the bottom frame's backlink, the rest

+* of the frame may be uninitialized, continue to the next. */
+   if (firstframe--)
+   goto next;


Won't this decrement firstframe on every iteration, so when firstframe
is 0, it will decrement it to -1, causing it to 'goto next' on all
future iterations?



Argg, yes, that should be:

if (!firstframe) {
firstframe = 0;
goto next;
}

Apologies for the monday-morning crap-patch.

/runsoff to find some more caffeine

-- Joe


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-14 Thread Josh Poimboeuf
On Mon, Jan 14, 2019 at 11:46:59AM -0500, Joe Lawrence wrote:
> @@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
>   return 1; /* invalid backlink, too far up. */
>   }
>  
> + /* We can only trust the bottom frame's backlink, the rest
> +  * of the frame may be uninitialized, continue to the next. */
> + if (firstframe--)
> + goto next;

Won't this decrement firstframe on every iteration, so when firstframe
is 0, it will decrement it to -1, causing it to 'goto next' on all
future iterations?

-- 
Josh


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-14 Thread Joe Lawrence
On Mon, Jan 14, 2019 at 08:21:40AM +0100, Nicolai Stange wrote:
> Joe Lawrence  writes:
> 
> > We should be careful when inspecting the bottom-most stack frame (the
> > first to be unwound), particularly for scheduled-out tasks.  As Nicolai
> > Stange explains, "If I'm reading the code in _switch() correctly, the
> > first frame is completely uninitialized except for the pointer back to
> > the caller's stack frame."  If a previous do_IRQ() invocation, for
> > example, has left a residual exception-marker on the first frame, the
> > stack tracer would incorrectly report this task's trace as unreliable.
> >
> 
> FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
> caller hardware_interrupt_common(), more specifically the
> EXCEPTION_PROLOG_COMMON_3 part therof.
> 

Hi Nicolai,

Yeah, I was sloppy with the description there. :)

> I thought about this a little more and can't see anything that would
> prevent higher, i.e. non-_switch() frames to also alias with prior
> exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
> frame's "parameter area" and most functions probably don't initialize
> this either. So, AFAICS, higher stack frames could potentially be
> affected by the very same problem.

Hmm, I suppose a callee could leave that stack-word untouched and then
make subsquent calls, which would be confusing for the unwinder.

> I think the best solution would be to clear the STACK_FRAME_REGS_MARKER
> upon exception return. I have a patch ready for that and will post it
> after it has passed some basic testing -- hopefully later this day.
> 

I agree that this seems like the simplest way to clean up the exception
stack frame state. 

> That being said, I still think that your patch should also get applied
> in some form -- looking at unitialized memory is just not a good thing
> to do.
> 
> [ ... snip ...]

> I would perhaps not limit this to the STACK_FRAME_REGS_MARKER, but also
> not emit the ip obtained from the first frame into the resulting trace.
> 
> I.e., how about moving all the sp/newsp handling to the beginning of the
> loop and doing an 'if (firstframe) continue;' right after that?

Good point, there is a bunch of ip and trace entries bookkeeping that
shouldn't apply in this case.

I gave the following some very light testing (5.0.0-rc2 + Petr's atomic
patches as to include and run the selftests) ... if you want to take a
bigger hammer to refactor some of the sp/newsp code (perhaps it could be
incorporated into the for() loop itself), feel free to go for it.  You
could add something like this as a 2nd patch to the previously mentioned
STACK_FRAME_REGS_MARKER cleanup fix.

Thanks,

-- Joe

-->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--

>From b87f9e81cf59a6e7e2309400e1b417562414cd5c Mon Sep 17 00:00:00 2001
From: Joe Lawrence 
Date: Sun, 13 Jan 2019 21:02:01 -0500
Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer checks for
 first-frame

The bottom-most stack frame (the first to be unwound) may be largely
uninitialized, for the "Power Architecture 64-Bit ELF V2 ABI" only
requires its backchain pointer to be set.

The reliable stack tracer should be careful when verifying this frame:
skip checks on STACK_FRAME_LR_SAVE and STACK_FRAME_MARKER offsets that
may contain uninitialized residual data.

Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for 
the consistency model")
Suggested-by: Nicolai Stange 
Signed-off-by: Joe Lawrence 
---
 arch/powerpc/kernel/stacktrace.c | 33 +---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2c50b55138f..46096687a5a8 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct 
stack_trace *trace)
 EXPORT_SYMBOL_GPL(save_stack_trace_regs);
 
 #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+/*
+ * This function returns an error if it detects any unreliable features of the
+ * stack.  Otherwise it guarantees that the stack trace is reliable.
+ *
+ * If the task is not 'current', the caller *must* ensure the task is inactive.
+ */
 int
 save_stack_trace_tsk_reliable(struct task_struct *tsk,
struct stack_trace *trace)
@@ -142,12 +148,6 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
if (sp & 0xF)
return 1;
 
-   /* Mark stacktraces with exception frames as unreliable. */
-   if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
-   stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
-   return 1;
-   }
-
newsp = stack[0];
/* Stack grows downwards; unwinder may only go up. */
if (newsp <= sp)
@@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
  

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-13 Thread Nicolai Stange
Joe Lawrence  writes:

> On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
>> Joe Lawrence  writes:
>> 
>> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>
> [ ... snip ... ]
>
>> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> >> from _switch() helps?
>> >> 
>> >> I.e. something like (completely untested):
>> >
>> > I'll kick off some builds tonight and try to get tests lined up
>> > tomorrow.  Unfortunately these take a bit of time to run setup, schedule
>> > and complete, so perhaps by next week.
>> 
>> Ok, that's probably still a good test for confirmation, but the solution
>> you proposed below is much better.
>
> Good news, 40 tests (RHEL-7 backport) with clearing out the
> STACK_FRAME_MARKER were all successful.  Previously I would see stalled
> livepatch transitions due to the unreliable report in ~10% of test
> cases.
>
>> >> diff --git a/arch/powerpc/kernel/entry_64.S 
>> >> b/arch/powerpc/kernel/entry_64.S
>> >> index 435927f549c4..b747d0647ec4 100644
>> >> --- a/arch/powerpc/kernel/entry_64.S
>> >> +++ b/arch/powerpc/kernel/entry_64.S
>> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>> >>   SAVE_8GPRS(14, r1)
>> >>   SAVE_10GPRS(22, r1)
>> >>   std r0,_NIP(r1) /* Return to switch caller */
>> >> +
>> >> + li  r23,0
>> >> + std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> >> +
>> >>   mfcrr23
>> >>   std r23,_CCR(r1)
>> >>   std r1,KSP(r3)  /* Set old stack pointer */
>> >> 
>> >> 
>> >
>> > This may be sufficient to avoid the condition, but if the contents of
>> > frame 0 are truely uninitialized (not to be trusted), should the
>> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
>> > aside from the LR and other stack size geometry sanity checks?
>> 
>> That's a very good point: we'll only ever be examining scheduled out tasks
>> and current (which at that time is running klp_try_complete_transition()).
>> 
>> current won't be in an interrupt/exception when it's walking its own
>> stack. And scheduled out tasks can't have an exception/interrupt frame
>> as their frame 0, correct?
>> 
>> Thus, AFAICS, whenever klp_try_complete_transition() finds a
>> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
>> said.
>
> I gave this about 20 tests and they were also all successful, what do
> you think?  (The log could probably use some trimming and cleanup.)  I
> think either solution would fix the issue.
>
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
>
> From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
> From: Joe Lawrence 
> Date: Fri, 11 Jan 2019 10:25:26 -0500
> Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
>  check
>
> The ppc64le reliable stack tracer iterates over a given task's stack,
> verifying a set of conditions to determine whether the trace is
> 'reliable'.  These checks include the presence of any exception frames.
>
> We should be careful when inspecting the bottom-most stack frame (the
> first to be unwound), particularly for scheduled-out tasks.  As Nicolai
> Stange explains, "If I'm reading the code in _switch() correctly, the
> first frame is completely uninitialized except for the pointer back to
> the caller's stack frame."  If a previous do_IRQ() invocation, for
> example, has left a residual exception-marker on the first frame, the
> stack tracer would incorrectly report this task's trace as unreliable.
>

FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
caller hardware_interrupt_common(), more specifically the
EXCEPTION_PROLOG_COMMON_3 part therof.


> save_stack_trace_tsk_reliable() already skips the first frame when
> verifying the saved Link Register.  It should do the same when looking
> for the STACK_FRAME_REGS_MARKER.  The task it is unwinding will be
> either 'current', in which case the tracer will have never been called
> from an exception path, or the task will be inactive and its first
> frame's contents will be uninitialized (aside from backchain pointer).

I thought about this a little more and can't see anything that would
prevent higher, i.e. non-_switch() frames to also alias with prior
exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
frame's "parameter area" and most functions probably don't initialize
this either. So, AFAICS, higher stack frames could potentially be
affected by the very same problem.

I think the best solution would be to clear the STACK_FRAME_REGS_MARKER
upon exception return. I have a patch ready for that and will post it
after it has passed some basic testing -- hopefully later this day.

That being said, I still think that your patch should also get applied
in some form -- looking at unitialized memory is just not a good thing
to do.


>
> Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for 
> the consistency model")
> Signed-off-by: Joe 

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-13 Thread Joe Lawrence
On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
> Joe Lawrence  writes:
> 
> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:

[ ... snip ... ]

> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
> >> from _switch() helps?
> >> 
> >> I.e. something like (completely untested):
> >
> > I'll kick off some builds tonight and try to get tests lined up
> > tomorrow.  Unfortunately these take a bit of time to run setup, schedule
> > and complete, so perhaps by next week.
> 
> Ok, that's probably still a good test for confirmation, but the solution
> you proposed below is much better.

Good news, 40 tests (RHEL-7 backport) with clearing out the
STACK_FRAME_MARKER were all successful.  Previously I would see stalled
livepatch transitions due to the unreliable report in ~10% of test
cases.

> >> diff --git a/arch/powerpc/kernel/entry_64.S 
> >> b/arch/powerpc/kernel/entry_64.S
> >> index 435927f549c4..b747d0647ec4 100644
> >> --- a/arch/powerpc/kernel/entry_64.S
> >> +++ b/arch/powerpc/kernel/entry_64.S
> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
> >>SAVE_8GPRS(14, r1)
> >>SAVE_10GPRS(22, r1)
> >>std r0,_NIP(r1) /* Return to switch caller */
> >> +
> >> +  li  r23,0
> >> +  std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
> >> +
> >>mfcrr23
> >>std r23,_CCR(r1)
> >>std r1,KSP(r3)  /* Set old stack pointer */
> >> 
> >> 
> >
> > This may be sufficient to avoid the condition, but if the contents of
> > frame 0 are truely uninitialized (not to be trusted), should the
> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
> > aside from the LR and other stack size geometry sanity checks?
> 
> That's a very good point: we'll only ever be examining scheduled out tasks
> and current (which at that time is running klp_try_complete_transition()).
> 
> current won't be in an interrupt/exception when it's walking its own
> stack. And scheduled out tasks can't have an exception/interrupt frame
> as their frame 0, correct?
> 
> Thus, AFAICS, whenever klp_try_complete_transition() finds a
> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
> said.

I gave this about 20 tests and they were also all successful, what do
you think?  (The log could probably use some trimming and cleanup.)  I
think either solution would fix the issue.

-->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--

>From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
From: Joe Lawrence 
Date: Fri, 11 Jan 2019 10:25:26 -0500
Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
 check

The ppc64le reliable stack tracer iterates over a given task's stack,
verifying a set of conditions to determine whether the trace is
'reliable'.  These checks include the presence of any exception frames.

We should be careful when inspecting the bottom-most stack frame (the
first to be unwound), particularly for scheduled-out tasks.  As Nicolai
Stange explains, "If I'm reading the code in _switch() correctly, the
first frame is completely uninitialized except for the pointer back to
the caller's stack frame."  If a previous do_IRQ() invocation, for
example, has left a residual exception-marker on the first frame, the
stack tracer would incorrectly report this task's trace as unreliable.

save_stack_trace_tsk_reliable() already skips the first frame when
verifying the saved Link Register.  It should do the same when looking
for the STACK_FRAME_REGS_MARKER.  The task it is unwinding will be
either 'current', in which case the tracer will have never been called
from an exception path, or the task will be inactive and its first
frame's contents will be uninitialized (aside from backchain pointer).

Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for 
the consistency model")
Signed-off-by: Joe Lawrence 
---
 arch/powerpc/kernel/stacktrace.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2c50b55138f..0793e75c45a6 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct 
stack_trace *trace)
 EXPORT_SYMBOL_GPL(save_stack_trace_regs);
 
 #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+/*
+ * This function returns an error if it detects any unreliable features of the
+ * stack.  Otherwise it guarantees that the stack trace is reliable.
+ *
+ * If the task is not 'current', the caller *must* ensure the task is inactive.
+ */
 int
 save_stack_trace_tsk_reliable(struct task_struct *tsk,
struct stack_trace *trace)
@@ -143,7 +149,7 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
return 1;
 
/* Mark stacktraces with exception frames as unreliable. */
-   if (sp 

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-13 Thread Torsten Duwe
On Sun, 13 Jan 2019 23:33:56 +1100
Balbir Singh  wrote:

> On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> > On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > > Could you please define interesting frame on top a bit more?
> > > Usually the topmost return address is in LR
> > 
> > There is no reliable way (other than DWARF unwind info) to find out
> > where the value of LR at function entry currently lives (if
> > anywhere). It may or may not be still available in LR, it may or
> > may not be saved to the return stack slot.  It can also live in
> > some GPR, or in some other stack slot.
> > 
> > (The same is true for all other registers).
> > 
> > The only thing the ABI guarantees you is that you can find all
> > stack frames via the back chain.  If you want more you can use some
> > heuristics and do some heroics (like GDB does), but this is not
> > fully reliable.  Using DWARF unwind info is, but that requires big
> > tables.
> >
> 
> Thanks, so are you suggesting that a reliable stack is not possible on
> ppc64le? Even with the restricted scope of the kernel?

The LR value location is _always_ hard to determine for the topmost
frame. This is not a problem for voluntarily sleeping tasks, because the
topmost function will always be well known. It is a problem for tasks preempted
by an interrupt, or those handling an exception, that's why these need
to report "unreliable".

Note that this is a very general problem, across _all_ RISC-like
architectures. It should thus be handled as generically as possible.

Torsten



Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-13 Thread Balbir Singh
On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > Could you please define interesting frame on top a bit more? Usually
> > the topmost return address is in LR
> 
> There is no reliable way (other than DWARF unwind info) to find out where
> the value of LR at function entry currently lives (if anywhere). It may or
> may not be still available in LR, it may or may not be saved to the return
> stack slot.  It can also live in some GPR, or in some other stack slot.
> 
> (The same is true for all other registers).
> 
> The only thing the ABI guarantees you is that you can find all stack frames
> via the back chain.  If you want more you can use some heuristics and do
> some heroics (like GDB does), but this is not fully reliable.  Using DWARF
> unwind info is, but that requires big tables.
>

Thanks, so are you suggesting that a reliable stack is not possible on
ppc64le? Even with the restricted scope of the kernel?

Balbir Singh.
 


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-12 Thread Segher Boessenkool
On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> Could you please define interesting frame on top a bit more? Usually
> the topmost return address is in LR

There is no reliable way (other than DWARF unwind info) to find out where
the value of LR at function entry currently lives (if anywhere). It may or
may not be still available in LR, it may or may not be saved to the return
stack slot.  It can also live in some GPR, or in some other stack slot.

(The same is true for all other registers).

The only thing the ABI guarantees you is that you can find all stack frames
via the back chain.  If you want more you can use some heuristics and do
some heroics (like GDB does), but this is not fully reliable.  Using DWARF
unwind info is, but that requires big tables.


Segher


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-11 Thread Balbir Singh
On Thu, Jan 10, 2019 at 04:14:00PM -0500, Joe Lawrence wrote:
> Hi all,
> 
> tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>about?
> 
> I am looking at a bug in which ~10% of livepatch tests on RHEL-7 and
> RHEL-8 distro kernels, the ppc64le reliable stack unwinder consistently
> reports an unreliable stack for a given task.  This condition can
> eventually resolve itself, but sometimes this state remains wedged for
> hours and I can live-debug the system with crash-utility.
>

In summary, you think the reliable stack tracer is being conservative?
xmon (built in debugger) also allows for live-debugging and might
be more easier to get some of this done.
 
> I have tried reproducing with upstream 4.20 kernel, but haven't seen
> this exact condition yet.  More on upstream in a bit.
> 
> That said, I have a question about the ppc64 stack frame layout with
> regard to scheduled tasks.  In each sticky "unreliable" stack trace
> instance that I've looked at, the task's stack has an interesting
> frame at the top:
>

Could you please define interesting frame on top a bit more? Usually
the topmost return address is in LR
 
> 
> Example 1 (RHEL-7)
> ==
> 
> crash> struct task_struct.thread c0022fd015c0 | grep ksp
> ksp = 0xc000288af9c0
> 
> crash> rd 0xc000288af9c0 -e 0xc000288b
> 
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> 
> sp[0]:
> 
> c000288af9c0:  c000288afb90 00dd   ...(
> c000288af9d0:  c0002a94 c1c60a00   .*..
> 
> crash> sym c0002a94
> c0002a94 (T) hardware_interrupt_common+0x114

What does bt look like in this case? what does r1 point to? I would
look at xmon and see what the "t" (stack trace) looks like, I think
it's a more familiar tool.

> 
> c000288af9e0:  c1c60a80    
> c000288af9f0:  c000288afbc0 00dd   ...(
> c000288afa00:  c14322e0 c1c60a00   ."C.
> c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
> c000288afa20:  7265677368657265 2200   erehsger."..
> 
> Uh-oh...
> 
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
> 
> c000288afa30:  c01240ac c000288afcb0   .@.(
> c000288afa40:  c13e4d00    .M>.
> c000288afa50:  0001 0001   
> c000288afa60:  c14322e0 0804   ."C.
> c000288afa70:  c000288ac080 c0022fd015c0   ...(.../
> c000288afa80:  c000288afae0    ...(
> c000288afa90:  c000288afae0 c7b80900   ...(
> c000288afaa0:  c0e90a00 c0e90a00   
> c000288afab0:  c01240ac c0e90a00   .@..
> c000288afac0:  c0e90a00 c4790580   ..y.
> c000288afad0:  0001 c0e90a00   
> c000288afae0:   0004   
> c000288afaf0:  c0022fd01ad0 c0e73be0   .../.;..
> c000288afb00:  c0023900f450 c1488190   P..9..H.
> c000288afb10:  00ad c0023900ef40   @..9
> c000288afb20:  c000288ac000 c0e73be0   ...(.;..
> c000288afb30:  c001b514 c0022fd01628   (../
> c000288afb40:  c000288afbb0 c0008800   ...(
> c000288afb50:  c0162880    .(..
> c000288afb60:  240f3022 0004   "0.$
> c000288afb70:  c0e90a00 c0022fd01a20    ../
> c000288afb80:  c000288afbf0 c14322e0   ...(."C.
> 
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> 
> sp[1]:
> 
> c000288afb90:  c000288afbf0 c1960a00   ...(
> c000288afba0:  c001b514 0004   
> 
> crash> sym c001b514
> c001b514 (T) __switch_to+0x264
> 
> c000288afbb0:  c0e90a00    
> c000288afbc0:  c000288ac000 c14322e0   ...(."C.
> c000288afbd0:  c0e90a00 c1960a00   
> c000288afbe0:  c0022fd015c0 c0023900ef40   .../@..9
> 
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> 
> sp[2]:
> 
> c000288afbf0:  c000288afcd0 c0003300   ...(.3..
> c000288afc00:  c0a83918 c13e4d00   .9...M>.
> 
> crash> sym c0a83918
> c0a83918 (t) __schedule+0x428
> 
>  - - - 

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-10 Thread Nicolai Stange
Joe Lawrence  writes:

> On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>> Hi Joe,
>> 
>> Joe Lawrence  writes:
>> 
>> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>> >about?
>> 
>> If I'm reading the code in _switch() correctly, the first frame is
>> completely uninitialized except for the pointer back to the caller's
>> stack frame.
>> 
>> For completeness: _switch() saves the return address, i.e. the link
>> register into its parent's stack frame, as is mandated by the ABI and
>> consistent with your findings below: it's always the second stack frame
>> where the return address into __switch_to() is kept.
>>
>
> Hi Nicolai,
>
> Good, that makes a lot of sense.  I couldn't find any reference
> explaining the contents of frame 0, only unwinding code here and there
> (as in crash-utility) that stepped over it.

FWIW, I learned about general stack frame usage on ppc from part 4 of
the introductionary series starting at [1]: it's a good reading and I
can definitely recommend it.

Summary:
- Callers of other functions always allocate a stack frame and only
  set the pointer to the previous stack frame (that's the
  'stdu r1, -STACK_FRAME_OVERHEAD(r1)' insn).
- Callees save their stuff into the stack frame allocated by the caller
  if needed. Where "if needed" == callee in turn calls another function.

The insignificance of frame 0's contents follows from this ABI: the
caller might not have called any callee yet, the callee might be a leaf
and so on.

Finally, as I understand it, the only purpose of _switch() creating a
standard stack frame at the bottom of scheduled out tasks is that the
higher ones can be found (for e.g. the backtracing): otherwise
there would be a pt_regs at the bottom of the stack. But I might be
wrong here.



>> 
>> 
>> >
>> >
>> > Example 1 (RHEL-7)
>> > ==
>> >
>> > crash> struct task_struct.thread c0022fd015c0 | grep ksp
>> > ksp = 0xc000288af9c0
>> >
>> > crash> rd 0xc000288af9c0 -e 0xc000288b
>> >
>> >  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> >
>> > sp[0]:
>> >
>> > c000288af9c0:  c000288afb90 00dd   ...(
>> > c000288af9d0:  c0002a94 c1c60a00   .*..
>> >
>> > crash> sym c0002a94
>> > c0002a94 (T) hardware_interrupt_common+0x114
>> 
>> So that c0002a94 certainly wasn't stored by _switch(). I think
>> what might have happened is that the switching frame aliased with some
>> prior interrupt frame as setup by hardware_interrupt_common().
>> 
>> The interrupt and switching frames seem to share a common layout as far
>> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
>> concerned.
>> 
>> That address into hardware_interrupt_common() could have been written by
>> the do_IRQ() called from there.
>> 
>
> That was my initial theory, but then when I saw an ordinary scheduled
> task with a similarly strange frame 0, I thought that _switch() might
> have been doing something clever (or not).  But according your earlier
> explanation, it would line up that these values may be inherited from
> do_IRQ() or the like.
>
>> 
>> > c000288af9e0:  c1c60a80    
>> > c000288af9f0:  c000288afbc0 00dd   ...(
>> > c000288afa00:  c14322e0 c1c60a00   ."C.
>> > c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
>> > c000288afa20:  7265677368657265 2200   erehsger."..
>> >
>> > Uh-oh...
>> >
>> > /* Mark stacktraces with exception frames as unreliable. */
>> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>> 
>> 
>> Aliasing of the switching stack frame with some prior interrupt stack
>> frame would explain why that STACK_FRAME_REGS_MARKER is still found on
>> the stack, i.e. it's a leftover.
>> 
>> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> from _switch() helps?
>> 
>> I.e. something like (completely untested):
>
> I'll kick off some builds tonight and try to get tests lined up
> tomorrow.  Unfortunately these take a bit of time to run setup, schedule
> and complete, so perhaps by next week.

Ok, that's probably still a good test for confirmation, but the solution
you proposed below is much better.


>> 
>> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> index 435927f549c4..b747d0647ec4 100644
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>>  SAVE_8GPRS(14, r1)
>>  SAVE_10GPRS(22, r1)
>>  std r0,_NIP(r1) /* Return to switch caller */
>> +
>> +li  r23,0
>> +std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> +
>>  mfcrr23
>>  std r23,_CCR(r1)
>>  std r1,KSP(r3)  /* Set old 

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-10 Thread Joe Lawrence
On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
> Hi Joe,
> 
> Joe Lawrence  writes:
> 
> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
> >about?
> 
> If I'm reading the code in _switch() correctly, the first frame is
> completely uninitialized except for the pointer back to the caller's
> stack frame.
> 
> For completeness: _switch() saves the return address, i.e. the link
> register into its parent's stack frame, as is mandated by the ABI and
> consistent with your findings below: it's always the second stack frame
> where the return address into __switch_to() is kept.
>

Hi Nicolai,

Good, that makes a lot of sense.  I couldn't find any reference
explaining the contents of frame 0, only unwinding code here and there
(as in crash-utility) that stepped over it.
 
> 
> 
> >
> >
> > Example 1 (RHEL-7)
> > ==
> >
> > crash> struct task_struct.thread c0022fd015c0 | grep ksp
> > ksp = 0xc000288af9c0
> >
> > crash> rd 0xc000288af9c0 -e 0xc000288b
> >
> >  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >
> > sp[0]:
> >
> > c000288af9c0:  c000288afb90 00dd   ...(
> > c000288af9d0:  c0002a94 c1c60a00   .*..
> >
> > crash> sym c0002a94
> > c0002a94 (T) hardware_interrupt_common+0x114
> 
> So that c0002a94 certainly wasn't stored by _switch(). I think
> what might have happened is that the switching frame aliased with some
> prior interrupt frame as setup by hardware_interrupt_common().
> 
> The interrupt and switching frames seem to share a common layout as far
> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
> concerned.
> 
> That address into hardware_interrupt_common() could have been written by
> the do_IRQ() called from there.
> 

That was my initial theory, but then when I saw an ordinary scheduled
task with a similarly strange frame 0, I thought that _switch() might
have been doing something clever (or not).  But according your earlier
explanation, it would line up that these values may be inherited from
do_IRQ() or the like.

> 
> > c000288af9e0:  c1c60a80    
> > c000288af9f0:  c000288afbc0 00dd   ...(
> > c000288afa00:  c14322e0 c1c60a00   ."C.
> > c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
> > c000288afa20:  7265677368657265 2200   erehsger."..
> >
> > Uh-oh...
> >
> > /* Mark stacktraces with exception frames as unreliable. */
> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
> 
> 
> Aliasing of the switching stack frame with some prior interrupt stack
> frame would explain why that STACK_FRAME_REGS_MARKER is still found on
> the stack, i.e. it's a leftover.
> 
> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
> from _switch() helps?
> 
> I.e. something like (completely untested):

I'll kick off some builds tonight and try to get tests lined up
tomorrow.  Unfortunately these take a bit of time to run setup, schedule
and complete, so perhaps by next week.

> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 435927f549c4..b747d0647ec4 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>   SAVE_8GPRS(14, r1)
>   SAVE_10GPRS(22, r1)
>   std r0,_NIP(r1) /* Return to switch caller */
> +
> + li  r23,0
> + std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
> +
>   mfcrr23
>   std r23,_CCR(r1)
>   std r1,KSP(r3)  /* Set old stack pointer */
> 
> 

This may be sufficient to avoid the condition, but if the contents of
frame 0 are truely uninitialized (not to be trusted), should the
unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
aside from the LR and other stack size geometry sanity checks?

> 
> 
> >
> > save_stack_trace_tsk_reliable
> > =
> >
> > arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> > take into account the first stackframe, but only to verify that the link
> > register is indeed pointing at kernel code address.
> 
> It's actually the other way around:
> 
>   if (!firstframe && !__kernel_text_address(ip))
>   return 1;
> 
> 
> So the address gets sanitized only if it's _not_ coming from the first
> frame.

Yup, that's right, I had it backwards.

Thanks!

-- Joe


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-10 Thread Nicolai Stange
Hi Joe,

Joe Lawrence  writes:

> tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>about?

If I'm reading the code in _switch() correctly, the first frame is
completely uninitialized except for the pointer back to the caller's
stack frame.

For completeness: _switch() saves the return address, i.e. the link
register into its parent's stack frame, as is mandated by the ABI and
consistent with your findings below: it's always the second stack frame
where the return address into __switch_to() is kept.



>
>
> Example 1 (RHEL-7)
> ==
>
> crash> struct task_struct.thread c0022fd015c0 | grep ksp
> ksp = 0xc000288af9c0
>
> crash> rd 0xc000288af9c0 -e 0xc000288b
>
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
>
> c000288af9c0:  c000288afb90 00dd   ...(
> c000288af9d0:  c0002a94 c1c60a00   .*..
>
> crash> sym c0002a94
> c0002a94 (T) hardware_interrupt_common+0x114

So that c0002a94 certainly wasn't stored by _switch(). I think
what might have happened is that the switching frame aliased with some
prior interrupt frame as setup by hardware_interrupt_common().

The interrupt and switching frames seem to share a common layout as far
as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
concerned.

That address into hardware_interrupt_common() could have been written by
the do_IRQ() called from there.


> c000288af9e0:  c1c60a80    
> c000288af9f0:  c000288afbc0 00dd   ...(
> c000288afa00:  c14322e0 c1c60a00   ."C.
> c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
> c000288afa20:  7265677368657265 2200   erehsger."..
>
> Uh-oh...
>
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER


Aliasing of the switching stack frame with some prior interrupt stack
frame would explain why that STACK_FRAME_REGS_MARKER is still found on
the stack, i.e. it's a leftover.

For testing, could you try whether clearing the word at STACK_FRAME_MARKER
from _switch() helps?

I.e. something like (completely untested):

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 435927f549c4..b747d0647ec4 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -596,6 +596,10 @@ _GLOBAL(_switch)
SAVE_8GPRS(14, r1)
SAVE_10GPRS(22, r1)
std r0,_NIP(r1) /* Return to switch caller */
+
+   li  r23,0
+   std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
+
mfcrr23
std r23,_CCR(r1)
std r1,KSP(r3)  /* Set old stack pointer */




>
> save_stack_trace_tsk_reliable
> =
>
> arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> take into account the first stackframe, but only to verify that the link
> register is indeed pointing at kernel code address.

It's actually the other way around:

if (!firstframe && !__kernel_text_address(ip))
return 1;


So the address gets sanitized only if it's _not_ coming from the first
frame.


Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)