Re: ARM64 Livepatch based on SFrame

2023-12-15 Thread Madhavan T. Venkataraman



On 12/15/23 07:04, Mark Rutland wrote:
> On Thu, Dec 14, 2023 at 02:49:29PM -0600, Madhavan T. Venkataraman wrote:
>> Hi Mark,
> 
> Hi Madhavan,
> 
>> I attended your presentation in the LPC. You mentioned that you could use
>> some help with some pre-requisites for the Livepatch feature.
>> I would like to lend a hand.
> 
> Cool!
> 
> I've been meaning to send a mail round with a summary of the current state of
> things, and what needs to be done going forward, but I haven't had the time
> since LPC to put that together (as e.g. that requires learning some more about
> SFrame).  I'll be disappearing for the holiday shortly, and I intend to pick
> that up in the new year.
> 
>> What would you like me to implement?
> 
> I'm not currently sure exactly what we need/want to implement, and as above I
> think that needs to wait until the new year.
> 

OK.

> However, one thing that you can do that would be very useful is to write up 
> and
> report the GCC DWARF issues that you mentioned in:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/20230202074036.507249-1-madve...@linux.microsoft.com/
> 
> ... as (depending on exactly what those are) those could also affect SFrame
> generation (and thus we'll need to get those fixed in GCC), and regardless it
> would be useful information to know.
> 
> I understood that you planned to do that from:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/054ce0d6-70f0-b834-d4e5-1049c8df7...@linux.microsoft.com/
> 
> ... but I couldn't spot any relevant mails or issues in the GCC bugzilla, so
> either I'm failing to search hard enough, or did that get forgotten about?
> 

Yeah. I had notes on that. But I seem to have lost them. I need to reproduce the
problems and analyze them again which is not trivial. So, I have been 
procrastinating.

I am also disappearing for the rest of this year. I will try to look at it in 
the
new year.

>> I would also like to implement Unwind Hints for the feature. If you want a
>> specific format for the hints, let me know.
> 
> I will get back to you on that in the new year; I think the specifics we want
> are going to depend on other details above we need to analyse first.
> 

OK.

For now, I will implement something and send it out just for reference. We can 
revisit
this topic next year sometime.

Thanks.

Madhavan



ARM64 Livepatch based on SFrame

2023-12-14 Thread Madhavan T. Venkataraman
Hi Mark,

I attended your presentation in the LPC. You mentioned that you could use some 
help with some pre-requisites for the Livepatch feature.
I would like to lend a hand.

What would you like me to implement?

I would also like to implement Unwind Hints for the feature. If you want a 
specific format for the hints, let me know.

Looking forward to help out with the feature.

Madhavan



Re: [RFC PATCH v2 0/1] arm64: Implement stack trace termination record

2021-04-19 Thread Madhavan T. Venkataraman
Sorry. Forgot to include link to v2. Here it is:

https://lore.kernel.org/linux-arm-kernel/20210402032404.47239-1-madve...@linux.microsoft.com/

Thanks!

Madhavan

On 4/19/21 1:16 PM, Madhavan T. Venkataraman wrote:
> CCing Pavel Tatashin  on request.
> 
> Pasha,
> 
> This is v2. v1 is here:
> 
> https://lore.kernel.org/linux-arm-kernel/20210324184607.120948-1-madve...@linux.microsoft.com/
> 
> Thanks!
>
> Madhavan
> 
> On 4/1/21 10:24 PM, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> Reliable stacktracing requires that we identify when a stacktrace is
>> terminated early. We can do this by ensuring all tasks have a final
>> frame record at a known location on their task stack, and checking
>> that this is the final frame record in the chain.
>>
>> All tasks have a pt_regs structure right after the task stack in the stack
>> page. The pt_regs structure contains a stackframe field. Make this stackframe
>> field the final frame in the task stack so all stack traces end at a fixed
>> stack offset.
>>
>> For kernel tasks, this is simple to understand. For user tasks, there is
>> some extra detail. User tasks get created via fork() et al. Once they return
>> from fork, they enter the kernel only on an EL0 exception. In arm64,
>> system calls are also EL0 exceptions.
>>
>> The EL0 exception handler uses the task pt_regs mentioned above to save
>> register state and call different exception functions. All stack traces
>> from EL0 exception code must end at the pt_regs. So, make pt_regs->stackframe
>> the final frame in the EL0 exception stack.
>>
>> To summarize, task_pt_regs(task)->stackframe will always be the final frame
>> in a stack trace.
>>
>> Sample stack traces
>> ===
>>
>> The final frame for the idle tasks is different from v1. The rest of the
>> stack traces are the same.
>>
>> Primary CPU's idle task (changed from v1)
>> ===
>>
>> [0.022365]   arch_stack_walk+0x0/0xd0
>> [0.022376]   callfd_stack+0x30/0x60
>> [0.022387]   rest_init+0xd8/0xf8
>> [0.022397]   arch_call_rest_init+0x18/0x24
>> [0.022411]   start_kernel+0x5b8/0x5f4
>> [0.022424]   __primary_switched+0xa8/0xac
>>
>> Secondary CPU's idle task (changed from v1)
>> =
>>
>> [0.022484]   arch_stack_walk+0x0/0xd0
>> [0.022494]   callfd_stack+0x30/0x60
>> [0.022502]   secondary_start_kernel+0x188/0x1e0
>> [0.022513]   __secondary_switched+0x80/0x84
>>
>> ---
>> Changelog:
>>
>> v1
>>  - Set up task_pt_regs(current)->stackframe as the final frame
>>when a new task is initialized in copy_thread().
>>
>>  - Create pt_regs for the idle tasks and set up pt_regs->stackframe
>>as the final frame for the idle tasks.
>>
>>  - Set up task_pt_regs(current)->stackframe as the final frame in
>>the EL0 exception handler so the EL0 exception stack trace ends
>>there.
>>
>>  - Terminate the stack trace successfully in unwind_frame() when
>>the FP reaches task_pt_regs(current)->stackframe.
>>
>>  - The stack traces (above) in the kernel will terminate at the
>>        correct place. Debuggers may show an extra record 0x0 at the end
>>for pt_regs->stackframe. That said, I did not see that extra frame
>>when I did stack traces using gdb.
>> v2
>>  - Changed some wordings as suggested by Mark Rutland.
>>
>>  - Removed the synthetic return PC for idle tasks. Changed the
>>branches to start_kernel() and secondary_start_kernel() to
>>calls so that they will have a proper return PC.
>>
>> Madhavan T. Venkataraman (1):
>>   arm64: Implement stack trace termination record
>>
>>  arch/arm64/kernel/entry.S  |  8 +---
>>  arch/arm64/kernel/head.S   | 29 +++--
>>  arch/arm64/kernel/process.c|  5 +
>>  arch/arm64/kernel/stacktrace.c | 10 +-
>>  4 files changed, 38 insertions(+), 14 deletions(-)
>>
>>
>> base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b
>>


Re: [RFC PATCH v2 0/1] arm64: Implement stack trace termination record

2021-04-19 Thread Madhavan T. Venkataraman
CCing Pavel Tatashin  on request.

Pasha,

This is v2. v1 is here:

https://lore.kernel.org/linux-arm-kernel/20210324184607.120948-1-madve...@linux.microsoft.com/

Thanks!
   
Madhavan

On 4/1/21 10:24 PM, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> Reliable stacktracing requires that we identify when a stacktrace is
> terminated early. We can do this by ensuring all tasks have a final
> frame record at a known location on their task stack, and checking
> that this is the final frame record in the chain.
> 
> All tasks have a pt_regs structure right after the task stack in the stack
> page. The pt_regs structure contains a stackframe field. Make this stackframe
> field the final frame in the task stack so all stack traces end at a fixed
> stack offset.
> 
> For kernel tasks, this is simple to understand. For user tasks, there is
> some extra detail. User tasks get created via fork() et al. Once they return
> from fork, they enter the kernel only on an EL0 exception. In arm64,
> system calls are also EL0 exceptions.
> 
> The EL0 exception handler uses the task pt_regs mentioned above to save
> register state and call different exception functions. All stack traces
> from EL0 exception code must end at the pt_regs. So, make pt_regs->stackframe
> the final frame in the EL0 exception stack.
> 
> To summarize, task_pt_regs(task)->stackframe will always be the final frame
> in a stack trace.
> 
> Sample stack traces
> ===
> 
> The final frame for the idle tasks is different from v1. The rest of the
> stack traces are the same.
> 
> Primary CPU's idle task (changed from v1)
> ===
> 
> [0.022365]   arch_stack_walk+0x0/0xd0
> [0.022376]   callfd_stack+0x30/0x60
> [0.022387]   rest_init+0xd8/0xf8
> [0.022397]   arch_call_rest_init+0x18/0x24
> [0.022411]   start_kernel+0x5b8/0x5f4
> [0.022424]   __primary_switched+0xa8/0xac
> 
> Secondary CPU's idle task (changed from v1)
> =
> 
> [0.022484]   arch_stack_walk+0x0/0xd0
> [0.022494]   callfd_stack+0x30/0x60
> [0.022502]   secondary_start_kernel+0x188/0x1e0
> [0.022513]   __secondary_switched+0x80/0x84
> 
> ---
> Changelog:
> 
> v1
>   - Set up task_pt_regs(current)->stackframe as the final frame
> when a new task is initialized in copy_thread().
> 
>   - Create pt_regs for the idle tasks and set up pt_regs->stackframe
> as the final frame for the idle tasks.
> 
>   - Set up task_pt_regs(current)->stackframe as the final frame in
> the EL0 exception handler so the EL0 exception stack trace ends
> there.
> 
>   - Terminate the stack trace successfully in unwind_frame() when
> the FP reaches task_pt_regs(current)->stackframe.
> 
>   - The stack traces (above) in the kernel will terminate at the
> correct place. Debuggers may show an extra record 0x0 at the end
> for pt_regs->stackframe. That said, I did not see that extra frame
> when I did stack traces using gdb.
> v2
>   - Changed some wordings as suggested by Mark Rutland.
> 
>   - Removed the synthetic return PC for idle tasks. Changed the
> branches to start_kernel() and secondary_start_kernel() to
> calls so that they will have a proper return PC.
> 
> Madhavan T. Venkataraman (1):
>   arm64: Implement stack trace termination record
> 
>  arch/arm64/kernel/entry.S  |  8 +---
>  arch/arm64/kernel/head.S   | 29 +++--
>  arch/arm64/kernel/process.c|  5 +
>  arch/arm64/kernel/stacktrace.c | 10 +-
>  4 files changed, 38 insertions(+), 14 deletions(-)
> 
> 
> base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b
> 


Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-16 Thread Madhavan T. Venkataraman
Thanks!

Madhavan

On 4/16/21 11:17 AM, Mark Brown wrote:
> On Thu, Apr 01, 2021 at 10:24:04PM -0500, madve...@linux.microsoft.com wrote:
> 
>> Reliable stacktracing requires that we identify when a stacktrace is
>> terminated early. We can do this by ensuring all tasks have a final
>> frame record at a known location on their task stack, and checking
>> that this is the final frame record in the chain.
> 
> Reviewed-by: Mark Brown 
> 


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-16 Thread Madhavan T. Venkataraman



On 4/14/21 5:23 AM, Madhavan T. Venkataraman wrote:
> In any case, I have absolutely no problems in implementing your section idea. 
> I will
> make an attempt to do that in version 3 of my patch series.

So, I attempted a patch with just declaring all .entry.text functions as 
unreliable
by checking just the section bounds. It does work for EL1 exceptions. But there
are other functions that are actually reliable that show up as unreliable.
The example in my test is el0_sync() which is at the base of all system call 
stacks.

How would you prefer I handle this? Should I place all SYM_CODE functions that
are actually safe for the unwinder in a separate section? I could just take
some approach and solve this. But I would like to get your opinion and Mark
Rutland's opinion so we are all on the same page.

Please let me know.

Madhavan


Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-14 Thread Madhavan T. Venkataraman
Hi Mark Rutland, Mark Brown,

Could you take a look at this version for proper stack termination and let me 
know
what you think?

Thanks!

Madhavan

On 4/1/21 10:24 PM, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> Reliable stacktracing requires that we identify when a stacktrace is
> terminated early. We can do this by ensuring all tasks have a final
> frame record at a known location on their task stack, and checking
> that this is the final frame record in the chain.
> 
> Kernel Tasks
> 
> 
> All tasks except the idle task have a pt_regs structure right after the
> task stack. This is called the task pt_regs. The pt_regs structure has a
> special stackframe field. Make this stackframe field the final frame in the
> task stack. This needs to be done in copy_thread() which initializes a new
> task's pt_regs and initial CPU context.
> 
> For the idle task, there is no task pt_regs. For our purpose, we need one.
> So, create a pt_regs just like other kernel tasks and make
> pt_regs->stackframe the final frame in the idle task stack. This needs to be
> done at two places:
> 
>   - On the primary CPU, the boot task runs. It calls start_kernel()
> and eventually becomes the idle task for the primary CPU. Just
> before start_kernel() is called, set up the final frame.
> 
>   - On each secondary CPU, a startup task runs that calls
> secondary_startup_kernel() and eventually becomes the idle task
> on the secondary CPU. Just before secondary_start_kernel() is
> called, set up the final frame.
> 
> User Tasks
> ==
> 
> User tasks are initially set up like kernel tasks when they are created.
> Then, they return to userland after fork via ret_from_fork(). After that,
> they enter the kernel only on an EL0 exception. (In arm64, system calls are
> also EL0 exceptions). The EL0 exception handler stores state in the task
> pt_regs and calls different functions based on the type of exception. The
> stack trace for an EL0 exception must end at the task pt_regs. So, make
> task pt_regs->stackframe as the final frame in the EL0 exception stack.
> 
> In summary, task pt_regs->stackframe is where a successful stack trace ends.
> 
> Stack trace termination
> ===
> 
> In the unwinder, terminate the stack trace successfully when
> task_pt_regs(task)->stackframe is reached. For stack traces in the kernel,
> this will correctly terminate the stack trace at the right place.
> 
> However, debuggers terminate the stack trace when FP == 0. In the
> pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
> debugger may print an extra record 0x0 at the end. While this is not
> pretty, this does not do any harm. This is a small price to pay for
> having reliable stack trace termination in the kernel.
> 
> Signed-off-by: Madhavan T. Venkataraman 
> ---
>  arch/arm64/kernel/entry.S  |  8 +---
>  arch/arm64/kernel/head.S   | 29 +++--
>  arch/arm64/kernel/process.c|  5 +
>  arch/arm64/kernel/stacktrace.c | 10 +-
>  4 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index a31a0a713c85..e2dc2e998934 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -261,16 +261,18 @@ alternative_else_nop_endif
>   stp lr, x21, [sp, #S_LR]
>  
>   /*
> -  * For exceptions from EL0, terminate the callchain here.
> +  * For exceptions from EL0, terminate the callchain here at
> +  * task_pt_regs(current)->stackframe.
> +  *
>* For exceptions from EL1, create a synthetic frame record so the
>* interrupted code shows up in the backtrace.
>*/
>   .if \el == 0
> - mov x29, xzr
> + stp xzr, xzr, [sp, #S_STACKFRAME]
>   .else
>   stp x29, x22, [sp, #S_STACKFRAME]
> - add x29, sp, #S_STACKFRAME
>   .endif
> + add x29, sp, #S_STACKFRAME
>  
>  #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>  alternative_if_not ARM64_HAS_PAN
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 840bda1869e9..743c019a42c7 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -393,6 +393,23 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>   ret x28
>  SYM_FUNC_END(__create_page_tables)
>  
> + /*
> +  * The boot task becomes the idle task for the primary CPU. The
> +  * CPU startup task on each secondary CPU becomes the idle task
> +  * for the secondary CPU.
> +  *
> +  * The idle task does no

Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-14 Thread Madhavan T. Venkataraman



On 4/13/21 6:02 AM, Mark Brown wrote:
> On Mon, Apr 12, 2021 at 02:55:35PM -0500, Madhavan T. Venkataraman wrote:
> 
>>
>> OK. Just so I am clear on the whole picture, let me state my understanding 
>> so far.
>> Correct me if I am wrong.
> 
>> 1. We are hoping that we can convert a significant number of SYM_CODE 
>> functions to
>>SYM_FUNC functions by providing them with a proper FP prolog and epilog 
>> so that
>>we can get objtool coverage for them. These don't need any blacklisting.
> 
> I wouldn't expect to be converting lots of SYM_CODE to SYM_FUNC.  I'd
> expect the overwhelming majority of SYM_CODE to be SYM_CODE because it's
> required to be non standard due to some external interface - things like
> the exception vectors, ftrace, and stuff around suspend/hibernate.  A
> quick grep seems to confirm this.
> 

OK. Fair enough.

>> 3. We are going to assume that the reliable unwinder is only for livepatch 
>> purposes
>>and will only be invoked on a task that is not currently running. The 
>> task either
> 
> The reliable unwinder can also be invoked on itself.
> 

I have not called out the self-directed case because I am assuming that the 
reliable unwinder
is only used for livepatch. So, AFAICT, this is applicable to the task that 
performs the
livepatch operation itself. In this case, there should be no unreliable 
functions on the
self-directed stack trace (otherwise, livepatching would always fail).

>> 4. So, the only functions that will need blacklisting are the remaining 
>> SYM_CODE functions
>>that might give up the CPU voluntarily. At this point, I am not even sure 
>> how
>>many of these will exist. One hopes that all of these would have ended up 
>> as
>>SYM_FUNC functions in (1).
> 
> There's stuff like ret_from_fork there.
> 

OK. There would be a few functions that fit this category. I agree.

>> I suggest we do (3) first. Then, review the assembly functions to do (1). 
>> Then, review the
>> remaining ones to see which ones must be blacklisted, if any.
> 
> I'm not clear what the concrete steps you're planning to do first are
> there - your 3 seems like a statement of assumptions.  For flagging
> functions I do think it'd be safer to default to assuming that all
> SYM_CODE functions can't be unwound reliably rather than only explicitly
> listing ones that cause problems.
> 

They are not assumptions. They are true statements. But I probably did not do a 
good
job of explaining. But Josh sent out a patch that updates the documentation that
explains what I said a lot better.

In any case, I have absolutely no problems in implementing your section idea. I 
will
make an attempt to do that in version 3 of my patch series.

Stay tuned.

And, thanks for all the input. It is very helpful.

Madhavan


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-12 Thread Madhavan T. Venkataraman



On 4/12/21 12:36 PM, Mark Brown wrote:
> On Fri, Apr 09, 2021 at 04:37:41PM -0500, Josh Poimboeuf wrote:
>> On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
> 
>>> Further, I believe all the special cases are assembly functions, and
>>> most of those are already in special sections to begin with. I reckon
>>> it'd be simpler and more robust to reject unwinding based on the
>>> section. If we need to unwind across specific functions in those
>>> sections, we could opt-in with some metadata. So e.g. we could reject
>>> all functions in ".entry.text", special casing the EL0 entry functions
>>> if necessary.
> 
>> Couldn't this also end up being somewhat fragile?  Saying "certain
>> sections are deemed unreliable" isn't necessarily obvious to somebody
>> who doesn't already know about it, and it could be overlooked or
>> forgotten over time.  And there's no way to enforce it stays that way.
> 
> Anything in this area is going to have some opportunity for fragility
> and missed assumptions somewhere.  I do find the idea of using the
> SYM_CODE annotations that we already have and use for other purposes to
> flag code that we don't expect to be suitable for reliable unwinding
> appealing from that point of view.  It's pretty clear at the points
> where they're used that they're needed, even with a pretty surface level
> review, and the bit actually pushing things into a section is going to
> be in a single place where the macro is defined.  That seems relatively
> robust as these things go, it seems no worse than our reliance on
> SYM_FUNC to create BTI annotations.  Missing those causes oopses when we
> try to branch to the function.
> 

OK. Just so I am clear on the whole picture, let me state my understanding so 
far.
Correct me if I am wrong.

1. We are hoping that we can convert a significant number of SYM_CODE functions 
to
   SYM_FUNC functions by providing them with a proper FP prolog and epilog so 
that
   we can get objtool coverage for them. These don't need any blacklisting.

2. If we can locate the pt_regs structures created on the stack cleanly for EL1
   exceptions, etc, then we can handle those cases in the unwinder without 
needing
   any black listing.

   I have a solution for this in version 3 that does it without encoding the FP 
or
   matching values on the stack. I have addressed all of the objections so far 
on
   that count. I will send the patch series out soon.

3. We are going to assume that the reliable unwinder is only for livepatch 
purposes
   and will only be invoked on a task that is not currently running. The task 
either
   voluntarily gave up the CPU or was pre-empted. We can safely ignore all 
SYM_CODE
   functions that will never voluntarily give up the CPU. They can only be 
pre-empted
   and pre-emption is already handled in (2). We don't need to blacklist any of 
these
   functions.

4. So, the only functions that will need blacklisting are the remaining 
SYM_CODE functions
   that might give up the CPU voluntarily. At this point, I am not even sure how
   many of these will exist. One hopes that all of these would have ended up as
   SYM_FUNC functions in (1).

So, IMHO, placing code in a black listed section should be the last step and 
not the first
one. This also satisfies Mark Rutland's requirement that no one muck with the 
entry text
while he is sorting out that code.

I suggest we do (3) first. Then, review the assembly functions to do (1). Then, 
review the
remaining ones to see which ones must be blacklisted, if any.

Do you agree?

Madhavan


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-11 Thread Madhavan T. Venkataraman



On 4/9/21 5:53 PM, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
>> On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
>>>> FWIW, over the years we've had zero issues with encoding the frame
>>>> pointer on x86.  After you save pt_regs, you encode the frame pointer to
>>>> point to it.  Ideally in the same macro so it's hard to overlook.
>>>>
>>>
>>> I had the same opinion. In fact, in my encoding scheme, I have additional
>>> checks to make absolutely sure that it is a true encoding and not stack
>>> corruption. The chances of all of those values accidentally matching are,
>>> well, null.
>>
>> Right, stack corruption -- which is already exceedingly rare -- would
>> have to be combined with a miracle or two in order to come out of the
>> whole thing marked as 'reliable' :-)
>>
>> And really, we already take a similar risk today by "trusting" the frame
>> pointer value on the stack to a certain extent.
> 
> Oh yeah, I forgot to mention some more benefits of encoding the frame
> pointer (or marking pt_regs in some other way):
> 
> a) Stack addresses can be printed properly: '%pS' for printing regs->pc
>and '%pB' for printing call returns.
> 
>Using '%pS' for call returns (as arm64 seems to do today) will result
>in printing the wrong function when you have tail calls to noreturn
>functions on the stack (which is actually quite common for calls to
>panic(), die(), etc).
> 
>More details:
> 
>https://lkml.kernel.org/r/20210403155948.ubbgtwmlsdyar7yp@treble
> 
> b) Stack dumps to the console can dump the exception registers they find
>along the way.  This is actually quite nice for debugging.
> 
> 

Great.

I am preparing version 3 taking into account comments from yourself,
Mark Rutland and Mark Brown.

Stay tuned.

Madhavan


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Madhavan T. Venkataraman



On 4/9/21 4:37 PM, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
>> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madve...@linux.microsoft.com wrote:
>>> From: "Madhavan T. Venkataraman" 
>>>
>>> There are a number of places in kernel code where the stack trace is not
>>> reliable. Enhance the unwinder to check for those cases and mark the
>>> stack trace as unreliable. Once all of the checks are in place, the unwinder
>>> can provide a reliable stack trace. But before this can be used for 
>>> livepatch,
>>> some other entity needs to guarantee that the frame pointers are all set up
>>> correctly in kernel functions. objtool is currently being worked on to
>>> fill that gap.
>>>
>>> Except for the return address check, all the other checks involve checking
>>> the return PC of every frame against certain kernel functions. To do this,
>>> implement some infrastructure code:
>>>
>>> - Define a special_functions[] array and populate the array with
>>>   the special functions
>>
>> I'm not too keen on having to manually collate this within the unwinder,
>> as it's very painful from a maintenance perspective.
> 
> Agreed.
> 
>> I'd much rather we could associate this information with the
>> implementations of these functions, so that they're more likely to
>> stay in sync.
>>
>> Further, I believe all the special cases are assembly functions, and
>> most of those are already in special sections to begin with. I reckon
>> it'd be simpler and more robust to reject unwinding based on the
>> section. If we need to unwind across specific functions in those
>> sections, we could opt-in with some metadata. So e.g. we could reject
>> all functions in ".entry.text", special casing the EL0 entry functions
>> if necessary.
> 
> Couldn't this also end up being somewhat fragile?  Saying "certain
> sections are deemed unreliable" isn't necessarily obvious to somebody
> who doesn't already know about it, and it could be overlooked or
> forgotten over time.  And there's no way to enforce it stays that way.
> 

Good point!

> FWIW, over the years we've had zero issues with encoding the frame
> pointer on x86.  After you save pt_regs, you encode the frame pointer to
> point to it.  Ideally in the same macro so it's hard to overlook.
> 

I had the same opinion. In fact, in my encoding scheme, I have additional
checks to make absolutely sure that it is a true encoding and not stack
corruption. The chances of all of those values accidentally matching are,
well, null.

> If you're concerned about debuggers getting confused by the encoding -
> which debuggers specifically?  In my experience, if vmlinux has
> debuginfo, gdb and most other debuggers will use DWARF (which is already
> broken in asm code) and completely ignore frame pointers.
> 

Yes. I checked gdb actually. It did not show a problem.

>> I think there's a lot more code that we cannot unwind, e.g. KVM
>> exception code, or almost anything marked with SYM_CODE_END().
> 
> Just a reminder that livepatch only unwinds blocked tasks (plus the
> 'current' task which calls into livepatch).  So practically speaking, it
> doesn't matter whether the 'unreliable' detection has full coverage.
> The only exceptions which really matter are those which end up calling
> schedule(), e.g. preemption or page faults.
> 
> Being able to consistently detect *all* possible unreliable paths would
> be nice in theory, but it's unnecessary and may not be worth the extra
> complexity.
> 

You do have a point. I tried to think of arch_stack_walk_reliable() as
something that should be implemented independent of livepatching. But
I could not really come up with a single example of where else it would
really be useful.

So, if we assume that the reliable stack trace is solely for the purpose
of livepatching, I agree with your earlier comments as well.

Thanks!

Madhavan


Re: [RFC PATCH v2 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-09 Thread Madhavan T. Venkataraman


>> Also, the Function Graph Tracer modifies the return address of a traced
>> function to a return trampoline to gather tracing data on function return.
>> Stack traces taken from that trampoline and functions it calls are
>> unreliable as the original return address may not be available in
>> that context. Mark the stack trace unreliable accordingly.
>>
>> Signed-off-by: Madhavan T. Venkataraman 
>> ---
>>  arch/arm64/kernel/entry-ftrace.S | 12 +++
>>  arch/arm64/kernel/stacktrace.c   | 61 
>>  2 files changed, 73 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/entry-ftrace.S 
>> b/arch/arm64/kernel/entry-ftrace.S
>> index b3e4f9a088b1..1f0714a50c71 100644
>> --- a/arch/arm64/kernel/entry-ftrace.S
>> +++ b/arch/arm64/kernel/entry-ftrace.S
>> @@ -86,6 +86,18 @@ SYM_CODE_START(ftrace_caller)
>>  b   ftrace_common
>>  SYM_CODE_END(ftrace_caller)
>>  
>> +/*
>> + * A stack trace taken from anywhere in the FTRACE trampoline code should be
>> + * considered unreliable as a tracer function (patched at ftrace_call) could
>> + * potentially set pt_regs->pc and redirect execution to a function 
>> different
>> + * than the traced function. E.g., livepatch.
> 
> IIUC the issue here that we have two copies of the pc: one in the regs,
> and one in a frame record, and so after the update to the regs, the
> frame record is stale.
> 
> This is something that we could fix by having
> ftrace_instruction_pointer_set() set both.
> 

Yes. I will look at this.

> However, as noted elsewhere there are other issues which mean we'd still
> need special unwinding code for this.
> 

The only other cases we have discussed are EL1 exceptions in the ftrace code
and the return trampoline for function graph tracing. Is there any other case?

Thanks.

Madhavan


Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

2021-04-09 Thread Madhavan T. Venkataraman



On 4/9/21 7:09 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> I've noted some concerns below. At a high-level, I'm not keen on the
> blacklisting approach, and I think there's some other preparatory work
> that would be more valuable in the short term.
> 

Some kind of blacklisting has to be done whichever way you do it.

> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> There are a number of places in kernel code where the stack trace is not
>> reliable. Enhance the unwinder to check for those cases and mark the
>> stack trace as unreliable. Once all of the checks are in place, the unwinder
>> can provide a reliable stack trace. But before this can be used for 
>> livepatch,
>> some other entity needs to guarantee that the frame pointers are all set up
>> correctly in kernel functions. objtool is currently being worked on to
>> fill that gap.
>>
>> Except for the return address check, all the other checks involve checking
>> the return PC of every frame against certain kernel functions. To do this,
>> implement some infrastructure code:
>>
>>  - Define a special_functions[] array and populate the array with
>>the special functions
> 
> I'm not too keen on having to manually collate this within the unwinder,
> as it's very painful from a maintenance perspective. I'd much rather we
> could associate this information with the implementations of these
> functions, so that they're more likely to stay in sync.
> 
> Further, I believe all the special cases are assembly functions, and
> most of those are already in special sections to begin with. I reckon
> it'd be simpler and more robust to reject unwinding based on the
> section. If we need to unwind across specific functions in those
> sections, we could opt-in with some metadata. So e.g. we could reject
> all functions in ".entry.text", special casing the EL0 entry functions
> if necessary.
> 

Yes. I have already agreed that using sections is the way to go. I am working
on that now.

> As I mentioned before, I'm currently reworking the entry assembly to
> make this simpler to do. I'd prefer to not make invasive changes in that
> area until that's sorted.
> 

I don't plan to make any invasive changes. But a couple of cosmetic changes may 
be
necessary. I don't know yet. But I will keep in mind that you don't want
any invasive changes there.

> I think there's a lot more code that we cannot unwind, e.g. KVM
> exception code, or almost anything marked with SYM_CODE_END().
> 

As Mark Brown suggested, I will take a look at all code that is marked as
SYM_CODE. His idea of placing all SYM_CODE in a separate section and 
blacklisting
that to begin with and refining things as we go along appears to me to be
a reasonable approach.

>>  - Using kallsyms_lookup(), lookup the symbol table entries for the
>>functions and record their address ranges
>>
>>  - Define an is_reliable_function(pc) to match a return PC against
>>the special functions.
>>
>> The unwinder calls is_reliable_function(pc) for every return PC and marks
>> the stack trace as reliable or unreliable accordingly.
>>
>> Return address check
>> 
>>
>> Check the return PC of every stack frame to make sure that it is a valid
>> kernel text address (and not some generated code, for example).
>>
>> Detect EL1 exception frame
>> ==
>>
>> EL1 exceptions can happen on any instruction including instructions in
>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>> they could render the stack trace unreliable.
>>
>> Add all of the EL1 exception handlers to special_functions[].
>>
>>  - el1_sync()
>>  - el1_irq()
>>  - el1_error()
>>  - el1_sync_invalid()
>>  - el1_irq_invalid()
>>  - el1_fiq_invalid()
>>  - el1_error_invalid()
>>
>> Detect ftrace frame
>> ===
>>
>> When FTRACE executes at the beginning of a traced function, it creates two
>> frames and calls the tracer function:
>>
>>  - One frame for the traced function
>>
>>  - One frame for the caller of the traced function
>>
>> That gives a sensible stack trace while executing in the tracer function.
>> When FTRACE returns to the traced function, the frames are popped and
>> everything is back to normal.
>>
>> However, in cases like live patch, the tracer function redirects execution
>> to a different function. When FTRACE returns, control will g

Re: [RFC PATCH v2 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-09 Thread Madhavan T. Venkataraman



On 4/9/21 6:31 AM, Mark Brown wrote:
> On Thu, Apr 08, 2021 at 02:23:39PM -0500, Madhavan T. Venkataraman wrote:
>> On 4/8/21 11:58 AM, Mark Brown wrote:
> 
>>> This looks good to me however I'd really like someone who has a firmer
>>> understanding of what ftrace is doing to double check as it is entirely
>>> likely that I am missing cases here, it seems likely that if I am
>>> missing stuff it's extra stuff that needs to be added and we're not
>>> actually making use of the reliability information yet.
> 
>> OK. So, do you have some specific reviewer(s) in mind? Apart from yourself, 
>> Mark Rutland and
>> Josh Poimboeuf, these are some reviewers I can think of (in alphabetical 
>> order):
> 
> Mainly Mark Rutland, but generally someone else who has looked at ftrace
> on arm64 in detail.  It was mainly a comment to say I wasn't going to do
> any kind of Reviewed-by but also hadn't spotted any issues.
> 

Understood.

Madhavan


Re: [RFC PATCH v2 1/4] arm64: Implement infrastructure for stack trace reliability checks

2021-04-08 Thread Madhavan T. Venkataraman



On 4/8/21 2:30 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 4/8/21 12:17 PM, Mark Brown wrote:
>> On Mon, Apr 05, 2021 at 03:43:10PM -0500, madve...@linux.microsoft.com wrote:
>>
>>> These checks will involve checking the return PC to see if it falls inside
>>> any special functions where the stack trace is considered unreliable.
>>> Implement the infrastructure needed for this.
>>
>> Following up again based on an off-list discussion with Mark Rutland:
>> while I think this is a reasonable implementation for specifically
>> listing functions that cause problems we could make life easier for
>> ourselves by instead using annotations at the call sites to put things
>> into sections which indicate that they're unsafe for unwinding, we can
>> then check for any address in one of those sections (or possibly do the
>> reverse and check for any address in a section we specifically know is
>> safe) rather than having to enumerate problematic functions in the
>> unwinder.  This also has the advantage of not having a list that's
>> separate to the functions themselves so it's less likely that the
>> unwinder will get out of sync with the rest of the code as things evolve.
>>
>> We already have SYM_CODE_START() annotations in the code for assembly
>> functions that aren't using the standard calling convention which should
>> help a lot here, we could add a variant of that for things that we know
>> are safe on stacks (like those we expect to find at the bottom of
>> stacks).
>>
> 
> As I already mentioned before, I like the idea of sections. The only reason 
> that I did
> not try it was that I have to address FTRACE trampolines and the 
> kretprobe_trampoline
> (and optprobes in the future).
> 
> I have the following options:
> 
> 1. Create a common section (I will have to come up with an appropriate name) 
> and put
>all such functions in that one section.
> 
> 2. Create one section for each logical type (exception section, ftrace 
> section and
>kprobe section) or some such.
> 

For now, I will start with idea 2. I will create a special section for each 
class of
functions (EL1 exception handlers, FTRACE trampolines, KPROBE trampolines). 
Instead of a
special functions array, I will implement a special_sections array. The rest of 
the code
should just fall into place.

Let me know if you prefer something different.

Thanks.

Madhavan

> 3. Use the section idea only for the el1 exceptions. For the others use the 
> current
>special_functions[] approach.
> 
> Which one do you and Mark Rutland prefer? Or, is there another choice?
> 
> Madhavan
> 


Re: [RFC PATCH v2 1/4] arm64: Implement infrastructure for stack trace reliability checks

2021-04-08 Thread Madhavan T. Venkataraman



On 4/8/21 12:17 PM, Mark Brown wrote:
> On Mon, Apr 05, 2021 at 03:43:10PM -0500, madve...@linux.microsoft.com wrote:
> 
>> These checks will involve checking the return PC to see if it falls inside
>> any special functions where the stack trace is considered unreliable.
>> Implement the infrastructure needed for this.
> 
> Following up again based on an off-list discussion with Mark Rutland:
> while I think this is a reasonable implementation for specifically
> listing functions that cause problems we could make life easier for
> ourselves by instead using annotations at the call sites to put things
> into sections which indicate that they're unsafe for unwinding, we can
> then check for any address in one of those sections (or possibly do the
> reverse and check for any address in a section we specifically know is
> safe) rather than having to enumerate problematic functions in the
> unwinder.  This also has the advantage of not having a list that's
> separate to the functions themselves so it's less likely that the
> unwinder will get out of sync with the rest of the code as things evolve.
> 
> We already have SYM_CODE_START() annotations in the code for assembly
> functions that aren't using the standard calling convention which should
> help a lot here, we could add a variant of that for things that we know
> are safe on stacks (like those we expect to find at the bottom of
> stacks).
> 

As I already mentioned before, I like the idea of sections. The only reason 
that I did
not try it was that I have to address FTRACE trampolines and the 
kretprobe_trampoline
(and optprobes in the future).

I have the following options:

1. Create a common section (I will have to come up with an appropriate name) 
and put
   all such functions in that one section.

2. Create one section for each logical type (exception section, ftrace section 
and
   kprobe section) or some such.

3. Use the section idea only for the el1 exceptions. For the others use the 
current
   special_functions[] approach.

Which one do you and Mark Rutland prefer? Or, is there another choice?

Madhavan


Re: [RFC PATCH v2 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-08 Thread Madhavan T. Venkataraman



On 4/8/21 11:58 AM, Mark Brown wrote:
> On Mon, Apr 05, 2021 at 03:43:12PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
>> for a function, the ftrace infrastructure is called for the function at
>> the very beginning. Ftrace creates two frames:
> 
> This looks good to me however I'd really like someone who has a firmer
> understanding of what ftrace is doing to double check as it is entirely
> likely that I am missing cases here, it seems likely that if I am
> missing stuff it's extra stuff that needs to be added and we're not
> actually making use of the reliability information yet.
> 

OK. So, do you have some specific reviewer(s) in mind? Apart from yourself, 
Mark Rutland and
Josh Poimboeuf, these are some reviewers I can think of (in alphabetical order):

AKASHI Takahiro 
Ard Biesheuvel 
Catalin Marinas 
Josh Poimboeuf 
Steven Rostedt (VMware) 
Torsten Duwe 
Will Deacon 

Sorry if I missed out any of the other experts.

Thanks.

Madhavan 


Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks

2021-04-05 Thread Madhavan T. Venkataraman



On 4/5/21 9:56 AM, Madhavan T. Venkataraman wrote:
> 
> 
> On 4/5/21 8:24 AM, Masami Hiramatsu wrote:
>> Hi Madhaven,
>>
>> On Sat, 3 Apr 2021 22:29:12 -0500
>> "Madhavan T. Venkataraman"  wrote:
>>
>>
>>>>> Check for kretprobe
>>>>> ===
>>>>>
>>>>> For functions with a kretprobe set up, probe code executes on entry
>>>>> to the function and replaces the return address in the stack frame with a
>>>>> kretprobe trampoline. Whenever the function returns, control is
>>>>> transferred to the trampoline. The trampoline eventually returns to the
>>>>> original return address.
>>>>>
>>>>> A stack trace taken while executing in the function (or in functions that
>>>>> get called from the function) will not show the original return address.
>>>>> Similarly, a stack trace taken while executing in the trampoline itself
>>>>> (and functions that get called from the trampoline) will not show the
>>>>> original return address. This means that the caller of the probed function
>>>>> will not show. This makes the stack trace unreliable.
>>>>>
>>>>> Add the kretprobe trampoline to special_functions[].
>>>>>
>>>>> FYI, each task contains a task->kretprobe_instances list that can
>>>>> theoretically be consulted to find the orginal return address. But I am
>>>>> not entirely sure how to safely traverse that list for stack traces
>>>>> not on the current process. So, I have taken the easy way out.
>>>>
>>>> For kretprobes, unwinding from the trampoline or kretprobe handler
>>>> shouldn't be a reliability concern for live patching, for similar
>>>> reasons as above.
>>>>
>>>
>>> Please see previous answer.
>>>
>>>> Otherwise, when unwinding from a blocked task which has
>>>> 'kretprobe_trampoline' on the stack, the unwinder needs a way to get the
>>>> original return address.  Masami has been working on an interface to
>>>> make that possible for x86.  I assume something similar could be done
>>>> for arm64.
>>>>
>>>
>>> OK. Until that is available, this case needs to be addressed.
>>
>> Actually, I've done that on arm64 :) See below patch.
>> (and I also have a similar code for arm32, what I'm considering is how
>> to unify x86/arm/arm64 kretprobe_find_ret_addr(), since those are very
>> similar.)
>>
>> This is applicable on my x86 series v5
>>
>> https://lore.kernel.org/bpf/161676170650.330141.6214727134265514123.stgit@devnote2/
>>
>> Thank you,
>>
>>
> 
> I took a brief look at your changes. Looks reasonable.
> 
> However, for now, I am going to include the kretprobe_trampoline in the 
> special_functions[]
> array until your changes are merged. At that point, it is just a matter of 
> deleting
> kretprobe_trampoline from the special_functions[] array. That is all.
> 
> I hope that is fine with everyone.
> 

Actually, there may still be a problem to solve.

If arch_stack_walk_reliable() is ever called from within kretprobe_trampoline() 
for debugging or
other purposes after the instance is deleted from the task instance list, it 
would not be able
to retrieve the original return address.

The stack trace would be unreliable in that case, would it not?

Madhavan



Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks

2021-04-05 Thread Madhavan T. Venkataraman



On 4/5/21 8:24 AM, Masami Hiramatsu wrote:
> Hi Madhaven,
> 
> On Sat, 3 Apr 2021 22:29:12 -0500
> "Madhavan T. Venkataraman"  wrote:
> 
> 
>>>> Check for kretprobe
>>>> ===
>>>>
>>>> For functions with a kretprobe set up, probe code executes on entry
>>>> to the function and replaces the return address in the stack frame with a
>>>> kretprobe trampoline. Whenever the function returns, control is
>>>> transferred to the trampoline. The trampoline eventually returns to the
>>>> original return address.
>>>>
>>>> A stack trace taken while executing in the function (or in functions that
>>>> get called from the function) will not show the original return address.
>>>> Similarly, a stack trace taken while executing in the trampoline itself
>>>> (and functions that get called from the trampoline) will not show the
>>>> original return address. This means that the caller of the probed function
>>>> will not show. This makes the stack trace unreliable.
>>>>
>>>> Add the kretprobe trampoline to special_functions[].
>>>>
>>>> FYI, each task contains a task->kretprobe_instances list that can
>>>> theoretically be consulted to find the orginal return address. But I am
>>>> not entirely sure how to safely traverse that list for stack traces
>>>> not on the current process. So, I have taken the easy way out.
>>>
>>> For kretprobes, unwinding from the trampoline or kretprobe handler
>>> shouldn't be a reliability concern for live patching, for similar
>>> reasons as above.
>>>
>>
>> Please see previous answer.
>>
>>> Otherwise, when unwinding from a blocked task which has
>>> 'kretprobe_trampoline' on the stack, the unwinder needs a way to get the
>>> original return address.  Masami has been working on an interface to
>>> make that possible for x86.  I assume something similar could be done
>>> for arm64.
>>>
>>
>> OK. Until that is available, this case needs to be addressed.
> 
> Actually, I've done that on arm64 :) See below patch.
> (and I also have a similar code for arm32, what I'm considering is how
> to unify x86/arm/arm64 kretprobe_find_ret_addr(), since those are very
> similar.)
> 
> This is applicable on my x86 series v5
> 
> https://lore.kernel.org/bpf/161676170650.330141.6214727134265514123.stgit@devnote2/
> 
> Thank you,
> 
> 

I took a brief look at your changes. Looks reasonable.

However, for now, I am going to include the kretprobe_trampoline in the 
special_functions[]
array until your changes are merged. At that point, it is just a matter of 
deleting
kretprobe_trampoline from the special_functions[] array. That is all.

I hope that is fine with everyone.

Madhavan



Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks

2021-04-05 Thread Madhavan T. Venkataraman



On 4/5/21 8:24 AM, Masami Hiramatsu wrote:
> Hi Madhaven,
> 
> On Sat, 3 Apr 2021 22:29:12 -0500
> "Madhavan T. Venkataraman"  wrote:
> 
> 
>>>> Check for kretprobe
>>>> ===
>>>>
>>>> For functions with a kretprobe set up, probe code executes on entry
>>>> to the function and replaces the return address in the stack frame with a
>>>> kretprobe trampoline. Whenever the function returns, control is
>>>> transferred to the trampoline. The trampoline eventually returns to the
>>>> original return address.
>>>>
>>>> A stack trace taken while executing in the function (or in functions that
>>>> get called from the function) will not show the original return address.
>>>> Similarly, a stack trace taken while executing in the trampoline itself
>>>> (and functions that get called from the trampoline) will not show the
>>>> original return address. This means that the caller of the probed function
>>>> will not show. This makes the stack trace unreliable.
>>>>
>>>> Add the kretprobe trampoline to special_functions[].
>>>>
>>>> FYI, each task contains a task->kretprobe_instances list that can
>>>> theoretically be consulted to find the orginal return address. But I am
>>>> not entirely sure how to safely traverse that list for stack traces
>>>> not on the current process. So, I have taken the easy way out.
>>>
>>> For kretprobes, unwinding from the trampoline or kretprobe handler
>>> shouldn't be a reliability concern for live patching, for similar
>>> reasons as above.
>>>
>>
>> Please see previous answer.
>>
>>> Otherwise, when unwinding from a blocked task which has
>>> 'kretprobe_trampoline' on the stack, the unwinder needs a way to get the
>>> original return address.  Masami has been working on an interface to
>>> make that possible for x86.  I assume something similar could be done
>>> for arm64.
>>>
>>
>> OK. Until that is available, this case needs to be addressed.
> 
> Actually, I've done that on arm64 :) See below patch.
> (and I also have a similar code for arm32, what I'm considering is how
> to unify x86/arm/arm64 kretprobe_find_ret_addr(), since those are very
> similar.)
> 
> This is applicable on my x86 series v5
> 
> https://lore.kernel.org/bpf/161676170650.330141.6214727134265514123.stgit@devnote2/
> 
> Thank you,
> 
> 

OK. I will take a look.

Thanks.

Madhavan


Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-04 Thread Madhavan T. Venkataraman



On 4/3/21 11:40 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 4/3/21 10:46 PM, Madhavan T. Venkataraman wrote:
>>> I'm somewhat arm-ignorant, so take the following comments with a grain
>>> of salt.
>>>
>>>
>>> I don't think changing these to 'bl' is necessary, unless you wanted
>>> __primary_switched() and __secondary_switched() to show up in the
>>> stacktrace for some reason?  If so, that seems like a separate patch.
>>>
>> The problem is with __secondary_switched. If you trace the code back to where
>> a secondary CPU is started, I don't see any calls anywhere. There are only
>> branches if I am not mistaken. So, the return address register never gets
>> set up with a proper address. The stack trace shows some hexadecimal value
>> instead of a symbol name.
>>
> 
> Actually, I take that back. There are calls in that code path. But I did only
> see some hexadecimal value instead of a proper address in the stack trace.
> Sorry about that confusion.
> 

Again, I apologize. I had this confused with something else in my notes.

So, the stack trace looks like this without my changes to convert the branch to
secondary_start_kernel() to a call:

 ...
[0.022492]   secondary_start_kernel+0x188/0x1e0
[0.022503]   0xf8689e1cc

It looks like the code calls __enable_mmu before reaching the place where it
branches to secondary_start_kernel().

bl  __enable_mmu

The return address register should be set to the next instruction address. I am
guessing that the return address is 0xf8689e1cc because of the idmap stuff.

Madhavan


Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-03 Thread Madhavan T. Venkataraman



On 4/3/21 10:46 PM, Madhavan T. Venkataraman wrote:
>> I'm somewhat arm-ignorant, so take the following comments with a grain
>> of salt.
>>
>>
>> I don't think changing these to 'bl' is necessary, unless you wanted
>> __primary_switched() and __secondary_switched() to show up in the
>> stacktrace for some reason?  If so, that seems like a separate patch.
>>
> The problem is with __secondary_switched. If you trace the code back to where
> a secondary CPU is started, I don't see any calls anywhere. There are only
> branches if I am not mistaken. So, the return address register never gets
> set up with a proper address. The stack trace shows some hexadecimal value
> instead of a symbol name.
> 

Actually, I take that back. There are calls in that code path. But I did only
see some hexadecimal value instead of a proper address in the stack trace.
Sorry about that confusion.

My reason to convert the branches to calls is this - the value of the return
address register at that point is the return PC of the previous branch and link
instruction wherever that happens to be. I think that is a little arbitrary.

Instead, if I call start_kernel() and secondary_start_kernel(), the return 
address
gets set up to the next instruction which, IMHO, is better.

But I am open to other suggestions.

Madhavan


Re: [RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-03 Thread Madhavan T. Venkataraman



On 4/3/21 10:59 AM, Josh Poimboeuf wrote:
> On Thu, Apr 01, 2021 at 10:24:04PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>> @@ -447,9 +464,9 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>>  #endif
>>  bl  switch_to_vhe   // Prefer VHE if possible
>>  add sp, sp, #16
>> -mov x29, #0
>> -mov x30, #0
>> -b   start_kernel
>> +setup_final_frame
>> +bl  start_kernel
>> +nop
>>  SYM_FUNC_END(__primary_switched)
>>  
>>  .pushsection ".rodata", "a"
>> @@ -606,14 +623,14 @@ SYM_FUNC_START_LOCAL(__secondary_switched)
>>  cbz x2, __secondary_too_slow
>>  msr sp_el0, x2
>>  scs_load x2, x3
>> -mov x29, #0
>> -mov x30, #0
>> +setup_final_frame
>>  
>>  #ifdef CONFIG_ARM64_PTR_AUTH
>>  ptrauth_keys_init_cpu x2, x3, x4, x5
>>  #endif
>>  
>> -b   secondary_start_kernel
>> +bl  secondary_start_kernel
>> +nop
>>  SYM_FUNC_END(__secondary_switched)
> 
> I'm somewhat arm-ignorant, so take the following comments with a grain
> of salt.
> 
> 
> I don't think changing these to 'bl' is necessary, unless you wanted
> __primary_switched() and __secondary_switched() to show up in the
> stacktrace for some reason?  If so, that seems like a separate patch.
> 

The problem is with __secondary_switched. If you trace the code back to where
a secondary CPU is started, I don't see any calls anywhere. There are only
branches if I am not mistaken. So, the return address register never gets
set up with a proper address. The stack trace shows some hexadecimal value
instead of a symbol name.

On ARM64, the call instruction is actually a branch instruction IIUC. The
only extra thing it does is to load the link register (return address register)
with the return address. That is all.

Instead of the link register pointing to some arbitrary code in startup that
did not call start_kernel() or secondary_start_kernel(), I wanted to set it
up as shown above.

> 
> Also, why are nops added after the calls?  My guess would be because,
> since these are basically tail calls to "noreturn" functions, the stack
> dump code would otherwise show the wrong function, i.e. whatever
> function happens to be after the 'bl'.
> 

That is correct. The stack trace shows something arbitrary.

> We had the same issue for x86.  It can be fixed by using '%pB' instead
> of '%pS' when printing the address in dump_backtrace_entry().  See
> sprint_backtrace() for more details.
> 
> BTW I think the same issue exists for GCC-generated code.  The following
> shows several such cases:
> 
>   objdump -dr vmlinux |awk '/bl   / {bl=1;l=$0;next} bl == 1 && /^$/ {print 
> l; print} // {bl=0}'
> 
> 
> However, looking at how arm64 unwinds through exceptions in kernel
> space, using '%pB' might have side effects when the exception LR
> (elr_el1) points to the beginning of a function.  Then '%pB' would show
> the end of the previous function, instead of the function which was
> interrupted.
> 
> So you may need to rethink how to unwind through in-kernel exceptions.
> 
> Basically, when printing a stack return address, you want to use '%pB'
> for a call return address and '%pS' for an interrupted address.
> 
> On x86, with the frame pointer unwinder, we encode the frame pointer by
> setting a bit in %rbp which tells the unwinder that it's a special
> pt_regs frame.  Then instead of treating it like a normal call frame,
> the stack dump code prints the registers, and the return address
> (regs->ip) gets printed with '%pS'.
> 

Yes. But there are objections to that kind of encoding.

Having the nop above does not do any harm. It just adds 4 bytes to the function 
text.
I would rather keep this simple right now because this is only for getting a 
sensible
stack trace for idle tasks.

Is there any other problem that you can see?

>>  SYM_FUNC_START_LOCAL(__secondary_too_slow)
>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>> index 325c83b1a24d..906baa232a89 100644
>> --- a/arch/arm64/kernel/process.c
>> +++ b/arch/arm64/kernel/process.c
>> @@ -437,6 +437,11 @@ int copy_thread(unsigned long clone_flags, unsigned 
>> long stack_start,
>>  }
>>  p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
>>  p->thread.cpu_context.sp = (unsigned long)childregs;
>> +/*
>> + * For the benefit of the unwinder, set up childregs->stackframe
>> + * as the final frame for the new task.
>> + */
>> +p->thread.cpu

Re: [RFC PATCH v1 0/4] arm64: Implement stack trace reliability checks

2021-04-03 Thread Madhavan T. Venkataraman



On 4/3/21 12:01 PM, Josh Poimboeuf wrote:
> On Tue, Mar 30, 2021 at 02:09:51PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> There are a number of places in kernel code where the stack trace is not
>> reliable. Enhance the unwinder to check for those cases and mark the
>> stack trace as unreliable. Once all of the checks are in place, the unwinder
>> can be used for livepatching.
> 
> This assumes all such places are known.  That's a big assumption, as
> 
> a) hand-written asm code may inadvertantly skip frame pointer setup;
> 
> b) for inline asm which calls a function, the compiler may blindly
>insert it into a function before the frame pointer setup.
> 
> That's where objtool stack validation would come in.
>

Yes. I meant that the reliable stack trace in the kernel is necessary. I did
not imply that it was sufficient. Clearly, it is not sufficient. It relies
on the frame pointer being setup correctly for all functions. That has to be
guaranteed by another entity such as objtool.

So, I will improve the wording and make it clear in the next version.

>> Detect EL1 exception frame
>> ==
>>
>> EL1 exceptions can happen on any instruction including instructions in
>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>> they could render the stack trace unreliable.
>>
>> Add all of the EL1 exception handlers to special_functions[].
>>
>>  - el1_sync()
>>  - el1_irq()
>>  - el1_error()
>>  - el1_sync_invalid()
>>  - el1_irq_invalid()
>>  - el1_fiq_invalid()
>>  - el1_error_invalid()
> 
> A possibly more robust alternative would be to somehow mark el1
> exception frames so the unwinder can detect them more generally.
> 
> For example, as described in my previous email, encode the frame pointer
> so the unwinder can detect el1 frames automatically.
> 

Encoding the frame pointer by setting the LSB (like X86) was my first solution.
Mark Rutland NAKed it. His objection was that it would confuse the debuggers
which are expecting the last 4 bits of the frame pointer to be 0. I agree with
this objection.

My problem with the encoding was also that it is not possible to know if the
LSB was set because of encoding or because of stack corruption.

My second attempt was to encode the frame pointer indirectly. That is, make
pt_regs->stackframe the exception frame and use other fields in the pt_regs
(including a frame type encoding field) for verification.

Mark Rutland NAKed it. His objection (if I am rephrasing it correctly) was that
garbage on the stack may accidentally match the values the unwinder checks in
the pt_regs (however unlikely that match might be).

The consensus was that the return PC must be checked against special functions
to recognize those special cases as the special functions are only invoked in
those special contexts and nowhere else.

As an aside, Mark Brown suggested (if I recall correctly) that the exception
functions could be placed in a special exception section so the unwinder can
check a return PC against the section bounds instead of individual functions.
I did consider implementing this. But I needed a way to address FTRACE
trampolines and KPROBE trampolines as well. So, I did not do that.


>> Detect ftrace frame
>> ===
>>
>> When FTRACE executes at the beginning of a traced function, it creates two
>> frames and calls the tracer function:
>>
>>  - One frame for the traced function
>>
>>  - One frame for the caller of the traced function
>>
>> That gives a sensible stack trace while executing in the tracer function.
>> When FTRACE returns to the traced function, the frames are popped and
>> everything is back to normal.
>>
>> However, in cases like live patch, the tracer function redirects execution
>> to a different function. When FTRACE returns, control will go to that target
>> function. A stack trace taken in the tracer function will not show the target
>> function. The target function is the real function that we want to track.
>> So, the stack trace is unreliable.
> 
> I don't think this is a real problem.  Livepatch only checks the stacks
> of blocked tasks (and the task calling into livepatch).  So the
> reliability of unwinding from the livepatch tracer function itself
> (klp_ftrace_handler) isn't a concern since it doesn't sleep.
> 

My thinking was - arch_stack_walk_reliable() should provide a reliable stack 
trace
and not assume anything about its consumers. It should not assume that 
livepatch is
the only consumer although it might be. 

Theoretically, there can be a tracer fun

Re: [RFC PATCH v1 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 1:53 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 4/1/21 1:40 PM, Madhavan T. Venkataraman wrote:
>>>> So, it is only defined if CONFIG_FUNCTION_GRAPH_TRACER is defined. I can 
>>>> address
>>>> this as well as your comment by defining another label whose name is more 
>>>> meaningful
>>>> to our use:
>>>> +SYM_INNER_LABEL(ftrace_trampoline, SYM_L_GLOBAL) // checked by the 
>>>> unwinder
>>>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>>> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>>>> nop // If enabled, this will be 
>>>> replaced
>>>> // "b ftrace_graph_caller"
>>>> #endif
>>> I'm not sure we need to bother with that, you'd still need the & I think.
>> I think we need to bother with that. If CONFIG_FUNCTION_GRAPH_TRACER is not 
>> on but
>> CONFIG_DYNAMIC_FTRACE_WITH_REGS is, then ftrace_graph_call() will not occur 
>> in the stack
>> trace taken from a tracer function. The unwinder still needs to recognize an 
>> ftrace frame.
>> I don't want to assume ftrace_common_return which is the label that 
>> currently follows
>> the above code. So, we need a different label outside the above ifdef.
> 
> Alternatively, I could just move the SYM_INNER_LABEL(ftrace_graph_call..) to 
> outside the ifdef.
> 
> Madhavan
> 

Or, even better, I could just use ftrace_call+4 because that would be the return
address for the tracer function at ftrace_call:

SYM_CODE_START(ftrace_common)
sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
mov x1, x9  // parent_ip (callsite's LR)
ldr_l   x2, function_trace_op   // op
mov x3, sp  // regs

SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
bl  ftrace_stub

I think that would be cleaner. And, I don't need the complicated comments for 
ftrace_graph_call.

Is this acceptable?

Madhavan


Re: [RFC PATCH v1 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 1:40 PM, Madhavan T. Venkataraman wrote:
>>> So, it is only defined if CONFIG_FUNCTION_GRAPH_TRACER is defined. I can 
>>> address
>>> this as well as your comment by defining another label whose name is more 
>>> meaningful
>>> to our use:
>>> +SYM_INNER_LABEL(ftrace_trampoline, SYM_L_GLOBAL) // checked by the unwinder
>>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>>> nop // If enabled, this will be replaced
>>> // "b ftrace_graph_caller"
>>> #endif
>> I'm not sure we need to bother with that, you'd still need the & I think.
> I think we need to bother with that. If CONFIG_FUNCTION_GRAPH_TRACER is not 
> on but
> CONFIG_DYNAMIC_FTRACE_WITH_REGS is, then ftrace_graph_call() will not occur 
> in the stack
> trace taken from a tracer function. The unwinder still needs to recognize an 
> ftrace frame.
> I don't want to assume ftrace_common_return which is the label that currently 
> follows
> the above code. So, we need a different label outside the above ifdef.

Alternatively, I could just move the SYM_INNER_LABEL(ftrace_graph_call..) to 
outside the ifdef.

Madhavan


Re: [RFC PATCH v1 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 1:28 PM, Mark Brown wrote:
> On Thu, Apr 01, 2021 at 12:43:25PM -0500, Madhavan T. Venkataraman wrote:
> 
>>>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
>>>> +  { (unsigned long) _graph_call, 0 },
>>>> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>>> +  { (unsigned long) ftrace_graph_caller, 0 },
> 
>>> It's weird that we take the address of ftrace_graph_call but not the
>>> other functions - we should be consistent or explain why.  It'd probably
>>> also look nicer to not nest the ifdefs, the dependencies in Kconfig will
>>> ensure we only get things when we should.
> 
>> I have explained it in the comment in the FTRACE trampoline right above
>> ftrace_graph_call().
> 
> Ah, right - it's a result of it being an inner label.  I'd suggest
> putting a brief note right at that line of code explaining this (eg,
> "Inner label, not a function"), it wasn't confusing due to the use of
> that symbol but rather due to it being different from everything else
> in the list and that's kind of lost in the main comment.
> 

OK, So, I will add a note in the main comment above the list. I will add the
comment line you have suggested at the exact line.

>> So, it is only defined if CONFIG_FUNCTION_GRAPH_TRACER is defined. I can 
>> address
>> this as well as your comment by defining another label whose name is more 
>> meaningful
>> to our use:
> 
>> +SYM_INNER_LABEL(ftrace_trampoline, SYM_L_GLOBAL) // checked by the unwinder
>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>> nop // If enabled, this will be replaced
>> // "b ftrace_graph_caller"
>> #endif
> 
> I'm not sure we need to bother with that, you'd still need the & I think.

I think we need to bother with that. If CONFIG_FUNCTION_GRAPH_TRACER is not on 
but
CONFIG_DYNAMIC_FTRACE_WITH_REGS is, then ftrace_graph_call() will not occur in 
the stack
trace taken from a tracer function. The unwinder still needs to recognize an 
ftrace frame.
I don't want to assume ftrace_common_return which is the label that currently 
follows
the above code. So, we need a different label outside the above ifdef.

As for the &, I needed it because ftrace_graph_call is currently defined 
elsewhere as:

extern unsigned long ftrace_graph_call;

I did not want to change that.

Thanks,

Madhavan



Re: [RFC PATCH v1 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 9:27 AM, Mark Brown wrote:
> On Tue, Mar 30, 2021 at 02:09:54PM -0500, madve...@linux.microsoft.com wrote:
> 
>> + * FTRACE trampolines.
>> + */
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
>> +{ (unsigned long) _graph_call, 0 },
>> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> +{ (unsigned long) ftrace_graph_caller, 0 },
>> +{ (unsigned long) return_to_handler, 0 },
>> +#endif
>> +#endif
> 
> It's weird that we take the address of ftrace_graph_call but not the
> other functions - we should be consistent or explain why.  It'd probably
> also look nicer to not nest the ifdefs, the dependencies in Kconfig will
> ensure we only get things when we should.
> 

Sorry. I forgot to respond to the nested ifdef comment. I will fix that.

Thanks!

Madhavan


Re: [RFC PATCH v1 1/4] arm64: Implement infrastructure for stack trace reliability checks

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 10:27 AM, Mark Brown wrote:
> On Tue, Mar 30, 2021 at 02:09:52PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> Implement a check_reliability() function that will contain checks for the
>> presence of various features and conditions that can render the stack trace
>> unreliable.
> 
> This looks good to me with one minor stylistic thing:
> 
>> +/*
>> + * Special functions where the stack trace is unreliable.
>> + */
>> +static struct function_rangespecial_functions[] = {
>> +{ 0, 0 }
>> +};
> 
> Might be good to put a comment here saying that this is terminating the
> list rather than detecting a NULL function pointer:
> 
>   { /* sentinel */ }
> 
> is a common idiom for that.
> 
> Given that it's a fixed array we could also...
> 
>> +for (func = special_functions; func->start; func++) {
>> +if (pc >= func->start && pc < func->end)
> 
> ...do these as
> 
>   for (i = 0; i < ARRAY_SIZE(special_functions); i++) 
> 
> so you don't need something like that, though that gets awkward when you
> have to write out special_functions[i].field a lot.
> 
> So many different potential colours for the bikeshed!
I will make the above changes.

Thanks!

Madhavan


Re: [RFC PATCH v1 3/4] arm64: Detect FTRACE cases that make the stack trace unreliable

2021-04-01 Thread Madhavan T. Venkataraman



On 4/1/21 9:27 AM, Mark Brown wrote:
> On Tue, Mar 30, 2021 at 02:09:54PM -0500, madve...@linux.microsoft.com wrote:
> 
>> + * FTRACE trampolines.
>> + */
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
>> +{ (unsigned long) _graph_call, 0 },
>> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> +{ (unsigned long) ftrace_graph_caller, 0 },
>> +{ (unsigned long) return_to_handler, 0 },
>> +#endif
>> +#endif
> 
> It's weird that we take the address of ftrace_graph_call but not the
> other functions - we should be consistent or explain why.  It'd probably
> also look nicer to not nest the ifdefs, the dependencies in Kconfig will
> ensure we only get things when we should.
> 

I have explained it in the comment in the FTRACE trampoline right above
ftrace_graph_call().

/*
 * The only call in the FTRACE trampoline code is above. The above
 * instruction is patched to call a tracer function. Its return
 * address is below (ftrace_graph_call). In a stack trace taken from
 * a tracer function, ftrace_graph_call() will show. The unwinder
 * checks this for reliable stack trace. Please see the comments
 * in stacktrace.c. If another call is added in the FTRACE
 * trampoline code, the special_functions[] array in stacktrace.c
 * must be updated.
 */

I also noticed that I have to fix something here. The label ftrace_graph_call
is defined like this:


#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
nop // If enabled, this will be replaced
// "b ftrace_graph_caller"
#endif

So, it is only defined if CONFIG_FUNCTION_GRAPH_TRACER is defined. I can address
this as well as your comment by defining another label whose name is more 
meaningful
to our use:


+SYM_INNER_LABEL(ftrace_trampoline, SYM_L_GLOBAL) // checked by the unwinder
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
nop // If enabled, this will be replaced
// "b ftrace_graph_caller"
#endif

Is this acceptable?

Thanks.

Madhavan


Re: [RFC PATCH v1 1/1] arm64: Implement stack trace termination record

2021-03-29 Thread Madhavan T. Venkataraman



On 3/29/21 6:27 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> Overall this looks pretty good; I have a few comments below.
> 
> On Wed, Mar 24, 2021 at 01:46:07PM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> The unwinder needs to be able to reliably tell when it has reached the end
>> of a stack trace. One way to do this is to have the last stack frame at a
>> fixed offset from the base of the task stack. When the unwinder reaches
>> that offset, it knows it is done.
> 
> To make the relationship with reliable stacktrace clearer, how about:
> 
> | Reliable stacktracing requires that we identify when a stacktrace is
> | terminated early. We can do this by ensuring all tasks have a final
> | frame record at a known location on their task stack, and checking
> | that this is the final frame record in the chain.
> 
> Currently we use inconsistent terminology to refer to the final frame
> record, and it would be good if we could be consistent. The existing
> code uses "terminal record" (which I appreciate isn't clear), and this
> series largely uses "last frame". It'd be nice to make that consistent.
> 
> For clarity could we please use "final" rather than "last"? That avoids
> the ambiguity of "last" also meaning "previous".
> 
> e.g. below this'd mean having `setup_final_frame`.

OK. I will make the above changes.

> 
>>
>> Kernel Tasks
>> 
>>
>> All tasks except the idle task have a pt_regs structure right after the
>> task stack. This is called the task pt_regs. The pt_regs structure has a
>> special stackframe field. Make this stackframe field the last frame in the
>> task stack. This needs to be done in copy_thread() which initializes a new
>> task's pt_regs and initial CPU context.
>>
>> For the idle task, there is no task pt_regs. For our purpose, we need one.
>> So, create a pt_regs just like other kernel tasks and make
>> pt_regs->stackframe the last frame in the idle task stack. This needs to be
>> done at two places:
>>
>>  - On the primary CPU, the boot task runs. It calls start_kernel()
>>and eventually becomes the idle task for the primary CPU. Just
>>before start_kernel() is called, set up the last frame.
>>
>>  - On each secondary CPU, a startup task runs that calls
>>secondary_startup_kernel() and eventually becomes the idle task
>>on the secondary CPU. Just before secondary_start_kernel() is
>>called, set up the last frame.
>>
>> User Tasks
>> ==
>>
>> User tasks are initially set up like kernel tasks when they are created.
>> Then, they return to userland after fork via ret_from_fork(). After that,
>> they enter the kernel only on an EL0 exception. (In arm64, system calls are
>> also EL0 exceptions). The EL0 exception handler stores state in the task
>> pt_regs and calls different functions based on the type of exception. The
>> stack trace for an EL0 exception must end at the task pt_regs. So, make
>> task pt_regs->stackframe as the last frame in the EL0 exception stack.
>>
>> In summary, task pt_regs->stackframe is where a successful stack trace ends.
>>
>> Stack trace termination
>> ===
>>
>> In the unwinder, terminate the stack trace successfully when
>> task_pt_regs(task)->stackframe is reached. For stack traces in the kernel,
>> this will correctly terminate the stack trace at the right place.
>>
>> However, debuggers terminate the stack trace when FP == 0. In the
>> pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
>> debugger may print an extra record 0x0 at the end. While this is not
>> pretty, this does not do any harm. This is a small price to pay for
>> having reliable stack trace termination in the kernel.
>>
>> Signed-off-by: Madhavan T. Venkataraman 
>> ---
>>  arch/arm64/kernel/entry.S  |  8 +---
>>  arch/arm64/kernel/head.S   | 28 
>>  arch/arm64/kernel/process.c|  5 +
>>  arch/arm64/kernel/stacktrace.c |  8 
>>  4 files changed, 38 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index a31a0a713c85..e2dc2e998934 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -261,16 +261,18 @@ alternative_else_nop_endif
>>  stp lr, x21, [sp, #S_LR]
>>  
>>  /*
>> - * For exceptions f

Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman
Thanks for all the input - Mark Rutland and Mark Brown.

I will send out the stack termination patch next. Since I am splitting
the original series into 3 separate series, I will change the titles and
start with version 1 in each case, if there is no objection.

Again, Thanks.

Madhavan

On 3/23/21 3:24 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 1:30 PM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 12:02 PM, Mark Rutland wrote:
>>
>> [...]
>>
>>> I think that I did a bad job of explaining what I wanted to do. It is not
>>> for any additional protection at all.
>>>
>>> So, let us say we create a field in the task structure:
>>>
>>> u64 unreliable_stack;
>>>
>>> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
>>> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
>>> On exiting the above, decrement unreliable_stack.
>>>
>>> In arch_stack_walk_reliable(), simply do this check upfront:
>>>
>>> if (task->unreliable_stack)
>>> return -EINVAL;
>>>
>>> This way, the function does not even bother unwinding the stack to find
>>> exception frames or checking for different return addresses or anything.
>>> We also don't have to worry about code being reorganized, functions
>>> being renamed, etc. It also may help in debugging to know if a task is
>>> experiencing an exception and the level of nesting, etc.
>>
>> As in my other reply, since this is an optimization that is not
>> necessary for functional correctness, I would prefer to avoid this for
>> now. We can reconsider that in future if we encounter performance
>> problems.
>>
>> Even with this there will be cases where we have to identify
>> non-unwindable functions explicitly (e.g. the patchable-function-entry
>> trampolines, where the real return address is in x9), and I'd prefer
>> that we use one mechanism consistently.
>>
>> I suspect that in the future we'll need to unwind across exception
>> boundaries using metadata, and we can treat the non-unwindable metadata
>> in the same way.
>>
>> [...]
>>
>>>> 3. Figure out exception boundary handling. I'm currently working to
>>>>simplify the entry assembly down to a uniform set of stubs, and I'd
>>>>prefer to get that sorted before we teach the unwinder about
>>>>exception boundaries, as it'll be significantly simpler to reason
>>>>about and won't end up clashing with the rework.
>>>
>>> So, here is where I still have a question. Is it necessary for the unwinder
>>> to know the exception boundaries? Is it not enough if it knows if there are
>>> exceptions present? For instance, using something like num_special_frames
>>> I suggested above?
>>
>> I agree that it would be legitimate to bail out early if we knew there
>> was going to be an exception somewhere in the trace. Regardless, I think
>> it's simpler overall to identify non-unwindability during the trace, and
>> doing that during the trace aligns more closely with the structure that
>> we'll need to permit unwinding across these boundaries in future, so I'd
>> prefer we do that rather than trying to optimize for early returns
>> today.
>>
> 
> OK. Fair enough.
> 
> Thanks.
> 
> Madhavan
> 


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 1:30 PM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 12:02 PM, Mark Rutland wrote:
> 
> [...]
> 
>> I think that I did a bad job of explaining what I wanted to do. It is not
>> for any additional protection at all.
>>
>> So, let us say we create a field in the task structure:
>>
>>  u64 unreliable_stack;
>>
>> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
>> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
>> On exiting the above, decrement unreliable_stack.
>>
>> In arch_stack_walk_reliable(), simply do this check upfront:
>>
>>  if (task->unreliable_stack)
>>  return -EINVAL;
>>
>> This way, the function does not even bother unwinding the stack to find
>> exception frames or checking for different return addresses or anything.
>> We also don't have to worry about code being reorganized, functions
>> being renamed, etc. It also may help in debugging to know if a task is
>> experiencing an exception and the level of nesting, etc.
> 
> As in my other reply, since this is an optimization that is not
> necessary for functional correctness, I would prefer to avoid this for
> now. We can reconsider that in future if we encounter performance
> problems.
> 
> Even with this there will be cases where we have to identify
> non-unwindable functions explicitly (e.g. the patchable-function-entry
> trampolines, where the real return address is in x9), and I'd prefer
> that we use one mechanism consistently.
> 
> I suspect that in the future we'll need to unwind across exception
> boundaries using metadata, and we can treat the non-unwindable metadata
> in the same way.
> 
> [...]
> 
>>> 3. Figure out exception boundary handling. I'm currently working to
>>>simplify the entry assembly down to a uniform set of stubs, and I'd
>>>prefer to get that sorted before we teach the unwinder about
>>>exception boundaries, as it'll be significantly simpler to reason
>>>about and won't end up clashing with the rework.
>>
>> So, here is where I still have a question. Is it necessary for the unwinder
>> to know the exception boundaries? Is it not enough if it knows if there are
>> exceptions present? For instance, using something like num_special_frames
>> I suggested above?
> 
> I agree that it would be legitimate to bail out early if we knew there
> was going to be an exception somewhere in the trace. Regardless, I think
> it's simpler overall to identify non-unwindability during the trace, and
> doing that during the trace aligns more closely with the structure that
> we'll need to permit unwinding across these boundaries in future, so I'd
> prefer we do that rather than trying to optimize for early returns
> today.
> 

OK. Fair enough.

Thanks.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 1:27 PM, Mark Brown wrote:
> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 12:02 PM, Mark Rutland wrote:
> 
>>> 3. Figure out exception boundary handling. I'm currently working to
>>>simplify the entry assembly down to a uniform set of stubs, and I'd
>>>prefer to get that sorted before we teach the unwinder about
>>>exception boundaries, as it'll be significantly simpler to reason
>>>about and won't end up clashing with the rework.
> 
>> So, here is where I still have a question. Is it necessary for the unwinder
>> to know the exception boundaries? Is it not enough if it knows if there are
>> exceptions present? For instance, using something like num_special_frames
>> I suggested above?
> 
> For reliable stack trace we can live with just flagging things as
> unreliable when we know there's an exception boundary somewhere but (as
> Mark mentioned elsewhere) being able to actually go through a subset of
> exception boundaries safely is likely to help usefully improve the
> performance of live patching, and for defensiveness we want to try to
> detect during an actual unwind anyway so it ends up being a performance
> improvment and double check rather than saving us code.  Better
> understanding of what's going on in the presence of exceptions may also
> help other users of the unwinder which can use stacks which aren't
> reliable get better results.
> 

Actually, I was not suggesting that the counter replace the unwinder
intelligence to recognize exception boundaries. I was only suggesting
the use of the counter for arch_stack_walk_reliable().

But I am fine with not implementing the counter for now.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 12:23 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 12:02 PM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
>>>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>>>>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>>> So, my next question is - can we define a practical limit for the
>>>> nesting so that any nesting beyond that is fatal? The reason I ask
>>>> is - if there is a max, then we can allocate an array of stack
>>>> frames out of band for the special frames so they are not part of
>>>> the stack and will not likely get corrupted.
>>>>
>>>> Also, we don't have to do any special detection. If the number of
>>>> out of band frames used is one or more then we have exceptions and
>>>> the stack trace is unreliable.
>>>
>>> Alternatively, if we can just increment a counter in the task
>>> structure when an exception is entered and decrement it when an
>>> exception returns, that counter will tell us that the stack trace is
>>> unreliable.
>>
>> As I noted earlier, we must treat *any* EL1 exception boundary needs to
>> be treated as unreliable for unwinding, and per my other comments w.r.t.
>> corrupting the call chain I don't think we need additional protection on
>> exception boundaries specifically.
>>
>>> Is this feasible?
>>>
>>> I think I have enough for v3 at this point. If you think that the
>>> counter idea is OK, I can implement it in v3. Once you confirm, I will
>>> start working on v3.
>>
>> Currently, I don't see a compelling reason to need this, and would
>> prefer to avoid it.
>>
> 
> I think that I did a bad job of explaining what I wanted to do. It is not
> for any additional protection at all.
> 
> So, let us say we create a field in the task structure:
> 
>   u64 unreliable_stack;
> 
> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
> On exiting the above, decrement unreliable_stack.
> 
> In arch_stack_walk_reliable(), simply do this check upfront:
> 
>   if (task->unreliable_stack)
>   return -EINVAL;
> 
> This way, the function does not even bother unwinding the stack to find
> exception frames or checking for different return addresses or anything.
> We also don't have to worry about code being reorganized, functions
> being renamed, etc. It also may help in debugging to know if a task is
> experiencing an exception and the level of nesting, etc.
> 
>> More generally, could we please break this work into smaller steps? I
>> reckon we can break this down into the following chunks:
>>
>> 1. Add the explicit final frame and associated handling. I suspect that
>>this is complicated enough on its own to be an independent series,
>>and it's something that we can merge without all the bits and pieces
>>necessary for truly reliable stacktracing.
>>
> 
> OK. I can do that.
> 
>> 2. Figure out how we must handle kprobes and ftrace. That probably means
>>rejecting unwinds from specific places, but we might also want to
>>adjust the trampolines if that makes this easier.
>>
> 
> I think I am already doing all the checks except the one you mentioned
> earlier. Yes, I can do this separately.
> 
>> 3. Figure out exception boundary handling. I'm currently working to
>>simplify the entry assembly down to a uniform set of stubs, and I'd
>>prefer to get that sorted before we teach the unwinder about
>>exception boundaries, as it'll be significantly simpler to reason
>>about and won't end up clashing with the rework.
>>
> 
> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames

Typo - num_special_frames should be unreliable_stack. That is the name of
the counter I used above.

Sorry about that.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 12:02 PM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>>>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>> So, my next question is - can we define a practical limit for the
>>> nesting so that any nesting beyond that is fatal? The reason I ask
>>> is - if there is a max, then we can allocate an array of stack
>>> frames out of band for the special frames so they are not part of
>>> the stack and will not likely get corrupted.
>>>
>>> Also, we don't have to do any special detection. If the number of
>>> out of band frames used is one or more then we have exceptions and
>>> the stack trace is unreliable.
>>
>> Alternatively, if we can just increment a counter in the task
>> structure when an exception is entered and decrement it when an
>> exception returns, that counter will tell us that the stack trace is
>> unreliable.
> 
> As I noted earlier, we must treat *any* EL1 exception boundary needs to
> be treated as unreliable for unwinding, and per my other comments w.r.t.
> corrupting the call chain I don't think we need additional protection on
> exception boundaries specifically.
> 
>> Is this feasible?
>>
>> I think I have enough for v3 at this point. If you think that the
>> counter idea is OK, I can implement it in v3. Once you confirm, I will
>> start working on v3.
> 
> Currently, I don't see a compelling reason to need this, and would
> prefer to avoid it.
> 

I think that I did a bad job of explaining what I wanted to do. It is not
for any additional protection at all.

So, let us say we create a field in the task structure:

u64 unreliable_stack;

Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
set up and pt_regs->stackframe gets chained, increment unreliable_stack.
On exiting the above, decrement unreliable_stack.

In arch_stack_walk_reliable(), simply do this check upfront:

if (task->unreliable_stack)
return -EINVAL;

This way, the function does not even bother unwinding the stack to find
exception frames or checking for different return addresses or anything.
We also don't have to worry about code being reorganized, functions
being renamed, etc. It also may help in debugging to know if a task is
experiencing an exception and the level of nesting, etc.

> More generally, could we please break this work into smaller steps? I
> reckon we can break this down into the following chunks:
> 
> 1. Add the explicit final frame and associated handling. I suspect that
>this is complicated enough on its own to be an independent series,
>and it's something that we can merge without all the bits and pieces
>necessary for truly reliable stacktracing.
> 

OK. I can do that.

> 2. Figure out how we must handle kprobes and ftrace. That probably means
>rejecting unwinds from specific places, but we might also want to
>adjust the trampolines if that makes this easier.
> 

I think I am already doing all the checks except the one you mentioned
earlier. Yes, I can do this separately.

> 3. Figure out exception boundary handling. I'm currently working to
>simplify the entry assembly down to a uniform set of stubs, and I'd
>prefer to get that sorted before we teach the unwinder about
>exception boundaries, as it'll be significantly simpler to reason
>about and won't end up clashing with the rework.
> 

So, here is where I still have a question. Is it necessary for the unwinder
to know the exception boundaries? Is it not enough if it knows if there are
exceptions present? For instance, using something like num_special_frames
I suggested above?

Thanks,

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 11:48 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>> Thanks for explaining the nesting. It is now clear to me.
> 
> No problem!
> 
>> So, my next question is - can we define a practical limit for the
>> nesting so that any nesting beyond that is fatal? The reason I ask is
>> - if there is a max, then we can allocate an array of stack frames out
>> of band for the special frames so they are not part of the stack and
>> will not likely get corrupted.
> 
> I suspect we can't define such a fatal limit without introducing a local
> DoS vector on some otherwise legitimate workload, and I fear this will
> further complicate the entry/exit logic, so I'd prefer to avoid
> introducing a new limit.
> 

I suspected as much. But I thought I will ask anyway.

> What exactly do you mean by a "special frame", and why do those need
> additional protection over regular frame records?
> 

Special frame just means pt_regs->stackframe that is used for exceptions.
No additional protection is needed. I just meant that since they are
out of band, we can reliably tell that there are exceptions without
examining the stack. That is all.

>> Also, we don't have to do any special detection. If the number of out
>> of band frames used is one or more then we have exceptions and the
>> stack trace is unreliable.
> 
> What is expected to protect against?
> 

It is not a protection thing. I just wanted a reliable way to tell that there
is an exception without having to unwind the stack up to the exception frame.
That is all.

Thanks.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 9:57 AM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>> Hi Mark,
>>>
>>> I have a general question. When exceptions are nested, how does it work? 
>>> Let us consider 2 cases:
>>>
>>> 1. Exception in a page fault handler itself. In this case, I guess one more 
>>> pt_regs will get
>>>established in the task stack for the second exception.
>>
>> Generally (ignoring SDEI and stack overflow exceptions) the regs will be
>> placed on the stack that was in use when the exception occurred, e.g.
>>
>>   task -> task
>>   irq -> irq
>>   overflow -> overflow
>>
>> For SDEI and stack overflow, we'll place the regs on the relevant SDEI
>> or overflow stack, e.g.
>>
>>   task -> overflow
>>   irq -> overflow
>>
>>   task -> sdei
>>   irq -> sdei
>>
>> I tried to explain the nesting rules in:
>>
>>   
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
>>   
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11=592700f094be229b5c9cc1192d5cea46eb4c7afc
>>
>>> 2. Exception in an interrupt handler. Here the interrupt handler is running 
>>> on the IRQ stack.
>>>Will the pt_regs get created on the IRQ stack?
>>
>> For an interrupt the regs will be placed on the stack that was in use
>> when the interrupt was taken. The kernel switches to the IRQ stack
>> *after* stacking the registers. e.g.
>>
>>   task -> task // subsequently switches to IRQ stack
>>   irq -> irq
>>
>>> Also, is there a maximum nesting for exceptions?
>>
>> In practice, yes, but the specific number isn't a constant, so in the
>> unwind code we have to act as if there is no limit other than stack
>> sizing.
>>
>> We try to prevent cerain exceptions from nesting (e.g. debug exceptions
>> cannot nest), but there are still several level sof nesting, and some
>> exceptions which can be nested safely (like faults). For example, it's
>> possible to have a chain:
>>
>>  syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault 
>> -> watchpoint -> fault -> overflow -> fault -> BRK
>>
>> ... and potentially longer than that.
>>
>> The practical limit is the size of all the stacks, and the unwinder's 
>> stack monotonicity checks ensure that an unwind will terminate.
>>
> 
> Thanks for explaining the nesting. It is now clear to me.
> 
> So, my next question is - can we define a practical limit for the nesting so 
> that any nesting beyond that
> is fatal? The reason I ask is - if there is a max, then we can allocate an 
> array of stack frames out of
> band for the special frames so they are not part of the stack and will not 
> likely get corrupted.
> 
> Also, we don't have to do any special detection. If the number of out of band 
> frames used is one or more
> then we have exceptions and the stack trace is unreliable.
> 

Alternatively, if we can just increment a counter in the task structure when an 
exception is entered
and decrement it when an exception returns, that counter will tell us that the 
stack trace is
unreliable.

Is this feasible?

I think I have enough for v3 at this point. If you think that the counter idea 
is OK, I can
implement it in v3. Once you confirm, I will start working on v3.

Thanks for all the input.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 9:57 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>> Hi Mark,
>>
>> I have a general question. When exceptions are nested, how does it work? Let 
>> us consider 2 cases:
>>
>> 1. Exception in a page fault handler itself. In this case, I guess one more 
>> pt_regs will get
>>established in the task stack for the second exception.
> 
> Generally (ignoring SDEI and stack overflow exceptions) the regs will be
> placed on the stack that was in use when the exception occurred, e.g.
> 
>   task -> task
>   irq -> irq
>   overflow -> overflow
> 
> For SDEI and stack overflow, we'll place the regs on the relevant SDEI
> or overflow stack, e.g.
> 
>   task -> overflow
>   irq -> overflow
> 
>   task -> sdei
>   irq -> sdei
> 
> I tried to explain the nesting rules in:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11=592700f094be229b5c9cc1192d5cea46eb4c7afc
> 
>> 2. Exception in an interrupt handler. Here the interrupt handler is running 
>> on the IRQ stack.
>>Will the pt_regs get created on the IRQ stack?
> 
> For an interrupt the regs will be placed on the stack that was in use
> when the interrupt was taken. The kernel switches to the IRQ stack
> *after* stacking the registers. e.g.
> 
>   task -> task // subsequently switches to IRQ stack
>   irq -> irq
> 
>> Also, is there a maximum nesting for exceptions?
> 
> In practice, yes, but the specific number isn't a constant, so in the
> unwind code we have to act as if there is no limit other than stack
> sizing.
> 
> We try to prevent cerain exceptions from nesting (e.g. debug exceptions
> cannot nest), but there are still several level sof nesting, and some
> exceptions which can be nested safely (like faults). For example, it's
> possible to have a chain:
> 
>  syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault 
> -> watchpoint -> fault -> overflow -> fault -> BRK
> 
> ... and potentially longer than that.
> 
> The practical limit is the size of all the stacks, and the unwinder's 
> stack monotonicity checks ensure that an unwind will terminate.
> 

Thanks for explaining the nesting. It is now clear to me.

So, my next question is - can we define a practical limit for the nesting so 
that any nesting beyond that
is fatal? The reason I ask is - if there is a max, then we can allocate an 
array of stack frames out of
band for the special frames so they are not part of the stack and will not 
likely get corrupted.

Also, we don't have to do any special detection. If the number of out of band 
frames used is one or more
then we have exceptions and the stack trace is unreliable.

Thanks.

Madhavan


Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 9:33 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 08:31:50AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 8:04 AM, Mark Rutland wrote:
>>> On Tue, Mar 23, 2021 at 07:46:10AM -0500, Madhavan T. Venkataraman wrote:
>>>> On 3/23/21 5:42 AM, Mark Rutland wrote:
>>>>> On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com 
>>>>> wrote:
>>>>>> From: "Madhavan T. Venkataraman" 
>>>>>>
>>>>>> EL1 exceptions can happen on any instruction including instructions in
>>>>>> the frame pointer prolog or epilog. Depending on where exactly they 
>>>>>> happen,
>>>>>> they could render the stack trace unreliable.
>>>>>>
>>>>>> If an EL1 exception frame is found on the stack, mark the stack trace as
>>>>>> unreliable.
>>>>>>
>>>>>> Now, the EL1 exception frame is not at any well-known offset on the 
>>>>>> stack.
>>>>>> It can be anywhere on the stack. In order to properly detect an EL1
>>>>>> exception frame the following checks must be done:
>>>>>>
>>>>>>  - The frame type must be EL1_FRAME.
>>>>>>
>>>>>>  - When the register state is saved in the EL1 pt_regs, the frame
>>>>>>pointer x29 is saved in pt_regs->regs[29] and the return PC
>>>>>>is saved in pt_regs->pc. These must match with the current
>>>>>>frame.
>>>>>
>>>>> Before you can do this, you need to reliably identify that you have a
>>>>> pt_regs on the stack, but this patch uses a heuristic, which is not
>>>>> reliable.
>>>>>
>>>>> However, instead you can identify whether you're trying to unwind
>>>>> through one of the EL1 entry functions, which tells you the same thing
>>>>> without even having to look at the pt_regs.
>>>>>
>>>>> We can do that based on the entry functions all being in .entry.text,
>>>>> which we could further sub-divide to split the EL0 and EL1 entry
>>>>> functions.
>>>>
>>>> Yes. I will check the entry functions. But I still think that we should
>>>> not rely on just one check. The additional checks will make it robust.
>>>> So, I suggest that the return address be checked first. If that passes,
>>>> then we can be reasonably sure that there are pt_regs. Then, check
>>>> the other things in pt_regs.
>>>
>>> What do you think this will catch?
>>
>> I am not sure that I have an exact example to mention here. But I will 
>> attempt
>> one. Let us say that a task has called arch_stack_walk() in the recent past.
>> The unwinder may have copied a stack frame onto some location in the stack
>> with one of the return addresses we check. Let us assume that there is some
>> stack corruption that makes a frame pointer point to that exact record. Now,
>> we will get a match on the return address on the next unwind.
> 
> I don't see how this is material to the pt_regs case, as either:
> 
> * When the unwinder considers this frame, it appears to be in the middle
>   of an EL1 entry function, and the unwinder must mark the unwinding as
>   unreliable regardless of the contents of any regs (so there's no need
>   to look at the regs).
> 
> * When the unwinder considers this frame, it does not appear to be in
>   the middle of an EL1 entry function, so the unwinder does not think
>   there are any regs to consider, and so we cannot detect this case.
> 
> ... unless I've misunderstood the example?
> 
> There's a general problem that it's possible to corrupt any portion of
> the chain to skip records, e.g.
> 
>   A -> B -> C -> D -> E -> F -> G -> H -> [final]
> 
> ... could get corrupted to:
> 
>   A -> B -> D -> H -> [final]
> 
> ... regardless of whether C/E/F/G had associated pt_regs. AFAICT there's
> no good way to catch this generally unless we have additional metadata
> to check the unwinding against.
> 
> The likelihood of this happening without triggering other checks is
> vanishingly low, and as we don't have a reliable mechanism for detecting
> this, I don't think it's worthwhile attempting to do so.
> 
> If and when we try to unwind across EL1 exception boundaries, the
> potential mismatch between the frame record and regs will be more
> significant, and I agree at that point thisd will need more thought.
> 
>> Pardon me if the example is somewhat crude. My point is that it is
>> highly unlikely but not impossible for the return address to be on the
>> stack and for the unwinder to get an unfortunate match.
> 
> I agree that this is possible in theory, but as above I don't think this
> is a practical concern.
> 

OK. What you say makes sense.

Thanks.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman
Hi Mark,

I have a general question. When exceptions are nested, how does it work? Let us 
consider 2 cases:

1. Exception in a page fault handler itself. In this case, I guess one more 
pt_regs will get
   established in the task stack for the second exception.

2. Exception in an interrupt handler. Here the interrupt handler is running on 
the IRQ stack.
   Will the pt_regs get created on the IRQ stack?

Also, is there a maximum nesting for exceptions?

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 8:36 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 07:56:40AM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 3/23/21 5:51 AM, Mark Rutland wrote:
>>> On Mon, Mar 15, 2021 at 11:57:57AM -0500, madve...@linux.microsoft.com 
>>> wrote:
>>>> From: "Madhavan T. Venkataraman" 
>>>>
>>>> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
>>>> for a function, the ftrace infrastructure is called for the function at
>>>> the very beginning. Ftrace creates two frames:
>>>>
>>>>- One for the traced function
>>>>
>>>>- One for the caller of the traced function
>>>>
>>>> That gives a reliable stack trace while executing in the ftrace
>>>> infrastructure code. When ftrace returns to the traced function, the frames
>>>> are popped and everything is back to normal.
>>>>
>>>> However, in cases like live patch, execution is redirected to a different
>>>> function when ftrace returns. A stack trace taken while still in the ftrace
>>>> infrastructure code will not show the target function. The target function
>>>> is the real function that we want to track.
>>>>
>>>> So, if an FTRACE frame is detected on the stack, just mark the stack trace
>>>> as unreliable.
>>>
>>> To identify this case, please identify the ftrace trampolines instead,
>>> e.g. ftrace_regs_caller, return_to_handler.
>>>
>>
>> Yes. As part of the return address checking, I will check this. IIUC, I 
>> think that
>> I need to check for the inner labels that are defined at the point where the
>> instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.
>>
>> SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
>> bl  ftrace_stub  <
>>
>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>> nop  <===// If enabled, this will be replaced
>> // "b ftrace_graph_caller"
>> #endif
>>
>> For instance, the stack trace I got while tracing do_mmap() with the stack 
>> trace
>> tracer looks like this:
>>
>>   ...
>> [  338.911793]   trace_function+0xc4/0x160
>> [  338.911801]   function_stack_trace_call+0xac/0x130
>> [  338.911807]   ftrace_graph_call+0x0/0x4
>> [  338.911813]   do_mmap+0x8/0x598
>> [  338.911820]   vm_mmap_pgoff+0xf4/0x188
>> [  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
>> [  338.911832]   __arm64_sys_mmap+0x38/0x50
>> [  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
>> [  338.911846]   do_el0_svc+0x2c/0x98
>> [  338.911851]   el0_svc+0x2c/0x70
>> [  338.911859]   el0_sync_handler+0xb0/0xb8
>> [  338.911864]   el0_sync+0x180/0x1c0
>>
>>> It'd be good to check *exactly* when we need to reject, since IIUC when
>>> we have a graph stack entry the unwind will be correct from livepatch's
>>> PoV.
>>>
>>
>> The current unwinder already handles this like this:
>>
>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> if (tsk->ret_stack &&
>> (ptrauth_strip_insn_pac(frame->pc) == (unsigned 
>> long)return_to_handler)) {
>> struct ftrace_ret_stack *ret_stack;
>> /*
>>  * This is a case where function graph tracer has
>>  * modified a return address (LR) in a stack frame
>>  * to hook a function return.
>>  * So replace it to an original value.
>>  */
>> ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
>> if (WARN_ON_ONCE(!ret_stack))
>> return -EINVAL;
>> frame->pc = ret_stack->ret;
>> }
>> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> 
> Beware that this handles the case where a function will return to
> return_to_handler, but doesn't handle unwinding from *within*
> return_to_handler, which we can't do reliably today, so that might need
> special handling.
> 

OK. I will take a look at this.

>> Is there anything else that needs handling here?
> 
> I wrote up a few cases to consider in:
> 
> https://www.kernel.org/doc/html/latest/livepatch/reliable-stacktrace.html
> 
> ... e.g. the "Obscuring of return addresses" case.
> 

Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 8:04 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 07:46:10AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 5:42 AM, Mark Rutland wrote:
>>> On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com 
>>> wrote:
>>>> From: "Madhavan T. Venkataraman" 
>>>>
>>>> EL1 exceptions can happen on any instruction including instructions in
>>>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>>>> they could render the stack trace unreliable.
>>>>
>>>> If an EL1 exception frame is found on the stack, mark the stack trace as
>>>> unreliable.
>>>>
>>>> Now, the EL1 exception frame is not at any well-known offset on the stack.
>>>> It can be anywhere on the stack. In order to properly detect an EL1
>>>> exception frame the following checks must be done:
>>>>
>>>>- The frame type must be EL1_FRAME.
>>>>
>>>>- When the register state is saved in the EL1 pt_regs, the frame
>>>>  pointer x29 is saved in pt_regs->regs[29] and the return PC
>>>>  is saved in pt_regs->pc. These must match with the current
>>>>  frame.
>>>
>>> Before you can do this, you need to reliably identify that you have a
>>> pt_regs on the stack, but this patch uses a heuristic, which is not
>>> reliable.
>>>
>>> However, instead you can identify whether you're trying to unwind
>>> through one of the EL1 entry functions, which tells you the same thing
>>> without even having to look at the pt_regs.
>>>
>>> We can do that based on the entry functions all being in .entry.text,
>>> which we could further sub-divide to split the EL0 and EL1 entry
>>> functions.
>>
>> Yes. I will check the entry functions. But I still think that we should
>> not rely on just one check. The additional checks will make it robust.
>> So, I suggest that the return address be checked first. If that passes,
>> then we can be reasonably sure that there are pt_regs. Then, check
>> the other things in pt_regs.
> 
> What do you think this will catch?
> 

I am not sure that I have an exact example to mention here. But I will attempt
one. Let us say that a task has called arch_stack_walk() in the recent past.
The unwinder may have copied a stack frame onto some location in the stack
with one of the return addresses we check. Let us assume that there is some
stack corruption that makes a frame pointer point to that exact record. Now,
we will get a match on the return address on the next unwind.

Pardon me if the example is somewhat crude. My point is that it is highly 
unlikely
but not impossible for the return address to be on the stack and for the 
unwinder to
get an unfortunate match.

> The only way to correctly identify whether or not we have a pt_regs is
> to check whether we're in specific portions of the EL1 entry assembly
> where the regs exist. However, as any EL1<->EL1 transition cannot be
> safely unwound, we'd mark any trace going through the EL1 entry assembly
> as unreliable.
> 
> Given that, I don't think it's useful to check the regs, and I'd prefer
> to avoid the subtlteties involved in attempting to do so.
> 

I agree that the return address check is a good check. I would like to add
extra checks to be absolutely sure.

> [...]
> 
>>>> +static void check_if_reliable(unsigned long fp, struct stackframe *frame,
>>>> +struct stack_info *info)
>>>> +{
>>>> +  struct pt_regs *regs;
>>>> +  unsigned long regs_start, regs_end;
>>>> +
>>>> +  /*
>>>> +   * If the stack trace has already been marked unreliable, just
>>>> +   * return.
>>>> +   */
>>>> +  if (!frame->reliable)
>>>> +  return;
>>>> +
>>>> +  /*
>>>> +   * Assume that this is an intermediate marker frame inside a pt_regs
>>>> +   * structure created on the stack and get the pt_regs pointer. Other
>>>> +   * checks will be done below to make sure that this is a marker
>>>> +   * frame.
>>>> +   */
>>>
>>> Sorry, but NAK to this approach specifically. This isn't reliable (since
>>> it can be influenced by arbitrary data on the stack), and it's far more
>>> complicated than identifying the entry functions specifically.
>>
>> As I mentioned above, I agree that we should check the return address. But
>> just as a precaution, I think we should double check the pt_regs.
>>
>> Is that OK with you? It does not take away anything or increase the risk in
>> anyway. I think it makes it more robust.
> 
> As above, I think that the work necessary to correctly access the regs
> means that it's not helpful to check the regs themselves. If you have
> something in mind where checking the regs is helpful I'm happy to
> consider that, but my general preference would be to stay away from the
> regs for now.
> 

I have mentioned a possibility above. Please take a look and let me know.

Thanks.

Madhavan


Re: [RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 5:51 AM, Mark Rutland wrote:
> On Mon, Mar 15, 2021 at 11:57:57AM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
>> for a function, the ftrace infrastructure is called for the function at
>> the very beginning. Ftrace creates two frames:
>>
>>  - One for the traced function
>>
>>  - One for the caller of the traced function
>>
>> That gives a reliable stack trace while executing in the ftrace
>> infrastructure code. When ftrace returns to the traced function, the frames
>> are popped and everything is back to normal.
>>
>> However, in cases like live patch, execution is redirected to a different
>> function when ftrace returns. A stack trace taken while still in the ftrace
>> infrastructure code will not show the target function. The target function
>> is the real function that we want to track.
>>
>> So, if an FTRACE frame is detected on the stack, just mark the stack trace
>> as unreliable.
> 
> To identify this case, please identify the ftrace trampolines instead,
> e.g. ftrace_regs_caller, return_to_handler.
> 

Yes. As part of the return address checking, I will check this. IIUC, I think 
that
I need to check for the inner labels that are defined at the point where the
instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.

SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
bl  ftrace_stub <

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
nop <===// If enabled, this will be replaced
// "b ftrace_graph_caller"
#endif

For instance, the stack trace I got while tracing do_mmap() with the stack trace
tracer looks like this:

 ...
[  338.911793]   trace_function+0xc4/0x160
[  338.911801]   function_stack_trace_call+0xac/0x130
[  338.911807]   ftrace_graph_call+0x0/0x4
[  338.911813]   do_mmap+0x8/0x598
[  338.911820]   vm_mmap_pgoff+0xf4/0x188
[  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
[  338.911832]   __arm64_sys_mmap+0x38/0x50
[  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
[  338.911846]   do_el0_svc+0x2c/0x98
[  338.911851]   el0_svc+0x2c/0x70
[  338.911859]   el0_sync_handler+0xb0/0xb8
[  338.911864]   el0_sync+0x180/0x1c0

> It'd be good to check *exactly* when we need to reject, since IIUC when
> we have a graph stack entry the unwind will be correct from livepatch's
> PoV.
> 

The current unwinder already handles this like this:

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(ptrauth_strip_insn_pac(frame->pc) == (unsigned 
long)return_to_handler)) {
struct ftrace_ret_stack *ret_stack;
/*
 * This is a case where function graph tracer has
 * modified a return address (LR) in a stack frame
 * to hook a function return.
 * So replace it to an original value.
 */
ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
if (WARN_ON_ONCE(!ret_stack))
return -EINVAL;
frame->pc = ret_stack->ret;
}
#endif /* CONFIG_FUNCTION_GRAPH_TRACER */

Is there anything else that needs handling here?

Thanks,

Madhavan


Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 5:42 AM, Mark Rutland wrote:
> On Mon, Mar 15, 2021 at 11:57:56AM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> EL1 exceptions can happen on any instruction including instructions in
>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>> they could render the stack trace unreliable.
>>
>> If an EL1 exception frame is found on the stack, mark the stack trace as
>> unreliable.
>>
>> Now, the EL1 exception frame is not at any well-known offset on the stack.
>> It can be anywhere on the stack. In order to properly detect an EL1
>> exception frame the following checks must be done:
>>
>>  - The frame type must be EL1_FRAME.
>>
>>  - When the register state is saved in the EL1 pt_regs, the frame
>>pointer x29 is saved in pt_regs->regs[29] and the return PC
>>is saved in pt_regs->pc. These must match with the current
>>frame.
> 
> Before you can do this, you need to reliably identify that you have a
> pt_regs on the stack, but this patch uses a heuristic, which is not
> reliable.
> 
> However, instead you can identify whether you're trying to unwind
> through one of the EL1 entry functions, which tells you the same thing
> without even having to look at the pt_regs.
> 
> We can do that based on the entry functions all being in .entry.text,
> which we could further sub-divide to split the EL0 and EL1 entry
> functions.
> 

Yes. I will check the entry functions. But I still think that we should
not rely on just one check. The additional checks will make it robust.
So, I suggest that the return address be checked first. If that passes,
then we can be reasonably sure that there are pt_regs. Then, check
the other things in pt_regs.

>>
>> Interrupts encountered in kernel code are also EL1 exceptions. At the end
>> of an interrupt, the interrupt handler checks if the current task must be
>> preempted for any reason. If so, it calls the preemption code which takes
>> the task off the CPU. A stack trace taken on the task after the preemption
>> will show the EL1 frame and will be considered unreliable. This is correct
>> behavior as preemption can happen practically at any point in code
>> including the frame pointer prolog and epilog.
>>
>> Breakpoints encountered in kernel code are also EL1 exceptions. The probing
>> infrastructure uses breakpoints for executing probe code. While in the probe
>> code, the stack trace will show an EL1 frame and will be considered
>> unreliable. This is also correct behavior.
>>
>> Signed-off-by: Madhavan T. Venkataraman 
>> ---
>>  arch/arm64/include/asm/stacktrace.h |  2 +
>>  arch/arm64/kernel/stacktrace.c  | 57 +
>>  2 files changed, 59 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/stacktrace.h 
>> b/arch/arm64/include/asm/stacktrace.h
>> index eb29b1fe8255..684f65808394 100644
>> --- a/arch/arm64/include/asm/stacktrace.h
>> +++ b/arch/arm64/include/asm/stacktrace.h
>> @@ -59,6 +59,7 @@ struct stackframe {
>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>  int graph;
>>  #endif
>> +bool reliable;
>>  };
>>  
>>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
>> @@ -169,6 +170,7 @@ static inline void start_backtrace(struct stackframe 
>> *frame,
>>  bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>>  frame->prev_fp = 0;
>>  frame->prev_type = STACK_TYPE_UNKNOWN;
>> +frame->reliable = true;
>>  }
>>  
>>  #endif  /* __ASM_STACKTRACE_H */
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 504cd161339d..6ae103326f7b 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -18,6 +18,58 @@
>>  #include 
>>  #include 
>>  
>> +static void check_if_reliable(unsigned long fp, struct stackframe *frame,
>> +  struct stack_info *info)
>> +{
>> +struct pt_regs *regs;
>> +unsigned long regs_start, regs_end;
>> +
>> +/*
>> + * If the stack trace has already been marked unreliable, just
>> + * return.
>> + */
>> +if (!frame->reliable)
>> +return;
>> +
>> +/*
>> + * Assume that this is an intermediate marker frame inside a pt_regs
>> + * structure created on the stack and get the pt_regs pointer. Other
>> + * checks will be done below to make sure that this is a marker
>> + * frame.
>> + */
> 
> Sorry, but NAK to this approach specifically. This isn't reliable (since
> it can be influenced by arbitrary data on the stack), and it's far more
> complicated than identifying the entry functions specifically.
> 

As I mentioned above, I agree that we should check the return address. But
just as a precaution, I think we should double check the pt_regs.

Is that OK with you? It does not take away anything or increase the risk in
anyway. I think it makes it more robust.

Thanks.

Madhavan


Re: [RFC PATCH v2 3/8] arm64: Terminate the stack trace at TASK_FRAME and EL0_FRAME

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 5:36 AM, Mark Rutland wrote:
> On Thu, Mar 18, 2021 at 03:29:19PM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 3/18/21 1:26 PM, Mark Brown wrote:
>>> On Mon, Mar 15, 2021 at 11:57:55AM -0500, madve...@linux.microsoft.com 
>>> wrote:
>>>
>>>> +  /* Terminal record, nothing to unwind */
>>>> +  if (fp == (unsigned long) regs->stackframe) {
>>>> +  if (regs->frame_type == TASK_FRAME ||
>>>> +  regs->frame_type == EL0_FRAME)
>>>> +  return -ENOENT;
>>>>return -EINVAL;
>>>> +  }
>>>
>>> This is conflating the reliable stacktrace checks (which your series
>>> will later flag up with frame->reliable) with verifying that we found
>>> the bottom of the stack by looking for this terminal stack frame record.
>>> For the purposes of determining if the unwinder got to the bottom of the
>>> stack we don't care what stack type we're looking at, we just care if it
>>> managed to walk to this defined final record.  
>>>
>>> At the minute nothing except reliable stack trace has any intention of
>>> checking the specific return code but it's clearer to be consistent.
>>>
>>
>> So, you are saying that the type check is redundant. OK. I will remove it
>> and just return -ENOENT on reaching the final record.
> 
> Yes please; and please fold that into the same patch that adds the final
> records.
> 

Will do.

Thanks.

Madhavan


Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-23 Thread Madhavan T. Venkataraman



On 3/23/21 5:24 AM, Mark Rutland wrote:
> On Fri, Mar 19, 2021 at 05:03:09PM -0500, Madhavan T. Venkataraman wrote:
>> I solved this by using existing functions logically instead of inventing a
>> dummy function. I initialize pt_regs->stackframe[1] to an existing function
>> so that the stack trace will not show a 0x0 entry as well as the kernel and
>> gdb will show identical stack traces.
>>
>> For all task stack traces including the idle tasks, the stack trace will
>> end at copy_thread() as copy_thread() is the function that initializes the
>> pt_regs and the first stack frame for a task.
> 
> I don't think this is a good idea, as it will mean that copy_thread()
> will appear to be live in every thread, and therefore will not be
> patchable.
> 
> There are other things people need to be aware of when using an external
> debugger (e.g. during EL0<->ELx transitions there are periods when X29
> and X30 contain the EL0 values, and cannot be used to unwind), so I
> don't think there's a strong need to make this look prettier to an
> external debugger.
> 

OK.

>> For EL0 exceptions, the stack trace will end with vectors() as vectors
>> entries call the EL0 handlers.
>>
>> Here are sample stack traces (I only show the ending of each trace):
>>
>> Idle task on primary CPU
>> 
>>
>>   ...
>> [0.022557]   start_kernel+0x5b8/0x5f4
>> [0.022570]   __primary_switched+0xa8/0xb8
>> [0.022578]   copy_thread+0x0/0x188
>>
>> Idle task on secondary CPU
>> ==
>>
>>   ...
>> [0.023397]   secondary_start_kernel+0x188/0x1e0
>> [0.023406]   __secondary_switched+0x40/0x88
>> [0.023415]   copy_thread+0x0/0x188
>>
>> All other kernel threads
>> 
>>
>>   ...
>> [   13.501062]   ret_from_fork+0x10/0x18
>> [   13.507998]   copy_thread+0x0/0x188
>>
>> User threads (EL0 exception)
>> 
>>
>> write(2) system call example:
>>
>>   ...
>> [  521.686148]   vfs_write+0xc8/0x2c0
>> [  521.686156]   ksys_write+0x74/0x108
>> [  521.686161]   __arm64_sys_write+0x24/0x30
>> [  521.686166]   el0_svc_common.constprop.0+0x70/0x1a8
>> [  521.686175]   do_el0_svc+0x2c/0x98
>> [  521.686180]   el0_svc+0x2c/0x70
>> [  521.686188]   el0_sync_handler+0xb0/0xb8
>> [  521.686193]   el0_sync+0x17c/0x180
>> [  521.686198]   vectors+0x0/0x7d8
> 
> [...]
> 
>> If you approve, the above will become RFC Patch v3 1/8 in the next version.
> 
> As above, I don't think we should repurpose an existing function here,
> and my preference is to use 0x0.
> 

OK.

>> Let me know.
>>
>> Also, I could introduce an extra frame in the EL1 exception stack trace that
>> includes vectors so the stack trace would look like this (timer interrupt 
>> example):
>>
>> call_timer_fn
>> run_timer_softirq
>> __do_softirq
>> irq_exit
>> __handle_domain_irq
>> gic_handle_irq
>> el1_irq
>> vectors
>>
>> This way, if the unwinder finds vectors, it knows that it is an exception 
>> frame.
> 
> I can see this might make it simpler to detect exception boundaries, but
> I suspect that we need other information anyway, so this doesn't become
> all that helpful. For EL0<->EL1 exception boundaries we want to
> successfully terminate a robust stacktrace whereas for EL1<->EL1
> exception boundaries we want to fail a robust stacktrace.
> 
> I reckon we have to figure that out from the el1_* and el0_* entry
> points (which I am working to reduce/simplify as part of the entry
> assembly conversion to C). With that we can terminate unwind at the
> el0_* parts, and reject unwinding across any other bit of .entry.text.
> 

OK. That is fine.

Thanks.

Madhavan


Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 1:19 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/19/21 9:29 AM, Madhavan T. Venkataraman wrote:
>>
>>
>> On 3/19/21 7:30 AM, Mark Brown wrote:
>>> On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote:
>>>> On 3/18/21 10:09 AM, Mark Brown wrote:
>>>
>>>>> If we are going to add the extra record there would probably be less
>>>>> potential for confusion if we pointed it at some sensibly named dummy
>>>>> function so anything or anyone that does see it on the stack doesn't get
>>>>> confused by a NULL.
>>>
>>>> I agree. I will think about this some more. If no other solution presents
>>>> itself, I will add the dummy function.
>>>
>>> After discussing this with Mark Rutland offlist he convinced me that so
>>> long as we ensure the kernel doesn't print the NULL record we're
>>> probably OK here, the effort setting the function pointer up correctly
>>> in all circumstances (especially when we're not in the normal memory
>>> map) is probably not worth it for the limited impact it's likely to have
>>> to see the NULL pointer (probably mainly a person working with some
>>> external debugger).  It should be noted in the changelog though, and/or
>>> merged in with the relevant change to the unwinder.
>>>
>>
>> OK. I will add a comment as well as note it in the changelog.
>>
>> Thanks to both of you.
>>
>> Madhavan
>>
> 
> I thought about this some more. I think I have a simple solution. I will
> prepare a patch and send it out. If you and Mark Rutland approve, I will
> include it in the next version of this RFC.
> 
> Madhavan
> 

I solved this by using existing functions logically instead of inventing a
dummy function. I initialize pt_regs->stackframe[1] to an existing function
so that the stack trace will not show a 0x0 entry as well as the kernel and
gdb will show identical stack traces.

For all task stack traces including the idle tasks, the stack trace will
end at copy_thread() as copy_thread() is the function that initializes the
pt_regs and the first stack frame for a task.

For EL0 exceptions, the stack trace will end with vectors() as vectors
entries call the EL0 handlers.

Here are sample stack traces (I only show the ending of each trace):

Idle task on primary CPU


 ...
[0.022557]   start_kernel+0x5b8/0x5f4
[0.022570]   __primary_switched+0xa8/0xb8
[0.022578]   copy_thread+0x0/0x188

Idle task on secondary CPU
==

 ...
[0.023397]   secondary_start_kernel+0x188/0x1e0
[0.023406]   __secondary_switched+0x40/0x88
[0.023415]   copy_thread+0x0/0x188

All other kernel threads


 ...
[   13.501062]   ret_from_fork+0x10/0x18
[   13.507998]   copy_thread+0x0/0x188

User threads (EL0 exception)


write(2) system call example:

 ...
[  521.686148]   vfs_write+0xc8/0x2c0
[  521.686156]   ksys_write+0x74/0x108
[  521.686161]   __arm64_sys_write+0x24/0x30
[  521.686166]   el0_svc_common.constprop.0+0x70/0x1a8
[  521.686175]   do_el0_svc+0x2c/0x98
[  521.686180]   el0_svc+0x2c/0x70
[  521.686188]   el0_sync_handler+0xb0/0xb8
[  521.686193]   el0_sync+0x17c/0x180
[  521.686198]   vectors+0x0/0x7d8

Here are the code changes:


diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index a31a0a713c85..514307e80b79 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -261,16 +261,19 @@ alternative_else_nop_endif
stp lr, x21, [sp, #S_LR]
 
/*
-* For exceptions from EL0, terminate the callchain here.
+* For exceptions from EL0, terminate the callchain here at
+* task_pt_regs(current)->stackframe.
+*
 * For exceptions from EL1, create a synthetic frame record so the
 * interrupted code shows up in the backtrace.
 */
.if \el == 0
-   mov x29, xzr
+   ldr x17, =vectors
+   stp xzr, x17, [sp, #S_STACKFRAME]
.else
stp x29, x22, [sp, #S_STACKFRAME]
-   add x29, sp, #S_STACKFRAME
.endif
+   add x29, sp, #S_STACKFRAME
 
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
 alternative_if_not ARM64_HAS_PAN
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 66b0e0b66e31..699e0dd313a1 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -393,6 +393,29 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
ret x28
 SYM_FUNC_END(__create_page_tables)
 
+   /*
+* The boot task becomes the idle task for the primary CPU. The
+* CPU

Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 9:29 AM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/19/21 7:30 AM, Mark Brown wrote:
>> On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/18/21 10:09 AM, Mark Brown wrote:
>>
>>>> If we are going to add the extra record there would probably be less
>>>> potential for confusion if we pointed it at some sensibly named dummy
>>>> function so anything or anyone that does see it on the stack doesn't get
>>>> confused by a NULL.
>>
>>> I agree. I will think about this some more. If no other solution presents
>>> itself, I will add the dummy function.
>>
>> After discussing this with Mark Rutland offlist he convinced me that so
>> long as we ensure the kernel doesn't print the NULL record we're
>> probably OK here, the effort setting the function pointer up correctly
>> in all circumstances (especially when we're not in the normal memory
>> map) is probably not worth it for the limited impact it's likely to have
>> to see the NULL pointer (probably mainly a person working with some
>> external debugger).  It should be noted in the changelog though, and/or
>> merged in with the relevant change to the unwinder.
>>
> 
> OK. I will add a comment as well as note it in the changelog.
> 
> Thanks to both of you.
> 
> Madhavan
> 

I thought about this some more. I think I have a simple solution. I will
prepare a patch and send it out. If you and Mark Rutland approve, I will
include it in the next version of this RFC.

Madhavan


Re: [RFC PATCH v2 2/8] arm64: Implement frame types

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 11:20 AM, Mark Brown wrote:
> On Fri, Mar 19, 2021 at 10:02:52AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/19/21 9:40 AM, Madhavan T. Venkataraman wrote:
> 
>>> Actually now I look again it's just not adding anything on EL2 entries
>>> at all, they use a separate set of macros which aren't updated - this
>>> will only update things for EL0 and EL1 entries so my comment above
>>> about this tracking EL2 as EL1 isn't accurate.
> 
>> So, do I need to do anything here?
> 
> Probably worth some note somewhere about other stack types existing and
> how they end up being handled, in the changelog at least.
> 
OK.

Madhavan


Re: [RFC PATCH v2 2/8] arm64: Implement frame types

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 9:40 AM, Madhavan T. Venkataraman wrote:
>> Are you referring to ARMv8.1 VHE extension where the kernel can run
>> at EL2? Could you elaborate? I thought that EL2 was basically for
>> Hypervisors.
> KVM is the main case, yes - IIRC otherwise it's mainly error handlers
> but I might be missing something.  We do recommend that the kernel is
> started at EL2 where possible.
> 
> Actually now I look again it's just not adding anything on EL2 entries
> at all, they use a separate set of macros which aren't updated - this
> will only update things for EL0 and EL1 entries so my comment above
> about this tracking EL2 as EL1 isn't accurate.

So, do I need to do anything here?

Madhavan


Re: [RFC PATCH v2 2/8] arm64: Implement frame types

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 8:22 AM, Mark Brown wrote:
> On Thu, Mar 18, 2021 at 05:22:49PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/18/21 12:40 PM, Mark Brown wrote:
> 
>>> Unless I'm misreading what's going on here this is more trying to set a
>>> type for the stack as a whole than for a specific stack frame.  I'm also
>>> finding this a bit confusing as the unwinder already tracks things it
>>> calls frame types and it handles types that aren't covered here like
>>> SDEI.  At the very least there's a naming issue here.
> 
>> Both these frames are on the task stack. So, it is not a stack type.
> 
> OTOH it's also not something that applies to every frame but only to the
> base frame from each stack which I think was more where I was coming
> from there.  In any case, the issue is also that there's already another
> thing that the unwinder calls a frame type so there's at least that
> collision which needs to be resolved if nothing else.
> 

The base frame from each stack as well as intermediate marker frames such
as the EL1 frame and the Ftrace frame.

As for the frame type, I will try to come up with a better name.

>>> Taking a step back though do we want to be tracking this via pt_regs?
>>> It's reliant on us robustly finding the correct pt_regs and on having
>>> the things that make the stack unreliable explicitly go in and set the
>>> appropriate type.  That seems like it will be error prone, I'd been
>>> expecting to do something more like using sections to filter code for
>>> unreliable features based on the addresses of the functions we find on
>>> the stack or similar.  This could still go wrong of course but there's
>>> fewer moving pieces, and especially fewer moving pieces specific to
>>> reliable stack trace.
> 
>> In that case, I suggest doing both. That is, check the type as well
>> as specific functions. For instance, in the EL1 pt_regs, in addition
>> to the above checks, check the PC against el1_sync(), el1_irq() and
>> el1_error(). I have suggested this in the cover letter.
> 
>> If this is OK with you, we could do that. We want to make really sure that
>> nothing goes wrong with detecting the exception frame.
> 
> ...
> 
>> If you dislike the frame type, I could remove it and just do the
>> following checks:
> 
>>  FP == pt_regs->regs[29]
>>  PC == pt_regs->pc
>>  and the address check against el1_*() functions
> 
>> and similar changes for EL0 as well.
> 
>> I still think that the frame type check makes it more robust.
> 
> Yeah, we know the entry points so they can serve the same role as
> checking an explicitly written value.  It does mean one less operation
> on exception entry, though I'm not sure that's that a big enough
> overhead to actually worry about.  I don't have *super* strong opinons
> against adding the explicitly written value other than it being one more
> thing we don't otherwise use which we have to get right for reliable
> stack trace, there's a greater risk of bitrot if it's not something that
> we ever look at outside of the reliable stack trace code.
> 

So, I will add the address checks for robustness. I will think some more
about the frame type.

>>>> EL1_FRAME
>>>>EL1 exception frame.
> 
>>> We do trap into EL2 as well, the patch will track EL2 frames as EL1
>>> frames.  Even if we can treat them the same the naming ought to be
>>> clear.
> 
>> Are you referring to ARMv8.1 VHE extension where the kernel can run
>> at EL2? Could you elaborate? I thought that EL2 was basically for
>> Hypervisors.
> 
> KVM is the main case, yes - IIRC otherwise it's mainly error handlers
> but I might be missing something.  We do recommend that the kernel is
> started at EL2 where possible.
> 
> Actually now I look again it's just not adding anything on EL2 entries
> at all, they use a separate set of macros which aren't updated - this
> will only update things for EL0 and EL1 entries so my comment above
> about this tracking EL2 as EL1 isn't accurate.
> 

OK.

Madhavan


Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-19 Thread Madhavan T. Venkataraman



On 3/19/21 7:30 AM, Mark Brown wrote:
> On Thu, Mar 18, 2021 at 03:26:13PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/18/21 10:09 AM, Mark Brown wrote:
> 
>>> If we are going to add the extra record there would probably be less
>>> potential for confusion if we pointed it at some sensibly named dummy
>>> function so anything or anyone that does see it on the stack doesn't get
>>> confused by a NULL.
> 
>> I agree. I will think about this some more. If no other solution presents
>> itself, I will add the dummy function.
> 
> After discussing this with Mark Rutland offlist he convinced me that so
> long as we ensure the kernel doesn't print the NULL record we're
> probably OK here, the effort setting the function pointer up correctly
> in all circumstances (especially when we're not in the normal memory
> map) is probably not worth it for the limited impact it's likely to have
> to see the NULL pointer (probably mainly a person working with some
> external debugger).  It should be noted in the changelog though, and/or
> merged in with the relevant change to the unwinder.
> 

OK. I will add a comment as well as note it in the changelog.

Thanks to both of you.

Madhavan


Re: [RFC PATCH v2 2/8] arm64: Implement frame types

2021-03-18 Thread Madhavan T. Venkataraman



On 3/18/21 12:40 PM, Mark Brown wrote:
> On Mon, Mar 15, 2021 at 11:57:54AM -0500, madve...@linux.microsoft.com wrote:
> 
>> To summarize, pt_regs->stackframe is used (or will be used) as a marker
>> frame in stack traces. To enable the unwinder to detect these frames, tag
>> each pt_regs->stackframe with a type. To record the type, use the unused2
>> field in struct pt_regs and rename it to frame_type. The types are:
> 
> Unless I'm misreading what's going on here this is more trying to set a
> type for the stack as a whole than for a specific stack frame.  I'm also
> finding this a bit confusing as the unwinder already tracks things it
> calls frame types and it handles types that aren't covered here like
> SDEI.  At the very least there's a naming issue here.
> 

When the unwinder gets to EL1 pt_regs->stackframe, it needs to be sure that
it is indeed a frame inside an EL1 pt_regs structure. It performs the
following checks:

FP == pt_regs->regs[29]
PC == pt_regs->pc
type == EL1_FRAME

to confirm that the frame is EL1 pt_regs->stackframe.

Similarly, for EL0, the type is EL0_FRAME.

Both these frames are on the task stack. So, it is not a stack type.

> Taking a step back though do we want to be tracking this via pt_regs?
> It's reliant on us robustly finding the correct pt_regs and on having
> the things that make the stack unreliable explicitly go in and set the
> appropriate type.  That seems like it will be error prone, I'd been
> expecting to do something more like using sections to filter code for
> unreliable features based on the addresses of the functions we find on
> the stack or similar.  This could still go wrong of course but there's
> fewer moving pieces, and especially fewer moving pieces specific to
> reliable stack trace.
> 

In that case, I suggest doing both. That is, check the type as well
as specific functions. For instance, in the EL1 pt_regs, in addition
to the above checks, check the PC against el1_sync(), el1_irq() and
el1_error(). I have suggested this in the cover letter.

If this is OK with you, we could do that. We want to make really sure that
nothing goes wrong with detecting the exception frame.

> I'm wary of tracking data that only ever gets used for the reliable
> stack trace path given that it's going to be fairly infrequently used
> and hence tested, especially things that only crop up in cases that are
> hard to provoke reliably.  If there's a way to detect things that
> doesn't use special data that seems safer.
> 

If you dislike the frame type, I could remove it and just do the
following checks:

FP == pt_regs->regs[29]
PC == pt_regs->pc
and the address check against el1_*() functions

and similar changes for EL0 as well.

I still think that the frame type check makes it more robust.

>> EL1_FRAME
>>  EL1 exception frame.
> 
> We do trap into EL2 as well, the patch will track EL2 frames as EL1
> frames.  Even if we can treat them the same the naming ought to be
> clear.
> 

Are you referring to ARMv8.1 VHE extension where the kernel can run
at EL2? Could you elaborate? I thought that EL2 was basically for
Hypervisors.

Thanks.

>> FTRACE_FRAME
>> FTRACE frame.
> 
> This is implemented later in the series.  If using this approach I'd
> suggest pulling the change in entry-ftrace.S that sets this into this
> patch, it's easier than adding a note about this being added later and
> should help with any bisect issues.
> 

OK. Good point.

Madhavan


Re: [RFC PATCH v2 3/8] arm64: Terminate the stack trace at TASK_FRAME and EL0_FRAME

2021-03-18 Thread Madhavan T. Venkataraman



On 3/18/21 1:26 PM, Mark Brown wrote:
> On Mon, Mar 15, 2021 at 11:57:55AM -0500, madve...@linux.microsoft.com wrote:
> 
>> +/* Terminal record, nothing to unwind */
>> +if (fp == (unsigned long) regs->stackframe) {
>> +if (regs->frame_type == TASK_FRAME ||
>> +regs->frame_type == EL0_FRAME)
>> +return -ENOENT;
>>  return -EINVAL;
>> +}
> 
> This is conflating the reliable stacktrace checks (which your series
> will later flag up with frame->reliable) with verifying that we found
> the bottom of the stack by looking for this terminal stack frame record.
> For the purposes of determining if the unwinder got to the bottom of the
> stack we don't care what stack type we're looking at, we just care if it
> managed to walk to this defined final record.  
> 
> At the minute nothing except reliable stack trace has any intention of
> checking the specific return code but it's clearer to be consistent.
> 

So, you are saying that the type check is redundant. OK. I will remove it
and just return -ENOENT on reaching the final record.

Madhavan



Re: [RFC PATCH v2 1/8] arm64: Implement stack trace termination record

2021-03-18 Thread Madhavan T. Venkataraman



On 3/18/21 10:09 AM, Mark Brown wrote:
> On Mon, Mar 15, 2021 at 11:57:53AM -0500, madve...@linux.microsoft.com wrote:
> 
>> In summary, task pt_regs->stackframe is where a successful stack trace ends.
> 
>> .if \el == 0
>> -   mov x29, xzr
>> +   stp xzr, xzr, [sp, #S_STACKFRAME]
>> .else
>> stp x29, x22, [sp, #S_STACKFRAME]
>> -   add x29, sp, #S_STACKFRAME
>> .endif
>> +   add x29, sp, #S_STACKFRAME
> 
> For both user and kernel threads this patch (at least by itself) results
> in an additional record being reported in stack traces with a NULL
> function pointer since it keeps the existing record where it is and adds
> this new fixed record below it.  This is addressed for the kernel later
> in the series, by "arm64: Terminate the stack trace at TASK_FRAME and
> EL0_FRAME", but will still be visible to other unwinders such as
> debuggers.  I'm not sure that this *matters* but it might and should at
> least be called out more explicitly.
> 
> If we are going to add the extra record there would probably be less
> potential for confusion if we pointed it at some sensibly named dummy
> function so anything or anyone that does see it on the stack doesn't get
> confused by a NULL.
> 

I agree. I will think about this some more. If no other solution presents
itself, I will add the dummy function.

Madhavan

> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [RFC PATCH v2 0/8] arm64: Implement reliable stack trace

2021-03-15 Thread Madhavan T. Venkataraman



On 3/15/21 11:57 AM, madve...@linux.microsoft.com wrote:
> Proper termination of the stack trace
> =
> 
> In the unwinder, check the following for properly terminating the stack
> trace:
> 
>   - Check every frame to see if it is task_pt_regs(stack)->stackframe.
> If it is, terminate the stack trace successfully.
> 

There is a typo in the above sentence. task_pt_regs(stack)->stackframe
should be task_pt_regs(task)->stackframe.

Sorry about that.

Madhavan


Re: [RFC PATCH v1 1/1] arm64: Unwinder enhancements for reliable stack trace

2021-03-01 Thread Madhavan T. Venkataraman



On 2/25/21 5:58 AM, Mark Rutland wrote:
> On Wed, Feb 24, 2021 at 01:34:09PM -0600, Madhavan T. Venkataraman wrote:
>> On 2/24/21 6:17 AM, Mark Rutland wrote:
>>> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madve...@linux.microsoft.com 
>>> wrote:
>>>> From: "Madhavan T. Venkataraman" 
>>>>Termination
>>>>===
>>>>
>>>>Currently, the unwinder terminates when both the FP (frame pointer)
>>>>and the PC (return address) of a frame are 0. But a frame could get
>>>>corrupted and zeroed. There needs to be a better check.
>>>>
>>>>The following special terminating frame and function have been
>>>>defined for this purpose:
>>>>
>>>>const u64arm64_last_frame[2] __attribute__ ((aligned (16)));
>>>>
>>>>void arm64_last_func(void)
>>>>{
>>>>}
>>>>
>>>>So, set the FP to arm64_last_frame and the PC to arm64_last_func in
>>>>the bottom most frame.
>>>
>>> My expectation was that we'd do this per-task, creating an empty frame
>>> record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
>>> instant it was created, and chaining this into x29. That way the address
>>> is known (since it can be derived from the task), and the frame will
>>> also implicitly check that the callchain terminates on the task stack
>>> without loops. That also means that we can use it to detect the entry
>>> code going wrong (e.g. if the SP gets corrupted), since in that case the
>>> entry code would place the record at a different location.
>>
>> That is exactly what this is doing. arm64_last_frame[] is a marker frame
>> that contains fp=0 and pc=0.
> 
> Almost! What I meant was that rather that each task should have its own
> final/marker frame record on its task task rather than sharing a common
> global variable.
> 
> That way a check for that frame record implicitly checks that a task
> started at the expected location *on that stack*, which catches
> additional stack corruption cases (e.g. if we left data on the stack
> prior to an EL0 entry).
> 

Ok. I will think about this.

> [...]
> 
>>> ... I reckon once we've moved the last of the exception triage out to C
>>> this will be relatively simple, since all of the exception handlers will
>>> look like:
>>>
>>> | SYM_CODE_START_LOCAL(elX_exception)
>>> |   kernel_entry X
>>> |   mov x0, sp
>>> |   bl  elX_exception_handler
>>> |   kernel_exit X
>>> | SYM_CODE_END(elX_exception)
>>>
>>> ... and so we just need to identify the set of elX_exception functions
>>> (which we'll never expect to take exceptions from directly). We could be
>>> strict and reject unwinding into arbitrary bits of the entry code (e.g.
>>> if we took an unexpected exception), and only permit unwinding to the
>>> BL.
>>>
>>>>It can do this if the FP and PC are also recorded elsewhere in the
>>>>pt_regs for comparison. Currently, the FP is also stored in
>>>>regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
>>>>be changed by lower level functions.
>>>>
>>>>Create a new field, pt_regs->orig_pc, and record the return address
>>>>PC there. With this, the unwinder can validate the exception frame
>>>>and set a flag so that the caller of the unwinder can know when
>>>>an exception frame is encountered.
>>>
>>> I don't understand the case you're trying to solve here. When is
>>> regs->pc changed in a way that's problematic?
>>>
>>
>> For instance, I used a test driver in which the driver calls a function
>> pointer which is NULL. The low level fault handler sends a signal to the
>> task. Looks like it changes regs->pc for this.
> 
> I'm struggling to follow what you mean here.
> 
> If the kernel branches to NULL, the CPU should raise an instruction
> abort from the current EL, and our handling of that should terminate the
> thread via die_kernel_fault(), without returning to the faulting
> context, and without altering the PC in the faulting context.
> 
> Signals are delivered to userspace and alter the userspace PC, not a
> kernel PC, so this doesn't seem relevant. Do you mean an exception fixup
> handler rather than a signal?
> 
>> When I dump the stack from the low level handler, the comparison with
>> regs->pc does not work.  But comparison with regs->orig_pc works.
> 
> As above, I'm struggling with this; could you share a concrete example? 
> 

Actually, my bad. I needed the orig_pc because of something that my test
driver did. And, it slipped my mind entirely.

Thanks for pointing it out. I will fix this in my resend.

Thanks again for your comments.

I am currently studying probing/tracing. As soon as I have a solution for that,
I will send out the next version. I look forward to the in-depth review.

Thanks,

Madhavan


Re: [RFC PATCH v1 1/1] arm64: Unwinder enhancements for reliable stack trace

2021-02-24 Thread Madhavan T. Venkataraman



On 2/24/21 6:17 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> As Mark Brown says, I think this needs to be split into several
> patches. i have some comments on the general approach, but I'll save
> in-depth review until this has been split.
> 

OK.

> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> Unwinder changes
>> 
>>
>>  Termination
>>  ===
>>
>>  Currently, the unwinder terminates when both the FP (frame pointer)
>>  and the PC (return address) of a frame are 0. But a frame could get
>>  corrupted and zeroed. There needs to be a better check.
>>
>>  The following special terminating frame and function have been
>>  defined for this purpose:
>>
>>  const u64arm64_last_frame[2] __attribute__ ((aligned (16)));
>>
>>  void arm64_last_func(void)
>>  {
>>  }
>>
>>  So, set the FP to arm64_last_frame and the PC to arm64_last_func in
>>  the bottom most frame.
> 
> My expectation was that we'd do this per-task, creating an empty frame
> record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
> instant it was created, and chaining this into x29. That way the address
> is known (since it can be derived from the task), and the frame will
> also implicitly check that the callchain terminates on the task stack
> without loops. That also means that we can use it to detect the entry
> code going wrong (e.g. if the SP gets corrupted), since in that case the
> entry code would place the record at a different location.
> 

That is exactly what this is doing. arm64_last_frame[] is a marker frame
that contains fp=0 and pc=0.

>>
>>  Exception/Interrupt detection
>>  =
>>
>>  An EL1 exception renders the stack trace unreliable as it can happen
>>  anywhere including the frame pointer prolog and epilog. The
>>  unwinder needs to be able to detect the exception on the stack.
>>
>>  Currently, the EL1 exception handler sets up pt_regs on the stack
>>  and chains pt_regs->stackframe with the other frames on the stack.
>>  But, the unwinder does not know where this exception frame is in
>>  the stack trace.
>>
>>  Set the LSB of the exception frame FP to allow the unwinder to
>>  detect the exception frame. When the unwinder detects the frame,
>>  it needs to make sure that it is really an exception frame and
>>  not the result of any stack corruption.
> 
> I'm not keen on messing with the encoding of the frame record as this
> will break external unwinders (e.g. using GDB on a kernel running under
> QEMU). I'd rather that we detected the exception boundary based on the
> LR, similar to what we did in commit:
> 

OK. I will take a look at the commit you mentioned.

>   7326749801396105 ("arm64: unwind: reference pt_regs via embedded stack 
> frame")
> 
> ... I reckon once we've moved the last of the exception triage out to C
> this will be relatively simple, since all of the exception handlers will
> look like:
> 
> | SYM_CODE_START_LOCAL(elX_exception)
> | kernel_entry X
> | mov x0, sp
> | bl  elX_exception_handler
> | kernel_exit X
> | SYM_CODE_END(elX_exception)
> 
> ... and so we just need to identify the set of elX_exception functions
> (which we'll never expect to take exceptions from directly). We could be
> strict and reject unwinding into arbitrary bits of the entry code (e.g.
> if we took an unexpected exception), and only permit unwinding to the
> BL.
> 
>>  It can do this if the FP and PC are also recorded elsewhere in the
>>  pt_regs for comparison. Currently, the FP is also stored in
>>  regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
>>  be changed by lower level functions.
>>
>>  Create a new field, pt_regs->orig_pc, and record the return address
>>  PC there. With this, the unwinder can validate the exception frame
>>  and set a flag so that the caller of the unwinder can know when
>>  an exception frame is encountered.
> 
> I don't understand the case you're trying to solve here. When is
> regs->pc changed in a way that's problematic?
> 

For instance, I used a test driver in which the driver calls a function
pointer which is NULL. The low level fault handler sends a signal to the
task. Looks like it changes regs->pc for this. When I dump the stack
from the low level handler, the comparison with regs->pc does not work.
But

Re: [RFC PATCH v1 1/1] arm64: Unwinder enhancements for reliable stack trace

2021-02-24 Thread Madhavan T. Venkataraman



On 2/24/21 6:33 AM, Mark Brown wrote:
> On Tue, Feb 23, 2021 at 01:20:49PM -0600, Madhavan T. Venkataraman wrote:
>> On 2/23/21 1:02 PM, Mark Brown wrote:
>>> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madve...@linux.microsoft.com 
>>> wrote:
> 
>>>> Reliable stack trace function
>>>> =
>>>>
>>>> Implement arch_stack_walk_reliable(). This function walks the stack like
>>>> the existing stack trace functions with a couple of additional checks:
> 
>>> Again, this should be at least one separate patch.  How does this ensure
>>> that we don't have any issues with any of the various probe mechanisms?
>>> If there's no need to explicitly check anything that should be called
>>> out in the changelog.
> 
>> I am trying to do this in an incremental fashion. I have to study the probe
>> mechanisms a little bit more before I can come up with a solution. But
>> if you want to see that addressed in this patch set, I could do that.
>> It will take a little bit of time. That is all.
> 
> Handling of the probes stuff seems like it's critical to reliable stack
> walk so we shouldn't claim to have support for reliable stack walk
> without it.  If it was a working implementation we could improve that'd
> be one thing but this would be buggy which is a different thing.
> 

OK. I will address the probe stuff in my resend.

>>>> +  (void) on_accessible_stack(task, stackframe, );
> 
>>> Shouldn't we return NULL if we are not on an accessible stack?
> 
>> The prev_fp has already been checked by the unwinder in the previous
>> frame. That is why I don't check the return value. If that is acceptable,
>> I will add a comment.
> 
> TBH if you're adding the comment it seems like you may as well add the
> check, it's not like it's expensive and it means there's no possibility
> that some future change could result in this assumption being broken.
> 

OK. I will add the check.

Thanks.

Madhavan


Re: [RFC PATCH v1 1/1] arm64: Unwinder enhancements for reliable stack trace

2021-02-23 Thread Madhavan T. Venkataraman



On 2/23/21 1:02 PM, Mark Brown wrote:
> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>>
>> Unwinder changes
>> 
> 
> This is making several different changes so should be split into a patch
> series - for example the change to terminate on a specific function
> pointer rather than NULL and the changes to the exception/interupt
> detection should be split.  Please see submitting-patches.rst for some
> discussion about how to split things up.  In general if you've got a
> changelog enumerating a number of different changes in a patch that's a
> warning sign that it might be good split things up.
> 

Will do.

> You should also copy the architecture maintainers (Catalin and Will) on
> any arch/arm64 submissions.
> 

Will do when I resubmit.

>>  Unwinder return value
>>  =
>>
>>  Currently, the unwinder returns -EINVAL for stack trace termination
>>  as well as stack trace error. Return -ENOENT for stack trace
>>  termination and -EINVAL for error to disambiguate. This idea has
>>  been borrowed from Mark Brown.
> 
> You could just include my patch for this in your series.
> 

OK.

>> Reliable stack trace function
>> =
>>
>> Implement arch_stack_walk_reliable(). This function walks the stack like
>> the existing stack trace functions with a couple of additional checks:
>>
>>  Return address check
>>  
>>
>>  For each frame, check the return address to see if it is a
>>  proper kernel text address. If not, return -EINVAL.
>>
>>  Exception frame check
>>  -
>>
>>  Check each frame to see if it is an EL1 exception frame. If it is,
>>  return -EINVAL.
> 
> Again, this should be at least one separate patch.  How does this ensure
> that we don't have any issues with any of the various probe mechanisms?
> If there's no need to explicitly check anything that should be called
> out in the changelog.
> 

I am trying to do this in an incremental fashion. I have to study the probe
mechanisms a little bit more before I can come up with a solution. But
if you want to see that addressed in this patch set, I could do that.
It will take a little bit of time. That is all.

> Since all these changes are mixed up this is a fairly superficial
> review of the actual code.
> 

Understood. I will split things up and we can take it from there.

>> +static notrace struct pt_regs *get_frame_regs(struct task_struct *task,
>> +  struct stackframe *frame)
>> +{
>> +unsigned long stackframe, regs_start, regs_end;
>> +struct stack_info info;
>> +
>> +stackframe = frame->prev_fp;
>> +if (!stackframe)
>> +return NULL;
>> +
>> +(void) on_accessible_stack(task, stackframe, );
> 
> Shouldn't we return NULL if we are not on an accessible stack?
> 

The prev_fp has already been checked by the unwinder in the previous
frame. That is why I don't check the return value. If that is acceptable,
I will add a comment.

>> +static notrace int update_frame(struct task_struct *task,
>> +struct stackframe *frame)
> 
> This function really needs some documentation, the function is just
> called update_frame() which doesn't say what sort of updates it's
> supposed to do and most of the checks aren't explained, not all of them
> are super obvious.
> 

I will add the documentation as well as try think of a better name.

>> +{
>> +unsigned long lsb = frame->fp & 0xf;
>> +unsigned long fp = frame->fp & ~lsb;
>> +unsigned long pc = frame->pc;
>> +struct pt_regs *regs;
>> +
>> +frame->exception_frame = false;
>> +
>> +if (fp == (unsigned long) arm64_last_frame &&
>> +pc == (unsigned long) arm64_last_func)
>> +return -ENOENT;
>> +
>> +if (!lsb)
>> +return 0;
>> +if (lsb != 1)
>> +return -EINVAL;
>> +
>> +/*
>> + * This looks like an EL1 exception frame.
> 
> For clarity it would be good to spell out the properties of an EL1
> exception frame.  It is not clear to me why we don't reference the frame
> type information the unwinder already records as part of these checks.
> 
> In general, especially for the bits specific to reliable stack trace, I
> think we want to err on the side of verbosity here so that it is crystal
> clear what all the checks are supposed to be doing and it's that much
> easier to tie everything through to the requirements document.

OK. I will improve the documentation.

Madhavan


Re: [RFC PATCH 00/17] objtool: add base support for arm64

2021-01-28 Thread Madhavan T. Venkataraman
Hi,

I sent this suggestion to linux-arm-kernel in response to the Reliable 
Stacktrace RFC from Mark Brown
and Mark Rutland. I am repeating it here for two reasons:

- It involves objtool.

- There are many more recipients in this thread that may be interested in this 
topic.

Please let me know if this suggestion is acceptable. If it is not, please let 
me know why.
Thanks.

Also, I apologize to all of you who have received this more than once.

FP and no-FP functions
=

I have a suggestion for objtool and the unwinder for ARM64.

IIUC, objtool is responsible for walking all the code paths (except unreachable
and ignored ones) and making sure that every function has proper frame pointer
code (prolog, epilog, etc). If a function is found to not have it, the kernel
build is failed. Is this understanding correct?

If so, can we take a different approach for ARM64?

Instead of failing the kernel build, can we just mark the functions as:

FP  Functions that have proper FP code
no-FP   Functions that don't

May be, we can add an "FP" flag in the symbol table entry for this.

Then, the unwinder can check the functions it encounters in the stack trace and
inform the caller if it found any no-FP functions. The caller of the unwinder 
can
decide what he wants to do with that information.

- the caller can ignore it

- the caller can print the stack trace with a warning that no-FP 
functions
  were found

- if the caller is livepatch, the caller can retry until the no-FP 
functions
  disappear from the stack trace. This way, we can have live patching 
even
  when some of the functions in the kernel are no-FP.

Does this make any sense? Is this acceptable? What are the pitfalls?

If we can do this, the unwinder could detect cases such as:

- If gcc thinks that a function is a leaf function but the function contains
  inline assembly code that calls another function.

- If a call to a function bounces through some intermediate code such as a
  trampoline.

- etc.

For specific no-FP functions, the unwinder might be able to deduce the original
caller. In these cases, the stack trace would still be reliable. For all the 
others,
the stack trace would be considered unreliable.

Compiler instead of objtool
===

If the above suggestion is acceptable, I have another suggestion.

It is a lot of work for every new architecture to add frame pointer verification
support in objtool. Can we get some help from the compiler?

The compiler knows which C functions it generates the FP prolog and epilog for. 
It can
mark those functions as FP. As for assembly functions, kernel developers could 
manually
annotate functions that have proper FP code. The compiler/assembler would mark 
them
as FP. Only a small subset of assembly functions would even have FP prolog and 
epilog.

Is this acceptable? What are the pitfalls?

This can be implemented easily for all architectures for which the compiler 
generates
FP code.

Can this be implemented using a GCC plugin? I know squat about GCC plugins.

Thanks!

Madhavan


Re: Live patching on ARM64

2021-01-26 Thread Madhavan T. Venkataraman
Hi all,

Mark Rutland had sent me some ideas on what work is pending for ARM64 live 
patching.
I sent some questions to Mark Rutland. I forgot to include everyone in the 
email.
Sorry about that. I have reproduced my questions and his responses below. Please
chime in with any comments:

Thanks!




On Mon, Jan 25, 2021 at 11:58:47AM -0600, Madhavan T. Venkataraman wrote:
> Some questions below:

I've answered thos below.

If possible, I'd prefer to handle future queries on a public list (so
that others can chime in, and so that it gets archived), so if you could
direct further questions to a thread on LAKML, that would be much
appreciated.

> On 1/15/21 6:33 AM, Mark Rutland wrote:
> [...]
>>
>> One general thing that I believe we'll need to do is to rework code to
>> be patch-safe (which implies being noinstr-safe too). For example, we'll
>> need to rework the instruction patching code such that this cannot end
>> up patching itself (or anything that has instrumented it) in an unsafe
>> way.
>>
>
> OK. I understand that. Are there are other scenarios that make patching
> unsafe?

I suspect so; these are simply the cases I'm immediately aware of. I
suspect there are other cases that we will need to consider that don't
immediately spring to mind.

> I expect the kernel already handles scenarios such as two CPUs patching
> the same location at the same time or a thread executing at a location that is
> currently being patched.

IIRC that is supposed to be catered for by ftrace (and so I assume for
livepatching too); I'm not certain about kprobes. In addition to
synchronization in the core ftrace code, arm64's ftrace_modify_code()
has a sanity-check with a non-atomic RMW sequence. We might be able to
make that more robust wiuth a faultable cmpxchg, and some changes around
ftrace_update_ftrace_func() and ftrace_make_nop()  to get rid of the
unvalidated cases.

> Any other scenarios to be considered?

I'm not immediately aware of others, but suspect more cases will become
apparent as work progresses on the bits we already know about.

>> Once we have objtool it should be possible to identify those cases
>> automatically. Currently I'm aware that we'll need to do something in at
>> least the following places:
>>
>
> OK. AFAIK, objtool checks for the following:
>
> - returning from noinstr function with instrumentation enabled
>
> - calling instrumentable functions from noinstr code without:
>
> instrumentation_begin();
> instrumentation_end();
>
> Is that what you mean?

That's what I was thinking of, yes -- this should highlight some places
that will need attention.

> Does objtool check other things as well that is relevant to (un)safe
> patching?

I'm not entirely familiar with objtool, so I'm not exactly sure what it
can do; I expect Josh and Julien can give more detail here.

>> * The insn framework (which is used by some patching code), since the
>>   bulk of it lives in arch/arm64/kernel/insn.c and isn't marked noinstr.
>>   
>>   We can probably shift the bulk of the aarch64_insn_gen_*() and
>>   aarch64_get_*() helpers into a header as __always_inline functions,
>>   which would allow them to be used in noinstr code. As those are
>>   typically invoked with a number of constant arguments that the
>>   compiler can fold, this /might/ work out as an optimization if the
>>   compiler can elide the error paths.
>
> OK. I will take a look at the insn code.

IIRC Julien's objtool series had some patches had some patches moving
the insn code about, so it'd be worth checking whether that's a help or
a hindrance. If it's possible to split out a set of preparatory patches
that make that ready both for objtool and the kernel, that would make it
easier to review that and queue it early.

>> * The alternatives code, since we call instrumentable and patchable
>>   functions between updating instructions and performing all the
>>   necessary maintenance. There are a number of cases within
>>   __apply_alternatives(), e.g.
>>
>>   - test_bit()
>>   - cpus_have_cap()
>>   - pr_info_once()
>>   - lm_alias()
>>   - alt_cb, if the callback is not marked as noinstr, or if it calls
>> instrumentable code (e.g. from the insn framework).
>>   - clean_dcache_range_nopatch(), as read_sanitised_ftr_reg() and
>> related code can be instrumented.
>>
>>   This might need some underlying rework elsewhere (e.g. in the
>>   cpufeature code, or atomics framework).
>>
>> So on the kernel side, maybe a first step would be to try to headerize
>> the insn generation code as __always_inline, and see whether that looks
>> ok? With that out of the way it'd be a bit easier to rework patching
>> code depending on the insn framework.
>
> OK. I will study this.

Great, thanks!

Mark.




Re: [RFC PATCH 00/17] objtool: add base support for arm64

2021-01-22 Thread Madhavan T. Venkataraman



On 1/22/21 3:43 PM, Ard Biesheuvel wrote:
> On Fri, 22 Jan 2021 at 22:15, Madhavan T. Venkataraman
>  wrote:
>>
>>
>>
>> On 1/22/21 11:43 AM, Mark Brown wrote:
>>> On Thu, Jan 21, 2021 at 12:54:52PM -0600, Josh Poimboeuf wrote:
>>>
>>>> 2) The shadow stack idea sounds promising -- how hard would it be to
>>>>make a prototype reliable unwinder?
>>>
>>> In theory it doesn't look too hard and I can't see a particular reason
>>> not to try doing this - there's going to be edge cases but hopefully for
>>> reliable stack trace they're all in areas where we would be happy to
>>> just decide the stack isn't reliable anyway, things like nesting which
>>> allocates separate shadow stacks for each nested level for example.
>>> I'll take a look.
>>>
>>
>> I am a new comer to this discussion and I am learning. Just have some
>> questions. Pardon me if they are obvious or if they have already been
>> asked and answered.
>>
>> Doesn't Clang already have support for a shadow stack implementation for 
>> ARM64?
>> We could take a look at how Clang does it.
>>
>> Will there not be a significant performance hit? May be, some of it can be
>> mitigated by using a parallel shadow stack rather than a compact one.
>>
>> Are there any longjmp style situations in the kernel where the stack is
>> unwound by several frames? In these cases, the shadow stack must be unwound
>> accordingly.
>>
> 
> Hello Madhavan,
> 
> Let's discuss the details of shadow call stacks on a separate thread,
> instead of further hijacking Julien's series.
> 

OK. Sounds good.


Re: [RFC PATCH 00/17] objtool: add base support for arm64

2021-01-22 Thread Madhavan T. Venkataraman



On 1/22/21 11:43 AM, Mark Brown wrote:
> On Thu, Jan 21, 2021 at 12:54:52PM -0600, Josh Poimboeuf wrote:
> 
>> 2) The shadow stack idea sounds promising -- how hard would it be to
>>make a prototype reliable unwinder?
> 
> In theory it doesn't look too hard and I can't see a particular reason
> not to try doing this - there's going to be edge cases but hopefully for
> reliable stack trace they're all in areas where we would be happy to
> just decide the stack isn't reliable anyway, things like nesting which
> allocates separate shadow stacks for each nested level for example.
> I'll take a look.
> 

I am a new comer to this discussion and I am learning. Just have some
questions. Pardon me if they are obvious or if they have already been
asked and answered.

Doesn't Clang already have support for a shadow stack implementation for ARM64?
We could take a look at how Clang does it.

Will there not be a significant performance hit? May be, some of it can be
mitigated by using a parallel shadow stack rather than a compact one.

Are there any longjmp style situations in the kernel where the stack is
unwound by several frames? In these cases, the shadow stack must be unwound
accordingly.

Madhavan

> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: [RFC PATCH 00/17] objtool: add base support for arm64

2021-01-22 Thread Madhavan T. Venkataraman



On 1/22/21 11:43 AM, Mark Brown wrote:
> On Thu, Jan 21, 2021 at 12:54:52PM -0600, Josh Poimboeuf wrote:
> 
>> 2) The shadow stack idea sounds promising -- how hard would it be to
>>make a prototype reliable unwinder?
> 
> In theory it doesn't look too hard and I can't see a particular reason
> not to try doing this - there's going to be edge cases but hopefully for
> reliable stack trace they're all in areas where we would be happy to
> just decide the stack isn't reliable anyway, things like nesting which
> allocates separate shadow stacks for each nested level for example.
> I'll take a look.
> 

I am a new comer to this discussion and I am learning. Just have some
questions. Pardon me if they are obvious or if they have already been
asked and answered.

Doesn't Clang already have support for a shadow stack implementation for ARM64?
We could take a look at how Clang does it.

Will there not be a significant performance hit? May be, some of it can be
mitigated by using a parallel shadow stack rather than a compact one.

Are there any longjmp style situations in the kernel where the stack is
unwound by several frames? In these cases, the shadow stack must be unwound
accordingly.

Madhavan

> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


Re: Live patching on ARM64

2021-01-19 Thread Madhavan T. Venkataraman


> Sorry for the late reply. The last RFC for arm64 support in objtool is a bit 
> old because it was preferable to split things into smaller series.
> 
> I touched it much lately, so I'm picking it back up and will try to get a git 
> branch into shape on a recent mainline (a few things need fixing since the 
> last time I rebased it).
> 
> I'll update you once I have something at least usable/presentable.
> 
> Cheers,
> 

Great. Thanks!

Madhavan


Re: Live patching on ARM64

2021-01-17 Thread Madhavan T. Venkataraman



On 1/15/21 6:33 AM, Mark Rutland wrote:

>> It looks like the most recent work in this area has been from the
>> following folks:
>>
>> Mark Brown and Mark Rutland:
>>  Kernel changes to providing reliable stack traces.
>>
>> Julien Thierry:
>>  Providing ARM64 support in objtool.
>>
>> Torsten Duwe:
>>  Ftrace with regs.
> 
> IIRC that's about right. I'm also trying to make arm64 patch-safe (more
> on that below), and there's a long tail of work there for anyone
> interested.
> 

OK.

>> I apologize if I have missed anyone else who is working on Live Patching
>> for ARM64. Do let me know.
>>
>> Is there any work I can help with? Any areas that need investigation, any 
>> code
>> that needs to be written, any work that needs to be reviewed, any testing 
>> that
>> needs to done? You folks are probably super busy and would not mind an extra
>> hand.
> 
> One general thing that I believe we'll need to do is to rework code to
> be patch-safe (which implies being noinstr-safe too). For example, we'll
> need to rework the instruction patching code such that this cannot end
> up patching itself (or anything that has instrumented it) in an unsafe
> way.
> 

OK.

> Once we have objtool it should be possible to identify those cases
> automatically. Currently I'm aware that we'll need to do something in at
> least the following places:
> 
> * The entry code -- I'm currently chipping away at this.
> 

OK.

> * The insn framework (which is used by some patching code), since the
>   bulk of it lives in arch/arm64/kernel/insn.c and isn't marked noinstr.
>   
>   We can probably shift the bulk of the aarch64_insn_gen_*() and
>   aarch64_get_*() helpers into a header as __always_inline functions,
>   which would allow them to be used in noinstr code. As those are
>   typically invoked with a number of constant arguments that the
>   compiler can fold, this /might/ work out as an optimization if the
>   compiler can elide the error paths.
> 
> * The alternatives code, since we call instrumentable and patchable
>   functions between updating instructions and performing all the
>   necessary maintenance. There are a number of cases within
>   __apply_alternatives(), e.g.
> 
>   - test_bit()
>   - cpus_have_cap()
>   - pr_info_once()
>   - lm_alias()
>   - alt_cb, if the callback is not marked as noinstr, or if it calls
> instrumentable code (e.g. from the insn framework).
>   - clean_dcache_range_nopatch(), as read_sanitised_ftr_reg() and
> related code can be instrumented.
> 
>   This might need some underlying rework elsewhere (e.g. in the
>   cpufeature code, or atomics framework).
> 

OK.

> So on the kernel side, maybe a first step would be to try to headerize
> the insn generation code as __always_inline, and see whether that looks
> ok? With that out of the way it'd be a bit easier to rework patching
> code depending on the insn framework.
> 

OK.

I have an understanding of some of the above already. I will come up to
speed on the others. I will email you any questions I might have.

> I'm not sure about the objtool side, so I'll leave that to Julien and co
> to answer.
> 

Thanks for the information.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-27 Thread Madhavan T. Venkataraman
Before I implement the user land solution recommended by reviewers, I just want
an opinion on where the code should reside.

I am thinking glibc. The other choice would be a separate library, say, 
libtramp.
What do you recommend?

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-27 Thread Madhavan T. Venkataraman



On 9/26/20 10:55 AM, Arvind Sankar wrote:
> On Fri, Sep 25, 2020 at 05:44:56PM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 9/24/20 6:43 PM, Arvind Sankar wrote:
>>>
>>> The source PC will generally not be available if the compiler decided to
>>> tail-call optimize the call to the trampoline into a jump.
>>>
>>
>> This is still work in progress. But I am thinking that labels can be used.
>> So, if the code is:
>>
>>  invoke_tramp:
>>  (*tramp)();
>>
>> then, invoke_tramp can be supplied as the calling PC.
>>
>> Similarly, labels can be used in assembly functions as well.
>>
>> Like I said, I have to think about this more.
> 
> What I mean is that the kernel won't have access to the actual source
> PC. If I followed your v1 correctly, it works by making any branch to
> the trampoline code trigger a page fault. At this point, the PC has
> already been updated to the trampoline entry, so the only thing the
> fault handler can know is the return address on the top of the stack,
> which (a) might not be where the branch actually originated, either
> because it was a jump, or you've already been hacked and you got here
> using a ret; (b) is available to userspace anyway.

Like I said, this is work in progress. I have to spend time to figure out
how this would work or if this would work. So, let us brainstorm this
a little bit.

There are two ways to invoke the trampoline:

(1) By just branching to the trampoline address.

(2) Or, by treating the address as a function pointer and calling it.
In the libffi case, it is (2).

If it is (2), it is easier. We can figure out the return address of the
call which would be the location after the call instruction.

If it is (1), it is harder as you point out. So, we can support this
at least for (2). The user can inform trampfd as to the type of
invocation for the trampoline.

For (1), the return address would be that of the call to the function
that contains the branch. If the kernel can get that call instruction
and figure out the function address, then we can do something.

I admit this is bit hairy at the moment. I have to work it out.

> 
>>
>>> What's special about these trampolines anyway? Any indirect function
>>> call could have these same problems -- an attacker could have
>>> overwritten the pointer the same way, whether it's supposed to point to
>>> a normal function or it is the target of this trampoline.
>>>
>>> For making them a bit safer, userspace could just map the page holding
>>> the data pointers/destination address(es) as read-only after
>>> initialization.
>>>
>>
>> You need to look at version 1 of trampfd for how to do "allowed pcs".
>> As an example, libffi defines ABI handlers for every arch-ABI combo.
>> These ABI handler pointers could be placed in an array in .rodata.
>> Then, the array can be written into trampfd for setting allowed PCS.
>> When the target PC is set for a trampoline, the kernel will check
>> it against allowed PCs and reject it if it has been overwritten.
> 
> I'm not asking how it's implemented. I'm asking what's the point? On a
> typical linux system, at least on x86, every library function call is an
> indirect branch. The protection they get is that the dynamic linker can
> map the pointer table read-only after initializing it.
> 

The security subsystem is concerned about dynamic code, not the indirect
branches set up for dynamic linking.


> For the RO mapping, libffi could be mapping both the entire closure
> structure, as well as the structure that describes the arguments and
> return types of the function, read-only once they are initialized.
> 

This has been suggested in some form before. The general problem with
this approach is that when the page is still writable, an attacker
can inject his code potentially. Making the page read-only after the
fact may not help. In specific use cases, it may work. But it is
not OK as a general approach to solving this problem.

> For libffi, there are three indirect branches for every trampoline call
> with your suggested trampoline: one to get to the trampoline, one to
> jump to the handler, and one to call the actual user function. If we are
> particularly concerned about the trampoline to handler branch for some
> reason, we could just replace it with a direct branch: if the kernel was
> generating the code, there's no reason to allow the data pointer or code
> target to be changed after the trampoline was created. It can just
> hard-code them in the generated code and be done with it. Even with
> user-space trampolines, you can use a direct call. All you need is
> libffi-trampoline.so whi

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-25 Thread Madhavan T. Venkataraman



On 9/24/20 6:43 PM, Arvind Sankar wrote:
> On Thu, Sep 24, 2020 at 03:23:52PM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>>> Which ISA does not support PIC objects? You mentioned i386 below, but
>>> i386 does support them, it just needs to copy the PC into a GPR first
>>> (see below).
>>
>> Position Independent Code needs PC-relative branches. I was referring
>> to PC-relative data references. Like RIP-relative data references in
>> X64. i386 ISA does not support this.
> 
> I was talking about PC-relative data references too: they are a
> requirement for PIC code that wants to access any global data. They can
> be implemented easily on i386 even though it doesn't have an addressing
> mode that uses the PC.
> 
>> Otherwise, using an ABI quirk or a calling convention side effect to load the
>> PC into a GPR is, IMO, non-standard or non-compliant or non-approved or
>> whatever you want to call it. I would be conservative and not use it. Who 
>> knows
>> what incompatibility there will be with some future software or hardware
>> features?
>>
>> For instance, in the i386 example, we do a call without a matching return.
>> Also, we use a pop to undo the call. Can anyone tell me if this kind of use
>> is an ABI approved one?
> 
> This doesn't have anything to do with the ABI, since what happened here
> isn't visible to any caller or callee. Any machine instruction sequence
> that has the effect of copying the PC into a GPR is acceptable, but this
> is basically the only possible solution on i386. If you don't like the
> call/pop mismatch (though that's supported by the hardware, and is what
> clang likes to use), you can use the slightly different technique used
> in my example, which copies the top of stack into a GPR after a call.
> 
> This is how all i386 PIC code has always worked.
> 

I have responded to this in my reply to Florian. Basically, I accept the opinion
of the reviewers. I will assume that any trick we use to get the current PC 
into a
GPR will not cause ABI compliance issue in the future.

>> Standard API for all userland for all architectures
>> ---
>>
>> The next advantage in using the kernel is standardization.
>>
>> If the kernel supplies this, then all applications and libraries can use
>> it for all architectures with one single, simple API. Without this, each
>> application/library has to roll its own solution for every architecture-ABI
>> combo it wants to support.
> 
> But you can get even more standardization out of a userspace library,
> because that can work even on non-linux OS's, as well as versions of
> linux where the new syscall isn't available.
> 

Dealing with old vs new kernels is the same as dealing with old vs new libs.

In any case, what you have suggested above has already been suggested before
and I have accepted everyone's opinion. Please see my response to Florian's 
email.

>>
>> Furthermore, if this work gets accepted, I plan to add a glibc wrapper for
>> the kernel API. The glibc API would look something like this:
>>
>>  Allocate a trampoline
>>  -
>>
>>  tramp = alloc_tramp();
>>
>>  Set trampoline parameters
>>  -
>>
>>  init_tramp(tramp, code, data);
>>
>>  Free the trampoline
>>  ---
>>
>>  free_tramp(tramp);
>>
>> glibc will allocate and manage the code and data tables, handle kernel API
>> details and manage the trampoline table.
> 
> glibc could do this already if it wants, even without the syscall,
> because this can be done in userspace already.
> 

I am wary of using ABI tricks or calling convention side-effects. However,
since the reviewers feel it is OK, I have accepted that opinion. I have
assumed now that any trick to load the current PC into a GPR can be used
without any risk. I hope that assumption is correct.

>>
>> Secure vs Performant trampoline
>> ---
>>
>> If you recall, in version 1, I presented a trampoline type that is
>> implemented in the kernel. When an application invokes the trampoline,
>> it traps into the kernel and the kernel performs the work of the trampoline.
>>
>> The disadvantage is that a trip to the kernel is needed. That can be
>> expensive.
>>
>> The advantage is that the kernel can add security checks before doing the
>> work. Mainly, I am looking at checks that might prevent the trampoline
>> from being used in an ROP/BOP chain. Some half-baked ideas:
>>
>>  - Check 

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-25 Thread Madhavan T. Venkataraman



On 9/24/20 3:52 PM, Florian Weimer wrote:
> * Madhavan T. Venkataraman:
> 
>> Otherwise, using an ABI quirk or a calling convention side effect to
>> load the PC into a GPR is, IMO, non-standard or non-compliant or
>> non-approved or whatever you want to call it. I would be
>> conservative and not use it. Who knows what incompatibility there
>> will be with some future software or hardware features?
> 
> AArch64 PAC makes a backwards-incompatible change that touches this
> area, but we'll see if they can actually get away with it.
> 
> In general, these things are baked into the ABI, even if they are not
> spelled out explicitly in the psABI supplement.
> 
>> For instance, in the i386 example, we do a call without a matching return.
>> Also, we use a pop to undo the call. Can anyone tell me if this kind of use
>> is an ABI approved one?
> 
> Yes, for i386, this is completely valid from an ABI point of view.
> It's equally possible to use a regular function call and just read the
> return address that has been pushed to the stack.  Then there's no
> stack mismatch at all.  Return stack predictors (including the one
> used by SHSTK) also recognize the CALL 0 construct, so that's fine as
> well.  The i386 psABI does not use function descriptors, and either
> approach (out-of-line thunk or CALL 0) is in common use to materialize
> the program counter in a register and construct the GOT pointer.
> 
>> If the kernel supplies this, then all applications and libraries can use
>> it for all architectures with one single, simple API. Without this, each
>> application/library has to roll its own solution for every architecture-ABI
>> combo it wants to support.
> 
> Is there any other user for these type-generic trampolines?
> Everything else I've seen generates machine code specific to the
> function being called.  libffi is quite the outlier in my experience
> because the trampoline calls a generic data-driven
> marshaller/unmarshaller.  The other trampoline generators put this
> marshalling code directly into the generated trampoline.
> 
> I'm still not convinced that this can't be done directly in libffi,
> without kernel help.  Hiding the architecture-specific code in the
> kernel doesn't reduce overall system complexity.
> 

See below. I have accepted the community's recommendation to implement it
in user land. However, this is not just for libffi. It is for all dynamic
code. libffi is just the first use case I am addressing with this.

>> As an example, in libffi:
>>
>>  ffi_closure_alloc() would call alloc_tramp()
>>
>>  ffi_prep_closure_loc() would call init_tramp()
>>
>>  ffi_closure_free() would call free_tramp()
>>
>> That is it! It works on all the architectures supported in the kernel for
>> trampfd.
> 
> ffi_prep_closure_loc would still need to check whether the trampoline
> has been allocated by alloc_tramp because some applications supply
> their own (executable and writable) mapping.  ffi_closure_alloc would
> need to support different sizes (not matching the trampoline).  It's
> also unclear to me to what extent software out there writes to the
> trampoline data directly, bypassing the libffi API (the structs are
> not opaque, after all).  And all the existing libffi memory management
> code (including the embedded dlmalloc copy) would be needed to support
> kernels without trampfd for years to come.
> 

In the libffi patch I have included, I have handled this. The closure
structure contains a tramp field:

  char tramp[FFI_TRAMPOLINE_SIZE];

If trampfd is not used, this array will contain the actual
trampoline code. If trampfd is used, then we don't need the array for
storing any trampoline code. That space can be used for storing trampfd
related information.

So, there is no change to the closure structure.

Also, the code can tell if the closure has been allocated from dlmalloc()
called from ffi_closure_alloc() or has been allocated by the caller
directly without calling ffi_closure_alloc(). I have written this function:

int ffi_closure_alloc_called(void *closure)
{
  msegmentptr seg = segment_holding (gm, closure);
  return (seg != NULL);
}

Using this function, I can tell how the closure has been allocated. I use
trampfd only for closures that have been allocated using ffi_closure_alloc().
So, I believe I have handled all the cases. If I have missed anything,
let me know. I will address it.

> I very much agree that we have a gap in libffi when it comes to
> JIT-less operation.  But I'm not convinced that kernel support is
> needed to close it, or that it is even the right design.
> 

I have taken into account most of the comments received so far and I have
come up with a proposal:

I would like to do this in two

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-24 Thread Madhavan T. Venkataraman



On 9/23/20 2:51 PM, Arvind Sankar wrote:
> On Wed, Sep 23, 2020 at 02:17:30PM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 9/23/20 4:11 AM, Arvind Sankar wrote:
>>> For libffi, I think the proposed standard trampoline won't actually
>>> work, because not all ABIs have two scratch registers available to use
>>> as code_reg and data_reg. Eg i386 fastcall only has one, and register
>>> has zero scratch registers. I believe 32-bit ARM only has one scratch
>>> register as well.
>>
>> The trampoline is invoked as a function call in the libffi case. Any
>> caller saved register can be used as code_reg, can it not? And the
>> scratch register is needed only to jump to the code. After that, it
>> can be reused for any other purpose.
>>
>> However, for ARM, you are quite correct. There is only one scratch
>> register. This means that I have to provide two types of trampolines:
>>
>>  - If an architecture has enough scratch registers, use the currently
>>defined trampoline.
>>
>>  - If the architecture has only one scratch register, but has PC-relative
>>data references, then embed the code address at the bottom of the
>>trampoline and access it using PC-relative addressing.
>>
>> Thanks for pointing this out.
>>
>> Madhavan
> 
> libffi is trying to provide closures with non-standard ABIs as well: the
> actual user function is standard ABI, but the closure can be called with
> a different ABI. If the closure was created with FFI_REGISTER abi, there
> are no registers available for the trampoline to use: EAX, EDX and ECX
> contain the first three arguments of the function, and every other
> register is callee-save.
> 
> I provided a sample of the kind of trampoline that would be needed in
> this case -- it's position-independent and doesn't clobber any registers
> at all, and you get 255 trampolines per page. If I take another 16-byte
> slot out of the page for the end trampoline that does the actual work,
> I'm sure I could even come up with one that can just call a normal C
> function, only the return might need special handling depending on the
> return type.
> 
> And again, do you actually have any example of an architecture that
> cannot run position-independent code? PC-relative addressing is an
> implementation detail: the fact that it's available for x86_64 but not
> for i386 just makes position-independent code more cumbersome on i386,
> but it doesn't make it impossible. For the tiny trampolines here, it
> makes almost no difference.
> 

I have tried to answer all of your previous comments here. Let me know
if I missed anything:


> Which ISA does not support PIC objects? You mentioned i386 below, but
> i386 does support them, it just needs to copy the PC into a GPR first
> (see below).

Position Independent Code needs PC-relative branches. I was referring
to PC-relative data references. Like RIP-relative data references in
X64. i386 ISA does not support this.

> i386 just needs a tiny bit of code to copy the PC into a GPR first, i.e.
> the trampoline would be:
> 
>   call1f
> 1:pop %data_reg
>   movl(code_table + X - 1b)(%data_reg), %code_reg
>   movl(data_table + X - 1b)(%data_reg), %data_reg
>   jmp *(%code_reg)
> 
> I do not understand the point about passing data at runtime. This
> trampoline is to achieve exactly that, no?

PC-relative data referencing


I agree that the current PC value can be loaded in a GPR using the trick
of call, pop on i386.

Perhaps, on other architectures, we can do similar things. For instance,
in architectures that load the return address in a designated register
instead of pushing it on the stack, the trampoline could call a leaf function
that moves the value of that register into data_reg so that at the location
after the call instruction, the current PC is already loaded in data_reg.
SPARC is one example I can think of.

My take is - if the ISA supports PC-relative data referencing explicitly (like
X64 or ARM64), then we can use it. Or, if the ABI specification documents an
approved way to load the PC into a GPR, we can use it.

Otherwise, using an ABI quirk or a calling convention side effect to load the
PC into a GPR is, IMO, non-standard or non-compliant or non-approved or
whatever you want to call it. I would be conservative and not use it. Who knows
what incompatibility there will be with some future software or hardware
features?

For instance, in the i386 example, we do a call without a matching return.
Also, we use a pop to undo the call. Can anyone tell me if this kind of use
is an ABI approved one?

Kernel supplied trampoline
--

One advanta

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 9:39 AM, Florian Weimer wrote:
> * Solar Designer:
> 
>> While I share my opinion here, I don't mean that to block Madhavan's
>> work.  I'd rather defer to people more knowledgeable in current userland
>> and ABI issues/limitations and plans on dealing with those, especially
>> to Florian Weimer.  I haven't seen Florian say anything specific for or
>> against Madhavan's proposal, and I'd like to.  (Have I missed that?)
> 
> There was a previous discussion, where I provided feedback (not much
> different from the feedback here, given that the mechanism is mostly the
> same).
> 
> I think it's unnecessary for the libffi use case.  Precompiled code can
> be loaded from disk because the libffi trampolines are so regular.  On
> most architectures, it's not even the code that's patched, but some of
> the data driving it, which happens to be located on the same page due to
> a libffi quirk.
> 
> The libffi use case is a bit strange anyway: its trampolines are
> type-generic, and the per-call adjustment is data-driven.  This means
> that once you have libffi in the process, you have a generic
> data-to-function-call mechanism available that can be abused (it's even
> fully CET compatible in recent versions).  And then you need to look at
> the processes that use libffi.  A lot of them contain bytecode
> interpreters, and those enable data-driven arbitrary code execution as
> well.  I know that there are efforts under way to harden Python, but
> it's going to be tough to get to the point where things are still
> difficult for an attacker once they have the ability to make mprotect
> calls.
> 
> It was pointed out to me that libffi is doing things wrong, and the
> trampolines should not be type-generic, but generated so that they match
> the function being called.  That is, the marshal/unmarshal code would be
> open-coded in the trampoline, rather than using some generic mechanism
> plus run-time dispatch on data tables describing the function type.
> That is a very different design (and typically used by compilers (JIT or
> not JIT) to implement native calls).  Mapping some code page with a
> repeating pattern would no longer work to defeat anti-JIT measures
> because it's closer to real JIT.  I don't know if kernel support could
> make sense in this context, but it would be a completely different
> patch.
> 
> Thanks,
> Florian
> 
Hi Florian,

I am making myself familiar with anti-JIT measures before I can respond
to this comment. Bear with me. I will also respond to the above
libffi comment.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 2:51 PM, Arvind Sankar wrote:
> On Wed, Sep 23, 2020 at 02:17:30PM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 9/23/20 4:11 AM, Arvind Sankar wrote:
>>> For libffi, I think the proposed standard trampoline won't actually
>>> work, because not all ABIs have two scratch registers available to use
>>> as code_reg and data_reg. Eg i386 fastcall only has one, and register
>>> has zero scratch registers. I believe 32-bit ARM only has one scratch
>>> register as well.
>>
>> The trampoline is invoked as a function call in the libffi case. Any
>> caller saved register can be used as code_reg, can it not? And the
>> scratch register is needed only to jump to the code. After that, it
>> can be reused for any other purpose.
>>
>> However, for ARM, you are quite correct. There is only one scratch
>> register. This means that I have to provide two types of trampolines:
>>
>>  - If an architecture has enough scratch registers, use the currently
>>defined trampoline.
>>
>>  - If the architecture has only one scratch register, but has PC-relative
>>data references, then embed the code address at the bottom of the
>>trampoline and access it using PC-relative addressing.
>>
>> Thanks for pointing this out.
>>
>> Madhavan
> 
> libffi is trying to provide closures with non-standard ABIs as well: the
> actual user function is standard ABI, but the closure can be called with
> a different ABI. If the closure was created with FFI_REGISTER abi, there
> are no registers available for the trampoline to use: EAX, EDX and ECX
> contain the first three arguments of the function, and every other
> register is callee-save.
> 
> I provided a sample of the kind of trampoline that would be needed in
> this case -- it's position-independent and doesn't clobber any registers
> at all, and you get 255 trampolines per page. If I take another 16-byte
> slot out of the page for the end trampoline that does the actual work,
> I'm sure I could even come up with one that can just call a normal C
> function, only the return might need special handling depending on the
> return type.
> 
> And again, do you actually have any example of an architecture that
> cannot run position-independent code? PC-relative addressing is an
> implementation detail: the fact that it's available for x86_64 but not
> for i386 just makes position-independent code more cumbersome on i386,
> but it doesn't make it impossible. For the tiny trampolines here, it
> makes almost no difference.
> 

Hi Arvind,

I am preparing a response for all of your comments. I will send it out
tomorrow. Sorry for the delay.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 3:51 PM, Pavel Machek wrote:
> Hi!
> 
 Scenario 2
 --

 We know what code we need in advance. User trampolines are a good example 
 of
 this. It is possible to define such code statically with some help from the
 kernel.

 This RFC addresses (2). (1) needs a general purpose trusted code generator
 and is out of scope for this RFC.
>>>
>>> This is slightly less crazy talk than introduction talking about holes
>>> in W^X. But it is very, very far from normal Unix system, where you
>>> have selection of interpretters to run your malware on (sh, python,
>>> awk, emacs, ...) and often you can even compile malware from sources. 
>>>
>>> And as you noted, we don't have "a general purpose trusted code
>>> generator" for our systems.
>>>
>>> I believe you should simply delete confusing "introduction" and
>>> provide details of super-secure system where your patches would be
>>> useful, instead.
>>
>> This RFC talks about converting dynamic code (which cannot be authenticated)
>> to static code that can be authenticated using signature verification. That
>> is the scope of this RFC.
>>
>> If I have not been clear before, by dynamic code, I mean machine code that is
>> dynamic in nature. Scripts are beyond the scope of this RFC.
>>
>> Also, malware compiled from sources is not dynamic code. That is orthogonal
>> to this RFC. If such malware has a valid signature that the kernel permits 
>> its
>> execution, we have a systemic problem.
>>
>> I am not saying that script authentication or compiled malware are not 
>> problems.
>> I am just saying that this RFC is not trying to solve all of the security 
>> problems.
>> It is trying to define one way to convert dynamic code to static code to 
>> address
>> one class of problems.
> 
> Well, you don't have to solve all problems at once.
> 
> But solutions have to exist, and AFAIK in this case they don't. You
> are armoring doors, but ignoring open windows.
> 

I am afraid I don't agree that the other open security issues must be
addressed for this RFC to make sense. If you think that any of those
issues actually has a bad interaction/intersection with this RFC,
let me know how and I will address it.

> Or very probably you are thinking about something different than
> normal desktop distros (Debian 10). Because on my systems, I have
> python, gdb and gcc...
> 
> It would be nice to specify what other pieces need to be present for
> this to make sense -- because it makes no sense on Debian 10.
> 

Since this RFC pertains to converting dynamic machine code to static
code, it has nothing to do with the other items you have mentioned.
I am not disagreeing that the other items need to be addressed. But
they are orthogonal.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 4:14 AM, Solar Designer wrote:
>>> The W^X implementation today is not complete. There exist many user level
>>> tricks that can be used to load and execute dynamic code. E.g.,
>>>
>>> - Load the code into a file and map the file with R-X.
>>>
>>> - Load the code in an RW- page. Change the permissions to R--. Then,
>>>   change the permissions to R-X.
>>>
>>> - Load the code in an RW- page. Remap the page with R-X to get a separate
>>>   mapping to the same underlying physical page.
>>>
>>> IMO, these are all security holes as an attacker can exploit them to inject
>>> his own code.
>> IMO, you are smoking crack^H^H very seriously misunderstanding what
>> W^X is supposed to protect from.
>>
>> W^X is not supposed to protect you from attackers that can already do
>> system calls. So loading code into a file then mapping the file as R-X
>> is in no way security hole in W^X.
>>
>> If you want to provide protection from attackers that _can_ do system
>> calls, fine, but please don't talk about W^X and please specify what
>> types of attacks you want to prevent and why that's good thing.
> On one hand, Pavel is absolutely right.  It is ridiculous to say that
> "these are all security holes as an attacker can exploit them to inject
> his own code."
> 

Why? Isn't it possible that an attacker can exploit some vulnerability such
as buffer overflow and overwrite the buffer that contains the dynamic code?


> On the other hand, "what W^X is supposed to protect from" depends on how
> the term W^X is defined (historically, by PaX and OpenBSD).  It may be
> that W^X is partially not a feature to defeat attacks per se, but also a
> policy enforcement feature preventing use of dangerous techniques (JIT).
> 
> Such policy might or might not make sense.  It might make sense for ease
> of reasoning, e.g. "I've flipped this setting, and now I'm certain the
> system doesn't have JIT within a process (can still have it through
> dynamically creating and invoking an entire new program), so there are
> no opportunities for an attacker to inject code nor generate previously
> non-existing ROP gadgets into an executable mapping within a process."
> 
> I do find it questionable whether such policy and such reasoning make
> sense beyond academia.
> 
> Then, there might be even more ways in which W^X is not perfect enough
> to enable such reasoning.  What about using ptrace(2) to inject code?
> Should enabling W^X also disable ability to debug programs by non-root?
> We already have Yama ptrace_scope, which can achieve that at the highest
> setting, although that's rather inconvenient and is probably unexpected
> by most to be a requirement for having (ridiculously?) full W^X allowing
> for the academic reasoning.
> 

I am not suggesting that W^X be fixed. That is up to the maintainers of that
code. I am saying that if the security subsystem is enhanced in the future with
policies and settings that prevent the user tricks I mentioned, then it becomes
impossible to execute dynamic code except by making security exceptions on a 
case
by case basis.

As an alternative to making security exceptions, one could convert dynamic code
to static code which can then be authenticated.

> Personally, I am for policies that make more practical sense.  For
> example, years ago I advocated here on kernel-hardening that we should
> have a mode where ELF flags enabling/disabling executable stack are
> ignored, and non-executable stack is always enforced.  This should also
> be extended to default (at program startup) permissions on more than
> just stack (but also on .bss, typical libcs' heap allocations, etc.)
> However, I am not convinced there's enough value in extending the policy
> to restricting explicit uses of mprotect(2).
> 
> Yes, PaX did that, and its emutramp.txt said "runtime code generation is
> by its nature incompatible with PaX's PAGEEXEC/SEGMEXEC and MPROTECT
> features, therefore the real solution is not in emulation but by
> designing a kernel API for runtime code generation and modifying
> userland to make use of it."  However, not being convinced in the
> MPROTECT feature having enough practical value, I am also not convinced
> "a kernel API for runtime code generation and modifying userland to make
> use of it" is the way to go.
> 

In a separate email, I will try to answer this and provide justification
for why it is better to do it in the kernel.

> Having static instead of dynamically-generated trampolines in userland
> code where possible (and making other userland/ABI changes to make that
> possible in more/all cases) is an obvious improvement, and IMO should be
> a priority over the above.
>

> While I share my opinion here, I don't mean that to block Madhavan's
> work.  I'd rather defer to people more knowledgeable in current userland
> and ABI issues/limitations and plans on dealing with those, especially
> to Florian Weimer.  I haven't seen Florian say anything specific for or
> against Madhavan's 

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 4:11 AM, Arvind Sankar wrote:
> For libffi, I think the proposed standard trampoline won't actually
> work, because not all ABIs have two scratch registers available to use
> as code_reg and data_reg. Eg i386 fastcall only has one, and register
> has zero scratch registers. I believe 32-bit ARM only has one scratch
> register as well.

The trampoline is invoked as a function call in the libffi case. Any
caller saved register can be used as code_reg, can it not? And the
scratch register is needed only to jump to the code. After that, it
can be reused for any other purpose.

However, for ARM, you are quite correct. There is only one scratch
register. This means that I have to provide two types of trampolines:

- If an architecture has enough scratch registers, use the currently
  defined trampoline.

- If the architecture has only one scratch register, but has PC-relative
  data references, then embed the code address at the bottom of the
  trampoline and access it using PC-relative addressing.

Thanks for pointing this out.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman



On 9/23/20 3:42 AM, Pavel Machek wrote:
> Hi!
> 
>> Solution proposed in this RFC
>> =
>>
>> >From this RFC's perspective, there are two scenarios for dynamic code:
>>
>> Scenario 1
>> --
>>
>> We know what code we need only at runtime. For instance, JIT code generated
>> for frequently executed Java methods. Only at runtime do we know what
>> methods need to be JIT compiled. Such code cannot be statically defined. It
>> has to be generated at runtime.
>>
>> Scenario 2
>> --
>>
>> We know what code we need in advance. User trampolines are a good example of
>> this. It is possible to define such code statically with some help from the
>> kernel.
>>
>> This RFC addresses (2). (1) needs a general purpose trusted code generator
>> and is out of scope for this RFC.
> 
> This is slightly less crazy talk than introduction talking about holes
> in W^X. But it is very, very far from normal Unix system, where you
> have selection of interpretters to run your malware on (sh, python,
> awk, emacs, ...) and often you can even compile malware from sources. 
> 
> And as you noted, we don't have "a general purpose trusted code
> generator" for our systems.
> 
> I believe you should simply delete confusing "introduction" and
> provide details of super-secure system where your patches would be
> useful, instead.
> 
> Best regards,
>   Pavel
> 

This RFC talks about converting dynamic code (which cannot be authenticated)
to static code that can be authenticated using signature verification. That
is the scope of this RFC.

If I have not been clear before, by dynamic code, I mean machine code that is
dynamic in nature. Scripts are beyond the scope of this RFC.

Also, malware compiled from sources is not dynamic code. That is orthogonal
to this RFC. If such malware has a valid signature that the kernel permits its
execution, we have a systemic problem.

I am not saying that script authentication or compiled malware are not problems.
I am just saying that this RFC is not trying to solve all of the security 
problems.
It is trying to define one way to convert dynamic code to static code to address
one class of problems.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Madhavan T. Venkataraman
...
>> The W^X implementation today is not complete. There exist many user level
>> tricks that can be used to load and execute dynamic code. E.g.,
>>
>> - Load the code into a file and map the file with R-X.
>>
>> - Load the code in an RW- page. Change the permissions to R--. Then,
>>   change the permissions to R-X.
>>
>> - Load the code in an RW- page. Remap the page with R-X to get a separate
>>   mapping to the same underlying physical page.
>>
>> IMO, these are all security holes as an attacker can exploit them to inject
>> his own code.
> 
> IMO, you are smoking crack^H^H very seriously misunderstanding what
> W^X is supposed to protect from.
> 
> W^X is not supposed to protect you from attackers that can already do
> system calls. So loading code into a file then mapping the file as R-X
> is in no way security hole in W^X.
> 
> If you want to provide protection from attackers that _can_ do system
> calls, fine, but please don't talk about W^X and please specify what
> types of attacks you want to prevent and why that's good thing.
> 


There are two things here - the idea behind W^X and the current realization
of that idea in actual implementation. The idea behind W^X, as I understand,
is to prevent a user from loading arbitrary code into a page and getting it
to execute. If the user code contains a vulnerability, an attacker can 
exploit it to potentially inject his own code and get it to execute. This
cannot be denied.

>From that perspective, all of the above tricks I have mentioned are tricks
that user code can use to load arbitrary code into a page and get it to
execute.

Now, I don't want the discussion to be stuck in a mere name. If what I am
suggesting needs a name other than "W^X" in the opinion of the reviewers,
that is fine with me. But I don't believe there is any disagreement that
the above user tricks are security holes.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-22 Thread Madhavan T. Venkataraman
I just resent the trampfd v2 RFC. I forgot to CC the reviewers who provided 
comments before.
So sorry.

Madhavan

On 9/22/20 4:53 PM, madve...@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" 
> 
> Introduction
> 
> 
> Dynamic code is used in many different user applications. Dynamic code is
> often generated at runtime. Dynamic code can also just be a pre-defined
> sequence of machine instructions in a data buffer. Examples of dynamic
> code are trampolines, JIT code, DBT code, etc.
> 
> Dynamic code is placed either in a data page or in a stack page. In order
> to execute dynamic code, the page it resides in needs to be mapped with
> execute permissions. Writable pages with execute permissions provide an
> attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
> 
> To mitigate this, LSMs such as SELinux implement W^X. That is, they may not
> allow pages to have both write and execute permissions. This prevents
> dynamic code from executing and blocks applications that use it. To allow
> genuine applications to run, exceptions have to be made for them (by setting
> execmem, etc) which opens the door to security issues.
> 
> The W^X implementation today is not complete. There exist many user level
> tricks that can be used to load and execute dynamic code. E.g.,
> 
> - Load the code into a file and map the file with R-X.
> 
> - Load the code in an RW- page. Change the permissions to R--. Then,
>   change the permissions to R-X.
> 
> - Load the code in an RW- page. Remap the page with R-X to get a separate
>   mapping to the same underlying physical page.
> 
> IMO, these are all security holes as an attacker can exploit them to inject
> his own code.
> 
> In the future, these holes will definitely be closed. For instance, LSMs
> (such as the IPE proposal [1]) may only allow code in properly signed object
> files to be mapped with execute permissions. This will do two things:
> 
>   - user level tricks using anonymous pages will fail as anonymous
> pages have no file identity
> 
>   - loading the code in a temporary file and mapping it with R-X
> will fail as the temporary file would not have a signature
> 
> We need a way to execute such code without making security exceptions.
> Trampolines are a good example of dynamic code. A couple of examples
> of trampolines are given below. My first use case for this RFC is
> libffi.
> 
> Examples of trampolines
> ===
> 
> libffi (A Portable Foreign Function Interface Library):
> 
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
> 
>   - Python
>   - Java
>   - Javascript
>   - Ruby FFI
>   - Lisp
>   - Objective C
> 
> GCC nested functions:
> 
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
> 
> Currently available solution
> 
> 
> One solution that has been proposed to allow trampolines to be executed
> without making security exceptions is Trampoline Emulation. See:
> 
> https://pax.grsecurity.net/docs/emutramp.txt
> 
> In this solution, the kernel recognizes certain sequences of instructions
> as "well-known" trampolines. When such a trampoline is executed, a page
> fault happens because the trampoline page does not have execute permission.
> The kernel recognizes the trampoline and emulates it. Basically, the
> kernel does the work of the trampoline on behalf of the application.
> 
> Currently, the emulated trampolines are the ones used in libffi and GCC
> nested functions. To my knowledge, only X86 is supported at this time.
> 
> As noted in emutramp.txt, this is not a generic solution. For every new
> trampoline that needs to be supported, new instruction sequences need to
> be recognized by the kernel and emulated. And this has to be done for
> every architecture that needs to be supported.
> 
> emutramp.txt notes the following:
> 
> "... the real solution is not in emulation but by designing a kernel API
> for runtime code generation and modifying userland to make use of it."
> 
> Solution proposed in this RFC
> =
> 
>>From this RFC's perspective, there are two scenarios for dynamic code:
> 
> Scenario 1
> -

Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-17 Thread Madhavan T. Venkataraman



On 9/17/20 10:36 AM, Madhavan T. Venkataraman wrote:
>>> libffi
>>> ==
>>>
>>> I have implemented my solution for libffi and provided the changes for
>>> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>>>
>>> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt
>> The URL does not appear to work, I get a 403 error.
> I apologize for that. That site is supposed to be accessible publicly.
> I will contact the administrator and get this resolved.
> 
> Sorry for the annoyance.
> 

Could you try the link again and confirm that you can access it?
Again, sorry for the trouble.

Madhavan


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-17 Thread Madhavan T. Venkataraman



On 9/16/20 8:04 PM, Florian Weimer wrote:
> * madvenka:
> 
>> Examples of trampolines
>> ===
>>
>> libffi (A Portable Foreign Function Interface Library):
>>
>> libffi allows a user to define functions with an arbitrary list of
>> arguments and return value through a feature called "Closures".
>> Closures use trampolines to jump to ABI handlers that handle calling
>> conventions and call a target function. libffi is used by a lot
>> of different applications. To name a few:
>>
>>  - Python
>>  - Java
>>  - Javascript
>>  - Ruby FFI
>>  - Lisp
>>  - Objective C
> 
> libffi does not actually need this.  It currently collocates
> trampolines and the data they need on the same page, but that's
> actually unecessary.  It's possible to avoid doing this just by
> changing libffi, without any kernel changes.
> 
> I think this has already been done for the iOS port.
> 

The trampoline table that has been implemented for the iOS port (MACH)
is based on PC-relative data referencing. That is, the code and data
are placed in adjacent pages so that the code can access the data using
an address relative to the current PC.

This is an ISA feature that is not supported on all architectures.

Now, if it is a performance feature, we can include some architectures
and exclude others. But this is a security feature. IMO, we cannot
exclude any architecture even if it is a legacy one as long as Linux
is running on the architecture. So, we need a solution that does
not assume any specific ISA feature.

>> The code for trampoline X in the trampoline table is:
>>
>>  load_table[X], code_reg
>>  load(code_reg), code_reg
>>  load_table[X], data_reg
>>  load(data_reg), data_reg
>>  jumpcode_reg
>>
>> The addresses _table[X] and _table[X] are baked into the
>> trampoline code. So, PC-relative data references are not needed. The user
>> can modify code_table[X] and data_table[X] dynamically.
> 
> You can put this code into the libffi shared object and map it from
> there, just like the rest of the libffi code.  To get more
> trampolines, you can map the page containing the trampolines multiple
> times, each instance preceded by a separate data page with the control
> information.
> 

If you put the code in the libffi shared object, how do you pass data to
the code at runtime? If the code we are talking about is a function, then
there is an ABI defined way to pass data to the function. But if the
code we are talking about is some arbitrary code such as a trampoline,
there is no ABI defined way to pass data to it except in a couple of
platforms such as HP PA-RISC that have support for function descriptors
in the ABI itself.

As mentioned before, if the ISA supports PC-relative data references
(e.g., X86 64-bit platforms support RIP-relative data references)
then we can pass data to that code by placing the code and data in
adjacent pages. So, you can implement the trampoline table for X64.
i386 does not support it.


> I think the previous patch submission has also resulted in several
> comments along those lines, so I'm not sure why you are reposting
> this.

IIRC, I have answered all of those comments by mentioning the point
that we need to support all architectures without requiring special
ISA features. Taking the kernel's help in this is one solution.


> 
>> libffi
>> ==
>>
>> I have implemented my solution for libffi and provided the changes for
>> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>>
>> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt
> 
> The URL does not appear to work, I get a 403 error.

I apologize for that. That site is supposed to be accessible publicly.
I will contact the administrator and get this resolved.

Sorry for the annoyance.

> 
>> If the trampfd patchset gets accepted, I will send the libffi changes
>> to the maintainers for a review. BTW, I have also successfully executed
>> the libffi self tests.
> 
> I have not seen your libffi changes, but I expect that the complexity
> is about the same as a userspace-only solution.
> 
> 

I agree. The complexity is about the same. But the support is for all
architectures. Once the common code is in place, the changes for each
architecture are trivial.

Madhavan

> Cc:ing libffi upstream for awareness.  The start of the thread is
> here:
> 
> 
> 


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-12 Thread Madhavan T. Venkataraman



On 8/12/20 5:06 AM, Mark Rutland wrote:
> [..]
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().
> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.

OK. Fair enough. We can give a different name to the stronger requirement.
Just for the sake of this discussion and for the want of a better name,
let us call it WX2.


> 
>> a) This requires a remap operation. Two mappings point to the same
>>  physical page. One mapping has W and the other one has X. This
>>  is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>  permission to a page that was writeable in the past and refuse to
>>  give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 

I already supplied the justification. Any user level method can potentially
be hijacked by an attacker for his purpose.

WX does not prevent all of the methods. We need WX2.


>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 

I am afraid I am not in agreement with this. If WX2 is not implemented,
an attacker can hack both code and data. If WX2 is implemented, an attacker
can only attack data. The attack surface is reduced.

Also, trampfd calls coming from code from a signed file can be authenticated.
trampfd calls coming from an attacker's generated code cannot be authenticated.


> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 

Actually, I did not suggest that userspace can perform checks. I said that
the kernel can perform checks.

User space cannot reliably perform checks between calls. A clever hacker
can cover his tracks.

In any case, the kernel has no knowledge of these checks. So, when execute
permissions are requested for a page, a properly implemented WX2 can refuse.


>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.


I submit that there are two aspects - code and data. In one case, both
code and data can be hacked. So, an attacker can modify both code
and data. In the other case, the attacker can only modify data.
The power is not identical. The attack surface is not the same.

Most of the times, security measures are mitigations. They are not a 100%.
This approach of not allowing the user to do certain things that can be
exploited and having the kernel doing them increases our confidence.
>From that perspective, the two approaches are different and it is worth
pursuing a kernel based mitigation.


> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly 

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-11 Thread Madhavan T. Venkataraman
I am working on version 2 of trampfd. Will send it out soon.

Thanks for all the comments so far!

Madhavan

On 8/10/20 12:34 PM, Madhavan T. Venkataraman wrote:
> Resending because of mailer problems. Some of the recipients did not receive
> my email. I apologize. Sigh.
> 
> Here is a redefinition of trampfd based on review comments.
> 
> I wanted to address dynamic code in 3 different ways:
> 
> Remove the need for dynamic code where possible
> 
> 
> If the kernel itself can perform the work of some dynamic code, then
> the code can be replaced by the kernel.
> 
> This is what I implemented in the patchset. But reviewers objected
> to the performance impact. One trip to the kernel was needed for each
> trampoline invocation. So, I have decided to defer this approach.
> 
> Convert dynamic code to static code where possible
> --
> 
> This is possible with help from the kernel. This has no performance
> impact and can be used in libffi, GCC nested functions, etc. I have
> described the approach below.
> 
> Deal with code generation
> ---
> 
> For cases like generating JIT code from Java byte code, I wanted to
> establish a framework. However, reviewers felt that details are missing.
> 
> Should the kernel generate code or should it use a user-level code 
> generator?
> How do you make sure that a user level code generator can be trusted?
> How would the communication work? ABI details? Architecture support?
> Support for different types - JIT, DBT, etc?
> 
> I have come to the conclusion that this is best done separately.
> 
> My main interest is to provide a way to convert dynamic code such as
> trampolines to static code without any special architecture support.
> This can be done with the kernel's help. Any code that gets written in
> the future can conform to this as well.
> 
> So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
> just address item 2. I will reimplement the support in libffi and present it.
> 
> Convert dynamic code to static code
> 
> 
> One problem with dynamic code is that it cannot be verified or authenticated
> by the kernel. The kernel cannot tell the difference between genuine dynamic
> code and an attacker's code. Where possible, dynamic code should be converted
> to static code and placed in the text segment of a binary file. This allows
> the kernel to verify the code by verifying the signature of the file.
> 
> The other problem is using user-level methods to load and execute dynamic code
> can potentially be exploited by an attacker to inject his code and have it be
> executed. To prevent this, a system may enforce W^X. If W^X is enforced
> properly, genuine dynamic code will not be able to run. This is another
> reason to convert dynamic code to static code.
> 
> The issue in converting dynamic code to static code is that the data is
> dynamic. The code does not know before hand where the data is going to be
> at runtime.
> 
> Some architectures support PC-relative data references. So, if you co-locate
> code and data, then the code can find the data at runtime. But this is not
> supported on all architectures. When supported, there may be limitations to
> deal with. Plus you have to take the trouble to co-locate code and data.
> And, to deal with W^X, code and data need to be in different pages.
> 
> All architectures must be supported without any limitations. Fortunately,
> the kernel can solve this problem quite easily. I suggest the following:
> 
> Convert dynamic code to static code like this:
> 
> - Decide which register should point to the data that the code needs.
>   Call it register R.
> 
> - Write the static code assuming that R already points to the data.
> 
> - Use trampfd and pass the following to the kernel:
> 
> - pointers to the code and data
> - the name of the register R
> 
> The kernel will write the following instructions in a trampoline page
> mapped into the caller's address space with R-X.
> 
> - Load the data address in register R
> - Jump to the static code
> 
> Basically, the kernel provides a trampoline to jump to the user's code
> and returns the kernel-provided trampoline's address to the user.
> 
> It is trivial to implement a trampoline table in the trampoline page to
> conserve memory.
> 
> Issues raised previously
> ---
> 
> I believe that the following i

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-11 Thread Madhavan T. Venkataraman



On 8/11/20 8:08 AM, Pavel Machek wrote:
> Hi!
> 
 Thanks for the lively discussion. I have tried to answer some of the
 comments below.
>>>
> There are options today, e.g.
>
> a) If the restriction is only per-alias, you can have distinct aliases
>where one is writable and another is executable, and you can make it
>hard to find the relationship between the two.
>
> b) If the restriction is only temporal, you can write instructions into
>an RW- buffer, transition the buffer to R--, verify the buffer
>contents, then transition it to --X.
>
> c) You can have two processes A and B where A generates instrucitons into
>a buffer that (only) B can execute (where B may be restricted from
>making syscalls like write, mprotect, etc).

 The general principle of the mitigation is W^X. I would argue that
 the above options are violations of the W^X principle. If they are
 allowed today, they must be fixed. And they will be. So, we cannot
 rely on them.
>>>
>>> Would you mind describing your threat model?
>>>
>>> Because I believe you are using model different from everyone else.
>>>
>>> In particular, I don't believe b) is a problem or should be fixed.
>>
>> It is a problem because a kernel that implements W^X properly
>> will not allow it. It has no idea what has been done in userland.
>> It has no idea that the user has checked and verified the buffer
>> contents after transitioning the page to R--.
> 
> No, it is not a problem. W^X is designed to protect from attackers
> doing buffer overflows, not attackers doing arbitrary syscalls.
> 

Hey Pavel,

You are correct. The W^X implementation today still has some holes.
IIUC, the principle of W^X is - user should not be able to (W) write code
into a page and use some trick to get it to (X) execute. So, what I
was trying to say was that the W^X principle is not implemented
completely today.

Mark Rutland mentioned some other tricks as well which are being used
today.

For instance, Microsoft has submitted this proposal:

 https://microsoft.github.io/ipe/

IPE is an LSM. In this proposal, only mappings that are backed by a
signature verified file can have execute permissions. This means that
all anonymous page based tricks will fail. And, file mapping based
tricks will fail as well when temporary files are used to load code
and mmap(). That is the intent.

Thanks!

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-11 Thread Madhavan T. Venkataraman



On 8/8/20 5:17 PM, Pavel Machek wrote:
> Hi!
> 
>> Thanks for the lively discussion. I have tried to answer some of the
>> comments below.
> 
>>> There are options today, e.g.
>>>
>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>where one is writable and another is executable, and you can make it
>>>hard to find the relationship between the two.
>>>
>>> b) If the restriction is only temporal, you can write instructions into
>>>an RW- buffer, transition the buffer to R--, verify the buffer
>>>contents, then transition it to --X.
>>>
>>> c) You can have two processes A and B where A generates instrucitons into
>>>a buffer that (only) B can execute (where B may be restricted from
>>>making syscalls like write, mprotect, etc).
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Would you mind describing your threat model?
> 
> Because I believe you are using model different from everyone else.
> 
> In particular, I don't believe b) is a problem or should be fixed.

It is a problem because a kernel that implements W^X properly
will not allow it. It has no idea what has been done in userland.
It has no idea that the user has checked and verified the buffer
contents after transitioning the page to R--.


> 
> I'll add d), application mmaps a file(R--), and uses write syscall to change
> trampolines in it.
> 

No matter how you do it, these are all user-level methods that can be
hacked. The kernel cannot be sure that an attacker's code has
not found its way into the file.

>> b) This is again a violation. The kernel should refuse to give execute
>>  permission to a page that was writeable in the past and refuse to
>>  give write permission to a page that was executable in the past.
> 
> Why?

I don't know about the latter part. I guess I need to think about it.
But the former is valid. When a page is RW-, a hacker could hack the
page. Then it does not matter that the page is transitioned to R--.
Again, the kernel cannot be sure that the user has verified the contents
after R--.

IMO, W^X needs to be enforced temporally as well.

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-10 Thread Madhavan T. Venkataraman
Resending because of mailer problems. Some of the recipients did not receive
my email. I apologize. Sigh.

Here is a redefinition of trampfd based on review comments.

I wanted to address dynamic code in 3 different ways:

Remove the need for dynamic code where possible


If the kernel itself can perform the work of some dynamic code, then
the code can be replaced by the kernel.

This is what I implemented in the patchset. But reviewers objected
to the performance impact. One trip to the kernel was needed for each
trampoline invocation. So, I have decided to defer this approach.

Convert dynamic code to static code where possible
--

This is possible with help from the kernel. This has no performance
impact and can be used in libffi, GCC nested functions, etc. I have
described the approach below.

Deal with code generation
---

For cases like generating JIT code from Java byte code, I wanted to
establish a framework. However, reviewers felt that details are missing.

Should the kernel generate code or should it use a user-level code 
generator?
How do you make sure that a user level code generator can be trusted?
How would the communication work? ABI details? Architecture support?
Support for different types - JIT, DBT, etc?

I have come to the conclusion that this is best done separately.

My main interest is to provide a way to convert dynamic code such as
trampolines to static code without any special architecture support.
This can be done with the kernel's help. Any code that gets written in
the future can conform to this as well.

So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
just address item 2. I will reimplement the support in libffi and present it.

Convert dynamic code to static code


One problem with dynamic code is that it cannot be verified or authenticated
by the kernel. The kernel cannot tell the difference between genuine dynamic
code and an attacker's code. Where possible, dynamic code should be converted
to static code and placed in the text segment of a binary file. This allows
the kernel to verify the code by verifying the signature of the file.

The other problem is using user-level methods to load and execute dynamic code
can potentially be exploited by an attacker to inject his code and have it be
executed. To prevent this, a system may enforce W^X. If W^X is enforced
properly, genuine dynamic code will not be able to run. This is another
reason to convert dynamic code to static code.

The issue in converting dynamic code to static code is that the data is
dynamic. The code does not know before hand where the data is going to be
at runtime.

Some architectures support PC-relative data references. So, if you co-locate
code and data, then the code can find the data at runtime. But this is not
supported on all architectures. When supported, there may be limitations to
deal with. Plus you have to take the trouble to co-locate code and data.
And, to deal with W^X, code and data need to be in different pages.

All architectures must be supported without any limitations. Fortunately,
the kernel can solve this problem quite easily. I suggest the following:

Convert dynamic code to static code like this:

- Decide which register should point to the data that the code needs.
  Call it register R.

- Write the static code assuming that R already points to the data.

- Use trampfd and pass the following to the kernel:

- pointers to the code and data
- the name of the register R

The kernel will write the following instructions in a trampoline page
mapped into the caller's address space with R-X.

- Load the data address in register R
- Jump to the static code

Basically, the kernel provides a trampoline to jump to the user's code
and returns the kernel-provided trampoline's address to the user.

It is trivial to implement a trampoline table in the trampoline page to
conserve memory.

Issues raised previously
---

I believe that the following issues that were raised by reviewers is not
a problem in this scheme. Please rereview.

- Florian mentioned the libffi trampoline table. Trampoline tables can be
  implemented in this scheme easily.

- Florian mentioned stack unwinders. I am not an expert on unwinders.
  But I don't see an issue with unwinders.

- Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there.

- Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there.

If I have missed addressing any previously raised issue, I apologize.
Please let me know.

Thanks!

Madhavan




Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-06 Thread Madhavan T. Venkataraman
Thanks for the lively discussion. I have tried to answer some of the
comments below.

On 8/4/20 9:30 AM, Mark Rutland wrote:
>
>> So, the context is - if security settings in a system disallow a page to have
>> both write and execute permissions, how do you allow the execution of
>> genuine trampolines that are runtime generated and placed in a data
>> page or a stack page?
> There are options today, e.g.
>
> a) If the restriction is only per-alias, you can have distinct aliases
>where one is writable and another is executable, and you can make it
>hard to find the relationship between the two.
>
> b) If the restriction is only temporal, you can write instructions into
>an RW- buffer, transition the buffer to R--, verify the buffer
>contents, then transition it to --X.
>
> c) You can have two processes A and B where A generates instrucitons into
>a buffer that (only) B can execute (where B may be restricted from
>making syscalls like write, mprotect, etc).

The general principle of the mitigation is W^X. I would argue that
the above options are violations of the W^X principle. If they are
allowed today, they must be fixed. And they will be. So, we cannot
rely on them.

a) This requires a remap operation. Two mappings point to the same
 physical page. One mapping has W and the other one has X. This
 is a violation of W^X.

b) This is again a violation. The kernel should refuse to give execute
 permission to a page that was writeable in the past and refuse to
 give write permission to a page that was executable in the past.

c) This is just a variation of (a).

In general, the problem with user-level methods to map and execute
dynamic code is that the kernel cannot tell if a genuine application is
using them or an attacker is using them or piggy-backing on them.
If a security subsystem blocks all user-level methods for this reason,
we need a kernel mechanism to deal with the problem.

The kernel mechanism is not to be a backdoor. It is there to define
ways in which safe dynamic code can be executed.

I admit I have to provide more proof that my API and framework can
cover different cases. So, that is what I am doing now. I am in the process
of identifying other examples (per Andy's comment) and attempting to
show that this API and framework can address them. It will take a little time.


>>
>> IIUC, you are suggesting that the user hands the kernel a code fragment
>> and requests it to be placed in an r-x page, correct? However, the
>> kernel cannot trust any code given to it by the user. Nor can it scan any
>> piece of code and reliably decide if it is safe or not.
> Per that same logic the kernel cannot trust trampfd creation calls to be
> legitimate as the adversary could mess with the arguments. It doesn't
> matter if the kernel's codegen is trustworthy if it's potentially driven
> by an adversary.

That is not true. IMO, this is not a deficiency in trampfd. This is
something that is there even for regular system calls. For instance,
the write() system call will faithfully write out a buffer to a file
even if the buffer contents have been hacked by an attacker.
A system call can perform certain checks on incoming arguments.
But it cannot tell if a hacker has modified them.

So, there are two aspects in dynamic code that I am considering -
data and code. I submit that the data part can be hacked if an
application has a vulnerability such as buffer overflow. I don't see
how we can ever help that.

So, I am focused on the code generation part. Not all dynamic code
is the same. They have different degrees of trust.

Off the top of my head, I have tried to identify some examples
where we can have more trust on dynamic code and have the kernel
permit its execution.

1. If the kernel can do the job, then that is one safe way. Here, the kernel
    is the code. There is no code generation involved. This is what I
    have presented in the patch series as the first cut.

2. If the kernel can generate the code, then that code has a measure
    of trust. For trampolines, I agreed to do this for performance.

3. If the code resides in a signed file, then we know that it comes from
    an known source and it was generated at build time. So, it is not
    hacker generated. So, there is a measure of trust.

    This is not just program text. This could also be a buffer that contains
    trampoline code that resides in the read-only data section of a binary.

4. If the code resides in a signed file and is emulated (e.g. by QEMU)
    and we generate code for dynamic binary translation, we should
    be able to do that provided the code generator itself is not suspect.
    See the next point.   

5. The above are examples of actual machine code or equivalent.
    We could also have source code from which we generate machine
    code. E.g., JIT code from Java byte code. In this case, if the source
   code is in a signed file, we have a measure of trust on the source.
   If the kernel 

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-04 Thread Madhavan T. Venkataraman
Hey Mark,

I am working on putting together an improved definition of trampfd per
Andy's comment. I will try to address your comments in that improved
definition. Once I send that out, I will respond to your emails as well.

Thanks.

Madhavan

On 8/4/20 8:55 AM, Mark Rutland wrote:
> On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/31/20 1:31 PM, Mark Rutland wrote:
>>> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>>>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>>>>  wrote:
>>>>> When the kernel generates the code for a trampoline, it can hard code 
>>>>> data values
>>>> in the generated code itself so it does not need PC-relative data 
>>>> referencing.
>>>>
>>>> And, for ISAs that do support the large offset, we do have to implement and
>>>> maintain the code page stuff for different ISAs for each application and 
>>>> library
>>>> if we did not use trampfd.
>>> Trampoline code is architecture specific today, so I don't see that as a
>>> major issue. Common structural bits can probably be shared even if the
>>> specifid machine code cannot.
>> True. But an implementor may prefer a standard mechanism provided by
>> the kernel so all of his architectures can be supported easily with less
>> effort.
>>
>> If you look at the libffi reference patch I have included, the architecture
>> specific changes to use trampfd just involve a single C function call to
>> a common code function.
> Sure but in addition to that each architecture backend had to define a
> set of arguments to that. I view the C function is analagous to the
> "common structural bits".
>
> I appreciate that your patch is small today (and architectures seem to
> largely align on what they need), but I don't think it's necessarily
> true that things will remain so simple as architecture are extended and
> their calling conventions evolve, and I also don't think it's clear that
> this will work for more complex cases elsewhere.
>
> [...]
>
>>>> With the user level trampoline table approach, the data part of the 
>>>> trampoline table
>>>> can be hacked by an attacker if an application has a vulnerability. 
>>>> Specifically, the
>>>> target PC can be altered to some arbitrary location. Trampfd implements an
>>>> "Allowed PCS" context. In the libffi changes, I have created a read-only 
>>>> array of
>>>> all ABI handlers used in closures for each architecture. This read-only 
>>>> array
>>>> can be used to restrict the PC values for libffi trampolines to prevent 
>>>> hacking.
>>>>
>>>> To generalize, we can implement security rules/features if the trampoline
>>>> object is in the kernel.
>>> I don't follow this argument. If it's possible to statically define that
>>> in the kernel, it's also possible to do that in userspace without any
>>> new kernel support.
>> It is not statically defined in the kernel.
>>
>> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
>> ABI handlers:
>>
>>     ffi_closure_unix64_sse
>>     ffi_closure_unix64
>>     ffi_closure_win64
>>
>> I could create an "Allowed PCs" context like this:
>>
>> struct my_allowed_pcs {
>>     struct trampfd_values    pcs;
>>     __u64                         pc_values[3];
>> };
>>
>> const struct my_allowed_pcs    my_allowed_pcs = {
>>     { 3, 0 },
>>     (uintptr_t) ffi_closure_unix64_sse,
>>     (uintptr_t) ffi_closure_unix64,
>>     (uintptr_t) ffi_closure_win64,
>> };
>>
>> I have created a read-only array of allowed ABI handlers that closures use.
>>
>> When I set up the context for a closure trampoline, I could do this:
>>
>>     pwrite(trampfd, _allowed_pcs, sizeof(my_allowed_pcs), 
>> TRAMPFD_ALLOWED_PCS_OFFSET);
>>    
>> This copies the array into the trampoline object in the kernel.
>> When the register context is set for the trampoline, the kernel checks
>> the PC register value against allowed PCs.
>>
>> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
>> permitted target PCs enforced by the kernel are the ABI handlers.
> Sorry, when I said "statically define" meant when you knew legitimate
> targets ahead of time when you create the trampoline (i.e. whether you
> could 

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-04 Thread Madhavan T. Venkataraman



On 8/4/20 9:33 AM, David Laight wrote:
>>> If you look at the libffi reference patch I have included, the architecture
>>> specific changes to use trampfd just involve a single C function call to
>>> a common code function.
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.

Sorry if I wasn't clear.

libffi is a separate use case and GCC nested functions is a separate one.
libffi is not used to solve the nested function stuff.

For nested functions, GCC generates trampoline code and arranges to
place it on the stack and execute it.

I agree with your other points about nested function implementation.

Madhavan
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.
>
>   David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)



Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> I feel like trampfd is too poorly defined at this point to evaluate.

Point taken. It is because I wanted to start with something small
and specific and expand it in the future. So, I did not really describe the big
picture - the overall vision, future work, that sort of thing. In retrospect,
may be, I should have done that.

I will take all of the input I have received so far and all of the responses
I have given, refine the definition of trampfd and send it out. Please
review that and let me know if anything is still missing from the
definition.

Thanks.

Madhavan



Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 7/31/20 1:31 PM, Mark Rutland wrote:
> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>>  wrote:
>> Dealing with multiple architectures
>> ---
>>
>> One good reason to use trampfd is multiple architecture support. The
>> trampoline table in a code page approach is neat. I don't deny that at
>> all. But my question is - can it be used in all cases?
>>
>> It requires PC-relative data references. I have not worked on all 
>> architectures.
>> So, I need to study this. But do all ISAs support PC-relative data 
>> references?
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.

So, two questions:

1. IIUC, for position independent code, we need PC-relative control transfers. 
I know that
    PC-relative control transfers are kinda fundamental. So, I expect most 
architectures
    support it. But to implement the trampoline table suggestion, we need 
PC-relative
    data references. Like:

    movq    X(%rip), %rax

2. Do you know which architectures do not support PC-relative data references? 
I am
    going to study this. But if you have some information, I would appreciate 
it.

In any case, I think we should support all of the architectures on which Linux 
currently
runs even if they are legacy.

>
>> Even in an ISA that supports it, there would be a maximum supported offset
>> from the current PC that can be reached for a data reference. That maximum
>> needs to be at least the size of a base page in the architecture. This is 
>> because
>> the code page and the data page need to be separate for security reasons.
>> Do all ISAs support a sufficiently large offset?
> ISAs with pc-relative addessing can usually generate PC-relative
> addresses into a GPR, from which they can apply an arbitrarily large
> offset.

I will study this. I need to nail down the list of architectures that cannot do 
this.

>
>> When the kernel generates the code for a trampoline, it can hard code data 
>> values
>> in the generated code itself so it does not need PC-relative data 
>> referencing.
>>
>> And, for ISAs that do support the large offset, we do have to implement and
>> maintain the code page stuff for different ISAs for each application and 
>> library
>> if we did not use trampfd.
> Trampoline code is architecture specific today, so I don't see that as a
> major issue. Common structural bits can probably be shared even if the
> specifid machine code cannot.

True. But an implementor may prefer a standard mechanism provided by
the kernel so all of his architectures can be supported easily with less
effort.

If you look at the libffi reference patch I have included, the architecture
specific changes to use trampfd just involve a single C function call to
a common code function.

So, from the point of view of adoption, IMHO, the kernel provided method
is preferable.

>
> [...]
>
>> Security
>> ---
>>
>> With the user level trampoline table approach, the data part of the 
>> trampoline table
>> can be hacked by an attacker if an application has a vulnerability. 
>> Specifically, the
>> target PC can be altered to some arbitrary location. Trampfd implements an
>> "Allowed PCS" context. In the libffi changes, I have created a read-only 
>> array of
>> all ABI handlers used in closures for each architecture. This read-only array
>> can be used to restrict the PC values for libffi trampolines to prevent 
>> hacking.
>>
>> To generalize, we can implement security rules/features if the trampoline
>> object is in the kernel.
> I don't follow this argument. If it's possible to statically define that
> in the kernel, it's also possible to do that in userspace without any
> new kernel support.
It is not statically defined in the kernel.

Let us take the libffi example. In the 64-bit X86 arch code, there are 3
ABI handlers:

    ffi_closure_unix64_sse
    ffi_closure_unix64
    ffi_closure_win64

I could create an "Allowed PCs" context like this:

struct my_allowed_pcs {
    struct trampfd_values    pcs;
    __u64                         pc_values[3];
};

const struct my_allowed_pcs    my_allowed_pcs = {
    { 3, 0 },
    (uintptr_t) ffi_closure_unix64_sse,
    (uintptr_t) ffi_closure_unix64,
    (uintptr_t) ffi_closure_win64,
};

I have created a read-only array of allowed ABI handlers that closures use.

When I set up the context for a closure trampoline, I could do this:

    pwrite(trampfd, _allowed_pcs, sizeof(my_allowed_pcs)

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 8/3/20 11:57 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 03 August 2020 17:03
>>
>> On 8/3/20 3:27 AM, David Laight wrote:
>>> From: Mark Rutland
>>>> Sent: 31 July 2020 19:32
>>> ...
>>>>> It requires PC-relative data references. I have not worked on all 
>>>>> architectures.
>>>>> So, I need to study this. But do all ISAs support PC-relative data 
>>>>> references?
>>>> Not all do, but pretty much any recent ISA will as it's a practical
>>>> necessity for fast position-independent code.
>>> i386 has neither PC-relative addressing nor moves from %pc.
>>> The cpu architecture knows that the sequence:
>>> call1f
>>> 1:  pop %reg
>>> is used to get the %pc value so is treated specially so that
>>> it doesn't 'trash' the return stack.
>>>
>>> So PIC code isn't too bad, but you have to use the correct
>>> sequence.
>> Is that true only for 32-bit systems only? I thought RIP-relative addressing 
>> was
>> introduced in 64-bit mode. Please confirm.
> I said i386 not amd64 or x86-64.

I am sorry. My bad.

>
> So yes, 64bit code has PC-relative addressing.
> But I'm pretty sure it has no other way to get the PC itself
> except using call - certainly nothing in the 'usual' instructions.

OK.

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman
Responses inline..

On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madve...@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" 
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?

That is the point. If a process is allowed to have pages that are both
writable and executable, a hacker can exploit some vulnerability such
as buffer overflow to write his own code into a page and somehow
contrive to execute that.

So, the context is - if security settings in a system disallow a page to have
both write and execute permissions, how do you allow the execution of
genuine trampolines that are runtime generated and placed in a data
page or a stack page?

trampfd tries to address that. So, trampfd is not a measure that increases
the security of a system or mitigates a security problem. It is a framework
to allow safe forms of dynamic code to execute when security settings
will block them otherwise.

>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.

Could not agree with you more.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.

In addition to the stack, EMUTRAMP also allows the emulation
of the same well-known trampolines placed in a non-stack data page.
For instance, libffi closures embed a trampoline in a closure structure.
That gets executed when the caller of libffi invokes it.

The goal of EMUTRAMP is to allow safe trampolines to execute when
security settings disallow their execution. Mainly, it permits applications
that use libffi to run. A lot of applications use libffi.

They chose the emulation method so that no changes need to be made
to application code to use them. But the EMUTRAMP implementors note
in their description that the real solution to the problem is a kernel
API that is backed by a safe code generator.

trampd is an attempt to define such an API. This is just a starting point.
I realize that we need to have a lot of discussion to refine the approach.

> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.

IIUC, you are suggesting that the user hands the kerne

Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 8/3/20 3:27 AM, David Laight wrote:
> From: Mark Rutland
>> Sent: 31 July 2020 19:32
> ...
>>> It requires PC-relative data references. I have not worked on all 
>>> architectures.
>>> So, I need to study this. But do all ISAs support PC-relative data 
>>> references?
>> Not all do, but pretty much any recent ISA will as it's a practical
>> necessity for fast position-independent code.
> i386 has neither PC-relative addressing nor moves from %pc.
> The cpu architecture knows that the sequence:
>   call1f  
> 1:pop %reg  
> is used to get the %pc value so is treated specially so that
> it doesn't 'trash' the return stack.
>
> So PIC code isn't too bad, but you have to use the correct
> sequence.

Is that true only for 32-bit systems only? I thought RIP-relative addressing was
introduced in 64-bit mode. Please confirm.

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 8/3/20 3:23 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 02 August 2020 19:55
>> To: Andy Lutomirski 
>> Cc: Kernel Hardening ; Linux API 
>> ;
>> linux-arm-kernel ; Linux FS Devel 
>> > fsde...@vger.kernel.org>; linux-integrity ; 
>> LKML > ker...@vger.kernel.org>; LSM List ; 
>> Oleg Nesterov
>> ; X86 ML 
>> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
>>
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madve...@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" 
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> I was thinking that you'd just let the 'not executable' page fault
> signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
> is executed.
>
> The user signal handler can then decode the faulting instruction
> and, if it matches the expected on-stack trampoline, modify the
> saved registers before returning from the signal.
>
> No kernel changes and all you need to add to the program is
> an architecture-dependant signal handler.

Understood.

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-03 Thread Madhavan T. Venkataraman



On 8/3/20 3:08 AM, David Laight wrote:
> From: Pavel Machek 
>> Sent: 02 August 2020 12:56
>> Hi!
>>
 This is quite clever, but now I???m wondering just how much kernel help
 is really needed. In your series, the trampoline is an non-executable
 page.  I can think of at least two alternative approaches, and I'd
 like to know the pros and cons.

 1. Entirely userspace: a return trampoline would be something like:

 1:
 pushq %rax
 pushq %rbc
 pushq %rcx
 ...
 pushq %r15
 movq %rsp, %rdi # pointer to saved regs
 leaq 1b(%rip), %rsi # pointer to the trampoline itself
 callq trampoline_handler # see below
>>> For nested calls (where the trampoline needs to pass the
>>> original stack frame to the nested function) I think you
>>> just need a page full of:
>>> mov $0, scratch_reg; jmp trampoline_handler
>> I believe you could do with mov %pc, scratch_reg; jmp ...
>>
>> That has advantage of being able to share single physical
>> page across multiple virtual pages...
> A lot of architecture don't let you copy %pc that way so you would
> have to use 'call' - but that trashes the return address cache.
> It also needs the trampoline handler to know the addresses
> of the trampolines.

Do you which ones don't allow you to copy %pc?

Some of the architctures do not have PC-relative data references.
If they do not allow you to copy the PC into a general purpose
register, then there is no way to implement the statically defined
trampoline that has been discussed so far. In these cases, the
trampoline has to be generate at runtime.

Thanks.

Madhavan



Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-02 Thread Madhavan T. Venkataraman



On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
>  wrote:
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madve...@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" 
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> yes.
>
>> So, my argument would be that this would always incur the overhead
>> of a trip to the kernel. I think twice the overhead if I am not mistaken.
>> With trampfd, we can have the kernel generate the code so that there
>> is no performance penalty at all.
> I feel like trampfd is too poorly defined at this point to evaluate.
> There are three general things it could do.  It could generate actual
> code that varies by instance.  It could have static code that does not
> vary.  And it could actually involve a kernel entry.
>
> If it involves a kernel entry, then it's slow.  Maybe this is okay for
> some use cases.

Yes. IMO, it is OK for most cases except where dynamic code
is used specifically for enhancing performance such as interpreters
using JIT code for frequently executed sequences and dynamic
binary translation.

> If it involves only static code, I see no good reason that it should
> be in the kernel.

It does not involve only static code. This is meant for dynamic code.
However, see below.

> If it involves dynamic code, then I think it needs a clearly defined
> use case that actually requires dynamic code.

Fair enough. I will work on this and get back to you. This might take
a little time. So, bear with me.

But I would like to make one point here. There are many applications
and libraries out there that use trampolines. They may all require the
same sort of things:

    - set register context
    - push stuff on stack
    - jump to a target PC

But in each case, the context would be different:

    - only register context
    - only stack context
    - both register and stack contexts
    - different registers
    - different values pushed on the stack
    - different target PCs

If we had to do this purely at user level, each application/library would
need to roll its own solution, the solution has to be implemented for
each supported architecture and maintained. While the code is static
in each separate case, it is dynamic across all of them.

That is, the kernel will generate the code on the fly for each trampoline
instance based on its current context. It will not maintain any static
trampoline code at all.

Basically, it will supply the context to an arch-specific function and say:

    - generate instructions for loading these regs with these values
    - generate instructions to push these values on the stack
    - generate an instruction to jump to this target PC

It will place all of those generated instructions on a page and return the 
address.

So, even with the static case, there is a lot of value in the kernel providing
this. Plus, it has the framework to handle dynamic code.

>> Also, signals are asynchronous. So, they are vulnerable to race conditions.
>> To prevent other signals from coming in while handling the raised signal,
>> we would need to block and unblock signals. This will cause more
>> overhead.
> If you're worried about raise() racing against signals from out of
> thread, you have bigger problems to deal with.

Agreed. The signal blocking is just one example of problems related
to signals. There are other bigger problems as well. So, let us remove
the signal-based approach from our discussions.

Thanks.

Madhavan


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-02 Thread Madhavan T. Venkataraman
More responses inline..

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madve...@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" 
>>
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.

Let me understand this. You are saying that the trampoline code
would raise a signal and, in the signal handler, set up the context
so that when the signal handler returns, we end up in the target
function with the context correctly set up. And, this trampoline code
can be generated statically at build time so that there are no
security issues using it.

Have I understood your suggestion correctly?

So, my argument would be that this would always incur the overhead
of a trip to the kernel. I think twice the overhead if I am not mistaken.
With trampfd, we can have the kernel generate the code so that there
is no performance penalty at all.

Signals have many problems. Which signal number should we use for this
purpose? If we use an existing one, that might conflict with what the 
application
is already handling. Getting a new signal number for this could meet
with resistance from the community.

Also, signals are asynchronous. So, they are vulnerable to race conditions.
To prevent other signals from coming in while handling the raised signal,
we would need to block and unblock signals. This will cause more
overhead.

> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.

How is this better than the kernel handling an address fault?
The system call still needs to do the same work as the fault handler.
We do need to specify the register and stack contexts before hand
so the system call can do its job.

Also, this always incurs a trip to the kernel. With trampfd, the kernel
could generate the code to avoid the performance penalty.


>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.

I need to study unwinders a little before I respond to this question.
So, bear with me.

> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> 
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> 
> 
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if  spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.

I am thinking that the trampfd API can be used for addressing JIT
code as well. I have not yet started thinking about the details. But I
think the API is sufficient. E.g.,

    struct trampfd_jit {
        void    *source;
        size_t    len;
    };

    struct trampfd_jit    jit;
    struct trampfd_map    map;
    void    *addr;

    jit.source = blah;
    jit.size = blah;

    fd = syscall(440, TRAMPFD_JIT, , flags);
    pread(fd, , sizeof(map), TRAMPFD_MAP_OFFSET);
    addr = mmap(NULL, map.size, map.prot, map.flags, fd, map.offset);

And addr would be used to invoke the generated JIT code.

Madhavan



  1   2   >