Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-17 Thread Torsten Duwe
On Wed, Jan 16, 2019 at 09:57:39AM +, Julien Thierry wrote:
> 
> I wanted to test this patch (and try to benchmark having the "mov x9,
> x30" always present in function prelude vs having two nops), but I
> cannot get this patch to apply (despite having a version including both
> commits below).
> 
> Could you provide a git branch from which I could try to rebase the
> patch? (Or a new version of the series)

It also applies with just a single, harmless fuzz 1 onto 5.0-rc2.
Does that help?

I'm currently over v7, with Mark's requests included and the mov patching
done only once at bootup...

Torsten



Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-16 Thread Julien Thierry



On 16/01/2019 15:56, Julien Thierry wrote:
> On 14/01/2019 12:26, Mark Rutland wrote:
>> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>>> On Fri, Jan 04, 2019 at 05:50:18PM +, Mark Rutland wrote:
 Hi Torsten,

 On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> of each function. Replace the first NOP thus generated with a quick LR
> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> to ftrace, does not clobber the value. Ftrace will then generate the
> standard stack frames.
>>>
>>> Do we know what the overhead would be, if this was a link time change
>>> for the first instruction?
>>
>> No, but it should be possible to benchamrk that for a given workload,
>> which is what I'd like to see.
>>
> 
> So, I hacked up something to have the -fpachable-function-entry=2 in the
> build and then have ftrace_init() patch in the "mov x9, lr" in the first
> nop of the function preludes.
> 
> I tested it on a 8 x Cortex A-57 machine and compared with a version
> that just has the two nops in the function prelude.
> 
> On workloads like hackbench, the average difference is within the noise
> (<1%). Time results below are in seconds.
> 
>   +++
>   | "nop; nop" | "mov x9, lr; nop"  |
>   +++
>   | 43.497 | 42.694 |
>   | 43.464 | 43.148 |
>   | 43.599 | 43.131 |
>   | 43.785 |  43.63 |
>   | 43.458 | 43.281 |
>   |   44.3 | 43.328 |
>   | 43.541 | 43.059 |
>   | 43.529 | 43.298 |
>   |  43.58 | 43.937 |
>   | 43.385 | 43.122 |
>   | 43.514 | 43.825 |
>   | 45.508 | 43.268 |
>   | 43.757 | 43.316 |
>   | 43.392 | 43.146 |
>   | 44.029 | 43.236 |
>   | 43.515 | 43.139 |
>   |  43.22 | 43.108 |
>   | 43.496 | 43.836 |
>   | 43.669 | 43.083 |
>   | 43.388 |  43.38 |
>   +++
> average   |43.6813 |   43.29825 |
>   +++
> 
Here are also some results running hackbench on 4 x Cortex-A53 (pay no
attention to the fact that the timescales are similar, I changed the
number of iteration done by hackbench so it wouldn't take too long)

++---+
| "nop; nop" | "mov x9, lr; nop" |
++---+
| 43.815 |44.455 |
| 43.758 |45.173 |
| 44.075 | 43.95 |
| 44.021 |44.185 |
| 43.959 |44.826 |
| 44.039 |44.478 |
| 43.836 |44.626 |
| 44.071 |45.177 |
| 43.619 |45.033 |
| 44.052 |45.095 |
| 43.903 |44.802 |
| 43.773 |44.955 |
| 43.908 | 45.02 |
| 43.441 |44.986 |
| 44.167 |45.182 |
| 44.106 |45.229 |
| 43.974 | 45.07 |
| 43.859 |45.283 |
| 43.706 |44.892 |
| 43.897 |44.194 |
++---+
average | 43.899 |44.835 |
++---+


So, in this case the performance take a ~2% hit from keeping the mov
always present in the function prelude instead of a nop.

Makes it a bit less obvious whether the always having that mov there
(whether patched at build time or run time) is good enough.

Cheers,

-- 
Julien Thierry


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-16 Thread Julien Thierry
Hi,

On 14/01/2019 12:26, Mark Rutland wrote:
> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>> On Fri, Jan 04, 2019 at 05:50:18PM +, Mark Rutland wrote:
>>> Hi Torsten,
>>>
>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
 Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
 of each function. Replace the first NOP thus generated with a quick LR
 saver (move it to scratch reg x9), so the 2nd replacement insn, the call
 to ftrace, does not clobber the value. Ftrace will then generate the
 standard stack frames.
>>
>> Do we know what the overhead would be, if this was a link time change
>> for the first instruction?
> 
> No, but it should be possible to benchamrk that for a given workload,
> which is what I'd like to see.
> 

So, I hacked up something to have the -fpachable-function-entry=2 in the
build and then have ftrace_init() patch in the "mov x9, lr" in the first
nop of the function preludes.

I tested it on a 8 x Cortex A-57 machine and compared with a version
that just has the two nops in the function prelude.

On workloads like hackbench, the average difference is within the noise
(<1%). Time results below are in seconds.

+++
| "nop; nop" | "mov x9, lr; nop"  |
+++
| 43.497 | 42.694 |
| 43.464 | 43.148 |
| 43.599 | 43.131 |
| 43.785 |  43.63 |
| 43.458 | 43.281 |
|   44.3 | 43.328 |
| 43.541 | 43.059 |
| 43.529 | 43.298 |
|  43.58 | 43.937 |
| 43.385 | 43.122 |
| 43.514 | 43.825 |
| 45.508 | 43.268 |
| 43.757 | 43.316 |
| 43.392 | 43.146 |
| 44.029 | 43.236 |
| 43.515 | 43.139 |
|  43.22 | 43.108 |
| 43.496 | 43.836 |
| 43.669 | 43.083 |
| 43.388 |  43.38 |
+++
average |43.6813 |   43.29825 |
+++


On a kernel build from defconfig, there seems to be around 5%
difference, but funnily enough it's the version with "mov x9, lr" that
seems faster (but maybe that might be caused by delays from the disk or
other IO related stuff).

I'll try a bit more runs of the kernel builds to make sure, but having
"mov x9, lr; nop" does not appear to deteriorate the performance
compared to "nop; nop" as function prelude.

Cheers,

-- 
Julien Thierry


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-16 Thread Julien Thierry



On 16/01/2019 09:57, Julien Thierry wrote:
> Hi Torsten,
> 
> On 04/01/2019 14:10, Torsten Duwe wrote:
>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
>> of each function. Replace the first NOP thus generated with a quick LR
>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
>> to ftrace, does not clobber the value. Ftrace will then generate the
>> standard stack frames.
>>
>> Note that patchable-function-entry in GCC disables IPA-RA, which means
>> ABI register calling conventions are obeyed *and* scratch registers
>> such as x9 are available.
>>
>> Introduce and handle an ftrace_regs_trampoline for module PLTs, right
>> after ftrace_trampoline, and double the size of this special section.
>>
>> Signed-off-by: Torsten Duwe 
>>
> 
> I wanted to test this patch (and try to benchmark having the "mov x9,
> x30" always present in function prelude vs having two nops), but I
> cannot get this patch to apply (despite having a version including both
> commits below).
> 
> Could you provide a git branch from which I could try to rebase the
> patch? (Or a new version of the series)
> 
>> ---
>>
>> This patch applies on 4.20 with the additional changes
>> bdb85cd1d20669dfae813555dddb745ad09323ba
>> (arm64/module: switch to ADRP/ADD sequences for PLT entries)
>> and
>> 7dc48bf96aa0fc8aa5b38cc3e5c36ac03171e680
>> (arm64: ftrace: always pass instrumented pc in x0)
>> along with their respective series, or alternatively on Linus' master,
>> which already has these.
>>
>> changes since v5:
>>
>> * fix mentioned pc in x0 to hold the start address of the call site,
>>   not the return address or the branch address.
>>   This resolves the problem found by Amit.
>>
>> ---
>>  arch/arm64/Kconfig|2 
>>  arch/arm64/Makefile   |4 +
>>  arch/arm64/include/asm/assembler.h|1 
>>  arch/arm64/include/asm/ftrace.h   |   13 +++
>>  arch/arm64/include/asm/module.h   |3 
>>  arch/arm64/kernel/Makefile|6 -
>>  arch/arm64/kernel/entry-ftrace.S  |  131 
>> ++
>>  arch/arm64/kernel/ftrace.c|  125 
>> 
>>  arch/arm64/kernel/module-plts.c   |3 
>>  arch/arm64/kernel/module.c|2 
>>  drivers/firmware/efi/libstub/Makefile |3 
>>  include/asm-generic/vmlinux.lds.h |1 
>>  include/linux/compiler_types.h|4 +
>>  13 files changed, 262 insertions(+), 36 deletions(-)
> 
> [...]
> 
>> --- a/arch/arm64/kernel/entry-ftrace.S
>> +++ b/arch/arm64/kernel/entry-ftrace.S
> 
> [...]
> 
>> @@ -122,6 +124,7 @@ skip_ftrace_call:// }
>>  ENDPROC(_mcount)
>>  
>>  #else /* CONFIG_DYNAMIC_FTRACE */
>> +#ifndef CONFIG_DYNAMIC_FTRACE_WITH_REGS
>>  /*
>>   * _mcount() is used to build the kernel with -pg option, but all the branch
>>   * instructions to _mcount() are replaced to NOP initially at kernel start 
>> up,
>> @@ -159,6 +162,124 @@ GLOBAL(ftrace_graph_call)  // ftrace_gra
>>  
>>  mcount_exit
>>  ENDPROC(ftrace_caller)
>> +#else /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */
>> +
>> +/*
>> + * Since no -pg or similar compiler flag is used, there should really be
>> + * no reference to _mcount; so do not define one. Only some value for
>> + * MCOUNT_ADDR is needed for comparison. Let it point here to have some
>> + * sort of magic value that can be recognised when debugging.
>> + */
>> +GLOBAL(_mcount)
>> +ret /* make it differ from regs caller */
> 
> There's something I can't figure out. Since there are no callers to
> _mcount, how does the ftrace core builds up its record of patchable
> functions?
> 
> I don't understand fully the core ftrace code but I've got the
> impression that without this record of struct dyn_ftrace, ftrace cannot
> patch in calls to tracers in the future.
> 
> Am I missing something?
> 

Forget that second part, I just saw the vmlinux.lds.h change.

-- 
Julien Thierry


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-16 Thread Julien Thierry
Hi Torsten,

On 04/01/2019 14:10, Torsten Duwe wrote:
> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> of each function. Replace the first NOP thus generated with a quick LR
> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> to ftrace, does not clobber the value. Ftrace will then generate the
> standard stack frames.
> 
> Note that patchable-function-entry in GCC disables IPA-RA, which means
> ABI register calling conventions are obeyed *and* scratch registers
> such as x9 are available.
> 
> Introduce and handle an ftrace_regs_trampoline for module PLTs, right
> after ftrace_trampoline, and double the size of this special section.
> 
> Signed-off-by: Torsten Duwe 
> 

I wanted to test this patch (and try to benchmark having the "mov x9,
x30" always present in function prelude vs having two nops), but I
cannot get this patch to apply (despite having a version including both
commits below).

Could you provide a git branch from which I could try to rebase the
patch? (Or a new version of the series)

> ---
> 
> This patch applies on 4.20 with the additional changes
> bdb85cd1d20669dfae813555dddb745ad09323ba
> (arm64/module: switch to ADRP/ADD sequences for PLT entries)
> and
> 7dc48bf96aa0fc8aa5b38cc3e5c36ac03171e680
> (arm64: ftrace: always pass instrumented pc in x0)
> along with their respective series, or alternatively on Linus' master,
> which already has these.
> 
> changes since v5:
> 
> * fix mentioned pc in x0 to hold the start address of the call site,
>   not the return address or the branch address.
>   This resolves the problem found by Amit.
> 
> ---
>  arch/arm64/Kconfig|2 
>  arch/arm64/Makefile   |4 +
>  arch/arm64/include/asm/assembler.h|1 
>  arch/arm64/include/asm/ftrace.h   |   13 +++
>  arch/arm64/include/asm/module.h   |3 
>  arch/arm64/kernel/Makefile|6 -
>  arch/arm64/kernel/entry-ftrace.S  |  131 
> ++
>  arch/arm64/kernel/ftrace.c|  125 
>  arch/arm64/kernel/module-plts.c   |3 
>  arch/arm64/kernel/module.c|2 
>  drivers/firmware/efi/libstub/Makefile |3 
>  include/asm-generic/vmlinux.lds.h |1 
>  include/linux/compiler_types.h|4 +
>  13 files changed, 262 insertions(+), 36 deletions(-)

[...]

> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S

[...]

> @@ -122,6 +124,7 @@ skip_ftrace_call: // }
>  ENDPROC(_mcount)
>  
>  #else /* CONFIG_DYNAMIC_FTRACE */
> +#ifndef CONFIG_DYNAMIC_FTRACE_WITH_REGS
>  /*
>   * _mcount() is used to build the kernel with -pg option, but all the branch
>   * instructions to _mcount() are replaced to NOP initially at kernel start 
> up,
> @@ -159,6 +162,124 @@ GLOBAL(ftrace_graph_call)   // ftrace_gra
>  
>   mcount_exit
>  ENDPROC(ftrace_caller)
> +#else /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */
> +
> +/*
> + * Since no -pg or similar compiler flag is used, there should really be
> + * no reference to _mcount; so do not define one. Only some value for
> + * MCOUNT_ADDR is needed for comparison. Let it point here to have some
> + * sort of magic value that can be recognised when debugging.
> + */
> +GLOBAL(_mcount)
> + ret /* make it differ from regs caller */

There's something I can't figure out. Since there are no callers to
_mcount, how does the ftrace core builds up its record of patchable
functions?

I don't understand fully the core ftrace code but I've got the
impression that without this record of struct dyn_ftrace, ftrace cannot
patch in calls to tracers in the future.

Am I missing something?

Thanks,

-- 
Julien Thierry


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-14 Thread Mark Rutland
On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
> On Fri, Jan 04, 2019 at 05:50:18PM +, Mark Rutland wrote:
> > Hi Torsten,
> > 
> > On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
> > > Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> > > of each function. Replace the first NOP thus generated with a quick LR
> > > saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> > > to ftrace, does not clobber the value. Ftrace will then generate the
> > > standard stack frames.
> 
> Do we know what the overhead would be, if this was a link time change
> for the first instruction?

No, but it should be possible to benchamrk that for a given workload,
which is what I'd like to see.

> Also, I was under the impression that some arch's do ftrace_call_replace
> under stop_machine(), is that a possibility here?

Something like that is a possibility.

I think we need numbers either way.

Thanks,
Mark.


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-14 Thread Balbir Singh
On Fri, Jan 04, 2019 at 05:50:18PM +, Mark Rutland wrote:
> Hi Torsten,
> 
> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
> > Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> > of each function. Replace the first NOP thus generated with a quick LR
> > saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> > to ftrace, does not clobber the value. Ftrace will then generate the
> > standard stack frames.
> >

Do we know what the overhead would be, if this was a link time change
for the first instruction?

Also, I was under the impression that some arch's do ftrace_call_replace
under stop_machine(), is that a possibility here?

Balbir Singh 


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-07 Thread Mark Rutland
On Fri, Jan 04, 2019 at 11:41:45PM +0100, Torsten Duwe wrote:
> On Fri, Jan 04, 2019 at 01:06:48PM -0500, Steven Rostedt wrote:
> > On Fri, 4 Jan 2019 17:50:18 +
> > Mark Rutland  wrote:
> > 
> > > At Linux Plumbers, I had a conversation with Steve Rostedt, and we came
> > > to the conclusion that (withut heavyweight synchronization) patching two
> > > NOPs at runtime isn't safe, since a CPU might have executed the first
> > > NOP as a NOP before another CPU patches both instructions. So a CPU
> > > might execute:
> > > 
> > >   NOP
> > >   BL  ftrace_regs_caller
> > > 
> > > ... rather than the expected:
> > > 
> > >   MOV X9, X30
> > >   BL  ftrace_regs_caller
> > > 
> > > ... and therefore X9 contains some UNKNOWN value, rather than the
> > > original LR value.
> 
> I'm perfectly aware of that; an earlier version had barriers, attempting
> to avoid just that, which Mark(?) wrote weren't neccessary.

The problem was that even with barriers, the only guarantee you get is
that instructions are made visible in order, not what the other CPU has
executed.

For example:

I.e. 

CPU#1   CPU#2
NOP#1
Patches NOP#1 -> INSN#1
Cache maintenance
Barrier

// INSN#1 now visible to CPU#2,
// but NOP#1 was already
// executed as a NOP.

Patches NOP#2 -> INSN#2
Cache maintenance
Barrier
INSN#2

> But is this a realistic scenario? All function entries are aligned 8 bytes.
> Are there arm64 implementations out there that fetch only 4 bytes and
> give a chance to mess with the 2nd 4 bytes? You at arm.com should know, and
> I won't be surprised if the answer is a weird "yes". Or maybe it's just
> another erratum lurking somewhere...

The alignment of the instructions provides no guarantee here. Regardless
of what contemporary implementations *may* do, the architecture provides
absolutely no guarantee.

For example, even if CPU#2 fetched both NOPs together, the cache
maintenance and barrier may cause it to throw away any speculative work
after executing NOP#1. Upon re-fetching, it could see both new INSNs,
but as it's already executed the first as a NOP, it will not re-execute
it as INSN#1.

Also consider pre-emption by a hypervisor or firmware may occur
mid-sequence.

> My point is: those 2 insn will _never_ be split by any alignment
> boundary > 8; does that mean anything, have you considered this?

This has no impact whatsoever.

> 
> > > I wonder if we could solve that by patching the kernel at build-time, to
> > > add the MOV X9, X30 in place of the first NOP. If we were to do that, we
> > > could also update the addresses to pooint at the second NOP, simplifying
> > > the changes to the runtime code.
> > 
> > You can also patch it at boot up when there's only one CPU running, and
> > interrupts are disabled.
> 
> May I remind about possible performance hits? 

Sure; please get some numbers either way.

> Even the NOPs had a tiny impact
> on certain in-order implementations. I'd rather switch between the mov and
> a "b +2".

Be careful; the architecture only permits live patching between certain
instructions. Please see ARM DDI 0487D.a, section B2.2.5, "Concurrent
modification and execution of instructions".

Per that, it's not safe to live-patch MOV->B or B->MOV.

It's *also* not safe to live-patch NOP->MOV, or vice-versa.

So I strongly suspect we must unconditionally patch the MOV in early.

Thanks,
Mark.


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-06 Thread Amit Daniel Kachhap
On Fri, Jan 4, 2019 at 8:05 PM Torsten Duwe  wrote:
>
> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> of each function. Replace the first NOP thus generated with a quick LR
> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> to ftrace, does not clobber the value. Ftrace will then generate the
> standard stack frames.
>
> Note that patchable-function-entry in GCC disables IPA-RA, which means
> ABI register calling conventions are obeyed *and* scratch registers
> such as x9 are available.
>
> Introduce and handle an ftrace_regs_trampoline for module PLTs, right
> after ftrace_trampoline, and double the size of this special section.
>
> Signed-off-by: Torsten Duwe 
>
> ---
>
> This patch applies on 4.20 with the additional changes
> bdb85cd1d20669dfae813555dddb745ad09323ba
> (arm64/module: switch to ADRP/ADD sequences for PLT entries)
> and
> 7dc48bf96aa0fc8aa5b38cc3e5c36ac03171e680
> (arm64: ftrace: always pass instrumented pc in x0)
> along with their respective series, or alternatively on Linus' master,
> which already has these.
>
> changes since v5:
>
> * fix mentioned pc in x0 to hold the start address of the call site,
>   not the return address or the branch address.
>   This resolves the problem found by Amit.

Function graph tracer display works fine with this version. From my side,
Tested by: Amit Daniel Kachhap 

// Amit
>
> ---
>  arch/arm64/Kconfig|2
>  arch/arm64/Makefile   |4 +
>  arch/arm64/include/asm/assembler.h|1
>  arch/arm64/include/asm/ftrace.h   |   13 +++
>  arch/arm64/include/asm/module.h   |3
>  arch/arm64/kernel/Makefile|6 -
>  arch/arm64/kernel/entry-ftrace.S  |  131 
> ++
>  arch/arm64/kernel/ftrace.c|  125 
>  arch/arm64/kernel/module-plts.c   |3
>  arch/arm64/kernel/module.c|2
>  drivers/firmware/efi/libstub/Makefile |3
>  include/asm-generic/vmlinux.lds.h |1
>  include/linux/compiler_types.h|4 +
>  13 files changed, 262 insertions(+), 36 deletions(-)
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -131,6 +131,8 @@ config ARM64
> select HAVE_DEBUG_KMEMLEAK
> select HAVE_DMA_CONTIGUOUS
> select HAVE_DYNAMIC_FTRACE
> +   select HAVE_DYNAMIC_FTRACE_WITH_REGS \
> +   if $(cc-option,-fpatchable-function-entry=2)
> select HAVE_EFFICIENT_UNALIGNED_ACCESS
> select HAVE_FTRACE_MCOUNT_RECORD
> select HAVE_FUNCTION_TRACER
> --- a/arch/arm64/Makefile
> +++ b/arch/arm64/Makefile
> @@ -79,6 +79,10 @@ ifeq ($(CONFIG_ARM64_MODULE_PLTS),y)
>  KBUILD_LDFLAGS_MODULE  += -T $(srctree)/arch/arm64/kernel/module.lds
>  endif
>
> +ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
> +  CC_FLAGS_FTRACE := -fpatchable-function-entry=2
> +endif
> +
>  # Default value
>  head-y := arch/arm64/kernel/head.o
>
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -17,6 +17,19 @@
>  #define MCOUNT_ADDR((unsigned long)_mcount)
>  #define MCOUNT_INSN_SIZE   AARCH64_INSN_SIZE
>
> +/*
> + * DYNAMIC_FTRACE_WITH_REGS is implemented by adding 2 NOPs at the beginning
> + * of each function, with the second NOP actually calling ftrace. In contrary
> + * to a classic _mcount call, the call instruction to be modified is thus
> + * the second one, and not the only one.
> + */
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +#define ARCH_SUPPORTS_FTRACE_OPS 1
> +#define REC_IP_BRANCH_OFFSET AARCH64_INSN_SIZE
> +#else
> +#define REC_IP_BRANCH_OFFSET 0
> +#endif
> +
>  #ifndef __ASSEMBLY__
>  #include 
>
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -7,9 +7,9 @@ CPPFLAGS_vmlinux.lds:= -DTEXT_OFFSET=$(
>  AFLAGS_head.o  := -DTEXT_OFFSET=$(TEXT_OFFSET)
>  CFLAGS_armv8_deprecated.o := -I$(src)
>
> -CFLAGS_REMOVE_ftrace.o = -pg
> -CFLAGS_REMOVE_insn.o = -pg
> -CFLAGS_REMOVE_return_address.o = -pg
> +CFLAGS_REMOVE_ftrace.o = -pg $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_insn.o = -pg $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_return_address.o = -pg $(CC_FLAGS_FTRACE)
>
>  # Object file lists.
>  arm64-obj-y:= debug-monitors.o entry.o irq.o fpsimd.o
>   \
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -13,7 +13,8 @@ cflags-$(CONFIG_X86)  += -m$(BITS) -D__K
>
>  # arm64 uses the full KBUILD_CFLAGS so it's necessary to explicitly
>  # disable the stackleak plugin
> -cflags-$(CONFIG_ARM64) := $(subst -pg,,$(KBUILD_CFLAGS)) -fpie \
> +cflags-$(CONFIG_ARM64) := $(filter-out -pg $(CC_FLAGS_FTRACE)\
> + ,$(KBUILD_CFLAGS)) -fpie \
>$(DISABLE_STACKLEAK_PLUGIN)
>  cflags-$(CONFIG_ARM)   := $(subst -pg,,$(KBUILD_CFLAGS)) \
>-fno-builtin -

Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-05 Thread Steven Rostedt
On Sat, 5 Jan 2019 12:05:43 +0100
Torsten Duwe  wrote:

> > My point is: those 2 insn will _never_ be split by any alignment
> > boundary > 8; does that mean anything, have you considered this?  
> 
> Forget that. Steve mentioned the keyword *interrupt*, which creates a
> completely different situation. In short, only the instruction pointer
> will be saved; and i-cache and pipeline will be freshly reloaded on return,
> so this threat is highly unlikely (interrupt taken exactly after 1st nop),
> but not impossible. "Puking horses..." as we say in German.

Correct.

> 
> > > > I wonder if we could solve that by patching the kernel at build-time, to
> > > > add the MOV X9, X30 in place of the first NOP. If we were to do that, we
> > > > could also update the addresses to pooint at the second NOP, simplifying
> > > > the changes to the runtime code.  
> > > 
> > > You can also patch it at boot up when there's only one CPU running, and
> > > interrupts are disabled.  
> > 
> > May I remind about possible performance hits? Even the NOPs had a tiny 
> > impact
> > on certain in-order implementations. I'd rather switch between the mov and
> > a "b +2".  
> 
> This one however still holds.

Now, if you can add one of the changes, do a synchronization to make
sure that all tasks are not preempted there, and see that first change,
then make the other change to complete the transaction, there may be a
solution: synchronize_rcu_tasks()!

convert all first nops to "MOV X9, X30"
synchronize_rcu_tasks();
convert all second nops to "BL ftrace_regs_caller"

That would work. What synchronize_rcu_tasks() does, is that it wont
return until all tasks have either called schedule voluntarily (not
preempted), goes into user space, or goes idle.

Tasks that are idle (not preempted) are not counted.

Then you are guaranteed that no task was preempted at the first nop and
will come back and call "BL ftrace_regs_caller".

The only caveat is that synchronize_rcu_tasks() can take some time to
complete (seconds even) if something was preempted and is starved from
the CPU for some time. This is why you would need to group the
conversions together, by changing all the first nops for all the
functions you want to trace before calling the synchronization routine.

-- Steve


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-05 Thread Torsten Duwe
On Fri, Jan 04, 2019 at 11:41:45PM +0100, Torsten Duwe wrote:
> On Fri, Jan 04, 2019 at 01:06:48PM -0500, Steven Rostedt wrote:
> > On Fri, 4 Jan 2019 17:50:18 +
> > Mark Rutland  wrote:
> > 
> > > At Linux Plumbers, I had a conversation with Steve Rostedt, and we came
> > > to the conclusion that (withut heavyweight synchronization) patching two
> > > NOPs at runtime isn't safe, since a CPU might have executed the first
> > > NOP as a NOP before another CPU patches both instructions. So a CPU
> > > might execute:
> > > 
> > >   NOP
> > >   BL  ftrace_regs_caller
> > > 
> > > ... rather than the expected:
> > > 
> > >   MOV X9, X30
> > >   BL  ftrace_regs_caller
> > > 
> > > ... and therefore X9 contains some UNKNOWN value, rather than the
> > > original LR value.
> 
> I'm perfectly aware of that; an earlier version had barriers, attempting
> to avoid just that, which Mark(?) wrote weren't neccessary.
> 
> But is this a realistic scenario? All function entries are aligned 8 bytes.
> Are there arm64 implementations out there that fetch only 4 bytes and
> give a chance to mess with the 2nd 4 bytes? You at arm.com should know, and
> I won't be surprised if the answer is a weird "yes". Or maybe it's just
> another erratum lurking somewhere...
> 
> My point is: those 2 insn will _never_ be split by any alignment
> boundary > 8; does that mean anything, have you considered this?

Forget that. Steve mentioned the keyword *interrupt*, which creates a
completely different situation. In short, only the instruction pointer
will be saved; and i-cache and pipeline will be freshly reloaded on return,
so this threat is highly unlikely (interrupt taken exactly after 1st nop),
but not impossible. "Puking horses..." as we say in German.

> > > I wonder if we could solve that by patching the kernel at build-time, to
> > > add the MOV X9, X30 in place of the first NOP. If we were to do that, we
> > > could also update the addresses to pooint at the second NOP, simplifying
> > > the changes to the runtime code.
> > 
> > You can also patch it at boot up when there's only one CPU running, and
> > interrupts are disabled.
> 
> May I remind about possible performance hits? Even the NOPs had a tiny impact
> on certain in-order implementations. I'd rather switch between the mov and
> a "b +2".

This one however still holds.

Torsten



Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-04 Thread Torsten Duwe
On Fri, Jan 04, 2019 at 01:06:48PM -0500, Steven Rostedt wrote:
> On Fri, 4 Jan 2019 17:50:18 +
> Mark Rutland  wrote:
> 
> > At Linux Plumbers, I had a conversation with Steve Rostedt, and we came
> > to the conclusion that (withut heavyweight synchronization) patching two
> > NOPs at runtime isn't safe, since a CPU might have executed the first
> > NOP as a NOP before another CPU patches both instructions. So a CPU
> > might execute:
> > 
> > NOP
> > BL  ftrace_regs_caller
> > 
> > ... rather than the expected:
> > 
> > MOV X9, X30
> > BL  ftrace_regs_caller
> > 
> > ... and therefore X9 contains some UNKNOWN value, rather than the
> > original LR value.

I'm perfectly aware of that; an earlier version had barriers, attempting
to avoid just that, which Mark(?) wrote weren't neccessary.

But is this a realistic scenario? All function entries are aligned 8 bytes.
Are there arm64 implementations out there that fetch only 4 bytes and
give a chance to mess with the 2nd 4 bytes? You at arm.com should know, and
I won't be surprised if the answer is a weird "yes". Or maybe it's just
another erratum lurking somewhere...

My point is: those 2 insn will _never_ be split by any alignment
boundary > 8; does that mean anything, have you considered this?

> > I wonder if we could solve that by patching the kernel at build-time, to
> > add the MOV X9, X30 in place of the first NOP. If we were to do that, we
> > could also update the addresses to pooint at the second NOP, simplifying
> > the changes to the runtime code.
> 
> You can also patch it at boot up when there's only one CPU running, and
> interrupts are disabled.

May I remind about possible performance hits? Even the NOPs had a tiny impact
on certain in-order implementations. I'd rather switch between the mov and
a "b +2".

Torsten



Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-04 Thread Steven Rostedt
On Fri, 4 Jan 2019 17:50:18 +
Mark Rutland  wrote:

> At Linux Plumbers, I had a conversation with Steve Rostedt, and we came
> to the conclusion that (withut heavyweight synchronization) patching two
> NOPs at runtime isn't safe, since a CPU might have executed the first
> NOP as a NOP before another CPU patches both instructions. So a CPU
> might execute:
> 
>   NOP
>   BL  ftrace_regs_caller
> 
> ... rather than the expected:
> 
>   MOV X9, X30
>   BL  ftrace_regs_caller
> 
> ... and therefore X9 contains some UNKNOWN value, rather than the
> original LR value.
> 
> I wonder if we could solve that by patching the kernel at build-time, to
> add the MOV X9, X30 in place of the first NOP. If we were to do that, we
> could also update the addresses to pooint at the second NOP, simplifying
> the changes to the runtime code.

You can also patch it at boot up when there's only one CPU running, and
interrupts are disabled.

-- Steve


Re: [PATCH v6] arm64: implement ftrace with regs

2019-01-04 Thread Mark Rutland
Hi Torsten,

On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
> of each function. Replace the first NOP thus generated with a quick LR
> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
> to ftrace, does not clobber the value. Ftrace will then generate the
> standard stack frames.
> 
> Note that patchable-function-entry in GCC disables IPA-RA, which means
> ABI register calling conventions are obeyed *and* scratch registers
> such as x9 are available.
> 
> Introduce and handle an ftrace_regs_trampoline for module PLTs, right
> after ftrace_trampoline, and double the size of this special section.
> 
> Signed-off-by: Torsten Duwe 
> 
> ---
> 
> This patch applies on 4.20 with the additional changes
> bdb85cd1d20669dfae813555dddb745ad09323ba
> (arm64/module: switch to ADRP/ADD sequences for PLT entries)
> and
> 7dc48bf96aa0fc8aa5b38cc3e5c36ac03171e680
> (arm64: ftrace: always pass instrumented pc in x0)
> along with their respective series, or alternatively on Linus' master,
> which already has these.
> 
> changes since v5:
> 
> * fix mentioned pc in x0 to hold the start address of the call site,
>   not the return address or the branch address.
>   This resolves the problem found by Amit.
> 
> ---
>  arch/arm64/Kconfig|2 
>  arch/arm64/Makefile   |4 +
>  arch/arm64/include/asm/assembler.h|1 
>  arch/arm64/include/asm/ftrace.h   |   13 +++
>  arch/arm64/include/asm/module.h   |3 
>  arch/arm64/kernel/Makefile|6 -
>  arch/arm64/kernel/entry-ftrace.S  |  131 
> ++
>  arch/arm64/kernel/ftrace.c|  125 
>  arch/arm64/kernel/module-plts.c   |3 
>  arch/arm64/kernel/module.c|2 
>  drivers/firmware/efi/libstub/Makefile |3 
>  include/asm-generic/vmlinux.lds.h |1 
>  include/linux/compiler_types.h|4 +
>  13 files changed, 262 insertions(+), 36 deletions(-)
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -131,6 +131,8 @@ config ARM64
>   select HAVE_DEBUG_KMEMLEAK
>   select HAVE_DMA_CONTIGUOUS
>   select HAVE_DYNAMIC_FTRACE
> + select HAVE_DYNAMIC_FTRACE_WITH_REGS \
> + if $(cc-option,-fpatchable-function-entry=2)
>   select HAVE_EFFICIENT_UNALIGNED_ACCESS
>   select HAVE_FTRACE_MCOUNT_RECORD
>   select HAVE_FUNCTION_TRACER
> --- a/arch/arm64/Makefile
> +++ b/arch/arm64/Makefile
> @@ -79,6 +79,10 @@ ifeq ($(CONFIG_ARM64_MODULE_PLTS),y)
>  KBUILD_LDFLAGS_MODULE+= -T $(srctree)/arch/arm64/kernel/module.lds
>  endif
>  
> +ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
> +  CC_FLAGS_FTRACE := -fpatchable-function-entry=2
> +endif
> +
>  # Default value
>  head-y   := arch/arm64/kernel/head.o
>  
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -17,6 +17,19 @@
>  #define MCOUNT_ADDR  ((unsigned long)_mcount)
>  #define MCOUNT_INSN_SIZE AARCH64_INSN_SIZE
>  
> +/*
> + * DYNAMIC_FTRACE_WITH_REGS is implemented by adding 2 NOPs at the beginning
> + * of each function, with the second NOP actually calling ftrace. In contrary
> + * to a classic _mcount call, the call instruction to be modified is thus
> + * the second one, and not the only one.
> + */
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +#define ARCH_SUPPORTS_FTRACE_OPS 1
> +#define REC_IP_BRANCH_OFFSET AARCH64_INSN_SIZE
> +#else
> +#define REC_IP_BRANCH_OFFSET 0
> +#endif

At Linux Plumbers, I had a conversation with Steve Rostedt, and we came
to the conclusion that (withut heavyweight synchronization) patching two
NOPs at runtime isn't safe, since a CPU might have executed the first
NOP as a NOP before another CPU patches both instructions. So a CPU
might execute:

NOP
BL  ftrace_regs_caller

... rather than the expected:

MOV X9, X30
BL  ftrace_regs_caller

... and therefore X9 contains some UNKNOWN value, rather than the
original LR value.

I wonder if we could solve that by patching the kernel at build-time, to
add the MOV X9, X30 in place of the first NOP. If we were to do that, we
could also update the addresses to pooint at the second NOP, simplifying
the changes to the runtime code.

> +
>  #ifndef __ASSEMBLY__
>  #include 
>  
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -7,9 +7,9 @@ CPPFLAGS_vmlinux.lds  := -DTEXT_OFFSET=$(
>  AFLAGS_head.o:= -DTEXT_OFFSET=$(TEXT_OFFSET)
>  CFLAGS_armv8_deprecated.o := -I$(src)
>  
> -CFLAGS_REMOVE_ftrace.o = -pg
> -CFLAGS_REMOVE_insn.o = -pg
> -CFLAGS_REMOVE_return_address.o = -pg
> +CFLAGS_REMOVE_ftrace.o = -pg $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_insn.o = -pg $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_return_address.o = -pg $(CC_FLAGS_FTRACE)
>  
>  # Object file lists.
>  arm64-obj-y  := debug-monitors.o entr