Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Alexei Starovoitov
On Wed, Oct 23, 2019 at 12:23:06PM -0400, Steven Rostedt wrote:
> On Tue, 22 Oct 2019 14:58:43 -0700
> Alexei Starovoitov  wrote:
> 
> > Neither of two statements are true. The per-function generated trampoline
> > I'm talking about is bpf specific. For a function with two arguments it's 
> > just:
> > push rbp 
> > mov rbp, rsp
> > push rdi
> > push rsi
> > lea  rdi,[rbp-0x10]
> > call jited_bpf_prog
> > pop rsi
> > pop rdi
> > leave
> > ret
> > 
> > fentry's nop is replaced with call to the above.
> > That's it.
> > kprobe and live patching has no use out of it.
> > 
> 
> Below is a patch that allows you to do this, and you don't need to
> worry about patching the nops. And it also allows to you hook directly
> to any function and still allow kprobes and tracing on those same
> functions (as long as they don't modify the ip, but in the future, we
> may be able to allow that too!). And this code does not depend on
> Peter's code either.
> 
> All you need to do is:
> 
>   register_ftrace_direct((unsigned long)func_you_want_to_trace,
>  (unsigned long)your_trampoline);
> 
> 
> I added to trace-event-samples.c:
> 
> void my_direct_func(raw_spinlock_t *lock)
> {
>   trace_printk("taking %p\n", lock);
> }
> 
> extern void my_tramp(void *);
> 
> asm (
> "   .pushsection.text, \"ax\", @progbits\n"
> "   my_tramp:"
> #if 1
> "   pushq %rbp\n"
> " movq %rsp, %rbp\n"
> " pushq %rdi\n"
> " call my_direct_func\n"
> " popq %rdi\n"
> " leave\n"
> #endif
> " ret\n"
> "   .popsection\n"
> );
> 
> 
> (the #if was for testing purposes)
> 
> And then in the module load and unload:
> 
>   ret = register_ftrace_direct((unsigned long)do_raw_spin_lock,
>(unsigned long)my_tramp);
> 
>   unregister_ftrace_direct((unsigned long)do_raw_spin_lock,
>(unsigned long)my_tramp);
> 
> respectively.
> 
> And what this does is if there's only a single callback to the
> registered function, it changes the nop in the function to call your
> trampoline directly (just like you want this to do!). But if we add
> another callback, a direct_ops ftrace_ops gets added to the list of the
> functions to go through, and this will set up the code to call your
> trampoline after it calls all the other callbacks.
> 
> The called trampoline will be called as if it was called directly from
> the nop.
> 
> OK, I wrote this up quickly, and it has some bugs, but nothing that
> can't be straighten out (specifically, the checks fail if you add a
> function trace to one of the direct callbacks, but this can be dealt
> with).
> 
> Note, the work needed to port this to other archs is rather minimal
> (just need to tweak the ftrace_regs_caller and have a way to pass back
> the call address via pt_regs that is not saved).
> 
> Alexei,
> 
> Would this work for you?

Yes!
Looks great. More comments below.

> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6adaf18b3365..de3372bd08ae 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -159,6 +159,7 @@ config X86
>   select HAVE_DYNAMIC_FTRACE
>   select HAVE_DYNAMIC_FTRACE_WITH_REGS
>   select HAVE_DYNAMIC_FTRACE_WITH_IPMODIFY
> + select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>   select HAVE_EBPF_JIT
>   select HAVE_EFFICIENT_UNALIGNED_ACCESS
>   select HAVE_EISA
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index c38a1576..34da1e424391 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -28,6 +28,12 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> long addr)
>   return addr;
>  }
>  
> +static inline void ftrace_set_call_func(struct pt_regs *regs, unsigned long 
> addr)
> +{
> + /* Emulate a call */
> + regs->orig_ax = addr;

This probably needs a longer comment :)

> + .if \make_call
> + movq ORIG_RAX(%rsp), %rax
> + movq %rax, MCOUNT_REG_SIZE-8(%rsp)

reading asm helps.

> +config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + bool
> +
>  config HAVE_FTRACE_MCOUNT_RECORD
>   bool
>   help
> @@ -565,6 +568,11 @@ config DYNAMIC_FTRACE_WITH_IPMODIFY_ONLY
>   depends on DYNAMIC_FTRACE
>   depends on HAVE_DYNAMIC_FTRACE_WITH_IPMODIFY
>  
> +config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + def_bool y
> + depends on DYNAMIC_FTRACE
> + depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

It seems to me that it's a bit of overkill to add new config knob
for every ftrace feature.
HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS (that arch defined) would
be enough to check and return error in register_ftrace_direct()
right?

> -static struct ftrace_hash *
> -__ftrace_hash_move(struct ftrace_hash *src)
> +static void transfer_hash(struct ftrace_hash *dst, struct ftrace_hash *src)
>  {
>   struct ftrace_func_entry *entry;
> - struct hlist_node *tn;
>   struct hlist_head *hhd;
> + struct hlist_node 

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Steven Rostedt
On Wed, 23 Oct 2019 12:23:06 -0400
Steven Rostedt  wrote:

> All you need to do is:
> 
>   register_ftrace_direct((unsigned long)func_you_want_to_trace,
>  (unsigned long)your_trampoline);
> 


> 
> Alexei,
> 
> Would this work for you?

I just pushed a test branch up to:

 git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git

 branch: ftrace/direct

Again, it's mostly development code, and may be buggy. (don't try
tracing when a direct function is added yet).

If this is something that you can use, then I'll work to clean it up
and sort out all the bugs.

Thanks!

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Steven Rostedt
On Tue, 22 Oct 2019 14:58:43 -0700
Alexei Starovoitov  wrote:

> Neither of two statements are true. The per-function generated trampoline
> I'm talking about is bpf specific. For a function with two arguments it's 
> just:
> push rbp 
> mov rbp, rsp
> push rdi
> push rsi
> lea  rdi,[rbp-0x10]
> call jited_bpf_prog
> pop rsi
> pop rdi
> leave
> ret
> 
> fentry's nop is replaced with call to the above.
> That's it.
> kprobe and live patching has no use out of it.
> 

Below is a patch that allows you to do this, and you don't need to
worry about patching the nops. And it also allows to you hook directly
to any function and still allow kprobes and tracing on those same
functions (as long as they don't modify the ip, but in the future, we
may be able to allow that too!). And this code does not depend on
Peter's code either.

All you need to do is:

register_ftrace_direct((unsigned long)func_you_want_to_trace,
   (unsigned long)your_trampoline);


I added to trace-event-samples.c:

void my_direct_func(raw_spinlock_t *lock)
{
trace_printk("taking %p\n", lock);
}

extern void my_tramp(void *);

asm (
"   .pushsection.text, \"ax\", @progbits\n"
"   my_tramp:"
#if 1
"   pushq %rbp\n"
"   movq %rsp, %rbp\n"
"   pushq %rdi\n"
"   call my_direct_func\n"
"   popq %rdi\n"
"   leave\n"
#endif
"   ret\n"
"   .popsection\n"
);


(the #if was for testing purposes)

And then in the module load and unload:

ret = register_ftrace_direct((unsigned long)do_raw_spin_lock,
 (unsigned long)my_tramp);

unregister_ftrace_direct((unsigned long)do_raw_spin_lock,
 (unsigned long)my_tramp);

respectively.

And what this does is if there's only a single callback to the
registered function, it changes the nop in the function to call your
trampoline directly (just like you want this to do!). But if we add
another callback, a direct_ops ftrace_ops gets added to the list of the
functions to go through, and this will set up the code to call your
trampoline after it calls all the other callbacks.

The called trampoline will be called as if it was called directly from
the nop.

OK, I wrote this up quickly, and it has some bugs, but nothing that
can't be straighten out (specifically, the checks fail if you add a
function trace to one of the direct callbacks, but this can be dealt
with).

Note, the work needed to port this to other archs is rather minimal
(just need to tweak the ftrace_regs_caller and have a way to pass back
the call address via pt_regs that is not saved).

Alexei,

Would this work for you?

-- Steve

[ This is on top of the last patch I sent ]

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6adaf18b3365..de3372bd08ae 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -159,6 +159,7 @@ config X86
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_DYNAMIC_FTRACE_WITH_IPMODIFY
+   select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EISA
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index c38a1576..34da1e424391 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -28,6 +28,12 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
return addr;
 }
 
+static inline void ftrace_set_call_func(struct pt_regs *regs, unsigned long 
addr)
+{
+   /* Emulate a call */
+   regs->orig_ax = addr;
+}
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 struct dyn_arch_ftrace {
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 5930810bf71d..3fcaf03566b0 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -88,6 +88,7 @@ EXPORT_SYMBOL(__fentry__)
movq %rdi, RDI(%rsp)
movq %r8, R8(%rsp)
movq %r9, R9(%rsp)
+   movq $0, ORIG_RAX(%rsp)
/*
 * Save the original RBP. Even though the mcount ABI does not
 * require this, it helps out callers.
@@ -114,7 +115,23 @@ EXPORT_SYMBOL(__fentry__)
subq $MCOUNT_INSN_SIZE, %rdi
.endm
 
-.macro restore_mcount_regs
+.macro restore_mcount_regs make_call=0 save_flags=0
+
+   /* ftrace_regs_caller can modify %rbp */
+   movq RBP(%rsp), %rbp
+
+   .if \make_call
+   movq ORIG_RAX(%rsp), %rax
+   movq %rax, MCOUNT_REG_SIZE-8(%rsp)
+
+   /* Swap the flags with orig_rax */
+   .if \save_flags
+   movq MCOUNT_REG_SIZE(%rsp), %rdi
+   movq %rdi, MCOUNT_REG_SIZE-8(%rsp)
+   movq %rax, MCOUNT_REG_SIZE(%rsp)
+   .endif
+   .endif
+
movq R9(%rsp), %r9
movq R8(%rsp), %r8
movq RDI(%rsp), %rdi
@@ -123,10 +140,11 @@ EXPORT_SYMBOL(__fentry__)
movq RCX(%rsp), %rcx
movq RAX(%rsp), %rax
 
-   /* ftrace_regs_caller can modify %rbp */
-   

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Peter Zijlstra
On Tue, Oct 22, 2019 at 09:20:27PM -0700, Andy Lutomirski wrote:
> Also, Alexei, are you testing on a CONFIG_FRAME_POINTER=y kernel?  The
> ftrace code has a somewhat nasty special case to make
> CONFIG_FRAME_POINTER=y work right, and your example trampoline does
> not but arguably should have exaclty the same fixup.  For good
> performance, you should be using CONFIG_FRAME_POINTER=n.

Trampolines and JITs (which both lack ORC data) should always have frame
pointers. That way, when ORC fails, it can fall back to FP and we can
still get sensible unwinds.

See:

  ae6a45a08689 ("x86/unwind/orc: Fall back to using frame pointers for 
generated code")


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Andy Lutomirski
On Tue, Oct 22, 2019 at 4:49 PM Alexei Starovoitov
 wrote:
>
> On Tue, Oct 22, 2019 at 03:45:26PM -0700, Andy Lutomirski wrote:
> >
> >
> > >> On Oct 22, 2019, at 2:58 PM, Alexei Starovoitov 
> > >>  wrote:
> > >>
> > >> On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> > >> I gave a solution for this. And that is to add another flag to allow
> > >> for just the minimum to change the ip. And we can even add another flag
> > >> to allow for changing the stack if needed (to emulate a call with the
> > >> same parameters).
> > >
> > > your solution is to reduce the overhead.
> > > my solution is to remove it competely. See the difference?
> > >
> > >> By doing this work, live kernel patching will also benefit. Because it
> > >> is also dealing with the unnecessary overhead of saving regs.
> > >> And we could possibly even have kprobes benefit from this if a kprobe
> > >> doesn't need full regs.
> > >
> > > Neither of two statements are true. The per-function generated trampoline
> > > I'm talking about is bpf specific. For a function with two arguments it's 
> > > just:
> > > push rbp
> > > mov rbp, rsp
> > > push rdi
> > > push rsi
> > > lea  rdi,[rbp-0x10]
> > > call jited_bpf_prog
> > > pop rsi
> > > pop rdi
> > > leave
> > > ret
> >
> > Why are you saving rsi?  You said upthread that you’re saving the args, but 
> > rsi is already available in rsi.
>
> because rsi is caller saved. The above example is for probing something
> like tcp_set_state(struct sock *sk, int state) that everyone used to
> kprobe until we got a tracepoint there.
> The main bpf prog has only one argument R1 == rdi on x86,
> but it's allowed to clobber all caller saved regs.
> Just like x86 function that accepts one argument in rdi can clobber rsi and 
> others.
> So it's essential to save 'sk' and 'state' for tcp_set_state()
> to continue as nothing happened.

Oh, right, you're hijacking the very first instruction, so you know
that the rest of the arg regs as well as rax are unused.

But I find it hard to believe that this is a particularly meaningful
optimization compared to the version that saves all the C-clobbered
registers.  Steven,

Also, Alexei, are you testing on a CONFIG_FRAME_POINTER=y kernel?  The
ftrace code has a somewhat nasty special case to make
CONFIG_FRAME_POINTER=y work right, and your example trampoline does
not but arguably should have exaclty the same fixup.  For good
performance, you should be using CONFIG_FRAME_POINTER=n.

Steven, with your benchmark, could you easily make your actual ftrace
hook do nothing at all and get a perf report on the result (i.e. call
the traced function in a loop a bunch of times under perf record -e
cycles or similar)?  It would be interesting to see exactly what
trampoline code you're generating and just how bad it is.  ISTM it
should be possible to squeeze very good performance out of ftrace.  I
suppose you could also have a fancier mode than just "IP" that
specifies that the caller knows exactly which registers are live and
what they are.  Then you could generate code that's exactly as good as
Alexei's.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 18:17:40 -0400
Steven Rostedt  wrote:

> > your solution is to reduce the overhead.
> > my solution is to remove it competely. See the difference?  
> 
> You're just trimming it down. I'm curious to what overhead you save by
> not saving all parameter registers, and doing a case by case basis?

I played with adding a ftrace_ip_caller, that only modifies the ip, and
passes a regs with bare minimum saved (allowing the callback to modify
the regs->ip).

I modified the test-events-sample to just hijack the do_raw_spin_lock()
function, which is a very hot path, as it is called by most of the
spin_lock code.

I then ran:

 # perf stat -r 20 /work/c/hackbench 50

vanilla, as nothing is happening.

I then enabled the test-events-sample with the new ftrace ip
modification only.

I then had it do the hijack of do_raw_spin_lock() with the current
ftrace_regs_caller that kprobes and live kernel patching use.

The improvement was quite noticeable.

Baseline: (no tracing)

# perf stat -r 20 /work/c/hackbench 50
Time: 1.146
Time: 1.120
Time: 1.102
Time: 1.099
Time: 1.136
Time: 1.123
Time: 1.128
Time: 1.115
Time: 1.111
Time: 1.192
Time: 1.160
Time: 1.156
Time: 1.135
Time: 1.144
Time: 1.150
Time: 1.120
Time: 1.121
Time: 1.120
Time: 1.106
Time: 1.127

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

  9,056.18 msec task-clock#6.770 CPUs utilized  
  ( +-  0.26% )
   148,461  context-switches  #0.016 M/sec  
  ( +-  2.58% )
18,397  cpu-migrations#0.002 M/sec  
  ( +-  3.24% )
54,029  page-faults   #0.006 M/sec  
  ( +-  0.63% )
33,227,808,738  cycles#3.669 GHz
  ( +-  0.25% )
23,314,273,335  stalled-cycles-frontend   #   70.16% frontend cycles 
idle ( +-  0.25% )
23,568,794,330  instructions  #0.71  insn per cycle 

  #0.99  stalled cycles per 
insn  ( +-  0.26% )
 3,756,521,438  branches  #  414.802 M/sec  
  ( +-  0.29% )
36,949,690  branch-misses #0.98% of all branches
  ( +-  0.46% )

1.3377 +- 0.0213 seconds time elapsed  ( +-  1.59% )

Using IPMODIFY without regs:

# perf stat -r 20 /work/c/hackbench 50
Time: 1.193
Time: 1.115
Time: 1.145
Time: 1.129
Time: 1.121
Time: 1.132
Time: 1.136
Time: 1.156
Time: 1.133
Time: 1.131
Time: 1.138
Time: 1.150
Time: 1.147
Time: 1.137
Time: 1.178
Time: 1.121
Time: 1.142
Time: 1.158
Time: 1.143
Time: 1.168

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

  9,231.39 msec task-clock#6.917 CPUs utilized  
  ( +-  0.39% )
   136,822  context-switches  #0.015 M/sec  
  ( +-  3.06% )
17,308  cpu-migrations#0.002 M/sec  
  ( +-  2.61% )
54,521  page-faults   #0.006 M/sec  
  ( +-  0.57% )
33,875,937,399  cycles#3.670 GHz
  ( +-  0.39% )
23,575,136,580  stalled-cycles-frontend   #   69.59% frontend cycles 
idle ( +-  0.41% )
24,246,404,007  instructions  #0.72  insn per cycle 

  #0.97  stalled cycles per 
insn  ( +-  0.33% )
 3,878,453,510  branches  #  420.138 M/sec  
  ( +-  0.36% )
47,653,911  branch-misses #1.23% of all branches
  ( +-  0.43% )

   1.33462 +- 0.00440 seconds time elapsed  ( +-  0.33% )


The current ftrace_regs_caller: (old way of doing it)

# perf stat -r 20 /work/c/hackbench 50
Time: 1.114
Time: 1.153
Time: 1.236
Time: 1.208
Time: 1.179
Time: 1.183
Time: 1.217
Time: 1.190
Time: 1.157
Time: 1.172
Time: 1.215
Time: 1.165
Time: 1.171
Time: 1.188
Time: 1.176
Time: 1.217
Time: 1.341
Time: 1.238
Time: 1.363
Time: 1.287

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

  9,522.76 msec task-clock#6.653 CPUs utilized  
  ( +-  0.36% )
   131,347  context-switches  #0.014 M/sec  
  ( +-  3.37% )
17,090  cpu-migrations#0.002 M/sec  
  ( +-  3.56% )
54,126  page-faults   #0.006 M/sec  
  ( +-  0.71% )
34,942,273,861  cycles#3.669 GHz
  ( +-  0.35% )
24,517,757,272  stalled-cycles-frontend   #   70.17% frontend cycles 
idle ( +-  0.35% )
24,684,124,265  instructions  #0.71  insn per cycle 

  #0.99  

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Alexei Starovoitov
On Tue, Oct 22, 2019 at 03:45:26PM -0700, Andy Lutomirski wrote:
> 
> 
> >> On Oct 22, 2019, at 2:58 PM, Alexei Starovoitov 
> >>  wrote:
> >> 
> >> On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> >> I gave a solution for this. And that is to add another flag to allow
> >> for just the minimum to change the ip. And we can even add another flag
> >> to allow for changing the stack if needed (to emulate a call with the
> >> same parameters).
> > 
> > your solution is to reduce the overhead.
> > my solution is to remove it competely. See the difference?
> > 
> >> By doing this work, live kernel patching will also benefit. Because it
> >> is also dealing with the unnecessary overhead of saving regs.
> >> And we could possibly even have kprobes benefit from this if a kprobe
> >> doesn't need full regs.
> > 
> > Neither of two statements are true. The per-function generated trampoline
> > I'm talking about is bpf specific. For a function with two arguments it's 
> > just:
> > push rbp 
> > mov rbp, rsp
> > push rdi
> > push rsi
> > lea  rdi,[rbp-0x10]
> > call jited_bpf_prog
> > pop rsi
> > pop rdi
> > leave
> > ret
> 
> Why are you saving rsi?  You said upthread that you’re saving the args, but 
> rsi is already available in rsi.

because rsi is caller saved. The above example is for probing something
like tcp_set_state(struct sock *sk, int state) that everyone used to
kprobe until we got a tracepoint there.
The main bpf prog has only one argument R1 == rdi on x86,
but it's allowed to clobber all caller saved regs.
Just like x86 function that accepts one argument in rdi can clobber rsi and 
others.
So it's essential to save 'sk' and 'state' for tcp_set_state()
to continue as nothing happened.

>  But I’m wondering whether the bpf jitted code could just directly access the 
> frame instead of indirecting through a register.

That's an excellent question!
We've debated a ton whether to extend main prog from R1 to all R1-R5
like bpf subprograms allow. The problem is a lot of existing infra
assume single R1==ctx. Passing 6th argument is not defined either.
But it's nice to see all arguments of the kernel function.
Also bpf is 64-bit ISA. Even when it's running on 32-bit arch.
Just taking values from registers doesn't work there.
Whereas when args are indirectly passed as a bunch of u64s in the stack
the bpf prog becomes portable across architectures
(not 100% of course, but close).

> Is it entirely specific to the probed function? 

yes. It is specific to the probed function. The verifier makes sure
that only first two arguments are accessed as read-only
via *(u64*)(r1 + 0) and *(u64*)(r1 + 8)
But the program is allowed to use r2-r5 without saving them.
r6-r10 are saved in implicit program prologue.

> In any event, I think you can’t *just* use text_poke.  Something needs to 
> coordinate to ensure that, if you bpf trace an already-kprobed function, the 
> right thing happens.

Right. Not _just_. I'm looking at Peter's patches and as a minimum I need to 
grab
text_mutex and make sure it's nop being replaced.

> FWIW, if you are going to use a trampoline like this, consider using r11 for 
> the caller frame instead of rsi.

r11 can be used by jited code that doing divide and multiply.
It's possible to refactor that code and free it up.
I prefer to save such micro-optimization for later.
Dealing with interpreter is also a pain to the point I'm considering
to avoid supporting it for these new things.
Folks should be using CONFIG_BPF_JIT_ALWAYS_ON anyway.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 15:45:26 -0700
Andy Lutomirski  wrote:

> >> On Oct 22, 2019, at 2:58 PM, Alexei Starovoitov 
> >>  wrote:
> >> 
> >> On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> >> I gave a solution for this. And that is to add another flag to allow
> >> for just the minimum to change the ip. And we can even add another flag
> >> to allow for changing the stack if needed (to emulate a call with the
> >> same parameters).  
> > 
> > your solution is to reduce the overhead.
> > my solution is to remove it competely. See the difference?
> >   
> >> By doing this work, live kernel patching will also benefit. Because it
> >> is also dealing with the unnecessary overhead of saving regs.
> >> And we could possibly even have kprobes benefit from this if a kprobe
> >> doesn't need full regs.  
> > 
> > Neither of two statements are true. The per-function generated trampoline
> > I'm talking about is bpf specific. For a function with two arguments it's 
> > just:
> > push rbp 
> > mov rbp, rsp
> > push rdi
> > push rsi
> > lea  rdi,[rbp-0x10]
> > call jited_bpf_prog
> > pop rsi
> > pop rdi
> > leave
> > ret  
> 
> Why are you saving rsi?  You said upthread that you’re saving the
> args, but rsi is already available in rsi.

The above is for two parameters, and is being called before the
function with those two parameters. The jited_bpf_prog will be called
with those two parameters as well, but it may also clobber them. Then
we need to restore those two parameters before calling the original
function.

> 
> Just how custom is this bpf program?  It seems to clobber no regs
> (except args), and it doesn’t return anything. Is it entirely
> specific to the probed function?  If so, why not just call it
> directly?

It's injecting the jited_bpf_prog to be called when the probed function
is called, with the same parameters as the probed function.

my_probed_function
  call trampoline


trampoline
  save parameters
  call jited_bpf_prog (with same parameters)
  restore paremeters
  ret

Jumps back to the my_probed_function, where my_probed_function is
clueless it was just interrupted.

No need to save any clobbered registers but the parameters, as the
jited_bpf_prog needs to save any registers that it clobbers just like
the my_probed_function saves its.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Andy Lutomirski



>> On Oct 22, 2019, at 2:58 PM, Alexei Starovoitov 
>>  wrote:
>> 
>> On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
>> I gave a solution for this. And that is to add another flag to allow
>> for just the minimum to change the ip. And we can even add another flag
>> to allow for changing the stack if needed (to emulate a call with the
>> same parameters).
> 
> your solution is to reduce the overhead.
> my solution is to remove it competely. See the difference?
> 
>> By doing this work, live kernel patching will also benefit. Because it
>> is also dealing with the unnecessary overhead of saving regs.
>> And we could possibly even have kprobes benefit from this if a kprobe
>> doesn't need full regs.
> 
> Neither of two statements are true. The per-function generated trampoline
> I'm talking about is bpf specific. For a function with two arguments it's 
> just:
> push rbp 
> mov rbp, rsp
> push rdi
> push rsi
> lea  rdi,[rbp-0x10]
> call jited_bpf_prog
> pop rsi
> pop rdi
> leave
> ret

Why are you saving rsi?  You said upthread that you’re saving the args, but rsi 
is already available in rsi.

Just how custom is this bpf program?  It seems to clobber no regs (except 
args), and it doesn’t return anything. Is it entirely specific to the probed 
function?  If so, why not just call it directly?

In any event, I think you can’t *just* use text_poke.  Something needs to 
coordinate to ensure that, if you bpf trace an already-kprobed function, the 
right thing happens.

FWIW, if you are going to use a trampoline like this, consider using r11 for 
the caller frame instead of rsi.  You won’t need to restore it. But I’m 
wondering whether the bpf jitted code could just directly access the frame 
instead of indirecting through a register. (Or am I entirely misunderstanding 
what rdi is for in your example?)

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 14:58:43 -0700
Alexei Starovoitov  wrote:

> On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> > 
> > I gave a solution for this. And that is to add another flag to allow
> > for just the minimum to change the ip. And we can even add another flag
> > to allow for changing the stack if needed (to emulate a call with the
> > same parameters).  
> 
> your solution is to reduce the overhead.
> my solution is to remove it competely. See the difference?

You're just trimming it down. I'm curious to what overhead you save by
not saving all parameter registers, and doing a case by case basis?

> 
> > By doing this work, live kernel patching will also benefit. Because it
> > is also dealing with the unnecessary overhead of saving regs.
> > 
> > And we could possibly even have kprobes benefit from this if a kprobe
> > doesn't need full regs.  
> 
> Neither of two statements are true. The per-function generated trampoline
> I'm talking about is bpf specific. For a function with two arguments it's 
> just:
> push rbp 
> mov rbp, rsp
> push rdi
> push rsi
> lea  rdi,[rbp-0x10]
> call jited_bpf_prog

What exactly does the jited_bpf_prog do? Does it modify context?
or is it for monitoring only.

Do only GPL BPF programs get this access?

> pop rsi
> pop rdi
> leave
> ret
> 
> fentry's nop is replaced with call to the above.
> That's it.
> kprobe and live patching has no use out of it.
> 
> > But you said that you can't have this and trace the functions at the
> > same time. Which also means you can't do live kernel patching on these
> > functions either.  
> 
> I don't think it's a real use case, but to avoid further arguing
> I'll add one nop to the front of generated bpf trampoline so that
> ftrace and livepatch can use it.

And how does this nop get accounted for? It needs to update the ftrace
dyn_ftrace array that stores all the function locations.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Alexei Starovoitov
On Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> 
> I gave a solution for this. And that is to add another flag to allow
> for just the minimum to change the ip. And we can even add another flag
> to allow for changing the stack if needed (to emulate a call with the
> same parameters).

your solution is to reduce the overhead.
my solution is to remove it competely. See the difference?

> By doing this work, live kernel patching will also benefit. Because it
> is also dealing with the unnecessary overhead of saving regs.
> 
> And we could possibly even have kprobes benefit from this if a kprobe
> doesn't need full regs.

Neither of two statements are true. The per-function generated trampoline
I'm talking about is bpf specific. For a function with two arguments it's just:
push rbp 
mov rbp, rsp
push rdi
push rsi
lea  rdi,[rbp-0x10]
call jited_bpf_prog
pop rsi
pop rdi
leave
ret

fentry's nop is replaced with call to the above.
That's it.
kprobe and live patching has no use out of it.

> But you said that you can't have this and trace the functions at the
> same time. Which also means you can't do live kernel patching on these
> functions either.

I don't think it's a real use case, but to avoid further arguing
I'll add one nop to the front of generated bpf trampoline so that
ftrace and livepatch can use it.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 13:46:23 -0700
Alexei Starovoitov  wrote:

> On Tue, Oct 22, 2019 at 02:10:21PM -0400, Steven Rostedt wrote:
> > On Tue, 22 Oct 2019 10:50:56 -0700
> > Alexei Starovoitov  wrote:
> >   
> > > > +static void my_hijack_func(unsigned long ip, unsigned long pip,
> > > > +  struct ftrace_ops *ops, struct pt_regs 
> > > > *regs)
> > > 
> > > 1.
> > > To pass regs into the callback ftrace_regs_caller() has huge amount
> > > of stores to do. Saving selector registers is not fast. pushf/popf are 
> > > even slower.
> > > ~200 bytes of stack is being used for save/restore.
> > > This is pure overhead that bpf execution cannot afford.
> > > bpf is different from traditional ftrace and other tracing, since
> > > it's often active 24/7. Every nanosecond counts.  
> > 
> > Live patching is the same as what you have. If not even more critical.
> > 
> > Note, it would be easy to also implement a "just give me IP regs" flag,
> > or have that be the default if IPMODIFY is set and REGS is not.  
> 
> And that will reduce overhead from 20+ regs save/restore into 3-4 ?

Huge difference.

> Say, it's only two regs (rbp and rip). Why bpf side has to pay this runtime
> penalty? I see no good technical reason.

Because we have existing infrastructure, and we don't need to rewrite
the world from scratch. Let's try this out first, and then you show me
how much of an overhead this is to prove that bpf deserves to create a
new infrastructure.

> 
> >   
> > > So for bpf I'm generating assembler trampoline tailored to specific kernel
> > > function that has its fentry nop replaced with a call to that trampoline.
> > > Instead of 20+ register save I save only arguments of that kernel 
> > > function.
> > > For example the trampoline that attaches to kfree_skb() will save only 
> > > two registers
> > > (rbp and rdi on x86) and will use 16 bytes of stack.
> > > 
> > > 2.
> > > The common ftrace callback api allows ftrace infra to use generic 
> > > ftrace_ops_list_func()
> > > that works for all ftracers, but it doesn't scale.  
> > 
> > That only happens if you have more than one callback to a same
> > function. Otherwise you get a dedicated trampoline.  
> 
> That's exactly what I tried to explain below. We have production use case
> with 2 kprobes (backed by ftrace) at the same function.
> fwiw the function is tcp_retransmit_skb.

As I state below, you can merge it into a single ftrace_ops.


> 
> > > We see different teams writing bpf services that attach to the same 
> > > function.
> > > In addition there are 30+ kprobes active in other places, so for every
> > > fentry the ftrace_ops_list_func() walks long linked list and does hash
> > > lookup for each. That's not efficient and we see this slowdown in 
> > > practice.
> > > Because of unique trampoline for every kernel function single
> > > generic list caller is not possible.
> > > Instead generated unique trampoline handles all attached bpf program
> > > for this particular kernel function in a sequence of calls.  
> > 
> > Why not have a single ftrace_ops() that calls this utility and do the
> > multiplexing then?  
> 
> because it's not an acceptable overhead.

Prove it! (with the non regs case)

> 
> > > No link lists to walk, no hash tables to lookup.
> > > All overhead is gone.
> > > 
> > > 3.
> > > The existing kprobe bpf progs are using pt_regs to read arguments. Due to 
> > >  
> > 
> > That was done because kprobes in general work off of int3. And the
> > saving of pt_regs was to reuse the code and allow kprobes to work both
> > with or without a ftrace helper.  
> 
> sure. that makes sense. That's why ftrace-based kprobes are the best
> and fastest kernel infra today for bpf to attach to.
> But after using it all in prod we see that it's too slow for ever growing
> demands which are bpf specific. All the optimizations that went
> into kprobe handling do help. No doubt about that. But here I'm talking
> about removing _all_ overhead. Not just making kprobe 2-3 times faster.

You said yourself that the reg saving overhead is too much. The
pushf/popf is hard too. And I agree that that is unacceptable overhead.
That's why we have two function callers: One with and one without
saving regs. Because I knew that was expensive and we didn't need to
cause everyone to suffer just to have kprobes work with ftrace.

I gave a solution for this. And that is to add another flag to allow
for just the minimum to change the ip. And we can even add another flag
to allow for changing the stack if needed (to emulate a call with the
same parameters).

The world is not just bpf. Does Facebook now own the Linux kernel?

By doing this work, live kernel patching will also benefit. Because it
is also dealing with the unnecessary overhead of saving regs.

And we could possibly even have kprobes benefit from this if a kprobe
doesn't need full regs.

> 
> > > that ugliness all of them are written for single architecture (x86-64).
> > > Porting them 

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Alexei Starovoitov
On Tue, Oct 22, 2019 at 02:10:21PM -0400, Steven Rostedt wrote:
> On Tue, 22 Oct 2019 10:50:56 -0700
> Alexei Starovoitov  wrote:
> 
> > > +static void my_hijack_func(unsigned long ip, unsigned long pip,
> > > +struct ftrace_ops *ops, struct pt_regs *regs)  
> > 
> > 1.
> > To pass regs into the callback ftrace_regs_caller() has huge amount
> > of stores to do. Saving selector registers is not fast. pushf/popf are even 
> > slower.
> > ~200 bytes of stack is being used for save/restore.
> > This is pure overhead that bpf execution cannot afford.
> > bpf is different from traditional ftrace and other tracing, since
> > it's often active 24/7. Every nanosecond counts.
> 
> Live patching is the same as what you have. If not even more critical.
> 
> Note, it would be easy to also implement a "just give me IP regs" flag,
> or have that be the default if IPMODIFY is set and REGS is not.

And that will reduce overhead from 20+ regs save/restore into 3-4 ?
Say, it's only two regs (rbp and rip). Why bpf side has to pay this runtime
penalty? I see no good technical reason.

> 
> > So for bpf I'm generating assembler trampoline tailored to specific kernel
> > function that has its fentry nop replaced with a call to that trampoline.
> > Instead of 20+ register save I save only arguments of that kernel function.
> > For example the trampoline that attaches to kfree_skb() will save only two 
> > registers
> > (rbp and rdi on x86) and will use 16 bytes of stack.
> > 
> > 2.
> > The common ftrace callback api allows ftrace infra to use generic 
> > ftrace_ops_list_func()
> > that works for all ftracers, but it doesn't scale.
> 
> That only happens if you have more than one callback to a same
> function. Otherwise you get a dedicated trampoline.

That's exactly what I tried to explain below. We have production use case
with 2 kprobes (backed by ftrace) at the same function.
fwiw the function is tcp_retransmit_skb.

> > We see different teams writing bpf services that attach to the same 
> > function.
> > In addition there are 30+ kprobes active in other places, so for every
> > fentry the ftrace_ops_list_func() walks long linked list and does hash
> > lookup for each. That's not efficient and we see this slowdown in practice.
> > Because of unique trampoline for every kernel function single
> > generic list caller is not possible.
> > Instead generated unique trampoline handles all attached bpf program
> > for this particular kernel function in a sequence of calls.
> 
> Why not have a single ftrace_ops() that calls this utility and do the
> multiplexing then?

because it's not an acceptable overhead.

> > No link lists to walk, no hash tables to lookup.
> > All overhead is gone.
> > 
> > 3.
> > The existing kprobe bpf progs are using pt_regs to read arguments. Due to
> 
> That was done because kprobes in general work off of int3. And the
> saving of pt_regs was to reuse the code and allow kprobes to work both
> with or without a ftrace helper.

sure. that makes sense. That's why ftrace-based kprobes are the best
and fastest kernel infra today for bpf to attach to.
But after using it all in prod we see that it's too slow for ever growing
demands which are bpf specific. All the optimizations that went
into kprobe handling do help. No doubt about that. But here I'm talking
about removing _all_ overhead. Not just making kprobe 2-3 times faster.

> > that ugliness all of them are written for single architecture (x86-64).
> > Porting them to arm64 is not that hard, but porting to 32-bit arch is close
> > to impossible. With custom generated trampoline we'll have bpf progs that
> > work as-is on all archs. raw_tracepoint bpf progs already demonstrated
> > that such portability is possible. This new kprobe++ progs will be similar.
> > 
> > 4.
> > Due to uniqueness of bpf trampoline sharing trampoline between ftracers
> > and bpf progs is not possible, so users would have to choose whether to
> > ftrace that particular kernel function or attach bpf to it.
> > Attach is not going to stomp on each other. I'm reusing ftrace_make_call/nop
> > approach that checks that its a 'nop' being replaced.
> 
> What about the approach I showed here? Just register a ftrace_ops with
> ip modify set, and then call you unique trampoline directly.

It's 100% unnecessary overhead.

> It would keep the modification all in one place instead of having
> multiple implementations of it. We can make ftrace call your trampoline
> just like it was called directly, without writing a whole new
> infrastructure.

That is not true at all.
I haven't posted the code yet, but you're already arguing about
hypothetical code duplication.
There is none so far. I'm not reinventing ftrace.
There will be no FTRACE_OPS_FL_* flags, no ftrace_ops equivalent.
None of it applies.
Since Peter is replacing ftrace specific nop->int3->call patching
with text_poke() I can just use that.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 10:50:56 -0700
Alexei Starovoitov  wrote:

> > +static void my_hijack_func(unsigned long ip, unsigned long pip,
> > +  struct ftrace_ops *ops, struct pt_regs *regs)  
> 
> 1.
> To pass regs into the callback ftrace_regs_caller() has huge amount
> of stores to do. Saving selector registers is not fast. pushf/popf are even 
> slower.
> ~200 bytes of stack is being used for save/restore.
> This is pure overhead that bpf execution cannot afford.
> bpf is different from traditional ftrace and other tracing, since
> it's often active 24/7. Every nanosecond counts.

Live patching is the same as what you have. If not even more critical.

Note, it would be easy to also implement a "just give me IP regs" flag,
or have that be the default if IPMODIFY is set and REGS is not.


> So for bpf I'm generating assembler trampoline tailored to specific kernel
> function that has its fentry nop replaced with a call to that trampoline.
> Instead of 20+ register save I save only arguments of that kernel function.
> For example the trampoline that attaches to kfree_skb() will save only two 
> registers
> (rbp and rdi on x86) and will use 16 bytes of stack.
> 
> 2.
> The common ftrace callback api allows ftrace infra to use generic 
> ftrace_ops_list_func()
> that works for all ftracers, but it doesn't scale.

That only happens if you have more than one callback to a same
function. Otherwise you get a dedicated trampoline.


> We see different teams writing bpf services that attach to the same function.
> In addition there are 30+ kprobes active in other places, so for every
> fentry the ftrace_ops_list_func() walks long linked list and does hash
> lookup for each. That's not efficient and we see this slowdown in practice.
> Because of unique trampoline for every kernel function single
> generic list caller is not possible.
> Instead generated unique trampoline handles all attached bpf program
> for this particular kernel function in a sequence of calls.

Why not have a single ftrace_ops() that calls this utility and do the
multiplexing then?

> No link lists to walk, no hash tables to lookup.
> All overhead is gone.
> 
> 3.
> The existing kprobe bpf progs are using pt_regs to read arguments. Due to

That was done because kprobes in general work off of int3. And the
saving of pt_regs was to reuse the code and allow kprobes to work both
with or without a ftrace helper.

> that ugliness all of them are written for single architecture (x86-64).
> Porting them to arm64 is not that hard, but porting to 32-bit arch is close
> to impossible. With custom generated trampoline we'll have bpf progs that
> work as-is on all archs. raw_tracepoint bpf progs already demonstrated
> that such portability is possible. This new kprobe++ progs will be similar.
> 
> 4.
> Due to uniqueness of bpf trampoline sharing trampoline between ftracers
> and bpf progs is not possible, so users would have to choose whether to
> ftrace that particular kernel function or attach bpf to it.
> Attach is not going to stomp on each other. I'm reusing ftrace_make_call/nop
> approach that checks that its a 'nop' being replaced.

What about the approach I showed here? Just register a ftrace_ops with
ip modify set, and then call you unique trampoline directly.

It would keep the modification all in one place instead of having
multiple implementations of it. We can make ftrace call your trampoline
just like it was called directly, without writing a whole new
infrastructure.

-- Steve


> 
> There probably will be some gotchas and unforeseen issues, since prototype
> is very rough and not in reviewable form yet. Will share when it's ready.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Alexei Starovoitov
On Tue, Oct 22, 2019 at 09:44:55AM -0400, Steven Rostedt wrote:
> On Tue, 22 Oct 2019 07:19:56 -0400
> Steven Rostedt  wrote:
> 
> > > I'm not touching dyn_ftrace.
> > > Actually calling my stuff ftrace+bpf is probably not correct either.
> > > I'm reusing code patching of nop into call that ftrace does. That's it.
> > > Turned out I cannot use 99% of ftrace facilities.
> > > ftrace_caller, ftrace_call, ftrace_ops_list_func and the whole ftrace api
> > > with ip, parent_ip and pt_regs cannot be used for this part of the work.
> > > bpf prog needs to access raw function arguments. To achieve that I'm  
> > 
> > You can do that today with the ftrace facility, just like live patching
> > does. You register a ftrace_ops with the flag FTRACE_OPS_FL_IPMODIFY,
> > and your func will set the regs->ip to your bpf handler. When the
> > ftrace_ops->func returns, instead of going back to the called
> > function, it can jump to your bpf_handler. You can create a shadow stack
> > (like function graph tracer does) to save the return address for where
> > you bpf handler needs to return to. As your bpf_handler needs raw
> > access to the parameters, it may not even need the shadow stack because
> > it should know the function it is reading the parameters from.
> 
> To show just how easy this is, I wrote up a quick hack that hijacks the
> wake_up_process() function and adds a trace_printk() to see what was
> woken up. My output from the trace is this:
> 
>   -0 [007] ..s168.517276: my_wake_up: We are waking up 
> rcu_preempt:10
><...>-1240  [001] 68.517727: my_wake_up: We are waking up 
> kthreadd:2
><...>-1240  [001] d..168.517973: my_wake_up: We are waking up 
> kworker/1:0:17
> bash-1188  [003] d..268.519020: my_wake_up: We are waking up 
> kworker/u16:3:140
> bash-1188  [003] d..268.519138: my_wake_up: We are waking up 
> kworker/u16:3:140
> sshd-1187  [005] d.s268.519295: my_wake_up: We are waking up 
> kworker/5:2:517
>   -0 [007] ..s168.522293: my_wake_up: We are waking up 
> rcu_preempt:10
>   -0 [007] ..s168.526309: my_wake_up: We are waking up 
> rcu_preempt:10
> 
> I added the code to the trace-event-sample.c sample module, and got the
> above when I loaded that module (modprobe trace-event-sample).
> 
> It's mostly non arch specific (that is, you can do this with any
> arch that supports the IPMODIFY flag). The only parts that would need
> arch specific code is the regs->ip compare. The pip check can also be
> done less "hacky". But this shows you how easy this can be done today.
> Not sure what is missing that you need.

yes. I've studied all these possibilites, tried a bunch ways to extend
and reuse ftrace_ops, but every time found something fundamental to
the existing ftrace design that doesn't work for bpf usage.
See below:

> Here's the patch:
> 
> diff --git a/samples/trace_events/trace-events-sample.c 
> b/samples/trace_events/trace-events-sample.c
> index 1a72b7d95cdc..526a6098c811 100644
> --- a/samples/trace_events/trace-events-sample.c
> +++ b/samples/trace_events/trace-events-sample.c
> @@ -11,6 +11,41 @@
>  #define CREATE_TRACE_POINTS
>  #include "trace-events-sample.h"
>  
> +#include 
> +
> +int wake_up_process(struct task_struct *p);
> +
> +int x;
> +
> +static int my_wake_up(struct task_struct *p)
> +{
> + int ret;
> +
> + trace_printk("We are waking up %s:%d\n", p->comm, p->pid);
> + ret = wake_up_process(p);
> + /* Force not having a tail call */
> + if (!x)
> + return ret;
> + return 0;
> +}
> +
> +static void my_hijack_func(unsigned long ip, unsigned long pip,
> +struct ftrace_ops *ops, struct pt_regs *regs)

1.
To pass regs into the callback ftrace_regs_caller() has huge amount
of stores to do. Saving selector registers is not fast. pushf/popf are even 
slower.
~200 bytes of stack is being used for save/restore.
This is pure overhead that bpf execution cannot afford.
bpf is different from traditional ftrace and other tracing, since
it's often active 24/7. Every nanosecond counts.
So for bpf I'm generating assembler trampoline tailored to specific kernel
function that has its fentry nop replaced with a call to that trampoline.
Instead of 20+ register save I save only arguments of that kernel function.
For example the trampoline that attaches to kfree_skb() will save only two 
registers
(rbp and rdi on x86) and will use 16 bytes of stack.

2.
The common ftrace callback api allows ftrace infra to use generic 
ftrace_ops_list_func()
that works for all ftracers, but it doesn't scale.
We see different teams writing bpf services that attach to the same function.
In addition there are 30+ kprobes active in other places, so for every
fentry the ftrace_ops_list_func() walks long linked list and does hash
lookup for each. That's not efficient and we see this slowdown in practice.
Because of unique trampoline 

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Tue, 22 Oct 2019 07:19:56 -0400
Steven Rostedt  wrote:

> > I'm not touching dyn_ftrace.
> > Actually calling my stuff ftrace+bpf is probably not correct either.
> > I'm reusing code patching of nop into call that ftrace does. That's it.
> > Turned out I cannot use 99% of ftrace facilities.
> > ftrace_caller, ftrace_call, ftrace_ops_list_func and the whole ftrace api
> > with ip, parent_ip and pt_regs cannot be used for this part of the work.
> > bpf prog needs to access raw function arguments. To achieve that I'm  
> 
> You can do that today with the ftrace facility, just like live patching
> does. You register a ftrace_ops with the flag FTRACE_OPS_FL_IPMODIFY,
> and your func will set the regs->ip to your bpf handler. When the
> ftrace_ops->func returns, instead of going back to the called
> function, it can jump to your bpf_handler. You can create a shadow stack
> (like function graph tracer does) to save the return address for where
> you bpf handler needs to return to. As your bpf_handler needs raw
> access to the parameters, it may not even need the shadow stack because
> it should know the function it is reading the parameters from.

To show just how easy this is, I wrote up a quick hack that hijacks the
wake_up_process() function and adds a trace_printk() to see what was
woken up. My output from the trace is this:

  -0 [007] ..s168.517276: my_wake_up: We are waking up 
rcu_preempt:10
   <...>-1240  [001] 68.517727: my_wake_up: We are waking up 
kthreadd:2
   <...>-1240  [001] d..168.517973: my_wake_up: We are waking up 
kworker/1:0:17
bash-1188  [003] d..268.519020: my_wake_up: We are waking up 
kworker/u16:3:140
bash-1188  [003] d..268.519138: my_wake_up: We are waking up 
kworker/u16:3:140
sshd-1187  [005] d.s268.519295: my_wake_up: We are waking up 
kworker/5:2:517
  -0 [007] ..s168.522293: my_wake_up: We are waking up 
rcu_preempt:10
  -0 [007] ..s168.526309: my_wake_up: We are waking up 
rcu_preempt:10

I added the code to the trace-event-sample.c sample module, and got the
above when I loaded that module (modprobe trace-event-sample).

It's mostly non arch specific (that is, you can do this with any
arch that supports the IPMODIFY flag). The only parts that would need
arch specific code is the regs->ip compare. The pip check can also be
done less "hacky". But this shows you how easy this can be done today.
Not sure what is missing that you need.

Here's the patch:

diff --git a/samples/trace_events/trace-events-sample.c 
b/samples/trace_events/trace-events-sample.c
index 1a72b7d95cdc..526a6098c811 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -11,6 +11,41 @@
 #define CREATE_TRACE_POINTS
 #include "trace-events-sample.h"
 
+#include 
+
+int wake_up_process(struct task_struct *p);
+
+int x;
+
+static int my_wake_up(struct task_struct *p)
+{
+   int ret;
+
+   trace_printk("We are waking up %s:%d\n", p->comm, p->pid);
+   ret = wake_up_process(p);
+   /* Force not having a tail call */
+   if (!x)
+   return ret;
+   return 0;
+}
+
+static void my_hijack_func(unsigned long ip, unsigned long pip,
+  struct ftrace_ops *ops, struct pt_regs *regs)
+{
+   unsigned long this_func = (unsigned long)my_wake_up;
+
+   if (pip >= this_func && pip <= this_func + 0x1)
+   return;
+
+   regs->ip = my_wake_up;
+}
+
+static struct ftrace_ops my_ops = {
+   .func = my_hijack_func,
+   .flags = FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_RECURSION_SAFE |
+  FTRACE_OPS_FL_SAVE_REGS,
+};
+
 static const char *random_strings[] = {
"Mother Goose",
"Snoopy",
@@ -115,6 +150,11 @@ void foo_bar_unreg(void)
 
 static int __init trace_event_init(void)
 {
+   int ret;
+
+   ret = ftrace_set_filter_ip(_ops, (unsigned long)wake_up_process, 0, 
0);
+   if (!ret)
+   register_ftrace_function(_ops);
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
if (IS_ERR(simple_tsk))
return -1;
@@ -124,6 +164,7 @@ static int __init trace_event_init(void)
 
 static void __exit trace_event_exit(void)
 {
+   unregister_ftrace_function(_ops);
kthread_stop(simple_tsk);
mutex_lock(_mutex);
if (simple_tsk_fn)


-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-22 Thread Steven Rostedt
On Mon, 21 Oct 2019 21:05:33 -0700
Alexei Starovoitov  wrote:

> On Mon, Oct 21, 2019 at 11:19:04PM -0400, Steven Rostedt wrote:
> > On Mon, 21 Oct 2019 23:16:30 -0400
> > Steven Rostedt  wrote:
> >   
> > > > what bugs you're seeing?
> > > > The IPI frequency that was mentioned in this thread or something else?
> > > > I'm hacking ftrace+bpf stuff in the same spot and would like to
> > > > base my work on the latest and greatest.
> > 
> > I'm also going to be touching some of this code too, as I'm waiting for
> > Peter's code to settle down. What are you touching? I'm going to be
> > working on making the dyn_ftrace records smaller, and this is going to
> > change the way the iterations work on modifying the code.  
> 
> I'm not touching dyn_ftrace.
> Actually calling my stuff ftrace+bpf is probably not correct either.
> I'm reusing code patching of nop into call that ftrace does. That's it.
> Turned out I cannot use 99% of ftrace facilities.
> ftrace_caller, ftrace_call, ftrace_ops_list_func and the whole ftrace api
> with ip, parent_ip and pt_regs cannot be used for this part of the work.
> bpf prog needs to access raw function arguments. To achieve that I'm

You can do that today with the ftrace facility, just like live patching
does. You register a ftrace_ops with the flag FTRACE_OPS_FL_IPMODIFY,
and your func will set the regs->ip to your bpf handler. When the
ftrace_ops->func returns, instead of going back to the called
function, it can jump to your bpf_handler. You can create a shadow stack
(like function graph tracer does) to save the return address for where
you bpf handler needs to return to. As your bpf_handler needs raw
access to the parameters, it may not even need the shadow stack because
it should know the function it is reading the parameters from.

If you still want the return address without the shadow stack, it
wouldn't be hard to add another ftrace_ops flag, that allows the
ftrace_ops to inject a return address to simulate a call function
before it jumps to the modified IP.

> generating code on the fly. Just like bpf jits do.
> As soon as I have something reviewable I'll share it.
> That's the stuff I mentioned to you at KR.
> First nop of a function will be replaced with a call into bpf.
> Very similar to what existing kprobe+bpf does, but faster and safer.
> Second part of calling real ftrace from bpf is on todo list.

Having two different infrastructures modifying the first nop is going
to cause a huge nightmare to maintain. This is why live patching didn't
do it. I strongly suggested that you do not have bpf have its own
infrastructure.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Alexei Starovoitov
On Mon, Oct 21, 2019 at 11:19:04PM -0400, Steven Rostedt wrote:
> On Mon, 21 Oct 2019 23:16:30 -0400
> Steven Rostedt  wrote:
> 
> > > what bugs you're seeing?
> > > The IPI frequency that was mentioned in this thread or something else?
> > > I'm hacking ftrace+bpf stuff in the same spot and would like to
> > > base my work on the latest and greatest.  
> 
> I'm also going to be touching some of this code too, as I'm waiting for
> Peter's code to settle down. What are you touching? I'm going to be
> working on making the dyn_ftrace records smaller, and this is going to
> change the way the iterations work on modifying the code.

I'm not touching dyn_ftrace.
Actually calling my stuff ftrace+bpf is probably not correct either.
I'm reusing code patching of nop into call that ftrace does. That's it.
Turned out I cannot use 99% of ftrace facilities.
ftrace_caller, ftrace_call, ftrace_ops_list_func and the whole ftrace api
with ip, parent_ip and pt_regs cannot be used for this part of the work.
bpf prog needs to access raw function arguments. To achieve that I'm
generating code on the fly. Just like bpf jits do.
As soon as I have something reviewable I'll share it.
That's the stuff I mentioned to you at KR.
First nop of a function will be replaced with a call into bpf.
Very similar to what existing kprobe+bpf does, but faster and safer.
Second part of calling real ftrace from bpf is on todo list.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Alexei Starovoitov
On Mon, Oct 21, 2019 at 11:16:30PM -0400, Steven Rostedt wrote:
> On Mon, 21 Oct 2019 20:10:09 -0700
> Alexei Starovoitov  wrote:
> 
> > On Mon, Oct 21, 2019 at 5:43 PM Steven Rostedt  wrote:
> > >
> > > On Mon, 21 Oct 2019 17:36:54 -0700
> > > Alexei Starovoitov  wrote:
> > >
> > >  
> > > > What is the status of this set ?
> > > > Steven, did you apply it ?  
> > >
> > > There's still bugs to figure out.  
> > 
> > what bugs you're seeing?
> > The IPI frequency that was mentioned in this thread or something else?
> > I'm hacking ftrace+bpf stuff in the same spot and would like to
> > base my work on the latest and greatest.
> 
> There's real bugs (crashes), talked about in later versions of the
> patch set:
> 
>  https://lore.kernel.org/r/20191009224135.2dcf7...@oasis.local.home
> 
> And there was another crash on Peter's latest I just reported:
> 
>  https://lore.kernel.org/r/20191021222110.49044...@oasis.local.home

ahh. I missed the whole thread. Thanks for pointers.



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 23:16:30 -0400
Steven Rostedt  wrote:

> > what bugs you're seeing?
> > The IPI frequency that was mentioned in this thread or something else?
> > I'm hacking ftrace+bpf stuff in the same spot and would like to
> > base my work on the latest and greatest.  

I'm also going to be touching some of this code too, as I'm waiting for
Peter's code to settle down. What are you touching? I'm going to be
working on making the dyn_ftrace records smaller, and this is going to
change the way the iterations work on modifying the code.

This is what I discussed on stage at Kernel Recipes with Willy Tarreau.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 20:10:09 -0700
Alexei Starovoitov  wrote:

> On Mon, Oct 21, 2019 at 5:43 PM Steven Rostedt  wrote:
> >
> > On Mon, 21 Oct 2019 17:36:54 -0700
> > Alexei Starovoitov  wrote:
> >
> >  
> > > What is the status of this set ?
> > > Steven, did you apply it ?  
> >
> > There's still bugs to figure out.  
> 
> what bugs you're seeing?
> The IPI frequency that was mentioned in this thread or something else?
> I'm hacking ftrace+bpf stuff in the same spot and would like to
> base my work on the latest and greatest.

There's real bugs (crashes), talked about in later versions of the
patch set:

 https://lore.kernel.org/r/20191009224135.2dcf7...@oasis.local.home

And there was another crash on Peter's latest I just reported:

 https://lore.kernel.org/r/20191021222110.49044...@oasis.local.home

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Alexei Starovoitov
On Mon, Oct 21, 2019 at 5:43 PM Steven Rostedt  wrote:
>
> On Mon, 21 Oct 2019 17:36:54 -0700
> Alexei Starovoitov  wrote:
>
>
> > What is the status of this set ?
> > Steven, did you apply it ?
>
> There's still bugs to figure out.

what bugs you're seeing?
The IPI frequency that was mentioned in this thread or something else?
I'm hacking ftrace+bpf stuff in the same spot and would like to
base my work on the latest and greatest.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 17:36:54 -0700
Alexei Starovoitov  wrote:


> What is the status of this set ?
> Steven, did you apply it ?

There's still bugs to figure out.

-- Steve



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-21 Thread Alexei Starovoitov
On Fri, Oct 4, 2019 at 6:45 AM Steven Rostedt  wrote:
>
> On Fri, 4 Oct 2019 13:22:37 +0200
> Peter Zijlstra  wrote:
>
> > On Thu, Oct 03, 2019 at 06:10:45PM -0400, Steven Rostedt wrote:
> > > But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> > > it's a problem, but something that we should note. Someone (those that
> > > don't like kernel interference) may complain.
> >
> > It is machine wide function tracing, interference is going to happen..
> > :-)
> >
> > Anyway, we can grow the batch size if sufficient benefit can be shown to
> > exist.
>
> Yeah, I was thinking that we just go with these patches and then fix
> the IPI issue when someone starts complaining ;-)
>
> Anyway, is this series ready to go? I can pull them in (I guess I
> should get an ack from Thomas or Ingo as they are x86 specific). I'm
> currently working on code that affects the same code paths as this
> patch, and would like to build my changes on top of this, instead of
> monkeying around with major conflicts.

What is the status of this set ?
Steven, did you apply it ?


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-11 Thread Steven Rostedt
On Fri, 11 Oct 2019 09:37:10 +0200
Daniel Bristot de Oliveira  wrote:

> But, yes, we will need [ as an optimization ] to sort the address right before
> inserting them in the batch. Still, having the ftrace_pages ordered seems to 
> be
> a good thing, as in many cases, the ftrace_pages are disjoint sets.

I think it would be fine if we run the batches according to the ftrace
page groups. Which will be guaranteed to be sorted. I try to keep the
groups to a minimum, thus it shouldn't be too many ipi busts.

Although, my new work may actually make more page groups by trying to
cut down on the dyn_ftrace size by using offsets. If a function extends
outside the offset range, a new page group will need to be created.

-- Steve



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-11 Thread Peter Zijlstra
On Fri, Oct 11, 2019 at 09:37:10AM +0200, Daniel Bristot de Oliveira wrote:
> On 11/10/2019 09:01, Peter Zijlstra wrote:
> > On Fri, Oct 04, 2019 at 10:10:47AM +0200, Daniel Bristot de Oliveira wrote:
> >> Currently, ftrace_rec entries are ordered inside the group of functions, 
> >> but
> >> "groups of function" are not ordered. So, the current int3 handler does a 
> >> (*):
> > We can insert a sort() of the vector right before doing
> > text_poke_bp_batch() of course...
> 
> I agree!
> 
> What I tried to do earlier this week was to order the ftrace_pages in the
> insertion [1], and so, while sequentially reading the pages with
> do_for_each_ftrace_rec() we would already see the "ip"s in order.
> 
> As ftrace_pages are inserted only at boot and during a load of a module, this
> would push the ordering for a very very slow path.
> 
> It works! But under the assumption that the address of functions in a module
> does not intersect with the address of other modules/kernel, e.g.:
> 
> kernel:  module A:  module B:
> [ 1, 2, 3, 4 ]   [ 7, 8, 9 ][ 15, 16, 19 ]
> 
> But this does not happen in practice, as I saw things like:
> 
> kernel:  module A:  module B:
> [ 1, 2, 3, 4 ]   [ 7, 8, 18 ]   [ 15, 16, 19 ]
>  ^^ <--- greater than the first of the next
> 
> Is this expected?

Expected is I think a big word. That is, I would certainly not expect
that, but I think I can see how it can happen.

Suppose that init sections are placed at the end (didn't verify, but
makes sense), then suppose that A's 18 is in an init section, which gets
freed once the module load is completed. We then proceed to load B,
which then overlaps with what used to be A's init.

If this were the case, then while we could retain the pach location for
A's 18, we would never actually modify it, because it's gotten marked
freed or something (that's how jump_labels work, I didn't check what
ftrace actually does here).

So while the above contains some assumptions, it is a semi plausable
explanation for what you've described.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-11 Thread Daniel Bristot de Oliveira
On 11/10/2019 09:01, Peter Zijlstra wrote:
> On Fri, Oct 04, 2019 at 10:10:47AM +0200, Daniel Bristot de Oliveira wrote:
>> Currently, ftrace_rec entries are ordered inside the group of functions, but
>> "groups of function" are not ordered. So, the current int3 handler does a 
>> (*):
> We can insert a sort() of the vector right before doing
> text_poke_bp_batch() of course...

I agree!

What I tried to do earlier this week was to order the ftrace_pages in the
insertion [1], and so, while sequentially reading the pages with
do_for_each_ftrace_rec() we would already see the "ip"s in order.

As ftrace_pages are inserted only at boot and during a load of a module, this
would push the ordering for a very very slow path.

It works! But under the assumption that the address of functions in a module
does not intersect with the address of other modules/kernel, e.g.:

kernel:  module A:  module B:
[ 1, 2, 3, 4 ]   [ 7, 8, 9 ][ 15, 16, 19 ]

But this does not happen in practice, as I saw things like:

kernel:  module A:  module B:
[ 1, 2, 3, 4 ]   [ 7, 8, 18 ]   [ 15, 16, 19 ]
 ^^ <--- greater than the first of the next

Is this expected?

At this point, I stopped working on it to give a time for my brain o think about
a better solution, also because Steve is reworking ftrace_pages to save some
space. So, it was better to wait.

But, yes, we will need [ as an optimization ] to sort the address right before
inserting them in the batch. Still, having the ftrace_pages ordered seems to be
a good thing, as in many cases, the ftrace_pages are disjoint sets.

[1] see ftrace_pages_insert() and ftrace_pages_start.

-- Daniel



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-11 Thread Peter Zijlstra
On Fri, Oct 04, 2019 at 10:10:47AM +0200, Daniel Bristot de Oliveira wrote:
> Currently, ftrace_rec entries are ordered inside the group of functions, but
> "groups of function" are not ordered. So, the current int3 handler does a (*):

We can insert a sort() of the vector right before doing
text_poke_bp_batch() of course...


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-07 Thread Peter Zijlstra
On Fri, Oct 04, 2019 at 10:10:47AM +0200, Daniel Bristot de Oliveira wrote:
> 1) the enabling/disabling ftrace path
> 2) the int3 path - if a thread/irq is running a kernel function
> 3) the IPI - that affects all CPUs, even those that are not "hitting" trace
> code, e.g., user-space.
> 
> The first one is for sure a cold-path. The second one is a hot-path: any task
> running kernel functions will hit it. But IMHO, the hottest one is the IPIs,
> because it will run on all CPUs, e.g., even isolated CPUs that are running in
> user-space.

Well, we can fix that, just like RCU. In fact, I suspect RCU has all the
bits required to cure this.

For NOHZ_FULL CPUs you can forgo the IPI and delay it until the next kernel
entry, just like RCU.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Steven Rostedt
On Fri, 4 Oct 2019 16:44:35 +0200
Daniel Bristot de Oliveira  wrote:

> On 04/10/2019 15:40, Steven Rostedt wrote:
> > On Fri, 4 Oct 2019 10:10:47 +0200
> > Daniel Bristot de Oliveira  wrote:
> >   
> >> [ In addition ]
> >>
> >> Currently, ftrace_rec entries are ordered inside the group of functions, 
> >> but
> >> "groups of function" are not ordered. So, the current int3 handler does a 
> >> (*):
> >>
> >> for_each_group_of_functions:
> >>check if the ip is in the range> n by the number of groups.
> >>do a bsearch.  > log(n) by the numbers of entry
> >> in the group.
> >>
> >> If, instead, it uses an ordered vector, the complexity would be log(n) by 
> >> the
> >> total number of entries, which is better. So, how bad is the idea of:  
> > BTW, I'm currently rewriting the grouping of the vectors, in order to
> > shrink the size of each dyn_ftrace_rec (as we discussed at Kernel
> > Recipes). I can make the groups all sorted in doing so, thus we can
> > load the sorted if that's needed, without doing anything special.
> >   
> 
> Good! if you do they sorted and store the amount of entries in a variable, we
> can have things done for a future "optimized" version.
> 

You mean something like this?

-- Steve


>From 9b85c08d796dc953cf9b9331d1aed4c3052b09b9 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" 
Date: Tue, 1 Oct 2019 14:38:07 -0400
Subject: [PATCH] ftrace: Add information on number of page groups allocated

Looking for ways to shrink the size of the dyn_ftrace structure, knowing the
information about how many pages and the number of groups of those pages, is
useful in working out the best ways to save on memory.

This adds one info print on how many groups of pages were used to allocate
the ftrace dyn_ftrace structures, and also shows the number of pages and
groups in the dyn_ftrace_total_info (which is used for debugging).

Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/trace/ftrace.c | 14 ++
 kernel/trace/trace.c  | 21 +++--
 kernel/trace/trace.h  |  2 ++
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index c4cc048eb594..244e2d9083a6 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2860,6 +2860,8 @@ static void ftrace_shutdown_sysctl(void)
 
 static u64 ftrace_update_time;
 unsigned long  ftrace_update_tot_cnt;
+unsigned long  ftrace_number_of_pages;
+unsigned long  ftrace_number_of_groups;
 
 static inline int ops_traces_mod(struct ftrace_ops *ops)
 {
@@ -2984,6 +2986,9 @@ static int ftrace_allocate_records(struct ftrace_page 
*pg, int count)
goto again;
}
 
+   ftrace_number_of_pages += 1 << order;
+   ftrace_number_of_groups++;
+
cnt = (PAGE_SIZE << order) / ENTRY_SIZE;
pg->size = cnt;
 
@@ -3039,6 +3044,8 @@ ftrace_allocate_pages(unsigned long num_to_init)
start_pg = pg->next;
kfree(pg);
pg = start_pg;
+   ftrace_number_of_pages -= 1 << order;
+   ftrace_number_of_groups--;
}
pr_info("ftrace: FAILED to allocate memory for functions\n");
return NULL;
@@ -5788,6 +5795,8 @@ void ftrace_release_mod(struct module *mod)
free_pages((unsigned long)pg->records, order);
tmp_page = pg->next;
kfree(pg);
+   ftrace_number_of_pages -= 1 << order;
+   ftrace_number_of_groups--;
}
 }
 
@@ -6129,6 +6138,8 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, 
void *end_ptr)
*last_pg = pg->next;
order = get_count_order(pg->size / ENTRIES_PER_PAGE);
free_pages((unsigned long)pg->records, order);
+   ftrace_number_of_pages -= 1 << order;
+   ftrace_number_of_groups--;
kfree(pg);
pg = container_of(last_pg, struct ftrace_page, next);
if (!(*last_pg))
@@ -6184,6 +6195,9 @@ void __init ftrace_init(void)
  __start_mcount_loc,
  __stop_mcount_loc);
 
+   pr_info("ftrace: allocated %ld pages with %ld groups\n",
+   ftrace_number_of_pages, ftrace_number_of_groups);
+
set_ftrace_early_filters();
 
return;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e917aa783675..bb41a70c5189 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7548,14 +7548,23 @@ static ssize_t
 tracing_read_dyn_info(struct file *filp, char __user *ubuf,
  size_t cnt, loff_t *ppos)
 {
-   unsigned long *p = filp->private_data;
-   char buf[64]; /* Not too big for a shallow stack */
+   unsigned long *p = ftrace_update_tot_cnt;
+   ssize_t ret;
+   

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Daniel Bristot de Oliveira
On 04/10/2019 15:40, Steven Rostedt wrote:
> On Fri, 4 Oct 2019 10:10:47 +0200
> Daniel Bristot de Oliveira  wrote:
> 
>> [ In addition ]
>>
>> Currently, ftrace_rec entries are ordered inside the group of functions, but
>> "groups of function" are not ordered. So, the current int3 handler does a 
>> (*):
>>
>> for_each_group_of_functions:
>>  check if the ip is in the range> n by the number of groups.
>>  do a bsearch.  > log(n) by the numbers of entry
>>   in the group.
>>
>> If, instead, it uses an ordered vector, the complexity would be log(n) by the
>> total number of entries, which is better. So, how bad is the idea of:
> BTW, I'm currently rewriting the grouping of the vectors, in order to
> shrink the size of each dyn_ftrace_rec (as we discussed at Kernel
> Recipes). I can make the groups all sorted in doing so, thus we can
> load the sorted if that's needed, without doing anything special.
> 

Good! if you do they sorted and store the amount of entries in a variable, we
can have things done for a future "optimized" version.

-- Daniel



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Steven Rostedt
On Fri, 4 Oct 2019 13:22:37 +0200
Peter Zijlstra  wrote:

> On Thu, Oct 03, 2019 at 06:10:45PM -0400, Steven Rostedt wrote:
> > But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> > it's a problem, but something that we should note. Someone (those that
> > don't like kernel interference) may complain.  
> 
> It is machine wide function tracing, interference is going to happen..
> :-)
> 
> Anyway, we can grow the batch size if sufficient benefit can be shown to
> exist.

Yeah, I was thinking that we just go with these patches and then fix
the IPI issue when someone starts complaining ;-)

Anyway, is this series ready to go? I can pull them in (I guess I
should get an ack from Thomas or Ingo as they are x86 specific). I'm
currently working on code that affects the same code paths as this
patch, and would like to build my changes on top of this, instead of
monkeying around with major conflicts.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Steven Rostedt
On Fri, 4 Oct 2019 10:10:47 +0200
Daniel Bristot de Oliveira  wrote:

> [ In addition ]
> 
> Currently, ftrace_rec entries are ordered inside the group of functions, but
> "groups of function" are not ordered. So, the current int3 handler does a (*):
> 
> for_each_group_of_functions:
>   check if the ip is in the range> n by the number of groups.
>   do a bsearch.  > log(n) by the numbers of entry
>in the group.
> 
> If, instead, it uses an ordered vector, the complexity would be log(n) by the
> total number of entries, which is better. So, how bad is the idea of:

BTW, I'm currently rewriting the grouping of the vectors, in order to
shrink the size of each dyn_ftrace_rec (as we discussed at Kernel
Recipes). I can make the groups all sorted in doing so, thus we can
load the sorted if that's needed, without doing anything special.

> 
>   in the enabling ftrace code path, it:
>   discover the number of entries
>   alloc a buffer
>   discover the order of the groups
>   for each group in the correct order
>   queue the entry in the buffer
>   apply the changes using the text_poke...
> 
> In this way we would optimize the two hot-paths:
>   int3 will be log(n)
>   IPIs bounded to 3.
> 
> I am not saying we need to do it now, as Steve said, not sure if this is a big
> problem, but... those that don't like kernel interference may complain. But if
> we leave the per-use-case vector in the text_poke_batch interface, things will
> be easier to fix.
> 
> NOTE: the other IPIs are generated by hooking the tracepoints and switching 
> the
> code to RO/RW...

Yeah, I did a trace of who is doing the on_each_cpu() call, and saw it
coming from the RO/RW changes, which this patch series removes.

-- Steve


>   
> * as far as I understood ftrace_location_range().
> 
> -- Daniel
> 
> > -- Steve
> >   



Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 06:10:45PM -0400, Steven Rostedt wrote:
> But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> it's a problem, but something that we should note. Someone (those that
> don't like kernel interference) may complain.

It is machine wide function tracing, interference is going to happen..
:-)

Anyway, we can grow the batch size if sufficient benefit can be shown to
exist.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-04 Thread Daniel Bristot de Oliveira
On 04/10/2019 00:10, Steven Rostedt wrote:
> On Wed, 2 Oct 2019 20:21:06 +0200
> Peter Zijlstra  wrote:
> 
>> On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>> ftrace was already batching the updates, for instance, causing 3 IPIs to 
>>> enable
>>> all functions. The text_poke() batching also works. But because of the 
>>> limited
>>> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during 
>>> the
>>> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a 
>>> VM,
>>> when enabling the function tracer, I see 250+ intermediate 
>>> text_poke_finish()
>>> because of a full buffer...
>>>
>>> Would this be the case of trying to use a dynamically allocated buffer?
>>>
>>> Thoughts?  
>>
>> Is it a problem? I tried growing the buffer (IIRC I made it 10 times
>> bigger) and didn't see any performance improvements because of it.
> 
> I'm just worried if people are going to complain about the IPI burst.
> Although, I just tried it out before applying this patch, and there's
> still a bit of a burst. Not sure why. I did:
> 
> # cat /proc/interrupts > /tmp/before; echo function > 
> /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after
> # cat /proc/interrupts > /tmp/before1; echo nop > 
> /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after1
> 
> Before this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:   2342   2347   2116   2175   2446   2030  
>  2416      Function call interrupts
> ---
>> CAL:   2462   2467   2236   2295   2446   2150   
>> 2536   2342   Function call interrupts
> 
> (Just showing the function call interrupts)
> 
> There appears to be 120 IPIs sent to all CPUS for enabling function tracer.
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:   2462   2467   2236   2295   2446   2150  
>  2536   2342   Function call interrupts
> ---
>> CAL:   2577   2582   2351   2410   2446   2265   
>> 2651   2457   Function call interrupts
> 
> And 151 IPIs for disabling it.
> 
> After applying this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:  66070  46620  59955  59236  68707  63397  
> 61644  62742   Function call interrupts
> ---
>> CAL:  66727  47277  59955  59893  69364  64054  
>> 62301  63399   Function call interrupts
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:  66727  47277  59955  59893  69364  64054  
> 62301  63399   Function call interrupts
> ---
>> CAL:  67358  47938  59985  60554  70025  64715  
>> 62962  64060   Function call interrupts
> 
> 
> We get 657 IPIs for enabling function tracer, and 661 for disabling it.
> Funny how it's more on the disable than the enable with the patch but
> the other way without it.
> 
> But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> it's a problem, but something that we should note. Someone (those that
> don't like kernel interference) may complain.

That is the point I was raising.

When enabling ftrace, we have three different paths:

1) the enabling/disabling ftrace path
2) the int3 path - if a thread/irq is running a kernel function
3) the IPI - that affects all CPUs, even those that are not "hitting" trace
code, e.g., user-space.

The first one is for sure a cold-path. The second one is a hot-path: any task
running kernel functions will hit it. But IMHO, the hottest one is the IPIs,
because it will run on all CPUs, e.g., even isolated CPUs that are running in
user-space.

Currently, ftrace does:

for_ftrace_rec:
Install all breakpoints
send IPI

for_ftrace_rec:
write the end of the instruction
send IPI

for_ftrace_rec:
 write the first byte of the instruction
send IPI

And that is the same thing we do with the batch mode, and so it would be better
to integrate both.

The problem is that considering the patch 2/3, the amount of entries we can
batch in the text_poke is limited, and so we batch on chunks of TP_VEC_MAX
entries. So, rather than doing 3 IPIs to change the code, we do:

(ftrace_rec count/TP_VEC_MAX) * 3 IPIs.

One possible solution for this would be to allocate a buffer with "number of
ftrace_rec" and use it in the text_poke batch mode.

But to do it, we should keep the old interface (the one that the 2/3 is
changing). Well, at least using a per-use-case buffer.

[ In addition ]

Currently, ftrace_rec entries are ordered inside the group of functions, but
"groups of function" are not ordered. So, the current int3 handler does a (*):

for_each_group_of_functions:
check if the ip is in the range> n by the number of groups.
do a bsearch.  > log(n) by the numbers of entry
 in the 

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-03 Thread Steven Rostedt
On Wed, 2 Oct 2019 20:21:06 +0200
Peter Zijlstra  wrote:

> On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:
> 
> > ftrace was already batching the updates, for instance, causing 3 IPIs to 
> > enable
> > all functions. The text_poke() batching also works. But because of the 
> > limited
> > buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during 
> > the
> > operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a 
> > VM,
> > when enabling the function tracer, I see 250+ intermediate 
> > text_poke_finish()
> > because of a full buffer...
> > 
> > Would this be the case of trying to use a dynamically allocated buffer?
> > 
> > Thoughts?  
> 
> Is it a problem? I tried growing the buffer (IIRC I made it 10 times
> bigger) and didn't see any performance improvements because of it.

I'm just worried if people are going to complain about the IPI burst.
Although, I just tried it out before applying this patch, and there's
still a bit of a burst. Not sure why. I did:

# cat /proc/interrupts > /tmp/before; echo function > 
/debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after
# cat /proc/interrupts > /tmp/before1; echo nop > 
/debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after1

Before this patch:

# diff /tmp/before /tmp/after
< CAL:   2342   2347   2116   2175   2446   2030   
2416      Function call interrupts
---
> CAL:   2462   2467   2236   2295   2446   2150   
> 2536   2342   Function call interrupts

(Just showing the function call interrupts)

There appears to be 120 IPIs sent to all CPUS for enabling function tracer.

# diff /tmp/before1 /tmp/after1
< CAL:   2462   2467   2236   2295   2446   2150   
2536   2342   Function call interrupts
---
> CAL:   2577   2582   2351   2410   2446   2265   
> 2651   2457   Function call interrupts

And 151 IPIs for disabling it.

After applying this patch:

# diff /tmp/before /tmp/after
< CAL:  66070  46620  59955  59236  68707  63397  
61644  62742   Function call interrupts
---
> CAL:  66727  47277  59955  59893  69364  64054  
> 62301  63399   Function call interrupts

# diff /tmp/before1 /tmp/after1
< CAL:  66727  47277  59955  59893  69364  64054  
62301  63399   Function call interrupts
---
> CAL:  67358  47938  59985  60554  70025  64715  
> 62962  64060   Function call interrupts


We get 657 IPIs for enabling function tracer, and 661 for disabling it.
Funny how it's more on the disable than the enable with the patch but
the other way without it.

But still, we are going from 120 to 660 IPIs for every CPU. Not saying
it's a problem, but something that we should note. Someone (those that
don't like kernel interference) may complain.

-- Steve


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-02 Thread Masami Hiramatsu
On Wed, 2 Oct 2019 18:35:26 +0200
Daniel Bristot de Oliveira  wrote:

> ftrace was already batching the updates, for instance, causing 3 IPIs to 
> enable
> all functions. The text_poke() batching also works. But because of the limited
> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during 
> the
> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a 
> VM,
> when enabling the function tracer, I see 250+ intermediate text_poke_finish()
> because of a full buffer...

Would you have any performance numbers of the previous code and applying this?

Thank you,


-- 
Masami Hiramatsu 


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-02 Thread Peter Zijlstra
On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:

> ftrace was already batching the updates, for instance, causing 3 IPIs to 
> enable
> all functions. The text_poke() batching also works. But because of the limited
> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during 
> the
> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a 
> VM,
> when enabling the function tracer, I see 250+ intermediate text_poke_finish()
> because of a full buffer...
> 
> Would this be the case of trying to use a dynamically allocated buffer?
> 
> Thoughts?

Is it a problem? I tried growing the buffer (IIRC I made it 10 times
bigger) and didn't see any performance improvements because of it.


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-02 Thread Daniel Bristot de Oliveira
On 27/08/2019 20:06, Peter Zijlstra wrote:
> Move ftrace over to using the generic x86 text_poke functions; this
> avoids having a second/different copy of that code around.
> 
> This also avoids ftrace violating the (new) W^X rule and avoids
> fragmenting the kernel text page-tables, due to no longer having to
> toggle them RW.

I tested this patch, and it works... but it generates more IPIs than the
previous one.

> Cc: Steven Rostedt 
> Cc: Daniel Bristot de Oliveira 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/ftrace.h |2 
>  arch/x86/kernel/alternative.c |4 
>  arch/x86/kernel/ftrace.c  |  630 
> ++
>  arch/x86/kernel/traps.c   |9 
>  4 files changed, 93 insertions(+), 552 deletions(-)
> 
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h

[ ... jumping to the point ...]

> -
>  void ftrace_replace_code(int enable)
>  {
>   struct ftrace_rec_iter *iter;
>   struct dyn_ftrace *rec;
> - const char *report = "adding breakpoints";
> - int count = 0;
> + const char *new, *old;
>   int ret;
>  
>   for_ftrace_rec_iter(iter) {
>   rec = ftrace_rec_iter_record(iter);
>  
> - ret = add_breakpoints(rec, enable);
> - if (ret)
> - goto remove_breakpoints;
> - count++;
> - }
> -
> - run_sync();

ftrace was already batching the updates, for instance, causing 3 IPIs to enable
all functions. The text_poke() batching also works. But because of the limited
buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during the
operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a VM,
when enabling the function tracer, I see 250+ intermediate text_poke_finish()
because of a full buffer...

Would this be the case of trying to use a dynamically allocated buffer?

Thoughts?

-- Daniel

> -
> - report = "updating code";
> - count = 0;
> -
> - for_ftrace_rec_iter(iter) {
> - rec = ftrace_rec_iter_record(iter);
> -
> - ret = add_update(rec, enable);
> - if (ret)
> - goto remove_breakpoints;
> - count++;
> + switch (ftrace_test_record(rec, enable)) {
> + case FTRACE_UPDATE_IGNORE:
> + default:
> + continue;
> +
> + case FTRACE_UPDATE_MAKE_CALL:
> + old = ftrace_nop_replace();
> + break;
> +
> + case FTRACE_UPDATE_MODIFY_CALL:
> + case FTRACE_UPDATE_MAKE_NOP:
> + old = ftrace_call_replace(rec->ip, 
> ftrace_get_addr_curr(rec));
> + break;
> + };
> +
> + ret = ftrace_verify_code(rec->ip, old);
> + if (ret) {
> + ftrace_bug(ret, rec);
> + return;
> + }
>   }
>  
> - run_sync();
> -
> - report = "removing breakpoints";
> - count = 0;
> -
>   for_ftrace_rec_iter(iter) {
>   rec = ftrace_rec_iter_record(iter);
>  
> - ret = finish_update(rec, enable);
> - if (ret)
> - goto remove_breakpoints;
> - count++;
> - }
> -
> - run_sync();
> -
> - return;
> + switch (ftrace_test_record(rec, enable)) {
> + case FTRACE_UPDATE_IGNORE:
> + default:
> + continue;
> +
> + case FTRACE_UPDATE_MAKE_CALL:
> + case FTRACE_UPDATE_MODIFY_CALL:
> + new = ftrace_call_replace(rec->ip, 
> ftrace_get_addr_new(rec));
> + break;
> +
> + case FTRACE_UPDATE_MAKE_NOP:
> + new = ftrace_nop_replace();
> + break;
> + };
>  
> - remove_breakpoints:
> - pr_warn("Failed on %s (%d):\n", report, count);
> - ftrace_bug(ret, rec);
> - for_ftrace_rec_iter(iter) {
> - rec = ftrace_rec_iter_record(iter);
> - /*
> -  * Breakpoints are handled only when this function is in
> -  * progress. The system could not work with them.
> -  */
> - if (remove_breakpoint(rec))
> - BUG();
> + text_poke_queue((void *)rec->ip, new, MCOUNT_INSN_SIZE, NULL);
> + ftrace_update_record(rec, enable);
>   }
> - run_sync();
> -}
> -
> -static int
> -ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
> -unsigned const char *new_code)
> -{
> - int ret;
> -
> - ret = add_break(ip, old_code);
> - if (ret)
> - goto out;
> -
> - run_sync();
> -
> - ret = add_update_code(ip, new_code);
> - if (ret)
> - goto fail_update;
> -
> - run_sync();
> -
> - ret = ftrace_write(ip, new_code, 1);
> - /*
> -  * The breakpoint is handled 

[PATCH 3/3] x86/ftrace: Use text_poke()

2019-08-27 Thread Peter Zijlstra
Move ftrace over to using the generic x86 text_poke functions; this
avoids having a second/different copy of that code around.

This also avoids ftrace violating the (new) W^X rule and avoids
fragmenting the kernel text page-tables, due to no longer having to
toggle them RW.

Cc: Steven Rostedt 
Cc: Daniel Bristot de Oliveira 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/ftrace.h |2 
 arch/x86/kernel/alternative.c |4 
 arch/x86/kernel/ftrace.c  |  630 ++
 arch/x86/kernel/traps.c   |9 
 4 files changed, 93 insertions(+), 552 deletions(-)

--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -35,8 +35,6 @@ struct dyn_arch_ftrace {
/* No extra data needed for x86 */
 };
 
-int ftrace_int3_handler(struct pt_regs *regs);
-
 #define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
 
 #endif /*  CONFIG_DYNAMIC_FTRACE */
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -941,7 +941,7 @@ static struct bp_patching_desc {
int nr_entries;
 } bp_patching;
 
-static int patch_cmp(const void *key, const void *elt)
+static int notrace patch_cmp(const void *key, const void *elt)
 {
struct text_poke_loc *tp = (struct text_poke_loc *) elt;
 
@@ -953,7 +953,7 @@ static int patch_cmp(const void *key, co
 }
 NOKPROBE_SYMBOL(patch_cmp);
 
-int poke_int3_handler(struct pt_regs *regs)
+int notrace poke_int3_handler(struct pt_regs *regs)
 {
struct text_poke_loc *tp;
void *ip;
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -43,16 +43,12 @@ int ftrace_arch_code_modify_prepare(void
 * ftrace has it set to "read/write".
 */
mutex_lock(_mutex);
-   set_kernel_text_rw();
-   set_all_modules_text_rw();
return 0;
 }
 
 int ftrace_arch_code_modify_post_process(void)
 __releases(_mutex)
 {
-   set_all_modules_text_ro();
-   set_kernel_text_ro();
mutex_unlock(_mutex);
return 0;
 }
@@ -60,67 +56,34 @@ int ftrace_arch_code_modify_post_process
 union ftrace_code_union {
char code[MCOUNT_INSN_SIZE];
struct {
-   unsigned char op;
+   char op;
int offset;
} __attribute__((packed));
 };
 
-static int ftrace_calc_offset(long ip, long addr)
-{
-   return (int)(addr - ip);
-}
-
-static unsigned char *
-ftrace_text_replace(unsigned char op, unsigned long ip, unsigned long addr)
+static const char *ftrace_text_replace(char op, unsigned long ip, unsigned 
long addr)
 {
static union ftrace_code_union calc;
 
-   calc.op = op;
-   calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
+   calc.op = op;
+   calc.offset = (int)(addr - (ip + MCOUNT_INSN_SIZE));
 
return calc.code;
 }
 
-static unsigned char *
-ftrace_call_replace(unsigned long ip, unsigned long addr)
-{
-   return ftrace_text_replace(0xe8, ip, addr);
-}
-
-static inline int
-within(unsigned long addr, unsigned long start, unsigned long end)
-{
-   return addr >= start && addr < end;
-}
-
-static unsigned long text_ip_addr(unsigned long ip)
+static const char *ftrace_nop_replace(void)
 {
-   /*
-* On x86_64, kernel text mappings are mapped read-only, so we use
-* the kernel identity mapping instead of the kernel text mapping
-* to modify the kernel text.
-*
-* For 32bit kernels, these mappings are same and we can use
-* kernel identity mapping to modify code.
-*/
-   if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa_symbol(ip));
-
-   return ip;
+   return ideal_nops[NOP_ATOMIC5];
 }
 
-static const unsigned char *ftrace_nop_replace(void)
+static const char *ftrace_call_replace(unsigned long ip, unsigned long addr)
 {
-   return ideal_nops[NOP_ATOMIC5];
+   return ftrace_text_replace(CALL_INSN_OPCODE, ip, addr);
 }
 
-static int
-ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
-  unsigned const char *new_code)
+static int ftrace_verify_code(unsigned long ip, const char *old_code)
 {
-   unsigned char replaced[MCOUNT_INSN_SIZE];
-
-   ftrace_expected = old_code;
+   char cur_code[MCOUNT_INSN_SIZE];
 
/*
 * Note:
@@ -129,31 +92,38 @@ ftrace_modify_code_direct(unsigned long
 * Carefully read and modify the code with probe_kernel_*(), and make
 * sure what we read is what we expected it to be before modifying it.
 */
-
/* read the text we want to modify */
-   if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE))
+   if (probe_kernel_read(cur_code, (void *)ip, MCOUNT_INSN_SIZE)) {
+   WARN_ON(1);
return -EFAULT;
+   }
 
/* Make sure it is what we expect it to be */
-   if (memcmp(replaced, old_code, MCOUNT_INSN_SIZE) != 0)
+   if 

Re: [RFC][PATCH 3/3] x86/ftrace: Use text_poke()

2019-08-26 Thread Peter Zijlstra
On Mon, Aug 26, 2019 at 02:51:41PM +0200, Peter Zijlstra wrote:
> Move ftrace over to using the generic x86 text_poke functions; this
> avoids having a second/different copy of that code around.
> 
> Cc: Daniel Bristot de Oliveira 
> Cc: Steven Rostedt 
> Signed-off-by: Peter Zijlstra (Intel) 

*sigh*.. one of those days again. I seem to have tested without this
patch applied.

Let me go fix this.


[RFC][PATCH 3/3] x86/ftrace: Use text_poke()

2019-08-26 Thread Peter Zijlstra
Move ftrace over to using the generic x86 text_poke functions; this
avoids having a second/different copy of that code around.

Cc: Daniel Bristot de Oliveira 
Cc: Steven Rostedt 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/ftrace.h |2 
 arch/x86/kernel/ftrace.c  |  571 +++---
 arch/x86/kernel/traps.c   |9 
 3 files changed, 51 insertions(+), 531 deletions(-)

--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -35,8 +35,6 @@ struct dyn_arch_ftrace {
/* No extra data needed for x86 */
 };
 
-int ftrace_int3_handler(struct pt_regs *regs);
-
 #define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
 
 #endif /*  CONFIG_DYNAMIC_FTRACE */
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -43,16 +43,17 @@ int ftrace_arch_code_modify_prepare(void
 * ftrace has it set to "read/write".
 */
mutex_lock(_mutex);
-   set_kernel_text_rw();
-   set_all_modules_text_rw();
+   WARN_ON_ONCE(tp_vec_nr);
return 0;
 }
 
 int ftrace_arch_code_modify_post_process(void)
 __releases(_mutex)
 {
-   set_all_modules_text_ro();
-   set_kernel_text_ro();
+   if (tp_vec_nr) {
+   text_poke_bp_batch(tp_vec, tp_vec_nr);
+   tp_vec_nr = 0;
+   }
mutex_unlock(_mutex);
return 0;
 }
@@ -60,67 +61,36 @@ int ftrace_arch_code_modify_post_process
 union ftrace_code_union {
char code[MCOUNT_INSN_SIZE];
struct {
-   unsigned char op;
+   char op;
int offset;
} __attribute__((packed));
 };
 
-static int ftrace_calc_offset(long ip, long addr)
-{
-   return (int)(addr - ip);
-}
-
-static unsigned char *
-ftrace_text_replace(unsigned char op, unsigned long ip, unsigned long addr)
+static char *ftrace_text_replace(char op, unsigned long ip, unsigned long addr)
 {
static union ftrace_code_union calc;
 
-   calc.op = op;
-   calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
+   calc.op = op;
+   calc.offset = (int)(addr - (ip + MCOUNT_INSN_SIZE));
 
return calc.code;
 }
 
-static unsigned char *
-ftrace_call_replace(unsigned long ip, unsigned long addr)
-{
-   return ftrace_text_replace(0xe8, ip, addr);
-}
-
-static inline int
-within(unsigned long addr, unsigned long start, unsigned long end)
+static char *ftrace_nop_replace(void)
 {
-   return addr >= start && addr < end;
-}
-
-static unsigned long text_ip_addr(unsigned long ip)
-{
-   /*
-* On x86_64, kernel text mappings are mapped read-only, so we use
-* the kernel identity mapping instead of the kernel text mapping
-* to modify the kernel text.
-*
-* For 32bit kernels, these mappings are same and we can use
-* kernel identity mapping to modify code.
-*/
-   if (within(ip, (unsigned long)_text, (unsigned long)_etext))
-   ip = (unsigned long)__va(__pa_symbol(ip));
-
-   return ip;
+   return ideal_nops[NOP_ATOMIC5];
 }
 
-static const unsigned char *ftrace_nop_replace(void)
+static char *ftrace_call_replace(unsigned long ip, unsigned long addr)
 {
-   return ideal_nops[NOP_ATOMIC5];
+   return ftrace_text_replace(CALL_INSN_OPCODE, ip, addr);
 }
 
-static int
-ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
-  unsigned const char *new_code)
+static int;
+ftrace_modify_code_direct(unsigned long ip, const char *old_code,
+ const char *new_code)
 {
-   unsigned char replaced[MCOUNT_INSN_SIZE];
-
-   ftrace_expected = old_code;
+   char cur_code[MCOUNT_INSN_SIZE];
 
/*
 * Note:
@@ -129,31 +99,23 @@ ftrace_modify_code_direct(unsigned long
 * Carefully read and modify the code with probe_kernel_*(), and make
 * sure what we read is what we expected it to be before modifying it.
 */
-
/* read the text we want to modify */
-   if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE))
+   if (probe_kernel_read(cur_code, (void *)ip, MCOUNT_INSN_SIZE))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
-   if (memcmp(replaced, old_code, MCOUNT_INSN_SIZE) != 0)
+   if (memcmp(cur_code, old_code, MCOUNT_INSN_SIZE) != 0)
return -EINVAL;
 
-   ip = text_ip_addr(ip);
-
/* replace the text with the new text */
-   if (probe_kernel_write((void *)ip, new_code, MCOUNT_INSN_SIZE))
-   return -EPERM;
-
-   sync_core();
-
+   text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
return 0;
 }
 
-int ftrace_make_nop(struct module *mod,
-   struct dyn_ftrace *rec, unsigned long addr)
+int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long 
addr)
 {
-   unsigned const char *new, *old;
unsigned long