from:"Google"

Re: How to trace kvm:kvm_exit using fprobe?

2024-05-30 Thread Google

Hi don,

Thanks for reporting!
I confirmed that the raw tracepoint probe event does not work on
the tracepoint in the module. (not only kvm but also other modules)

Let me fix the issue.

Thank you,

On Tue, 21 May 2024 17:28:48 -0700
don  wrote:

> Hi,  Mr Masami Hiramatsu，
> I see your presentation "Function Parameters with BTF Function tracing with
> parameters",
> and I started using fprobe and dynamic_events to trace functions.
> 
> I have a question regarding tracepoints using fprobe/dynamic_events
> I can trace the kvm:kvm_exit event using trace-cmd:
> 
> *trace-cmd stream -e kvm:kvm_exit*
> But if I echo to dynamic_event will get "Invalid argument" error:
> # cd /sys/kernel/debug/tracing
> 
> *# echo 't:kvm kvm_exit' >> dynamic_events*-bash: echo: write error:
> Invalid argument
> 
> How to solve this problem?
> 
> Thanks,
> don


-- 
Masami Hiramatsu (Google)

Re: [PATCH 0/3] tracing: Fix some selftest issues

2024-05-28 Thread Google

On Wed, 29 May 2024 01:46:40 +0900
Masami Hiramatsu (Google)  wrote:

> On Mon, 27 May 2024 19:29:07 -0400
> Steven Rostedt  wrote:
> 
> > On Sun, 26 May 2024 19:10:57 +0900
> > "Masami Hiramatsu (Google)"  wrote:
> > 
> > > Hi,
> > > 
> > > Here is a series of some fixes/improvements for the test modules and boot
> > > time selftest of kprobe events. I found a WARNING message with some boot 
> > > time selftest configuration, which came from the combination of embedded
> > > kprobe generate API tests module and ftrace boot-time selftest. So the 
> > > main
> > > problem is that the test module should not be built-in. But I also think
> > > this WARNING message is useless (because there are warning messages 
> > > already)
> > > and the cleanup code is redundant. This series fixes those issues.
> > 
> > Note, when I enable trace tests as builtin instead of modules, I just
> > disable the bootup self tests when it detects this. This helps with
> > doing tests via config options than having to add user space code that
> > loads modules.
> > 
> > Could you do something similar?
> 
> OK, in that case, I would like to move the test cleanup code in
> module_exit function into the end of module_init function. 
> It looks there is no reason to split those into 2 parts.

Wait, I would like to hear Tom's opinion. I found following usage comments
in the code.

 * Following that are a few examples using the created events to test
 * various ways of tracing a synthetic event.
 *
 * To test, select CONFIG_SYNTH_EVENT_GEN_TEST and build the module.
 * Then:
 *
 * # insmod kernel/trace/synth_event_gen_test.ko
 * # cat /sys/kernel/tracing/trace
 *
 * You should see several events in the trace buffer -
 * "create_synth_test", "empty_synth_test", and several instances of
 * "gen_synth_test".
 *
 * To remove the events, remove the module:
 *
 * # rmmod synth_event_gen_test

Tom, is that intended behavior ? and are you expected to reuse these
events outside of the module? e.g. load the test module and run some
test script in user space which uses those events?

As far as I can see, those tests are not intended to be embedded in the
kernel because those are expected to be removed.

Thank you,

> 
> Thank you,
> 
> > 
> > -- Steve
> > 
> > 
> > > 
> > > Thank you,
> > > 
> > > ---
> > > 
> > > Masami Hiramatsu (Google) (3):
> > >   tracing: Build event generation tests only as modules
> > >   tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests
> > >   tracing/kprobe: Remove cleanup code unrelated to selftest
> > > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 

-- 
Masami Hiramatsu (Google)

Re: [PATCH 0/3] tracing: Fix some selftest issues

2024-05-28 Thread Google

On Mon, 27 May 2024 19:29:07 -0400
Steven Rostedt  wrote:

> On Sun, 26 May 2024 19:10:57 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Hi,
> > 
> > Here is a series of some fixes/improvements for the test modules and boot
> > time selftest of kprobe events. I found a WARNING message with some boot 
> > time selftest configuration, which came from the combination of embedded
> > kprobe generate API tests module and ftrace boot-time selftest. So the main
> > problem is that the test module should not be built-in. But I also think
> > this WARNING message is useless (because there are warning messages already)
> > and the cleanup code is redundant. This series fixes those issues.
> 
> Note, when I enable trace tests as builtin instead of modules, I just
> disable the bootup self tests when it detects this. This helps with
> doing tests via config options than having to add user space code that
> loads modules.
> 
> Could you do something similar?

OK, in that case, I would like to move the test cleanup code in
module_exit function into the end of module_init function. 
It looks there is no reason to split those into 2 parts.

Thank you,

> 
> -- Steve
> 
> 
> > 
> > Thank you,
> > 
> > ---
> > 
> > Masami Hiramatsu (Google) (3):
> >   tracing: Build event generation tests only as modules
> >   tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests
> >   tracing/kprobe: Remove cleanup code unrelated to selftest
> > 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations

2024-05-28 Thread Google

Hi,

Sorry for late reply.

On Fri, 10 May 2024 10:20:56 +0200
Vlastimil Babka  wrote:

> On 5/10/24 9:59 AM, wuqiang.matt wrote:
> > On 2024/5/7 21:55, Vlastimil Babka wrote:
>  >>
> >>> + } while (!try_cmpxchg_acquire(>tail, , tail + 1));
> >>> +
> >>> + /* now the tail position is reserved for the given obj */
> >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >>> + /* update sequence to make this obj available for pop() */
> >>> + smp_store_release(>last, tail + 1);
> >>> +
> >>> + return 0;
> >>> +}
> >>>   
> >>>   /**
> >>>* objpool_push() - reclaim the object and return back to objpool
> >>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
> >>>* return: 0 or error code (it fails only when user tries to push
> >>>* the same object multiple times or wrong "objects" into objpool)
> >>>*/
> >>> -int objpool_push(void *obj, struct objpool_head *pool);
> >>> +static inline int objpool_push(void *obj, struct objpool_head *pool)
> >>> +{
> >>> + unsigned long flags;
> >>> + int rc;
> >>> +
> >>> + /* disable local irq to avoid preemption & interruption */
> >>> + raw_local_irq_save(flags);
> >>> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >> 
> >> And IIUC, we could in theory objpool_pop() on one cpu, then later another
> >> cpu might do objpool_push() and cause the latter cpu's pool to go over
> >> capacity? Is there some implicit requirements of objpool users to take care
> >> of having matched cpu for pop and push? Are the current objpool users
> >> obeying this requirement? (I can see the selftests do, not sure about the
> >> actual users).
> >> Or am I missing something? Thanks.
> > 
> > The objects are all pre-allocated along with creation of the new objpool
> > and the total number of objects never exceeds the capacity on local node.
> 
> Aha, I see, the capacity of entries is enough to hold objects from all nodes
> in the most unfortunate case they all end up freed from a single cpu.
> 
> > So objpool_push() would always find an available slot from the ring-array
> > for the given object to insert back. objpool_pop() would try looping all
> > the percpu slots until an object is found or whole objpool is empty.
> 
> So it's correct, but seems rather wasteful to have the whole capacity for
> entries replicated on every cpu? It does make objpool_push() simple and
> fast, but as you say, objpool_pop() still has to search potentially all
> non-local percpu slots, with disabled irqs, which is far from ideal.

For the kretprobe/fprobe use-case, it is important to push (return) object
fast. We can reservce enough number of objects when registering but push
operation will happen always on random CPU.

> 
> And the "abort if the slot was already full" comment for
> objpool_try_add_slot() seems still misleading? Maybe that was your initial
> idea but changed later?

Ah, it should not happen...

> 
> > Currently kretprobe is the only actual usecase of objpool.

Note that fprobe is also using this objpool, but currently I'm working on
integrating fprobe on function-graph tracer[1] which will make fprobe not
using objpool. And also I'm planning to replace kretprobe with the new
fprobe eventually. So if SLUB will use objpool for frontend caching, it
sounds good to me. (Maybe it can speed up the object allocation/free)

> > 
> > I'm testing an updated objpool in our HIDS project for critical pathes,
> > which is widely deployed on servers inside my company. The new version
> > eliminates the raw_local_irq_save and raw_local_irq_restore pair of
> > objpool_push and gains up to 5% of performance boost.
> 
> Mind Ccing me and linux-mm once you are posting that?

Can you add me too?

Thank you,

> 
> Thanks,
> Vlastimil
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2 2/2] selftests/user_events: Add non-spacing separator check

2024-05-27 Thread Google

On Tue, 23 Apr 2024 16:23:38 +
Beau Belgrave  wrote:

> The ABI documentation indicates that field separators do not need a
> space between them, only a ';'. When no spacing is used, the register
> must work. Any subsequent register, with or without spaces, must match
> and not return -EADDRINUSE.
> 
> Add a non-spacing separator case to our self-test register case to ensure
> it works going forward.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks!

> Signed-off-by: Beau Belgrave 
> ---
>  tools/testing/selftests/user_events/ftrace_test.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/testing/selftests/user_events/ftrace_test.c 
> b/tools/testing/selftests/user_events/ftrace_test.c
> index dcd7509fe2e0..0bb46793dcd4 100644
> --- a/tools/testing/selftests/user_events/ftrace_test.c
> +++ b/tools/testing/selftests/user_events/ftrace_test.c
> @@ -261,6 +261,12 @@ TEST_F(user, register_events) {
>   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
>   ASSERT_EQ(0, reg.write_index);
>  
> + /* Register without separator spacing should still match */
> + reg.enable_bit = 29;
> + reg.name_args = (__u64)"__test_event u32 field1;u32 field2";
> + ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
> + ASSERT_EQ(0, reg.write_index);
> +
>   /* Multiple registers to same name but different args should fail */
>   reg.enable_bit = 29;
>   reg.name_args = (__u64)"__test_event u32 field1;";
> @@ -288,6 +294,8 @@ TEST_F(user, register_events) {
>   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
>   unreg.disable_bit = 30;
>   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
> + unreg.disable_bit = 29;
> + ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
>  
>   /* Delete should have been auto-done after close and unregister */
>   close(self->data_fd);
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2] tracing/probes: fix error check in parse_btf_field()

2024-05-27 Thread Google

On Mon, 27 May 2024 11:43:52 +0200
Carlos López  wrote:

> btf_find_struct_member() might return NULL or an error via the
> ERR_PTR() macro. However, its caller in parse_btf_field() only checks
> for the NULL condition. Fix this by using IS_ERR() and returning the
> error up the stack.
> 

Thanks for updating! This version looks good to me.
Let me pick this to probes/fixes.

Thank you,


> Fixes: c440adfbe3025 ("tracing/probes: Support BTF based data structure field 
> access")
> Signed-off-by: Carlos López 
> ---
> v2: added call to trace_probe_log_err()
> 
>  kernel/trace/trace_probe.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 5e263c141574..39877c80d6cb 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -554,6 +554,10 @@ static int parse_btf_field(char *fieldname, const struct 
> btf_type *type,
>   anon_offs = 0;
>   field = btf_find_struct_member(ctx->btf, type, 
> fieldname,
>  _offs);
> + if (IS_ERR(field)) {
> + trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> + return PTR_ERR(field);
> + }
>   if (!field) {
>   trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
>   return -ENOENT;
> -- 
> 2.35.3
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 00/20] function_graph: Allow multiple users for function graph tracing

2024-05-26 Thread Google

On Fri, 24 May 2024 22:36:52 -0400
Steven Rostedt  wrote:

> [
>   Resend for some of you as I missed a comma in the Cc and
>   quilt died sending this out.
> ]
> 
> This is a continuation of the function graph multi user code.
> I wrote a proof of concept back in 2019 of this code[1] and
> Masami started cleaning it up. I started from Masami's work v10
> that can be found here:
> 
>  
> https://lore.kernel.org/linux-trace-kernel/171509088006.162236.7227326999861366050.stgit@devnote2/
> 
> This is *only* the code that allows multiple users of function
> graph tracing. This is not the fprobe work that Masami is working
> to add on top of it. As Masami took my proof of concept, there
> was still several things I disliked about that code. Instead of
> having Masami clean it up even more, I decided to take over on just
> my code and change it up a bit.
> 
> The biggest changes from where Masami left off is mostly renaming more
> variables, macros, and function names. I fixed up the current comments
> and added more to make the code a bit more understandable.
> 
> At the end of the series, I added two patches to optimize the entry
> and exit. On entry, there was a loop that iterated the 16 elements
> of the fgraph_array[] looking for any that may have a gops registered
> to it. It's quite a waste to do that loop if there's only one
> registered user. To fix that, I added a fgraph_array_bitmask that has
> its bits set that correspond to the elements of the array. Then
> a simple for_each_set_bit() is used for the iteration. I do the same
> thing at the exit callback of the function where it iterates over the
> bits of the bitmap saved on the ret_stack.
> 
> I also noticed that Masami added code to handle tail calls in the
> unwinder and had that in one of my patches. I took that code out
> and made it a separate patch with Masami as the author.
> 
> The diff between this and Masami's last update is at the end of this email.

Thanks for update the patches!
I think your changes are good. I just have some comments and replied.
So, the plan is to push this series in the tracing/for-next? I will
port my fprobe part on it and run some tests.

Thank you,

> 
> Based on Linus commit: 0eb03c7e8e2a4cc3653eb5eeb2d2001182071215
> 
> [1] https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> 
> Masami Hiramatsu (Google) (3):
>   function_graph: Handle tail calls for stack unwinding
>   function_graph: Use a simple LRU for fgraph_array index number
>   ftrace: Add multiple fgraph storage selftest
> 
> Steven Rostedt (Google) (2):
>   function_graph: Use for_each_set_bit() in __ftrace_return_to_handler()
>   function_graph: Use bitmask to loop on fgraph entry
> 
> Steven Rostedt (VMware) (15):
>   function_graph: Convert ret_stack to a series of longs
>   fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible by 
> long
>   function_graph: Add an array structure that will allow multiple 
> callbacks
>   function_graph: Allow multiple users to attach to function graph
>   function_graph: Remove logic around ftrace_graph_entry and return
>   ftrace/function_graph: Pass fgraph_ops to function graph callbacks
>   ftrace: Allow function_graph tracer to be enabled in instances
>   ftrace: Allow ftrace startup flags to exist without dynamic ftrace
>   function_graph: Have the instances use their own ftrace_ops for 
> filtering
>   function_graph: Add "task variables" per task for fgraph_ops
>   function_graph: Move set_graph_function tests to shadow stack global var
>   function_graph: Move graph depth stored data to shadow stack global var
>   function_graph: Move graph notrace bit to shadow stack global var
>   function_graph: Implement fgraph_reserve_data() and 
> fgraph_retrieve_data()
>   function_graph: Add selftest for passing local variables
> 
> 
>  arch/arm64/kernel/ftrace.c   |  21 +-
>  arch/loongarch/kernel/ftrace_dyn.c   |  15 +-
>  arch/powerpc/kernel/trace/ftrace.c   |   3 +-
>  arch/riscv/kernel/ftrace.c   |  15 +-
>  arch/x86/kernel/ftrace.c |  19 +-
>  include/linux/ftrace.h   |  42 +-
>  include/linux/sched.h|   2 +-
>  include/linux/trace_recursion.h  |  39 --
>  kernel/trace/fgraph.c| 994 
> ---
>  kernel/trace/ftrace.c|  11 +-
>  kernel/trace/ftrace_internal.h   |   2 -
>  kernel/trace/trace.h |  94 +++-
>  kernel/trace/trace_functions.c   |   8 +
>  kernel/trace/trace_functions_graph.c |  96 ++--
>  kernel/trace/trace_irqsoff.c |  10 +-
>  kernel/trace/trace_sched_wa

Re: [PATCH 04/20] function_graph: Allow multiple users to attach to function graph

2024-05-26 Thread Google

On Fri, 24 May 2024 22:36:56 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (VMware)" 
> 
> Allow for multiple users to attach to function graph tracer at the same
> time. Only 16 simultaneous users can attach to the tracer. This is because
> there's an array that stores the pointers to the attached fgraph_ops. When
> a function being traced is entered, each of the ftrace_ops entryfunc is
> called and if it returns non zero, its index into the array will be added
> to the shadow stack.
> 
> On exit of the function being traced, the shadow stack will contain the
> indexes of the ftrace_ops on the array that want their retfunc to be
> called.
> 
> Because a function may sleep for a long time (if a task sleeps itself),
> the return of the function may be literally days later. If the ftrace_ops
> is removed, its place on the array is replaced with a ftrace_ops that
> contains the stub functions and that will be called when the function
> finally returns.
> 
> If another ftrace_ops is added that happens to get the same index into the
> array, its return function may be called. But that's actually the way
> things current work with the old function graph tracer. If one tracer is
> removed and another is added, the new one will get the return calls of the
> function traced by the previous one, thus this is not a regression. This
> can be fixed by adding a counter to each time the array item is updated and
> save that on the shadow stack as well, such that it won't be called if the
> index saved does not match the index on the array.
> 
> Note, being able to filter functions when both are called is not completely
> handled yet, but that shouldn't be too hard to manage.
> 
> Co-developed with Masami Hiramatsu:
> Link: 
> https://lore.kernel.org/linux-trace-kernel/171509096221.162236.8806372072523195752.stgit@devnote2
> 

Thanks for update this. I have some comments below.

> Signed-off-by: Steven Rostedt (VMware) 
> Signed-off-by: Masami Hiramatsu (Google) 
> Signed-off-by: Steven Rostedt (Google) 
> ---
[...]

> @@ -110,11 +253,13 @@ void ftrace_graph_stop(void)
>  /* Add a function return address to the trace stack on thread info.*/
>  static int
>  ftrace_push_return_trace(unsigned long ret, unsigned long func,
> -  unsigned long frame_pointer, unsigned long *retp)
> +  unsigned long frame_pointer, unsigned long *retp,
> +  int fgraph_idx)

We do not need this fgraph_idx parameter anymore because this removed
reuse-frame check.

>  {
>   struct ftrace_ret_stack *ret_stack;
>   unsigned long long calltime;
> - int index;
> + unsigned long val;
> + int offset;
>  
>   if (unlikely(ftrace_graph_is_dead()))
>   return -EBUSY;
> @@ -124,24 +269,57 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> long func,
>  
>   BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
>  
> + /* Set val to "reserved" with the delta to the new fgraph frame */
> + val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
> +
>   /*
>* We must make sure the ret_stack is tested before we read
>* anything else.
>*/
>   smp_rmb();
>  
> - /* The return trace stack is full */
> - if (current->curr_ret_stack >= SHADOW_STACK_MAX_INDEX) {
> + /*
> +  * Check if there's room on the shadow stack to fit a fraph frame
> +  * and a bitmap word.
> +  */
> + if (current->curr_ret_stack + FGRAPH_FRAME_OFFSET + 1 >= 
> SHADOW_STACK_MAX_OFFSET) {
>   atomic_inc(>trace_overrun);
>   return -EBUSY;
>   }
>  
>   calltime = trace_clock_local();
>  
> - index = current->curr_ret_stack;
> - RET_STACK_INC(current->curr_ret_stack);
> - ret_stack = RET_STACK(current, index);
> + offset = READ_ONCE(current->curr_ret_stack);
> + ret_stack = RET_STACK(current, offset);
> + offset += FGRAPH_FRAME_OFFSET;
> +
> + /* ret offset = FGRAPH_FRAME_OFFSET ; type = reserved */
> + current->ret_stack[offset] = val;
> + ret_stack->ret = ret;
> + /*
> +  * The unwinders expect curr_ret_stack to point to either zero
> +  * or an offset where to find the next ret_stack. Even though the
> +  * ret stack might be bogus, we want to write the ret and the
> +  * offset to find the ret_stack before we increment the stack point.
> +  * If an interrupt comes in now before we increment the curr_ret_stack
> +  * it may blow away what we wrote. But that's fine, because the
> +  * offset will still be correct (even though the 'ret'

Re: [PATCH 20/20] function_graph: Use bitmask to loop on fgraph entry

2024-05-26 Thread Google

On Fri, 24 May 2024 22:37:12 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Instead of looping through all the elements of fgraph_array[] to see if
> there's an gops attached to one and then calling its gops->func(). Create
> a fgraph_array_bitmask that sets bits when an index in the array is
> reserved (via the simple lru algorithm). Then only the bits set in this
> bitmask needs to be looked at where only elements in the array that have
> ops registered need to be looked at.
> 
> Note, we do not care about races. If a bit is set before the gops is
> assigned, it only wastes time looking at the element and ignoring it (as
> it did before this bitmask is added).

This is OK because anyway we check gops == _stub.
By the way, shouldn't we also make "if (gops == _stub)"
check unlikely()?

This change looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/fgraph.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 5e8e13ffcfb6..1aae521e5997 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -173,6 +173,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
>  int ftrace_graph_active;
>  
>  static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
> +static unsigned long fgraph_array_bitmask;
>  
>  /* LRU index table for fgraph_array */
>  static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
> @@ -197,6 +198,8 @@ static int fgraph_lru_release_index(int idx)
>  
>   fgraph_lru_table[fgraph_lru_last] = idx;
>   fgraph_lru_last = (fgraph_lru_last + 1) % FGRAPH_ARRAY_SIZE;
> +
> + clear_bit(idx, _array_bitmask);
>   return 0;
>  }
>  
> @@ -211,6 +214,8 @@ static int fgraph_lru_alloc_index(void)
>  
>   fgraph_lru_table[fgraph_lru_next] = -1;
>   fgraph_lru_next = (fgraph_lru_next + 1) % FGRAPH_ARRAY_SIZE;
> +
> + set_bit(idx, _array_bitmask);
>   return idx;
>  }
>  
> @@ -632,7 +637,8 @@ int function_graph_enter(unsigned long ret, unsigned long 
> func,
>   if (offset < 0)
>   goto out;
>  
> - for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
> + for_each_set_bit(i, _array_bitmask,
> +  sizeof(fgraph_array_bitmask) * BITS_PER_BYTE) {
>   struct fgraph_ops *gops = fgraph_array[i];
>   int save_curr_ret_stack;
>  
> -- 
> 2.43.0
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 19/20] function_graph: Use for_each_set_bit() in __ftrace_return_to_handler()

2024-05-26 Thread Google

On Fri, 24 May 2024 22:37:11 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Instead of iterating through the entire fgraph_array[] and seeing if one
> of the bitmap bits are set to know to call the array's retfunc() function,
> use for_each_set_bit() on the bitmap itself. This will only iterate for
> the number of set bits.
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/fgraph.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 4d503b3e45ad..5e8e13ffcfb6 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -827,11 +827,10 @@ static unsigned long __ftrace_return_to_handler(struct 
> fgraph_ret_regs *ret_regs
>  #endif
>  
>   bitmap = get_bitmap_bits(current, offset);
> - for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
> +
> + for_each_set_bit(i, , sizeof(bitmap) * BITS_PER_BYTE) {
>   struct fgraph_ops *gops = fgraph_array[i];
>  
> - if (!(bitmap & BIT(i)))
> - continue;
>       if (gops == _stub)

Ah, nit: maybe this is unlikely()?

Thank you,


-- 
Masami Hiramatsu (Google)

Re: [PATCH 19/20] function_graph: Use for_each_set_bit() in __ftrace_return_to_handler()

2024-05-26 Thread Google

On Fri, 24 May 2024 22:37:11 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Instead of iterating through the entire fgraph_array[] and seeing if one
> of the bitmap bits are set to know to call the array's retfunc() function,
> use for_each_set_bit() on the bitmap itself. This will only iterate for
> the number of set bits.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/fgraph.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 4d503b3e45ad..5e8e13ffcfb6 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -827,11 +827,10 @@ static unsigned long __ftrace_return_to_handler(struct 
> fgraph_ret_regs *ret_regs
>  #endif
>  
>   bitmap = get_bitmap_bits(current, offset);
> - for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
> +
> + for_each_set_bit(i, , sizeof(bitmap) * BITS_PER_BYTE) {
>   struct fgraph_ops *gops = fgraph_array[i];
>  
> - if (!(bitmap & BIT(i)))
> - continue;
>   if (gops == _stub)
>   continue;
>  
> -- 
> 2.43.0
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] ftrace: Fix stack trace entry generated by ftrace_pid_func()

2024-05-26 Thread Google

On Sun, 26 May 2024 22:51:53 +0800
kernel test robot  wrote:

> Hi Tatsuya,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on rostedt-trace/for-next v6.9 next-20240523]
> [cannot apply to rostedt-trace/for-next-urgent]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Tatsuya-S/ftrace-Fix-stack-trace-entry-generated-by-ftrace_pid_func/20240526-193149
> base:   linus/master
> patch link:
> https://lore.kernel.org/r/20240526112658.46740-1-tatsuya.s2862%40gmail.com
> patch subject: [PATCH] ftrace: Fix stack trace entry generated by 
> ftrace_pid_func()
> config: x86_64-buildonly-randconfig-002-20240526 
> (https://download.01.org/0day-ci/archive/20240526/202405262232.l4xh8q6o-...@intel.com/config)
> compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
> 617a15a9eac96088ae5e9134248d8236e34b91b1)
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240526/202405262232.l4xh8q6o-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202405262232.l4xh8q6o-...@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> kernel/trace/ftrace.c:102:6: warning: no previous prototype for function 
> >> 'ftrace_pids_enabled' [-Wmissing-prototypes]
>  102 | bool ftrace_pids_enabled(struct ftrace_ops *ops)
>  |  ^
>kernel/trace/ftrace.c:102:1: note: declare 'static' if the function is not 
> intended to be used outside of this translation unit
>  102 | bool ftrace_pids_enabled(struct ftrace_ops *ops)
>  | ^
>  | static 
>1 warning generated.

This is because the prototype in linux/ftrace.h is placed in the 
#ifdef CONFIG_DYNAMIC_FTRACE block. The prototype needs to be moved
outside of the block.

Thank you,

> 
> 
> vim +/ftrace_pids_enabled +102 kernel/trace/ftrace.c
> 
>101
>  > 102bool ftrace_pids_enabled(struct ftrace_ops *ops)
>103{
>104struct trace_array *tr;
>105
>106if (!(ops->flags & FTRACE_OPS_FL_PID) || !ops->private)
>107return false;
>108
>109tr = ops->private;
>110
>111    return tr->function_pids != NULL || 
> tr->function_no_pids != NULL;
>112}
>113
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki


-- 
Masami Hiramatsu (Google)

Re: [PATCH] tracing/probes: fix error check in parse_btf_field()

2024-05-26 Thread Google

On Sun, 26 May 2024 14:27:56 +0200
Carlos López  wrote:

> 
> Hi,
> 
> On 26/5/24 12:17, Masami Hiramatsu (Google) wrote:
> > On Sat, 25 May 2024 20:21:32 +0200
> > Carlos López  wrote:
> > 
> >> btf_find_struct_member() might return NULL or an error via the
> >> ERR_PTR() macro. However, its caller in parse_btf_field() only checks
> >> for the NULL condition. Fix this by using IS_ERR() and returning the
> >> error up the stack.
> >>
> > 
> > Thanks for finding it!
> > I think this requires new error message for error_log file.
> > Can you add the log as
> > 
> > trace_probe_log_err(ctx->offset, BTF_ERROR);
> > 
> > And define BTF_ERROR in ERRORS@kernel/trace/trace_probe.h ?
> 
> Sounds good, but should we perhaps reuse BAD_BTF_TID?
> 
> ```
> C(BAD_BTF_TID,"Failed to get BTF type info."),\
> ```
> 
> `btf_find_struct_member()` fails if `type` is not a struct or if it runs
> OOM while allocating the anon stack, so it seems appropriate.

Good point, it sounds reasonable.

Thanks!

> 
> Best,
> Carlos
> 
> > Thank you,
> > 
> >> Fixes: c440adfbe3025 ("tracing/probes: Support BTF based data structure 
> >> field access")
> >> Signed-off-by: Carlos López 
> >> ---
> >>   kernel/trace/trace_probe.c | 2 ++
> >>   1 file changed, 2 insertions(+)
> >>
> >> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> >> index 5e263c141574..5417e9712157 100644
> >> --- a/kernel/trace/trace_probe.c
> >> +++ b/kernel/trace/trace_probe.c
> >> @@ -554,6 +554,8 @@ static int parse_btf_field(char *fieldname, const 
> >> struct btf_type *type,
> >>anon_offs = 0;
> >>field = btf_find_struct_member(ctx->btf, type, 
> >> fieldname,
> >>   _offs);
> >> +  if (IS_ERR(field))
> >> +  return PTR_ERR(field);
> >>if (!field) {
> >>trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
> >>return -ENOENT;
> >> -- 
> >> 2.35.3
> >>
> > 
> > 
> 
> -- 
> Carlos López
> Security Engineer
> SUSE Software Solutions


-- 
Masami Hiramatsu (Google)

Re: [PATCH] tracing/probes: fix error check in parse_btf_field()

2024-05-26 Thread Google

On Sat, 25 May 2024 20:21:32 +0200
Carlos López  wrote:

> btf_find_struct_member() might return NULL or an error via the
> ERR_PTR() macro. However, its caller in parse_btf_field() only checks
> for the NULL condition. Fix this by using IS_ERR() and returning the
> error up the stack.
> 

Thanks for finding it!
I think this requires new error message for error_log file.
Can you add the log as

trace_probe_log_err(ctx->offset, BTF_ERROR);

And define BTF_ERROR in ERRORS@kernel/trace/trace_probe.h ?

Thank you,

> Fixes: c440adfbe3025 ("tracing/probes: Support BTF based data structure field 
> access")
> Signed-off-by: Carlos López 
> ---
>  kernel/trace/trace_probe.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 5e263c141574..5417e9712157 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -554,6 +554,8 @@ static int parse_btf_field(char *fieldname, const struct 
> btf_type *type,
>   anon_offs = 0;
>   field = btf_find_struct_member(ctx->btf, type, 
> fieldname,
>  _offs);
> + if (IS_ERR(field))
> + return PTR_ERR(field);
>   if (!field) {
>   trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
>   return -ENOENT;
> -- 
> 2.35.3
> 


-- 
Masami Hiramatsu (Google)

[PATCH 3/3] tracing/kprobe: Remove cleanup code unrelated to selftest

2024-05-26 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

This cleanup all kprobe events code is not related to the selftest
itself, and it can fail by the reason unrelated to this test.
If the test is successful, the generated events are cleaned up.
And if not, we cannot guarantee that the kprobe events will work
correctly. So, anyway, there is no need to clean it up.

Signed-off-by: Masami Hiramatsu (Google) 
---
 kernel/trace/trace_kprobe.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 4abed36544d0..f94628c15c14 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -2129,11 +2129,6 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
 end:
-   ret = dyn_events_release_all(_kprobe_ops);
-   if (ret) {
-   pr_warn("error on cleaning up probes.\n");
-   warn++;
-   }
/*
 * Wait for the optimizer work to finish. Otherwise it might fiddle
 * with probes in already freed __init text.

[PATCH 2/3] tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests

2024-05-26 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Since the kprobe-events selftest shows OK or NG with the reason, the
WARN_ON_ONCE()s for each place are redundant. Let's remove it.

Signed-off-by: Masami Hiramatsu (Google) 
---
 kernel/trace/trace_kprobe.c |   26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 16383247bdbf..4abed36544d0 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -2023,18 +2023,18 @@ static __init int kprobe_trace_self_tests_init(void)
pr_info("Testing kprobe tracing: ");
 
ret = create_or_delete_trace_kprobe("p:testprobe 
kprobe_trace_selftest_target $stack $stack0 +0($stack)");
-   if (WARN_ON_ONCE(ret)) {
+   if (ret) {
pr_warn("error on probing function entry.\n");
warn++;
} else {
/* Enable trace point */
tk = find_trace_kprobe("testprobe", KPROBE_EVENT_SYSTEM);
-   if (WARN_ON_ONCE(tk == NULL)) {
+   if (tk == NULL) {
pr_warn("error on getting new probe.\n");
warn++;
} else {
file = find_trace_probe_file(tk, top_trace_array());
-   if (WARN_ON_ONCE(file == NULL)) {
+   if (file == NULL) {
pr_warn("error on getting probe file.\n");
warn++;
} else
@@ -2044,18 +2044,18 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
ret = create_or_delete_trace_kprobe("r:testprobe2 
kprobe_trace_selftest_target $retval");
-   if (WARN_ON_ONCE(ret)) {
+   if (ret) {
pr_warn("error on probing function return.\n");
warn++;
} else {
/* Enable trace point */
tk = find_trace_kprobe("testprobe2", KPROBE_EVENT_SYSTEM);
-   if (WARN_ON_ONCE(tk == NULL)) {
+   if (tk == NULL) {
pr_warn("error on getting 2nd new probe.\n");
warn++;
} else {
file = find_trace_probe_file(tk, top_trace_array());
-   if (WARN_ON_ONCE(file == NULL)) {
+   if (file == NULL) {
pr_warn("error on getting probe file.\n");
warn++;
} else
@@ -2079,7 +2079,7 @@ static __init int kprobe_trace_self_tests_init(void)
 
/* Disable trace points before removing it */
tk = find_trace_kprobe("testprobe", KPROBE_EVENT_SYSTEM);
-   if (WARN_ON_ONCE(tk == NULL)) {
+   if (tk == NULL) {
pr_warn("error on getting test probe.\n");
warn++;
} else {
@@ -2089,7 +2089,7 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
file = find_trace_probe_file(tk, top_trace_array());
-   if (WARN_ON_ONCE(file == NULL)) {
+   if (file == NULL) {
pr_warn("error on getting probe file.\n");
warn++;
} else
@@ -2098,7 +2098,7 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
tk = find_trace_kprobe("testprobe2", KPROBE_EVENT_SYSTEM);
-   if (WARN_ON_ONCE(tk == NULL)) {
+   if (tk == NULL) {
pr_warn("error on getting 2nd test probe.\n");
warn++;
} else {
@@ -2108,7 +2108,7 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
file = find_trace_probe_file(tk, top_trace_array());
-   if (WARN_ON_ONCE(file == NULL)) {
+   if (file == NULL) {
pr_warn("error on getting probe file.\n");
warn++;
} else
@@ -2117,20 +2117,20 @@ static __init int kprobe_trace_self_tests_init(void)
}
 
ret = create_or_delete_trace_kprobe("-:testprobe");
-   if (WARN_ON_ONCE(ret)) {
+   if (ret) {
pr_warn("error on deleting a probe.\n");
warn++;
}
 
ret = create_or_delete_trace_kprobe("-:testprobe2");
-   if (WARN_ON_ONCE(ret)) {
+   if (ret) {
pr_warn("error on deleting a probe.\n");
warn++;
}
 
 end:
ret = dyn_events_release_all(_kprobe_ops);
-   if (WARN_ON_ONCE(ret)) {
+   if (ret) {
pr_warn("error on cleaning up probes.\n");
warn++;
}

[PATCH 1/3] tracing: Build event generation tests only as modules

2024-05-26 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Since the kprobes and synth event generation tests adds and enable
generated events in init_module() and delete it in exit_module(),
if we make it as built-in, those events are left in kernel and cause
kprobe event self-test failure.

[   97.349708] [ cut here ]
[   97.353453] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:2133 
kprobe_trace_self_tests_init+0x3f1/0x480
[   97.357106] Modules linked in:
[   97.358488] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 
6.9.0-g699646734ab5-dirty #14
[   97.361556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.15.0-1 04/01/2014
[   97.363880] RIP: 0010:kprobe_trace_self_tests_init+0x3f1/0x480
[   97.365538] Code: a8 24 08 82 e9 ae fd ff ff 90 0f 0b 90 48 c7 c7 e5 aa 0b 
82 e9 ee fc ff ff 90 0f 0b 90 48 c7 c7 2d 61 06 82 e9 8e fd ff ff 90 <0f> 0b 90 
48 c7 c7 33 0b 0c 82 89 c6 e8 6e 03 1f ff 41 ff c7 e9 90
[   97.370429] RSP: :c9013b50 EFLAGS: 00010286
[   97.371852] RAX: fff0 RBX: 888005919c00 RCX: 
[   97.373829] RDX: 888003f4 RSI: 8236a598 RDI: 888003f40a68
[   97.375715] RBP:  R08: 0001 R09: 
[   97.377675] R10: 811c9ae5 R11: 8120c4e0 R12: 
[   97.379591] R13: 0001 R14: 0015 R15: 
[   97.381536] FS:  () GS:88807dcc() 
knlGS:
[   97.383813] CS:  0010 DS:  ES:  CR0: 80050033
[   97.385449] CR2:  CR3: 02244000 CR4: 06b0
[   97.387347] DR0:  DR1:  DR2: 
[   97.389277] DR3:  DR6: fffe0ff0 DR7: 0400
[   97.391196] Call Trace:
[   97.391967]  
[   97.392647]  ? __warn+0xcc/0x180
[   97.393640]  ? kprobe_trace_self_tests_init+0x3f1/0x480
[   97.395181]  ? report_bug+0xbd/0x150
[   97.396234]  ? handle_bug+0x3e/0x60
[   97.397311]  ? exc_invalid_op+0x1a/0x50
[   97.398434]  ? asm_exc_invalid_op+0x1a/0x20
[   97.399652]  ? trace_kprobe_is_busy+0x20/0x20
[   97.400904]  ? tracing_reset_all_online_cpus+0x15/0x90
[   97.402304]  ? kprobe_trace_self_tests_init+0x3f1/0x480
[   97.403773]  ? init_kprobe_trace+0x50/0x50
[   97.404972]  do_one_initcall+0x112/0x240
[   97.406113]  do_initcall_level+0x95/0xb0
[   97.407286]  ? kernel_init+0x1a/0x1a0
[   97.408401]  do_initcalls+0x3f/0x70
[   97.409452]  kernel_init_freeable+0x16f/0x1e0
[   97.410662]  ? rest_init+0x1f0/0x1f0
[   97.411738]  kernel_init+0x1a/0x1a0
[   97.412788]  ret_from_fork+0x39/0x50
[   97.413817]  ? rest_init+0x1f0/0x1f0
[   97.414844]  ret_from_fork_asm+0x11/0x20
[   97.416285]  
[   97.417134] irq event stamp: 13437323
[   97.418376] hardirqs last  enabled at (13437337): [] 
console_unlock+0x11c/0x150
[   97.421285] hardirqs last disabled at (13437370): [] 
console_unlock+0x101/0x150
[   97.423838] softirqs last  enabled at (13437366): [] 
handle_softirqs+0x23f/0x2a0
[   97.426450] softirqs last disabled at (13437393): [] 
__irq_exit_rcu+0x66/0xd0
[   97.428850] ---[ end trace  ]---

To avoid this issue, build these tests only as modules.

Fixes: 9fe41efaca08 ("tracing: Add synth event generation test module")
Fixes: 64836248dda2 ("tracing: Add kprobe event command generation test module")
Signed-off-by: Masami Hiramatsu (Google) 
---
 kernel/trace/Kconfig |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 166ad5444eea..721c3b221048 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1136,7 +1136,7 @@ config PREEMPTIRQ_DELAY_TEST
 
 config SYNTH_EVENT_GEN_TEST
tristate "Test module for in-kernel synthetic event generation"
-   depends on SYNTH_EVENTS
+   depends on SYNTH_EVENTS && m
help
   This option creates a test module to check the base
   functionality of in-kernel synthetic event definition and
@@ -1149,7 +1149,7 @@ config SYNTH_EVENT_GEN_TEST
 
 config KPROBE_EVENT_GEN_TEST
tristate "Test module for in-kernel kprobe event generation"
-   depends on KPROBE_EVENTS
+   depends on KPROBE_EVENTS && m
help
   This option creates a test module to check the base
   functionality of in-kernel kprobe event definition.

[PATCH 0/3] tracing: Fix some selftest issues

2024-05-26 Thread Masami Hiramatsu (Google)

Hi,

Here is a series of some fixes/improvements for the test modules and boot
time selftest of kprobe events. I found a WARNING message with some boot 
time selftest configuration, which came from the combination of embedded
kprobe generate API tests module and ftrace boot-time selftest. So the main
problem is that the test module should not be built-in. But I also think
this WARNING message is useless (because there are warning messages already)
and the cleanup code is redundant. This series fixes those issues.

Thank you,

---

Masami Hiramatsu (Google) (3):
  tracing: Build event generation tests only as modules
  tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests
  tracing/kprobe: Remove cleanup code unrelated to selftest


 kernel/trace/Kconfig|4 ++--
 kernel/trace/trace_kprobe.c |   29 -
 2 files changed, 14 insertions(+), 19 deletions(-)

--
Masami Hiramatsu (Google)

Re: [PATCH v10 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-05-25 Thread Google

On Fri, 24 May 2024 18:41:56 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:00 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Steven Rostedt (VMware) (15):
> >   function_graph: Convert ret_stack to a series of longs
> >   fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible 
> > by long
> >   function_graph: Add an array structure that will allow multiple 
> > callbacks
> >   function_graph: Allow multiple users to attach to function graph
> >   function_graph: Remove logic around ftrace_graph_entry and return
> >   ftrace/function_graph: Pass fgraph_ops to function graph callbacks
> >   ftrace: Allow function_graph tracer to be enabled in instances
> >   ftrace: Allow ftrace startup flags exist without dynamic ftrace
> >   function_graph: Have the instances use their own ftrace_ops for 
> > filtering
> >   function_graph: Add "task variables" per task for fgraph_ops
> >   function_graph: Move set_graph_function tests to shadow stack global 
> > var
> >   function_graph: Move graph depth stored data to shadow stack global 
> > var
> >   function_graph: Move graph notrace bit to shadow stack global var
> >   function_graph: Implement fgraph_reserve_data() and 
> > fgraph_retrieve_data()
> >   function_graph: Add selftest for passing local variables
> 
> Hi Masami,
> 
> While reviewing these patches, I realized there's several things I dislike
> about the patches I wrote. So I took these patches and started cleaning
> them up a little. Mostly renaming functions and adding comments.

Thanks for cleaning up the patches!!

> 
> As this is a major change to the function graph tracer, and I feel nervous
> about building something on top of this, how about I take over these
> patches and push them out for the next merge window. I'm hoping to get them
> into linux-next by v6.10-rc2 (I spent the day working on them, and it's
> mostly minor tweaks).

OK.

> Then I can push it out to 6.11 and get some good testing against it. Then
> we can add your stuff on top and get that merged in 6.12.

Yeah, it is reasonable plan. I also concerns about the stability. Especially,
this involves fprobe side changes too. If we introduce both at once, it may
mess up many things.

> 
> If all goes well, I'm hoping to get a series on just these patches (and
> your selftest addition) by tonight.
> 
> Thoughts?

I agree with you.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google)

Re: [PATCH v10 07/36] function_graph: Allow multiple users to attach to function graph

2024-05-25 Thread Google

On Fri, 24 May 2024 21:32:08 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:09:22 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > @@ -109,6 +244,21 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > if (!current->ret_stack)
> > return -EBUSY;
> >  
> > +   /*
> > +* At first, check whether the previous fgraph callback is pushed by
> > +* the fgraph on the same function entry.
> > +* But if @func is the self tail-call function, we also need to ensure
> > +* the ret_stack is not for the previous call by checking whether the
> > +* bit of @fgraph_idx is set or not.
> > +*/
> > +   ret_stack = get_ret_stack(current, current->curr_ret_stack, );
> > +   if (ret_stack && ret_stack->func == func &&
> > +   get_fgraph_type(current, offset + FGRAPH_FRAME_OFFSET) == 
> > FGRAPH_TYPE_BITMAP &&
> > +   !is_fgraph_index_set(current, offset + FGRAPH_FRAME_OFFSET, 
> > fgraph_idx))
> > +   return offset + FGRAPH_FRAME_OFFSET;
> > +
> > +   val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
> > +
> > BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
> 
> I'm trying to figure out what the above is trying to do. This gets called
> once in function_graph_enter() (or function_graph_enter_ops()). What
> exactly are you trying to catch here?

Aah, good catch! This was originally for catching the self tail-call case with
multiple fgraph callback on the same function, but it was my misread.
In later patch ([12/36]), we introduced function_graph_enter_ops() so that
we can skip checking hash table and directly pass the fgraph_ops to user
callback. I thought this function_graph_enter_ops() is used even if multiple
fgraph is set on the same function. In this case, we always need to check the
stack can be reused(pushed by other fgraph_ops on the same function) or not.
But as we discussed, the function_graph_enter_ops() is used only when only
one fgraph is set on the function (if there are multiple fgraphs are set on
the same function, use function_graph_enter() ), we are sure that 
ftrace_push_return_trace() is called only once on hooking the function entry.
Thus we don't need to reuse it.

> 
> Is it from this email:
> 
>   
> https://lore.kernel.org/all/20231110105154.df937bf9f200a0c16806c...@kernel.org/
> 
> As that's the last version before you added the above code.
> 
> But you also noticed it may not be needed, but triggered a crash without it
> in v3:
> 
>   
> https://lore.kernel.org/all/20231205234511.3839128259dfec153ea7d...@kernel.org/
> 
> I removed this code in my version and it runs just fine. Perhaps there was
> another bug that this was hiding that you fixed in later versions?

No problem. I think we can remove this block safely.

Thank you,

> 
> -- Steve
> 

-- 
Masami Hiramatsu (Google)

Re: [PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header

2024-05-23 Thread Google

On Thu, 23 May 2024 19:14:59 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:35 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > Add ftrace_regs definition for x86_64 in the ftrace header to
> > clarify what register will be accessible from ftrace_regs.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v3:
> >   - Add rip to be saved.
> >  Changes in v2:
> >   - Newly added.
> > ---
> >  arch/x86/include/asm/ftrace.h |6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index cf88cc8cc74d..c88bf47f46da 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> > long addr)
> >  
> >  #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >  struct ftrace_regs {
> > +   /*
> > +* On the x86_64, the ftrace_regs saves;
> > +* rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp.
> > +* Also orig_ax is used for passing direct trampoline address.
> > +* x86_32 doesn't support ftrace_regs.
> 
> Should add a comment that if fregs->regs.cs is set, then all of the pt_regs
> is valid.

But what about rbx and r1*? Only regs->cs should be care for pt_regs?
Or, did you mean "the ftrace_regs is valid"?

> And x86_32 does support ftrace_regs, it just doesn't support
> having a subset of it.

Oh, thanks. I'll update the comment about x86_32.

Thank you,

> 
> -- Steve
> 
> 
> > +*/
> > struct pt_regs  regs;
> >  };
> >  
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition

2024-05-23 Thread Google

On Thu, 23 May 2024 19:10:31 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:12 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > To clarify what will be expected on ftrace_regs, add a comment to the
> > architecture independent definition of the ftrace_regs.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > Acked-by: Mark Rutland 
> > ---
> >  Changes in v8:
> >   - Update that the saved registers depends on the context.
> >  Changes in v3:
> >   - Add instruction pointer
> >  Changes in v2:
> >   - newly added.
> > ---
> >  include/linux/ftrace.h |   26 ++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 54d53f345d14..b81f1afa82a1 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -118,6 +118,32 @@ extern int ftrace_enabled;
> >  
> >  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >  
> > +/**
> > + * ftrace_regs - ftrace partial/optimal register set
> > + *
> > + * ftrace_regs represents a group of registers which is used at the
> > + * function entry and exit. There are three types of registers.
> > + *
> > + * - Registers for passing the parameters to callee, including the stack
> > + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> > + * - Registers for passing the return values to caller.
> > + *   (e.g. rax and rdx on x86_64)
> > + * - Registers for hooking the function call and return including the
> > + *   frame pointer (the frame pointer is architecture/config dependent)
> > + *   (e.g. rip, rbp and rsp for x86_64)
> > + *
> > + * Also, architecture dependent fields can be used for internal process.
> > + * (e.g. orig_ax on x86_64)
> > + *
> > + * On the function entry, those registers will be restored except for
> > + * the stack pointer, so that user can change the function parameters
> > + * and instruction pointer (e.g. live patching.)
> > + * On the function exit, only registers which is used for return values
> > + * are restored.
> 
> I wonder if we should also add a note about some architectures in some
> circumstances may store all pt_regs in ftrace_regs. For example, if an
> architecture supports FTRACE_WITH_REGS, it may pass the pt_regs within the
> ftrace_regs. If that is the case, then ftrace_get_regs() called on it will
> return a pointer to a valid pt_regs, or NULL if it is not supported or the
> ftrace_regs does not have a all the registers.

Agreed. That case also should be noted. Thanks for pointing!


> 
> -- Steve
> 
> 
> > + *
> > + * NOTE: user *must not* access regs directly, only do it via APIs, because
> > + * the member can be changed according to the architecture.
> > + */
> >  struct ftrace_regs {
> > struct pt_regs  regs;
> >  };
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-23 Thread Google

;   return 0;
> @@ -1031,10 +1032,13 @@ static void uretprobe_trace_func(struct trace_uprobe 
> *tu, unsigned long func,
>struct uprobe_cpu_buffer **ucbp)
>  {
>   struct event_file_link *link;
> + struct uprobe_cpu_buffer *ucb;
> +
> + ucb = prepare_uprobe_buffer(tu, regs, ucbp);
>  
>   rcu_read_lock();
>   trace_probe_for_each_link_rcu(link, >tp)
> - __uprobe_trace_func(tu, func, regs, ucbp, link->file);
> + __uprobe_trace_func(tu, func, regs, ucb, link->file);
>   rcu_read_unlock();
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] kernel: trace: preemptirq_delay_test: add MODULE_DESCRIPTION()

2024-05-18 Thread Google

On Sat, 18 May 2024 15:54:49 -0700
Jeff Johnson  wrote:

> Fix the 'make W=1' warning:
> 
> WARNING: modpost: missing MODULE_DESCRIPTION() in 
> kernel/trace/preemptirq_delay_test.o
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Fixes: f96e8577da10 ("lib: Add module for testing preemptoff/irqsoff latency 
tracers")

Thanks,

> Signed-off-by: Jeff Johnson 
> ---
>  kernel/trace/preemptirq_delay_test.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/preemptirq_delay_test.c 
> b/kernel/trace/preemptirq_delay_test.c
> index 8c4ffd076162..cb0871fbdb07 100644
> --- a/kernel/trace/preemptirq_delay_test.c
> +++ b/kernel/trace/preemptirq_delay_test.c
> @@ -215,4 +215,5 @@ static void __exit preemptirq_delay_exit(void)
>  
>  module_init(preemptirq_delay_init)
>  module_exit(preemptirq_delay_exit)
> +MODULE_DESCRIPTION("Preempt / IRQ disable delay thread to test latency 
> tracers");
>  MODULE_LICENSE("GPL v2");
> 
> ---
> base-commit: 674143feb6a8c02d899e64e2ba0f992896afd532
> change-id: 20240518-md-preemptirq_delay_test-552cd20e7b0b
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-13 Thread Google

On Sat, 11 May 2024 15:09:48 -0600
Jiri Olsa  wrote:

> On Thu, May 09, 2024 at 04:24:37PM +, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-09 at 10:30 +0200, Jiri Olsa wrote:
> > > > Per the earlier discussion, this cannot be reached unless uretprobes 
> > > > are in
> > > > use,
> > > > which cannot happen without something with privileges taking an action. 
> > > > But
> > > > are
> > > > uretprobes ever used for monitoring applications where security is
> > > > important? Or
> > > > is it strictly a debug-time thing?
> > > 
> > > sorry, I don't have that level of detail, but we do have customers
> > > that use uprobes in general or want to use it and complain about
> > > the speed
> > > 
> > > there are several tools in bcc [1] that use uretprobes in scripts,
> > > like:
> > >   memleak, sslsniff, trace, bashreadline, gethostlatency, argdist,
> > >   funclatency
> > 
> > Is it possible to have shadow stack only use the non-syscall solution? It 
> > seems
> > it exposes a more limited compatibility in that it only allows writing the
> > specific trampoline address. (IIRC) Then shadow stack users could still use
> > uretprobes, but just not the new optimized solution. There are already
> > operations that are slower with shadow stack, like longjmp(), so this could 
> > be
> > ok maybe.
> 
> I guess it's doable, we'd need to keep both trampolines around, because
> shadow stack is enabled by app dynamically and use one based on the
> state of shadow stack when uretprobe is installed
> 
> so you're worried the optimized syscall path could be somehow exploited
> to add data on shadow stack?

Good point. For the security concerning (e.g. leaking sensitive information
from secure process which uses shadow stack), we need another limitation
which prohibits probing such process even for debugging. But I think that
needs another series of patches. We also need to discuss when it should be
prohibited and how (e.g. audit interface? SELinux?).
But I think this series is just optimizing currently available uprobes with
a new syscall. I don't think it changes such security concerning.

Thank you,

> 
> jirka


-- 
Masami Hiramatsu (Google)

Re: [PATCHv5 bpf-next 7/8] selftests/x86: Add return uprobe shadow stack test

2024-05-13 Thread Google

ough sigsetjmp above
> +  * or succeeds and we're good.
> +  */
> + uretprobe_trigger();
> +
> + printf("[OK]\tUretprobe test\n");
> + err = 0;
> +
> +out:
> + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
> + signal(SIGSEGV, SIG_DFL);
> + if (fd)
> + close(fd);
> + return err;
> +}
> +
>  void segv_handler_ptrace(int signum, siginfo_t *si, void *uc)
>  {
>   /* The SSP adjustment caused a segfault. */
> @@ -867,6 +1003,12 @@ int main(int argc, char *argv[])
>   goto out;
>   }
>  
> + if (test_uretprobe()) {
> + ret = 1;
> + printf("[FAIL]\turetprobe test\n");
> + goto out;
> + }
> +
>   return ret;
>  
>  out:
> -- 
> 2.44.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH RESEND v8 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-05-07 Thread Google

On Sun,  5 May 2024 19:06:19 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
> 
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
> 
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
> 
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures.  If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Song Liu 
> ---
>  arch/loongarch/kernel/module.c | 19 --
>  arch/mips/kernel/module.c  | 20 --
>  arch/nios2/kernel/module.c | 21 ---
>  arch/parisc/kernel/module.c| 24 
>  arch/riscv/kernel/module.c | 24 
>  arch/sparc/kernel/module.c | 20 --
>  include/linux/execmem.h| 47 
>  mm/execmem.c   | 67 --
>  mm/mm_init.c   |  2 +
>  9 files changed, 210 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
> index c7d0338d12c1..ca6dd7ea1610 100644
> --- a/arch/loongarch/kernel/module.c
> +++ b/arch/loongarch/kernel/module.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
> *strtab,
>   return 0;
>  }
>  
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, 
> __builtin_return_address(0));
> + execmem_info = (struct execmem_info){
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start  = MODULES_VADDR,
> + .end= MODULES_END,
> + .pgprot = PAGE_KERNEL,
> + .alignment = 1,
> + },
> + },
> + };
> +
> + return _info;
>  }
>  
>  static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
> diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
> index 9a6c96014904..59225a3cf918 100644
> --- a/arch/mips/kernel/module.c
> +++ b/arch/mips/kernel/module.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  struct mips_hi16 {
> @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list);
>  static DEFINE_SPINLOCK(dbe_lock);
>  
>  #ifdef MODULES_VADDR
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> + execmem_info = (struct execmem_info){
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start  = MODULES_VADDR,
> + .end= MODULES_END,
> + .pgprot = PAGE_KERNEL,
> + .alignment = 1,
> + },
> + },
> + };
> +
> + return _info;
>  }
>  #endif
>  
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 9c97b7513853..0d1ee86631fc 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -18,15 +18,26 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> -void *module_alloc(unsigned long size)
> +static struct execmem_info execmem_info __ro_after_init;
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return __vmalloc_node_range(size, 1, MODULES

Re: [PATCH RESEND v8 05/16] module: make module_memory_{alloc,free} more self-contained

2024-05-07 Thread Google

On Sun,  5 May 2024 19:06:17 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> Move the logic related to the memory allocation and freeing into
> module_memory_alloc() and module_memory_free().
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Mike Rapoport (IBM) 
> Reviewed-by: Philippe Mathieu-Daudé 
> ---
>  kernel/module/main.c | 64 +++-
>  1 file changed, 39 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index e1e8a7a9d6c1..5b82b069e0d3 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type 
> type)
>   mod_mem_type_is_core_data(type);
>  }
>  
> -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
> +static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
>  {
> + unsigned int size = PAGE_ALIGN(mod->mem[type].size);
> + void *ptr;
> +
> + mod->mem[type].size = size;
> +
>   if (mod_mem_use_vmalloc(type))
> - return vzalloc(size);
> - return module_alloc(size);
> + ptr = vmalloc(size);
> + else
> + ptr = module_alloc(size);
> +
> + if (!ptr)
> + return -ENOMEM;
> +
> + /*
> +  * The pointer to these blocks of memory are stored on the module
> +  * structure and we keep that around so long as the module is
> +  * around. We only free that memory when we unload the module.
> +  * Just mark them as not being a leak then. The .init* ELF
> +  * sections *do* get freed after boot so we *could* treat them
> +  * slightly differently with kmemleak_ignore() and only grey
> +  * them out as they work as typical memory allocations which
> +  * *do* eventually get freed, but let's just keep things simple
> +  * and avoid *any* false positives.
> +  */
> + kmemleak_not_leak(ptr);
> +
> + memset(ptr, 0, size);
> + mod->mem[type].base = ptr;
> +
> + return 0;
>  }
>  
> -static void module_memory_free(void *ptr, enum mod_mem_type type)
> +static void module_memory_free(struct module *mod, enum mod_mem_type type)
>  {
> + void *ptr = mod->mem[type].base;
> +
>   if (mod_mem_use_vmalloc(type))
>   vfree(ptr);
>   else
> @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod)
>   /* Free lock-classes; relies on the preceding sync_rcu(). */
>   lockdep_free_key_range(mod_mem->base, mod_mem->size);
>   if (mod_mem->size)
> - module_memory_free(mod_mem->base, type);
> + module_memory_free(mod, type);
>   }
>  
>   /* MOD_DATA hosts mod, so free it at last */
>   lockdep_free_key_range(mod->mem[MOD_DATA].base, 
> mod->mem[MOD_DATA].size);
> - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
> + module_memory_free(mod, MOD_DATA);
>  }
>  
>  /* Free a module, remove from lists, etc. */
> @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, 
> struct load_info *info)
>  static int move_module(struct module *mod, struct load_info *info)
>  {
>   int i;
> - void *ptr;
>   enum mod_mem_type t = 0;
>   int ret = -ENOMEM;
>  
> @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct 
> load_info *info)
>   mod->mem[type].base = NULL;
>   continue;
>   }
> - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size);
> - ptr = module_memory_alloc(mod->mem[type].size, type);
> - /*
> - * The pointer to these blocks of memory are stored on the 
> module
> - * structure and we keep that around so long as the module is
> - * around. We only free that memory when we unload the 
> module.
> - * Just mark them as not being a leak then. The .init* ELF
> - * sections *do* get freed after boot so we *could* treat 
> them
> - * slightly differently with kmemleak_ignore() and only grey
> - * them out as they work as typical memory allocations which
> - * *do* eventually get freed, but let's just keep things 
> simple
> - * and avoid *any* false positives.
> -  */
> - kmemleak_not_leak(ptr);
> - if (!ptr) {
> +
> + ret = module_memory_alloc(mod, type);
> +

[PATCH v10 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Skip recording calltime and rettime if the fgraph_ops does not need it.
This is a kind of performance optimization for fprobe. Since the fprobe
user does not use these entries, recording timestamp in fgraph is just
a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag,
and all fgraph_ops sets this flag, skip recording calltime and rettime.

Suggested-by: Jiri Olsa 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v10:
  - Add likely() to skipping timestamp.
 Changes in v9:
  - Newly added.
---
 include/linux/ftrace.h |2 ++
 kernel/trace/fgraph.c  |   51 +---
 kernel/trace/fprobe.c  |1 +
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 64ca91d1527f..eb9de9d70829 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1156,6 +1156,8 @@ struct fgraph_ops {
struct ftrace_ops   ops; /* for the hash lists */
void*private;
int idx;
+   /* If skip_timestamp is true, this does not record timestamps. */
+   boolskip_timestamp;
 };
 
 void *fgraph_reserve_data(int idx, int size_bytes);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 40f47fcbc6c3..13b41485ce49 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -138,6 +138,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
 static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
+static bool fgraph_skip_timestamp;
 
 /* LRU index table for fgraph_array */
 static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
@@ -483,7 +484,7 @@ void ftrace_graph_stop(void)
 static int
 ftrace_push_return_trace(unsigned long ret, unsigned long func,
 unsigned long frame_pointer, unsigned long *retp,
-int fgraph_idx)
+int fgraph_idx, bool skip_ts)
 {
struct ftrace_ret_stack *ret_stack;
unsigned long long calltime;
@@ -506,8 +507,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
ret_stack = get_ret_stack(current, current->curr_ret_stack, );
if (ret_stack && ret_stack->func == func &&
get_fgraph_type(current, offset + FGRAPH_FRAME_OFFSET) == 
FGRAPH_TYPE_BITMAP &&
-   !is_fgraph_index_set(current, offset + FGRAPH_FRAME_OFFSET, 
fgraph_idx))
+   !is_fgraph_index_set(current, offset + FGRAPH_FRAME_OFFSET, 
fgraph_idx)) {
+   /* If previous one skips calltime, update it. */
+   if (!skip_ts && !ret_stack->calltime)
+   ret_stack->calltime = trace_clock_local();
return offset + FGRAPH_FRAME_OFFSET;
+   }
 
val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
 
@@ -525,7 +530,11 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
return -EBUSY;
}
 
-   calltime = trace_clock_local();
+   /* This is not really 'likely' but for keeping the least path to be 
faster. */
+   if (likely(skip_ts))
+   calltime = 0LL;
+   else
+   calltime = trace_clock_local();
 
offset = READ_ONCE(current->curr_ret_stack);
ret_stack = RET_STACK(current, offset);
@@ -609,7 +618,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned 
long func,
trace.func = func;
trace.depth = ++current->curr_ret_depth;
 
-   offset = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0);
+   offset = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0,
+ fgraph_skip_timestamp);
if (offset < 0)
goto out;
 
@@ -662,7 +672,8 @@ int function_graph_enter_ops(unsigned long ret, unsigned 
long func,
return -ENODEV;
 
/* Use start for the distance to ret_stack (skipping over reserve) */
-   offset = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
gops->idx);
+   offset = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
gops->idx,
+ gops->skip_timestamp);
if (offset < 0)
return offset;
type = get_fgraph_type(current, offset);
@@ -740,6 +751,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, 
unsigned long *ret,
*ret = ret_stack->ret;
trace->func = ret_stack->func;
trace->calltime = ret_stack->calltime;
+   trace->rettime = 0;
trace->overrun = atomic_read(>trace_overrun);
trace->depth = current->curr_ret_depth;
/*
@@ -800,7 +812,6 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, 
unsigned long frame_pointe
re

[PATCH v10 35/36] Documentation: probes: Update fprobe on function-graph tracer

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit

[PATCH v10 17/36] function_graph: Move graph notrace bit to shadow stack global var

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

The use of the task->trace_recursion for the logic used for the function
graph no-trace was a bit of an abuse of that variable. Now that there
exists global vars that are per stack for registered graph traces, use
that instead.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Make description lines shorter than 76 chars.
---
 include/linux/trace_recursion.h  |7 ---
 kernel/trace/trace.h |9 +
 kernel/trace/trace_functions_graph.c |   10 ++
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index fdfb6f66718a..ae04054a1be3 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -44,13 +44,6 @@ enum {
  */
TRACE_IRQ_BIT,
 
-   /*
-* To implement set_graph_notrace, if this bit is set, we ignore
-* function graph tracing of called functions, until the return
-* function is called to clear it.
-*/
-   TRACE_GRAPH_NOTRACE_BIT,
-
/* Used to prevent recursion recording from recursing. */
TRACE_RECORD_RECURSION_BIT,
 };
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 7ab731b9ebc8..f23b6fbd547d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -918,8 +918,17 @@ enum {
 
TRACE_GRAPH_DEPTH_START_BIT,
TRACE_GRAPH_DEPTH_END_BIT,
+
+   /*
+* To implement set_graph_notrace, if this bit is set, we ignore
+* function graph tracing of called functions, until the return
+* function is called to clear it.
+*/
+   TRACE_GRAPH_NOTRACE_BIT,
 };
 
+#define TRACE_GRAPH_NOTRACE(1 << TRACE_GRAPH_NOTRACE_BIT)
+
 static inline unsigned long ftrace_graph_depth(unsigned long *task_var)
 {
return (*task_var >> TRACE_GRAPH_DEPTH_START_BIT) & 3;
diff --git a/kernel/trace/trace_functions_graph.c 
b/kernel/trace/trace_functions_graph.c
index 66cce73e94f8..13d0387ac6a6 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -130,6 +130,7 @@ static inline int ftrace_graph_ignore_irqs(void)
 int trace_graph_entry(struct ftrace_graph_ent *trace,
  struct fgraph_ops *gops)
 {
+   unsigned long *task_var = fgraph_get_task_var(gops);
struct trace_array *tr = gops->private;
struct trace_array_cpu *data;
unsigned long flags;
@@ -138,7 +139,7 @@ int trace_graph_entry(struct ftrace_graph_ent *trace,
int ret;
int cpu;
 
-   if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT))
+   if (*task_var & TRACE_GRAPH_NOTRACE)
return 0;
 
/*
@@ -149,7 +150,7 @@ int trace_graph_entry(struct ftrace_graph_ent *trace,
 * returning from the function.
 */
if (ftrace_graph_notrace_addr(trace->func)) {
-   trace_recursion_set(TRACE_GRAPH_NOTRACE_BIT);
+   *task_var |= TRACE_GRAPH_NOTRACE_BIT;
/*
 * Need to return 1 to have the return called
 * that will clear the NOTRACE bit.
@@ -240,6 +241,7 @@ void __trace_graph_return(struct trace_array *tr,
 void trace_graph_return(struct ftrace_graph_ret *trace,
struct fgraph_ops *gops)
 {
+   unsigned long *task_var = fgraph_get_task_var(gops);
struct trace_array *tr = gops->private;
struct trace_array_cpu *data;
unsigned long flags;
@@ -249,8 +251,8 @@ void trace_graph_return(struct ftrace_graph_ret *trace,
 
ftrace_graph_addr_finish(gops, trace);
 
-   if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) {
-   trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT);
+   if (*task_var & TRACE_GRAPH_NOTRACE) {
+   *task_var &= ~TRACE_GRAPH_NOTRACE;
return;
}

[PATCH v10 34/36] selftests/ftrace: Add a test case for repeating register/unregister fprobe

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

This test case repeats define and undefine the fprobe dynamic event to
ensure that the fprobe does not cause any issue with such operations.

Signed-off-by: Masami Hiramatsu (Google) 
---
 .../test.d/dynevent/add_remove_fprobe_repeat.tc|   19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc

diff --git 
a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc 
b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
new file mode 100644
index ..b4ad09237e2a
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
@@ -0,0 +1,19 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - Repeating add/remove fprobe events
+# requires: dynamic_events "f[:[/][]] [%return] 
[]":README
+
+echo 0 > events/enable
+echo > dynamic_events
+
+PLACE=$FUNCTION_FORK
+REPEAT_TIMES=64
+
+for i in `seq 1 $REPEAT_TIMES`; do
+  echo "f:myevent $PLACE" >> dynamic_events
+  grep -q myevent dynamic_events
+  test -d events/fprobes/myevent
+  echo > dynamic_events
+done
+
+clear_trace

[PATCH v10 33/36] selftests: ftrace: Remove obsolate maxactive syntax check

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Since the fprobe event does not support maxactive anymore, stop
testing the maxactive syntax error checking.

Signed-off-by: Masami Hiramatsu (Google) 
---
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git 
a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc 
b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index 61877d166451..c9425a34fae3 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -16,9 +16,7 @@ aarch64)
   REG=%r0 ;;
 esac
 
-check_error 'f^100 vfs_read'   # MAXACT_NO_KPROBE
-check_error 'f^1a111 vfs_read' # BAD_MAXACT
-check_error 'f^10 vfs_read'# MAXACT_TOO_BIG
+check_error 'f^100 vfs_read'   # BAD_MAXACT
 
 check_error 'f ^non_exist_func'# BAD_PROBE_ADDR (enoent)
 check_error 'f ^vfs_read+10'   # BAD_PROBE_ADDR

[PATCH v10 32/36] tracing/fprobe: Remove nr_maxactive from fprobe

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Remove depercated fprobe::nr_maxactive. This involves fprobe events to
rejects the maxactive number.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Newly added.
---
 include/linux/fprobe.h  |2 --
 kernel/trace/trace_fprobe.c |   44 ++-
 2 files changed, 6 insertions(+), 40 deletions(-)

diff --git a/include/linux/fprobe.h b/include/linux/fprobe.h
index 2d06bbd99601..a86b3e4df2a0 100644
--- a/include/linux/fprobe.h
+++ b/include/linux/fprobe.h
@@ -54,7 +54,6 @@ struct fprobe_hlist {
  * @nmissed: The counter for missing events.
  * @flags: The status flag.
  * @entry_data_size: The private data storage size.
- * @nr_maxactive: The max number of active functions. (*deprecated)
  * @entry_handler: The callback function for function entry.
  * @exit_handler: The callback function for function exit.
  * @hlist_array: The fprobe_hlist for fprobe search from IP hash table.
@@ -63,7 +62,6 @@ struct fprobe {
unsigned long   nmissed;
unsigned intflags;
size_t  entry_data_size;
-   int nr_maxactive;
 
fprobe_entry_cb entry_handler;
fprobe_exit_cb  exit_handler;
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 86cd6a8c806a..20ef5cd5d419 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -422,7 +422,6 @@ static struct trace_fprobe *alloc_trace_fprobe(const char 
*group,
   const char *event,
   const char *symbol,
   struct tracepoint *tpoint,
-  int maxactive,
   int nargs, bool is_return)
 {
struct trace_fprobe *tf;
@@ -442,7 +441,6 @@ static struct trace_fprobe *alloc_trace_fprobe(const char 
*group,
tf->fp.entry_handler = fentry_dispatcher;
 
tf->tpoint = tpoint;
-   tf->fp.nr_maxactive = maxactive;
 
ret = trace_probe_init(>tp, event, group, false, nargs);
if (ret < 0)
@@ -1021,12 +1019,11 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
 *  FETCHARG:TYPE : use TYPE instead of unsigned long.
 */
struct trace_fprobe *tf = NULL;
-   int i, len, new_argc = 0, ret = 0;
+   int i, new_argc = 0, ret = 0;
bool is_return = false;
char *symbol = NULL;
const char *event = NULL, *group = FPROBE_EVENT_SYSTEM;
const char **new_argv = NULL;
-   int maxactive = 0;
char buf[MAX_EVENT_NAME_LEN];
char gbuf[MAX_EVENT_NAME_LEN];
char sbuf[KSYM_NAME_LEN];
@@ -1048,33 +1045,13 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
 
trace_probe_log_init("trace_fprobe", argc, argv);
 
-   event = strchr([0][1], ':');
-   if (event)
-   event++;
-
-   if (isdigit(argv[0][1])) {
-   if (event)
-   len = event - [0][1] - 1;
-   else
-   len = strlen([0][1]);
-   if (len > MAX_EVENT_NAME_LEN - 1) {
-   trace_probe_log_err(1, BAD_MAXACT);
-   goto parse_error;
-   }
-   memcpy(buf, [0][1], len);
-   buf[len] = '\0';
-   ret = kstrtouint(buf, 0, );
-   if (ret || !maxactive) {
+   if (argv[0][1] != '\0') {
+   if (argv[0][1] != ':') {
+   trace_probe_log_set_index(0);
trace_probe_log_err(1, BAD_MAXACT);
goto parse_error;
}
-   /* fprobe rethook instances are iterated over via a list. The
-* maximum should stay reasonable.
-*/
-   if (maxactive > RETHOOK_MAXACTIVE_MAX) {
-   trace_probe_log_err(1, MAXACT_TOO_BIG);
-   goto parse_error;
-   }
+   event = [0][2];
}
 
trace_probe_log_set_index(1);
@@ -1084,12 +1061,6 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
if (ret < 0)
goto parse_error;
 
-   if (!is_return && maxactive) {
-   trace_probe_log_set_index(0);
-   trace_probe_log_err(1, BAD_MAXACT_TYPE);
-   goto parse_error;
-   }
-
trace_probe_log_set_index(0);
if (event) {
ret = traceprobe_parse_event_name(, , gbuf,
@@ -1147,8 +1118,7 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
goto out;
 
/* setup a probe */
-   tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive,
-   argc, is_return);
+   tf = alloc_trace

[PATCH v10 31/36] fprobe: Rewrite fprobe on function-graph tracer

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Rewrite fprobe implementation on function-graph tracer.
Major API changes are:
 -  'nr_maxactive' field is deprecated.
 -  This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
!CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and
CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only
on x86_64.
 -  Currently the entry size is limited in 15 * sizeof(long).
 -  If there is too many fprobe exit handler set on the same
function, it will fail to probe.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v9:
  - Remove unneeded prototype of ftrace_regs_get_return_address().
  - Fix entry data address calculation.
  - Remove DIV_ROUND_UP() from hotpath.
 Changes in v8:
  - Use trace_func_graph_ret/ent_t for fgraph_ops.
  - Update CONFIG_FPROBE dependencies.
  - Add ftrace_regs_get_return_address() for each arch.
 Changes in v3:
  - Update for new reserve_data/retrieve_data API.
  - Fix internal push/pop on fgraph data logic so that it can
correctly save/restore the returning fprobes.
 Changes in v2:
  - Add more lockdep_assert_held(fprobe_mutex)
  - Use READ_ONCE() and WRITE_ONCE() for fprobe_hlist_node::fp.
  - Add NOKPROBE_SYMBOL() for the functions which is called from
entry/exit callback.
---
 arch/arm64/include/asm/ftrace.h |6 
 arch/loongarch/include/asm/ftrace.h |6 
 arch/powerpc/include/asm/ftrace.h   |6 
 arch/s390/include/asm/ftrace.h  |6 
 arch/x86/include/asm/ftrace.h   |6 
 include/linux/fprobe.h  |   53 ++-
 kernel/trace/Kconfig|8 
 kernel/trace/fprobe.c   |  638 +--
 lib/test_fprobe.c   |   45 --
 9 files changed, 529 insertions(+), 245 deletions(-)

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 95a8f349f871..800c75f46a13 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -143,6 +143,12 @@ ftrace_regs_get_frame_pointer(const struct ftrace_regs 
*fregs)
return fregs->fp;
 }
 
+static __always_inline unsigned long
+ftrace_regs_get_return_address(const struct ftrace_regs *fregs)
+{
+   return fregs->lr;
+}
+
 static __always_inline struct pt_regs *
 ftrace_partial_regs(const struct ftrace_regs *fregs, struct pt_regs *regs)
 {
diff --git a/arch/loongarch/include/asm/ftrace.h 
b/arch/loongarch/include/asm/ftrace.h
index 14a1576bf948..b8432b7cc9d4 100644
--- a/arch/loongarch/include/asm/ftrace.h
+++ b/arch/loongarch/include/asm/ftrace.h
@@ -81,6 +81,12 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
*fregs, unsigned long ip)
 #define ftrace_regs_get_frame_pointer(fregs) \
((fregs)->regs.regs[22])
 
+static __always_inline unsigned long
+ftrace_regs_get_return_address(struct ftrace_regs *fregs)
+{
+   return *(unsigned long *)(fregs->regs.regs[1]);
+}
+
 #define ftrace_graph_func ftrace_graph_func
 void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs);
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 51245fd6b45b..d8a74a6570f8 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -77,6 +77,12 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs 
*fregs)
 #define ftrace_regs_query_register_offset(name) \
regs_query_register_offset(name)
 
+static __always_inline unsigned long
+ftrace_regs_get_return_address(struct ftrace_regs *fregs)
+{
+   return fregs->regs.link;
+}
+
 struct ftrace_ops;
 
 #define ftrace_graph_func ftrace_graph_func
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index cb8d60a5fe1d..d8ca1776c554 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -89,6 +89,12 @@ ftrace_regs_get_frame_pointer(struct ftrace_regs *fregs)
return sp[0];   /* return backchain */
 }
 
+static __always_inline unsigned long
+ftrace_regs_get_return_address(const struct ftrace_regs *fregs)
+{
+   return fregs->regs.gprs[14];
+}
+
 #define arch_ftrace_fill_perf_regs(fregs, _regs)do {   \
(_regs)->psw.addr = (fregs)->regs.psw.addr; \
(_regs)->gprs[15] = (fregs)->regs.gprs[15]; \
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 7625887fc49b..979d3458a328 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -82,6 +82,12 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
 #define ftrace_regs_get_frame_pointer(fregs) \
frame_pointer(&(fregs)->regs)
 
+static __always_inline unsigned long
+ftrace_regs_get_return_address(struct ftrace_regs *fregs)
+{
+   return *(unsigned long *)ftrace_regs_get_stack_pointer(fregs);
+}
+
 struct ftrace_ops;
 #define ftrace_graph_func ftrace_graph_func
 void ftrace_graph_func(unsi

[PATCH v10 30/36] ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add CONFIG_HAVE_FTRACE_GRAPH_FUNC kconfig in addition to ftrace_graph_func
macro check. This is for the other feature (e.g. FPROBE) which requires to
access ftrace_regs from fgraph_ops::entryfunc() can avoid compiling if
the fgraph can not pass the valid ftrace_regs.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v8:
  - Newly added.
---
 arch/arm64/Kconfig |1 +
 arch/loongarch/Kconfig |1 +
 arch/powerpc/Kconfig   |1 +
 arch/riscv/Kconfig |1 +
 arch/x86/Kconfig   |1 +
 kernel/trace/Kconfig   |5 +
 6 files changed, 10 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8d5047bc13bc..e0a5c69eeda2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -206,6 +206,7 @@ config ARM64
select HAVE_SAMPLE_FTRACE_DIRECT_MULTI
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_FAST_GUP
+   select HAVE_FTRACE_GRAPH_FUNC
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 21dc39ae6bc2..0276a6825e6d 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -121,6 +121,7 @@ config LOONGARCH
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !ARCH_STRICT_ALIGN
select HAVE_EXIT_THREAD
select HAVE_FAST_GUP
+   select HAVE_FTRACE_GRAPH_FUNC
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..b79d16c5846a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -237,6 +237,7 @@ config PPC
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_FAST_GUP
+   select HAVE_FTRACE_GRAPH_FUNC
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_DESCRIPTORSif PPC64_ELF_ABI_V1
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b58b8e81b510..6fd2a166904b 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -127,6 +127,7 @@ config RISCV
select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && MMU && 
(CLANG_SUPPORTS_DYNAMIC_FTRACE || GCC_SUPPORTS_DYNAMIC_FTRACE)
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
+   select HAVE_FTRACE_GRAPH_FUNC
select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_GRAPH_FREGS
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4da23dc0b07c..bd86e598e31d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -225,6 +225,7 @@ config X86
select HAVE_EXIT_THREAD
select HAVE_FAST_GUP
select HAVE_FENTRY  if X86_64 || DYNAMIC_FTRACE
+   select HAVE_FTRACE_GRAPH_FUNC   if HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_FREGSif HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_GRAPH_TRACER   if X86_32 || (X86_64 && 
DYNAMIC_FTRACE)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 0b6ce0a38967..0e4c33f1ab43 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,11 @@ config HAVE_FUNCTION_GRAPH_TRACER
 config HAVE_FUNCTION_GRAPH_FREGS
bool
 
+config HAVE_FTRACE_GRAPH_FUNC
+   bool
+   help
+ True if ftrace_graph_func() is defined.
+
 config HAVE_DYNAMIC_FTRACE
bool
help

[PATCH v10 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is
converted from ftrace_regs by ftrace_partial_regs(), thus some registers
may always returns 0. But it should be enough for function entry (access
arguments) and exit (access return value).

Signed-off-by: Masami Hiramatsu (Google) 
Acked-by: Florent Revest 
---
 Changes in v9:
  - Avoid wasting memory for bpf_kprobe_multi_pt_regs when
CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y
---
 kernel/trace/bpf_trace.c |   27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e51a6ef87167..b779f4a83361 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void)
 fs_initcall(bpf_event_init);
 #endif /* CONFIG_MODULES */
 
-#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
+#ifdef CONFIG_FPROBE
 struct bpf_kprobe_multi_link {
struct bpf_link link;
struct fprobe fp;
@@ -2600,6 +2600,13 @@ struct user_syms {
char *buf;
 };
 
+#ifndef CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST
+static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs);
+#define bpf_kprobe_multi_pt_regs_ptr() this_cpu_ptr(_kprobe_multi_pt_regs)
+#else
+#define bpf_kprobe_multi_pt_regs_ptr() (NULL)
+#endif
+
 static int copy_user_syms(struct user_syms *us, unsigned long __user *usyms, 
u32 cnt)
 {
unsigned long __user usymbol;
@@ -2792,13 +2799,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct bpf_run_ctx 
*ctx)
 
 static int
 kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
-  unsigned long entry_ip, struct pt_regs *regs)
+  unsigned long entry_ip, struct ftrace_regs *fregs)
 {
struct bpf_kprobe_multi_run_ctx run_ctx = {
.link = link,
.entry_ip = entry_ip,
};
struct bpf_run_ctx *old_run_ctx;
+   struct pt_regs *regs;
int err;
 
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
@@ -2809,6 +2817,7 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link 
*link,
 
migrate_disable();
rcu_read_lock();
+   regs = ftrace_partial_regs(fregs, bpf_kprobe_multi_pt_regs_ptr());
old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx);
err = bpf_prog_run(link->link.prog, regs);
bpf_reset_run_ctx(old_run_ctx);
@@ -2826,13 +2835,9 @@ kprobe_multi_link_handler(struct fprobe *fp, unsigned 
long fentry_ip,
  void *data)
 {
struct bpf_kprobe_multi_link *link;
-   struct pt_regs *regs = ftrace_get_regs(fregs);
-
-   if (!regs)
-   return 0;
 
link = container_of(fp, struct bpf_kprobe_multi_link, fp);
-   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
+   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
return 0;
 }
 
@@ -2842,13 +2847,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, 
unsigned long fentry_ip,
   void *data)
 {
struct bpf_kprobe_multi_link *link;
-   struct pt_regs *regs = ftrace_get_regs(fregs);
-
-   if (!regs)
-   return;
 
link = container_of(fp, struct bpf_kprobe_multi_link, fp);
-   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
+   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
 }
 
 static int symbols_cmp_r(const void *a, const void *b, const void *priv)
@@ -3107,7 +3108,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
kvfree(cookies);
return err;
 }
-#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
+#else /* !CONFIG_FPROBE */
 int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog 
*prog)
 {
return -EOPNOTSUPP;

[PATCH v10 28/36] tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Allow fprobe events to be enabled with CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
With this change, fprobe events mostly use ftrace_regs instead of pt_regs.
Note that if the arch doesn't enable HAVE_PT_REGS_COMPAT_FTRACE_REGS,
fprobe events will not be able to be used from perf.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v9:
  - Copy store_trace_entry_data() as store_fprobe_entry_data() for
fprobe.
 Chagnes in v3:
  - Use ftrace_regs_get_return_value().
 Changes in v2:
  - Define ftrace_regs_get_kernel_stack_nth() for
!CONFIG_HAVE_REGS_AND_STACK_ACCESS_API.
 Changes from previous series: Update against the new series.
---
 include/linux/ftrace.h  |   17 ++
 kernel/trace/Kconfig|1 
 kernel/trace/trace_fprobe.c |  107 +--
 kernel/trace/trace_probe_tmpl.h |2 -
 4 files changed, 86 insertions(+), 41 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 3871823c1429..64ca91d1527f 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -256,6 +256,23 @@ static __always_inline bool ftrace_regs_has_args(struct 
ftrace_regs *fregs)
frame_pointer(&(fregs)->regs)
 #endif
 
+#ifdef CONFIG_HAVE_REGS_AND_STACK_ACCESS_API
+static __always_inline unsigned long
+ftrace_regs_get_kernel_stack_nth(struct ftrace_regs *fregs, unsigned int nth)
+{
+   unsigned long *stackp;
+
+   stackp = (unsigned long *)ftrace_regs_get_stack_pointer(fregs);
+   if (((unsigned long)(stackp + nth) & ~(THREAD_SIZE - 1)) ==
+   ((unsigned long)stackp & ~(THREAD_SIZE - 1)))
+   return *(stackp + nth);
+
+   return 0;
+}
+#else /* !CONFIG_HAVE_REGS_AND_STACK_ACCESS_API */
+#define ftrace_regs_get_kernel_stack_nth(fregs, nth)   (0L)
+#endif /* CONFIG_HAVE_REGS_AND_STACK_ACCESS_API */
+
 typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 7df7b1fb305c..0b6ce0a38967 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -680,7 +680,6 @@ config FPROBE_EVENTS
select TRACING
select PROBE_EVENTS
select DYNAMIC_EVENTS
-   depends on DYNAMIC_FTRACE_WITH_REGS
default y
help
  This allows user to add tracing events on the function entry and
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 273cdf3cf70c..86cd6a8c806a 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -133,7 +133,7 @@ static int
 process_fetch_insn(struct fetch_insn *code, void *rec, void *edata,
   void *dest, void *base)
 {
-   struct pt_regs *regs = rec;
+   struct ftrace_regs *fregs = rec;
unsigned long val;
int ret;
 
@@ -141,17 +141,17 @@ process_fetch_insn(struct fetch_insn *code, void *rec, 
void *edata,
/* 1st stage: get value from context */
switch (code->op) {
case FETCH_OP_STACK:
-   val = regs_get_kernel_stack_nth(regs, code->param);
+   val = ftrace_regs_get_kernel_stack_nth(fregs, code->param);
break;
case FETCH_OP_STACKP:
-   val = kernel_stack_pointer(regs);
+   val = ftrace_regs_get_stack_pointer(fregs);
break;
case FETCH_OP_RETVAL:
-   val = regs_return_value(regs);
+   val = ftrace_regs_get_return_value(fregs);
break;
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
case FETCH_OP_ARG:
-   val = regs_get_kernel_argument(regs, code->param);
+   val = ftrace_regs_get_argument(fregs, code->param);
break;
case FETCH_OP_EDATA:
val = *(unsigned long *)((unsigned long)edata + code->offset);
@@ -174,7 +174,7 @@ NOKPROBE_SYMBOL(process_fetch_insn)
 /* function entry handler */
 static nokprobe_inline void
 __fentry_trace_func(struct trace_fprobe *tf, unsigned long entry_ip,
-   struct pt_regs *regs,
+   struct ftrace_regs *fregs,
struct trace_event_file *trace_file)
 {
struct fentry_trace_entry_head *entry;
@@ -188,41 +188,71 @@ __fentry_trace_func(struct trace_fprobe *tf, unsigned 
long entry_ip,
if (trace_trigger_soft_disabled(trace_file))
return;
 
-   dsize = __get_data_size(>tp, regs, NULL);
+   dsize = __get_data_size(>tp, fregs, NULL);
 
entry = trace_event_buffer_reserve(, trace_file,
   sizeof(*entry) + tf->tp.size + 
dsize);
if (!entry)
return;
 
-   fbuffer.regs = regs;
+   fbuffer.regs = ftrace_get_regs(fregs);
entry = fbuffer.entry = ring_buffer_event_data(fbuffer.event);
entry->ip = entry_ip;
-

[PATCH v10 27/36] tracing: Add ftrace_fill_perf_regs() for perf event

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add ftrace_fill_perf_regs() which should be compatible with the
perf_fetch_caller_regs(). In other words, the pt_regs returned from the
ftrace_fill_perf_regs() must satisfy 'user_mode(regs) == false' and can be
used for stack tracing.

Signed-off-by: Masami Hiramatsu (Google) 
---
  Changes from previous series: NOTHING, just forward ported.
---
 arch/arm64/include/asm/ftrace.h   |7 +++
 arch/powerpc/include/asm/ftrace.h |7 +++
 arch/s390/include/asm/ftrace.h|5 +
 arch/x86/include/asm/ftrace.h |7 +++
 include/linux/ftrace.h|   31 +++
 5 files changed, 57 insertions(+)

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index aab2b7a0f78c..95a8f349f871 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -154,6 +154,13 @@ ftrace_partial_regs(const struct ftrace_regs *fregs, 
struct pt_regs *regs)
return regs;
 }
 
+#define arch_ftrace_fill_perf_regs(fregs, _regs) do {  \
+   (_regs)->pc = (fregs)->pc;  \
+   (_regs)->regs[29] = (fregs)->fp;\
+   (_regs)->sp = (fregs)->sp;  \
+   (_regs)->pstate = PSR_MODE_EL1h;\
+   } while (0)
+
 int ftrace_regs_query_register_offset(const char *name);
 
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index cfec6c5a47d0..51245fd6b45b 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -44,6 +44,13 @@ static __always_inline struct pt_regs 
*arch_ftrace_get_regs(struct ftrace_regs *
return fregs->regs.msr ? >regs : NULL;
 }
 
+#define arch_ftrace_fill_perf_regs(fregs, _regs) do {  \
+   (_regs)->result = 0;\
+   (_regs)->nip = (fregs)->regs.nip;   \
+   (_regs)->gpr[1] = (fregs)->regs.gpr[1]; \
+   asm volatile("mfmsr %0" : "=r" ((_regs)->msr)); \
+   } while (0)
+
 static __always_inline void
 ftrace_regs_set_instruction_pointer(struct ftrace_regs *fregs,
unsigned long ip)
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index 9f8cc6d13bec..cb8d60a5fe1d 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -89,6 +89,11 @@ ftrace_regs_get_frame_pointer(struct ftrace_regs *fregs)
return sp[0];   /* return backchain */
 }
 
+#define arch_ftrace_fill_perf_regs(fregs, _regs)do {   \
+   (_regs)->psw.addr = (fregs)->regs.psw.addr; \
+   (_regs)->gprs[15] = (fregs)->regs.gprs[15]; \
+   } while (0)
+
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 /*
  * When an ftrace registered caller is tracing a function that is
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 8d6db2b7d03a..7625887fc49b 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -54,6 +54,13 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
return >regs;
 }
 
+#define arch_ftrace_fill_perf_regs(fregs, _regs) do {  \
+   (_regs)->ip = (fregs)->regs.ip; \
+   (_regs)->sp = (fregs)->regs.sp; \
+   (_regs)->cs = __KERNEL_CS;  \
+   (_regs)->flags = 0; \
+   } while (0)
+
 #define ftrace_regs_set_instruction_pointer(fregs, _ip)\
do { (fregs)->regs.ip = (_ip); } while (0)
 
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 914f451b0d69..3871823c1429 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -194,6 +194,37 @@ ftrace_partial_regs(struct ftrace_regs *fregs, struct 
pt_regs *regs)
 
 #endif /* !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS || 
CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */
 
+#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
+
+/*
+ * Please define arch dependent pt_regs which compatible to the
+ * perf_arch_fetch_caller_regs() but based on ftrace_regs.
+ * This requires
+ *   - user_mode(_regs) returns false (always kernel mode).
+ *   - able to use the _regs for stack trace.
+ */
+#ifndef arch_ftrace_fill_perf_regs
+/* As same as perf_arch_fetch_caller_regs(), do nothing by default */
+#define arch_ftrace_fill_perf_regs(fregs, _regs) do {} while (0)
+#endif
+
+static __always_inline struct pt_regs *
+ftrace_fill_perf_regs(struct ftrace_regs *fregs, struct pt_regs *regs)
+{
+   arch_ftrace_fill_perf_regs(fregs, regs);
+   return regs;
+}
+
+#else /* !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS */
+
+static __always_inline struct pt_regs *
+ftrace_fill_perf_regs(struc

[PATCH v10 26/36] tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add ftrace_partial_regs() which converts the ftrace_regs to pt_regs.
This is for the eBPF which needs this to keep the same pt_regs interface
to access registers.
Thus when replacing the pt_regs with ftrace_regs in fprobes (which is
used by kprobe_multi eBPF event), this will be used.

If the architecture defines its own ftrace_regs, this copies partial
registers to pt_regs and returns it. If not, ftrace_regs is the same as
pt_regs and ftrace_partial_regs() will return ftrace_regs::regs.

Signed-off-by: Masami Hiramatsu (Google) 
Acked-by: Florent Revest 
---
 Changes in v8:
  - Add the reason why this required in changelog.
 Changes from previous series: NOTHING, just forward ported.
---
 arch/arm64/include/asm/ftrace.h |   11 +++
 include/linux/ftrace.h  |   17 +
 2 files changed, 28 insertions(+)

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index ac82dc43a57d..aab2b7a0f78c 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -143,6 +143,17 @@ ftrace_regs_get_frame_pointer(const struct ftrace_regs 
*fregs)
return fregs->fp;
 }
 
+static __always_inline struct pt_regs *
+ftrace_partial_regs(const struct ftrace_regs *fregs, struct pt_regs *regs)
+{
+   memcpy(regs->regs, fregs->regs, sizeof(u64) * 9);
+   regs->sp = fregs->sp;
+   regs->pc = fregs->pc;
+   regs->regs[29] = fregs->fp;
+   regs->regs[30] = fregs->lr;
+   return regs;
+}
+
 int ftrace_regs_query_register_offset(const char *name);
 
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index ff2966a1b529..914f451b0d69 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -177,6 +177,23 @@ static __always_inline struct pt_regs 
*ftrace_get_regs(struct ftrace_regs *fregs
return arch_ftrace_get_regs(fregs);
 }
 
+#if !defined(CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS) || \
+   defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)
+
+static __always_inline struct pt_regs *
+ftrace_partial_regs(struct ftrace_regs *fregs, struct pt_regs *regs)
+{
+   /*
+* If CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, ftrace_regs memory
+* layout is the same as pt_regs. So always returns that address.
+* Since arch_ftrace_get_regs() will check some members and may return
+* NULL, we can not use it.
+*/
+   return >regs;
+}
+
+#endif /* !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS || 
CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */
+
 /*
  * When true, the ftrace_regs_{get,set}_*() functions may be used on fregs.
  * Note: this can be true even when ftrace_get_regs() cannot provide a pt_regs.

[PATCH v10 25/36] fprobe: Use ftrace_regs in fprobe exit handler

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Change the fprobe exit handler to use ftrace_regs structure instead of
pt_regs. This also introduce HAVE_PT_REGS_TO_FTRACE_REGS_CAST which means
the ftrace_regs's memory layout is equal to the pt_regs so that those are
able to cast. Fprobe introduces a new dependency with that.

Signed-off-by: Masami Hiramatsu (Google) 
---
  Changes in v3:
   - Use ftrace_regs_get_return_value()
  Changes from previous series: NOTHING, just forward ported.
---
 arch/loongarch/Kconfig  |1 +
 arch/s390/Kconfig   |1 +
 arch/x86/Kconfig|1 +
 include/linux/fprobe.h  |2 +-
 include/linux/ftrace.h  |6 ++
 kernel/trace/Kconfig|8 
 kernel/trace/bpf_trace.c|6 +-
 kernel/trace/fprobe.c   |3 ++-
 kernel/trace/trace_fprobe.c |6 +-
 lib/test_fprobe.c   |6 +++---
 samples/fprobe/fprobe_example.c |2 +-
 11 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 6897eacac063..21dc39ae6bc2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -114,6 +114,7 @@ config LOONGARCH
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_ARGS
+   select HAVE_PT_REGS_TO_FTRACE_REGS_CAST
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EBPF_JIT
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 1ed25b72eb47..ce85b3165065 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -170,6 +170,7 @@ config S390
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_ARGS
+   select HAVE_PT_REGS_TO_FTRACE_REGS_CAST
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EBPF_JIT if HAVE_MARCH_Z196_FEATURES
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ea198adb11d2..4da23dc0b07c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -215,6 +215,7 @@ config X86
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_DYNAMIC_FTRACE_WITH_ARGSif X86_64
+   select HAVE_PT_REGS_TO_FTRACE_REGS_CAST if X86_64
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_SAMPLE_FTRACE_DIRECTif X86_64
select HAVE_SAMPLE_FTRACE_DIRECT_MULTI  if X86_64
diff --git a/include/linux/fprobe.h b/include/linux/fprobe.h
index ca64ee5e45d2..ef609bcca0f9 100644
--- a/include/linux/fprobe.h
+++ b/include/linux/fprobe.h
@@ -14,7 +14,7 @@ typedef int (*fprobe_entry_cb)(struct fprobe *fp, unsigned 
long entry_ip,
   void *entry_data);
 
 typedef void (*fprobe_exit_cb)(struct fprobe *fp, unsigned long entry_ip,
-  unsigned long ret_ip, struct pt_regs *regs,
+  unsigned long ret_ip, struct ftrace_regs *regs,
   void *entry_data);
 
 /**
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1c185fefd932..ff2966a1b529 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -163,6 +163,12 @@ struct ftrace_regs {
 #define ftrace_regs_set_instruction_pointer(fregs, ip) do { } while (0)
 #endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS */
 
+#ifdef CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST
+
+static_assert(sizeof(struct pt_regs) == sizeof(struct ftrace_regs));
+
+#endif /* CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */
+
 static __always_inline struct pt_regs *ftrace_get_regs(struct ftrace_regs 
*fregs)
 {
if (!fregs)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 4a850250dab4..7df7b1fb305c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -57,6 +57,13 @@ config HAVE_DYNAMIC_FTRACE_WITH_ARGS
 This allows for use of ftrace_regs_get_argument() and
 ftrace_regs_get_stack_pointer().
 
+config HAVE_PT_REGS_TO_FTRACE_REGS_CAST
+   bool
+   help
+If this is set, the memory layout of the ftrace_regs data structure
+is the same as the pt_regs. So the pt_regs is possible to be casted
+to ftrace_regs.
+
 config HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
bool
help
@@ -288,6 +295,7 @@ config FPROBE
bool "Kernel Function Probe (fprobe)"
depends on FUNCTION_TRACER
depends on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS
+   depends on HAVE_PT_REGS_TO_FTRACE_REGS_CAST || 
!HAVE_DYNAMIC_FTRACE_WITH_ARGS
depends on HAVE_RETHOOK
select RETHOOK
default n
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7837cf4e39d9..e51a6ef87167 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2838,10 +2838,14 @@ kprobe_multi_link_handler(struct fprobe *fp, unsigned 
long fentry_ip,
 
 s

[PATCH v10 24/36] fprobe: Use ftrace_regs in fprobe entry handler

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
on arm64.

Signed-off-by: Masami Hiramatsu (Google) 
Acked-by: Florent Revest 
---
 Changes in v6:
  - Keep using SAVE_REGS flag to avoid breaking bpf kprobe-multi test.
---
 include/linux/fprobe.h  |2 +-
 kernel/trace/Kconfig|3 ++-
 kernel/trace/bpf_trace.c|   10 +++---
 kernel/trace/fprobe.c   |3 ++-
 kernel/trace/trace_fprobe.c |6 +-
 lib/test_fprobe.c   |4 ++--
 samples/fprobe/fprobe_example.c |2 +-
 7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/fprobe.h b/include/linux/fprobe.h
index f39869588117..ca64ee5e45d2 100644
--- a/include/linux/fprobe.h
+++ b/include/linux/fprobe.h
@@ -10,7 +10,7 @@
 struct fprobe;
 
 typedef int (*fprobe_entry_cb)(struct fprobe *fp, unsigned long entry_ip,
-  unsigned long ret_ip, struct pt_regs *regs,
+  unsigned long ret_ip, struct ftrace_regs *regs,
   void *entry_data);
 
 typedef void (*fprobe_exit_cb)(struct fprobe *fp, unsigned long entry_ip,
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index f909792713ff..4a850250dab4 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -287,7 +287,7 @@ config DYNAMIC_FTRACE_WITH_ARGS
 config FPROBE
bool "Kernel Function Probe (fprobe)"
depends on FUNCTION_TRACER
-   depends on DYNAMIC_FTRACE_WITH_REGS
+   depends on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS
depends on HAVE_RETHOOK
select RETHOOK
default n
@@ -672,6 +672,7 @@ config FPROBE_EVENTS
select TRACING
select PROBE_EVENTS
select DYNAMIC_EVENTS
+   depends on DYNAMIC_FTRACE_WITH_REGS
default y
help
  This allows user to add tracing events on the function entry and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 9dc605f08a23..7837cf4e39d9 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void)
 fs_initcall(bpf_event_init);
 #endif /* CONFIG_MODULES */
 
-#ifdef CONFIG_FPROBE
+#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
 struct bpf_kprobe_multi_link {
struct bpf_link link;
struct fprobe fp;
@@ -2822,10 +2822,14 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link 
*link,
 
 static int
 kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip,
- unsigned long ret_ip, struct pt_regs *regs,
+ unsigned long ret_ip, struct ftrace_regs *fregs,
  void *data)
 {
struct bpf_kprobe_multi_link *link;
+   struct pt_regs *regs = ftrace_get_regs(fregs);
+
+   if (!regs)
+   return 0;
 
link = container_of(fp, struct bpf_kprobe_multi_link, fp);
kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
@@ -3099,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
kvfree(cookies);
return err;
 }
-#else /* !CONFIG_FPROBE */
+#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
 int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog 
*prog)
 {
return -EOPNOTSUPP;
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 9ff018245840..3d3789283873 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -46,7 +46,7 @@ static inline void __fprobe_handler(unsigned long ip, 
unsigned long parent_ip,
}
 
if (fp->entry_handler)
-   ret = fp->entry_handler(fp, ip, parent_ip, 
ftrace_get_regs(fregs), entry_data);
+   ret = fp->entry_handler(fp, ip, parent_ip, fregs, entry_data);
 
/* If entry_handler returns !0, nmissed is not counted. */
if (rh) {
@@ -182,6 +182,7 @@ static void fprobe_init(struct fprobe *fp)
fp->ops.func = fprobe_kprobe_handler;
else
fp->ops.func = fprobe_handler;
+
fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
 }
 
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 62e6a8f4aae9..b2c20d4fdfd7 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -338,12 +338,16 @@ NOKPROBE_SYMBOL(fexit_perf_func);
 #endif /* CONFIG_PERF_EVENTS */
 
 static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip,
-unsigned long ret_ip, struct pt_regs *regs,
+unsigned long ret_ip, struct ftrace_regs *fregs,
 void *entry_data)
 {
struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp);
+   struct pt_regs *regs = ftr

[PATCH v10 23/36] function_graph: Pass ftrace_regs to retfunc

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Pass ftrace_regs to the fgraph_ops::retfunc(). If ftrace_regs is not
available, it passes a NULL instead. User callback function can access
some registers (including return address) via this ftrace_regs.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v8:
  - Pass ftrace_regs to retfunc, instead of adding retregfunc.
 Changes in v6:
  - update to use ftrace_regs_get_return_value() because of reordering
patches.
 Changes in v3:
  - Update for new multiple fgraph.
  - Save the return address to instruction pointer in ftrace_regs.
---
 include/linux/ftrace.h   |3 ++-
 kernel/trace/fgraph.c|   14 ++
 kernel/trace/ftrace.c|3 ++-
 kernel/trace/trace.h |3 ++-
 kernel/trace/trace_functions_graph.c |7 ---
 kernel/trace/trace_irqsoff.c |3 ++-
 kernel/trace/trace_sched_wakeup.c|3 ++-
 kernel/trace/trace_selftest.c|3 ++-
 8 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 0af83b115886..1c185fefd932 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1067,7 +1067,8 @@ struct fgraph_ops;
 
 /* Type of the callback handlers for tracing function graph*/
 typedef void (*trace_func_graph_ret_t)(struct ftrace_graph_ret *,
-  struct fgraph_ops *); /* return */
+  struct fgraph_ops *,
+  struct ftrace_regs *); /* return */
 typedef int (*trace_func_graph_ent_t)(struct ftrace_graph_ent *,
  struct fgraph_ops *,
  struct ftrace_regs *); /* entry */
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index d520b2b78918..40f47fcbc6c3 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -264,7 +264,8 @@ static int entry_run(struct ftrace_graph_ent *trace, struct 
fgraph_ops *ops,
 }
 
 /* ftrace_graph_return set to this to tell some archs to run function graph */
-static void return_run(struct ftrace_graph_ret *trace, struct fgraph_ops *ops)
+static void return_run(struct ftrace_graph_ret *trace, struct fgraph_ops *ops,
+  struct ftrace_regs *fregs)
 {
 }
 
@@ -455,7 +456,8 @@ int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace,
 }
 
 static void ftrace_graph_ret_stub(struct ftrace_graph_ret *trace,
- struct fgraph_ops *gops)
+ struct fgraph_ops *gops,
+ struct ftrace_regs *fregs)
 {
 }
 
@@ -799,6 +801,9 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, 
unsigned long frame_pointe
}
 
trace.rettime = trace_clock_local();
+   if (fregs)
+   ftrace_regs_set_instruction_pointer(fregs, ret);
+
 #ifdef CONFIG_FUNCTION_GRAPH_RETVAL
trace.retval = ftrace_regs_get_return_value(fregs);
 #endif
@@ -812,7 +817,7 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, 
unsigned long frame_pointe
if (gops == _stub)
continue;
 
-   gops->retfunc(, gops);
+   gops->retfunc(, gops, fregs);
}
 
/*
@@ -976,7 +981,8 @@ void ftrace_graph_sleep_time_control(bool enable)
  * Simply points to ftrace_stub, but with the proper protocol.
  * Defined by the linker script in linux/vmlinux.lds.h
  */
-void ftrace_stub_graph(struct ftrace_graph_ret *trace, struct fgraph_ops 
*gops);
+void ftrace_stub_graph(struct ftrace_graph_ret *trace, struct fgraph_ops *gops,
+  struct ftrace_regs *fregs);
 
 /* The callbacks that hook a function */
 trace_func_graph_ret_t ftrace_graph_return = ftrace_stub_graph;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 5377a0b22ec9..e869258efc52 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -835,7 +835,8 @@ static int profile_graph_entry(struct ftrace_graph_ent 
*trace,
 }
 
 static void profile_graph_return(struct ftrace_graph_ret *trace,
-struct fgraph_ops *gops)
+struct fgraph_ops *gops,
+struct ftrace_regs *fregs)
 {
struct ftrace_ret_stack *ret_stack;
struct ftrace_profile_stat *stat;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 8221b6febb51..81cb2a90cbda 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -681,7 +681,8 @@ void trace_latency_header(struct seq_file *m);
 void trace_default_header(struct seq_file *m);
 void print_trace_header(struct seq_file *m, struct trace_iterator *iter);
 
-void trace_graph_return(struct ftrace_graph_ret *trace, struct fgraph_ops 
*gops);
+void trace_graph_return(struct ftrace_graph_ret *trace, struct fgraph_ops 
*gops,
+   struct ftrace_regs *fregs);
 int trace_graph_entry(

[PATCH v10 22/36] function_graph: Replace fgraph_ret_regs with ftrace_regs

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Use ftrace_regs instead of fgraph_ret_regs for tracing return value
on function_graph tracer because of simplifying the callback interface.

The CONFIG_HAVE_FUNCTION_GRAPH_RETVAL is also replaced by
CONFIG_HAVE_FUNCTION_GRAPH_FREGS.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v8:
  - Newly added.
---
 arch/arm64/Kconfig  |2 +-
 arch/arm64/include/asm/ftrace.h |   23 ++-
 arch/arm64/kernel/asm-offsets.c |   12 
 arch/arm64/kernel/entry-ftrace.S|   32 ++--
 arch/loongarch/Kconfig  |2 +-
 arch/loongarch/include/asm/ftrace.h |   24 ++--
 arch/loongarch/kernel/asm-offsets.c |   12 
 arch/loongarch/kernel/mcount.S  |   17 ++---
 arch/loongarch/kernel/mcount_dyn.S  |   14 +++---
 arch/riscv/Kconfig  |2 +-
 arch/riscv/include/asm/ftrace.h |   21 -
 arch/riscv/kernel/mcount.S  |   24 +---
 arch/s390/Kconfig   |2 +-
 arch/s390/include/asm/ftrace.h  |   26 +-
 arch/s390/kernel/asm-offsets.c  |6 --
 arch/s390/kernel/mcount.S   |9 +
 arch/x86/Kconfig|2 +-
 arch/x86/include/asm/ftrace.h   |   22 ++
 arch/x86/kernel/ftrace_32.S |   15 +--
 arch/x86/kernel/ftrace_64.S |   17 +
 include/linux/ftrace.h  |   14 +++---
 kernel/trace/Kconfig|4 ++--
 kernel/trace/fgraph.c   |   21 +
 23 files changed, 117 insertions(+), 206 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..8d5047bc13bc 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -209,7 +209,7 @@ config ARM64
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_ERROR_INJECTION
-   select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER
+   select HAVE_FUNCTION_GRAPH_FREGS
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_GCC_PLUGINS
select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && \
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index ab158196480c..ac82dc43a57d 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -137,6 +137,12 @@ ftrace_override_function_with_return(struct ftrace_regs 
*fregs)
fregs->pc = fregs->lr;
 }
 
+static __always_inline unsigned long
+ftrace_regs_get_frame_pointer(const struct ftrace_regs *fregs)
+{
+   return fregs->fp;
+}
+
 int ftrace_regs_query_register_offset(const char *name);
 
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
@@ -194,23 +200,6 @@ static inline bool arch_syscall_match_sym_name(const char 
*sym,
 
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-struct fgraph_ret_regs {
-   /* x0 - x7 */
-   unsigned long regs[8];
-
-   unsigned long fp;
-   unsigned long __unused;
-};
-
-static inline unsigned long fgraph_ret_regs_return_value(struct 
fgraph_ret_regs *ret_regs)
-{
-   return ret_regs->regs[0];
-}
-
-static inline unsigned long fgraph_ret_regs_frame_pointer(struct 
fgraph_ret_regs *ret_regs)
-{
-   return ret_regs->fp;
-}
 
 void prepare_ftrace_return(unsigned long self_addr, unsigned long *parent,
   unsigned long frame_pointer);
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 81496083c041..81bb6704ff5a 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -200,18 +200,6 @@ int main(void)
   DEFINE(FTRACE_OPS_FUNC,  offsetof(struct ftrace_ops, func));
 #endif
   BLANK();
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-  DEFINE(FGRET_REGS_X0,offsetof(struct 
fgraph_ret_regs, regs[0]));
-  DEFINE(FGRET_REGS_X1,offsetof(struct 
fgraph_ret_regs, regs[1]));
-  DEFINE(FGRET_REGS_X2,offsetof(struct 
fgraph_ret_regs, regs[2]));
-  DEFINE(FGRET_REGS_X3,offsetof(struct 
fgraph_ret_regs, regs[3]));
-  DEFINE(FGRET_REGS_X4,offsetof(struct 
fgraph_ret_regs, regs[4]));
-  DEFINE(FGRET_REGS_X5,offsetof(struct 
fgraph_ret_regs, regs[5]));
-  DEFINE(FGRET_REGS_X6,offsetof(struct 
fgraph_ret_regs, regs[6]));
-  DEFINE(FGRET_REGS_X7,offsetof(struct 
fgraph_ret_regs, regs[7]));
-  DEFINE(FGRET_REGS_FP,offsetof(struct 
fgraph_ret_regs, fp));
-  DEFINE(FGRET_REGS_SIZE,  sizeof(struct fgraph_ret_regs));
-#endif
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
   DEFINE(FTRACE_OPS_DIRECT_CALL,   offsetof(struct f

[PATCH v10 21/36] function_graph: Pass ftrace_regs to entryfunc

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Pass ftrace_regs to the fgraph_ops::entryfunc(). If ftrace_regs is not
available, it passes a NULL instead. User callback function can access
some registers (including return address) via this ftrace_regs.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v8:
  - Just pass ftrace_regs to the handler instead of adding a new
entryregfunc.
  - Update riscv ftrace_graph_func().
 Changes in v3:
  - Update for new multiple fgraph.
---
 arch/arm64/kernel/ftrace.c   |2 +
 arch/loongarch/kernel/ftrace_dyn.c   |2 +
 arch/powerpc/kernel/trace/ftrace.c   |2 +
 arch/powerpc/kernel/trace/ftrace_64_pg.c |   10 ---
 arch/riscv/kernel/ftrace.c   |2 +
 arch/x86/kernel/ftrace.c |   42 --
 include/linux/ftrace.h   |   20 +++---
 kernel/trace/fgraph.c|   21 +--
 kernel/trace/ftrace.c|3 +-
 kernel/trace/trace.h |3 +-
 kernel/trace/trace_functions_graph.c |3 +-
 kernel/trace/trace_irqsoff.c |3 +-
 kernel/trace/trace_sched_wakeup.c|3 +-
 kernel/trace/trace_selftest.c|8 --
 14 files changed, 76 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index b96740829798..779b975f03f5 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -497,7 +497,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
parent_ip,
return;
 
if (!function_graph_enter_ops(*parent, ip, frame_pointer,
- (void *)frame_pointer, gops))
+ (void *)frame_pointer, fregs, gops))
*parent = (unsigned long)_to_handler;
 
ftrace_test_recursion_unlock(bit);
diff --git a/arch/loongarch/kernel/ftrace_dyn.c 
b/arch/loongarch/kernel/ftrace_dyn.c
index 920eb673b32b..155bdaba2012 100644
--- a/arch/loongarch/kernel/ftrace_dyn.c
+++ b/arch/loongarch/kernel/ftrace_dyn.c
@@ -254,7 +254,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
parent_ip,
 
old = *parent;
 
-   if (!function_graph_enter_ops(old, ip, 0, parent, gops))
+   if (!function_graph_enter_ops(old, ip, 0, parent, fregs, gops))
*parent = return_hooker;
 }
 #else
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 4a9294821c0d..501adb80fc8d 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -435,7 +435,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
goto out;
 
-   if (!function_graph_enter_ops(parent_ip, ip, 0, (unsigned long *)sp, 
gops))
+   if (!function_graph_enter_ops(parent_ip, ip, 0, (unsigned long *)sp, 
fregs, gops))
parent_ip = ppc_function_entry(return_to_handler);
 
ftrace_test_recursion_unlock(bit);
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.c 
b/arch/powerpc/kernel/trace/ftrace_64_pg.c
index 12fab1803bcf..4ae9eeb1c8f1 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.c
+++ b/arch/powerpc/kernel/trace/ftrace_64_pg.c
@@ -800,7 +800,8 @@ int ftrace_disable_ftrace_graph_caller(void)
  * in current thread info. Return the address we want to divert to.
  */
 static unsigned long
-__prepare_ftrace_return(unsigned long parent, unsigned long ip, unsigned long 
sp)
+__prepare_ftrace_return(unsigned long parent, unsigned long ip, unsigned long 
sp,
+   struct ftrace_regs *fregs)
 {
unsigned long return_hooker;
int bit;
@@ -817,7 +818,7 @@ __prepare_ftrace_return(unsigned long parent, unsigned long 
ip, unsigned long sp
 
return_hooker = ppc_function_entry(return_to_handler);
 
-   if (!function_graph_enter(parent, ip, 0, (unsigned long *)sp))
+   if (!function_graph_enter_regs(parent, ip, 0, (unsigned long *)sp, 
fregs))
parent = return_hooker;
 
ftrace_test_recursion_unlock(bit);
@@ -829,13 +830,14 @@ __prepare_ftrace_return(unsigned long parent, unsigned 
long ip, unsigned long sp
 void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
-   fregs->regs.link = __prepare_ftrace_return(parent_ip, ip, 
fregs->regs.gpr[1]);
+   fregs->regs.link = __prepare_ftrace_return(parent_ip, ip,
+  fregs->regs.gpr[1], fregs);
 }
 #else
 unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
unsigned long sp)
 {
-   return __prepare_ftrace_return(parent, ip, sp);
+   return __prepare_ftrace_return(parent, ip, sp, NULL);
 }
 #endif
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
diff --git a/arch/riscv/kernel/ftrace.c b/ar

[PATCH v10 20/36] ftrace: Add multiple fgraph storage selftest

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add a selftest for multiple function graph tracer with storage on a same
function. In this case, the shadow stack entry will be shared among those
fgraph with different data storage. So this will ensure the fgraph will
not mixed those storage data.

Signed-off-by: Masami Hiramatsu (Google) 
Suggested-by: Steven Rostedt (Google) 
---
 Changes in v8:
  - Newly added.
---
 kernel/trace/trace_selftest.c |  171 ++---
 1 file changed, 126 insertions(+), 45 deletions(-)

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index fcdc744c245e..369efc569238 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -762,28 +762,32 @@ trace_selftest_startup_function(struct tracer *trace, 
struct trace_array *tr)
 #define SHORT_NUMBER 12345
 #define WORD_NUMBER 1234567890
 #define LONG_NUMBER 1234567890123456789LL
-
-static int fgraph_store_size __initdata;
-static const char *fgraph_store_type_name __initdata;
-static char *fgraph_error_str __initdata;
-static char fgraph_error_str_buf[128] __initdata;
+#define ERRSTR_BUFLEN 128
+
+struct fgraph_fixture {
+   struct fgraph_ops gops;
+   int store_size;
+   const char *store_type_name;
+   char error_str_buf[ERRSTR_BUFLEN];
+   char *error_str;
+};
 
 static __init int store_entry(struct ftrace_graph_ent *trace,
  struct fgraph_ops *gops)
 {
-   const char *type = fgraph_store_type_name;
-   int size = fgraph_store_size;
+   struct fgraph_fixture *fixture = container_of(gops, struct 
fgraph_fixture, gops);
+   const char *type = fixture->store_type_name;
+   int size = fixture->store_size;
void *p;
 
p = fgraph_reserve_data(gops->idx, size);
if (!p) {
-   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+   snprintf(fixture->error_str_buf, ERRSTR_BUFLEN,
 "Failed to reserve %s\n", type);
-   fgraph_error_str = fgraph_error_str_buf;
return 0;
}
 
-   switch (fgraph_store_size) {
+   switch (size) {
case 1:
*(char *)p = BYTE_NUMBER;
break;
@@ -804,7 +808,8 @@ static __init int store_entry(struct ftrace_graph_ent 
*trace,
 static __init void store_return(struct ftrace_graph_ret *trace,
struct fgraph_ops *gops)
 {
-   const char *type = fgraph_store_type_name;
+   struct fgraph_fixture *fixture = container_of(gops, struct 
fgraph_fixture, gops);
+   const char *type = fixture->store_type_name;
long long expect = 0;
long long found = -1;
int size;
@@ -812,20 +817,18 @@ static __init void store_return(struct ftrace_graph_ret 
*trace,
 
p = fgraph_retrieve_data(gops->idx, );
if (!p) {
-   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+   snprintf(fixture->error_str_buf, ERRSTR_BUFLEN,
 "Failed to retrieve %s\n", type);
-   fgraph_error_str = fgraph_error_str_buf;
return;
}
-   if (fgraph_store_size > size) {
-   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+   if (fixture->store_size > size) {
+   snprintf(fixture->error_str_buf, ERRSTR_BUFLEN,
 "Retrieved size %d is smaller than expected %d\n",
-size, (int)fgraph_store_size);
-   fgraph_error_str = fgraph_error_str_buf;
+size, (int)fixture->store_size);
return;
}
 
-   switch (fgraph_store_size) {
+   switch (fixture->store_size) {
case 1:
expect = BYTE_NUMBER;
found = *(char *)p;
@@ -845,45 +848,44 @@ static __init void store_return(struct ftrace_graph_ret 
*trace,
}
 
if (found != expect) {
-   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+   snprintf(fixture->error_str_buf, ERRSTR_BUFLEN,
 "%s returned not %lld but %lld\n", type, expect, 
found);
-   fgraph_error_str = fgraph_error_str_buf;
return;
}
-   fgraph_error_str = NULL;
+   fixture->error_str = NULL;
 }
 
-static struct fgraph_ops store_bytes __initdata = {
-   .entryfunc  = store_entry,
-   .retfunc= store_return,
-};
-
-static int __init test_graph_storage_type(const char *name, int size)
+static int __init init_fgraph_fixture(struct fgraph_fixture *fixture)
 {
char *func_name;
int len;
-   int ret;
 
-   fgraph_store_type_name = name;
-   fgraph_store_size = size;
+   snprintf(fixture->error_str_buf, ERRSTR_BUFLEN,
+"Failed to execute

[PATCH v10 19/36] function_graph: Add selftest for passing local variables

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Add boot up selftest that passes variables from a function entry to a
function exit, and make sure that they do get passed around.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Add reserved size test.
  - Use pr_*() instead of printk(KERN_*).
---
 kernel/trace/trace_selftest.c |  169 +
 1 file changed, 169 insertions(+)

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index f8f55fd79e53..fcdc744c245e 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -756,6 +756,173 @@ trace_selftest_startup_function(struct tracer *trace, 
struct trace_array *tr)
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+
+#define BYTE_NUMBER 123
+#define SHORT_NUMBER 12345
+#define WORD_NUMBER 1234567890
+#define LONG_NUMBER 1234567890123456789LL
+
+static int fgraph_store_size __initdata;
+static const char *fgraph_store_type_name __initdata;
+static char *fgraph_error_str __initdata;
+static char fgraph_error_str_buf[128] __initdata;
+
+static __init int store_entry(struct ftrace_graph_ent *trace,
+ struct fgraph_ops *gops)
+{
+   const char *type = fgraph_store_type_name;
+   int size = fgraph_store_size;
+   void *p;
+
+   p = fgraph_reserve_data(gops->idx, size);
+   if (!p) {
+   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+"Failed to reserve %s\n", type);
+   fgraph_error_str = fgraph_error_str_buf;
+   return 0;
+   }
+
+   switch (fgraph_store_size) {
+   case 1:
+   *(char *)p = BYTE_NUMBER;
+   break;
+   case 2:
+   *(short *)p = SHORT_NUMBER;
+   break;
+   case 4:
+   *(int *)p = WORD_NUMBER;
+   break;
+   case 8:
+   *(long long *)p = LONG_NUMBER;
+   break;
+   }
+
+   return 1;
+}
+
+static __init void store_return(struct ftrace_graph_ret *trace,
+   struct fgraph_ops *gops)
+{
+   const char *type = fgraph_store_type_name;
+   long long expect = 0;
+   long long found = -1;
+   int size;
+   char *p;
+
+   p = fgraph_retrieve_data(gops->idx, );
+   if (!p) {
+   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+"Failed to retrieve %s\n", type);
+   fgraph_error_str = fgraph_error_str_buf;
+   return;
+   }
+   if (fgraph_store_size > size) {
+   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+"Retrieved size %d is smaller than expected %d\n",
+size, (int)fgraph_store_size);
+   fgraph_error_str = fgraph_error_str_buf;
+   return;
+   }
+
+   switch (fgraph_store_size) {
+   case 1:
+   expect = BYTE_NUMBER;
+   found = *(char *)p;
+   break;
+   case 2:
+   expect = SHORT_NUMBER;
+   found = *(short *)p;
+   break;
+   case 4:
+   expect = WORD_NUMBER;
+   found = *(int *)p;
+   break;
+   case 8:
+   expect = LONG_NUMBER;
+   found = *(long long *)p;
+   break;
+   }
+
+   if (found != expect) {
+   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+"%s returned not %lld but %lld\n", type, expect, 
found);
+   fgraph_error_str = fgraph_error_str_buf;
+   return;
+   }
+   fgraph_error_str = NULL;
+}
+
+static struct fgraph_ops store_bytes __initdata = {
+   .entryfunc  = store_entry,
+   .retfunc= store_return,
+};
+
+static int __init test_graph_storage_type(const char *name, int size)
+{
+   char *func_name;
+   int len;
+   int ret;
+
+   fgraph_store_type_name = name;
+   fgraph_store_size = size;
+
+   snprintf(fgraph_error_str_buf, sizeof(fgraph_error_str_buf),
+"Failed to execute storage %s\n", name);
+   fgraph_error_str = fgraph_error_str_buf;
+
+   pr_cont("PASSED\n");
+   pr_info("Testing fgraph storage of %d byte%s: ", size, size > 1 ? "s" : 
"");
+
+   func_name = "*" __stringify(DYN_FTRACE_TEST_NAME);
+   len = strlen(func_name);
+
+   ret = ftrace_set_filter(_bytes.ops, func_name, len, 1);
+   if (ret && ret != -ENODEV) {
+   pr_cont("*Could not set filter* ");
+   return -1;
+   }
+
+   ret = register_ftrace_graph(_bytes);
+   if (ret) {
+   pr_warn("Failed to init store_by

[PATCH v10 04/36] function_graph: Convert ret_stack to a series of longs

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

In order to make it possible to have multiple callbacks registered with the
function_graph tracer, the retstack needs to be converted from an array of
ftrace_ret_stack structures to an array of longs. This will allow to store
the list of callbacks on the stack for the return side of the functions.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/sched.h |2 -
 kernel/trace/fgraph.c |  124 -
 2 files changed, 71 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..e453ad8d2d79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1396,7 +1396,7 @@ struct task_struct {
int curr_ret_depth;
 
/* Stack of return addresses for return function tracing: */
-   struct ftrace_ret_stack *ret_stack;
+   unsigned long   *ret_stack;
 
/* Timestamp for last schedule: */
unsigned long long  ftrace_timestamp;
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index c83c005e654e..30edeb6d4aa9 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -25,6 +25,18 @@
 #define ASSIGN_OPS_HASH(opsname, val)
 #endif
 
+#define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
+#define FGRAPH_RET_INDEX (ALIGN(FGRAPH_RET_SIZE, sizeof(long)) / sizeof(long))
+#define SHADOW_STACK_SIZE (PAGE_SIZE)
+#define SHADOW_STACK_INDEX \
+   (ALIGN(SHADOW_STACK_SIZE, sizeof(long)) / sizeof(long))
+/* Leave on a buffer at the end */
+#define SHADOW_STACK_MAX_INDEX (SHADOW_STACK_INDEX - FGRAPH_RET_INDEX)
+
+#define RET_STACK(t, index) ((struct ftrace_ret_stack 
*)(&(t)->ret_stack[index]))
+#define RET_STACK_INC(c) ({ c += FGRAPH_RET_INDEX; })
+#define RET_STACK_DEC(c) ({ c -= FGRAPH_RET_INDEX; })
+
 DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
@@ -69,6 +81,7 @@ static int
 ftrace_push_return_trace(unsigned long ret, unsigned long func,
 unsigned long frame_pointer, unsigned long *retp)
 {
+   struct ftrace_ret_stack *ret_stack;
unsigned long long calltime;
int index;
 
@@ -85,23 +98,25 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
smp_rmb();
 
/* The return trace stack is full */
-   if (current->curr_ret_stack == FTRACE_RETFUNC_DEPTH - 1) {
+   if (current->curr_ret_stack >= SHADOW_STACK_MAX_INDEX) {
atomic_inc(>trace_overrun);
return -EBUSY;
}
 
calltime = trace_clock_local();
 
-   index = ++current->curr_ret_stack;
+   index = current->curr_ret_stack;
+   RET_STACK_INC(current->curr_ret_stack);
+   ret_stack = RET_STACK(current, index);
barrier();
-   current->ret_stack[index].ret = ret;
-   current->ret_stack[index].func = func;
-   current->ret_stack[index].calltime = calltime;
+   ret_stack->ret = ret;
+   ret_stack->func = func;
+   ret_stack->calltime = calltime;
 #ifdef HAVE_FUNCTION_GRAPH_FP_TEST
-   current->ret_stack[index].fp = frame_pointer;
+   ret_stack->fp = frame_pointer;
 #endif
 #ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
-   current->ret_stack[index].retp = retp;
+   ret_stack->retp = retp;
 #endif
return 0;
 }
@@ -148,7 +163,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
 
return 0;
  out_ret:
-   current->curr_ret_stack--;
+   RET_STACK_DEC(current->curr_ret_stack);
  out:
current->curr_ret_depth--;
return -EBUSY;
@@ -159,11 +174,13 @@ static void
 ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
unsigned long frame_pointer)
 {
+   struct ftrace_ret_stack *ret_stack;
int index;
 
index = current->curr_ret_stack;
+   RET_STACK_DEC(index);
 
-   if (unlikely(index < 0 || index >= FTRACE_RETFUNC_DEPTH)) {
+   if (unlikely(index < 0 || index > SHADOW_STACK_MAX_INDEX)) {
ftrace_graph_stop();
WARN_ON(1);
/* Might as well panic, otherwise we have no where to go */
@@ -171,6 +188,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, 
unsigned long *ret,
return;
}
 
+   ret_stack = RET_STACK(current, index);
 #ifdef HAVE_FUNCTION_GRAPH_FP_TEST
/*
 * The arch may choose to record the frame pointer used
@@ -186,22 +204,22 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, 
unsigned long *ret,
 * Note, -mfentry does not use frame pointers, and this test
 *  is not needed if CC_USING_FENTRY is set.
 */
-   if (unlikely(current->ret_stack[index].fp != frame_pointer)) {
+   if (unlikely(ret_stack->fp != f

[PATCH v10 18/36] function_graph: Implement fgraph_reserve_data() and fgraph_retrieve_data()

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Added functions that can be called by a fgraph_ops entryfunc and retfunc to
store state between the entry of the function being traced to the exit of
the same function. The fgraph_ops entryfunc() may call
fgraph_reserve_data() to store up to 32 words onto the task's shadow
ret_stack and this then can be retrieved by fgraph_retrieve_data() called
by the corresponding retfunc().

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v10:
  - Fix to support data size up to 32 words (previously it only support up to
31 words but the max size check missed to check it is 32 words)
  - Use "offset" instead of "index".
 Changes in v8:
  - Avoid using DIV_ROUND_UP() in the hot path.
 Changes in v3:
  - Store fgraph_array index to the data entry.
  - Both function requires fgraph_array index to store/retrieve data.
  - Reserve correct size of the data.
  - Return correct data area.
 Changes in v2:
  - Retrieve the reserved size by fgraph_retrieve_data().
  - Expand the maximum data size to 32 words.
  - Update stack index with __get_index(val) if FGRAPH_TYPE_ARRAY entry.
  - fix typos and make description lines shorter than 76 chars.
---
 include/linux/ftrace.h |3 +
 kernel/trace/fgraph.c  |  179 ++--
 2 files changed, 175 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 97f7d1cf4f8f..1eee028b9a75 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1075,6 +1075,9 @@ struct fgraph_ops {
int idx;
 };
 
+void *fgraph_reserve_data(int idx, int size_bytes);
+void *fgraph_retrieve_data(int idx, int *size_bytes);
+
 /*
  * Stack of return addresses for functions
  * of a thread.
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 3498e8fd8e53..4f62e82448f6 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -37,14 +37,20 @@
  * bits: 10 - 11   Type of storage
  *   0 - reserved
  *   1 - bitmap of fgraph_array index
+ *   2 - reserved data
  *
  * For bitmap of fgraph_array index
  *  bits: 12 - 27  The bitmap of fgraph_ops fgraph_array index
  *
+ * For reserved data:
+ *  bits: 12 - 17  The size in words that is stored
+ *  bits: 18 - 23  The index of fgraph_array, which shows who is stored
+ *
  * That is, at the end of function_graph_enter, if the first and forth
  * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc called
- * on the return of the function being traced, this is what will be on the
- * task's shadow ret_stack: (the stack grows upward)
+ * on the return of the function being traced, and the forth fgraph_ops
+ * stored two words of data, this is what will be on the task's shadow
+ * ret_stack: (the stack grows upward)
  *
  *  ret_stack[SHADOW_STACK_IN_WORD]
  * | SHADOW_STACK_TASK_VARS(ret_stack)[15]  |
@@ -54,9 +60,17 @@
  * ...
  * || <- task->curr_ret_stack
  * ++
+ * | (3 << FGRAPH_DATA_INDEX_SHIFT)| \  | This is for fgraph_ops[3].
+ * | ((2 - 1) << FGRAPH_DATA_SHIFT)| \  | The data size is 2 words.
+ * | (FGRAPH_TYPE_DATA << FGRAPH_TYPE_SHIFT)| \ |
+ * | (offset2:FGRAPH_FRAME_OFFSET+3)| <- the offset2 is from here
+ * ++ ( It is 4 words from the 
ret_stack)
+ * |STORED DATA WORD 2  |
+ * |STORED DATA WORD 1  |
+ * +--i-+
  * | (BIT(3)|BIT(0)) << FGRAPH_INDEX_SHIFT | \  |
  * | FGRAPH_TYPE_BITMAP << FGRAPH_TYPE_SHIFT| \ |
- * | (offset:FGRAPH_FRAME_OFFSET)   | <- the offset is from here
+ * | (offset1:FGRAPH_FRAME_OFFSET)  | <- the offset1 is from here
  * ++
  * | struct ftrace_ret_stack|
  * |   (stores the saved ret pointer)   | <- the offset points here
@@ -83,12 +97,26 @@
 enum {
FGRAPH_TYPE_RESERVED= 0,
FGRAPH_TYPE_BITMAP  = 1,
+   FGRAPH_TYPE_DATA= 2,
 };
 
 #define FGRAPH_INDEX_SIZE  16
 #define FGRAPH_INDEX_MASK  GENMASK(FGRAPH_INDEX_SIZE - 1, 0)
 #define FGRAPH_INDEX_SHIFT (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE)
 
+/* The data size == 0 means 1 word, and 31 (=2^5 - 1) means 32 words. */
+#define FGRAPH_DATA_SIZE   5
+#define FGRAPH_DATA_MASK   GENMASK(FGRAPH_DATA_SIZE - 1, 0)
+#define FGRAPH_DATA_SHIFT  (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE)
+#define FGRAPH_MAX_DATA_SIZE (sizeof(long) * (1 << FGRAPH_DATA_SIZE))
+
+#define FGRAPH_DATA_INDEX_SIZE 4
+#define FGRAPH_DATA_INDEX_MASK GENMASK(FGRAPH_DATA_INDEX_SIZE - 1, 0)
+#define FGRAPH_DATA_INDEX_SHIFT

[PATCH v10 16/36] function_graph: Move graph depth stored data to shadow stack global var

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

The use of the task->trace_recursion for the logic used for the function
graph depth was a bit of an abuse of that variable. Now that there
exists global vars that are per stack for registered graph traces, use that
instead.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/trace_recursion.h |   29 -
 kernel/trace/trace.h|   34 --
 2 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 02e6afc6d7fe..fdfb6f66718a 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -44,25 +44,6 @@ enum {
  */
TRACE_IRQ_BIT,
 
-   /*
-* In the very unlikely case that an interrupt came in
-* at a start of graph tracing, and we want to trace
-* the function in that interrupt, the depth can be greater
-* than zero, because of the preempted start of a previous
-* trace. In an even more unlikely case, depth could be 2
-* if a softirq interrupted the start of graph tracing,
-* followed by an interrupt preempting a start of graph
-* tracing in the softirq, and depth can even be 3
-* if an NMI came in at the start of an interrupt function
-* that preempted a softirq start of a function that
-* preempted normal context Luckily, it can't be
-* greater than 3, so the next two bits are a mask
-* of what the depth is when we set TRACE_GRAPH_FL
-*/
-
-   TRACE_GRAPH_DEPTH_START_BIT,
-   TRACE_GRAPH_DEPTH_END_BIT,
-
/*
 * To implement set_graph_notrace, if this bit is set, we ignore
 * function graph tracing of called functions, until the return
@@ -78,16 +59,6 @@ enum {
 #define trace_recursion_clear(bit) do { (current)->trace_recursion &= 
~(1<<(bit)); } while (0)
 #define trace_recursion_test(bit)  ((current)->trace_recursion & 
(1<<(bit)))
 
-#define trace_recursion_depth() \
-   (((current)->trace_recursion >> TRACE_GRAPH_DEPTH_START_BIT) & 3)
-#define trace_recursion_set_depth(depth) \
-   do {\
-   current->trace_recursion &= \
-   ~(3 << TRACE_GRAPH_DEPTH_START_BIT);\
-   current->trace_recursion |= \
-   ((depth) & 3) << TRACE_GRAPH_DEPTH_START_BIT;   \
-   } while (0)
-
 #define TRACE_CONTEXT_BITS 4
 
 #define TRACE_FTRACE_START TRACE_FTRACE_BIT
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c7c7e7c9f700..7ab731b9ebc8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -899,8 +899,38 @@ extern void free_fgraph_ops(struct trace_array *tr);
 
 enum {
TRACE_GRAPH_FL  = 1,
+
+   /*
+* In the very unlikely case that an interrupt came in
+* at a start of graph tracing, and we want to trace
+* the function in that interrupt, the depth can be greater
+* than zero, because of the preempted start of a previous
+* trace. In an even more unlikely case, depth could be 2
+* if a softirq interrupted the start of graph tracing,
+* followed by an interrupt preempting a start of graph
+* tracing in the softirq, and depth can even be 3
+* if an NMI came in at the start of an interrupt function
+* that preempted a softirq start of a function that
+* preempted normal context Luckily, it can't be
+* greater than 3, so the next two bits are a mask
+* of what the depth is when we set TRACE_GRAPH_FL
+*/
+
+   TRACE_GRAPH_DEPTH_START_BIT,
+   TRACE_GRAPH_DEPTH_END_BIT,
 };
 
+static inline unsigned long ftrace_graph_depth(unsigned long *task_var)
+{
+   return (*task_var >> TRACE_GRAPH_DEPTH_START_BIT) & 3;
+}
+
+static inline void ftrace_graph_set_depth(unsigned long *task_var, int depth)
+{
+   *task_var &= ~(3 << TRACE_GRAPH_DEPTH_START_BIT);
+   *task_var |= (depth & 3) << TRACE_GRAPH_DEPTH_START_BIT;
+}
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 extern struct ftrace_hash __rcu *ftrace_graph_hash;
 extern struct ftrace_hash __rcu *ftrace_graph_notrace_hash;
@@ -933,7 +963,7 @@ ftrace_graph_addr(unsigned long *task_var, struct 
ftrace_graph_ent *trace)
 * when the depth is zero.
 */
*task_var |= TRACE_GRAPH_FL;
-   trace_recursion_set_depth(trace->depth);
+   ftrace_graph_set_depth(task_var, trace->depth);
 
/*
 * If no irqs are to be traced, but a set_graph_function
@@ -958,7 +988,7 @@ ftrace_graph_addr_finish(struct fgra

[PATCH v10 15/36] function_graph: Move set_graph_function tests to shadow stack global var

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

The use of the task->trace_recursion for the logic used for the
set_graph_funnction was a bit of an abuse of that variable. Now that there
exists global vars that are per stack for registered graph traces, use that
instead.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/trace_recursion.h  |5 +
 kernel/trace/trace.h |   32 +---
 kernel/trace/trace_functions_graph.c |6 +++---
 kernel/trace/trace_irqsoff.c |4 ++--
 kernel/trace/trace_sched_wakeup.c|4 ++--
 5 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 24ea8ac049b4..02e6afc6d7fe 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -44,9 +44,6 @@ enum {
  */
TRACE_IRQ_BIT,
 
-   /* Set if the function is in the set_graph_function file */
-   TRACE_GRAPH_BIT,
-
/*
 * In the very unlikely case that an interrupt came in
 * at a start of graph tracing, and we want to trace
@@ -60,7 +57,7 @@ enum {
 * that preempted a softirq start of a function that
 * preempted normal context Luckily, it can't be
 * greater than 3, so the next two bits are a mask
-* of what the depth is when we set TRACE_GRAPH_BIT
+* of what the depth is when we set TRACE_GRAPH_FL
 */
 
TRACE_GRAPH_DEPTH_START_BIT,
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9995d6b00a93..c7c7e7c9f700 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -897,11 +897,16 @@ extern void init_array_fgraph_ops(struct trace_array *tr, 
struct ftrace_ops *ops
 extern int allocate_fgraph_ops(struct trace_array *tr, struct ftrace_ops *ops);
 extern void free_fgraph_ops(struct trace_array *tr);
 
+enum {
+   TRACE_GRAPH_FL  = 1,
+};
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 extern struct ftrace_hash __rcu *ftrace_graph_hash;
 extern struct ftrace_hash __rcu *ftrace_graph_notrace_hash;
 
-static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
+static inline int
+ftrace_graph_addr(unsigned long *task_var, struct ftrace_graph_ent *trace)
 {
unsigned long addr = trace->func;
int ret = 0;
@@ -923,12 +928,11 @@ static inline int ftrace_graph_addr(struct 
ftrace_graph_ent *trace)
}
 
if (ftrace_lookup_ip(hash, addr)) {
-
/*
 * This needs to be cleared on the return functions
 * when the depth is zero.
 */
-   trace_recursion_set(TRACE_GRAPH_BIT);
+   *task_var |= TRACE_GRAPH_FL;
trace_recursion_set_depth(trace->depth);
 
/*
@@ -948,11 +952,14 @@ static inline int ftrace_graph_addr(struct 
ftrace_graph_ent *trace)
return ret;
 }
 
-static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
+static inline void
+ftrace_graph_addr_finish(struct fgraph_ops *gops, struct ftrace_graph_ret 
*trace)
 {
-   if (trace_recursion_test(TRACE_GRAPH_BIT) &&
+   unsigned long *task_var = fgraph_get_task_var(gops);
+
+   if ((*task_var & TRACE_GRAPH_FL) &&
trace->depth == trace_recursion_depth())
-   trace_recursion_clear(TRACE_GRAPH_BIT);
+   *task_var &= ~TRACE_GRAPH_FL;
 }
 
 static inline int ftrace_graph_notrace_addr(unsigned long addr)
@@ -979,7 +986,7 @@ static inline int ftrace_graph_notrace_addr(unsigned long 
addr)
 }
 
 #else
-static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
+static inline int ftrace_graph_addr(unsigned long *task_var, struct 
ftrace_graph_ent *trace)
 {
return 1;
 }
@@ -988,17 +995,20 @@ static inline int ftrace_graph_notrace_addr(unsigned long 
addr)
 {
return 0;
 }
-static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
+static inline void ftrace_graph_addr_finish(struct fgraph_ops *gops, struct 
ftrace_graph_ret *trace)
 { }
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
 extern unsigned int fgraph_max_depth;
 
-static inline bool ftrace_graph_ignore_func(struct ftrace_graph_ent *trace)
+static inline bool
+ftrace_graph_ignore_func(struct fgraph_ops *gops, struct ftrace_graph_ent 
*trace)
 {
+   unsigned long *task_var = fgraph_get_task_var(gops);
+
/* trace it when it is-nested-in or is a function enabled. */
-   return !(trace_recursion_test(TRACE_GRAPH_BIT) ||
-ftrace_graph_addr(trace)) ||
+   return !((*task_var & TRACE_GRAPH_FL) ||
+ftrace_graph_addr(task_var, trace)) ||
(trace->depth < 0) ||
(fgraph_max_depth && trace->depth >= fgraph_max_depth);
 }
diff --git a/kernel/trace/trace_functions_graph.c 
b/kernel/trace/trace_functions_graph.c
index 7f30652f0e97..66cce73e94f8 10

[PATCH v10 07/36] function_graph: Allow multiple users to attach to function graph

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Allow for multiple users to attach to function graph tracer at the same
time. Only 16 simultaneous users can attach to the tracer. This is because
there's an array that stores the pointers to the attached fgraph_ops. When
a function being traced is entered, each of the ftrace_ops entryfunc is
called and if it returns non zero, its index into the array will be added
to the shadow stack.

On exit of the function being traced, the shadow stack will contain the
indexes of the ftrace_ops on the array that want their retfunc to be
called.

Because a function may sleep for a long time (if a task sleeps itself),
the return of the function may be literally days later. If the ftrace_ops
is removed, its place on the array is replaced with a ftrace_ops that
contains the stub functions and that will be called when the function
finally returns.

If another ftrace_ops is added that happens to get the same index into the
array, its return function may be called. But that's actually the way
things current work with the old function graph tracer. If one tracer is
removed and another is added, the new one will get the return calls of the
function traced by the previous one, thus this is not a regression. This
can be fixed by adding a counter to each time the array item is updated and
save that on the shadow stack as well, such that it won't be called if the
index saved does not match the index on the array.

Note, being able to filter functions when both are called is not completely
handled yet, but that shouldn't be too hard to manage.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v10:
  - Rephrase all "index" on shadow stack to "offset" and some "ret_stack" to
   "frame".
  - Macros are renamed;
FGRAPH_RET_SIZE to FGRAPH_FRAME_SIZE
FGRAPH_RET_INDEX to FGRAPH_FRAME_OFFSET
FGRAPH_RET_INDEX_SIZE to FGRAPH_FRAME_OFFSET_SIZE
FGRAPH_RET_INDEX_MASK to FGRAPH_FRAME_OFFSET_MASK
SHADOW_STACK_INDEX to SHADOW_STACK_IN_WORD
SHADOW_STACK_MAX_INDEX to SHADOW_STACK_MAX_OFFSET
  - Remove unused FGRAPH_MAX_INDEX macro.
  - Fix wrong explanation of the shadow stack entry.
 Changes in v7:
  - Fix max limitation check in ftrace_graph_push_return().
  - Rewrite the shadow stack implementation using bitmap entry. This allows
us to detect recursive call/tail call easier. (this implementation is
moved from later patch in the series.
 Changes in v2:
  - Check return value of the ftrace_pop_return_trace() instead of 'ret'
since 'ret' is set to the address of panic().
  - Fix typo and make lines shorter than 76 chars in description.
---
 include/linux/ftrace.h |3 
 kernel/trace/fgraph.c  |  378 +++-
 2 files changed, 310 insertions(+), 71 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index d5df5f8fc35a..0a1a7316de7b 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1066,6 +1066,7 @@ extern int ftrace_graph_entry_stub(struct 
ftrace_graph_ent *trace);
 struct fgraph_ops {
trace_func_graph_ent_t  entryfunc;
trace_func_graph_ret_t  retfunc;
+   int idx;
 };
 
 /*
@@ -1100,7 +1101,7 @@ function_graph_enter(unsigned long ret, unsigned long 
func,
 unsigned long frame_pointer, unsigned long *retp);
 
 struct ftrace_ret_stack *
-ftrace_graph_get_ret_stack(struct task_struct *task, int idx);
+ftrace_graph_get_ret_stack(struct task_struct *task, int skip);
 
 unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
unsigned long ret, unsigned long *retp);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 3f9dd213e7d8..6b06962657fe 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -7,6 +7,7 @@
  *
  * Highly modified by Steven Rostedt (VMware).
  */
+#include 
 #include 
 #include 
 #include 
@@ -25,25 +26,157 @@
 #define ASSIGN_OPS_HASH(opsname, val)
 #endif
 
-#define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
-#define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
+#define FGRAPH_FRAME_SIZE sizeof(struct ftrace_ret_stack)
+#define FGRAPH_FRAME_OFFSET DIV_ROUND_UP(FGRAPH_FRAME_SIZE, sizeof(long))
+
+/*
+ * On entry to a function (via function_graph_enter()), a new ftrace_ret_stack
+ * is allocated on the task's ret_stack with bitmap entry, then each
+ * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that returns
+ * non-zero, the index into the fgraph_array[] for that fgraph_ops is recorded
+ * on the bitmap entry as a bit flag.
+ *
+ * The top of the ret_stack (when not empty) will always have a reference
+ * to the last ftrace_ret_stack saved. All references to the
+ * ftrace_ret_stack has the format of:
+ *
+ * bits:  0 -  9   offset in words from the previous ftrace_ret_stack
+ *

[PATCH v10 14/36] function_graph: Add "task variables" per task for fgraph_ops

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Add a "task variables" array on the tasks shadow ret_stack that is the
size of longs for each possible registered fgraph_ops. That's a total
of 16, taking up 8 * 16 = 128 bytes (out of a page size 4k).

This will allow for fgraph_ops to do specific features on a per task basis
having a way to maintain state for each task.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v10:
  - Explain where the task vars is placed in shadow stack.
 Changes in v3:
  - Move fgraph_ops::idx to previous patch in the series.
 Changes in v2:
  - Make description lines shorter than 76 chars.
---
 include/linux/ftrace.h |1 +
 kernel/trace/fgraph.c  |   74 +++-
 2 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b11af9d88438..97f7d1cf4f8f 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1116,6 +1116,7 @@ ftrace_graph_get_ret_stack(struct task_struct *task, int 
skip);
 
 unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
unsigned long ret, unsigned long *retp);
+unsigned long *fgraph_get_task_var(struct fgraph_ops *gops);
 
 /*
  * Sometimes we don't want to trace a function with the function
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index c00a299decb1..3498e8fd8e53 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -46,6 +46,10 @@
  * on the return of the function being traced, this is what will be on the
  * task's shadow ret_stack: (the stack grows upward)
  *
+ *  ret_stack[SHADOW_STACK_IN_WORD]
+ * | SHADOW_STACK_TASK_VARS(ret_stack)[15]  |
+ * ...
+ * | SHADOW_STACK_TASK_VARS(ret_stack)[0]   |
  *  ret_stack[SHADOW_STACK_MAX_OFFSET]
  * ...
  * || <- task->curr_ret_stack
@@ -90,10 +94,18 @@ enum {
 #define SHADOW_STACK_SIZE (PAGE_SIZE)
 #define SHADOW_STACK_IN_WORD (SHADOW_STACK_SIZE / sizeof(long))
 /* Leave on a buffer at the end */
-#define SHADOW_STACK_MAX_OFFSET (SHADOW_STACK_IN_WORD - (FGRAPH_FRAME_OFFSET + 
1))
+#define SHADOW_STACK_MAX_OFFSET\
+   (SHADOW_STACK_IN_WORD - (FGRAPH_FRAME_OFFSET + 1 + FGRAPH_ARRAY_SIZE))
 
 #define RET_STACK(t, offset) ((struct ftrace_ret_stack 
*)(&(t)->ret_stack[offset]))
 
+/*
+ * Each fgraph_ops has a reservered unsigned long at the end (top) of the
+ * ret_stack to store task specific state.
+ */
+#define SHADOW_STACK_TASK_VARS(ret_stack) \
+   ((unsigned long *)(&(ret_stack)[SHADOW_STACK_IN_WORD - 
FGRAPH_ARRAY_SIZE]))
+
 DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
@@ -184,6 +196,44 @@ static void return_run(struct ftrace_graph_ret *trace, 
struct fgraph_ops *ops)
 {
 }
 
+static void ret_stack_set_task_var(struct task_struct *t, int idx, long val)
+{
+   unsigned long *gvals = SHADOW_STACK_TASK_VARS(t->ret_stack);
+
+   gvals[idx] = val;
+}
+
+static unsigned long *
+ret_stack_get_task_var(struct task_struct *t, int idx)
+{
+   unsigned long *gvals = SHADOW_STACK_TASK_VARS(t->ret_stack);
+
+   return [idx];
+}
+
+static void ret_stack_init_task_vars(unsigned long *ret_stack)
+{
+   unsigned long *gvals = SHADOW_STACK_TASK_VARS(ret_stack);
+
+   memset(gvals, 0, sizeof(*gvals) * FGRAPH_ARRAY_SIZE);
+}
+
+/**
+ * fgraph_get_task_var - retrieve a task specific state variable
+ * @gops: The ftrace_ops that owns the task specific variable
+ *
+ * Every registered fgraph_ops has a task state variable
+ * reserved on the task's ret_stack. This function returns the
+ * address to that variable.
+ *
+ * Returns the address to the fgraph_ops @gops tasks specific
+ * unsigned long variable.
+ */
+unsigned long *fgraph_get_task_var(struct fgraph_ops *gops)
+{
+   return ret_stack_get_task_var(current, gops->idx);
+}
+
 /*
  * @offset: The offset into @t->ret_stack to find the ret_stack entry
  * @frame_offset: Where to place the offset into @t->ret_stack of that entry
@@ -793,6 +843,7 @@ static int alloc_retstack_tasklist(unsigned long 
**ret_stack_list)
 
if (t->ret_stack == NULL) {
atomic_set(>trace_overrun, 0);
+   ret_stack_init_task_vars(ret_stack_list[start]);
t->curr_ret_stack = 0;
t->curr_ret_depth = -1;
/* Make sure the tasks see the 0 first: */
@@ -853,6 +904,7 @@ static void
 graph_init_task(struct task_struct *t, unsigned long *ret_stack)
 {
atomic_set(>trace_overrun, 0);
+   ret_stack_init_task_vars(ret_stack);
t->ftrace_timestamp = 0;
t->curr_ret_stack = 0;
t->curr_ret_depth = -1;
@@ -951,6 +1003,24 @@ static int start_graph_tracing(void)
return ret;
 }
 
+static void init_task_v

[PATCH v10 13/36] function_graph: Use a simple LRU for fgraph_array index number

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Since the fgraph_array index is used for the bitmap on the shadow
stack, it may leave some entries after a function_graph instance is
removed. Thus if another instance reuses the fgraph_array index soon
after releasing it, the fgraph may confuse to call the newer callback
for the entries which are pushed by the older instance.
To avoid reusing the fgraph_array index soon after releasing, introduce
a simple LRU table for managing the index number. This will reduce the
possibility of this confusion.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v8:
  - Add a WARN_ON_ONCE() if fgraph_lru_table[] is broken when releasing
index, and remove WARN_ON_ONCE() from unregister_ftrace_graph()
  - Fix to release allocated index if register_ftrace_graph() fails.
  - Add comments and code cleanup.
 Changes in v5:
  - Fix the underflow bug in fgraph_lru_release_index() and return 0
if the release is succeded.
 Changes in v4:
  - Newly added.
---
 kernel/trace/fgraph.c |   71 +++--
 1 file changed, 50 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index b6949e4fda79..c00a299decb1 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -97,10 +97,48 @@ enum {
 DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
-static int fgraph_array_cnt;
-
 static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
 
+/* LRU index table for fgraph_array */
+static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
+static int fgraph_lru_next;
+static int fgraph_lru_last;
+
+/* Initialize fgraph_lru_table with unused index */
+static void fgraph_lru_init(void)
+{
+   int i;
+
+   for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
+   fgraph_lru_table[i] = i;
+}
+
+/* Release the used index to the LRU table */
+static int fgraph_lru_release_index(int idx)
+{
+   if (idx < 0 || idx >= FGRAPH_ARRAY_SIZE ||
+   WARN_ON_ONCE(fgraph_lru_table[fgraph_lru_last] != -1))
+   return -1;
+
+   fgraph_lru_table[fgraph_lru_last] = idx;
+   fgraph_lru_last = (fgraph_lru_last + 1) % FGRAPH_ARRAY_SIZE;
+   return 0;
+}
+
+/* Allocate a new index from LRU table */
+static int fgraph_lru_alloc_index(void)
+{
+   int idx = fgraph_lru_table[fgraph_lru_next];
+
+   /* No id is available */
+   if (idx == -1)
+   return -1;
+
+   fgraph_lru_table[fgraph_lru_next] = -1;
+   fgraph_lru_next = (fgraph_lru_next + 1) % FGRAPH_ARRAY_SIZE;
+   return idx;
+}
+
 static inline int get_frame_offset(struct task_struct *t, int offset)
 {
return t->ret_stack[offset] & FGRAPH_FRAME_OFFSET_MASK;
@@ -365,7 +403,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
if (offset < 0)
goto out;
 
-   for (i = 0; i < fgraph_array_cnt; i++) {
+   for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
struct fgraph_ops *gops = fgraph_array[i];
 
if (gops == _stub)
@@ -917,7 +955,7 @@ int register_ftrace_graph(struct fgraph_ops *gops)
 {
int command = 0;
int ret = 0;
-   int i;
+   int i = -1;
 
mutex_lock(_lock);
 
@@ -933,21 +971,16 @@ int register_ftrace_graph(struct fgraph_ops *gops)
/* The array must always have real data on it */
for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
fgraph_array[i] = _stub;
+   fgraph_lru_init();
}
 
-   /* Look for an available spot */
-   for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
-   if (fgraph_array[i] == _stub)
-   break;
-   }
-   if (i >= FGRAPH_ARRAY_SIZE) {
+   i = fgraph_lru_alloc_index();
+   if (i < 0 || WARN_ON_ONCE(fgraph_array[i] != _stub)) {
ret = -ENOSPC;
goto out;
}
 
fgraph_array[i] = gops;
-   if (i + 1 > fgraph_array_cnt)
-   fgraph_array_cnt = i + 1;
gops->idx = i;
 
ftrace_graph_active++;
@@ -971,6 +1004,7 @@ int register_ftrace_graph(struct fgraph_ops *gops)
if (ret) {
fgraph_array[i] = _stub;
ftrace_graph_active--;
+   fgraph_lru_release_index(i);
}
 out:
mutex_unlock(_lock);
@@ -980,25 +1014,20 @@ int register_ftrace_graph(struct fgraph_ops *gops)
 void unregister_ftrace_graph(struct fgraph_ops *gops)
 {
int command = 0;
-   int i;
 
mutex_lock(_lock);
 
if (unlikely(!ftrace_graph_active))
goto out;
 
-   if (unlikely(gops->idx < 0 || gops->idx >= fgraph_array_cnt))
+   if (unlikely(gops->idx < 0 || gops->idx >= FGRAPH_ARRAY_SIZE ||
+fgraph_array[gops->idx] != gops))
goto out;
 
-   WARN_ON_ONCE(fgraph_array[gops->idx] != gops);
+

[PATCH v10 12/36] function_graph: Have the instances use their own ftrace_ops for filtering

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Allow for instances to have their own ftrace_ops part of the fgraph_ops
that makes the funtion_graph tracer filter on the set_ftrace_filter file
of the instance and not the top instance.

Note that this also requires to update ftrace_graph_func() to call new
function_graph_enter_ops() instead of function_graph_enter() so that
it avoid pushing on shadow stack multiple times on the same function.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v10:
  - Use "offset" for shadow stack instead of "index".
 Changes in v9:
  - Fix to clear fgraph_array correctly when ftrace_startup() fails.
  - Return -ENOSPC if fgraph_array is full.
 Changes in v8:
  - Fix a compilation error in loongarch implementation.
  - Update riscv implementation of ftrace_graph_func().
 Changes in v7:
  - Move FGRAPH_TYPE_BITMAP type implementation to earlier patch (
which implements FGRAPH_TYPE_ARRAY) so that it does not need to
replace the FGRAPH_TYPE_ARRAY type.
  - Update loongarch and powerpc implementation of ftrace_graph_func().
  - Update description.
 Changes in v6:
  - Fix to check whether the fgraph_ops is already unregistered in
function_graph_enter_ops().
  - Fix stack unwinder error on arm64 because of passing wrong value
as retp. Thanks Mark!
 Changes in v4:
  - Simplify get_ret_stack() sanity check and use WARN_ON_ONCE() for
obviously wrong value.
  - Do not check ret == return_to_handler but always read the previous
ret_stack in ftrace_push_return_trace() to check it is reusable.
  - Set the bit 0 of the bitmap entry always in function_graph_enter()
because it uses bit 0 to check re-usability.
  - Fix to ensure the ret_stack entry is bitmap type when checking the
bitmap.
 Changes in v3:
  - Pass current fgraph_ops to the new entry handler
   (function_graph_enter_ops) if fgraph use ftrace.
  - Add fgraph_ops::idx in this patch.
  - Replace the array type with the bitmap type so that it can record
which fgraph is called.
  - Fix some helper function to use passed task_struct instead of current.
  - Reduce the ret-index size to 1024 words.
  - Make the ret-index directly points the ret_stack.
  - Fix ftrace_graph_ret_addr() to handle tail-call case correctly.
 Changes in v2:
  - Use ftrace_graph_func and FTRACE_OPS_GRAPH_STUB instead of
ftrace_stub and FTRACE_OPS_FL_STUB for new ftrace based fgraph.
---
 arch/arm64/kernel/ftrace.c   |   21 +-
 arch/loongarch/kernel/ftrace_dyn.c   |   15 
 arch/powerpc/kernel/trace/ftrace.c   |3 +
 arch/riscv/kernel/ftrace.c   |   15 
 arch/x86/kernel/ftrace.c |   19 +
 include/linux/ftrace.h   |6 ++
 kernel/trace/fgraph.c|  125 +-
 kernel/trace/ftrace.c|4 +
 kernel/trace/trace.h |   16 ++--
 kernel/trace/trace_functions.c   |2 -
 kernel/trace/trace_functions_graph.c |8 ++
 11 files changed, 183 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index a650f5e11fc5..b96740829798 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -481,7 +481,26 @@ void prepare_ftrace_return(unsigned long self_addr, 
unsigned long *parent,
 void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
-   prepare_ftrace_return(ip, >lr, fregs->fp);
+   struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
+   unsigned long frame_pointer = fregs->fp;
+   unsigned long *parent = >lr;
+   int bit;
+
+   if (unlikely(ftrace_graph_is_dead()))
+   return;
+
+   if (unlikely(atomic_read(>tracing_graph_pause)))
+   return;
+
+   bit = ftrace_test_recursion_trylock(ip, *parent);
+   if (bit < 0)
+   return;
+
+   if (!function_graph_enter_ops(*parent, ip, frame_pointer,
+ (void *)frame_pointer, gops))
+   *parent = (unsigned long)_to_handler;
+
+   ftrace_test_recursion_unlock(bit);
 }
 #else
 /*
diff --git a/arch/loongarch/kernel/ftrace_dyn.c 
b/arch/loongarch/kernel/ftrace_dyn.c
index 73858c9029cc..920eb673b32b 100644
--- a/arch/loongarch/kernel/ftrace_dyn.c
+++ b/arch/loongarch/kernel/ftrace_dyn.c
@@ -241,10 +241,21 @@ void prepare_ftrace_return(unsigned long self_addr, 
unsigned long *parent)
 void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
+   struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
+   unsigned long return_hooker = (unsigned long)_to_handler;
struct pt_regs *regs = >regs;
-   unsigned long *parent = (unsigned long *)>regs[1];
+   unsigned long *parent;

[PATCH v10 11/36] ftrace: Allow ftrace startup flags exist without dynamic ftrace

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Some of the flags for ftrace_startup() may be exposed even when
CONFIG_DYNAMIC_FTRACE is not configured in. This is fine as the difference
between dynamic ftrace and static ftrace is done within the internals of
ftrace itself. No need to have use cases fail to compile because dynamic
ftrace is disabled.

This change is needed to move some of the logic of what is passed to
ftrace_startup() out of the parameters of ftrace_startup().

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/ftrace.h |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index ff70ee437209..c4d81e0ec862 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -538,6 +538,15 @@ static inline void stack_tracer_disable(void) { }
 static inline void stack_tracer_enable(void) { }
 #endif
 
+enum {
+   FTRACE_UPDATE_CALLS = (1 << 0),
+   FTRACE_DISABLE_CALLS= (1 << 1),
+   FTRACE_UPDATE_TRACE_FUNC= (1 << 2),
+   FTRACE_START_FUNC_RET   = (1 << 3),
+   FTRACE_STOP_FUNC_RET= (1 << 4),
+   FTRACE_MAY_SLEEP= (1 << 5),
+};
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 void ftrace_arch_code_modify_prepare(void);
@@ -632,15 +641,6 @@ void ftrace_set_global_notrace(unsigned char *buf, int 
len, int reset);
 void ftrace_free_filter(struct ftrace_ops *ops);
 void ftrace_ops_set_global_filter(struct ftrace_ops *ops);
 
-enum {
-   FTRACE_UPDATE_CALLS = (1 << 0),
-   FTRACE_DISABLE_CALLS= (1 << 1),
-   FTRACE_UPDATE_TRACE_FUNC= (1 << 2),
-   FTRACE_START_FUNC_RET   = (1 << 3),
-   FTRACE_STOP_FUNC_RET= (1 << 4),
-   FTRACE_MAY_SLEEP= (1 << 5),
-};
-
 /*
  * The FTRACE_UPDATE_* enum is used to pass information back
  * from the ftrace_update_record() and ftrace_test_record()

[PATCH v10 10/36] ftrace: Allow function_graph tracer to be enabled in instances

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Now that function graph tracing can handle more than one user, allow it to
be enabled in the ftrace instances. Note, the filtering of the functions is
still joined by the top level set_ftrace_filter and friends, as well as the
graph and nograph files.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Fix to remove set_graph_array() completely.
---
 include/linux/ftrace.h   |1 +
 kernel/trace/ftrace.c|1 +
 kernel/trace/trace.h |   13 ++-
 kernel/trace/trace_functions.c   |8 
 kernel/trace/trace_functions_graph.c |   65 +-
 kernel/trace/trace_selftest.c|4 +-
 6 files changed, 64 insertions(+), 28 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index f9e51cbf9a97..ff70ee437209 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1070,6 +1070,7 @@ extern int ftrace_graph_entry_stub(struct 
ftrace_graph_ent *trace, struct fgraph
 struct fgraph_ops {
trace_func_graph_ent_t  entryfunc;
trace_func_graph_ret_t  retfunc;
+   void*private;
int idx;
 };
 
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4b0708106692..92abb9869198 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7326,6 +7326,7 @@ __init void ftrace_init_global_array_ops(struct 
trace_array *tr)
tr->ops = _ops;
tr->ops->private = tr;
ftrace_init_trace_array(tr);
+   init_array_fgraph_ops(tr);
 }
 
 void ftrace_init_array_ops(struct trace_array *tr, ftrace_func_t func)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 55bb9a3bf322..114b120afd2a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -396,6 +396,9 @@ struct trace_array {
struct ftrace_ops   *ops;
struct trace_pid_list   __rcu *function_pids;
struct trace_pid_list   __rcu *function_no_pids;
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+   struct fgraph_ops   *gops;
+#endif
 #ifdef CONFIG_DYNAMIC_FTRACE
/* All of these are protected by the ftrace_lock */
struct list_headfunc_probes;
@@ -680,7 +683,6 @@ void print_trace_header(struct seq_file *m, struct 
trace_iterator *iter);
 
 void trace_graph_return(struct ftrace_graph_ret *trace, struct fgraph_ops 
*gops);
 int trace_graph_entry(struct ftrace_graph_ent *trace, struct fgraph_ops *gops);
-void set_graph_array(struct trace_array *tr);
 
 void tracing_start_cmdline_record(void);
 void tracing_stop_cmdline_record(void);
@@ -891,6 +893,9 @@ extern int __trace_graph_entry(struct trace_array *tr,
 extern void __trace_graph_return(struct trace_array *tr,
 struct ftrace_graph_ret *trace,
 unsigned int trace_ctx);
+extern void init_array_fgraph_ops(struct trace_array *tr);
+extern int allocate_fgraph_ops(struct trace_array *tr);
+extern void free_fgraph_ops(struct trace_array *tr);
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 extern struct ftrace_hash __rcu *ftrace_graph_hash;
@@ -1003,6 +1008,12 @@ print_graph_function_flags(struct trace_iterator *iter, 
u32 flags)
 {
return TRACE_TYPE_UNHANDLED;
 }
+static inline void init_array_fgraph_ops(struct trace_array *tr) { }
+static inline int allocate_fgraph_ops(struct trace_array *tr)
+{
+   return 0;
+}
+static inline void free_fgraph_ops(struct trace_array *tr) { }
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
 extern struct list_head ftrace_pids;
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 9f1bfbe105e8..8e8da0d0ee52 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -80,6 +80,7 @@ void ftrace_free_ftrace_ops(struct trace_array *tr)
 int ftrace_create_function_files(struct trace_array *tr,
 struct dentry *parent)
 {
+   int ret;
/*
 * The top level array uses the "global_ops", and the files are
 * created on boot up.
@@ -90,6 +91,12 @@ int ftrace_create_function_files(struct trace_array *tr,
if (!tr->ops)
return -EINVAL;
 
+   ret = allocate_fgraph_ops(tr);
+   if (ret) {
+   kfree(tr->ops);
+   return ret;
+   }
+
ftrace_create_filter_files(tr->ops, parent);
 
return 0;
@@ -99,6 +106,7 @@ void ftrace_destroy_function_files(struct trace_array *tr)
 {
ftrace_destroy_filter_files(tr->ops);
ftrace_free_ftrace_ops(tr);
+   free_fgraph_ops(tr);
 }
 
 static ftrace_func_t select_trace_function(u32 flags_val)
diff --git a/kernel/trace/trace_functions_graph.c 
b/kernel/trace/trace_functions_graph.c
index b7b142b65299..9ccc904a7703 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_g

[PATCH v10 09/36] ftrace/function_graph: Pass fgraph_ops to function graph callbacks

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Pass the fgraph_ops structure to the function graph callbacks. This will
allow callbacks to add a descriptor to a fgraph_ops private field that wil
be added in the future and use it for the callbacks. This will be useful
when more than one callback can be registered to the function graph tracer.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - cleanup to set argument name on function prototype.
---
 include/linux/ftrace.h   |   10 +++---
 kernel/trace/fgraph.c|   16 +---
 kernel/trace/ftrace.c|6 --
 kernel/trace/trace.h |4 ++--
 kernel/trace/trace_functions_graph.c |   11 +++
 kernel/trace/trace_irqsoff.c |6 --
 kernel/trace/trace_sched_wakeup.c|6 --
 kernel/trace/trace_selftest.c|5 +++--
 8 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 0a1a7316de7b..f9e51cbf9a97 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1055,11 +1055,15 @@ struct ftrace_graph_ret {
unsigned long long rettime;
 } __packed;
 
+struct fgraph_ops;
+
 /* Type of the callback handlers for tracing function graph*/
-typedef void (*trace_func_graph_ret_t)(struct ftrace_graph_ret *); /* return */
-typedef int (*trace_func_graph_ent_t)(struct ftrace_graph_ent *); /* entry */
+typedef void (*trace_func_graph_ret_t)(struct ftrace_graph_ret *,
+  struct fgraph_ops *); /* return */
+typedef int (*trace_func_graph_ent_t)(struct ftrace_graph_ent *,
+ struct fgraph_ops *); /* entry */
 
-extern int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace);
+extern int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace, struct 
fgraph_ops *gops);
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index c50edd1344e5..1fcbae5cc6d6 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -144,13 +144,13 @@ add_fgraph_index_bitmap(struct task_struct *t, int 
offset, unsigned long bitmap)
 }
 
 /* ftrace_graph_entry set to this to tell some archs to run function graph */
-static int entry_run(struct ftrace_graph_ent *trace)
+static int entry_run(struct ftrace_graph_ent *trace, struct fgraph_ops *ops)
 {
return 0;
 }
 
 /* ftrace_graph_return set to this to tell some archs to run function graph */
-static void return_run(struct ftrace_graph_ret *trace)
+static void return_run(struct ftrace_graph_ret *trace, struct fgraph_ops *ops)
 {
 }
 
@@ -211,12 +211,14 @@ int __weak ftrace_disable_ftrace_graph_caller(void)
 }
 #endif
 
-int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace)
+int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace,
+   struct fgraph_ops *gops)
 {
return 0;
 }
 
-static void ftrace_graph_ret_stub(struct ftrace_graph_ret *trace)
+static void ftrace_graph_ret_stub(struct ftrace_graph_ret *trace,
+ struct fgraph_ops *gops)
 {
 }
 
@@ -377,7 +379,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
if (gops == _stub)
continue;
 
-   if (gops->entryfunc())
+   if (gops->entryfunc(, gops))
bitmap |= BIT(i);
}
 
@@ -525,7 +527,7 @@ static unsigned long __ftrace_return_to_handler(struct 
fgraph_ret_regs *ret_regs
if (gops == _stub)
continue;
 
-   gops->retfunc();
+   gops->retfunc(, gops);
}
 
/*
@@ -679,7 +681,7 @@ void ftrace_graph_sleep_time_control(bool enable)
  * Simply points to ftrace_stub, but with the proper protocol.
  * Defined by the linker script in linux/vmlinux.lds.h
  */
-extern void ftrace_stub_graph(struct ftrace_graph_ret *);
+void ftrace_stub_graph(struct ftrace_graph_ret *trace, struct fgraph_ops 
*gops);
 
 /* The callbacks that hook a function */
 trace_func_graph_ret_t ftrace_graph_return = ftrace_stub_graph;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index fef833f63647..4b0708106692 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -815,7 +815,8 @@ void ftrace_graph_graph_time_control(bool enable)
fgraph_graph_time = enable;
 }
 
-static int profile_graph_entry(struct ftrace_graph_ent *trace)
+static int profile_graph_entry(struct ftrace_graph_ent *trace,
+  struct fgraph_ops *gops)
 {
struct ftrace_ret_stack *ret_stack;
 
@@ -832,7 +833,8 @@ static int profile_graph_entry(struct ftrace_graph_ent 
*trace)
return 1;
 }
 
-static void profile_graph_return(struct ftrace_graph_ret *trace)
+static void profile_graph_return(struct ftrace_graph_ret *trace,
+struct fgra

[PATCH v10 08/36] function_graph: Remove logic around ftrace_graph_entry and return

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

The function pointers ftrace_graph_entry and ftrace_graph_return are no
longer called via the function_graph tracer. Instead, an array structure is
now used that will allow for multiple users of the function_graph
infrastructure. The variables are still used by the architecture code for
non dynamic ftrace configs, where a test is made against them to see if
they point to the default stub function or not. This is how the static
function tracing knows to call into the function graph tracer
infrastructure or not.

Two new stub functions are made. entry_run() and return_run(). The
ftrace_graph_entry and ftrace_graph_return are set to them respectively
when the function graph tracer is enabled, and this will trigger the
architecture specific function graph code to be executed.

This also requires checking the global_ops hash for all calls into the
function_graph tracer.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Fix typo and make lines shorter than 76 chars in the description.
  - Remove unneeded return from return_run() function.
---
 kernel/trace/fgraph.c  |   67 +---
 kernel/trace/ftrace.c  |2 -
 kernel/trace/ftrace_internal.h |2 -
 3 files changed, 15 insertions(+), 56 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 6b06962657fe..c50edd1344e5 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -143,6 +143,17 @@ add_fgraph_index_bitmap(struct task_struct *t, int offset, 
unsigned long bitmap)
t->ret_stack[offset] |= (bitmap << FGRAPH_INDEX_SHIFT);
 }
 
+/* ftrace_graph_entry set to this to tell some archs to run function graph */
+static int entry_run(struct ftrace_graph_ent *trace)
+{
+   return 0;
+}
+
+/* ftrace_graph_return set to this to tell some archs to run function graph */
+static void return_run(struct ftrace_graph_ret *trace)
+{
+}
+
 /*
  * @offset: The offset into @t->ret_stack to find the ret_stack entry
  * @frame_offset: Where to place the offset into @t->ret_stack of that entry
@@ -673,7 +684,6 @@ extern void ftrace_stub_graph(struct ftrace_graph_ret *);
 /* The callbacks that hook a function */
 trace_func_graph_ret_t ftrace_graph_return = ftrace_stub_graph;
 trace_func_graph_ent_t ftrace_graph_entry = ftrace_graph_entry_stub;
-static trace_func_graph_ent_t __ftrace_graph_entry = ftrace_graph_entry_stub;
 
 /* Try to assign a return stack array on FTRACE_RETSTACK_ALLOC_SIZE tasks. */
 static int alloc_retstack_tasklist(unsigned long **ret_stack_list)
@@ -756,46 +766,6 @@ ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
}
 }
 
-static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace)
-{
-   if (!ftrace_ops_test(_ops, trace->func, NULL))
-   return 0;
-   return __ftrace_graph_entry(trace);
-}
-
-/*
- * The function graph tracer should only trace the functions defined
- * by set_ftrace_filter and set_ftrace_notrace. If another function
- * tracer ops is registered, the graph tracer requires testing the
- * function against the global ops, and not just trace any function
- * that any ftrace_ops registered.
- */
-void update_function_graph_func(void)
-{
-   struct ftrace_ops *op;
-   bool do_test = false;
-
-   /*
-* The graph and global ops share the same set of functions
-* to test. If any other ops is on the list, then
-* the graph tracing needs to test if its the function
-* it should call.
-*/
-   do_for_each_ftrace_op(op, ftrace_ops_list) {
-   if (op != _ops && op != _ops &&
-   op != _list_end) {
-   do_test = true;
-   /* in double loop, break out with goto */
-   goto out;
-   }
-   } while_for_each_ftrace_op(op);
- out:
-   if (do_test)
-   ftrace_graph_entry = ftrace_graph_entry_test;
-   else
-   ftrace_graph_entry = __ftrace_graph_entry;
-}
-
 static DEFINE_PER_CPU(unsigned long *, idle_ret_stack);
 
 static void
@@ -937,18 +907,12 @@ int register_ftrace_graph(struct fgraph_ops *gops)
ftrace_graph_active--;
goto out;
}
-
-   ftrace_graph_return = gops->retfunc;
-
/*
-* Update the indirect function to the entryfunc, and the
-* function that gets called to the entry_test first. Then
-* call the update fgraph entry function to determine if
-* the entryfunc should be called directly or not.
+* Some archs just test to see if these are not
+* the default function
 */
-   __ftrace_graph_entry = gops->entryfunc;
-   ftrace_graph_entry = ftrace_graph_entry_test;
-

[PATCH v10 06/36] function_graph: Add an array structure that will allow multiple callbacks

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Add an array structure that will eventually allow the function graph tracer
to have up to 16 simultaneous callbacks attached. It's an array of 16
fgraph_ops pointers, that is assigned when one is registered. On entry of a
function the entry of the first item in the array is called, and if it
returns zero, then the callback returns non zero if it wants the return
callback to be called on exit of the function.

The array will simplify the process of having more than one callback
attached to the same function, as its index into the array can be stored on
the shadow stack. We need to only save the index, because this will allow
the fgraph_ops to be freed before the function returns (which may happen if
the function call schedule for a long time).

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Remove unneeded brace.
---
 kernel/trace/fgraph.c |  114 +++--
 1 file changed, 81 insertions(+), 33 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 6f8d36370994..3f9dd213e7d8 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -39,6 +39,11 @@
 DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
+static int fgraph_array_cnt;
+#define FGRAPH_ARRAY_SIZE  16
+
+static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
+
 /* Both enabled by default (can be cleared by function_graph tracer flags */
 static bool fgraph_sleep_time = true;
 
@@ -62,6 +67,20 @@ int __weak ftrace_disable_ftrace_graph_caller(void)
 }
 #endif
 
+int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace)
+{
+   return 0;
+}
+
+static void ftrace_graph_ret_stub(struct ftrace_graph_ret *trace)
+{
+}
+
+static struct fgraph_ops fgraph_stub = {
+   .entryfunc = ftrace_graph_entry_stub,
+   .retfunc = ftrace_graph_ret_stub,
+};
+
 /**
  * ftrace_graph_stop - set to permanently disable function graph tracing
  *
@@ -159,7 +178,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
goto out;
 
/* Only trace if the calling function expects to */
-   if (!ftrace_graph_entry())
+   if (!fgraph_array[0]->entryfunc())
goto out_ret;
 
return 0;
@@ -274,7 +293,7 @@ static unsigned long __ftrace_return_to_handler(struct 
fgraph_ret_regs *ret_regs
trace.retval = fgraph_ret_regs_return_value(ret_regs);
 #endif
trace.rettime = trace_clock_local();
-   ftrace_graph_return();
+   fgraph_array[0]->retfunc();
/*
 * The ftrace_graph_return() may still access the current
 * ret_stack structure, we need to make sure the update of
@@ -410,11 +429,6 @@ void ftrace_graph_sleep_time_control(bool enable)
fgraph_sleep_time = enable;
 }
 
-int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace)
-{
-   return 0;
-}
-
 /*
  * Simply points to ftrace_stub, but with the proper protocol.
  * Defined by the linker script in linux/vmlinux.lds.h
@@ -652,37 +666,54 @@ static int start_graph_tracing(void)
 int register_ftrace_graph(struct fgraph_ops *gops)
 {
int ret = 0;
+   int i;
 
mutex_lock(_lock);
 
-   /* we currently allow only one tracer registered at a time */
-   if (ftrace_graph_active) {
+   if (!fgraph_array[0]) {
+   /* The array must always have real data on it */
+   for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
+   fgraph_array[i] = _stub;
+   }
+
+   /* Look for an available spot */
+   for (i = 0; i < FGRAPH_ARRAY_SIZE; i++) {
+   if (fgraph_array[i] == _stub)
+   break;
+   }
+   if (i >= FGRAPH_ARRAY_SIZE) {
ret = -EBUSY;
goto out;
}
 
-   register_pm_notifier(_suspend_notifier);
+   fgraph_array[i] = gops;
+   if (i + 1 > fgraph_array_cnt)
+   fgraph_array_cnt = i + 1;
 
ftrace_graph_active++;
-   ret = start_graph_tracing();
-   if (ret) {
-   ftrace_graph_active--;
-   goto out;
-   }
 
-   ftrace_graph_return = gops->retfunc;
+   if (ftrace_graph_active == 1) {
+   register_pm_notifier(_suspend_notifier);
+   ret = start_graph_tracing();
+   if (ret) {
+   ftrace_graph_active--;
+   goto out;
+   }
+
+   ftrace_graph_return = gops->retfunc;
 
-   /*
-* Update the indirect function to the entryfunc, and the
-* function that gets called to the entry_test first. Then
-* call the update fgraph entry function to determine if
-* the entryfunc should be called directly or not.
-*/
-   __ftrace_graph_entry = gops->entryfunc;
-   ftrace_graph_entry = ftrace_graph_entry_test;
-

[PATCH v10 05/36] fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible by long

2024-05-07 Thread Masami Hiramatsu (Google)

From: Steven Rostedt (VMware) 

Instead of using "ALIGN()", use BUILD_BUG_ON() as the structures should
always be divisible by sizeof(long).

Link: 
http://lkml.kernel.org/r/2019052444.gi2...@hirez.programming.kicks-ass.net

Suggested-by: Peter Zijlstra 
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v7:
  - Use DIV_ROUND_UP() to calculate FGRAPH_RET_INDEX
---
 kernel/trace/fgraph.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 30edeb6d4aa9..6f8d36370994 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -26,10 +26,9 @@
 #endif
 
 #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
-#define FGRAPH_RET_INDEX (ALIGN(FGRAPH_RET_SIZE, sizeof(long)) / sizeof(long))
+#define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
 #define SHADOW_STACK_SIZE (PAGE_SIZE)
-#define SHADOW_STACK_INDEX \
-   (ALIGN(SHADOW_STACK_SIZE, sizeof(long)) / sizeof(long))
+#define SHADOW_STACK_INDEX (SHADOW_STACK_SIZE / sizeof(long))
 /* Leave on a buffer at the end */
 #define SHADOW_STACK_MAX_INDEX (SHADOW_STACK_INDEX - FGRAPH_RET_INDEX)
 
@@ -91,6 +90,8 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
if (!current->ret_stack)
return -EBUSY;
 
+   BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
+
/*
 * We must make sure the ret_stack is tested before we read
 * anything else.
@@ -325,6 +326,8 @@ ftrace_graph_get_ret_stack(struct task_struct *task, int 
idx)
 {
int index = task->curr_ret_stack;
 
+   BUILD_BUG_ON(FGRAPH_RET_SIZE % sizeof(long));
+
index -= FGRAPH_RET_INDEX * (idx + 1);
if (index < 0)
return NULL;

[PATCH v10 02/36] tracing: Rename ftrace_regs_return_value to ftrace_regs_get_return_value

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Rename ftrace_regs_return_value to ftrace_regs_get_return_value as same as
other ftrace_regs_get/set_* APIs.

Signed-off-by: Masami Hiramatsu (Google) 
Acked-by: Mark Rutland 
---
 Changes in v6:
  - Moved to top of the series.
 Changes in v3:
  - Newly added.
---
 arch/loongarch/include/asm/ftrace.h |2 +-
 arch/powerpc/include/asm/ftrace.h   |2 +-
 arch/s390/include/asm/ftrace.h  |2 +-
 arch/x86/include/asm/ftrace.h   |2 +-
 include/linux/ftrace.h  |2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/loongarch/include/asm/ftrace.h 
b/arch/loongarch/include/asm/ftrace.h
index de891c2c83d4..b43acfc5776c 100644
--- a/arch/loongarch/include/asm/ftrace.h
+++ b/arch/loongarch/include/asm/ftrace.h
@@ -70,7 +70,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
*fregs, unsigned long ip)
regs_get_kernel_argument(&(fregs)->regs, n)
 #define ftrace_regs_get_stack_pointer(fregs) \
kernel_stack_pointer(&(fregs)->regs)
-#define ftrace_regs_return_value(fregs) \
+#define ftrace_regs_get_return_value(fregs) \
regs_return_value(&(fregs)->regs)
 #define ftrace_regs_set_return_value(fregs, ret) \
regs_set_return_value(&(fregs)->regs, ret)
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 107fc5a48456..cfec6c5a47d0 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -61,7 +61,7 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs *fregs)
regs_get_kernel_argument(&(fregs)->regs, n)
 #define ftrace_regs_get_stack_pointer(fregs) \
kernel_stack_pointer(&(fregs)->regs)
-#define ftrace_regs_return_value(fregs) \
+#define ftrace_regs_get_return_value(fregs) \
regs_return_value(&(fregs)->regs)
 #define ftrace_regs_set_return_value(fregs, ret) \
regs_set_return_value(&(fregs)->regs, ret)
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index 621f23d5ae30..1912b598d1b8 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -88,7 +88,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs *fregs,
regs_get_kernel_argument(&(fregs)->regs, n)
 #define ftrace_regs_get_stack_pointer(fregs) \
kernel_stack_pointer(&(fregs)->regs)
-#define ftrace_regs_return_value(fregs) \
+#define ftrace_regs_get_return_value(fregs) \
regs_return_value(&(fregs)->regs)
 #define ftrace_regs_set_return_value(fregs, ret) \
regs_set_return_value(&(fregs)->regs, ret)
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 897cf02c20b1..cf88cc8cc74d 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -58,7 +58,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
regs_get_kernel_argument(&(fregs)->regs, n)
 #define ftrace_regs_get_stack_pointer(fregs) \
kernel_stack_pointer(&(fregs)->regs)
-#define ftrace_regs_return_value(fregs) \
+#define ftrace_regs_get_return_value(fregs) \
regs_return_value(&(fregs)->regs)
 #define ftrace_regs_set_return_value(fregs, ret) \
regs_set_return_value(&(fregs)->regs, ret)
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b81f1afa82a1..d5df5f8fc35a 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -184,7 +184,7 @@ static __always_inline bool ftrace_regs_has_args(struct 
ftrace_regs *fregs)
regs_get_kernel_argument(ftrace_get_regs(fregs), n)
 #define ftrace_regs_get_stack_pointer(fregs) \
kernel_stack_pointer(ftrace_get_regs(fregs))
-#define ftrace_regs_return_value(fregs) \
+#define ftrace_regs_get_return_value(fregs) \
regs_return_value(ftrace_get_regs(fregs))
 #define ftrace_regs_set_return_value(fregs, ret) \
regs_set_return_value(ftrace_get_regs(fregs), ret)

[PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add ftrace_regs definition for x86_64 in the ftrace header to
clarify what register will be accessible from ftrace_regs.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v3:
  - Add rip to be saved.
 Changes in v2:
  - Newly added.
---
 arch/x86/include/asm/ftrace.h |6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index cf88cc8cc74d..c88bf47f46da 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
 
 #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
 struct ftrace_regs {
+   /*
+* On the x86_64, the ftrace_regs saves;
+* rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp.
+* Also orig_ax is used for passing direct trampoline address.
+* x86_32 doesn't support ftrace_regs.
+*/
struct pt_regs  regs;
 };

[PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition

2024-05-07 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

To clarify what will be expected on ftrace_regs, add a comment to the
architecture independent definition of the ftrace_regs.

Signed-off-by: Masami Hiramatsu (Google) 
Acked-by: Mark Rutland 
---
 Changes in v8:
  - Update that the saved registers depends on the context.
 Changes in v3:
  - Add instruction pointer
 Changes in v2:
  - newly added.
---
 include/linux/ftrace.h |   26 ++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 54d53f345d14..b81f1afa82a1 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -118,6 +118,32 @@ extern int ftrace_enabled;
 
 #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
 
+/**
+ * ftrace_regs - ftrace partial/optimal register set
+ *
+ * ftrace_regs represents a group of registers which is used at the
+ * function entry and exit. There are three types of registers.
+ *
+ * - Registers for passing the parameters to callee, including the stack
+ *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
+ * - Registers for passing the return values to caller.
+ *   (e.g. rax and rdx on x86_64)
+ * - Registers for hooking the function call and return including the
+ *   frame pointer (the frame pointer is architecture/config dependent)
+ *   (e.g. rip, rbp and rsp for x86_64)
+ *
+ * Also, architecture dependent fields can be used for internal process.
+ * (e.g. orig_ax on x86_64)
+ *
+ * On the function entry, those registers will be restored except for
+ * the stack pointer, so that user can change the function parameters
+ * and instruction pointer (e.g. live patching.)
+ * On the function exit, only registers which is used for return values
+ * are restored.
+ *
+ * NOTE: user *must not* access regs directly, only do it via APIs, because
+ * the member can be changed according to the architecture.
+ */
 struct ftrace_regs {
struct pt_regs  regs;
 };

[PATCH v10 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-05-07 Thread Masami Hiramatsu (Google)

Hi,

Here is the 10th version of the series to re-implement the fprobe on
function-graph tracer. The previous version is;

https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/

This version is ported on the latest kernel (v6.9-rc6 + probes/for-next)
and fixed some bugs + performance optimizations.
 - [7/36] Fix terminology in comments and code. Use "offset" instead
  of "index" for shadow stack. This also update macros.
 - [18/36] Fix supported data size bug.
 - [29/36] Define bpf_kprobe_multi_pt_regs only if it is used.
 - [36/36] Add likely() to skip timestamp.

Overview

This series does major 2 changes, enable multiple function-graphs on
the ftrace (e.g. allow function-graph on sub instances) and rewrite the
fprobe on this function-graph.

The former changes had been sent from Steven Rostedt 4 years ago (*),
which allows users to set different setting function-graph tracer (and
other tracers based on function-graph) in each trace-instances at the
same time.

(*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/

The purpose of latter change are;

 1) Remove dependency of the rethook from fprobe so that we can reduce
   the return hook code and shadow stack.

 2) Make 'ftrace_regs' the common trace interface for the function
   boundary.

1) Currently we have 2(or 3) different function return hook codes,
 the function-graph tracer and rethook (and legacy kretprobe).
 But since this  is redundant and needs double maintenance cost,
 I would like to unify those. From the user's viewpoint, function-
 graph tracer is very useful to grasp the execution path. For this
 purpose, it is hard to use the rethook in the function-graph
 tracer, but the opposite is possible. (Strictly speaking, kretprobe
 can not use it because it requires 'pt_regs' for historical reasons.)

2) Now the fprobe provides the 'pt_regs' for its handler, but that is
 wrong for the function entry and exit. Moreover, depending on the
 architecture, there is no way to accurately reproduce 'pt_regs'
 outside of interrupt or exception handlers. This means fprobe should
 not use 'pt_regs' because it does not use such exceptions.
 (Conversely, kprobe should use 'pt_regs' because it is an abstract
  interface of the software breakpoint exception.)

This series changes fprobe to use function-graph tracer for tracing
function entry and exit, instead of mixture of ftrace and rethook.
Unlike the rethook which is a per-task list of system-wide allocated
nodes, the function graph's ret_stack is a per-task shadow stack.
Thus it does not need to set 'nr_maxactive' (which is the number of
pre-allocated nodes).
Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
their register interface, this changes it to convert 'ftrace_regs' to
'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
so users must access only registers for function parameters or
return value. 

Design
--
Instead of using ftrace's function entry hook directly, the new fprobe
is built on top of the function-graph's entry and return callbacks
with 'ftrace_regs'.

Since the fprobe requires access to 'ftrace_regs', the architecture
must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
entry callback with 'ftrace_regs', and also
CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
return_to_handler.

All fprobes share a single function-graph ops (means shares a common
ftrace filter) similar to the kprobe-on-ftrace. This needs another
layer to find corresponding fprobe in the common function-graph
callbacks, but has much better scalability, since the number of
registered function-graph ops is limited.

In the entry callback, the fprobe runs its entry_handler and saves the
address of 'fprobe' on the function-graph's shadow stack as data. The
return callback decodes the data to get the 'fprobe' address, and runs
the exit_handler.

The fprobe introduces two hash-tables, one is for entry callback which
searches fprobes related to the given function address passed by entry
callback. The other is for a return callback which checks if the given
'fprobe' data structure pointer is still valid. Note that it is
possible to unregister fprobe before the return callback runs. Thus
the address validation must be done before using it in the return
callback.

This series can be applied against the probes/for-next branch, which
is based on v6.9-rc6.

This series can also be found below branch.

https://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux.git/log/?h=topic/fprobe-on-fgraph

Thank you,

---

Masami Hiramatsu (Google) (21):
  tracing: Add a comment about ftrace_regs definition
  tracing: Rename ftrace_regs_return_value to ftrace_regs_get_return_value
  x86: tracing: Add ftrace_regs definition in the header
  function_gra

Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-01 Thread Google

On Wed, 1 May 2024 16:12:37 -0700
"Paul E. McKenney"  wrote:

> Note that the immediate pressure for this patch should be relieved by the
> NAPI patch series [1], but this sort of problem could easily arise again.
> 
> When running heavy test workloads with KASAN enabled, RCU Tasks grace
> periods can extend for many tens of seconds, significantly slowing
> trace registration.  Therefore, make the registration-side RCU Tasks
> grace period be asynchronous via call_rcu_tasks().
> 

Good catch! AFAICS, there is no reason to wait for synchronization
when adding a new direct trampoline.
This looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> [1] https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> 
> Reported-by: Jakub Kicinski 
> Reported-by: Alexei Starovoitov 
> Reported-by: Chris Mason 
> Signed-off-by: Paul E. McKenney 
> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Mark Rutland 
> Cc: Mathieu Desnoyers 
> Cc: 
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 6c96b30f3d63b..32ea92934268c 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5365,6 +5365,13 @@ static void remove_direct_functions_hash(struct 
> ftrace_hash *hash, unsigned long
>   }
>  }
>  
> +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> +{
> + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> +
> + free_ftrace_hash(fhp);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * for multiple functions registered in @ops
> @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   out_unlock:
>   mutex_unlock(_mutex);
>  
> - if (free_hash && free_hash != EMPTY_HASH) {
> - synchronize_rcu_tasks();
> - free_ftrace_hash(free_hash);
> - }
> + if (free_hash && free_hash != EMPTY_HASH)
> + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
>  
>   if (new_hash)
>   free_ftrace_hash(new_hash);


-- 
Masami Hiramatsu (Google)

Re: [v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()

2024-05-01 Thread Google

On Mon, 29 Apr 2024 15:55:09 +0200
Markus Elfring  wrote:

> …
> > > it jumps to the label 'out' instead of 'fail' by mistake.In the result,
> …
> >
> > Looks good to me.
> 
> * Do you care for a typo in this change description?
> 
> * Would you like to read any improved (patch) version descriptions (or 
> changelogs)?

Thanks, but those are nitpicks and I don't mind it.

Thank you,

> 
> Regards,
> Markus


-- 
Masami Hiramatsu (Google)

Re: [PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-01 Thread Google

On Tue, 30 Apr 2024 14:23:27 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Synthetic events create and destroy tracefs files when they are created
> and removed. The tracing subsystem has its own file descriptor
> representing the state of the events attached to the tracefs files.
> There's a race between the eventfs files and this file descriptor of the
> tracing system where the following can cause an issue:
> 
> With two scripts 'A' and 'B' doing:
> 
>   Script 'A':
> echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
> while :
> do
>   echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
> done
> 
>   Script 'B':
> echo > /sys/kernel/tracing/synthetic_events
> 
> Script 'A' creates a synthetic event "hello" and then just writes zero
> into its enable file.
> 
> Script 'B' removes all synthetic events (including the newly created
> "hello" event).
> 
> What happens is that the opening of the "enable" file has:
> 
>  {
>   struct trace_event_file *file = inode->i_private;
>   int ret;
> 
>   ret = tracing_check_open_get_tr(file->tr);
>  [..]
> 
> But deleting the events frees the "file" descriptor, and a "use after
> free" happens with the dereference at "file->tr".
> 
> The file descriptor does have a reference counter, but there needs to be a
> way to decrement it from the eventfs when the eventfs_inode is removed
> that represents this file descriptor.
> 
> Add an optional "release" callback to the eventfs_entry array structure,
> that gets called when the eventfs file is about to be removed. This allows
> for the creating on the eventfs file to increment the tracing file
> descriptor ref counter. When the eventfs file is deleted, it can call the
> release function that will call the put function for the tracing file
> descriptor.
> 
> This will protect the tracing file from being freed while a eventfs file
> that references it is being opened.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> Link: 
> https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/
> 
> Cc: sta...@vger.kernel.org
> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use 
> eventfs_inode")
> Reported-by: Tze-nan wu 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c|  7 +++
>  include/linux/tracefs.h |  3 +++
>  kernel/trace/trace_events.c | 12 
>  3 files changed, 22 insertions(+)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 894c6ca1e500..dc97c19f9e0a 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -84,10 +84,17 @@ enum {
>  static void release_ei(struct kref *ref)
>  {
>   struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
> kref);
> + const struct eventfs_entry *entry;
>   struct eventfs_root_inode *rei;
>  
>   WARN_ON_ONCE(!ei->is_freed);
>  
> + for (int i = 0; i < ei->nr_entries; i++) {
> + entry = >entries[i];
> + if (entry->release)
> + entry->release(entry->name, ei->data);
> + }
> +
>   kfree(ei->entry_attrs);
>   kfree_const(ei->name);
>   if (ei->is_events) {
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 7a5fe17b6bf9..d03f74658716 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -62,6 +62,8 @@ struct eventfs_file;
>  typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
>   const struct file_operations **fops);
>  
> +typedef void (*eventfs_release)(const char *name, void *data);
> +
>  /**
>   * struct eventfs_entry - dynamically created eventfs file call back handler
>   * @name:Then name of the dynamic file in an eventfs directory
> @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
> *mode, void **data,
>  struct eventfs_entry {
>   const char  *name;
>   eventfs_callbackcallback;
> + eventfs_release release;
>  };
>  
>  struct eventfs_inode;
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 52f75c36bbca..6ef29eba90ce 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
> *mode, void **data,
>   return 0;
>  }
>  
>

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-30 Thread Google

On Mon, 29 Apr 2024 13:25:04 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  wrote:
> >
> > Hi Andrii,
> >
> > On Thu, 25 Apr 2024 13:31:53 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Hey Masami,
> > >
> > > I can't really review most of that code as I'm completely unfamiliar
> > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > a few comments where there were somewhat more obvious BPF-related
> > > pieces.
> > >
> > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > and then with your series applied on top. Just to see if there are any
> > > regressions. I think it will be a useful data point for you.
> >
> > Thanks for testing!
> >
> > >
> > > You should be already familiar with the bench tool we have in BPF
> > > selftests (I used it on some other patches for your tree).
> >
> > What patches we need?
> >
> 
> You mean for this `bench` tool? They are part of BPF selftests (under
> tools/testing/selftests/bpf), you can build them by running:
> 
> $ make RELEASE=1 -j$(nproc) bench
> 
> After that you'll get a self-container `bench` binary, which has all
> the self-contained benchmarks.
> 
> You might also find a small script (benchs/run_bench_trigger.sh inside
> BPF selftests directory) helpful, it collects final summary of the
> benchmark run and optionally accepts a specific set of benchmarks. So
> you can use it like this:
> 
> $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> kprobe :   18.731 ± 0.639M/s
> kprobe-multi   :   23.938 ± 0.612M/s
> 
> By default it will run a wider set of benchmarks (no uprobes, but a
> bunch of extra fentry/fexit tests and stuff like this).

origin:
# benchs/run_bench_trigger.sh 
kretprobe :1.329 ± 0.007M/s 
kretprobe-multi:1.341 ± 0.004M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.288 ± 0.014M/s 
kretprobe-multi:1.365 ± 0.002M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.329 ± 0.002M/s 
kretprobe-multi:1.331 ± 0.011M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.311 ± 0.003M/s 
kretprobe-multi:1.318 ± 0.002M/s s

patched: 

# benchs/run_bench_trigger.sh
kretprobe :1.274 ± 0.003M/s 
kretprobe-multi:1.397 ± 0.002M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.307 ± 0.002M/s 
kretprobe-multi:1.406 ± 0.004M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.279 ± 0.004M/s 
kretprobe-multi:1.330 ± 0.014M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.256 ± 0.010M/s 
kretprobe-multi:1.412 ± 0.003M/s 

Hmm, in my case, it seems smaller differences (~3%?).
I attached perf report results for those, but I don't see large difference.

> > >
> > > BASELINE
> > > 
> > > kprobe :   24.634 ± 0.205M/s
> > > kprobe-multi   :   28.898 ± 0.531M/s
> > > kretprobe  :   10.478 ± 0.015M/s
> > > kretprobe-multi:   11.012 ± 0.063M/s
> > >
> > > THIS PATCH SET ON TOP
> > > =
> > > kprobe :   25.144 ± 0.027M/s (+2%)
> > > kprobe-multi   :   28.909 ± 0.074M/s
> > > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> >
> > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > eventually because it should be a single callback version of
> > kretprobe-multi.

I ran another benchmark (prctl loop, attached), the origin kernel result is 
here;

# sh ./benchmark.sh 
count = 1000, took 6.748133 sec

And the patched kernel result;

# sh ./benchmark.sh 
count = 1000, took 6.644095 sec

I confirmed that the parf result has no big difference.

Thank you,


> >
> > >
> > > These numbers are pretty stable and look to be more or less 
> > > representative.
> > >
> > > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > > about the same, though.
> > >
> > > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > > slower, almost by 10%. Not sure why, but looks real after re-running
> > > benchmarks a bunch of times and getting stable results.
> >
> > Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> > So nothing should be changed. Maybe cache access pattern has been
> > changed?
> > I'll check it with tracefs (to remove the effect from bpf related changes)
> >
> > >
> > > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > > Again, I don't know if it is expected or not, but it's a nice
> >

Re: [PATCH v9 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled

2024-04-29 Thread Google

On Thu, 25 Apr 2024 13:09:32 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 15, 2024 at 6:22 AM Masami Hiramatsu (Google)
>  wrote:
> >
> > From: Masami Hiramatsu (Google) 
> >
> > Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is
> > converted from ftrace_regs by ftrace_partial_regs(), thus some registers
> > may always returns 0. But it should be enough for function entry (access
> > arguments) and exit (access return value).
> >
> > Signed-off-by: Masami Hiramatsu (Google) 
> > Acked-by: Florent Revest 
> > ---
> >  Changes from previous series: NOTHING, Update against the new series.
> > ---
> >  kernel/trace/bpf_trace.c |   22 +-
> >  1 file changed, 9 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index e51a6ef87167..57b1174030c9 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void)
> >  fs_initcall(bpf_event_init);
> >  #endif /* CONFIG_MODULES */
> >
> > -#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
> > +#ifdef CONFIG_FPROBE
> >  struct bpf_kprobe_multi_link {
> > struct bpf_link link;
> > struct fprobe fp;
> > @@ -2600,6 +2600,8 @@ struct user_syms {
> > char *buf;
> >  };
> >
> > +static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs);
> 
> this is a waste if CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, right?
> Can we guard it?

Good catch! Yes, we can guard it.

> 
> 
> > +
> >  static int copy_user_syms(struct user_syms *us, unsigned long __user 
> > *usyms, u32 cnt)
> >  {
> > unsigned long __user usymbol;
> > @@ -2792,13 +2794,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct 
> > bpf_run_ctx *ctx)
> >
> >  static int
> >  kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
> > -  unsigned long entry_ip, struct pt_regs *regs)
> > +  unsigned long entry_ip, struct ftrace_regs 
> > *fregs)
> >  {
> > struct bpf_kprobe_multi_run_ctx run_ctx = {
> > .link = link,
> > .entry_ip = entry_ip,
> > };
> > struct bpf_run_ctx *old_run_ctx;
> > +   struct pt_regs *regs;
> > int err;
> >
> > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> > @@ -2809,6 +2812,7 @@ kprobe_multi_link_prog_run(struct 
> > bpf_kprobe_multi_link *link,
> >
> > migrate_disable();
> > rcu_read_lock();
> > +   regs = ftrace_partial_regs(fregs, 
> > this_cpu_ptr(_kprobe_multi_pt_regs));
> 
> and then pass NULL if defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)?

Indeed.

Thank you!

> 
> 
> > old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx);
> > err = bpf_prog_run(link->link.prog, regs);
> > bpf_reset_run_ctx(old_run_ctx);
> > @@ -2826,13 +2830,9 @@ kprobe_multi_link_handler(struct fprobe *fp, 
> > unsigned long fentry_ip,
> >   void *data)
> >  {
> > struct bpf_kprobe_multi_link *link;
> > -   struct pt_regs *regs = ftrace_get_regs(fregs);
> > -
> > -   if (!regs)
> > -   return 0;
> >
> > link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> > -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> > +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
> > return 0;
> >  }
> >
> > @@ -2842,13 +2842,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, 
> > unsigned long fentry_ip,
> >void *data)
> >  {
> > struct bpf_kprobe_multi_link *link;
> > -   struct pt_regs *regs = ftrace_get_regs(fregs);
> > -
> > -   if (!regs)
> > -   return;
> >
> > link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> > -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> > +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
> >  }
> >
> >  static int symbols_cmp_r(const void *a, const void *b, const void *priv)
> > @@ -3107,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> > *attr, struct bpf_prog *pr
> > kvfree(cookies);
> > return err;
> >  }
> > -#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
> > +#else /* !CONFIG_FPROBE */
> >  int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct 
> > bpf_prog *prog)
> >  {
> > return -EOPNOTSUPP;
> >
> >


-- 
Masami Hiramatsu (Google)

Re: [PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded

2024-04-29 Thread Google

On Thu, 25 Apr 2024 13:15:08 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 15, 2024 at 6:25 AM Masami Hiramatsu (Google)
>  wrote:
> >
> > From: Masami Hiramatsu (Google) 
> >
> > Skip recording calltime and rettime if the fgraph_ops does not need it.
> > This is a kind of performance optimization for fprobe. Since the fprobe
> > user does not use these entries, recording timestamp in fgraph is just
> > a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag,
> > and all fgraph_ops sets this flag, skip recording calltime and rettime.
> >
> > Suggested-by: Jiri Olsa 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v9:
> >   - Newly added.
> > ---
> >  include/linux/ftrace.h |2 ++
> >  kernel/trace/fgraph.c  |   46 
> > +++---
> >  kernel/trace/fprobe.c  |1 +
> >  3 files changed, 42 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index d845a80a3d56..06fc7cbef897 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -1156,6 +1156,8 @@ struct fgraph_ops {
> > struct ftrace_ops   ops; /* for the hash lists */
> > void*private;
> > int idx;
> > +   /* If skip_timestamp is true, this does not record timestamps. */
> > +   boolskip_timestamp;
> >  };
> >
> >  void *fgraph_reserve_data(int idx, int size_bytes);
> > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> > index 7556fbbae323..a5722537bb79 100644
> > --- a/kernel/trace/fgraph.c
> > +++ b/kernel/trace/fgraph.c
> > @@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
> >  int ftrace_graph_active;
> >
> >  static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
> > +static bool fgraph_skip_timestamp;
> >
> >  /* LRU index table for fgraph_array */
> >  static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
> > @@ -475,7 +476,7 @@ void ftrace_graph_stop(void)
> >  static int
> >  ftrace_push_return_trace(unsigned long ret, unsigned long func,
> >  unsigned long frame_pointer, unsigned long *retp,
> > -int fgraph_idx)
> > +int fgraph_idx, bool skip_ts)
> >  {
> > struct ftrace_ret_stack *ret_stack;
> > unsigned long long calltime;
> > @@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > ret_stack = get_ret_stack(current, current->curr_ret_stack, );
> > if (ret_stack && ret_stack->func == func &&
> > get_fgraph_type(current, index + FGRAPH_RET_INDEX) == 
> > FGRAPH_TYPE_BITMAP &&
> > -   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> > fgraph_idx))
> > +   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> > fgraph_idx)) {
> > +   /* If previous one skips calltime, update it. */
> > +   if (!skip_ts && !ret_stack->calltime)
> > +   ret_stack->calltime = trace_clock_local();
> > return index + FGRAPH_RET_INDEX;
> > +   }
> >
> > val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | 
> > FGRAPH_RET_INDEX;
> >
> > @@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > return -EBUSY;
> > }
> >
> > -   calltime = trace_clock_local();
> > +   if (skip_ts)
> 
> would it be ok to add likely() here to keep the least-overhead code path 
> linear?

It's not "likely", but hmm, yes as you said. We can keep the least overhead.
OK, let me add likely. 

Thank you,

> 
> > +   calltime = 0LL;
> > +   else
> > +   calltime = trace_clock_local();
> >
> > index = READ_ONCE(current->curr_ret_stack);
> > ret_stack = RET_STACK(current, index);
> > @@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, 
> > unsigned long func,
> > trace.func = func;
> > trace.depth = ++current->curr_ret_depth;
> >
> > -   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0);
> > +   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0,
> > +fgraph_skip_timestamp);
> >

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Google

Hi Andrii,

On Thu, 25 Apr 2024 13:31:53 -0700
Andrii Nakryiko  wrote:

> Hey Masami,
> 
> I can't really review most of that code as I'm completely unfamiliar
> with all those inner workings of fprobe/ftrace/function_graph. I left
> a few comments where there were somewhat more obvious BPF-related
> pieces.
> 
> But I also did run our BPF benchmarks on probes/for-next as a baseline
> and then with your series applied on top. Just to see if there are any
> regressions. I think it will be a useful data point for you.

Thanks for testing!

> 
> You should be already familiar with the bench tool we have in BPF
> selftests (I used it on some other patches for your tree).

What patches we need?

> 
> BASELINE
> 
> kprobe :   24.634 ± 0.205M/s
> kprobe-multi   :   28.898 ± 0.531M/s
> kretprobe  :   10.478 ± 0.015M/s
> kretprobe-multi:   11.012 ± 0.063M/s
> 
> THIS PATCH SET ON TOP
> =
> kprobe :   25.144 ± 0.027M/s (+2%)
> kprobe-multi   :   28.909 ± 0.074M/s
> kretprobe  :9.482 ± 0.008M/s (-9.5%)
> kretprobe-multi:   13.688 ± 0.027M/s (+24%)

This looks good. Kretprobe should also use kretprobe-multi (fprobe)
eventually because it should be a single callback version of
kretprobe-multi.

> 
> These numbers are pretty stable and look to be more or less representative.
> 
> As you can see, kprobes got a bit faster, kprobe-multi seems to be
> about the same, though.
> 
> Then (I suppose they are "legacy") kretprobes got quite noticeably
> slower, almost by 10%. Not sure why, but looks real after re-running
> benchmarks a bunch of times and getting stable results.

Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
So nothing should be changed. Maybe cache access pattern has been
changed?
I'll check it with tracefs (to remove the effect from bpf related changes)

> 
> On the other hand, multi-kretprobes got significantly faster (+24%!).
> Again, I don't know if it is expected or not, but it's a nice
> improvement.

Thanks!

> 
> If you have any idea why kretprobes would get so much slower, it would
> be nice to look into that and see if you can mitigate the regression
> somehow. Thanks!

OK, let me check it.

Thank you!

> 
> 
> >  51 files changed, 2325 insertions(+), 882 deletions(-)
> >  create mode 100644 
> > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> >
> > --
> > Masami Hiramatsu (Google) 
> >


-- 
Masami Hiramatsu (Google)

Re: [PATCH v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()

2024-04-29 Thread Google

Hi LuMing,

On Sat, 27 Apr 2024 08:23:47 +0100
lumingyindet...@126.com wrote:

> From: LuMingYin 
> 
> If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt',
> it jumps to the label 'out' instead of 'fail' by mistake.In the result,
> the buffer 'tmp' is not freed in this case and leaks its memory.
> 
> Thus jump to the label 'fail' in that error case.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 


Thank you!

> Fixes: 032330abd08b ("tracing/probes: Cleanup probe argument parser")
> Signed-off-by: LuMingYin 
> ---
>  kernel/trace/trace_probe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index c09fa6fc636e..81c319b92038 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1466,7 +1466,7 @@ static int traceprobe_parse_probe_arg_body(const char 
> *argv, ssize_t *size,
>   parg->fmt = kmalloc(len, GFP_KERNEL);
>   if (!parg->fmt) {
>   ret = -ENOMEM;
> - goto out;
> + goto fail;
>   }
>   snprintf(parg->fmt, len, "%s[%d]", parg->type->fmttype,
>parg->count);
> -- 
> 2.25.1
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] kernel/trace/trace_probe:Fixed memory leak issues in trace_probe.c.

2024-04-26 Thread Google

Hi LuMingYin,

Thanks for finding the problem! But please make a commit message
following Documentation/process/submitting-patches.rst

On Fri, 26 Apr 2024 10:13:43 +0100
lumingyindet...@126.com wrote:

> From: LuMingYin <11570291+yin-lum...@user.noreply.gitee.com>
> 
> At line 1408 of the file /linux/kernel/trace/trace_probe.c, pointer variables 
> named code and tmp are defined. At line 1437, a new dynamic memory area is 
> allocated using the function kcalloc. When the if statement at line 1467 
> evaluates to true, the program jumps to the out label at line 1469. Within 
> this function, there are two labels: out and fail. The difference between 
> these two labels is that fail additionally frees the dynamic memory area 
> pointed to by the variable code. Therefore, the program should jump to the 
> fail label instead of the out label. This commit fixes this bug.
> 

For example, you must line break after about 70 characters. Also,
please don't use the line number because the line number is easily
changed (function name is OK). Since this bug is very clear mistake,
so you can just explain that as following.

 If traceprobe_parse_probe_arg_body() fails to allocate 'parg->fmt', it
 jumps to 'out' instead of 'fail' by mistake. In the result, in this
 case the 'tmp' buffer is not freed and leaks its memory.

 Fix it by jumping to 'fail' in that case.

The first paragraph explains what happens, and second one to exaplain
how to fix it.

Also, please add this Fixes tag.

Fixes: 032330abd08b ("tracing/probes: Cleanup probe argument parser")

You can easily find this commit number with git blame.

Thank you,

> Signed-off-by: LuMingYin <11570291+yin-lum...@user.noreply.gitee.com>
> ---
>  kernel/trace/trace_probe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index dfe3ee6035ec..42bc0f362226 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1466,7 +1466,7 @@ static int traceprobe_parse_probe_arg_body(const char 
> *argv, ssize_t *size,
>   parg->fmt = kmalloc(len, GFP_KERNEL);
>   if (!parg->fmt) {
>   ret = -ENOMEM;
> - goto out;
> + goto fail;
>   }
>   snprintf(parg->fmt, len, "%s[%d]", parg->type->fmttype,
>parg->count);
> -- 
> 2.25.1
> 

-- 
Masami Hiramatsu (Google)

Re: [PATCH 0/2] Objpool performance improvements

2024-04-26 Thread Google

Hi Andrii,

On Wed, 24 Apr 2024 14:52:12 -0700
Andrii Nakryiko  wrote:

> Improve objpool (used heavily in kretprobe hot path) performance with two
> improvements:
>   - inlining performance critical objpool_push()/objpool_pop() operations;
>   - avoiding re-calculating relatively expensive nr_possible_cpus().

Thanks for optimizing objpool. Both looks good to me.

BTW, I don't intend to stop this short-term optimization attempts,
but I would like to ask you check the new fgraph based fprobe 
(kretprobe-multi)[1] instead of objpool/rethook.

[1] 
https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/

I'm considering to obsolete the kretprobe (and rethook) with fprobe
and eventually remove it. Those have similar feature and we should
choose safer one.

Thank you,

> 
> These opportunities were found when benchmarking and profiling kprobes and
> kretprobes with BPF-based benchmarks. See individual patches for details and
> results.
> 
> Andrii Nakryiko (2):
>   objpool: enable inlining objpool_push() and objpool_pop() operations
>   objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
> 
>  include/linux/objpool.h | 105 +++--
>  lib/objpool.c   | 112 +++-
>  2 files changed, 107 insertions(+), 110 deletions(-)
> 
> -- 
> 2.43.0
> 

-- 
Masami Hiramatsu (Google)

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-25 Thread Google

On Wed, 24 Apr 2024 15:35:15 +0200
Florent Revest  wrote:

> Neat! :) I had a look at mostly the "high level" part (fprobe and
> arm64 specific bits) and this seems to be in a good state to me.
> 

Thanks for the review this long series!

> Thanks for all that work, that is quite a refactoring :)
> 
> On Mon, Apr 15, 2024 at 2:49 PM Masami Hiramatsu (Google)
>  wrote:
> >
> > Hi,
> >
> > Here is the 9th version of the series to re-implement the fprobe on
> > function-graph tracer. The previous version is;
> >
> > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> >
> > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> > and fixed some bugs + performance optimization patch[36/36].
> >  - [12/36] Fix to clear fgraph_array entry in registration failure, also
> >return -ENOSPC when fgraph_array is full.
> >  - [28/36] Add new store_fprobe_entry_data() for fprobe.
> >  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
> >  - [36/36] Add new flag to skip timestamp recording.
> >
> > Overview
> > 
> > This series does major 2 changes, enable multiple function-graphs on
> > the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> > fprobe on this function-graph.
> >
> > The former changes had been sent from Steven Rostedt 4 years ago (*),
> > which allows users to set different setting function-graph tracer (and
> > other tracers based on function-graph) in each trace-instances at the
> > same time.
> >
> > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> >
> > The purpose of latter change are;
> >
> >  1) Remove dependency of the rethook from fprobe so that we can reduce
> >the return hook code and shadow stack.
> >
> >  2) Make 'ftrace_regs' the common trace interface for the function
> >boundary.
> >
> > 1) Currently we have 2(or 3) different function return hook codes,
> >  the function-graph tracer and rethook (and legacy kretprobe).
> >  But since this  is redundant and needs double maintenance cost,
> >  I would like to unify those. From the user's viewpoint, function-
> >  graph tracer is very useful to grasp the execution path. For this
> >  purpose, it is hard to use the rethook in the function-graph
> >  tracer, but the opposite is possible. (Strictly speaking, kretprobe
> >  can not use it because it requires 'pt_regs' for historical reasons.)
> >
> > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
> >  wrong for the function entry and exit. Moreover, depending on the
> >  architecture, there is no way to accurately reproduce 'pt_regs'
> >  outside of interrupt or exception handlers. This means fprobe should
> >  not use 'pt_regs' because it does not use such exceptions.
> >  (Conversely, kprobe should use 'pt_regs' because it is an abstract
> >   interface of the software breakpoint exception.)
> >
> > This series changes fprobe to use function-graph tracer for tracing
> > function entry and exit, instead of mixture of ftrace and rethook.
> > Unlike the rethook which is a per-task list of system-wide allocated
> > nodes, the function graph's ret_stack is a per-task shadow stack.
> > Thus it does not need to set 'nr_maxactive' (which is the number of
> > pre-allocated nodes).
> > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> > their register interface, this changes it to convert 'ftrace_regs' to
> > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> > so users must access only registers for function parameters or
> > return value.
> >
> > Design
> > --
> > Instead of using ftrace's function entry hook directly, the new fprobe
> > is built on top of the function-graph's entry and return callbacks
> > with 'ftrace_regs'.
> >
> > Since the fprobe requires access to 'ftrace_regs', the architecture
> > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> > entry callback with 'ftrace_regs', and also
> > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> > return_to_handler.
> >
> > All fprobes share a single function-graph ops (means shares a common
> > ftrace filter) similar to the kprobe-on-ftrace. This needs another
> > layer to find corresponding fprobe in the common function-graph
> > ca

Re: [PATCH v9 01/36] tracing: Add a comment about ftrace_regs definition

2024-04-24 Thread Google

On Wed, 24 Apr 2024 15:19:24 +0200
Florent Revest  wrote:

> On Wed, Apr 24, 2024 at 2:23 PM Florent Revest  wrote:
> >
> > On Mon, Apr 15, 2024 at 2:49 PM Masami Hiramatsu (Google)
> >  wrote:
> > >
> > > From: Masami Hiramatsu (Google) 
> > >
> > > To clarify what will be expected on ftrace_regs, add a comment to the
> > > architecture independent definition of the ftrace_regs.
> > >
> > > Signed-off-by: Masami Hiramatsu (Google) 
> > > Acked-by: Mark Rutland 
> > > ---
> > >  Changes in v8:
> > >   - Update that the saved registers depends on the context.
> > >  Changes in v3:
> > >   - Add instruction pointer
> > >  Changes in v2:
> > >   - newly added.
> > > ---
> > >  include/linux/ftrace.h |   26 ++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > > index 54d53f345d14..b81f1afa82a1 100644
> > > --- a/include/linux/ftrace.h
> > > +++ b/include/linux/ftrace.h
> > > @@ -118,6 +118,32 @@ extern int ftrace_enabled;
> > >
> > >  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> > >
> > > +/**
> > > + * ftrace_regs - ftrace partial/optimal register set
> > > + *
> > > + * ftrace_regs represents a group of registers which is used at the
> > > + * function entry and exit. There are three types of registers.
> > > + *
> > > + * - Registers for passing the parameters to callee, including the stack
> > > + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> > > + * - Registers for passing the return values to caller.
> > > + *   (e.g. rax and rdx on x86_64)
> >
> > Ooc, have we ever considered skipping argument registers that are not
> > return value registers in the exit code paths ? For example, why would
> > we want to save rdi in a return handler ?
> >
> > But if we want to avoid the situation of having "sparse ftrace_regs"
> > all over again, we'd have to split ftrace_regs into a ftrace_args_regs
> > and a ftrace_ret_regs which would make this refactoring even more
> > painful, just to skip a few instructions. :|
> >
> > I don't necessarily think it's worth it, I just wanted to make sure
> > this was considered.
> 
> Ah, well, I just reached patch 22 and noticed that there you add add:
> 
> + * Basically, ftrace_regs stores the registers related to the context.
> + * On function entry, registers for function parameters and hooking the
> + * function call are stored, and on function exit, registers for function
> + * return value and frame pointers are stored.
> 
> So ftrace_regs can be a a sparse structure then. That's fair enough with me! 
> ;)

Yes, and in this patch, I explained that too :)

> + * On the function entry, those registers will be restored except for
> + * the stack pointer, so that user can change the function parameters
> + * and instruction pointer (e.g. live patching.)
> + * On the function exit, only registers which is used for return values
> > + * are restored.

So the function exit, ftrace_regs will be sparse.

Thank you,

-- 
Masami Hiramatsu (Google)

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-22 Thread Google

On Mon, 22 Apr 2024 14:55:25 -0700
Beau Belgrave  wrote:

> On Sat, Apr 20, 2024 at 09:50:52PM +0900, Masami Hiramatsu wrote:
> > On Fri, 19 Apr 2024 14:13:34 -0700
> > Beau Belgrave  wrote:
> > 
> > > On Fri, Apr 19, 2024 at 11:33:05AM +0900, Masami Hiramatsu wrote:
> > > > On Tue, 16 Apr 2024 22:41:01 +
> > > > Beau Belgrave  wrote:
> 
> *SNIP*
> 
> > > > nit: This loop can be simpler, because we are sure fixed has enough 
> > > > length;
> > > > 
> > > > /* insert a space after ';' if there is no space. */
> > > > while(*args) {
> > > > *pos = *args++;
> > > > if (*pos++ == ';' && !isspace(*args))
> > > > *pos++ = ' ';
> > > > }
> > > > 
> > > 
> > > I was worried that if count_semis_no_space() ever had different logic
> > > (maybe after this commit) that it could cause an overflow if the count
> > > was wrong, etc.
> > > 
> > > I don't have an issue making it shorter, but I was trying to be more on
> > > the safe side, since this isn't a fast path (event register).
> > 
> > OK, anyway current code looks correct. But note that I don't think
> > "pos++; len--;" is safer, since it is not atomic. This pattern
> > easily loose "len--;" in my experience. So please carefully use it ;)
> > 
> 
> I'll stick with your loop. Perhaps others will chime in on the v2 and
> state a stronger opinion.
> 
> You scared me with the atomic comment, I went back and looked at all the
> paths for this. In the user_events IOCTL the buffer is copied from user
> to kernel, so it cannot change (and no other threads access it). I also
> checked trace_parse_run_command() which is the same. So at least in this
> context the non-atomic part is OK.

Oh, sorry if I scared you. I've seen bugs get introduced into loops like
this many times (while updating the code), so I try to keep it simple.
I'm sure that your code has no bugs.

Thank you,

-- 
Masami Hiramatsu (Google)

Re: [PATCH v2] uprobes: reduce contention on uprobes_tree access

2024-04-22 Thread Google

 uprobe->pending_list */
> @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> loff_t offset)
>  {
>   struct uprobe *uprobe;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   uprobe = __find_uprobe(inode, offset);
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  
>   return uprobe;
>  }
> @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  {
>   struct uprobe *u;
>  
> - spin_lock(_treelock);
> + write_lock(_treelock);
>   u = __insert_uprobe(uprobe);
> - spin_unlock(_treelock);
> + write_unlock(_treelock);
>  
>   return u;
>  }
> @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
>   if (WARN_ON(!uprobe_is_active(uprobe)))
>   return;
>  
> - spin_lock(_treelock);
> + write_lock(_treelock);
>   rb_erase(>rb_node, _tree);
> - spin_unlock(_treelock);
> + write_unlock(_treelock);
>   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
>   put_uprobe(uprobe);
>  }
> @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
>   min = vaddr_to_offset(vma, start);
>   max = min + (end - start) - 1;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   n = find_node_in_range(inode, min, max);
>   if (n) {
>   for (t = n; t; t = rb_prev(t)) {
> @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
>   get_uprobe(u);
>   }
>   }
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  }
>  
>  /* @vma contains reference counter, not the probed instruction. */
> @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> long start, unsigned long e
>   min = vaddr_to_offset(vma, start);
>   max = min + (end - start) - 1;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   n = find_node_in_range(inode, min, max);
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  
>   return !!n;
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2] uprobes: reduce contention on uprobes_tree access

2024-04-22 Thread Google

+++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index e4834d23e1d1..8ae0eefc3a34 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> >   */
> >  #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
> >  
> > -static DEFINE_SPINLOCK(uprobes_treelock);  /* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
> >  
> >  #define UPROBES_HASH_SZ13
> >  /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> > loff_t offset)
> >  {
> > struct uprobe *uprobe;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > uprobe = __find_uprobe(inode, offset);
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  
> > return uprobe;
> >  }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > *uprobe)
> >  {
> > struct uprobe *u;
> >  
> > -   spin_lock(_treelock);
> > +   write_lock(_treelock);
> > u = __insert_uprobe(uprobe);
> > -   spin_unlock(_treelock);
> > +   write_unlock(_treelock);
> >  
> > return u;
> >  }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > if (WARN_ON(!uprobe_is_active(uprobe)))
> > return;
> >  
> > -   spin_lock(_treelock);
> > +   write_lock(_treelock);
> > rb_erase(>rb_node, _tree);
> > -   spin_unlock(_treelock);
> > +   write_unlock(_treelock);
> > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > put_uprobe(uprobe);
> >  }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > n = find_node_in_range(inode, min, max);
> > if (n) {
> > for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > get_uprobe(u);
> > }
> > }
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  }
> >  
> >  /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> > long start, unsigned long e
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > n = find_node_in_range(inode, min, max);
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  
> > return !!n;
> >  }
> > -- 
> > 2.43.0
> > 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCHv3 bpf-next 0/7] uprobe: uretprobe speed up

2024-04-22 Thread Google

Hi Jiri,

On Sun, 21 Apr 2024 21:41:59 +0200
Jiri Olsa  wrote:

> hi,
> as part of the effort on speeding up the uprobes [0] coming with
> return uprobe optimization by using syscall instead of the trap
> on the uretprobe trampoline.
> 
> The speed up depends on instruction type that uprobe is installed
> and depends on specific HW type, please check patch 1 for details.
> 
> Patches 1-6 are based on bpf-next/master, but path 1 and 2 are
> apply-able on linux-trace.git tree probes/for-next branch.
> Patch 7 is based on man-pages master.

Thanks for updated! I reviewed the series and just except for the
manpage, it looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

for the series.
If Linux API maintainers are OK, I can pick this in probes/for-next.
(BTW, who will pick the manpage patch?)

Thank you,

> 
> v3 changes:
>   - added source ip check if the uretprobe syscall is called from
> trampoline and sending SIGILL to process if it's not
>   - keep x86 compat process to use standard breakpoint
>   - split syscall wiring into separate change
>   - ran ltp and syzkaller locally, no issues found [Masami]
>   - building uprobe_compat binary in selftests which breaks
> CI atm because of missing 32-bit delve packages, I will
> need to fix that in separate changes once this is acked
>   - added man page change
>   - there were several changes so I removed acks [Oleg Andrii]
> 
> Also available at:
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   uretprobe_syscall
> 
> thanks,
> jirka
> 
> 
> Notes to check list items in Documentation/process/adding-syscalls.rst:
> 
> - System Call Alternatives
>   New syscall seems like the best way in here, becase we need
>   just to quickly enter kernel with no extra arguments processing,
>   which we'd need to do if we decided to use another syscall.
> 
> - Designing the API: Planning for Extension
>   The uretprobe syscall is very specific and most likely won't be
>   extended in the future.
> 
>   At the moment it does not take any arguments and even if it does
>   in future, it's allowed to be called only from trampoline prepared
>   by kernel, so there'll be no broken user.
> 
> - Designing the API: Other Considerations
>   N/A because uretprobe syscall does not return reference to kernel
>   object.
> 
> - Proposing the API
>   Wiring up of the uretprobe system call si in separate change,
>   selftests and man page changes are part of the patchset.
> 
> - Generic System Call Implementation
>   There's no CONFIG option for the new functionality because it
>   keeps the same behaviour from the user POV.
> 
> - x86 System Call Implementation
>   It's 64-bit syscall only.
> 
> - Compatibility System Calls (Generic)
>   N/A uretprobe syscall has no arguments and is not supported
>   for compat processes.
> 
> - Compatibility System Calls (x86)
>   N/A uretprobe syscall is not supported for compat processes.
> 
> - System Calls Returning Elsewhere
>   N/A.
> 
> - Other Details
>   N/A.
> 
> - Testing
>   Adding new bpf selftests and ran ltp on top of this change.
> 
> - Man Page
>   Attached.
> 
> - Do not call System Calls in the Kernel
>   N/A.
> 
> 
> [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
> ---
> Jiri Olsa (6):
>   uprobe: Wire up uretprobe system call
>   uprobe: Add uretprobe syscall to speed up return probe
>   selftests/bpf: Add uretprobe syscall test for regs integrity
>   selftests/bpf: Add uretprobe syscall test for regs changes
>   selftests/bpf: Add uretprobe syscall call from user space test
>   selftests/bpf: Add uretprobe compat test
> 
>  arch/x86/entry/syscalls/syscall_64.tbl|   1 +
>  arch/x86/kernel/uprobes.c | 115 
> ++
>  include/linux/syscalls.h  |   2 +
>  include/linux/uprobes.h   |   3 +
>  include/uapi/asm-generic/unistd.h |   5 +-
>  kernel/events/uprobes.c   |  24 +--
>  kernel/sys_ni.c   |   2 +
>  tools/include/linux/compiler.h|   4 ++
>  tools/testing/selftests/bpf/.gitignore|   1 +
>  tools/testing/selftests/bpf/Makefile  |   6 +-
>  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 123 
> +++-
>  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c   | 362 
> +
>  tools/testing/selftests/bpf/progs/u

Re: [PATCH 7/7] man2: Add uretprobe syscall page

2024-04-22 Thread Google

On Sun, 21 Apr 2024 21:42:06 +0200
Jiri Olsa  wrote:

> Adding man page for new uretprobe syscall.
> 
> Signed-off-by: Jiri Olsa 
> ---
>  man2/uretprobe.2 | 40 
>  1 file changed, 40 insertions(+)
>  create mode 100644 man2/uretprobe.2
> 
> diff --git a/man2/uretprobe.2 b/man2/uretprobe.2
> new file mode 100644
> index ..c0343a88bb57
> --- /dev/null
> +++ b/man2/uretprobe.2
> @@ -0,0 +1,40 @@
> +.\" Copyright (C) 2024, Jiri Olsa 
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +uretprobe \- execute pending return uprobes
> +.SH SYNOPSIS
> +.nf
> +.B int uretprobe(void)
> +.fi
> +.SH DESCRIPTION
> +On x86_64 architecture the kernel is using uretprobe syscall to trigger
> +uprobe return probe consumers instead of using standard breakpoint 
> instruction.
> +The reason is that it's much faster to do syscall than breakpoint trap
> +on x86_64 architecture.

Do we specify the supported architecture as this? Currently it is supported
only on x86-64, but it could be extended later, right?

This should be just noted as NOTES. Something like "This syscall is initially
introduced on x86-64 because a syscall is faster than a breakpoint trap on it.
But this will be extended to the architectures whose syscall is faster than
breakpoint trap."

Thank you,

> +
> +The uretprobe syscall is not supposed to be called directly by user, it's 
> allowed
> +to be invoked only through user space trampoline provided by kernel.
> +When called from outside of this trampoline, the calling process will receive
> +.BR SIGILL .
> +
> +.SH RETURN VALUE
> +.BR uretprobe()
> +return value is specific for given architecture.
> +
> +.SH VERSIONS
> +This syscall is not specified in POSIX,
> +and details of its behavior vary across systems.
> +.SH STANDARDS
> +None.
> +.SH NOTES
> +.BR uretprobe()
> +exists only to allow the invocation of return uprobe consumers.
> +It should
> +.B never
> +be called directly.
> +Details of the arguments (if any) passed to
> +.BR uretprobe ()
> +and the return value are specific for given architecture.
> -- 
> 2.44.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v5 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-22 Thread Google

On Mon, 22 Apr 2024 12:44:35 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.
> 
> Since code allocations are now implemented with execmem, kprobes can be
> enabled in non-modular kernels.
> 
> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> dependency of CONFIG_KPROBES on CONFIG_MODULES.

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/Kconfig|  2 +-
>  include/linux/module.h  |  9 ++
>  kernel/kprobes.c| 55 +++--
>  kernel/trace/trace_kprobe.c | 20 +-
>  4 files changed, 63 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 7006f71f0110..a48ce6a488b3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,9 +52,9 @@ config GENERIC_ENTRY
>  
>  config KPROBES
>   bool "Kprobes"
> - depends on MODULES
>   depends on HAVE_KPROBES
>   select KALLSYMS
> + select EXECMEM
>   select TASKS_RCU if PREEMPTION
>   help
> Kprobes allows you to trap at almost any kernel address and
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 1153b0d99a80..ffa1c603163c 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod)
>   return mod->state != MODULE_STATE_GOING;
>  }
>  
> +static inline bool module_is_coming(struct module *mod)
> +{
> +return mod->state == MODULE_STATE_COMING;
> +}
> +
>  struct module *__module_text_address(unsigned long addr);
>  struct module *__module_address(unsigned long addr);
>  bool is_module_address(unsigned long addr);
> @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct 
> module *mod, void *ptr)
>   return ptr;
>  }
>  
> +static inline bool module_is_coming(struct module *mod)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MODULES */
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index ddd7cdc16edf..ca2c6cbd42d2 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   }
>  
>   /* Get module refcount and reject __init functions for loaded modules. 
> */
> - if (*probed_mod) {
> + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) {
>   /*
>* We must hold a refcount of the probed module while updating
>* its code to prohibit unexpected unloading.
> @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p,
>* kprobes in there.
>*/
>   if (within_module_init((unsigned long)p->addr, *probed_mod) &&
> - (*probed_mod)->state != MODULE_STATE_COMING) {
> + !module_is_coming(*probed_mod)) {
>   module_put(*probed_mod);
>   *probed_mod = NULL;
>   ret = -ENOENT;
>   }
>   }
> +
>  out:
>   preempt_enable();
>   jump_label_unlock();
> @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
> unsigned long end)
>   return 0;
>  }
>  
> -/* Remove all symbols in given area from kprobe blacklist */
> -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> -{
> - struct kprobe_blacklist_entry *ent, *n;
> -
> - list_for_each_entry_safe(ent, n, _blacklist, list) {
> - if (ent->start_addr < start || ent->start_addr >= end)
> - continue;
> - list_del(>list);
> - kfree(ent);
> - }
> -}
> -
> -static void kprobe_remove_ksym_blacklist(unsigned long entry)
> -{
> - kprobe_remove_area_blacklist(entry, entry + 1);
> -}
> -
>  int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long 
> *value,
>  char *type, char *sym)
>  {
> @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned 
> long *start,
>   return ret ? : arch_populate_kprobe_blacklist();
>  }
>  
> +#ifdef CONFIG_MODULES
> +/* Remove all symbols in given area from kprobe blacklist */
> +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> +{
> + struct kprobe_blacklist_e

Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-20 Thread Google

On Fri, 19 Apr 2024 10:59:09 -0700
Andrii Nakryiko  wrote:

> On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu  wrote:
> >
> > On Thu, 18 Apr 2024 12:09:09 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > > that RCU is watching when trying to setup rethooko on a function entry.
> > >
> > > One notable exception when we force rcu_is_watching() check is
> > > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> > > old-style int3-based workflow instead of relying on ftrace, making RCU
> > > watching check important to validate.
> > >
> > > This further (in addition to improvements in the previous patch)
> > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > > by 2.3%, according to BPF benchmarks ([0]).
> > >
> > >   [0] 
> > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> > >
> > > Signed-off-by: Andrii Nakryiko 
> >
> >
> > Thanks for update! This looks good to me.
> 
> Thanks, Masami! Will you take it through your tree, or you'd like to
> route it through bpf-next?

OK, let me take it through linux-trace tree.

Thank you!

> 
> >
> > Acked-by: Masami Hiramatsu (Google) 
> >
> > Thanks,
> >
> > > ---
> > >  kernel/trace/rethook.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > > index fa03094e9e69..a974605ad7a5 100644
> > > --- a/kernel/trace/rethook.c
> > > +++ b/kernel/trace/rethook.c
> > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook 
> > > *rh)
> > >   if (unlikely(!handler))
> > >   return NULL;
> > >
> > > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> > > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
> > >   /*
> > >* This expects the caller will set up a rethook on a function 
> > > entry.
> > >* When the function returns, the rethook will eventually be 
> > > reclaimed
> > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook 
> > > *rh)
> > >*/
> > >   if (unlikely(!rcu_is_watching()))
> > >   return NULL;
> > > +#endif
> > >
> > >   return (struct rethook_node *)objpool_pop(>pool);
> > >  }
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-20 Thread Google

a fast path (event register).

OK, anyway current code looks correct. But note that I don't think
"pos++; len--;" is safer, since it is not atomic. This pattern
easily loose "len--;" in my experience. So please carefully use it ;)

> 
> > > +
> > > + /*
> > > +  * len is the length of the copy excluding the null.
> > > +  * This ensures we always have room for a null.
> > > +  */
> > > + *pos = '\0';
> > > +
> > > + return fixed;
> > > +}
> > > +
> > > +static char **user_event_argv_split(char *args, int *argc)
> > > +{
> > > + /* Count how many ';' without a trailing space */
> > > + int count = count_semis_no_space(args);
> > > +
> > > + if (count) {
> > 
> > nit: it is better to exit fast, so 
> > 
> > if (!count)
> > return argv_split(GFP_KERNEL, args, argc);
> > 
> > ...
> 
> Sure, will fix in a v2.
> 
> > 
> > Thank you,
> > 
> > OT: BTW, can this also simplify synthetic events?
> > 
> 
> I'm not sure, I'll check when I have some time. I want to get this fix
> in sooner rather than later.

Ah, nevermind. Synthetic event parses the field by strsep(';') first
and argv_split(). So it does not have this issue.

Thank you,

> 
> Thanks,
> -Beau
> 
> > > + /* We must fixup 'field;field' to 'field; field' */
> > > + char *fixed = fix_semis_no_space(args, count);
> > > + char **split;
> > > +
> > > + if (!fixed)
> > > + return NULL;
> > > +
> > > + /* We do a normal split afterwards */
> > > + split = argv_split(GFP_KERNEL, fixed, argc);
> > > +
> > > + /* We can free since argv_split makes a copy */
> > > + kfree(fixed);
> > > +
> > > + return split;
> > > + }
> > > +
> > > + /* No fixup is required */
> > > + return argv_split(GFP_KERNEL, args, argc);
> > > +}
> > > +
> > >  /*
> > >   * Parses the event name, arguments and flags then registers if 
> > > successful.
> > >   * The name buffer lifetime is owned by this method for success cases 
> > > only.
> > > @@ -2012,7 +2098,7 @@ static int user_event_parse(struct user_event_group 
> > > *group, char *name,
> > >   return -EPERM;
> > >  
> > >   if (args) {
> > > - argv = argv_split(GFP_KERNEL, args, );
> > > + argv = user_event_argv_split(args, );
> > >  
> > >   if (!argv)
> > >   return -ENOMEM;
> > > -- 
> > > 2.34.1
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-20 Thread Google

On Sat, 20 Apr 2024 10:33:38 +0300
Mike Rapoport  wrote:

> On Fri, Apr 19, 2024 at 03:59:40PM +, Christophe Leroy wrote:
> > 
> > 
> > Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> > > Hi Masami,
> > > 
> > > On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> > >> Hi Mike,
> > >>
> > >> On Thu, 11 Apr 2024 19:00:50 +0300
> > >> Mike Rapoport  wrote:
> > >>
> > >>> From: "Mike Rapoport (IBM)" 
> > >>>
> > >>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > >>> code.
> > >>>
> > >>> Since code allocations are now implemented with execmem, kprobes can be
> > >>> enabled in non-modular kernels.
> > >>>
> > >>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > >>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > >>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
> > >>
> > >> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> > >> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> > >> function body? We have enough dummy functions for that, so it should
> > >> not make a problem.
> > > 
> > > The code in check_kprobe_address_safe() that gets the module and checks 
> > > for
> > > __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> > > I can pull it out to a helper or leave #ifdef in the function body,
> > > whichever you prefer.
> > 
> > As far as I can see, the only problem is MODULE_STATE_COMING.
> > Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h  ?
> 
> There's dereference of 'struct module' there:
>  
>   (*probed_mod)->state != MODULE_STATE_COMING) {
>   ...
>   }
> 
> so moving out 'enum module_state' won't be enough.

Hmm, this part should be inline functions like;

#ifdef CONFIG_MODULES
static inline bool module_is_coming(struct module *mod)
{
return mod->state == MODULE_STATE_COMING;
}
#else
#define module_is_coming(mod) (false)
#endif

Then we don't need the enum.
Thank you,

>  
> > >   
> > >> -- 
> > >> Masami Hiramatsu
> > > 
> 
> -- 
> Sincerely yours,
> Mike.
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-20 Thread Google

On Sat, 20 Apr 2024 07:22:50 +0300
Mike Rapoport  wrote:

> On Fri, Apr 19, 2024 at 02:42:16PM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
> > >
> > > On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > > > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > > > [...]
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > > > >
> > > > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > > > etc.) to have
> > > > > > the same execmem_range. From [1]:
> > > > > >
> > > > > > static void *execmem_cache_alloc(struct execmem_range *range, 
> > > > > > size_t size)
> > > > > > {
> > > > > > ...
> > > > > >p = __execmem_cache_alloc(size);
> > > > > >if (p)
> > > > > >return p;
> > > > > >   err = execmem_cache_populate(range, size);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > We are calling __execmem_cache_alloc() without range. For this to 
> > > > > > work,
> > > > > > we can only call execmem_cache_alloc() with one execmem_range.
> > > > >
> > > > > Actually, on x86 this will "just work" because everything shares the 
> > > > > same
> > > > > address space :)
> > > > >
> > > > > The 2M pages in the cache will be in the modules space, so
> > > > > __execmem_cache_alloc() will always return memory from that address 
> > > > > space.
> > > > >
> > > > > For other architectures this indeed needs to be fixed with passing the
> > > > > range to __execmem_cache_alloc() and limiting search in the cache for 
> > > > > that
> > > > > range.
> > > >
> > > > I think we at least need the "map to" concept (initially proposed by 
> > > > Thomas)
> > > > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > > > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > > > the same range.
> > >
> > > Why?
> > 
> > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > the "range" object, we will have to compare different range parameters to 
> > check
> > we can share cached pages between module text and kprobe, which is not
> > efficient. Did I miss something?

Song, thanks for trying to eplain. I think I need to explain why I used
module_alloc() originally.

This depends on how kprobe features are implemented on the architecture, and
how much features are supported on kprobes.

Because kprobe jump optimization and kprobe jump-back optimization need to
use a jump instruction to jump into the trampoline and jump back from the
trampoline directly, if the architecuture jmp instruction supports +-2GB range
like x86, it needs to allocate the trampoline buffer inside such address space.
This requirement is similar to the modules (because module function needs to
call other functions in the kernel etc.), at least kprobes on x86 used
module_alloc().

However, if an architecture only supports breakpoint/trap based kprobe,
it does not need to consider whether the execmem is allocated.

> 
> We can always share large ROX pages as long as they are within the correct
> address space. The permissions for them are ROX and the alignment
> differences are due to KASAN and this is handled during allocation of the
> large page to refill the cache. __execmem_cache_alloc() only needs to limit
> the search for the address space of the range.

So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
should be configured for each arch. Especially, if it is only used for
searching parameter, it looks OK to me.

Thank you,

> 
> And regardless, they way we deal with sharing of the cache can be sorted
> out later.
> 
> > Thanks,
> > Song
> 
> -- 
> Sincerely yours,
> Mike.
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v9 07/36] function_graph: Allow multiple users to attach to function graph

2024-04-20 Thread Google

On Fri, 19 Apr 2024 23:52:58 -0400
Steven Rostedt  wrote:

> On Mon, 15 Apr 2024 21:50:20 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > @@ -27,23 +28,157 @@
> >  
> >  #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
> >  #define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
> > +
> > +/*
> > + * On entry to a function (via function_graph_enter()), a new 
> > ftrace_ret_stack
> > + * is allocated on the task's ret_stack with indexes entry, then each
> > + * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that 
> > returns
> > + * non-zero, the index into the fgraph_array[] for that fgraph_ops is 
> > recorded
> > + * on the indexes entry as a bit flag.
> > + * As the associated ftrace_ret_stack saved for those fgraph_ops needs to
> > + * be found, the index to it is also added to the ret_stack along with the
> > + * index of the fgraph_array[] to each fgraph_ops that needs their retfunc
> > + * called.
> > + *
> > + * The top of the ret_stack (when not empty) will always have a reference
> > + * to the last ftrace_ret_stack saved. All references to the
> > + * ftrace_ret_stack has the format of:
> > + *
> > + * bits:  0 -  9   offset in words from the previous ftrace_ret_stack
> > + * (bitmap type should have FGRAPH_RET_INDEX always)
> > + * bits: 10 - 11   Type of storage
> > + *   0 - reserved
> > + *   1 - bitmap of fgraph_array index
> > + *
> > + * For bitmap of fgraph_array index
> > + *  bits: 12 - 27  The bitmap of fgraph_ops fgraph_array index
> 
> I really hate the terminology I came up with here, and would love to
> get better terminology for describing what is going on. I looked it
> over but I'm constantly getting confused. And I wrote this code!
> 
> Perhaps we should use:
> 
>  @frame : The data that represents a single function call. When a
>   function is traced, all the data used for all the callbacks
>   attached to it, is in a single frame. This would replace the
>   FGRAPH_RET_SIZE as FGRAPH_FRAME_SIZE.

Agreed.

> 
>  @offset : This is the word size position on the stack. It would
>replace INDEX, as I think "index" is being used for more
>than one thing. Perhaps it should be "offset" when dealing
>with where it is on the shadow stack, and "pos" when dealing
>with which callback ops is being referenced.

Indeed. @index is usually used from the index in an array. So we can use
@index for fgraph_array[]. But inside a @frame, @offset would be better.

> 
> 
> > + *
> > + * That is, at the end of function_graph_enter, if the first and forth
> > + * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc 
> > called
> > + * on the return of the function being traced, this is what will be on the
> > + * task's shadow ret_stack: (the stack grows upward)
> > + *
> > + * || <- task->curr_ret_stack
> > + * ++
> > + * | bitmap_type(bitmap:(BIT(3)|BIT(0)),|
> > + * | offset:FGRAPH_RET_INDEX)   | <- the offset is from 
> > here
> > + * ++
> > + * | struct ftrace_ret_stack|
> > + * |   (stores the saved ret pointer)   | <- the offset points here
> > + * ++
> > + * | (X) | (N)  | ( N words away from
> > + * ||   previous ret_stack)
> > + *
> > + * If a backtrace is required, and the real return pointer needs to be
> > + * fetched, then it looks at the task's curr_ret_stack index, if it
> > + * is greater than zero (reserved, or right before poped), it would mask
> > + * the value by FGRAPH_RET_INDEX_MASK to get the offset index of the
> > + * ftrace_ret_stack structure stored on the shadow stack.
> > + */
> > +
> > +#define FGRAPH_RET_INDEX_SIZE  10
> 
> Replace SIZE with BITS.

Agreed.

> 
> > +#define FGRAPH_RET_INDEX_MASK  GENMASK(FGRAPH_RET_INDEX_SIZE - 1, 0)
> 
>   #define FGRAPH_FRAME_SIZE_BITS  10
>   #define FGRAPH_FRAME_SIZE_MASK  GENMASK(FGRAPH_FRAME_SIZE_BITS - 1, 0)
> 
> 
> > +
> > +#define FGRAPH_TYPE_SIZE   2
> > +#define FGRAPH_TYPE_MASK   GENMASK(FGRAPH_TYPE_SIZE - 1, 0)
> 
>   #define FGRAPH_TYPE_BITS2
>   #define FGRAPH_TYPE_MASKGENMASK(FGRAPH_TYPE_BITS - 1, 0)
>

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-18 Thread Google

Hi Steve,

Can you review this series? Especially, [07/36] and [12/36] has been changed
a lot from your original patch.

Thank you,

On Mon, 15 Apr 2024 21:48:59 +0900
"Masami Hiramatsu (Google)"  wrote:

> Hi,
> 
> Here is the 9th version of the series to re-implement the fprobe on
> function-graph tracer. The previous version is;
> 
> https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> 
> This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> and fixed some bugs + performance optimization patch[36/36].
>  - [12/36] Fix to clear fgraph_array entry in registration failure, also
>return -ENOSPC when fgraph_array is full.
>  - [28/36] Add new store_fprobe_entry_data() for fprobe.
>  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
>  - [36/36] Add new flag to skip timestamp recording.
> 
> Overview
> 
> This series does major 2 changes, enable multiple function-graphs on
> the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> fprobe on this function-graph.
> 
> The former changes had been sent from Steven Rostedt 4 years ago (*),
> which allows users to set different setting function-graph tracer (and
> other tracers based on function-graph) in each trace-instances at the
> same time.
> 
> (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> 
> The purpose of latter change are;
> 
>  1) Remove dependency of the rethook from fprobe so that we can reduce
>the return hook code and shadow stack.
> 
>  2) Make 'ftrace_regs' the common trace interface for the function
>boundary.
> 
> 1) Currently we have 2(or 3) different function return hook codes,
>  the function-graph tracer and rethook (and legacy kretprobe).
>  But since this  is redundant and needs double maintenance cost,
>  I would like to unify those. From the user's viewpoint, function-
>  graph tracer is very useful to grasp the execution path. For this
>  purpose, it is hard to use the rethook in the function-graph
>  tracer, but the opposite is possible. (Strictly speaking, kretprobe
>  can not use it because it requires 'pt_regs' for historical reasons.)
> 
> 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
>  wrong for the function entry and exit. Moreover, depending on the
>  architecture, there is no way to accurately reproduce 'pt_regs'
>  outside of interrupt or exception handlers. This means fprobe should
>  not use 'pt_regs' because it does not use such exceptions.
>  (Conversely, kprobe should use 'pt_regs' because it is an abstract
>   interface of the software breakpoint exception.)
> 
> This series changes fprobe to use function-graph tracer for tracing
> function entry and exit, instead of mixture of ftrace and rethook.
> Unlike the rethook which is a per-task list of system-wide allocated
> nodes, the function graph's ret_stack is a per-task shadow stack.
> Thus it does not need to set 'nr_maxactive' (which is the number of
> pre-allocated nodes).
> Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> their register interface, this changes it to convert 'ftrace_regs' to
> 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> so users must access only registers for function parameters or
> return value. 
> 
> Design
> --
> Instead of using ftrace's function entry hook directly, the new fprobe
> is built on top of the function-graph's entry and return callbacks
> with 'ftrace_regs'.
> 
> Since the fprobe requires access to 'ftrace_regs', the architecture
> must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> entry callback with 'ftrace_regs', and also
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> return_to_handler.
> 
> All fprobes share a single function-graph ops (means shares a common
> ftrace filter) similar to the kprobe-on-ftrace. This needs another
> layer to find corresponding fprobe in the common function-graph
> callbacks, but has much better scalability, since the number of
> registered function-graph ops is limited.
> 
> In the entry callback, the fprobe runs its entry_handler and saves the
> address of 'fprobe' on the function-graph's shadow stack as data. The
> return callback decodes the data to get the 'fprobe' address, and runs
> the exit_handler.
> 
> The fprobe introduces two hash-tables, one is for entry callback which
> searches fprobes related to the given function address passed by entry
> callback. The other is for a return callback which checks if the given

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-18 Thread Google

On Tue, 16 Apr 2024 22:41:01 +
Beau Belgrave  wrote:

> When the ABI was updated to prevent same name w/different args, it
> missed an important corner case when fields don't end with a space.
> Typically, space is used for fields to help separate them, like
> "u8 field1; u8 field2". If no spaces are used, like
> "u8 field1;u8 field2", then the parsing works for the first time.
> However, the match check fails on a subsequent register, leading to
> confusion.
> 
> This is because the match check uses argv_split() and assumes that all
> fields will be split upon the space. When spaces are used, we get back
> { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> This causes a mismatch, and the user program gets back -EADDRINUSE.
> 
> Add a method to detect this case before calling argv_split(). If found
> force a space after the field separator character ';'. This ensures all
> cases work properly for matching.
> 
> With this fix, the following are all treated as matching:
> u8 field1;u8 field2
> u8 field1; u8 field2
> u8 field1;\tu8 field2
> u8 field1;\nu8 field2

Sounds good to me. I just have some nits.

> 
> Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> args event")
> Signed-off-by: Beau Belgrave 
> ---
>  kernel/trace/trace_events_user.c | 88 +++-
>  1 file changed, 87 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index 70d428c394b6..9184d3962b2a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -1989,6 +1989,92 @@ static int user_event_set_tp_name(struct user_event 
> *user)
>   return 0;
>  }
>  
> +/*
> + * Counts how many ';' without a trailing space are in the args.
> + */
> +static int count_semis_no_space(char *args)
> +{
> + int count = 0;
> +
> + while ((args = strchr(args, ';'))) {
> + args++;
> +
> + if (!isspace(*args))
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/*
> + * Copies the arguments while ensuring all ';' have a trailing space.
> + */
> +static char *fix_semis_no_space(char *args, int count)

nit: This name does not represent what it does. 'insert_space_after_semis()'
is more self-described.

> +{
> + char *fixed, *pos;
> + char c, last;
> + int len;
> +
> + len = strlen(args) + count;
> + fixed = kmalloc(len + 1, GFP_KERNEL);
> +
> + if (!fixed)
> + return NULL;
> +
> + pos = fixed;
> + last = '\0';
> +
> + while (len > 0) {
> + c = *args++;
> +
> + if (last == ';' && !isspace(c)) {
> + *pos++ = ' ';
> + len--;
> + }
> +
> + if (len > 0) {
> + *pos++ = c;
> + len--;
> + }
> +
> + last = c;
> + }

nit: This loop can be simpler, because we are sure fixed has enough length;

/* insert a space after ';' if there is no space. */
while(*args) {
*pos = *args++;
if (*pos++ == ';' && !isspace(*args))
*pos++ = ' ';
}

> +
> + /*
> +  * len is the length of the copy excluding the null.
> +  * This ensures we always have room for a null.
> +  */
> + *pos = '\0';
> +
> + return fixed;
> +}
> +
> +static char **user_event_argv_split(char *args, int *argc)
> +{
> + /* Count how many ';' without a trailing space */
> + int count = count_semis_no_space(args);
> +
> + if (count) {

nit: it is better to exit fast, so 

if (!count)
return argv_split(GFP_KERNEL, args, argc);

...

Thank you,

OT: BTW, can this also simplify synthetic events?

> + /* We must fixup 'field;field' to 'field; field' */
> + char *fixed = fix_semis_no_space(args, count);
> + char **split;
> +
> + if (!fixed)
> + return NULL;
> +
> + /* We do a normal split afterwards */
> + split = argv_split(GFP_KERNEL, fixed, argc);
> +
> + /* We can free since argv_split makes a copy */
> + kfree(fixed);
> +
> + return split;
> + }
> +
> + /* No fixup is required */
> + return argv_split(GFP_KERNEL, args, argc);
> +}
> +
>  /*
>   * Parses the event name, arguments and flags then registers if successful.
>   * The name buffer lifetime is owned by this method for success cases only.
> @@ -2012,7 +2098,7 @@ static int user_event_parse(struct user_event_group 
> *group, char *name,
>   return -EPERM;
>  
>   if (args) {
> - argv = argv_split(GFP_KERNEL, args, );
> + argv = user_event_argv_split(args, );
>  
>   if (!argv)
>   return -ENOMEM;
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Google

On Thu, 18 Apr 2024 12:09:09 -0700
Andrii Nakryiko  wrote:

> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> One notable exception when we force rcu_is_watching() check is
> CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> old-style int3-based workflow instead of relying on ftrace, making RCU
> watching check important to validate.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 
> Signed-off-by: Andrii Nakryiko 


Thanks for update! This looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..a974605ad7a5 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-18 Thread Google

gt;  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > > RB_ROOT;
> > > > > > > > > > > >   */
> > > > > > > > > > > >  #define no_uprobe_events()   
> > > > > > > > > > > > RB_EMPTY_ROOT(_tree)
> > > > > > > > > > > >
> > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > >
> > > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > >  {
> > > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + read_lock(_treelock);
> > > > > > > > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > > + read_unlock(_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > >   return uprobe;
> > > > > > > > > > > >  }
> > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe 
> > > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > >  {
> > > > > > > > > > > >   struct uprobe *u;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + write_lock(_treelock);
> > > > > > > > > > > >   u = __insert_uprobe(uprobe);
> > > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > > + write_unlock(_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > >   return u;
> > > > > > > > > > > >  }
> > > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct 
> > > > > > > > > > > > uprobe *uprobe)
> > > > > > > > > > > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > > > >   return;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + write_lock(_treelock);
> > > > > > > > > > > >   rb_erase(>rb_node, _tree);
> > > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > > + write_unlock(_treelock);
> > > > > > > > > > > >   RB_CLEAR_NODE(>rb_node); /* for 
> > > > > > > > > > > > uprobe_is_active() */
> > > > > > > > > > > >   put_uprobe(uprobe);
> > > > > > > > > > > >  }
> > > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void 
> > > > > > > > > > > > build_probe_list(struct inode *inode,
> > > > > > > > > > > >   min = vaddr_to_offset(vma, start);
> > > > > > > > > > > >   max = min + (end - start) - 1;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + read_lock(_treelock);
> > > > > > > > > > > >   n = find_node_in_range(inode, min, max);
> > > > > > > > > > > >   if (n) {
> > > > > > > > > > > >   for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void 
> > > > > > > > > > > > build_probe_list(struct inode *inode,
> > > > > > > > > > > >   get_uprobe(u);
> > > > > > > > > > > >   }
> > > > > > > > > > > >   }
> > > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > > + read_unlock(_treelock);
> > > > > > > > > > > >  }
> > > > > > > > > > > >
> > > > > > > > > > > >  /* @vma contains reference counter, not the probed 
> > > > > > > > > > > > instruction. */
> > > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct 
> > > > > > > > > > > > vm_area_struct *vma, unsigned long start, unsigned long 
> > > > > > > > > > > > e
> > > > > > > > > > > >   min = vaddr_to_offset(vma, start);
> > > > > > > > > > > >   max = min + (end - start) - 1;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + read_lock(_treelock);
> > > > > > > > > > > >   n = find_node_in_range(inode, min, max);
> > > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > > + read_unlock(_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > >   return !!n;
> > > > > > > > > > > >  }
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.43.0
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Masami Hiramatsu (Google) 
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Masami Hiramatsu (Google) 
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) 
> > > > 
> > > > 
> > > > -- 
> > > > Masami Hiramatsu (Google) 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Google

On Thu, 11 Apr 2024 19:00:41 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
> 
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
> 

This looks good to me for the kprobe part.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/powerpc/kernel/kprobes.c|  6 ++--
>  arch/s390/kernel/ftrace.c|  4 +--
>  arch/s390/kernel/kprobes.c   |  4 +--
>  arch/s390/kernel/module.c|  5 +--
>  arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
>  arch/x86/kernel/ftrace.c |  6 ++--
>  arch/x86/kernel/kprobes/core.c   |  4 +--
>  include/linux/execmem.h  | 57 
>  include/linux/moduleloader.h |  3 --
>  kernel/bpf/core.c|  6 ++--
>  kernel/kprobes.c |  8 ++---
>  kernel/module/Kconfig|  1 +
>  kernel/module/main.c | 25 +-
>  mm/Kconfig   |  3 ++
>  mm/Makefile  |  1 +
>  mm/execmem.c | 26 +++
>  16 files changed, 122 insertions(+), 45 deletions(-)
>  create mode 100644 include/linux/execmem.h
>  create mode 100644 mm/execmem.c
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index bbca90a5e2ec..9fcd01bb2ce6 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -19,8 +19,8 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -130,7 +130,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>  
> @@ -142,7 +142,7 @@ void *alloc_insn_page(void)
>   }
>   return page;
>  error:
> - module_memfree(page);
> + execmem_free(page);
>   return NULL;
>  }
>  
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index c46381ea04ec..798249ef5646 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -7,13 +7,13 @@
>   *   Author(s): Martin Schwidefsky 
>   */
>  
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
>  {
>   const char *start, *end;
>  
> - ftrace_plt = module_alloc(PAGE_SIZE);
> + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
>   if (!ftrace_plt)
>   panic("cannot allocate ftrace plt\n");
>  
> diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
> index f0cf20d4b3c5..3c1b1be744de 100644
> --- a/arch/s390/kernel/kprobes.c
> +++ b/arch/s390/kernel/kprobes.c
> @@ -9,7 +9,6 @@
>  
>  #define pr_fmt(fmt) "kprobes: " fmt
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -21,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -38,7 +38,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>   set_memory_rox((unsigned long)page, 1);
> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..ac97a905e8cd 100644
> --- a/arch/s3

Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Google

Sorry, this is actually v3. (miss-configured...)

Thanks,

On Thu, 18 Apr 2024 05:46:34 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Currently, kprobe event checks whether the target symbol name is unique
> or not, so that it does not put a probe on an unexpected place. But this
> skips the check if the target is on a module because the module may not
> be loaded.
> 
> To fix this issue, this patch checks the number of probe target symbols
> in a target module when the module is loaded. If the probe is not on the
> unique name symbols in the module, it will be rejected at that point.
> 
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v3:
>   - Update the patch description.
>  Updated from last October post, which was dropped by test failure:
> 
> https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/
>  Changes in v2:
>   - Fix to skip checking uniqueness if the target module is not loaded.
>   - Fix register_module_trace_kprobe() to pass correct symbol name.
>   - Fix to call __register_trace_kprobe() from module callback.
> ---
>  kernel/trace/trace_kprobe.c |  125 
> ---
>  1 file changed, 81 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index c68d4e830fbe..0113afe2662d 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
>   return ret;
>  }
>  
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol);
> +
> +static int register_module_trace_kprobe(struct module *mod, struct 
> trace_kprobe *tk)
> +{
> + const char *p;
> + int ret = 0;
> +
> + p = strchr(trace_kprobe_symbol(tk), ':');
> + if (p)
> + ret = validate_module_probe_symbol(module_name(mod), p + 1);
> + if (!ret)
> + ret = __register_trace_kprobe(tk);
> + return ret;
> +}
> +
>  /* Module notifier call back, checking event on the module */
>  static int trace_kprobe_module_callback(struct notifier_block *nb,
>  unsigned long val, void *data)
> @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct 
> notifier_block *nb,
>   if (trace_kprobe_within_module(tk, mod)) {
>   /* Don't need to check busy - this should have gone. */
>   __unregister_trace_kprobe(tk);
> - ret = __register_trace_kprobe(tk);
> + ret = register_module_trace_kprobe(mod, tk);
>   if (ret)
>   pr_warn("Failed to re-register probe %s on %s: 
> %d\n",
>   trace_probe_name(>tp),
> @@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char 
> *name, unsigned long unused)
>   return 0;
>  }
>  
> -static unsigned int number_of_same_symbols(char *func_name)
> +static unsigned int number_of_same_symbols(const char *mod, const char 
> *func_name)
>  {
>   struct sym_count_ctx ctx = { .count = 0, .name = func_name };
>  
> - kallsyms_on_each_match_symbol(count_symbols, func_name, );
> + if (!mod)
> + kallsyms_on_each_match_symbol(count_symbols, func_name, 
> );
>  
> - module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
> + module_kallsyms_on_each_symbol(mod, count_mod_symbols, );
>  
>   return ctx.count;
>  }
>  
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol)
> +{
> + unsigned int count = number_of_same_symbols(modname, symbol);
> +
> + if (count > 1) {
> + /*
> +  * Users should use ADDR to remove the ambiguity of
> +  * using KSYM only.
> +  */
> + return -EADDRNOTAVAIL;
> + } else if (count == 0) {
> + /*
> +  * We can return ENOENT earlier than when register the
> +  * kprobe.
> +  */
> + return -ENOENT;
> + }
> + return 0;
> +}
> +
> +static int validate_probe_symbol(char *symbol)
> +{
> + struct module *mod = NULL;
> + char *modname = NULL, *p;
> + int ret = 0;
> +
> + p = strchr(symbol, ':');
> + if (p) {
> + modname = symbol;
> +

[PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Currently, kprobe event checks whether the target symbol name is unique
or not, so that it does not put a probe on an unexpected place. But this
skips the check if the target is on a module because the module may not
be loaded.

To fix this issue, this patch checks the number of probe target symbols
in a target module when the module is loaded. If the probe is not on the
unique name symbols in the module, it will be rejected at that point.

Note that the symbol which has a unique name in the target module,
it will be accepted even if there are same-name symbols in the
kernel or other modules,

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v3:
  - Update the patch description.
 Updated from last October post, which was dropped by test failure:

https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/
 Changes in v2:
  - Fix to skip checking uniqueness if the target module is not loaded.
  - Fix register_module_trace_kprobe() to pass correct symbol name.
  - Fix to call __register_trace_kprobe() from module callback.
---
 kernel/trace/trace_kprobe.c |  125 ---
 1 file changed, 81 insertions(+), 44 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c68d4e830fbe..0113afe2662d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+static int validate_module_probe_symbol(const char *modname, const char 
*symbol);
+
+static int register_module_trace_kprobe(struct module *mod, struct 
trace_kprobe *tk)
+{
+   const char *p;
+   int ret = 0;
+
+   p = strchr(trace_kprobe_symbol(tk), ':');
+   if (p)
+   ret = validate_module_probe_symbol(module_name(mod), p + 1);
+   if (!ret)
+   ret = __register_trace_kprobe(tk);
+   return ret;
+}
+
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,
   unsigned long val, void *data)
@@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct 
notifier_block *nb,
if (trace_kprobe_within_module(tk, mod)) {
/* Don't need to check busy - this should have gone. */
__unregister_trace_kprobe(tk);
-   ret = __register_trace_kprobe(tk);
+   ret = register_module_trace_kprobe(mod, tk);
if (ret)
pr_warn("Failed to re-register probe %s on %s: 
%d\n",
trace_probe_name(>tp),
@@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char 
*name, unsigned long unused)
return 0;
 }
 
-static unsigned int number_of_same_symbols(char *func_name)
+static unsigned int number_of_same_symbols(const char *mod, const char 
*func_name)
 {
struct sym_count_ctx ctx = { .count = 0, .name = func_name };
 
-   kallsyms_on_each_match_symbol(count_symbols, func_name, );
+   if (!mod)
+   kallsyms_on_each_match_symbol(count_symbols, func_name, 
);
 
-   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
+   module_kallsyms_on_each_symbol(mod, count_mod_symbols, );
 
return ctx.count;
 }
 
+static int validate_module_probe_symbol(const char *modname, const char 
*symbol)
+{
+   unsigned int count = number_of_same_symbols(modname, symbol);
+
+   if (count > 1) {
+   /*
+* Users should use ADDR to remove the ambiguity of
+* using KSYM only.
+*/
+   return -EADDRNOTAVAIL;
+   } else if (count == 0) {
+   /*
+* We can return ENOENT earlier than when register the
+* kprobe.
+*/
+   return -ENOENT;
+   }
+   return 0;
+}
+
+static int validate_probe_symbol(char *symbol)
+{
+   struct module *mod = NULL;
+   char *modname = NULL, *p;
+   int ret = 0;
+
+   p = strchr(symbol, ':');
+   if (p) {
+   modname = symbol;
+   symbol = p + 1;
+   *p = '\0';
+   /* Return 0 (defer) if the module does not exist yet. */
+   rcu_read_lock_sched();
+   mod = find_module(modname);
+   if (mod && !try_module_get(mod))
+   mod = NULL;
+   rcu_read_unlock_sched();
+   if (!mod)
+   goto out;
+   }
+
+   ret = validate_module_probe_symbol(modname, symbol);
+out:
+   if (p)
+   *p = ':';
+   if (mod)
+   module_put(mod);
+   return ret;
+}
+
 static int trace_kprobe_entry_handler(struc

Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Google

On Tue, 16 Apr 2024 00:47:26 -0400
Steven Rostedt  wrote:

> On Mon, 15 Apr 2024 18:40:23 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Check the number of probe target symbols in the target module when
> > the module is loaded. If the probe is not on the unique name symbols
> > in the module, it will be rejected at that point.
> > 
> > Note that the symbol which has a unique name in the target module,
> > it will be accepted even if there are same-name symbols in the
> > kernel or other modules,
> 
> This says what it does, but doesn't explain why it is doing it.
> What's the purpose of this patch?

Thank you for pointing it out, I just reused the description which I
sent last year. It needs to be updated.

-
Currently, kprobe event checks whether the target symbol name is
unique, so that it does not put a probe on unexpected place. But this
skips the check if the target is on a module. This fixes the issue
by checking the symbol is unique in the target module if the target
is on a module.
-

Thanks!

> 
> -- Steve

-- 
Masami Hiramatsu (Google)

[PATCH] bootconfig: Fix the kerneldoc of _xbc_exit()

2024-04-15 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Fix the kerneldoc of _xbc_exit() which is updated to have an @early
argument and the function name is changed.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202404150036.kpj3hefa-...@intel.com/
Signed-off-by: Masami Hiramatsu (Google) 
---
 lib/bootconfig.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 8841554432d5..97f8911ea339 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -901,7 +901,8 @@ static int __init xbc_parse_tree(void)
 }
 
 /**
- * xbc_exit() - Clean up all parsed bootconfig
+ * _xbc_exit() - Clean up all parsed bootconfig
+ * @early: Set true if this is called before budy system is initialized.
  *
  * This clears all data structures of parsed bootconfig on memory.
  * If you need to reuse xbc_init() with new boot config, you can

[PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded

2024-04-15 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Skip recording calltime and rettime if the fgraph_ops does not need it.
This is a kind of performance optimization for fprobe. Since the fprobe
user does not use these entries, recording timestamp in fgraph is just
a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag,
and all fgraph_ops sets this flag, skip recording calltime and rettime.

Suggested-by: Jiri Olsa 
Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v9:
  - Newly added.
---
 include/linux/ftrace.h |2 ++
 kernel/trace/fgraph.c  |   46 +++---
 kernel/trace/fprobe.c  |1 +
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index d845a80a3d56..06fc7cbef897 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1156,6 +1156,8 @@ struct fgraph_ops {
struct ftrace_ops   ops; /* for the hash lists */
void*private;
int idx;
+   /* If skip_timestamp is true, this does not record timestamps. */
+   boolskip_timestamp;
 };
 
 void *fgraph_reserve_data(int idx, int size_bytes);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 7556fbbae323..a5722537bb79 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
 int ftrace_graph_active;
 
 static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
+static bool fgraph_skip_timestamp;
 
 /* LRU index table for fgraph_array */
 static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
@@ -475,7 +476,7 @@ void ftrace_graph_stop(void)
 static int
 ftrace_push_return_trace(unsigned long ret, unsigned long func,
 unsigned long frame_pointer, unsigned long *retp,
-int fgraph_idx)
+int fgraph_idx, bool skip_ts)
 {
struct ftrace_ret_stack *ret_stack;
unsigned long long calltime;
@@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
ret_stack = get_ret_stack(current, current->curr_ret_stack, );
if (ret_stack && ret_stack->func == func &&
get_fgraph_type(current, index + FGRAPH_RET_INDEX) == 
FGRAPH_TYPE_BITMAP &&
-   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, fgraph_idx))
+   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
fgraph_idx)) {
+   /* If previous one skips calltime, update it. */
+   if (!skip_ts && !ret_stack->calltime)
+   ret_stack->calltime = trace_clock_local();
return index + FGRAPH_RET_INDEX;
+   }
 
val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_RET_INDEX;
 
@@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
return -EBUSY;
}
 
-   calltime = trace_clock_local();
+   if (skip_ts)
+   calltime = 0LL;
+   else
+   calltime = trace_clock_local();
 
index = READ_ONCE(current->curr_ret_stack);
ret_stack = RET_STACK(current, index);
@@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned 
long func,
trace.func = func;
trace.depth = ++current->curr_ret_depth;
 
-   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0);
+   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0,
+fgraph_skip_timestamp);
if (index < 0)
goto out;
 
@@ -654,7 +663,8 @@ int function_graph_enter_ops(unsigned long ret, unsigned 
long func,
return -ENODEV;
 
/* Use start for the distance to ret_stack (skipping over reserve) */
-   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
gops->idx);
+   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
gops->idx,
+gops->skip_timestamp);
if (index < 0)
return index;
type = get_fgraph_type(current, index);
@@ -732,6 +742,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, 
unsigned long *ret,
*ret = ret_stack->ret;
trace->func = ret_stack->func;
trace->calltime = ret_stack->calltime;
+   trace->rettime = 0;
trace->overrun = atomic_read(>trace_overrun);
trace->depth = current->curr_ret_depth;
/*
@@ -792,7 +803,6 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, 
unsigned long frame_pointe
return (unsigned long)panic;
}
 
-   trace.rettime = trace_clock_local();
if (fregs)
ftrace_regs_set_instruction_pointer(fregs, ret);
 
@@ -808,6 +818,8 @@ __ftrace_retu

[PATCH v9 35/36] Documentation: probes: Update fprobe on function-graph tracer

2024-04-15 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit

[PATCH v9 34/36] selftests/ftrace: Add a test case for repeating register/unregister fprobe

2024-04-15 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

This test case repeats define and undefine the fprobe dynamic event to
ensure that the fprobe does not cause any issue with such operations.

Signed-off-by: Masami Hiramatsu (Google) 
---
 .../test.d/dynevent/add_remove_fprobe_repeat.tc|   19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc

diff --git 
a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc 
b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
new file mode 100644
index ..b4ad09237e2a
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
@@ -0,0 +1,19 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Generic dynamic event - Repeating add/remove fprobe events
+# requires: dynamic_events "f[:[/][]] [%return] 
[]":README
+
+echo 0 > events/enable
+echo > dynamic_events
+
+PLACE=$FUNCTION_FORK
+REPEAT_TIMES=64
+
+for i in `seq 1 $REPEAT_TIMES`; do
+  echo "f:myevent $PLACE" >> dynamic_events
+  grep -q myevent dynamic_events
+  test -d events/fprobes/myevent
+  echo > dynamic_events
+done
+
+clear_trace

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1343 matches

Mail list logo