from:"Steven Rostedt"

Re: [PATCH v9 07/36] function_graph: Allow multiple users to attach to function graph

2024-04-19 Thread Steven Rostedt

On Mon, 15 Apr 2024 21:50:20 +0900
"Masami Hiramatsu (Google)"  wrote:

> @@ -27,23 +28,157 @@
>  
>  #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
>  #define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
> +
> +/*
> + * On entry to a function (via function_graph_enter()), a new 
> ftrace_ret_stack
> + * is allocated on the task's ret_stack with indexes entry, then each
> + * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that returns
> + * non-zero, the index into the fgraph_array[] for that fgraph_ops is 
> recorded
> + * on the indexes entry as a bit flag.
> + * As the associated ftrace_ret_stack saved for those fgraph_ops needs to
> + * be found, the index to it is also added to the ret_stack along with the
> + * index of the fgraph_array[] to each fgraph_ops that needs their retfunc
> + * called.
> + *
> + * The top of the ret_stack (when not empty) will always have a reference
> + * to the last ftrace_ret_stack saved. All references to the
> + * ftrace_ret_stack has the format of:
> + *
> + * bits:  0 -  9 offset in words from the previous ftrace_ret_stack
> + *   (bitmap type should have FGRAPH_RET_INDEX always)
> + * bits: 10 - 11 Type of storage
> + * 0 - reserved
> + * 1 - bitmap of fgraph_array index
> + *
> + * For bitmap of fgraph_array index
> + *  bits: 12 - 27The bitmap of fgraph_ops fgraph_array index

I really hate the terminology I came up with here, and would love to
get better terminology for describing what is going on. I looked it
over but I'm constantly getting confused. And I wrote this code!

Perhaps we should use:

 @frame : The data that represents a single function call. When a
  function is traced, all the data used for all the callbacks
  attached to it, is in a single frame. This would replace the
  FGRAPH_RET_SIZE as FGRAPH_FRAME_SIZE.

 @offset : This is the word size position on the stack. It would
   replace INDEX, as I think "index" is being used for more
   than one thing. Perhaps it should be "offset" when dealing
   with where it is on the shadow stack, and "pos" when dealing
   with which callback ops is being referenced.


> + *
> + * That is, at the end of function_graph_enter, if the first and forth
> + * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc 
> called
> + * on the return of the function being traced, this is what will be on the
> + * task's shadow ret_stack: (the stack grows upward)
> + *
> + * || <- task->curr_ret_stack
> + * ++
> + * | bitmap_type(bitmap:(BIT(3)|BIT(0)),|
> + * | offset:FGRAPH_RET_INDEX)   | <- the offset is from here
> + * ++
> + * | struct ftrace_ret_stack|
> + * |   (stores the saved ret pointer)   | <- the offset points here
> + * ++
> + * | (X) | (N)  | ( N words away from
> + * ||   previous ret_stack)
> + *
> + * If a backtrace is required, and the real return pointer needs to be
> + * fetched, then it looks at the task's curr_ret_stack index, if it
> + * is greater than zero (reserved, or right before poped), it would mask
> + * the value by FGRAPH_RET_INDEX_MASK to get the offset index of the
> + * ftrace_ret_stack structure stored on the shadow stack.
> + */
> +
> +#define FGRAPH_RET_INDEX_SIZE10

Replace SIZE with BITS.

> +#define FGRAPH_RET_INDEX_MASKGENMASK(FGRAPH_RET_INDEX_SIZE - 1, 0)

  #define FGRAPH_FRAME_SIZE_BITS10
  #define FGRAPH_FRAME_SIZE_MASKGENMASK(FGRAPH_FRAME_SIZE_BITS - 1, 0)


> +
> +#define FGRAPH_TYPE_SIZE 2
> +#define FGRAPH_TYPE_MASK GENMASK(FGRAPH_TYPE_SIZE - 1, 0)

  #define FGRAPH_TYPE_BITS  2
  #define FGRAPH_TYPE_MASK  GENMASK(FGRAPH_TYPE_BITS - 1, 0)


> +#define FGRAPH_TYPE_SHIFTFGRAPH_RET_INDEX_SIZE
> +
> +enum {
> + FGRAPH_TYPE_RESERVED= 0,
> + FGRAPH_TYPE_BITMAP  = 1,
> +};
> +
> +#define FGRAPH_INDEX_SIZE16

replace "INDEX" with "OPS" as it will be the indexes of ops in the
array.

  #define FGRAPH_OPS_BITS   16
  #define FGRAPH_OPS_MASK   GENMASK(FGRAPH_OPS_BITS - 1, 0)

> +#define FGRAPH_INDEX_MASKGENMASK(FGRAPH_INDEX_SIZE - 1, 0)
> +#define FGRAPH_INDEX_SHIFT   (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE)
> +
> +/* Currently the max stack index can't be more than register callers */
> +#define FGRAPH_MAX_INDEX (FGRAPH_INDEX_SIZE + FGRAPH_RET_INDEX)

FGRAPH_MAX_INDEX isn't even used. Let's delete it.

> +
> +#define FGRAPH_ARRAY_SIZEFGRAPH_INDEX_SIZE

  #define FGRAPH_ARRAY_SIZE FGRAPH_INDEX_BITS

> +
>  #define SHADOW_STACK_SIZE (PAGE_SIZE)
>  #define SHADOW_STACK_INDEX (SHADOW_STACK_SIZE / sizeof(long))
>  /* Leave

Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 16:00:20 +0800
Jason Xing  wrote:

> If other experts see this thread, please help me. I would appreciate
> it. I have strong interests and feel strong responsibility to
> implement something like this patch series. It can be very useful!!

I'm not a networking expert, but as I'm Cc'd and this is about tracing,
I'll jump in to see if I can help. Honestly, reading the thread, it
appears that you and Eric are talking past each other.

I believe Eric is concerned about losing the value of the enum. Enums
are types, and if you typecast them to another type, they lose the
previous type, and all the safety that goes with it.

Now, I do not really understand the problem trying to be solved here. I
understand how TCP works but I never looked into the implementation of
MPTCP.

You added this:

+static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
+{
+   return reason += RST_REASON_START;
+}

And used it for places like this:

@@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;

dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }

As I don't know this code or how MPTCP works, I do not understand the
above. It use to pass to send_reset() SK_RST_REASON_NOT_SPECIFIED. But
now it takes a "reset_reason" calls the "convert_mptcp_reason()" to get
back a enum value.

If you are mapping the reset_reason to enum sk_rst_reason, why not do
it via a real conversion instead of this fragile arithmetic between the two
values?

static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
{
switch(reason) {
case 0: return SK_RST_REASON_MPTCP_RST_EUNSPEC;
case 1: return SK_RST_REASON_MPTCP_RST_EMPTCP;
case 2: return SK_RST_REASON_MPTCP_RST_ERESOURCE;
[..]
default: return SK_RST_REASON_MAX; // or some other error value
]
}

I'm not sure if this is any better, but it's not doing any casting and
it's easier to understand. It's a simple mapping between the reason and
the enum and there's no inherit dependency between the values. Could
possibly create enums for the reason numbers and replace the hard coded
values with them.

That way that helper function is at least doing a real conversion of
one type to another.

But like I said from the beginning. I don't understand the details here
and have not spent the time to dig deeper. I just read the thread and I
agree with Eric that the arithmetic conversion of reason to an enum
looks fragile at best and buggy at worst.

-- Steve

Re: [PATCH] ftrace: Replace ftrace_disabled variable with ftrace_is_dead function

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 22:38:44 +0800
"Bang Li"  wrote:

> Use the existing function ftrace_is_dead to replace the variable to make
> the code clearer.
> 
> Signed-off-by: Bang Li 
> ---
>  kernel/trace/ftrace.c | 46 +--
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 50ca4d4f8840..4a08c79db677 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2693,7 +2693,7 @@ void __weak ftrace_replace_code(int mod_flags)
>   int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
>   int failed;
>  
> - if (unlikely(ftrace_disabled))
> + if (unlikely(ftrace_is_dead()))
>   return;
>  

NACK!

ftrace_is_dead() is only there to make the static variable
"ftrace_disabled" available for code outside of ftrace.c. In ftrace.c,
it is perfectly fine to use ftrace_disabled.

-- Steve

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 14:36:18 +0900
Masami Hiramatsu (Google)  wrote:

> Hi Steve,
> 
> Can you review this series? Especially, [07/36] and [12/36] has been changed
> a lot from your original patch.

I haven't forgotten (just been a bit hectic).

Worse comes to worse, I'll review it tomorrow.

-- Steve

> 
> Thank you,
> 
> On Mon, 15 Apr 2024 21:48:59 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Hi,
> > 
> > Here is the 9th version of the series to re-implement the fprobe on
> > function-graph tracer. The previous version is;
> > 
> > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> > 
> > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> > and fixed some bugs + performance optimization patch[36/36].
> >  - [12/36] Fix to clear fgraph_array entry in registration failure, also
> >return -ENOSPC when fgraph_array is full.
> >  - [28/36] Add new store_fprobe_entry_data() for fprobe.
> >  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
> >  - [36/36] Add new flag to skip timestamp recording.
> > 
> > Overview
> > 
> > This series does major 2 changes, enable multiple function-graphs on
> > the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> > fprobe on this function-graph.
> > 
> > The former changes had been sent from Steven Rostedt 4 years ago (*),
> > which allows users to set different setting function-graph tracer (and
> > other tracers based on function-graph) in each trace-instances at the
> > same time.
> > 
> > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> > 
> > The purpose of latter change are;
> > 
> >  1) Remove dependency of the rethook from fprobe so that we can reduce
> >the return hook code and shadow stack.
> > 
> >  2) Make 'ftrace_regs' the common trace interface for the function
> >boundary.
> > 
> > 1) Currently we have 2(or 3) different function return hook codes,
> >  the function-graph tracer and rethook (and legacy kretprobe).
> >  But since this  is redundant and needs double maintenance cost,
> >  I would like to unify those. From the user's viewpoint, function-
> >  graph tracer is very useful to grasp the execution path. For this
> >  purpose, it is hard to use the rethook in the function-graph
> >  tracer, but the opposite is possible. (Strictly speaking, kretprobe
> >  can not use it because it requires 'pt_regs' for historical reasons.)
> > 
> > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
> >  wrong for the function entry and exit. Moreover, depending on the
> >  architecture, there is no way to accurately reproduce 'pt_regs'
> >  outside of interrupt or exception handlers. This means fprobe should
> >  not use 'pt_regs' because it does not use such exceptions.
> >  (Conversely, kprobe should use 'pt_regs' because it is an abstract
> >   interface of the software breakpoint exception.)
> > 
> > This series changes fprobe to use function-graph tracer for tracing
> > function entry and exit, instead of mixture of ftrace and rethook.
> > Unlike the rethook which is a per-task list of system-wide allocated
> > nodes, the function graph's ret_stack is a per-task shadow stack.
> > Thus it does not need to set 'nr_maxactive' (which is the number of
> > pre-allocated nodes).
> > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> > their register interface, this changes it to convert 'ftrace_regs' to
> > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> > so users must access only registers for function parameters or
> > return value. 
> > 
> > Design
> > --
> > Instead of using ftrace's function entry hook directly, the new fprobe
> > is built on top of the function-graph's entry and return callbacks
> > with 'ftrace_regs'.
> > 
> > Since the fprobe requires access to 'ftrace_regs', the architecture
> > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> > entry callback with 'ftrace_regs', and also
> > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> > return_to_handler.
> > 
> > All fprobes share a single function-graph ops (means shares a common
> > ftrace filter) similar to the kprobe-on-ftrace. This needs another
> > layer to find corresponding fprobe in the

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-18 Thread Steven Rostedt

On Thu, 18 Apr 2024 09:55:55 +0300
Mike Rapoport  wrote:

Hi Mike,

Thanks for doing this review!

> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each sub-buffer.
> > + * @nr_subbufs:Number of subbfs in the ring-buffer, including 
> > the reader.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. ID range [0 : 
> > @nr_subbufs - 1]
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> > + * @flags: Placeholder for now, 0 until new features are supported.
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> > + * @Reserved1: Reserved for future use.
> > + * @Reserved2: Reserved for future use.
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;  
> 
> Why do you need reserved fields? This structure always resides in the
> beginning of a page and the rest of the page is essentially "reserved".

So this code is also going to be used in arm's pkvm hypervisor code,
where it will be using these fields, but since we are looking at
keeping the same interface between the two, we don't want these used by
this interface.

We probably should add a comment about that.

> 
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index cc9ebe593571..793ecc454039 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c  
> 
> ... 
> 
> > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > +  unsigned long *subbuf_ids)
> > +{
> > +   struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > +   unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > +   struct buffer_page *first_subbuf, *subbuf;
> > +   int id = 0;
> > +
> > +   subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > +   cpu_buffer->reader_page->id = id++;
> > +
> > +   first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > +   do {
> > +   if (WARN_ON(id >= nr_subbufs))
> > +   break;
> > +
> > +   subbuf_ids[id] = (unsigned long)subbuf->page;
> > +   subbuf->id = id;
> > +
> > +   rb_inc_page();
> > +   id++;
> > +   } while (subbuf != first_subbuf);
> > +
> > +   /* install subbuf ID to kern VA translation */
> > +   cpu_buffer->subbuf_ids = subbuf_ids;
> > +
> > +   /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > +   meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> 
> Isn't this a single page?

One thing we are doing is to make sure that the subbuffers are aligned
by their size. If a subbuffer is 3 pages, it should be aligned on 3
page boundaries. This was something that Linus suggested.

> 
> > +   meta->meta_struct_len = sizeof(*meta);
> > +   meta->nr_subbufs = nr_subbufs;
> > +   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > +
> > +   rb_update_meta_page(cpu_buffer);
> > +}  
> 
> ...
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(sub_order, start, page)\  
> 
> Nit: usually iterators in kernel use for_each

Ah, good catch. Yeah, that should be changed. But then ...

> 
> > +   page = subbuf_page(0, (start)); \
> > +   for (int __off = 0; __off < (1 << (sub_order)); \
> > +__off++, page = subbuf_page(__off, (start)))  
> 
> The pages are allocated with alloc_pages_node(.. subbuf_order) are
> physically contiguous and struct pages for them are also contiguous, so
> inside a subbuf_order allocation you can just do page++.
> 

I'm wondering if we should just nuke the macro. It was there because of
the previous implementation did things twice. But now it's just done
once here:

+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page;
+
+   foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], 
page) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+

Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-15 Thread Steven Rostedt

On Mon, 15 Apr 2024 18:40:23 +0900
"Masami Hiramatsu (Google)"  wrote:

> Check the number of probe target symbols in the target module when
> the module is loaded. If the probe is not on the unique name symbols
> in the module, it will be rejected at that point.
> 
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,

This says what it does, but doesn't explain why it is doing it.
What's the purpose of this patch?

-- Steve

[PATCH] ASoC: tracing: Export SND_SOC_DAPM_DIR_OUT to its value

2024-04-15 Thread Steven Rostedt



The string SND_SOC_DAPM_DIR_OUT is printed in the snd_soc_dapm_path trace
event instead of its value:

   (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-")

User space cannot parse this, as it has no idea what SND_SOC_DAPM_DIR_OUT
is. Use TRACE_DEFINE_ENUM() to convert it to its value:

   (((REC->path_dir) == 1) ? "->" : "<-")

So that user space tools, such as perf and trace-cmd, can parse it
correctly.

Reported-by: Luca Ceresoli 
Fixes: 6e588a0d839b5 ("ASoC: dapm: Consolidate path trace events")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/events/asoc.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h
index 4eed9028bb11..517015ef36a8 100644
--- a/include/trace/events/asoc.h
+++ b/include/trace/events/asoc.h
@@ -12,6 +12,8 @@
 #define DAPM_DIRECT "(direct)"
 #define DAPM_ARROW(dir) (((dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-")
 
+TRACE_DEFINE_ENUM(SND_SOC_DAPM_DIR_OUT);
+
 struct snd_soc_jack;
 struct snd_soc_card;
 struct snd_soc_dapm_widget;
-- 
2.43.0

Re: TP_printk() bug with %c, and more?

2024-04-15 Thread Steven Rostedt

On Tue, 16 Apr 2024 04:08:46 +0200
Luca Ceresoli  wrote:

> Thanks for the insight. I'm definitely trying to fix this based on your
> hint as soon as I get my hand on a board.

I have a patch I forgot to send out. Let me do that now.

-- Steve

Re: TP_printk() bug with %c, and more?

2024-04-15 Thread Steven Rostedt

On Mon, 18 Mar 2024 16:43:07 +0100
Luca Ceresoli  wrote:

> However the arrows are still reversed.

This requires a kernel change. The problem is that the print fmt has:

print fmt: "%c%s %s %s %s %s", (int) REC->path_node && (int) REC->path_connect 
? '*' : ' ', __get_str(wname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? 
"->" : "<-"), __get_str(pname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? 
"->" : "<-"), __get_str(pnname)

User space (trace-cmd and perf) have no idea what SND_SOC_DAPM_DIR_OUT
is. The kernel needs to convert that, otherwise the parsing will fail,
or it will default it to zero.

-- Steve

Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-13 Thread Steven Rostedt

On Sat, 13 Apr 2024 12:53:38 +0200
Peter Zijlstra  wrote:

> On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> 
> > > Anyway, since we typically run stuff from NMI context, accessing user
> > > data is 'interesting'. As such I would really like to make this work
> > > depend on the call-graph rework that pushes all the user access bits
> > > into return-to-user.  
> > 
> > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > could look at and think about what a rebase looks like? Or do you have
> > someone in mind I should work with for this?  
> 
> I've been offline for a little while and still need to catch up with
> things myself.
> 
> Josh was working on that when I dropped off IIRC, I'm not entirely sure
> where things are at currently (and there is no way I can ever hope to
> process the backlog).
> 
> Anybody know where we are with that?

It's still very much on my RADAR, but with layoffs and such, my
priorities have unfortunately changed. I'm hoping to start helping out
in the near future though (in a month or two).

Josh was working on it, but I think he got pulled off onto other
priorities too :-p

-- Steve

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Steven Rostedt

On Thu, 11 Apr 2024 08:15:05 -0700
Kees Cook  wrote:

> This looks good to me. If tracing wants to take it:
> 
> Acked-by: Kees Cook 
> 
> If not, I can take it in my tree if I get a tracing Ack. :)

You can take it.

Acked-by: Steven Rostedt (Google) 

-- Steve

[PATCH v2 11/11] tracing: Update function tracing output for previous boot buffer

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0

[PATCH v2 10/11] tracing: Handle old buffer mappings for event strings and functions

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.

That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.

For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 33ef5311fa39..bac5db67cf15 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3671,8 +3671,11 @@ static void test_can_verify(void)
 void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 va_list ap)
 {
+   long text_delta = iter->tr->text_delta;
+   long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
+   bool good;
int i, j;
 
if (WARN_ON_ONCE(!fmt))
@@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 
j = 0;
 
-   /* We only care about %s and variants */
+   /*
+* We only care about %s and variants
+* as well as %p[sS] if delta is non-zero
+*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
@@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
}
if (p[i+j] == 's')
break;
+
+   if (text_delta && p[i+1] == 'p' &&
+   ((p[i+2] == 's' || p[i+2] == 'S')))
+   break;
+
star = false;
}
j = 0;
@@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(>seq, iter->fmt, ap);
 
+   /* Add delta to %pS pointers */
+   if (p[i+1] == 'p') {
+   unsigned long addr;
+   char fmt[4];
+
+   fmt[0] = '%';
+   fmt[1] = 'p';
+   fmt[2] = p[i+2]; /* Either %ps or %pS */
+   fmt[3] = '\0';
+
+   addr = va_arg(ap, unsigned long);
+   addr += text_delta;
+   trace_seq_printf(>seq, fmt, (void *)addr);
+
+   p += i + 3;
+   continue;
+   }
+
/*
 * If iter->seq is full, the above call no longer guarantees
 * that ap is in sync with fmt processing, and further calls
@@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
 
+   good = trace_safe_str(iter, str, star, len);
+
+   /* Could be from the last boot */
+   if (data_delta && !good) {
+   str += data_delta;
+   good = trace_safe_str(iter, str, star, len);
+   }
+
/*
 * If you hit this warning, it is likely that the
 * trace event in question used %s on a string that
@@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 * instead. See samples/trace_events/trace-events-sample.h
 * for reference.
 */
-   if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
- "fmt: '%s' current_buffer: '%s'",
+   if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
  fmt, seq_buf_str(>seq.seq))) {
int ret;
 
-- 
2.43.0

[PATCH v2 09/11] tracing/ring-buffer: Add last_boot_info file to boot instance

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.

It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).

The output will look like:

 # cat last_boot_info
 text delta:-268435456
 data delta:-268435456

The text and data are kept separate in case they are ever made different.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  3 +++
 kernel/trace/ring_buffer.c  | 23 ++
 kernel/trace/trace.c| 47 -
 kernel/trace/trace.h|  2 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a50b0223b1d3..55de3798a9b9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long 
size, unsigned flag
   unsigned long range_size,
   struct lock_class_key *key);
 
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c68695ae2749..e189f500ac32 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2390,6 +2390,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned 
long size, unsigned flag
return alloc_buffer(size, flags, order, start, start + range_size, key);
 }
 
+/**
+ * ring_buffer_last_boot_delta - return the delta offset from last boot
+ * @buffer: The buffer to return the delta from
+ * @text: Return text delta
+ * @data: Return data delta
+ *
+ * Returns: The true if the delta is non zero
+ */
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data)
+{
+   if (!buffer)
+   return false;
+
+   if (!buffer->last_text_delta)
+   return false;
+
+   *text = buffer->last_text_delta;
+   *data = buffer->last_data_delta;
+
+   return true;
+}
+
 /**
  * ring_buffer_free - free a ring buffer.
  * @buffer: the buffer to free.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b2b16be6e4ec..33ef5311fa39 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array 
*tr,
return ret;
 }
 
+static void update_last_data(struct trace_array *tr)
+{
+   if (!tr->text_delta && !tr->data_delta)
+   return;
+
+   /* Clear old data */
+   tracing_reset_online_cpus(>array_buffer);
+
+   /* Using current data now */
+   tr->text_delta = 0;
+   tr->data_delta = 0;
+}
 
 /**
  * tracing_update_buffers - used by tracing facility to expand ring buffers
@@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
 
mutex_lock(_types_lock);
+
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
 
mutex_lock(_types_lock);
 
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static ssize_t
+tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
+{
+   struct trace_array *tr = filp->private_data;
+   struct seq_buf seq;
+   char buf[64];
+
+   seq_buf_init(, buf, 64);
+
+   seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta);
+   seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta);
+
+   return simple_read_from_buffer(ubuf, cnt, ppos, buf, 
seq_buf_used());
+}
+
 static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
 {
struct trace_array *tr = inode->i_private;
@@ -7499,6 +7531,13 @@ static const struct file_operations 
trace_time_stamp_mode_fops = {
.release= tracing_single_release_tr,
 };
 
+static const s

[PATCH v2 08/11] ring-buffer: Save text and data locations in mapped meta data

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When a ring buffer is mapped to a specific address, save the address of a
text function and some data. This will be used to determine the delta
between the last boot and the current boot for pointers to functions as
well as to data.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6f19e7612472..c68695ae2749 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,8 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   text_addr;
+   unsigned long   data_addr;
unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
@@ -539,6 +541,9 @@ struct trace_buffer {
unsigned long   range_addr_start;
unsigned long   range_addr_end;
 
+   longlast_text_delta;
+   longlast_data_delta;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1827,10 +1832,15 @@ static void rb_meta_validate_events(struct 
ring_buffer_per_cpu *cpu_buffer)
}
 }
 
+/* Used to calculate data delta */
+static char rb_data_ptr[] __initdata = "";
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
unsigned long delta;
+   unsigned long this_text = (unsigned long)rb_range_meta_init;
+   unsigned long this_data = (unsigned long)rb_data_ptr;
void *subbuf;
int cpu;
int i;
@@ -1847,6 +1857,10 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
+   buffer->last_text_delta = this_text - meta->text_addr;
+   buffer->last_data_delta = this_data - meta->data_addr;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
continue;
}
 
@@ -1863,6 +1877,8 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
subbuf = rb_subbufs_from_meta(meta);
 
meta->first_buffer = (unsigned long)subbuf;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
 
/*
 * The buffers[] array holds the order of the sub-buffers
-- 
2.43.0

[PATCH v2 07/11] ring-buffer: Validate boot range memory events

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Make sure all the events in each of the sub-buffers that were mapped in a
memory region are valid. This moves the code that walks the buffers for
time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
ifdef block and is used to validate the content. Only the ring buffer
event meta data is checked and not the data load.

This also has a second purpose. The buffer_page structure that points to
the data sub-buffers has accounting that keeps track of the number of
events that are on the sub-buffer. This updates that counter as well. That
counter is used in reading the buffer and knowing if the ring buffer is
empty or not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 190 -
 1 file changed, 166 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d05b60fd5695..6f19e7612472 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1663,10 +1663,170 @@ static bool rb_meta_valid(struct ring_buffer_meta 
*meta, int cpu,
subbuf = (void *)subbuf + subbuf_size;
}
 
-   pr_info("Ring buffer meta is from previous boot!\n");
return true;
 }
 
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+#ifdef CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
+static DEFINE_PER_CPU(atomic_t, checking);
+static atomic_t ts_dump;
+
+#define buffer_warn_return(fmt, ...)   \
+   do {\
+   /* If another report is happening, ignore this one */   \
+   if (atomic_inc_return(_dump) != 1) { \
+   atomic_dec(_dump);   \
+   goto out;   \
+   }   \
+   atomic_inc(_buffer->record_disabled);   \
+   pr_warn(fmt, ##__VA_ARGS__);\
+   dump_buffer_page(bpage, info, tail);\
+   atomic_dec(_dump);   \
+   /* There's some cases in boot up that this can happen */ \
+   if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING))   \
+   /* Do not re-enable checking */ \
+   return; \
+   } while (0)
+#else
+#define buffer_warn_return(fmt, ...) do { } while (0)
+#endif
+
+static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int 
cpu,
+  unsigned long long *timestamp, bool warn)
+{
+   struct ring_buffer_event *event;
+   u64 ts, delta;
+   int events = 0;
+   int e;
+
+   ts = dpage->time_stamp;
+
+   for (e = 0; e < tail; e += rb_event_length(event)) {
+
+   event = (struct ring_buffer_event *)(dpage->data + e);
+
+   switch (event->type_len) {
+
+   case RINGBUF_TYPE_TIME_EXTEND:
+   delta = rb_event_time_stamp(event);
+   ts += delta;
+   break;
+
+   case RINGBUF_TYPE_TIME_STAMP:
+   delta = rb_event_time_stamp(event);
+   if (warn && delta < ts) {
+   buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT 
BACKWARDS: last ts: %lld absolute ts: %lld\n",
+  cpu, ts, delta);
+   }
+   ts = delta;
+   break;
+
+   case RINGBUF_TYPE_PADDING:
+   if (event->time_delta == 1)
+   break;
+   fallthrough;
+   case RINGBUF_TYPE_DATA:
+   events++;
+   ts += event->time_delta;
+   break;
+
+   default:
+   return -1;
+   }
+   }
+   *timestamp = ts;
+   return events;
+}
+
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+{
+   unsigned long long ts;
+   int tail;
+
+   tail = local_read(>commit);
+   return rb_read_data_buffer(dpage, tail, cpu, , false);
+}
+
+/* If the meta data has been validated, now validate the events */
+static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   struct buffer_page *head_page;
+   unsigned long entry_bytes = 0;
+   unsigned long entries = 0;
+   int ret;
+   int i;
+
+   if (!meta || !meta->head_buffer)
+   return;
+
+   /* Do the reader page

[PATCH v2 06/11] ring-buffer: Add test if range of boot buffer is valid

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a test against the ring buffer memory range to see if it has valid
data. The ring_buffer_meta structure is given a new field called
"first_buffer" which holds the address of the first sub-buffer. This is
used to both determine if the other fields are valid as well as finding
the offset between the old addresses of the sub-buffer from the previous
boot to the new addresses of the current boot.

Since the values for nr_subbufs and subbuf_size is to be the same, check
if the values in the meta page match the values calculated.

Take the range of the first_buffer and the total size of all the buffers
and make sure the saved head_buffer and commit_buffer fall in the range.

Iterate through all the sub-buffers to make sure that the values in the
sub-buffer "commit" field (the field that holds the amount of data on the
sub-buffer) is within the end of the sub-buffer. Also check the index
array to make sure that all the indexes are within nr_subbufs.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 143 ++---
 1 file changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ee3a5c6966e2..d05b60fd5695 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,7 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
__u32   subbuf_size;
@@ -1606,21 +1607,103 @@ static void *rb_range_buffer(struct 
ring_buffer_per_cpu *cpu_buffer, int idx)
return (void *)ptr;
 }
 
+/*
+ * See if the existing memory contains valid ring buffer data.
+ * As the previous kernel must be the same as this kernel, all
+ * the calculations (size of buffers and number of buffers)
+ * must be the same.
+ */
+static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
+ struct trace_buffer *buffer, int nr_pages)
+{
+   int subbuf_size = PAGE_SIZE;
+   struct buffer_data_page *subbuf;
+   unsigned long buffers_start;
+   unsigned long buffers_end;
+   int i;
+
+   /* The subbuffer's size and number of subbuffers must match */
+   if (meta->subbuf_size != subbuf_size ||
+   meta->nr_subbufs != nr_pages + 1) {
+   pr_info("Ring buffer boot meta [%d] mismatch of 
subbuf_size/nr_pages\n", cpu);
+   return false;
+   }
+
+   buffers_start = meta->first_buffer;
+   buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
+
+   /* Is the head and commit buffers within the range of buffers? */
+   if (meta->head_buffer < buffers_start ||
+   meta->head_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] head buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   if (meta->commit_buffer < buffers_start ||
+   meta->commit_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] commit buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   subbuf = rb_subbufs_from_meta(meta);
+
+   /* Is the meta buffers and the subbufs themselves have correct data? */
+   for (i = 0; i < meta->nr_subbufs; i++) {
+   if (meta->buffers[i] < 0 ||
+   meta->buffers[i] >= meta->nr_subbufs) {
+   pr_info("Ring buffer boot meta [%d] array out of 
range\n", cpu);
+   return false;
+   }
+
+   if ((unsigned)local_read(>commit) > subbuf_size) {
+   pr_info("Ring buffer boot meta [%d] buffer invalid 
commit\n", cpu);
+   return false;
+   }
+
+   subbuf = (void *)subbuf + subbuf_size;
+   }
+
+   pr_info("Ring buffer meta is from previous boot!\n");
+   return true;
+}
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
+   unsigned long delta;
void *subbuf;
int cpu;
int i;
 
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+   void *next_meta;
+
meta = rb_range_meta(buffer, nr_pages, cpu);
 
+   if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
+   /* Make the mappings match the current address */
+   subbuf = rb_subbufs_from_meta(meta);
+   delta = (unsigned long)subbuf - meta->first_buffer;
+   meta->first_buffer += delta;
+   meta->head_buffer += delta;
+   meta->commit_buffer += delta;

[PATCH v2 05/11] ring-buffer: Add output of ring buffer meta page

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.

It does not expose any virtual addresses, just indexes into the sub-buffer
pages.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 77 ++
 kernel/trace/trace.c   | 30 ++-
 kernel/trace/trace.h   |  2 +
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index db8762faef77..ee3a5c6966e2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 
+#include "trace.h"
+
 /*
  * The "absolute" timestamp in the buffer is only 59 bits.
  * If a clock has the 5 MSBs set, it needs to be saved and
@@ -1635,6 +1637,81 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
}
 }
 
+static void *rbm_start(struct seq_file *m, loff_t *pos)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val;
+
+   if (!meta)
+   return NULL;
+
+   if (*pos > meta->nr_subbufs)
+   return NULL;
+
+   val = *pos;
+   val++;
+
+   return (void *)val;
+}
+
+static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   (*pos)++;
+
+   return rbm_start(m, pos);
+}
+
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rbm_show(struct seq_file *m, void *v)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val = (unsigned long)v;
+
+   if (val == 1) {
+   seq_printf(m, "head_buffer:   %d\n",
+  rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
+   seq_printf(m, "commit_buffer: %d\n",
+  rb_meta_subbuf_idx(meta, (void 
*)meta->commit_buffer));
+   seq_printf(m, "subbuf_size:   %d\n", meta->subbuf_size);
+   seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs);
+   return 0;
+   }
+
+   val -= 2;
+   seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]);
+
+   return 0;
+}
+
+static void rbm_stop(struct seq_file *m, void *p)
+{
+}
+
+static const struct seq_operations rb_meta_seq_ops = {
+   .start  = rbm_start,
+   .next   = rbm_next,
+   .show   = rbm_show,
+   .stop   = rbm_stop,
+};
+
+int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, 
int cpu)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, _meta_seq_ops);
+   if (ret)
+   return ret;
+
+   m = file->private_data;
+   m->private = buffer->buffers[cpu];
+
+   return 0;
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
 {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 31067de977fc..b2b16be6e4ec 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
-static int show_traces_release(struct inode *inode, struct file *file)
+static int tracing_seq_release(struct inode *inode, struct file *file)
 {
struct trace_array *tr = inode->i_private;
 
@@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = {
.open   = show_traces_open,
.read   = seq_read,
.llseek = seq_lseek,
-   .release= show_traces_release,
+   .release= tracing_seq_release,
 };
 
 static ssize_t
@@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
+{
+   struct trace_array *tr = inode->i_private;
+   int cpu = tracing_get_cpu(inode);
+   int ret;
+
+   ret = tracing_check_open_get_tr(tr);
+   if (ret)
+   return ret;
+
+   ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
+   if (ret < 0)
+   __trace_array_put(tr);
+   return ret;
+}
+
 static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops 
= {

[PATCH v2 04/11] tracing: Implement creating an instance based on a given memory region

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.

This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 50 +++-
 kernel/trace/trace.h |  3 +++
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 83194bf7b1df..31067de977fc 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file 
*file)
 static bool
 trace_ok_for_array(struct tracer *t, struct trace_array *tr)
 {
+#ifdef CONFIG_TRACER_SNAPSHOT
+   /* arrays with mapped buffer range do not have snapshots */
+   if (tr->range_addr_start && t->use_max_tr)
+   return false;
+#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
 }
 
@@ -8673,11 +8678,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, 
long cpu)
tr, cpu, _entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
-   tr, cpu, _fops);
+   if (!tr->range_addr_start) {
+   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+ tr, cpu, _fops);
 
-   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
-   tr, cpu, _raw_fops);
+   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+ tr, cpu, _raw_fops);
+   }
 #endif
 }
 
@@ -9214,7 +9221,18 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   if (tr->range_addr_start && tr->range_addr_size) {
+   buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
+ tr->range_addr_start,
+ tr->range_addr_size);
+   /*
+* This is basically the same as a mapped buffer,
+* with the same restrictions.
+*/
+   tr->mapped++;
+   } else {
+   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   }
if (!buf->buffer)
return -ENOMEM;
 
@@ -9251,6 +9269,10 @@ static int allocate_trace_buffers(struct trace_array 
*tr, int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
+   /* Fix mapped buffer trace arrays do not have snapshot buffers */
+   if (tr->range_addr_start)
+   return 0;
+
ret = allocate_trace_buffer(tr, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
@@ -9351,7 +9373,9 @@ static int trace_array_create_dir(struct trace_array *tr)
 }
 
 static struct trace_array *
-trace_array_create_systems(const char *name, const char *systems)
+trace_array_create_systems(const char *name, const char *systems,
+  unsigned long range_addr_start,
+  unsigned long range_addr_size)
 {
struct trace_array *tr;
int ret;
@@ -9377,6 +9401,10 @@ trace_array_create_systems(const char *name, const char 
*systems)
goto out_free_tr;
}
 
+   /* Only for boot up memory mapped ring buffers */
+   tr->range_addr_start = range_addr_start;
+   tr->range_addr_size = range_addr_size;
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9434,7 +9462,7 @@ trace_array_create_systems(const char *name, const char 
*systems)
 
 static struct trace_array *trace_array_create(const char *name)
 {
-   return trace_array_create_systems(name, NULL);
+   return trace_array_create_systems(name, NULL, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9488,7 +9516,7 @@ struct trace_array *trace_array_get_by_name(const char 
*name, const char *system
goto out_unlock;
}
 
-   tr = trace_array_create_systems(name, systems);
+   tr = trace_array_create_systems(name, systems, 0, 0);
 
if (IS_ERR(tr))
tr = NULL;
@@ -9681,8 +9709,10 @@ init_tracer_tracefs(struct trace_array *tr, struct 
dentry *d_tracer)
MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_cre

[PATCH v2 03/11] ring-buffer: Add ring_buffer_meta data

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Populate the ring_buffer_meta array. It holds the pointer to the
head_buffer (next to read), the commit_buffer (next to write) the size of
the sub-buffers, number of sub-buffers and an array that keeps track of
the order of the sub-buffers.

This information will be stored in the persistent memory to help on reboot
to reconstruct the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 208 -
 1 file changed, 183 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a6d22a77f128..db8762faef77 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -43,6 +43,11 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   head_buffer;
+   unsigned long   commit_buffer;
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+   int buffers[];
 };
 
 /*
@@ -498,6 +503,7 @@ struct ring_buffer_per_cpu {
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
+   struct ring_buffer_meta *ring_meta;
 
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
@@ -1258,6 +1264,11 @@ static void rb_head_page_activate(struct 
ring_buffer_per_cpu *cpu_buffer)
 * Set the previous list pointer to have the HEAD flag.
 */
rb_set_list_to_head(head->list.prev);
+
+   if (cpu_buffer->ring_meta) {
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   meta->head_buffer = (unsigned long)head->page;
+   }
 }
 
 static void rb_list_head_clear(struct list_head *list)
@@ -1505,50 +1516,125 @@ rb_range_align_subbuf(unsigned long addr, int 
subbuf_size, int nr_subbufs)
 }
 
 /*
- * Return a specific sub-buffer for a given @cpu defined by @idx.
+ * Return the ring_buffer_meta for a given @cpu.
  */
-static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
 {
-   unsigned long ptr;
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   unsigned long ptr = buffer->range_addr_start;
+   struct ring_buffer_meta *meta;
int nr_subbufs;
 
-   /* Include the reader page */
-   nr_subbufs = nr_pages + 1;
+   if (!ptr)
+   return NULL;
+
+   /* When nr_pages passed in is zero, the first meta has already been 
initialized */
+   if (!nr_pages) {
+   meta = (struct ring_buffer_meta *)ptr;
+   nr_subbufs = meta->nr_subbufs;
+   } else {
+   meta = NULL;
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+   }
 
/*
 * The first chunk may not be subbuffer aligned, where as
 * the rest of the chunks are.
 */
-   ptr = buffer->range_addr_start;
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
if (cpu) {
-   unsigned long p;
-
-   ptr += subbuf_size * nr_subbufs;
-
-   /* Save the beginning of this CPU chunk */
-   p = ptr;
-
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
+   ptr += subbuf_size * nr_subbufs;
 
if (cpu > 1) {
unsigned long size;
+   unsigned long p;
 
+   /* Save the beginning of this CPU chunk */
+   p = ptr;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
ptr += subbuf_size * nr_subbufs;
 
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
-
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
}
}
-   if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end)
+   return (void *)ptr;
+}
+
+static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
+{
+   int subbuf_size = meta->subbuf_size;
+   unsigned long ptr;
+
+   ptr = (unsigned long)meta;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
+
+   return (void *)ptr;
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
+{
+   struct ring_buffer_meta *meta;
+   unsigned long ptr;
+   int subbuf_size;
+
+   meta = rb_range_meta(cpu_buffer->buffer, 0, cpu_buffer->cpu);
+   if (!m

[PATCH v2 00/11] tracing: Persistent traces across a reboot or crash

2024-04-10 Thread Steven Rostedt

This is a way to map a ring buffer instance across reboots.
The requirement is that you have a memory region that is not erased.
I tested this on a Debian VM running on qemu on a Debian server,
and even tested it on a baremetal box running Fedora. I was
surprised that it worked on the baremetal box, but it does so
surprisingly consistently.

I changed things up since the POC version (which this is no longer
a proof of concept, but now a serious patch set).

https://lore.kernel.org/linux-trace-kernel/20240306015910.766510...@goodmis.org/

I got rid of the variables that needed to be set:

trace_buffer_start and trace_buffer_size

The allocation of the mapping require a new interface (not in this
patch set yet, as it will be added for pstore or the wildcard
memmap that is under RFC). That is, either a new patch will be
added to connect pstore, or if the new wildcard memmap option is
accepted, it can be created with:

memmap=12M*4096:trace trace_instance=boot_mapped@trace

But that is to come later. If you want to test this code, I have
the command line version available at:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
branch: trace/ring-buffer-prealloc

The name of the instance no longer defined by the kernel but by
the calling code.

The memory reserved is used by the ring buffer of this instance.
It acts like a memory mapped instance so it has some limitations. It does not
allow snapshots nor does it allow tracers which use a snapshot buffer (like
irqsoff and wakeup tracers).

On boot up, when setting up the ring buffer, it looks at the current
content and does a vigorous test to see if the content is valid.
It even walks the events in all the sub-buffers to make sure the
ring buffer meta data is correct. If it determines that the content
is valid, it will reconstruct the ring buffer to use the content
it has found.

If the buffer is valid, on the next boot, the boot_mapped instance
will contain the data from the previous boot. You can cat the
trace or trace_pipe file, or even run trace-cmd extract on it to
make a trace.dat file that holds the date. This is much better than
dealing with a ftrace_dump_on_opps (I wish I had this a decade ago!)

There are still some limitations of this buffer. One is that it assumes
that the kernel you are booting back into is the same one that crashed.
At least the trace_events (like sched_switch and friends) all have the
same ids. This would be true with the same kernel as the ids are determined
at link time.

Module events could possible be a problem as the ids may not match.

This version of the patch series saves a text function and a data
string address in the persistent memory, and this is used to calculate
the delta between text and data addresses of the new boot up. Now
function tracing and "%pS" still work across boots. Even the RCU
trace events that point to static strings work as well!

The delta is exported by a new file in the instance called "last_boot_info"
that has something like this:

# cat last_boot_info
text delta:-268435456
data delta:-268435456

This can be used by trace-cmd that reads the trace_pipe_raw data and
now can figure out how to map the print_formats and kallsyms to the raw
data in the buffers.

This is no longer a POC as it seems to be working as expected.

This is based on top of Vincent Donnefort's ring buffer user space mapping code:

https://lore.kernel.org/linux-trace-kernel/20240406173649.3210836-1-vdonnef...@google.com/

Enjoy...

Steven Rostedt (Google) (11):
ring-buffer: Allow mapped field to be set without mapping
ring-buffer: Add ring_buffer_alloc_range()
ring-buffer: Add ring_buffer_meta data
tracing: Implement creating an instance based on a given memory region
ring-buffer: Add output of ring buffer meta page
ring-buffer: Add test if range of boot buffer is valid
ring-buffer: Validate boot range memory events
ring-buffer: Save text and data locations in mapped meta data
tracing/ring-buffer: Add last_boot_info file to boot instance
tracing: Handle old buffer mappings for event strings and functions
tracing: Update function tracing output for previous boot buffer

[PATCH v2 01/11] ring-buffer: Allow mapped field to be set without mapping

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In preparation for having the ring buffer mapped to a dedicated location,
which will have the same restrictions as user space memory mapped buffers,
allow it to use the "mapped" field of the ring_buffer_per_cpu structure
without having the user space meta page mapping.

When this starts using the mapped field, it will need to handle adding a
user space mapping (and removing it) from a ring buffer that is using a
dedicated memory range.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 793ecc454039..44b1d5f1a99a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5223,6 +5223,9 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 {
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 
+   if (!meta)
+   return;
+
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
@@ -6167,7 +6170,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6345,12 +6348,13 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
cpu,
 */
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   cpu_buffer->mapped = 1;
+   cpu_buffer->mapped++;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
@@ -6388,7 +6392,7 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
mutex_lock(>mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   cpu_buffer->mapped = 0;
+   cpu_buffer->mapped--;
 
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
-- 
2.43.0

[PATCH v2 02/11] ring-buffer: Add ring_buffer_alloc_range()

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In preparation to allowing the trace ring buffer to be allocated in a
range of memory that is persistent across reboots, add
ring_buffer_alloc_range(). It takes a contiguous range of memory and will
split it up evening for the per CPU ring buffers.

If there's not enough memory to handle all CPUs with the minimum size, it
will fail to allocate the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  17 +++
 kernel/trace/ring_buffer.c  | 222 ++--
 2 files changed, 204 insertions(+), 35 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..a50b0223b1d3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
 
+struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned 
flags,
+  int order, unsigned long start,
+  unsigned long range_size,
+  struct lock_class_key *key);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
@@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+/*
+ * Because the ring buffer is generic, if other users of the ring buffer get
+ * traced by ftrace, it can produce lockdep warnings. We need to keep each
+ * ring buffer's lock class separate.
+ */
+#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
+({ \
+   static struct lock_class_key __key; \
+   __ring_buffer_alloc_range((size), (flags), (order), (start),\
+ (range_size), &__key);\
+})
+
 typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
 ring_buffer_cond_fn cond, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 44b1d5f1a99a..a6d22a77f128 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -42,6 +42,9 @@
 
 static void update_pages_handler(struct work_struct *work);
 
+struct ring_buffer_meta {
+};
+
 /*
  * The ring buffer header is special. We must manually up keep it.
  */
@@ -340,7 +343,8 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
-   u32  id;/* ID for external mapping */
+   u32  id:30; /* ID for external mapping */
+   u32  range:1;   /* Mapped via a range */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -371,7 +375,9 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_pages((unsigned long)bpage->page, bpage->order);
+   /* Range pages are not to be freed */
+   if (!bpage->range)
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -521,6 +527,9 @@ struct trace_buffer {
struct rb_irq_work  irq_work;
booltime_stamp_abs;
 
+   unsigned long   range_addr_start;
+   unsigned long   range_addr_end;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1483,9 +1492,67 @@ static void rb_check_pages(struct ring_buffer_per_cpu 
*cpu_buffer)
}
 }
 
+/*
+ * Take an address, add the meta data size as well as the array of
+ * array subbuffer indexes, then align it to a subbuffer size.
+ */
+static unsigned long
+rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
+{
+   addr += sizeof(struct ring_buffer_meta) +
+   sizeof(int) * nr_subbufs;
+   return ALIGN(addr, subbuf_size);
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+{
+   unsigned long ptr;
+   int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   int nr_subbufs;
+
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+
+   /*
+* The firs

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

On Wed, 10 Apr 2024 13:07:20 -0400
Chuck Lever  wrote:

> Well I guess I could test it for you. I'll take it for nfsd v6.9-rc.

Thanks!

-- Steve

Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space

2024-04-10 Thread Steven Rostedt



Hi Andrew, et.al.

Linus said it's a hard requirement that this code gets an Acked-by (or
Reviewed-by) from the memory sub-maintainers before he will accept it.
He was upset that we faulted in pages one at a time instead of mapping it
in one go:

  
https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w...@mail.gmail.com/

Could you take a look at patches 1-3 to make sure they look sane from a
memory management point of view?

I really want this applied in the next merge window.

Thanks!

-- Steve


On Sat,  6 Apr 2024 18:36:44 +0100
Vincent Donnefort  wrote:

> The tracing ring-buffers can be stored on disk or sent to network
> without any copy via splice. However the later doesn't allow real time
> processing of the traces. A solution is to give userspace direct access
> to the ring-buffer pages via a mapping. An application can now become a
> consumer of the ring-buffer, in a similar fashion to what trace_pipe
> offers.
> 
> Support for this new feature can already be found in libtracefs from
> version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.
> 
> Vincent
> 
> v19 -> v20:
>   * Fix typos in documentation.
>   * Remove useless mmap open and fault callbacks.
>   * add mm.h include for vm_insert_pages
> 
> v18 -> v19:
>   * Use VM_PFNMAP and vm_insert_pages
>   * Allocate ring-buffer subbufs with __GFP_COMP
>   * Pad the meta-page with the zero-page to align on the subbuf_order
>   * Extend the ring-buffer test with mmap() dedicated suite
> 
> v17 -> v18:
>   * Fix lockdep_assert_held
>   * Fix spin_lock_init typo
>   * Fix CONFIG_TRACER_MAX_TRACE typo
> 
> v16 -> v17:
>   * Documentation and comments improvements.
>   * Create get/put_snapshot_map() for clearer code.
>   * Replace kzalloc with kcalloc.
>   * Fix -ENOMEM handling in rb_alloc_meta_page().
>   * Move flush(cpu_buffer->reader_page) behind the reader lock.
>   * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
> mutex. (removes READ_ONCE/WRITE_ONCE accesses).
> 
> v15 -> v16:
>   * Add comment for the dcache flush.
>   * Remove now unnecessary WRITE_ONCE for the meta-page.
> 
> v14 -> v15:
>   * Add meta-page and reader-page flush. Intends to fix the mapping
> for VIVT and aliasing-VIPT data caches.
>   * -EPERM on VM_EXEC.
>   * Fix build warning !CONFIG_TRACER_MAX_TRACE.
> 
> v13 -> v14:
>   * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
>   * on unmap, sync meta-page teardown with the reader_lock instead of
> the synchronize_rcu.
>   * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
> (intends to fix a lockdep issue)
>   * Add kerneldoc for flags and Reserved fields.
>   * Add kselftest for snapshot/map mutual exclusion.
> 
> v12 -> v13:
>   * Swap subbufs_{touched,lost} for Reserved fields.
>   * Add a flag field in the meta-page.
>   * Fix CONFIG_TRACER_MAX_TRACE.
>   * Rebase on top of trace/urgent.
>   * Add a comment for try_unregister_trigger()
> 
> v11 -> v12:
>   * Fix code sample mmap bug.
>   * Add logging in sample code.
>   * Reset tracer in selftest.
>   * Add a refcount for the snapshot users.
>   * Prevent mapping when there are snapshot users and vice versa.
>   * Refine the meta-page.
>   * Fix types in the meta-page.
>   * Collect Reviewed-by.
> 
> v10 -> v11:
>   * Add Documentation and code sample.
>   * Add a selftest.
>   * Move all the update to the meta-page into a single
> rb_update_meta_page().
>   * rb_update_meta_page() is now called from
> ring_buffer_map_get_reader() to fix NOBLOCK callers.
>   * kerneldoc for struct trace_meta_page.
>   * Add a patch to zero all the ring-buffer allocations.
> 
> v9 -> v10:
>   * Refactor rb_update_meta_page()
>   * In-loop declaration for foreach_subbuf_page()
>   * Check for cpu_buffer->mapped overflow
> 
> v8 -> v9:
>   * Fix the unlock path in ring_buffer_map()
>   * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
>   * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)
> 
> v7 -> v8:
>   * Drop the subbufs renaming into bpages
>   * Use subbuf as a name when relevant
> 
> v6 -> v7:
>   * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
>   * Support for subbufs
>   * Rename subbufs into bpages
> 
> v5 -> v6:
>   * Rebase on next-20230802.
>   * (unsigned long) -> (void *) cast for virt_to_page().
>   * Add a wait for the GET_READER_PAGE ioctl.
>   * Move writer fields update (overrun/pages_lost/entries/pages_touched)
> in the irq_work.
>   * Rearrange id in struct buffer_page.
>   * Rearrange the meta-page.
>   * ring_buffer_meta_page -> trace_buffer_meta_page.
>   * Add meta_struct_len into the meta-page.
> 
> v4 -> v5:
>   * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)
> 
> v3 -> v4:
>   * Add to the meta-page:
>- pages_lost / pages_read (allow to compute how full is the
>ring-buffer)
>- read (allow to compute how many entries can be read)
>- A reader_page

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-10 Thread Steven Rostedt

On Sat,  6 Apr 2024 18:36:46 +0100
Vincent Donnefort  wrote:

> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> + struct vm_area_struct *vma)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long flags, *subbuf_ids;
> + int err = 0;
> +
> + if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + return -EINVAL;
> +
> + cpu_buffer = buffer->buffers[cpu];
> +
> + mutex_lock(_buffer->mapping_lock);
> +
> + if (cpu_buffer->mapped) {
> + err = __rb_map_vma(cpu_buffer, vma);
> + if (!err)
> + err = __rb_inc_dec_mapped(cpu_buffer, true);
> + mutex_unlock(_buffer->mapping_lock);
> + return err;
> + }
> +
> + /* prevent another thread from changing buffer/sub-buffer sizes */
> + mutex_lock(>mutex);
> +
> + err = rb_alloc_meta_page(cpu_buffer);
> + if (err)
> + goto unlock;
> +
> + /* subbuf_ids include the reader while nr_pages does not */
> + subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), 
> GFP_KERNEL);
> + if (!subbuf_ids) {
> + rb_free_meta_page(cpu_buffer);
> + err = -ENOMEM;
> + goto unlock;
> + }
> +
> + atomic_inc(_buffer->resize_disabled);
> +
> + /*
> +  * Lock all readers to block any subbuf swap until the subbuf IDs are
> +  * assigned.
> +  */
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> + rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> +
> + err = __rb_map_vma(cpu_buffer, vma);
> + if (!err) {
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> + cpu_buffer->mapped = 1;
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> + } else {
> + kfree(cpu_buffer->subbuf_ids);
> + cpu_buffer->subbuf_ids = NULL;
> + rb_free_meta_page(cpu_buffer);
> + }
> +unlock:

Nit: For all labels, please add a space before them. Otherwise, diffs will
show "unlock" as the function and not "ring_buffer_map", making it harder
to find where the change is.

Same for the labels below.

-- Steve


> + mutex_unlock(>mutex);
> + mutex_unlock(_buffer->mapping_lock);
> +
> + return err;
> +}
> +
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long flags;
> + int err = 0;
> +
> + if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + return -EINVAL;
> +
> + cpu_buffer = buffer->buffers[cpu];
> +
> + mutex_lock(_buffer->mapping_lock);
> +
> + if (!cpu_buffer->mapped) {
> + err = -ENODEV;
> + goto out;
> + } else if (cpu_buffer->mapped > 1) {
> + __rb_inc_dec_mapped(cpu_buffer, false);
> + goto out;
> + }
> +
> + mutex_lock(>mutex);
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> +
> + cpu_buffer->mapped = 0;
> +
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> +
> + kfree(cpu_buffer->subbuf_ids);
> + cpu_buffer->subbuf_ids = NULL;
> + rb_free_meta_page(cpu_buffer);
> + atomic_dec(_buffer->resize_disabled);
> +
> + mutex_unlock(>mutex);
> +out:
> + mutex_unlock(_buffer->mapping_lock);
> +
> + return err;
> +}
> +
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long reader_size;
> + unsigned long flags;
> +
> + cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
> + if (IS_ERR(cpu_buffer))
> + return (int)PTR_ERR(cpu_buffer);
> +
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> +consume:
> + if (rb_per_cpu_empty(cpu_buffer))
> + goto out;
> +
> + reader_size = rb_page_size(cpu_buffer->reader_page);
> +
> + /*
> +  * There are data to be read on the current reader page, we can
> +  * return to the caller. But before that, we assume the latter will read
> +  * everything. Let's update the kernel reader accordingly.
> +  */
> + if (cpu_buffer->reader_page->read < reader_size) {
> + while (cpu_buffer->reader_page->read < reader_size)
> + rb_advance_reader(cpu_buffer);
> + goto out;
> + }
> +
> + if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
> + goto out;
> +
> + goto consume;
> +out:
> + /* Some archs do not have data cache coherency between kernel and 
> user-space */
> + flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
> +
> + rb_update_meta_page(cpu_buffer);
> +
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> + rb_put_mapped_buffer(cpu_buffer);
> +
> + return 0;
> +}
> +
>  /*
>   * We only allocate new buffers, never free them if the CPU goes down.
>   *

Re: [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP

2024-04-10 Thread Steven Rostedt



Hi Vincent,

Thanks for sending this. Nit: Subject should start with a capital:

  ring-buffer: Allocate sub-buffers with __GFP_COMP

-- Steve


On Sat,  6 Apr 2024 18:36:45 +0100
Vincent Donnefort  wrote:

> In preparation for the ring-buffer memory mapping, allocate compound
> pages for the ring-buffer sub-buffers to enable us to map them to
> user-space with vm_insert_pages().
> 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..cc9ebe593571 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   list_add(>list, pages);
>  
>   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
> - mflags | __GFP_ZERO,
> + mflags | __GFP_COMP | __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page)
>   goto free_pages;
> @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, 
> long nr_pages, int cpu)
>  
>   cpu_buffer->reader_page = bpage;
>  
> - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
> + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
> __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page)
>   goto fail_free_reader;
> @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer 
> *buffer, int cpu)
>   goto out;
>  
>   page = alloc_pages_node(cpu_to_node(cpu),
> - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
> + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
> __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page) {
>   kfree(bpage);

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

On Wed, 10 Apr 2024 12:38:53 -0400
Chuck Lever  wrote:

> On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > The rpcgss_context trace event acceptor field is a dynamically sized
> > string that records the "data" parameter. But this parameter is also
> > dependent on the "len" field to determine the size of the data.
> > 
> > It needs to use __string_len() helper macro where the length can be passed
> > in. It also incorrectly uses strncpy() to save it instead of
> > __assign_str(). As these macros can change, it is not wise to open code
> > them in trace events.
> > 
> > As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> > __assign_str() can be used for both __string() and __string_len() fields.
> > Before that commit, __assign_str_len() is required to be used. This needs
> > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> > Rework __assign_str() and __string() to not duplicate getting the string")
> > is the commit that makes __string_str_len() obsolete).
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> > Signed-off-by: Steven Rostedt (Google)   
> 
> Acked-by: Chuck Lever 
> 

Thanks, but feel free to take it if you want. Unless you rather have me
take it through my tree?

-- Steve

[PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The rpcgss_context trace event acceptor field is a dynamically sized
string that records the "data" parameter. But this parameter is also
dependent on the "len" field to determine the size of the data.

It needs to use __string_len() helper macro where the length can be passed
in. It also incorrectly uses strncpy() to save it instead of
__assign_str(). As these macros can change, it is not wise to open code
them in trace events.

As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
__assign_str() can be used for both __string() and __string_len() fields.
Before that commit, __assign_str_len() is required to be used. This needs
to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
Rework __assign_str() and __string() to not duplicate getting the string")
is the commit that makes __string_str_len() obsolete).

Cc: sta...@vger.kernel.org
Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/events/rpcgss.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h
index ba2d96a1bc2f..f50fcafc69de 100644
--- a/include/trace/events/rpcgss.h
+++ b/include/trace/events/rpcgss.h
@@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context,
__field(unsigned int, timeout)
__field(u32, window_size)
__field(int, len)
-   __string(acceptor, data)
+   __string_len(acceptor, data, len)
),
 
TP_fast_assign(
@@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context,
__entry->timeout = timeout;
__entry->window_size = window_size;
__entry->len = len;
-   strncpy(__get_str(acceptor), data, len);
+   __assign_str(acceptor, data);
),
 
TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s",
-- 
2.43.0

Re: [PATCH] ftrace: Fix use-after-free issue in ftrace_location()

2024-04-10 Thread Steven Rostedt

On Mon, 1 Apr 2024 20:55:43 +0800
Zheng Yejian  wrote:

> KASAN reports a bug:
> 
>   BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
>   Read of size 8 at addr 888141d40010 by task insmod/424
>   CPU: 8 PID: 424 Comm: insmod Tainted: GW  6.9.0-rc2+ #213
>   [...]
>   Call Trace:
>
>dump_stack_lvl+0x68/0xa0
>print_report+0xcf/0x610
>kasan_report+0xb5/0xe0
>ftrace_location+0x90/0x120
>register_kprobe+0x14b/0xa40
>kprobe_init+0x2d/0xff0 [kprobe_example]
>do_one_initcall+0x8f/0x2d0
>do_init_module+0x13a/0x3c0
>load_module+0x3082/0x33d0
>init_module_from_file+0xd2/0x130
>__x64_sys_finit_module+0x306/0x440
>do_syscall_64+0x68/0x140
>entry_SYSCALL_64_after_hwframe+0x71/0x79
> 
> The root cause is that when lookup_rec() is lookuping ftrace record of
> an address in some module, and at the same time in ftrace_release_mod(),
> the memory that saving ftrace records has been freed as that module is
> being deleted.
> 
>   register_kprobes() {
> check_kprobe_address_safe() {
>   arch_check_ftrace_location() {
> ftrace_location() {
>   lookup_rec()  // access memory that has been freed by
> // ftrace_release_mod() !!!
> 
> It seems that the ftrace_lock is required when lookuping records in
> ftrace_location(), so is ftrace_location_range().
> 
> Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization")
> Signed-off-by: Zheng Yejian 
> ---
>  kernel/trace/ftrace.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index da1710499698..838d175709c1 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>  }
>  
>  /**
> - * ftrace_location_range - return the first address of a traced location
> + * ftrace_location_range_locked - return the first address of a traced 
> location
>   *   if it touches the given ip range
>   * @start: start of range to search.
>   * @end: end of range to search (inclusive). @end points to the last byte
> @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>   * that is either a NOP or call to the function tracer. It checks the ftrace
>   * internal tables to determine if the address belongs or not.
>   */
> -unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +static unsigned long ftrace_location_range_locked(unsigned long start, 
> unsigned long end)
>  {
>   struct dyn_ftrace *rec;
>  
> @@ -1603,6 +1603,17 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   return 0;
>  }
>  
> +unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +{
> + unsigned long loc;
> +
> + mutex_lock(_lock);
> + loc = ftrace_location_range_locked(start, end);
> + mutex_unlock(_lock);

I'm not so sure we can take a mutex in all places that call this function.

What about using RCU?

rcu_read_lock();
loc = ftrace_location_range_rcu(start, end);
rcu_read_unlock();

Then in ftrace_release_mod() we can have:

 out_unlock:
mutex_unlock();

/* Need to synchronize with ftrace_location() */
if (tmp_pages)
synchronize_rcu();

-- Steve


> +
> + return loc;
> +}
> +
>  /**
>   * ftrace_location - return the ftrace location
>   * @ip: the instruction pointer to check
> @@ -1614,25 +1625,22 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   */
>  unsigned long ftrace_location(unsigned long ip)
>  {
> - struct dyn_ftrace *rec;
> + unsigned long loc;
>   unsigned long offset;
>   unsigned long size;
>  
> - rec = lookup_rec(ip, ip);
> - if (!rec) {
> + loc = ftrace_location_range(ip, ip);
> + if (!loc) {
>   if (!kallsyms_lookup_size_offset(ip, , ))
>   goto out;
>  
>   /* map sym+0 to __fentry__ */
>   if (!offset)
> - rec = lookup_rec(ip, ip + size - 1);
> + loc = ftrace_location_range(ip, ip + size - 1);
>   }
>  
> - if (rec)
> - return rec->ip;
> -
>  out:
> - return 0;
> + return loc;
>  }
>  
>  /**

Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

On Wed, 10 Apr 2024 08:44:00 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thanks.

> 
> BTW, isn't this a real bugfix, because the page_touched can be
> bigger than nr_pages without this fix?

Yes, I simply forgot to add the Cc stable.

-- Steve

[PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..6511dc3a00da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
 
-   local_inc(_buffer->pages_touched);
/*
 * Just make sure we have seen our old_write and synchronize
 * with any interrupts that come in.
@@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
 */
local_set(_page->page->commit, 0);
 
-   /* Again, either we update tail_page or an interrupt does */
-   (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
+   /* Either we update tail_page or an interrupt does */
+   if (try_cmpxchg(_buffer->tail_page, _page, next_page))
+   local_inc(_buffer->pages_touched);
}
 }
 
-- 
2.43.0

Re: [PATCH net-next v3 6/6] rstreason: make it work in trace world

2024-04-09 Thread Steven Rostedt

On Tue,  9 Apr 2024 18:09:34 +0800
Jason Xing  wrote:

>  /*
>   * tcp event with arguments sk and skb
> @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   TP_ARGS(sk, skb)
>  );
>  
> +#undef FN1
> +#define FN1(reason)  TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
> +#undef FN2
> +#define FN2(reason)  TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
> +DEFINE_RST_REASON(FN1, FN1)

Interesting. I've never seen the passing of the internal macros to the main
macro before. I see that you are using it for handling both the
SK_RST_REASON and the SK_DROP_REASON.

> +
> +#undef FN1
> +#undef FNe1
> +#define FN1(reason)  { SK_RST_REASON_##reason, #reason },
> +#define FNe1(reason) { SK_RST_REASON_##reason, #reason }
> +
> +#undef FN2
> +#undef FNe2
> +#define FN2(reason)  { SKB_DROP_REASON_##reason, #reason },
> +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason }

Anyway, from a tracing point of view, as it looks like it would work
(I haven't tested it).

Reviewed-by: Steven Rostedt (Google) 

-- Steve

Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 16:58:17 +0200
Michal Koutný  wrote:

> @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[])
>   return;
>   }
>  
> - tpid = pid & (PID_MAX_DEFAULT - 1);
> + tpid = pid % PID_MAP_SIZE;

Does that compile to the same? This is a fast path.

-- Steve


>   map = savedcmd->map_pid_to_cmdline[tpid];
>   if (map != NO_CMDLINE_MAP) {
>   tpid = savedcmd->map_cmdline_to_pid[map];

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 11:01:54 +0200
Marco Elver  wrote:

> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):

How common is this? And can't you just do the same with adding a kprobe?

-- Steve

Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 11:53:14 -0700
"Paul E. McKenney"  wrote:

> @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct 
> ftrace_hash *hash, unsigned long
>   }
>  }
>  
> +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> +{
> + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> +
> + free_ftrace_hash(fhp);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * for multiple functions registered in @ops
> @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   out_unlock:
>   mutex_unlock(_mutex);
>  
> - if (free_hash && free_hash != EMPTY_HASH) {
> - synchronize_rcu_tasks();
> - free_ftrace_hash(free_hash);
> - }
> + if (free_hash && free_hash != EMPTY_HASH)
> + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
>  
>   if (new_hash)
>   free_ftrace_hash(new_hash);

I'm getting ready to go on PTO, but a quick glance doesn't look like this
would cause any harm.

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 15:39:44 +0100
Vincent Donnefort  wrote:

> > Do you plan on sending out a v20 series?  
> 
> Of course, let me spin that this week! Got also few typos to fix in the doc 
> and
> I believe an include missing for riscv.

No rush, I'll be on PTO until next Tuesday, and will not get to it before then.

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Fri, 29 Mar 2024 14:40:55 -0400
Steven Rostedt  wrote:

> > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> > +{
> > +   return VM_FAULT_SIGBUS;
> > +}  
> 
> If this is all it does, I don't believe it's needed.
> 
> > +
> > +#ifdef CONFIG_TRACER_MAX_TRACE
> > +static int get_snapshot_map(struct trace_array *tr)
> > +{
> > +   int err = 0;
> > +
> > +   /*
> > +* Called with mmap_lock held. lockdep would be unhappy if we would now
> > +* take trace_types_lock. Instead use the specific
> > +* snapshot_trigger_lock.
> > +*/
> > +   spin_lock(>snapshot_trigger_lock);
> > +
> > +   if (tr->snapshot || tr->mapped == UINT_MAX)
> > +   err = -EBUSY;
> > +   else
> > +   tr->mapped++;
> > +
> > +   spin_unlock(>snapshot_trigger_lock);
> > +
> > +   /* Wait for update_max_tr() to observe iter->tr->mapped */
> > +   if (tr->mapped == 1)
> > +   synchronize_rcu();
> > +
> > +   return err;
> > +
> > +}
> > +static void put_snapshot_map(struct trace_array *tr)
> > +{
> > +   spin_lock(>snapshot_trigger_lock);
> > +   if (!WARN_ON(!tr->mapped))
> > +   tr->mapped--;
> > +   spin_unlock(>snapshot_trigger_lock);
> > +}
> > +#else
> > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
> > +static inline void put_snapshot_map(struct trace_array *tr) { }
> > +#endif
> > +
> > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
> > +{
> > +   struct ftrace_buffer_info *info = vma->vm_file->private_data;
> > +   struct trace_iterator *iter = >iter;
> > +
> > +   WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file));
> > +   put_snapshot_map(iter->tr);
> > +}
> > +
> > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { }  
> 
> Same for the open.
> 
> 
> > +
> > +static const struct vm_operations_struct tracing_buffers_vmops = {
> > +   .open   = tracing_buffers_mmap_open,
> > +   .close  = tracing_buffers_mmap_close,
> > +   .fault  = tracing_buffers_mmap_fault,
> > +};  
> 
> I replaced this with:
> 
> static const struct vm_operations_struct tracing_buffers_vmops = {
>   .close  = tracing_buffers_mmap_close,
> };
> 
> And it appears to work just fine. The mm code handles the NULL cases for
> .open and .fault.
> 
> Is there any reason to do something different than the mm defaults?

Hi Vincent,

Do you plan on sending out a v20 series?

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko  wrote:

> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >  
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Ah, thanks for verifying (I should have read the thread before replying to
the previous email).

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 21:00:19 -0700
Andrii Nakryiko  wrote:

> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?

That is totally up to Masami. It may have even less overhead as I'm not
sure how many times that gets called, and there may be more work to do than
with function tracing.

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Steven Rostedt

On Wed, 3 Apr 2024 09:40:48 +0900
Masami Hiramatsu (Google)  wrote:

> OK, for me, this last sentence is preferred for the help message. That 
> explains
> what this is for. 
> 
> All callbacks that attach to the function tracing have some sort
> of protection against recursion. This option is only to verify that
>   　ftrace (and other users of ftrace_test_recursion_trylock()) are not
> called outside of RCU, as if they are, it can cause a race.
> But it also has a noticeable overhead when enabled.
> 
> BTW, how much overhead does this introduce? and the race case a kernel crash?
> or just messed up the ftrace buffer?

There's a hypothetical race where it can cause a use after free.

That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
which requires RCU to be watching. There's a theoretical case where that
task calls the trampoline and misses the synchronization. Note, these
locations are with preemption disabled, as rcu is always watching when
preemption is enabled. Thus it would be extremely fast where as the
synchronize_rcu_tasks() is rather slow.

We also have synchronize_rcu_tasks_rude() which would actually keep the
trace from happening, as it would schedule on each CPU forcing all CPUs to
have RCU watching.

I have never heard of this race being hit. I guess it could happen on a VM
where a vCPU gets preempted at the right moment for a long time and the
other CPUs synchronize.

But like lockdep, where deadlocks can crash the kernel, we don't enable it
for production.

The overhead is another function call within the function tracer. I had
numbers before, but I guess I could run tests again and get new numbers.

Thanks,

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt

On Mon, 1 Apr 2024 19:29:46 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 1 Apr 2024 12:09:18 -0400
> > Steven Rostedt  wrote:
> >  
> > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >  
> > > > > Masami,
> > > > >
> > > > > Are you OK with just keeping it set to N.  
> > > >
> > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > >  
> > > > >
> > > > > We could have other options like PROVE_LOCKING enable it.  
> > > >
> > > > Agreed (but it should say this is a debug option)  
> > >
> > > It does say "Validate" which to me is a debug option. What would you
> > > suggest?  
> >
> > I think the help message should have "This is for debugging ftrace."
> >  
> 
> Sent v2 with adjusted wording, thanks!

You may want to wait till Masami and I agree ;-)

Masami,

But it isn't really for "debugging", it's for validating. That is, it
doesn't give us any information to debug ftrace. It only validates if it is
executed properly. In other words, I never want to be asked "How can I use
this option to debug ftrace?"

For example, we also have:

  "Verify ring buffer time stamp deltas"

that makes sure the time stamps of the ring buffer are not buggy.

And there's:

  "Verify compile time sorting of ftrace functions"

Which is also used to make sure things are working properly.

Neither of the above says they are for "debugging", even though they are
more useful for debugging than this option.

I'm not sure saying this is "debugging ftrace" is accurate. It may help
debug ftrace if it is called outside of an RCU location, which has a
1 in 100,000,000,000 chance of causing an actual bug, as the race window is
extremely small. 

Now if it is also called outside of instrumentation, that will likely trigger
other warnings even without this code, and this will not be needed to debug
that.

ftrace has all sorts of "verifiers" that is used to make sure things are
working properly. And yes, you can consider it as "debugging". But I would
not consider this an option to enable if ftrace was broken, and you are
looking into why it is broken.

To me, this option is only to verify that ftrace (and other users of
ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
are, it can cause a race. But it also has a noticeable overhead when enabled.

-- Steve

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt

On Mon, 1 Apr 2024 20:25:52 +0900
Masami Hiramatsu (Google)  wrote:

> > Masami,
> > 
> > Are you OK with just keeping it set to N.  
> 
> OK, if it is only for the debugging, I'm OK to set N this.
> 
> > 
> > We could have other options like PROVE_LOCKING enable it.  
> 
> Agreed (but it should say this is a debug option)

It does say "Validate" which to me is a debug option. What would you
suggest?

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-03-29 Thread Steven Rostedt

On Tue, 26 Mar 2024 10:08:28 +
Vincent Donnefort  wrote:

> Currently, user-space extracts data from the ring-buffer via splice,
> which is handy for storage or network sharing. However, due to splice
> limitations, it is imposible to do real-time analysis without a copy.
> 
> A solution for that problem is to let the user-space map the ring-buffer
> directly.
> 
> The mapping is exposed via the per-CPU file trace_pipe_raw. The first
> element of the mapping is the meta-page. It is followed by each
> subbuffer constituting the ring-buffer, ordered by their unique page ID:
> 
>   * Meta-page -- include/uapi/linux/trace_mmap.h for a description
>   * Subbuf ID 0
>   * Subbuf ID 1
>  ...
> 
> It is therefore easy to translate a subbuf ID into an offset in the
> mapping:
> 
>   reader_id = meta->reader->id;
>   reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
> 
> When new data is available, the mapper must call a newly introduced ioctl:
> TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
> point to the next reader containing unread data.
> 

Thanks for the update Vincent!

> Mapping will prevent snapshot and buffer size modifications.
> 
> CC: 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> index ffcd8dfcaa4f..d25b9d504a7c 100644
> --- a/include/uapi/linux/trace_mmap.h
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -43,4 +43,6 @@ struct trace_buffer_meta {
>   __u64   Reserved2;
>  };
>  
> +#define TRACE_MMAP_IOCTL_GET_READER  _IO('T', 0x1)
> +
>  #endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 233d1af39fff..0f37aa9860fd 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
> trace_array *tr,
>   return;
>   }
>  
> + if (tr->mapped) {
> + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
> + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
> + return;
> + }
> +
>   local_irq_save(flags);
>   update_max_tr(tr, current, smp_processor_id(), cond_data);
>   local_irq_restore(flags);
> @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct 
> trace_array *tr)
>   lockdep_assert_held(_types_lock);
>  
>   spin_lock(>snapshot_trigger_lock);
> - if (tr->snapshot == UINT_MAX) {
> + if (tr->snapshot == UINT_MAX || tr->mapped) {
>   spin_unlock(>snapshot_trigger_lock);
>   return -EBUSY;
>   }
> @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
>  {
>   if (tr->current_trace == _trace)
>   return;
> - 
> +
>   tr->current_trace->enabled--;
>  
>   if (tr->current_trace->reset)
> @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
> *ppos,
>   return ret;
>  }
>  
> -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters 
> */
>  static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
> unsigned long arg)
>  {
>   struct ftrace_buffer_info *info = file->private_data;
>   struct trace_iterator *iter = >iter;
> + int err;
> +
> + if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
> + if (!(file->f_flags & O_NONBLOCK)) {
> + err = ring_buffer_wait(iter->array_buffer->buffer,
> +iter->cpu_file,
> +iter->tr->buffer_percent,
> +NULL, NULL);
> + if (err)
> + return err;
> + }
>  
> - if (cmd)
> - return -ENOIOCTLCMD;
> + return ring_buffer_map_get_reader(iter->array_buffer->buffer,
> +   iter->cpu_file);
> + } else if (cmd) {
> + return -ENOTTY;
> + }
>  
> + /*
> +  * An ioctl call with cmd 0 to the ring buffer file will wake up all
> +  * waiters
> +  */
>   mutex_lock(_types_lock);
>  
>   /* Make sure the waiters see the new wait_index */
> @@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, 
> unsigned int cmd, unsigned
>   return 0;
>  }
>  
> +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> +{
> + return VM_FAULT_SIGBUS;
> +}

If this is all it does, I don't believe it's needed.

> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +static int get_snapshot_map(struct trace_array *tr)
> +{
> + int err = 0;
> +
> + /*
> +  * Called with mmap_lock held. lockdep would be unhappy if we would now
> +  * take trace_types_lock. Instead use the specific
> +  * snapshot_trigger_lock.
> +  */
> + spin_lock(>snapshot_trigger_lock);
> +
> + if (tr->snapshot || tr->mapped == UINT_MAX)
> +

Re: [PATCH] trace/sched: add tgid for sched_wakeup_template

2024-03-27 Thread Steven Rostedt

On Wed, 27 Mar 2024 16:50:57 +0800
Tio Zhang  wrote:

> By doing this, we are able to filter tasks by tgid while we are
> tracing wakeup events by ebpf or other methods.
> 
> For example, when we care about tracing a user space process (which has
> uncertain number of LWPs, i.e, pids) to monitor its wakeup latency,
> without tgid available in sched_wakeup tracepoints, we would struggle
> finding out all pids to trace, or we could use kprobe to achieve tgid
> tracing, which is less accurate and much less efficient than using
> tracepoint.

This is a very common trace event, and I really do not want to add more
data than necessary to it, as it increases the size of the event which
means less events can be recorded on a fixed size trace ring buffer.

Note, you are not modifying the "tracepoint", but you are actually
modifying a "trace event".

 "tracepoint" is the hook in the kernel code:

   trace_sched_wakeup()

 "trace event" is defined by TRACE_EVENT() macro (and friends) that defines
 what is exposed in the tracefs file system.

I thought ebpf could hook directly to the tracepoint which is:

trace_sched_wakeup(p);

I believe you can have direct access to the 'p' before it is processed from 
ebpf.

There's also "trace probes" (I think we are lacking documentation on this,
as well as event probes :-p):

 $ gdb vmlinux
(gdb) p &((struct task_struct *)0)->tgid
$1 = (pid_t *) 0x56c
(gdb) p &((struct task_struct *)0)->pid
$2 = (pid_t *) 0x568

 # echo 't:wakeup sched_waking pid=+0x568($arg1):u32 tgid=+0x56c($arg1):u32' > 
/sys/kernel/tracing/dynamic_events

 # trace-cmd start -e wakeup
 # trace-cmd show
   trace-cmd-7307[003] d..6. 599486.485762: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=845 tgid=845
bash-845 [001] d.s4. 599486.486136: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
bash-845 [001] d..4. 599486.486336: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=5516 tgid=5516
   kworker/u18:2-5516[001] d..4. 599486.486445: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=818 tgid=818
  -0   [001] d.s4. 599486.491206: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s5. 599486.493218: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s4. 599486.497200: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [003] d.s4. 599486.829209: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=70 tgid=70

The above attaches to the tracepoint and $arg1 is the 'struct task_struct *p'.

-- Steve


> 
> Signed-off-by: Tio Zhang 
> Signed-off-by: Dylane Chen 
> ---
>  include/trace/events/sched.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..ea7e525649e5 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __field(pid_t,  pid )
>   __field(int,prio)
>   __field(int,target_cpu  )
> + __field(pid_t,  tgid)
>   ),
>  
>   TP_fast_assign(
> @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __entry->pid= p->pid;
>   __entry->prio   = p->prio; /* XXX SCHED_DEADLINE */
>   __entry->target_cpu = task_cpu(p);
> + __entry->tgid   = p->tgid;
>   ),
>  
> - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d",
> __entry->comm, __entry->pid, __entry->prio,
> -   __entry->target_cpu)
> +   __entry->target_cpu, __entry->tgid)
>  );
>  
>  /*

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 09:16:33 -0700
Andrii Nakryiko  wrote:

> > It's no different than lockdep. Test boxes should have it enabled, but
> > there's no reason to have this enabled in a production system.
> >  
> 
> I tend to agree with Steven here (which is why I sent this patch as it
> is), but I'm happy to do it as an opt-out, if Masami insists. Please
> do let me know if I need to send v2 or this one is actually the one
> we'll end up using. Thanks!

Masami,

Are you OK with just keeping it set to N.

We could have other options like PROVE_LOCKING enable it.

-- Steve

Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 15:53:38 +0100
Arnd Bergmann  wrote:

> -const char *
> +int
>  ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
>  unsigned long *off, char **modname, char *sym)
>  {
>   struct ftrace_mod_map *mod_map;
> - const char *ret = NULL;
> + int ret;

This needs to be ret = 0;

>  
>   /* mod_map is freed via call_rcu() */
>   preempt_disable();

As here we have:

list_for_each_entry_rcu(mod_map, _mod_maps, list) {
ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym);
if (ret) {
if (modname)
*modname = mod_map->mod->name;
break;
}
}
preempt_enable();

return ret;
}

Where it is possible for the loop never to be executed.

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Steven Rostedt

On Mon, 25 Mar 2024 11:38:48 +0900
Masami Hiramatsu (Google)  wrote:

> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
> 
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> > 
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.  
> 
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.
> 

Really, it's for debugging. I would argue that it should *not* be default y.
Peter added this to find all the locations that could be called where RCU
is not watching. But the issue I have is that this is that it *does cause
overhead* with function tracing.

I believe we found pretty much all locations that were an issue, and we
should now just make it an option for developers.

It's no different than lockdep. Test boxes should have it enabled, but
there's no reason to have this enabled in a production system.

-- Steve


> > 
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> >  #endif
> >  
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)  \
> > ({  \
> > bool __ret = !rcu_is_watching();\  
> 
> BTW, maybe we can add "unlikely" in the next "if" line?
> 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> >   This file can be reset, but the limit can not change in
> >   size at runtime.
> >  
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > +   bool "Validate RCU is on during ftrace recursion check"
> > +   depends on FUNCTION_TRACER
> > +   depends on ARCH_WANTS_NO_INSTR  
> 
>   default y
> 
> > +   help
> > + All callbacks that attach to the function tracing have some sort
> > + of protection against recursion. This option performs additional
> > + checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > + This will add more overhead to all ftrace-based invocations.  
> 
>   ... invocations, but keep it safe.
> 
> > +
> > + If unsure, say N  
> 
>   If unsure, say Y
> 
> Thank you,
> 
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> > bool "Record functions that recurse in the ring buffer"
> > depends on FTRACE_RECORD_RECURSION
> > -- 
> > 2.43.0
> >   
> 
>

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-21 Thread Steven Rostedt

On Tue, 12 Mar 2024 13:42:28 +
Mark Rutland  wrote:

> There are ways around that, but they're complicated and/or expensive, e.g.
> 
> * Use a sequence of multiple patches, starting with replacing the JALR with an
>   exception-generating instruction with a fixup handler, which is sort-of what
>   x86 does with UD2. This may require multiple passes with
>   synchronize_rcu_tasks() to make sure all threads have seen the latest
>   instructions, and that cannot be done under stop_machine(), so if you need
>   stop_machine() for CMODx reasons, you may need to use that several times 
> with
>   intervening calls to synchronize_rcu_tasks().

Just for clarification. x86 doesn't use UD2 for updating the call sites. It
uses the breakpoint (0xcc) operation. This is because x86 instructions are
not a fixed size and can cross cache boundaries, making updates to text
sections dangerous as another CPU may get half the old instruction and half
the new one!

Thus, when we have:

0f 1f 44 00 00  nop

and want to convert it to:

e8 e7 57 07 00  call ftrace_caller

We have to first add a breakpoint:

cc 17 44 00 00

Send an IPI to all CPUs so that they see the breakpoint (IPI is equivalent
to a memory barrier).

We have a ftrace breakpoint handler that will look at the function that the
breakpoint happened on. If it was a nop, it will just skip over the rest of
the instruction, and return from the break point handler just after the
"cc 17 44 00 00". If it's supposed to be a function, it will push the
return to after the instruction onto the stack, and return from the break
point handler to ftrace_caller. (Although things changed a little since
this update is now handled by text_poke_bp_batch()).

Then it changes the rest of the instruction:

cc e7 57 07 00

Sends out another IPI to all CPUs and removes the break point with the new
instruction op.

e8 e7 57 07 00

and now all the callers of this function will call the ftrace_caller.

-- Steve

Re: [PATCH v7 1/5] tracing/probes: support '%pd' type for print struct dentry's name

2024-03-21 Thread Steven Rostedt

On Fri, 22 Mar 2024 00:28:05 +0900
Masami Hiramatsu (Google)  wrote:

> On Fri, 22 Mar 2024 00:07:59 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > What would be really useful is if we had a way to expose BTF here. 
> > > Something like:
> > > 
> > >  "%pB::"
> > > 
> > > The "%pB" would mean to look up the struct/field offsets and types via 
> > > BTF,
> > > and create the appropriate command to find and print it.  
> > 
> > Would you mean casing the pointer to ""?  
> 
> BTW, for this BTF type casting, I would like to make it more naturally, like
> ( *)$arg1-> as same as other BTF args.
> 

Sure. I'm just interested in the functionality. I'll let others come up
with the syntax. ;-)

-- Steve

Re: [PATCH v7 1/5] tracing/probes: support '%pd' type for print struct dentry's name

2024-03-21 Thread Steven Rostedt

On Fri, 22 Mar 2024 00:07:59 +0900
Masami Hiramatsu (Google)  wrote:

> > What would be really useful is if we had a way to expose BTF here. 
> > Something like:
> > 
> >  "%pB::"
> > 
> > The "%pB" would mean to look up the struct/field offsets and types via BTF,
> > and create the appropriate command to find and print it.  
> 
> Would you mean casing the pointer to ""?

I mean, instead of having:

 ":%pd"

We could have:

 "+0(*:%pB:dentry:name):string"

Where the parsing could use BTF to see that this is a pointer to "struct dentry"
and the member field is "name".

This would also allow pretty much any other structure dereference.
That is if BTF gives structure member offsets?

-- Steve

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-21 Thread Steven Rostedt

On Tue, 12 Mar 2024 13:42:28 +
Mark Rutland  wrote:

> > It would be interesting to see how the per-call performance would
> > improve on x86 with CALL_OPS! ;-)  
> 
> Heh. ;)

But this would require adding -fpatchable-function-entry on x86, which
would increase the size of text, which could possibly have a performance
regression when tracing is disabled.

I'd have no problem if someone were to implement it, but there's a strict
requirement that it does not slow down the system when tracing is disabled.
As tracing is a second class citizen compared to the rest of the kernel.

-- Steve

Re: [PATCH v7 1/5] tracing/probes: support '%pd' type for print struct dentry's name

2024-03-21 Thread Steven Rostedt

On Wed, 20 Mar 2024 21:29:20 +0800
Ye Bin  wrote:

> Support print type '%pd' for print dentry's  name.
> 

The above is not a very detailed change log. A change log should state not
only what the change is doing, but also why.

Having examples of before and after would be useful in the change log.

> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace.c|  2 +-
>  kernel/trace/trace_fprobe.c |  6 +
>  kernel/trace/trace_kprobe.c |  6 +
>  kernel/trace/trace_probe.c  | 50 +
>  kernel/trace/trace_probe.h  |  2 ++
>  5 files changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index b12f8384a36a..ac26b8446337 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c


> +/* @buf: *buf must be equal to NULL. Caller must to free *buf */
> +int traceprobe_expand_dentry_args(int argc, const char *argv[], char **buf)
> +{
> + int i, used, ret;
> + const int bufsize = MAX_DENTRY_ARGS_LEN;
> + char *tmpbuf = NULL;
> +
> + if (*buf)
> + return -EINVAL;
> +
> + used = 0;
> + for (i = 0; i < argc; i++) {
> + if (glob_match("*:%pd", argv[i])) {
> + char *tmp;
> + char *equal;
> +
> + if (!tmpbuf) {
> + tmpbuf = kmalloc(bufsize, GFP_KERNEL);
> + if (!tmpbuf)
> + return -ENOMEM;
> + }
> +
> + tmp = kstrdup(argv[i], GFP_KERNEL);
> + if (!tmp)
> + goto nomem;
> +
> + equal = strchr(tmp, '=');
> + if (equal)
> + *equal = '\0';
> + tmp[strlen(argv[i]) - 4] = '\0';
> + ret = snprintf(tmpbuf + used, bufsize - used,
> +"%s%s+0x0(+0x%zx(%s)):string",
> +equal ? tmp : "", equal ? "=" : "",
> +offsetof(struct dentry, d_name.name),
> +equal ? equal + 1 : tmp);

What would be really useful is if we had a way to expose BTF here. Something 
like:

 "%pB::"

The "%pB" would mean to look up the struct/field offsets and types via BTF,
and create the appropriate command to find and print it.

That would be much more flexible and useful as it would allow reading any
field from any structure without having to use gdb.

-- Steve


> + kfree(tmp);
> + if (ret >= bufsize - used)
> + goto nomem;
> + argv[i] = tmpbuf + used;
> + used += ret + 1;
> + }
> + }
> +
> + *buf = tmpbuf;
> + return 0;
> +nomem:
> + kfree(tmpbuf);
> + return -ENOMEM;
> +}

Re: [PATCH v3] net/ipv4: add tracepoint for icmp_send

2024-03-21 Thread Steven Rostedt

On Thu, 21 Mar 2024 10:45:00 +0800
Jason Xing  wrote:

> The format of the whole patch looks strange... Did you send this patch
> by using 'git send-email' instead of pasting the text and sending?

Yeah, it's uuencoded.

Subject: 
=?UTF-8?B?wqBbUEFUQ0ggdjNdIG5ldC9pcHY0OiBhZGQgdHJhY2Vwb2ludCBmb3IgaWNtcF9zZW5k?=
Content-Type: multipart/mixed;
boundary="=_001_next="
X-MAIL:mse-fl2.zte.com.cn 42L2Ahm2097008
X-Fangmail-Gw-Spam-Type: 0
X-Fangmail-Anti-Spam-Filtered: true
X-Fangmail-MID-QID: 65FB975E.000/4V0TV60kJlz8XrRb



--=_001_next=
Content-Type: multipart/related;
boundary="=_002_next="


--=_002_next=
Content-Type: multipart/alternative;
boundary="=_003_next="


--=_003_next=
Content-Type: text/plain;
charset="UTF-8"
Content-Transfer-Encoding: base64

-- Steve

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 20:46:11 -0400
Waiman Long  wrote:

> I have no objection to that. However, there are now 2 function call 
> overhead in each iteration if either CONFIG_IRQSOFF_TRACER or 
> CONFIG_PREEMPT_TRACER is on. Is it possible to do it with just one 
> function call? IOW, make restart_critical_timings() a real function.

Yeah, I could do that.

-- Steve

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 13:15:39 -0400
Mathieu Desnoyers  wrote:

> > I would like to introduce restart_critical_timings() and place it in
> > locations that have this behavior.  
> 
> Is there any way you could move this to need_resched() rather than
> sprinkle those everywhere ?

Because need_resched() itself does not mean it's going to schedule
immediately. I looked at a few locations that need_resched() is called.
Most are in idle code where the critical timings are already handled.

I'm not sure I'd add it for places like mm/memory.c or 
drivers/md/bcache/btree.c.

A lot of places look to use it more for PREEMPT_NONE situations as a open
coded cond_resched().

The main reason this one is particularly an issue, is that it spins as long
as the owner is still running. Which may be some time, as here it was 7ms.

-- Steve

[RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq 
#  ||| / _--=> preempt-depth   
#   / _-=> migrate-disable 
#  | / delay   
#  cmd pid || time  |   caller 
# \   /||  \|/   
   <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
   <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
   <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
   <...>-1 1.N.1. 7824us : 
 => rwsem_optimistic_spin+0x140/0x194
 => rwsem_down_write_slowpath+0xc9/0x3fe
 => copy_process+0xd08/0x1b8a
 => kernel_clone+0x94/0x256
 => __x64_sys_clone+0x7a/0x9a
 => do_syscall_64+0x51/0xa1
 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.

Signed-off-by: Steven Rostedt (Google) 

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
 # define start_critical_timings() do { } while (0)
 #endif
 
+/* Used in spins that check need_resched() with preemption disabled */
+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
 #ifdef CONFIG_DEBUG_IRQFLAGS
 extern void warn_bogus_irq_restore(void);
 #define raw_check_bogus_irq_restore()  \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_PREEMPT_RT
@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
 
+   restart_critical_timings();
if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
break;
if (rt_task(current) &&

Re: [GIT PULL] tracing/tools: Updates for 6.9

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 13:41:12 +0100
Daniel Bristot de Oliveira  wrote:

> On 3/20/24 00:02, Steven Rostedt wrote:
> > On Mon, 18 Mar 2024 18:41:13 +0100
> > Daniel Bristot de Oliveira  wrote:
> >   
> >> Steven,
> >>
> >> Tracing tooling updates for 6.9
> >>
> >> Tracing:
> >> - Update makefiles for latency-collector and RTLA,
> >>   using tools/build/ makefiles like perf does, inheriting
> >>   its benefits. For example, having a proper way to
> >>   handle dependencies.
> >>
> >> - The timerlat tracer has an interface for any tool to use.
> >>   rtla timerlat tool uses this interface dispatching its
> >>   own threads as workload. But, rtla timerlat could also be
> >>   used for any other process. So, add 'rtla timerlat -U'
> >>   option, allowing the timerlat tool to measure the latency of
> >>   any task using the timerlat tracer interface.
> >>
> >> Verification:
> >> - Update makefiles for verification/rv, using tools/build/
> >>   makefiles like perf does, inheriting its benefits.
> >>   For example, having a proper way to handle dependencies.
> >>
> >>
> >> Please pull the latest trace-tools-v6.9 tree, which can be found at:
> >>
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
> >> trace-tools-v6.9  
> > 
> > Looks like you just built on top of a random commit from Linus's tree:  
> 
> yeah :-/
> 
> > commit f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
> > Merge: 906a93befec8 8f06fb458539
> > Author: Linus Torvalds 
> > Date:   Sun Mar 17 16:59:33 2024 -0700
> > 
> > Merge tag 'i3c/for-6.9' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
> > 
> > Linus prefers basing off of real tags or previous pulls from us.  
> 
> Ack, took note. I will do on top v6.8 tag.
> 
> > Can you rebase your changes on v6.8 and resend?
> > 
> >   $ git checkout v6.8
> >   $ git cherry-pick 
> > f6cef5f8c37f58a3bc95b3754c3ae98e086631ca..trace-tools-v6.9
> > 
> > Appears to work fine.  
> 
> questions: when something go wrong in a pull request
> 
>   - Should I keep the old tag, and then do another one with -2
> (it seems you do like this), or delete the old tag and send it again
> with the same name?

Just create a new tag.

> 
>   - Should I resend the PULL request with something in the log or
> at the Subject saying it is a v2 of the pull request?

Yes please.

> 
> I could ask via chat, but I think it is good for the community to
> have access to these info.

+1

Thanks,


-- Steve

Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 17:10:38 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Fix to initialize 'val' local variable with zero.
> Dan reported that Smatch static code checker reports an error that a local
> 'val' variable needs to be initialized. Actually, the 'val' is expected to
> be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> initialize it with zero.

BTW, that loop should really have a comment stating that FETCH_OP_ARG is
expected to happen before FETCH_OP_ST_EDATA.

-- Steve

Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 17:10:38 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Fix to initialize 'val' local variable with zero.
> Dan reported that Smatch static code checker reports an error that a local
> 'val' variable needs to be initialized. Actually, the 'val' is expected to
> be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> initialize it with zero.
> 
> Reported-by: Dan Carpenter 
> Closes: 
> https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
> Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe 
> and fprobe)")
> Signed-off-by: Masami Hiramatsu (Google) 
> ---

Reviewed-by: Steven Rostedt (Google) 

-- Steve

>  kernel/trace/trace_probe.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 217169de0920..dfe3ee6035ec 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -839,7 +839,7 @@ int traceprobe_get_entry_data_size(struct trace_probe *tp)
>  void store_trace_entry_data(void *edata, struct trace_probe *tp, struct 
> pt_regs *regs)
>  {
>   struct probe_entry_arg *earg = tp->entry_arg;
> - unsigned long val;
> + unsigned long val = 0;
>   int i;
>  
>   if (!earg)

Re: [bug report] tracing/probes: Support $argN in return probe (kprobe and fprobe)

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 12:44:23 +0900
Masami Hiramatsu (Google)  wrote:

> > > kernel/trace/trace_probe.c
> > > 846 return;
> > > 847 
> > > 848 for (i = 0; i < earg->size; i++) {
> > > 849 struct fetch_insn *code = >code[i];
> > > 850 
> > > 851 switch (code->op) {
> > > 852 case FETCH_OP_ARG:
> > > 853 val = regs_get_kernel_argument(regs, 
> > > code->param);
> > > 854 break;
> > > 855 case FETCH_OP_ST_EDATA:  
> > > --> 856 *(unsigned long *)((unsigned long)edata + 
> > > code->offset) = val;
> > > 
> > > Probably the earg->code[i] always has FETCH_OP_ARG before
> > > FETCH_OP_ST_EDATA but Smatch isn't smart enough to figure that out...  
> > 
> > Looks that way:
> > 
> > case FETCH_OP_END:
> > earg->code[i].op = FETCH_OP_ARG;
> > earg->code[i].param = argnum;
> > earg->code[i + 1].op = FETCH_OP_ST_EDATA;
> > earg->code[i + 1].offset = offset;
> > return offset;
> > 
> > But probably should still initialize val to zero or have a WARN_ON() if
> > that doesn't happen.  
> 
> OK, let's val = 0 in the store_trace_entry_data(), but WARN_ON() in this loop
> is a bit strange. I think we should have a verifiler.

Initializing to zero is fine.

-- Steve

Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash

2024-03-19 Thread Steven Rostedt

On Sat, 9 Mar 2024 12:40:51 -0800
Kees Cook  wrote:

> The part I'd like to get wired up sanely is having pstore find the
> nvdimm area automatically, but it never quite happened:
> https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=cs4kutw9tl...@mail.gmail.com/

The automatic detection is what I'm looking for.

> 
> > Thanks for the info. We use pstore on ChromeOS, but it is currently
> > restricted to 1MB which is too small for the tracing buffers. From what
> > I understand, it's also in a specific location where there's only 1MB
> > available for contiguous memory.  
> 
> That's the area that is specifically hardware backed with persistent
> RAM.
> 
> > I'm looking at finding a way to get consistent memory outside that
> > range. That's what I'll be doing next week ;-)
> > 
> > But this code was just to see if I could get a single contiguous range
> > of memory mapped to ftrace, and this patch set does exactly that.  
> 
> Well, please take a look at pstore. It should be able to do everything
> you mention already; it just needs a way to define multiple regions if
> you want to use an area outside of the persistent ram area defined by
> Chrome OS's platform driver.

I'm not exactly sure how to use pstore here. At boot up I just need some
consistent memory reserved for the tracing buffer. It just needs to be the
same location at every boot up.

I don't need a front end. If you mean a way to access it from user space.
The front end is the tracefs directory, as I need all the features that the
tracefs directory gives.

I'm going to look to see how pstore is set up in ChromeOS and see if I can
use whatever it does to allocate another location.

-- Steve

Re: [PATCH 2/2] tracing: Ignore -Wstring-compare with diagnostic macros

2024-03-19 Thread Steven Rostedt

On Tue, 19 Mar 2024 17:30:41 -0700
Justin Stitt  wrote:


> > diff --git a/include/trace/stages/stage6_event_callback.h 
> > b/include/trace/stages/stage6_event_callback.h
> > index 83da83a0c14f..56a4eea5a48e 100644
> > --- a/include/trace/stages/stage6_event_callback.h
> > +++ b/include/trace/stages/stage6_event_callback.h
> > @@ -35,9 +35,14 @@
> > do {\
> > char *__str__ = __get_str(dst); \
> > int __len__ = __get_dynamic_array_len(dst) - 1; \
> > +   __diag_push();  \
> > +   __diag_ignore(clang, 11, "-Wstring-compare",\
> > + "__builtin_constant_p() ensures strcmp()" \
> > + "will be used for string literals");  \
> > WARN_ON_ONCE(__builtin_constant_p(src) ?\
> >  strcmp((src), __data_offsets.dst##_ptr_) : \
> >  (src) != __data_offsets.dst##_ptr_);   \  
> 
> What exactly is the point of the literal string comparison? Why
> doesn't strcmp do the trick?

This is done in the hotpath, and is only for debugging. The string passed
into __string() should be the same as the string passed into __assign_str().

But this is moot as I ended up always using strcmp() even if it will slow
down the recording of the event.

Next merge window the src parameter (along with the strcmp() checks) are
going away.

-- Steve


> 
> > +   __diag_pop();   \
> > memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
> >EVENT_NULL_STR, __len__);\
> > __str__[__len__] = '\0';\
> >
> > --
> > 2.44.0
> >

Re: [GIT PULL] tracing/tools: Updates for 6.9

2024-03-19 Thread Steven Rostedt

On Mon, 18 Mar 2024 18:41:13 +0100
Daniel Bristot de Oliveira  wrote:

> Steven,
> 
> Tracing tooling updates for 6.9
> 
> Tracing:
> - Update makefiles for latency-collector and RTLA,
>   using tools/build/ makefiles like perf does, inheriting
>   its benefits. For example, having a proper way to
>   handle dependencies.
> 
> - The timerlat tracer has an interface for any tool to use.
>   rtla timerlat tool uses this interface dispatching its
>   own threads as workload. But, rtla timerlat could also be
>   used for any other process. So, add 'rtla timerlat -U'
>   option, allowing the timerlat tool to measure the latency of
>   any task using the timerlat tracer interface.
> 
> Verification:
> - Update makefiles for verification/rv, using tools/build/
>   makefiles like perf does, inheriting its benefits.
>   For example, having a proper way to handle dependencies.
> 
> 
> Please pull the latest trace-tools-v6.9 tree, which can be found at:
> 
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
> trace-tools-v6.9

Looks like you just built on top of a random commit from Linus's tree:

commit f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
Merge: 906a93befec8 8f06fb458539
Author: Linus Torvalds 
Date:   Sun Mar 17 16:59:33 2024 -0700

Merge tag 'i3c/for-6.9' of 
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux

Linus prefers basing off of real tags or previous pulls from us.

Can you rebase your changes on v6.8 and resend?

  $ git checkout v6.8
  $ git cherry-pick f6cef5f8c37f58a3bc95b3754c3ae98e086631ca..trace-tools-v6.9

Appears to work fine.

Thanks!

-- Steve


> 
> Tag SHA1: 2eb09a97c56af3c27bd9dcebccb495f70d56d5c0
> Head SHA1: 9c63d9f58a42b979a42bcaed534d9246996ac0d9
> 
> 
> Daniel Bristot de Oliveira (4):
>   tools/tracing: Use tools/build makefiles on latency-collector
>   tools/rtla: Use tools/build makefiles to build rtla
>   tools/verification: Use tools/build makefiles on rv
>   tools/rtla: Add -U/--user-load option to timerlat
> 
> 
>  .../tools/rtla/common_timerlat_options.rst |   6 +
>  tools/tracing/latency/.gitignore   |   5 +-
>  tools/tracing/latency/Build|   1 +
>  tools/tracing/latency/Makefile | 105 --
>  tools/tracing/latency/Makefile.config  |  30 +++
>  tools/tracing/rtla/.gitignore  |   7 +-
>  tools/tracing/rtla/Build   |   1 +
>  tools/tracing/rtla/Makefile| 217 
> +++--
>  tools/tracing/rtla/Makefile.config |  47 +
>  tools/tracing/rtla/Makefile.rtla   |  80 
>  tools/tracing/rtla/Makefile.standalone |  26 +++
>  tools/tracing/rtla/sample/timerlat_load.py |  74 +++
>  tools/tracing/rtla/src/Build   |  11 ++
>  tools/tracing/rtla/src/timerlat_hist.c |  16 +-
>  tools/tracing/rtla/src/timerlat_top.c  |  14 +-
>  tools/verification/rv/.gitignore   |   6 +
>  tools/verification/rv/Build|   1 +
>  tools/verification/rv/Makefile | 207 +++-
>  tools/verification/rv/Makefile.config  |  47 +
>  tools/verification/rv/Makefile.rv  |  51 +
>  tools/verification/rv/src/Build|   4 +
>  21 files changed, 650 insertions(+), 306 deletions(-)
>  create mode 100644 tools/tracing/latency/Build
>  create mode 100644 tools/tracing/latency/Makefile.config
>  create mode 100644 tools/tracing/rtla/Build
>  create mode 100644 tools/tracing/rtla/Makefile.config
>  create mode 100644 tools/tracing/rtla/Makefile.rtla
>  create mode 100644 tools/tracing/rtla/Makefile.standalone
>  create mode 100644 tools/tracing/rtla/sample/timerlat_load.py
>  create mode 100644 tools/tracing/rtla/src/Build
>  create mode 100644 tools/verification/rv/.gitignore
>  create mode 100644 tools/verification/rv/Build
>  create mode 100644 tools/verification/rv/Makefile.config
>  create mode 100644 tools/verification/rv/Makefile.rv
>  create mode 100644 tools/verification/rv/src/Build
> ---

Re: [PATCH 0/2] tracing: Fully silence instance of -Wstring-compare

2024-03-19 Thread Steven Rostedt

On Tue, 19 Mar 2024 09:07:51 -0700
Nathan Chancellor  wrote:

> Hi all,
> 
> This series fully resolves the new instance of -Wstring-compare from
> within the __assign_str() macro. The first patch resolves a build
> failure with GCC that would be seen with just the second patch applied.
> The second patch actually hides the warning.
> 
> NOTE: This is based on trace/for-next, which does not contain the
> minimum LLVM version bump that occurred later in the current merge
> window, so this uses
> 
>   __diag_ignore(clang, 11, ...
> 
> instead of
> 
>   __diag_ignore(clang, 13, ...
> 
> which will be required when this is merged into Linus's tree. If you can
> base this series on a tree that has the merge commit e5eb28f6d1af ("Merge
> tag 'mm-nonmm-stable-2024-03-14-09-36' of
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") in it, then that
> change can be done at application time, rather than merge time (or I can
> send a v2). It would be really nice for this to make the merge window so
> that this warning does not proliferate into other trees that base on
> -rc1.
> 

I'm guessing this isn't needed with the last update.

-- Steve

[PATCH] tracing: Just use strcmp() for testing string() and assign_str() match

2024-03-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

As __assign_str() no longer uses its "src" parameter, there's a check to
make sure nothing depends on it being different than what was passed to
__string(). It originally just compared the pointer passed to __string()
with the pointer passed into __assign_str() via the "src" parameter. But
there's a couple of outliers that just pass in a quoted string constant,
where comparing the pointers is UB to the compiler, as the compiler is
free to create multiple copies of the same string constant.

Instead, just use strcmp(). It may slow down the trace event, but this
will eventually be removed.

Also, fix the issue of passing NULL to strcmp() by adding a WARN_ON() to
make sure that both "src" and the pointer saved in __string() are either
both NULL or have content, and then checking if "src" is not NULL before
performing the strcmp().

Link: 
https://lore.kernel.org/all/CAHk-=wjxX16kWd=uxG5wzqt=axoydf1bgwokk+qvmao0zh7...@mail.gmail.com/

Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() check")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..3690e677263f 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,9 +35,8 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
-   WARN_ON_ONCE(__builtin_constant_p(src) ?\
-strcmp((src), __data_offsets.dst##_ptr_) : \
-(src) != __data_offsets.dst##_ptr_);   \
+   WARN_ON_ONCE(!(void *)(src) != !(void 
*)__data_offsets.dst##_ptr_); \
+   WARN_ON_ONCE((src) && strcmp((src), 
__data_offsets.dst##_ptr_)); \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
-- 
2.43.0

Re: [PATCH v2] net/ipv4: add tracepoint for icmp_send

2024-03-19 Thread Steven Rostedt

On Tue, 19 Mar 2024 20:13:52 +0800 (CST)
 wrote:

> From: Peilin He
> 
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
> 
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
> 
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/debug/tracing

FYI, that directory is obsolete. Please always reference /sys/kernel/tracing.

> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
> 
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
> 
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
>   1. adjust the trace_icmp_send() to more protocols than UDP.
>   2. move the calling of trace_icmp_send after sanity checks
>  in __icmp_send().
> 
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 64 
> +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 68 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
> 
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..c3dc337be7bc
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> + TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> + TP_ARGS(skb, type, code),
> +
> + TP_STRUCT__entry(
> + __field(__u16, sport)
> + __field(__u16, dport)
> + __field(int, type)
> + __field(int, code)
> + __array(__u8, saddr, 4)
> + __array(__u8, daddr, 4)
> + __field(const void *, skbaddr)
> + __field(unsigned short, ulen)

Note, to prevent holes, I usually suggest pointers and longs go first,
followed by ints, and then end with char.

__field(const void *, skbaddr)
__field(int, type)
__field(int, code)
__array(__u8, saddr, 4)
__array(__u8, daddr, 4)
__field(__u16, sport)
__field(__u16, dport)
__field(unsigned short, ulen)

-- Steve


> + ),
> +
> + TP_fast_assign(
> + struct iphdr *iph = ip_hdr(skb);
> + int proto_4 = iph->protocol;
> + __be32 *p32;
> +
> + __entry->skbaddr = skb;
> + __entry->type = type;
> + __entry->code = code;
> +
> + if (proto_4 == IPPROTO_UDP) {
> + struct udphdr *uh = udp_hdr(skb);
> + __entry->sport = ntohs(uh->source);
> + __entry->dport = ntohs(uh->dest);
> + __entry->ulen = ntohs(uh->len);
> + } else {
> + __entry->sport = 0;
> + __entry->dport = 0;
> + __entry->ulen = 0;
> + }
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = iph->saddr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = iph->daddr;
> + ),
> +
> + TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
> ulen=%d skbaddr=%p",
> + __entry->type, __entry->code,
> + __entry->saddr, __entry->sport, __entry->daddr,
> + __entry->dport, __entry->ulen, __entry->skbaddr)
> +);

Re: [bug report] tracing/probes: Support $argN in return probe (kprobe and fprobe)

2024-03-19 Thread Steven Rostedt

On Tue, 19 Mar 2024 10:19:09 +0300
Dan Carpenter  wrote:

> Hello Masami Hiramatsu (Google),
> 
> Commit 25f00e40ce79 ("tracing/probes: Support $argN in return probe
> (kprobe and fprobe)") from Mar 4, 2024 (linux-next), leads to the
> following Smatch static checker warning:
> 
>   kernel/trace/trace_probe.c:856 store_trace_entry_data()
>   error: uninitialized symbol 'val'.
> 
> kernel/trace/trace_probe.c
> 846 return;
> 847 
> 848 for (i = 0; i < earg->size; i++) {
> 849 struct fetch_insn *code = >code[i];
> 850 
> 851 switch (code->op) {
> 852 case FETCH_OP_ARG:
> 853 val = regs_get_kernel_argument(regs, 
> code->param);
> 854 break;
> 855 case FETCH_OP_ST_EDATA:
> --> 856 *(unsigned long *)((unsigned long)edata + 
> code->offset) = val;  
> 
> Probably the earg->code[i] always has FETCH_OP_ARG before
> FETCH_OP_ST_EDATA but Smatch isn't smart enough to figure that out...

Looks that way:

case FETCH_OP_END:
earg->code[i].op = FETCH_OP_ARG;
earg->code[i].param = argnum;
earg->code[i + 1].op = FETCH_OP_ST_EDATA;
earg->code[i + 1].offset = offset;
return offset;

But probably should still initialize val to zero or have a WARN_ON() if
that doesn't happen.

-- Steve


> 
> 857 break;
> 858 case FETCH_OP_END:
> 859 goto end;
> 860 default:
> 861 break;
> 862 }
> 863 }
> 864 end:
> 865 return;
> 866 }
> 
> regards,
> dan carpenter

Re: TP_printk() bug with %c, and more?

2024-03-18 Thread Steven Rostedt

On Mon, 18 Mar 2024 16:43:07 +0100
Luca Ceresoli  wrote:

> Indeed I was on an older version, apologies.
> 
> I upgraded both libtraceevent and trace-cmd to master and applied your
> patch, now the %c is formatted correctly.
> 
> However the arrows are still reversed.
> 
> Is this what you were expecting?

No, I didn't look at the arrows, just the %c issue. I'll try to get some
time to do that.

-- Steve

Re: TP_printk() bug with %c, and more?

2024-03-15 Thread Steven Rostedt

On Fri, 15 Mar 2024 19:03:12 +0100
Luca Ceresoli  wrote:

> > > 
> > > I've come across an unexpected behaviour in the kernel tracing
> > > infrastructure that looks like a bug, or maybe two.
> > > 
> > > Cc-ing ASoC maintainers for as it appeared using ASoC traces, but it
> > > does not look ASoC-specific.
> > > 
> > > It all started when using this trace-cmd sequence on an ARM64 board
> > > running a mainline 6.8.0-rc7 kernel:
> > > 
> > >   trace-cmd record -e snd_soc_dapm_path ./my-play
> > >   trace-cmd report
> > > 
> > > While this produces perfectly valid traces for other asoc events,
> > > the snd_soc_dapm_path produces:
> > > 
> > >   snd_soc_dapm_path:>c<* MIC1_EN <- (direct) <-
> > > 
> > > instead of the expected:
> > > 
> > >   snd_soc_dapm_path:*MIC1 <- (direct) <- MIC1_EN
> > > 
> > > The originating macro is:
> > > 
> > >   TP_printk("%c%s %s %s %s %s",
> > >   (int) __entry->path_node &&
> > >   (int) __entry->path_connect ? '*' : ' ',
> > >   __get_str(wname), DAPM_ARROW(__entry->path_dir),
> > >   __get_str(pname), DAPM_ARROW(__entry->path_dir),
> > >   __get_str(pnname))
> > > 
> > > It appears as if the %c placeholder always produces the three ">c<"
> > > characters, the '*' or ' ' char is printed as the first %s, all the
> > > other strings are shifted right by one position and the last string is
> > > never printed.
> > > 
> > > On my x86_64 laptop running the default Ubuntu kernel (6.5) I'm able to
> > > trace a few events having a '%c' in their TP_printk() macros and the
> > > result is:
> > > 
> > >   intel_pipe_update_start: dev :00:02.0, pipe >c<, frame=1,
> > >   scanline=107856, min=2208, max=2154
> > > 
> > 
> > What does /sys/kernel/tracing/trace show?  
> 
> It is correct:
> 
>intel_pipe_update_start: dev :00:02.0, pipe B, frame=377644, 
> scanline=1466, min=2154, max=2159
> 
> > If that's fine, then the bug is in libtraceevent and not the kernel.
> > 
> > I'm testing it out now, and I see %c not being processed properly by
> > libtraceevent. I'll take a deeper look.  
> 
> Thanks.
> 
> > > originating from:
> > > 
> > >   TP_printk("dev %s, pipe %c, frame=%u, scanline=%u, min=%u, max=%u",
> > > 
> > > Here it looks like the %c produced ">c<" again, but apparently without
> > > any shifting.
> > > 
> > > Back on the ARM64 board I found a couple interesting clues.
> > > 
> > > First, using the /tracing/ interface instead of trace-cmd, I'm
> > > getting correctly formatted strings:
> > > 
> > > trace-cmd: snd_soc_dapm_path: >c<* HPOUT_L -> (direct) ->
> > > debugfs:   snd_soc_dapm_path: *HPOUT_L <- (direct) <- HPOUT_POP_SOUND_L
> > > 
> > > Notice the arrows pointing to the opposite direction though. The correct
> > > arrow is the one in the debugfs run.  
> 
> This other issue appears a separate bug however.

Can you make user you have the latest libtraceevent from:

   git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git

And apply this patch.

Thanks,

-- Steve

diff --git a/src/event-parse.c b/src/event-parse.c
index d607556..61b0966 100644
--- a/src/event-parse.c
+++ b/src/event-parse.c
@@ -3732,8 +3732,19 @@ process_arg_token(struct tep_event *event, struct 
tep_print_arg *arg,
arg->atom.atom = atom;
break;
 
-   case TEP_EVENT_DQUOTE:
case TEP_EVENT_SQUOTE:
+   arg->type = TEP_PRINT_ATOM;
+   /* Make characters into numbers */
+   if (asprintf(>atom.atom, "%d", token[0]) < 0) {
+   free_token(token);
+   *tok = NULL;
+   arg->atom.atom = NULL;
+   return TEP_EVENT_ERROR;
+   }
+   free_token(token);
+   type = read_token_item(event->tep, );
+   break;
+   case TEP_EVENT_DQUOTE:
arg->type = TEP_PRINT_ATOM;
arg->atom.atom = token;
type = read_token_item(event->tep, );

Re: TP_printk() bug with %c, and more?

2024-03-15 Thread Steven Rostedt

On Fri, 15 Mar 2024 17:49:00 +0100
Luca Ceresoli  wrote:

> Hello Linux tracing maintainers,

Hi Luca!

> 
> I've come across an unexpected behaviour in the kernel tracing
> infrastructure that looks like a bug, or maybe two.
> 
> Cc-ing ASoC maintainers for as it appeared using ASoC traces, but it
> does not look ASoC-specific.
> 
> It all started when using this trace-cmd sequence on an ARM64 board
> running a mainline 6.8.0-rc7 kernel:
> 
>   trace-cmd record -e snd_soc_dapm_path ./my-play
>   trace-cmd report
> 
> While this produces perfectly valid traces for other asoc events,
> the snd_soc_dapm_path produces:
> 
>   snd_soc_dapm_path:>c<* MIC1_EN <- (direct) <-
> 
> instead of the expected:
> 
>   snd_soc_dapm_path:*MIC1 <- (direct) <- MIC1_EN
> 
> The originating macro is:
> 
>   TP_printk("%c%s %s %s %s %s",
>   (int) __entry->path_node &&
>   (int) __entry->path_connect ? '*' : ' ',
>   __get_str(wname), DAPM_ARROW(__entry->path_dir),
>   __get_str(pname), DAPM_ARROW(__entry->path_dir),
>   __get_str(pnname))
> 
> It appears as if the %c placeholder always produces the three ">c<"
> characters, the '*' or ' ' char is printed as the first %s, all the
> other strings are shifted right by one position and the last string is
> never printed.
> 
> On my x86_64 laptop running the default Ubuntu kernel (6.5) I'm able to
> trace a few events having a '%c' in their TP_printk() macros and the
> result is:
> 
>   intel_pipe_update_start: dev :00:02.0, pipe >c<, frame=1,
>   scanline=107856, min=2208, max=2154
> 

What does /sys/kernel/tracing/trace show?

If that's fine, then the bug is in libtraceevent and not the kernel.

I'm testing it out now, and I see %c not being processed properly by
libtraceevent. I'll take a deeper look.

Thanks for the report.

-- Steve


> originating from:
> 
>   TP_printk("dev %s, pipe %c, frame=%u, scanline=%u, min=%u, max=%u",
> 
> Here it looks like the %c produced ">c<" again, but apparently without
> any shifting.
> 
> Back on the ARM64 board I found a couple interesting clues.
> 
> First, using the /tracing/ interface instead of trace-cmd, I'm
> getting correctly formatted strings:
> 
> trace-cmd: snd_soc_dapm_path: >c<* HPOUT_L -> (direct) ->
> debugfs:   snd_soc_dapm_path: *HPOUT_L <- (direct) <- HPOUT_POP_SOUND_L
> 
> Notice the arrows pointing to the opposite direction though. The correct
> arrow is the one in the debugfs run.
> 
> Second, I tried a simple test:
> 
>   TP_printk("(%c,%c,%c,%c) [%s,%s,%s,%s]",
>   
>

[PATCH] ring-buffer: Make wake once of ring_buffer_wait() more robust

2024-03-15 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The default behavior of ring_buffer_wait() when passed a NULL "cond"
parameter is to exit the function the first time it is woken up. The
current implementation uses a counter that starts at zero and when it is
greater than one it exits the wait_event_interruptible().

But this relies on the internal working of wait_event_interruptible() as
that code basically has:

  if (cond)
return;
  prepare_to_wait();
  if (!cond)
schedule();
  finish_wait();

That is, cond is called twice before it sleeps. The default cond of
ring_buffer_wait() needs to account for that and wait for its counter to
increment twice before exiting.

Instead, use the seq/atomic_inc logic that is used by the tracing code
that calls this function. Add an atomic_t seq to rb_irq_work and when cond
is NULL, have the default callback take a descriptor as its data that
holds the rbwork and the value of the seq when it started.

The wakeups will now increment the rbwork->seq and the cond callback will
simply check if that number is different, and no longer have to rely on
the implementation of wait_event_interruptible().

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 34 +-
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f0f3577339b3..c8ff71d16a6e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -388,6 +388,7 @@ struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
wait_queue_head_t   full_waiters;
+   atomic_tseq;
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
@@ -763,6 +764,9 @@ static void rb_wake_up_waiters(struct irq_work *work)
 {
struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
work);
 
+   /* For waiters waiting for the first wake up */
+   (void)atomic_fetch_inc_release(>seq);
+
wake_up_all(>waiters);
if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
/* Only cpu_buffer sets the above flags */
@@ -891,20 +895,21 @@ rb_wait_cond(struct rb_irq_work *rbwork, struct 
trace_buffer *buffer,
return false;
 }
 
+struct rb_wait_data {
+   struct rb_irq_work  *irq_work;
+   int seq;
+};
+
 /*
  * The default wait condition for ring_buffer_wait() is to just to exit the
  * wait loop the first time it is woken up.
  */
 static bool rb_wait_once(void *data)
 {
-   long *once = data;
+   struct rb_wait_data *rdata = data;
+   struct rb_irq_work *rbwork = rdata->irq_work;
 
-   /* wait_event() actually calls this twice before scheduling*/
-   if (*once > 1)
-   return true;
-
-   (*once)++;
-   return false;
+   return atomic_read_acquire(>seq) != rdata->seq;
 }
 
 /**
@@ -925,14 +930,9 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, 
int full,
struct ring_buffer_per_cpu *cpu_buffer;
struct wait_queue_head *waitq;
struct rb_irq_work *rbwork;
-   long once = 0;
+   struct rb_wait_data rdata;
int ret = 0;
 
-   if (!cond) {
-   cond = rb_wait_once;
-   data = 
-   }
-
/*
 * Depending on what the caller is waiting for, either any
 * data in any cpu buffer, or a specific buffer, put the
@@ -954,6 +954,14 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, 
int full,
else
waitq = >waiters;
 
+   /* Set up to exit loop as soon as it is woken */
+   if (!cond) {
+   cond = rb_wait_once;
+   rdata.irq_work = rbwork;
+   rdata.seq = atomic_read_acquire(>seq);
+   data = 
+   }
+
ret = wait_event_interruptible((*waitq),
rb_wait_cond(rbwork, buffer, cpu, full, cond, 
data));
 
-- 
2.43.0

[PATCH] tracing: Add __string_src() helper to help compilers not to get confused

2024-03-14 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The __string() helper macro of the TRACE_EVENT() macro is used to
determine how much of the ring buffer needs to be allocated to fit the
given source string. Some trace events have a string that is dependent on
another variable that could be NULL, and in those cases the string is
passed in to be NULL.

The __string() macro can handle being passed in a NULL pointer for which
it will turn it into "(null)". It does that with:

  strlen((src) ? (const char *)(src) : "(null)") + 1

But if src itself has the same conditional type it can confuse the
compiler. That is:

  __string(r ? dev(r)->name : NULL)

Would turn into:

 strlen((r ? dev(r)->name : NULL) ? (r ? dev(r)->name : NULL) : "(null)" + 1

For which the compiler thinks that NULL is being passed to strlen() and
gives this kind of warning:

./include/trace/stages/stage5_get_offsets.h:50:21: warning: argument 1 null 
where non-null expected [-Wnonnull]
   50 | strlen((src) ? (const char *)(src) : "(null)") + 1)

Instead, create a static inline function that takes the src string and
will return the string if it is not NULL and will return "(null)" if it
is. This will then make the strlen() line:

 strlen(__string_src(src)) + 1

Where the compiler can see that strlen() will not end up with NULL and
does not warn about it.

Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix
tracepoints that save qdisc_dev() as a string") being applied, as passing
the qdisc_dev() into __string_src() will give an error.

Link: https://lore.kernel.org/all/ZfNmfCmgCs4Nc+EH@aschofie-mobl2/

Reported-by: Alison Schofield 
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage5_get_offsets.h | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index e6b96608f452..c6a62dfb18ef 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -9,6 +9,16 @@
 #undef __entry
 #define __entry entry
 
+#ifndef __STAGE5_STRING_SRC_H
+#define __STAGE5_STRING_SRC_H
+static inline const char *__string_src(const char *str)
+{
+   if (!str)
+  return EVENT_NULL_STR;
+   return str;
+}
+#endif /* __STAGE5_STRING_SRC_H */
+
 /*
  * Fields should never declare an array: i.e. __field(int, arr[5])
  * If they do, it will cause issues in parsing and possibly corrupt the
@@ -47,7 +57,7 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
+   strlen(__string_src(src)) + 1)  \
__data_offsets->item##_ptr_ = src;
 
 #undef __string_len
@@ -70,7 +80,7 @@
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
+   strlen(__string_src(src)) + 1)  \
__data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
-- 
2.43.0

Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-03-14 Thread Steven Rostedt

On Thu, 14 Mar 2024 09:57:57 -0700
Alison Schofield  wrote:

> On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >This is a treewide change. I will likely re-create this patch again in
> >the second week of the merge window of v6.9 and submit it then. Hoping
> >to keep the conflicts that it will cause to a minimum.
> > ]

Note, change of plans. I plan on sending this in the next merge window, as
this merge window I have this patch:

https://lore.kernel.org/linux-trace-kernel/20240312113002.00031...@gandalf.local.home/

That will warn if the source string of __string() is different than the
source string of __assign_str(). I want to make sure they are identical
before just dropping one of them.

> 
> > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> > index bdf117a33744..07ba4e033347 100644
> > --- a/drivers/cxl/core/trace.h
> > +++ b/drivers/cxl/core/trace.h  
> 
> snip to poison
> 
> > @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison,
> > ),
> >  
> > TP_fast_assign(
> > -   __assign_str(memdev, dev_name(>dev));
> > -   __assign_str(host, dev_name(cxlmd->dev.parent));
> > +   __assign_str(memdev);
> > +   __assign_str(host);  
> 
> I think I get that the above changes work because the TP_STRUCT__entry for
> these did:
>   __string(memdev, dev_name(>dev))
>   __string(host, dev_name(cxlmd->dev.parent))

That's the point. They have to be identical or you will likely bug.

The __string(name, src) is used to find the string length of src which
allocates the necessary length on the ring buffer. The __assign_str(name, src)
will copy src into the ring buffer.

Similar to:

len = strlen(src);
buf = malloc(len);
strcpy(buf, str);

Where __string() is strlen() and __assign_str() is strcpy(). It doesn't
make sense to use two different strings, and if you did, it would likely be
a bug.

But the magic behind __string() does much more than just get the length of
the string, and it could easily save the pointer to the string (along with
its length) and have it copy that in the __assign_str() call, making the
src parameter of __assign_str() useless.

> 
> > __entry->serial = cxlmd->cxlds->serial;
> > __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts);
> > __entry->dpa = cxl_poison_record_dpa(record);
> > @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison,
> > __entry->trace_type = trace_type;
> > __entry->flags = flags;
> > if (region) {
> > -   __assign_str(region, dev_name(>dev));
> > +   __assign_str(region);
> > memcpy(__entry->uuid, >params.uuid, 16);
> > __entry->hpa = cxl_trace_hpa(region, cxlmd,
> >  __entry->dpa);
> > } else {
> > -   __assign_str(region, "");
> > +   __assign_str(region);
> > memset(__entry->uuid, 0, 16);
> > __entry->hpa = ULLONG_MAX;  
> 
> For the above 2, there was no helper in TP_STRUCT__entry. A recently
> posted patch is fixing that up to be __string(region, NULL) See [1],
> with the actual assignment still happening in TP_fast_assign.

__string(region, NULL) doesn't make sense. It's like:

len = strlen(NULL);
buf = malloc(len);
strcpy(buf, NULL);

??

I'll reply to that email.

-- Steve

> 
> Does that assign logic need to move to the TP_STRUCT__entry definition
> when you merge these changes? I'm not clear how much logic is able to be
> included, ie like 'C' style code in the TP_STRUCT__entry.
> 
> [1]
> https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofi...@intel.com/

Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-14 Thread Steven Rostedt

On Thu, 14 Mar 2024 15:39:28 +0100
Paolo Abeni  wrote:

> On Wed, 2024-03-13 at 09:34 -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >Note, I need to take this patch through my tree, so I'm looking for 
> > acks.  
> 
> Note that this device driver is changing quite rapidly, so I expect
> some conflicts here later. I guess Liuns will have to handle them ;)

Well, it merges fine with linux-next and linus's current master. ;-)

> 
> >This causes the build to fail when I add the __assign_str() check, which
> >I was about to push to Linus, but it breaks allmodconfig due to this 
> > error.
> > ]
> > 
> > The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
> > are going through some optimizations where only the source string of
> > __string() will be used and the __assign_str() source will be ignored and
> > later removed.
> > 
> > To make sure that there's no issues, a new check is added between the
> > __string() src argument and the __assign_str() src argument that does a
> > strcmp() to make sure they are the same string.
> > 
> > The hclgevf trace events have:
> > 
> >   __assign_str(devname, >nic.kinfo.netdev->name);
> > 
> > Which triggers the warning:
> > 
> > hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from 
> > incompatible pointer type [-Werror=incompatible-pointer-types]
> >34 | __assign_str(devname, 
> > >nic.kinfo.netdev->name);
> >  [..]
> > arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
> > argument is of type ‘char (*)[16]’
> >75 | int strcmp(const char *cs, const char *ct);
> >   |^~
> > 
> > 
> > Because __assign_str() now has:
> > 
> > WARN_ON_ONCE(__builtin_constant_p(src) ?\
> >  strcmp((src), __data_offsets.dst##_ptr_) : \
> >  (src) != __data_offsets.dst##_ptr_);   \
> > 
> > The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
> > that name is:
> > 
> > charname[IFNAMSIZ]
> > 
> > Where passing an address '&' of a char array is not compatible with 
> > strcmp().
> > 
> > The '&' is not necessary, remove it.
> > 
> > Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF 
> > mailbox")  
> 
> checkpactch in strict mode complains the hash is not 12 char long.

Hmm, I wonder why my git blame gives me 13 characters in the sha. (I cut
and pasted it from git blame). My git config has:

[core]  
abbrev = 12


> 
> > Signed-off-by: Steven Rostedt (Google)   
> 
> FWIW
> 
> Acked-by: Paolo Abeni 

Thanks!

-- Steve

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-13 Thread Steven Rostedt

On Wed, 13 Mar 2024 13:45:50 -0400
Steven Rostedt  wrote:

> Let me test to make sure that when src is a string "like this" that it does
> the strcmp(). Otherwise, we may have to always do the strcmp(), which I
> really would like to avoid.

I added the below patch and enabled sched_switch and it triggered the
warning (expected if it went the strcmp() path). I then changed it to be:

#define __assign_str(dst, src)  \
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
-(src) != __data_offsets.dst##_ptr_);   \
+(src) == __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
} while (0)

And the sched_switch did not trigger (expected). So it seems that it should
not be a problem.

Note, I only tested this with gcc and not clang.

But I guess there's also the case where we have:

__assign_str(str, field ? field : "NULL")

But hopefully that's not an issue.

-- Steve

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..eaacd0c4e899 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -236,6 +236,7 @@ TRACE_EVENT(sched_switch,
__array(char,   next_comm,  TASK_COMM_LEN   )
__field(pid_t,  next_pid)
__field(int,next_prio   )
+   __string( test, "this")
),
 
TP_fast_assign(
@@ -246,6 +247,7 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid   = next->pid;
__entry->next_prio  = next->prio;
+   __assign_str(test, "this");
/* XXX SCHED_DEADLINE */
),
 
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..cf301c723fd0 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -36,7 +36,7 @@
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
-strcmp((src), __data_offsets.dst##_ptr_) : \
+!strcmp((src), __data_offsets.dst##_ptr_) :
\
 (src) != __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-13 Thread Steven Rostedt

On Wed, 13 Mar 2024 09:59:03 -0700
Nathan Chancellor  wrote:

> > Reported-by: kernel test robot 
> > Closes: 
> > https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/
> > Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does 
> > not match __string()")  
> 
> Is this change destined for 6.9 or 6.10? I applied it to current
> trace/core (eb1533d156d3) along with 51270d573a8d but the warning is
> still present. I even tried
> 
> __builtin_choose_expr(__is_constexpr((src)),
>  strcmp((src), __data_offsets.dst##_ptr_),
>  (src) != __data_offsets.dst##_ptr_));
> 
> but not even that silenced the warning. I think we will still need a
> diag directive to fully silence this warning.

Yes, you said that the warning is still there, but the bug it shows should
not be.

I believe that's because clang still evaluates the (src) != ... even when
the source is a contast and it warns about it. But if src is a constant, we
do not want to do the !=, we want to do the slower strcmp().

Let me test to make sure that when src is a string "like this" that it does
the strcmp(). Otherwise, we may have to always do the strcmp(), which I
really would like to avoid.

BTW, I triggered another bug with strcmp():

  https://lore.kernel.org/all/20240313093454.3909a...@gandalf.local.home/

-- Steve

[PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

[
   Note, I need to take this patch through my tree, so I'm looking for acks.
   This causes the build to fail when I add the __assign_str() check, which
   I was about to push to Linus, but it breaks allmodconfig due to this error.
]

The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
are going through some optimizations where only the source string of
__string() will be used and the __assign_str() source will be ignored and
later removed.

To make sure that there's no issues, a new check is added between the
__string() src argument and the __assign_str() src argument that does a
strcmp() to make sure they are the same string.

The hclgevf trace events have:

  __assign_str(devname, >nic.kinfo.netdev->name);

Which triggers the warning:

hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from incompatible 
pointer type [-Werror=incompatible-pointer-types]
   34 | __assign_str(devname, >nic.kinfo.netdev->name);
 [..]
arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
argument is of type ‘char (*)[16]’
   75 | int strcmp(const char *cs, const char *ct);
  |^~


Because __assign_str() now has:

WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
 (src) != __data_offsets.dst##_ptr_);   \

The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
that name is:

charname[IFNAMSIZ]

Where passing an address '&' of a char array is not compatible with strcmp().

The '&' is not necessary, remove it.

Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF mailbox")
Signed-off-by: Steven Rostedt (Google) 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h  | 8 
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h| 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
index 8510b88d4982..f3cd5a376eca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
@@ -24,7 +24,7 @@ TRACE_EVENT(hclge_pf_mbx_get,
__field(u8, code)
__field(u8, subcode)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_GET_MBX_LEN)
),
 
@@ -33,7 +33,7 @@ TRACE_EVENT(hclge_pf_mbx_get,
__entry->code = req->msg.code;
__entry->subcode = req->msg.subcode;
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, >vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_vf_to_pf_cmd));
),
@@ -56,7 +56,7 @@ TRACE_EVENT(hclge_pf_mbx_send,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_SEND_MBX_LEN)
),
 
@@ -64,7 +64,7 @@ TRACE_EVENT(hclge_pf_mbx_send,
__entry->vfid = req->dest_vfid;
__entry->code = le16_to_cpu(req->msg.code);
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, >vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_pf_to_vf_cmd));
),
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
index 5d4895bb57a1..b259e95dd53c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
@@ -23,7 +23,7 @@ TRACE_EVENT(hclge_vf_mbx_get,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, >nic.kinfo.netdev->name)
+   __string(devname, hdev->nic.kinfo.netdev->name)
__array(u32, mbx_data, VF_GET_MBX_LEN)
),
 
@@ -31,7 +31,7 @@ TRACE_EVENT(hclge_vf_mbx_get,
__entry->vfid = req->dest_vfid;
__ent

[PATCH v4] ring-buffer: Have mmapped ring buffer keep track of missed events

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v3: 
https://lore.kernel.org/linux-trace-kernel/20240109212702.5b8e6...@gandalf.local.home

- Almost forgot to add this, but one of my libtracefs tests failed due to it.

- Rebased against the latest code.

- Removed the zeroing out at the end of the page as the ring buffer is now
  zeroed out when allocated. Not too worried about stale tracing data, but
  having non tracing data would be a problem.

 kernel/trace/ring_buffer.c | 53 +-
 1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 13250856a3a8..abe21c47fb49 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -313,6 +313,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event 
*event)
 /* Missed count stored at end */
 #define RB_MISSED_STORED   (1 << 30)
 
+#define RB_MISSED_MASK (3 << 30)
+
 struct buffer_data_page {
u64  time_stamp;/* page time stamp */
local_t  commit;/* write committed index */
@@ -2274,7 +2276,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 /* Size is determined by what has been committed */
 static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
 {
-   return rb_page_commit(bpage);
+   return rb_page_commit(bpage) & ~RB_MISSED_MASK;
 }
 
 static __always_inline unsigned
@@ -3901,7 +3903,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu 
*cpu_buffer)
return true;
 
/* Reader should exhaust content in reader page */
-   if (reader->read != rb_page_commit(reader))
+   if (reader->read != rb_page_size(reader))
return false;
 
/*
@@ -4372,7 +4374,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
return ((iter->head_page == commit_page && iter->head >= commit) ||
(iter->head_page == reader && commit_page == head_page &&
 head_page->read == commit &&
-iter->head == rb_page_commit(cpu_buffer->reader_page)));
+iter->head == rb_page_size(cpu_buffer->reader_page)));
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_empty);
 
@@ -5701,7 +5703,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
event = rb_reader_event(cpu_buffer);
 
read = reader->read;
-   commit = rb_page_commit(reader);
+   commit = rb_page_size(reader);
 
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
@@ -5778,7 +5780,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
-   cpu_buffer->read_bytes += rb_page_commit(reader);
+   cpu_buffer->read_bytes += rb_page_size(reader);
 
/* swap the pages */
rb_init_page(bpage);
@@ -6331,6 +6333,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer 
*buffer, int cpu,
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *reader;
+   unsigned long missed_events;
unsigned long reader_size;
unsigned long flags;
 
@@ -6356,9 +6360,46 @@ int ring_buffer_map_get_reader(struct trace_buffer 
*buffer, int cpu)
goto out;
}
 
-   if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+   reader = rb_get_reader_page(cpu_buffer);
+   if (WARN_ON(!reader))
goto out;
 
+   /* Check if any events were dropped */
+   missed_events = cpu_buffer->lost_events;
+
+   if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
+   if (missed_events) {
+   struct buffer_data_page *bpage = reader->page;
+   unsigned int commit;
+

[PATCH] ring-buffer: Do not set shortest_full when full target is hit

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The rb_watermark_hit() checks if the amount of data in the ring buffer is
above the percentage level passed in by the "full" variable. If it is, it
returns true.

But it also sets the "shortest_full" field of the cpu_buffer that informs
writers that it needs to call the irq_work if the amount of data on the
ring buffer is above the requested amount.

The rb_watermark_hit() always sets the shortest_full even if the amount in
the ring buffer is what it wants. As it is not going to wait, because it
has what it wants, there's no reason to set shortest_full.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9b887d44b8d9..350607cce869 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page;
ret = !pagebusy && full_hit(buffer, cpu, full);
 
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
+   if (!ret && (!cpu_buffer->shortest_full ||
+cpu_buffer->shortest_full > full)) {
+   cpu_buffer->shortest_full = full;
+   }
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
}
return ret;
-- 
2.43.0

Re: [PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

On Wed, 13 Mar 2024 00:38:42 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Mar 2024 09:19:21 -0400
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > The check for knowing if the poll should wait or not is basically the
> > exact same logic as rb_watermark_hit(). The only difference is that
> > rb_watermark_hit() also handles the !full case. But for the full case, the
> > logic is the same. Just call that instead of duplicating the code in
> > ring_buffer_poll_wait().
> >   
> 
> This changes a bit (e.g. adding pagebusy check) but basically that should
> be there. And new version appears to be consistent between ring_buffer_wait()
> and ring_buffer_poll_wait(). So looks good to me.

The pagebusy check is an optimization. As if it is true, it means the
writer is still on the reader_page and there's no sub-buffers available. It
just prevents having to do the calculation of the buffer-percentage filled
(what's done by the full_hit() logic).

> 
> Reviewed-by: Masami Hiramatsu (Google) 
>

Thanks!

-- Steve

Re: [PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

On Wed, 13 Mar 2024 00:22:10 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Mar 2024 09:19:20 -0400
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > If a reader of the ring buffer is doing a poll, and waiting for the ring
> > buffer to hit a specific watermark, there could be a case where it gets
> > into an infinite ping-pong loop.
> > 
> > The poll code has:
> > 
> >   rbwork->full_waiters_pending = true;
> >   if (!cpu_buffer->shortest_full ||
> >   cpu_buffer->shortest_full > full)
> >  cpu_buffer->shortest_full = full;
> > 
> > The writer will see full_waiters_pending and check if the ring buffer is
> > filled over the percentage of the shortest_full value. If it is, it calls
> > an irq_work to wake up all the waiters.
> > 
> > But the code could get into a circular loop:
> > 
> > CPU 0   CPU 1
> > -   -
> >  [ Poll ]
> >[ shortest_full = 0 ]
> >rbwork->full_waiters_pending = true;
> >   if (rbwork->full_waiters_pending &&
> >   [ buffer percent ] > 
> > shortest_full) {
> >  rbwork->wakeup_full = true;
> >  [ queue_irqwork ]  
> 
> Oh, so `[ buffer percent ] > shortest_full` does not work because
> if this happens in this order, shortest_full may be 0.

Exactly!

> 
> > 
> >cpu_buffer->shortest_full = full;
> > 
> >   [ IRQ work ]
> >   if (rbwork->wakeup_full) {
> > cpu_buffer->shortest_full = 0;

And here shortest_full gets set back to zero! (But that's not the bug).

> > wakeup poll waiters;
> >   [woken]
> >if ([ buffer percent ] > full)
> >   break;
> >rbwork->full_waiters_pending = true;

The bug is setting full_waiters_pending before updating the shortest_full.

> >   if (rbwork->full_waiters_pending &&
> >   [ buffer percent ] > 
> > shortest_full) {
> >  rbwork->wakeup_full = true;
> >  [ queue_irqwork ]
> > 
> >cpu_buffer->shortest_full = full;
> > 
> >   [ IRQ work ]
> >   if (rbwork->wakeup_full) {
> > cpu_buffer->shortest_full = 0;
> > wakeup poll waiters;
> >   [woken]
> > 
> >  [ Wash, rinse, repeat! ]
> > 
> > In the poll, the shortest_full needs to be set before the
> > full_pending_waiters, as once that is set, the writer will compare the
> > current shortest_full (which is incorrect) to decide to call the irq_work,
> > which will reset the shortest_full (expecting the readers to update it).
> > 
> > Also move the setting of full_waiters_pending after the check if the ring
> > buffer has the required percentage filled. There's no reason to tell the
> > writer to wake up waiters if there are no waiters.
> >   
> 
> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks!

I'm running it through my tests and when they finish, I'll be posting the
for-linus patches.

-- Steve

[PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The WARN_ON() check in __assign_str() to catch where the source variable
to the macro doesn't match the source variable to __string() gives an
error in clang:

>> include/trace/events/sunrpc.h:703:4: warning: result of comparison against a 
>> string literal is unspecified (use an explicit string comparison function 
>> instead) [-Wstring-compare]
 670 | __assign_str(progname, "unknown");

That's because the __assign_str() macro has:

   WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);

Where "src" is a string literal. Clang warns when comparing a string
literal directly as it is undefined to what the value of the literal is.

Since this is still to make sure the same string that goes to __string()
is the same as __assign_str(), for string literals do a test for that and
then use strcmp() in those cases

Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix
tracepoints that save qdisc_dev() as a string") being applied, as this was
what found that bug.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/
Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does not 
match __string()")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index a0c15f67eabe..83da83a0c14f 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,7 +35,9 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
-   WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);   \
+   WARN_ON_ONCE(__builtin_constant_p(src) ?\
+strcmp((src), __data_offsets.dst##_ptr_) : \
+(src) != __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
-- 
2.43.0

[PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If a reader of the ring buffer is doing a poll, and waiting for the ring
buffer to hit a specific watermark, there could be a case where it gets
into an infinite ping-pong loop.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

In the poll, the shortest_full needs to be set before the
full_pending_waiters, as once that is set, the writer will compare the
current shortest_full (which is incorrect) to decide to call the irq_work,
which will reset the shortest_full (expecting the readers to update it).

Also move the setting of full_waiters_pending after the check if the ring
buffer has the required percentage filled. There's no reason to tell the
writer to wake up waiters if there are no waiters.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b..adfe603a769b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
poll_wait(filp, >full_waiters, poll_table);
 
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   rbwork->full_waiters_pending = true;
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
-   } else {
-   poll_wait(filp, >waiters, poll_table);
-   rbwork->waiters_pending = true;
+   if (full_hit(buffer, cpu, full))
+   return EPOLLIN | EPOLLRDNORM;
+   /*
+* Only allow full_waiters_pending update to be seen after
+* the shortest_full is set. If the writer sees the
+* full_waiters_pending flag set, it will compare the
+* amount in the ring buffer to shortest_full. If the amount
+* in the ring buffer is greater than the shortest_full
+* percent, it will call the irq_work handler to wake up
+* this list. The irq_handler will reset shortest_full
+* back to zero. That's done under the reader_lock, but
+* the below smp_mb() makes sure that the update to
+* full_waiters_pending doesn't leak up into the above.
+*/
+   smp_mb();
+   rbwork->full_waiters_pending = true;
+   return 0;
}
 
+   poll_wait(filp, >waiters, poll_table);
+   rbwork->waiters_pending = true;
+
/*
 * There's a tight race between setti

[PATCH v2 0/2] ring-buffer: Fix poll wakeup logic

2024-03-12 Thread Steven Rostedt



After making a slight change to wakeups in ring_buffer_wait()
the system would hang. Spending several hours going on a wild goose
chase I found that the change only triggered the bug because it
changed the timings. The bug was there before the update but never
was triggered.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

The race was triggered when running:

  trace-cmd record -p function -m 5000

Which enables function tracing and then creates two files it is writing
into where each is 2500K in size. The -m is a "max file size". When
trace-cmd writes 2500K to one file it then switches to the other, erasing
the old data. To do this, trace-cmd switches between both poll and
the reader using both methods of wake up. The change to the reader wakeup
was able to change the way the poll was woken to trigger this bug.

The second patch is a clean up and also a way to consolidate the logic
of the shortest_full. The read wakeup uses rb_watermark_hit for both
full wakeups and !full wakeups. But since poll uses the same logic for
full wakeups it can just call that function with full set.

Changes since v1: 
https://lore.kernel.org/all/20240312115455.666920...@goodmis.org/

- Removed unused 'flags' in ring_buffer_poll_wait() as the spin_lock
  is now in rb_watermark_hit().


Steven Rostedt (Google) (2):
  ring-buffer: Fix full_waiters_pending in poll
  ring-buffer: Reuse rb_watermark_hit() for the poll logic


 kernel/trace/ring_buffer.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

[PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The check for knowing if the poll should wait or not is basically the
exact same logic as rb_watermark_hit(). The only difference is that
rb_watermark_hit() also handles the !full case. But for the full case, the
logic is the same. Just call that instead of duplicating the code in
ring_buffer_poll_wait().

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index adfe603a769b..857803e8cf07 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -959,25 +959,18 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
}
 
if (full) {
-   unsigned long flags;
-
poll_wait(filp, >full_waiters, poll_table);
 
-   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
-   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
-   if (full_hit(buffer, cpu, full))
+   if (rb_watermark_hit(buffer, cpu, full))
return EPOLLIN | EPOLLRDNORM;
/*
 * Only allow full_waiters_pending update to be seen after
-* the shortest_full is set. If the writer sees the
-* full_waiters_pending flag set, it will compare the
-* amount in the ring buffer to shortest_full. If the amount
-* in the ring buffer is greater than the shortest_full
-* percent, it will call the irq_work handler to wake up
-* this list. The irq_handler will reset shortest_full
+* the shortest_full is set (in rb_watermark_hit). If the
+* writer sees the full_waiters_pending flag set, it will
+* compare the amount in the ring buffer to shortest_full.
+* If the amount in the ring buffer is greater than the
+* shortest_full percent, it will call the irq_work handler
+* to wake up this list. The irq_handler will reset 
shortest_full
 * back to zero. That's done under the reader_lock, but
 * the below smp_mb() makes sure that the update to
 * full_waiters_pending doesn't leak up into the above.
-- 
2.43.0

[PATCH v4 2/2] tracing/ring-buffer: Fix wait_on_pipe() race

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577..dc5ae4e96aee 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..fc6d0af56bb1 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c198ba466853..67d8405f4451 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -901,23 +901,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
  * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
+ * @cond: condition function to break out of wait (NULL to run once)
+ * @data: the data to pass to @cond.
  *
  * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data)
 {
struct ring

[PATCH v4 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-12 Thread Steven Rostedt

[ Sorry for the spam, but I just noticed kernel-test-robot complained
  about KernelDoc format, which this fixes ]


A patch was sent to "fix" the wait_index variable that is used to help with
waking of waiters on the ring buffer. The patch was rejected, but I started
looking at associated code. Discussing it on IRC with Mathieu Desnoyers
we discovered a design flaw.

The waiter reads "wait_index" then enters a "wait loop". After adding
itself to the wait queue, if the buffer is filled or a signal is pending
it will exit out. If wait_index is different, it will also exit out.
(Note, I noticed that the check for wait_index was after the schedule()
call and should have been before it, but that's besides the point).

The race is what happens if the waker updates the wait_index,
a new waiter comes in and reads the updated wait_index before it adds
itself to the queue, it will miss the update!

These patches fix the design by converting the ring_buffer_wait()
to use the wait_event_interruptible() interface.

Then the wait_to_pipe() caller will pass its own condition into
the ring_buffer_wait() to be checked by the wait_event_interruptible().

- The first patch restructures the ring_buffer_wait() to use the
  wait_event_interruptible() logic. It does not change the interface,
  but adds a local "cond" condition callback function that will be
  what it defaults to in the second patch if NULL is passed in.

  The default cond function is just a "wait once" function. That
  is, the first time it is called (before the wait_event_interruptible()
  calls schedule) will set the "once" variable to one and return
  false. The next time it is called (after wait_event_interruptible()
  wakes up) it will return true to break out of the loop.

- The second patch changes the ring_buffer_wait() interface to allow
  the caller to define its own "cond" callback. That will be checked
  in wait_event_interruptible() along with checking if the proper
  amount of data in the ring buffer has been hit.

Changes since v3: 
https://lore.kernel.org/all/20240312120252.998983...@goodmis.org/

- I should have checked before sending v2, but after I did, I noticed
  that kernel-test-robot reported that the cond and data parameters
  added to ring_buffer_wait() were not added to the kerneldoc above it.

Changes since v2: 
https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/

- Patches 1-3 of v2 have been accepted.

- Instead of breaking ring_buffer_wait() into three functions that do
  a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait()
  take a condition callback function and a data parameter.

  The ring_buffer_wait() now uses a wait_event_interruptible() code,
  and added a helper function to do check its own conditions, and also
  to call the passed in condition function and allow the caller to
  specify other conditions to break out of the event loop.

Changes since v1: 
https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/

- My tests triggered a warning about calling a mutex_lock() after a
  prepare_to_wait() that changed the task's state. Convert the affected
  mutex over to a spinlock.


Steven Rostedt (Google) (2):
  ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
  tracing/ring-buffer: Fix wait_on_pipe() race


 include/linux/ring_buffer.h  |   4 +-
 include/linux/trace_events.h |   5 +-
 kernel/trace/ring_buffer.c   | 119 ++-
 kernel/trace/trace.c |  43 +++-
 4 files changed, 109 insertions(+), 62 deletions(-)

[PATCH v4 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Convert ring_buffer_wait() over to wait_event_interruptible(). The default
condition is to execute the wait loop inside __wait_event() just once.

This does not change the ring_buffer_wait() prototype yet, but
restructures the code so that it can take a "cond" and "data" parameter
and will call wait_event_interruptible() with a helper function as the
condition.

The helper function (rb_wait_cond) takes the cond function and data
parameters. It will first check if the buffer hit the watermark defined by
the "full" parameter and then call the passed in condition parameter. If
either are true, it returns true.

If rb_wait_cond() does not return true, it will set the appropriate
"waiters_pending" flag and returns false.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   1 +
 kernel/trace/ring_buffer.c  | 116 +---
 2 files changed, 69 insertions(+), 48 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..338a33db1577 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6ef763f57c66..c198ba466853 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
return ret;
 }
 
-/**
- * ring_buffer_wait - wait for input to the ring buffer
- * @buffer: buffer to wait on
- * @cpu: the cpu buffer to wait on
- * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
- *
- * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
- * as data is added to any of the @buffer's cpu buffers. Otherwise
- * it will wait for data to be added to a specific cpu buffer.
- */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+static inline bool
+rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer,
+int cpu, int full, ring_buffer_cond_fn cond, void *data)
 {
-   struct ring_buffer_per_cpu *cpu_buffer;
-   DEFINE_WAIT(wait);
-   struct rb_irq_work *work;
-   int ret = 0;
-
-   /*
-* Depending on what the caller is waiting for, either any
-* data in any cpu buffer, or a specific buffer, put the
-* caller on the appropriate wait queue.
-*/
-   if (cpu == RING_BUFFER_ALL_CPUS) {
-   work = >irq_work;
-   /* Full only makes sense on per cpu reads */
-   full = 0;
-   } else {
-   if (!cpumask_test_cpu(cpu, buffer->cpumask))
-   return -ENODEV;
-   cpu_buffer = buffer->buffers[cpu];
-   work = _buffer->irq_work;
-   }
+   if (rb_watermark_hit(buffer, cpu, full))
+   return true;
 
-   if (full)
-   prepare_to_wait(>full_waiters, , TASK_INTERRUPTIBLE);
-   else
-   prepare_to_wait(>waiters, , TASK_INTERRUPTIBLE);
+   if (cond(data))
+   return true;
 
/*
 * The events can happen in critical sections where
@@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int 
cpu, int full)
 * a task has been queued. It's OK for spurious wake ups.
 */
if (full)
-   work->full_waiters_pending = true;
+   rbwork->full_waiters_pending = true;
else
-   work->waiters_pending = true;
+   rbwork->waiters_pending = true;
 
-   if (rb_watermark_hit(buffer, cpu, full))
-   goto out;
+   return false;
+}
 
-   if (signal_pending(current)) {
-   ret = -EINTR;
-   goto out;
+/*
+ * The default wait condition for ring_buffer_wait() is to just to exit the
+ * wait loop the first time it is woken up.
+ */
+static bool rb_wait_once(void *data)
+{
+   long *once = data;
+
+   /* wait_event() actually calls this twice before scheduling*/
+   if (*once > 1)
+   return true;
+
+   (*once)++;
+   return false;
+}
+
+/**
+ * ring_buffer_wait - wait for input to the rin

[PATCH v3 2/2] tracing/ring-buffer: Fix wait_on_pipe() race

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 11 -
 kernel/trace/trace.c | 43 ++--
 4 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577..dc5ae4e96aee 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..fc6d0af56bb1 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c198ba466853..81a5303bdc09 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -906,18 +906,19 @@ static bool rb_wait_once(void *data)
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data)
 {
struct ring_buffer_per_cpu *cpu_buffer;
struct wait_queue_head *waitq;
-   ring_buffer_cond_fn cond;
struct rb_irq_work *rbwork;
-   void *data;
long once = 0;
int ret = 0;
 
-   cond = rb_wait_once;
-   data = 
+   if (!cond) {
+   cond = rb_wait_once;
+   data = 
+   }

[PATCH v3 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-12 Thread Steven Rostedt




A patch was sent to "fix" the wait_index variable that is used to help with
waking of waiters on the ring buffer. The patch was rejected, but I started
looking at associated code. Discussing it on IRC with Mathieu Desnoyers
we discovered a design flaw.

The waiter reads "wait_index" then enters a "wait loop". After adding
itself to the wait queue, if the buffer is filled or a signal is pending
it will exit out. If wait_index is different, it will also exit out.
(Note, I noticed that the check for wait_index was after the schedule()
call and should have been before it, but that's besides the point).

The race is what happens if the waker updates the wait_index,
a new waiter comes in and reads the updated wait_index before it adds
itself to the queue, it will miss the update!

These patches fix the design by converting the ring_buffer_wait()
to use the wait_event_interruptible() interface.

Then the wait_to_pipe() caller will pass its own condition into
the ring_buffer_wait() to be checked by the wait_event_interruptible().

- The first patch restructures the ring_buffer_wait() to use the
  wait_event_interruptible() logic. It does not change the interface,
  but adds a local "cond" condition callback function that will be
  what it defaults to in the second patch if NULL is passed in.

  The default cond function is just a "wait once" function. That
  is, the first time it is called (before the wait_event_interruptible()
  calls schedule) will set the "once" variable to one and return
  false. The next time it is called (after wait_event_interruptible()
  wakes up) it will return true to break out of the loop.

- The second patch changes the ring_buffer_wait() interface to allow
  the caller to define its own "cond" callback. That will be checked
  in wait_event_interruptible() along with checking if the proper
  amount of data in the ring buffer has been hit.

Changes since v2: 
https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/

- Patches 1-3 of v2 have been accepted.

- Instead of breaking ring_buffer_wait() into three functions that do
  a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait()
  take a condition callback function and a data parameter.

  The ring_buffer_wait() now uses a wait_event_interruptible() code,
  and added a helper function to do check its own conditions, and also
  to call the passed in condition function and allow the caller to
  specify other conditions to break out of the event loop.

Changes since v1: 
https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/

- My tests triggered a warning about calling a mutex_lock() after a
  prepare_to_wait() that changed the task's state. Convert the affected
  mutex over to a spinlock.



Steven Rostedt (Google) (2):
  ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
  tracing/ring-buffer: Fix wait_on_pipe() race


 include/linux/ring_buffer.h  |   4 +-
 include/linux/trace_events.h |   5 +-
 kernel/trace/ring_buffer.c   | 117 +--
 kernel/trace/trace.c |  43 +++-
 4 files changed, 107 insertions(+), 62 deletions(-)

[PATCH v3 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Convert ring_buffer_wait() over to wait_event_interruptible(). The default
condition is to execute the wait loop inside __wait_event() just once.

This does not change the ring_buffer_wait() prototype yet, but
restructures the code so that it can take a "cond" and "data" parameter
and will call wait_event_interruptible() with a helper function as the
condition.

The helper function (rb_wait_cond) takes the cond function and data
parameters. It will first check if the buffer hit the watermark defined by
the "full" parameter and then call the passed in condition parameter. If
either are true, it returns true.

If rb_wait_cond() does not return true, it will set the appropriate
"waiters_pending" flag and returns false.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   1 +
 kernel/trace/ring_buffer.c  | 116 +---
 2 files changed, 69 insertions(+), 48 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..338a33db1577 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6ef763f57c66..c198ba466853 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
return ret;
 }
 
-/**
- * ring_buffer_wait - wait for input to the ring buffer
- * @buffer: buffer to wait on
- * @cpu: the cpu buffer to wait on
- * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
- *
- * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
- * as data is added to any of the @buffer's cpu buffers. Otherwise
- * it will wait for data to be added to a specific cpu buffer.
- */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+static inline bool
+rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer,
+int cpu, int full, ring_buffer_cond_fn cond, void *data)
 {
-   struct ring_buffer_per_cpu *cpu_buffer;
-   DEFINE_WAIT(wait);
-   struct rb_irq_work *work;
-   int ret = 0;
-
-   /*
-* Depending on what the caller is waiting for, either any
-* data in any cpu buffer, or a specific buffer, put the
-* caller on the appropriate wait queue.
-*/
-   if (cpu == RING_BUFFER_ALL_CPUS) {
-   work = >irq_work;
-   /* Full only makes sense on per cpu reads */
-   full = 0;
-   } else {
-   if (!cpumask_test_cpu(cpu, buffer->cpumask))
-   return -ENODEV;
-   cpu_buffer = buffer->buffers[cpu];
-   work = _buffer->irq_work;
-   }
+   if (rb_watermark_hit(buffer, cpu, full))
+   return true;
 
-   if (full)
-   prepare_to_wait(>full_waiters, , TASK_INTERRUPTIBLE);
-   else
-   prepare_to_wait(>waiters, , TASK_INTERRUPTIBLE);
+   if (cond(data))
+   return true;
 
/*
 * The events can happen in critical sections where
@@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int 
cpu, int full)
 * a task has been queued. It's OK for spurious wake ups.
 */
if (full)
-   work->full_waiters_pending = true;
+   rbwork->full_waiters_pending = true;
else
-   work->waiters_pending = true;
+   rbwork->waiters_pending = true;
 
-   if (rb_watermark_hit(buffer, cpu, full))
-   goto out;
+   return false;
+}
 
-   if (signal_pending(current)) {
-   ret = -EINTR;
-   goto out;
+/*
+ * The default wait condition for ring_buffer_wait() is to just to exit the
+ * wait loop the first time it is woken up.
+ */
+static bool rb_wait_once(void *data)
+{
+   long *once = data;
+
+   /* wait_event() actually calls this twice before scheduling*/
+   if (*once > 1)
+   return true;
+
+   (*once)++;
+   return false;
+}
+
+/**
+ * ring_buffer_wait - wait for input to the rin

[PATCH 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The check for knowing if the poll should wait or not is basically the
exact same logic as rb_watermark_hit(). The only difference is that
rb_watermark_hit() also handles the !full case. But for the full case, the
logic is the same. Just call that instead of duplicating the code in
ring_buffer_poll_wait().

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index adfe603a769b..6ef763f57c66 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -963,21 +963,16 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
 
poll_wait(filp, >full_waiters, poll_table);
 
-   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
-   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
-   if (full_hit(buffer, cpu, full))
+   if (rb_watermark_hit(buffer, cpu, full))
return EPOLLIN | EPOLLRDNORM;
/*
 * Only allow full_waiters_pending update to be seen after
-* the shortest_full is set. If the writer sees the
-* full_waiters_pending flag set, it will compare the
-* amount in the ring buffer to shortest_full. If the amount
-* in the ring buffer is greater than the shortest_full
-* percent, it will call the irq_work handler to wake up
-* this list. The irq_handler will reset shortest_full
+* the shortest_full is set (in rb_watermark_hit). If the
+* writer sees the full_waiters_pending flag set, it will
+* compare the amount in the ring buffer to shortest_full.
+* If the amount in the ring buffer is greater than the
+* shortest_full percent, it will call the irq_work handler
+* to wake up this list. The irq_handler will reset 
shortest_full
 * back to zero. That's done under the reader_lock, but
 * the below smp_mb() makes sure that the update to
 * full_waiters_pending doesn't leak up into the above.
-- 
2.43.0

[PATCH 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If a reader of the ring buffer is doing a poll, and waiting for the ring
buffer to hit a specific watermark, there could be a case where it gets
into an infinite ping-pong loop.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

In the poll, the shortest_full needs to be set before the
full_pending_waiters, as once that is set, the writer will compare the
current shortest_full (which is incorrect) to decide to call the irq_work,
which will reset the shortest_full (expecting the readers to update it).

Also move the setting of full_waiters_pending after the check if the ring
buffer has the required percentage filled. There's no reason to tell the
writer to wake up waiters if there are no waiters.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b..adfe603a769b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
poll_wait(filp, >full_waiters, poll_table);
 
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   rbwork->full_waiters_pending = true;
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
-   } else {
-   poll_wait(filp, >waiters, poll_table);
-   rbwork->waiters_pending = true;
+   if (full_hit(buffer, cpu, full))
+   return EPOLLIN | EPOLLRDNORM;
+   /*
+* Only allow full_waiters_pending update to be seen after
+* the shortest_full is set. If the writer sees the
+* full_waiters_pending flag set, it will compare the
+* amount in the ring buffer to shortest_full. If the amount
+* in the ring buffer is greater than the shortest_full
+* percent, it will call the irq_work handler to wake up
+* this list. The irq_handler will reset shortest_full
+* back to zero. That's done under the reader_lock, but
+* the below smp_mb() makes sure that the update to
+* full_waiters_pending doesn't leak up into the above.
+*/
+   smp_mb();
+   rbwork->full_waiters_pending = true;
+   return 0;
}
 
+   poll_wait(filp, >waiters, poll_table);
+   rbwork->waiters_pending = true;
+
/*
 * There's a tight race between setti

[PATCH 0/2] ring-buffer: Fix poll wakeup logic

2024-03-12 Thread Steven Rostedt



After making a slight change to wakeups in ring_buffer_wait()
the system would hang. Spending several hours going on a wild goose
chase I found that the change only triggered the bug because it
changed the timings. The bug was there before the update but never
was triggered.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

The race was triggered when running:

  trace-cmd record -p function -m 5000

Which enables function tracing and then creates two files it is writing
into where each is 2500K in size. The -m is a "max file size". When
trace-cmd writes 2500K to one file it then switches to the other, erasing
the old data. To do this, trace-cmd switches between both poll and
the reader using both methods of wake up. The change to the reader wakeup
was able to change the way the poll was woken to trigger this bug.

The second patch is a clean up and also a way to consolidate the logic
of the shortest_full. The read wakeup uses rb_watermark_hit for both
full wakeups and !full wakeups. But since poll uses the same logic for
full wakeups it can just call that function with full set.


Steven Rostedt (Google) (2):
  ring-buffer: Fix full_waiters_pending in poll
  ring-buffer: Reuse rb_watermark_hit() for the poll logic


 kernel/trace/ring_buffer.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

Re: [PATCH 0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-10 Thread Steven Rostedt

On Fri, 8 Mar 2024 13:41:59 -0800
Linus Torvalds  wrote:

> On Fri, 8 Mar 2024 at 13:39, Linus Torvalds
>  wrote:
> >
> > So the above "complexity" is *literally* just changing the
> >
> >   (new = atomic_read_acquire(>seq)) != old
> >
> > condition to
> >
> >   should_exit ||
> >   (new = atomic_read_acquire(>seq)) != old  
> 
> .. and obviously you'll need to add the exit condition to the actual
> "deal with events" loop too.

I haven't had a chance to rework this part of the patches, but I have
some other fixes to push to you from earlier this week, and I think the
first three patches of this series are also fine. As the loop in
ring_buffer_wait() isn't needed, and patch 2 and 3 are trivial bugs.

I'll send you a pull request for that work and I'll work on this code
later.

-- Steve

Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash

2024-03-09 Thread Steven Rostedt

On Sat, 9 Mar 2024 10:27:47 -0800
Kees Cook  wrote:

> On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote:
> > This is a way to map a ring buffer instance across reboots.  
> 
> As mentioned on Fedi, check out the persistent storage subsystem
> (pstore)[1]. It already does what you're starting to construct for RAM
> backends (but also supports reed-solomon ECC), and supports several
> other backends including EFI storage (which is default enabled on at
> least Fedora[2]), block devices, etc. It has an existing mechanism for
> handling reservations (including via device tree), and supports multiple
> "frontends" including the Oops handler, console output, and even ftrace
> which does per-cpu recording and event reconstruction (Joel wrote this
> frontend).

Mathieu was telling me about the pmem infrastructure.

This patch set doesn't care where the memory comes from. You just give
it an address and size, and it will do the rest.

> 
> It should be pretty straight forward to implement a new frontend if the
> ftrace one isn't flexible enough. It's a bit clunky still to add one,
> but search for "ftrace" in fs/pstore/ram.c to see how to plumb a new
> frontend into the RAM backend.
> 
> I continue to want to lift the frontend configuration options up into
> the pstore core, since it would avoid a bunch of redundancy, but this is
> where we are currently. :)

Thanks for the info. We use pstore on ChromeOS, but it is currently
restricted to 1MB which is too small for the tracing buffers. From what
I understand, it's also in a specific location where there's only 1MB
available for contiguous memory.

I'm looking at finding a way to get consistent memory outside that
range. That's what I'll be doing next week ;-)

But this code was just to see if I could get a single contiguous range
of memory mapped to ftrace, and this patch set does exactly that.

> 
> -Kees
> 
> [1] CONFIG_PSTORE et. al. in fs/pstore/ 
> https://docs.kernel.org/admin-guide/ramoops.html
> [2] 
> https://www.freedesktop.org/software/systemd/man/latest/systemd-pstore.service.html
> 

Thanks!

-- Steve

Re: [PATCH 0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-08 Thread Steven Rostedt

On Fri, 8 Mar 2024 12:39:10 -0800
Linus Torvalds  wrote:

> On Fri, 8 Mar 2024 at 10:38, Steven Rostedt  wrote:
> >
> > A patch was sent to "fix" the wait_index variable that is used to help with
> > waking of waiters on the ring buffer. The patch was rejected, but I started
> > looking at associated code. Discussing it on IRC with Mathieu Desnoyers
> > we discovered a design flaw.  
> 
> Honestly, all of this seems excessively complicated.
> 
> And your new locking shouldn't be necessary if you just do things much
> more simply.

You mean to replace the wait_woken_*() code (that has the new locking)?

> 
> Here's what I *think* you should do:
> 
>   struct xyz {
> ...
> atomic_t seq;
> struct wait_queue_head seq_wait;
> ...
>   };
> 
> with the consumer doing something very simple like this:
> 
> int seq = atomic_read_acquire(>seq);
> for (;;) {
> .. consume outstanding events ..
> seq = wait_for_seq_change(seq, my);
> }
> 
> and the producer being similarly trivial, just having a
> "add_seq_event()" at the end:
> 
> ... add whatever event ..
> add_seq_event(my);
> 
> And the helper functions for this are really darn simple:
> 
>   static inline int wait_for_seq_change(int old, struct xyz *my)
>   {
> int new;
> wait_event(my->seq_wait,
> (new = atomic_read_acquire(>seq)) != old);

But the index isn't the only condition for it to wake up to. If the file is
closing, it want's to know that too. Or if it's just being kicked out to
consume whatever is there and ignore the watermark.

> return new;
>   }
> 
>   static inline void add_seq_event(struct xyz *my)
>   {
> atomic_fetch_inc_release(>seq);
> wake_up(>seq_wait);
>   }

But it's not only the producer that does the wakeup. That part wasn't
broken.

The broken part is a third party that comes along and wants to wake up the
consumer and tell them to just consume what's there and exit.

There's two layers:

1) the ring buffer has the above simple producer / consumer.
   Where the wake ups can happen at the point of where the buffer has
   the amount filled that the consumer wants to start consuming with.

2) The tracing layer; Here on close of a file, the consumers need to be
   woken up and not wait again. And just take whatever was there to finish
   reading.

   There's also another case that the ioctl() just kicks the current
   readers out, but doesn't care about new readers.

I'm not sure how the seq can handle both there being enough data to wake up
the consumer and the case that another task just wants the consume to wake
up and ignore the watermark.

The wake_woken_*() code was only for the second part (to wake up consumers
and tell them to no longer wait for the producer), and had nothing to do
with the produce/consumer part.

-- Steve

[PATCH v2 6/6] tracing/ring-buffer: Fix wait_on_pipe() race

2024-03-08 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
 wait_woken_prepare()
if (waking)
   woken = true;
index = wait_index;
   wait_woken_set()
  waking = true
  wait_index++;
   ring_buffer_wake_waiters();
 wait_on_pipe()
ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->waking.

Either that condition check needs to be passed to ring_buffer_wait() or
the function needs to be broken up into three parts. This chooses to do
the break up.

Break ring_buffer_wait() into:

ring_buffer_prepare_to_wait();
ring_buffer_wait();
ring_buffer_finish_wait();

Now wait_on_pipe() can have:

ring_buffer_prepare_to_wait();
if (!iter->waking)
ring_buffer_wait();
ring_buffer_finish_wait();

And this will catch the above race, as the waiter will either see waking,
or already have been woken up.

Link: 
https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwwgpi0jgte6oyuymnsr...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  4 ++
 kernel/trace/ring_buffer.c  | 88 ++---
 kernel/trace/trace.c| 14 +-
 3 files changed, 78 insertions(+), 28 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..e5b5903cdc21 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,7 +98,11 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+int ring_buffer_prepare_to_wait(struct trace_buffer *buffer, int cpu, int 
*full,
+   struct wait_queue_entry *wait);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+void ring_buffer_finish_wait(struct trace_buffer *buffer, int cpu, int full,
+struct wait_queue_entry *wait);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 856d0e5b0da5..fa7090f6b4fc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -868,29 +868,29 @@ rb_get_work_queue(struct trace_buffer *buffer, int cpu, 
int *full)
 }
 
 /**
- * ring_buffer_wait - wait for input to the ring buffer
+ * ring_buffer_prepare_to_wait - Prepare to wait for data on the ring buffer
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
- * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
+ * @full: wait until the percentage of pages are available,
+ * if @cpu != RING_BUFFER_ALL_CPUS. It may be updated via this 
function.
+ * @wait: The wait queue entry.
  *
- * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
- * as data is added to any of the @buffer's cpu buffers. Otherwise
- * it will wait for data to be added to a specific cpu buffer.
+ * This must be called before ring_buffer_wait(). It calls the 
prepare_to_wait()
+ * on @wait for the necessary wait queue defined by @buffer, @cpu, and @full.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_prepare_to_wait(struct trace_buffer *buffer, int cpu, int 
*full,
+struct wait_queue_entry *wait)
 {
struct rb_irq_work *rbwork;
-   DEFINE_WAIT(wait);
-   int ret = 0;
 
-   rbwork = rb_get_work_queue(buffer, cpu, );
+   rbwork = rb_get_work_queue(buffer, cpu, full);
if (IS_ERR(rbwork))
return PTR_ERR(rbwork);
 
-   if (full)
-   prepare_to_wait(>full_waiters, , 
TASK_INTERRUPTIBLE);
+   if (*full)
+   prepare_to_wait(>full_waiters, wait, 
TASK_INTERRUPTIBLE);
else
-   prepare_to_wait(>waiters, , TASK_INTERRUPTIBLE);
+   prepare_to_wait(>waiters, wait, TASK_INTERRUPTIBLE);
 
/*
 * The events can happen in critical sections where
@@ -912,30 +912,66 @@ int ring_buffer_wait(struc

[PATCH v2 5/6] ring-buffer: Restructure ring_buffer_wait() to prepare for updates

2024-03-08 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The ring_buffer_wait() needs to be broken into three functions for proper
synchronization from the context of the callers:

  ring_buffer_prepare_to_wait()
  ring_buffer_wait()
  ring_buffer_finish_wait()

To simplify the process, pull out the logic for getting the right work
queue to wait on, as it will be needed for the above functions.

There are three work queues depending on the cpu value.

If cpu == RING_BUFFER_ALL_CPUS, then the main "buffer->irq_work" is used.

Otherwise, the cpu_buffer representing the CPU buffer's irq_work is used.

Create a rb_get_work_queue() helper function to retrieve the proper queue.

Also rename "work" to "rbwork" as the variable point to struct rb_irq_work,
and to be more consistent with the variable naming elsewhere in the file.

Link: 
https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwwgpi0jgte6oyuymnsr...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 58 +++---
 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b..856d0e5b0da5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -842,6 +842,31 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
return ret;
 }
 
+static struct rb_irq_work *
+rb_get_work_queue(struct trace_buffer *buffer, int cpu, int *full)
+{
+   struct ring_buffer_per_cpu *cpu_buffer;
+   struct rb_irq_work *rbwork;
+
+   /*
+* Depending on what the caller is waiting for, either any
+* data in any cpu buffer, or a specific buffer, put the
+* caller on the appropriate wait queue.
+*/
+   if (cpu == RING_BUFFER_ALL_CPUS) {
+   rbwork = >irq_work;
+   /* Full only makes sense on per cpu reads */
+   *full = 0;
+   } else {
+   if (!cpumask_test_cpu(cpu, buffer->cpumask))
+   return ERR_PTR(-ENODEV);
+   cpu_buffer = buffer->buffers[cpu];
+   rbwork = _buffer->irq_work;
+   }
+
+   return rbwork;
+}
+
 /**
  * ring_buffer_wait - wait for input to the ring buffer
  * @buffer: buffer to wait on
@@ -854,31 +879,18 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
  */
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
 {
-   struct ring_buffer_per_cpu *cpu_buffer;
+   struct rb_irq_work *rbwork;
DEFINE_WAIT(wait);
-   struct rb_irq_work *work;
int ret = 0;
 
-   /*
-* Depending on what the caller is waiting for, either any
-* data in any cpu buffer, or a specific buffer, put the
-* caller on the appropriate wait queue.
-*/
-   if (cpu == RING_BUFFER_ALL_CPUS) {
-   work = >irq_work;
-   /* Full only makes sense on per cpu reads */
-   full = 0;
-   } else {
-   if (!cpumask_test_cpu(cpu, buffer->cpumask))
-   return -ENODEV;
-   cpu_buffer = buffer->buffers[cpu];
-   work = _buffer->irq_work;
-   }
+   rbwork = rb_get_work_queue(buffer, cpu, );
+   if (IS_ERR(rbwork))
+   return PTR_ERR(rbwork);
 
if (full)
-   prepare_to_wait(>full_waiters, , TASK_INTERRUPTIBLE);
+   prepare_to_wait(>full_waiters, , 
TASK_INTERRUPTIBLE);
else
-   prepare_to_wait(>waiters, , TASK_INTERRUPTIBLE);
+   prepare_to_wait(>waiters, , TASK_INTERRUPTIBLE);
 
/*
 * The events can happen in critical sections where
@@ -901,9 +913,9 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, 
int full)
 * a task has been queued. It's OK for spurious wake ups.
 */
if (full)
-   work->full_waiters_pending = true;
+   rbwork->full_waiters_pending = true;
else
-   work->waiters_pending = true;
+   rbwork->waiters_pending = true;
 
if (rb_watermark_hit(buffer, cpu, full))
goto out;
@@ -916,9 +928,9 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, 
int full)
schedule();
  out:
if (full)
-   finish_wait(>full_waiters, );
+   finish_wait(>full_waiters, );
else
-   finish_wait(>waiters, );
+   finish_wait(>waiters, );
 
if (!ret && !rb_watermark_hit(buffer, cpu, full) && 
signal_pending(current))
ret = -EINTR;
-- 
2.43.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 34090 matches

Mail list logo