from:"Mathieu Desnoyers"

Convert the definition of the system call enter/exit tracepoints to
faultable tracepoints now that all upstream tracers handle it.

This allows tracers to fault-in userspace system call arguments such as
path strings within their probe callbacks.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Reviewed-by: Masami Hiramatsu (Google) 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Since v4:
- Use 'guard(preempt_notrace)'.
- Add brackets to multiline 'if' statements.
---
 include/trace/events/syscalls.h |  4 +--
 kernel/trace/trace_syscalls.c   | 52 -
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..dc30e3004818 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAY_FAULT(sys_enter,
 
TP_PROTO(struct pt_regs *regs, long id),
 
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAY_FAULT(sys_exit,
 
TP_PROTO(struct pt_regs *regs, long ret),
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..314666d663b6 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct 
pt_regs *regs, long id)
int syscall_nr;
int size;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs 
*regs, long ret)
struct trace_event_buffer fbuffer;
int syscall_nr;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -376,8 +388,11 @@ static int reg_event_syscall_enter(struct trace_event_file 
*file,
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
-   if (!tr->sys_refcount_enter)
-   ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+   if (!tr->sys_refcount_enter) {
+   ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, 
tr,
+ 
TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
+   }
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
@@ -414,8 +429,11 @@ static int reg_event_syscall_exit(struct trace_event_file 
*file,
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
-   if (!tr->sys_refcount_exit)
-   ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+   if (!tr->sys_refcount_exit) {
+   ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, 
tr,
+
TRACEPOINT_DEFAULT_PRIO,
+TRACEPOINT_MAY_FAULT);
+   }
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
@@ -582,6 +600,12 @@ static void perf_syscall_enter(void *ignore, struct 
pt_regs *regs, long id)
int rctx;
int size;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -630,8 +654,11 @@ static int perf_sysenter_enable(struct trace_event_call 
*call)
num = ((struct syscall_metadat

[PATCH v6 4/5] tracing/perf: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Reviewed-by: Masami Hiramatsu (Google) 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
---
 include/trace/perf.h | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 2c11181c82e0..161e1655b953 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -12,8 +12,8 @@
 #undef __perf_task
 #define __perf_task(t) (__task = (t))
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, 
tp_flags) \
 static notrace void\
 perf_trace_##call(void *__data, proto) \
 {  \
@@ -28,6 +28,13 @@ perf_trace_##call(void *__data, proto)   
\
int __data_size;\
int rctx;   \
\
+   DEFINE_INACTIVE_GUARD(preempt_notrace, trace_event_guard);  \
+   \
+   if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\
+   might_fault();  \
+   activate_guard(preempt_notrace, trace_event_guard)();   \
+   }   \
+   \
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events);   \
@@ -55,6 +62,17 @@ perf_trace_##call(void *__data, proto)   
\
  head, __task);\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+PARAMS(tstruct), PARAMS(assign), PARAMS(print), \
+TRACEPOINT_MAY_FAULT)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
-- 
2.39.2

[PATCH v6 3/5] tracing/bpf-trace: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that bpf can handle registering to
such tracepoints by explicitly disabling preemption within the bpf
tracepoint probes to respect the current expectations within bpf tracing
code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Reviewed-by: Masami Hiramatsu (Google) 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
- Add brackets to multiline 'if' statements.
Changes since v5:
- Rebased on v6.11-rc5.
- Pass the TRACEPOINT_MAY_FAULT flag directly to 
tracepoint_probe_register_prio_flags.
---
 include/trace/bpf_probe.h | 21 -
 kernel/trace/bpf_trace.c  |  2 +-
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index a2ea11cc912e..cc96dd1e7c3d 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -42,16 +42,27 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#define __BPF_DECLARE_TRACE(call, proto, args) \
+#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags)   \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
-   CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, 
CAST_TO_U64(args));\
+   DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard);\
+   \
+   if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\
+   might_fault();  \
+   activate_guard(preempt_notrace, bpf_trace_guard)(); \
+   }   \
+   \
+   CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, 
CAST_TO_U64(args)); \
 }
 
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 
TRACEPOINT_MAY_FAULT)
 
 /*
  * This part is compiled out, it is only here as a build time check
@@ -105,13 +116,13 @@ static inline void bpf_test_buffer_##call(void)   
\
 
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(call, proto, args)   \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)   \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
 
 #undef DECLARE_TRACE_WRITABLE
 #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c77eb80cbd7f..ed07283d505b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2473,7 +2473,7 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, 
struct bpf_raw_tp_link *li
 
return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
link, 
TRACEPOINT_DEFAULT_PRIO,
-   TRACEPOINT_MAY_EXIST);
+   TRACEPOINT_MAY_EXIST | 
(tp->flags & TRACEPOINT_MAY_FAULT));
 }
 
 int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link 
*link)
-- 
2.39.2

[PATCH v6 2/5] tracing/ftrace: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that ftrace can handle registering
to such tracepoints by explicitly disabling preemption within the ftrace
tracepoint probes to respect the current expectations within ftrace ring
buffer code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Masami Hiramatsu (Google) 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
- Add brackets to multiline 'if' statements.
Changes since v5:
- Pass the TRACEPOINT_MAY_FAULT flag directly to 
tracepoint_probe_register_prio_flags.
---
 include/trace/trace_events.h | 64 ++--
 kernel/trace/trace_events.c  | 16 +
 2 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index df590eea8ae4..c887f7b6fbe9 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -45,6 +45,16 @@
 PARAMS(print));   \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_MAY_FAULT
+#define TRACE_EVENT_MAY_FAULT(name, proto, args, tstruct, assign, print) \
+   DECLARE_EVENT_CLASS_MAY_FAULT(name,\
+PARAMS(proto),\
+PARAMS(args), \
+PARAMS(tstruct),  \
+PARAMS(assign),   \
+PARAMS(print));   \
+   DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 #include "stages/stage1_struct_define.h"
 
 #undef DECLARE_EVENT_CLASS
@@ -57,6 +67,11 @@
\
static struct trace_event_class event_class_##name;
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(name, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)  \
static struct trace_event_call  __used  \
@@ -80,7 +95,7 @@
 #undef TRACE_EVENT_FN_MAY_FAULT
 #define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct,   \
assign, print, reg, unreg)  \
-   TRACE_EVENT(name, PARAMS(proto), PARAMS(args),  \
+   TRACE_EVENT_MAY_FAULT(name, PARAMS(proto), PARAMS(args),\
PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \
 
 #undef TRACE_EVENT_FN_COND
@@ -123,6 +138,11 @@
tstruct;\
};
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
@@ -214,6 +234,11 @@ static struct trace_event_functions 
trace_event_type_funcs_##call = {  \
.trace  = trace_raw_output_##call,  \
 };
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
 static notrace enum print_line_t   \
@@ -250,6 +275,11 @@ static struct trace_event_fields 
trace_event_fields_##call[] = {   \
tstruct \
{} };
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -271,6 +301,11 @@ s

[PATCH v6 1/5] tracing: Introduce faultable tracepoints

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a faultable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those faultable tracepoints. When the
TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
callback will be called with preemption enabled, and is allowed to take
page faults. Faultable probes can only be registered on faultable
tracepoints and non-faultable probes on non-faultable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to faultable probes
unregistration and teardown.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Reviewed-by: Masami Hiramatsu (Google) 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v1:
- Cleanup __DO_TRACE() implementation.
- Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
  MAYFAULT, and use might_fault() rather than might_sleep(), to properly
  convey that the tracepoints are meant to be able to take a page fault,
  which requires to be able to sleep *and* to hold the mmap_sem.
Changes since v2:
- Rename MAYFAULT to MAY_FAULT.
- Rebased on 6.5.5.
- Introduce MAY_EXIST tracepoint flag.
Changes since v3:
- Rebased on 6.6.2.
Changes since v4:
- Rebased on 6.9.6.
- Simplify flag check in tracepoint_probe_register_prio_flags().
- Update MAY_EXIST flag description.
Changes since v5:
- Rebased on v6.11-rc5.
---
 include/linux/tracepoint-defs.h | 14 ++
 include/linux/tracepoint.h  | 88 +++--
 include/trace/define_trace.h|  7 +++
 include/trace/trace_events.h|  6 +++
 init/Kconfig|  1 +
 kernel/trace/bpf_trace.c|  4 +-
 kernel/trace/trace_fprobe.c |  5 +-
 kernel/tracepoint.c | 65 ++--
 8 files changed, 136 insertions(+), 54 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 4dc4955f0fbf..94e39c86b49f 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -29,6 +29,19 @@ struct tracepoint_func {
int prio;
 };
 
+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAY_EXIST: On registration, don't warn if the tracepoint is
+ *already registered.
+ * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with
+ *preemption enabled, and is allowed to take page
+ *faults.
+ */
+enum tracepoint_flags {
+   TRACEPOINT_MAY_EXIST = (1 << 0),
+   TRACEPOINT_MAY_FAULT = (1 << 1),
+};
+
 struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
@@ -39,6 +52,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+   unsigned int flags;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 6be396bb4297..7ae5496a800c 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -40,17 +41,10 @@ extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
   int prio);
 extern int
-tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, 
void *data,
-int prio);
+tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void 
*data,
+  int prio, unsigned int flags);
 extern int
 tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
-static inline int
-tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe,
-   void *data)
-{
-   return tracepoint_probe_register_prio_may_exist(tp, probe, data,
-   
TRACEPOINT_DEFAULT_PRIO);
-}
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),

[PATCH v6 0/5] Faultable Tracepoints

Wire up the system call tracepoints with Tasks Trace RCU to allow
the ftrace, perf, and eBPF tracers to handle page faults.

This series does the initial wire-up allowing tracers to handle page
faults, but leaves out the actual handling of said page faults as future
work.

I have tested this against a feature branch of lttng-modules which
implements handling of page faults for the filename argument of the
openat(2) system call.

This v6 rebases v5 on top of v6.11-rc5. It requires the "cleanup.h:
Introduce DEFINE_INACTIVE_GUARD()/activate_guard()" series.

Thanks,

Mathieu

Link: 
https://lore.kernel.org/lkml/20240828143719.828968-1-mathieu.desnoy...@efficios.com/
 # cleanup.h dependency
Link: 
https://lore.kernel.org/lkml/20240627152340.82413-1-mathieu.desnoy...@efficios.com/
 # v5
Link: 
https://lore.kernel.org/lkml/20231120205418.334172-1-mathieu.desnoy...@efficios.com/
Link: 
https://lore.kernel.org/lkml/e4e9a2bc-1776-4b51-aba4-a147795a5...@efficios.com/
Link: 
https://lore.kernel.org/lkml/a0ac5f77-411e-4562-9863-81196238f...@efficios.com/
Link: 
https://lore.kernel.org/lkml/ba543d44-9302-4115-ac4f-d4e9f8d98a90@paulmck-laptop/
Link: 
https://lore.kernel.org/lkml/20231120221524.gd8...@noisy.programming.kicks-ass.net/
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
Cc: linux-trace-ker...@vger.kernel.org

Mathieu Desnoyers (5):
  tracing: Introduce faultable tracepoints
  tracing/ftrace: Add support for faultable tracepoints
  tracing/bpf-trace: Add support for faultable tracepoints
  tracing/perf: Add support for faultable tracepoints
  tracing: Convert sys_enter/exit to faultable tracepoints

 include/linux/tracepoint-defs.h | 14 ++
 include/linux/tracepoint.h  | 88 +++--
 include/trace/bpf_probe.h   | 21 ++--
 include/trace/define_trace.h|  7 +++
 include/trace/events/syscalls.h |  4 +-
 include/trace/perf.h| 22 -
 include/trace/trace_events.h| 68 -
 init/Kconfig|  1 +
 kernel/trace/bpf_trace.c|  4 +-
 kernel/trace/trace_events.c | 16 +++---
 kernel/trace/trace_fprobe.c |  5 +-
 kernel/trace/trace_syscalls.c   | 52 ---
 kernel/tracepoint.c | 65 ++--
 13 files changed, 288 insertions(+), 79 deletions(-)

-- 
2.39.2

[PATCH v1 1/2] cleanup.h guard: Rename DEFINE_ prefix to DECLARE_

The convention used in other kernel headers (e.g. wait.h, percpu-defs.h)
is to use "DECLARE_" prefix for macros emitting externs, static inlines
and type definitions.

The "DEFINE_" prefix is used for variable definitions.

In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually
define a guard variable, rename all the guard "DEFINE_" prefix to
"DECLARE_".

Signed-off-by: Mathieu Desnoyers 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
---
 crypto/asymmetric_keys/x509_parser.h |  4 +-
 drivers/cxl/acpi.c   |  6 +--
 drivers/cxl/core/cdat.c  |  2 +-
 drivers/cxl/cxl.h|  2 +-
 drivers/gpio/gpiolib.h   |  2 +-
 drivers/net/wireless/intel/iwlwifi/mvm/mvm.h |  2 +-
 drivers/platform/x86/intel/pmc/core_ssram.c  |  2 +-
 fs/fuse/virtio_fs.c  |  2 +-
 fs/namespace.c   |  2 +-
 fs/pstore/inode.c|  4 +-
 include/linux/bitmap.h   |  2 +-
 include/linux/cleanup.h  | 56 ++--
 include/linux/cpuhplock.h|  2 +-
 include/linux/cpumask.h  |  2 +-
 include/linux/device.h   |  6 +--
 include/linux/file.h |  6 +--
 include/linux/firmware.h |  2 +-
 include/linux/firmware/qcom/qcom_tzmem.h |  2 +-
 include/linux/gpio/driver.h  |  4 +-
 include/linux/iio/iio.h  |  4 +-
 include/linux/interrupt.h|  4 +-
 include/linux/irqflags.h |  4 +-
 include/linux/local_lock.h   | 22 
 include/linux/mutex.h|  6 +--
 include/linux/netdevice.h|  2 +-
 include/linux/nsproxy.h  |  2 +-
 include/linux/of.h   |  2 +-
 include/linux/path.h |  2 +-
 include/linux/pci.h  |  4 +-
 include/linux/percpu.h   |  2 +-
 include/linux/preempt.h  |  6 +--
 include/linux/property.h |  2 +-
 include/linux/rcupdate.h |  2 +-
 include/linux/rtnetlink.h|  2 +-
 include/linux/rwsem.h| 10 ++--
 include/linux/sched/task.h   |  4 +-
 include/linux/slab.h |  4 +-
 include/linux/spinlock.h | 38 ++---
 include/linux/srcu.h |  8 +--
 include/sound/pcm.h  |  6 +--
 kernel/sched/core.c  |  2 +-
 kernel/sched/sched.h | 16 +++---
 kernel/sched/syscalls.c  |  4 +-
 lib/locking-selftest.c   | 12 ++---
 sound/core/control_led.c |  2 +-
 45 files changed, 141 insertions(+), 141 deletions(-)

diff --git a/crypto/asymmetric_keys/x509_parser.h 
b/crypto/asymmetric_keys/x509_parser.h
index 0688c222806b..ac0560ccfcdb 100644
--- a/crypto/asymmetric_keys/x509_parser.h
+++ b/crypto/asymmetric_keys/x509_parser.h
@@ -45,8 +45,8 @@ struct x509_certificate {
  * x509_cert_parser.c
  */
 extern void x509_free_certificate(struct x509_certificate *cert);
-DEFINE_FREE(x509_free_certificate, struct x509_certificate *,
-   if (!IS_ERR(_T)) x509_free_certificate(_T))
+DECLARE_FREE(x509_free_certificate, struct x509_certificate *,
+if (!IS_ERR(_T)) x509_free_certificate(_T))
 extern struct x509_certificate *x509_cert_parse(const void *data, size_t 
datalen);
 extern int x509_decode_time(time64_t *_t,  size_t hdrlen,
unsigned char tag,
diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 82b78e331d8e..2bb3cef82035 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -337,9 +337,9 @@ static int add_or_reset_cxl_resource(struct resource 
*parent, struct resource *r
return rc;
 }
 
-DEFINE_FREE(put_cxlrd, struct cxl_root_decoder *,
-   if (!IS_ERR_OR_NULL(_T)) put_device(&_T->cxlsd.cxld.dev))
-DEFINE_FREE(del_cxl_resource, struct resource *, if (_T) del_cxl_resource(_T))
+DECLARE_FREE(put_cxlrd, struct cxl_root_decoder *,
+if (!IS_ERR_OR_NULL(_T)) put_device(&_T->cxlsd.cxld.dev))
+DECLARE_FREE(del_cxl_resource, struct resource *, if (_T) del_cxl_resource(_T))
 static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
 struct cxl_cfmws_context *ctx)
 {
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index bb83867d9fec..689143566642 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -385,7 +385,7 @@ static void discard_dsmas(struct xarray *xa)
}
xa_destroy(xa);
 }
-DEFINE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T))
+DECLARE

[PATCH v1 2/2] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard

To cover scenarios where the scope of the guard differs from the scope
of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard().

Here is an example use for a conditionally activated guard variable:

void func(bool a)
{
DEFINE_INACTIVE_GUARD(preempt_notrace, myguard);

[...]
if (a) {
might_sleep();
activate_guard(preempt_notrace, myguard)();
}
[ protected code ]
}

Signed-off-by: Mathieu Desnoyers 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Greg KH 
Cc: Sean Christopherson 
---
 include/linux/cleanup.h | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h
index 1247e67a6161..5f031cce97ca 100644
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -149,12 +149,20 @@ static inline class_##_name##_t 
class_##_name##ext##_constructor(_init_args) \
  *  similar to scoped_guard(), except it does fail when the lock
  *  acquire fails.
  *
+ * DEFINE_INACTIVE_GUARD(name, var):
+ *  define an inactive guard variable in a given scope, initialized to 
NULL.
+ *
+ * activate_guard(name, var)(args...):
+ *  activate a guard variable with its constructor, if it is not already
+ *  activated.
  */
 
 #define DECLARE_GUARD(_name, _type, _lock, _unlock) \
DECLARE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), 
_type _T); \
static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
-   { return *_T; }
+   { return *_T; } \
+   static inline class_##_name##_t class_##_name##_null(void) \
+   { return NULL; }
 
 #define DECLARE_GUARD_COND(_name, _ext, _condlock) \
EXTEND_CLASS(_name, _ext, \
@@ -178,6 +186,14 @@ static inline class_##_name##_t 
class_##_name##ext##_constructor(_init_args) \
if (!__guard_ptr(_name)(&scope)) _fail; \
else
 
+#define DEFINE_INACTIVE_GUARD(_name, _var) \
+   class_##_name##_t _var __cleanup(class_##_name##_destructor) = \
+   class_##_name##_null()
+
+#define activate_guard(_name, _var) \
+   if (!class_##_name##_lock_ptr(&(_var))) \
+   _var = class_##_name##_constructor
+
 /*
  * Additional helper macros for generating lock guards with types, either for
  * locks that don't have a native type (eg. RCU, preempt) or those that need a
@@ -212,6 +228,11 @@ static inline void 
class_##_name##_destructor(class_##_name##_t *_T)   \
 static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)\
 {  \
return _T->lock;\
+}  \
+static inline class_##_name##_t class_##_name##_null(void) \
+{  \
+   class_##_name##_t _t = { .lock = NULL };\
+   return _t;  \
 }
 
 
-- 
2.39.2

[PATCH v1 0/2] cleanup.h: Introduce DEFINE_INACTIVE_GUARD()/activate_guard()

In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually
define a guard variable, rename all the guard "DEFINE_" prefix to
"DECLARE_".

To cover scenarios where the scope of the guard differs from the scope
of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard().

The faultable tracepoints depend on this.

Based on v6.11-rc5.

Thanks,

Mathieu

Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Greg KH 
Cc: Sean Christopherson 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
Cc: linux-trace-ker...@vger.kernel.org

Mathieu Desnoyers (2):
  cleanup.h guard: Rename DEFINE_ prefix to DECLARE_
  cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard

 crypto/asymmetric_keys/x509_parser.h |  4 +-
 drivers/cxl/acpi.c   |  6 +-
 drivers/cxl/core/cdat.c  |  2 +-
 drivers/cxl/cxl.h|  2 +-
 drivers/gpio/gpiolib.h   |  2 +-
 drivers/net/wireless/intel/iwlwifi/mvm/mvm.h |  2 +-
 drivers/platform/x86/intel/pmc/core_ssram.c  |  2 +-
 fs/fuse/virtio_fs.c  |  2 +-
 fs/namespace.c   |  2 +-
 fs/pstore/inode.c|  4 +-
 include/linux/bitmap.h   |  2 +-
 include/linux/cleanup.h  | 79 +---
 include/linux/cpuhplock.h|  2 +-
 include/linux/cpumask.h  |  2 +-
 include/linux/device.h   |  6 +-
 include/linux/file.h |  6 +-
 include/linux/firmware.h |  2 +-
 include/linux/firmware/qcom/qcom_tzmem.h |  2 +-
 include/linux/gpio/driver.h  |  4 +-
 include/linux/iio/iio.h  |  4 +-
 include/linux/interrupt.h|  4 +-
 include/linux/irqflags.h |  4 +-
 include/linux/local_lock.h   | 22 +++---
 include/linux/mutex.h|  6 +-
 include/linux/netdevice.h|  2 +-
 include/linux/nsproxy.h  |  2 +-
 include/linux/of.h   |  2 +-
 include/linux/path.h |  2 +-
 include/linux/pci.h  |  4 +-
 include/linux/percpu.h   |  2 +-
 include/linux/preempt.h  |  6 +-
 include/linux/property.h |  2 +-
 include/linux/rcupdate.h |  2 +-
 include/linux/rtnetlink.h|  2 +-
 include/linux/rwsem.h| 10 +--
 include/linux/sched/task.h   |  4 +-
 include/linux/slab.h |  4 +-
 include/linux/spinlock.h | 38 +-
 include/linux/srcu.h |  8 +-
 include/sound/pcm.h  |  6 +-
 kernel/sched/core.c  |  2 +-
 kernel/sched/sched.h | 16 ++--
 kernel/sched/syscalls.c  |  4 +-
 lib/locking-selftest.c   | 12 +--
 sound/core/control_led.c |  2 +-
 45 files changed, 163 insertions(+), 142 deletions(-)

-- 
2.39.2

Re: [PATCH v5 3/8] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard


On 2024-08-20 01:00, Steven Rostedt wrote:

On Thu, 27 Jun 2024 11:23:35 -0400
Mathieu Desnoyers  wrote:


To cover scenarios where the scope of the guard differs from the scope
of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard().

Here is an example use for a conditionally activated guard variable:

void func(bool a)
{
DEFINE_INACTIVE_GUARD(preempt_notrace, myguard);

[...]
if (a) {
might_sleep();
activate_guard(preempt_notrace, myguard)();
}
[ protected code ]
}



Hi Mathieu,

The three cleanup patches fail to apply (I believe one has already been
fixed by Ingo too). Could you have the clean up patches be a separate
series that is likely to get in, especially since it's more of a moving
target.


Then it would make sense to split this into two series: one for the
guard stuff, and one for tracing which depends on the first one.

I'll do that.

Mathieu



Thanks,

-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Mathieu Desnoyers


On 2024-07-15 15:13, Steven Rostedt wrote:

On Mon, 15 Jul 2024 15:10:17 -0400
Mathieu Desnoyers  wrote:


On 2024-07-15 14:47, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Gone but never forgotten.

[ Also moved Daniel's name to be consistent with the alphabetical order ]


Hi Steven,

You appear to have missed this entry from SCHEDULER:

R:  Daniel Bristot de Oliveira  (SCHED_DEADLINE)

Should it be done in a separate commit or folded in this one ?


That has already been done:

   
https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/

This patch is actually on top of that one.


Sorry I missed that.

Reviewed-by: Mathieu Desnoyers 



-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Mathieu Desnoyers


On 2024-07-15 14:47, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Gone but never forgotten.

[ Also moved Daniel's name to be consistent with the alphabetical order ]


Hi Steven,

You appear to have missed this entry from SCHEDULER:

R:  Daniel Bristot de Oliveira  (SCHED_DEADLINE)

Should it be done in a separate commit or folded in this one ?

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 
---
  CREDITS | 10 +++---
  MAINTAINERS |  3 ---
  2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/CREDITS b/CREDITS
index 88c4c08cb613..d7798910e99f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, 
Tropez, Tropez+)
  D: Various bugfixes and changes to sound drivers
  S: USA
  
-N: Daniel Bristot de Oliveira

-D: Scheduler contributions, notably: SCHED_DEADLINE
-
  N: Carlos Henrique Bauer
  E: chba...@acm.org
  E: ba...@atlas.unisinos.br
@@ -534,6 +531,13 @@ S: Kopmansg 2
  S: 411 13  Goteborg
  S: Sweden
  
+N: Daniel Bristot de Oliveira

+D: Scheduler contributions, notably: SCHED_DEADLINE
+D: Real-time Linux Analysis
+D: Runtime Verification
+D: OS Noise and Latency Tracers
+S: Brazil and Italy
+
  N: Paul Bristow
  E: p...@paulbristow.net
  W: https://paulbristow.net/linux/idefloppy.html
diff --git a/MAINTAINERS b/MAINTAINERS
index 2e1b8bbacb5e..0b7e4cd4ffd7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18895,7 +18895,6 @@ F:  include/uapi/linux/rtc.h
  F:tools/testing/selftests/rtc/
  
  Real-time Linux Analysis (RTLA) tools

-M: Daniel Bristot de Oliveira 
  M:Steven Rostedt 
  L:linux-trace-ker...@vger.kernel.org
  S:Maintained
@@ -19529,7 +19528,6 @@ S:  Maintained
  F:drivers/infiniband/ulp/rtrs/
  
  RUNTIME VERIFICATION (RV)

-M: Daniel Bristot de Oliveira 
  M:Steven Rostedt 
  L:linux-trace-ker...@vger.kernel.org
  S:Maintained
@@ -22806,7 +22804,6 @@ F:  kernel/trace/trace_mmiotrace.c
  
  TRACING OS NOISE / LATENCY TRACERS

  M:Steven Rostedt 
-M: Daniel Bristot de Oliveira 
  S:Maintained
  F:Documentation/trace/hwlat_detector.rst
  F:Documentation/trace/osnoise-tracer.rst


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-07-02 Thread Mathieu Desnoyers


On 2024-07-02 12:51, Steven Rostedt wrote:

On Tue, 2 Jul 2024 11:32:53 -0400
Mathieu Desnoyers  wrote:


If we use '*' for user events already, perhaps we'd want to consider
using the same range for the ring buffer ioctls ? Arguably one is
about instrumentation and the other is about ring buffer interaction
(data transport), but those are both related to tracing.


Yeah, but I still rather keep them separate.


No objection, I'm OK either way.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-07-02 Thread Mathieu Desnoyers


On 2024-07-02 11:18, Steven Rostedt wrote:

On Tue, 2 Jul 2024 10:36:03 -0400
Mathieu Desnoyers  wrote:


I can send a patch this week to update it. Or feel free to send a patch
yourself.


You need to reserve an unused ioctl Code and Seq# range within:

Documentation/userspace-api/ioctl/ioctl-number.rst


Ug, it's been so long, I completely forgot about that file.

Thanks for catching this.



Otherwise this duplicate will confuse all system call instrumentation
tooling.


Agreed, what if we did this then:

-- Steve

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a141e8e65c5d..9a97030c6c8d 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -186,6 +186,7 @@ Code  Seq#Include File  
 Comments
  'Q'   alllinux/soundcard.h
  'R'   00-1F  linux/random.h  conflict!
  'R'   01 linux/rfkill.h  conflict!
+'R'   20-2F  linux/trace_mmap.h
  'R'   C0-DF  net/bluetooth/rfcomm.h
  'R'   E0 uapi/linux/fsl_mc.h
  'S'   alllinux/cdrom.h   conflict!
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index bd1066754220..c102ef35d11e 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,6 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
  };
  
-#define TRACE_MMAP_IOCTL_GET_READER		_IO('T', 0x1)

+#define TRACE_MMAP_IOCTL_GET_READER_IO('R', 0x20)


Note that user events also has this issue: the ioctl is not reserved in
the ioctl-number.rst list. See include/uapi/linux/user_events.h:

#define DIAG_IOC_MAGIC '*'

/* Request to register a user_event */
#define DIAG_IOCSREG _IOWR(DIAG_IOC_MAGIC, 0, struct user_reg *)

/* Request to delete a user_event */
#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char *)

/* Requests to unregister a user_event */
#define DIAG_IOCSUNREG _IOW(DIAG_IOC_MAGIC, 2, struct user_unreg*)

Where '*' maps to Code 0x2A. Looking at the list I don't see any
conflicts there, but we should definitely add it.

If we use '*' for user events already, perhaps we'd want to consider
using the same range for the ring buffer ioctls ? Arguably one is
about instrumentation and the other is about ring buffer interaction
(data transport), but those are both related to tracing.

Thanks,

Mathieu

  
  #endif /* _TRACE_MMAP_H_ */


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-07-02 Thread Mathieu Desnoyers


On 2024-06-30 08:40, Steven Rostedt wrote:

On Sun, 30 Jun 2024 13:53:23 +0300
"Dmitry V. Levin"  wrote:


On Fri, May 10, 2024 at 03:04:32PM +0100, Vincent Donnefort wrote:
[...]

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index b682e9925539..bd1066754220 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
  };
  
+#define TRACE_MMAP_IOCTL_GET_READER		_IO('T', 0x1)

+


I'm sorry but among all the numbers this one was probably the least
fortunate choice because it collides with TCGETS on most of architectures.


Hmm, that is unfortunate.



For example, this is how strace output would look like when
TRACE_MMAP_IOCTL_GET_READER support is added:

$ strace -e ioctl stty
ioctl(0, TCGETS or TRACE_MMAP_IOCTL_GET_READER, {c_iflag=ICRNL|IXON, 
c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, 
c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0

Even though ioctl numbers are inherently not unique, TCGETS is
a very traditional one, so it would be great if you could change
TRACE_MMAP_IOCTL_GET_READER to avoid this collision.

Given that _IO('T', 0x1) is _IOC(_IOC_NONE, 'T', 0x1, 0),
something like _IOC(_IOC_NONE, 'T', 0x1, 0x1) should be OK.


Well, it may not be too late to update this as it hasn't been
officially released in 6.10 yet. It's still only in the -rc and the
library doesn't rely on this yet (I've been holding off until 6.10 was
officially released before releasing the library that uses it).

I can send a patch this week to update it. Or feel free to send a patch
yourself.


You need to reserve an unused ioctl Code and Seq# range within:

Documentation/userspace-api/ioctl/ioctl-number.rst

Otherwise this duplicate will confuse all system call instrumentation
tooling.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH v5 8/8] tracing: Convert sys_enter/exit to faultable tracepoints

Convert the definition of the system call enter/exit tracepoints to
faultable tracepoints now that all upstream tracers handle it.

This allows tracers to fault-in userspace system call arguments such as
path strings within their probe callbacks.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Since v4:
- Use 'guard(preempt_notrace)'.
- Add brackets to multiline 'if' statements.
---
 include/trace/events/syscalls.h |  4 +--
 kernel/trace/trace_syscalls.c   | 52 -
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..dc30e3004818 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAY_FAULT(sys_enter,
 
TP_PROTO(struct pt_regs *regs, long id),
 
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAY_FAULT(sys_exit,
 
TP_PROTO(struct pt_regs *regs, long ret),
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..314666d663b6 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct 
pt_regs *regs, long id)
int syscall_nr;
int size;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs 
*regs, long ret)
struct trace_event_buffer fbuffer;
int syscall_nr;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -376,8 +388,11 @@ static int reg_event_syscall_enter(struct trace_event_file 
*file,
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
-   if (!tr->sys_refcount_enter)
-   ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+   if (!tr->sys_refcount_enter) {
+   ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, 
tr,
+ 
TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
+   }
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
@@ -414,8 +429,11 @@ static int reg_event_syscall_exit(struct trace_event_file 
*file,
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
-   if (!tr->sys_refcount_exit)
-   ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+   if (!tr->sys_refcount_exit) {
+   ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, 
tr,
+
TRACEPOINT_DEFAULT_PRIO,
+TRACEPOINT_MAY_FAULT);
+   }
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
@@ -582,6 +600,12 @@ static void perf_syscall_enter(void *ignore, struct 
pt_regs *regs, long id)
int rctx;
int size;
 
+   /*
+* Probe called with preemption enabled (may_fault), but ring buffer and
+* per-cpu data require preemption to be disabled.
+*/
+   guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -630,8 +654,11 @@ static int perf_sysenter_enable(struct trace_event_call 
*call)
num = ((struct syscall_metadata *)call->data)->syscall_nr;
 
mu

[PATCH v5 7/8] tracing/perf: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
---
 include/trace/perf.h | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 2c11181c82e0..161e1655b953 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -12,8 +12,8 @@
 #undef __perf_task
 #define __perf_task(t) (__task = (t))
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, 
tp_flags) \
 static notrace void\
 perf_trace_##call(void *__data, proto) \
 {  \
@@ -28,6 +28,13 @@ perf_trace_##call(void *__data, proto)   
\
int __data_size;\
int rctx;   \
\
+   DEFINE_INACTIVE_GUARD(preempt_notrace, trace_event_guard);  \
+   \
+   if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\
+   might_fault();  \
+   activate_guard(preempt_notrace, trace_event_guard)();   \
+   }   \
+   \
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events);   \
@@ -55,6 +62,17 @@ perf_trace_##call(void *__data, proto)   
\
  head, __task);\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+PARAMS(tstruct), PARAMS(assign), PARAMS(print), \
+TRACEPOINT_MAY_FAULT)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
-- 
2.39.2

[PATCH v5 5/8] tracing/ftrace: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that ftrace can handle registering
to such tracepoints by explicitly disabling preemption within the ftrace
tracepoint probes to respect the current expectations within ftrace ring
buffer code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
- Add brackets to multiline 'if' statements.
---
 include/trace/trace_events.h | 64 ++--
 kernel/trace/trace_events.c  | 28 
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index df590eea8ae4..c887f7b6fbe9 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -45,6 +45,16 @@
 PARAMS(print));   \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_MAY_FAULT
+#define TRACE_EVENT_MAY_FAULT(name, proto, args, tstruct, assign, print) \
+   DECLARE_EVENT_CLASS_MAY_FAULT(name,\
+PARAMS(proto),\
+PARAMS(args), \
+PARAMS(tstruct),  \
+PARAMS(assign),   \
+PARAMS(print));   \
+   DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 #include "stages/stage1_struct_define.h"
 
 #undef DECLARE_EVENT_CLASS
@@ -57,6 +67,11 @@
\
static struct trace_event_class event_class_##name;
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(name, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)  \
static struct trace_event_call  __used  \
@@ -80,7 +95,7 @@
 #undef TRACE_EVENT_FN_MAY_FAULT
 #define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct,   \
assign, print, reg, unreg)  \
-   TRACE_EVENT(name, PARAMS(proto), PARAMS(args),  \
+   TRACE_EVENT_MAY_FAULT(name, PARAMS(proto), PARAMS(args),\
PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \
 
 #undef TRACE_EVENT_FN_COND
@@ -123,6 +138,11 @@
tstruct;\
};
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
@@ -214,6 +234,11 @@ static struct trace_event_functions 
trace_event_type_funcs_##call = {  \
.trace  = trace_raw_output_##call,  \
 };
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
 static notrace enum print_line_t   \
@@ -250,6 +275,11 @@ static struct trace_event_fields 
trace_event_fields_##call[] = {   \
tstruct \
{} };
 
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),  \
+   PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -271,6 +301,11 @@ static inline notrace int trace_event_get_offsets_##call(  
\

[PATCH v5 6/8] tracing/bpf-trace: Add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that bpf can handle registering to
such tracepoints by explicitly disabling preemption within the bpf
tracepoint probes to respect the current expectations within bpf tracing
code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to connect to faultable
tracepoints.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v4:
- Use DEFINE_INACTIVE_GUARD.
- Add brackets to multiline 'if' statements.
---
 include/trace/bpf_probe.h | 20 
 kernel/trace/bpf_trace.c  | 12 +---
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index e609cd7da47e..96c1269dd88c 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -42,17 +42,29 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#define __BPF_DECLARE_TRACE(call, proto, args) \
+#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags)   \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
struct bpf_prog *prog = __data; \
+   \
+   DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard);\
+   \
+   if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\
+   might_fault();  \
+   activate_guard(preempt_notrace, bpf_trace_guard)(); \
+   }   \
+   \
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
 }
 
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, 
print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 
TRACEPOINT_MAY_FAULT)
 
 /*
  * This part is compiled out, it is only here as a build time check
@@ -106,13 +118,13 @@ static inline void bpf_test_buffer_##call(void)   
\
 
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(call, proto, args)   \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)   \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
 
 #undef DECLARE_TRACE_WRITABLE
 #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
-   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 192de33d961f..873b0e885677 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2443,9 +2443,15 @@ static int __bpf_probe_register(struct bpf_raw_event_map 
*btp, struct bpf_prog *
if (prog->aux->max_tp_access > btp->writable_size)
return -EINVAL;
 
-   return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
-   prog, 
TRACEPOINT_DEFAULT_PRIO,
-   TRACEPOINT_MAY_EXIST);
+   if (tp->flags & TRACEPOINT_MAY_FAULT) {
+   return tracepoint_probe_register_prio_flags(tp, (void 
*)btp->bpf_func,
+   prog, 
TRACEPOINT_DEFAULT_PRIO,
+   
TRACEPOINT_MAY_EXIST | TRACEPOINT_MAY_FAULT);

[PATCH v5 2/8] cleanup.h guard: Rename DEFINE_ prefix to DECLARE_

The convention used in other kernel headers (e.g. wait.h, percpu-defs.h)
is to use "DECLARE_" prefix for macros emitting externs, static inlines
and type definitions.

The "DEFINE_" prefix is used for variable definitions.

In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually
define a guard variable, rename all the guard "DEFINE_" prefix to
"DECLARE_".

Signed-off-by: Mathieu Desnoyers 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
---
 drivers/cxl/core/cdat.c |  2 +-
 drivers/cxl/cxl.h   |  2 +-
 drivers/gpio/gpiolib.h  |  2 +-
 drivers/platform/x86/intel/pmc/core_ssram.c |  2 +-
 fs/fuse/virtio_fs.c |  2 +-
 fs/pstore/inode.c   |  4 +-
 include/linux/bitmap.h  |  2 +-
 include/linux/cleanup.h | 56 ++---
 include/linux/cpu.h |  2 +-
 include/linux/cpumask.h |  2 +-
 include/linux/device.h  |  6 +--
 include/linux/file.h|  4 +-
 include/linux/firmware.h|  2 +-
 include/linux/gpio/driver.h |  4 +-
 include/linux/iio/iio.h |  4 +-
 include/linux/irqflags.h|  4 +-
 include/linux/mutex.h   |  6 +--
 include/linux/of.h  |  2 +-
 include/linux/pci.h |  4 +-
 include/linux/percpu.h  |  2 +-
 include/linux/preempt.h |  6 +--
 include/linux/rcupdate.h|  2 +-
 include/linux/rwsem.h   | 10 ++--
 include/linux/sched/task.h  |  4 +-
 include/linux/slab.h|  4 +-
 include/linux/spinlock.h| 38 +++---
 include/linux/srcu.h|  2 +-
 include/sound/pcm.h |  6 +--
 kernel/sched/core.c |  4 +-
 kernel/sched/sched.h| 16 +++---
 lib/locking-selftest.c  | 12 ++---
 sound/core/control_led.c|  2 +-
 32 files changed, 110 insertions(+), 110 deletions(-)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index bb83867d9fec..689143566642 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -385,7 +385,7 @@ static void discard_dsmas(struct xarray *xa)
}
xa_destroy(xa);
 }
-DEFINE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T))
+DECLARE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T))
 
 void cxl_endpoint_parse_cdat(struct cxl_port *port)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 036d17db68e0..89cadb029d31 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -737,7 +737,7 @@ struct cxl_root *devm_cxl_add_root(struct device *host,
   const struct cxl_root_ops *ops);
 struct cxl_root *find_cxl_root(struct cxl_port *port);
 void put_cxl_root(struct cxl_root *cxl_root);
-DEFINE_FREE(put_cxl_root, struct cxl_root *, if (_T) put_cxl_root(_T))
+DECLARE_FREE(put_cxl_root, struct cxl_root *, if (_T) put_cxl_root(_T))
 
 int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd);
 void cxl_bus_rescan(void);
diff --git a/drivers/gpio/gpiolib.h b/drivers/gpio/gpiolib.h
index 8e0e211ebf08..17507a64c284 100644
--- a/drivers/gpio/gpiolib.h
+++ b/drivers/gpio/gpiolib.h
@@ -199,7 +199,7 @@ struct gpio_chip_guard {
int idx;
 };
 
-DEFINE_CLASS(gpio_chip_guard,
+DECLARE_CLASS(gpio_chip_guard,
 struct gpio_chip_guard,
 srcu_read_unlock(&_T.gdev->srcu, _T.idx),
 ({
diff --git a/drivers/platform/x86/intel/pmc/core_ssram.c 
b/drivers/platform/x86/intel/pmc/core_ssram.c
index 1bde86c54eb9..115f16448406 100644
--- a/drivers/platform/x86/intel/pmc/core_ssram.c
+++ b/drivers/platform/x86/intel/pmc/core_ssram.c
@@ -29,7 +29,7 @@
 #define LPM_REG_COUNT  28
 #define LPM_MODE_OFFSET1
 
-DEFINE_FREE(pmc_core_iounmap, void __iomem *, iounmap(_T));
+DECLARE_FREE(pmc_core_iounmap, void __iomem *, iounmap(_T));
 
 static u32 pmc_core_find_guid(struct pmc_info *list, const struct pmc_reg_map 
*map)
 {
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index bb3e941b9503..d062bafb294a 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -852,7 +852,7 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
-DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
+DECLARE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
 
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 56815799ce79..f34da47d26d4 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c

[PATCH v5 4/8] tracing: Introduce faultable tracepoints

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a faultable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those faultable tracepoints. When the
TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
callback will be called with preemption enabled, and is allowed to take
page faults. Faultable probes can only be registered on faultable
tracepoints and non-faultable probes on non-faultable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to faultable probes
unregistration and teardown.

Link: 
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/
Co-developed-by: Michael Jeanson 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Michael Jeanson 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 
---
Changes since v1:
- Cleanup __DO_TRACE() implementation.
- Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
  MAYFAULT, and use might_fault() rather than might_sleep(), to properly
  convey that the tracepoints are meant to be able to take a page fault,
  which requires to be able to sleep *and* to hold the mmap_sem.
Changes since v2:
- Rename MAYFAULT to MAY_FAULT.
- Rebased on 6.5.5.
- Introduce MAY_EXIST tracepoint flag.
Changes since v3:
- Rebased on 6.6.2.
Changes since v4:
- Rebased on 6.9.6.
- Simplify flag check in tracepoint_probe_register_prio_flags().
- Update MAY_EXIST flag description.
---
 include/linux/tracepoint-defs.h | 14 ++
 include/linux/tracepoint.h  | 88 +++--
 include/trace/define_trace.h|  7 +++
 include/trace/trace_events.h|  6 +++
 init/Kconfig|  1 +
 kernel/trace/bpf_trace.c|  5 +-
 kernel/trace/trace_fprobe.c |  5 +-
 kernel/tracepoint.c | 65 ++--
 8 files changed, 136 insertions(+), 55 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 4dc4955f0fbf..94e39c86b49f 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -29,6 +29,19 @@ struct tracepoint_func {
int prio;
 };
 
+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAY_EXIST: On registration, don't warn if the tracepoint is
+ *already registered.
+ * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with
+ *preemption enabled, and is allowed to take page
+ *faults.
+ */
+enum tracepoint_flags {
+   TRACEPOINT_MAY_EXIST = (1 << 0),
+   TRACEPOINT_MAY_FAULT = (1 << 1),
+};
+
 struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
@@ -39,6 +52,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+   unsigned int flags;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 689b6d71590e..eaf8c00b30a3 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -41,17 +42,10 @@ extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
   int prio);
 extern int
-tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, 
void *data,
-int prio);
+tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void 
*data,
+  int prio, unsigned int flags);
 extern int
 tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
-static inline int
-tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe,
-   void *data)
-{
-   return tracepoint_probe_register_prio_may_exist(tp, probe, data,
-   
TRACEPOINT_DEFAULT_PRIO);
-}
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
void *priv);
@@ -90,6 +84,7 @@ int unregister_tracepoint_module_notifier(struct 
notifier_

[PATCH v5 0/8] Faultable Tracepoints

Wire up the system call tracepoints with Tasks Trace RCU to allow
the ftrace, perf, and eBPF tracers to handle page faults.

This series does the initial wire-up allowing tracers to handle page
faults, but leaves out the actual handling of said page faults as future
work.

I have tested this against a feature branch of lttng-modules which
implements handling of page faults for the filename argument of the
openat(2) system call.

This v5 addresses comments from the previous round of review [1].

Steven Rostedt suggested separating tracepoints into two separate
sections. It is unclear how that approach would prove to be an
improvement over the currently proposed approach, so those changes were
not incorporated. See [2] for my detailed reply.

In the previous round, Peter Zijlstra suggested use of SRCU rather than
Tasks Trace RCU. See my reply about the distinction between SRCU and
Tasks Trace RCU [3] and this explanation from Paul E. McKenney about the
purpose of Tasks Trace RCU [4].

The macros DEFINE_INACTIVE_GUARD and activate_guard are added to
cleanup.h for use in the __DO_TRACE() macro. Those appear to be more
flexible than the guard_if() proposed by Peter Zijlstra in the previous
round of review [5].

This series is based on kernel v6.9.6.

Thanks,

Mathieu

Link: 
https://lore.kernel.org/lkml/20231120205418.334172-1-mathieu.desnoy...@efficios.com/
 # [1]
Link: 
https://lore.kernel.org/lkml/e4e9a2bc-1776-4b51-aba4-a147795a5...@efficios.com/ 
# [2]
Link: 
https://lore.kernel.org/lkml/a0ac5f77-411e-4562-9863-81196238f...@efficios.com/ 
# [3]
Link: 
https://lore.kernel.org/lkml/ba543d44-9302-4115-ac4f-d4e9f8d98a90@paulmck-laptop/
 # [4]
Link: 
https://lore.kernel.org/lkml/20231120221524.gd8...@noisy.programming.kicks-ass.net/
 # [5]
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: b...@vger.kernel.org
Cc: Joel Fernandes 

Mathieu Desnoyers (8):
  cleanup.h: Header include guard should match header name
  cleanup.h guard: Rename DEFINE_ prefix to DECLARE_
  cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard
  tracing: Introduce faultable tracepoints
  tracing/ftrace: Add support for faultable tracepoints
  tracing/bpf-trace: Add support for faultable tracepoints
  tracing/perf: Add support for faultable tracepoints
  tracing: Convert sys_enter/exit to faultable tracepoints

 drivers/cxl/core/cdat.c |  2 +-
 drivers/cxl/cxl.h   |  2 +-
 drivers/gpio/gpiolib.h  |  2 +-
 drivers/platform/x86/intel/pmc/core_ssram.c |  2 +-
 fs/fuse/virtio_fs.c |  2 +-
 fs/pstore/inode.c   |  4 +-
 include/linux/bitmap.h  |  2 +-
 include/linux/cleanup.h | 85 
 include/linux/cpu.h |  2 +-
 include/linux/cpumask.h |  2 +-
 include/linux/device.h  |  6 +-
 include/linux/file.h|  4 +-
 include/linux/firmware.h|  2 +-
 include/linux/gpio/driver.h |  4 +-
 include/linux/iio/iio.h |  4 +-
 include/linux/irqflags.h|  4 +-
 include/linux/mutex.h   |  6 +-
 include/linux/of.h  |  2 +-
 include/linux/pci.h |  4 +-
 include/linux/percpu.h  |  2 +-
 include/linux/preempt.h |  6 +-
 include/linux/rcupdate.h|  2 +-
 include/linux/rwsem.h   | 10 +--
 include/linux/sched/task.h  |  4 +-
 include/linux/slab.h|  4 +-
 include/linux/spinlock.h| 38 -
 include/linux/srcu.h|  2 +-
 include/linux/tracepoint-defs.h | 14 
 include/linux/tracepoint.h  | 88 +++--
 include/sound/pcm.h |  6 +-
 include/trace/bpf_probe.h   | 20 -
 include/trace/define_trace.h|  7 ++
 include/trace/events/syscalls.h |  4 +-
 include/trace/perf.h| 22 +-
 include/trace/trace_events.h| 68 +++-
 init/Kconfig|  1 +
 kernel/sched/core.c |  4 +-
 kernel/sched/sched.h| 16 ++--
 kernel/trace/bpf_trace.c| 11 ++-
 kernel/trace/trace_events.c | 28 +--
 kernel/trace/trace_fprobe.c |  5 +-
 kernel/trace/trace_syscalls.c   | 52 ++--
 kernel/tracepoint.c | 65 +--
 lib/locking-selftest.c  | 12 +--
 sound/core/control_led.c|  2 +-
 45 files changed, 441

[PATCH v5 3/8] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard

To cover scenarios where the scope of the guard differs from the scope
of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard().

Here is an example use for a conditionally activated guard variable:

void func(bool a)
{
DEFINE_INACTIVE_GUARD(preempt_notrace, myguard);

[...]
if (a) {
might_sleep();
activate_guard(preempt_notrace, myguard)();
}
[ protected code ]
}

Signed-off-by: Mathieu Desnoyers 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Greg KH 
Cc: Sean Christopherson 
---
 include/linux/cleanup.h | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h
index 04f03ad5f25d..d6a3d8099d77 100644
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -146,12 +146,20 @@ static inline class_##_name##_t 
class_##_name##ext##_constructor(_init_args) \
  *  similar to scoped_guard(), except it does fail when the lock
  *  acquire fails.
  *
+ * DEFINE_INACTIVE_GUARD(name, var):
+ *  define an inactive guard variable in a given scope, initialized to 
NULL.
+ *
+ * activate_guard(name, var)(args...):
+ *  activate a guard variable with its constructor, if it is not already
+ *  activated.
  */
 
 #define DECLARE_GUARD(_name, _type, _lock, _unlock) \
DECLARE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), 
_type _T); \
static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
-   { return *_T; }
+   { return *_T; } \
+   static inline class_##_name##_t class_##_name##_null(void) \
+   { return NULL; }
 
 #define DECLARE_GUARD_COND(_name, _ext, _condlock) \
EXTEND_CLASS(_name, _ext, \
@@ -175,6 +183,14 @@ static inline class_##_name##_t 
class_##_name##ext##_constructor(_init_args) \
if (!__guard_ptr(_name)(&scope)) _fail; \
else
 
+#define DEFINE_INACTIVE_GUARD(_name, _var) \
+   class_##_name##_t _var __cleanup(class_##_name##_destructor) = \
+   class_##_name##_null()
+
+#define activate_guard(_name, _var) \
+   if (!class_##_name##_lock_ptr(&(_var))) \
+   _var = class_##_name##_constructor
+
 /*
  * Additional helper macros for generating lock guards with types, either for
  * locks that don't have a native type (eg. RCU, preempt) or those that need a
@@ -209,6 +225,11 @@ static inline void 
class_##_name##_destructor(class_##_name##_t *_T)   \
 static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)\
 {  \
return _T->lock;\
+}  \
+static inline class_##_name##_t class_##_name##_null(void) \
+{  \
+   class_##_name##_t _t = { .lock = NULL };\
+   return _t;  \
 }
 
 
-- 
2.39.2

[PATCH v5 1/8] cleanup.h: Header include guard should match header name

The include guard should match the header name. Rename __LINUX_GUARDS_H
to __LINUX_CLEANUP_H.

Signed-off-by: Mathieu Desnoyers 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
---
 include/linux/cleanup.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h
index c2d09bc4f976..4cf8ad5d27a3 100644
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __LINUX_GUARDS_H
-#define __LINUX_GUARDS_H
+#ifndef __LINUX_CLEANUP_H
+#define __LINUX_CLEANUP_H
 
 #include 
 
@@ -247,4 +247,4 @@ __DEFINE_LOCK_GUARD_0(_name, _lock)
{ return class_##_name##_lock_ptr(_T); }
 
 
-#endif /* __LINUX_GUARDS_H */
+#endif /* __LINUX_CLEANUP_H */
-- 
2.39.2

[RFC PATCH 1/4] kernel/reboot: Introduce pre_restart notifiers

Introduce a new pre_restart notifier chain for callbacks that need to
be executed after the system has been made quiescent with
syscore_shutdown(), before machine restart.

This pre_restart notifier chain should be invoked on machine restart and
on emergency machine restart.

The use-case for this new notifier chain is to preserve tracing data
within pmem areas on systems where the BIOS does not clear memory across
warm reboots.

Why do we need a new notifier chain ?

1) The reboot and restart_prepare notifiers are called too early in the
   reboot sequence: they are invoked before syscore_shutdown(), which
   leaves other CPUs actively running threads while those notifiers are
   invoked.

2) The "restart" notifier is meant to trigger the actual machine
   restart, and is not meant to be invoked as a last step immediately
   before restart. It is also not always used: some architecture code
   choose to bypass this restart notifier and reboot directly from the
   architecture code.

Wiring up the architecture code to call this notifier chain is left to
follow-up arch-specific patches.

Signed-off-by: Mathieu Desnoyers 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Steven Rostedt 
Cc: nvd...@lists.linux.dev
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
---
 include/linux/reboot.h |  4 
 kernel/reboot.c| 51 ++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/reboot.h b/include/linux/reboot.h
index abcdde4df697..c7f340e81451 100644
--- a/include/linux/reboot.h
+++ b/include/linux/reboot.h
@@ -50,6 +50,10 @@ extern int register_restart_handler(struct notifier_block *);
 extern int unregister_restart_handler(struct notifier_block *);
 extern void do_kernel_restart(char *cmd);
 
+extern int register_pre_restart_handler(struct notifier_block *);
+extern int unregister_pre_restart_handler(struct notifier_block *);
+extern void do_kernel_pre_restart(char *cmd);
+
 /*
  * Architecture-specific implementations of sys_reboot commands.
  */
diff --git a/kernel/reboot.c b/kernel/reboot.c
index 22c16e2564cc..b7287dd48d35 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -235,6 +235,57 @@ void do_kernel_restart(char *cmd)
atomic_notifier_call_chain(&restart_handler_list, reboot_mode, cmd);
 }
 
+/*
+ * Notifier list for kernel code which wants to be called immediately
+ * before restarting the system.
+ */
+static ATOMIC_NOTIFIER_HEAD(pre_restart_handler_list);
+
+/**
+ * register_pre_restart_handler - Register function to be called in 
preparation
+ *to reset the system
+ * @nb: Info about handler function to be called
+ *
+ * Registers a function with code to be called in preparation to restart
+ * the system.
+ *
+ * Currently always returns zero, as atomic_notifier_chain_register()
+ * always returns zero.
+ */
+int register_pre_restart_handler(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_register(&pre_restart_handler_list, nb);
+}
+EXPORT_SYMBOL(register_pre_restart_handler);
+
+/**
+ * unregister_pre_restart_handler - Unregister previously registered
+ *  pre-restart handler
+ * @nb: Hook to be unregistered
+ *
+ * Unregisters a previously registered pre-restart handler function.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+int unregister_pre_restart_handler(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_unregister(&pre_restart_handler_list, nb);
+}
+EXPORT_SYMBOL(unregister_pre_restart_handler);
+
+/**
+ * do_kernel_pre_restart - Execute kernel pre-restart handler call chain
+ *
+ * Calls functions registered with register_pre_restart_handler.
+ *
+ * Expected to be called from machine_restart and
+ * machine_emergency_restart before invoking the restart handlers.
+ */
+void do_kernel_pre_restart(char *cmd)
+{
+   atomic_notifier_call_chain(&pre_restart_handler_list, reboot_mode, cmd);
+}
+
 void migrate_to_reboot_cpu(void)
 {
/* The boot cpu is always logical cpu 0 */
-- 
2.39.2

[RFC PATCH 3/4] arm64: Invoke pre_restart notifiers

Invoke the pre_restart notifiers after shutdown, before machine restart.
This allows preserving pmem memory across warm reboots.

Invoke the pre_restart notifiers before emergency machine restart as
well to cover the panic() scenario.

Signed-off-by: Mathieu Desnoyers 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Steven Rostedt 
Cc: nvd...@lists.linux.dev
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/kernel/process.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4ae31b7af6c3..4a27397617fb 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -129,6 +129,8 @@ void machine_restart(char *cmd)
local_irq_disable();
smp_send_stop();
 
+   do_kernel_pre_restart(cmd);
+
/*
 * UpdateCapsule() depends on the system being reset via
 * ResetSystem().
-- 
2.39.2

[RFC PATCH 2/4] nvdimm/pmem: Flush to memory before machine restart

Register pre-restart notifiers to flush pmem areas from CPU data cache
to memory on reboot, immediately before restarting the machine. This
ensures all other CPUs are quiescent before the pmem data is flushed to
memory.

I did an earlier POC that flushed caches on panic/die oops notifiers [1],
but it did not cover the reboot case. I've been made aware that some
distribution vendors have started shipping their own modified version of
my earlier POC patch. This makes a strong argument for upstreaming this
work.

Use the newly introduced "pre-restart" notifiers to flush pmem data to
memory immediately before machine restart.

Delta from my POC patch [1]:

Looking at the panic() code, it invokes emergency_restart() to restart
the machine, which uses the new pre-restart notifiers. There is
therefore no need to hook into panic handlers explicitly.

Looking at the die notifiers, those don't actually end up triggering
a machine restart, so it does not appear to be relevant to flush pmem
to memory there. I must admit I originally looked at how ftrace hooked
into panic/die-oops handlers for its ring buffers, but the use-case it
different here: we only want to cover machine restart use-cases.

Link: 
https://lore.kernel.org/linux-kernel/f6067e3e-a2bc-483d-b214-6e3fe6691...@efficios.com/
 [1]
Signed-off-by: Mathieu Desnoyers 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Steven Rostedt 
Cc: nvd...@lists.linux.dev
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/nvdimm/pmem.c | 29 -
 drivers/nvdimm/pmem.h |  2 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 598fe2e89bda..bf1d187a9dca 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -26,12 +26,16 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "btt.h"
 #include "pfn.h"
 #include "nd.h"
 
+static int pmem_pre_restart_handler(struct notifier_block *self,
+   unsigned long ev, void *unused);
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
/*
@@ -423,6 +427,7 @@ static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
+   unregister_pre_restart_notifier(&pmem->pre_restart_notifier);
dax_remove_host(pmem->disk);
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
@@ -575,9 +580,14 @@ static int pmem_attach_disk(struct device *dev,
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   pmem->pre_restart_notifier.notifier_call = pmem_pre_restart_handler;
+   pmem->pre_restart_notifier.priority = 0;
+   rc = register_pre_restart_notifier(&pmem->pre_restart_notifier);
if (rc)
goto out_remove_host;
+   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   if (rc)
+   goto out_unregister_reboot;
if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
 
@@ -589,6 +599,8 @@ static int pmem_attach_disk(struct device *dev,
dev_warn(dev, "'badblocks' notification disabled\n");
return 0;
 
+out_unregister_pre_restart:
+   unregister_pre_restart_notifier(&pmem->pre_restart_notifier);
 out_remove_host:
dax_remove_host(pmem->disk);
 out_cleanup_dax:
@@ -751,6 +763,21 @@ static void nd_pmem_notify(struct device *dev, enum 
nvdimm_event event)
}
 }
 
+/*
+ * For volatile memory use-cases where explicit flushing of the data cache is
+ * not useful after stores, the pmem reboot notifier is called on preparation
+ * for restart to make sure the content of the pmem memory area is flushed from
+ * data cache to memory, so it can be preserved across warm reboot.
+ */
+static int pmem_pre_restart_handler(struct notifier_block *self,
+   unsigned long ev, void *unused)
+{
+   struct pmem_device *pmem = container_of(self, struct pmem_device, 
pre_restart_notifier);
+
+   arch_wb_cache_pmem(pmem->virt_addr, pmem->size);
+   return NOTIFY_DONE;
+}
+
 MODULE_ALIAS("pmem");
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO);
 MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM);
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 392b0b38acb9..b8a2a518cf82 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,6 +28,7 @@ struct pmem_device {
struct dax_device   *dax_dev;
struct gendisk

[RFC PATCH 0/4] Flush nvdimm/pmem to memory before machine restart

Introduce a new pre_restart notifier chain for callbacks that need to
be executed after the system has been made quiescent with
syscore_shutdown(), before machine restart.

Register pre-restart notifiers to flush pmem areas from CPU data cache
to memory on reboot, immediately before restarting the machine. This
ensures all other CPUs are quiescent before the pmem data is flushed to
memory.

The use-case for this new notifier chain is to preserve tracing data
within pmem areas on systems where the BIOS does not clear memory across
warm reboots.

I did an earlier POC that flushed caches on panic/die oops notifiers [1],
but it did not cover the reboot case. I've been made aware that some
distribution vendors have started shipping their own modified version of
my earlier POC patch. This makes a strong argument for upstreaming this
work.

Link: 
https://lore.kernel.org/linux-kernel/f6067e3e-a2bc-483d-b214-6e3fe6691...@efficios.com/
 [1]
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: nvd...@lists.linux.dev
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org

Mathieu Desnoyers (4):
  kernel/reboot: Introduce pre_restart notifiers
  nvdimm/pmem: Flush to memory before machine restart
  arm64: Invoke pre_restart notifiers
  x86: Invoke pre_restart notifiers

 arch/arm64/kernel/process.c |  2 ++
 arch/x86/kernel/reboot.c|  7 +++--
 drivers/nvdimm/pmem.c   | 29 -
 drivers/nvdimm/pmem.h   |  2 ++
 include/linux/reboot.h  |  4 +++
 kernel/reboot.c | 51 +
 6 files changed, 92 insertions(+), 3 deletions(-)

-- 
2.39.2

[RFC PATCH 4/4] x86: Invoke pre_restart notifiers

Invoke the pre_restart notifiers after shutdown, before machine restart.
This allows preserving pmem memory across warm reboots.

Invoke the pre_restart notifiers on emergency_machine_restart to cover
the panic() scenario.

Signed-off-by: Mathieu Desnoyers 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Steven Rostedt 
Cc: nvd...@lists.linux.dev
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
---
 arch/x86/kernel/reboot.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index f3130f762784..222619fa63c6 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -631,8 +631,10 @@ static void native_machine_emergency_restart(void)
int orig_reboot_type = reboot_type;
unsigned short mode;
 
-   if (reboot_emergency)
+   if (reboot_emergency) {
+   do_kernel_pre_restart(NULL);
emergency_reboot_disable_virtualization();
+   }
 
tboot_shutdown(TB_SHUTDOWN_REBOOT);
 
@@ -760,12 +762,13 @@ static void __machine_emergency_restart(int emergency)
machine_ops.emergency_restart();
 }
 
-static void native_machine_restart(char *__unused)
+static void native_machine_restart(char *cmd)
 {
pr_notice("machine restart\n");
 
if (!reboot_force)
machine_shutdown();
+   do_kernel_pre_restart(cmd);
__machine_emergency_restart(0);
 }
 
-- 
2.39.2

Re: [PATCH 0/3] Tracepoints and static branch/call in Rust


On 2024-06-06 11:46, Alice Ryhl wrote:

On Thu, Jun 6, 2024 at 5:25 PM Mathieu Desnoyers
 wrote:


On 2024-06-06 11:05, Alice Ryhl wrote:

This implementation implements support for static keys in Rust so that
the actual static branch will end up in the Rust object file. However,
it would also be possible to just wrap the trace_##name generated by
__DECLARE_TRACE in an extern C function and then call that from Rust.
This will simplify the Rust code by removing the need for static
branches and calls, but it places the static branch behind an external
call, which has performance implications.


The tracepoints try very hard to minimize overhead of dormant tracepoints
so it is not frowned-upon to have them built into production binaries.
This is needed to make sure distribution vendors keep those tracepoints
in the kernel binaries that reach end-users.

Adding a function call before evaluation of the static branch goes against
this major goal.



A possible middle ground would be to place just the __DO_TRACE body in
an extern C function and to implement the Rust wrapper by doing the
static branch in Rust, and then calling into C the code that contains
__DO_TRACE when the tracepoint is active. However, this would need some
changes to include/linux/tracepoint.h to generate and export a function
containing the body of __DO_TRACE when the tracepoint should be callable
from Rust.


This tradeoff is more acceptable than having a function call before
evaluation of the static branch, but I wonder what is the upside of
this tradeoff compared to inlining the whole __DO_TRACE in Rust ?


So in general, there is a tradeoff between placing parts of the
tracepoint (which is perf sensitive) behind an external call, and having
code duplicated in both C and Rust (which must be kept in sync when
changes are made). This is an important point that I would like feedback
on from the C maintainers.


I don't see how the duplication happens there: __DO_TRACE is meant to be
inlined into each C tracepoint caller site, so the code is already meant
to be duplicated. Having an explicit function wrapping the tracepoint
for Rust would just create an extra instance of __DO_TRACE if it happens
to be also inlined into C code.

Or do you meant you would like to prevent having to duplicate the
implementation of __DO_TRACE in both C and Rust ?

I'm not sure if you mean to prevent source code duplication between
C and Rust or duplication of binary code (instructions).


It's a question of maintenance burden. If you change how __DO_TRACE is
implemented, then those changes must also be reflected in the Rust
version. There's no issue in the binary code.


As long as it is only __DO_TRACE that is duplicated between C and Rust,
I don't see it as a large burden.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 3/3] rust: add tracepoint support


On 2024-06-06 12:16, Alice Ryhl wrote:

On Thu, Jun 6, 2024 at 5:29 PM Mathieu Desnoyers
 wrote:


On 2024-06-06 11:05, Alice Ryhl wrote:

Make it possible to have Rust code call into tracepoints defined by C
code. It is still required that the tracepoint is declared in a C
header, and that this header is included in the input to bindgen.

Signed-off-by: Alice Ryhl 


[...]


diff --git a/rust/helpers.c b/rust/helpers.c
index 2c37a0f5d7a8..0560cc2a512a 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, 
gfp_t flags)
   }
   EXPORT_SYMBOL_GPL(rust_helper_krealloc);

+void rust_helper_preempt_enable_notrace(void)
+{
+ preempt_enable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace);
+
+void rust_helper_preempt_disable_notrace(void)
+{
+ preempt_disable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace);
+
+bool rust_helper_current_cpu_online(void)
+{
+ return cpu_online(raw_smp_processor_id());
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online);
+
+void *rust_helper___rcu_dereference_raw(void **p)
+{
+ return rcu_dereference_raw(p);
+}
+EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw);


Ouch. Doing a function call for each of those small operations will
have a rather large performance impact when tracing is active. If it is
not possible to inline those in Rust, then implementing __DO_TRACE in
a C function would at least allow Rust to only do a single call to C
rather than go back and forth between Rust and C.

What prevents inlining those helpers in Rust ?


There's nothing fundamental that prevents it. But they contain a large
amount of architecture specific code, which makes them a significant
amount of work to reimplement in Rust.

For example, rcu_dereference_raw calls into READ_ONCE. READ_ONCE is
usually a volatile load, but under arm64+LTO, you get a bunch of
inline assembly that relies on ALTERNATIVE for feature detection:
https://elixir.bootlin.com/linux/v6.9/source/arch/arm64/include/asm/rwonce.h#L36

And preempt_enable_notrace has a similar story.

The solution that Boqun mentions is nice, but it relies on rustc and
clang using the same version of LLVM. You are unlikely to have
compilers with matching LLVMs unless you intentionally take steps to
make that happen.

But yes, perhaps these helpers are an argument to have a single call
for __DO_TRACE instead.


If those helpers end up being inlined into Rust with the solution
pointed to by Boqun, then it makes sense to implement __DO_TRACE
in Rust. Otherwise doing a single call to C would be more efficient
than calling each of the helpers individually.

Thanks,

Mathieu



Alice


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 3/3] rust: add tracepoint support


On 2024-06-06 11:49, Boqun Feng wrote:

On Thu, Jun 06, 2024 at 11:30:03AM -0400, Mathieu Desnoyers wrote:

On 2024-06-06 11:05, Alice Ryhl wrote:

Make it possible to have Rust code call into tracepoints defined by C
code. It is still required that the tracepoint is declared in a C
header, and that this header is included in the input to bindgen.

Signed-off-by: Alice Ryhl 


[...]


diff --git a/rust/helpers.c b/rust/helpers.c
index 2c37a0f5d7a8..0560cc2a512a 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, 
gfp_t flags)
   }
   EXPORT_SYMBOL_GPL(rust_helper_krealloc);
+void rust_helper_preempt_enable_notrace(void)
+{
+   preempt_enable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace);
+
+void rust_helper_preempt_disable_notrace(void)
+{
+   preempt_disable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace);
+
+bool rust_helper_current_cpu_online(void)
+{
+   return cpu_online(raw_smp_processor_id());
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online);
+
+void *rust_helper___rcu_dereference_raw(void **p)
+{
+   return rcu_dereference_raw(p);
+}
+EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw);


Ouch. Doing a function call for each of those small operations will
have a rather large performance impact when tracing is active. If it is


Long-term plan is to 1) compile the C helpers in some IR and 2) inline
the helpers with Rust in IR-level, as what Gary has:


https://lore.kernel.org/rust-for-linux/20240529202817.3641974-1-g...@garyguo.net/

and I use it for the upcoming atomic API support:

https://github.com/fbq/linux/tree/dev/rust/atomic-rfc

and it works very well.


Thanks for the pointers, it makes sense.

Thanks,

Mathieu



Regards,
Boqun


not possible to inline those in Rust, then implementing __DO_TRACE in
a C function would at least allow Rust to only do a single call to C
rather than go back and forth between Rust and C.

What prevents inlining those helpers in Rust ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 0/3] Tracepoints and static branch/call in Rust

lication of binary code (instructions).

Thanks,

Mathieu




Link: 
https://lore.kernel.org/rust-for-linux/20231101-rust-binder-v1-0-08ba9197f...@google.com/
 [1]
Link: https://r.android.com/3110088 [2]
Signed-off-by: Alice Ryhl 
---
Alice Ryhl (3):
   rust: add static_call support
   rust: add static_key_false
   rust: add tracepoint support

  rust/bindings/bindings_helper.h |  1 +
  rust/bindings/lib.rs| 15 +++
  rust/helpers.c  | 24 +++
  rust/kernel/lib.rs  |  3 ++
  rust/kernel/static_call.rs  | 92 +
  rust/kernel/static_key.rs   | 87 ++
  rust/kernel/tracepoint.rs   | 92 +
  scripts/Makefile.build  |  2 +-
  8 files changed, 315 insertions(+), 1 deletion(-)
---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20240606-tracepoint-31e15b90e471

Best regards,


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 2/3] rust: add static_key_false


On 2024-06-06 11:05, Alice Ryhl wrote:

Add just enough support for static key so that we can use it from
tracepoints. Tracepoints rely on `static_key_false` even though it is
deprecated, so we add the same functionality to Rust.

Signed-off-by: Alice Ryhl 
---
  rust/kernel/lib.rs|  1 +
  rust/kernel/static_key.rs | 87 +++
  scripts/Makefile.build|  2 +-
  3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index d534b1178955..22e1fedd0774 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -39,6 +39,7 @@
  pub mod print;
  mod static_assert;
  pub mod static_call;
+pub mod static_key;
  #[doc(hidden)]
  pub mod std_vendor;
  pub mod str;
diff --git a/rust/kernel/static_key.rs b/rust/kernel/static_key.rs
new file mode 100644
index ..6c3dbe14c98a
--- /dev/null
+++ b/rust/kernel/static_key.rs
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0


This static key code is something that can be generally useful
both in kernel and user-space. Is there anything that would prevent
licensing this under MIT right away instead so it could eventually be
re-used more broadly in userspace as well ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 3/3] rust: add tracepoint support


On 2024-06-06 11:05, Alice Ryhl wrote:

Make it possible to have Rust code call into tracepoints defined by C
code. It is still required that the tracepoint is declared in a C
header, and that this header is included in the input to bindgen.

Signed-off-by: Alice Ryhl 


[...]


diff --git a/rust/helpers.c b/rust/helpers.c
index 2c37a0f5d7a8..0560cc2a512a 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, 
gfp_t flags)
  }
  EXPORT_SYMBOL_GPL(rust_helper_krealloc);
  
+void rust_helper_preempt_enable_notrace(void)

+{
+   preempt_enable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace);
+
+void rust_helper_preempt_disable_notrace(void)
+{
+   preempt_disable_notrace();
+}
+EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace);
+
+bool rust_helper_current_cpu_online(void)
+{
+   return cpu_online(raw_smp_processor_id());
+}
+EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online);
+
+void *rust_helper___rcu_dereference_raw(void **p)
+{
+   return rcu_dereference_raw(p);
+}
+EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw);


Ouch. Doing a function call for each of those small operations will
have a rather large performance impact when tracing is active. If it is
not possible to inline those in Rust, then implementing __DO_TRACE in
a C function would at least allow Rust to only do a single call to C
rather than go back and forth between Rust and C.

What prevents inlining those helpers in Rust ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules

2024-06-04 Thread Mathieu Desnoyers


On 2024-06-04 12:34, Steven Rostedt wrote:

On Tue, 4 Jun 2024 11:02:16 -0400
Mathieu Desnoyers  wrote:


I see.

It looks like there are a few things we could improve there:

1) With your approach, modules need to be already loaded before
attaching an fprobe event to them. This effectively prevents
attaching to any module init code. Is there any way we could allow
this by implementing a module coming notifier in fprobe as well ?
This would require that fprobes are kept around in a data structure
that matches the modules when they are loaded in the coming notifier.


The above sounds like a nice enhancement, but not something necessary for
this series.


IMHO it is nevertheless relevant to discuss the impact of supporting
this kind of use-case on the ABI presented to userspace, at least to
validate that what is exposed today can incrementally be enhanced
towards that goal.

I'm not saying that it needs to be implemented today, but we should
at least give it some thoughts right now to make sure the ABI is a
good fit.



2) Given that the fprobe module going notifier is protected by the
event_mutex, can we use locking rather than reference counting
in fprobe attach to guarantee the target module is not reclaimed
concurrently ? This would remove the transient side-effect of
holding a module reference count which temporarily prevents module
unload.


Why do we care about unloading modules during the transition? Note, module
unload has always been considered a second class citizen, and there's been
talks in the past to even rip it out.


As a general rule I try to ensure tracing has as little impact on the
system behavior so issues that occur without tracing can be reproduced
with instrumentation.

On systems where modules are loaded/unloaded with udev, holding
references on modules can spuriously prevent module unload, which
as a consequence changes the system behavior.

About the relative importance of the various kernel subsystems,
following your reasoning that module unload is considered a
second-class citizen within the kernel, I would argue that tracing
is a third-class citizen and should not needlessly modify the
behavior of classes above it.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules

2024-06-04 Thread Mathieu Desnoyers


On 2024-06-03 19:49, Masami Hiramatsu (Google) wrote:

On Mon, 3 Jun 2024 15:50:55 -0400
Mathieu Desnoyers  wrote:


On 2024-06-01 04:22, Masami Hiramatsu (Google) wrote:

From: Masami Hiramatsu (Google) 

Support raw tracepoint event on module by fprobe events.
Since it only uses for_each_kernel_tracepoint() to find a tracepoint,
the tracepoints on modules are not handled. Thus if user specified a
tracepoint on a module, it shows an error.
This adds new for_each_module_tracepoint() API to tracepoint subsystem,
and uses it to find tracepoints on modules.


Hi Masami,

Why prevent module unload when a fprobe tracepoint is attached to a
module ? This changes the kernel's behavior significantly just for the
sake of instrumentation.


I don't prevent module unloading all the time, just before registering
tracepoint handler (something like booking a ticket :-) ).
See the last hunk of this patch, it puts the module before exiting
__trace_fprobe_create().


So at least this is transient, but I still have concerns, see below.



As an alternative, LTTng-modules attach/detach to/from modules with the
coming/going notifiers, so the instrumentation gets removed when a
module is unloaded rather than preventing its unload by holding a module
reference count. I would recommend a similar approach for fprobe.


Yes, since tracepoint subsystem provides a notifier API to notify the
tracepoint is gone, fprobe already uses it to find unloading and
unregister the target function. (see __tracepoint_probe_module_cb)


I see.

It looks like there are a few things we could improve there:

1) With your approach, modules need to be already loaded before
attaching an fprobe event to them. This effectively prevents
attaching to any module init code. Is there any way we could allow
this by implementing a module coming notifier in fprobe as well ?
This would require that fprobes are kept around in a data structure
that matches the modules when they are loaded in the coming notifier.

2) Given that the fprobe module going notifier is protected by the
event_mutex, can we use locking rather than reference counting
in fprobe attach to guarantee the target module is not reclaimed
concurrently ? This would remove the transient side-effect of
holding a module reference count which temporarily prevents module
unload.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules

2024-06-03 Thread Mathieu Desnoyers

trace_probe_log_err(0, NO_TRACEPOINT);
@@ -1110,8 +1142,8 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
goto out;
  
  	/* setup a probe */

-   tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive,
-   argc, is_return);
+   tf = alloc_trace_fprobe(group, event, symbol, tpoint, tp_mod,
+   maxactive, argc, is_return);
if (IS_ERR(tf)) {
ret = PTR_ERR(tf);
/* This must return -ENOMEM, else there is a bug */
@@ -1119,10 +1151,6 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
goto out;   /* We know tf is not allocated */
}
  
-	if (is_tracepoint)

-   tf->mod = __module_text_address(
-   (unsigned long)tf->tpoint->probestub);
-
/* parse arguments */
for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
trace_probe_log_set_index(i + 2);
@@ -1155,6 +1183,8 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
}
  
  out:

+   if (tp_mod)
+   module_put(tp_mod);
    traceprobe_finish_parse(&ctx);
trace_probe_log_clear();
kfree(new_argv);



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-12 Thread Mathieu Desnoyers


On 2024-04-12 12:28, Beau Belgrave wrote:

On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote:

On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave  wrote:


In the Open Telemetry profiling SIG [1], we are trying to find a way to
grab a tracing association quickly on a per-sample basis. The team at
Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture


Presumably TLS == Thread Local Storage.



Yes, the initial idea is to use thread local storage (TLS). It seems to
be the fastest option to save a per-thread value that changes at a fast
rate.


upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS or other value(s) when a sampling event is generated. This is
supported today on Windows via EventActivityIdControl() [3]. Since
Open Telemetry works on both Windows and Linux, ideally we can do
something as efficient for Linux based workloads.

This series is to explore how it would be best possible to collect
supporting data from a user process when a profile sample is collected.
Having a value stored in TLS makes a lot of sense for this however
there are other ways to explore. Whatever is chosen, kernel samples
taken in process context should be able to get this supporting data.
In these patches on X64 the fsbase and gsbase are used for this.

An option to explore suggested by Mathieu Desnoyers is to utilize rseq
for processes to register a value location that can be included when
profiling if desired. This would allow a tighter contract between user
processes and a profiler.  It would allow better labeling/categorizing
the correlation values.


It is hard to understand this idea. Are you saying stash a cookie in
TLS for samples to capture to indicate an activity? Restartable
sequences are about preemption on a CPU not of a thread, so at least
my intuition is that they feel different. You could stash information
like this today by changing the thread name which generates comm
events. I've wondered about having similar information in some form of
reserved for profiling stack slot, for example, to stash a pointer to
the name of a function being interpreted. Snapshotting all of a stack
is bad performance wise and for security. A stack slot would be able
to deal with nesting.



You are getting the idea. A slot or tag for a thread would be great! I'm
not a fan of overriding the thread comm name (as that already has a
use). TLS would be fine, if we could also pass an offset + size + type.

Maybe a stack slot that just points to parts of TLS? That way you could
have a set of slots that don't require much memory and selectively copy
them out of TLS (or where ever those slots point to in user memory).

When I was talking to Mathieu about this, it seems that rseq already had
a place to potentially put these slots. I'm unsure though how the per
thread aspects would work.

Mathieu, can you post your ideas here about that?


Sure. I'll try to summarize my thoughts here. By all means, let me
know if I'm missing important pieces of the puzzle.

First of all, here is my understanding of what information we want to
share between userspace and kernel.

A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random
UUID allows) a portion of the dependency chain involved in doing some
work (e.g. answer a HTTP request) across one or many participating hosts.

Activity IDs have a parent/child relationship: a parent activity ID can
create children activity IDs.

For instance, if one host has the service "dispatch", another host
has a "web server", and a third host has a SQL database, we should
be able to follow the chain of activities needed to answer a web
query by following those activity IDs, linking them together
through parent/child relationships. This usually requires the
communication protocols to convey those activity IDs across hosts.

The reason why this information must be provided from userspace is
because it's userspace that knows where to find those activity IDs
within its application-layer communication protocols.

With tracing, taking a full trace of the activity ID spans begin/end
from all hosts allow reconstructing the activity IDs parent/child
relationships, so we typically only need to extract information about
activity ID span begin/end with parent/child info to a tracer.

Using activity IDs from a kernel profiler is trickier, because
we do not have access to the complete span begin/end trace to
reconstruct the activity ID parent/child relationship. This is
where I suspect we'd want to introduce a notion of "activity ID
stack", so a profiler could reconstruct the currently active
stack of activity IDs for the current thread by walking that
stack.

This profil

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers


On 2024-03-20 13:58, Steven Rostedt wrote:

On Wed, 20 Mar 2024 13:15:39 -0400
Mathieu Desnoyers  wrote:


I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?


Because need_resched() itself does not mean it's going to schedule
immediately. I looked at a few locations that need_resched() is called.
Most are in idle code where the critical timings are already handled.

I'm not sure I'd add it for places like mm/memory.c or 
drivers/md/bcache/btree.c.

A lot of places look to use it more for PREEMPT_NONE situations as a open
coded cond_resched().

The main reason this one is particularly an issue, is that it spins as long
as the owner is still running. Which may be some time, as here it was 7ms.


What I think we should be discussing here is how calling need_resched()
should interact with the latency tracked by critical timings.

AFAIU, when code explicitly calls need_resched() in a loop, there are
two cases:

- need_resched() returns false: This means the loop can continue without
  causing long latency on the system. Technically we could restart the
  critical timings at this point.

- need_resched() returns true: This means the loop should exit quickly
  and call the scheduler. I would not reset the critical timings there,
  as whatever code is executed between need_resched() returning true
  and calling the scheduler is adding to latency.

Having stop/start critical timings around idle loops seems to just be
an optimization over that.

As for mm and driver/md code, what is wrong with doing a critical
timings reset when need_resched() returns false ? It would prevent
a whole class of false-positives rather than playing whack-a-mole with
those that pop up.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers


On 2024-03-20 12:20, Steven Rostedt wrote:

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq
#  ||| / _--=> preempt-depth
#   / _-=> migrate-disable
#  | / delay
#  cmd pid || time  |   caller
# \   /||  \|/
<...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
<...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7824us : 
  => rwsem_optimistic_spin+0x140/0x194
  => rwsem_down_write_slowpath+0xc9/0x3fe
  => copy_process+0xd08/0x1b8a
  => kernel_clone+0x94/0x256
  => __x64_sys_clone+0x7a/0x9a
  => do_syscall_64+0x51/0xa1
  => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
  # define start_critical_timings() do { } while (0)
  #endif
  
+/* Used in spins that check need_resched() with preemption disabled */

+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
  #ifdef CONFIG_DEBUG_IRQFLAGS
  extern void warn_bogus_irq_restore(void);
  #define raw_check_bogus_irq_restore() \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifndef CONFIG_PREEMPT_RT

@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
  
+		restart_critical_timings();

if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
    break;
if (rt_task(current) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2] tracing: Limit trace_marker writes to just 4K

2024-03-05 Thread Mathieu Desnoyers


On 2024-03-04 22:34, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Limit the max print event of trace_marker to just 4K string size. This must
also be less than the amount that can be held by a trace_seq along with
the text that is before the output (like the task name, PID, CPU, state,
etc). As trace_seq is made to handle large events (some greater than 4K).
Make the max size of a trace_marker write event be 4K which is guaranteed
to fit in the trace_seq buffer.

Suggested-by: Mathieu Desnoyers 


From my perspective I only attempted to clarify the point Linus made
about limiting the trace_marker input to 4kB. Feel adapt the
Suggested-by tag accordingly.

Reviewed-by: Mathieu Desnoyers 

Thanks,

Mathieu


Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240304192710.4c996...@gandalf.local.home/

- Just make the max limit 4K and not half of the trace_seq size.
   The trace_seq is already made to handle events greater than 4k.

  kernel/trace/trace.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..d16b95ca58a7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7293,6 +7293,8 @@ tracing_free_buffer_release(struct inode *inode, struct 
file *filp)
return 0;
  }
  
+#define TRACE_MARKER_MAX_SIZE		4096

+
  static ssize_t
  tracing_mark_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *fpos)
@@ -7320,6 +7322,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if ((ssize_t)cnt < 0)
return -EINVAL;
  
+	if (cnt > TRACE_MARKER_MAX_SIZE)

+   cnt = TRACE_MARKER_MAX_SIZE;
+
meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
   again:
size = cnt + meta_size;
@@ -7328,11 +7333,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
  
-	if (size > TRACE_SEQ_BUFFER_SIZE) {

-   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
-   goto again;
-   }
-
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
        tracing_gen_ctx());


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 21:37, Steven Rostedt wrote:

On Mon, 4 Mar 2024 21:35:38 -0500
Steven Rostedt  wrote:


And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.


validating is debugging.


Did Linus put you up to this? To test me to see if I'm learning how to say "No" 
;-)


No, he did not. I genuinely think that validating size limits like
this either at compile time or, when they can vary at runtime like
in this case, with a dynamic check, decreases the cognitive
load on the reviewers. We can then assume that whatever limit
was put in place is actually enforced and not just wishful
thinking.

If the "header" size upper bound is not validated at runtime, there
is not much point in adding the BUILD_BUG_ON() based on that value
in the first place, and you should then just add a runtime check that
you don't overflow the output buffer before writing the output to it.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:59, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:42:39 -0500
Mathieu Desnoyers  wrote:


#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.


That would be a lot of complex code that is for debugging something that
has never happened in the past 16 years and Linus has already reprimanded
me on doing such things.


Is it more complex than remembering the iterator position in
print_trace_fmt() right before:

if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) {
if (iter->iter_flags & TRACE_FILE_LAT_FMT)
trace_print_lat_context(iter);
else
trace_print_context(iter);
}

and then checking just after that the offset is not beyond 128
bytes ? Perhaps there is even something internal to "iter"
that already knows about this offset (?).

This would catch any context going beyond its planned upper
limit early. Failing early and not just in rare corner cases
is good.

And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:41, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:35:16 -0500
Steven Rostedt  wrote:


BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


That said, the meta data is most likely not going to be more than 128 bytes
(it shouldn't be more than 80).

I could do as you suggest and create a separate TRACE_MARKER_SIZE and just
make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of
the content) by 128 bytes.

/* Added meta data should not be more than 128 bytes */
BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE);


Bonus points if you add

#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.

Thanks,

Mathieu



-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:35, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:15:57 -0500
Mathieu Desnoyers  wrote:


On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


If this header has a known well-defined upper-limit length, then use
that in the BUILD_BUG_ON().

Thanks,

Mathieu



-- Steve
--

Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);

This way we clearly document that the tracing mark write
input limit is 4kB, rather than something derived from
the size of an output buffer.

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..d68544aef65f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if ((ssize_t)cnt < 0)
return -EINVAL;
  
+	/*

+* TRACE_SEQ_SIZE is the total size of trace_seq buffer used
+* for output. As the print event outputs more than just
+* the string written, keep it smaller than the trace_seq
+* as it could drop the event if the extra data makes it bigger
+* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE
+* is more than enough.
+*/
+   if (cnt > TRACE_SEQ_SIZE / 2)
+   cnt = TRACE_SEQ_SIZE / 2;
+
meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
   again:
size = cnt + meta_size;
@@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
  
-	if (size > TRACE_SEQ_BUFFER_SIZE) {

-   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
-   goto again;
-   }
-
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
        tracing_gen_ctx());


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Remove precision vsnprintf() check from print event


On 2024-03-04 17:43, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

This reverts 60be76eeabb3d ("tracing: Add size check when printing
trace_marker output"). The only reason the precision check was added
was because of a bug that miscalculated the write size of the string into
the ring buffer and it truncated it removing the terminating nul byte. On
reading the trace it crashed the kernel. But this was due to the bug in
the code that happened during development and should never happen in
practice. If anything, the precision can hide bugs where the string in the
ring buffer isn't nul terminated and it will not be checked.

Link: 
https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home
Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/

Reported-by: Sachin Sant 
Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker 
output")
Signed-off-by: Steven Rostedt (Google) 


This is a step in the right direction IMHO.

Reviewed-by: Mathieu Desnoyers 

Just out of curiosity, is there anything to prevent trace_marker from
writing a huge string into the ring buffer in the first place ? Is this
limit implicit and based on the page size of the architecture or is it
a known fixed limit ? (e.g. 4kB strings).

It appears to currently be limited by

#define TRACE_SEQ_BUFFER_SIZE   (PAGE_SIZE * 2 - \
(sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int)))

checked within tracing_mark_write().

I would have hoped for a simpler limit (e.g. 4kB) consistent across
architectures. But that would belong to a separate change.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment

2024-03-01 Thread Mathieu Desnoyers


On 2024-03-01 10:49, Steven Rostedt wrote:

On Fri, 1 Mar 2024 13:37:18 +0800
linke  wrote:


So basically you are worried about read-tearing?

That wasn't mentioned in the change log.


Yes. Sorry for making this confused, I am not very familiar with this and
still learning.


No problem. We all have to learn this anyway.




Funny part is, if the above timestamp read did a tear, then this would
definitely not match, and would return the correct value. That is, the
buffer is not empty because the only way for this to get corrupted is if
something is in the process of writing to it.


I agree with you here.

commit = rb_page_commit(commit_page);

But if commit_page above is the result of a torn read, the commit field
read by rb_page_commit() may not represent a valid value.


But commit_page is a word length, and I will argue that any compiler that
tears "long" words is broken. ;-)


[ For those tuning in, we are discussing ring_buffer_iter_empty()
  "commit_page = cpu_buffer->commit_page;" racy load. ]

I counter-argue that real-world compilers *are* broken based on your
personal definition, but we have to deal with them, as documented
in Documentation/memory-barriers.txt (see below).

What is the added overhead of using a READ_ONCE() there ? Why are
we wasting effort trying to guess the compiler behavior if the
real-world performance impact is insignificant ?

Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE():

"(*) For aligned memory locations whose size allows them to be accessed
 with a single memory-reference instruction, prevents "load tearing"
 and "store tearing," in which a single large access is replaced by
 multiple smaller accesses."

I agree that {READ,WRITE}_ONCE() are really not needed at initialization,
when there are demonstrably no concurrent accesses to the data

But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields
just adds complexity, prevents static analyzers to properly understand the
code and report issues, and just obfuscates the code.

Thanks,

Mathieu





In this case, READ_ONCE() is only needed for the commit_page.


But we can at least keep the READ_ONCE() on the commit_page just because it
is used in the next instruction.

-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Mathieu Desnoyers


On 2024-02-20 09:19, Steven Rostedt wrote:

On Mon, 19 Feb 2024 18:20:32 -0500
Steven Rostedt  wrote:


Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.

Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.

Before we had:

w = local_read(&tail_page->write);
/* get time stamps */
write = local_add_return(length, &tail_page->write);
if (write - length == w) {
/* do simple case */
} else {
/* do complex case */
}

By switching the local_add_return() to a local_try_cmpxchg() it can now be:

 w = local_read(&tail_page->write);
  again:
/* get time stamps */
if (!local_try_cmpxchg(&tail_page->write, &w, w + length))
goto again;

 /* do simple case */


Something about this logic is causing __rb_next_reserve() to sometimes
always return -EAGAIN and triggering the:

 RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)

Which disables the ring buffer.

I'm not sure what it is, but until I do, I'm removing the patch from my
queue.


Try resetting the info->add_timestamp flags to add_ts_default on goto again
within __rb_reserve_next().

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers

2024-02-16 Thread Mathieu Desnoyers


On 2024-02-16 15:22, Arnd Bergmann wrote:

From: Arnd Bergmann 

In some randconfig builds, the IS_ERR() check appears to not get completely
eliminated, resulting in the compiler to insert references to these two
functions that cause a link failure:

ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined!
ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined!

Add more stub functions for the dax-disabled case here to make it build again.


Hi Arnd,

Note that this is a duplicate of:

https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/

now present in Andrew's tree.

The only differences are the subject, commit message and a newline between 
"set_dax_nomc"
and "set_dax_synchronous" in your change.

Thanks,

Mathieu



Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/
Signed-off-by: Arnd Bergmann 
---
  include/linux/dax.h | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index df2d52b8a245..4527c10016fb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
  bool dax_write_cache_enabled(struct dax_device *dax_dev);
  bool dax_synchronous(struct dax_device *dax_dev);
  void set_dax_synchronous(struct dax_device *dax_dev);
+void set_dax_nocache(struct dax_device *dax_dev);
+void set_dax_nomc(struct dax_device *dax_dev);
+
  size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
  /*
@@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device 
*dax_dev)
  static inline void set_dax_synchronous(struct dax_device *dax_dev)
  {
  }
+static inline void set_dax_nocache(struct dax_device *dax_dev)
+{
+}
+static inline void set_dax_nomc(struct dax_device *dax_dev)
+{
+}
  static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
struct dax_device *dax_dev)
  {
@@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
  }
  #endif
  
-void set_dax_nocache(struct dax_device *dax_dev);

-void set_dax_nomc(struct dax_device *dax_dev);
-
  struct writeback_control;
  #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
  int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: CPU data cache across reboot/kexec for pmem/dax devices

2024-02-13 Thread Mathieu Desnoyers


On 2024-02-09 15:15, Dan Williams wrote:

Mathieu Desnoyers wrote:

Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,


This one gives me pause, because a trip through the BIOS typically means
lots of resets and other low level magic, so this would likely require
pushing dirty data out of CPU caches prior to entering the BIOS code
paths.

So this either needs explicit cache flushing or mapping the memory with
write-through semantics. That latter one is not supported in the stack
today.


I would favor the explicit cache flushing approach over write-through semantics
for performance reasons: typically the ring buffers are overwritten, so always
storing the data to memory would be wasteful.

But I would rather do the cache flushing only on crash (die oops/panic 
notifiers)
rather than require the tracer to flush cache lines immediately after each event
is produced, again for performance reasons.

I have done a crude attempt at registering die oops/panic notifiers from the 
pmem
driver, and flush all pmem areas to memory when die oops/panic notifiers are
called (see the untested patch below). I wonder if this is something that would 
be
generally useful ?




- kexec/kdump.


This should maintain the state of CPU caches. As far as the CPU is
concerned it is just long jumping into a new kernel in memory without
resetting any CPU cache state.


Good to know that this scenario is expected to "Just Work".

Thanks,

Mathieu


diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e9898457a7bd..b92f18607d39 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -26,12 +26,18 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "btt.h"
 #include "pfn.h"
 #include "nd.h"
 
+static int pmem_die_handler(struct notifier_block *self,

+   unsigned long ev, void *unused);
+static int pmem_panic_handler(struct notifier_block *self,
+   unsigned long ev, void *unused);
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
/*
@@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
+	atomic_notifier_chain_unregister(&panic_notifier_list,

+   &pmem->panic_notifier);
+   unregister_die_notifier(&pmem->die_notifier);
dax_remove_host(pmem->disk);
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
@@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev,
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   pmem->die_notifier.notifier_call = pmem_die_handler;
+   pmem->die_notifier.priority = -INT_MAX;
+   rc = register_die_notifier(&pmem->die_notifier);
if (rc)
goto out_remove_host;
+   pmem->panic_notifier.notifier_call = pmem_panic_handler;
+   pmem->panic_notifier.priority = -INT_MAX;
+   rc = atomic_notifier_chain_register(&panic_notifier_list,
+   &pmem->panic_notifier);
+   if (rc)
+   goto out_unregister_die;
+   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   if (rc)
+   goto out_unregister_panic;
if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
 
@@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev,

dev_warn(dev, "'badblocks' notification disabled\n");
return

CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Mathieu Desnoyers


Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,
- kexec/kdump.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

On 2024-02-05 13:34, Vincent Donnefort wrote:

On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:

[...]

How are the kernel linear mapping and the userspace mapping made coherent
on architectures with virtually aliasing data caches ?

Ref.
https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t

Hi Mathieu,

Thanks for the pointer.

We are in the exact same problem as DAX. We do modify the data through the
kernel linear mapping while user-space can read it through its own. I should
probably return an error when used with any of the arch ARM || SPARC || MIPS,
until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.

You might want to use LTTng's ring buffer approach instead. See

https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202

lib_ring_buffer_flush_read_subbuf_dcache()

Basically, whenever user-space grabs a sub-buffer for reading (through
lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng
calls flush_dcache_page() on all pages of this subbuffer (I should
really change this for a call to flush_dcache_folio() which would be
more efficient).

Note that doing this is not very efficient on architectures which have
coherent data caches and incoherent dcache vs icache: in that case,
we issue the flush_dcache_page uselessly. I plan on using the new
cpu_dcache_is_aliasing() check once/if it makes it way upstream to
remove those useless flushes on architectures which define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the
data cache.

The equivalent of LTTng's "get subbuf" operation would be
the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   struct trace_array *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   spin_lock(&tr->snapshot_trigger_lock);
+   if (!WARN_ON(!tr->mapped))
+   tr->mapped--;
+   spin_unlock(&tr->snapshot_trigger_lock);
+#endif
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = &info->iter;
+   struct trace_array *tr = iter->tr;
+   int ret = 0;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = &tracing_buffers_vmops;
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   /*
+* We hold mmap_lock here. lockdep would be unhappy if we would now take
+* trace_types_lock. Instead use the specific snapshot_trigger_lock.
+*/
+   spin_lock(&tr->snapshot_trigger_lock);
+   if (tr->snapshot || tr->mapped == UINT_MAX) {
+   spin_unlock(&tr->snapshot_trigger_lock);
+   return -EBUSY;
+   }
+   tr->mapped++;
+   spin_unlock(&tr->snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+#endif
+   ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (ret) {
+   spin_lock(&tr->snapshot_trigger_lock);
+   iter->tr->mapped--;
+   spin_unlock(&tr->snapshot_trigger_lock);
+   }
+#endif
+   return ret;
+}
+
  static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.read   = tracing_buffers_read,
@@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops 
= {
.splice_read= tracing_buffers_splice_read,
.unlocked_ioctl = tracing_buffers_ioctl,
.llseek = no_llseek,
+   .mmap   = tracing_buffers_mmap,
  };
  
  static ssize_t

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index bd312e9afe25..8a96e7a89e6b 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -336,6 +336,7 @@ struct trace_array {
boolallocated_snapshot;
spinlock_t  snapshot_trigger_lock;
unsigned intsnapshot;
+   unsigned intmapped;
unsigned long   max_latency;
  #ifdef CONFIG_FSNOTIFY
struct dentry   *d_max_latency;


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

2024-01-31 Thread Mathieu Desnoyers


On 2024-01-30 22:13, Dan Williams wrote:

Dave Chinner wrote:

On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

   depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following implementation of dax_is_supported()
by a preparatory change:

 return !IS_ENABLED(CONFIG_ARM) &&
!IS_ENABLED(CONFIG_MIPS) &&
!IS_ENABLED(CONFIG_SPARC);

Use dcache_is_aliasing() instead to figure out whether the environment
has aliasing dcaches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
  include/linux/dax.h | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index cfc8cd4a3eae..f59e604662e4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  
  typedef unsigned long dax_entry_t;
  
@@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,

  }
  static inline bool dax_is_supported(void)
  {
-   return !IS_ENABLED(CONFIG_ARM) &&
-  !IS_ENABLED(CONFIG_MIPS) &&
-  !IS_ENABLED(CONFIG_SPARC);
+   return !dcache_is_aliasing();


Yeah, if this is just a one liner should go into
fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the
start of the function.

I also noticed that device mapper uses fs_dax_get_by_bdev() to
determine if it can support DAX, but this patch set does not address
that case. Hence it really seems to me like fs_dax_get_by_bdev() is
the right place to put this check.


Oh, good catch. Yes, I agree this can definitely be pushed down, but
then I think it should be pushed down all the way to make alloc_dax()
fail. That will need some additional fixups like:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8dcabf84d866..a35e60e62440 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor)
 md->dax_dev = alloc_dax(md, &dm_dax_ops);
 if (IS_ERR(md->dax_dev)) {
 md->dax_dev = NULL;
-   goto bad;
+   } else {
+   set_dax_nocache(md->dax_dev);
+   set_dax_nomc(md->dax_dev);
+   if (dax_add_host(md->dax_dev, md->disk))
+   goto bad;
 }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
-   goto bad;
 }
  
 format_dev_t(md->name, MKDEV(_major, minor));


...to make it not fatal to fail to register the dax_dev.


I've had a quick look at other users of alloc_dax() and
alloc_dax_region(), and so far I figure that all of those
really want to bail out on alloc_dax failure. Is dm.c the
only special-case we need to fix to make it non-fatal ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

2024-01-31 Thread Mathieu Desnoyers


On 2024-01-30 21:48, Dave Chinner wrote:

On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote:

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased data caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc


/me wonders why the dentry cache aliasing has problems on these
systems.

Oh, dcache != fs/dcache.c (the VFS dentry cache).

Can you please rename this function appropriately so us dumb
filesystem people don't confuse cpu data cache configurations with
the VFS dentry cache aliasing when we read this code? Something like
cpu_dcache_is_aliased(), perhaps?


Good point, will do. I'm planning go rename as follows for v3 to
eliminate confusion with dentry cache (and with "page cache" in
general):

ARCH_HAS_CACHE_ALIASING -> ARCH_HAS_CPU_CACHE_ALIASING
dcache_is_aliasing() -> cpu_dcache_is_aliasing()

I noticed that you suggested "aliased" rather than "aliasing",
but I followed what arm64 did for icache_is_aliasing(). Do you
have a strong preference one way or another ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased data caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and
implement "dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus
dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing dcache-icache but not dcache-dcache.

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..290e3cc85845
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..5adeee5e421f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && 
FRAME_POINTER && !ARM_UNWIND
select ARCH_HAS_BINFMT_FLAT
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CPU_FINALIZE_INIT if MMU
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/

[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following implementation of dax_is_supported()
by a preparatory change:

return !IS_ENABLED(CONFIG_ARM) &&
   !IS_ENABLED(CONFIG_MIPS) &&
   !IS_ENABLED(CONFIG_SPARC);

Use dcache_is_aliasing() instead to figure out whether the environment
has aliasing dcaches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 include/linux/dax.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index cfc8cd4a3eae..f59e604662e4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
 }
 static inline bool dax_is_supported(void)
 {
-   return !IS_ENABLED(CONFIG_ARM) &&
-  !IS_ENABLED(CONFIG_MIPS) &&
-  !IS_ENABLED(CONFIG_SPARC);
+   return !dcache_is_aliasing();
 }
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
-- 
2.39.2

[RFC PATCH v2 6/8] xfs: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..360f640159b0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1208,7 +1208,7 @@ static bool
 xfs_inode_should_enable_dax(
struct xfs_inode *ip)
 {
-   if (!IS_ENABLED(CONFIG_FS_DAX))
+   if (!dax_is_supported())
return false;
if (xfs_has_dax_never(ip->i_mount))
return false;
-- 
2.39.2

[RFC PATCH v2 5/8] fuse: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/dax.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..36e1c1abbf8e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, 
unsigned int flags)
if (dax_mode == FUSE_DAX_NEVER)
return false;
 
+   /*
+* Silently fallback to 'never' mode if the architecture does
+* not support DAX.
+*/
+   if (!dax_is_supported())
+   return false;
+
/*
 * fc->dax may be NULL in 'inode' mode when filesystem device doesn't
 * support DAX, in which case it will silently fallback to 'never' mode.
-- 
2.39.2

[RFC PATCH v2 4/8] ext4: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Mount fails if dax=always
is requested as a mount option on an architecture which does not support
DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext4/super.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..f2c11ae3ec29 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct 
super_block *sb,
}
 
if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) {
-   if (ext4_has_feature_inline_data(sb)) {
+   if (!dax_is_supported()) {
+   ext4_msg(sb, KERN_ERR, "DAX unsupported by 
architecture.");
+   return -EINVAL;
+   } else if (ext4_has_feature_inline_data(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
" that may contain inline data");
return -EINVAL;
-- 
2.39.2

[RFC PATCH v2 3/8] ext2: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Print an error and disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext2/super.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..30ff57d47ed4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
if (test_opt(sb, DAX)) {
-   if (!sbi->s_daxdev) {
+   if (!dax_is_supported()) {
+   ext2_msg(sb, KERN_ERR,
+   "DAX unsupported by architecture. Turning off 
DAX.");
+   clear_opt(sbi->s_mount_opt, DAX);
+   } else if (!sbi->s_daxdev) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
clear_opt(sbi->s_mount_opt, DAX);
-- 
2.39.2

[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces dcache_is_aliasing() with the new Kconfig
option ARCH_HAS_CACHE_ALIASING and implements it for all architectures.
The implementation of dcache_is_aliasing() is either evaluated to a
constant at compile-time or a runtime check, which is what is needed on
ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Note that the order of the series was completely changed based on the
feedback received on v1, cache_is_aliasing() is renamed to
dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
simplified, and the dax_is_supported() check was moved to its rightful
place in all filesystems.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-fsde...@vger.kernel.org

Mathieu Desnoyers (8):
  dax: Introduce dax_is_supported()
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Use dax_is_supported()
  xfs: Use dax_is_supported()
  Introduce dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of dcache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  1 -
 fs/erofs/super.c|  5 -
 fs/ext2/super.c |  6 +-
 fs/ext4/super.c |  5 -
 fs/fuse/dax.c   |  7 +++
 fs/xfs/xfs_iops.c   |  2 +-
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  9 +
 mm/Kconfig  |  6 ++
 29 files changed, 142 insertions(+), 5 deletions(-)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2

[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()

Introduce a new dax_is_supported() static inline to check whether the
architecture supports DAX.

This replaces the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

This is done in preparation for its use by each filesystem supporting
the dax mount option to validate whether dax is indeed supported.

This is done in preparation for using dcache_is_aliasing() in a
following change which will properly support architectures which detect
dcache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/Kconfig  |  1 -
 include/linux/dax.h | 10 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..cfc8cd4a3eae 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !IS_ENABLED(CONFIG_ARM) &&
+  !IS_ENABLED(CONFIG_MIPS) &&
+  !IS_ENABLED(CONFIG_SPARC);
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2

Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()


On 2024-01-30 06:33, Jan Kara wrote:

On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org


Looks good to me (although I share Dave's opinion it would be nice to CC
the whole series to fsdevel - luckily we have lore these days so it is not
that tedious to find the whole series :)). Feel free to add:

Acked-by: Jan Kara 


Hi Jan,

Thanks for looking at it! I will do significant changes for v2, so
I will hold on before integrating your acked-by if it's OK with you.

Thanks!

Mathieu



Honza


---
  fs/ext2/super.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, use at 
your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
        break;
  
  #if defined(CONFIG_QUOTA)

--
2.39.2



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()


On 2024-01-29 21:38, Dave Chinner wrote:

On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).


Where's the rest of this patchset? I have no idea what
dax_is_supported() actually does, how it interacts with
CONFIG_FS_DAX, etc.

If you are changing anything to do with FSDAX, the cc-ing the
-entire- patchset to linux-fsdevel is absolutely necessary so the
entire patchset lands in our inboxes and not just a random patch
from the middle of a bigger change.


Sorry, I will Cc linux-fsdevel on all patches for the next round.

Meanwhile you can find the whole series on lore:

https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/

[...]



Assuming that I understand what dax_is_supported() is doing, this
change isn't right.  We're just setting the DAX configuration flags
from the mount options here, we don't validate them until
we've parsed all options and eliminated conflicts and rejected
conflicting options. We validate whether the options are
appropriate for the underlying hardware configuration later in the
mount process.

dax=always suitability is check in xfs_setup_dax_always() called
later in the mount process when we have enough context and support
to open storage devices and check them for DAX support. If the
hardware does not support DAX then we simply we turn off DAX
support, we do not reject the mount as this change does.

dax=inode and dax=never are valid options on all configurations,
even those with without FSDAX support or have hardware that is not
capable of using DAX. dax=inode only affects how an inode is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.

Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...


Thanks a lot for the detailed explanation. You are right, I will
move the dax_is_supported() check to xfs_inode_should_enable_dax().

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression


On 2024-01-29 16:22, Dan Williams wrote:

Mathieu Desnoyers wrote:

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces cache_is_aliasing() with new Kconfig
options:

   * ARCH_HAS_CACHE_ALIASING
   * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,


Hi Mathieu, this looks good overall, just some quibbling about the
ordering.


Thanks for having a look !



I would introduce dax_is_supported() with the current overly broad
interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then
fixup the filesystems to use the new helper, and finally go back and
convert dax_is_supported() to use cache_is_aliasing() internally.


Will do.



Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC
needs to exist. As long as all paths that care are calling
cache_is_aliasing() then whether it is dynamic or not is something only
the compiler cares about. If those dynamic archs do not want to pay the
.text size increase they can always do CONFIG_FS_DAX=n, right?


Good point. It will help reduce complexity and improve test coverage.

I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()",
so if we introduce an "icache_is_aliasing()" in the future, it won't be
confusing. Having aliasing icache-dcache but not dcache-dcache seems to
be fairly common.

So basically:

If an arch selects ARCH_HAS_CACHE_ALIASING, it implements
dcache_is_aliasing() (for now), and eventually we can implement
icache_is_aliasing() as well.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RFC PATCH 5/7] ext4: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext4/super.c | 52 -
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..9e0606289239 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2359,34 +2359,32 @@ static int ext4_parse_param(struct fs_context *fc, 
struct fs_parameter *param)
return ext4_parse_test_dummy_encryption(param, ctx);
case Opt_dax:
case Opt_dax_type:
-#ifdef CONFIG_FS_DAX
-   {
-   int type = (token == Opt_dax) ?
-  Opt_dax : result.uint_32;
-
-   switch (type) {
-   case Opt_dax:
-   case Opt_dax_always:
-   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   break;
-   case Opt_dax_never:
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   break;
-   case Opt_dax_inode:
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   /* Strictly for printing options */
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
-   break;
+   if (dax_is_supported()) {
+   int type = (token == Opt_dax) ?
+  Opt_dax : result.uint_32;
+
+   switch (type) {
+   case Opt_dax:
+   case Opt_dax_always:
+   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   break;
+   case Opt_dax_never:
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   break;
+   case Opt_dax_inode:
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   /* Strictly for printing options */
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
+   break;
+   }
+   return 0;
+   } else {
+   ext4_msg(NULL, KERN_INFO, "dax option not supported");
+   return -EINVAL;
}
-   return 0;
-   }
-#else
-   ext4_msg(NULL, KERN_INFO, "dax option not supported");
-   return -EINVAL;
-#endif
case Opt_data_err:
if (result.uint_32 == Opt_data_err_abort)
ctx_set_mount_opt(ctx, m->mount_opt);
-- 
2.39.2

[RFC PATCH 7/7] xfs: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_super.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 764304595e8b..b27ecb11db66 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1376,14 +1376,22 @@ xfs_fs_parse_param(
case Opt_nodiscard:
parsing_mp->m_features &= ~XFS_FEAT_DISCARD;
return 0;
-#ifdef CONFIG_FS_DAX
case Opt_dax:
-   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
-   return 0;
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
case Opt_dax_enum:
-   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
-   return 0;
-#endif
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
/* Following mount options will be removed in September 2025 */
case Opt_ikeep:
xfs_fs_warn_deprecated(fc, param, XFS_FEAT_IKEEP, true);
-- 
2.39.2

[RFC PATCH 6/7] fuse: Introduce fuse_dax_is_supported()

Use dax_is_supported() in addition to IS_ENABLED(CONFIG_FUSE_DAX) to
validate whether CONFIG_FUSE_DAX is enabled and the architecture does
not have virtually aliased caches.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +-
 fs/fuse/inode.c | 47 +++--
 fs/fuse/virtio_fs.c |  4 ++--
 4 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a660f1f21540..133ac8524064 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3247,6 +3247,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned 
int flags)
init_waitqueue_head(&fi->page_waitq);
fi->writepages = RB_ROOT;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_inode_init(inode, flags);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..1cbe37106669 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /** Default max number of pages that can be used in a single read request */
 #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
@@ -979,6 +980,38 @@ static inline void fuse_sync_bucket_dec(struct 
fuse_sync_bucket *bucket)
rcu_read_unlock();
 }
 
+#ifdef CONFIG_FUSE_DAX
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return inode->dax;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return fc->dax_mode;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return fc->dax;
+}
+#else
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return NULL;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return FUSE_DAX_INODE_DEFAULT;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return NULL;
+}
+#endif
+
 /** Device operations */
 extern const struct file_operations fuse_dev_operations;
 
@@ -1324,7 +1357,8 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+#define fuse_dax_is_supported()(IS_ENABLED(CONFIG_FUSE_DAX) && 
dax_is_supported())
+#define FUSE_IS_DAX(inode) (fuse_dax_is_supported() && IS_DAX(inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2a6d44f91729..030e6ce5486d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -108,7 +108,7 @@ static struct inode *fuse_alloc_inode(struct super_block 
*sb)
if (!fi->forget)
goto out_free;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi))
+   if (fuse_dax_is_supported() && !fuse_dax_inode_alloc(sb, fi))
goto out_free_forget;
 
return &fi->inode;
@@ -126,9 +126,8 @@ static void fuse_free_inode(struct inode *inode)
 
mutex_destroy(&fi->mutex);
kfree(fi->forget);
-#ifdef CONFIG_FUSE_DAX
-   kfree(fi->dax);
-#endif
+   if (fuse_dax_is_supported())
+   kfree(fuse_inode_get_dax(fi));
kmem_cache_free(fuse_inode_cachep, fi);
 }
 
@@ -361,7 +360,7 @@ void fuse_change_attributes(struct inode *inode, struct 
fuse_attr *attr,
invalidate_inode_pages2(inode->i_mapping);
}
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_dontcache(inode, attr->flags);
 }
 
@@ -856,14 +855,16 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
}
-#ifdef CONFIG_FUSE_DAX
-   if (fc->dax_mode == FUSE_DAX_ALWAYS)
-   seq_puts(m, ",dax=always");
-   else if (fc->dax_mode == FUSE_DAX_NEVER)
-   seq_puts(m, ",dax=never");
-   else if (fc->dax_mode == FUSE_DAX_INODE_USER)
-   seq_puts(m, ",dax=inode");
-#endi

[RFC PATCH 2/7] dax: Fix incorrect list of cache aliasing architectures

fs/Kconfig:FS_DAX prevents DAX from building on architectures with
virtually aliased dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

Use this instead in Kconfig to prevent FS_DAX from being built on
architectures with virtually aliased dcache:

  depends on !ARCH_HAS_CACHE_ALIASING

For architectures which detect dcache aliasing at runtime, introduce
a new dax_is_supported() static inline which uses "cache_is_aliasing()"
to figure out whether the environment has aliasing dcaches.

This new dax_is_supported() helper will be used in each filesystem
supporting the dax mount option to validate whether dax is indeed
supported.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/Kconfig  | 2 +-
 include/linux/dax.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..6746fe403761 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,7 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
+   depends on !ARCH_HAS_CACHE_ALIASING
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..8c595b04deeb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -78,6 +79,10 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !cache_is_aliasing();
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +127,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2

[RFC PATCH 4/7] ext2: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext2/super.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, 
use at your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
break;
 
 #if defined(CONFIG_QUOTA)
-- 
2.39.2

[RFC PATCH 1/7] Introduce cache_is_aliasing() across all architectures

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:
   (ARCH_HAS_CACHE_ALIASING=y)

* arm V4, V5 (CPU_CACHE_VIVT)
* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing depends on querying CPU state at runtime:
   (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y)

* arm V6, V6K (CPU_CACHE_VIPT) (cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* arm V7, V7M (unless ARM V6 or V6K are also present) (CPU_CACHE_VIPT)
* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 +++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  8 
 mm/Kconfig  | 10 ++
 17 files changed, 75 insertions(+)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b03054b35c74 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -16,6 +16,9 @@ extern unsigned int cacheid;
 #define cache_is_vipt()cacheid_is(CACHEID_VIPT)
 #define cache_is_vipt_nonaliasing()cacheid_is(CACHEID_VIPT_NONALIASING)
 #define cache_is_vipt_aliasing()   cacheid_is(CACHEID_VIPT_ALIASING)
+#ifdef CONFIG_ARCH_HAS_CACHE_ALIASING_DYNAMIC
+#define cache_is_aliasing()cache_is_vipt_aliasing()
+#endif
 #define icache_is_vivt_asid_tagged()   cacheid_is(CACHEID_ASID_TAGGED)
 #define icache_is_vipt_aliasing()  cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()   cacheid_is(CACHEID_PIPT)
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index c164cde50243..23af93cdc03d 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -539,9 +539,11 @@ config CPU_CACHE_NOP
bool
 
 config CPU_CACHE_VIVT
+   select ARCH_HAS_CACHE_ALIASING
bool
 
 config CPU_CACHE_VIPT
+   select ARCH_HAS_CACHE_ALIASING_DYNAMIC if CPU_V6 || CPU_V6K
bool
 
 config CPU_CACHE_FA
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..439d7640deb8 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
def_bool y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_DMA_PREP_COHERENT
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..216338704f0a 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
bool
default y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CP

[RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches. 

This patch series introduces cache_is_aliasing() with new Kconfig
options:

  * ARCH_HAS_CACHE_ALIASING
  * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev

Mathieu Desnoyers (7):
  Introduce cache_is_aliasing() across all architectures
  dax: Fix incorrect list of cache aliasing architectures
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Introduce fuse_dax_is_supported()
  xfs: Use dax_is_supported()

 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 ++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  2 +-
 fs/erofs/super.c| 10 +++---
 fs/ext2/super.c | 14 
 fs/ext4/super.c | 52 ++---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +++-
 fs/fuse/inode.c | 47 +-
 fs/fuse/virtio_fs.c |  4 +--
 fs/xfs/xfs_super.c  | 20 +++
 include/linux/cacheinfo.h   |  8 +
 include/linux/dax.h |  9 +
 mm/Kconfig  | 10 ++
 27 files changed, 198 insertions(+), 73 deletions(-)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2

Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers

2024-01-26 Thread Mathieu Desnoyers


On 2024-01-26 15:12, Steven Rostedt wrote:
[...]

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..2187be6d7b23 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -223,13 +223,41 @@ static const struct inode_operations 
tracefs_file_inode_operations = {
.setattr= tracefs_setattr,
  };
  
+/* Copied from get_next_ino() but adds allocation for multiple inodes */

+#define LAST_INO_BATCH 1024
+#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+unsigned int tracefs_get_next_ino(int files)
+{
+   unsigned int *p = &get_cpu_var(last_ino);
+   unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+   /* Check if adding files+1 overflows */


How does it handle a @files input where:

* (files+1 > LAST_INO_BATCH) ?

* (files+1 == LAST_INO_BATCH) ?


+   if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & 
LAST_INO_MASK))) {
+   static atomic_t shared_last_ino;
+   int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+   res = next - LAST_INO_BATCH;
+   }
+#endif
+
+   res++;
+   /* get_next_ino should not provide a 0 inode number */
+   if (unlikely(!res))
+   res++;


I suspect that bumping this res++ in the 0 case can cause inode range
reservation issues at (files+1 == LAST_INO_BATCH-1).

Thanks,

Mathieu


+   *p = res + files;
+   put_cpu_var(last_ino);
+   return res;
+}


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[REGRESSION] v5.13: FS_DAX unavailable on 32-bit ARM

2024-01-26 Thread Mathieu Desnoyers


Hi,

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on ARMv7, but
this regression will likely prevent them from upgrading their kernel.

The root of the issue here is the fact that DAX was never designed to handle
virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the
pages through their linear mapping, which is not consistent with the userspace
mappings on virtually aliased dcaches.

I can see a few ways forward to address this:

A) I have prepared a patch series introducing cache_is_aliasing() with new 
Kconfig
   options:

  * ARCH_HAS_CACHE_ALIASING
  * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implemented it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is needed
on architectures like 32-bit ARM.

With this we can basically narrow down the list of architectures which are
unsupported by DAX to those which are really affected, without actually solving
the issue for architectures with virtually aliased dcaches.

B) Another approach would be to dig into what exactly DAX is doing with the 
linear
   mapping, and try to fix this. I see two options there:

B.1) Either we extend vmap to allow vmap'd pages to be aligned on specific 
multiples,
 and use a coloring trick based on SHMLBA like userspace mappings do for 
all DAX
 internal pages accesses, or

B.2) We introduce flush_dcache_folio() at relevant spots (perhaps dax_flush() 
?) to
 synchronize the linear mapping wrt userspace mappings. (would this work ?)

Any thoughts on how to best move forward with this issue are welcome.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-25 Thread Mathieu Desnoyers


On 2024-01-20 08:47, Steven Rostedt wrote:

On Fri, 19 Jan 2024 20:49:36 -0500
Mathieu Desnoyers  wrote:


Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

 Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
  (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
  (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
  (When decoded, going from "0x" to "0x" overflow 16-bit exactly 
once
   which allows the trace reader to reconstruct timestamp 0x1235 from 
the
   previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
  (Encoding over 16-bit not possible because going from 0x1235 to
   0x1237 would overflow 16-bit twice, which cannot be detected
   by a trace reader. Therefore use the full 64-bit timestamp in the
   "large" event header representation.)


I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.


Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.


The events are based off of the last injected timestamp.


Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.


The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore.


Wrong again. The 16 least significant bits are not ignored. The "current 
timestamp"
is really 0x12345678 when the packet header is read.


In the packet header you have 0x12345678 in the first event you have
0x5678 how does that get you the timestamp? If that event had 0x,
when the reader reads this packet, it would take the header 0x12345678
chop off (ignore) the 5678, and add the , right?


We need to consider not only what happens when the 16 low bits increase, but
also what happens when they end up with a value smaller than the previous
16 low bits.

As a summary from our video meeting discussion:

There are 3 cases we care about here:

packet header timestamp: 0x12345678

followed by either:

A) first event delta from packet header timestamp is 0: 16-bit value 0x5678

B) first event delta from packet header timestamp is <= 0x:
   B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value)
   B.2) 16-bit value example 0x (exactly one 16-bit overflow from previous 
value)

C) first event delta from packet header timestamp is larger than 0x, which 
would
   cause the low-order 16 bits to have more than one 16-bit overflow from the 
previous
   value. The tracer detects this and uses a full 64-bit timestamp instead of 
the compact
   16 bits.

[...]








But how do you detect the overflow? That last timestamps to know if
the tsc overflowed or not needs to be saved and compared. I would
assume you have a similar race that I have.


Yes, I save a "last timestamp" per buffer, but the race does not
matter because of the order in which it is saved wrt local cmpxchg
updating the reserved position.

The algorithm looks like:

do {
- read current reserved position (old pos)
- read time
- compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

[A]

save_last_tsc()


So the last_tsc that is saved is from the timestamp read before the
cmpxchg.


Yes.





If interrupted at [A] by another trace producer, it will compare with
an older "last tsc" than the tsc of the event physically located just
before the nested event. This stale "last tsc" has a value which is
necessarily lower than the one we would be saving with the
save_last_tsc immediately afterwards, which means in the worse case
we end up using a full 64-bit timestamp when in fact we could use a
more compact representation. But this race is rare and therefore it
does not matter for size.


That's equivalent to me "injecting" an absolute value for the same race.


Yes.





The fact that I only need this last_tsc value for the sake of
optimization, and not for computation of a time delta from a previous
injected timestamp, makes it possible to handle the race gra

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-19 16:42, Steven Rostedt wrote:

On Fri, 19 Jan 2024 15:56:21 -0500
Mathieu Desnoyers  wrote:


So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?


No. Let's use an example to show how it works.

For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
in the compact event header representation. But for the sake of making this
discussion easier, let's assume a tracer would use 16-bit for timestamps in the
compact representation.

Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
 (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
 (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
 (When decoded, going from "0x" to "0x" overflow 16-bit exactly once
  which allows the trace reader to reconstruct timestamp 0x1235 from the
  previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
 (Encoding over 16-bit not possible because going from 0x1235 to
  0x1237 would overflow 16-bit twice, which cannot be detected
  by a trace reader. Therefore use the full 64-bit timestamp in the
  "large" event header representation.)


I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.


Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.


The events are based off of the last injected timestamp.


Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.


The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore.


Wrong again. The 16 least significant bits are not ignored. The "current 
timestamp"
is really 0x12345678 when the packet header is read.


So in actuality, it inserts 0x1234, plus adds a
5678 that represents the creation of the header (like we discussed about in
the last tracing meeting).


There is no "0x1234" reference time. There is only the actual 0x12345678
current time at packet begin, including those 16 low order bits.



The first event has: 0x5678 which is based on the previous injected
timestamp of 0x1234.


It is not "based" on the previous injected timestamp. It just represents
the low-order 16-bits of the timestamp at event 1. The trace readers takes
two informations to compute the complete current timestamp for event 1:

1) The current timestamp just before event 1 is read (0x12345678),
2) The low-order 16 bits of event 1 (0x5678).

Given that there is no 16-bit overflow when comparing:

0x12345678 & 0x and 0x5678

We know that the resulting current timestamp for event 1 is:

(0x12345678 & ~0x) + 0x5678 = 0x12345678

Or basically that time did not move between the packet header and event 1.



the events go on just using a delta from that until you see it overflow (or
too big of a delta to fit in the 16 bits), and you insert a new full
timestamps that everything will be based on:

   0x1237


No. The following events use the same algorithm I just described: They use
this notion of "current timestamp" and the information provided by the new
timestamp field in the event header to figure out the updated current timestamp.

It is _never_ based on some kind of arbitrary reference, it is always absolute,
and is always computed based on the current timestamp and the timestamp field
encountered.



Now events following that are just a delta from that timestamp. But you
calculate the delta simply by masking out the lower bits.


No, again, there is no delta from arbitrary injected time. It's always
computed from this virtual "current time", which applies to an event stream.



But how do you detect the overflow? That last timestamps to know if the tsc
overflowed or not needs to be saved and compared. I would assume you have a
similar race that I have.


Yes, I save a "last timestamp" per buffer, but the race does not matter because
of the order in which it is s

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-19 10:37, Steven Rostedt wrote:

On Fri, 19 Jan 2024 09:40:27 -0500
Mathieu Desnoyers  wrote:


On 2024-01-18 18:12, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 



[...]


Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.


I understand that the reason why you need the before/after stamps and their
associated complexity is because the Ftrace ring buffer ABI encodes event
timestamps as delta from the previous event within the buffer as a mean of
compressing the timestamp fields. If the delta cannot be represented in a
given number of bits, then it inserts a 64-bit timestamp (not sure if that
one is absolute or a delta from previous event).


There's both. An extended timestamp, which is added when the delta is too
big, and that too is just a delta from the previous event. And there is the
absolute timestamp as well. I could always just use the absolute one. That
event came much later.


OK





This timestamp encoding as delta between events introduce a strong
inter-dependency between consecutive (nested) events, and is the reason
why you are stuck with all this timestamp before/after complexity.

The Common Trace Format specifies (and LTTng implements) a different way
to achieve the same ring buffer space-savings achieved with timestamp deltas
while keeping the timestamps semantically absolute from a given reference,
hence without all the before/after timestamp complexity. You can see the
clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic


That points to this:

-8<-
6.3. Clock value update procedure
To update DEF_CLK_VAL from an unsigned integer field F having the unsigned 
integer value V and the class C:

Let L be an unsigned integer initialized to, depending on the type property of 
C:

"fixed-length-unsigned-integer"
The value of the length property of C.

"variable-length-unsigned-integer"
S ×7, where S is the number of bytes which F occupies with the data stream.

Let MASK be an unsigned integer initialized to 2L − 1.

Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is 
the bitwise AND operator and “~” is the bitwise NOT operator.

Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is 
the bitwise AND operator.

Set DEF_CLK_VAL to:

If V ≥ CUR
H + V

Else
H + MASK + 1 + V
->8-

There's a lot of missing context there, so I don't see how it relates.


This explains how the "current time" is reconstructed by a trace reader
when loading an event header timestamp field. But for the sake of this
discussion we can focus on the less formal explanation of how the tracer
generates this timestamp encoding provided below.


idea on the producer side is to record the low-order bits of the current
timestamp in the event header (truncating the higher order bits), and
fall back on a full 64-bit value if the number of low-order bits overflows
from the previous timestamp is more than 1, or if it is impossible to figure
out precisely the timestamp of the previous event due to a race. This achieves
the same space savings as delta timestamp encoding without introducing the
strong event inter-dependency.


So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?


No. Let's use an example to show how it works.

For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
in the compact event header representation. But for the sake of making this
discussion easier, let's assume a tracer would use 16-bit for timestamps in the
compact representation.

Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

  Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
   (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
   (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
   (When decoded, going from "0x" to "0x" overflow 16-bit exactly once
which allows the trace reader to reconstruct timestamp 0x1235 from the
previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
   (Encoding over 16-bit not possible because going from 0x1235 to
0x

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-18 18:12, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.


I admire the effort of trying to simplify the Ftrace ring buffer by bringing
over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem,
I certainly welcome the simplifications.


Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.


I understand that the reason why you need the before/after stamps and their
associated complexity is because the Ftrace ring buffer ABI encodes event
timestamps as delta from the previous event within the buffer as a mean of
compressing the timestamp fields. If the delta cannot be represented in a
given number of bits, then it inserts a 64-bit timestamp (not sure if that
one is absolute or a delta from previous event).

This timestamp encoding as delta between events introduce a strong
inter-dependency between consecutive (nested) events, and is the reason
why you are stuck with all this timestamp before/after complexity.

The Common Trace Format specifies (and LTTng implements) a different way
to achieve the same ring buffer space-savings achieved with timestamp deltas
while keeping the timestamps semantically absolute from a given reference,
hence without all the before/after timestamp complexity. You can see the
clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic
idea on the producer side is to record the low-order bits of the current
timestamp in the event header (truncating the higher order bits), and
fall back on a full 64-bit value if the number of low-order bits overflows
from the previous timestamp is more than 1, or if it is impossible to figure
out precisely the timestamp of the previous event due to a race. This achieves
the same space savings as delta timestamp encoding without introducing the
strong event inter-dependency.

The fact that Ftrace exposes this ring buffer binary layout as a user-space
ABI makes it tricky to move to the Common Trace Format timestamp encoding.
There are clearly huge simplifications that could be made by moving to this
scheme though. Is there any way to introduce a different timestamp encoding
scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
introduce this simpler scheme and gradually phase out the more complex delta
encoding when no users are left.

Thoughts ?

Thanks,

Mathieu

[1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Mathieu Desnoyers


On 2024-01-11 11:17, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()
   ring_buffer_map_fault()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader



Hi Vincent,

The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1],
and has a lot of similarities with this patch series.

LTTng has the notion of "subbuffer id" to allow atomically exchanging a
"reader" extra subbuffer with the subbuffer to be read. It implements
"get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader)
to move the currently read subbuffer position. [2]

It would not hurt to compare your approach to LTTng and highlight
similarities/differences, and the rationale for the differences.

Especially when it comes to designing kernel ABIs, it's good to make sure
that all bases are covered, because those choices will have lasting impacts.

Thanks,

Mathieu

[1] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c
[2] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 13:43, Steven Rostedt wrote:

On Fri, 15 Dec 2023 13:25:07 -0500
Mathieu Desnoyers  wrote:


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.


These are no different than the knobs for sched_debug


These are very much different. The sched features are enabled at
build-time by modifying kernel/sched/features.h. There is no userspace
ABI involved.





Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.


There's already lots of exposure using options. The options are just
knobs, nothing more.



It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.


No it is not. There is no such thing as "immutable ABI". The rule is
"don't break user space" If this functionality in the kernel goes away,
the knob could become a nop, and I doubt any user space will break
because of it.

That is, the only burden is keeping this option exposed. But it could
be just like that light switch that has nothing connected to it. It's
still there, but does nothing if you switch it. This knob can act the
same way. This does not in anyway prevent future innovation.


I am not comfortable with exposing internal ring buffer implementation
details to userspace which may or may not be deprecated as no-ops
in the future. This will lead to useless clutter.

One approach that might be somewhat better would be to only expose
those files when a new CONFIG_TRACING_DEBUG option is enabled. This
way userspace cannot rely on having those files present as part
of the ABI, but you would still be free to use them for selftests
by skipping the specific selftests if the config option is not
enabled.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 12:34, Steven Rostedt wrote:

On Fri, 15 Dec 2023 12:24:14 -0500
Mathieu Desnoyers  wrote:


On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.


What's wrong with a way to stop the copying ?


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.

Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.

It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.



The option is just a way to say "I don't want to do the copy into the
buffer, I want to go directly into it"


My main concern is how this concept, once exposed to userspace,
becomes not only an internal implementation detail, but a fundamental
part of the design which cannot ever go away without breaking the ABI
or making parts of the ABI irrelevant.

I can make a parallel with the scheduler: this is as if the sched
feature knobs (which are there only for development/debugging purposes)
would all be exposed as userspace ABI. This would seriously
limit the evolution of the scheduler design in the future. I see this
as the same problem at the ring buffer level.





I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.


Maybe I don't understand your concern.

It's an on/off switch (like all options are). This is just a way to say
"I want to indirect copying of the event before filtering or not".


Not all tracefs options are booleans. The "current_tracer" file ABI
exposes a string input/output parameter. I would recommend the same
for the equivalent of a "current_filter" file.



The "input-argument" part above may never happen. What's different
between tracefs and LTTng, is that all events are created by the
subsystem not by me. You don't use the TRACE_EVENT() macro, but you
need to manually create each event you care about yourself. It's more
of a burden but you also then have the luxury of hooking to the input
portion. That is not exposed, and that is by design. As that could
possibly make all tracepoints an ABI, and you'll have people like
peterz nacking any new tracepoint in the scheduler. He doesn't allow
trace events anymore because of that exposure.


I'm not arguing for moving to the input-argument scheme, I just used
this as an hypothetical example to show why we should not expose
internal implementation details to userspace which will prevent future
evolution only for the sake of having debugging knobs.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.

I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.

Moreover, double-negative boolean options (disable-X=false) are confusing.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 10:26, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The disable-filter-buf option will disable the temporary buffer and always
write into the ring buffer. This will avoid the copy when the event is to
be recorded, but also adds a bit more overhead on the discard, and if
another event were to interrupt the event that is to be discarded, then
the event will not be removed from the ring buffer but instead converted
to padding that will not be read by the reader. Padding will still take up
space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).



I'm not convinced that a boolean state is what you need here.

Yes, today you are in a position where you have two options:

a) use the filter buffer, which falls back on copy to ring buffer
   if nested,

b) disable the filter buffer, and thus always copy to ring buffer
   before filtering,

But I think it would not be far-fetched to improve the implementation
of the filter-buffer to have one filter-buffer per nesting level
(per execution context), or just implement the filter buffer as a
per-cpu stack, which would remove the need to fall back on copy
to ring buffer when nested. Or you could even do like LTTng and
filter on the input arguments directly.

But each of these changes would add yet another boolean tunable,
which can quickly make the mix hard to understand from a user
perspective.

So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"

?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Mathieu Desnoyers


On 2023-12-13 21:11, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.


I literally had to dust off my old Eee PC for this :)



This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!



Acked-by: Mathieu Desnoyers 


@@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer,
struct rb_event_info info;
int nr_loops = 0;
int add_ts_default;
+   static int once;


(as you spotted, should be removed)

Thanks,

Mathieu

  
  	/* ring buffer does cmpxchg, make sure it is safe in NMI context */

if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs


On 2023-12-12 11:53, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

  static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
  {
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;

/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, &val, &cnt2))
 return false;

 if (val != expect)
 return false;

<<<< interrupted here!

 cnt = local_read(&t->cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/

Cc: sta...@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 
32 bit")
Reported-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 


Reviewed-by: Mathieu Desnoyers 


---
  kernel/trace/ring_buffer.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..e110cde685ea 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
  
+	/* Any interruptions in this function should cause a failure */

+   cnt = local_read(&t->cnt);
+
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, &val, &cnt2))
 return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 if (val != expect)
 return false;
  
-	 cnt = local_read(&t->cnt);

 if ((cnt & 3) != cnt2)
 return false;
  


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()

The following race can cause rb_time_read() to observe a corrupted time
stamp:

rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;

__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));

*cnt = rb_time_cnt(top);

/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
  ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.

Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.

The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().

Link: 
https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoy...@efficios.com/
Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")
Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..71c225ca2a2b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and msb counts don't match, this interrupted a write */
-   if (*cnt != rb_time_cnt(msb))
+   /* If top, msb or bottom counts don't match, this interrupted a write */
+   if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
return false;
 
/* The shift to msb will lose its cnt bits */
-- 
2.39.2

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit


On 2023-12-11 23:38, Steven Rostedt wrote:

On Mon, 11 Dec 2023 22:51:04 -0500
Mathieu Desnoyers  wrote:

[...]


For this first issue, here is the race:

rb_time_cmpxchg()
[...]
  if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(&t->top, top, top2))
  return false;

__rb_time_read()
[...]
  do {
  c = local_read(&t->cnt);
  top = local_read(&t->top);
  bottom = local_read(&t->bottom);
  msb = local_read(&t->msb);
  } while (c != local_read(&t->cnt));

  *cnt = rb_time_cnt(top);

  /* If top and msb counts don't match, this interrupted a write */
  if (*cnt != rb_time_cnt(msb))
  return false;
^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.


Ah, OK that makes more sense. Yeah, if I had the three words from the
beginning, I would have tested to make sure they all match an not just the
two :-p


Technically just checking that the very first and last words which are
updated by set/cmpxchg have the same cnt bits would suffice. Because
this is just a scenario of __rb_time_read interrupting an update, the
updates in between are fine if the first/last words to be updated have
the same cnt.



As this would fix a commit that tried to fix this before!

   f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")

FYI, that would be the "Fixes" for this patch.


OK

[...]



   


- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
and produces corrupted time stamps. This is _not_ fixed by this patch.


Except that it's not 4 bits that is compared, but 32 bits.

struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};

The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.


Let's bring up the race scenario I spotted:

rb_time_cmpxchg()
[...]
  /* The cmpxchg always fails if it interrupted an update */
   if (!__rb_time_read(t, &val, &cnt2))
   return false;

   if (val != expect)
   return false;


   cnt = local_read(&t->cnt);
   if ((cnt & 3) != cnt2)
   return false;
   ^ here (cnt & 3) == cnt2, but @val contains outdated data. This
 means the piecewise rb_time_read_cmpxchg() that follow will
 derive expected values from the outdated @val.


Ah. Of course this would be fixed if we did the local_read(&t->cnt)
*before* everything else.


But then we could be interrupted after that initial read, before reading
val. I suspect we'd want to propagate the full 32-bit cnt that was
read by __rb_time_read() to the caller, which is not the case today.
With that, we would not have to read it again in rb_time_cmpxchg.

It does leave the issue of having only 2 bits in the msb, top, bottom
fields to detect races, which are subject to overflow.





   cnt2 = cnt + 1;

   rb_time_split(val, &top, &bottom, &msb);
   top = rb_time_val_cnt(top, cnt);
   bottom = rb_time_val_cnt(bottom, cnt);
  ^ top, bottom, and msb contain outdated data, which do not
match cnt due to 2-bit overflow.

   rb_time_split(set, &top2, &bottom2, &msb2);
   top2 = rb_time_val_cnt(top2, cnt2);
   bottom2 = rb_time_val_cnt(bottom2, cnt2);

  if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
  return false;
  ^ This @cnt cmpxchg succeeds because it uses the re-read cnt
is used as expected value.


Sure. And I believe you did find another bug. If we read the cnt first,
before reading val, then it would not be outdated.


As stated above, I suspect we'd run into other issues if interrupted
between read of cnt and reading val. Propagating the full 32-bit cnt
value read from __rb_time_read() to the caller would be better I think.




  if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(&t->top, top, top2))
  return false;
  if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
  return false;
  ^ these cmpxchg have just used the outdated @val as expected
values, even though the content of the rb_time was modified
by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
means those cmpxchg can fail not only due to being interrupted

Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-12 09:00, Steven Rostedt wrote:
[...]

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
int size;
int len;
@@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
  
-	if (cnt > TRACE_BUF_SIZE)

-   cnt = TRACE_BUF_SIZE;


You're removing an early bound check for a size_t userspace input...


-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
-
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;


... and then implicitly casting it into a "int" size variable, which
can therefore become a negative value.

Just for the sake of not having to rely on ring_buffer_lock_reserve
catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an
early check for negative here.

  
  	/* If less than "", then make sure we can still add that */

if (cnt < FAULTED_SIZE)
@@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greated than what was allowed, then


greater ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add size check when printing trace_marker output


On 2023-12-12 08:44, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:


Does this result in leaking kernel memory to userspace ? If so, it
should state "Fixes..." and CC stable.

Thanks,

Mathieu



   trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

   int max = iter->ent_size - offsetof(struct print_entry, buf);

   trace_seq_printf(s, ": %*s", max, field->buf);

Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace_output.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..e11fb8996286 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
  {
struct print_entry *field;
struct trace_seq *s = &iter->seq;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
  	seq_print_ip_sym(s, field->ip, flags);

-   trace_seq_printf(s, ": %s", field->buf);
+   trace_seq_printf(s, ": %*s", max, field->buf);
  
  	return trace_handle_return(s);

  }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
trace_iterator *iter, int flags,
 struct trace_event *event)
  {
struct print_entry *field;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
-	trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf);

+   trace_seq_printf(&iter->seq, "# %lx %*s", field->ip, max, field->buf);
  
  	return trace_handle_return(&iter->seq);

  }


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-11 Thread Mathieu Desnoyers


On 2023-12-11 17:59, Steven Rostedt wrote:

On Mon, 11 Dec 2023 15:13:24 -0500
Mathieu Desnoyers  wrote:


Going through a review of the ring buffer rb_time functions for 32-bit
architectures, I updated the comments to match the code, and identified
the following issues:


Thanks Mathieu!



- rb_time_cmpxchg() needs to update the msb last, so it matches
   the validation of top and msb by __rb_time_read(). This is fixed by
   this patch.


Hmm, does it? This is not parallel programming, it's only protecting
against interrupts.


Understood, this is indeed the model I had in mind during my review
(nested interruption only from local cpu) because preemption is disabled
around use of the ring buffer.

For this first issue, here is the race:

rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;

__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));

*cnt = rb_time_cnt(top);

/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
  ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.



BTW, it's best not to have a fix like this with a comment change this big,
as the comment change is highly likely to cause conflicts in any backport.


I wanted to start the discussion without having a N-patches series, but I agree
that this first fix should be split into a separate patch.



Although, for consistency, I wonder if everything else should be changed to go:

   bottom, top, msb

as it would match the order of the bits, as msb has the highest order, top
the next, and bottom the least. Doing it as:

   top, bottom, msb

seems out of order. I did that because msb was an after thought :-/


Agreed on the order change, but I suspect this would belong to another
patch (not the fix).





- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
   and produces corrupted time stamps. This is _not_ fixed by this patch.


Except that it's not 4 bits that is compared, but 32 bits.

struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};

The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.


Let's bring up the race scenario I spotted:

rb_time_cmpxchg()
[...]
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, &val, &cnt2))
 return false;

 if (val != expect)
 return false;


 cnt = local_read(&t->cnt);
 if ((cnt & 3) != cnt2)
 return false;
 ^ here (cnt & 3) == cnt2, but @val contains outdated data. This
   means the piecewise rb_time_read_cmpxchg() that follow will
   derive expected values from the outdated @val.

 cnt2 = cnt + 1;

 rb_time_split(val, &top, &bottom, &msb);
 top = rb_time_val_cnt(top, cnt);
 bottom = rb_time_val_cnt(bottom, cnt);
^ top, bottom, and msb contain outdated data, which do not
  match cnt due to 2-bit overflow.

 rb_time_split(set, &top2, &bottom2, &msb2);
 top2 = rb_time_val_cnt(top2, cnt2);
 bottom2 = rb_time_val_cnt(bottom2, cnt2);

if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
return false;
^ This @cnt cmpxchg succeeds because it uses the re-read cnt
  is used as expected value.
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
return false;
^ these cmpxchg have just used the outdated @val as expected
  values, even though the content of the rb_time was modified
  by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
  means those cmpxchg can fail not only due to being interrupted
  by another write or cmpxchg, but also simply due to expected
  value mismatch in any of the fields, which will then cause
  following __rb_time_read() to fail until a rb_time_set() is done.

return true;

So this overflow scenario on top of cmpxchg does not cause corrupted
time stamps, but does cause subseque

[RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-11 Thread Mathieu Desnoyers

Going through a review of the ring buffer rb_time functions for 32-bit
architectures, I updated the comments to match the code, and identified
the following issues:

- rb_time_cmpxchg() needs to update the msb last, so it matches
  the validation of top and msb by __rb_time_read(). This is fixed by
  this patch.

- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
  and produces corrupted time stamps. This is _not_ fixed by this patch.

- After a cmpxchg fails between updates to top and msb, a write is
  needed before read and cmpxchg can succeed again. I am not entirely
  sure the rest of the ring buffer handles this correctly.

Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/ring_buffer.c | 64 +++---
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..f6ed699947cd 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -576,34 +576,50 @@ struct ring_buffer_iter {
 #ifdef RB_TIME_32
 
 /*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
+ * On 32-bit machines, local64_t is very expensive. As the ring
+ * buffer doesn't need all the features of a true 64-bit atomic,
+ * on 32-bit, it uses these functions (64-bit still uses local64_t).
  *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
+ * For the ring buffer, the operations required to manipulate 64-bit
+ * time stamps are the following:
  *
- *  - Reads may fail if it interrupted a modification of the time stamp.
+ *  - Read may fail if it interrupted a modification of the time stamp.
  *  It will succeed if it did not interrupt another write even if
  *  the read itself is interrupted by a write.
+ *  A read will fail if it follows a cmpxchg which failed between
+ *  updates to its top and msb bits, until a write is performed.
+ *  (note: this limitation may be unexpected in parts of the
+ *  ring buffer algorithm)
  *  It returns whether it was successful or not.
  *
- *  - Writes always succeed and will overwrite other writes and writes
+ *  - Write always succeeds and will overwrite other writes and writes
  *  that were done by events interrupting the current write.
  *
  *  - A write followed by a read of the same time stamp will always succeed,
  *  but may not contain the same value.
  *
  *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
+ *  A cmpxchg will fail if it follows a cmpxchg which failed between
+ *  updates to its top and msb bits, until a write is performed.
+ *  (note: this limitation may be unexpected in parts of the
+ *  ring buffer algorithm)
  *  Other than that, it acts like a normal cmpxchg.
  *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
+ * The 64-bit time stamp is broken up, from most to least significant,
+ * in: msb, top and bottom fields, of respectively 4, 30, and 30 bits.
  *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
+ * The two most significant bits of each field hold a 2-bit counter (0-3).
  * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
+ * When reading the top, bottom, and msb fields, if the two counter bits
+ *   match, then the combined values make a valid 64-bit number.
+ *
+ * Counter limits. The following situations can generate overflows that
+ * produce corrupted time stamps:
+ *
+ * - A read or a write interrupted by 2^32 writes or cmpxchg.
+ *
+ * - A cmpxchg interrupted by 4 writes or cmpxchg.
+ *   (note: this is not sufficient and should be fixed)
  */
 #define RB_TIME_SHIFT  30
 #define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
@@ -632,7 +648,7 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
/*
 * If the read is interrupted by a write, then the cnt will
-* be different. Loop until both top and bottom have been read
+* be different. Loop until top, bottom and msb have been read
 * without interruption.
 */
do {
@@ -644,7 +660,12 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and msb counts don't match, this interrupted a write */
+   /*
+* If top and msb counts don't match, this either interrupted a
+* write or follows a failed cmpxchg.
+* This requires the update to bottom to be enclosed between
+

Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-10 11:38, Steven Rostedt wrote:

On Sun, 10 Dec 2023 11:07:22 -0500
Mathieu Desnoyers  wrote:


It just allows more to be written in one go.

I don't see why the tests need to cover this or detect this change.


If the purpose of this change is to ensure that the entire
trace marker payload is shown within a single event, then
there should be a test which exercises this, and which
validates that the end result is that the entire payload
is indeed shown within a single event record.


No, the purpose of the change is not to do that, because there can always
be a bigger trace marker write than a single event can hold. This is the
way it has always worked. This is an optimization or "enhancement". The 1KB
restriction was actually because of a previous implementation years ago
(before selftests even existed) that wrote into a temp buffer before
copying into the ring buffer. But since we now can copy directly into the
ring buffer, there's no reason not to use the maximum that the ring buffer
can accept.


My point is that the difference between the new "enhanced" behavior
and the previous behavior is not tested for.





Otherwise there is no permanent validation that this change
indeed does what it is intended to do, so it can regress
at any time without any test noticing it.


What regress? The amount of a trace_marker write that can make it into a
the buffer in one go? Now, I agree that we should have a test to make sure
that all of the trace marker write gets into the buffer.


Yes. This is pretty much my point.



But it's always
been allowed to break up that write however it wanted to.


And the enhanced behavior extends the amount of data that can get
written into a single sub-buffer, and this is not tested.



Note, because different architectures have different page sizes, how much
that can make it in one go is architecture dependent. So you can have a
"regression" by simply running your application on a different architecture.


Which is why in the following patches you have expressing the subbuffer
size as bytes rather than pages is important at the ABI level. It
facilitates portability of tests, and decreases documentation / user
burden.


Again, it's not a requirement, it's just an enhancement.


How does this have anything to do with dispensing from testing the
new behavior ? If the new behavior has a bug that causes it to
silently truncate the trace marker payloads, how do you catch it
with the current tests ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-10 10:30, Steven Rostedt wrote:

On Sun, 10 Dec 2023 09:09:06 -0500
Mathieu Desnoyers  wrote:


On 2023-12-09 17:50, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.


I would expect the tests under tools/testing/selftests/ftrace/* to
be affected by those changes. If they are not, then it's a hole in
the test coverage and should be taken care of as this behavior is
modified.



Before this change we had:

~# echo -n 
'12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
  
7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
  
4567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
  89012345678901234567890123456789012345678901234567890123456789012345678901234' 
> /sys/kernel/tracing/trace_marker

~# cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 4/4   #P:8
#
#_-=> irqs-off/BH-disabled
#

Re: [PATCH 13/14] ring-buffer: Add documentation on the buffer_subbuf_order file


On 2023-12-09 22:54, Steven Rostedt wrote:
[...]

+  buffer_subbuf_order:
+
+   This sets or displays the sub buffer page size order. The ring buffer
+   is broken up into several same size "sub buffers". An event can not be
+   bigger than the size of the sub buffer. Normally, the sub buffer is
+   the size of the architecture's page (4K on x86). The sub buffer also
+   contains meta data at the start which also limits the size of an event.
+   That means when the sub buffer is a page size, no event can be larger
+   than the page size minus the sub buffer meta data.


The fact that the user ABI documentation for this tracer parameter needs
to dig into details about architecture page size is a good indicator
that this ABI is not at the abstraction level it should be (pages vs
bytes).

Thanks,

Mathieu


+
+   The buffer_subbuf_order allows the user to change the size of the sub
+   buffer. As the sub buffer is a set of pages by the power of 2, thus
+   the sub buffer total size is defined by the order:
+
+   order   size
+   
+   0   PAGE_SIZE
+   1   PAGE_SIZE * 2
+   2   PAGE_SIZE * 4
+   3   PAGE_SIZE * 8
+
+   Changing the order will change the sub buffer size allowing for events
+   to be larger than the page size.
+
+   Note: When changing the order, tracing is stopped and any data in the
+   ring buffer and the snapshot buffer will be discarded.
+
free_buffer:
  
  	If a process is performing tracing, and the ring buffer	should be


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 14/14] ringbuffer/selftest: Add basic selftest to test chaning subbuf order


On 2023-12-09 22:54, Steven Rostedt wrote:
[...]

+get_buffer_data_size() {
+   sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+a="1234567890"
+
+make_str() {
+cnt=$1
+s=""
+while [ $cnt -gt 10 ]; do
+s="${s}${a}"
+cnt=$((cnt-10))
+done
+while [ $cnt -gt 0 ]; do
+s="${s}X"
+cnt=$((cnt-1))
+done
+echo -n $s
+}
+
+test_buffer() {
+
+   size=`get_buffer_data_size`
+
+   str=`make_str $size`
+
+   echo $str > trace_marker
+
+   grep -q $a trace


This test has no clue if the record was truncated or not.

It basically repeats the string

"1234567890" until it fills the subbuffer size and pads with
 as needed as trace marker payload, but the grep looks for the
"1234567890" pattern only.

The test should be extended to validate whether the trace marker
payload was truncated or not, otherwise it is of limited value.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 00/14] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers