Re: [RFC PATCH 1/4] hazptr: Add initial implementation of hazard pointers
On 2024-09-17 16:33, Boqun Feng wrote: [...] The synchronization between readers and updaters is built around "hazard pointer slots": a slot readers can use to store a pointer value. Reader side protection: 1. Read the value of a pointer to the target data element. 2. Store it to a hazard pointer slot. 3. Enforce full ordering (e.g. smp_mb()). 4. Re-read the original pointer, reset the slot and retry if the value changed. 5. Otherwise, the continued existence of the target data element is guaranteed. Updater side check: 1. Unpublish the target data element (e.g. setting the pointer value to NULL). 2. Enforce full ordering. 3. Read the value from a hazard pointer slot. 4. If the value doesn't match the target data element, then this slot doesn't represent a reference to it. 5. Otherwise, updater needs to re-check (step 3). Cool! I look forward to see where this is meant to be used. I would expect it to be a useful tool to speed up reference counting of things like the mm_struct and for TLB flush IPIs. On a related note, with a userspace port in mind, the membarrier(2) syscall can be useful to turn the smp_mb() in (3) from the reader side into a simple compiler barrier, assuming (2) from the updater is using membarrier. If IPIs are acceptable (or already required) for some kernel use-cases, this means a similar asymmetric fence scheme could be used to speed up readers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH v6 5/5] tracing: Convert sys_enter/exit to faultable tracepoints
Convert the definition of the system call enter/exit tracepoints to faultable tracepoints now that all upstream tracers handle it. This allows tracers to fault-in userspace system call arguments such as path strings within their probe callbacks. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Reviewed-by: Masami Hiramatsu (Google) Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Since v4: - Use 'guard(preempt_notrace)'. - Add brackets to multiline 'if' statements. --- include/trace/events/syscalls.h | 4 +-- kernel/trace/trace_syscalls.c | 52 - 2 files changed, 46 insertions(+), 10 deletions(-) diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h index b6e0cbc2c71f..dc30e3004818 100644 --- a/include/trace/events/syscalls.h +++ b/include/trace/events/syscalls.h @@ -15,7 +15,7 @@ #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS -TRACE_EVENT_FN(sys_enter, +TRACE_EVENT_FN_MAY_FAULT(sys_enter, TP_PROTO(struct pt_regs *regs, long id), @@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter, TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY) -TRACE_EVENT_FN(sys_exit, +TRACE_EVENT_FN_MAY_FAULT(sys_exit, TP_PROTO(struct pt_regs *regs, long ret), diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 9c581d6da843..314666d663b6 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) int syscall_nr; int size; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret) struct trace_event_buffer fbuffer; int syscall_nr; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -376,8 +388,11 @@ static int reg_event_syscall_enter(struct trace_event_file *file, if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls)) return -ENOSYS; mutex_lock(&syscall_trace_lock); - if (!tr->sys_refcount_enter) - ret = register_trace_sys_enter(ftrace_syscall_enter, tr); + if (!tr->sys_refcount_enter) { + ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, tr, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); + } if (!ret) { rcu_assign_pointer(tr->enter_syscall_files[num], file); tr->sys_refcount_enter++; @@ -414,8 +429,11 @@ static int reg_event_syscall_exit(struct trace_event_file *file, if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls)) return -ENOSYS; mutex_lock(&syscall_trace_lock); - if (!tr->sys_refcount_exit) - ret = register_trace_sys_exit(ftrace_syscall_exit, tr); + if (!tr->sys_refcount_exit) { + ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, tr, + TRACEPOINT_DEFAULT_PRIO, +TRACEPOINT_MAY_FAULT); + } if (!ret) { rcu_assign_pointer(tr->exit_syscall_files[num], file); tr->sys_refcount_exit++; @@ -582,6 +600,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) int rctx; int size; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -630,8 +654,11 @@ static int perf_sysenter_enable(struct trace_event_call *call) num = ((struct syscall_metadat
[PATCH v6 4/5] tracing/perf: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that perf can handle registering to such tracepoints by explicitly disabling preemption within the perf tracepoint probes to respect the current expectations within perf ring buffer code. This change does not yet allow perf to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Reviewed-by: Masami Hiramatsu (Google) Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. --- include/trace/perf.h | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/include/trace/perf.h b/include/trace/perf.h index 2c11181c82e0..161e1655b953 100644 --- a/include/trace/perf.h +++ b/include/trace/perf.h @@ -12,8 +12,8 @@ #undef __perf_task #define __perf_task(t) (__task = (t)) -#undef DECLARE_EVENT_CLASS -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ +#undef _DECLARE_EVENT_CLASS +#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags) \ static notrace void\ perf_trace_##call(void *__data, proto) \ { \ @@ -28,6 +28,13 @@ perf_trace_##call(void *__data, proto) \ int __data_size;\ int rctx; \ \ + DEFINE_INACTIVE_GUARD(preempt_notrace, trace_event_guard); \ + \ + if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\ + might_fault(); \ + activate_guard(preempt_notrace, trace_event_guard)(); \ + } \ + \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ head = this_cpu_ptr(event_call->perf_events); \ @@ -55,6 +62,17 @@ perf_trace_##call(void *__data, proto) \ head, __task);\ } +#undef DECLARE_EVENT_CLASS +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ + _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ +PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0) + +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ +PARAMS(tstruct), PARAMS(assign), PARAMS(print), \ +TRACEPOINT_MAY_FAULT) + /* * This part is compiled out, it is only here as a build time check * to make sure that if the tracepoint handling changes, the -- 2.39.2
[PATCH v6 3/5] tracing/bpf-trace: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that bpf can handle registering to such tracepoints by explicitly disabling preemption within the bpf tracepoint probes to respect the current expectations within bpf tracing code. This change does not yet allow bpf to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Reviewed-by: Masami Hiramatsu (Google) Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. - Add brackets to multiline 'if' statements. Changes since v5: - Rebased on v6.11-rc5. - Pass the TRACEPOINT_MAY_FAULT flag directly to tracepoint_probe_register_prio_flags. --- include/trace/bpf_probe.h | 21 - kernel/trace/bpf_trace.c | 2 +- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h index a2ea11cc912e..cc96dd1e7c3d 100644 --- a/include/trace/bpf_probe.h +++ b/include/trace/bpf_probe.h @@ -42,16 +42,27 @@ /* tracepoints with more than 12 arguments will hit build error */ #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) -#define __BPF_DECLARE_TRACE(call, proto, args) \ +#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags) \ static notrace void\ __bpf_trace_##call(void *__data, proto) \ { \ - CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));\ + DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard);\ + \ + if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\ + might_fault(); \ + activate_guard(preempt_notrace, bpf_trace_guard)(); \ + } \ + \ + CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ } #undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) + +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT) /* * This part is compiled out, it is only here as a build time check @@ -105,13 +116,13 @@ static inline void bpf_test_buffer_##call(void) \ #undef DECLARE_TRACE #define DECLARE_TRACE(call, proto, args) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0) #undef DECLARE_TRACE_WRITABLE #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \ __CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size) #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index c77eb80cbd7f..ed07283d505b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2473,7 +2473,7 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *li return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func, link, TRACEPOINT_DEFAULT_PRIO, - TRACEPOINT_MAY_EXIST); + TRACEPOINT_MAY_EXIST | (tp->flags & TRACEPOINT_MAY_FAULT)); } int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link) -- 2.39.2
[PATCH v6 2/5] tracing/ftrace: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that ftrace can handle registering to such tracepoints by explicitly disabling preemption within the ftrace tracepoint probes to respect the current expectations within ftrace ring buffer code. This change does not yet allow ftrace to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Reviewed-by: Masami Hiramatsu (Google) Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. - Add brackets to multiline 'if' statements. Changes since v5: - Pass the TRACEPOINT_MAY_FAULT flag directly to tracepoint_probe_register_prio_flags. --- include/trace/trace_events.h | 64 ++-- kernel/trace/trace_events.c | 16 + 2 files changed, 71 insertions(+), 9 deletions(-) diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h index df590eea8ae4..c887f7b6fbe9 100644 --- a/include/trace/trace_events.h +++ b/include/trace/trace_events.h @@ -45,6 +45,16 @@ PARAMS(print)); \ DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args)); +#undef TRACE_EVENT_MAY_FAULT +#define TRACE_EVENT_MAY_FAULT(name, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS_MAY_FAULT(name,\ +PARAMS(proto),\ +PARAMS(args), \ +PARAMS(tstruct), \ +PARAMS(assign), \ +PARAMS(print)); \ + DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args)); + #include "stages/stage1_struct_define.h" #undef DECLARE_EVENT_CLASS @@ -57,6 +67,11 @@ \ static struct trace_event_class event_class_##name; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(name, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT #define DEFINE_EVENT(template, name, proto, args) \ static struct trace_event_call __used \ @@ -80,7 +95,7 @@ #undef TRACE_EVENT_FN_MAY_FAULT #define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct, \ assign, print, reg, unreg) \ - TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \ + TRACE_EVENT_MAY_FAULT(name, PARAMS(proto), PARAMS(args),\ PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \ #undef TRACE_EVENT_FN_COND @@ -123,6 +138,11 @@ tstruct;\ }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT #define DEFINE_EVENT(template, name, proto, args) @@ -214,6 +234,11 @@ static struct trace_event_functions trace_event_type_funcs_##call = { \ .trace = trace_raw_output_##call, \ }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT_PRINT #define DEFINE_EVENT_PRINT(template, call, proto, args, print) \ static notrace enum print_line_t \ @@ -250,6 +275,11 @@ static struct trace_event_fields trace_event_fields_##call[] = { \ tstruct \ {} }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT_PRINT #define DEFINE_EVENT_PRINT(template, name, proto, args, print) @@ -271,6 +301,11 @@ s
[PATCH v6 1/5] tracing: Introduce faultable tracepoints
When invoked from system call enter/exit instrumentation, accessing user-space data is a common use-case for tracers. However, tracepoints currently disable preemption around iteration on the registered tracepoint probes and invocation of the probe callbacks, which prevents tracers from handling page faults. Extend the tracepoint and trace event APIs to allow defining a faultable tracepoint which invokes its callback with preemption enabled. Also extend the tracepoint API to allow tracers to request specific probes to be connected to those faultable tracepoints. When the TRACEPOINT_MAY_FAULT flag is provided on registration, the probe callback will be called with preemption enabled, and is allowed to take page faults. Faultable probes can only be registered on faultable tracepoints and non-faultable probes on non-faultable tracepoints. The tasks trace rcu mechanism is used to synchronize read-side marshalling of the registered probes with respect to faultable probes unregistration and teardown. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Reviewed-by: Masami Hiramatsu (Google) Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v1: - Cleanup __DO_TRACE() implementation. - Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to MAYFAULT, and use might_fault() rather than might_sleep(), to properly convey that the tracepoints are meant to be able to take a page fault, which requires to be able to sleep *and* to hold the mmap_sem. Changes since v2: - Rename MAYFAULT to MAY_FAULT. - Rebased on 6.5.5. - Introduce MAY_EXIST tracepoint flag. Changes since v3: - Rebased on 6.6.2. Changes since v4: - Rebased on 6.9.6. - Simplify flag check in tracepoint_probe_register_prio_flags(). - Update MAY_EXIST flag description. Changes since v5: - Rebased on v6.11-rc5. --- include/linux/tracepoint-defs.h | 14 ++ include/linux/tracepoint.h | 88 +++-- include/trace/define_trace.h| 7 +++ include/trace/trace_events.h| 6 +++ init/Kconfig| 1 + kernel/trace/bpf_trace.c| 4 +- kernel/trace/trace_fprobe.c | 5 +- kernel/tracepoint.c | 65 ++-- 8 files changed, 136 insertions(+), 54 deletions(-) diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h index 4dc4955f0fbf..94e39c86b49f 100644 --- a/include/linux/tracepoint-defs.h +++ b/include/linux/tracepoint-defs.h @@ -29,6 +29,19 @@ struct tracepoint_func { int prio; }; +/** + * enum tracepoint_flags - Tracepoint flags + * @TRACEPOINT_MAY_EXIST: On registration, don't warn if the tracepoint is + *already registered. + * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with + *preemption enabled, and is allowed to take page + *faults. + */ +enum tracepoint_flags { + TRACEPOINT_MAY_EXIST = (1 << 0), + TRACEPOINT_MAY_FAULT = (1 << 1), +}; + struct tracepoint { const char *name; /* Tracepoint name */ struct static_key key; @@ -39,6 +52,7 @@ struct tracepoint { int (*regfunc)(void); void (*unregfunc)(void); struct tracepoint_func __rcu *funcs; + unsigned int flags; }; #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 6be396bb4297..7ae5496a800c 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -40,17 +41,10 @@ extern int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data, int prio); extern int -tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, void *data, -int prio); +tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void *data, + int prio, unsigned int flags); extern int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data); -static inline int -tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe, - void *data) -{ - return tracepoint_probe_register_prio_may_exist(tp, probe, data, - TRACEPOINT_DEFAULT_PRIO); -} extern void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
[PATCH v6 0/5] Faultable Tracepoints
Wire up the system call tracepoints with Tasks Trace RCU to allow the ftrace, perf, and eBPF tracers to handle page faults. This series does the initial wire-up allowing tracers to handle page faults, but leaves out the actual handling of said page faults as future work. I have tested this against a feature branch of lttng-modules which implements handling of page faults for the filename argument of the openat(2) system call. This v6 rebases v5 on top of v6.11-rc5. It requires the "cleanup.h: Introduce DEFINE_INACTIVE_GUARD()/activate_guard()" series. Thanks, Mathieu Link: https://lore.kernel.org/lkml/20240828143719.828968-1-mathieu.desnoy...@efficios.com/ # cleanup.h dependency Link: https://lore.kernel.org/lkml/20240627152340.82413-1-mathieu.desnoy...@efficios.com/ # v5 Link: https://lore.kernel.org/lkml/20231120205418.334172-1-mathieu.desnoy...@efficios.com/ Link: https://lore.kernel.org/lkml/e4e9a2bc-1776-4b51-aba4-a147795a5...@efficios.com/ Link: https://lore.kernel.org/lkml/a0ac5f77-411e-4562-9863-81196238f...@efficios.com/ Link: https://lore.kernel.org/lkml/ba543d44-9302-4115-ac4f-d4e9f8d98a90@paulmck-laptop/ Link: https://lore.kernel.org/lkml/20231120221524.gd8...@noisy.programming.kicks-ass.net/ Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes Cc: linux-trace-ker...@vger.kernel.org Mathieu Desnoyers (5): tracing: Introduce faultable tracepoints tracing/ftrace: Add support for faultable tracepoints tracing/bpf-trace: Add support for faultable tracepoints tracing/perf: Add support for faultable tracepoints tracing: Convert sys_enter/exit to faultable tracepoints include/linux/tracepoint-defs.h | 14 ++ include/linux/tracepoint.h | 88 +++-- include/trace/bpf_probe.h | 21 ++-- include/trace/define_trace.h| 7 +++ include/trace/events/syscalls.h | 4 +- include/trace/perf.h| 22 - include/trace/trace_events.h| 68 - init/Kconfig| 1 + kernel/trace/bpf_trace.c| 4 +- kernel/trace/trace_events.c | 16 +++--- kernel/trace/trace_fprobe.c | 5 +- kernel/trace/trace_syscalls.c | 52 --- kernel/tracepoint.c | 65 ++-- 13 files changed, 288 insertions(+), 79 deletions(-) -- 2.39.2
[PATCH v1 1/2] cleanup.h guard: Rename DEFINE_ prefix to DECLARE_
The convention used in other kernel headers (e.g. wait.h, percpu-defs.h) is to use "DECLARE_" prefix for macros emitting externs, static inlines and type definitions. The "DEFINE_" prefix is used for variable definitions. In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually define a guard variable, rename all the guard "DEFINE_" prefix to "DECLARE_". Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra (Intel) Cc: Ingo Molnar --- crypto/asymmetric_keys/x509_parser.h | 4 +- drivers/cxl/acpi.c | 6 +-- drivers/cxl/core/cdat.c | 2 +- drivers/cxl/cxl.h| 2 +- drivers/gpio/gpiolib.h | 2 +- drivers/net/wireless/intel/iwlwifi/mvm/mvm.h | 2 +- drivers/platform/x86/intel/pmc/core_ssram.c | 2 +- fs/fuse/virtio_fs.c | 2 +- fs/namespace.c | 2 +- fs/pstore/inode.c| 4 +- include/linux/bitmap.h | 2 +- include/linux/cleanup.h | 56 ++-- include/linux/cpuhplock.h| 2 +- include/linux/cpumask.h | 2 +- include/linux/device.h | 6 +-- include/linux/file.h | 6 +-- include/linux/firmware.h | 2 +- include/linux/firmware/qcom/qcom_tzmem.h | 2 +- include/linux/gpio/driver.h | 4 +- include/linux/iio/iio.h | 4 +- include/linux/interrupt.h| 4 +- include/linux/irqflags.h | 4 +- include/linux/local_lock.h | 22 include/linux/mutex.h| 6 +-- include/linux/netdevice.h| 2 +- include/linux/nsproxy.h | 2 +- include/linux/of.h | 2 +- include/linux/path.h | 2 +- include/linux/pci.h | 4 +- include/linux/percpu.h | 2 +- include/linux/preempt.h | 6 +-- include/linux/property.h | 2 +- include/linux/rcupdate.h | 2 +- include/linux/rtnetlink.h| 2 +- include/linux/rwsem.h| 10 ++-- include/linux/sched/task.h | 4 +- include/linux/slab.h | 4 +- include/linux/spinlock.h | 38 ++--- include/linux/srcu.h | 8 +-- include/sound/pcm.h | 6 +-- kernel/sched/core.c | 2 +- kernel/sched/sched.h | 16 +++--- kernel/sched/syscalls.c | 4 +- lib/locking-selftest.c | 12 ++--- sound/core/control_led.c | 2 +- 45 files changed, 141 insertions(+), 141 deletions(-) diff --git a/crypto/asymmetric_keys/x509_parser.h b/crypto/asymmetric_keys/x509_parser.h index 0688c222806b..ac0560ccfcdb 100644 --- a/crypto/asymmetric_keys/x509_parser.h +++ b/crypto/asymmetric_keys/x509_parser.h @@ -45,8 +45,8 @@ struct x509_certificate { * x509_cert_parser.c */ extern void x509_free_certificate(struct x509_certificate *cert); -DEFINE_FREE(x509_free_certificate, struct x509_certificate *, - if (!IS_ERR(_T)) x509_free_certificate(_T)) +DECLARE_FREE(x509_free_certificate, struct x509_certificate *, +if (!IS_ERR(_T)) x509_free_certificate(_T)) extern struct x509_certificate *x509_cert_parse(const void *data, size_t datalen); extern int x509_decode_time(time64_t *_t, size_t hdrlen, unsigned char tag, diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c index 82b78e331d8e..2bb3cef82035 100644 --- a/drivers/cxl/acpi.c +++ b/drivers/cxl/acpi.c @@ -337,9 +337,9 @@ static int add_or_reset_cxl_resource(struct resource *parent, struct resource *r return rc; } -DEFINE_FREE(put_cxlrd, struct cxl_root_decoder *, - if (!IS_ERR_OR_NULL(_T)) put_device(&_T->cxlsd.cxld.dev)) -DEFINE_FREE(del_cxl_resource, struct resource *, if (_T) del_cxl_resource(_T)) +DECLARE_FREE(put_cxlrd, struct cxl_root_decoder *, +if (!IS_ERR_OR_NULL(_T)) put_device(&_T->cxlsd.cxld.dev)) +DECLARE_FREE(del_cxl_resource, struct resource *, if (_T) del_cxl_resource(_T)) static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws, struct cxl_cfmws_context *ctx) { diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c index bb83867d9fec..689143566642 100644 --- a/drivers/cxl/core/cdat.c +++ b/drivers/cxl/core/cdat.c @@ -385,7 +385,7 @@ static void discard_dsmas(struct xarray *xa) } xa_destroy(xa); } -DEFINE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T)) +DECLARE
[PATCH v1 2/2] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard
To cover scenarios where the scope of the guard differs from the scope of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard(). Here is an example use for a conditionally activated guard variable: void func(bool a) { DEFINE_INACTIVE_GUARD(preempt_notrace, myguard); [...] if (a) { might_sleep(); activate_guard(preempt_notrace, myguard)(); } [ protected code ] } Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra (Intel) Cc: Ingo Molnar Cc: Linus Torvalds Cc: Kees Cook Cc: Greg KH Cc: Sean Christopherson --- include/linux/cleanup.h | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 1247e67a6161..5f031cce97ca 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -149,12 +149,20 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ * similar to scoped_guard(), except it does fail when the lock * acquire fails. * + * DEFINE_INACTIVE_GUARD(name, var): + * define an inactive guard variable in a given scope, initialized to NULL. + * + * activate_guard(name, var)(args...): + * activate a guard variable with its constructor, if it is not already + * activated. */ #define DECLARE_GUARD(_name, _type, _lock, _unlock) \ DECLARE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \ static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \ - { return *_T; } + { return *_T; } \ + static inline class_##_name##_t class_##_name##_null(void) \ + { return NULL; } #define DECLARE_GUARD_COND(_name, _ext, _condlock) \ EXTEND_CLASS(_name, _ext, \ @@ -178,6 +186,14 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ if (!__guard_ptr(_name)(&scope)) _fail; \ else +#define DEFINE_INACTIVE_GUARD(_name, _var) \ + class_##_name##_t _var __cleanup(class_##_name##_destructor) = \ + class_##_name##_null() + +#define activate_guard(_name, _var) \ + if (!class_##_name##_lock_ptr(&(_var))) \ + _var = class_##_name##_constructor + /* * Additional helper macros for generating lock guards with types, either for * locks that don't have a native type (eg. RCU, preempt) or those that need a @@ -212,6 +228,11 @@ static inline void class_##_name##_destructor(class_##_name##_t *_T) \ static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)\ { \ return _T->lock;\ +} \ +static inline class_##_name##_t class_##_name##_null(void) \ +{ \ + class_##_name##_t _t = { .lock = NULL };\ + return _t; \ } -- 2.39.2
[PATCH v1 0/2] cleanup.h: Introduce DEFINE_INACTIVE_GUARD()/activate_guard()
In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually define a guard variable, rename all the guard "DEFINE_" prefix to "DECLARE_". To cover scenarios where the scope of the guard differs from the scope of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard(). The faultable tracepoints depend on this. Based on v6.11-rc5. Thanks, Mathieu Cc: Linus Torvalds Cc: Kees Cook Cc: Greg KH Cc: Sean Christopherson Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes Cc: linux-trace-ker...@vger.kernel.org Mathieu Desnoyers (2): cleanup.h guard: Rename DEFINE_ prefix to DECLARE_ cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard crypto/asymmetric_keys/x509_parser.h | 4 +- drivers/cxl/acpi.c | 6 +- drivers/cxl/core/cdat.c | 2 +- drivers/cxl/cxl.h| 2 +- drivers/gpio/gpiolib.h | 2 +- drivers/net/wireless/intel/iwlwifi/mvm/mvm.h | 2 +- drivers/platform/x86/intel/pmc/core_ssram.c | 2 +- fs/fuse/virtio_fs.c | 2 +- fs/namespace.c | 2 +- fs/pstore/inode.c| 4 +- include/linux/bitmap.h | 2 +- include/linux/cleanup.h | 79 +--- include/linux/cpuhplock.h| 2 +- include/linux/cpumask.h | 2 +- include/linux/device.h | 6 +- include/linux/file.h | 6 +- include/linux/firmware.h | 2 +- include/linux/firmware/qcom/qcom_tzmem.h | 2 +- include/linux/gpio/driver.h | 4 +- include/linux/iio/iio.h | 4 +- include/linux/interrupt.h| 4 +- include/linux/irqflags.h | 4 +- include/linux/local_lock.h | 22 +++--- include/linux/mutex.h| 6 +- include/linux/netdevice.h| 2 +- include/linux/nsproxy.h | 2 +- include/linux/of.h | 2 +- include/linux/path.h | 2 +- include/linux/pci.h | 4 +- include/linux/percpu.h | 2 +- include/linux/preempt.h | 6 +- include/linux/property.h | 2 +- include/linux/rcupdate.h | 2 +- include/linux/rtnetlink.h| 2 +- include/linux/rwsem.h| 10 +-- include/linux/sched/task.h | 4 +- include/linux/slab.h | 4 +- include/linux/spinlock.h | 38 +- include/linux/srcu.h | 8 +- include/sound/pcm.h | 6 +- kernel/sched/core.c | 2 +- kernel/sched/sched.h | 16 ++-- kernel/sched/syscalls.c | 4 +- lib/locking-selftest.c | 12 +-- sound/core/control_led.c | 2 +- 45 files changed, 163 insertions(+), 142 deletions(-) -- 2.39.2
Re: [PATCH v5 3/8] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard
On 2024-08-20 01:00, Steven Rostedt wrote: On Thu, 27 Jun 2024 11:23:35 -0400 Mathieu Desnoyers wrote: To cover scenarios where the scope of the guard differs from the scope of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard(). Here is an example use for a conditionally activated guard variable: void func(bool a) { DEFINE_INACTIVE_GUARD(preempt_notrace, myguard); [...] if (a) { might_sleep(); activate_guard(preempt_notrace, myguard)(); } [ protected code ] } Hi Mathieu, The three cleanup patches fail to apply (I believe one has already been fixed by Ingo too). Could you have the clean up patches be a separate series that is likely to get in, especially since it's more of a moving target. Then it would make sense to split this into two series: one for the guard stuff, and one for tracing which depends on the first one. I'll do that. Mathieu Thanks, -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Update MAINTAINERS file
On 2024-07-15 15:13, Steven Rostedt wrote: On Mon, 15 Jul 2024 15:10:17 -0400 Mathieu Desnoyers wrote: On 2024-07-15 14:47, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Gone but never forgotten. [ Also moved Daniel's name to be consistent with the alphabetical order ] Hi Steven, You appear to have missed this entry from SCHEDULER: R: Daniel Bristot de Oliveira (SCHED_DEADLINE) Should it be done in a separate commit or folded in this one ? That has already been done: https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/ This patch is actually on top of that one. Sorry I missed that. Reviewed-by: Mathieu Desnoyers -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Update MAINTAINERS file
On 2024-07-15 14:47, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Gone but never forgotten. [ Also moved Daniel's name to be consistent with the alphabetical order ] Hi Steven, You appear to have missed this entry from SCHEDULER: R: Daniel Bristot de Oliveira (SCHED_DEADLINE) Should it be done in a separate commit or folded in this one ? Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- CREDITS | 10 +++--- MAINTAINERS | 3 --- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/CREDITS b/CREDITS index 88c4c08cb613..d7798910e99f 100644 --- a/CREDITS +++ b/CREDITS @@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, Tropez, Tropez+) D: Various bugfixes and changes to sound drivers S: USA -N: Daniel Bristot de Oliveira -D: Scheduler contributions, notably: SCHED_DEADLINE - N: Carlos Henrique Bauer E: chba...@acm.org E: ba...@atlas.unisinos.br @@ -534,6 +531,13 @@ S: Kopmansg 2 S: 411 13 Goteborg S: Sweden +N: Daniel Bristot de Oliveira +D: Scheduler contributions, notably: SCHED_DEADLINE +D: Real-time Linux Analysis +D: Runtime Verification +D: OS Noise and Latency Tracers +S: Brazil and Italy + N: Paul Bristow E: p...@paulbristow.net W: https://paulbristow.net/linux/idefloppy.html diff --git a/MAINTAINERS b/MAINTAINERS index 2e1b8bbacb5e..0b7e4cd4ffd7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18895,7 +18895,6 @@ F: include/uapi/linux/rtc.h F:tools/testing/selftests/rtc/ Real-time Linux Analysis (RTLA) tools -M: Daniel Bristot de Oliveira M:Steven Rostedt L:linux-trace-ker...@vger.kernel.org S:Maintained @@ -19529,7 +19528,6 @@ S: Maintained F:drivers/infiniband/ulp/rtrs/ RUNTIME VERIFICATION (RV) -M: Daniel Bristot de Oliveira M:Steven Rostedt L:linux-trace-ker...@vger.kernel.org S:Maintained @@ -22806,7 +22804,6 @@ F: kernel/trace/trace_mmiotrace.c TRACING OS NOISE / LATENCY TRACERS M:Steven Rostedt -M: Daniel Bristot de Oliveira S:Maintained F:Documentation/trace/hwlat_detector.rst F:Documentation/trace/osnoise-tracer.rst -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer
On 2024-07-02 12:51, Steven Rostedt wrote: On Tue, 2 Jul 2024 11:32:53 -0400 Mathieu Desnoyers wrote: If we use '*' for user events already, perhaps we'd want to consider using the same range for the ring buffer ioctls ? Arguably one is about instrumentation and the other is about ring buffer interaction (data transport), but those are both related to tracing. Yeah, but I still rather keep them separate. No objection, I'm OK either way. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer
On 2024-07-02 11:18, Steven Rostedt wrote: On Tue, 2 Jul 2024 10:36:03 -0400 Mathieu Desnoyers wrote: I can send a patch this week to update it. Or feel free to send a patch yourself. You need to reserve an unused ioctl Code and Seq# range within: Documentation/userspace-api/ioctl/ioctl-number.rst Ug, it's been so long, I completely forgot about that file. Thanks for catching this. Otherwise this duplicate will confuse all system call instrumentation tooling. Agreed, what if we did this then: -- Steve diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index a141e8e65c5d..9a97030c6c8d 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -186,6 +186,7 @@ Code Seq#Include File Comments 'Q' alllinux/soundcard.h 'R' 00-1F linux/random.h conflict! 'R' 01 linux/rfkill.h conflict! +'R' 20-2F linux/trace_mmap.h 'R' C0-DF net/bluetooth/rfcomm.h 'R' E0 uapi/linux/fsl_mc.h 'S' alllinux/cdrom.h conflict! diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index bd1066754220..c102ef35d11e 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,6 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; -#define TRACE_MMAP_IOCTL_GET_READER _IO('T', 0x1) +#define TRACE_MMAP_IOCTL_GET_READER_IO('R', 0x20) Note that user events also has this issue: the ioctl is not reserved in the ioctl-number.rst list. See include/uapi/linux/user_events.h: #define DIAG_IOC_MAGIC '*' /* Request to register a user_event */ #define DIAG_IOCSREG _IOWR(DIAG_IOC_MAGIC, 0, struct user_reg *) /* Request to delete a user_event */ #define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char *) /* Requests to unregister a user_event */ #define DIAG_IOCSUNREG _IOW(DIAG_IOC_MAGIC, 2, struct user_unreg*) Where '*' maps to Code 0x2A. Looking at the list I don't see any conflicts there, but we should definitely add it. If we use '*' for user events already, perhaps we'd want to consider using the same range for the ring buffer ioctls ? Arguably one is about instrumentation and the other is about ring buffer interaction (data transport), but those are both related to tracing. Thanks, Mathieu #endif /* _TRACE_MMAP_H_ */ -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer
On 2024-06-30 08:40, Steven Rostedt wrote: On Sun, 30 Jun 2024 13:53:23 +0300 "Dmitry V. Levin" wrote: On Fri, May 10, 2024 at 03:04:32PM +0100, Vincent Donnefort wrote: [...] diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index b682e9925539..bd1066754220 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER _IO('T', 0x1) + I'm sorry but among all the numbers this one was probably the least fortunate choice because it collides with TCGETS on most of architectures. Hmm, that is unfortunate. For example, this is how strace output would look like when TRACE_MMAP_IOCTL_GET_READER support is added: $ strace -e ioctl stty ioctl(0, TCGETS or TRACE_MMAP_IOCTL_GET_READER, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0 Even though ioctl numbers are inherently not unique, TCGETS is a very traditional one, so it would be great if you could change TRACE_MMAP_IOCTL_GET_READER to avoid this collision. Given that _IO('T', 0x1) is _IOC(_IOC_NONE, 'T', 0x1, 0), something like _IOC(_IOC_NONE, 'T', 0x1, 0x1) should be OK. Well, it may not be too late to update this as it hasn't been officially released in 6.10 yet. It's still only in the -rc and the library doesn't rely on this yet (I've been holding off until 6.10 was officially released before releasing the library that uses it). I can send a patch this week to update it. Or feel free to send a patch yourself. You need to reserve an unused ioctl Code and Seq# range within: Documentation/userspace-api/ioctl/ioctl-number.rst Otherwise this duplicate will confuse all system call instrumentation tooling. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH v5 8/8] tracing: Convert sys_enter/exit to faultable tracepoints
Convert the definition of the system call enter/exit tracepoints to faultable tracepoints now that all upstream tracers handle it. This allows tracers to fault-in userspace system call arguments such as path strings within their probe callbacks. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Since v4: - Use 'guard(preempt_notrace)'. - Add brackets to multiline 'if' statements. --- include/trace/events/syscalls.h | 4 +-- kernel/trace/trace_syscalls.c | 52 - 2 files changed, 46 insertions(+), 10 deletions(-) diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h index b6e0cbc2c71f..dc30e3004818 100644 --- a/include/trace/events/syscalls.h +++ b/include/trace/events/syscalls.h @@ -15,7 +15,7 @@ #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS -TRACE_EVENT_FN(sys_enter, +TRACE_EVENT_FN_MAY_FAULT(sys_enter, TP_PROTO(struct pt_regs *regs, long id), @@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter, TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY) -TRACE_EVENT_FN(sys_exit, +TRACE_EVENT_FN_MAY_FAULT(sys_exit, TP_PROTO(struct pt_regs *regs, long ret), diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 9c581d6da843..314666d663b6 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id) int syscall_nr; int size; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret) struct trace_event_buffer fbuffer; int syscall_nr; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -376,8 +388,11 @@ static int reg_event_syscall_enter(struct trace_event_file *file, if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls)) return -ENOSYS; mutex_lock(&syscall_trace_lock); - if (!tr->sys_refcount_enter) - ret = register_trace_sys_enter(ftrace_syscall_enter, tr); + if (!tr->sys_refcount_enter) { + ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, tr, + TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_FAULT); + } if (!ret) { rcu_assign_pointer(tr->enter_syscall_files[num], file); tr->sys_refcount_enter++; @@ -414,8 +429,11 @@ static int reg_event_syscall_exit(struct trace_event_file *file, if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls)) return -ENOSYS; mutex_lock(&syscall_trace_lock); - if (!tr->sys_refcount_exit) - ret = register_trace_sys_exit(ftrace_syscall_exit, tr); + if (!tr->sys_refcount_exit) { + ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, tr, + TRACEPOINT_DEFAULT_PRIO, +TRACEPOINT_MAY_FAULT); + } if (!ret) { rcu_assign_pointer(tr->exit_syscall_files[num], file); tr->sys_refcount_exit++; @@ -582,6 +600,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) int rctx; int size; + /* +* Probe called with preemption enabled (may_fault), but ring buffer and +* per-cpu data require preemption to be disabled. +*/ + guard(preempt_notrace)(); + syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) return; @@ -630,8 +654,11 @@ static int perf_sysenter_enable(struct trace_event_call *call) num = ((struct syscall_metadata *)call->data)->syscall_nr; mu
[PATCH v5 7/8] tracing/perf: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that perf can handle registering to such tracepoints by explicitly disabling preemption within the perf tracepoint probes to respect the current expectations within perf ring buffer code. This change does not yet allow perf to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. --- include/trace/perf.h | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/include/trace/perf.h b/include/trace/perf.h index 2c11181c82e0..161e1655b953 100644 --- a/include/trace/perf.h +++ b/include/trace/perf.h @@ -12,8 +12,8 @@ #undef __perf_task #define __perf_task(t) (__task = (t)) -#undef DECLARE_EVENT_CLASS -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ +#undef _DECLARE_EVENT_CLASS +#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags) \ static notrace void\ perf_trace_##call(void *__data, proto) \ { \ @@ -28,6 +28,13 @@ perf_trace_##call(void *__data, proto) \ int __data_size;\ int rctx; \ \ + DEFINE_INACTIVE_GUARD(preempt_notrace, trace_event_guard); \ + \ + if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\ + might_fault(); \ + activate_guard(preempt_notrace, trace_event_guard)(); \ + } \ + \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ head = this_cpu_ptr(event_call->perf_events); \ @@ -55,6 +62,17 @@ perf_trace_##call(void *__data, proto) \ head, __task);\ } +#undef DECLARE_EVENT_CLASS +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ + _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ +PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0) + +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ +PARAMS(tstruct), PARAMS(assign), PARAMS(print), \ +TRACEPOINT_MAY_FAULT) + /* * This part is compiled out, it is only here as a build time check * to make sure that if the tracepoint handling changes, the -- 2.39.2
[PATCH v5 5/8] tracing/ftrace: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that ftrace can handle registering to such tracepoints by explicitly disabling preemption within the ftrace tracepoint probes to respect the current expectations within ftrace ring buffer code. This change does not yet allow ftrace to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. - Add brackets to multiline 'if' statements. --- include/trace/trace_events.h | 64 ++-- kernel/trace/trace_events.c | 28 2 files changed, 83 insertions(+), 9 deletions(-) diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h index df590eea8ae4..c887f7b6fbe9 100644 --- a/include/trace/trace_events.h +++ b/include/trace/trace_events.h @@ -45,6 +45,16 @@ PARAMS(print)); \ DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args)); +#undef TRACE_EVENT_MAY_FAULT +#define TRACE_EVENT_MAY_FAULT(name, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS_MAY_FAULT(name,\ +PARAMS(proto),\ +PARAMS(args), \ +PARAMS(tstruct), \ +PARAMS(assign), \ +PARAMS(print)); \ + DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args)); + #include "stages/stage1_struct_define.h" #undef DECLARE_EVENT_CLASS @@ -57,6 +67,11 @@ \ static struct trace_event_class event_class_##name; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(name, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT #define DEFINE_EVENT(template, name, proto, args) \ static struct trace_event_call __used \ @@ -80,7 +95,7 @@ #undef TRACE_EVENT_FN_MAY_FAULT #define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct, \ assign, print, reg, unreg) \ - TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \ + TRACE_EVENT_MAY_FAULT(name, PARAMS(proto), PARAMS(args),\ PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \ #undef TRACE_EVENT_FN_COND @@ -123,6 +138,11 @@ tstruct;\ }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT #define DEFINE_EVENT(template, name, proto, args) @@ -214,6 +234,11 @@ static struct trace_event_functions trace_event_type_funcs_##call = { \ .trace = trace_raw_output_##call, \ }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT_PRINT #define DEFINE_EVENT_PRINT(template, call, proto, args, print) \ static notrace enum print_line_t \ @@ -250,6 +275,11 @@ static struct trace_event_fields trace_event_fields_##call[] = { \ tstruct \ {} }; +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \ + PARAMS(tstruct), PARAMS(assign), PARAMS(print)) + #undef DEFINE_EVENT_PRINT #define DEFINE_EVENT_PRINT(template, name, proto, args, print) @@ -271,6 +301,11 @@ static inline notrace int trace_event_get_offsets_##call( \
[PATCH v5 6/8] tracing/bpf-trace: Add support for faultable tracepoints
In preparation for converting system call enter/exit instrumentation into faultable tracepoints, make sure that bpf can handle registering to such tracepoints by explicitly disabling preemption within the bpf tracepoint probes to respect the current expectations within bpf tracing code. This change does not yet allow bpf to take page faults per se within its probe, but allows its existing probes to connect to faultable tracepoints. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v4: - Use DEFINE_INACTIVE_GUARD. - Add brackets to multiline 'if' statements. --- include/trace/bpf_probe.h | 20 kernel/trace/bpf_trace.c | 12 +--- 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h index e609cd7da47e..96c1269dd88c 100644 --- a/include/trace/bpf_probe.h +++ b/include/trace/bpf_probe.h @@ -42,17 +42,29 @@ /* tracepoints with more than 12 arguments will hit build error */ #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) -#define __BPF_DECLARE_TRACE(call, proto, args) \ +#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags) \ static notrace void\ __bpf_trace_##call(void *__data, proto) \ { \ struct bpf_prog *prog = __data; \ + \ + DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard);\ + \ + if ((tp_flags) & TRACEPOINT_MAY_FAULT) {\ + might_fault(); \ + activate_guard(preempt_notrace, bpf_trace_guard)(); \ + } \ + \ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args)); \ } #undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) + +#undef DECLARE_EVENT_CLASS_MAY_FAULT +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT) /* * This part is compiled out, it is only here as a build time check @@ -106,13 +118,13 @@ static inline void bpf_test_buffer_##call(void) \ #undef DECLARE_TRACE #define DECLARE_TRACE(call, proto, args) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0) #undef DECLARE_TRACE_WRITABLE #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \ __CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \ - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \ __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size) #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 192de33d961f..873b0e885677 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2443,9 +2443,15 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog * if (prog->aux->max_tp_access > btp->writable_size) return -EINVAL; - return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func, - prog, TRACEPOINT_DEFAULT_PRIO, - TRACEPOINT_MAY_EXIST); + if (tp->flags & TRACEPOINT_MAY_FAULT) { + return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func, + prog, TRACEPOINT_DEFAULT_PRIO, + TRACEPOINT_MAY_EXIST | TRACEPOINT_MAY_FAULT);
[PATCH v5 2/8] cleanup.h guard: Rename DEFINE_ prefix to DECLARE_
The convention used in other kernel headers (e.g. wait.h, percpu-defs.h) is to use "DECLARE_" prefix for macros emitting externs, static inlines and type definitions. The "DEFINE_" prefix is used for variable definitions. In preparation to introduce a "DEFINE_INACTIVE_GUARD()" to actually define a guard variable, rename all the guard "DEFINE_" prefix to "DECLARE_". Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra (Intel) Cc: Ingo Molnar --- drivers/cxl/core/cdat.c | 2 +- drivers/cxl/cxl.h | 2 +- drivers/gpio/gpiolib.h | 2 +- drivers/platform/x86/intel/pmc/core_ssram.c | 2 +- fs/fuse/virtio_fs.c | 2 +- fs/pstore/inode.c | 4 +- include/linux/bitmap.h | 2 +- include/linux/cleanup.h | 56 ++--- include/linux/cpu.h | 2 +- include/linux/cpumask.h | 2 +- include/linux/device.h | 6 +-- include/linux/file.h| 4 +- include/linux/firmware.h| 2 +- include/linux/gpio/driver.h | 4 +- include/linux/iio/iio.h | 4 +- include/linux/irqflags.h| 4 +- include/linux/mutex.h | 6 +-- include/linux/of.h | 2 +- include/linux/pci.h | 4 +- include/linux/percpu.h | 2 +- include/linux/preempt.h | 6 +-- include/linux/rcupdate.h| 2 +- include/linux/rwsem.h | 10 ++-- include/linux/sched/task.h | 4 +- include/linux/slab.h| 4 +- include/linux/spinlock.h| 38 +++--- include/linux/srcu.h| 2 +- include/sound/pcm.h | 6 +-- kernel/sched/core.c | 4 +- kernel/sched/sched.h| 16 +++--- lib/locking-selftest.c | 12 ++--- sound/core/control_led.c| 2 +- 32 files changed, 110 insertions(+), 110 deletions(-) diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c index bb83867d9fec..689143566642 100644 --- a/drivers/cxl/core/cdat.c +++ b/drivers/cxl/core/cdat.c @@ -385,7 +385,7 @@ static void discard_dsmas(struct xarray *xa) } xa_destroy(xa); } -DEFINE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T)) +DECLARE_FREE(dsmas, struct xarray *, if (_T) discard_dsmas(_T)) void cxl_endpoint_parse_cdat(struct cxl_port *port) { diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 036d17db68e0..89cadb029d31 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -737,7 +737,7 @@ struct cxl_root *devm_cxl_add_root(struct device *host, const struct cxl_root_ops *ops); struct cxl_root *find_cxl_root(struct cxl_port *port); void put_cxl_root(struct cxl_root *cxl_root); -DEFINE_FREE(put_cxl_root, struct cxl_root *, if (_T) put_cxl_root(_T)) +DECLARE_FREE(put_cxl_root, struct cxl_root *, if (_T) put_cxl_root(_T)) int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd); void cxl_bus_rescan(void); diff --git a/drivers/gpio/gpiolib.h b/drivers/gpio/gpiolib.h index 8e0e211ebf08..17507a64c284 100644 --- a/drivers/gpio/gpiolib.h +++ b/drivers/gpio/gpiolib.h @@ -199,7 +199,7 @@ struct gpio_chip_guard { int idx; }; -DEFINE_CLASS(gpio_chip_guard, +DECLARE_CLASS(gpio_chip_guard, struct gpio_chip_guard, srcu_read_unlock(&_T.gdev->srcu, _T.idx), ({ diff --git a/drivers/platform/x86/intel/pmc/core_ssram.c b/drivers/platform/x86/intel/pmc/core_ssram.c index 1bde86c54eb9..115f16448406 100644 --- a/drivers/platform/x86/intel/pmc/core_ssram.c +++ b/drivers/platform/x86/intel/pmc/core_ssram.c @@ -29,7 +29,7 @@ #define LPM_REG_COUNT 28 #define LPM_MODE_OFFSET1 -DEFINE_FREE(pmc_core_iounmap, void __iomem *, iounmap(_T)); +DECLARE_FREE(pmc_core_iounmap, void __iomem *, iounmap(_T)); static u32 pmc_core_find_guid(struct pmc_info *list, const struct pmc_reg_map *map) { diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index bb3e941b9503..d062bafb294a 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -852,7 +852,7 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } -DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) +DECLARE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c index 56815799ce79..f34da47d26d4 100644 --- a/fs/pstore/inode.c +++ b/fs/pstore/inode.c
[PATCH v5 4/8] tracing: Introduce faultable tracepoints
When invoked from system call enter/exit instrumentation, accessing user-space data is a common use-case for tracers. However, tracepoints currently disable preemption around iteration on the registered tracepoint probes and invocation of the probe callbacks, which prevents tracers from handling page faults. Extend the tracepoint and trace event APIs to allow defining a faultable tracepoint which invokes its callback with preemption enabled. Also extend the tracepoint API to allow tracers to request specific probes to be connected to those faultable tracepoints. When the TRACEPOINT_MAY_FAULT flag is provided on registration, the probe callback will be called with preemption enabled, and is allowed to take page faults. Faultable probes can only be registered on faultable tracepoints and non-faultable probes on non-faultable tracepoints. The tasks trace rcu mechanism is used to synchronize read-side marshalling of the registered probes with respect to faultable probes unregistration and teardown. Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoy...@efficios.com/ Co-developed-by: Michael Jeanson Signed-off-by: Mathieu Desnoyers Signed-off-by: Michael Jeanson Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes --- Changes since v1: - Cleanup __DO_TRACE() implementation. - Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to MAYFAULT, and use might_fault() rather than might_sleep(), to properly convey that the tracepoints are meant to be able to take a page fault, which requires to be able to sleep *and* to hold the mmap_sem. Changes since v2: - Rename MAYFAULT to MAY_FAULT. - Rebased on 6.5.5. - Introduce MAY_EXIST tracepoint flag. Changes since v3: - Rebased on 6.6.2. Changes since v4: - Rebased on 6.9.6. - Simplify flag check in tracepoint_probe_register_prio_flags(). - Update MAY_EXIST flag description. --- include/linux/tracepoint-defs.h | 14 ++ include/linux/tracepoint.h | 88 +++-- include/trace/define_trace.h| 7 +++ include/trace/trace_events.h| 6 +++ init/Kconfig| 1 + kernel/trace/bpf_trace.c| 5 +- kernel/trace/trace_fprobe.c | 5 +- kernel/tracepoint.c | 65 ++-- 8 files changed, 136 insertions(+), 55 deletions(-) diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h index 4dc4955f0fbf..94e39c86b49f 100644 --- a/include/linux/tracepoint-defs.h +++ b/include/linux/tracepoint-defs.h @@ -29,6 +29,19 @@ struct tracepoint_func { int prio; }; +/** + * enum tracepoint_flags - Tracepoint flags + * @TRACEPOINT_MAY_EXIST: On registration, don't warn if the tracepoint is + *already registered. + * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with + *preemption enabled, and is allowed to take page + *faults. + */ +enum tracepoint_flags { + TRACEPOINT_MAY_EXIST = (1 << 0), + TRACEPOINT_MAY_FAULT = (1 << 1), +}; + struct tracepoint { const char *name; /* Tracepoint name */ struct static_key key; @@ -39,6 +52,7 @@ struct tracepoint { int (*regfunc)(void); void (*unregfunc)(void); struct tracepoint_func __rcu *funcs; + unsigned int flags; }; #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 689b6d71590e..eaf8c00b30a3 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -41,17 +42,10 @@ extern int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data, int prio); extern int -tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, void *data, -int prio); +tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void *data, + int prio, unsigned int flags); extern int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data); -static inline int -tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe, - void *data) -{ - return tracepoint_probe_register_prio_may_exist(tp, probe, data, - TRACEPOINT_DEFAULT_PRIO); -} extern void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), void *priv); @@ -90,6 +84,7 @@ int unregister_tracepoint_module_notifier(struct notifier_
[PATCH v5 0/8] Faultable Tracepoints
Wire up the system call tracepoints with Tasks Trace RCU to allow the ftrace, perf, and eBPF tracers to handle page faults. This series does the initial wire-up allowing tracers to handle page faults, but leaves out the actual handling of said page faults as future work. I have tested this against a feature branch of lttng-modules which implements handling of page faults for the filename argument of the openat(2) system call. This v5 addresses comments from the previous round of review [1]. Steven Rostedt suggested separating tracepoints into two separate sections. It is unclear how that approach would prove to be an improvement over the currently proposed approach, so those changes were not incorporated. See [2] for my detailed reply. In the previous round, Peter Zijlstra suggested use of SRCU rather than Tasks Trace RCU. See my reply about the distinction between SRCU and Tasks Trace RCU [3] and this explanation from Paul E. McKenney about the purpose of Tasks Trace RCU [4]. The macros DEFINE_INACTIVE_GUARD and activate_guard are added to cleanup.h for use in the __DO_TRACE() macro. Those appear to be more flexible than the guard_if() proposed by Peter Zijlstra in the previous round of review [5]. This series is based on kernel v6.9.6. Thanks, Mathieu Link: https://lore.kernel.org/lkml/20231120205418.334172-1-mathieu.desnoy...@efficios.com/ # [1] Link: https://lore.kernel.org/lkml/e4e9a2bc-1776-4b51-aba4-a147795a5...@efficios.com/ # [2] Link: https://lore.kernel.org/lkml/a0ac5f77-411e-4562-9863-81196238f...@efficios.com/ # [3] Link: https://lore.kernel.org/lkml/ba543d44-9302-4115-ac4f-d4e9f8d98a90@paulmck-laptop/ # [4] Link: https://lore.kernel.org/lkml/20231120221524.gd8...@noisy.programming.kicks-ass.net/ # [5] Cc: Peter Zijlstra Cc: Alexei Starovoitov Cc: Yonghong Song Cc: Paul E. McKenney Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: b...@vger.kernel.org Cc: Joel Fernandes Mathieu Desnoyers (8): cleanup.h: Header include guard should match header name cleanup.h guard: Rename DEFINE_ prefix to DECLARE_ cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard tracing: Introduce faultable tracepoints tracing/ftrace: Add support for faultable tracepoints tracing/bpf-trace: Add support for faultable tracepoints tracing/perf: Add support for faultable tracepoints tracing: Convert sys_enter/exit to faultable tracepoints drivers/cxl/core/cdat.c | 2 +- drivers/cxl/cxl.h | 2 +- drivers/gpio/gpiolib.h | 2 +- drivers/platform/x86/intel/pmc/core_ssram.c | 2 +- fs/fuse/virtio_fs.c | 2 +- fs/pstore/inode.c | 4 +- include/linux/bitmap.h | 2 +- include/linux/cleanup.h | 85 include/linux/cpu.h | 2 +- include/linux/cpumask.h | 2 +- include/linux/device.h | 6 +- include/linux/file.h| 4 +- include/linux/firmware.h| 2 +- include/linux/gpio/driver.h | 4 +- include/linux/iio/iio.h | 4 +- include/linux/irqflags.h| 4 +- include/linux/mutex.h | 6 +- include/linux/of.h | 2 +- include/linux/pci.h | 4 +- include/linux/percpu.h | 2 +- include/linux/preempt.h | 6 +- include/linux/rcupdate.h| 2 +- include/linux/rwsem.h | 10 +-- include/linux/sched/task.h | 4 +- include/linux/slab.h| 4 +- include/linux/spinlock.h| 38 - include/linux/srcu.h| 2 +- include/linux/tracepoint-defs.h | 14 include/linux/tracepoint.h | 88 +++-- include/sound/pcm.h | 6 +- include/trace/bpf_probe.h | 20 - include/trace/define_trace.h| 7 ++ include/trace/events/syscalls.h | 4 +- include/trace/perf.h| 22 +- include/trace/trace_events.h| 68 +++- init/Kconfig| 1 + kernel/sched/core.c | 4 +- kernel/sched/sched.h| 16 ++-- kernel/trace/bpf_trace.c| 11 ++- kernel/trace/trace_events.c | 28 +-- kernel/trace/trace_fprobe.c | 5 +- kernel/trace/trace_syscalls.c | 52 ++-- kernel/tracepoint.c | 65 +-- lib/locking-selftest.c | 12 +-- sound/core/control_led.c| 2 +- 45 files changed, 441
[PATCH v5 3/8] cleanup.h: Introduce DEFINE_INACTIVE_GUARD and activate_guard
To cover scenarios where the scope of the guard differs from the scope of its activation, introduce DEFINE_INACTIVE_GUARD() and activate_guard(). Here is an example use for a conditionally activated guard variable: void func(bool a) { DEFINE_INACTIVE_GUARD(preempt_notrace, myguard); [...] if (a) { might_sleep(); activate_guard(preempt_notrace, myguard)(); } [ protected code ] } Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra (Intel) Cc: Ingo Molnar Cc: Linus Torvalds Cc: Kees Cook Cc: Greg KH Cc: Sean Christopherson --- include/linux/cleanup.h | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 04f03ad5f25d..d6a3d8099d77 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -146,12 +146,20 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ * similar to scoped_guard(), except it does fail when the lock * acquire fails. * + * DEFINE_INACTIVE_GUARD(name, var): + * define an inactive guard variable in a given scope, initialized to NULL. + * + * activate_guard(name, var)(args...): + * activate a guard variable with its constructor, if it is not already + * activated. */ #define DECLARE_GUARD(_name, _type, _lock, _unlock) \ DECLARE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \ static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \ - { return *_T; } + { return *_T; } \ + static inline class_##_name##_t class_##_name##_null(void) \ + { return NULL; } #define DECLARE_GUARD_COND(_name, _ext, _condlock) \ EXTEND_CLASS(_name, _ext, \ @@ -175,6 +183,14 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ if (!__guard_ptr(_name)(&scope)) _fail; \ else +#define DEFINE_INACTIVE_GUARD(_name, _var) \ + class_##_name##_t _var __cleanup(class_##_name##_destructor) = \ + class_##_name##_null() + +#define activate_guard(_name, _var) \ + if (!class_##_name##_lock_ptr(&(_var))) \ + _var = class_##_name##_constructor + /* * Additional helper macros for generating lock guards with types, either for * locks that don't have a native type (eg. RCU, preempt) or those that need a @@ -209,6 +225,11 @@ static inline void class_##_name##_destructor(class_##_name##_t *_T) \ static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)\ { \ return _T->lock;\ +} \ +static inline class_##_name##_t class_##_name##_null(void) \ +{ \ + class_##_name##_t _t = { .lock = NULL };\ + return _t; \ } -- 2.39.2
[PATCH v5 1/8] cleanup.h: Header include guard should match header name
The include guard should match the header name. Rename __LINUX_GUARDS_H to __LINUX_CLEANUP_H. Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra (Intel) Cc: Ingo Molnar --- include/linux/cleanup.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index c2d09bc4f976..4cf8ad5d27a3 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __LINUX_GUARDS_H -#define __LINUX_GUARDS_H +#ifndef __LINUX_CLEANUP_H +#define __LINUX_CLEANUP_H #include @@ -247,4 +247,4 @@ __DEFINE_LOCK_GUARD_0(_name, _lock) { return class_##_name##_lock_ptr(_T); } -#endif /* __LINUX_GUARDS_H */ +#endif /* __LINUX_CLEANUP_H */ -- 2.39.2
[RFC PATCH 1/4] kernel/reboot: Introduce pre_restart notifiers
Introduce a new pre_restart notifier chain for callbacks that need to be executed after the system has been made quiescent with syscore_shutdown(), before machine restart. This pre_restart notifier chain should be invoked on machine restart and on emergency machine restart. The use-case for this new notifier chain is to preserve tracing data within pmem areas on systems where the BIOS does not clear memory across warm reboots. Why do we need a new notifier chain ? 1) The reboot and restart_prepare notifiers are called too early in the reboot sequence: they are invoked before syscore_shutdown(), which leaves other CPUs actively running threads while those notifiers are invoked. 2) The "restart" notifier is meant to trigger the actual machine restart, and is not meant to be invoked as a last step immediately before restart. It is also not always used: some architecture code choose to bypass this restart notifier and reboot directly from the architecture code. Wiring up the architecture code to call this notifier chain is left to follow-up arch-specific patches. Signed-off-by: Mathieu Desnoyers Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Steven Rostedt Cc: nvd...@lists.linux.dev Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org --- include/linux/reboot.h | 4 kernel/reboot.c| 51 ++ 2 files changed, 55 insertions(+) diff --git a/include/linux/reboot.h b/include/linux/reboot.h index abcdde4df697..c7f340e81451 100644 --- a/include/linux/reboot.h +++ b/include/linux/reboot.h @@ -50,6 +50,10 @@ extern int register_restart_handler(struct notifier_block *); extern int unregister_restart_handler(struct notifier_block *); extern void do_kernel_restart(char *cmd); +extern int register_pre_restart_handler(struct notifier_block *); +extern int unregister_pre_restart_handler(struct notifier_block *); +extern void do_kernel_pre_restart(char *cmd); + /* * Architecture-specific implementations of sys_reboot commands. */ diff --git a/kernel/reboot.c b/kernel/reboot.c index 22c16e2564cc..b7287dd48d35 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -235,6 +235,57 @@ void do_kernel_restart(char *cmd) atomic_notifier_call_chain(&restart_handler_list, reboot_mode, cmd); } +/* + * Notifier list for kernel code which wants to be called immediately + * before restarting the system. + */ +static ATOMIC_NOTIFIER_HEAD(pre_restart_handler_list); + +/** + * register_pre_restart_handler - Register function to be called in preparation + *to reset the system + * @nb: Info about handler function to be called + * + * Registers a function with code to be called in preparation to restart + * the system. + * + * Currently always returns zero, as atomic_notifier_chain_register() + * always returns zero. + */ +int register_pre_restart_handler(struct notifier_block *nb) +{ + return atomic_notifier_chain_register(&pre_restart_handler_list, nb); +} +EXPORT_SYMBOL(register_pre_restart_handler); + +/** + * unregister_pre_restart_handler - Unregister previously registered + * pre-restart handler + * @nb: Hook to be unregistered + * + * Unregisters a previously registered pre-restart handler function. + * + * Returns zero on success, or %-ENOENT on failure. + */ +int unregister_pre_restart_handler(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&pre_restart_handler_list, nb); +} +EXPORT_SYMBOL(unregister_pre_restart_handler); + +/** + * do_kernel_pre_restart - Execute kernel pre-restart handler call chain + * + * Calls functions registered with register_pre_restart_handler. + * + * Expected to be called from machine_restart and + * machine_emergency_restart before invoking the restart handlers. + */ +void do_kernel_pre_restart(char *cmd) +{ + atomic_notifier_call_chain(&pre_restart_handler_list, reboot_mode, cmd); +} + void migrate_to_reboot_cpu(void) { /* The boot cpu is always logical cpu 0 */ -- 2.39.2
[RFC PATCH 3/4] arm64: Invoke pre_restart notifiers
Invoke the pre_restart notifiers after shutdown, before machine restart. This allows preserving pmem memory across warm reboots. Invoke the pre_restart notifiers before emergency machine restart as well to cover the panic() scenario. Signed-off-by: Mathieu Desnoyers Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Steven Rostedt Cc: nvd...@lists.linux.dev Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org --- arch/arm64/kernel/process.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 4ae31b7af6c3..4a27397617fb 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -129,6 +129,8 @@ void machine_restart(char *cmd) local_irq_disable(); smp_send_stop(); + do_kernel_pre_restart(cmd); + /* * UpdateCapsule() depends on the system being reset via * ResetSystem(). -- 2.39.2
[RFC PATCH 2/4] nvdimm/pmem: Flush to memory before machine restart
Register pre-restart notifiers to flush pmem areas from CPU data cache to memory on reboot, immediately before restarting the machine. This ensures all other CPUs are quiescent before the pmem data is flushed to memory. I did an earlier POC that flushed caches on panic/die oops notifiers [1], but it did not cover the reboot case. I've been made aware that some distribution vendors have started shipping their own modified version of my earlier POC patch. This makes a strong argument for upstreaming this work. Use the newly introduced "pre-restart" notifiers to flush pmem data to memory immediately before machine restart. Delta from my POC patch [1]: Looking at the panic() code, it invokes emergency_restart() to restart the machine, which uses the new pre-restart notifiers. There is therefore no need to hook into panic handlers explicitly. Looking at the die notifiers, those don't actually end up triggering a machine restart, so it does not appear to be relevant to flush pmem to memory there. I must admit I originally looked at how ftrace hooked into panic/die-oops handlers for its ring buffers, but the use-case it different here: we only want to cover machine restart use-cases. Link: https://lore.kernel.org/linux-kernel/f6067e3e-a2bc-483d-b214-6e3fe6691...@efficios.com/ [1] Signed-off-by: Mathieu Desnoyers Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Steven Rostedt Cc: nvd...@lists.linux.dev Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org --- drivers/nvdimm/pmem.c | 29 - drivers/nvdimm/pmem.h | 2 ++ 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 598fe2e89bda..bf1d187a9dca 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -26,12 +26,16 @@ #include #include #include +#include #include #include "pmem.h" #include "btt.h" #include "pfn.h" #include "nd.h" +static int pmem_pre_restart_handler(struct notifier_block *self, + unsigned long ev, void *unused); + static struct device *to_dev(struct pmem_device *pmem) { /* @@ -423,6 +427,7 @@ static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; + unregister_pre_restart_notifier(&pmem->pre_restart_notifier); dax_remove_host(pmem->disk); kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); @@ -575,9 +580,14 @@ static int pmem_attach_disk(struct device *dev, goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - rc = device_add_disk(dev, disk, pmem_attribute_groups); + pmem->pre_restart_notifier.notifier_call = pmem_pre_restart_handler; + pmem->pre_restart_notifier.priority = 0; + rc = register_pre_restart_notifier(&pmem->pre_restart_notifier); if (rc) goto out_remove_host; + rc = device_add_disk(dev, disk, pmem_attribute_groups); + if (rc) + goto out_unregister_reboot; if (devm_add_action_or_reset(dev, pmem_release_disk, pmem)) return -ENOMEM; @@ -589,6 +599,8 @@ static int pmem_attach_disk(struct device *dev, dev_warn(dev, "'badblocks' notification disabled\n"); return 0; +out_unregister_pre_restart: + unregister_pre_restart_notifier(&pmem->pre_restart_notifier); out_remove_host: dax_remove_host(pmem->disk); out_cleanup_dax: @@ -751,6 +763,21 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) } } +/* + * For volatile memory use-cases where explicit flushing of the data cache is + * not useful after stores, the pmem reboot notifier is called on preparation + * for restart to make sure the content of the pmem memory area is flushed from + * data cache to memory, so it can be preserved across warm reboot. + */ +static int pmem_pre_restart_handler(struct notifier_block *self, + unsigned long ev, void *unused) +{ + struct pmem_device *pmem = container_of(self, struct pmem_device, pre_restart_notifier); + + arch_wb_cache_pmem(pmem->virt_addr, pmem->size); + return NOTIFY_DONE; +} + MODULE_ALIAS("pmem"); MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_IO); MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_PMEM); diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 392b0b38acb9..b8a2a518cf82 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -27,6 +28,7 @@ struct pmem_device { struct dax_device *dax_dev; struct gendisk
[RFC PATCH 0/4] Flush nvdimm/pmem to memory before machine restart
Introduce a new pre_restart notifier chain for callbacks that need to be executed after the system has been made quiescent with syscore_shutdown(), before machine restart. Register pre-restart notifiers to flush pmem areas from CPU data cache to memory on reboot, immediately before restarting the machine. This ensures all other CPUs are quiescent before the pmem data is flushed to memory. The use-case for this new notifier chain is to preserve tracing data within pmem areas on systems where the BIOS does not clear memory across warm reboots. I did an earlier POC that flushed caches on panic/die oops notifiers [1], but it did not cover the reboot case. I've been made aware that some distribution vendors have started shipping their own modified version of my earlier POC patch. This makes a strong argument for upstreaming this work. Link: https://lore.kernel.org/linux-kernel/f6067e3e-a2bc-483d-b214-6e3fe6691...@efficios.com/ [1] Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: nvd...@lists.linux.dev Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org Mathieu Desnoyers (4): kernel/reboot: Introduce pre_restart notifiers nvdimm/pmem: Flush to memory before machine restart arm64: Invoke pre_restart notifiers x86: Invoke pre_restart notifiers arch/arm64/kernel/process.c | 2 ++ arch/x86/kernel/reboot.c| 7 +++-- drivers/nvdimm/pmem.c | 29 - drivers/nvdimm/pmem.h | 2 ++ include/linux/reboot.h | 4 +++ kernel/reboot.c | 51 + 6 files changed, 92 insertions(+), 3 deletions(-) -- 2.39.2
[RFC PATCH 4/4] x86: Invoke pre_restart notifiers
Invoke the pre_restart notifiers after shutdown, before machine restart. This allows preserving pmem memory across warm reboots. Invoke the pre_restart notifiers on emergency_machine_restart to cover the panic() scenario. Signed-off-by: Mathieu Desnoyers Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Steven Rostedt Cc: nvd...@lists.linux.dev Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" --- arch/x86/kernel/reboot.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index f3130f762784..222619fa63c6 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -631,8 +631,10 @@ static void native_machine_emergency_restart(void) int orig_reboot_type = reboot_type; unsigned short mode; - if (reboot_emergency) + if (reboot_emergency) { + do_kernel_pre_restart(NULL); emergency_reboot_disable_virtualization(); + } tboot_shutdown(TB_SHUTDOWN_REBOOT); @@ -760,12 +762,13 @@ static void __machine_emergency_restart(int emergency) machine_ops.emergency_restart(); } -static void native_machine_restart(char *__unused) +static void native_machine_restart(char *cmd) { pr_notice("machine restart\n"); if (!reboot_force) machine_shutdown(); + do_kernel_pre_restart(cmd); __machine_emergency_restart(0); } -- 2.39.2
Re: [PATCH 0/3] Tracepoints and static branch/call in Rust
On 2024-06-06 11:46, Alice Ryhl wrote: On Thu, Jun 6, 2024 at 5:25 PM Mathieu Desnoyers wrote: On 2024-06-06 11:05, Alice Ryhl wrote: This implementation implements support for static keys in Rust so that the actual static branch will end up in the Rust object file. However, it would also be possible to just wrap the trace_##name generated by __DECLARE_TRACE in an extern C function and then call that from Rust. This will simplify the Rust code by removing the need for static branches and calls, but it places the static branch behind an external call, which has performance implications. The tracepoints try very hard to minimize overhead of dormant tracepoints so it is not frowned-upon to have them built into production binaries. This is needed to make sure distribution vendors keep those tracepoints in the kernel binaries that reach end-users. Adding a function call before evaluation of the static branch goes against this major goal. A possible middle ground would be to place just the __DO_TRACE body in an extern C function and to implement the Rust wrapper by doing the static branch in Rust, and then calling into C the code that contains __DO_TRACE when the tracepoint is active. However, this would need some changes to include/linux/tracepoint.h to generate and export a function containing the body of __DO_TRACE when the tracepoint should be callable from Rust. This tradeoff is more acceptable than having a function call before evaluation of the static branch, but I wonder what is the upside of this tradeoff compared to inlining the whole __DO_TRACE in Rust ? So in general, there is a tradeoff between placing parts of the tracepoint (which is perf sensitive) behind an external call, and having code duplicated in both C and Rust (which must be kept in sync when changes are made). This is an important point that I would like feedback on from the C maintainers. I don't see how the duplication happens there: __DO_TRACE is meant to be inlined into each C tracepoint caller site, so the code is already meant to be duplicated. Having an explicit function wrapping the tracepoint for Rust would just create an extra instance of __DO_TRACE if it happens to be also inlined into C code. Or do you meant you would like to prevent having to duplicate the implementation of __DO_TRACE in both C and Rust ? I'm not sure if you mean to prevent source code duplication between C and Rust or duplication of binary code (instructions). It's a question of maintenance burden. If you change how __DO_TRACE is implemented, then those changes must also be reflected in the Rust version. There's no issue in the binary code. As long as it is only __DO_TRACE that is duplicated between C and Rust, I don't see it as a large burden. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 3/3] rust: add tracepoint support
On 2024-06-06 12:16, Alice Ryhl wrote: On Thu, Jun 6, 2024 at 5:29 PM Mathieu Desnoyers wrote: On 2024-06-06 11:05, Alice Ryhl wrote: Make it possible to have Rust code call into tracepoints defined by C code. It is still required that the tracepoint is declared in a C header, and that this header is included in the input to bindgen. Signed-off-by: Alice Ryhl [...] diff --git a/rust/helpers.c b/rust/helpers.c index 2c37a0f5d7a8..0560cc2a512a 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, gfp_t flags) } EXPORT_SYMBOL_GPL(rust_helper_krealloc); +void rust_helper_preempt_enable_notrace(void) +{ + preempt_enable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace); + +void rust_helper_preempt_disable_notrace(void) +{ + preempt_disable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace); + +bool rust_helper_current_cpu_online(void) +{ + return cpu_online(raw_smp_processor_id()); +} +EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online); + +void *rust_helper___rcu_dereference_raw(void **p) +{ + return rcu_dereference_raw(p); +} +EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw); Ouch. Doing a function call for each of those small operations will have a rather large performance impact when tracing is active. If it is not possible to inline those in Rust, then implementing __DO_TRACE in a C function would at least allow Rust to only do a single call to C rather than go back and forth between Rust and C. What prevents inlining those helpers in Rust ? There's nothing fundamental that prevents it. But they contain a large amount of architecture specific code, which makes them a significant amount of work to reimplement in Rust. For example, rcu_dereference_raw calls into READ_ONCE. READ_ONCE is usually a volatile load, but under arm64+LTO, you get a bunch of inline assembly that relies on ALTERNATIVE for feature detection: https://elixir.bootlin.com/linux/v6.9/source/arch/arm64/include/asm/rwonce.h#L36 And preempt_enable_notrace has a similar story. The solution that Boqun mentions is nice, but it relies on rustc and clang using the same version of LLVM. You are unlikely to have compilers with matching LLVMs unless you intentionally take steps to make that happen. But yes, perhaps these helpers are an argument to have a single call for __DO_TRACE instead. If those helpers end up being inlined into Rust with the solution pointed to by Boqun, then it makes sense to implement __DO_TRACE in Rust. Otherwise doing a single call to C would be more efficient than calling each of the helpers individually. Thanks, Mathieu Alice -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 3/3] rust: add tracepoint support
On 2024-06-06 11:49, Boqun Feng wrote: On Thu, Jun 06, 2024 at 11:30:03AM -0400, Mathieu Desnoyers wrote: On 2024-06-06 11:05, Alice Ryhl wrote: Make it possible to have Rust code call into tracepoints defined by C code. It is still required that the tracepoint is declared in a C header, and that this header is included in the input to bindgen. Signed-off-by: Alice Ryhl [...] diff --git a/rust/helpers.c b/rust/helpers.c index 2c37a0f5d7a8..0560cc2a512a 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, gfp_t flags) } EXPORT_SYMBOL_GPL(rust_helper_krealloc); +void rust_helper_preempt_enable_notrace(void) +{ + preempt_enable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace); + +void rust_helper_preempt_disable_notrace(void) +{ + preempt_disable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace); + +bool rust_helper_current_cpu_online(void) +{ + return cpu_online(raw_smp_processor_id()); +} +EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online); + +void *rust_helper___rcu_dereference_raw(void **p) +{ + return rcu_dereference_raw(p); +} +EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw); Ouch. Doing a function call for each of those small operations will have a rather large performance impact when tracing is active. If it is Long-term plan is to 1) compile the C helpers in some IR and 2) inline the helpers with Rust in IR-level, as what Gary has: https://lore.kernel.org/rust-for-linux/20240529202817.3641974-1-g...@garyguo.net/ and I use it for the upcoming atomic API support: https://github.com/fbq/linux/tree/dev/rust/atomic-rfc and it works very well. Thanks for the pointers, it makes sense. Thanks, Mathieu Regards, Boqun not possible to inline those in Rust, then implementing __DO_TRACE in a C function would at least allow Rust to only do a single call to C rather than go back and forth between Rust and C. What prevents inlining those helpers in Rust ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 0/3] Tracepoints and static branch/call in Rust
lication of binary code (instructions). Thanks, Mathieu Link: https://lore.kernel.org/rust-for-linux/20231101-rust-binder-v1-0-08ba9197f...@google.com/ [1] Link: https://r.android.com/3110088 [2] Signed-off-by: Alice Ryhl --- Alice Ryhl (3): rust: add static_call support rust: add static_key_false rust: add tracepoint support rust/bindings/bindings_helper.h | 1 + rust/bindings/lib.rs| 15 +++ rust/helpers.c | 24 +++ rust/kernel/lib.rs | 3 ++ rust/kernel/static_call.rs | 92 + rust/kernel/static_key.rs | 87 ++ rust/kernel/tracepoint.rs | 92 + scripts/Makefile.build | 2 +- 8 files changed, 315 insertions(+), 1 deletion(-) --- base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0 change-id: 20240606-tracepoint-31e15b90e471 Best regards, -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 2/3] rust: add static_key_false
On 2024-06-06 11:05, Alice Ryhl wrote: Add just enough support for static key so that we can use it from tracepoints. Tracepoints rely on `static_key_false` even though it is deprecated, so we add the same functionality to Rust. Signed-off-by: Alice Ryhl --- rust/kernel/lib.rs| 1 + rust/kernel/static_key.rs | 87 +++ scripts/Makefile.build| 2 +- 3 files changed, 89 insertions(+), 1 deletion(-) diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index d534b1178955..22e1fedd0774 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -39,6 +39,7 @@ pub mod print; mod static_assert; pub mod static_call; +pub mod static_key; #[doc(hidden)] pub mod std_vendor; pub mod str; diff --git a/rust/kernel/static_key.rs b/rust/kernel/static_key.rs new file mode 100644 index ..6c3dbe14c98a --- /dev/null +++ b/rust/kernel/static_key.rs @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 This static key code is something that can be generally useful both in kernel and user-space. Is there anything that would prevent licensing this under MIT right away instead so it could eventually be re-used more broadly in userspace as well ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 3/3] rust: add tracepoint support
On 2024-06-06 11:05, Alice Ryhl wrote: Make it possible to have Rust code call into tracepoints defined by C code. It is still required that the tracepoint is declared in a C header, and that this header is included in the input to bindgen. Signed-off-by: Alice Ryhl [...] diff --git a/rust/helpers.c b/rust/helpers.c index 2c37a0f5d7a8..0560cc2a512a 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -165,6 +165,30 @@ rust_helper_krealloc(const void *objp, size_t new_size, gfp_t flags) } EXPORT_SYMBOL_GPL(rust_helper_krealloc); +void rust_helper_preempt_enable_notrace(void) +{ + preempt_enable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_enable_notrace); + +void rust_helper_preempt_disable_notrace(void) +{ + preempt_disable_notrace(); +} +EXPORT_SYMBOL_GPL(rust_helper_preempt_disable_notrace); + +bool rust_helper_current_cpu_online(void) +{ + return cpu_online(raw_smp_processor_id()); +} +EXPORT_SYMBOL_GPL(rust_helper_current_cpu_online); + +void *rust_helper___rcu_dereference_raw(void **p) +{ + return rcu_dereference_raw(p); +} +EXPORT_SYMBOL_GPL(rust_helper___rcu_dereference_raw); Ouch. Doing a function call for each of those small operations will have a rather large performance impact when tracing is active. If it is not possible to inline those in Rust, then implementing __DO_TRACE in a C function would at least allow Rust to only do a single call to C rather than go back and forth between Rust and C. What prevents inlining those helpers in Rust ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules
On 2024-06-04 12:34, Steven Rostedt wrote: On Tue, 4 Jun 2024 11:02:16 -0400 Mathieu Desnoyers wrote: I see. It looks like there are a few things we could improve there: 1) With your approach, modules need to be already loaded before attaching an fprobe event to them. This effectively prevents attaching to any module init code. Is there any way we could allow this by implementing a module coming notifier in fprobe as well ? This would require that fprobes are kept around in a data structure that matches the modules when they are loaded in the coming notifier. The above sounds like a nice enhancement, but not something necessary for this series. IMHO it is nevertheless relevant to discuss the impact of supporting this kind of use-case on the ABI presented to userspace, at least to validate that what is exposed today can incrementally be enhanced towards that goal. I'm not saying that it needs to be implemented today, but we should at least give it some thoughts right now to make sure the ABI is a good fit. 2) Given that the fprobe module going notifier is protected by the event_mutex, can we use locking rather than reference counting in fprobe attach to guarantee the target module is not reclaimed concurrently ? This would remove the transient side-effect of holding a module reference count which temporarily prevents module unload. Why do we care about unloading modules during the transition? Note, module unload has always been considered a second class citizen, and there's been talks in the past to even rip it out. As a general rule I try to ensure tracing has as little impact on the system behavior so issues that occur without tracing can be reproduced with instrumentation. On systems where modules are loaded/unloaded with udev, holding references on modules can spuriously prevent module unload, which as a consequence changes the system behavior. About the relative importance of the various kernel subsystems, following your reasoning that module unload is considered a second-class citizen within the kernel, I would argue that tracing is a third-class citizen and should not needlessly modify the behavior of classes above it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules
On 2024-06-03 19:49, Masami Hiramatsu (Google) wrote: On Mon, 3 Jun 2024 15:50:55 -0400 Mathieu Desnoyers wrote: On 2024-06-01 04:22, Masami Hiramatsu (Google) wrote: From: Masami Hiramatsu (Google) Support raw tracepoint event on module by fprobe events. Since it only uses for_each_kernel_tracepoint() to find a tracepoint, the tracepoints on modules are not handled. Thus if user specified a tracepoint on a module, it shows an error. This adds new for_each_module_tracepoint() API to tracepoint subsystem, and uses it to find tracepoints on modules. Hi Masami, Why prevent module unload when a fprobe tracepoint is attached to a module ? This changes the kernel's behavior significantly just for the sake of instrumentation. I don't prevent module unloading all the time, just before registering tracepoint handler (something like booking a ticket :-) ). See the last hunk of this patch, it puts the module before exiting __trace_fprobe_create(). So at least this is transient, but I still have concerns, see below. As an alternative, LTTng-modules attach/detach to/from modules with the coming/going notifiers, so the instrumentation gets removed when a module is unloaded rather than preventing its unload by holding a module reference count. I would recommend a similar approach for fprobe. Yes, since tracepoint subsystem provides a notifier API to notify the tracepoint is gone, fprobe already uses it to find unloading and unregister the target function. (see __tracepoint_probe_module_cb) I see. It looks like there are a few things we could improve there: 1) With your approach, modules need to be already loaded before attaching an fprobe event to them. This effectively prevents attaching to any module init code. Is there any way we could allow this by implementing a module coming notifier in fprobe as well ? This would require that fprobes are kept around in a data structure that matches the modules when they are loaded in the coming notifier. 2) Given that the fprobe module going notifier is protected by the event_mutex, can we use locking rather than reference counting in fprobe attach to guarantee the target module is not reclaimed concurrently ? This would remove the transient side-effect of holding a module reference count which temporarily prevents module unload. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2 2/3] tracing/fprobe: Support raw tracepoint events on modules
trace_probe_log_err(0, NO_TRACEPOINT); @@ -1110,8 +1142,8 @@ static int __trace_fprobe_create(int argc, const char *argv[]) goto out; /* setup a probe */ - tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive, - argc, is_return); + tf = alloc_trace_fprobe(group, event, symbol, tpoint, tp_mod, + maxactive, argc, is_return); if (IS_ERR(tf)) { ret = PTR_ERR(tf); /* This must return -ENOMEM, else there is a bug */ @@ -1119,10 +1151,6 @@ static int __trace_fprobe_create(int argc, const char *argv[]) goto out; /* We know tf is not allocated */ } - if (is_tracepoint) - tf->mod = __module_text_address( - (unsigned long)tf->tpoint->probestub); - /* parse arguments */ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { trace_probe_log_set_index(i + 2); @@ -1155,6 +1183,8 @@ static int __trace_fprobe_create(int argc, const char *argv[]) } out: + if (tp_mod) + module_put(tp_mod); traceprobe_finish_parse(&ctx); trace_probe_log_clear(); kfree(new_argv); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On 2024-04-12 12:28, Beau Belgrave wrote: On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote: On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave wrote: In the Open Telemetry profiling SIG [1], we are trying to find a way to grab a tracing association quickly on a per-sample basis. The team at Elastic has a bespoke way to do this [2], however, I'd like to see a more general way to achieve this. The folks I've been talking with seem open to the idea of just having a TLS value for this we could capture Presumably TLS == Thread Local Storage. Yes, the initial idea is to use thread local storage (TLS). It seems to be the fastest option to save a per-thread value that changes at a fast rate. upon each sample. We could then just state, Open Telemetry SDKs should have a TLS value for span correlation. However, we need a way to sample the TLS or other value(s) when a sampling event is generated. This is supported today on Windows via EventActivityIdControl() [3]. Since Open Telemetry works on both Windows and Linux, ideally we can do something as efficient for Linux based workloads. This series is to explore how it would be best possible to collect supporting data from a user process when a profile sample is collected. Having a value stored in TLS makes a lot of sense for this however there are other ways to explore. Whatever is chosen, kernel samples taken in process context should be able to get this supporting data. In these patches on X64 the fsbase and gsbase are used for this. An option to explore suggested by Mathieu Desnoyers is to utilize rseq for processes to register a value location that can be included when profiling if desired. This would allow a tighter contract between user processes and a profiler. It would allow better labeling/categorizing the correlation values. It is hard to understand this idea. Are you saying stash a cookie in TLS for samples to capture to indicate an activity? Restartable sequences are about preemption on a CPU not of a thread, so at least my intuition is that they feel different. You could stash information like this today by changing the thread name which generates comm events. I've wondered about having similar information in some form of reserved for profiling stack slot, for example, to stash a pointer to the name of a function being interpreted. Snapshotting all of a stack is bad performance wise and for security. A stack slot would be able to deal with nesting. You are getting the idea. A slot or tag for a thread would be great! I'm not a fan of overriding the thread comm name (as that already has a use). TLS would be fine, if we could also pass an offset + size + type. Maybe a stack slot that just points to parts of TLS? That way you could have a set of slots that don't require much memory and selectively copy them out of TLS (or where ever those slots point to in user memory). When I was talking to Mathieu about this, it seems that rseq already had a place to potentially put these slots. I'm unsure though how the per thread aspects would work. Mathieu, can you post your ideas here about that? Sure. I'll try to summarize my thoughts here. By all means, let me know if I'm missing important pieces of the puzzle. First of all, here is my understanding of what information we want to share between userspace and kernel. A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random UUID allows) a portion of the dependency chain involved in doing some work (e.g. answer a HTTP request) across one or many participating hosts. Activity IDs have a parent/child relationship: a parent activity ID can create children activity IDs. For instance, if one host has the service "dispatch", another host has a "web server", and a third host has a SQL database, we should be able to follow the chain of activities needed to answer a web query by following those activity IDs, linking them together through parent/child relationships. This usually requires the communication protocols to convey those activity IDs across hosts. The reason why this information must be provided from userspace is because it's userspace that knows where to find those activity IDs within its application-layer communication protocols. With tracing, taking a full trace of the activity ID spans begin/end from all hosts allow reconstructing the activity IDs parent/child relationships, so we typically only need to extract information about activity ID span begin/end with parent/child info to a tracer. Using activity IDs from a kernel profiler is trickier, because we do not have access to the complete span begin/end trace to reconstruct the activity ID parent/child relationship. This is where I suspect we'd want to introduce a notion of "activity ID stack", so a profiler could reconstruct the currently active stack of activity IDs for the current thread by walking that stack. This profil
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 13:58, Steven Rostedt wrote: On Wed, 20 Mar 2024 13:15:39 -0400 Mathieu Desnoyers wrote: I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Because need_resched() itself does not mean it's going to schedule immediately. I looked at a few locations that need_resched() is called. Most are in idle code where the critical timings are already handled. I'm not sure I'd add it for places like mm/memory.c or drivers/md/bcache/btree.c. A lot of places look to use it more for PREEMPT_NONE situations as a open coded cond_resched(). The main reason this one is particularly an issue, is that it spins as long as the owner is still running. Which may be some time, as here it was 7ms. What I think we should be discussing here is how calling need_resched() should interact with the latency tracked by critical timings. AFAIU, when code explicitly calls need_resched() in a loop, there are two cases: - need_resched() returns false: This means the loop can continue without causing long latency on the system. Technically we could restart the critical timings at this point. - need_resched() returns true: This means the loop should exit quickly and call the scheduler. I would not reset the critical timings there, as whatever code is executed between need_resched() returning true and calling the scheduler is adding to latency. Having stop/start critical timings around idle loops seems to just be an optimization over that. As for mm and driver/md code, what is wrong with doing a critical timings reset when need_resched() returns false ? It would prevent a whole class of false-positives rather than playing whack-a-mole with those that pop up. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 12:20, Steven Rostedt wrote: From: Steven Rostedt (Google) I'm debugging some latency issues on a Chromebook and the preemptirqsoff tracer hit this: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty # # latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) #- #| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0) #- # => started at: rwsem_optimistic_spin # => ended at: rwsem_optimistic_spin # # #_--=> CPU# # / _-=> irqs-off # | / _=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # / _-=> migrate-disable # | / delay # cmd pid || time | caller # \ /|| \|/ <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 <-rwsem_optimistic_spin+0x20/0x194 <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7824us : => rwsem_optimistic_spin+0x140/0x194 => rwsem_down_write_slowpath+0xc9/0x3fe => copy_process+0xd08/0x1b8a => kernel_clone+0x94/0x256 => __x64_sys_clone+0x7a/0x9a => do_syscall_64+0x51/0xa1 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6 Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that spins with interrupts enabled, preempt disabled, and checks for need_resched(). If it is true, it breaks out and schedules. Hence, it's hiding real issues, because it can spin for a very long time and this is not the source of the latency I'm tracking down. I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 147feebd508c..e9f97f60bfc0 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -145,6 +145,13 @@ do { \ # define start_critical_timings() do { } while (0) #endif +/* Used in spins that check need_resched() with preemption disabled */ +static inline void restart_critical_timings(void) +{ + stop_critical_timings(); + start_critical_timings(); +} + #ifdef CONFIG_DEBUG_IRQFLAGS extern void warn_bogus_irq_restore(void); #define raw_check_bogus_irq_restore() \ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2340b6d90ec6..fa7b43e53d20 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #ifndef CONFIG_PREEMPT_RT @@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); + restart_critical_timings(); if (need_resched() || !owner_on_cpu(owner)) { state = OWNER_NONSPINNABLE; break; @@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * a writer, need_resched() check needs to be done here. */ if (owner_state != OWNER_WRITER) { + restart_critical_timings(); if (need_resched()) break; if (rt_task(current) && -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2] tracing: Limit trace_marker writes to just 4K
On 2024-03-04 22:34, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Limit the max print event of trace_marker to just 4K string size. This must also be less than the amount that can be held by a trace_seq along with the text that is before the output (like the task name, PID, CPU, state, etc). As trace_seq is made to handle large events (some greater than 4K). Make the max size of a trace_marker write event be 4K which is guaranteed to fit in the trace_seq buffer. Suggested-by: Mathieu Desnoyers From my perspective I only attempted to clarify the point Linus made about limiting the trace_marker input to 4kB. Feel adapt the Suggested-by tag accordingly. Reviewed-by: Mathieu Desnoyers Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240304192710.4c996...@gandalf.local.home/ - Just make the max limit 4K and not half of the trace_seq size. The trace_seq is already made to handle events greater than 4k. kernel/trace/trace.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d16b95ca58a7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7293,6 +7293,8 @@ tracing_free_buffer_release(struct inode *inode, struct file *filp) return 0; } +#define TRACE_MARKER_MAX_SIZE 4096 + static ssize_t tracing_mark_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *fpos) @@ -7320,6 +7322,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + if (cnt > TRACE_MARKER_MAX_SIZE) + cnt = TRACE_MARKER_MAX_SIZE; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7333,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 21:37, Steven Rostedt wrote: On Mon, 4 Mar 2024 21:35:38 -0500 Steven Rostedt wrote: And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. validating is debugging. Did Linus put you up to this? To test me to see if I'm learning how to say "No" ;-) No, he did not. I genuinely think that validating size limits like this either at compile time or, when they can vary at runtime like in this case, with a dynamic check, decreases the cognitive load on the reviewers. We can then assume that whatever limit was put in place is actually enforced and not just wishful thinking. If the "header" size upper bound is not validated at runtime, there is not much point in adding the BUILD_BUG_ON() based on that value in the first place, and you should then just add a runtime check that you don't overflow the output buffer before writing the output to it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:59, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:42:39 -0500 Mathieu Desnoyers wrote: #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. That would be a lot of complex code that is for debugging something that has never happened in the past 16 years and Linus has already reprimanded me on doing such things. Is it more complex than remembering the iterator position in print_trace_fmt() right before: if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) { if (iter->iter_flags & TRACE_FILE_LAT_FMT) trace_print_lat_context(iter); else trace_print_context(iter); } and then checking just after that the offset is not beyond 128 bytes ? Perhaps there is even something internal to "iter" that already knows about this offset (?). This would catch any context going beyond its planned upper limit early. Failing early and not just in rare corner cases is good. And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:41, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:35:16 -0500 Steven Rostedt wrote: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq That said, the meta data is most likely not going to be more than 128 bytes (it shouldn't be more than 80). I could do as you suggest and create a separate TRACE_MARKER_SIZE and just make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of the content) by 128 bytes. /* Added meta data should not be more than 128 bytes */ BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE); Bonus points if you add #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:35, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:15:57 -0500 Mathieu Desnoyers wrote: On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq If this header has a known well-defined upper-limit length, then use that in the BUILD_BUG_ON(). Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); This way we clearly document that the tracing mark write input limit is 4kB, rather than something derived from the size of an output buffer. Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d68544aef65f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + /* +* TRACE_SEQ_SIZE is the total size of trace_seq buffer used +* for output. As the print event outputs more than just +* the string written, keep it smaller than the trace_seq +* as it could drop the event if the extra data makes it bigger +* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE +* is more than enough. +*/ + if (cnt > TRACE_SEQ_SIZE / 2) + cnt = TRACE_SEQ_SIZE / 2; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Remove precision vsnprintf() check from print event
On 2024-03-04 17:43, Steven Rostedt wrote: From: "Steven Rostedt (Google)" This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/ Reported-by: Sachin Sant Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Signed-off-by: Steven Rostedt (Google) This is a step in the right direction IMHO. Reviewed-by: Mathieu Desnoyers Just out of curiosity, is there anything to prevent trace_marker from writing a huge string into the ring buffer in the first place ? Is this limit implicit and based on the page size of the architecture or is it a known fixed limit ? (e.g. 4kB strings). It appears to currently be limited by #define TRACE_SEQ_BUFFER_SIZE (PAGE_SIZE * 2 - \ (sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int))) checked within tracing_mark_write(). I would have hoped for a simpler limit (e.g. 4kB) consistent across architectures. But that would belong to a separate change. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
On 2024-03-01 10:49, Steven Rostedt wrote: On Fri, 1 Mar 2024 13:37:18 +0800 linke wrote: So basically you are worried about read-tearing? That wasn't mentioned in the change log. Yes. Sorry for making this confused, I am not very familiar with this and still learning. No problem. We all have to learn this anyway. Funny part is, if the above timestamp read did a tear, then this would definitely not match, and would return the correct value. That is, the buffer is not empty because the only way for this to get corrupted is if something is in the process of writing to it. I agree with you here. commit = rb_page_commit(commit_page); But if commit_page above is the result of a torn read, the commit field read by rb_page_commit() may not represent a valid value. But commit_page is a word length, and I will argue that any compiler that tears "long" words is broken. ;-) [ For those tuning in, we are discussing ring_buffer_iter_empty() "commit_page = cpu_buffer->commit_page;" racy load. ] I counter-argue that real-world compilers *are* broken based on your personal definition, but we have to deal with them, as documented in Documentation/memory-barriers.txt (see below). What is the added overhead of using a READ_ONCE() there ? Why are we wasting effort trying to guess the compiler behavior if the real-world performance impact is insignificant ? Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE(): "(*) For aligned memory locations whose size allows them to be accessed with a single memory-reference instruction, prevents "load tearing" and "store tearing," in which a single large access is replaced by multiple smaller accesses." I agree that {READ,WRITE}_ONCE() are really not needed at initialization, when there are demonstrably no concurrent accesses to the data But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields just adds complexity, prevents static analyzers to properly understand the code and report issues, and just obfuscates the code. Thanks, Mathieu In this case, READ_ONCE() is only needed for the commit_page. But we can at least keep the READ_ONCE() on the commit_page just because it is used in the next instruction. -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-02-20 09:19, Steven Rostedt wrote: On Mon, 19 Feb 2024 18:20:32 -0500 Steven Rostedt wrote: Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. Before we had: w = local_read(&tail_page->write); /* get time stamps */ write = local_add_return(length, &tail_page->write); if (write - length == w) { /* do simple case */ } else { /* do complex case */ } By switching the local_add_return() to a local_try_cmpxchg() it can now be: w = local_read(&tail_page->write); again: /* get time stamps */ if (!local_try_cmpxchg(&tail_page->write, &w, w + length)) goto again; /* do simple case */ Something about this logic is causing __rb_next_reserve() to sometimes always return -EAGAIN and triggering the: RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) Which disables the ring buffer. I'm not sure what it is, but until I do, I'm removing the patch from my queue. Try resetting the info->add_timestamp flags to add_ts_default on goto again within __rb_reserve_next(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers
On 2024-02-16 15:22, Arnd Bergmann wrote: From: Arnd Bergmann In some randconfig builds, the IS_ERR() check appears to not get completely eliminated, resulting in the compiler to insert references to these two functions that cause a link failure: ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined! ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined! Add more stub functions for the dax-disabled case here to make it build again. Hi Arnd, Note that this is a duplicate of: https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/ now present in Andrew's tree. The only differences are the subject, commit message and a newline between "set_dax_nomc" and "set_dax_synchronous" in your change. Thanks, Mathieu Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/ Signed-off-by: Arnd Bergmann --- include/linux/dax.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index df2d52b8a245..4527c10016fb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool dax_synchronous(struct dax_device *dax_dev); void set_dax_synchronous(struct dax_device *dax_dev); +void set_dax_nocache(struct dax_device *dax_dev); +void set_dax_nomc(struct dax_device *dax_dev); + size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); /* @@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev) static inline void set_dax_synchronous(struct dax_device *dax_dev) { } +static inline void set_dax_nocache(struct dax_device *dax_dev) +{ +} +static inline void set_dax_nomc(struct dax_device *dax_dev) +{ +} static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, struct dax_device *dax_dev) { @@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, } #endif -void set_dax_nocache(struct dax_device *dax_dev); -void set_dax_nomc(struct dax_device *dax_dev); - struct writeback_control; #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX) int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: CPU data cache across reboot/kexec for pmem/dax devices
On 2024-02-09 15:15, Dan Williams wrote: Mathieu Desnoyers wrote: Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, This one gives me pause, because a trip through the BIOS typically means lots of resets and other low level magic, so this would likely require pushing dirty data out of CPU caches prior to entering the BIOS code paths. So this either needs explicit cache flushing or mapping the memory with write-through semantics. That latter one is not supported in the stack today. I would favor the explicit cache flushing approach over write-through semantics for performance reasons: typically the ring buffers are overwritten, so always storing the data to memory would be wasteful. But I would rather do the cache flushing only on crash (die oops/panic notifiers) rather than require the tracer to flush cache lines immediately after each event is produced, again for performance reasons. I have done a crude attempt at registering die oops/panic notifiers from the pmem driver, and flush all pmem areas to memory when die oops/panic notifiers are called (see the untested patch below). I wonder if this is something that would be generally useful ? - kexec/kdump. This should maintain the state of CPU caches. As far as the CPU is concerned it is just long jumping into a new kernel in memory without resetting any CPU cache state. Good to know that this scenario is expected to "Just Work". Thanks, Mathieu diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e9898457a7bd..b92f18607d39 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -26,12 +26,18 @@ #include #include #include +#include #include #include "pmem.h" #include "btt.h" #include "pfn.h" #include "nd.h" +static int pmem_die_handler(struct notifier_block *self, + unsigned long ev, void *unused); +static int pmem_panic_handler(struct notifier_block *self, + unsigned long ev, void *unused); + static struct device *to_dev(struct pmem_device *pmem) { /* @@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; + atomic_notifier_chain_unregister(&panic_notifier_list, + &pmem->panic_notifier); + unregister_die_notifier(&pmem->die_notifier); dax_remove_host(pmem->disk); kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); @@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev, goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - rc = device_add_disk(dev, disk, pmem_attribute_groups); + pmem->die_notifier.notifier_call = pmem_die_handler; + pmem->die_notifier.priority = -INT_MAX; + rc = register_die_notifier(&pmem->die_notifier); if (rc) goto out_remove_host; + pmem->panic_notifier.notifier_call = pmem_panic_handler; + pmem->panic_notifier.priority = -INT_MAX; + rc = atomic_notifier_chain_register(&panic_notifier_list, + &pmem->panic_notifier); + if (rc) + goto out_unregister_die; + rc = device_add_disk(dev, disk, pmem_attribute_groups); + if (rc) + goto out_unregister_panic; if (devm_add_action_or_reset(dev, pmem_release_disk, pmem)) return -ENOMEM; @@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev, dev_warn(dev, "'badblocks' notification disabled\n"); return
CPU data cache across reboot/kexec for pmem/dax devices
Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, - kexec/kdump. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On 2024-02-05 13:34, Vincent Donnefort wrote: On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: [...] How are the kernel linear mapping and the userspace mapping made coherent on architectures with virtually aliasing data caches ? Ref. https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Hi Mathieu, Thanks for the pointer. We are in the exact same problem as DAX. We do modify the data through the kernel linear mapping while user-space can read it through its own. I should probably return an error when used with any of the arch ARM || SPARC || MIPS, until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. You might want to use LTTng's ring buffer approach instead. See https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202 lib_ring_buffer_flush_read_subbuf_dcache() Basically, whenever user-space grabs a sub-buffer for reading (through lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng calls flush_dcache_page() on all pages of this subbuffer (I should really change this for a call to flush_dcache_folio() which would be more efficient). Note that doing this is not very efficient on architectures which have coherent data caches and incoherent dcache vs icache: in that case, we issue the flush_dcache_page uselessly. I plan on using the new cpu_dcache_is_aliasing() check once/if it makes it way upstream to remove those useless flushes on architectures which define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the data cache. The equivalent of LTTng's "get subbuf" operation would be the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
+ + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + struct trace_array *tr = iter->tr; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); + +#ifdef CONFIG_TRACER_MAX_TRACE + spin_lock(&tr->snapshot_trigger_lock); + if (!WARN_ON(!tr->mapped)) + tr->mapped--; + spin_unlock(&tr->snapshot_trigger_lock); +#endif +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_data; + struct trace_iterator *iter = &info->iter; + struct trace_array *tr = iter->tr; + int ret = 0; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE); + vma->vm_ops = &tracing_buffers_vmops; + +#ifdef CONFIG_TRACER_MAX_TRACE + /* +* We hold mmap_lock here. lockdep would be unhappy if we would now take +* trace_types_lock. Instead use the specific snapshot_trigger_lock. +*/ + spin_lock(&tr->snapshot_trigger_lock); + if (tr->snapshot || tr->mapped == UINT_MAX) { + spin_unlock(&tr->snapshot_trigger_lock); + return -EBUSY; + } + tr->mapped++; + spin_unlock(&tr->snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); +#endif + ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file); +#ifdef CONFIG_TRACER_MAX_TRACE + if (ret) { + spin_lock(&tr->snapshot_trigger_lock); + iter->tr->mapped--; + spin_unlock(&tr->snapshot_trigger_lock); + } +#endif + return ret; +} + static const struct file_operations tracing_buffers_fops = { .open = tracing_buffers_open, .read = tracing_buffers_read, @@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops = { .splice_read= tracing_buffers_splice_read, .unlocked_ioctl = tracing_buffers_ioctl, .llseek = no_llseek, + .mmap = tracing_buffers_mmap, }; static ssize_t diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index bd312e9afe25..8a96e7a89e6b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -336,6 +336,7 @@ struct trace_array { boolallocated_snapshot; spinlock_t snapshot_trigger_lock; unsigned intsnapshot; + unsigned intmapped; unsigned long max_latency; #ifdef CONFIG_FSNOTIFY struct dentry *d_max_latency; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
On 2024-01-30 22:13, Dan Williams wrote: Dave Chinner wrote: On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. This was turned into the following implementation of dax_is_supported() by a preparatory change: return !IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_MIPS) && !IS_ENABLED(CONFIG_SPARC); Use dcache_is_aliasing() instead to figure out whether the environment has aliasing dcaches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- include/linux/dax.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index cfc8cd4a3eae..f59e604662e4 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, } static inline bool dax_is_supported(void) { - return !IS_ENABLED(CONFIG_ARM) && - !IS_ENABLED(CONFIG_MIPS) && - !IS_ENABLED(CONFIG_SPARC); + return !dcache_is_aliasing(); Yeah, if this is just a one liner should go into fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the start of the function. I also noticed that device mapper uses fs_dax_get_by_bdev() to determine if it can support DAX, but this patch set does not address that case. Hence it really seems to me like fs_dax_get_by_bdev() is the right place to put this check. Oh, good catch. Yes, I agree this can definitely be pushed down, but then I think it should be pushed down all the way to make alloc_dax() fail. That will need some additional fixups like: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8dcabf84d866..a35e60e62440 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor) md->dax_dev = alloc_dax(md, &dm_dax_ops); if (IS_ERR(md->dax_dev)) { md->dax_dev = NULL; - goto bad; + } else { + set_dax_nocache(md->dax_dev); + set_dax_nomc(md->dax_dev); + if (dax_add_host(md->dax_dev, md->disk)) + goto bad; } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) - goto bad; } format_dev_t(md->name, MKDEV(_major, minor)); ...to make it not fatal to fail to register the dax_dev. I've had a quick look at other users of alloc_dax() and alloc_dax_region(), and so far I figure that all of those really want to bail out on alloc_dax failure. Is dm.c the only special-case we need to fix to make it non-fatal ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures
On 2024-01-30 21:48, Dave Chinner wrote: On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote: Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc /me wonders why the dentry cache aliasing has problems on these systems. Oh, dcache != fs/dcache.c (the VFS dentry cache). Can you please rename this function appropriately so us dumb filesystem people don't confuse cpu data cache configurations with the VFS dentry cache aliasing when we read this code? Something like cpu_dcache_is_aliased(), perhaps? Good point, will do. I'm planning go rename as follows for v3 to eliminate confusion with dentry cache (and with "page cache" in general): ARCH_HAS_CACHE_ALIASING -> ARCH_HAS_CPU_CACHE_ALIASING dcache_is_aliasing() -> cpu_dcache_is_aliasing() I noticed that you suggested "aliased" rather than "aliasing", but I followed what arm64 did for icache_is_aliasing(). Do you have a strong preference one way or another ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The dcache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The dcache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and implement "dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus dcache_is_aliasing() simply evaluates to "false". Note that this leaves "icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing dcache-icache but not dcache-dcache. Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..969e6740bcf7 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..290e3cc85845 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..5adeee5e421f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ config ARM select ARCH_32BIT_OFF_T select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND select ARCH_HAS_BINFMT_FLAT + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CPU_FINALIZE_INIT if MMU select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL if MMU diff --git a/arch/arm/include/
[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. This was turned into the following implementation of dax_is_supported() by a preparatory change: return !IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_MIPS) && !IS_ENABLED(CONFIG_SPARC); Use dcache_is_aliasing() instead to figure out whether the environment has aliasing dcaches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- include/linux/dax.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index cfc8cd4a3eae..f59e604662e4 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, } static inline bool dax_is_supported(void) { - return !IS_ENABLED(CONFIG_ARM) && - !IS_ENABLED(CONFIG_MIPS) && - !IS_ENABLED(CONFIG_SPARC); + return !dcache_is_aliasing(); } #else static inline void *dax_holder(struct dax_device *dax_dev) -- 2.39.2
[RFC PATCH v2 6/8] xfs: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Chandan Babu R Cc: Darrick J. Wong Cc: linux-...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a0d77f5f512e..360f640159b0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1208,7 +1208,7 @@ static bool xfs_inode_should_enable_dax( struct xfs_inode *ip) { - if (!IS_ENABLED(CONFIG_FS_DAX)) + if (!dax_is_supported()) return false; if (xfs_has_dax_never(ip->i_mount)) return false; -- 2.39.2
[RFC PATCH v2 5/8] fuse: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Miklos Szeredi Cc: linux-fsde...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/fuse/dax.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 12ef91d170bb..36e1c1abbf8e 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags) if (dax_mode == FUSE_DAX_NEVER) return false; + /* +* Silently fallback to 'never' mode if the architecture does +* not support DAX. +*/ + if (!dax_is_supported()) + return false; + /* * fc->dax may be NULL in 'inode' mode when filesystem device doesn't * support DAX, in which case it will silently fallback to 'never' mode. -- 2.39.2
[RFC PATCH v2 4/8] ext4: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Mount fails if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext4/super.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c5fcf377ab1f..f2c11ae3ec29 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct super_block *sb, } if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) { - if (ext4_has_feature_inline_data(sb)) { + if (!dax_is_supported()) { + ext4_msg(sb, KERN_ERR, "DAX unsupported by architecture."); + return -EINVAL; + } else if (ext4_has_feature_inline_data(sb)) { ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem" " that may contain inline data"); return -EINVAL; -- 2.39.2
[RFC PATCH v2 3/8] ext2: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Print an error and disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext2/super.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..30ff57d47ed4 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (test_opt(sb, DAX)) { - if (!sbi->s_daxdev) { + if (!dax_is_supported()) { + ext2_msg(sb, KERN_ERR, + "DAX unsupported by architecture. Turning off DAX."); + clear_opt(sbi->s_mount_opt, DAX); + } else if (!sbi->s_daxdev) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); clear_opt(sbi->s_mount_opt, DAX); -- 2.39.2
[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CACHE_ALIASING and implements it for all architectures. The implementation of dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Note that the order of the series was completely changed based on the feedback received on v1, cache_is_aliasing() is renamed to dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, and the dax_is_supported() check was moved to its rightful place in all filesystems. Feedback is welcome, Thanks, Mathieu Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-fsde...@vger.kernel.org Mathieu Desnoyers (8): dax: Introduce dax_is_supported() erofs: Use dax_is_supported() ext2: Use dax_is_supported() ext4: Use dax_is_supported() fuse: Use dax_is_supported() xfs: Use dax_is_supported() Introduce dcache_is_aliasing() across all architectures dax: Fix incorrect list of dcache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ fs/Kconfig | 1 - fs/erofs/super.c| 5 - fs/ext2/super.c | 6 +- fs/ext4/super.c | 5 - fs/fuse/dax.c | 7 +++ fs/xfs/xfs_iops.c | 2 +- include/linux/cacheinfo.h | 6 ++ include/linux/dax.h | 9 + mm/Kconfig | 6 ++ 29 files changed, 142 insertions(+), 5 deletions(-) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h -- 2.39.2
[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()
Introduce a new dax_is_supported() static inline to check whether the architecture supports DAX. This replaces the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) This is done in preparation for its use by each filesystem supporting the dax mount option to validate whether dax is indeed supported. This is done in preparation for using dcache_is_aliasing() in a following change which will properly support architectures which detect dcache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/Kconfig | 1 - include/linux/dax.h | 10 ++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..cfc8cd4a3eae 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return false; return dax_synchronous(dax_dev); } +static inline bool dax_is_supported(void) +{ + return !IS_ENABLED(CONFIG_ARM) && + !IS_ENABLED(CONFIG_MIPS) && + !IS_ENABLED(CONFIG_SPARC); +} #else static inline void *dax_holder(struct dax_device *dax_dev) { @@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, { return 0; } +static inline bool dax_is_supported(void) +{ + return false; +} #endif void set_dax_nocache(struct dax_device *dax_dev); -- 2.39.2
Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()
On 2024-01-30 06:33, Jan Kara wrote: On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Looks good to me (although I share Dave's opinion it would be nice to CC the whole series to fsdevel - luckily we have lore these days so it is not that tedious to find the whole series :)). Feel free to add: Acked-by: Jan Kara Hi Jan, Thanks for looking at it! I will do significant changes for v2, so I will hold on before integrating your acked-by if it's OK with you. Thanks! Mathieu Honza --- fs/ext2/super.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..0398e7a90eb6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -585,13 +585,13 @@ static int parse_options(char *options, struct super_block *sb, set_opt(opts->s_mount_opt, XIP); fallthrough; case Opt_dax: -#ifdef CONFIG_FS_DAX - ext2_msg(sb, KERN_WARNING, - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); - set_opt(opts->s_mount_opt, DAX); -#else - ext2_msg(sb, KERN_INFO, "dax option not supported"); -#endif + if (dax_is_supported()) { + ext2_msg(sb, KERN_WARNING, +"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + set_opt(opts->s_mount_opt, DAX); + } else { + ext2_msg(sb, KERN_INFO, "dax option not supported"); + } break; #if defined(CONFIG_QUOTA) -- 2.39.2 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()
On 2024-01-29 21:38, Dave Chinner wrote: On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Where's the rest of this patchset? I have no idea what dax_is_supported() actually does, how it interacts with CONFIG_FS_DAX, etc. If you are changing anything to do with FSDAX, the cc-ing the -entire- patchset to linux-fsdevel is absolutely necessary so the entire patchset lands in our inboxes and not just a random patch from the middle of a bigger change. Sorry, I will Cc linux-fsdevel on all patches for the next round. Meanwhile you can find the whole series on lore: https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/ [...] Assuming that I understand what dax_is_supported() is doing, this change isn't right. We're just setting the DAX configuration flags from the mount options here, we don't validate them until we've parsed all options and eliminated conflicts and rejected conflicting options. We validate whether the options are appropriate for the underlying hardware configuration later in the mount process. dax=always suitability is check in xfs_setup_dax_always() called later in the mount process when we have enough context and support to open storage devices and check them for DAX support. If the hardware does not support DAX then we simply we turn off DAX support, we do not reject the mount as this change does. dax=inode and dax=never are valid options on all configurations, even those with without FSDAX support or have hardware that is not capable of using DAX. dax=inode only affects how an inode is instantiated in cache - if the inode has a flag that says "use DAX" and dax is suppoortable by the hardware, then the turn on DAX for that inode. Otherwise we just use the normal non-dax IO paths. Again, we don't error out the filesystem if DAX is not supported, we just don't turn it on. This check is done in xfs_inode_should_enable_dax() and I think all you need to do is replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported() call... Thanks a lot for the detailed explanation. You are right, I will move the dax_is_supported() check to xfs_inode_should_enable_dax(). Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
On 2024-01-29 16:22, Dan Williams wrote: Mathieu Desnoyers wrote: This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implements it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARMV6 and ARMV6K. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Feedback is welcome, Hi Mathieu, this looks good overall, just some quibbling about the ordering. Thanks for having a look ! I would introduce dax_is_supported() with the current overly broad interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then fixup the filesystems to use the new helper, and finally go back and convert dax_is_supported() to use cache_is_aliasing() internally. Will do. Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC needs to exist. As long as all paths that care are calling cache_is_aliasing() then whether it is dynamic or not is something only the compiler cares about. If those dynamic archs do not want to pay the .text size increase they can always do CONFIG_FS_DAX=n, right? Good point. It will help reduce complexity and improve test coverage. I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()", so if we introduce an "icache_is_aliasing()" in the future, it won't be confusing. Having aliasing icache-dcache but not dcache-dcache seems to be fairly common. So basically: If an arch selects ARCH_HAS_CACHE_ALIASING, it implements dcache_is_aliasing() (for now), and eventually we can implement icache_is_aliasing() as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RFC PATCH 5/7] ext4: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/ext4/super.c | 52 - 1 file changed, 25 insertions(+), 27 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c5fcf377ab1f..9e0606289239 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2359,34 +2359,32 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param) return ext4_parse_test_dummy_encryption(param, ctx); case Opt_dax: case Opt_dax_type: -#ifdef CONFIG_FS_DAX - { - int type = (token == Opt_dax) ? - Opt_dax : result.uint_32; - - switch (type) { - case Opt_dax: - case Opt_dax_always: - ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - break; - case Opt_dax_never: - ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - break; - case Opt_dax_inode: - ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - /* Strictly for printing options */ - ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE); - break; + if (dax_is_supported()) { + int type = (token == Opt_dax) ? + Opt_dax : result.uint_32; + + switch (type) { + case Opt_dax: + case Opt_dax_always: + ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + break; + case Opt_dax_never: + ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + break; + case Opt_dax_inode: + ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + /* Strictly for printing options */ + ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE); + break; + } + return 0; + } else { + ext4_msg(NULL, KERN_INFO, "dax option not supported"); + return -EINVAL; } - return 0; - } -#else - ext4_msg(NULL, KERN_INFO, "dax option not supported"); - return -EINVAL; -#endif case Opt_data_err: if (result.uint_32 == Opt_data_err_abort) ctx_set_mount_opt(ctx, m->mount_opt); -- 2.39.2
[RFC PATCH 7/7] xfs: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Chandan Babu R Cc: Darrick J. Wong Cc: linux-...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/xfs/xfs_super.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 764304595e8b..b27ecb11db66 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1376,14 +1376,22 @@ xfs_fs_parse_param( case Opt_nodiscard: parsing_mp->m_features &= ~XFS_FEAT_DISCARD; return 0; -#ifdef CONFIG_FS_DAX case Opt_dax: - xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS); - return 0; + if (dax_is_supported()) { + xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS); + return 0; + } else { + xfs_warn(parsing_mp, "dax option not supported."); + return -EINVAL; + } case Opt_dax_enum: - xfs_mount_set_dax_mode(parsing_mp, result.uint_32); - return 0; -#endif + if (dax_is_supported()) { + xfs_mount_set_dax_mode(parsing_mp, result.uint_32); + return 0; + } else { + xfs_warn(parsing_mp, "dax option not supported."); + return -EINVAL; + } /* Following mount options will be removed in September 2025 */ case Opt_ikeep: xfs_fs_warn_deprecated(fc, param, XFS_FEAT_IKEEP, true); -- 2.39.2
[RFC PATCH 6/7] fuse: Introduce fuse_dax_is_supported()
Use dax_is_supported() in addition to IS_ENABLED(CONFIG_FUSE_DAX) to validate whether CONFIG_FUSE_DAX is enabled and the architecture does not have virtually aliased caches. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Miklos Szeredi Cc: linux-fsde...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/fuse/file.c | 2 +- fs/fuse/fuse_i.h| 36 +- fs/fuse/inode.c | 47 +++-- fs/fuse/virtio_fs.c | 4 ++-- 4 files changed, 62 insertions(+), 27 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a660f1f21540..133ac8524064 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3247,6 +3247,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) init_waitqueue_head(&fi->page_waitq); fi->writepages = RB_ROOT; - if (IS_ENABLED(CONFIG_FUSE_DAX)) + if (fuse_dax_is_supported()) fuse_dax_inode_init(inode, flags); } diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 1df83eebda92..1cbe37106669 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -31,6 +31,7 @@ #include #include #include +#include /** Default max number of pages that can be used in a single read request */ #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 @@ -979,6 +980,38 @@ static inline void fuse_sync_bucket_dec(struct fuse_sync_bucket *bucket) rcu_read_unlock(); } +#ifdef CONFIG_FUSE_DAX +static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode *inode) +{ + return inode->dax; +} + +static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc) +{ + return fc->dax_mode; +} + +static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc) +{ + return fc->dax; +} +#else +static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode *inode) +{ + return NULL; +} + +static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc) +{ + return FUSE_DAX_INODE_DEFAULT; +} + +static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc) +{ + return NULL; +} +#endif + /** Device operations */ extern const struct file_operations fuse_dev_operations; @@ -1324,7 +1357,8 @@ void fuse_free_conn(struct fuse_conn *fc); /* dax.c */ -#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode)) +#define fuse_dax_is_supported()(IS_ENABLED(CONFIG_FUSE_DAX) && dax_is_supported()) +#define FUSE_IS_DAX(inode) (fuse_dax_is_supported() && IS_DAX(inode)) ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to); ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 2a6d44f91729..030e6ce5486d 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -108,7 +108,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb) if (!fi->forget) goto out_free; - if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi)) + if (fuse_dax_is_supported() && !fuse_dax_inode_alloc(sb, fi)) goto out_free_forget; return &fi->inode; @@ -126,9 +126,8 @@ static void fuse_free_inode(struct inode *inode) mutex_destroy(&fi->mutex); kfree(fi->forget); -#ifdef CONFIG_FUSE_DAX - kfree(fi->dax); -#endif + if (fuse_dax_is_supported()) + kfree(fuse_inode_get_dax(fi)); kmem_cache_free(fuse_inode_cachep, fi); } @@ -361,7 +360,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, invalidate_inode_pages2(inode->i_mapping); } - if (IS_ENABLED(CONFIG_FUSE_DAX)) + if (fuse_dax_is_supported()) fuse_dax_dontcache(inode, attr->flags); } @@ -856,14 +855,16 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) seq_printf(m, ",blksize=%lu", sb->s_blocksize); } -#ifdef CONFIG_FUSE_DAX - if (fc->dax_mode == FUSE_DAX_ALWAYS) - seq_puts(m, ",dax=always"); - else if (fc->dax_mode == FUSE_DAX_NEVER) - seq_puts(m, ",dax=never"); - else if (fc->dax_mode == FUSE_DAX_INODE_USER) - seq_puts(m, ",dax=inode"); -#endi
[RFC PATCH 2/7] dax: Fix incorrect list of cache aliasing architectures
fs/Kconfig:FS_DAX prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. Use this instead in Kconfig to prevent FS_DAX from being built on architectures with virtually aliased dcache: depends on !ARCH_HAS_CACHE_ALIASING For architectures which detect dcache aliasing at runtime, introduce a new dax_is_supported() static inline which uses "cache_is_aliasing()" to figure out whether the environment has aliasing dcaches. This new dax_is_supported() helper will be used in each filesystem supporting the dax mount option to validate whether dax is indeed supported. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/Kconfig | 2 +- include/linux/dax.h | 9 + 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..6746fe403761 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,7 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) + depends on !ARCH_HAS_CACHE_ALIASING depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..8c595b04deeb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -78,6 +79,10 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return false; return dax_synchronous(dax_dev); } +static inline bool dax_is_supported(void) +{ + return !cache_is_aliasing(); +} #else static inline void *dax_holder(struct dax_device *dax_dev) { @@ -122,6 +127,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, { return 0; } +static inline bool dax_is_supported(void) +{ + return false; +} #endif void set_dax_nocache(struct dax_device *dax_dev); -- 2.39.2
[RFC PATCH 4/7] ext2: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/ext2/super.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..0398e7a90eb6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -585,13 +585,13 @@ static int parse_options(char *options, struct super_block *sb, set_opt(opts->s_mount_opt, XIP); fallthrough; case Opt_dax: -#ifdef CONFIG_FS_DAX - ext2_msg(sb, KERN_WARNING, - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); - set_opt(opts->s_mount_opt, DAX); -#else - ext2_msg(sb, KERN_INFO, "dax option not supported"); -#endif + if (dax_is_supported()) { + ext2_msg(sb, KERN_WARNING, +"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + set_opt(opts->s_mount_opt, DAX); + } else { + ext2_msg(sb, KERN_INFO, "dax option not supported"); + } break; #if defined(CONFIG_QUOTA) -- 2.39.2
[RFC PATCH 1/7] Introduce cache_is_aliasing() across all architectures
Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: (ARCH_HAS_CACHE_ALIASING=y) * arm V4, V5 (CPU_CACHE_VIVT) * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The dcache aliasing depends on querying CPU state at runtime: (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y) * arm V6, V6K (CPU_CACHE_VIPT) (cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The dcache is never aliasing: * arm V7, V7M (unless ARM V6 or V6K are also present) (CPU_CACHE_VIPT) * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 3 +++ arch/arm/mm/Kconfig | 2 ++ arch/csky/Kconfig | 1 + arch/m68k/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 8 mm/Kconfig | 10 ++ 17 files changed, 75 insertions(+) create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..969e6740bcf7 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h index e8c30430be33..b03054b35c74 100644 --- a/arch/arm/include/asm/cachetype.h +++ b/arch/arm/include/asm/cachetype.h @@ -16,6 +16,9 @@ extern unsigned int cacheid; #define cache_is_vipt()cacheid_is(CACHEID_VIPT) #define cache_is_vipt_nonaliasing()cacheid_is(CACHEID_VIPT_NONALIASING) #define cache_is_vipt_aliasing() cacheid_is(CACHEID_VIPT_ALIASING) +#ifdef CONFIG_ARCH_HAS_CACHE_ALIASING_DYNAMIC +#define cache_is_aliasing()cache_is_vipt_aliasing() +#endif #define icache_is_vivt_asid_tagged() cacheid_is(CACHEID_ASID_TAGGED) #define icache_is_vipt_aliasing() cacheid_is(CACHEID_VIPT_I_ALIASING) #define icache_is_pipt() cacheid_is(CACHEID_PIPT) diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index c164cde50243..23af93cdc03d 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -539,9 +539,11 @@ config CPU_CACHE_NOP bool config CPU_CACHE_VIVT + select ARCH_HAS_CACHE_ALIASING bool config CPU_CACHE_VIPT + select ARCH_HAS_CACHE_ALIASING_DYNAMIC if CPU_V6 || CPU_V6K bool config CPU_CACHE_FA diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index cf2a6fd7dff8..439d7640deb8 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -2,6 +2,7 @@ config CSKY def_bool y select ARCH_32BIT_OFF_T + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_DMA_PREP_COHERENT select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_SYNC_DMA_FOR_CPU diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index 4b3e93cac723..216338704f0a 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -3,6 +3,7 @@ config M68K bool default y select ARCH_32BIT_OFF_T + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_BINFMT_FLAT select ARCH_HAS_CP
[RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implements it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARMV6 and ARMV6K. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Feedback is welcome, Thanks, Mathieu Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Mathieu Desnoyers (7): Introduce cache_is_aliasing() across all architectures dax: Fix incorrect list of cache aliasing architectures erofs: Use dax_is_supported() ext2: Use dax_is_supported() ext4: Use dax_is_supported() fuse: Introduce fuse_dax_is_supported() xfs: Use dax_is_supported() arch/arc/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 3 ++ arch/arm/mm/Kconfig | 2 ++ arch/csky/Kconfig | 1 + arch/m68k/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ fs/Kconfig | 2 +- fs/erofs/super.c| 10 +++--- fs/ext2/super.c | 14 fs/ext4/super.c | 52 ++--- fs/fuse/file.c | 2 +- fs/fuse/fuse_i.h| 36 +++- fs/fuse/inode.c | 47 +- fs/fuse/virtio_fs.c | 4 +-- fs/xfs/xfs_super.c | 20 +++ include/linux/cacheinfo.h | 8 + include/linux/dax.h | 9 + mm/Kconfig | 10 ++ 27 files changed, 198 insertions(+), 73 deletions(-) create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h -- 2.39.2
Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers
On 2024-01-26 15:12, Steven Rostedt wrote: [...] diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..2187be6d7b23 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = { .setattr= tracefs_setattr, }; +/* Copied from get_next_ino() but adds allocation for multiple inodes */ +#define LAST_INO_BATCH 1024 +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) +static DEFINE_PER_CPU(unsigned int, last_ino); + +unsigned int tracefs_get_next_ino(int files) +{ + unsigned int *p = &get_cpu_var(last_ino); + unsigned int res = *p; + +#ifdef CONFIG_SMP + /* Check if adding files+1 overflows */ How does it handle a @files input where: * (files+1 > LAST_INO_BATCH) ? * (files+1 == LAST_INO_BATCH) ? + if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { + static atomic_t shared_last_ino; + int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); + + res = next - LAST_INO_BATCH; + } +#endif + + res++; + /* get_next_ino should not provide a 0 inode number */ + if (unlikely(!res)) + res++; I suspect that bumping this res++ in the 0 case can cause inode range reservation issues at (files+1 == LAST_INO_BATCH-1). Thanks, Mathieu + *p = res + files; + put_cpu_var(last_ino); + return res; +} -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[REGRESSION] v5.13: FS_DAX unavailable on 32-bit ARM
Hi, This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. I can see a few ways forward to address this: A) I have prepared a patch series introducing cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implemented it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected, without actually solving the issue for architectures with virtually aliased dcaches. B) Another approach would be to dig into what exactly DAX is doing with the linear mapping, and try to fix this. I see two options there: B.1) Either we extend vmap to allow vmap'd pages to be aligned on specific multiples, and use a coloring trick based on SHMLBA like userspace mappings do for all DAX internal pages accesses, or B.2) We introduce flush_dcache_folio() at relevant spots (perhaps dax_flush() ?) to synchronize the linear mapping wrt userspace mappings. (would this work ?) Any thoughts on how to best move forward with this issue are welcome. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-20 08:47, Steven Rostedt wrote: On Fri, 19 Jan 2024 20:49:36 -0500 Mathieu Desnoyers wrote: Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. In the packet header you have 0x12345678 in the first event you have 0x5678 how does that get you the timestamp? If that event had 0x, when the reader reads this packet, it would take the header 0x12345678 chop off (ignore) the 5678, and add the , right? We need to consider not only what happens when the 16 low bits increase, but also what happens when they end up with a value smaller than the previous 16 low bits. As a summary from our video meeting discussion: There are 3 cases we care about here: packet header timestamp: 0x12345678 followed by either: A) first event delta from packet header timestamp is 0: 16-bit value 0x5678 B) first event delta from packet header timestamp is <= 0x: B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value) B.2) 16-bit value example 0x (exactly one 16-bit overflow from previous value) C) first event delta from packet header timestamp is larger than 0x, which would cause the low-order 16 bits to have more than one 16-bit overflow from the previous value. The tracer detects this and uses a full 64-bit timestamp instead of the compact 16 bits. [...] But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is saved wrt local cmpxchg updating the reserved position. The algorithm looks like: do { - read current reserved position (old pos) - read time - compute new reserved position (new pos) } while (cmpxchg(reserved pos, old pos, new pos) != old pos); [A] save_last_tsc() So the last_tsc that is saved is from the timestamp read before the cmpxchg. Yes. If interrupted at [A] by another trace producer, it will compare with an older "last tsc" than the tsc of the event physically located just before the nested event. This stale "last tsc" has a value which is necessarily lower than the one we would be saving with the save_last_tsc immediately afterwards, which means in the worse case we end up using a full 64-bit timestamp when in fact we could use a more compact representation. But this race is rare and therefore it does not matter for size. That's equivalent to me "injecting" an absolute value for the same race. Yes. The fact that I only need this last_tsc value for the sake of optimization, and not for computation of a time delta from a previous injected timestamp, makes it possible to handle the race gra
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-19 16:42, Steven Rostedt wrote: On Fri, 19 Jan 2024 15:56:21 -0500 Mathieu Desnoyers wrote: So when an overflow happens, you just insert a timestamp, and then events after that is based on that? No. Let's use an example to show how it works. For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps in the compact event header representation. But for the sake of making this discussion easier, let's assume a tracer would use 16-bit for timestamps in the compact representation. Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. So in actuality, it inserts 0x1234, plus adds a 5678 that represents the creation of the header (like we discussed about in the last tracing meeting). There is no "0x1234" reference time. There is only the actual 0x12345678 current time at packet begin, including those 16 low order bits. The first event has: 0x5678 which is based on the previous injected timestamp of 0x1234. It is not "based" on the previous injected timestamp. It just represents the low-order 16-bits of the timestamp at event 1. The trace readers takes two informations to compute the complete current timestamp for event 1: 1) The current timestamp just before event 1 is read (0x12345678), 2) The low-order 16 bits of event 1 (0x5678). Given that there is no 16-bit overflow when comparing: 0x12345678 & 0x and 0x5678 We know that the resulting current timestamp for event 1 is: (0x12345678 & ~0x) + 0x5678 = 0x12345678 Or basically that time did not move between the packet header and event 1. the events go on just using a delta from that until you see it overflow (or too big of a delta to fit in the 16 bits), and you insert a new full timestamps that everything will be based on: 0x1237 No. The following events use the same algorithm I just described: They use this notion of "current timestamp" and the information provided by the new timestamp field in the event header to figure out the updated current timestamp. It is _never_ based on some kind of arbitrary reference, it is always absolute, and is always computed based on the current timestamp and the timestamp field encountered. Now events following that are just a delta from that timestamp. But you calculate the delta simply by masking out the lower bits. No, again, there is no delta from arbitrary injected time. It's always computed from this virtual "current time", which applies to an event stream. But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is s
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-19 10:37, Steven Rostedt wrote: On Fri, 19 Jan 2024 09:40:27 -0500 Mathieu Desnoyers wrote: On 2024-01-18 18:12, Steven Rostedt wrote: From: "Steven Rostedt (Google)" [...] Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. I understand that the reason why you need the before/after stamps and their associated complexity is because the Ftrace ring buffer ABI encodes event timestamps as delta from the previous event within the buffer as a mean of compressing the timestamp fields. If the delta cannot be represented in a given number of bits, then it inserts a 64-bit timestamp (not sure if that one is absolute or a delta from previous event). There's both. An extended timestamp, which is added when the delta is too big, and that too is just a delta from the previous event. And there is the absolute timestamp as well. I could always just use the absolute one. That event came much later. OK This timestamp encoding as delta between events introduce a strong inter-dependency between consecutive (nested) events, and is the reason why you are stuck with all this timestamp before/after complexity. The Common Trace Format specifies (and LTTng implements) a different way to achieve the same ring buffer space-savings achieved with timestamp deltas while keeping the timestamps semantically absolute from a given reference, hence without all the before/after timestamp complexity. You can see the clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic That points to this: -8<- 6.3. Clock value update procedure To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C: Let L be an unsigned integer initialized to, depending on the type property of C: "fixed-length-unsigned-integer" The value of the length property of C. "variable-length-unsigned-integer" S ×7, where S is the number of bytes which F occupies with the data stream. Let MASK be an unsigned integer initialized to 2L − 1. Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator. Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator. Set DEF_CLK_VAL to: If V ≥ CUR H + V Else H + MASK + 1 + V ->8- There's a lot of missing context there, so I don't see how it relates. This explains how the "current time" is reconstructed by a trace reader when loading an event header timestamp field. But for the sake of this discussion we can focus on the less formal explanation of how the tracer generates this timestamp encoding provided below. idea on the producer side is to record the low-order bits of the current timestamp in the event header (truncating the higher order bits), and fall back on a full 64-bit value if the number of low-order bits overflows from the previous timestamp is more than 1, or if it is impossible to figure out precisely the timestamp of the previous event due to a race. This achieves the same space savings as delta timestamp encoding without introducing the strong event inter-dependency. So when an overflow happens, you just insert a timestamp, and then events after that is based on that? No. Let's use an example to show how it works. For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps in the compact event header representation. But for the sake of making this discussion easier, let's assume a tracer would use 16-bit for timestamps in the compact representation. Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-18 18:12, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. I admire the effort of trying to simplify the Ftrace ring buffer by bringing over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem, I certainly welcome the simplifications. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. I understand that the reason why you need the before/after stamps and their associated complexity is because the Ftrace ring buffer ABI encodes event timestamps as delta from the previous event within the buffer as a mean of compressing the timestamp fields. If the delta cannot be represented in a given number of bits, then it inserts a 64-bit timestamp (not sure if that one is absolute or a delta from previous event). This timestamp encoding as delta between events introduce a strong inter-dependency between consecutive (nested) events, and is the reason why you are stuck with all this timestamp before/after complexity. The Common Trace Format specifies (and LTTng implements) a different way to achieve the same ring buffer space-savings achieved with timestamp deltas while keeping the timestamps semantically absolute from a given reference, hence without all the before/after timestamp complexity. You can see the clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic idea on the producer side is to record the low-order bits of the current timestamp in the event header (truncating the higher order bits), and fall back on a full 64-bit value if the number of low-order bits overflows from the previous timestamp is more than 1, or if it is impossible to figure out precisely the timestamp of the previous event due to a race. This achieves the same space savings as delta timestamp encoding without introducing the strong event inter-dependency. The fact that Ftrace exposes this ring buffer binary layout as a user-space ABI makes it tricky to move to the Common Trace Format timestamp encoding. There are clearly huge simplifications that could be made by moving to this scheme though. Is there any way to introduce a different timestamp encoding scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to introduce this simpler scheme and gradually phase out the more complex delta encoding when no users are left. Thoughts ? Thanks, Mathieu [1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 2024-01-11 11:17, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader Hi Vincent, The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1], and has a lot of similarities with this patch series. LTTng has the notion of "subbuffer id" to allow atomically exchanging a "reader" extra subbuffer with the subbuffer to be read. It implements "get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader) to move the currently read subbuffer position. [2] It would not hurt to compare your approach to LTTng and highlight similarities/differences, and the rationale for the differences. Especially when it comes to designing kernel ABIs, it's good to make sure that all bases are covered, because those choices will have lasting impacts. Thanks, Mathieu [1] https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c [2] https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 13:43, Steven Rostedt wrote: On Fri, 15 Dec 2023 13:25:07 -0500 Mathieu Desnoyers wrote: I am not against exposing an ABI that allows userspace to alter the filter behavior. I disagree on the way you plan to expose the ABI. These are no different than the knobs for sched_debug These are very much different. The sched features are enabled at build-time by modifying kernel/sched/features.h. There is no userspace ABI involved. Exposing this option as an ABI in this way exposes too much internal ring buffer implementation details to userspace. There's already lots of exposure using options. The options are just knobs, nothing more. It exposes the following details which IMHO should be hidden or configurable in a way that allows moving to a whole new mechanism which will have significantly different characteristics in the future: It exposes that: - filtering uses a copy to a temporary buffer, and - that this copy is enabled by default. Once exposed, those design constraints become immutable due to ABI. No it is not. There is no such thing as "immutable ABI". The rule is "don't break user space" If this functionality in the kernel goes away, the knob could become a nop, and I doubt any user space will break because of it. That is, the only burden is keeping this option exposed. But it could be just like that light switch that has nothing connected to it. It's still there, but does nothing if you switch it. This knob can act the same way. This does not in anyway prevent future innovation. I am not comfortable with exposing internal ring buffer implementation details to userspace which may or may not be deprecated as no-ops in the future. This will lead to useless clutter. One approach that might be somewhat better would be to only expose those files when a new CONFIG_TRACING_DEBUG option is enabled. This way userspace cannot rely on having those files present as part of the ABI, but you would still be free to use them for selftests by skipping the specific selftests if the config option is not enabled. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 12:34, Steven Rostedt wrote: On Fri, 15 Dec 2023 12:24:14 -0500 Mathieu Desnoyers wrote: On 2023-12-15 12:04, Steven Rostedt wrote: On Fri, 15 Dec 2023 10:53:39 -0500 Mathieu Desnoyers wrote: [...] So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" If I add other ways to filter, it will be a separate file to control that, but all options are on/off switches. Even if I add other functionality to the way buffers are created, this will still have the same functionality to turn the entire thing on or off. I'll be clearer then: I think this is a bad ABI. In your reply, you justify this bad ABI by implementation motivations. What's wrong with a way to stop the copying ? I am not against exposing an ABI that allows userspace to alter the filter behavior. I disagree on the way you plan to expose the ABI. Exposing this option as an ABI in this way exposes too much internal ring buffer implementation details to userspace. It exposes the following details which IMHO should be hidden or configurable in a way that allows moving to a whole new mechanism which will have significantly different characteristics in the future: It exposes that: - filtering uses a copy to a temporary buffer, and - that this copy is enabled by default. Once exposed, those design constraints become immutable due to ABI. The option is just a way to say "I don't want to do the copy into the buffer, I want to go directly into it" My main concern is how this concept, once exposed to userspace, becomes not only an internal implementation detail, but a fundamental part of the design which cannot ever go away without breaking the ABI or making parts of the ABI irrelevant. I can make a parallel with the scheduler: this is as if the sched feature knobs (which are there only for development/debugging purposes) would all be exposed as userspace ABI. This would seriously limit the evolution of the scheduler design in the future. I see this as the same problem at the ring buffer level. I don't care about the implementation, I care about the ABI, and I feel that your reply is not addressing my concern at all. Maybe I don't understand your concern. It's an on/off switch (like all options are). This is just a way to say "I want to indirect copying of the event before filtering or not". Not all tracefs options are booleans. The "current_tracer" file ABI exposes a string input/output parameter. I would recommend the same for the equivalent of a "current_filter" file. The "input-argument" part above may never happen. What's different between tracefs and LTTng, is that all events are created by the subsystem not by me. You don't use the TRACE_EVENT() macro, but you need to manually create each event you care about yourself. It's more of a burden but you also then have the luxury of hooking to the input portion. That is not exposed, and that is by design. As that could possibly make all tracepoints an ABI, and you'll have people like peterz nacking any new tracepoint in the scheduler. He doesn't allow trace events anymore because of that exposure. I'm not arguing for moving to the input-argument scheme, I just used this as an hypothetical example to show why we should not expose internal implementation details to userspace which will prevent future evolution only for the sake of having debugging knobs. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 12:04, Steven Rostedt wrote: On Fri, 15 Dec 2023 10:53:39 -0500 Mathieu Desnoyers wrote: [...] So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" If I add other ways to filter, it will be a separate file to control that, but all options are on/off switches. Even if I add other functionality to the way buffers are created, this will still have the same functionality to turn the entire thing on or off. I'll be clearer then: I think this is a bad ABI. In your reply, you justify this bad ABI by implementation motivations. I don't care about the implementation, I care about the ABI, and I feel that your reply is not addressing my concern at all. Moreover, double-negative boolean options (disable-X=false) are confusing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 10:26, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Normally, when the filter is enabled, a temporary buffer is created to copy the event data into it to perform the filtering logic. If the filter passes and the event should be recorded, then the event is copied from the temporary buffer into the ring buffer. If the event is to be discarded then it is simply dropped. If another event comes in via an interrupt, it will not use the temporary buffer as it is busy and will write directly into the ring buffer. The disable-filter-buf option will disable the temporary buffer and always write into the ring buffer. This will avoid the copy when the event is to be recorded, but also adds a bit more overhead on the discard, and if another event were to interrupt the event that is to be discarded, then the event will not be removed from the ring buffer but instead converted to padding that will not be read by the reader. Padding will still take up space on the ring buffer. This option can be beneficial if most events are recorded and not discarded, or simply for debugging the discard functionality of the ring buffer. Also fix some whitespace (that was fixed by editing this in vscode). I'm not convinced that a boolean state is what you need here. Yes, today you are in a position where you have two options: a) use the filter buffer, which falls back on copy to ring buffer if nested, b) disable the filter buffer, and thus always copy to ring buffer before filtering, But I think it would not be far-fetched to improve the implementation of the filter-buffer to have one filter-buffer per nesting level (per execution context), or just implement the filter buffer as a per-cpu stack, which would remove the need to fall back on copy to ring buffer when nested. Or you could even do like LTTng and filter on the input arguments directly. But each of these changes would add yet another boolean tunable, which can quickly make the mix hard to understand from a user perspective. So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Remove 32bit timestamp logic
On 2023-12-13 21:11, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Each event has a 27 bit timestamp delta that is used to hold the delta from the last event. If the time between events is greater than 2^27, then a timestamp is added that holds a 59 bit absolute timestamp. Until a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp"), if an interrupt interrupted an event in progress, all the events delta would be zero to not deal with the races that need to be handled. The commit a389d86f7fd09 changed that to handle the races giving all events, even those that preempt other events, still have an accurate timestamp. To handle those races requires performing 64-bit cmpxchg on the timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered very slow. To try to deal with this the timestamp logic was broken into two and then three 32-bit cmpxchgs, with the thought that two (or three) 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit architectures. Part of the problem with this is that I didn't have any 32-bit architectures to test on. After hitting several subtle bugs in this code, an effort was made to try and see if three 32-bit cmpxchgs are indeed faster than a single 64-bit. After a few people brushed off the dust of their old 32-bit machines, tests were done, and even though 32-bit cmpxchg was faster than a single 64-bit, it was in the order of 50% at best, not 300%. I literally had to dust off my old Eee PC for this :) This means that this complex code is not only complex but also not even faster than just using 64-bit cmpxchg. Nuke it! Acked-by: Mathieu Desnoyers @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, struct rb_event_info info; int nr_loops = 0; int add_ts_default; + static int once; (as you spotted, should be removed) Thanks, Mathieu /* ring buffer does cmpxchg, make sure it is safe in NMI context */ if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
On 2023-12-12 11:53, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit architectures. That is: static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) { unsigned long cnt, top, bottom, msb; unsigned long cnt2, top2, bottom2, msb2; u64 val; /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, &val, &cnt2)) return false; if (val != expect) return false; <<<< interrupted here! cnt = local_read(&t->cnt); The problem is that the synchronization counter in the rb_time_t is read *after* the value of the timestamp is read. That means if an interrupt were to come in between the value being read and the counter being read, it can change the value and the counter and the interrupted process would be clueless about it! The counter needs to be read first and then the value. That way it is easy to tell if the value is stale or not. If the counter hasn't been updated, then the value is still good. Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/ Cc: sta...@vger.kernel.org Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit") Reported-by: Mathieu Desnoyers Signed-off-by: Steven Rostedt (Google) Reviewed-by: Mathieu Desnoyers --- kernel/trace/ring_buffer.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 1d9caee7f542..e110cde685ea 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) unsigned long cnt2, top2, bottom2, msb2; u64 val; + /* Any interruptions in this function should cause a failure */ + cnt = local_read(&t->cnt); + /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, &val, &cnt2)) return false; @@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) if (val != expect) return false; - cnt = local_read(&t->cnt); if ((cnt & 3) != cnt2) return false; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
The following race can cause rb_time_read() to observe a corrupted time stamp: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(&t->cnt); top = local_read(&t->top); bottom = local_read(&t->bottom); msb = local_read(&t->msb); } while (c != local_read(&t->cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. Fix this by checking that all three of msb, top, and bottom 2-bit cnt values match. The reason to favor checking all three fields over requiring a specific update order for both rb_time_set() and rb_time_cmpxchg() is because checking all three fields is more robust to handle partial failures of rb_time_cmpxchg() when interrupted by nested rb_time_set(). Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoy...@efficios.com/ Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()") Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: linux-trace-ker...@vger.kernel.org --- kernel/trace/ring_buffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8d2a4f00eca9..71c225ca2a2b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) *cnt = rb_time_cnt(top); - /* If top and msb counts don't match, this interrupted a write */ - if (*cnt != rb_time_cnt(msb)) + /* If top, msb or bottom counts don't match, this interrupted a write */ + if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom)) return false; /* The shift to msb will lose its cnt bits */ -- 2.39.2
Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
On 2023-12-11 23:38, Steven Rostedt wrote: On Mon, 11 Dec 2023 22:51:04 -0500 Mathieu Desnoyers wrote: [...] For this first issue, here is the race: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(&t->cnt); top = local_read(&t->top); bottom = local_read(&t->bottom); msb = local_read(&t->msb); } while (c != local_read(&t->cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. Ah, OK that makes more sense. Yeah, if I had the three words from the beginning, I would have tested to make sure they all match an not just the two :-p Technically just checking that the very first and last words which are updated by set/cmpxchg have the same cnt bits would suffice. Because this is just a scenario of __rb_time_read interrupting an update, the updates in between are fine if the first/last words to be updated have the same cnt. As this would fix a commit that tried to fix this before! f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()") FYI, that would be the "Fixes" for this patch. OK [...] - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. Except that it's not 4 bits that is compared, but 32 bits. struct rb_time_struct { local_t cnt; local_t top; local_t bottom; local_t msb; }; The full local_t (32 bits) is used for synchronization. But the other elements do get extra bits and there still might be some issues, but not as severe as you stated here. Let's bring up the race scenario I spotted: rb_time_cmpxchg() [...] /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, &val, &cnt2)) return false; if (val != expect) return false; cnt = local_read(&t->cnt); if ((cnt & 3) != cnt2) return false; ^ here (cnt & 3) == cnt2, but @val contains outdated data. This means the piecewise rb_time_read_cmpxchg() that follow will derive expected values from the outdated @val. Ah. Of course this would be fixed if we did the local_read(&t->cnt) *before* everything else. But then we could be interrupted after that initial read, before reading val. I suspect we'd want to propagate the full 32-bit cnt that was read by __rb_time_read() to the caller, which is not the case today. With that, we would not have to read it again in rb_time_cmpxchg. It does leave the issue of having only 2 bits in the msb, top, bottom fields to detect races, which are subject to overflow. cnt2 = cnt + 1; rb_time_split(val, &top, &bottom, &msb); top = rb_time_val_cnt(top, cnt); bottom = rb_time_val_cnt(bottom, cnt); ^ top, bottom, and msb contain outdated data, which do not match cnt due to 2-bit overflow. rb_time_split(set, &top2, &bottom2, &msb2); top2 = rb_time_val_cnt(top2, cnt2); bottom2 = rb_time_val_cnt(bottom2, cnt2); if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) return false; ^ This @cnt cmpxchg succeeds because it uses the re-read cnt is used as expected value. Sure. And I believe you did find another bug. If we read the cnt first, before reading val, then it would not be outdated. As stated above, I suspect we'd run into other issues if interrupted between read of cnt and reading val. Propagating the full 32-bit cnt value read from __rb_time_read() to the caller would be better I think. if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) return false; ^ these cmpxchg have just used the outdated @val as expected values, even though the content of the rb_time was modified by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This means those cmpxchg can fail not only due to being interrupted
Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-12 09:00, Steven Rostedt wrote: [...] --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, enum event_trigger_type tt = ETT_NONE; struct trace_buffer *buffer; struct print_entry *entry; + int meta_size; ssize_t written; int size; int len; @@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (!(tr->trace_flags & TRACE_ITER_MARKERS)) return -EINVAL; - if (cnt > TRACE_BUF_SIZE) - cnt = TRACE_BUF_SIZE; You're removing an early bound check for a size_t userspace input... - - BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE); - - size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */ + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ + again: + size = cnt + meta_size; ... and then implicitly casting it into a "int" size variable, which can therefore become a negative value. Just for the sake of not having to rely on ring_buffer_lock_reserve catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an early check for negative here. /* If less than "", then make sure we can still add that */ if (cnt < FAULTED_SIZE) @@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); - if (unlikely(!event)) + if (unlikely(!event)) { + /* +* If the size was greated than what was allowed, then greater ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add size check when printing trace_marker output
On 2023-12-12 08:44, Steven Rostedt wrote: From: "Steven Rostedt (Google)" If for some reason the trace_marker write does not have a nul byte for the string, it will overflow the print: Does this result in leaking kernel memory to userspace ? If so, it should state "Fixes..." and CC stable. Thanks, Mathieu trace_seq_printf(s, ": %s", field->buf); The field->buf could be missing the nul byte. To prevent overflow, add the max size that the buf can be by using the event size and the field location. int max = iter->ent_size - offsetof(struct print_entry, buf); trace_seq_printf(s, ": %*s", max, field->buf); Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..e11fb8996286 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter, { struct print_entry *field; struct trace_seq *s = &iter->seq; + int max = iter->ent_size - offsetof(struct print_entry, buf); trace_assign_type(field, iter->ent); seq_print_ip_sym(s, field->ip, flags); - trace_seq_printf(s, ": %s", field->buf); + trace_seq_printf(s, ": %*s", max, field->buf); return trace_handle_return(s); } @@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags, struct trace_event *event) { struct print_entry *field; + int max = iter->ent_size - offsetof(struct print_entry, buf); trace_assign_type(field, iter->ent); - trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf); + trace_seq_printf(&iter->seq, "# %lx %*s", field->ip, max, field->buf); return trace_handle_return(&iter->seq); } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
On 2023-12-11 17:59, Steven Rostedt wrote: On Mon, 11 Dec 2023 15:13:24 -0500 Mathieu Desnoyers wrote: Going through a review of the ring buffer rb_time functions for 32-bit architectures, I updated the comments to match the code, and identified the following issues: Thanks Mathieu! - rb_time_cmpxchg() needs to update the msb last, so it matches the validation of top and msb by __rb_time_read(). This is fixed by this patch. Hmm, does it? This is not parallel programming, it's only protecting against interrupts. Understood, this is indeed the model I had in mind during my review (nested interruption only from local cpu) because preemption is disabled around use of the ring buffer. For this first issue, here is the race: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(&t->cnt); top = local_read(&t->top); bottom = local_read(&t->bottom); msb = local_read(&t->msb); } while (c != local_read(&t->cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. BTW, it's best not to have a fix like this with a comment change this big, as the comment change is highly likely to cause conflicts in any backport. I wanted to start the discussion without having a N-patches series, but I agree that this first fix should be split into a separate patch. Although, for consistency, I wonder if everything else should be changed to go: bottom, top, msb as it would match the order of the bits, as msb has the highest order, top the next, and bottom the least. Doing it as: top, bottom, msb seems out of order. I did that because msb was an after thought :-/ Agreed on the order change, but I suspect this would belong to another patch (not the fix). - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. Except that it's not 4 bits that is compared, but 32 bits. struct rb_time_struct { local_t cnt; local_t top; local_t bottom; local_t msb; }; The full local_t (32 bits) is used for synchronization. But the other elements do get extra bits and there still might be some issues, but not as severe as you stated here. Let's bring up the race scenario I spotted: rb_time_cmpxchg() [...] /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, &val, &cnt2)) return false; if (val != expect) return false; cnt = local_read(&t->cnt); if ((cnt & 3) != cnt2) return false; ^ here (cnt & 3) == cnt2, but @val contains outdated data. This means the piecewise rb_time_read_cmpxchg() that follow will derive expected values from the outdated @val. cnt2 = cnt + 1; rb_time_split(val, &top, &bottom, &msb); top = rb_time_val_cnt(top, cnt); bottom = rb_time_val_cnt(bottom, cnt); ^ top, bottom, and msb contain outdated data, which do not match cnt due to 2-bit overflow. rb_time_split(set, &top2, &bottom2, &msb2); top2 = rb_time_val_cnt(top2, cnt2); bottom2 = rb_time_val_cnt(bottom2, cnt2); if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2)) return false; ^ This @cnt cmpxchg succeeds because it uses the re-read cnt is used as expected value. if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2)) return false; ^ these cmpxchg have just used the outdated @val as expected values, even though the content of the rb_time was modified by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This means those cmpxchg can fail not only due to being interrupted by another write or cmpxchg, but also simply due to expected value mismatch in any of the fields, which will then cause following __rb_time_read() to fail until a rb_time_set() is done. return true; So this overflow scenario on top of cmpxchg does not cause corrupted time stamps, but does cause subseque
[RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
Going through a review of the ring buffer rb_time functions for 32-bit architectures, I updated the comments to match the code, and identified the following issues: - rb_time_cmpxchg() needs to update the msb last, so it matches the validation of top and msb by __rb_time_read(). This is fixed by this patch. - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. - After a cmpxchg fails between updates to top and msb, a write is needed before read and cmpxchg can succeed again. I am not entirely sure the rest of the ring buffer handles this correctly. Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: linux-trace-ker...@vger.kernel.org --- kernel/trace/ring_buffer.c | 64 +++--- 1 file changed, 46 insertions(+), 18 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8d2a4f00eca9..f6ed699947cd 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -576,34 +576,50 @@ struct ring_buffer_iter { #ifdef RB_TIME_32 /* - * On 32 bit machines, local64_t is very expensive. As the ring - * buffer doesn't need all the features of a true 64 bit atomic, - * on 32 bit, it uses these functions (64 still uses local64_t). + * On 32-bit machines, local64_t is very expensive. As the ring + * buffer doesn't need all the features of a true 64-bit atomic, + * on 32-bit, it uses these functions (64-bit still uses local64_t). * - * For the ring buffer, 64 bit required operations for the time is - * the following: + * For the ring buffer, the operations required to manipulate 64-bit + * time stamps are the following: * - * - Reads may fail if it interrupted a modification of the time stamp. + * - Read may fail if it interrupted a modification of the time stamp. * It will succeed if it did not interrupt another write even if * the read itself is interrupted by a write. + * A read will fail if it follows a cmpxchg which failed between + * updates to its top and msb bits, until a write is performed. + * (note: this limitation may be unexpected in parts of the + * ring buffer algorithm) * It returns whether it was successful or not. * - * - Writes always succeed and will overwrite other writes and writes + * - Write always succeeds and will overwrite other writes and writes * that were done by events interrupting the current write. * * - A write followed by a read of the same time stamp will always succeed, * but may not contain the same value. * * - A cmpxchg will fail if it interrupted another write or cmpxchg. + * A cmpxchg will fail if it follows a cmpxchg which failed between + * updates to its top and msb bits, until a write is performed. + * (note: this limitation may be unexpected in parts of the + * ring buffer algorithm) * Other than that, it acts like a normal cmpxchg. * - * The 60 bit time stamp is broken up by 30 bits in a top and bottom half - * (bottom being the least significant 30 bits of the 60 bit time stamp). + * The 64-bit time stamp is broken up, from most to least significant, + * in: msb, top and bottom fields, of respectively 4, 30, and 30 bits. * - * The two most significant bits of each half holds a 2 bit counter (0-3). + * The two most significant bits of each field hold a 2-bit counter (0-3). * Each update will increment this counter by one. - * When reading the top and bottom, if the two counter bits match then the - * top and bottom together make a valid 60 bit number. + * When reading the top, bottom, and msb fields, if the two counter bits + * match, then the combined values make a valid 64-bit number. + * + * Counter limits. The following situations can generate overflows that + * produce corrupted time stamps: + * + * - A read or a write interrupted by 2^32 writes or cmpxchg. + * + * - A cmpxchg interrupted by 4 writes or cmpxchg. + * (note: this is not sufficient and should be fixed) */ #define RB_TIME_SHIFT 30 #define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1) @@ -632,7 +648,7 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) /* * If the read is interrupted by a write, then the cnt will -* be different. Loop until both top and bottom have been read +* be different. Loop until top, bottom and msb have been read * without interruption. */ do { @@ -644,7 +660,12 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) *cnt = rb_time_cnt(top); - /* If top and msb counts don't match, this interrupted a write */ + /* +* If top and msb counts don't match, this either interrupted a +* write or follows a failed cmpxchg. +* This requires the update to bottom to be enclosed between +
Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-10 11:38, Steven Rostedt wrote: On Sun, 10 Dec 2023 11:07:22 -0500 Mathieu Desnoyers wrote: It just allows more to be written in one go. I don't see why the tests need to cover this or detect this change. If the purpose of this change is to ensure that the entire trace marker payload is shown within a single event, then there should be a test which exercises this, and which validates that the end result is that the entire payload is indeed shown within a single event record. No, the purpose of the change is not to do that, because there can always be a bigger trace marker write than a single event can hold. This is the way it has always worked. This is an optimization or "enhancement". The 1KB restriction was actually because of a previous implementation years ago (before selftests even existed) that wrote into a temp buffer before copying into the ring buffer. But since we now can copy directly into the ring buffer, there's no reason not to use the maximum that the ring buffer can accept. My point is that the difference between the new "enhanced" behavior and the previous behavior is not tested for. Otherwise there is no permanent validation that this change indeed does what it is intended to do, so it can regress at any time without any test noticing it. What regress? The amount of a trace_marker write that can make it into a the buffer in one go? Now, I agree that we should have a test to make sure that all of the trace marker write gets into the buffer. Yes. This is pretty much my point. But it's always been allowed to break up that write however it wanted to. And the enhanced behavior extends the amount of data that can get written into a single sub-buffer, and this is not tested. Note, because different architectures have different page sizes, how much that can make it in one go is architecture dependent. So you can have a "regression" by simply running your application on a different architecture. Which is why in the following patches you have expressing the subbuffer size as bytes rather than pages is important at the ABI level. It facilitates portability of tests, and decreases documentation / user burden. Again, it's not a requirement, it's just an enhancement. How does this have anything to do with dispensing from testing the new behavior ? If the new behavior has a bug that causes it to silently truncate the trace marker payloads, how do you catch it with the current tests ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-10 10:30, Steven Rostedt wrote: On Sun, 10 Dec 2023 09:09:06 -0500 Mathieu Desnoyers wrote: On 2023-12-09 17:50, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. I would expect the tests under tools/testing/selftests/ftrace/* to be affected by those changes. If they are not, then it's a hole in the test coverage and should be taken care of as this behavior is modified. Before this change we had: ~# echo -n '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123 4567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567 89012345678901234567890123456789012345678901234567890123456789012345678901234' > /sys/kernel/tracing/trace_marker ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:8 # #_-=> irqs-off/BH-disabled #
Re: [PATCH 13/14] ring-buffer: Add documentation on the buffer_subbuf_order file
On 2023-12-09 22:54, Steven Rostedt wrote: [...] + buffer_subbuf_order: + + This sets or displays the sub buffer page size order. The ring buffer + is broken up into several same size "sub buffers". An event can not be + bigger than the size of the sub buffer. Normally, the sub buffer is + the size of the architecture's page (4K on x86). The sub buffer also + contains meta data at the start which also limits the size of an event. + That means when the sub buffer is a page size, no event can be larger + than the page size minus the sub buffer meta data. The fact that the user ABI documentation for this tracer parameter needs to dig into details about architecture page size is a good indicator that this ABI is not at the abstraction level it should be (pages vs bytes). Thanks, Mathieu + + The buffer_subbuf_order allows the user to change the size of the sub + buffer. As the sub buffer is a set of pages by the power of 2, thus + the sub buffer total size is defined by the order: + + order size + + 0 PAGE_SIZE + 1 PAGE_SIZE * 2 + 2 PAGE_SIZE * 4 + 3 PAGE_SIZE * 8 + + Changing the order will change the sub buffer size allowing for events + to be larger than the page size. + + Note: When changing the order, tracing is stopped and any data in the + ring buffer and the snapshot buffer will be discarded. + free_buffer: If a process is performing tracing, and the ring buffer should be -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 14/14] ringbuffer/selftest: Add basic selftest to test chaning subbuf order
On 2023-12-09 22:54, Steven Rostedt wrote: [...] +get_buffer_data_size() { + sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page +} + +a="1234567890" + +make_str() { +cnt=$1 +s="" +while [ $cnt -gt 10 ]; do +s="${s}${a}" +cnt=$((cnt-10)) +done +while [ $cnt -gt 0 ]; do +s="${s}X" +cnt=$((cnt-1)) +done +echo -n $s +} + +test_buffer() { + + size=`get_buffer_data_size` + + str=`make_str $size` + + echo $str > trace_marker + + grep -q $a trace This test has no clue if the record was truncated or not. It basically repeats the string "1234567890" until it fills the subbuffer size and pads with as needed as trace marker payload, but the grep looks for the "1234567890" pattern only. The test should be extended to validate whether the trace marker payload was truncated or not, otherwise it is of limited value. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 00/14] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers
On 2023-12-09 22:54, Steven Rostedt wrote: [...] Basically, events to the tracing subsystem are limited to just under a PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page size, and an event can not be bigger than a sub buffer. This allows users to change the size of a sub buffer by the order: echo 3 > /sys/kernel/tracing/buffer_subbuf_order Will make each sub buffer a size of 8 pages, allowing events to be almost as big as 8 pages in size (sub buffers do have meta data on them as well, keeping an event from reaching the same size as a sub buffer). Specifying the "order" of subbuffer size as a power of two of number of pages is a poor UX choice for a user-facing ABI. I would recommend allowing the user to specify the size in bytes, and internally bump to size to the next power of 2, with a minimum of PAGE_SIZE. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com