Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On 2024-04-12 12:28, Beau Belgrave wrote: On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote: On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave wrote: In the Open Telemetry profiling SIG [1], we are trying to find a way to grab a tracing association quickly on a per-sample basis. The team at Elastic has a bespoke way to do this [2], however, I'd like to see a more general way to achieve this. The folks I've been talking with seem open to the idea of just having a TLS value for this we could capture Presumably TLS == Thread Local Storage. Yes, the initial idea is to use thread local storage (TLS). It seems to be the fastest option to save a per-thread value that changes at a fast rate. upon each sample. We could then just state, Open Telemetry SDKs should have a TLS value for span correlation. However, we need a way to sample the TLS or other value(s) when a sampling event is generated. This is supported today on Windows via EventActivityIdControl() [3]. Since Open Telemetry works on both Windows and Linux, ideally we can do something as efficient for Linux based workloads. This series is to explore how it would be best possible to collect supporting data from a user process when a profile sample is collected. Having a value stored in TLS makes a lot of sense for this however there are other ways to explore. Whatever is chosen, kernel samples taken in process context should be able to get this supporting data. In these patches on X64 the fsbase and gsbase are used for this. An option to explore suggested by Mathieu Desnoyers is to utilize rseq for processes to register a value location that can be included when profiling if desired. This would allow a tighter contract between user processes and a profiler. It would allow better labeling/categorizing the correlation values. It is hard to understand this idea. Are you saying stash a cookie in TLS for samples to capture to indicate an activity? Restartable sequences are about preemption on a CPU not of a thread, so at least my intuition is that they feel different. You could stash information like this today by changing the thread name which generates comm events. I've wondered about having similar information in some form of reserved for profiling stack slot, for example, to stash a pointer to the name of a function being interpreted. Snapshotting all of a stack is bad performance wise and for security. A stack slot would be able to deal with nesting. You are getting the idea. A slot or tag for a thread would be great! I'm not a fan of overriding the thread comm name (as that already has a use). TLS would be fine, if we could also pass an offset + size + type. Maybe a stack slot that just points to parts of TLS? That way you could have a set of slots that don't require much memory and selectively copy them out of TLS (or where ever those slots point to in user memory). When I was talking to Mathieu about this, it seems that rseq already had a place to potentially put these slots. I'm unsure though how the per thread aspects would work. Mathieu, can you post your ideas here about that? Sure. I'll try to summarize my thoughts here. By all means, let me know if I'm missing important pieces of the puzzle. First of all, here is my understanding of what information we want to share between userspace and kernel. A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random UUID allows) a portion of the dependency chain involved in doing some work (e.g. answer a HTTP request) across one or many participating hosts. Activity IDs have a parent/child relationship: a parent activity ID can create children activity IDs. For instance, if one host has the service "dispatch", another host has a "web server", and a third host has a SQL database, we should be able to follow the chain of activities needed to answer a web query by following those activity IDs, linking them together through parent/child relationships. This usually requires the communication protocols to convey those activity IDs across hosts. The reason why this information must be provided from userspace is because it's userspace that knows where to find those activity IDs within its application-layer communication protocols. With tracing, taking a full trace of the activity ID spans begin/end from all hosts allow reconstructing the activity IDs parent/child relationships, so we typically only need to extract information about activity ID span begin/end with parent/child info to a tracer. Using activity IDs from a kernel profiler is trickier, because we do not have access to the complete span begin/end trace to reconstruct the activity ID parent/child relationship. This is where I suspect we'd want to introduce a notion of "activity ID stack", so a profiler could reconstruct the currently active stack of activity IDs for the current thread by walking that stack. This profiling could be triggered either from an interrupt (s
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 13:58, Steven Rostedt wrote: On Wed, 20 Mar 2024 13:15:39 -0400 Mathieu Desnoyers wrote: I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Because need_resched() itself does not mean it's going to schedule immediately. I looked at a few locations that need_resched() is called. Most are in idle code where the critical timings are already handled. I'm not sure I'd add it for places like mm/memory.c or drivers/md/bcache/btree.c. A lot of places look to use it more for PREEMPT_NONE situations as a open coded cond_resched(). The main reason this one is particularly an issue, is that it spins as long as the owner is still running. Which may be some time, as here it was 7ms. What I think we should be discussing here is how calling need_resched() should interact with the latency tracked by critical timings. AFAIU, when code explicitly calls need_resched() in a loop, there are two cases: - need_resched() returns false: This means the loop can continue without causing long latency on the system. Technically we could restart the critical timings at this point. - need_resched() returns true: This means the loop should exit quickly and call the scheduler. I would not reset the critical timings there, as whatever code is executed between need_resched() returning true and calling the scheduler is adding to latency. Having stop/start critical timings around idle loops seems to just be an optimization over that. As for mm and driver/md code, what is wrong with doing a critical timings reset when need_resched() returns false ? It would prevent a whole class of false-positives rather than playing whack-a-mole with those that pop up. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 12:20, Steven Rostedt wrote: From: Steven Rostedt (Google) I'm debugging some latency issues on a Chromebook and the preemptirqsoff tracer hit this: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty # # latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) #- #| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0) #- # => started at: rwsem_optimistic_spin # => ended at: rwsem_optimistic_spin # # #_--=> CPU# # / _-=> irqs-off # | / _=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # / _-=> migrate-disable # | / delay # cmd pid || time | caller # \ /|| \|/ <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 <-rwsem_optimistic_spin+0x20/0x194 <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7824us : => rwsem_optimistic_spin+0x140/0x194 => rwsem_down_write_slowpath+0xc9/0x3fe => copy_process+0xd08/0x1b8a => kernel_clone+0x94/0x256 => __x64_sys_clone+0x7a/0x9a => do_syscall_64+0x51/0xa1 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6 Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that spins with interrupts enabled, preempt disabled, and checks for need_resched(). If it is true, it breaks out and schedules. Hence, it's hiding real issues, because it can spin for a very long time and this is not the source of the latency I'm tracking down. I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 147feebd508c..e9f97f60bfc0 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -145,6 +145,13 @@ do { \ # define start_critical_timings() do { } while (0) #endif +/* Used in spins that check need_resched() with preemption disabled */ +static inline void restart_critical_timings(void) +{ + stop_critical_timings(); + start_critical_timings(); +} + #ifdef CONFIG_DEBUG_IRQFLAGS extern void warn_bogus_irq_restore(void); #define raw_check_bogus_irq_restore() \ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2340b6d90ec6..fa7b43e53d20 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #ifndef CONFIG_PREEMPT_RT @@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); + restart_critical_timings(); if (need_resched() || !owner_on_cpu(owner)) { state = OWNER_NONSPINNABLE; break; @@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * a writer, need_resched() check needs to be done here. */ if (owner_state != OWNER_WRITER) { + restart_critical_timings(); if (need_resched()) break; if (rt_task(current) && -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2] tracing: Limit trace_marker writes to just 4K
On 2024-03-04 22:34, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Limit the max print event of trace_marker to just 4K string size. This must also be less than the amount that can be held by a trace_seq along with the text that is before the output (like the task name, PID, CPU, state, etc). As trace_seq is made to handle large events (some greater than 4K). Make the max size of a trace_marker write event be 4K which is guaranteed to fit in the trace_seq buffer. Suggested-by: Mathieu Desnoyers From my perspective I only attempted to clarify the point Linus made about limiting the trace_marker input to 4kB. Feel adapt the Suggested-by tag accordingly. Reviewed-by: Mathieu Desnoyers Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240304192710.4c996...@gandalf.local.home/ - Just make the max limit 4K and not half of the trace_seq size. The trace_seq is already made to handle events greater than 4k. kernel/trace/trace.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d16b95ca58a7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7293,6 +7293,8 @@ tracing_free_buffer_release(struct inode *inode, struct file *filp) return 0; } +#define TRACE_MARKER_MAX_SIZE 4096 + static ssize_t tracing_mark_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *fpos) @@ -7320,6 +7322,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + if (cnt > TRACE_MARKER_MAX_SIZE) + cnt = TRACE_MARKER_MAX_SIZE; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7333,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 21:37, Steven Rostedt wrote: On Mon, 4 Mar 2024 21:35:38 -0500 Steven Rostedt wrote: And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. validating is debugging. Did Linus put you up to this? To test me to see if I'm learning how to say "No" ;-) No, he did not. I genuinely think that validating size limits like this either at compile time or, when they can vary at runtime like in this case, with a dynamic check, decreases the cognitive load on the reviewers. We can then assume that whatever limit was put in place is actually enforced and not just wishful thinking. If the "header" size upper bound is not validated at runtime, there is not much point in adding the BUILD_BUG_ON() based on that value in the first place, and you should then just add a runtime check that you don't overflow the output buffer before writing the output to it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:59, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:42:39 -0500 Mathieu Desnoyers wrote: #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. That would be a lot of complex code that is for debugging something that has never happened in the past 16 years and Linus has already reprimanded me on doing such things. Is it more complex than remembering the iterator position in print_trace_fmt() right before: if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) { if (iter->iter_flags & TRACE_FILE_LAT_FMT) trace_print_lat_context(iter); else trace_print_context(iter); } and then checking just after that the offset is not beyond 128 bytes ? Perhaps there is even something internal to "iter" that already knows about this offset (?). This would catch any context going beyond its planned upper limit early. Failing early and not just in rare corner cases is good. And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:41, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:35:16 -0500 Steven Rostedt wrote: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq That said, the meta data is most likely not going to be more than 128 bytes (it shouldn't be more than 80). I could do as you suggest and create a separate TRACE_MARKER_SIZE and just make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of the content) by 128 bytes. /* Added meta data should not be more than 128 bytes */ BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE); Bonus points if you add #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:35, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:15:57 -0500 Mathieu Desnoyers wrote: On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq If this header has a known well-defined upper-limit length, then use that in the BUILD_BUG_ON(). Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); This way we clearly document that the tracing mark write input limit is 4kB, rather than something derived from the size of an output buffer. Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d68544aef65f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + /* +* TRACE_SEQ_SIZE is the total size of trace_seq buffer used +* for output. As the print event outputs more than just +* the string written, keep it smaller than the trace_seq +* as it could drop the event if the extra data makes it bigger +* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE +* is more than enough. +*/ + if (cnt > TRACE_SEQ_SIZE / 2) + cnt = TRACE_SEQ_SIZE / 2; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Remove precision vsnprintf() check from print event
On 2024-03-04 17:43, Steven Rostedt wrote: From: "Steven Rostedt (Google)" This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/ Reported-by: Sachin Sant Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Signed-off-by: Steven Rostedt (Google) This is a step in the right direction IMHO. Reviewed-by: Mathieu Desnoyers Just out of curiosity, is there anything to prevent trace_marker from writing a huge string into the ring buffer in the first place ? Is this limit implicit and based on the page size of the architecture or is it a known fixed limit ? (e.g. 4kB strings). It appears to currently be limited by #define TRACE_SEQ_BUFFER_SIZE (PAGE_SIZE * 2 - \ (sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int))) checked within tracing_mark_write(). I would have hoped for a simpler limit (e.g. 4kB) consistent across architectures. But that would belong to a separate change. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
On 2024-03-01 10:49, Steven Rostedt wrote: On Fri, 1 Mar 2024 13:37:18 +0800 linke wrote: So basically you are worried about read-tearing? That wasn't mentioned in the change log. Yes. Sorry for making this confused, I am not very familiar with this and still learning. No problem. We all have to learn this anyway. Funny part is, if the above timestamp read did a tear, then this would definitely not match, and would return the correct value. That is, the buffer is not empty because the only way for this to get corrupted is if something is in the process of writing to it. I agree with you here. commit = rb_page_commit(commit_page); But if commit_page above is the result of a torn read, the commit field read by rb_page_commit() may not represent a valid value. But commit_page is a word length, and I will argue that any compiler that tears "long" words is broken. ;-) [ For those tuning in, we are discussing ring_buffer_iter_empty() "commit_page = cpu_buffer->commit_page;" racy load. ] I counter-argue that real-world compilers *are* broken based on your personal definition, but we have to deal with them, as documented in Documentation/memory-barriers.txt (see below). What is the added overhead of using a READ_ONCE() there ? Why are we wasting effort trying to guess the compiler behavior if the real-world performance impact is insignificant ? Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE(): "(*) For aligned memory locations whose size allows them to be accessed with a single memory-reference instruction, prevents "load tearing" and "store tearing," in which a single large access is replaced by multiple smaller accesses." I agree that {READ,WRITE}_ONCE() are really not needed at initialization, when there are demonstrably no concurrent accesses to the data But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields just adds complexity, prevents static analyzers to properly understand the code and report issues, and just obfuscates the code. Thanks, Mathieu In this case, READ_ONCE() is only needed for the commit_page. But we can at least keep the READ_ONCE() on the commit_page just because it is used in the next instruction. -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-02-20 09:19, Steven Rostedt wrote: On Mon, 19 Feb 2024 18:20:32 -0500 Steven Rostedt wrote: Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. Before we had: w = local_read(_page->write); /* get time stamps */ write = local_add_return(length, _page->write); if (write - length == w) { /* do simple case */ } else { /* do complex case */ } By switching the local_add_return() to a local_try_cmpxchg() it can now be: w = local_read(_page->write); again: /* get time stamps */ if (!local_try_cmpxchg(_page->write, , w + length)) goto again; /* do simple case */ Something about this logic is causing __rb_next_reserve() to sometimes always return -EAGAIN and triggering the: RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) Which disables the ring buffer. I'm not sure what it is, but until I do, I'm removing the patch from my queue. Try resetting the info->add_timestamp flags to add_ts_default on goto again within __rb_reserve_next(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers
On 2024-02-16 15:22, Arnd Bergmann wrote: From: Arnd Bergmann In some randconfig builds, the IS_ERR() check appears to not get completely eliminated, resulting in the compiler to insert references to these two functions that cause a link failure: ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined! ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined! Add more stub functions for the dax-disabled case here to make it build again. Hi Arnd, Note that this is a duplicate of: https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/ now present in Andrew's tree. The only differences are the subject, commit message and a newline between "set_dax_nomc" and "set_dax_synchronous" in your change. Thanks, Mathieu Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/ Signed-off-by: Arnd Bergmann --- include/linux/dax.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index df2d52b8a245..4527c10016fb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool dax_synchronous(struct dax_device *dax_dev); void set_dax_synchronous(struct dax_device *dax_dev); +void set_dax_nocache(struct dax_device *dax_dev); +void set_dax_nomc(struct dax_device *dax_dev); + size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); /* @@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev) static inline void set_dax_synchronous(struct dax_device *dax_dev) { } +static inline void set_dax_nocache(struct dax_device *dax_dev) +{ +} +static inline void set_dax_nomc(struct dax_device *dax_dev) +{ +} static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, struct dax_device *dax_dev) { @@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, } #endif -void set_dax_nocache(struct dax_device *dax_dev); -void set_dax_nomc(struct dax_device *dax_dev); - struct writeback_control; #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX) int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: CPU data cache across reboot/kexec for pmem/dax devices
On 2024-02-09 15:15, Dan Williams wrote: Mathieu Desnoyers wrote: Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, This one gives me pause, because a trip through the BIOS typically means lots of resets and other low level magic, so this would likely require pushing dirty data out of CPU caches prior to entering the BIOS code paths. So this either needs explicit cache flushing or mapping the memory with write-through semantics. That latter one is not supported in the stack today. I would favor the explicit cache flushing approach over write-through semantics for performance reasons: typically the ring buffers are overwritten, so always storing the data to memory would be wasteful. But I would rather do the cache flushing only on crash (die oops/panic notifiers) rather than require the tracer to flush cache lines immediately after each event is produced, again for performance reasons. I have done a crude attempt at registering die oops/panic notifiers from the pmem driver, and flush all pmem areas to memory when die oops/panic notifiers are called (see the untested patch below). I wonder if this is something that would be generally useful ? - kexec/kdump. This should maintain the state of CPU caches. As far as the CPU is concerned it is just long jumping into a new kernel in memory without resetting any CPU cache state. Good to know that this scenario is expected to "Just Work". Thanks, Mathieu diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e9898457a7bd..b92f18607d39 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -26,12 +26,18 @@ #include #include #include +#include #include #include "pmem.h" #include "btt.h" #include "pfn.h" #include "nd.h" +static int pmem_die_handler(struct notifier_block *self, + unsigned long ev, void *unused); +static int pmem_panic_handler(struct notifier_block *self, + unsigned long ev, void *unused); + static struct device *to_dev(struct pmem_device *pmem) { /* @@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; + atomic_notifier_chain_unregister(_notifier_list, + >panic_notifier); + unregister_die_notifier(>die_notifier); dax_remove_host(pmem->disk); kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); @@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev, goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - rc = device_add_disk(dev, disk, pmem_attribute_groups); + pmem->die_notifier.notifier_call = pmem_die_handler; + pmem->die_notifier.priority = -INT_MAX; + rc = register_die_notifier(>die_notifier); if (rc) goto out_remove_host; + pmem->panic_notifier.notifier_call = pmem_panic_handler; + pmem->panic_notifier.priority = -INT_MAX; + rc = atomic_notifier_chain_register(_notifier_list, + >panic_notifier); + if (rc) + goto out_unregister_die; + rc = device_add_disk(dev, disk, pmem_attribute_groups); + if (rc) + goto out_unregister_panic; if (devm_add_action_or_reset(dev, pmem_release_disk, pmem)) return -ENOMEM; @@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev, dev_warn(dev, "'badblocks' notification disabled\n"); return 0; +out_unregister_panic: + atomic_notifier_chain_unregister(_notifi
CPU data cache across reboot/kexec for pmem/dax devices
Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, - kexec/kdump. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On 2024-02-05 13:34, Vincent Donnefort wrote: On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: [...] How are the kernel linear mapping and the userspace mapping made coherent on architectures with virtually aliasing data caches ? Ref. https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Hi Mathieu, Thanks for the pointer. We are in the exact same problem as DAX. We do modify the data through the kernel linear mapping while user-space can read it through its own. I should probably return an error when used with any of the arch ARM || SPARC || MIPS, until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. You might want to use LTTng's ring buffer approach instead. See https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202 lib_ring_buffer_flush_read_subbuf_dcache() Basically, whenever user-space grabs a sub-buffer for reading (through lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng calls flush_dcache_page() on all pages of this subbuffer (I should really change this for a call to flush_dcache_folio() which would be more efficient). Note that doing this is not very efficient on architectures which have coherent data caches and incoherent dcache vs icache: in that case, we issue the flush_dcache_page uselessly. I plan on using the new cpu_dcache_is_aliasing() check once/if it makes it way upstream to remove those useless flushes on architectures which define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the data cache. The equivalent of LTTng's "get subbuf" operation would be the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + struct trace_array *tr = iter->tr; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); + +#ifdef CONFIG_TRACER_MAX_TRACE + spin_lock(>snapshot_trigger_lock); + if (!WARN_ON(!tr->mapped)) + tr->mapped--; + spin_unlock(>snapshot_trigger_lock); +#endif +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_data; + struct trace_iterator *iter = >iter; + struct trace_array *tr = iter->tr; + int ret = 0; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE); + vma->vm_ops = _buffers_vmops; + +#ifdef CONFIG_TRACER_MAX_TRACE + /* +* We hold mmap_lock here. lockdep would be unhappy if we would now take +* trace_types_lock. Instead use the specific snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + if (tr->snapshot || tr->mapped == UINT_MAX) { + spin_unlock(>snapshot_trigger_lock); + return -EBUSY; + } + tr->mapped++; + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); +#endif + ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file); +#ifdef CONFIG_TRACER_MAX_TRACE + if (ret) { + spin_lock(>snapshot_trigger_lock); + iter->tr->mapped--; + spin_unlock(>snapshot_trigger_lock); + } +#endif + return ret; +} + static const struct file_operations tracing_buffers_fops = { .open = tracing_buffers_open, .read = tracing_buffers_read, @@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops = { .splice_read= tracing_buffers_splice_read, .unlocked_ioctl = tracing_buffers_ioctl, .llseek = no_llseek, + .mmap = tracing_buffers_mmap, }; static ssize_t diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index bd312e9afe25..8a96e7a89e6b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -336,6 +336,7 @@ struct trace_array { boolallocated_snapshot; spinlock_t snapshot_trigger_lock; unsigned intsnapshot; + unsigned intmapped; unsigned long max_latency; #ifdef CONFIG_FSNOTIFY struct dentry *d_max_latency; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
On 2024-01-30 22:13, Dan Williams wrote: Dave Chinner wrote: On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. This was turned into the following implementation of dax_is_supported() by a preparatory change: return !IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_MIPS) && !IS_ENABLED(CONFIG_SPARC); Use dcache_is_aliasing() instead to figure out whether the environment has aliasing dcaches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- include/linux/dax.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index cfc8cd4a3eae..f59e604662e4 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, } static inline bool dax_is_supported(void) { - return !IS_ENABLED(CONFIG_ARM) && - !IS_ENABLED(CONFIG_MIPS) && - !IS_ENABLED(CONFIG_SPARC); + return !dcache_is_aliasing(); Yeah, if this is just a one liner should go into fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the start of the function. I also noticed that device mapper uses fs_dax_get_by_bdev() to determine if it can support DAX, but this patch set does not address that case. Hence it really seems to me like fs_dax_get_by_bdev() is the right place to put this check. Oh, good catch. Yes, I agree this can definitely be pushed down, but then I think it should be pushed down all the way to make alloc_dax() fail. That will need some additional fixups like: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8dcabf84d866..a35e60e62440 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor) md->dax_dev = alloc_dax(md, _dax_ops); if (IS_ERR(md->dax_dev)) { md->dax_dev = NULL; - goto bad; + } else { + set_dax_nocache(md->dax_dev); + set_dax_nomc(md->dax_dev); + if (dax_add_host(md->dax_dev, md->disk)) + goto bad; } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) - goto bad; } format_dev_t(md->name, MKDEV(_major, minor)); ...to make it not fatal to fail to register the dax_dev. I've had a quick look at other users of alloc_dax() and alloc_dax_region(), and so far I figure that all of those really want to bail out on alloc_dax failure. Is dm.c the only special-case we need to fix to make it non-fatal ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures
On 2024-01-30 21:48, Dave Chinner wrote: On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote: Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc /me wonders why the dentry cache aliasing has problems on these systems. Oh, dcache != fs/dcache.c (the VFS dentry cache). Can you please rename this function appropriately so us dumb filesystem people don't confuse cpu data cache configurations with the VFS dentry cache aliasing when we read this code? Something like cpu_dcache_is_aliased(), perhaps? Good point, will do. I'm planning go rename as follows for v3 to eliminate confusion with dentry cache (and with "page cache" in general): ARCH_HAS_CACHE_ALIASING -> ARCH_HAS_CPU_CACHE_ALIASING dcache_is_aliasing() -> cpu_dcache_is_aliasing() I noticed that you suggested "aliased" rather than "aliasing", but I followed what arm64 did for icache_is_aliasing(). Do you have a strong preference one way or another ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The dcache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The dcache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and implement "dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus dcache_is_aliasing() simply evaluates to "false". Note that this leaves "icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing dcache-icache but not dcache-dcache. Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..969e6740bcf7 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..290e3cc85845 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..5adeee5e421f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ config ARM select ARCH_32BIT_OFF_T select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND select ARCH_HAS_BINFMT_FLAT + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CPU_FINALIZE_INIT if MMU select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL if MMU diff --git a/arch/arm/include/asm/c
[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. This was turned into the following implementation of dax_is_supported() by a preparatory change: return !IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_MIPS) && !IS_ENABLED(CONFIG_SPARC); Use dcache_is_aliasing() instead to figure out whether the environment has aliasing dcaches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- include/linux/dax.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index cfc8cd4a3eae..f59e604662e4 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, } static inline bool dax_is_supported(void) { - return !IS_ENABLED(CONFIG_ARM) && - !IS_ENABLED(CONFIG_MIPS) && - !IS_ENABLED(CONFIG_SPARC); + return !dcache_is_aliasing(); } #else static inline void *dax_holder(struct dax_device *dax_dev) -- 2.39.2
[RFC PATCH v2 6/8] xfs: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Chandan Babu R Cc: Darrick J. Wong Cc: linux-...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a0d77f5f512e..360f640159b0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1208,7 +1208,7 @@ static bool xfs_inode_should_enable_dax( struct xfs_inode *ip) { - if (!IS_ENABLED(CONFIG_FS_DAX)) + if (!dax_is_supported()) return false; if (xfs_has_dax_never(ip->i_mount)) return false; -- 2.39.2
[RFC PATCH v2 5/8] fuse: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Miklos Szeredi Cc: linux-fsde...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/fuse/dax.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 12ef91d170bb..36e1c1abbf8e 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags) if (dax_mode == FUSE_DAX_NEVER) return false; + /* +* Silently fallback to 'never' mode if the architecture does +* not support DAX. +*/ + if (!dax_is_supported()) + return false; + /* * fc->dax may be NULL in 'inode' mode when filesystem device doesn't * support DAX, in which case it will silently fallback to 'never' mode. -- 2.39.2
[RFC PATCH v2 4/8] ext4: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Mount fails if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext4/super.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c5fcf377ab1f..f2c11ae3ec29 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct super_block *sb, } if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) { - if (ext4_has_feature_inline_data(sb)) { + if (!dax_is_supported()) { + ext4_msg(sb, KERN_ERR, "DAX unsupported by architecture."); + return -EINVAL; + } else if (ext4_has_feature_inline_data(sb)) { ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem" " that may contain inline data"); return -EINVAL; -- 2.39.2
[RFC PATCH v2 3/8] ext2: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Print an error and disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext2/super.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..30ff57d47ed4 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (test_opt(sb, DAX)) { - if (!sbi->s_daxdev) { + if (!dax_is_supported()) { + ext2_msg(sb, KERN_ERR, + "DAX unsupported by architecture. Turning off DAX."); + clear_opt(sbi->s_mount_opt, DAX); + } else if (!sbi->s_daxdev) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); clear_opt(sbi->s_mount_opt, DAX); -- 2.39.2
[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CACHE_ALIASING and implements it for all architectures. The implementation of dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Note that the order of the series was completely changed based on the feedback received on v1, cache_is_aliasing() is renamed to dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, and the dax_is_supported() check was moved to its rightful place in all filesystems. Feedback is welcome, Thanks, Mathieu Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-fsde...@vger.kernel.org Mathieu Desnoyers (8): dax: Introduce dax_is_supported() erofs: Use dax_is_supported() ext2: Use dax_is_supported() ext4: Use dax_is_supported() fuse: Use dax_is_supported() xfs: Use dax_is_supported() Introduce dcache_is_aliasing() across all architectures dax: Fix incorrect list of dcache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ fs/Kconfig | 1 - fs/erofs/super.c| 5 - fs/ext2/super.c | 6 +- fs/ext4/super.c | 5 - fs/fuse/dax.c | 7 +++ fs/xfs/xfs_iops.c | 2 +- include/linux/cacheinfo.h | 6 ++ include/linux/dax.h | 9 + mm/Kconfig | 6 ++ 29 files changed, 142 insertions(+), 5 deletions(-) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h -- 2.39.2
[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()
Introduce a new dax_is_supported() static inline to check whether the architecture supports DAX. This replaces the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) This is done in preparation for its use by each filesystem supporting the dax mount option to validate whether dax is indeed supported. This is done in preparation for using dcache_is_aliasing() in a following change which will properly support architectures which detect dcache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/Kconfig | 1 - include/linux/dax.h | 10 ++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..cfc8cd4a3eae 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return false; return dax_synchronous(dax_dev); } +static inline bool dax_is_supported(void) +{ + return !IS_ENABLED(CONFIG_ARM) && + !IS_ENABLED(CONFIG_MIPS) && + !IS_ENABLED(CONFIG_SPARC); +} #else static inline void *dax_holder(struct dax_device *dax_dev) { @@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, { return 0; } +static inline bool dax_is_supported(void) +{ + return false; +} #endif void set_dax_nocache(struct dax_device *dax_dev); -- 2.39.2
Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()
On 2024-01-30 06:33, Jan Kara wrote: On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Looks good to me (although I share Dave's opinion it would be nice to CC the whole series to fsdevel - luckily we have lore these days so it is not that tedious to find the whole series :)). Feel free to add: Acked-by: Jan Kara Hi Jan, Thanks for looking at it! I will do significant changes for v2, so I will hold on before integrating your acked-by if it's OK with you. Thanks! Mathieu Honza --- fs/ext2/super.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..0398e7a90eb6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -585,13 +585,13 @@ static int parse_options(char *options, struct super_block *sb, set_opt(opts->s_mount_opt, XIP); fallthrough; case Opt_dax: -#ifdef CONFIG_FS_DAX - ext2_msg(sb, KERN_WARNING, - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); - set_opt(opts->s_mount_opt, DAX); -#else - ext2_msg(sb, KERN_INFO, "dax option not supported"); -#endif + if (dax_is_supported()) { + ext2_msg(sb, KERN_WARNING, +"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + set_opt(opts->s_mount_opt, DAX); + } else { + ext2_msg(sb, KERN_INFO, "dax option not supported"); + } break; #if defined(CONFIG_QUOTA) -- 2.39.2 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()
On 2024-01-29 21:38, Dave Chinner wrote: On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Where's the rest of this patchset? I have no idea what dax_is_supported() actually does, how it interacts with CONFIG_FS_DAX, etc. If you are changing anything to do with FSDAX, the cc-ing the -entire- patchset to linux-fsdevel is absolutely necessary so the entire patchset lands in our inboxes and not just a random patch from the middle of a bigger change. Sorry, I will Cc linux-fsdevel on all patches for the next round. Meanwhile you can find the whole series on lore: https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/ [...] Assuming that I understand what dax_is_supported() is doing, this change isn't right. We're just setting the DAX configuration flags from the mount options here, we don't validate them until we've parsed all options and eliminated conflicts and rejected conflicting options. We validate whether the options are appropriate for the underlying hardware configuration later in the mount process. dax=always suitability is check in xfs_setup_dax_always() called later in the mount process when we have enough context and support to open storage devices and check them for DAX support. If the hardware does not support DAX then we simply we turn off DAX support, we do not reject the mount as this change does. dax=inode and dax=never are valid options on all configurations, even those with without FSDAX support or have hardware that is not capable of using DAX. dax=inode only affects how an inode is instantiated in cache - if the inode has a flag that says "use DAX" and dax is suppoortable by the hardware, then the turn on DAX for that inode. Otherwise we just use the normal non-dax IO paths. Again, we don't error out the filesystem if DAX is not supported, we just don't turn it on. This check is done in xfs_inode_should_enable_dax() and I think all you need to do is replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported() call... Thanks a lot for the detailed explanation. You are right, I will move the dax_is_supported() check to xfs_inode_should_enable_dax(). Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
On 2024-01-29 16:22, Dan Williams wrote: Mathieu Desnoyers wrote: This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implements it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARMV6 and ARMV6K. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Feedback is welcome, Hi Mathieu, this looks good overall, just some quibbling about the ordering. Thanks for having a look ! I would introduce dax_is_supported() with the current overly broad interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then fixup the filesystems to use the new helper, and finally go back and convert dax_is_supported() to use cache_is_aliasing() internally. Will do. Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC needs to exist. As long as all paths that care are calling cache_is_aliasing() then whether it is dynamic or not is something only the compiler cares about. If those dynamic archs do not want to pay the .text size increase they can always do CONFIG_FS_DAX=n, right? Good point. It will help reduce complexity and improve test coverage. I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()", so if we introduce an "icache_is_aliasing()" in the future, it won't be confusing. Having aliasing icache-dcache but not dcache-dcache seems to be fairly common. So basically: If an arch selects ARCH_HAS_CACHE_ALIASING, it implements dcache_is_aliasing() (for now), and eventually we can implement icache_is_aliasing() as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RFC PATCH 5/7] ext4: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/ext4/super.c | 52 - 1 file changed, 25 insertions(+), 27 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c5fcf377ab1f..9e0606289239 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2359,34 +2359,32 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param) return ext4_parse_test_dummy_encryption(param, ctx); case Opt_dax: case Opt_dax_type: -#ifdef CONFIG_FS_DAX - { - int type = (token == Opt_dax) ? - Opt_dax : result.uint_32; - - switch (type) { - case Opt_dax: - case Opt_dax_always: - ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - break; - case Opt_dax_never: - ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - break; - case Opt_dax_inode: - ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); - ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); - /* Strictly for printing options */ - ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE); - break; + if (dax_is_supported()) { + int type = (token == Opt_dax) ? + Opt_dax : result.uint_32; + + switch (type) { + case Opt_dax: + case Opt_dax_always: + ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + break; + case Opt_dax_never: + ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + break; + case Opt_dax_inode: + ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS); + ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER); + /* Strictly for printing options */ + ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE); + break; + } + return 0; + } else { + ext4_msg(NULL, KERN_INFO, "dax option not supported"); + return -EINVAL; } - return 0; - } -#else - ext4_msg(NULL, KERN_INFO, "dax option not supported"); - return -EINVAL; -#endif case Opt_data_err: if (result.uint_32 == Opt_data_err_abort) ctx_set_mount_opt(ctx, m->mount_opt); -- 2.39.2
[RFC PATCH 7/7] xfs: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Chandan Babu R Cc: Darrick J. Wong Cc: linux-...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/xfs/xfs_super.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 764304595e8b..b27ecb11db66 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1376,14 +1376,22 @@ xfs_fs_parse_param( case Opt_nodiscard: parsing_mp->m_features &= ~XFS_FEAT_DISCARD; return 0; -#ifdef CONFIG_FS_DAX case Opt_dax: - xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS); - return 0; + if (dax_is_supported()) { + xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS); + return 0; + } else { + xfs_warn(parsing_mp, "dax option not supported."); + return -EINVAL; + } case Opt_dax_enum: - xfs_mount_set_dax_mode(parsing_mp, result.uint_32); - return 0; -#endif + if (dax_is_supported()) { + xfs_mount_set_dax_mode(parsing_mp, result.uint_32); + return 0; + } else { + xfs_warn(parsing_mp, "dax option not supported."); + return -EINVAL; + } /* Following mount options will be removed in September 2025 */ case Opt_ikeep: xfs_fs_warn_deprecated(fc, param, XFS_FEAT_IKEEP, true); -- 2.39.2
[RFC PATCH 6/7] fuse: Introduce fuse_dax_is_supported()
Use dax_is_supported() in addition to IS_ENABLED(CONFIG_FUSE_DAX) to validate whether CONFIG_FUSE_DAX is enabled and the architecture does not have virtually aliased caches. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Miklos Szeredi Cc: linux-fsde...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/fuse/file.c | 2 +- fs/fuse/fuse_i.h| 36 +- fs/fuse/inode.c | 47 +++-- fs/fuse/virtio_fs.c | 4 ++-- 4 files changed, 62 insertions(+), 27 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a660f1f21540..133ac8524064 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3247,6 +3247,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) init_waitqueue_head(>page_waitq); fi->writepages = RB_ROOT; - if (IS_ENABLED(CONFIG_FUSE_DAX)) + if (fuse_dax_is_supported()) fuse_dax_inode_init(inode, flags); } diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 1df83eebda92..1cbe37106669 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -31,6 +31,7 @@ #include #include #include +#include /** Default max number of pages that can be used in a single read request */ #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 @@ -979,6 +980,38 @@ static inline void fuse_sync_bucket_dec(struct fuse_sync_bucket *bucket) rcu_read_unlock(); } +#ifdef CONFIG_FUSE_DAX +static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode *inode) +{ + return inode->dax; +} + +static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc) +{ + return fc->dax_mode; +} + +static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc) +{ + return fc->dax; +} +#else +static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode *inode) +{ + return NULL; +} + +static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc) +{ + return FUSE_DAX_INODE_DEFAULT; +} + +static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc) +{ + return NULL; +} +#endif + /** Device operations */ extern const struct file_operations fuse_dev_operations; @@ -1324,7 +1357,8 @@ void fuse_free_conn(struct fuse_conn *fc); /* dax.c */ -#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode)) +#define fuse_dax_is_supported()(IS_ENABLED(CONFIG_FUSE_DAX) && dax_is_supported()) +#define FUSE_IS_DAX(inode) (fuse_dax_is_supported() && IS_DAX(inode)) ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to); ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 2a6d44f91729..030e6ce5486d 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -108,7 +108,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb) if (!fi->forget) goto out_free; - if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi)) + if (fuse_dax_is_supported() && !fuse_dax_inode_alloc(sb, fi)) goto out_free_forget; return >inode; @@ -126,9 +126,8 @@ static void fuse_free_inode(struct inode *inode) mutex_destroy(>mutex); kfree(fi->forget); -#ifdef CONFIG_FUSE_DAX - kfree(fi->dax); -#endif + if (fuse_dax_is_supported()) + kfree(fuse_inode_get_dax(fi)); kmem_cache_free(fuse_inode_cachep, fi); } @@ -361,7 +360,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, invalidate_inode_pages2(inode->i_mapping); } - if (IS_ENABLED(CONFIG_FUSE_DAX)) + if (fuse_dax_is_supported()) fuse_dax_dontcache(inode, attr->flags); } @@ -856,14 +855,16 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) seq_printf(m, ",blksize=%lu", sb->s_blocksize); } -#ifdef CONFIG_FUSE_DAX - if (fc->dax_mode == FUSE_DAX_ALWAYS) - seq_puts(m, ",dax=always"); - else if (fc->dax_mode == FUSE_DAX_NEVER) - seq_puts(m, ",dax=never"); - else if (fc->dax_mode == FUSE_DAX_INODE_USER) - seq_puts(m, ",dax=inode"); -#endif + if
[RFC PATCH 2/7] dax: Fix incorrect list of cache aliasing architectures
fs/Kconfig:FS_DAX prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. Use this instead in Kconfig to prevent FS_DAX from being built on architectures with virtually aliased dcache: depends on !ARCH_HAS_CACHE_ALIASING For architectures which detect dcache aliasing at runtime, introduce a new dax_is_supported() static inline which uses "cache_is_aliasing()" to figure out whether the environment has aliasing dcaches. This new dax_is_supported() helper will be used in each filesystem supporting the dax mount option to validate whether dax is indeed supported. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/Kconfig | 2 +- include/linux/dax.h | 9 + 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..6746fe403761 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,7 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) + depends on !ARCH_HAS_CACHE_ALIASING depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..8c595b04deeb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -78,6 +79,10 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return false; return dax_synchronous(dax_dev); } +static inline bool dax_is_supported(void) +{ + return !cache_is_aliasing(); +} #else static inline void *dax_holder(struct dax_device *dax_dev) { @@ -122,6 +127,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, { return 0; } +static inline bool dax_is_supported(void) +{ + return false; +} #endif void set_dax_nocache(struct dax_device *dax_dev); -- 2.39.2
[RFC PATCH 4/7] ext2: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/ext2/super.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..0398e7a90eb6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -585,13 +585,13 @@ static int parse_options(char *options, struct super_block *sb, set_opt(opts->s_mount_opt, XIP); fallthrough; case Opt_dax: -#ifdef CONFIG_FS_DAX - ext2_msg(sb, KERN_WARNING, - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); - set_opt(opts->s_mount_opt, DAX); -#else - ext2_msg(sb, KERN_INFO, "dax option not supported"); -#endif + if (dax_is_supported()) { + ext2_msg(sb, KERN_WARNING, +"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + set_opt(opts->s_mount_opt, DAX); + } else { + ext2_msg(sb, KERN_INFO, "dax option not supported"); + } break; #if defined(CONFIG_QUOTA) -- 2.39.2
[RFC PATCH 1/7] Introduce cache_is_aliasing() across all architectures
Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: (ARCH_HAS_CACHE_ALIASING=y) * arm V4, V5 (CPU_CACHE_VIVT) * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The dcache aliasing depends on querying CPU state at runtime: (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y) * arm V6, V6K (CPU_CACHE_VIPT) (cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The dcache is never aliasing: * arm V7, V7M (unless ARM V6 or V6K are also present) (CPU_CACHE_VIPT) * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 3 +++ arch/arm/mm/Kconfig | 2 ++ arch/csky/Kconfig | 1 + arch/m68k/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 8 mm/Kconfig | 10 ++ 17 files changed, 75 insertions(+) create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..969e6740bcf7 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h index e8c30430be33..b03054b35c74 100644 --- a/arch/arm/include/asm/cachetype.h +++ b/arch/arm/include/asm/cachetype.h @@ -16,6 +16,9 @@ extern unsigned int cacheid; #define cache_is_vipt()cacheid_is(CACHEID_VIPT) #define cache_is_vipt_nonaliasing()cacheid_is(CACHEID_VIPT_NONALIASING) #define cache_is_vipt_aliasing() cacheid_is(CACHEID_VIPT_ALIASING) +#ifdef CONFIG_ARCH_HAS_CACHE_ALIASING_DYNAMIC +#define cache_is_aliasing()cache_is_vipt_aliasing() +#endif #define icache_is_vivt_asid_tagged() cacheid_is(CACHEID_ASID_TAGGED) #define icache_is_vipt_aliasing() cacheid_is(CACHEID_VIPT_I_ALIASING) #define icache_is_pipt() cacheid_is(CACHEID_PIPT) diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index c164cde50243..23af93cdc03d 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -539,9 +539,11 @@ config CPU_CACHE_NOP bool config CPU_CACHE_VIVT + select ARCH_HAS_CACHE_ALIASING bool config CPU_CACHE_VIPT + select ARCH_HAS_CACHE_ALIASING_DYNAMIC if CPU_V6 || CPU_V6K bool config CPU_CACHE_FA diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index cf2a6fd7dff8..439d7640deb8 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -2,6 +2,7 @@ config CSKY def_bool y select ARCH_32BIT_OFF_T + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_DMA_PREP_COHERENT select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_SYNC_DMA_FOR_CPU diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index 4b3e93cac723..216338704f0a 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -3,6 +3,7 @@ config M68K bool default y select ARCH_32BIT_OFF_T + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_BINFMT_FLAT select ARCH_HAS_CP
[RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implements it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARMV6 and ARMV6K. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Feedback is welcome, Thanks, Mathieu Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Mathieu Desnoyers (7): Introduce cache_is_aliasing() across all architectures dax: Fix incorrect list of cache aliasing architectures erofs: Use dax_is_supported() ext2: Use dax_is_supported() ext4: Use dax_is_supported() fuse: Introduce fuse_dax_is_supported() xfs: Use dax_is_supported() arch/arc/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 3 ++ arch/arm/mm/Kconfig | 2 ++ arch/csky/Kconfig | 1 + arch/m68k/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ fs/Kconfig | 2 +- fs/erofs/super.c| 10 +++--- fs/ext2/super.c | 14 fs/ext4/super.c | 52 ++--- fs/fuse/file.c | 2 +- fs/fuse/fuse_i.h| 36 +++- fs/fuse/inode.c | 47 +- fs/fuse/virtio_fs.c | 4 +-- fs/xfs/xfs_super.c | 20 +++ include/linux/cacheinfo.h | 8 + include/linux/dax.h | 9 + mm/Kconfig | 10 ++ 27 files changed, 198 insertions(+), 73 deletions(-) create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h -- 2.39.2
Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers
On 2024-01-26 15:12, Steven Rostedt wrote: [...] diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..2187be6d7b23 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = { .setattr= tracefs_setattr, }; +/* Copied from get_next_ino() but adds allocation for multiple inodes */ +#define LAST_INO_BATCH 1024 +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) +static DEFINE_PER_CPU(unsigned int, last_ino); + +unsigned int tracefs_get_next_ino(int files) +{ + unsigned int *p = _cpu_var(last_ino); + unsigned int res = *p; + +#ifdef CONFIG_SMP + /* Check if adding files+1 overflows */ How does it handle a @files input where: * (files+1 > LAST_INO_BATCH) ? * (files+1 == LAST_INO_BATCH) ? + if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { + static atomic_t shared_last_ino; + int next = atomic_add_return(LAST_INO_BATCH, _last_ino); + + res = next - LAST_INO_BATCH; + } +#endif + + res++; + /* get_next_ino should not provide a 0 inode number */ + if (unlikely(!res)) + res++; I suspect that bumping this res++ in the 0 case can cause inode range reservation issues at (files+1 == LAST_INO_BATCH-1). Thanks, Mathieu + *p = res + files; + put_cpu_var(last_ino); + return res; +} -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[REGRESSION] v5.13: FS_DAX unavailable on 32-bit ARM
Hi, This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. I can see a few ways forward to address this: A) I have prepared a patch series introducing cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implemented it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected, without actually solving the issue for architectures with virtually aliased dcaches. B) Another approach would be to dig into what exactly DAX is doing with the linear mapping, and try to fix this. I see two options there: B.1) Either we extend vmap to allow vmap'd pages to be aligned on specific multiples, and use a coloring trick based on SHMLBA like userspace mappings do for all DAX internal pages accesses, or B.2) We introduce flush_dcache_folio() at relevant spots (perhaps dax_flush() ?) to synchronize the linear mapping wrt userspace mappings. (would this work ?) Any thoughts on how to best move forward with this issue are welcome. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-20 08:47, Steven Rostedt wrote: On Fri, 19 Jan 2024 20:49:36 -0500 Mathieu Desnoyers wrote: Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. In the packet header you have 0x12345678 in the first event you have 0x5678 how does that get you the timestamp? If that event had 0x, when the reader reads this packet, it would take the header 0x12345678 chop off (ignore) the 5678, and add the , right? We need to consider not only what happens when the 16 low bits increase, but also what happens when they end up with a value smaller than the previous 16 low bits. As a summary from our video meeting discussion: There are 3 cases we care about here: packet header timestamp: 0x12345678 followed by either: A) first event delta from packet header timestamp is 0: 16-bit value 0x5678 B) first event delta from packet header timestamp is <= 0x: B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value) B.2) 16-bit value example 0x (exactly one 16-bit overflow from previous value) C) first event delta from packet header timestamp is larger than 0x, which would cause the low-order 16 bits to have more than one 16-bit overflow from the previous value. The tracer detects this and uses a full 64-bit timestamp instead of the compact 16 bits. [...] But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is saved wrt local cmpxchg updating the reserved position. The algorithm looks like: do { - read current reserved position (old pos) - read time - compute new reserved position (new pos) } while (cmpxchg(reserved pos, old pos, new pos) != old pos); [A] save_last_tsc() So the last_tsc that is saved is from the timestamp read before the cmpxchg. Yes. If interrupted at [A] by another trace producer, it will compare with an older "last tsc" than the tsc of the event physically located just before the nested event. This stale "last tsc" has a value which is necessarily lower than the one we would be saving with the save_last_tsc immediately afterwards, which means in the worse case we end up using a full 64-bit timestamp when in fact we could use a more compact representation. But this race is rare and therefore it does not matter for size. That's equivalent to me "injecting" an absolute value for the same race. Yes. The fact that I only need this last_tsc value for the sake of optimization, and not for computation of a time delta from a previous injected timestamp, makes it possible to handle the race gracefully without
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-19 16:42, Steven Rostedt wrote: On Fri, 19 Jan 2024 15:56:21 -0500 Mathieu Desnoyers wrote: So when an overflow happens, you just insert a timestamp, and then events after that is based on that? No. Let's use an example to show how it works. For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps in the compact event header representation. But for the sake of making this discussion easier, let's assume a tracer would use 16-bit for timestamps in the compact representation. Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. So in actuality, it inserts 0x1234, plus adds a 5678 that represents the creation of the header (like we discussed about in the last tracing meeting). There is no "0x1234" reference time. There is only the actual 0x12345678 current time at packet begin, including those 16 low order bits. The first event has: 0x5678 which is based on the previous injected timestamp of 0x1234. It is not "based" on the previous injected timestamp. It just represents the low-order 16-bits of the timestamp at event 1. The trace readers takes two informations to compute the complete current timestamp for event 1: 1) The current timestamp just before event 1 is read (0x12345678), 2) The low-order 16 bits of event 1 (0x5678). Given that there is no 16-bit overflow when comparing: 0x12345678 & 0x and 0x5678 We know that the resulting current timestamp for event 1 is: (0x12345678 & ~0x) + 0x5678 = 0x12345678 Or basically that time did not move between the packet header and event 1. the events go on just using a delta from that until you see it overflow (or too big of a delta to fit in the 16 bits), and you insert a new full timestamps that everything will be based on: 0x1237 No. The following events use the same algorithm I just described: They use this notion of "current timestamp" and the information provided by the new timestamp field in the event header to figure out the updated current timestamp. It is _never_ based on some kind of arbitrary reference, it is always absolute, and is always computed based on the current timestamp and the timestamp field encountered. Now events following that are just a delta from that timestamp. But you calculate the delta simply by masking out the lower bits. No, again, there is no delta from arbitrary injected time. It's always computed from this virtual "current time", which applies to an event stream. But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is saved wrt local cmpxchg up
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-19 10:37, Steven Rostedt wrote: On Fri, 19 Jan 2024 09:40:27 -0500 Mathieu Desnoyers wrote: On 2024-01-18 18:12, Steven Rostedt wrote: From: "Steven Rostedt (Google)" [...] Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. I understand that the reason why you need the before/after stamps and their associated complexity is because the Ftrace ring buffer ABI encodes event timestamps as delta from the previous event within the buffer as a mean of compressing the timestamp fields. If the delta cannot be represented in a given number of bits, then it inserts a 64-bit timestamp (not sure if that one is absolute or a delta from previous event). There's both. An extended timestamp, which is added when the delta is too big, and that too is just a delta from the previous event. And there is the absolute timestamp as well. I could always just use the absolute one. That event came much later. OK This timestamp encoding as delta between events introduce a strong inter-dependency between consecutive (nested) events, and is the reason why you are stuck with all this timestamp before/after complexity. The Common Trace Format specifies (and LTTng implements) a different way to achieve the same ring buffer space-savings achieved with timestamp deltas while keeping the timestamps semantically absolute from a given reference, hence without all the before/after timestamp complexity. You can see the clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic That points to this: -8<- 6.3. Clock value update procedure To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C: Let L be an unsigned integer initialized to, depending on the type property of C: "fixed-length-unsigned-integer" The value of the length property of C. "variable-length-unsigned-integer" S ×7, where S is the number of bytes which F occupies with the data stream. Let MASK be an unsigned integer initialized to 2L − 1. Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator. Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator. Set DEF_CLK_VAL to: If V ≥ CUR H + V Else H + MASK + 1 + V ->8- There's a lot of missing context there, so I don't see how it relates. This explains how the "current time" is reconstructed by a trace reader when loading an event header timestamp field. But for the sake of this discussion we can focus on the less formal explanation of how the tracer generates this timestamp encoding provided below. idea on the producer side is to record the low-order bits of the current timestamp in the event header (truncating the higher order bits), and fall back on a full 64-bit value if the number of low-order bits overflows from the previous timestamp is more than 1, or if it is impossible to figure out precisely the timestamp of the previous event due to a race. This achieves the same space savings as delta timestamp encoding without introducing the strong event inter-dependency. So when an overflow happens, you just insert a timestamp, and then events after that is based on that? No. Let's use an example to show how it works. For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps in the compact event header representation. But for the sake of making this discussion easier, let's assume a tracer would use 16-bit for timestamps in the compact representation. Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-18 18:12, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. I admire the effort of trying to simplify the Ftrace ring buffer by bringing over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem, I certainly welcome the simplifications. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. I understand that the reason why you need the before/after stamps and their associated complexity is because the Ftrace ring buffer ABI encodes event timestamps as delta from the previous event within the buffer as a mean of compressing the timestamp fields. If the delta cannot be represented in a given number of bits, then it inserts a 64-bit timestamp (not sure if that one is absolute or a delta from previous event). This timestamp encoding as delta between events introduce a strong inter-dependency between consecutive (nested) events, and is the reason why you are stuck with all this timestamp before/after complexity. The Common Trace Format specifies (and LTTng implements) a different way to achieve the same ring buffer space-savings achieved with timestamp deltas while keeping the timestamps semantically absolute from a given reference, hence without all the before/after timestamp complexity. You can see the clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic idea on the producer side is to record the low-order bits of the current timestamp in the event header (truncating the higher order bits), and fall back on a full 64-bit value if the number of low-order bits overflows from the previous timestamp is more than 1, or if it is impossible to figure out precisely the timestamp of the previous event due to a race. This achieves the same space savings as delta timestamp encoding without introducing the strong event inter-dependency. The fact that Ftrace exposes this ring buffer binary layout as a user-space ABI makes it tricky to move to the Common Trace Format timestamp encoding. There are clearly huge simplifications that could be made by moving to this scheme though. Is there any way to introduce a different timestamp encoding scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to introduce this simpler scheme and gradually phase out the more complex delta encoding when no users are left. Thoughts ? Thanks, Mathieu [1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 2024-01-11 11:17, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader Hi Vincent, The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1], and has a lot of similarities with this patch series. LTTng has the notion of "subbuffer id" to allow atomically exchanging a "reader" extra subbuffer with the subbuffer to be read. It implements "get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader) to move the currently read subbuffer position. [2] It would not hurt to compare your approach to LTTng and highlight similarities/differences, and the rationale for the differences. Especially when it comes to designing kernel ABIs, it's good to make sure that all bases are covered, because those choices will have lasting impacts. Thanks, Mathieu [1] https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c [2] https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 13:43, Steven Rostedt wrote: On Fri, 15 Dec 2023 13:25:07 -0500 Mathieu Desnoyers wrote: I am not against exposing an ABI that allows userspace to alter the filter behavior. I disagree on the way you plan to expose the ABI. These are no different than the knobs for sched_debug These are very much different. The sched features are enabled at build-time by modifying kernel/sched/features.h. There is no userspace ABI involved. Exposing this option as an ABI in this way exposes too much internal ring buffer implementation details to userspace. There's already lots of exposure using options. The options are just knobs, nothing more. It exposes the following details which IMHO should be hidden or configurable in a way that allows moving to a whole new mechanism which will have significantly different characteristics in the future: It exposes that: - filtering uses a copy to a temporary buffer, and - that this copy is enabled by default. Once exposed, those design constraints become immutable due to ABI. No it is not. There is no such thing as "immutable ABI". The rule is "don't break user space" If this functionality in the kernel goes away, the knob could become a nop, and I doubt any user space will break because of it. That is, the only burden is keeping this option exposed. But it could be just like that light switch that has nothing connected to it. It's still there, but does nothing if you switch it. This knob can act the same way. This does not in anyway prevent future innovation. I am not comfortable with exposing internal ring buffer implementation details to userspace which may or may not be deprecated as no-ops in the future. This will lead to useless clutter. One approach that might be somewhat better would be to only expose those files when a new CONFIG_TRACING_DEBUG option is enabled. This way userspace cannot rely on having those files present as part of the ABI, but you would still be free to use them for selftests by skipping the specific selftests if the config option is not enabled. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 12:34, Steven Rostedt wrote: On Fri, 15 Dec 2023 12:24:14 -0500 Mathieu Desnoyers wrote: On 2023-12-15 12:04, Steven Rostedt wrote: On Fri, 15 Dec 2023 10:53:39 -0500 Mathieu Desnoyers wrote: [...] So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" If I add other ways to filter, it will be a separate file to control that, but all options are on/off switches. Even if I add other functionality to the way buffers are created, this will still have the same functionality to turn the entire thing on or off. I'll be clearer then: I think this is a bad ABI. In your reply, you justify this bad ABI by implementation motivations. What's wrong with a way to stop the copying ? I am not against exposing an ABI that allows userspace to alter the filter behavior. I disagree on the way you plan to expose the ABI. Exposing this option as an ABI in this way exposes too much internal ring buffer implementation details to userspace. It exposes the following details which IMHO should be hidden or configurable in a way that allows moving to a whole new mechanism which will have significantly different characteristics in the future: It exposes that: - filtering uses a copy to a temporary buffer, and - that this copy is enabled by default. Once exposed, those design constraints become immutable due to ABI. The option is just a way to say "I don't want to do the copy into the buffer, I want to go directly into it" My main concern is how this concept, once exposed to userspace, becomes not only an internal implementation detail, but a fundamental part of the design which cannot ever go away without breaking the ABI or making parts of the ABI irrelevant. I can make a parallel with the scheduler: this is as if the sched feature knobs (which are there only for development/debugging purposes) would all be exposed as userspace ABI. This would seriously limit the evolution of the scheduler design in the future. I see this as the same problem at the ring buffer level. I don't care about the implementation, I care about the ABI, and I feel that your reply is not addressing my concern at all. Maybe I don't understand your concern. It's an on/off switch (like all options are). This is just a way to say "I want to indirect copying of the event before filtering or not". Not all tracefs options are booleans. The "current_tracer" file ABI exposes a string input/output parameter. I would recommend the same for the equivalent of a "current_filter" file. The "input-argument" part above may never happen. What's different between tracefs and LTTng, is that all events are created by the subsystem not by me. You don't use the TRACE_EVENT() macro, but you need to manually create each event you care about yourself. It's more of a burden but you also then have the luxury of hooking to the input portion. That is not exposed, and that is by design. As that could possibly make all tracepoints an ABI, and you'll have people like peterz nacking any new tracepoint in the scheduler. He doesn't allow trace events anymore because of that exposure. I'm not arguing for moving to the input-argument scheme, I just used this as an hypothetical example to show why we should not expose internal implementation details to userspace which will prevent future evolution only for the sake of having debugging knobs. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 12:04, Steven Rostedt wrote: On Fri, 15 Dec 2023 10:53:39 -0500 Mathieu Desnoyers wrote: [...] So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" If I add other ways to filter, it will be a separate file to control that, but all options are on/off switches. Even if I add other functionality to the way buffers are created, this will still have the same functionality to turn the entire thing on or off. I'll be clearer then: I think this is a bad ABI. In your reply, you justify this bad ABI by implementation motivations. I don't care about the implementation, I care about the ABI, and I feel that your reply is not addressing my concern at all. Moreover, double-negative boolean options (disable-X=false) are confusing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add disable-filter-buf option
On 2023-12-15 10:26, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Normally, when the filter is enabled, a temporary buffer is created to copy the event data into it to perform the filtering logic. If the filter passes and the event should be recorded, then the event is copied from the temporary buffer into the ring buffer. If the event is to be discarded then it is simply dropped. If another event comes in via an interrupt, it will not use the temporary buffer as it is busy and will write directly into the ring buffer. The disable-filter-buf option will disable the temporary buffer and always write into the ring buffer. This will avoid the copy when the event is to be recorded, but also adds a bit more overhead on the discard, and if another event were to interrupt the event that is to be discarded, then the event will not be removed from the ring buffer but instead converted to padding that will not be read by the reader. Padding will still take up space on the ring buffer. This option can be beneficial if most events are recorded and not discarded, or simply for debugging the discard functionality of the ring buffer. Also fix some whitespace (that was fixed by editing this in vscode). I'm not convinced that a boolean state is what you need here. Yes, today you are in a position where you have two options: a) use the filter buffer, which falls back on copy to ring buffer if nested, b) disable the filter buffer, and thus always copy to ring buffer before filtering, But I think it would not be far-fetched to improve the implementation of the filter-buffer to have one filter-buffer per nesting level (per execution context), or just implement the filter buffer as a per-cpu stack, which would remove the need to fall back on copy to ring buffer when nested. Or you could even do like LTTng and filter on the input arguments directly. But each of these changes would add yet another boolean tunable, which can quickly make the mix hard to understand from a user perspective. So rather than stacking tons of "on/off" switches for filter features, how about you let users express the mechanism they want to use for filtering with a string instead ? e.g. filter-method="temp-buffer" filter-method="ring-buffer" filter-method="input-arguments" ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Remove 32bit timestamp logic
On 2023-12-13 21:11, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Each event has a 27 bit timestamp delta that is used to hold the delta from the last event. If the time between events is greater than 2^27, then a timestamp is added that holds a 59 bit absolute timestamp. Until a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp"), if an interrupt interrupted an event in progress, all the events delta would be zero to not deal with the races that need to be handled. The commit a389d86f7fd09 changed that to handle the races giving all events, even those that preempt other events, still have an accurate timestamp. To handle those races requires performing 64-bit cmpxchg on the timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered very slow. To try to deal with this the timestamp logic was broken into two and then three 32-bit cmpxchgs, with the thought that two (or three) 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit architectures. Part of the problem with this is that I didn't have any 32-bit architectures to test on. After hitting several subtle bugs in this code, an effort was made to try and see if three 32-bit cmpxchgs are indeed faster than a single 64-bit. After a few people brushed off the dust of their old 32-bit machines, tests were done, and even though 32-bit cmpxchg was faster than a single 64-bit, it was in the order of 50% at best, not 300%. I literally had to dust off my old Eee PC for this :) This means that this complex code is not only complex but also not even faster than just using 64-bit cmpxchg. Nuke it! Acked-by: Mathieu Desnoyers @@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer, struct rb_event_info info; int nr_loops = 0; int add_ts_default; + static int once; (as you spotted, should be removed) Thanks, Mathieu /* ring buffer does cmpxchg, make sure it is safe in NMI context */ if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
On 2023-12-12 11:53, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit architectures. That is: static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) { unsigned long cnt, top, bottom, msb; unsigned long cnt2, top2, bottom2, msb2; u64 val; /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, , )) return false; if (val != expect) return false; <<<< interrupted here! cnt = local_read(>cnt); The problem is that the synchronization counter in the rb_time_t is read *after* the value of the timestamp is read. That means if an interrupt were to come in between the value being read and the counter being read, it can change the value and the counter and the interrupted process would be clueless about it! The counter needs to be read first and then the value. That way it is easy to tell if the value is stale or not. If the counter hasn't been updated, then the value is still good. Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/ Cc: sta...@vger.kernel.org Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit") Reported-by: Mathieu Desnoyers Signed-off-by: Steven Rostedt (Google) Reviewed-by: Mathieu Desnoyers --- kernel/trace/ring_buffer.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 1d9caee7f542..e110cde685ea 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) unsigned long cnt2, top2, bottom2, msb2; u64 val; + /* Any interruptions in this function should cause a failure */ + cnt = local_read(>cnt); + /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, , )) return false; @@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) if (val != expect) return false; - cnt = local_read(>cnt); if ((cnt & 3) != cnt2) return false; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
The following race can cause rb_time_read() to observe a corrupted time stamp: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(>msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(>top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(>cnt); top = local_read(>top); bottom = local_read(>bottom); msb = local_read(>msb); } while (c != local_read(>cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. Fix this by checking that all three of msb, top, and bottom 2-bit cnt values match. The reason to favor checking all three fields over requiring a specific update order for both rb_time_set() and rb_time_cmpxchg() is because checking all three fields is more robust to handle partial failures of rb_time_cmpxchg() when interrupted by nested rb_time_set(). Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoy...@efficios.com/ Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()") Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: linux-trace-ker...@vger.kernel.org --- kernel/trace/ring_buffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8d2a4f00eca9..71c225ca2a2b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) *cnt = rb_time_cnt(top); - /* If top and msb counts don't match, this interrupted a write */ - if (*cnt != rb_time_cnt(msb)) + /* If top, msb or bottom counts don't match, this interrupted a write */ + if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom)) return false; /* The shift to msb will lose its cnt bits */ -- 2.39.2
Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
On 2023-12-11 23:38, Steven Rostedt wrote: On Mon, 11 Dec 2023 22:51:04 -0500 Mathieu Desnoyers wrote: [...] For this first issue, here is the race: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(>msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(>top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(>cnt); top = local_read(>top); bottom = local_read(>bottom); msb = local_read(>msb); } while (c != local_read(>cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. Ah, OK that makes more sense. Yeah, if I had the three words from the beginning, I would have tested to make sure they all match an not just the two :-p Technically just checking that the very first and last words which are updated by set/cmpxchg have the same cnt bits would suffice. Because this is just a scenario of __rb_time_read interrupting an update, the updates in between are fine if the first/last words to be updated have the same cnt. As this would fix a commit that tried to fix this before! f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()") FYI, that would be the "Fixes" for this patch. OK [...] - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. Except that it's not 4 bits that is compared, but 32 bits. struct rb_time_struct { local_t cnt; local_t top; local_t bottom; local_t msb; }; The full local_t (32 bits) is used for synchronization. But the other elements do get extra bits and there still might be some issues, but not as severe as you stated here. Let's bring up the race scenario I spotted: rb_time_cmpxchg() [...] /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, , )) return false; if (val != expect) return false; cnt = local_read(>cnt); if ((cnt & 3) != cnt2) return false; ^ here (cnt & 3) == cnt2, but @val contains outdated data. This means the piecewise rb_time_read_cmpxchg() that follow will derive expected values from the outdated @val. Ah. Of course this would be fixed if we did the local_read(>cnt) *before* everything else. But then we could be interrupted after that initial read, before reading val. I suspect we'd want to propagate the full 32-bit cnt that was read by __rb_time_read() to the caller, which is not the case today. With that, we would not have to read it again in rb_time_cmpxchg. It does leave the issue of having only 2 bits in the msb, top, bottom fields to detect races, which are subject to overflow. cnt2 = cnt + 1; rb_time_split(val, , , ); top = rb_time_val_cnt(top, cnt); bottom = rb_time_val_cnt(bottom, cnt); ^ top, bottom, and msb contain outdated data, which do not match cnt due to 2-bit overflow. rb_time_split(set, , , ); top2 = rb_time_val_cnt(top2, cnt2); bottom2 = rb_time_val_cnt(bottom2, cnt2); if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2)) return false; ^ This @cnt cmpxchg succeeds because it uses the re-read cnt is used as expected value. Sure. And I believe you did find another bug. If we read the cnt first, before reading val, then it would not be outdated. As stated above, I suspect we'd run into other issues if interrupted between read of cnt and reading val. Propagating the full 32-bit cnt value read from __rb_time_read() to the caller would be better I think. if (!rb_time_read_cmpxchg(>msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(>top, top, top2)) return false; if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2)) return false; ^ these cmpxchg have just used the outdated @val as expected values, even though the content of the rb_time was modified by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This means those cmpxchg can fail not only due to being interrupted by another write or cmpxchg, but also simply due to expected value mismatch in any of the fields, which will then cause Yes, it is expected that this will fail for being i
Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-12 09:00, Steven Rostedt wrote: [...] --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, enum event_trigger_type tt = ETT_NONE; struct trace_buffer *buffer; struct print_entry *entry; + int meta_size; ssize_t written; int size; int len; @@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (!(tr->trace_flags & TRACE_ITER_MARKERS)) return -EINVAL; - if (cnt > TRACE_BUF_SIZE) - cnt = TRACE_BUF_SIZE; You're removing an early bound check for a size_t userspace input... - - BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE); - - size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */ + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ + again: + size = cnt + meta_size; ... and then implicitly casting it into a "int" size variable, which can therefore become a negative value. Just for the sake of not having to rely on ring_buffer_lock_reserve catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an early check for negative here. /* If less than "", then make sure we can still add that */ if (cnt < FAULTED_SIZE) @@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); - if (unlikely(!event)) + if (unlikely(!event)) { + /* +* If the size was greated than what was allowed, then greater ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Add size check when printing trace_marker output
On 2023-12-12 08:44, Steven Rostedt wrote: From: "Steven Rostedt (Google)" If for some reason the trace_marker write does not have a nul byte for the string, it will overflow the print: Does this result in leaking kernel memory to userspace ? If so, it should state "Fixes..." and CC stable. Thanks, Mathieu trace_seq_printf(s, ": %s", field->buf); The field->buf could be missing the nul byte. To prevent overflow, add the max size that the buf can be by using the event size and the field location. int max = iter->ent_size - offsetof(struct print_entry, buf); trace_seq_printf(s, ": %*s", max, field->buf); Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..e11fb8996286 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter, { struct print_entry *field; struct trace_seq *s = >seq; + int max = iter->ent_size - offsetof(struct print_entry, buf); trace_assign_type(field, iter->ent); seq_print_ip_sym(s, field->ip, flags); - trace_seq_printf(s, ": %s", field->buf); + trace_seq_printf(s, ": %*s", max, field->buf); return trace_handle_return(s); } @@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags, struct trace_event *event) { struct print_entry *field; + int max = iter->ent_size - offsetof(struct print_entry, buf); trace_assign_type(field, iter->ent); - trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf); + trace_seq_printf(>seq, "# %lx %*s", field->ip, max, field->buf); return trace_handle_return(>seq); } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
On 2023-12-11 17:59, Steven Rostedt wrote: On Mon, 11 Dec 2023 15:13:24 -0500 Mathieu Desnoyers wrote: Going through a review of the ring buffer rb_time functions for 32-bit architectures, I updated the comments to match the code, and identified the following issues: Thanks Mathieu! - rb_time_cmpxchg() needs to update the msb last, so it matches the validation of top and msb by __rb_time_read(). This is fixed by this patch. Hmm, does it? This is not parallel programming, it's only protecting against interrupts. Understood, this is indeed the model I had in mind during my review (nested interruption only from local cpu) because preemption is disabled around use of the ring buffer. For this first issue, here is the race: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(>msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(>top, top, top2)) return false; __rb_time_read() [...] do { c = local_read(>cnt); top = local_read(>top); bottom = local_read(>bottom); msb = local_read(>msb); } while (c != local_read(>cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. BTW, it's best not to have a fix like this with a comment change this big, as the comment change is highly likely to cause conflicts in any backport. I wanted to start the discussion without having a N-patches series, but I agree that this first fix should be split into a separate patch. Although, for consistency, I wonder if everything else should be changed to go: bottom, top, msb as it would match the order of the bits, as msb has the highest order, top the next, and bottom the least. Doing it as: top, bottom, msb seems out of order. I did that because msb was an after thought :-/ Agreed on the order change, but I suspect this would belong to another patch (not the fix). - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. Except that it's not 4 bits that is compared, but 32 bits. struct rb_time_struct { local_t cnt; local_t top; local_t bottom; local_t msb; }; The full local_t (32 bits) is used for synchronization. But the other elements do get extra bits and there still might be some issues, but not as severe as you stated here. Let's bring up the race scenario I spotted: rb_time_cmpxchg() [...] /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, , )) return false; if (val != expect) return false; cnt = local_read(>cnt); if ((cnt & 3) != cnt2) return false; ^ here (cnt & 3) == cnt2, but @val contains outdated data. This means the piecewise rb_time_read_cmpxchg() that follow will derive expected values from the outdated @val. cnt2 = cnt + 1; rb_time_split(val, , , ); top = rb_time_val_cnt(top, cnt); bottom = rb_time_val_cnt(bottom, cnt); ^ top, bottom, and msb contain outdated data, which do not match cnt due to 2-bit overflow. rb_time_split(set, , , ); top2 = rb_time_val_cnt(top2, cnt2); bottom2 = rb_time_val_cnt(bottom2, cnt2); if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2)) return false; ^ This @cnt cmpxchg succeeds because it uses the re-read cnt is used as expected value. if (!rb_time_read_cmpxchg(>msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(>top, top, top2)) return false; if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2)) return false; ^ these cmpxchg have just used the outdated @val as expected values, even though the content of the rb_time was modified by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This means those cmpxchg can fail not only due to being interrupted by another write or cmpxchg, but also simply due to expected value mismatch in any of the fields, which will then cause following __rb_time_read() to fail until a rb_time_set() is done. return true; So this overflow scenario on top of cmpxchg does not cause corrupted time stamps, but does cause subsequent __rb_time_read() and rb_time_cmpxchg() to fail until an eventual rb_time_set(). Although, I should also change this to be: struct rb_time_struct { local_t
[RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit
Going through a review of the ring buffer rb_time functions for 32-bit architectures, I updated the comments to match the code, and identified the following issues: - rb_time_cmpxchg() needs to update the msb last, so it matches the validation of top and msb by __rb_time_read(). This is fixed by this patch. - A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter and produces corrupted time stamps. This is _not_ fixed by this patch. - After a cmpxchg fails between updates to top and msb, a write is needed before read and cmpxchg can succeed again. I am not entirely sure the rest of the ring buffer handles this correctly. Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: linux-trace-ker...@vger.kernel.org --- kernel/trace/ring_buffer.c | 64 +++--- 1 file changed, 46 insertions(+), 18 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8d2a4f00eca9..f6ed699947cd 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -576,34 +576,50 @@ struct ring_buffer_iter { #ifdef RB_TIME_32 /* - * On 32 bit machines, local64_t is very expensive. As the ring - * buffer doesn't need all the features of a true 64 bit atomic, - * on 32 bit, it uses these functions (64 still uses local64_t). + * On 32-bit machines, local64_t is very expensive. As the ring + * buffer doesn't need all the features of a true 64-bit atomic, + * on 32-bit, it uses these functions (64-bit still uses local64_t). * - * For the ring buffer, 64 bit required operations for the time is - * the following: + * For the ring buffer, the operations required to manipulate 64-bit + * time stamps are the following: * - * - Reads may fail if it interrupted a modification of the time stamp. + * - Read may fail if it interrupted a modification of the time stamp. * It will succeed if it did not interrupt another write even if * the read itself is interrupted by a write. + * A read will fail if it follows a cmpxchg which failed between + * updates to its top and msb bits, until a write is performed. + * (note: this limitation may be unexpected in parts of the + * ring buffer algorithm) * It returns whether it was successful or not. * - * - Writes always succeed and will overwrite other writes and writes + * - Write always succeeds and will overwrite other writes and writes * that were done by events interrupting the current write. * * - A write followed by a read of the same time stamp will always succeed, * but may not contain the same value. * * - A cmpxchg will fail if it interrupted another write or cmpxchg. + * A cmpxchg will fail if it follows a cmpxchg which failed between + * updates to its top and msb bits, until a write is performed. + * (note: this limitation may be unexpected in parts of the + * ring buffer algorithm) * Other than that, it acts like a normal cmpxchg. * - * The 60 bit time stamp is broken up by 30 bits in a top and bottom half - * (bottom being the least significant 30 bits of the 60 bit time stamp). + * The 64-bit time stamp is broken up, from most to least significant, + * in: msb, top and bottom fields, of respectively 4, 30, and 30 bits. * - * The two most significant bits of each half holds a 2 bit counter (0-3). + * The two most significant bits of each field hold a 2-bit counter (0-3). * Each update will increment this counter by one. - * When reading the top and bottom, if the two counter bits match then the - * top and bottom together make a valid 60 bit number. + * When reading the top, bottom, and msb fields, if the two counter bits + * match, then the combined values make a valid 64-bit number. + * + * Counter limits. The following situations can generate overflows that + * produce corrupted time stamps: + * + * - A read or a write interrupted by 2^32 writes or cmpxchg. + * + * - A cmpxchg interrupted by 4 writes or cmpxchg. + * (note: this is not sufficient and should be fixed) */ #define RB_TIME_SHIFT 30 #define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1) @@ -632,7 +648,7 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) /* * If the read is interrupted by a write, then the cnt will -* be different. Loop until both top and bottom have been read +* be different. Loop until top, bottom and msb have been read * without interruption. */ do { @@ -644,7 +660,12 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt) *cnt = rb_time_cnt(top); - /* If top and msb counts don't match, this interrupted a write */ + /* +* If top and msb counts don't match, this either interrupted a +* write or follows a failed cmpxchg. +* This requires the update to bottom to be enclosed between +* updates
Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-10 11:38, Steven Rostedt wrote: On Sun, 10 Dec 2023 11:07:22 -0500 Mathieu Desnoyers wrote: It just allows more to be written in one go. I don't see why the tests need to cover this or detect this change. If the purpose of this change is to ensure that the entire trace marker payload is shown within a single event, then there should be a test which exercises this, and which validates that the end result is that the entire payload is indeed shown within a single event record. No, the purpose of the change is not to do that, because there can always be a bigger trace marker write than a single event can hold. This is the way it has always worked. This is an optimization or "enhancement". The 1KB restriction was actually because of a previous implementation years ago (before selftests even existed) that wrote into a temp buffer before copying into the ring buffer. But since we now can copy directly into the ring buffer, there's no reason not to use the maximum that the ring buffer can accept. My point is that the difference between the new "enhanced" behavior and the previous behavior is not tested for. Otherwise there is no permanent validation that this change indeed does what it is intended to do, so it can regress at any time without any test noticing it. What regress? The amount of a trace_marker write that can make it into a the buffer in one go? Now, I agree that we should have a test to make sure that all of the trace marker write gets into the buffer. Yes. This is pretty much my point. But it's always been allowed to break up that write however it wanted to. And the enhanced behavior extends the amount of data that can get written into a single sub-buffer, and this is not tested. Note, because different architectures have different page sizes, how much that can make it in one go is architecture dependent. So you can have a "regression" by simply running your application on a different architecture. Which is why in the following patches you have expressing the subbuffer size as bytes rather than pages is important at the ABI level. It facilitates portability of tests, and decreases documentation / user burden. Again, it's not a requirement, it's just an enhancement. How does this have anything to do with dispensing from testing the new behavior ? If the new behavior has a bug that causes it to silently truncate the trace marker payloads, how do you catch it with the current tests ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-10 10:30, Steven Rostedt wrote: On Sun, 10 Dec 2023 09:09:06 -0500 Mathieu Desnoyers wrote: On 2023-12-09 17:50, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. I would expect the tests under tools/testing/selftests/ftrace/* to be affected by those changes. If they are not, then it's a hole in the test coverage and should be taken care of as this behavior is modified. Before this change we had: ~# echo -n '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123 4567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567 89012345678901234567890123456789012345678901234567890123456789012345678901234' > /sys/kernel/tracing/trace_marker ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:8 # #_-=> irqs-off/BH-disabled # / _---
Re: [PATCH 13/14] ring-buffer: Add documentation on the buffer_subbuf_order file
On 2023-12-09 22:54, Steven Rostedt wrote: [...] + buffer_subbuf_order: + + This sets or displays the sub buffer page size order. The ring buffer + is broken up into several same size "sub buffers". An event can not be + bigger than the size of the sub buffer. Normally, the sub buffer is + the size of the architecture's page (4K on x86). The sub buffer also + contains meta data at the start which also limits the size of an event. + That means when the sub buffer is a page size, no event can be larger + than the page size minus the sub buffer meta data. The fact that the user ABI documentation for this tracer parameter needs to dig into details about architecture page size is a good indicator that this ABI is not at the abstraction level it should be (pages vs bytes). Thanks, Mathieu + + The buffer_subbuf_order allows the user to change the size of the sub + buffer. As the sub buffer is a set of pages by the power of 2, thus + the sub buffer total size is defined by the order: + + order size + + 0 PAGE_SIZE + 1 PAGE_SIZE * 2 + 2 PAGE_SIZE * 4 + 3 PAGE_SIZE * 8 + + Changing the order will change the sub buffer size allowing for events + to be larger than the page size. + + Note: When changing the order, tracing is stopped and any data in the + ring buffer and the snapshot buffer will be discarded. + free_buffer: If a process is performing tracing, and the ring buffer should be -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 14/14] ringbuffer/selftest: Add basic selftest to test chaning subbuf order
On 2023-12-09 22:54, Steven Rostedt wrote: [...] +get_buffer_data_size() { + sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page +} + +a="1234567890" + +make_str() { +cnt=$1 +s="" +while [ $cnt -gt 10 ]; do +s="${s}${a}" +cnt=$((cnt-10)) +done +while [ $cnt -gt 0 ]; do +s="${s}X" +cnt=$((cnt-1)) +done +echo -n $s +} + +test_buffer() { + + size=`get_buffer_data_size` + + str=`make_str $size` + + echo $str > trace_marker + + grep -q $a trace This test has no clue if the record was truncated or not. It basically repeats the string "1234567890" until it fills the subbuffer size and pads with as needed as trace marker payload, but the grep looks for the "1234567890" pattern only. The test should be extended to validate whether the trace marker payload was truncated or not, otherwise it is of limited value. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 00/14] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers
On 2023-12-09 22:54, Steven Rostedt wrote: [...] Basically, events to the tracing subsystem are limited to just under a PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page size, and an event can not be bigger than a sub buffer. This allows users to change the size of a sub buffer by the order: echo 3 > /sys/kernel/tracing/buffer_subbuf_order Will make each sub buffer a size of 8 pages, allowing events to be almost as big as 8 pages in size (sub buffers do have meta data on them as well, keeping an event from reaching the same size as a sub buffer). Specifying the "order" of subbuffer size as a power of two of number of pages is a poor UX choice for a user-facing ABI. I would recommend allowing the user to specify the size in bytes, and internally bump to size to the next power of 2, with a minimum of PAGE_SIZE. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
On 2023-12-09 17:10, Steven Rostedt wrote: [...] <...>-852 [001] . 121.550551: tracing_mark_write[LINE TOO BIG] <...>-852 [001] . 121.550581: tracing_mark_write: 78901234 Failing to print an entire message because it does not fit in the buffer size is rather inconvenient. It would be better to print the partial line, and end the line with a tag. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes
On 2023-12-09 17:50, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. I would expect the tests under tools/testing/selftests/ftrace/* to be affected by those changes. If they are not, then it's a hole in the test coverage and should be taken care of as this behavior is modified. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: Add offset of events in dump on mismatch
On 2023-12-07 17:16, Steven Rostedt wrote: [...] diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 8d2a4f00eca9..b10deb8a5647 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3424,23 +3424,27 @@ static void dump_buffer_page(struct buffer_data_page *bpage, case RINGBUF_TYPE_TIME_EXTEND: delta = rb_event_time_stamp(event); ts += delta; - pr_warn(" [%lld] delta:%lld TIME EXTEND\n", ts, delta); + pr_warn(" %x: [%lld] delta:%lld TIME EXTEND\n", Please prefix hex values with "0x", as in: pr_warn(" 0x%x: [%lld] delta:%lld TIME EXTEND\n" Otherwise it can be confusing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH] MAINTAINERS: TRACING: Add Mathieu Desnoyers as Reviewer
In order to make sure I get CC'd on tracing changes for which my input would be relevant, add my name as reviewer of the TRACING subsystem. Signed-off-by: Mathieu Desnoyers Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: linux-trace-ker...@vger.kernel.org --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 4cc6bf79fdd8..a7c2092d0063 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -21622,6 +21622,7 @@ F: drivers/hwmon/pmbus/tps546d24.c TRACING M: Steven Rostedt M: Masami Hiramatsu +R: Mathieu Desnoyers L: linux-kernel@vger.kernel.org L: linux-trace-ker...@vger.kernel.org S: Maintained -- 2.39.2
Re: [PATCH RFC] selftests/rseq: fix kselftest Clang build warnings
On 9/8/23 19:03, Justin Stitt wrote: Hi, I am experiencing many warnings when trying to build tools/testing/selftests. Here's one such example from rseq tree: | param_test.c:1234:10: error: address argument to atomic operation must be a pointer to _Atomic type ('intptr_t *' (aka 'long *') invalid) | 1234 | while (!atomic_load(>percpu_list_ptr)) {} || ^ ~~ | /usr/local/google/home/justinstitt/repos/tc-build/build/llvm/final/lib/clang/18/include/stdatomic.h:140:29: note: expanded from macro 'atomic_load' |140 | #define atomic_load(object) __c11_atomic_load(object, __ATOMIC_SEQ_CST) || ^ ~~ I added the _Atomic type in various locations to silence _all_ (10) of these warnings. I'm wondering, though, perhaps the absence of these _Atomic types in the first place is on purpose? Am I on the right track to fix these warnings without damaging the legitimacy of the tests at hand? I'd like some feedback about where to go from here and if others are experiencing the same issues. Thanks! FWIW here's my specific build incantation on Clang-18 (49d41de57896e935cd5726719c5208bce22694ae): $ make LLVM=1 -j128 ARCH=x86_64 mrproper headers defconfig kselftest-merge $ make LLVM=1 ARCH=x86_64 -C tools/testing/selftests I should have used the __atomic_load_n() compiler builtin rather than atomic_load(), mainly because it does not require the _Atomic annotation to each type it touches. Thanks, Mathieu Link: https://github.com/ClangBuiltLinux/linux/issues/1698 Link: https://github.com/ClangBuiltLinux/continuous-integration2/issues/61 Signed-off-by: Justin Stitt --- tools/testing/selftests/rseq/param_test.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c index bf951a490bb4..94802aeed2c6 100644 --- a/tools/testing/selftests/rseq/param_test.c +++ b/tools/testing/selftests/rseq/param_test.c @@ -356,7 +356,7 @@ struct inc_thread_test_data { }; struct percpu_list_node { - intptr_t data; + _Atomic intptr_t data; struct percpu_list_node *next; }; @@ -1212,8 +1212,8 @@ static int set_signal_handler(void) /* Test MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU membarrier command. */ #ifdef TEST_MEMBARRIER struct test_membarrier_thread_args { - int stop; - intptr_t percpu_list_ptr; + _Atomic int stop; + _Atomic intptr_t percpu_list_ptr; }; /* Worker threads modify data in their "active" percpu lists. */ @@ -1240,7 +1240,7 @@ void *test_membarrier_worker_thread(void *arg) int cpu = get_current_cpu_id(); ret = rseq_offset_deref_addv(RSEQ_MO_RELAXED, RSEQ_PERCPU, - >percpu_list_ptr, + (intptr_t*)>percpu_list_ptr, sizeof(struct percpu_list_entry) * cpu, 1, cpu); } while (rseq_unlikely(ret)); } --- base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c change-id: 20230908-kselftest-param_test-c-1763b62e762f Best regards, -- Justin Stitt -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH][RFC] tracing: Enable tracepoints via module parameters
- On Apr 20, 2021, at 10:55 AM, rostedt rost...@goodmis.org wrote: > On Tue, 20 Apr 2021 09:29:27 -0400 (EDT) > Mathieu Desnoyers wrote: > >> - On Apr 20, 2021, at 8:55 AM, rostedt rost...@goodmis.org wrote: >> [...] >> > >> > Would adding automatic module parameters be an issue? That is, you can add >> > in the insmod command line a parameter that will enable tracepoints. We >> > could have a way to even see them from the modinfo. I think I had that >> > working once, and it wasn't really that hard to do. >> >> There is one thing we should consider here in terms of namespacing: those >> module >> command line parameters should be specific to each tracer (e.g. ftrace, perf, >> ebpf). >> >> LTTng for instance already tackles early module load tracing in a different >> way: users can enable instrumentation of yet-to-be loaded kernel modules. So >> it would not make sense in that scheme to have module load parameters. >> >> It's a different trade-off in terms of error reporting though: for instance, >> LTTng won't report an error if a user does a typo when entering an event >> name. >> >> So I think those command line parameters should be tracer-specific, do you >> agree >> ? > > > No, I do not agree. I would like to make it consistent with the kernel > command line. As you can put in: "trace_event=sched_switch" and the > sched_switch trace point will be enable (for the tracefs directory) on boot > up. The same should be for modules as well. > > It shouldn't affect LTTng, as you already have a way to enable them as they > get loaded. That sounds fine. So that would be within the "trace_event" (and not tracepoint) namespace for module load parameters as well ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH][RFC] tracing: Enable tracepoints via module parameters
- On Apr 20, 2021, at 8:55 AM, rostedt rost...@goodmis.org wrote: [...] > > Would adding automatic module parameters be an issue? That is, you can add > in the insmod command line a parameter that will enable tracepoints. We > could have a way to even see them from the modinfo. I think I had that > working once, and it wasn't really that hard to do. There is one thing we should consider here in terms of namespacing: those module command line parameters should be specific to each tracer (e.g. ftrace, perf, ebpf). LTTng for instance already tackles early module load tracing in a different way: users can enable instrumentation of yet-to-be loaded kernel modules. So it would not make sense in that scheme to have module load parameters. It's a different trade-off in terms of error reporting though: for instance, LTTng won't report an error if a user does a typo when entering an event name. So I think those command line parameters should be tracer-specific, do you agree ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?
- On Apr 16, 2021, at 3:02 PM, paulmck paul...@kernel.org wrote: [...] > > If it can be done reasonably, I suggest also having some way for the > person building userspace RCU to say "I know what I am doing, so do > it with volatile rather than memory_order_consume." Like so ? #define CMM_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x)) #define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p) /* * By defining URCU_DEREFERENCE_USE_VOLATILE, the user requires use of * volatile access to implement rcu_dereference rather than * memory_order_consume load from the C11/C++11 standards. * * This may improve performance on weakly-ordered architectures where * the compiler implements memory_order_consume as a * memory_order_acquire, which is stricter than required by the * standard. * * Note that using volatile accesses for rcu_dereference may cause * LTO to generate incorrectly ordered code starting from C11/C++11. */ #ifdef URCU_DEREFERENCE_USE_VOLATILE # define rcu_dereference(x) CMM_LOAD_SHARED(x) #else # if defined (__cplusplus) # if __cplusplus >= 201103L # include # define rcu_dereference(x) ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume) # else # define rcu_dereference(x) CMM_LOAD_SHARED(x) # endif # else # if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L) # include # define rcu_dereference(x) atomic_load_explicit(&(x), memory_order_consume) # else # define rcu_dereference(x) CMM_LOAD_SHARED(x) # endif # endif #endif Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?
- On Apr 16, 2021, at 12:01 PM, paulmck paul...@kernel.org wrote: > On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote: >> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote: >> > Hi Paul, Will, Peter, >> > >> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO >> > is able to break rcu_dereference. This seems to be taken care of by >> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree. >> > >> > In the liburcu user-space library, we have this comment near >> > rcu_dereference() >> > in >> > include/urcu/static/pointer.h: >> > >> > * The compiler memory barrier in CMM_LOAD_SHARED() ensures that >> > value-speculative >> > * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform >> > the >> > * data read before the pointer read by speculating the value of the >> > pointer. >> > * Correct ordering is ensured because the pointer is read as a volatile >> > access. >> > * This acts as a global side-effect operation, which forbids reordering of >> > * dependent memory operations. Note that such concern about >> > dependency-breaking >> > * optimizations will eventually be taken care of by the >> > "memory_order_consume" >> > * addition to forthcoming C++ standard. >> > >> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was >> > introduced in >> > liburcu as a public API before READ_ONCE() existed in the Linux kernel) >> > >> > Peter tells me the "memory_order_consume" is not something which can be >> > used >> > today. >> > Any information on its status at C/C++ standard levels and >> > implementation-wise ? > > Actually, you really can use memory_order_consume. All current > implementations will compile it as if it was memory_order_acquire. > This will work correctly, but may be slower than you would like on ARM, > PowerPC, and so on. > > On things like x86, the penalty is forgone optimizations, so less > of a problem there. OK > >> > Pragmatically speaking, what should we change in liburcu to ensure we don't >> > generate >> > broken code when LTO is enabled ? I suspect there are a few options here: >> > >> > 1) Fail to build if LTO is enabled, >> > 2) Generate slower code for rcu_dereference, either on all architectures >> > or only >> >on weakly-ordered architectures, >> > 3) Generate different code depending on whether LTO is enabled or not. >> > AFAIU >> > this would only >> >work if every compile unit is aware that it will end up being optimized >> > with >> >LTO. Not sure >> >how this could be done in the context of user-space. >> > 4) [ Insert better idea here. ] > > Use memory_order_consume if LTO is enabled. That will work now, and > might generate good code in some hoped-for future. In the context of a user-space library, how does one check whether LTO is enabled with preprocessor directives ? A quick test with gcc seems to show that both with and without -flto cannot be distinguished from a preprocessor POV, e.g. the output of both gcc --std=c11 -O2 -dM -E - < /dev/null and gcc --std=c11 -O2 -flto -dM -E - < /dev/null is exactly the same. Am I missing something here ? If we accept to use memory_order_consume all the time in both C and C++ code starting from C11 and C++11, the following code snippet could do the trick: #define CMM_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x)) #define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p) #if defined (__cplusplus) # if __cplusplus >= 201103L # include # define rcu_dereference(x) ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume) # else # define rcu_dereference(x)CMM_LOAD_SHARED(x) # endif #else # if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L) # include # define rcu_dereference(x)atomic_load_explicit(&(x), memory_order_consume) # else # define rcu_dereference(x)CMM_LOAD_SHARED(x) # endif #endif This uses the volatile approach prior to C11/C++11, and moves to memory_order_consume afterwards. This will bring a performance penalty on weakly-ordered architectures even when -flto is not specified though. Then the burden is pushed on the compiler people to eventually implement an efficient memory_order_consume. Is that acceptable ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?
Hi Paul, Will, Peter, I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO is able to break rcu_dereference. This seems to be taken care of by arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree. In the liburcu user-space library, we have this comment near rcu_dereference() in include/urcu/static/pointer.h: * The compiler memory barrier in CMM_LOAD_SHARED() ensures that value-speculative * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the * data read before the pointer read by speculating the value of the pointer. * Correct ordering is ensured because the pointer is read as a volatile access. * This acts as a global side-effect operation, which forbids reordering of * dependent memory operations. Note that such concern about dependency-breaking * optimizations will eventually be taken care of by the "memory_order_consume" * addition to forthcoming C++ standard. (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in liburcu as a public API before READ_ONCE() existed in the Linux kernel) Peter tells me the "memory_order_consume" is not something which can be used today. Any information on its status at C/C++ standard levels and implementation-wise ? Pragmatically speaking, what should we change in liburcu to ensure we don't generate broken code when LTO is enabled ? I suspect there are a few options here: 1) Fail to build if LTO is enabled, 2) Generate slower code for rcu_dereference, either on all architectures or only on weakly-ordered architectures, 3) Generate different code depending on whether LTO is enabled or not. AFAIU this would only work if every compile unit is aware that it will end up being optimized with LTO. Not sure how this could be done in the context of user-space. 4) [ Insert better idea here. ] Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v3 0/3] rseq: minor optimizations
- On Apr 13, 2021, at 4:33 PM, Eric Dumazet eric.duma...@gmail.com wrote: > From: Eric Dumazet > > rseq is a heavy user of copy to/from user data in fast paths. > This series tries to reduce the cost. > > v3: Third patch going back to v1 (only deal with 64bit arches) > v2: Addressed Peter and Mathieu feedbacks, thanks ! For the whole series: Reviewed-by: Mathieu Desnoyers Thanks Eric! Mathieu > > Eric Dumazet (3): > rseq: optimize rseq_update_cpu_id() > rseq: remove redundant access_ok() > rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs() > > kernel/rseq.c | 29 + > 1 file changed, 21 insertions(+), 8 deletions(-) > > -- > 2.31.1.295.g9ea45b61b8-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
- On Apr 13, 2021, at 2:22 PM, Eric Dumazet eduma...@google.com wrote: > On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers > wrote: >> > >> As long as the ifdefs are localized within clearly identified wrappers in the >> rseq code I don't mind doing the special-casing there. >> >> The point which remains is that I don't think we want to optimize for speed >> on 32-bit architectures when it adds special-casing and complexity to the >> 32-bit >> build. I suspect there is less and less testing performed on 32-bit >> architectures >> nowadays, and it's good that as much code as possible is shared between >> 32-bit >> and >> 64-bit builds to share the test coverage. >> > > Quite frankly V1 was fine, I can't really make it looking better. Yes, I'm OK with V1 of that patch. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
- On Apr 13, 2021, at 1:33 PM, Eric Dumazet eduma...@google.com wrote: > On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers > wrote: >> >> - On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote: >> >> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet wrote: >> >> >> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet wrote: >> >> > >> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers >> >> > wrote: >> >> > > >> >> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet >> >> > > eric.duma...@gmail.com wrote: >> >> > > >> >> > > > From: Eric Dumazet >> >> > > > >> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, >> >> > > > update includes") added regressions for our servers. >> >> > > > >> >> > > > Using copy_from_user() and clear_user() for 64bit values >> >> > > > is suboptimal. >> >> > > > >> >> > > > We can use faster put_user() and get_user(). >> >> > > > >> >> > > > 32bit arches can be changed to use the ptr32 field, >> >> > > > since the padding field must always be zero. >> >> > > > >> >> > > > v2: added ideas from Peter and Mathieu about making this >> >> > > >generic, since my initial patch was only dealing with >> >> > > >64bit arches. >> >> > > >> >> > > Ah, now I remember the reason why reading and clearing the entire >> >> > > 64-bit >> >> > > is important: it's because we don't want to allow user-space >> >> > > processes to >> >> > > use this change in behavior to figure out whether they are running on >> >> > > a >> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel. >> >> > > >> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to >> >> > > keep >> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well. >> >> > > >> >> > > Thanks, >> >> > > >> >> > >> >> > So... back to V1 then ? >> >> >> >> Or add more stuff as in : >> > >> > diff against v2, WDYT ? >> >> I like this approach slightly better, because it moves the preprocessor >> ifdefs >> into >> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a >> 32-bit >> process running on native 32-bit kernel and as compat task on a 64-bit >> kernel. >> >> That being said, I don't expect anyone to care much about performance of >> 32-bit >> kernels, so we could use copy_from_user() on 32-bit kernels to remove >> special-cases >> in 32-bit specific code. This would eliminate the 32-bit specific "padding" >> read, and >> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit >> kernels. >> >> As for clear_user(), I wonder whether we could simply keep using it, but >> change >> the >> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I >> find it >> odd that performance optimizations which would be relevant elsewhere creep >> into >> the >> rseq code. > > > clear_user() is a maze of arch-dependent macros/functions/assembly > > I guess the same could be said from copy_in_user(), but apparently we removed > special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22 > > Definitely it seems odd having to carefully choose between multiple methods. As long as the ifdefs are localized within clearly identified wrappers in the rseq code I don't mind doing the special-casing there. The point which remains is that I don't think we want to optimize for speed on 32-bit architectures when it adds special-casing and complexity to the 32-bit build. I suspect there is less and less testing performed on 32-bit architectures nowadays, and it's good that as much code as possible is shared between 32-bit and 64-bit builds to share the test coverage. Thanks, Mathieu > > >> >> Thanks, >> >> Mathieu >> >> > >> > diff --git a/kernel/rseq.c b/kernel/rseq.c >> > index >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011 >>
Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
- On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote: > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet wrote: >> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet wrote: >> > >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers >> > wrote: >> > > >> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com >> > > wrote: >> > > >> > > > From: Eric Dumazet >> > > > >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, >> > > > update includes") added regressions for our servers. >> > > > >> > > > Using copy_from_user() and clear_user() for 64bit values >> > > > is suboptimal. >> > > > >> > > > We can use faster put_user() and get_user(). >> > > > >> > > > 32bit arches can be changed to use the ptr32 field, >> > > > since the padding field must always be zero. >> > > > >> > > > v2: added ideas from Peter and Mathieu about making this >> > > >generic, since my initial patch was only dealing with >> > > >64bit arches. >> > > >> > > Ah, now I remember the reason why reading and clearing the entire 64-bit >> > > is important: it's because we don't want to allow user-space processes to >> > > use this change in behavior to figure out whether they are running on a >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel. >> > > >> > > So although I'm fine with making 64-bit kernels faster, we'll want to >> > > keep >> > > updating the entire 64-bit ptr field on 32-bit kernels as well. >> > > >> > > Thanks, >> > > >> > >> > So... back to V1 then ? >> >> Or add more stuff as in : > > diff against v2, WDYT ? I like this approach slightly better, because it moves the preprocessor ifdefs into rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 32-bit process running on native 32-bit kernel and as compat task on a 64-bit kernel. That being said, I don't expect anyone to care much about performance of 32-bit kernels, so we could use copy_from_user() on 32-bit kernels to remove special-cases in 32-bit specific code. This would eliminate the 32-bit specific "padding" read, and let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit kernels. As for clear_user(), I wonder whether we could simply keep using it, but change the clear_user() macro to figure out that it can use a faster 8-byte put_user ? I find it odd that performance optimizations which would be relevant elsewhere creep into the rseq code. Thanks, Mathieu > > diff --git a/kernel/rseq.c b/kernel/rseq.c > index > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011 > 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, > { >u32 ptr; > > + if (get_user(ptr, >rseq_cs.ptr.padding)) > + return -EFAULT; > + if (ptr) > + return -EINVAL; >if (get_user(ptr, >rseq_cs.ptr.ptr32)) >return -EFAULT; >*uptrp = (struct rseq_cs __user *)ptr; > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t, > struct rseq_cs *rseq_cs) >u32 sig; >int ret; > > - if (rseq_get_cs_ptr(_cs, t->rseq)) > - return -EFAULT; > + ret = rseq_get_cs_ptr(_cs, t->rseq); > + if (ret) > + return ret; >if (!urseq_cs) { >memset(rseq_cs, 0, sizeof(*rseq_cs)); >return 0; > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t) > #ifdef CONFIG_64BIT >return put_user(0UL, >rseq->rseq_cs.ptr64); > #else > - return put_user(0UL, >rseq->rseq_cs.ptr.ptr32); > + return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) | > + put_user(0UL, >rseq->rseq_cs.ptr.padding); > #endif > } -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
- On Apr 13, 2021, at 12:57 PM, Eric Dumazet eduma...@google.com wrote: > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers > wrote: >> >> - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com >> wrote: >> >> > From: Eric Dumazet >> > >> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, >> > update includes") added regressions for our servers. >> > >> > Using copy_from_user() and clear_user() for 64bit values >> > is suboptimal. >> > >> > We can use faster put_user() and get_user(). >> > >> > 32bit arches can be changed to use the ptr32 field, >> > since the padding field must always be zero. >> > >> > v2: added ideas from Peter and Mathieu about making this >> >generic, since my initial patch was only dealing with >> >64bit arches. >> >> Ah, now I remember the reason why reading and clearing the entire 64-bit >> is important: it's because we don't want to allow user-space processes to >> use this change in behavior to figure out whether they are running on a >> 32-bit or in a 32-bit compat mode on a 64-bit kernel. >> >> So although I'm fine with making 64-bit kernels faster, we'll want to keep >> updating the entire 64-bit ptr field on 32-bit kernels as well. >> >> Thanks, >> > > So... back to V1 then ? In terms of behavior, yes. And it's probably the "easy" fix, but I hate that it adds lots of preprocessor ifdefs into the rseq code. But this would require auditing get_user()/put_user() for each architecture supported by rseq to ensure they support 8-byte load/store. And it would become an added burden on architecture maintainers wishing to add rseq support for their architecture. One alternative would be to implement rseq_get_user_u64 and rseq_put_user_u64 wrappers as static functions within rseq.c to hide the preprocessor ifdeffery from the higher-level code. I try very hard to avoid mixing preprocessor ifdefs with C code logic whenever I can. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com wrote: > From: Eric Dumazet > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, > update includes") added regressions for our servers. > > Using copy_from_user() and clear_user() for 64bit values > is suboptimal. > > We can use faster put_user() and get_user(). > > 32bit arches can be changed to use the ptr32 field, > since the padding field must always be zero. > > v2: added ideas from Peter and Mathieu about making this >generic, since my initial patch was only dealing with >64bit arches. Ah, now I remember the reason why reading and clearing the entire 64-bit is important: it's because we don't want to allow user-space processes to use this change in behavior to figure out whether they are running on a 32-bit or in a 32-bit compat mode on a 64-bit kernel. So although I'm fine with making 64-bit kernels faster, we'll want to keep updating the entire 64-bit ptr field on 32-bit kernels as well. Thanks, Mathieu > > Signed-off-by: Eric Dumazet > Cc: Mathieu Desnoyers > Cc: Peter Zijlstra > Cc: "Paul E. McKenney" > Cc: Boqun Feng > Cc: Arjun Roy > Cc: Ingo Molnar > --- > kernel/rseq.c | 41 + > 1 file changed, 33 insertions(+), 8 deletions(-) > > diff --git a/kernel/rseq.c b/kernel/rseq.c > index > cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e > 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t) > return 0; > } > > +#ifdef CONFIG_64BIT > +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, > +const struct rseq __user *rseq) > +{ > + u64 ptr; > + > + if (get_user(ptr, >rseq_cs.ptr64)) > + return -EFAULT; > + *uptrp = (struct rseq_cs __user *)ptr; > + return 0; > +} > +#else > +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, > +const struct rseq __user *rseq) > +{ > + u32 ptr; > + > + if (get_user(ptr, >rseq_cs.ptr.ptr32)) > + return -EFAULT; > + *uptrp = (struct rseq_cs __user *)ptr; > + return 0; > +} > +#endif > + > static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) > { > struct rseq_cs __user *urseq_cs; > - u64 ptr; > u32 __user *usig; > u32 sig; > int ret; > > - if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr))) > + if (rseq_get_cs_ptr(_cs, t->rseq)) > return -EFAULT; > - if (!ptr) { > + if (!urseq_cs) { > memset(rseq_cs, 0, sizeof(*rseq_cs)); > return 0; > } > - if (ptr >= TASK_SIZE) > + if ((unsigned long)urseq_cs >= TASK_SIZE) > return -EINVAL; > - urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr; > + > if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) > return -EFAULT; > > @@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t) >* >* Set rseq_cs to NULL. >*/ > - if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64))) > - return -EFAULT; > - return 0; > +#ifdef CONFIG_64BIT > + return put_user(0UL, >rseq->rseq_cs.ptr64); > +#else > + return put_user(0UL, >rseq->rseq_cs.ptr.ptr32); > +#endif > } > > /* > -- > 2.31.1.295.g9ea45b61b8-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 3/3] rseq: optimise for 64bit arches
- On Apr 13, 2021, at 11:06 AM, David Laight david.lai...@aculab.com wrote: [...] > > Hmmm... too much replication. > You could do: > #ifdef CONFIG_64BIT > #define PTR_TYPE u64 > #define PTR_FLD ptr64 > #else > #define PTR_TYPE u32 > #define PTR_FLD ptr32 > #endif > > Then have one copy of the code and the #undefs. Good point, agreed. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 2/3] rseq: remove redundant access_ok()
- On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote: > From: Eric Dumazet > > After commit 8f2817701492 ("rseq: Use get_user/put_user rather > than __get_user/__put_user") we no longer need > an access_ok() call from __rseq_handle_notify_resume() While we are doing that, should we also remove the access_ok() check in rseq_syscall() ? It look like it can also be removed for the exact same reason outlined here. Thanks, Mathieu > > Signed-off-by: Eric Dumazet > Cc: Mathieu Desnoyers > Cc: Peter Zijlstra > Cc: "Paul E. McKenney" > Cc: Boqun Feng > Cc: Arjun Roy > Cc: Ingo Molnar > --- > kernel/rseq.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/kernel/rseq.c b/kernel/rseq.c > index > d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e > 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, > struct pt_regs *regs) > > if (unlikely(t->flags & PF_EXITING)) > return; > - if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq > - goto error; > ret = rseq_ip_fixup(regs); > if (unlikely(ret < 0)) > goto error; > -- > 2.31.1.295.g9ea45b61b8-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 1/3] rseq: optimize rseq_update_cpu_id()
- On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote: > From: Eric Dumazet > > Two put_user() in rseq_update_cpu_id() are replaced > by a pair of unsafe_put_user() with appropriate surroundings. > > This removes one stac/clac pair on x86 in fast path. > > Signed-off-by: Eric Dumazet > Cc: Mathieu Desnoyers > Cc: Peter Zijlstra > Cc: "Paul E. McKenney" > Cc: Boqun Feng > Cc: Arjun Roy > Cc: Ingo Molnar > --- > kernel/rseq.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/kernel/rseq.c b/kernel/rseq.c > index > a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e > 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -84,13 +84,20 @@ > static int rseq_update_cpu_id(struct task_struct *t) > { > u32 cpu_id = raw_smp_processor_id(); > + struct rseq *r = t->rseq; AFAIU the variable above should be a struct rseq __user *. Elsewhere in the file we use "rseq" rather than "r" for struct rseq __user * variable name, it would be better to keep the naming consistent across the file if possible. Thanks, Mathieu > > - if (put_user(cpu_id, >rseq->cpu_id_start)) > - return -EFAULT; > - if (put_user(cpu_id, >rseq->cpu_id)) > - return -EFAULT; > + if (!user_write_access_begin(r, sizeof(*r))) > + goto efault; > + unsafe_put_user(cpu_id, >cpu_id_start, efault_end); > + unsafe_put_user(cpu_id, >cpu_id, efault_end); > + user_write_access_end(); > trace_rseq_update(t); > return 0; > + > +efault_end: > + user_write_access_end(); > +efault: > + return -EFAULT; > } > > static int rseq_reset_rseq_cpu_id(struct task_struct *t) > -- > 2.31.1.295.g9ea45b61b8-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 3/3] rseq: optimise for 64bit arches
- On Apr 13, 2021, at 6:36 AM, David Laight david.lai...@aculab.com wrote: > From: Peter Zijlstra >> Sent: 13 April 2021 10:10 >> >> On Tue, Apr 13, 2021 at 12:36:57AM -0700, Eric Dumazet wrote: >> > From: Eric Dumazet >> > >> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, >> > update includes") added regressions for our servers. >> > >> > Using copy_from_user() and clear_user() for 64bit values >> > on 64bit arches is suboptimal. >> > >> > We might revisit this patch once all 32bit arches support >> > get_user() and/or put_user() for 8 bytes values. >> >> Argh, what a mess :/ afaict only nios32 lacks put_user_8, but get_user_8 >> is missing in a fair number of archs. >> >> That said; 32bit archs never have to actually set the top bits in that >> word, so they _could_ only set the low 32 bits. That works provided >> userspace itself keeps the high bits clear. > > Does that work for 32bit BE ? Yes, because uapi/linux/rseq.h defines the ptr32 as: #if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN) __u32 padding; /* Initialized to zero. */ __u32 ptr32; #else /* LITTLE */ __u32 ptr32; __u32 padding; /* Initialized to zero. */ #endif /* ENDIAN */ which takes care of BE vs LE. > > David > >> So I suppose that if we're going to #ifdef this, we might as well do the >> whole thing. >> >> Mathieu; did I forget a reason why this cannot work? The only difference it brings on 32-bit is that the truncation of high bits will be done before the following validation: if (!ptr) { memset(rseq_cs, 0, sizeof(*rseq_cs)); return 0; } if (ptr >= TASK_SIZE) return -EINVAL; The question is whether we really want to issue a segmentation fault if 32-bit user-space has set non-zero high bits, or if silently ignoring those high bits is acceptable. Nits below: >> >> diff --git a/kernel/rseq.c b/kernel/rseq.c >> index a4f86a9d6937..94006190b8eb 100644 >> --- a/kernel/rseq.c >> +++ b/kernel/rseq.c >> @@ -115,20 +115,25 @@ static int rseq_reset_rseq_cpu_id(struct task_struct >> *t) >> static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) >> { >> struct rseq_cs __user *urseq_cs; >> -u64 ptr; >> +unsigned long ptr; I am always reluctant to use long/unsigned long type as type for the get/put_user (x) argument, because it hides the cast deep within architecture-specific macros. I understand that in this specific case it happens that on 64-bit archs we end up casting a u64 to unsigned long (same size), and on 32-bit archs we end up casting a u32 to unsigned long (also same size), so there is no practical concern about type promotion and sign-extension, but I think it would be better to have something explicit, e.g.: #ifdef CONFIG_64BIT static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, struct rseq_cs *rseq_cs) { u64 ptr; if (get_user(ptr, _cs->ptr64)) return -EFAULT; *ptrp = (struct rseq_cs __user *)ptr; return 0; } #else static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, struct rseq_cs *rseq_cs) { u32 ptr; if (get_user(ptr, _cs->ptr.ptr32)) return -EFAULT; *ptrp = (struct rseq_cs __user *)ptr; return 0; } #endif And use those helpers to get the ptr value. >> u32 __user *usig; >> u32 sig; >> int ret; >> >> -if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr))) >> +#ifdef CONFIG_64BIT >> +if (get_user(ptr, >rseq->rseq_cs.ptr64)) >> return -EFAULT; >> +#else >> +if (get_user(ptr, >rseq->rseq_cs.ptr32)) Note that this is also not right. It should be >rseq->rseq_cs.ptr.ptr32. Thanks, Mathieu >> +return -EFAULT; >> +#endif >> if (!ptr) { >> memset(rseq_cs, 0, sizeof(*rseq_cs)); >> return 0; >> } >> if (ptr >= TASK_SIZE) >> return -EINVAL; >> -urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr; >> +urseq_cs = (struct rseq_cs __user *)ptr; >> if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) >> return -EFAULT; >> > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, > UK > Registration No: 1397386 (Wales) -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: about seg fault inside rseq critical section
4f \n\t " >> ".popsection \n\t " >> "1: \n\t " >> "leaq cs_obj(%%rip), %%rax \n\t " >> "movq %%rax, % [rseq_cs] \n\t " >> "cmpl % [cpu_id], % [current_cpu_id] \n\t " >> "jnz 4f \n\t " >> /* enable signal testing */ >> "movq $5, %%rdi \n\t " >> "call sleep@plt \n\t " >> /*/ >> "jmp % l[committed] \n\t " >> "2: \n\t " >> /* Disassembler-friendly signature: nopl (%rip). */ >> ".byte 0x0f, 0x1f, 0x05 \n\t " >> ".long 0x53053053 \n\t " /* RSEQ_SIG */ >> "4: \n\t " >> "jmp % l[aborted] \n\t " >> : /* no outputs */ >> : [cpu_id] "r" (cpu), >> [current_cpu_id] "m" ( __rseq_abi . cpu_id ), >> [rseq_cs] "m" ( __rseq_abi . rseq_cs ) >> : "memory" , "cc" , "rax" >> : aborted, committed >> ); >> committed: >> printf ( "committed \n " ); >> return 0 ; >> aborted: >> printf ( "aborted \n " ); >> return - 1 ; >> } >> void signal_callback_handler ( int signum ) { >> printf ( "Caught signal %d \n " , signum); >> } >> int main ( int argc , char ** argv ) { >> signal (SIGINT, signal_callback_handler); >> int cpu, ret; >> register_thread (); >> cpu = RSEQ_ACCESS_ONCE ( __rseq_abi . cpu_id_start ); >> printf ( "ret = %d \n " , do_test (cpu)); >> return 0 ; >> } > As the screenshot is shown, the program executed the signal handler instead of > rseq abort handler after interrupted the program. > I am confused about how rseq deal with signal delivery as I assumed abort > handler will be triggered anyway when flags permitted. Could you please > explain > such two cases or could you please share any references (code, article, etc) > here? > Thanks in advance! > My virtual machine environment: >> $ cat /etc/os-release >> NAME="Ubuntu" >> VERSION="20.04.2 LTS (Focal Fossa)" >> ... >> $ uname -r >> 5.4.0-66-generic > Best regards, > Mingyi > From: Mathieu Desnoyers > Sent: Wednesday, April 7, 2021 9:25 AM > To: Liu, Mingyi > Cc: linux-kernel ; Peter Zijlstra > ; paulmck ; Boqun Feng > > Subject: Re: about seg fault inside rseq critical section > (re-sent in plain-text for lkml) > - On Apr 6, 2021, at 6:24 PM, Mingyi Liu mingyi...@gatech.edu wrote: > > Hi Mathieu, > > I notice that the program will be terminated with segmentation fault when > > it is > > even seg faulted inside the rseq critical section. In Linux kernel rseq.c, I > > haven't found comment or code regarding this. Could you share any > > references on > > why it doesn't execute user defined abort handler but uses the OS handler > > instead? > > Thanks in advance! > Hi Mingyi, > Please let me add the other rseq maintainers and LKML in CC. I'm a bit > stretched > on time > here, so maybe they will have time to answer before I do. > Meanwhile, if you could provide details about your architecture, kernel > .config, > and a > small reproducer program, it would help bootstrapping the discussion. > Thanks, > Mathieu > > Best, > > Mingyi > -- > Mathieu Desnoyers > EfficiOS Inc. > [ http://www.efficios.com/ | http://www.efficios.com ] -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: about seg fault inside rseq critical section
(re-sent in plain-text for lkml) - On Apr 6, 2021, at 6:24 PM, Mingyi Liu mingyi...@gatech.edu wrote: > Hi Mathieu, > I notice that the program will be terminated with segmentation fault when it > is > even seg faulted inside the rseq critical section. In Linux kernel rseq.c, I > haven't found comment or code regarding this. Could you share any references > on > why it doesn't execute user defined abort handler but uses the OS handler > instead? > Thanks in advance! Hi Mingyi, Please let me add the other rseq maintainers and LKML in CC. I'm a bit stretched on time here, so maybe they will have time to answer before I do. Meanwhile, if you could provide details about your architecture, kernel .config, and a small reproducer program, it would help bootstrapping the discussion. Thanks, Mathieu > Best, > Mingyi -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Mar 11, 2021, at 11:51 AM, Peter Zijlstra pet...@infradead.org wrote: > On Thu, Mar 11, 2021 at 09:51:56AM -0500, Mathieu Desnoyers wrote: >> >> >> - On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote: >> >> > For userspace checkpoint and restore (C/R) a way of getting process state >> > containing RSEQ configuration is needed. >> > >> > There are two ways this information is going to be used: >> > - to re-enable RSEQ for threads which had it enabled before C/R >> > - to detect if a thread was in a critical section during C/R >> > >> > Since C/R preserves TLS memory and addresses RSEQ ABI will be restored >> > using the address registered before C/R. >> > >> > Detection whether the thread is in a critical section during C/R is needed >> > to enforce behavior of RSEQ abort during C/R. Attaching with ptrace() >> > before registers are dumped itself doesn't cause RSEQ abort. >> > Restoring the instruction pointer within the critical section is >> > problematic because rseq_cs may get cleared before the control is passed >> > to the migrated application code leading to RSEQ invariants not being >> > preserved. C/R code will use RSEQ ABI address to find the abort handler >> > to which the instruction pointer needs to be set. >> > >> > To achieve above goals expose the RSEQ ABI address and the signature value >> > with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION. >> > >> > This new ptrace request can also be used by debuggers so they are aware >> > of stops within restartable sequences in progress. >> > >> > Signed-off-by: Piotr Figiel >> > Reviewed-by: Michal Miroslaw >> >> Reviewed-by: Mathieu Desnoyers > > How do we route this? Do I stick this in tip/sched/core as being an rseq > patch? Sure, it's fine with me, thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote: > For userspace checkpoint and restore (C/R) a way of getting process state > containing RSEQ configuration is needed. > > There are two ways this information is going to be used: > - to re-enable RSEQ for threads which had it enabled before C/R > - to detect if a thread was in a critical section during C/R > > Since C/R preserves TLS memory and addresses RSEQ ABI will be restored > using the address registered before C/R. > > Detection whether the thread is in a critical section during C/R is needed > to enforce behavior of RSEQ abort during C/R. Attaching with ptrace() > before registers are dumped itself doesn't cause RSEQ abort. > Restoring the instruction pointer within the critical section is > problematic because rseq_cs may get cleared before the control is passed > to the migrated application code leading to RSEQ invariants not being > preserved. C/R code will use RSEQ ABI address to find the abort handler > to which the instruction pointer needs to be set. > > To achieve above goals expose the RSEQ ABI address and the signature value > with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION. > > This new ptrace request can also be used by debuggers so they are aware > of stops within restartable sequences in progress. > > Signed-off-by: Piotr Figiel > Reviewed-by: Michal Miroslaw Reviewed-by: Mathieu Desnoyers Thanks! Mathieu > > --- > v2: > Applied review comments: > - changed return value from the ptrace request to the size of the > configuration structure > - expanded configuration structure with the flags field and > the rseq abi structure size > > v1: > https://lore.kernel.org/lkml/20210222100443.4155938-1-fig...@google.com/ > > --- > include/uapi/linux/ptrace.h | 10 ++ > kernel/ptrace.c | 25 + > 2 files changed, 35 insertions(+) > > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h > index 83ee45fa634b..3747bf816f9a 100644 > --- a/include/uapi/linux/ptrace.h > +++ b/include/uapi/linux/ptrace.h > @@ -102,6 +102,16 @@ struct ptrace_syscall_info { > }; > }; > > +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f > + > +struct ptrace_rseq_configuration { > + __u64 rseq_abi_pointer; > + __u32 rseq_abi_size; > + __u32 signature; > + __u32 flags; > + __u32 pad; > +}; > + > /* > * These values are stored in task->ptrace_message > * by tracehook_report_syscall_* to describe the current syscall-stop. > diff --git a/kernel/ptrace.c b/kernel/ptrace.c > index 61db50f7ca86..76f09456ec4b 100644 > --- a/kernel/ptrace.c > +++ b/kernel/ptrace.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > > #include /* for syscall_get_* */ > > @@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child, > return ret; > } > > +#ifdef CONFIG_RSEQ > +static long ptrace_get_rseq_configuration(struct task_struct *task, > + unsigned long size, void __user *data) > +{ > + struct ptrace_rseq_configuration conf = { > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, > + .rseq_abi_size = sizeof(*task->rseq), > + .signature = task->rseq_sig, > + .flags = 0, > + }; > + > + size = min_t(unsigned long, size, sizeof(conf)); > + if (copy_to_user(data, , size)) > + return -EFAULT; > + return sizeof(conf); > +} > +#endif > + > #ifdef PTRACE_SINGLESTEP > #define is_singlestep(request)((request) == PTRACE_SINGLESTEP) > #else > @@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long > request, > ret = seccomp_get_metadata(child, addr, datavp); > break; > > +#ifdef CONFIG_RSEQ > + case PTRACE_GET_RSEQ_CONFIGURATION: > + ret = ptrace_get_rseq_configuration(child, addr, datavp); > + break; > +#endif > + > default: > break; > } > -- > 2.30.1.766.gb4fecdf3b7-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
[tip: sched/core] sched/membarrier: fix missing local execution of ipi_sync_rq_state()
The following commit has been merged into the sched/core branch of tip: Commit-ID: ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825 Gitweb: https://git.kernel.org/tip/ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825 Author:Mathieu Desnoyers AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00 Committer: Ingo Molnar CommitterDate: Sat, 06 Mar 2021 12:40:21 +01:00 sched/membarrier: fix missing local execution of ipi_sync_rq_state() The function sync_runqueues_membarrier_state() should copy the membarrier state from the @mm received as parameter to each runqueue currently running tasks using that mm. However, the use of smp_call_function_many() skips the current runqueue, which is unintended. Replace by a call to on_each_cpu_mask(). Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") Reported-by: Nadav Amit Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Cc: sta...@vger.kernel.org # 5.4.x+ Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com --- kernel/sched/membarrier.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index acdae62..b5add64 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm) } rcu_read_unlock(); - preempt_disable(); - smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); - preempt_enable(); + on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true); free_cpumask_var(tmpmask); cpus_read_unlock();
Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 26, 2021, at 9:11 AM, Piotr Figiel fig...@google.com wrote: > Hi, > > On Mon, Feb 22, 2021 at 09:53:17AM -0500, Mathieu Desnoyers wrote: > >> I notice that other structures defined in this UAPI header are not >> packed as well. Should we add an attribute packed on new structures ? >> It seems like it is generally a safer course of action, even though >> each field is naturally aligned here (there is no padding/hole in the >> structure). > > I considered this for quite a while. There are some gains for this > approach, i.e. it's safer towards the ISO C, as theoretically compiler > can generate arbitrary offsets as long as struct elements have correct > order in memory. > Also with packed attribute it would be harder to make it incorrect in > future modifications. > User code also could theoretically put the structure on any misaligned > address. > > But the drawback is that all accesses to the structure contents are > inefficient and some compilers may generate large chunks of code > whenever the structure elements are accessed (I recall at least one ARM > compiler which generates series of single-byte accesses for those). For > kernel it doesn't matter much because the structure type is used in one > place, but it may be different for the application code. > > The change would be also inconsistent with the rest of the file and IMO > the gains are only theoretical. > > If there are more opinions on this or you have some argument I'm missing > please let me know I can send v3 with packed and explicit padding > removed. I think this is rather borderline trade off. I personally don't have a strong opinion on this and completely agree with your analysis. Maybe for pre-existing system calls adding more non-packed structures might be kind-of OK if some were already exposed, even though it seems rather fragile wrt ISO C. Thanks, Mathieu > > Best regards and thanks for looking at this, > Piotr. -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 26, 2021, at 11:06 AM, Piotr Figiel fig...@google.com wrote: > Hi, > > On Fri, Feb 26, 2021 at 10:32:35AM -0500, Mathieu Desnoyers wrote: >> > +static long ptrace_get_rseq_configuration(struct task_struct *task, >> > +unsigned long size, void __user *data) >> > +{ >> > + struct ptrace_rseq_configuration conf = { >> > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, >> > + .rseq_abi_size = sizeof(*task->rseq), >> > + .signature = task->rseq_sig, >> > + .flags = 0, >> > + }; >> > + >> > + size = min_t(unsigned long, size, sizeof(conf)); >> > + if (copy_to_user(data, , size)) >> > + return -EFAULT; >> > + return sizeof(conf); >> > +} >> >> I think what Florian was after would be: >> >> struct ptrace_rseq_configuration { >> __u32 size; /* size of struct ptrace_rseq_configuration */ >> __u32 flags; >> __u64 rseq_abi_pointer; >> __u32 signature; >> __u32 pad; >> }; >> >> where: >> >> .size = sizeof(struct ptrace_rseq_configuration), >> >> This way, the configuration structure can be expanded in the future. The >> rseq ABI structure is by definition fixed-size, so there is no point in >> having its size here. > > Still rseq syscall accepts the rseq ABI structure size as a paremeter. > I think this way the information returned from ptrace is consistent with > the userspace view of the rseq state and allows expansion in case the > ABI structure would have to be extended (in spite of it's current > definition). > > The configuration structure still can be expanded as its size is > reported to userspace as return value from the request (in line with > Dmitry's comments). Fair enough. And now with the reply from Florian I see that I misunderstood his point. Thanks, Mathieu > > Best regards, Piotr. -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
[tip: sched/urgent] sched/membarrier: fix missing local execution of ipi_sync_rq_state()
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: fba111913e51a934eaad85734254eab801343836 Gitweb: https://git.kernel.org/tip/fba111913e51a934eaad85734254eab801343836 Author:Mathieu Desnoyers AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00 Committer: Peter Zijlstra CommitterDate: Mon, 01 Mar 2021 11:02:15 +01:00 sched/membarrier: fix missing local execution of ipi_sync_rq_state() The function sync_runqueues_membarrier_state() should copy the membarrier state from the @mm received as parameter to each runqueue currently running tasks using that mm. However, the use of smp_call_function_many() skips the current runqueue, which is unintended. Replace by a call to on_each_cpu_mask(). Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") Reported-by: Nadav Amit Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: sta...@vger.kernel.org # 5.4.x+ Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com --- kernel/sched/membarrier.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index acdae62..b5add64 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm) } rcu_read_unlock(); - preempt_disable(); - smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); - preempt_enable(); + on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true); free_cpumask_var(tmpmask); cpus_read_unlock();
Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 26, 2021, at 11:04 AM, emmir em...@google.com wrote: > On Fri, 26 Feb 2021 at 16:32, Mathieu Desnoyers > wrote: >> >> - On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote: >> [...] >> > --- >> > v2: >> > Applied review comments: >> > - changed return value from the ptrace request to the size of the >> > configuration structure >> > - expanded configuration structure with the flags field and >> > the rseq abi structure size >> > >> [...] >> > +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f >> > + >> > +struct ptrace_rseq_configuration { >> > + __u64 rseq_abi_pointer; >> > + __u32 rseq_abi_size; >> > + __u32 signature; >> > + __u32 flags; >> > + __u32 pad; >> > +}; >> > + >> [...] >> > +#ifdef CONFIG_RSEQ >> > +static long ptrace_get_rseq_configuration(struct task_struct *task, >> > + unsigned long size, void __user >> > *data) >> > +{ >> > + struct ptrace_rseq_configuration conf = { >> > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, >> > + .rseq_abi_size = sizeof(*task->rseq), >> > + .signature = task->rseq_sig, >> > + .flags = 0, >> > + }; >> > + >> > + size = min_t(unsigned long, size, sizeof(conf)); >> > + if (copy_to_user(data, , size)) >> > + return -EFAULT; >> > + return sizeof(conf); >> > +} >> >> I think what Florian was after would be: >> >> struct ptrace_rseq_configuration { >> __u32 size; /* size of struct ptrace_rseq_configuration */ >> __u32 flags; >> __u64 rseq_abi_pointer; >> __u32 signature; >> __u32 pad; >> }; >> >> where: >> >> .size = sizeof(struct ptrace_rseq_configuration), >> >> This way, the configuration structure can be expanded in the future. The >> rseq ABI structure is by definition fixed-size, so there is no point in >> having its size here. >> >> Florian, did I understand your request correctly, or am I missing your point >> ? > > In this case returning sizeof(conf) would serve the same purpose, wouldn't it? If the size is received as input from user-space as well, this can be used to make sure the kernel detects what size is expected by user-space and act accordingly. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote: [...] > --- > v2: > Applied review comments: > - changed return value from the ptrace request to the size of the > configuration structure > - expanded configuration structure with the flags field and > the rseq abi structure size > [...] > +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f > + > +struct ptrace_rseq_configuration { > + __u64 rseq_abi_pointer; > + __u32 rseq_abi_size; > + __u32 signature; > + __u32 flags; > + __u32 pad; > +}; > + [...] > +#ifdef CONFIG_RSEQ > +static long ptrace_get_rseq_configuration(struct task_struct *task, > + unsigned long size, void __user *data) > +{ > + struct ptrace_rseq_configuration conf = { > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, > + .rseq_abi_size = sizeof(*task->rseq), > + .signature = task->rseq_sig, > + .flags = 0, > + }; > + > + size = min_t(unsigned long, size, sizeof(conf)); > + if (copy_to_user(data, , size)) > + return -EFAULT; > + return sizeof(conf); > +} I think what Florian was after would be: struct ptrace_rseq_configuration { __u32 size; /* size of struct ptrace_rseq_configuration */ __u32 flags; __u64 rseq_abi_pointer; __u32 signature; __u32 pad; }; where: .size = sizeof(struct ptrace_rseq_configuration), This way, the configuration structure can be expanded in the future. The rseq ABI structure is by definition fixed-size, so there is no point in having its size here. Florian, did I understand your request correctly, or am I missing your point ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
- On Feb 24, 2021, at 1:14 PM, rostedt rost...@goodmis.org wrote: > On Wed, 24 Feb 2021 11:59:35 -0500 (EST) > Mathieu Desnoyers wrote: >> >> As a prototype solution, what I've done currently is to copy the user-space >> data into a kmalloc'd buffer in a preparation step before disabling >> preemption >> and copying data over into the per-cpu buffers. It works, but I think we >> should >> be able to do it without the needless copy. >> >> What I have in mind as an efficient solution (not implemented yet) for the >> LTTng >> kernel tracer goes as follows: >> >> #define COMMIT_LOCAL 0 >> #define COMMIT_REMOTE 1 >> >> - faultable probe is called from system call tracepoint [ >> preemption/blocking/migration is allowed ] >> - probe code calculate the length which needs to be reserved to store the >> event >> (e.g. user strlen), >> >> - preempt disable -> [ preemption/blocking/migration is not allowed from >> here ] >> - reserve_cpu = smp_processor_id() >> - reserve space in the ring buffer for reserve_cpu >> [ from that point on, we have _exclusive_ access to write into the >> ring buffer >> "slot" >> from any cpu until we commit. ] >> - preempt enable -> [ preemption/blocking/migration is allowed from here ] >> > > So basically the commit position here doesn't move until this task is > scheduled back in and the commit (remote or local) is updated. Indeed. > To put it in terms of the ftrace ring buffer, where we have both a commit > page and a commit index, and it only gets moved by the first one to start a > commit stack (that is, interrupts that interrupted a write will not > increment the commit). The tricky part for ftrace is its reliance on the fact that the concurrent users of the per-cpu ring buffer are all nested contexts. LTTng does not assume that and has been designed to be used both in kernel and user-space: lttng-modules and lttng-ust share a lot of ring buffer code. Therefore, LTTng's ring buffer supports preemption/migration of concurrent contexts. The fact that LTTng uses local-atomic-ops on its kernel ring buffers is just an optimization on an overall ring buffer design meant to allow preemption. > Now, I'm not sure how LTTng does it, but I could see issues for ftrace to > try to move the commit pointer (the pointer to the new commit page), as the > design is currently dependent on the fact that it can't happen while > commits are taken place. Indeed, what makes it easy for LTTng is because the ring buffer has been designed to support preemption/migration from the ground up. > Are the pages of the LTTng indexed by an array of pages? Yes, they are. Handling the initial page allocation and then the tracer copy of data to/from the ring buffer pages is the responsibility of the LTTng lib ring buffer "backend". The LTTng lib ring buffer backend is somewhat similar to a page table done in software, where the top level of the page table can be dynamically updated when doing flight recorder tracing. It is however completely separate from the space reservation/commit scheme which is handled by the lib ring buffer "frontend". The algorithm I described in my prior email is specifically targeted at the frontend layer, leaving the "backend" unchanged. For some reasons I suspect Ftrace ring buffer combined those two layers into a single algorithm, which may have its advantages, but seems to strengthen its dependency on only having nested contexts sharing a given per-cpu ring buffer. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: tasks-trace RCU: question about grace period forward progress
- On Feb 25, 2021, at 1:33 PM, paulmck paul...@kernel.org wrote: [...] > commit 581f79546b6be406a9c7280b2d3511b60821efe0 > Author: Paul E. McKenney > Date: Thu Feb 25 10:26:00 2021 -0800 > >rcu-tasks: Add block comment laying out RCU Tasks Trace design > >This commit adds a block comment that gives a high-level overview of >how RCU tasks trace grace periods progress. It also adds a note about >how exiting tasks are handles, plus it gives an overview of the memory handles -> handled >ordering. > >Reported-by: Peter Zijlstra >Reported-by: Mathieu Desnoyers >Signed-off-by: Paul E. McKenney > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 17c8ebe..f818357 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -726,6 +726,42 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread); > // flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace > // readers can operate from idle, offline, and exception entry/exit in no > // way allows rcu_preempt and rcu_sched readers to also do so. > +// > +// The implementation uses rcu_tasks_wait_gp(), which relies on function > +// pointers in the rcu_tasks structure. The rcu_spawn_tasks_trace_kthread() > +// function sets these function pointers up so that rcu_tasks_wait_gp() > +// invokes these functions in this order: > +// > +// rcu_tasks_trace_pregp_step(): > +// Initialize the count of readers and block CPU-hotplug operations. > +// rcu_tasks_trace_pertask(), invoked on every non-idle task: > +// Initialize per-task state and attempt to identify an immediate > +// quiescent state for that task, or, failing that, attempt to set > +// that task's .need_qs flag so that that task's next outermost > +// rcu_read_unlock_trace() will report the quiescent state (in which > +// case the count of readers is incremented). If both attempts fail, > +// the task is added to a "holdout" list. > +// rcu_tasks_trace_postscan(): > +// Initialize state and attempt to identify an immediate quiescent > +// state as above (but only for idle tasks), unblock CPU-hotplug > +// operations, and wait for an RCU grace period to avoid races with > +// tasks that are in the process of exiting. > +// check_all_holdout_tasks_trace(), repeatedly until holdout list is empty: > +// Scans the holdout list, attempting to identify a quiescent state > +// for each task on the list. If there is a quiescent state, the > +// corresponding task is removed from the holdout list. > +// rcu_tasks_trace_postgp(): > +// Wait for the count of readers do drop to zero, reporting any stalls. > +// Also execute full memory barriers to maintain ordering with code > +// executing after the grace period. > +// > +// The exit_tasks_rcu_finish_trace() synchronizes with exiting tasks. > +// > +// Pre-grace-period update-side code is ordered before the grace > +// period via the ->cbs_lock and barriers in rcu_tasks_kthread(). > +// Pre-grace-period read-side code is ordered before the grace period by > +// atomic_dec_and_test() of the count of readers (for IPIed readers) and by > +// scheduler context-switch ordering (for locked-down non-running readers). The rest looks good, thanks! Mathieu > > // The lockdep state must be outside of #ifdef to be useful. > #ifdef CONFIG_DEBUG_LOCK_ALLOC -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: tasks-trace RCU: question about grace period forward progress
- On Feb 25, 2021, at 10:36 AM, paulmck paul...@kernel.org wrote: > On Thu, Feb 25, 2021 at 09:22:48AM -0500, Mathieu Desnoyers wrote: >> Hi Paul, >> >> Answering a question from Peter on IRC got me to look at >> rcu_read_lock_trace(), >> and I see this: >> >> static inline void rcu_read_lock_trace(void) >> { >> struct task_struct *t = current; >> >> WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + >> 1); >> barrier(); >> if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) && >> t->trc_reader_special.b.need_mb) >> smp_mb(); // Pairs with update-side barriers >> rcu_lock_acquire(_trace_lock_map); >> } >> >> static inline void rcu_read_unlock_trace(void) >> { >> int nesting; >> struct task_struct *t = current; >> >> rcu_lock_release(_trace_lock_map); >> nesting = READ_ONCE(t->trc_reader_nesting) - 1; >> barrier(); // Critical section before disabling. >> // Disable IPI-based setting of .need_qs. >> WRITE_ONCE(t->trc_reader_nesting, INT_MIN); >> if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) { >> WRITE_ONCE(t->trc_reader_nesting, nesting); >> return; // We assume shallow reader nesting. >> } >> rcu_read_unlock_trace_special(t, nesting); >> } >> >> AFAIU, each thread keeps track of whether it is nested within a RCU read-side >> critical >> section with a counter, and grace periods iterate over all threads to make >> sure >> they >> are not within a read-side critical section before they can complete: >> >> # define rcu_tasks_trace_qs(t) \ >> do {\ >> if (!likely(READ_ONCE((t)->trc_reader_checked)) && \ >> !unlikely(READ_ONCE((t)->trc_reader_nesting))) {\ >> smp_store_release(&(t)->trc_reader_checked, true); \ >> smp_mb(); /* Readers partitioned by store. */ \ >> } \ >> } while (0) >> >> It reminds me of the liburcu urcu-mb flavor which also deals with per-thread >> state to track whether threads are nested within a critical section: >> >> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90 >> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125 >> >> static inline void _urcu_mb_read_lock_update(unsigned long tmp) >> { >> if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) { >> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, >> _CMM_LOAD_SHARED(urcu_mb_gp.ctr)); >> cmm_smp_mb(); >> } else >> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + >> URCU_GP_COUNT); >> } >> >> static inline void _urcu_mb_read_lock(void) >> { >> unsigned long tmp; >> >> urcu_assert(URCU_TLS(urcu_mb_reader).registered); >> cmm_barrier(); >> tmp = URCU_TLS(urcu_mb_reader).ctr; >> urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK); >> _urcu_mb_read_lock_update(tmp); >> } >> >> The main difference between the two algorithm is that task-trace within the >> kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either >> incremented or flipped between 0 and 1 by the grace period. This allow RCU >> readers >> outermost nesting starting after the beginning of the grace period not to >> prevent >> progress of the grace period. >> >> Without this, a steady flow of incoming tasks-trace-RCU readers can prevent >> the >> grace period from ever completing. >> >> Or is this handled in a clever way that I am missing here ? > > There are several mechanisms designed to handle this. The following > paragraphs describe these at a high level. > > The trc_wait_for_one_reader() is invoked on each task. It uses the > try_invoke_on_locked_down_task(), which, if the task is currently not > running, keeps it that way and invokes trc_inspect_reader(). If the > locked-down task is in a read-side critical section, the need_qs field > is set, which will cause the task's next rcu_read_lock_trace() to report > the quiescent st
tasks-trace RCU: question about grace period forward progress
Hi Paul, Answering a question from Peter on IRC got me to look at rcu_read_lock_trace(), and I see this: static inline void rcu_read_lock_trace(void) { struct task_struct *t = current; WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1); barrier(); if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) && t->trc_reader_special.b.need_mb) smp_mb(); // Pairs with update-side barriers rcu_lock_acquire(_trace_lock_map); } static inline void rcu_read_unlock_trace(void) { int nesting; struct task_struct *t = current; rcu_lock_release(_trace_lock_map); nesting = READ_ONCE(t->trc_reader_nesting) - 1; barrier(); // Critical section before disabling. // Disable IPI-based setting of .need_qs. WRITE_ONCE(t->trc_reader_nesting, INT_MIN); if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) { WRITE_ONCE(t->trc_reader_nesting, nesting); return; // We assume shallow reader nesting. } rcu_read_unlock_trace_special(t, nesting); } AFAIU, each thread keeps track of whether it is nested within a RCU read-side critical section with a counter, and grace periods iterate over all threads to make sure they are not within a read-side critical section before they can complete: # define rcu_tasks_trace_qs(t) \ do {\ if (!likely(READ_ONCE((t)->trc_reader_checked)) && \ !unlikely(READ_ONCE((t)->trc_reader_nesting))) {\ smp_store_release(&(t)->trc_reader_checked, true); \ smp_mb(); /* Readers partitioned by store. */ \ } \ } while (0) It reminds me of the liburcu urcu-mb flavor which also deals with per-thread state to track whether threads are nested within a critical section: https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90 https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125 static inline void _urcu_mb_read_lock_update(unsigned long tmp) { if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) { _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, _CMM_LOAD_SHARED(urcu_mb_gp.ctr)); cmm_smp_mb(); } else _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + URCU_GP_COUNT); } static inline void _urcu_mb_read_lock(void) { unsigned long tmp; urcu_assert(URCU_TLS(urcu_mb_reader).registered); cmm_barrier(); tmp = URCU_TLS(urcu_mb_reader).ctr; urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK); _urcu_mb_read_lock_update(tmp); } The main difference between the two algorithm is that task-trace within the kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either incremented or flipped between 0 and 1 by the grace period. This allow RCU readers outermost nesting starting after the beginning of the grace period not to prevent progress of the grace period. Without this, a steady flow of incoming tasks-trace-RCU readers can prevent the grace period from ever completing. Or is this handled in a clever way that I am missing here ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
- Mathieu Desnoyers wrote: > - On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjean...@efficios.com > wrote: > > > [ Adding Mathieu Desnoyers in CC ] > > > > On 2021-02-23 21 h 16, Steven Rostedt wrote: > >> On Thu, 18 Feb 2021 17:21:19 -0500 > >> Michael Jeanson wrote: > >> > >>> This series only implements the tracepoint infrastructure required to > >>> allow tracers to handle page faults. Modifying each tracer to handle > >>> those page faults would be a next step after we all agree on this piece > >>> of instrumentation infrastructure. > >> > >> I started taking a quick look at this, and came up with the question: how > >> do you allow preemption when dealing with per-cpu buffers or storage to > >> record the data? > >> > >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and > >> this is the reason for the need to disable preemption. What's the solution > >> that LTTng is using for this? I know it has a per cpu buffers too, but does > >> it have some kind of "per task" buffer that is being used to extract the > >> data that can fault? > > As a prototype solution, what I've done currently is to copy the user-space > data into a kmalloc'd buffer in a preparation step before disabling preemption > and copying data over into the per-cpu buffers. It works, but I think we > should > be able to do it without the needless copy. > > What I have in mind as an efficient solution (not implemented yet) for the > LTTng > kernel tracer goes as follows: > > #define COMMIT_LOCAL 0 > #define COMMIT_REMOTE 1 > > - faultable probe is called from system call tracepoint [ > preemption/blocking/migration is allowed ] > - probe code calculate the length which needs to be reserved to store the > event > (e.g. user strlen), > > - preempt disable -> [ preemption/blocking/migration is not allowed from > here ] > - reserve_cpu = smp_processor_id() > - reserve space in the ring buffer for reserve_cpu > [ from that point on, we have _exclusive_ access to write into the ring > buffer "slot" > from any cpu until we commit. ] > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > - copy data from user-space to the ring buffer "slot", > > - preempt disable -> [ preemption/blocking/migration is not allowed from > here ] > commit_cpu = smp_processor_id() > if (commit_cpu == reserve_cpu) >use local_add to increment the > buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL] > else >use atomic_add to increment the > buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE] The line above should increment reserve_cpu's buffer commit count, of course. Thanks, Mathieu > - preempt enable -> [ preemption/blocking/migration is allowed from here ] > > Given that lttng uses per-buffer/per-sub-buffer commit counters as simple > free-running > accumulators, the trick here is to use two commit counters rather than single > one for each > sub-buffer. Whenever we need to read a commit count value, we always sum the > total of the > LOCAL and REMOTE counter. > > This allows dealing with migration between reserve and commit without > requiring the overhead > of an atomic operation on the fast-path (LOCAL case). > > I had to design this kind of dual-counter trick in the context of user-space > use of restartable > sequences. It looks like it may have a role to play in the kernel as well. :) > > Or am I missing something important that would not survive real-life ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjean...@efficios.com wrote: > [ Adding Mathieu Desnoyers in CC ] > > On 2021-02-23 21 h 16, Steven Rostedt wrote: >> On Thu, 18 Feb 2021 17:21:19 -0500 >> Michael Jeanson wrote: >> >>> This series only implements the tracepoint infrastructure required to >>> allow tracers to handle page faults. Modifying each tracer to handle >>> those page faults would be a next step after we all agree on this piece >>> of instrumentation infrastructure. >> >> I started taking a quick look at this, and came up with the question: how >> do you allow preemption when dealing with per-cpu buffers or storage to >> record the data? >> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and >> this is the reason for the need to disable preemption. What's the solution >> that LTTng is using for this? I know it has a per cpu buffers too, but does >> it have some kind of "per task" buffer that is being used to extract the >> data that can fault? As a prototype solution, what I've done currently is to copy the user-space data into a kmalloc'd buffer in a preparation step before disabling preemption and copying data over into the per-cpu buffers. It works, but I think we should be able to do it without the needless copy. What I have in mind as an efficient solution (not implemented yet) for the LTTng kernel tracer goes as follows: #define COMMIT_LOCAL 0 #define COMMIT_REMOTE 1 - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ] - probe code calculate the length which needs to be reserved to store the event (e.g. user strlen), - preempt disable -> [ preemption/blocking/migration is not allowed from here ] - reserve_cpu = smp_processor_id() - reserve space in the ring buffer for reserve_cpu [ from that point on, we have _exclusive_ access to write into the ring buffer "slot" from any cpu until we commit. ] - preempt enable -> [ preemption/blocking/migration is allowed from here ] - copy data from user-space to the ring buffer "slot", - preempt disable -> [ preemption/blocking/migration is not allowed from here ] commit_cpu = smp_processor_id() if (commit_cpu == reserve_cpu) use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL] else use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE] - preempt enable -> [ preemption/blocking/migration is allowed from here ] Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running accumulators, the trick here is to use two commit counters rather than single one for each sub-buffer. Whenever we need to read a commit count value, we always sum the total of the LOCAL and REMOTE counter. This allows dealing with migration between reserve and commit without requiring the overhead of an atomic operation on the fast-path (LOCAL case). I had to design this kind of dual-counter trick in the context of user-space use of restartable sequences. It looks like it may have a role to play in the kernel as well. :) Or am I missing something important that would not survive real-life ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 22, 2021, at 5:04 AM, Piotr Figiel fig...@google.com wrote: > For userspace checkpoint and restore (C/R) a way of getting process state > containing RSEQ configuration is needed. > > There are two ways this information is going to be used: > - to re-enable RSEQ for threads which had it enabled before C/R > - to detect if a thread was in a critical section during C/R > > Since C/R preserves TLS memory and addresses RSEQ ABI will be restored > using the address registered before C/R. > > Detection whether the thread is in a critical section during C/R is needed > to enforce behavior of RSEQ abort during C/R. Attaching with ptrace() > before registers are dumped itself doesn't cause RSEQ abort. > Restoring the instruction pointer within the critical section is > problematic because rseq_cs may get cleared before the control is passed > to the migrated application code leading to RSEQ invariants not being > preserved. C/R code will use RSEQ ABI address to find the abort handler > to which the instruction pointer needs to be set. > > To achieve above goals expose the RSEQ ABI address and the signature value > with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION. > > This new ptrace request can also be used by debuggers so they are aware > of stops within restartable sequences in progress. > > Signed-off-by: Piotr Figiel > Reviewed-by: Michal Miroslaw > > --- > include/uapi/linux/ptrace.h | 8 > kernel/ptrace.c | 23 +++ > 2 files changed, 31 insertions(+) > > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h > index 83ee45fa634b..d54cf6b6ce7c 100644 > --- a/include/uapi/linux/ptrace.h > +++ b/include/uapi/linux/ptrace.h > @@ -102,6 +102,14 @@ struct ptrace_syscall_info { > }; > }; > > +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f > + > +struct ptrace_rseq_configuration { > + __u64 rseq_abi_pointer; > + __u32 signature; > + __u32 pad; > +}; I notice that other structures defined in this UAPI header are not packed as well. Should we add an attribute packed on new structures ? It seems like it is generally a safer course of action, even though each field is naturally aligned here (there is no padding/hole in the structure). > + > /* > * These values are stored in task->ptrace_message > * by tracehook_report_syscall_* to describe the current syscall-stop. > diff --git a/kernel/ptrace.c b/kernel/ptrace.c > index 61db50f7ca86..a936af66cf6f 100644 > --- a/kernel/ptrace.c > +++ b/kernel/ptrace.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > > #include /* for syscall_get_* */ > > @@ -779,6 +780,22 @@ static int ptrace_peek_siginfo(struct task_struct *child, > return ret; > } > > +#ifdef CONFIG_RSEQ > +static long ptrace_get_rseq_configuration(struct task_struct *task, > + unsigned long size, void __user *data) > +{ > + struct ptrace_rseq_configuration conf = { > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, > + .signature = task->rseq_sig, > + }; > + > + size = min_t(unsigned long, size, sizeof(conf)); > + if (copy_to_user(data, , size)) > + return -EFAULT; > + return size; See other email about returning 0 here. Thanks, Mathieu > + > default: > break; > } > -- > 2.30.0.617.g56c4b15f3c-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request
- On Feb 22, 2021, at 6:57 AM, Dmitry V. Levin l...@altlinux.org wrote: > On Mon, Feb 22, 2021 at 11:04:43AM +0100, Piotr Figiel wrote: > [...] >> --- a/include/uapi/linux/ptrace.h >> +++ b/include/uapi/linux/ptrace.h >> @@ -102,6 +102,14 @@ struct ptrace_syscall_info { >> }; >> }; >> >> +#define PTRACE_GET_RSEQ_CONFIGURATION 0x420f >> + >> +struct ptrace_rseq_configuration { >> +__u64 rseq_abi_pointer; >> +__u32 signature; >> +__u32 pad; >> +}; >> + >> /* >> * These values are stored in task->ptrace_message >> * by tracehook_report_syscall_* to describe the current syscall-stop. >> diff --git a/kernel/ptrace.c b/kernel/ptrace.c >> index 61db50f7ca86..a936af66cf6f 100644 >> --- a/kernel/ptrace.c >> +++ b/kernel/ptrace.c >> @@ -31,6 +31,7 @@ >> #include >> #include >> #include >> +#include >> >> #include /* for syscall_get_* */ >> >> @@ -779,6 +780,22 @@ static int ptrace_peek_siginfo(struct task_struct >> *child, >> return ret; >> } >> >> +#ifdef CONFIG_RSEQ >> +static long ptrace_get_rseq_configuration(struct task_struct *task, >> + unsigned long size, void __user *data) >> +{ >> +struct ptrace_rseq_configuration conf = { >> +.rseq_abi_pointer = (u64)(uintptr_t)task->rseq, >> +.signature = task->rseq_sig, >> +}; >> + >> +size = min_t(unsigned long, size, sizeof(conf)); >> +if (copy_to_user(data, , size)) >> +return -EFAULT; >> +return size; >> +} >> +#endif > > From API perspective I suggest for such interfaces to return the amount of > data that could have been written if there was enough room specified, e.g. > in this case it's sizeof(conf) instead of size. Looking at the ptrace(2) man page: RETURN VALUE On success, the PTRACE_PEEK* requests return the requested data (but see NOTES), the PTRACE_SECCOMP_GET_FILTER request returns the number of instructions in the BPF program, and other requests return zero. On error, all requests return -1, and errno is set appropriately. Since the value returned by a successful PTRACE_PEEK* request may be -1, the caller must clear errno before the call, and then check it af‐ terward to determine whether or not an error occurred. It looks like the usual behavior for ptrace requests would be to return 0 when everything is OK. Unless there a strong motivation for doing different for this new request, I would be tempted to use the same expected behavior than other requests on success: return 0. Unless there is a strong motivation for returning either size or sizeof(conf) ? If we return sizeof(conf) to user-space, it means it should check it and deal with the size mismatch. Is that size ever expected to change ? Thanks, Mathieu > > > -- > ldv -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
lttng-trace: a new strace-alike LTTng command
Hi there! I just pushed a new command line tool named `lttng-trace`, which allows tracing a specific command and its sub-processes using the LTTng system call tracer. It focuses on simplicity. It can be found here: Git website: http://git.lttng.org/?p=lttng-trace.git Clone with: git clone https://git.lttng.org/lttng-trace.git For instance, tracing the command "date" is as simple as: lttng-trace date The tracing session can be optionally configured by using the usual lttng control interfaces after this message appears: [lttng-trace] Tracing session `date-20210218-170545` created. It can be customized using the `lttng` command. [lttng-trace] Press key when ready to run the child process. After execution of the command, the following message details how to view the trace: [lttng-trace] Sub-process hierarchy traced successfully. View trace with `babeltrace2 /tmp/lttng-trace/date-20210218-170906` See `lttng-trace --help` for more options. This new command is in early alpha stage. Feedback is welcome! For instance, we are wondering whether the default behavior should just be tracing system calls, or if some level of user-space tracing should be enabled by default as well. Or if some new options should enable "typical" usage scenarios. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com