Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections
Hi Luis, On 12/22/23 21:10, Luis Chamberlain wrote: On Fri, Dec 22, 2023 at 01:13:26PM +0100, Helge Deller wrote: On 12/22/23 06:59, Luis Chamberlain wrote: On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote: On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores 64-bit pointers into the __ksymtab* sections. Make sure that those sections will be correctly aligned at module link time, otherwise unaligned memory accesses may happen at runtime. ... ... So, honestly I don't see a real reason why it shouldn't be applied... Like I said, you Cc'd stable as a fix, I added "Cc: sta...@vger.kernel.org" on the patch itself, so *if* the patch would have been applied by you, it would later end up in stable kernel series too. But I did not CC'ed the stable mailing list directly, so my patch was never sent to that mailing list. as a maintainer it is my job to verify how critical this is and ask for more details about how you found it and evaluate the real impact. Even if it was not a stable fix I tend to ask this for patches, even if they are trivial. ... OK, can you extend the patch below with something like: perf stat --repeat 100 --pre 'modprobe -r b a b c' -- ./tools/testing/selftests/module/find_symbol.sh And test before and after? I ran a simple test as-is and the data I get is within noise, and so I think we need the --repeat 100 thing. Your selftest code is based on perf. AFAICS we don't have perf on parisc/hppa, so I can't test your selftest code on that architecture. I assume you tested on x86, where the CPU will transparently take care of unaligned accesses. This is probably why the results are within the noise. But on some platforms the CPU raises an exception on unaligned accesses and jumps into special exception handler assembler code inside the kernel. This is much more expensive than on x86, which is why we track on parisc in /proc/cpuinfo counters on how often this exception handler is called: IRQ: CPU0 CPU1 3: 1332 0 SuperIO ttyS0 7:1270013 0 SuperIO pata_ns87415 64: 320023012 320021431 CPU timer 65: 17080507 20624423 CPU IPI UAH: 10948640 58104 Unaligned access handler traps This "UAH" field could theoretically be used to extend your selftest. But is it really worth it? The outcome is very much architecture and CPU specific, maybe it's just within the noise as you measured. IMHO we should always try to natively align structures, and if we see we got it wrong in kernel code, we should fix it. My patches just fix those memory sections where we use inline assembly (instead of C) and thus missed to provide the correct alignments. Helge --- before: sudo ./tools/testing/selftests/module/find_symbol.sh Performance counter stats for '/sbin/modprobe test_kallsyms_b': 81,956,206 ns duration_time 81,883,000 ns system_time 210 page-faults 0.081956206 seconds time elapsed 0.0 seconds user 0.081883000 seconds sys Performance counter stats for '/sbin/modprobe test_kallsyms_b': 85,960,863 ns duration_time 84,679,000 ns system_time 212 page-faults 0.085960863 seconds time elapsed 0.0 seconds user 0.084679000 seconds sys Performance counter stats for '/sbin/modprobe test_kallsyms_b': 86,484,868 ns duration_time 86,541,000 ns system_time 213 page-faults 0.086484868 seconds time elapsed 0.0 seconds user 0.086541000 seconds sys --- After your modules alignement fix: sudo ./tools/testing/selftests/module/find_symbol.sh Performance counter stats for '/sbin/modprobe test_kallsyms_b': 83,579,980 ns duration_time 83,530,000 ns system_time 212 page-faults 0.083579980 seconds time elapsed 0.0 seconds user 0.08353 seconds sys Performance counter stats for '/sbin/modprobe test_kallsyms_b': 70,721,786 ns duration_time 69,289,000 ns system_time 211 page-faults 0.070721786 seconds time elapsed 0.0 seconds user 0.069289000 seconds sys Performance counter stats for '/sbin/modprobe test_kallsyms_b': 76,513,219 ns duration_time 76,381,000 ns system_time 214 page-faults 0.076513219 seconds time elapsed 0.0 seconds user 0.076381000 seconds sys After your modules alignement fix: sudo ./tools/testing/selftests/module/find_symbol.sh Performance counter stats for '/sbin
Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
On 12/29/23 17:11, David Laight wrote: osq_lock() starts by setting node->next to NULL and node->locked to 0. Careful analysis shows that node->next is always NULL on entry. node->locked is set non-zero by another cpu to force a wakeup. This can only happen after the 'prev->next = node' assignment, so locked can be set to zero just before that (along with the assignment to node->prev). Only initialise node->cpu once, after that use its value instead of smp_processor_id() - which is probably a real function call. Should reduce cache-line bouncing a little. Signed-off-by: David Laight --- Re-send without the 'RE:' on the subject line. kernel/locking/osq_lock.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d414eef4bec6..55f5db896c02 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *prev) { struct optimistic_spin_node *next = NULL; - int curr = encode_cpu(smp_processor_id()); + int curr = node->cpu; int old; /* @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; I have some concern about not clearing node->next at the beginning. I know that next is supposed to be cleared at the end. I do have some worry that there may exist a race condition that leaves next not cleared before it is used again. So you either have to prove that such a race does not exist, or initializing it to NULL at the beginning like it is today. Cheers, Longman - node->cpu = curr; + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) + node->cpu = encode_cpu(smp_processor_id()); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(&lock->tail, curr); + old = atomic_xchg(&lock->tail, node->cpu); if (old == OSQ_UNLOCKED_VAL) return true; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) void osq_unlock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node, *next; - int curr = encode_cpu(smp_processor_id()); + int curr = raw_cpu_read(osq_node.cpu); /* * Fast path for the uncontended case.
Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
On 12/29/23 15:58, David Laight wrote: The vcpu_is_preempted() test stops osq_lock() spinning if a virtual cpu is no longer running. Although patched out for bare-metal the code still needs the cpu number. Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss when osq_unlock() is waking up the next cpu. Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead. Update in the osq_lock() 'unqueue' path when 'node->prev' is changed. This is simpler than checking for 'node->prev' changing and caching 'prev->cpu'. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index b60b0add0161..89be63627434 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -14,8 +14,9 @@ struct optimistic_spin_node { struct optimistic_spin_node *self, *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int locked;/* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */ }; static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) -{ - return node->cpu - 1; -} - static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; + node->prev_cpu = old - 1; prev = decode_cpu(old); node->prev = prev; node->locked = 0; @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted(node->prev_cpu))) return true; /* unqueue */ @@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * it will wait in Step-A. */ + WRITE_ONCE(next->prev_cpu, prev->cpu - 1); WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); Reviewed-by: Waiman Long
Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
On 12/29/23 15:57, David Laight wrote: this_cpu_ptr() is rather more expensive than raw_cpu_read() since the latter can use an 'offset from register' (%gs for x86-84). Add a 'self' field to 'struct optimistic_spin_node' that can be read with raw_cpu_read(), initialise on first call. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 9bb3a077ba92..b60b0add0161 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,7 +13,7 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *self, *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ }; @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); My gcc 11 compiler produces the following x86-64 code: 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); 0x0029 <+25>: mov %rcx,%rdx 0x002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 Which looks pretty optimized for me. Maybe older compiler may generate more complex code. However, I do have some doubt as to the benefit of this patch at the expense of making the code a bit more complex. Cheers, Longman
Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
On 12/29/23 15:56, David Laight wrote: osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock() but only needs the 'cpu' value to write to lock->tail. Just pass prev->cpu or OSQ_UNLOCKED_VAL instead. Also directly return NULL or 'next' instead of breaking the loop. Should have no effect on the generated code since gcc manages to assume that 'prev != NULL' due to an earlier dereference. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 55f5db896c02..9bb3a077ba92 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) static inline struct optimistic_spin_node * osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *node, - struct optimistic_spin_node *prev) + int old) Make the last argument name more descriptive, like "old_cpu" as the "int" type does not provide enough context to allow people to guess what "old" may be. Cheers, Longman
Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
On 12/29/23 15:53, David Laight wrote: struct optimistic_spin_node is private to the implementation. Move it into the C file to ensure nothing is accessing it. Signed-off-by: David Laight --- include/linux/osq_lock.h | 5 - kernel/locking/osq_lock.c | 7 +++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..ea8fb31379e3 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -6,11 +6,6 @@ * An MCS like lock especially tailored for optimistic spinning for sleeping * lock implementations (mutex, rwsem, etc). */ -struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ -}; struct optimistic_spin_queue { /* diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..d414eef4bec6 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -11,6 +11,13 @@ * called from interrupt context and we have preemption disabled while * spinning. */ + +struct optimistic_spin_node { + struct optimistic_spin_node *next, *prev; + int locked; /* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ +}; + static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); /* Please correct the patch title "osf_lock.c" => "osq_lock.c". After the fix, you can add Acked-by: Waiman Long
Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
On Fri, 29 Dec 2023 at 12:56, David Laight wrote: > > osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock() > but only needs the 'cpu' value to write to lock->tail. > Just pass prev->cpu or OSQ_UNLOCKED_VAL instead. > > Also directly return NULL or 'next' instead of breaking the loop. Please split these two totally independent things out of the patch, just to make things much more obvious. I like the new calling convention, but I don't like how the patch isn't obviously just that. In fact, I'd take your patch #1 and just the calling convention change from #3 as "these are obviously not changing anything at all, only moving things to more local places". I'd also take the other part of #3 as a "clearly doesn't change anything" but it should be a separate patch, and it should be done differently: make 'next' be local to just *inside* the for-loop (in fact, make it local to the if-statement that sets it), to clarify the whole thing that it can never be non-NULL at the top of the loop, and can never have any long-term semantics. The other parts actually change some logic, and would need the OSQ people to take a more serious look. Linus
[PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
osq_lock() starts by setting node->next to NULL and node->locked to 0. Careful analysis shows that node->next is always NULL on entry. node->locked is set non-zero by another cpu to force a wakeup. This can only happen after the 'prev->next = node' assignment, so locked can be set to zero just before that (along with the assignment to node->prev). Only initialise node->cpu once, after that use its value instead of smp_processor_id() - which is probably a real function call. Should reduce cache-line bouncing a little. Signed-off-by: David Laight --- Re-send without the 'RE:' on the subject line. kernel/locking/osq_lock.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d414eef4bec6..55f5db896c02 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *prev) { struct optimistic_spin_node *next = NULL; - int curr = encode_cpu(smp_processor_id()); + int curr = node->cpu; int old; /* @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; - node->cpu = curr; + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) + node->cpu = encode_cpu(smp_processor_id()); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(&lock->tail, curr); + old = atomic_xchg(&lock->tail, node->cpu); if (old == OSQ_UNLOCKED_VAL) return true; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) void osq_unlock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node, *next; - int curr = encode_cpu(smp_processor_id()); + int curr = raw_cpu_read(osq_node.cpu); /* * Fast path for the uncontended case. -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH] ftrace: Fix modification of direct_function hash while in use
Masami and Jiri, This patch made it through all my tests. If I can get an Acked-by by Sunday, I'll include it in my push to Linus (I have a couple of other fixes to send him). -- Steve On Fri, 29 Dec 2023 11:51:34 -0500 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Masami Hiramatsu reported a memory leak in register_ftrace_direct() where > if the number of new entries are added is large enough to cause two > allocations in the loop: > > for (i = 0; i < size; i++) { > hlist_for_each_entry(entry, &hash->buckets[i], hlist) { > new = ftrace_add_rec_direct(entry->ip, addr, > &free_hash); > if (!new) > goto out_remove; > entry->direct = addr; > } > } > > Where ftrace_add_rec_direct() has: > > if (ftrace_hash_empty(direct_functions) || > direct_functions->count > 2 * (1 << direct_functions->size_bits)) > { > struct ftrace_hash *new_hash; > int size = ftrace_hash_empty(direct_functions) ? 0 : > direct_functions->count + 1; > > if (size < 32) > size = 32; > > new_hash = dup_hash(direct_functions, size); > if (!new_hash) > return NULL; > > *free_hash = direct_functions; > direct_functions = new_hash; > } > > The "*free_hash = direct_functions;" can happen twice, losing the previous > allocation of direct_functions. > > But this also exposed a more serious bug. > > The modification of direct_functions above is not safe. As > direct_functions can be referenced at any time to find what direct caller > it should call, the time between: > > new_hash = dup_hash(direct_functions, size); > and > direct_functions = new_hash; > > can have a race with another CPU (or even this one if it gets interrupted), > and the entries being moved to the new hash are not referenced. > > That's because the "dup_hash()" is really misnamed and is really a > "move_hash()". It moves the entries from the old hash to the new one. > > Now even if that was changed, this code is not proper as direct_functions > should not be updated until the end. That is the best way to handle > function reference changes, and is the way other parts of ftrace handles > this. > > The following is done: > > 1. Change add_hash_entry() to return the entry it created and inserted > into the hash, and not just return success or not. > > 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove > the former. > > 3. Allocate a "new_hash" at the start that is made for holding both the > new hash entries as well as the existing entries in direct_functions. > > 4. Copy (not move) the direct_function entries over to the new_hash. > > 5. Copy the entries of the added hash to the new_hash. > > 6. If everything succeeds, then use rcu_pointer_assign() to update the > direct_functions with the new_hash. > > This simplifies the code and fixes both the memory leak as well as the > race condition mentioned above. > > Link: > https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ > > Cc: sta...@vger.kernel.org > Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ftrace.c | 100 -- > 1 file changed, 47 insertions(+), 53 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 8de8bec5f366..b01ae7d36021 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash, > hash->count++; > } > > -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip) > +static struct ftrace_func_entry * > +add_hash_entry(struct ftrace_hash *hash, unsigned long ip) > { > struct ftrace_func_entry *entry; > > entry = kmalloc(sizeof(*entry), GFP_KERNEL); > if (!entry) > - return -ENOMEM; > + return NULL; > > entry->ip = ip; > __add_hash_entry(hash, entry); > > - return 0; > + return entry; > } > > static void > @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct > ftrace_hash *hash) > struct ftrace_func_entry *entry; > struct ftrace_hash *new_hash; > int size; > - int ret; > int i; > > new_hash = alloc_ftrace_hash(size_bits); > @@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct > ftrace_hash *hash) > size = 1 << hash->size_bits; > for (i = 0; i < size; i++) { > hlist_for_each_entry(entry, &hash->buckets[i], hlist) { > - ret = add_hash_entry(new_hash, entry->ip); > -
RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
osq_lock() starts by setting node->next to NULL and node->locked to 0. Careful analysis shows that node->next is always NULL on entry. node->locked is set non-zero by another cpu to force a wakeup. This can only happen after the 'prev->next = node' assignment, so locked can be set to zero just before that (along with the assignment to node->prev). Only initialise node->cpu once, after that use its value instead of smp_processor_id() - which is probably a real function call. Should reduce cache-line bouncing a little. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d414eef4bec6..55f5db896c02 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *prev) { struct optimistic_spin_node *next = NULL; - int curr = encode_cpu(smp_processor_id()); + int curr = node->cpu; int old; /* @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; - node->cpu = curr; + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) + node->cpu = encode_cpu(smp_processor_id()); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(&lock->tail, curr); + old = atomic_xchg(&lock->tail, node->cpu); if (old == OSQ_UNLOCKED_VAL) return true; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) void osq_unlock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node, *next; - int curr = encode_cpu(smp_processor_id()); + int curr = raw_cpu_read(osq_node.cpu); /* * Fast path for the uncontended case. -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
struct optimistic_spin_node is private to the implementation. Move it into the C file to ensure nothing is accessing it. Signed-off-by: David Laight --- include/linux/osq_lock.h | 5 - kernel/locking/osq_lock.c | 7 +++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..ea8fb31379e3 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -6,11 +6,6 @@ * An MCS like lock especially tailored for optimistic spinning for sleeping * lock implementations (mutex, rwsem, etc). */ -struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ -}; struct optimistic_spin_queue { /* diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..d414eef4bec6 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -11,6 +11,13 @@ * called from interrupt context and we have preemption disabled while * spinning. */ + +struct optimistic_spin_node { + struct optimistic_spin_node *next, *prev; + int locked; /* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ +}; + static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); /* -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code
Zeng Heng noted that heavy use of the osq (optimistic spin queue) code used rather more cpu than might be expected. See: https://lore.kernel.org/lkml/202312210155.wc2huk8c-...@intel.com/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8 Part of the problem is there is a pretty much guaranteed cache line reload reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal) in the wakeup path which slows it down. (On bare metal the hypervisor call is patched out, but the argument is still read.) Careful analysis shows that it isn't necessary to dirty the per-cpu data on the fast-path osq_lock() path. This may be slightly beneficial. The code also uses this_cpu_ptr() to get the address of the per-cpu data. On x86-64 (at least) this is implemented as: &per_cpu_data[smp_processor_id()]->member ie there is a real function call, an array index and an add. However if raw_cpu_read() can used then (which is typically just an offset from register - %gs for x86-64) the code will be faster. Putting the address of the per-cpu node into itself means that only one cache line need be loaded. I can't see a list of per-cpu data initialisation functions, so the fields are initialised on the first osq_lock() call. The last patch avoids the cache line reload calling vcpu_is_preempted() by simply saving node->prev->cpu as node->prev_cpu and updating it when node->prev changes. This is simpler than the patch proposed by Waimon. David Laight (5): Move the definition of optimistic_spin_node into osf_lock.c Avoid dirtying the local cpu's 'node' in the osq_lock() fast path. Clarify osq_wait_next() Optimise per-cpu data accesses. Optimise vcpu_is_preempted() check. include/linux/osq_lock.h | 5 kernel/locking/osq_lock.c | 61 +-- 2 files changed, 33 insertions(+), 33 deletions(-) -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
The vcpu_is_preempted() test stops osq_lock() spinning if a virtual cpu is no longer running. Although patched out for bare-metal the code still needs the cpu number. Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss when osq_unlock() is waking up the next cpu. Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead. Update in the osq_lock() 'unqueue' path when 'node->prev' is changed. This is simpler than checking for 'node->prev' changing and caching 'prev->cpu'. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index b60b0add0161..89be63627434 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -14,8 +14,9 @@ struct optimistic_spin_node { struct optimistic_spin_node *self, *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int locked;/* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */ }; static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) -{ - return node->cpu - 1; -} - static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; + node->prev_cpu = old - 1; prev = decode_cpu(old); node->prev = prev; node->locked = 0; @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted(node->prev_cpu))) return true; /* unqueue */ @@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * it will wait in Step-A. */ + WRITE_ONCE(next->prev_cpu, prev->cpu - 1); WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock() but only needs the 'cpu' value to write to lock->tail. Just pass prev->cpu or OSQ_UNLOCKED_VAL instead. Also directly return NULL or 'next' instead of breaking the loop. Should have no effect on the generated code since gcc manages to assume that 'prev != NULL' due to an earlier dereference. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 55f5db896c02..9bb3a077ba92 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) static inline struct optimistic_spin_node * osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *node, - struct optimistic_spin_node *prev) + int old) { - struct optimistic_spin_node *next = NULL; + struct optimistic_spin_node *next; int curr = node->cpu; - int old; /* -* If there is a prev node in queue, then the 'old' value will be -* the prev node's CPU #, else it's set to OSQ_UNLOCKED_VAL since if -* we're currently last in queue, then the queue will then become empty. +* If osq_lock() is being cancelled there must be a previous node +* and 'old' is its CPU #. +* For osq_unlock() there is never a previous node and old is set +* to OSQ_UNLOCKED_VAL. */ - old = prev ? prev->cpu : OSQ_UNLOCKED_VAL; for (;;) { if (atomic_read(&lock->tail) == curr && @@ -69,7 +68,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, * will now observe @lock and will complete its * unlock()/unqueue(). */ - break; + return NULL; } /* @@ -85,13 +84,11 @@ osq_wait_next(struct optimistic_spin_queue *lock, if (node->next) { next = xchg(&node->next, NULL); if (next) - break; + return next; } cpu_relax(); } - - return next; } bool osq_lock(struct optimistic_spin_queue *lock) @@ -192,7 +189,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * back to @prev. */ - next = osq_wait_next(lock, node, prev); + next = osq_wait_next(lock, node, prev->cpu); if (!next) return false; @@ -232,7 +229,7 @@ void osq_unlock(struct optimistic_spin_queue *lock) return; } - next = osq_wait_next(lock, node, NULL); + next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL); if (next) WRITE_ONCE(next->locked, 1); } -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
this_cpu_ptr() is rather more expensive than raw_cpu_read() since the latter can use an 'offset from register' (%gs for x86-84). Add a 'self' field to 'struct optimistic_spin_node' that can be read with raw_cpu_read(), initialise on first call. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 9bb3a077ba92..b60b0add0161 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,7 +13,7 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *self, *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ }; @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); struct optimistic_spin_node *prev, *next; int old; - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) - node->cpu = encode_cpu(smp_processor_id()); + if (unlikely(!node)) { + int cpu = encode_cpu(smp_processor_id()); + node = decode_cpu(cpu); + node->self = node; + node->cpu = cpu; + } /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock) /* * Second most likely case. */ - node = this_cpu_ptr(&osq_node); + node = raw_cpu_read(osq_node.self); next = xchg(&node->next, NULL); if (next) { WRITE_ONCE(next->locked, 1); -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains
On 12/28/2023 3:41 AM, Ulf Hansson wrote: Attaching/detaching of a device to multiple PM domains has started to become a common operation for many drivers, typically during ->probe() and ->remove(). In most cases, this has lead to lots of boilerplate code in the drivers. To fixup up the situation, let's introduce a pair of helper functions, dev_pm_domain_attach|detach_list(), that driver can use instead of the open-coding. Note that, it seems reasonable to limit the support for these helpers to DT based platforms, at it's the only valid use case for now. Signed-off-by: Ulf Hansson --- drivers/base/power/common.c | 133 include/linux/pm_domain.h | 38 +++ 2 files changed, 171 insertions(+) diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c index 44ec20918a4d..1ef51889fc6f 100644 --- a/drivers/base/power/common.c +++ b/drivers/base/power/common.c @@ -167,6 +167,114 @@ struct device *dev_pm_domain_attach_by_name(struct device *dev, } EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name); +/** + * dev_pm_domain_attach_list - Associate a device with its PM domains. + * @dev: The device used to lookup the PM domains for. + * @data: The data used for attaching to the PM domains. + * @list: An out-parameter with an allocated list of attached PM domains. + * + * This function helps to attach a device to its multiple PM domains. The + * caller, which is typically a driver's probe function, may provide a list of + * names for the PM domains that we should try to attach the device to, but it + * may also provide an empty list, in case the attach should be done for all of + * the available PM domains. + * + * Callers must ensure proper synchronization of this function with power + * management callbacks. + * + * Returns the number of attached PM domains or a negative error code in case of + * a failure. Note that, to detach the list of PM domains, the driver shall call + * dev_pm_domain_detach_list(), typically during the remove phase. + */ +int dev_pm_domain_attach_list(struct device *dev, + const struct dev_pm_domain_attach_data *data, + struct dev_pm_domain_list **list) +{ + struct device_node *np = dev->of_node; + struct dev_pm_domain_list *pds; + struct device *pd_dev = NULL; + int ret, i, num_pds = 0; + bool by_id = true; + u32 link_flags = data && data->pd_flags & PD_FLAG_NO_DEV_LINK ? 0 : + DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME; + + if (dev->pm_domain) + return -EEXIST; + + /* For now this is limited to OF based platforms. */ + if (!np) + return 0; + + if (data && data->pd_names) { + num_pds = data->num_pd_names; + by_id = false; + } else { + num_pds = of_count_phandle_with_args(np, "power-domains", +"#power-domain-cells"); + } + + if (num_pds <= 0) + return 0; + + pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL); + if (!pds) + return -ENOMEM; + + pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs), + GFP_KERNEL); + if (!pds->pd_devs) + return -ENOMEM; + + pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links), +GFP_KERNEL); + if (!pds->pd_links) + return -ENOMEM; + + if (link_flags && data->pd_flags & PD_FLAG_DEV_LINK_ON) Since data is optional, this check results in crash if data is NULL. Thanks + link_flags |= DL_FLAG_RPM_ACTIVE; + + for (i = 0; i < num_pds; i++) { + if (by_id) + pd_dev = dev_pm_domain_attach_by_id(dev, i); + else + pd_dev = dev_pm_domain_attach_by_name(dev, + data->pd_names[i]); + if (IS_ERR_OR_NULL(pd_dev)) { + ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV; + goto err_attach; + } + + if (link_flags) { + struct device_link *link; + + link = device_link_add(dev, pd_dev, link_flags); + if (!link) { + ret = -ENODEV; + goto err_link; + } + + pds->pd_links[i] = link; + } + + pds->pd_devs[i] = pd_dev; + } + + pds->num_pds = num_pds; + *list = pds; + return num_pds; + +err_link: + dev_pm_domain_detach(pd_dev, true); +err_attach: + while (--i >= 0) { + if (pds->pd_links[i]) + device_link_del(pds->pd_links[i]); + dev_pm_domain_detach(pds->pd_devs[i
Re: [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping
On Fri, 29 Dec 2023 13:40:50 -0500 Steven Rostedt wrote: > I'm sending this to a wider audience, as I want to hear more > feedback on this before I accept it. > I forgot to mention that this can be applied on top of: git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git for-next -- Steve
[RFC][PATCH 2/2] tracing: Allow user-space mapping of the ring-buffer
From: Vincent Donnefort Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Link: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-3-vdonnef...@google.com Signed-off-by: Vincent Donnefort Signed-off-by: Steven Rostedt (Google) --- include/uapi/linux/trace_mmap.h | 2 + kernel/trace/trace.c| 79 +++-- 2 files changed, 78 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index f950648b0ba9..8c49489c5867 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -26,4 +26,6 @@ struct trace_buffer_meta { __u32 meta_struct_len;/* Len of this struct */ }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 46dbe22121e9..cfeaf2cd204e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = &info->iter; + int err; - if (cmd) - return -ENOIOCTLCMD; + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(&trace_types_lock); iter->wait_index++; @@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_data; + struct trace_iterator *iter = &info->iter; + + if (vma->vm_flags & VM_WRITE) + retu
[RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping
I'm sending this to a wider audience, as I want to hear more feedback on this before I accept it. Vincent has been working on allowing the ftrace ring buffer to be memory mapped into user space. This has been going on since last year, where we talked at the 2022 Tracing Summit in London. Vincent's last series can be found here: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-1-vdonnef...@google.com/ But I'm posting these as these are what I now have in my queue. I've tested these patches pretty thoroughly and they look good. I even have a libtracefs API patch ready to implement this. For testing, you can install: git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git (version 1.8.1) git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git (latest branch of libtracefs) Then apply: https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home/ to libtracefs. And then build the samples: make samples Which will create: bin/cpu-map That you can use with: # trace-cmd start -e sched # bin/cpu-map 0 Which will output all the events in CPU 0 via the memory mapping if it is supported by the kernel or else it will print at the start: "Was not able to map, falling back to buffered read" If you want to see the source code for cpu-map, you have to look at the man page ;-) The "make samples" will extract the example code from the man pages and build them. Documentation/libtracefs-cpu-map.txt The library API is rather simple, it just has: tracefs_cpu_open_mapped() tracefs_cpu_is_mapped() tracefs_cpu_map() tracefs_cpu_unmap() Which will create a tracefs_cpu handle for reading. If it fails to map, it just does a normal read. Anyway, this email is not about the library interface, but more of the kernel interface. And that is this: First there's a "meta page" that can mapped via mmap() on the trace_pipe_raw file (there's one trace_pipe_raw file per CPU): page_size = getpagesize(); meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); The meta will have a layout of: struct trace_buffer_meta { unsigned long entries; unsigned long overrun; unsigned long read; unsigned long subbufs_touched; unsigned long subbufs_lost; unsigned long subbufs_read; struct { unsigned long lost_events;/* Events lost at the time of the reader swap */ __u32 id; /* Reader subbuf ID from 0 to nr_subbufs - 1 */ __u32 read; /* Number of bytes read on the reader subbuf */ } reader; __u32 subbuf_size;/* Size of each subbuf including the header */ __u32 nr_subbufs; /* Number of subbufs in the ring-buffer */ __u32 meta_page_size; /* Size of the meta-page */ __u32 meta_struct_len;/* Len of this struct */ }; The meta_page_size can grow if we need to (and then we can extend this API if we need to). If the meta_page_size is greater than a page, it should remap it: if (meta->meta_page_size > page_size) { int new_size = meta->meta_page_size; munmap(meta, page_size); meta = mmap(NULL, new_size, PROT_READ, MAP_SHARED, fd, 0); } Now the sub buffers of the ring buffer are mapped right after the meta page. The can be added with: data_len = meta->subbuf_size * meta->nr_subbufs; data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta->meta_page_size); This maps all the ring buffer sub-buffers as well as the reader page. The way this works is that the reader page is free to read from user space and writer will only write to the other pages. To get the reader page: subbuf = data + meta->subbuf_size * meta->reader.id; Then you can load that into the libtraceevent kbuffer API: struct tep_handle *tep = tracefs_local_events(NULL); kbuf = tep_kbuffer(tep); kbuffer_load_subbuffer(kbuf, subbuf); And use the kbuf descriptor to iterate the events. When done with the reader page, the application needs to make an ioctl() call: ioctl(fd, TRACE_MMAP_IOCTL_GET_READER); This will swap the reader page with the head of the other pages, and the old reader page is now going to be in the writable portion of the ring buffer where the writer can write to it, but the page that was swapped out becomes the new reader page. If there is no data, then user space can check the meta->reader.id to see if it changed or not. If it did not change, then there's no new data. If the writer is still on that page, it acts the same as we do in the kernel. It can still update that page and the begging of the sub-buffer has the index of where the writer currently is on that page. All the mapped pages are read-only. When the ring buffer is mapped, it
[RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions
From: Vincent Donnefort In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Link: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-2-vdonnef...@google.com Signed-off-by: Vincent Donnefort Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 7 + include/uapi/linux/trace_mmap.h | 29 +++ kernel/trace/ring_buffer.c | 382 +++- 3 files changed, 417 insertions(+), 1 deletion(-) create mode 100644 include/uapi/linux/trace_mmap.h diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..0841ba8bab14 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..f950648b0ba9 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_TRACE_MMAP_H_ +#define _UAPI_TRACE_MMAP_H_ + +#include + +struct trace_buffer_meta { + unsigned long entries; + unsigned long overrun; + unsigned long read; + + unsigned long subbufs_touched; + unsigned long subbufs_lost; + unsigned long subbufs_read; + + struct { + unsigned long lost_events;/* Events lost at the time of the reader swap */ + __u32 id; /* Reader subbuf ID from 0 to nr_subbufs - 1 */ + __u32 read; /* Number of bytes read on the reader subbuf */ + } reader; + + __u32 subbuf_size;/* Size of each subbuf including the header */ + __u32 nr_subbufs; /* Number of subbufs in the ring-buffer */ + + __u32 meta_page_size; /* Size of the meta-page */ + __u32 meta_struct_len;/* Len of this struct */ +}; + +#endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 173d2595ce2d..2f3e0260db88 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -338,6 +338,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -388,6 +389,7 @@ struct rb_irq_work { boolwaiters_pending; boolfull_waiters_pending; boolwakeup_full; + boolis_cpu_buffer; }; /* @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + int mapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to addr */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int f return (dirty * 100) > (full * nr_pages
[PATCH] ftrace: Fix modification of direct_function hash while in use
From: "Steven Rostedt (Google)" Masami Hiramatsu reported a memory leak in register_ftrace_direct() where if the number of new entries are added is large enough to cause two allocations in the loop: for (i = 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { new = ftrace_add_rec_direct(entry->ip, addr, &free_hash); if (!new) goto out_remove; entry->direct = addr; } } Where ftrace_add_rec_direct() has: if (ftrace_hash_empty(direct_functions) || direct_functions->count > 2 * (1 << direct_functions->size_bits)) { struct ftrace_hash *new_hash; int size = ftrace_hash_empty(direct_functions) ? 0 : direct_functions->count + 1; if (size < 32) size = 32; new_hash = dup_hash(direct_functions, size); if (!new_hash) return NULL; *free_hash = direct_functions; direct_functions = new_hash; } The "*free_hash = direct_functions;" can happen twice, losing the previous allocation of direct_functions. But this also exposed a more serious bug. The modification of direct_functions above is not safe. As direct_functions can be referenced at any time to find what direct caller it should call, the time between: new_hash = dup_hash(direct_functions, size); and direct_functions = new_hash; can have a race with another CPU (or even this one if it gets interrupted), and the entries being moved to the new hash are not referenced. That's because the "dup_hash()" is really misnamed and is really a "move_hash()". It moves the entries from the old hash to the new one. Now even if that was changed, this code is not proper as direct_functions should not be updated until the end. That is the best way to handle function reference changes, and is the way other parts of ftrace handles this. The following is done: 1. Change add_hash_entry() to return the entry it created and inserted into the hash, and not just return success or not. 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove the former. 3. Allocate a "new_hash" at the start that is made for holding both the new hash entries as well as the existing entries in direct_functions. 4. Copy (not move) the direct_function entries over to the new_hash. 5. Copy the entries of the added hash to the new_hash. 6. If everything succeeds, then use rcu_pointer_assign() to update the direct_functions with the new_hash. This simplifies the code and fixes both the memory leak as well as the race condition mentioned above. Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ Cc: sta...@vger.kernel.org Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ftrace.c | 100 -- 1 file changed, 47 insertions(+), 53 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8de8bec5f366..b01ae7d36021 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash, hash->count++; } -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip) +static struct ftrace_func_entry * +add_hash_entry(struct ftrace_hash *hash, unsigned long ip) { struct ftrace_func_entry *entry; entry = kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) - return -ENOMEM; + return NULL; entry->ip = ip; __add_hash_entry(hash, entry); - return 0; + return entry; } static void @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash) struct ftrace_func_entry *entry; struct ftrace_hash *new_hash; int size; - int ret; int i; new_hash = alloc_ftrace_hash(size_bits); @@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash) size = 1 << hash->size_bits; for (i = 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { - ret = add_hash_entry(new_hash, entry->ip); - if (ret < 0) + if (add_hash_entry(new_hash, entry->ip) == NULL) goto free_hash; } } @@ -2536,7 +2535,7 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec) #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS /* Protected by rcu_tasks for reading, and direct_mutex for writing */ -static struct ftrace_hash *direct_functions = EMPTY_HASH; +static struct ftrace_hash __rcu *direct_functions = EM
Re: [PATCH] tracing: Fix possible memory leak in ftrace_regsiter_direct()
On Wed, 27 Dec 2023 21:38:25 +0900 "Masami Hiramatsu (Google)" wrote: > From: Masami Hiramatsu (Google) > > If ftrace_register_direct() called with a large number of target There's no function called "ftrace_register_direct()", I guess you meant register_ftrace_direct()? > functions (e.g. 65), the free_hash can be updated twice or more > in the ftrace_add_rec_direct() without freeing the previous hash > memory. Thus this can cause a memory leak. > > Fix this issue by expanding the direct_hash at once before > adding the new entries. > > Signed-off-by: Masami Hiramatsu (Google) > Fixes: f64dd4627ec6 ("ftrace: Add multi direct register/unregister interface") > Cc: sta...@vger.kernel.org > --- > kernel/trace/ftrace.c | 49 > +++-- > 1 file changed, 31 insertions(+), 18 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 8de8bec5f366..9269c2c3e595 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -2555,28 +2555,33 @@ unsigned long ftrace_find_rec_direct(unsigned long ip) > return entry->direct; > } > > -static struct ftrace_func_entry* > -ftrace_add_rec_direct(unsigned long ip, unsigned long addr, > - struct ftrace_hash **free_hash) > +static struct ftrace_hash *ftrace_expand_direct(int inc_count) > { > - struct ftrace_func_entry *entry; > + struct ftrace_hash *new_hash, *free_hash; > + int size = ftrace_hash_empty(direct_functions) ? 0 : > + direct_functions->count + inc_count; > > - if (ftrace_hash_empty(direct_functions) || > - direct_functions->count > 2 * (1 << direct_functions->size_bits)) { > - struct ftrace_hash *new_hash; > - int size = ftrace_hash_empty(direct_functions) ? 0 : > - direct_functions->count + 1; > + if (!ftrace_hash_empty(direct_functions) && > + size <= 2 * (1 << direct_functions->size_bits)) > + return NULL; > > - if (size < 32) > - size = 32; > + if (size < 32) > + size = 32; Hmm, why the limit of 32? I know this was there before this patch, but this patch made me notice it. In dup_hash() we have: bits = fls(size / 2); /* Don't allocate too much */ if (bits > FTRACE_HASH_MAX_BITS) bits = FTRACE_HASH_MAX_BITS; Where bits will determine the number of buckets. If size = 32, then bits = fls(32/2) = fls(16) = 5 So the buckets will be 2^5 = 32. Thus, you will get 32 buckets even with 64 entries, which will cause a minimum of two loops to find a bucket. Is this a concern? > > - new_hash = dup_hash(direct_functions, size); > - if (!new_hash) > - return NULL; > + new_hash = dup_hash(direct_functions, size); > + if (!new_hash) > + return ERR_PTR(-ENOMEM); > > - *free_hash = direct_functions; > - direct_functions = new_hash; > - } > + free_hash = direct_functions; > + direct_functions = new_hash; > + > + return free_hash; > +} > + > +static struct ftrace_func_entry* > +ftrace_add_rec_direct(unsigned long ip, unsigned long addr) > +{ > + struct ftrace_func_entry *entry; > > entry = kmalloc(sizeof(*entry), GFP_KERNEL); > if (!entry) > @@ -5436,11 +5441,19 @@ int register_ftrace_direct(struct ftrace_ops *ops, > unsigned long addr) > } > } > > + /* ... and prepare the insertion */ > + free_hash = ftrace_expand_direct(hash->count); Why does the direct_functions need to be expanded before new items are entered? Can't we do it the way ftrace does it, and that is just to fill the hash, and then expand if necessary. Hmm, also, I think there's a bug here too. That is, hash_dup() does not do what it says. It doesn't copy it actually moves. So when you use hash_dup() with the source being direct_functions, it's removing the entries from the direct functions, and not copying them :-/ That is, during the resize, the check against direct_functions will not find them and things will be missed! What I think we need to do, is what ftrace does, and that is to add everything to a temp hash first and then do an RCU assign to the direct_functions. > + if (IS_ERR(free_hash)) { > + err = PTR_ERR(free_hash); > + free_hash = NULL; > + goto out_unlock; > + } > + > /* ... and insert them to direct_functions hash. */ > err = -ENOMEM; > for (i = 0; i < size; i++) { > hlist_for_each_entry(entry, &hash->buckets[i], hlist) { > - new = ftrace_add_rec_direct(entry->ip, addr, > &free_hash); > + new = ftrace_add_rec_direct(entry->ip, addr); > if (!new) > goto out_remove; > entry->direct = addr; This should fix both the leak and the fact that di
Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()
On Fri, Dec 29, 2023 at 10:10:08AM +0100, Markus Elfring wrote: > >> The kfree() function was called in two cases by > >> the virtio_fs_get_tree() function during error handling > >> even if the passed variable contained a null pointer. > > > > So what? kfree(NULL) is perfectly acceptable. > > I suggest to reconsider the usefulness of such a special function call. Can you be more explicit in your suggestion?
Re: [PATCH] tracing: Fix blocked reader of snapshot buffer
On Thu, 28 Dec 2023 09:51:49 -0500 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > If an application blocks on the snapshot or snapshot_raw files, expecting > to be woken up when a snapshot occurs, it will not happen. Or it may > happen with an unexpected result. > > That result is that the application will be reading the main buffer > instead of the snapshot buffer. That is because when the snapshot occurs, > the main and snapshot buffers are swapped. But the reader has a descriptor > still pointing to the buffer that it originally connected to. > > This is fine for the main buffer readers, as they may be blocked waiting > for a watermark to be hit, and when a snapshot occurs, the data that the > main readers want is now on the snapshot buffer. > > But for waiters of the snapshot buffer, they are waiting for an event to > occur that will trigger the snapshot and they can then consume it quickly > to save the snapshot before the next snapshot occurs. But to do this, they > need to read the new snapshot buffer, not the old one that is now > receiving new data. > > Also, it does not make sense to have a watermark "buffer_percent" on the > snapshot buffer, as the snapshot buffer is static and does not receive new > data except all at once. This looks good to me. Acked-by: Masami Hiramatsu (Google) Thank you! > > Cc: sta...@vger.kernel.org > Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from > userspace") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 3 ++- > kernel/trace/trace.c | 20 +--- > 2 files changed, 19 insertions(+), 4 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index e476d42c84c5..07dae67424a9 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -838,7 +838,8 @@ void ring_buffer_wake_waiters(struct trace_buffer > *buffer, int cpu) > /* make sure the waiters see the new index */ > smp_wmb(); > > - rb_wake_up_waiters(&rbwork->work); > + /* This can be called in any context */ > + irq_work_queue(&rbwork->work); > } > > /** > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index cfeaf2cd204e..606787edae8c 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -1902,6 +1902,9 @@ update_max_tr(struct trace_array *tr, struct > task_struct *tsk, int cpu, > __update_max_tr(tr, tsk, cpu); > > arch_spin_unlock(&tr->max_lock); > + > + /* Any waiters on the old snapshot buffer need to wake up */ > + ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS); > } > > /** > @@ -1953,12 +1956,23 @@ update_max_tr_single(struct trace_array *tr, struct > task_struct *tsk, int cpu) > > static int wait_on_pipe(struct trace_iterator *iter, int full) > { > + int ret; > + > /* Iterators are static, they should be filled or empty */ > if (trace_buffer_iter(iter, iter->cpu_file)) > return 0; > > - return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, > - full); > + ret = ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, > full); > + > +#ifdef CONFIG_TRACER_MAX_TRACE > + /* > + * Make sure this is still the snapshot buffer, as if a snapshot were > + * to happen, this would now be the main buffer. > + */ > + if (iter->snapshot) > + iter->array_buffer = &iter->tr->max_buffer; > +#endif > + return ret; > } > > #ifdef CONFIG_FTRACE_STARTUP_TEST > @@ -8560,7 +8574,7 @@ tracing_buffers_splice_read(struct file *file, loff_t > *ppos, > > wait_index = READ_ONCE(iter->wait_index); > > - ret = wait_on_pipe(iter, iter->tr->buffer_percent); > + ret = wait_on_pipe(iter, iter->snapshot ? 0 : > iter->tr->buffer_percent); > if (ret) > goto out; > > -- > 2.42.0 > -- Masami Hiramatsu (Google)
Re: [PATCH] arm64: dts: qcom: qcm6490-fairphone-fp5: Add missing reserved-memory
On 29.12.2023 13:53, Luca Weiss wrote: > It seems we also need to reserve a region of 81 MiB called "removed_mem" > otherwise we can easily hit the following error with higher RAM usage: > > [ 1467.809274] Internal error: synchronous external abort: 9610 > [#2] SMP > > Fixes: eee9602ad649 ("arm64: dts: qcom: qcm6490: Add device-tree for > Fairphone 5") > Signed-off-by: Luca Weiss > --- Reviewed-by: Konrad Dybcio Konrad
[PATCH] arm64: dts: qcom: qcm6490-fairphone-fp5: Add missing reserved-memory
It seems we also need to reserve a region of 81 MiB called "removed_mem" otherwise we can easily hit the following error with higher RAM usage: [ 1467.809274] Internal error: synchronous external abort: 9610 [#2] SMP Fixes: eee9602ad649 ("arm64: dts: qcom: qcm6490: Add device-tree for Fairphone 5") Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts index 176898c9dbbd..1e85c43a6fd1 100644 --- a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts +++ b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts @@ -82,6 +82,11 @@ cdsp_mem: cdsp@88f0 { no-map; }; + removed_mem: removed@c000 { + reg = <0x0 0xc000 0x0 0x510>; + no-map; + }; + rmtfs_mem: memory@f850 { compatible = "qcom,rmtfs-mem"; reg = <0x0 0xf850 0x0 0x60>; --- base-commit: 39676dfe52331dba909c617f213fdb21015c8d10 change-id: 20231229-fp5-reserved-mem-b88e822b1127 Best regards, -- Luca Weiss
Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100
On Thu, 28 Dec 2023 23:05:21 -0500 Steven Rostedt wrote: > On Wed, 27 Dec 2023 07:57:08 +0900 > Masami Hiramatsu (Google) wrote: > > > On Tue, 26 Dec 2023 12:59:02 -0500 > > Steven Rostedt wrote: > > > > > From: "Steven Rostedt (Google)" > > > > > > The tracefs file "buffer_percent" is to allow user space to set a > > > water-mark on how much of the tracing ring buffer needs to be filled in > > > order to wake up a blocked reader. > > > > > > 0 - is to wait until any data is in the buffer > > > 1 - is to wait for 1% of the sub buffers to be filled > > > 50 - would be half of the sub buffers are filled with data > > > 100 - is not to wake the waiter until the ring buffer is completely full > > > > > > Unfortunately the test for being full was: > > > > > > dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > > > return (dirty * 100) > (full * nr_pages); > > > > > > Where "full" is the value for "buffer_percent". > > > > > > There is two issues with the above when full == 100. > > > > > > 1. dirty * 100 > 100 * nr_pages will never be true > > >That is, the above is basically saying that if the user sets > > >buffer_percent to 100, more pages need to be dirty than exist in the > > >ring buffer! > > > > > > 2. The page that the writer is on is never considered dirty, as dirty > > >pages are only those that are full. When the writer goes to a new > > >sub-buffer, it clears the contents of that sub-buffer. > > > > > > That is, even if the check was ">=" it would still not be equal as the > > > most pages that can be considered "dirty" is nr_pages - 1. > > > > > > To fix this, add one to dirty and use ">=" in the compare. > > > > > > Cc: sta...@vger.kernel.org > > > Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") > > > Signed-off-by: Steven Rostedt (Google) > > > --- > > > kernel/trace/ring_buffer.c | 9 +++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > > index 83eab547f1d1..32c0dd2fd1c3 100644 > > > --- a/kernel/trace/ring_buffer.c > > > +++ b/kernel/trace/ring_buffer.c > > > @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct > > > trace_buffer *buffer, int cpu, int f > > > if (!nr_pages || !full) > > > return true; > > > > > > - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > > > + /* > > > + * Add one as dirty will never equal nr_pages, as the sub-buffer > > > + * that the writer is on is not counted as dirty. > > > + * This is needed if "buffer_percent" is set to 100. > > > + */ > > > + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1; > > > > Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty, > > it is 0.5% dirty. Consider @full = 1%. > > Yes it is required, as the comment above it states. dirty will never > equal nr_pages. Without it, buffer_percent == 100 will never wake up. > > The +1 is to add the page the writer is on, which is never considered > "dirty". Ah, got it. I thought the page under writing is dirty too (similar to the memory page) That's my misunderstanding. > > > > > @dirty = 1 + 1 = 2 and @dirty * 100 == 200. but > > @full * @nr_pages = 1 * 200 = 200. > > Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%. > > Do we care? > > What's the difference if it wakes up on 2 dirty pages or 1? It would be > very hard to measure the difference. > > But if you say 100, which means "I want to wake up when full" it will > never wake up. Because it will always be nr_pages - 1. OK, because we count only the complete page as dirty. In that case, the number of readable page should be dirty + 1. > > We could also say the +1 is the reader page too, because that's not > counted as well. > > In other words, we can bike shed this to make 1% accurate (which > honestly, I have no idea what the use case for that would be) or we can > fix the bug that has 100% which just means, wake me up if the buffer is > full, and when the writer is on the last page, it is considered full. Got it. Acked-by: Masami Hiramatsu (Google) Thank you, > > -- Steve > -- Masami Hiramatsu (Google)
Re: [PATCH 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec
On 28/12/2023 11:41, Ulf Hansson wrote: Let's avoid some of the boilerplate code to manage the vcodec PM domains, by converting into using dev_pm_domain_attach|detach_list(). Cc: Mauro Carvalho Chehab Cc: Stanimir Varbanov Cc: Vikash Garodia Cc: "Bryan O'Donoghue" Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Signed-off-by: Ulf Hansson On top of 39676dfe52331 - (tag: next-20231222, linux-next/master) Add linux-next specific files for 20231222 (7 days ago) Tested-by: Bryan O'Donoghue Reviewed-by: Bryan O'Donoghue --- bod
Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()
>> The kfree() function was called in two cases by >> the virtio_fs_get_tree() function during error handling >> even if the passed variable contained a null pointer. > > So what? kfree(NULL) is perfectly acceptable. I suggest to reconsider the usefulness of such a special function call. > Are you trying to optimise an error path? I would appreciate if further improvements can be achieved. https://wiki.sei.cmu.edu/confluence/display/c/MEM12-C.+Consider+using+a+goto+chain+when+leaving+a+function+on+error+when+using+and+releasing+resources Regards, Markus
[PATCH] arm64: dts: qcom: sc7280: Add static properties to cryptobam
When the properties num-channels & qcom,num-ees are not specified, the driver tries to read the values from registers, but this read fails and resets the device if the interconnect from the qcom,qce node is not already active when that happens. Add the static properties to not touch any registers during probe, the rest of the time when the BAM is used by QCE then the interconnect will be active already. Fixes: d488f903a860 ("arm64: dts: qcom: sc7280: add QCrypto nodes") Signed-off-by: Luca Weiss --- See also: https://lore.kernel.org/linux-arm-msm/cy01ekqvwe36.b9x5tdxar...@fairphone.com/ Alternatively you could add interconnect property (copied from &crypto) and add interconnect support to the BAM driver, then during probe we can read the registers without crashing the device. --- arch/arm64/boot/dts/qcom/sc7280.dtsi | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi index 83b5b76ba179..ce0d24ee7eed 100644 --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi @@ -2345,6 +2345,8 @@ cryptobam: dma-controller@1dc4000 { <&apps_smmu 0x4e6 0x0011>; qcom,ee = <0>; qcom,controlled-remotely; + num-channels = <16>; + qcom,num-ees = <4>; }; crypto: crypto@1dfa000 { --- base-commit: 39676dfe52331dba909c617f213fdb21015c8d10 change-id: 20231229-sc7280-cryptobam-fixup-fb5f94a5572f Best regards, -- Luca Weiss
Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()
On Fri, Dec 29, 2023 at 09:38:47AM +0100, Markus Elfring wrote: > From: Markus Elfring > Date: Fri, 29 Dec 2023 09:15:07 +0100 > > The kfree() function was called in two cases by > the virtio_fs_get_tree() function during error handling > even if the passed variable contained a null pointer. So what? kfree(NULL) is perfectly acceptable. Are you trying to optimise an error path?
[PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()
From: Markus Elfring Date: Fri, 29 Dec 2023 09:15:07 +0100 The kfree() function was called in two cases by the virtio_fs_get_tree() function during error handling even if the passed variable contained a null pointer. This issue was detected by using the Coccinelle software. * Thus use another label. * Move an error code assignment into an if branch. * Delete an initialisation (for the variable “fc”) which became unnecessary with this refactoring. Signed-off-by: Markus Elfring --- fs/fuse/virtio_fs.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 2f8ba9254c1e..0746f54ec743 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -1415,10 +1415,10 @@ static int virtio_fs_get_tree(struct fs_context *fsc) { struct virtio_fs *fs; struct super_block *sb; - struct fuse_conn *fc = NULL; + struct fuse_conn *fc; struct fuse_mount *fm; unsigned int virtqueue_size; - int err = -EIO; + int err; /* This gets a reference on virtio_fs object. This ptr gets installed * in fc->iq->priv. Once fuse_conn is going away, it calls ->put() @@ -1431,13 +1431,15 @@ static int virtio_fs_get_tree(struct fs_context *fsc) } virtqueue_size = virtqueue_get_vring_size(fs->vqs[VQ_REQUEST].vq); - if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD)) - goto out_err; + if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD)) { + err = -EIO; + goto lock_mutex; + } err = -ENOMEM; fc = kzalloc(sizeof(*fc), GFP_KERNEL); if (!fc) - goto out_err; + goto lock_mutex; fm = kzalloc(sizeof(*fm), GFP_KERNEL); if (!fm) @@ -1476,6 +1478,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc) out_err: kfree(fc); +lock_mutex: mutex_lock(&virtio_fs_mutex); virtio_fs_put(fs); mutex_unlock(&virtio_fs_mutex); -- 2.43.0
[PATCH 1/2] virtiofs: Improve three size determinations
From: Markus Elfring Date: Fri, 29 Dec 2023 08:42:04 +0100 Replace the specification of data structures by pointer dereferences as the parameter for the operator “sizeof” to make the corresponding size determination a bit safer according to the Linux coding style convention. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- fs/fuse/virtio_fs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..2f8ba9254c1e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -1435,11 +1435,11 @@ static int virtio_fs_get_tree(struct fs_context *fsc) goto out_err; err = -ENOMEM; - fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL); + fc = kzalloc(sizeof(*fc), GFP_KERNEL); if (!fc) goto out_err; - fm = kzalloc(sizeof(struct fuse_mount), GFP_KERNEL); + fm = kzalloc(sizeof(*fm), GFP_KERNEL); if (!fm) goto out_err; @@ -1495,7 +1495,7 @@ static int virtio_fs_init_fs_context(struct fs_context *fsc) if (fsc->purpose == FS_CONTEXT_FOR_SUBMOUNT) return fuse_init_fs_context_submount(fsc); - ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL); + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) return -ENOMEM; fsc->fs_private = ctx; -- 2.43.0
[PATCH 0/2] virtiofs: Adjustments for two function implementations
From: Markus Elfring Date: Fri, 29 Dec 2023 09:28:09 +0100 A few update suggestions were taken into account from static source code analysis. Markus Elfring (2): Improve three size determinations Improve error handling in virtio_fs_get_tree() fs/fuse/virtio_fs.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) -- 2.43.0