Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-29 Thread Helge Deller

Hi Luis,

On 12/22/23 21:10, Luis Chamberlain wrote:

On Fri, Dec 22, 2023 at 01:13:26PM +0100, Helge Deller wrote:

On 12/22/23 06:59, Luis Chamberlain wrote:

On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:

On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
(e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
64-bit pointers into the __ksymtab* sections.
Make sure that those sections will be correctly aligned at module link time,
otherwise unaligned memory accesses may happen at runtime.

...

...

So, honestly I don't see a real reason why it shouldn't be applied...


Like I said, you Cc'd stable as a fix,


I added "Cc: sta...@vger.kernel.org" on the patch itself, so *if* the patch
would have been applied by you, it would later end up in stable kernel series 
too.
But I did not CC'ed the stable mailing list directly, so my patch was never
sent to that mailing list.


as a maintainer it is my job to
verify how critical this is and ask for more details about how you found
it and evaluate the real impact. Even if it was not a stable fix I tend
to ask this for patches, even if they are trivial.
...
OK, can you extend the patch below with something like:

perf stat --repeat 100 --pre 'modprobe -r b a b c' -- 
./tools/testing/selftests/module/find_symbol.sh

And test before and after?

I ran a simple test as-is and the data I get is within noise, and so
I think we need the --repeat 100 thing.


Your selftest code is based on perf.
AFAICS we don't have perf on parisc/hppa, so I can't test your selftest code
on that architecture.
I assume you tested on x86, where the CPU will transparently take care of
unaligned accesses. This is probably why the results are within
the noise.
But on some platforms the CPU raises an exception on unaligned accesses
and jumps into special exception handler assembler code inside the kernel.
This is much more expensive than on x86, which is why we track on parisc
in /proc/cpuinfo counters on how often this exception handler is called:
IRQ:   CPU0   CPU1
  3:   1332  0 SuperIO  ttyS0
  7:1270013  0 SuperIO  pata_ns87415
 64:  320023012  320021431 CPU  timer
 65:   17080507   20624423 CPU  IPI
UAH:   10948640  58104   Unaligned access handler traps

This "UAH" field could theoretically be used to extend your selftest.
But is it really worth it? The outcome is very much architecture and CPU
specific, maybe it's just within the noise as you measured.

IMHO we should always try to natively align structures, and if we see
we got it wrong in kernel code, we should fix it.
My patches just fix those memory sections where we use inline
assembly (instead of C) and thus missed to provide the correct alignments.

Helge


---
before:
sudo ./tools/testing/selftests/module/find_symbol.sh

  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 81,956,206 ns   duration_time
 81,883,000 ns   system_time
210  page-faults

0.081956206 seconds time elapsed

0.0 seconds user
0.081883000 seconds sys



  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 85,960,863 ns   duration_time
 84,679,000 ns   system_time
212  page-faults

0.085960863 seconds time elapsed

0.0 seconds user
0.084679000 seconds sys



  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 86,484,868 ns   duration_time
 86,541,000 ns   system_time
213  page-faults

0.086484868 seconds time elapsed

0.0 seconds user
0.086541000 seconds sys

---
After your modules alignement fix:
sudo ./tools/testing/selftests/module/find_symbol.sh
  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 83,579,980 ns   duration_time
 83,530,000 ns   system_time
212  page-faults

0.083579980 seconds time elapsed

0.0 seconds user
0.08353 seconds sys



  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 70,721,786 ns   duration_time
 69,289,000 ns   system_time
211  page-faults

0.070721786 seconds time elapsed

0.0 seconds user
0.069289000 seconds sys



  Performance counter stats for '/sbin/modprobe test_kallsyms_b':

 76,513,219 ns   duration_time
 76,381,000 ns   system_time
214  page-faults

0.076513219 seconds time elapsed

0.0 seconds user
0.076381000 seconds sys

After your modules alignement fix:
sudo ./tools/testing/selftests/module/find_symbol.sh
  Performance counter stats for '/sbin

Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2023-12-29 Thread Waiman Long



On 12/29/23 17:11, David Laight wrote:

osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight 
---

Re-send without the 'RE:' on the subject line.
  kernel/locking/osq_lock.c | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *prev)
  {
struct optimistic_spin_node *next = NULL;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = node->cpu;
int old;
  
  	/*

@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  {
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
int old;
  
-	node->locked = 0;

-   node->next = NULL;


I have some concern about not clearing node->next at the beginning. I 
know that next is supposed to be cleared at the end. I do have some 
worry that there may exist a race condition that leaves next not cleared 
before it is used again. So you either have to prove that such a race 
does not exist, or initializing it to NULL at the beginning like it is 
today.


Cheers,
Longman


-   node->cpu = curr;
+   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+   node->cpu = encode_cpu(smp_processor_id());
  
  	/*

 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(&lock->tail, curr);
+   old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
  
  	prev = decode_cpu(old);

node->prev = prev;
+   node->locked = 0;
  
  	/*

 * osq_lock()   unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  void osq_unlock(struct optimistic_spin_queue *lock)
  {
struct optimistic_spin_node *node, *next;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = raw_cpu_read(osq_node.cpu);
  
  	/*

 * Fast path for the uncontended case.





Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

2023-12-29 Thread Waiman Long



On 12/29/23 15:58, David Laight wrote:

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
   cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
  
  struct optimistic_spin_node {

struct optimistic_spin_node *self, *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
+   int locked;/* 1 if lock acquired */
+   int cpu;   /* encoded CPU # + 1 value */
+   int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
  };
  
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);

@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
  }
  
-static inline int node_cpu(struct optimistic_spin_node *node)

-{
-   return node->cpu - 1;
-}
-
  static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
  {
int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
  
+	node->prev_cpu = old - 1;

prev = decode_cpu(old);
node->prev = prev;
node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+ vcpu_is_preempted(node->prev_cpu)))
return true;
  
  	/* unqueue */

@@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
  
+	WRITE_ONCE(next->prev_cpu, prev->cpu - 1);

WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);

Reviewed-by: Waiman Long 

  





Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-29 Thread Waiman Long

On 12/29/23 15:57, David Laight wrote:

this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 +-
  1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
   */
  
  struct optimistic_spin_node {

-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
  };
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  
  bool osq_lock(struct optimistic_spin_queue *lock)

  {
-   struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+   struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);


My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
   0x0029 <+25>:    mov    %rcx,%rdx
   0x002c <+28>:    add %gs:0x0(%rip),%rdx    # 0x34 



Which looks pretty optimized for me. Maybe older compiler may generate 
more complex code. However, I do have some doubt as to the benefit of 
this patch at the expense of making the code a bit more complex.


Cheers,
Longman




Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

2023-12-29 Thread Waiman Long



On 12/29/23 15:56, David Laight wrote:

osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.

Also directly return NULL or 'next' instead of breaking the loop.

Should have no effect on the generated code since gcc manages to
assume that 'prev != NULL' due to an earlier dereference.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 23 ++-
  1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 55f5db896c02..9bb3a077ba92 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
  static inline struct optimistic_spin_node *
  osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ int old)


Make the last argument name more descriptive, like "old_cpu" as the 
"int" type does not provide enough context to allow people to guess what 
"old" may be.


Cheers,
Longman





Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c

2023-12-29 Thread Waiman Long

On 12/29/23 15:53, David Laight wrote:

struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight 
---
  include/linux/osq_lock.h  | 5 -
  kernel/locking/osq_lock.c | 7 +++
  2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..ea8fb31379e3 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -6,11 +6,6 @@
   * An MCS like lock especially tailored for optimistic spinning for sleeping
   * lock implementations (mutex, rwsem, etc).
   */
-struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
-};
  
  struct optimistic_spin_queue {

/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..d414eef4bec6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,13 @@
   * called from interrupt context and we have preemption disabled while
   * spinning.
   */
+
+struct optimistic_spin_node {
+   struct optimistic_spin_node *next, *prev;
+   int locked; /* 1 if lock acquired */
+   int cpu; /* encoded CPU # + 1 value */
+};
+
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
  
  /*


Please correct the patch title "osf_lock.c" => "osq_lock.c".

After the fix, you can add

Acked-by: Waiman Long 




Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

2023-12-29 Thread Linus Torvalds
On Fri, 29 Dec 2023 at 12:56, David Laight  wrote:
>
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.

Please split these two totally independent things out of the patch,
just to make things much more obvious.

I like the new calling convention, but I don't like how the patch
isn't obviously just that.

In fact, I'd take your patch #1 and just the calling convention change
from #3 as "these are obviously not changing anything at all, only
moving things to more local places".

I'd also take the other part of #3 as a "clearly doesn't change
anything" but it should be a separate patch, and it should be done
differently: make 'next' be local to just *inside* the for-loop (in
fact, make it local to the if-statement that sets it), to clarify the
whole thing that it can never be non-NULL at the top of the loop, and
can never have any long-term semantics.

The other parts actually change some logic, and would need the OSQ
people to take a more serious look.

Linus



[PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2023-12-29 Thread David Laight
osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight 
---

Re-send without the 'RE:' on the subject line.
 kernel/locking/osq_lock.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *prev)
 {
struct optimistic_spin_node *next = NULL;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = node->cpu;
int old;
 
/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 {
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
int old;
 
-   node->locked = 0;
-   node->next = NULL;
-   node->cpu = curr;
+   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+   node->cpu = encode_cpu(smp_processor_id());
 
/*
 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(&lock->tail, curr);
+   old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
 
prev = decode_cpu(old);
node->prev = prev;
+   node->locked = 0;
 
/*
 * osq_lock()   unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
struct optimistic_spin_node *node, *next;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = raw_cpu_read(osq_node.cpu);
 
/*
 * Fast path for the uncontended case.
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




Re: [PATCH] ftrace: Fix modification of direct_function hash while in use

2023-12-29 Thread Steven Rostedt


Masami and Jiri,

This patch made it through all my tests. If I can get an Acked-by by
Sunday, I'll include it in my push to Linus (I have a couple of other fixes
to send him).

-- Steve


On Fri, 29 Dec 2023 11:51:34 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
> if the number of new entries are added is large enough to cause two
> allocations in the loop:
> 
> for (i = 0; i < size; i++) {
> hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
> new = ftrace_add_rec_direct(entry->ip, addr, 
> &free_hash);
> if (!new)
> goto out_remove;
> entry->direct = addr;
> }
> }
> 
> Where ftrace_add_rec_direct() has:
> 
> if (ftrace_hash_empty(direct_functions) ||
> direct_functions->count > 2 * (1 << direct_functions->size_bits)) 
> {
> struct ftrace_hash *new_hash;
> int size = ftrace_hash_empty(direct_functions) ? 0 :
> direct_functions->count + 1;
> 
> if (size < 32)
> size = 32;
> 
> new_hash = dup_hash(direct_functions, size);
> if (!new_hash)
> return NULL;
> 
> *free_hash = direct_functions;
> direct_functions = new_hash;
> }
> 
> The "*free_hash = direct_functions;" can happen twice, losing the previous
> allocation of direct_functions.
> 
> But this also exposed a more serious bug.
> 
> The modification of direct_functions above is not safe. As
> direct_functions can be referenced at any time to find what direct caller
> it should call, the time between:
> 
> new_hash = dup_hash(direct_functions, size);
>  and
> direct_functions = new_hash;
> 
> can have a race with another CPU (or even this one if it gets interrupted),
> and the entries being moved to the new hash are not referenced.
> 
> That's because the "dup_hash()" is really misnamed and is really a
> "move_hash()". It moves the entries from the old hash to the new one.
> 
> Now even if that was changed, this code is not proper as direct_functions
> should not be updated until the end. That is the best way to handle
> function reference changes, and is the way other parts of ftrace handles
> this.
> 
> The following is done:
> 
>  1. Change add_hash_entry() to return the entry it created and inserted
> into the hash, and not just return success or not.
> 
>  2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
> the former.
> 
>  3. Allocate a "new_hash" at the start that is made for holding both the
> new hash entries as well as the existing entries in direct_functions.
> 
>  4. Copy (not move) the direct_function entries over to the new_hash.
> 
>  5. Copy the entries of the added hash to the new_hash.
> 
>  6. If everything succeeds, then use rcu_pointer_assign() to update the
> direct_functions with the new_hash.
> 
> This simplifies the code and fixes both the memory leak as well as the
> race condition mentioned above.
> 
> Link: 
> https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
> 
> Cc: sta...@vger.kernel.org
> Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ftrace.c | 100 --
>  1 file changed, 47 insertions(+), 53 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8de8bec5f366..b01ae7d36021 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash,
>   hash->count++;
>  }
>  
> -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
> +static struct ftrace_func_entry *
> +add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
>  {
>   struct ftrace_func_entry *entry;
>  
>   entry = kmalloc(sizeof(*entry), GFP_KERNEL);
>   if (!entry)
> - return -ENOMEM;
> + return NULL;
>  
>   entry->ip = ip;
>   __add_hash_entry(hash, entry);
>  
> - return 0;
> + return entry;
>  }
>  
>  static void
> @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
> ftrace_hash *hash)
>   struct ftrace_func_entry *entry;
>   struct ftrace_hash *new_hash;
>   int size;
> - int ret;
>   int i;
>  
>   new_hash = alloc_ftrace_hash(size_bits);
> @@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
> ftrace_hash *hash)
>   size = 1 << hash->size_bits;
>   for (i = 0; i < size; i++) {
>   hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
> - ret = add_hash_entry(new_hash, entry->ip);
> -   

RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2023-12-29 Thread David Laight
osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *prev)
 {
struct optimistic_spin_node *next = NULL;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = node->cpu;
int old;
 
/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 {
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
int old;
 
-   node->locked = 0;
-   node->next = NULL;
-   node->cpu = curr;
+   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+   node->cpu = encode_cpu(smp_processor_id());
 
/*
 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(&lock->tail, curr);
+   old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
 
prev = decode_cpu(old);
node->prev = prev;
+   node->locked = 0;
 
/*
 * osq_lock()   unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
struct optimistic_spin_node *node, *next;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = raw_cpu_read(osq_node.cpu);
 
/*
 * Fast path for the uncontended case.
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c

2023-12-29 Thread David Laight
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight 
---
 include/linux/osq_lock.h  | 5 -
 kernel/locking/osq_lock.c | 7 +++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..ea8fb31379e3 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -6,11 +6,6 @@
  * An MCS like lock especially tailored for optimistic spinning for sleeping
  * lock implementations (mutex, rwsem, etc).
  */
-struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
-};
 
 struct optimistic_spin_queue {
/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..d414eef4bec6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,13 @@
  * called from interrupt context and we have preemption disabled while
  * spinning.
  */
+
+struct optimistic_spin_node {
+   struct optimistic_spin_node *next, *prev;
+   int locked; /* 1 if lock acquired */
+   int cpu; /* encoded CPU # + 1 value */
+};
+
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
 
 /*
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

2023-12-29 Thread David Laight
Zeng Heng noted that heavy use of the osq (optimistic spin queue) code
used rather more cpu than might be expected. See:
https://lore.kernel.org/lkml/202312210155.wc2huk8c-...@intel.com/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8

Part of the problem is there is a pretty much guaranteed cache line reload
reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal)
in the wakeup path which slows it down.
(On bare metal the hypervisor call is patched out, but the argument is still 
read.)

Careful analysis shows that it isn't necessary to dirty the per-cpu data
on the fast-path osq_lock() path. This may be slightly beneficial.

The code also uses this_cpu_ptr() to get the address of the per-cpu data.
On x86-64 (at least) this is implemented as:
 &per_cpu_data[smp_processor_id()]->member
ie there is a real function call, an array index and an add.
However if raw_cpu_read() can used then (which is typically just an offset
from register - %gs for x86-64) the code will be faster.
Putting the address of the per-cpu node into itself means that only one
cache line need be loaded.

I can't see a list of per-cpu data initialisation functions, so the fields
are initialised on the first osq_lock() call.

The last patch avoids the cache line reload calling vcpu_is_preempted()
by simply saving node->prev->cpu as node->prev_cpu and updating it when
node->prev changes.
This is simpler than the patch proposed by Waimon.

David Laight (5):
  Move the definition of optimistic_spin_node into osf_lock.c
  Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
  Clarify osq_wait_next()
  Optimise per-cpu data accesses.
  Optimise vcpu_is_preempted() check.

 include/linux/osq_lock.h  |  5 
 kernel/locking/osq_lock.c | 61 +--
 2 files changed, 33 insertions(+), 33 deletions(-)

-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

2023-12-29 Thread David Laight
The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
  cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
 
 struct optimistic_spin_node {
struct optimistic_spin_node *self, *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
+   int locked;/* 1 if lock acquired */
+   int cpu;   /* encoded CPU # + 1 value */
+   int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
-static inline int node_cpu(struct optimistic_spin_node *node)
-{
-   return node->cpu - 1;
-}
-
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
 
+   node->prev_cpu = old - 1;
prev = decode_cpu(old);
node->prev = prev;
node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+ vcpu_is_preempted(node->prev_cpu)))
return true;
 
/* unqueue */
@@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
 
+   WRITE_ONCE(next->prev_cpu, prev->cpu - 1);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);
 
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

2023-12-29 Thread David Laight
osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.

Also directly return NULL or 'next' instead of breaking the loop.

Should have no effect on the generated code since gcc manages to
assume that 'prev != NULL' due to an earlier dereference.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 55f5db896c02..9bb3a077ba92 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
 static inline struct optimistic_spin_node *
 osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ int old)
 {
-   struct optimistic_spin_node *next = NULL;
+   struct optimistic_spin_node *next;
int curr = node->cpu;
-   int old;
 
/*
-* If there is a prev node in queue, then the 'old' value will be
-* the prev node's CPU #, else it's set to OSQ_UNLOCKED_VAL since if
-* we're currently last in queue, then the queue will then become empty.
+* If osq_lock() is being cancelled there must be a previous node
+* and 'old' is its CPU #.
+* For osq_unlock() there is never a previous node and old is set
+* to OSQ_UNLOCKED_VAL.
 */
-   old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
 
for (;;) {
if (atomic_read(&lock->tail) == curr &&
@@ -69,7 +68,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 * will now observe @lock and will complete its
 * unlock()/unqueue().
 */
-   break;
+   return NULL;
}
 
/*
@@ -85,13 +84,11 @@ osq_wait_next(struct optimistic_spin_queue *lock,
if (node->next) {
next = xchg(&node->next, NULL);
if (next)
-   break;
+   return next;
}
 
cpu_relax();
}
-
-   return next;
 }
 
 bool osq_lock(struct optimistic_spin_queue *lock)
@@ -192,7 +189,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * back to @prev.
 */
 
-   next = osq_wait_next(lock, node, prev);
+   next = osq_wait_next(lock, node, prev->cpu);
if (!next)
return false;
 
@@ -232,7 +229,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
return;
}
 
-   next = osq_wait_next(lock, node, NULL);
+   next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL);
if (next)
WRITE_ONCE(next->locked, 1);
 }
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-29 Thread David Laight
this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
  */
 
 struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
 };
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-   struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+   struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
struct optimistic_spin_node *prev, *next;
int old;
 
-   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
-   node->cpu = encode_cpu(smp_processor_id());
+   if (unlikely(!node)) {
+   int cpu = encode_cpu(smp_processor_id());
+   node = decode_cpu(cpu);
+   node->self = node;
+   node->cpu = cpu;
+   }
 
/*
 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
/*
 * Second most likely case.
 */
-   node = this_cpu_ptr(&osq_node);
+   node = raw_cpu_read(osq_node.self);
next = xchg(&node->next, NULL);
if (next) {
WRITE_ONCE(next->locked, 1);
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




Re: [PATCH 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains

2023-12-29 Thread Nikunj Kela



On 12/28/2023 3:41 AM, Ulf Hansson wrote:

Attaching/detaching of a device to multiple PM domains has started to
become a common operation for many drivers, typically during ->probe() and
->remove(). In most cases, this has lead to lots of boilerplate code in the
drivers.

To fixup up the situation, let's introduce a pair of helper functions,
dev_pm_domain_attach|detach_list(), that driver can use instead of the
open-coding. Note that, it seems reasonable to limit the support for these
helpers to DT based platforms, at it's the only valid use case for now.

Signed-off-by: Ulf Hansson 
---
  drivers/base/power/common.c | 133 
  include/linux/pm_domain.h   |  38 +++
  2 files changed, 171 insertions(+)

diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c
index 44ec20918a4d..1ef51889fc6f 100644
--- a/drivers/base/power/common.c
+++ b/drivers/base/power/common.c
@@ -167,6 +167,114 @@ struct device *dev_pm_domain_attach_by_name(struct device 
*dev,
  }
  EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name);
  
+/**

+ * dev_pm_domain_attach_list - Associate a device with its PM domains.
+ * @dev: The device used to lookup the PM domains for.
+ * @data: The data used for attaching to the PM domains.
+ * @list: An out-parameter with an allocated list of attached PM domains.
+ *
+ * This function helps to attach a device to its multiple PM domains. The
+ * caller, which is typically a driver's probe function, may provide a list of
+ * names for the PM domains that we should try to attach the device to, but it
+ * may also provide an empty list, in case the attach should be done for all of
+ * the available PM domains.
+ *
+ * Callers must ensure proper synchronization of this function with power
+ * management callbacks.
+ *
+ * Returns the number of attached PM domains or a negative error code in case 
of
+ * a failure. Note that, to detach the list of PM domains, the driver shall 
call
+ * dev_pm_domain_detach_list(), typically during the remove phase.
+ */
+int dev_pm_domain_attach_list(struct device *dev,
+ const struct dev_pm_domain_attach_data *data,
+ struct dev_pm_domain_list **list)
+{
+   struct device_node *np = dev->of_node;
+   struct dev_pm_domain_list *pds;
+   struct device *pd_dev = NULL;
+   int ret, i, num_pds = 0;
+   bool by_id = true;
+   u32 link_flags = data && data->pd_flags & PD_FLAG_NO_DEV_LINK ? 0 :
+   DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME;
+
+   if (dev->pm_domain)
+   return -EEXIST;
+
+   /* For now this is limited to OF based platforms. */
+   if (!np)
+   return 0;
+
+   if (data && data->pd_names) {
+   num_pds = data->num_pd_names;
+   by_id = false;
+   } else {
+   num_pds = of_count_phandle_with_args(np, "power-domains",
+"#power-domain-cells");
+   }
+
+   if (num_pds <= 0)
+   return 0;
+
+   pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL);
+   if (!pds)
+   return -ENOMEM;
+
+   pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs),
+   GFP_KERNEL);
+   if (!pds->pd_devs)
+   return -ENOMEM;
+
+   pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links),
+GFP_KERNEL);
+   if (!pds->pd_links)
+   return -ENOMEM;
+
+   if (link_flags && data->pd_flags & PD_FLAG_DEV_LINK_ON)


Since data is optional, this check results in crash if data is NULL. Thanks



+   link_flags |= DL_FLAG_RPM_ACTIVE;
+
+   for (i = 0; i < num_pds; i++) {
+   if (by_id)
+   pd_dev = dev_pm_domain_attach_by_id(dev, i);
+   else
+   pd_dev = dev_pm_domain_attach_by_name(dev,
+   data->pd_names[i]);
+   if (IS_ERR_OR_NULL(pd_dev)) {
+   ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV;
+   goto err_attach;
+   }
+
+   if (link_flags) {
+   struct device_link *link;
+
+   link = device_link_add(dev, pd_dev, link_flags);
+   if (!link) {
+   ret = -ENODEV;
+   goto err_link;
+   }
+
+   pds->pd_links[i] = link;
+   }
+
+   pds->pd_devs[i] = pd_dev;
+   }
+
+   pds->num_pds = num_pds;
+   *list = pds;
+   return num_pds;
+
+err_link:
+   dev_pm_domain_detach(pd_dev, true);
+err_attach:
+   while (--i >= 0) {
+   if (pds->pd_links[i])
+   device_link_del(pds->pd_links[i]);
+   dev_pm_domain_detach(pds->pd_devs[i

Re: [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping

2023-12-29 Thread Steven Rostedt
On Fri, 29 Dec 2023 13:40:50 -0500
Steven Rostedt  wrote:

> I'm sending this to a wider audience, as I want to hear more
> feedback on this before I accept it.
>

I forgot to mention that this can be applied on top of:


   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git  for-next

-- Steve



[RFC][PATCH 2/2] tracing: Allow user-space mapping of the ring-buffer

2023-12-29 Thread Steven Rostedt
From: Vincent Donnefort 

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-3-vdonnef...@google.com

Signed-off-by: Vincent Donnefort 
Signed-off-by: Steven Rostedt (Google) 
---
 include/uapi/linux/trace_mmap.h |  2 +
 kernel/trace/trace.c| 79 +++--
 2 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index f950648b0ba9..8c49489c5867 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -26,4 +26,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;/* Len of this struct */
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 46dbe22121e9..cfeaf2cd204e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = &info->iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(&trace_types_lock);
 
iter->wait_index++;
@@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   if (vma->vm_flags & VM_WRITE)
+   retu

[RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping

2023-12-29 Thread Steven Rostedt


I'm sending this to a wider audience, as I want to hear more
feedback on this before I accept it.

Vincent has been working on allowing the ftrace ring buffer to be
memory mapped into user space. This has been going on since
last year, where we talked at the 2022 Tracing Summit in London.

Vincent's last series can be found here:

   
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-1-vdonnef...@google.com/

But I'm posting these as these are what I now have in my queue.

I've tested these patches pretty thoroughly and they look good.
I even have a libtracefs API patch ready to implement this.

For testing, you can install:

 git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git (version 1.8.1)

 git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
  (latest branch of libtracefs)

Then apply:

  https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home/

to libtracefs. And then build the samples:

  make samples

Which will create: bin/cpu-map

That you can use with:

  # trace-cmd start -e sched
  # bin/cpu-map 0

Which will output all the events in CPU 0 via the memory mapping if it is
supported by the kernel or else it will print at the start:

  "Was not able to map, falling back to buffered read"

If you want to see the source code for cpu-map, you have to look
at the man page ;-) The "make samples" will extract the example code
from the man pages and build them.

  Documentation/libtracefs-cpu-map.txt

The library API is rather simple, it just has:

 tracefs_cpu_open_mapped()
 tracefs_cpu_is_mapped()
 tracefs_cpu_map()
 tracefs_cpu_unmap()

Which will create a tracefs_cpu handle for reading. If it fails to
map, it just does a normal read.

Anyway, this email is not about the library interface, but more of
the kernel interface. And that is this:

First there's a "meta page" that can mapped via mmap() on the
trace_pipe_raw file (there's one trace_pipe_raw file per CPU):

page_size = getpagesize();
meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);

The meta will have a layout of:

struct trace_buffer_meta {
unsigned long   entries;
unsigned long   overrun;
unsigned long   read;

unsigned long   subbufs_touched;
unsigned long   subbufs_lost;
unsigned long   subbufs_read;

struct {
unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
__u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
__u32   read;   /* Number of bytes read on the 
reader subbuf */
} reader;

__u32   subbuf_size;/* Size of each subbuf 
including the header */
__u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */

__u32   meta_page_size; /* Size of the meta-page */
__u32   meta_struct_len;/* Len of this struct */
};

The meta_page_size can grow if we need to (and then we can extend this
API if we need to). If the meta_page_size is greater than a page, it
should remap it:

if (meta->meta_page_size > page_size) {
int new_size = meta->meta_page_size;
munmap(meta, page_size);
meta = mmap(NULL, new_size, PROT_READ, MAP_SHARED, fd, 0);  
}

Now the sub buffers of the ring buffer are mapped right after
the meta page. The can be added with:


data_len = meta->subbuf_size * meta->nr_subbufs;
data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
  fd, meta->meta_page_size);

This maps all the ring buffer sub-buffers as well as the reader page.
The way this works is that the reader page is free to read from
user space and writer will only write to the other pages.

To get the reader page:

subbuf = data + meta->subbuf_size * meta->reader.id;

Then you can load that into the libtraceevent kbuffer API:

struct tep_handle *tep = tracefs_local_events(NULL);
kbuf = tep_kbuffer(tep);
kbuffer_load_subbuffer(kbuf, subbuf);

And use the kbuf descriptor to iterate the events.

When done with the reader page, the application needs to make an
ioctl() call:

ioctl(fd, TRACE_MMAP_IOCTL_GET_READER);


This will swap the reader page with the head of the other pages,
and the old reader page is now going to be in the writable portion
of the ring buffer where the writer can write to it, but the
page that was swapped out becomes the new reader page. If there is
no data, then user space can check the meta->reader.id to see if
it changed or not. If it did not change, then there's no new data.

If the writer is still on that page, it acts the same as we do in
the kernel. It can still update that page and the begging of the
sub-buffer has the index of where the writer currently is on that
page.

All the mapped pages are read-only.

When the ring buffer is mapped, it 

[RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-29 Thread Steven Rostedt
From: Vincent Donnefort 

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-2-vdonnef...@google.com

Signed-off-by: Vincent Donnefort 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   7 +
 include/uapi/linux/trace_mmap.h |  29 +++
 kernel/trace/ring_buffer.c  | 382 +++-
 3 files changed, 417 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/trace_mmap.h

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..f950648b0ba9
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_TRACE_MMAP_H_
+#define _UAPI_TRACE_MMAP_H_
+
+#include 
+
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
+   __u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
+   __u32   read;   /* Number of bytes read on the 
reader subbuf */
+   } reader;
+
+   __u32   subbuf_size;/* Size of each subbuf 
including the header */
+   __u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */
+
+   __u32   meta_page_size; /* Size of the meta-page */
+   __u32   meta_struct_len;/* Len of this struct */
+};
+
+#endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..2f3e0260db88 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -388,6 +389,7 @@ struct rb_irq_work {
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
+   boolis_cpu_buffer;
 };
 
 /*
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
return (dirty * 100) > (full * nr_pages

[PATCH] ftrace: Fix modification of direct_function hash while in use

2023-12-29 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:

for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, 
&free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}

Where ftrace_add_rec_direct() has:

if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;

if (size < 32)
size = 32;

new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;

*free_hash = direct_functions;
direct_functions = new_hash;
}

The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.

But this also exposed a more serious bug.

The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:

new_hash = dup_hash(direct_functions, size);
 and
direct_functions = new_hash;

can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.

That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.

Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.

The following is done:

 1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.

 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.

 3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.

 4. Copy (not move) the direct_function entries over to the new_hash.

 5. Copy the entries of the added hash to the new_hash.

 6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.

This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.

Link: 
https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/

Cc: sta...@vger.kernel.org
Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ftrace.c | 100 --
 1 file changed, 47 insertions(+), 53 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..b01ae7d36021 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash,
hash->count++;
 }
 
-static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
+static struct ftrace_func_entry *
+add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
 {
struct ftrace_func_entry *entry;
 
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
-   return -ENOMEM;
+   return NULL;
 
entry->ip = ip;
__add_hash_entry(hash, entry);
 
-   return 0;
+   return entry;
 }
 
 static void
@@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
ftrace_hash *hash)
struct ftrace_func_entry *entry;
struct ftrace_hash *new_hash;
int size;
-   int ret;
int i;
 
new_hash = alloc_ftrace_hash(size_bits);
@@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
ftrace_hash *hash)
size = 1 << hash->size_bits;
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
-   ret = add_hash_entry(new_hash, entry->ip);
-   if (ret < 0)
+   if (add_hash_entry(new_hash, entry->ip) == NULL)
goto free_hash;
}
}
@@ -2536,7 +2535,7 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 /* Protected by rcu_tasks for reading, and direct_mutex for writing */
-static struct ftrace_hash *direct_functions = EMPTY_HASH;
+static struct ftrace_hash __rcu *direct_functions = EM

Re: [PATCH] tracing: Fix possible memory leak in ftrace_regsiter_direct()

2023-12-29 Thread Steven Rostedt
On Wed, 27 Dec 2023 21:38:25 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> If ftrace_register_direct() called with a large number of target

There's no function called "ftrace_register_direct()", I guess you meant
register_ftrace_direct()?

> functions (e.g. 65), the free_hash can be updated twice or more
> in the ftrace_add_rec_direct() without freeing the previous hash
> memory. Thus this can cause a memory leak.
> 
> Fix this issue by expanding the direct_hash at once before
> adding the new entries.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> Fixes: f64dd4627ec6 ("ftrace: Add multi direct register/unregister interface")
> Cc: sta...@vger.kernel.org
> ---
>  kernel/trace/ftrace.c |   49 
> +++--
>  1 file changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8de8bec5f366..9269c2c3e595 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2555,28 +2555,33 @@ unsigned long ftrace_find_rec_direct(unsigned long ip)
>   return entry->direct;
>  }
>  
> -static struct ftrace_func_entry*
> -ftrace_add_rec_direct(unsigned long ip, unsigned long addr,
> -   struct ftrace_hash **free_hash)
> +static struct ftrace_hash *ftrace_expand_direct(int inc_count)
>  {
> - struct ftrace_func_entry *entry;
> + struct ftrace_hash *new_hash, *free_hash;
> + int size = ftrace_hash_empty(direct_functions) ? 0 :
> + direct_functions->count + inc_count;
>  
> - if (ftrace_hash_empty(direct_functions) ||
> - direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
> - struct ftrace_hash *new_hash;
> - int size = ftrace_hash_empty(direct_functions) ? 0 :
> - direct_functions->count + 1;
> + if (!ftrace_hash_empty(direct_functions) &&
> + size <= 2 * (1 << direct_functions->size_bits))
> + return NULL;
>  
> - if (size < 32)
> - size = 32;
> + if (size < 32)
> + size = 32;

Hmm, why the limit of 32? I know this was there before this patch, but this
patch made me notice it.

In dup_hash() we have:

bits = fls(size / 2);

/* Don't allocate too much */
if (bits > FTRACE_HASH_MAX_BITS)
bits = FTRACE_HASH_MAX_BITS;

Where bits will determine the number of buckets.

If size = 32, then bits = fls(32/2) = fls(16) = 5

So the buckets will be 2^5 = 32. Thus, you will get 32 buckets even with 64
entries, which will cause a minimum of two loops to find a bucket.

Is this a concern?


>  
> - new_hash = dup_hash(direct_functions, size);
> - if (!new_hash)
> - return NULL;
> + new_hash = dup_hash(direct_functions, size);
> + if (!new_hash)
> + return ERR_PTR(-ENOMEM);
>  
> - *free_hash = direct_functions;
> - direct_functions = new_hash;
> - }
> + free_hash = direct_functions;
> + direct_functions = new_hash;
> +
> + return free_hash;
> +}
> +
> +static struct ftrace_func_entry*
> +ftrace_add_rec_direct(unsigned long ip, unsigned long addr)
> +{
> + struct ftrace_func_entry *entry;
>  
>   entry = kmalloc(sizeof(*entry), GFP_KERNEL);
>   if (!entry)
> @@ -5436,11 +5441,19 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   }
>   }
>  
> + /* ... and prepare the insertion */
> + free_hash = ftrace_expand_direct(hash->count);

Why does the direct_functions need to be expanded before new items are
entered? Can't we do it the way ftrace does it, and that is just to fill
the hash, and then expand if necessary.

Hmm, also, I think there's a bug here too. That is, hash_dup() does not do
what it says. It doesn't copy it actually moves. So when you use hash_dup()
with the source being direct_functions, it's removing the entries from the
direct functions, and not copying them :-/

That is, during the resize, the check against direct_functions will not
find them and things will be missed!

What I think we need to do, is what ftrace does, and that is to add
everything to a temp hash first and then do an RCU assign to the
direct_functions.

> + if (IS_ERR(free_hash)) {
> + err = PTR_ERR(free_hash);
> + free_hash = NULL;
> + goto out_unlock;
> + }
> +
>   /* ... and insert them to direct_functions hash. */
>   err = -ENOMEM;
>   for (i = 0; i < size; i++) {
>   hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
> - new = ftrace_add_rec_direct(entry->ip, addr, 
> &free_hash);
> + new = ftrace_add_rec_direct(entry->ip, addr);
>   if (!new)
>   goto out_remove;
>   entry->direct = addr;

This should fix both the leak and the fact that di

Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Matthew Wilcox
On Fri, Dec 29, 2023 at 10:10:08AM +0100, Markus Elfring wrote:
> >> The kfree() function was called in two cases by
> >> the virtio_fs_get_tree() function during error handling
> >> even if the passed variable contained a null pointer.
> >
> > So what?  kfree(NULL) is perfectly acceptable.
> 
> I suggest to reconsider the usefulness of such a special function call.

Can you be more explicit in your suggestion?



Re: [PATCH] tracing: Fix blocked reader of snapshot buffer

2023-12-29 Thread Google
On Thu, 28 Dec 2023 09:51:49 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> If an application blocks on the snapshot or snapshot_raw files, expecting
> to be woken up when a snapshot occurs, it will not happen. Or it may
> happen with an unexpected result.
> 
> That result is that the application will be reading the main buffer
> instead of the snapshot buffer. That is because when the snapshot occurs,
> the main and snapshot buffers are swapped. But the reader has a descriptor
> still pointing to the buffer that it originally connected to.
> 
> This is fine for the main buffer readers, as they may be blocked waiting
> for a watermark to be hit, and when a snapshot occurs, the data that the
> main readers want is now on the snapshot buffer.
> 
> But for waiters of the snapshot buffer, they are waiting for an event to
> occur that will trigger the snapshot and they can then consume it quickly
> to save the snapshot before the next snapshot occurs. But to do this, they
> need to read the new snapshot buffer, not the old one that is now
> receiving new data.
> 
> Also, it does not make sense to have a watermark "buffer_percent" on the
> snapshot buffer, as the snapshot buffer is static and does not receive new
> data except all at once.

This looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> 
> Cc: sta...@vger.kernel.org
> Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from 
> userspace")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c |  3 ++-
>  kernel/trace/trace.c   | 20 +---
>  2 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index e476d42c84c5..07dae67424a9 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -838,7 +838,8 @@ void ring_buffer_wake_waiters(struct trace_buffer 
> *buffer, int cpu)
>   /* make sure the waiters see the new index */
>   smp_wmb();
>  
> - rb_wake_up_waiters(&rbwork->work);
> + /* This can be called in any context */
> + irq_work_queue(&rbwork->work);
>  }
>  
>  /**
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index cfeaf2cd204e..606787edae8c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1902,6 +1902,9 @@ update_max_tr(struct trace_array *tr, struct 
> task_struct *tsk, int cpu,
>   __update_max_tr(tr, tsk, cpu);
>  
>   arch_spin_unlock(&tr->max_lock);
> +
> + /* Any waiters on the old snapshot buffer need to wake up */
> + ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS);
>  }
>  
>  /**
> @@ -1953,12 +1956,23 @@ update_max_tr_single(struct trace_array *tr, struct 
> task_struct *tsk, int cpu)
>  
>  static int wait_on_pipe(struct trace_iterator *iter, int full)
>  {
> + int ret;
> +
>   /* Iterators are static, they should be filled or empty */
>   if (trace_buffer_iter(iter, iter->cpu_file))
>   return 0;
>  
> - return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file,
> - full);
> + ret = ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, 
> full);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> + /*
> +  * Make sure this is still the snapshot buffer, as if a snapshot were
> +  * to happen, this would now be the main buffer.
> +  */
> + if (iter->snapshot)
> + iter->array_buffer = &iter->tr->max_buffer;
> +#endif
> + return ret;
>  }
>  
>  #ifdef CONFIG_FTRACE_STARTUP_TEST
> @@ -8560,7 +8574,7 @@ tracing_buffers_splice_read(struct file *file, loff_t 
> *ppos,
>  
>   wait_index = READ_ONCE(iter->wait_index);
>  
> - ret = wait_on_pipe(iter, iter->tr->buffer_percent);
> + ret = wait_on_pipe(iter, iter->snapshot ? 0 : 
> iter->tr->buffer_percent);
>   if (ret)
>   goto out;
>  
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] arm64: dts: qcom: qcm6490-fairphone-fp5: Add missing reserved-memory

2023-12-29 Thread Konrad Dybcio
On 29.12.2023 13:53, Luca Weiss wrote:
> It seems we also need to reserve a region of 81 MiB called "removed_mem"
> otherwise we can easily hit the following error with higher RAM usage:
> 
>   [ 1467.809274] Internal error: synchronous external abort: 9610 
> [#2] SMP
> 
> Fixes: eee9602ad649 ("arm64: dts: qcom: qcm6490: Add device-tree for 
> Fairphone 5")
> Signed-off-by: Luca Weiss 
> ---
Reviewed-by: Konrad Dybcio 

Konrad



[PATCH] arm64: dts: qcom: qcm6490-fairphone-fp5: Add missing reserved-memory

2023-12-29 Thread Luca Weiss
It seems we also need to reserve a region of 81 MiB called "removed_mem"
otherwise we can easily hit the following error with higher RAM usage:

  [ 1467.809274] Internal error: synchronous external abort: 9610 
[#2] SMP

Fixes: eee9602ad649 ("arm64: dts: qcom: qcm6490: Add device-tree for Fairphone 
5")
Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts 
b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
index 176898c9dbbd..1e85c43a6fd1 100644
--- a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
+++ b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
@@ -82,6 +82,11 @@ cdsp_mem: cdsp@88f0 {
no-map;
};
 
+   removed_mem: removed@c000 {
+   reg = <0x0 0xc000 0x0 0x510>;
+   no-map;
+   };
+
rmtfs_mem: memory@f850 {
compatible = "qcom,rmtfs-mem";
reg = <0x0 0xf850 0x0 0x60>;

---
base-commit: 39676dfe52331dba909c617f213fdb21015c8d10
change-id: 20231229-fp5-reserved-mem-b88e822b1127

Best regards,
-- 
Luca Weiss 




Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100

2023-12-29 Thread Google
On Thu, 28 Dec 2023 23:05:21 -0500
Steven Rostedt  wrote:

> On Wed, 27 Dec 2023 07:57:08 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > On Tue, 26 Dec 2023 12:59:02 -0500
> > Steven Rostedt  wrote:
> > 
> > > From: "Steven Rostedt (Google)" 
> > > 
> > > The tracefs file "buffer_percent" is to allow user space to set a
> > > water-mark on how much of the tracing ring buffer needs to be filled in
> > > order to wake up a blocked reader.
> > > 
> > >  0 - is to wait until any data is in the buffer
> > >  1 - is to wait for 1% of the sub buffers to be filled
> > >  50 - would be half of the sub buffers are filled with data
> > >  100 - is not to wake the waiter until the ring buffer is completely full
> > > 
> > > Unfortunately the test for being full was:
> > > 
> > >   dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
> > >   return (dirty * 100) > (full * nr_pages);
> > > 
> > > Where "full" is the value for "buffer_percent".
> > > 
> > > There is two issues with the above when full == 100.
> > > 
> > > 1. dirty * 100 > 100 * nr_pages will never be true
> > >That is, the above is basically saying that if the user sets
> > >buffer_percent to 100, more pages need to be dirty than exist in the
> > >ring buffer!
> > > 
> > > 2. The page that the writer is on is never considered dirty, as dirty
> > >pages are only those that are full. When the writer goes to a new
> > >sub-buffer, it clears the contents of that sub-buffer.
> > > 
> > > That is, even if the check was ">=" it would still not be equal as the
> > > most pages that can be considered "dirty" is nr_pages - 1.
> > > 
> > > To fix this, add one to dirty and use ">=" in the compare.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage")
> > > Signed-off-by: Steven Rostedt (Google) 
> > > ---
> > >  kernel/trace/ring_buffer.c | 9 +++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > index 83eab547f1d1..32c0dd2fd1c3 100644
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct 
> > > trace_buffer *buffer, int cpu, int f
> > >   if (!nr_pages || !full)
> > >   return true;
> > >  
> > > - dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
> > > + /*
> > > +  * Add one as dirty will never equal nr_pages, as the sub-buffer
> > > +  * that the writer is on is not counted as dirty.
> > > +  * This is needed if "buffer_percent" is set to 100.
> > > +  */
> > > + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1;  
> > 
> > Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty,
> > it is 0.5% dirty. Consider @full = 1%.
> 
> Yes it is required, as the comment above it states. dirty will never
> equal nr_pages. Without it, buffer_percent == 100 will never wake up.
> 
> The +1 is to add the page the writer is on, which is never considered
> "dirty".

Ah, got it. I thought the page under writing is dirty too (similar to
the memory page) That's my misunderstanding.

> 
> > 
> > @dirty = 1 + 1 = 2 and @dirty * 100 == 200. but 
> > @full * @nr_pages = 1 * 200 = 200.
> > Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%.
> 
> Do we care?
> 
> What's the difference if it wakes up on 2 dirty pages or 1? It would be
> very hard to measure the difference.
> 
> But if you say 100, which means "I want to wake up when full" it will
> never wake up. Because it will always be nr_pages - 1.

OK, because we count only the complete page as dirty. In that case,
the number of readable page should be dirty + 1.

> 
> We could also say the +1 is the reader page too, because that's not
> counted as well.
> 
> In other words, we can bike shed this to make 1% accurate (which
> honestly, I have no idea what the use case for that would be) or we can
> fix the bug that has 100% which just means, wake me up if the buffer is
> full, and when the writer is on the last page, it is considered full.

Got it.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec

2023-12-29 Thread Bryan O'Donoghue

On 28/12/2023 11:41, Ulf Hansson wrote:

Let's avoid some of the boilerplate code to manage the vcodec PM domains,
by converting into using dev_pm_domain_attach|detach_list().

Cc: Mauro Carvalho Chehab 
Cc: Stanimir Varbanov 
Cc: Vikash Garodia 
Cc: "Bryan O'Donoghue" 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: 
Signed-off-by: Ulf Hansson 


On top of 39676dfe52331 - (tag: next-20231222, linux-next/master) Add 
linux-next specific files for 20231222 (7 days ago)


Tested-by: Bryan O'Donoghue 
Reviewed-by: Bryan O'Donoghue 

---
bod



Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Markus Elfring
>> The kfree() function was called in two cases by
>> the virtio_fs_get_tree() function during error handling
>> even if the passed variable contained a null pointer.
>
> So what?  kfree(NULL) is perfectly acceptable.

I suggest to reconsider the usefulness of such a special function call.


> Are you trying to optimise an error path?

I would appreciate if further improvements can be achieved.
https://wiki.sei.cmu.edu/confluence/display/c/MEM12-C.+Consider+using+a+goto+chain+when+leaving+a+function+on+error+when+using+and+releasing+resources

Regards,
Markus



[PATCH] arm64: dts: qcom: sc7280: Add static properties to cryptobam

2023-12-29 Thread Luca Weiss
When the properties num-channels & qcom,num-ees are not specified, the
driver tries to read the values from registers, but this read fails and
resets the device if the interconnect from the qcom,qce node is not
already active when that happens.

Add the static properties to not touch any registers during probe, the
rest of the time when the BAM is used by QCE then the interconnect will
be active already.

Fixes: d488f903a860 ("arm64: dts: qcom: sc7280: add QCrypto nodes")
Signed-off-by: Luca Weiss 
---
See also:
https://lore.kernel.org/linux-arm-msm/cy01ekqvwe36.b9x5tdxar...@fairphone.com/

Alternatively you could add interconnect property (copied from &crypto)
and add interconnect support to the BAM driver, then during probe we can
read the registers without crashing the device.
---
 arch/arm64/boot/dts/qcom/sc7280.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
b/arch/arm64/boot/dts/qcom/sc7280.dtsi
index 83b5b76ba179..ce0d24ee7eed 100644
--- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
+++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
@@ -2345,6 +2345,8 @@ cryptobam: dma-controller@1dc4000 {
 <&apps_smmu 0x4e6 0x0011>;
qcom,ee = <0>;
qcom,controlled-remotely;
+   num-channels = <16>;
+   qcom,num-ees = <4>;
};
 
crypto: crypto@1dfa000 {

---
base-commit: 39676dfe52331dba909c617f213fdb21015c8d10
change-id: 20231229-sc7280-cryptobam-fixup-fb5f94a5572f

Best regards,
-- 
Luca Weiss 




Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Matthew Wilcox
On Fri, Dec 29, 2023 at 09:38:47AM +0100, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 29 Dec 2023 09:15:07 +0100
> 
> The kfree() function was called in two cases by
> the virtio_fs_get_tree() function during error handling
> even if the passed variable contained a null pointer.

So what?  kfree(NULL) is perfectly acceptable.  Are you trying to
optimise an error path?




[PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Markus Elfring
From: Markus Elfring 
Date: Fri, 29 Dec 2023 09:15:07 +0100

The kfree() function was called in two cases by
the virtio_fs_get_tree() function during error handling
even if the passed variable contained a null pointer.
This issue was detected by using the Coccinelle software.

* Thus use another label.

* Move an error code assignment into an if branch.

* Delete an initialisation (for the variable “fc”)
  which became unnecessary with this refactoring.

Signed-off-by: Markus Elfring 
---
 fs/fuse/virtio_fs.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 2f8ba9254c1e..0746f54ec743 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1415,10 +1415,10 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
 {
struct virtio_fs *fs;
struct super_block *sb;
-   struct fuse_conn *fc = NULL;
+   struct fuse_conn *fc;
struct fuse_mount *fm;
unsigned int virtqueue_size;
-   int err = -EIO;
+   int err;

/* This gets a reference on virtio_fs object. This ptr gets installed
 * in fc->iq->priv. Once fuse_conn is going away, it calls ->put()
@@ -1431,13 +1431,15 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
}

virtqueue_size = virtqueue_get_vring_size(fs->vqs[VQ_REQUEST].vq);
-   if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD))
-   goto out_err;
+   if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD)) {
+   err = -EIO;
+   goto lock_mutex;
+   }

err = -ENOMEM;
fc = kzalloc(sizeof(*fc), GFP_KERNEL);
if (!fc)
-   goto out_err;
+   goto lock_mutex;

fm = kzalloc(sizeof(*fm), GFP_KERNEL);
if (!fm)
@@ -1476,6 +1478,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)

 out_err:
kfree(fc);
+lock_mutex:
mutex_lock(&virtio_fs_mutex);
virtio_fs_put(fs);
mutex_unlock(&virtio_fs_mutex);
--
2.43.0




[PATCH 1/2] virtiofs: Improve three size determinations

2023-12-29 Thread Markus Elfring
From: Markus Elfring 
Date: Fri, 29 Dec 2023 08:42:04 +0100

Replace the specification of data structures by pointer dereferences
as the parameter for the operator “sizeof” to make the corresponding size
determination a bit safer according to the Linux coding style convention.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 fs/fuse/virtio_fs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..2f8ba9254c1e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1435,11 +1435,11 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
goto out_err;

err = -ENOMEM;
-   fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL);
+   fc = kzalloc(sizeof(*fc), GFP_KERNEL);
if (!fc)
goto out_err;

-   fm = kzalloc(sizeof(struct fuse_mount), GFP_KERNEL);
+   fm = kzalloc(sizeof(*fm), GFP_KERNEL);
if (!fm)
goto out_err;

@@ -1495,7 +1495,7 @@ static int virtio_fs_init_fs_context(struct fs_context 
*fsc)
if (fsc->purpose == FS_CONTEXT_FOR_SUBMOUNT)
return fuse_init_fs_context_submount(fsc);

-   ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL);
+   ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
fsc->fs_private = ctx;
--
2.43.0




[PATCH 0/2] virtiofs: Adjustments for two function implementations

2023-12-29 Thread Markus Elfring
From: Markus Elfring 
Date: Fri, 29 Dec 2023 09:28:09 +0100

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (2):
  Improve three size determinations
  Improve error handling in virtio_fs_get_tree()

 fs/fuse/virtio_fs.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

--
2.43.0