date:20240325

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-25 Thread Jason Xing

On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Using the macro for other tracepoints use to be more concise.
> No functional change.
>
> Jason Xing (3):
>   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
>   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
>   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
>
>  include/trace/events/net_probe_common.h | 29 
>  include/trace/events/sock.h | 35 -

I just noticed that some trace files in include/trace directory (like
net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
folks while some files (like tcp.h) have been maintained by specific
maintainers/experts (like Eric) because they belong to one specific
area. I wonder if we can get more networking guys involved in net
tracing.

I'm not sure if 1) we can put those files into the "NETWORKING
[GENERAL]" category, or 2) we can create a new category to include
them all.

I know people start using BPF to trace them all instead, but I can see
some good advantages of those hooks implemented in the kernel, say:
1) help those machines which are not easy to use BPF tools.
2) insert the tracepoint in the middle of some functions which cannot
be replaced by bpf kprobe.
3) if we have enough tracepoints, we can generate a timeline to
know/detect which flow/skb spends unexpected time at which point.
...
We can do many things in this area, I think :)

What do you think about this, Jakub, Paolo, Eric ?

Thanks,
Jason

>  include/trace/events/tcp.h  | 29 
>  3 files changed, 34 insertions(+), 59 deletions(-)
>
> --
> 2.37.3
>

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-25 Thread Alexei Starovoitov

On Mon, Mar 25, 2024 at 7:17 PM Rick Edgecombe
 wrote:
>
>
> diff --git a/mm/util.c b/mm/util.c
> index 669397235787..8619d353a1aa 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -469,17 +469,17 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
> rlimit *rlim_stack)
>
> if (mmap_is_legacy(rlim_stack)) {
> mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
> -   mm->get_unmapped_area = arch_get_unmapped_area;
> +   clear_bit(MMF_TOPDOWN, >flags);
> } else {
> mm->mmap_base = mmap_base(random_factor, rlim_stack);
> -   mm->get_unmapped_area = arch_get_unmapped_area_topdown;
> +   set_bit(MMF_TOPDOWN, >flags);
> }
>  }
>  #elif defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT)
>  void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
>  {
> mm->mmap_base = TASK_UNMAPPED_BASE;
> -   mm->get_unmapped_area = arch_get_unmapped_area;
> +   clear_bit(MMF_TOPDOWN, >flags);
>  }
>  #endif

Makes sense to me.
Acked-by: Alexei Starovoitov 
for the idea and for bpf bits.

Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-25 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  mm/memory-tiers.c | 73 ---
>> >  1 file changed, 63 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 974af10cfdd8..9396330fa162 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
>> >
>> >  static DEFINE_MUTEX(memory_tier_lock);
>> >  static LIST_HEAD(memory_tiers);
>> > +/*
>> > + * The list is used to store all memory types that are not created
>> > + * by a device driver.
>> > + */
>> > +static LIST_HEAD(default_memory_types);
>> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>> >  struct memory_dev_type *default_dram_type;
>> >
>> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion 
>> > __read_mostly;
>> >
>> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>> >
>> > +static DEFINE_MUTEX(default_dram_perf_lock);
>>
>> Better to add comments about what is protected by this lock.
>>
>
> Thank you. I will add a comment like this:
> + /* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>
> I also found an error path was not handled and
> found the lock could be put closer to what it protects.
> I will have them fixed in V5.
>
>> >  static bool default_dram_perf_error;
>> >  static struct access_coordinate default_dram_perf;
>> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
>> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, 
>> > struct memory_dev_type *mem
>> >  static struct memory_tier *set_node_memory_tier(int node)
>> >  {
>> >   struct memory_tier *memtier;
>> > - struct memory_dev_type *memtype;
>> > + struct memory_dev_type *mtype;
>>
>> mtype may be referenced without initialization now below.
>>
>
> Good catch! Thank you.
>
> Please check below.
> I may found a potential NULL pointer dereference.
>
>> > + int adist = MEMTIER_ADISTANCE_DRAM;
>> >   pg_data_t *pgdat = NODE_DATA(node);
>> >
>> >
>> > @@ -514,11 +521,20 @@ static struct memory_tier *set_node_memory_tier(int 
>> > node)
>> >   if (!node_state(node, N_MEMORY))
>> >   return ERR_PTR(-EINVAL);
>> >
>> > - __init_node_memory_type(node, default_dram_type);
>> > + mt_calc_adistance(node, );

Re: [PATCH net-next v2 0/3] tcp: make trace of reset logic complete

2024-03-25 Thread Jason Xing

On Tue, Mar 26, 2024 at 10:23 AM Jakub Kicinski  wrote:
>
> On Tue, 26 Mar 2024 10:13:55 +0800 Jason Xing wrote:
> > Yesterday, I posted two series to do two kinds of things. They are not
> > the same. Maybe you get me wrong :S
>
> Ah, my bad, sorry about that. I see that they are different now.

That's all right :)

> One is v1 the other v2, both targeting tcp tracing... Easy to miss
> in the post merge window rush :(

Yes, and thanks for the check :)

Re: [PATCH net-next v3 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

2024-03-25 Thread Jakub Kicinski

On Mon, 25 Mar 2024 11:29:18 +0100 Balazs Scheidler wrote:
> +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));

Indent with tabs please, checkpatch says:

ERROR: code indent should use tabs where possible
#59: FILE: include/trace/events/udp.h:38:
+memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$

WARNING: please, no spaces at the start of a line
#59: FILE: include/trace/events/udp.h:38:
+memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$

ERROR: code indent should use tabs where possible
#60: FILE: include/trace/events/udp.h:39:
+memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$

WARNING: please, no spaces at the start of a line
#60: FILE: include/trace/events/udp.h:39:
+memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$
-- 
pw-bot: cr

Re: [PATCH net-next v2 0/3] tcp: make trace of reset logic complete

2024-03-25 Thread Jakub Kicinski

On Tue, 26 Mar 2024 10:13:55 +0800 Jason Xing wrote:
> Yesterday, I posted two series to do two kinds of things. They are not
> the same. Maybe you get me wrong :S

Ah, my bad, sorry about that. I see that they are different now.
One is v1 the other v2, both targeting tcp tracing... Easy to miss
in the post merge window rush :(

[PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-25 Thread Rick Edgecombe

The mm_struct contains a function pointer *get_unmapped_area(), which
is set to either arch_get_unmapped_area() or
arch_get_unmapped_area_topdown() during the initialization of the mm.

Since the function pointer only ever points to two functions that are named
the same across all arch's, a function pointer is not really required. In
addition future changes will want to add versions of the functions that
take additional arguments. So to save a pointers worth of bytes in
mm_struct, and prevent adding additional function pointers to mm_struct in
future changes, remove it and keep the information about which
get_unmapped_area() to use in a flag.

Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing
behavior mm->get_unmapped_area() would get copied to the new mm in
dup_mm(), so not clobbering the flag preserves the existing behavior
around inheriting the topdown-ness.

Introduce a helper, mm_get_unmapped_area(), to easily convert code that
refers to the old function pointer to instead select and call either
arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
flag. Then drop the mm->get_unmapped_area() function pointer. Leave the
get_unmapped_area() pointer in struct file_operations alone. The main
purpose of this change is to reorganize in preparation for future changes,
but it also converts the calls of mm->get_unmapped_area() from indirect
branches into a direct ones.

The stress-ng bigheap benchmark calls realloc a lot, which calls through
get_unmapped_area() in the kernel. On x86, the change yielded a ~1%
improvement there on a retpoline config.

In testing a few x86 configs, removing the pointer unfortunately didn't
result in any actual size reductions in the compiled layout of mm_struct.
But depending on compiler or arch alignment requirements, the change could
shrink the size of mm_struct.

Signed-off-by: Rick Edgecombe 
Acked-by: Dave Hansen 
Acked-by: Liam R. Howlett 
Reviewed-by: Kirill A. Shutemov 
Cc: linux-s...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-fsde...@vger.kernel.org
Cc: io-ur...@vger.kernel.org
Cc: b...@vger.kernel.org
---
v4:
 - Split out pde_get_unmapped_area() refactor into separate patch
   (Christophe Leroy)

v3:
 - Fix comment that still referred to mm->get_unmapped_area()
 - Resolve trivial rebase conflicts with "mm: thp_get_unmapped_area must
   honour topdown preference"
 - Spelling fix in log

v2:
 - Fix comment on MMF_TOPDOWN (Kirill, rppt)
 - Move MMF_TOPDOWN to actually unused bit
 - Add MMF_TOPDOWN to MMF_INIT_MASK so it doesn't get clobbered on fork,
   and result in the children using the search up path.
 - New lower performance results after above bug fix
 - Add Reviews and Acks
---
 arch/s390/mm/hugetlbpage.c   |  2 +-
 arch/s390/mm/mmap.c  |  4 ++--
 arch/sparc/kernel/sys_sparc_64.c | 15 ++-
 arch/sparc/mm/hugetlbpage.c  |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c |  2 +-
 arch/x86/mm/hugetlbpage.c|  2 +-
 arch/x86/mm/mmap.c   |  4 ++--
 drivers/char/mem.c   |  2 +-
 drivers/dax/device.c |  6 +++---
 fs/hugetlbfs/inode.c |  4 ++--
 fs/proc/inode.c  |  3 ++-
 fs/ramfs/file-mmu.c  |  2 +-
 include/linux/mm_types.h |  6 +-
 include/linux/sched/coredump.h   |  5 -
 include/linux/sched/mm.h |  5 +
 io_uring/io_uring.c  |  2 +-
 kernel/bpf/arena.c   |  2 +-
 kernel/bpf/syscall.c |  2 +-
 mm/debug.c   |  6 --
 mm/huge_memory.c |  9 -
 mm/mmap.c| 21 ++---
 mm/shmem.c   | 11 +--
 mm/util.c|  6 +++---
 23 files changed, 66 insertions(+), 57 deletions(-)

diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index c2e8242bd15d..219d906fe830 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -328,7 +328,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
goto check_asce_limit;
}
 
-   if (mm->get_unmapped_area == arch_get_unmapped_area)
+   if (!test_bit(MMF_TOPDOWN, >flags))
addr = hugetlb_get_unmapped_area_bottomup(file, addr, len,
pgoff, flags);
else
diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
index b14fc0887654..6b2e4436ad4a 100644
--- a/arch/s390/mm/mmap.c
+++ b/arch/s390/mm/mmap.c
@@ -185,10 +185,10 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
 */
if (mmap_is_legacy(rlim_stack)) {
mm->mmap_base = mmap_base_legacy(random_factor);
-

Re: [PATCH net-next v2 0/3] tcp: make trace of reset logic complete

2024-03-25 Thread Jason Xing

On Tue, Mar 26, 2024 at 9:30 AM Jakub Kicinski  wrote:
>
> On Mon, 25 Mar 2024 14:28:28 +0800 Jason Xing wrote:
> > Before this, we miss some cases where the TCP layer could send rst but
> > we cannot trace it. So I decided to complete it :)
> >
> > v2
> > 1. fix spelling mistakes
>
> Not only do you post it before we "officially" open net-next but
> also ignoring the 24h wait period.
>
> https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#tl-dr
>
> The main goal of the 24h rule is to stop people from bombarding us with
> new versions for silly reasons.

Sorry, I don't understand what you mean. I definitely know the rules.
But the first version of this patch series was sent about two weeks
ago (see link [1])

Yesterday, I posted two series to do two kinds of things. They are not
the same. Maybe you get me wrong :S

link [1]: 
https://patchwork.kernel.org/project/netdevbpf/list/?series=834178=*

Thanks,
Jason

>
> You show know better than this, it's hardly your first contribution :(
> --
> pv-bot: 24h

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote:
> > > +#endif /* _LINUX_EXECMEM_H */
> > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > index 9d9095e81792..87fd8c14a938 100644
> > > --- a/kernel/kprobes.c
> > > +++ b/kernel/kprobes.c
> > > @@ -44,6 +44,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #define KPROBE_HASH_BITS 6
> > >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> > >  void __weak *alloc_insn_page(void)
> > >  {
> > >   /*
> > > -  * Use module_alloc() so this page is within +/- 2GB of where the
> > > +  * Use alloc_execmem() so this page is within +/- 2GB of where the
> > >* kernel image and loaded module images reside. This is required
> > >* for most of the architectures.
> > >* (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> > >*/
> > > - return module_alloc(PAGE_SIZE);
> > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> > >  }
> > >  
> > >  static void free_insn_page(void *page)
> > >  {
> > > - module_memfree(page);
> > > + free_execmem(page);
> > >  }
> > >  
> > >  struct kprobe_insn_cache kprobe_insn_slots = {
> > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe 
> > > *p,
> > >   goto out;
> > >   }
> > >  
> > > +#ifdef CONFIG_MODULES
> >
> > You don't need this block, because these APIs have dummy functions.
>
> Hmm... I'll verify this tomorrow.

It depends on having struct module available given "(*probed_mod)->state".

It is non-existent unless CONFIG_MODULES is set given how things are
flagged in include/linux/module.h.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 2:58 AM EET, Masami Hiramatsu (Google) wrote:
> On Mon, 25 Mar 2024 23:55:01 +0200
> Jarkko Sakkinen  wrote:
>
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
> > the kernel module allocator.
> > 
> > Introduce alloc_textmem() and free_textmem() for allocating executable
> > memory. If an arch implements these functions, it can mark this up with
> > the HAVE_ALLOC_EXECMEM kconfig flag.
> > 
> > At first this feature will be used for enabling kprobes without
> > modules support for arch/riscv.
> > 
> > Link: 
> > https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
> > Suggested-by: Masami Hiramatsu 
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5:
> > - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
> >   compile because 2/2 had the fixup (leaked there when rebasing the
> >   patch set).
> > v4:
> > - Squashed a couple of unrequired CONFIG_MODULES checks.
> > - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
> > v3:
> > - A new patch added.
> > - For IS_DEFINED() I need advice as I could not really find that many
> >   locations where it would be applicable.
> > ---
> >  arch/Kconfig| 16 +++-
> >  include/linux/execmem.h | 13 +
> >  kernel/kprobes.c| 17 ++---
> >  kernel/trace/trace_kprobe.c |  8 
> >  4 files changed, 50 insertions(+), 4 deletions(-)
> >  create mode 100644 include/linux/execmem.h
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index a5af0edd3eb8..33ba68b7168f 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> >  
> >  config KPROBES
> > bool "Kprobes"
> > -   depends on MODULES
> > depends on HAVE_KPROBES
> > +   select ALLOC_EXECMEM
> > select KALLSYMS
> > select TASKS_RCU if PREEMPTION
> > help
> > @@ -215,6 +215,20 @@ config HAVE_OPTPROBES
> >  config HAVE_KPROBES_ON_FTRACE
> > bool
> >  
> > +config HAVE_ALLOC_EXECMEM
> > +   bool
> > +   help
> > + Architectures that select this option are capable of allocating 
> > executable
> > + memory, which can be used by subsystems but is not dependent of any 
> > of its
> > + clients.
> > +
> > +config ALLOC_EXECMEM
> > +   bool "Executable (trampoline) memory allocation"
> > +   depends on MODULES || HAVE_ALLOC_EXECMEM
> > +   help
> > + Select this for executable (trampoline) memory. Can be enabled when 
> > either
> > + module allocator or arch-specific allocator is available.
> > +
> >  config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
> > bool
> > help
> > diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> > new file mode 100644
> > index ..ae2ff151523a
> > --- /dev/null
> > +++ b/include/linux/execmem.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_EXECMEM_H
> > +#define _LINUX_EXECMEM_H
> > +
> > +#ifdef CONFIG_HAVE_ALLOC_EXECMEM
> > +void *alloc_execmem(unsigned long size, gfp_t gfp);
> > +void free_execmem(void *region);
> > +#else
> > +#define alloc_execmem(size, gfp)   module_alloc(size)
> > +#define free_execmem(region)   module_memfree(region)
> > +#endif
> > +
> > +#endif /* _LINUX_EXECMEM_H */
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 9d9095e81792..87fd8c14a938 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -44,6 +44,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define KPROBE_HASH_BITS 6
> >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> >  void __weak *alloc_insn_page(void)
> >  {
> > /*
> > -* Use module_alloc() so this page is within +/- 2GB of where the
> > +* Use alloc_execmem() so this page is within +/- 2GB of where the
> >  * kernel image and loaded module images reside. This is required
> >  * for most of the architectures.
> >  * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> >  */
> > -   return module_alloc(PAGE_SIZE);
> > +   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> >  }
> >  
> >  static void free_insn_page(void *page)
> >  {
> > -   module_memfree(page);
> > +   free_execmem(page);
> >  }
> >  
> >  struct kprobe_insn_cache kprobe_insn_slots = {
> > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
> > goto out;
> > }
> >  
> > +#ifdef CONFIG_MODULES
>
> You don't need this block, because these APIs have dummy functions.

Hmm... I'll verify this tomorrow.

>
> > /* Check if 'p' is probing a module. */
> > *probed_mod = __module_text_address((unsigned long) p->addr);
> > if (*probed_mod) {
>
> So this block never be true if !CONFIG_MODULES automatically, and it should be
> optimized out by compiler.

Yeah sure, was not done for saving

Re: [PATCH net-next v2 0/3] tcp: make trace of reset logic complete

2024-03-25 Thread Jakub Kicinski

On Mon, 25 Mar 2024 14:28:28 +0800 Jason Xing wrote:
> Before this, we miss some cases where the TCP layer could send rst but
> we cannot trace it. So I decided to complete it :)
> 
> v2
> 1. fix spelling mistakes

Not only do you post it before we "officially" open net-next but
also ignoring the 24h wait period.

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#tl-dr

The main goal of the 24h rule is to stop people from bombarding us with
new versions for silly reasons.

You show know better than this, it's hardly your first contribution :(
-- 
pv-bot: 24h

[PATCH v6 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v6 1/2] kprobes: implemente trampoline memory allocator

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES.

Introduce alloc_execmem() and free_execmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

The second new kconfig flag is ALLOC_EXECMEM, which can be selected if
either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If
HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and
module_memfree() are used as a fallback, thus retaining backwards
compatibility to earlier kernel versions.

This will allow architecture to enable kprobes traces without requiring
to enable module.

The support can be implemented with four easy steps:

1. Implement alloc_execmem().
2. Implement free_execmem().
3. Edit arch//Makefile.
4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v6:
- Use null pointer for notifiers and register the module notifier only if
  IS_ENABLED(CONFIG_MODULES) is set.
- Fixed typo in the commit message and wrote more verbose description
  of the feature.
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 19 +++
 kernel/trace/trace_kprobe.c | 15 +--
 4 files changed, 56 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..9ed07a4bf9e3 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Google

On Mon, 25 Mar 2024 23:55:01 +0200
Jarkko Sakkinen  wrote:

> Tracing with kprobes while running a monolithic kernel is currently
> impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
> the kernel module allocator.
> 
> Introduce alloc_textmem() and free_textmem() for allocating executable
> memory. If an arch implements these functions, it can mark this up with
> the HAVE_ALLOC_EXECMEM kconfig flag.
> 
> At first this feature will be used for enabling kprobes without
> modules support for arch/riscv.
> 
> Link: 
> https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
> Suggested-by: Masami Hiramatsu 
> Signed-off-by: Jarkko Sakkinen 
> ---
> v5:
> - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
>   compile because 2/2 had the fixup (leaked there when rebasing the
>   patch set).
> v4:
> - Squashed a couple of unrequired CONFIG_MODULES checks.
> - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
> v3:
> - A new patch added.
> - For IS_DEFINED() I need advice as I could not really find that many
>   locations where it would be applicable.
> ---
>  arch/Kconfig| 16 +++-
>  include/linux/execmem.h | 13 +
>  kernel/kprobes.c| 17 ++---
>  kernel/trace/trace_kprobe.c |  8 
>  4 files changed, 50 insertions(+), 4 deletions(-)
>  create mode 100644 include/linux/execmem.h
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a5af0edd3eb8..33ba68b7168f 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,8 +52,8 @@ config GENERIC_ENTRY
>  
>  config KPROBES
>   bool "Kprobes"
> - depends on MODULES
>   depends on HAVE_KPROBES
> + select ALLOC_EXECMEM
>   select KALLSYMS
>   select TASKS_RCU if PREEMPTION
>   help
> @@ -215,6 +215,20 @@ config HAVE_OPTPROBES
>  config HAVE_KPROBES_ON_FTRACE
>   bool
>  
> +config HAVE_ALLOC_EXECMEM
> + bool
> + help
> +   Architectures that select this option are capable of allocating 
> executable
> +   memory, which can be used by subsystems but is not dependent of any 
> of its
> +   clients.
> +
> +config ALLOC_EXECMEM
> + bool "Executable (trampoline) memory allocation"
> + depends on MODULES || HAVE_ALLOC_EXECMEM
> + help
> +   Select this for executable (trampoline) memory. Can be enabled when 
> either
> +   module allocator or arch-specific allocator is available.
> +
>  config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
>   bool
>   help
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> new file mode 100644
> index ..ae2ff151523a
> --- /dev/null
> +++ b/include/linux/execmem.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_EXECMEM_H
> +#define _LINUX_EXECMEM_H
> +
> +#ifdef CONFIG_HAVE_ALLOC_EXECMEM
> +void *alloc_execmem(unsigned long size, gfp_t gfp);
> +void free_execmem(void *region);
> +#else
> +#define alloc_execmem(size, gfp) module_alloc(size)
> +#define free_execmem(region) module_memfree(region)
> +#endif
> +
> +#endif /* _LINUX_EXECMEM_H */
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 9d9095e81792..87fd8c14a938 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define KPROBE_HASH_BITS 6
>  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> @@ -113,17 +114,17 @@ enum kprobe_slot_state {
>  void __weak *alloc_insn_page(void)
>  {
>   /*
> -  * Use module_alloc() so this page is within +/- 2GB of where the
> +  * Use alloc_execmem() so this page is within +/- 2GB of where the
>* kernel image and loaded module images reside. This is required
>* for most of the architectures.
>* (e.g. x86-64 needs this to handle the %rip-relative fixups.)
>*/
> - return module_alloc(PAGE_SIZE);
> + return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
>  }
>  
>  static void free_insn_page(void *page)
>  {
> - module_memfree(page);
> + free_execmem(page);
>  }
>  
>  struct kprobe_insn_cache kprobe_insn_slots = {
> @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   goto out;
>   }
>  
> +#ifdef CONFIG_MODULES

You don't need this block, because these APIs have dummy functions.

>   /* Check if 'p' is probing a module. */
>   *probed_mod = __module_text_address((unsigned long) p->addr);
>   if (*probed_mod) {

So this block never be true if !CONFIG_MODULES automatically, and it should be
optimized out by compiler.

> @@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   ret = -ENOENT;
>   }
>   }
> +#endif
> +
>  out:
>   preempt_enable();
>   jump_label_unlock();
> @@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
> unsigned long end)
>   return 0;
>  }
>  
>

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Steven Rostedt

On Mon, 25 Mar 2024 11:38:48 +0900
Masami Hiramatsu (Google)  wrote:

> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
> 
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> > 
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.  
> 
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.
> 

Really, it's for debugging. I would argue that it should *not* be default y.
Peter added this to find all the locations that could be called where RCU
is not watching. But the issue I have is that this is that it *does cause
overhead* with function tracing.

I believe we found pretty much all locations that were an issue, and we
should now just make it an option for developers.

It's no different than lockdep. Test boxes should have it enabled, but
there's no reason to have this enabled in a production system.

-- Steve


> > 
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> >  #endif
> >  
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)  \
> > ({  \
> > bool __ret = !rcu_is_watching();\  
> 
> BTW, maybe we can add "unlikely" in the next "if" line?
> 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> >   This file can be reset, but the limit can not change in
> >   size at runtime.
> >  
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > +   bool "Validate RCU is on during ftrace recursion check"
> > +   depends on FUNCTION_TRACER
> > +   depends on ARCH_WANTS_NO_INSTR  
> 
>   default y
> 
> > +   help
> > + All callbacks that attach to the function tracing have some sort
> > + of protection against recursion. This option performs additional
> > + checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > + This will add more overhead to all ftrace-based invocations.  
> 
>   ... invocations, but keep it safe.
> 
> > +
> > + If unsure, say N  
> 
>   If unsure, say Y
> 
> Thank you,
> 
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> > bool "Record functions that recurse in the ring buffer"
> > depends on FTRACE_RECORD_RECURSION
> > -- 
> > 2.43.0
> >   
> 
>

Re: [PATCH v3 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote:
> - if (ret == -ENOENT && !trace_kprobe_module_exist(tk)) {
> +#ifdef CONFIG_MODULES
> + if (ret == -ENOENT && trace_kprobe_module_exist(tk))
> + ret = 0;
> +#endif /* CONFIG_MODULES */

For this we could have

#ifndef CONFIG_MODULES
#define trace_kprobe_module_exist(tk) false
#endif

That would clean up at least two locations requiring no changes. Should
I go forward this or not?

BR, Jarkko

Re: [PATCH 00/64] i2c: reword i2c_algorithm according to newest specification

2024-03-25 Thread Andi Shyti

Hi Wolfram,

> > @Andi: are you okay with this approach? It means you'd need to merge
> > -rc2 into your for-next branch. Or rebase if all fails.
> 
> I think it's a good plan, I'll try to support you with it.

Do you feel more comfortable if I take the patches as soon as
they are reviewd?

So far I have tagged patch 1-4 and I can already merge 2,3,4 as
long as you merge patch 1.

Andi

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 1:50 AM EET, Masami Hiramatsu (Google) wrote:
> On Tue, 26 Mar 2024 00:09:42 +0200
> "Jarkko Sakkinen"  wrote:
>
> > On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> > > +#ifdef CONFIG_MODULES
> > >   if (register_module_notifier(_kprobe_module_nb))
> > >   return -EINVAL;
> > > +#endif /* CONFIG_MODULES */
> > 
> > register_module_notifier() does have "dummy" version but what
> > would I pass to it. It makes more mess than it cleans to declare
> > also a "dummy" version of trace_kprobe_module_nb.
>
> That is better than having #ifdef in the function.
>
> > 
> > The callback itself has too tight module subsystem bindings so
> > that they could be simply flagged with IS_DEFINED() (or correct
> > if I'm mistaken, this the conclusion I've ended up with).
>
> Please try this.
>
> -
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 70dc6179086e..bc98db14927f 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2625,6 +2625,7 @@ static void remove_module_kprobe_blacklist(struct 
> module *mod)
>   }
>  }
>  
> +#ifdef CONFIG_MODULES
>  /* Module notifier call back, checking kprobes on the module */
>  static int kprobes_module_callback(struct notifier_block *nb,
>  unsigned long val, void *data)
> @@ -2675,6 +2676,9 @@ static int kprobes_module_callback(struct 
> notifier_block *nb,
>   mutex_unlock(_mutex);
>   return NOTIFY_DONE;
>  }
> +#else
> +#define kprobes_module_callback  (NULL)
> +#endif
>  
>  static struct notifier_block kprobe_module_nb = {
>   .notifier_call = kprobes_module_callback,
> @@ -2739,7 +2743,7 @@ static int __init init_kprobes(void)
>   err = arch_init_kprobes();
>   if (!err)
>   err = register_die_notifier(_exceptions_nb);
> - if (!err)
> + if (!err && IS_ENABLED(CONFIG_MODULES))
>   err = register_module_notifier(_module_nb);
>  
>   kprobes_initialized = (err == 0);

OK, thanks for the suggestion WFM.

I'll give this also a spin with VisionFive2 RISC-V SBC before sending
v6.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Google

On Tue, 26 Mar 2024 00:09:42 +0200
"Jarkko Sakkinen"  wrote:

> On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> > +#ifdef CONFIG_MODULES
> > if (register_module_notifier(_kprobe_module_nb))
> > return -EINVAL;
> > +#endif /* CONFIG_MODULES */
> 
> register_module_notifier() does have "dummy" version but what
> would I pass to it. It makes more mess than it cleans to declare
> also a "dummy" version of trace_kprobe_module_nb.

That is better than having #ifdef in the function.

> 
> The callback itself has too tight module subsystem bindings so
> that they could be simply flagged with IS_DEFINED() (or correct
> if I'm mistaken, this the conclusion I've ended up with).

Please try this.

-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 70dc6179086e..bc98db14927f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2625,6 +2625,7 @@ static void remove_module_kprobe_blacklist(struct module 
*mod)
}
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking kprobes on the module */
 static int kprobes_module_callback(struct notifier_block *nb,
   unsigned long val, void *data)
@@ -2675,6 +2676,9 @@ static int kprobes_module_callback(struct notifier_block 
*nb,
mutex_unlock(_mutex);
return NOTIFY_DONE;
 }
+#else
+#define kprobes_module_callback(NULL)
+#endif
 
 static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
@@ -2739,7 +2743,7 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(_exceptions_nb);
-   if (!err)
+   if (!err && IS_ENABLED(CONFIG_MODULES))
err = register_module_notifier(_module_nb);
 
kprobes_initialized = (err == 0);

-

Thank you,

-- 
Masami Hiramatsu (Google)

Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1

2024-03-25 Thread Andrii Nakryiko

On Mon, Mar 25, 2024 at 10:27 AM Andrii Nakryiko
 wrote:
>
> On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov
>  wrote:
> >
> > Hi Andrii,
> >
> > syzbot found UAF in raw_tp cookie series in bpf-next.
> > Reverting the whole merge
> > 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'")
> >
> > fixes the issue.
> >
> > Pls take a look.
> > See C reproducer below. It splats consistently with CONFIG_KASAN=y
> >
> > Thanks.
>
> Will do, traveling today, so will be offline for a bit, but will check
> first thing afterwards.
>

Ok, so I don't think it's bpf_raw_tp_link specific, it should affect a
bunch of other links (unless I missed something). Basically, when last
link refcnt drops, we detach, do bpf_prog_put() and then proceed to
kfree link itself synchronously. But that link can still be referred
from running BPF program (I think multi-kprobe/multi-uprobe use it for
cookies, raw_tp with my changes started using link at runtime, there
are probably more types), and so if we free this memory synchronously,
we can have UAF.

We should do what we do for bpf_maps and delay freeing, the only
question is how tunable that freeing can be? Always do call_rcu()?
Always call_rcu_tasks_trace() (relevant for sleepable multi-uprobes)?
Should we allow synchronous free if link is not directly accessible
from program during its run?

Anyway, I sent a fix as an RFC so we can discuss.

> >
> > On Sun, Mar 24, 2024 at 4:28 PM syzbot
> >  wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by 
> > > us..
> > > git tree:   bpf-next
> > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114f17a518
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=162bb7a518
> > >
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com
> > >
> > > ==
> > > BUG: KASAN: slab-use-after-free in __bpf_trace_run 
> > > kernel/trace/bpf_trace.c:2376 [inline]
> > > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 
> > > kernel/trace/bpf_trace.c:2430
> > > Read of size 8 at addr 8880290d9918 by task migration/0/19
> > >
> > > CPU: 0 PID: 19 Comm: migration/0 Not tainted 
> > > 6.8.0-syzkaller-05233-g520fad2e3206 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 02/29/2024
> > > Stopper: 0x0 <- 0x0
> > > Call Trace:
> > >  
> > >  __dump_stack lib/dump_stack.c:88 [inline]
> > >  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
> > >  print_address_description mm/kasan/report.c:377 [inline]
> > >  print_report+0x169/0x550 mm/kasan/report.c:488
> > >  kasan_report+0x143/0x180 mm/kasan/report.c:601
> > >  __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline]
> > >  bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430
> > >  __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27
> > >  trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27
> > >  rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360
> > >  __schedule+0x345/0x4a20 kernel/sched/core.c:6635
> > >  __schedule_loop kernel/sched/core.c:6813 [inline]
> > >  schedule+0x14b/0x320 kernel/sched/core.c:6828
> > >  smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160
> > >  kthread+0x2f0/0x390 kernel/kthread.c:388
> > >  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> > >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
> > >  
> > >
> > > Allocated by task 5075:
> > >  kasan_save_stack mm/kasan/common.c:47 [inline]
> > >  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
> > >  poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
> > >  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
> > >  kasan_kmalloc include/linux/kasan.h:211 [inline]
> > >  kmalloc_trace+0x1d9/0x360 mm/slub.c:4012
> > >  kmalloc include/linux/slab.h:590 [inline]
> > >  kzalloc include/linux/slab.h:711 [inline]
> > >  bpf_raw_tp_link_attach+0x2a0/0x6e0 kernel/bpf/syscall.c:3816
> > >  bpf_raw_tracepoint_open+0x1c2/0x240 kernel/bpf/syscall.c:3863
> > >  __sys_bpf+0x3c0/0x810 kernel/bpf/syscall.c:5673
> > >  __do_sys_bpf

Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Naik, Avadhut




On 3/25/2024 15:31, Borislav Petkov wrote:
> On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote:
>> Can this patchset be merged in? Or would you prefer me sending out
>> another revision with Steven's "Reviewed-by:" tag?
> 
> First of all, please do not top-post.
>
Apologies for that!
 
> Then, you were on Cc on the previous thread. Please summarize from it
> and put in the commit message *why* it is good to have each field added.
> 
> And then, above the tracepoint, I'd like you to add a rule which
> states what information can and should be added to the tracepoint. And
> no, "just because" is not good enough. The previous thread has hints.
> 

Thanks for the clarification! Will update accordingly.

> Thx.
> 

-- 
Thanks,
Avadhut Naik

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 4:56 AM EET, Masami Hiramatsu (Google) wrote:
> Hi Jarkko,
>
> On Sun, 24 Mar 2024 01:29:08 +0200
> Jarkko Sakkinen  wrote:
>
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by allowing architectures to implement module_alloc()
> > and module_memfree() independent of the module subsystem. An arch tree
> > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > 
> > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > and implementing module_memfree().
>
> Even though, this involves changes in arch-independent part. So it should
> be solved by generic way. Did you checked Calvin's thread?
>
> https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/

Nope, has not been in my radar but for sure will check it.

I don't mind making this more generic. The  point of this version was to
put focus on single architecture and do as little as possible how the
code works right now so that it is easier to give feedback on direction.

> I think, we'd better to introduce `alloc_execmem()`,
> CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
>
>   config HAVE_ALLOC_EXECMEM
>   bool
>
>   config ALLOC_EXECMEM
>   bool "Executable trampline memory allocation"
>   depends on MODULES || HAVE_ALLOC_EXECMEM

Right so this is logically the same as I have just with ALLOC_EXECMEM
added to cover both MODULES and HAVE_ALLOC_EXECMEM (which is essentially
the same as HAVE_ALLOC_KPROBES just with a different name).

Not at all against this. I think this factor more understandable
structuring, just "peer checking" that I understand what I'm reading :-)

> And define fallback macro to module_alloc() like this.
>
> #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> #define alloc_execmem(size, gfp)  module_alloc(size)
> #endif
>
> Then, introduce a new dependency to kprobes
>
>   config KPROBES
>   bool "Kprobes"
>   select ALLOC_EXECMEM

OK I think this is good but now I see actually two logical chunks of
work becuse this select changes how KPROBES kconfig option works. It
has previously required to manually select MODULES first.

So I'll "select MODULES" as separate patch to keep all logical changes
transparent...

>
> and update kprobes to use alloc_execmem and remove module related
> code from it.
>
> You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> avoid using #ifdefs.
>
> Finally, you can add RISCV implementation patch of HAVE_ALLOC_EXECMEM in the
> next patch.

OK, I think the suggestions are sane and not that much drift what I have
now so works for me.

> Thank you,

BR, Jarkko

Re: [PATCH 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)

2024-03-25 Thread Stanislav Jakubek

On Mon, Mar 25, 2024 at 08:28:27PM +0100, Konrad Dybcio wrote:
> On 24.03.2024 3:04 PM, Stanislav Jakubek wrote:
> > Add a device tree for the Motorola Moto G (2013) smartphone based
> > on the Qualcomm MSM8226 SoC.
> > 
> > Initially supported features:
> >   - Buttons (Volume Down/Up, Power)
> >   - eMMC
> >   - Hall Effect Sensor
> >   - SimpleFB display
> >   - TMP108 temperature sensor
> >   - Vibrator
> > 
> > Signed-off-by: Stanislav Jakubek 
> > ---
> 
> [...]
> 
> > +   hob-ram@f50 {
> > +   reg = <0x0f50 0x4>,
> > + <0x0f54 0x2000>;
> > +   no-map;
> > +   };
> 
> Any reason it's in two parts? Should it be one contiguous region, or
> two separate nodes?
> 
> lgtm otherwise

Hi Konrad, I copied this from downstream as-is.
According to the downstream docs [1]:

HOB RAM MMAP Device provides ability for userspace to access the
hand over block memory to read out modem related parameters.

And the two regs are the "DHOB partition" and "SHOB partition".

I suppose this is something Motorola (firmware?) specific (since the
downstream compatible is mmi,hob_ram [2]).
Should I split this into 2 nodes - dhob@f50 and shob@f54?

Stanislav

[1] 
https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/Documentation/devicetree/bindings/misc/hob_ram.txt
[2] 
https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/arch/arm/boot/dts/msm8226-moto-common.dtsi#L258

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-25 Thread Andrii Nakryiko

On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam
 wrote:
>
> Hi Ingo,
>
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 
> > > microsecs).
> >
> > Have you considered/measured per-CPU RW semaphores?
>
> No I hadn't but thanks hugely for suggesting it! In initial measurements
> it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> all the exclamation marks but I'm very excited. I'll do some more testing
> tomorrow but so far it's looking very good.
>

Documentation ([0]) says that locking for writing calls
synchronize_rcu(), is that right? If that's true, attaching multiple
uprobes (including just attaching a single BPF multi-uprobe) will take
a really long time. We need to confirm we are not significantly
regressing this. And if we do, we need to take measures in the BPF
multi-uprobe attachment code path to make sure that a single
multi-uprobe attachment is still fast.

If my worries above turn out to be true, it still feels like a first
good step should be landing this patch as is (and get it backported to
older kernels), and then have percpu rw-semaphore as a final (and a
bit more invasive) solution (it's RCU-based, so feels like a good
primitive to settle on), making sure to not regress multi-uprobes
(we'll probably will need some batched API for multiple uprobes).

Thoughts?

  [0] https://docs.kernel.org/locking/percpu-rw-semaphore.html

> Thanks again for the input.
>
> Jon.

Re: [PATCH v4 3/4] remoteproc: stm32: Create sub-functions to request shutdown and release

2024-03-25 Thread Mathieu Poirier

On Fri, Mar 08, 2024 at 03:47:07PM +0100, Arnaud Pouliquen wrote:
> To prepare for the support of TEE remoteproc, create sub-functions
> that can be used in both cases, with and without TEE support.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
>  drivers/remoteproc/stm32_rproc.c | 84 +++-
>  1 file changed, 51 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/remoteproc/stm32_rproc.c 
> b/drivers/remoteproc/stm32_rproc.c
> index 88623df7d0c3..8cd838df4e92 100644
> --- a/drivers/remoteproc/stm32_rproc.c
> +++ b/drivers/remoteproc/stm32_rproc.c
> @@ -209,6 +209,54 @@ static int stm32_rproc_mbox_idx(struct rproc *rproc, 
> const unsigned char *name)
>   return -EINVAL;
>  }
>  
> +static void stm32_rproc_request_shutdown(struct rproc *rproc)
> +{
> + struct stm32_rproc *ddata = rproc->priv;
> + int err, dummy_data, idx;
> +
> + /* Request shutdown of the remote processor */
> + if (rproc->state != RPROC_OFFLINE && rproc->state != RPROC_CRASHED) {
> + idx = stm32_rproc_mbox_idx(rproc, STM32_MBX_SHUTDOWN);
> + if (idx >= 0 && ddata->mb[idx].chan) {
> + /* A dummy data is sent to allow to block on transmit. 
> */
> + err = mbox_send_message(ddata->mb[idx].chan,
> + _data);

Why is this changed from the original implementation?

> + if (err < 0)
> + dev_warn(>dev, "warning: remote FW 
> shutdown without ack\n");
> + }
> + }
> +}
> +
> +static int stm32_rproc_release(struct rproc *rproc)
> +{
> + struct stm32_rproc *ddata = rproc->priv;
> + unsigned int err = 0;
> +
> + /* To allow platform Standby power mode, set remote proc Deep Sleep. */
> + if (ddata->pdds.map) {
> + err = regmap_update_bits(ddata->pdds.map, ddata->pdds.reg,
> +  ddata->pdds.mask, 1);
> + if (err) {
> + dev_err(>dev, "failed to set pdds\n");
> + return err;
> + }
> + }
> +
> + /* Update coprocessor state to OFF if available. */
> + if (ddata->m4_state.map) {
> + err = regmap_update_bits(ddata->m4_state.map,
> +  ddata->m4_state.reg,
> +  ddata->m4_state.mask,
> +  M4_STATE_OFF);
> + if (err) {
> + dev_err(>dev, "failed to set copro state\n");
> + return err;
> + }
> + }
> +
> + return 0;
> +}
> +
>  static int stm32_rproc_prepare(struct rproc *rproc)
>  {
>   struct device *dev = rproc->dev.parent;
> @@ -519,17 +567,9 @@ static int stm32_rproc_detach(struct rproc *rproc)
>  static int stm32_rproc_stop(struct rproc *rproc)
>  {
>   struct stm32_rproc *ddata = rproc->priv;
> - int err, idx;
> + int err;
>  
> - /* request shutdown of the remote processor */
> - if (rproc->state != RPROC_OFFLINE && rproc->state != RPROC_CRASHED) {
> - idx = stm32_rproc_mbox_idx(rproc, STM32_MBX_SHUTDOWN);
> - if (idx >= 0 && ddata->mb[idx].chan) {
> - err = mbox_send_message(ddata->mb[idx].chan, "detach");
> - if (err < 0)
> - dev_warn(>dev, "warning: remote FW 
> shutdown without ack\n");
> - }
> - }
> + stm32_rproc_request_shutdown(rproc);
>  
>   err = stm32_rproc_set_hold_boot(rproc, true);
>   if (err)
> @@ -541,29 +581,7 @@ static int stm32_rproc_stop(struct rproc *rproc)
>   return err;
>   }
>  
> - /* to allow platform Standby power mode, set remote proc Deep Sleep */
> - if (ddata->pdds.map) {
> - err = regmap_update_bits(ddata->pdds.map, ddata->pdds.reg,
> -  ddata->pdds.mask, 1);
> - if (err) {
> - dev_err(>dev, "failed to set pdds\n");
> - return err;
> - }
> - }
> -
> - /* update coprocessor state to OFF if available */
> - if (ddata->m4_state.map) {
> - err = regmap_update_bits(ddata->m4_state.map,
> -  ddata->m4_state.reg,
> -  ddata->m4_state.mask,
> -  M4_STATE_OFF);
> - if (err) {
> - dev_err(>dev, "failed to set copro state\n");
> - return err;
> - }
> - }
> -
> - return 0;
> + return stm32_rproc_release(rproc);
>  }
>  
>  static void stm32_rproc_kick(struct rproc *rproc, int vqid)
> -- 
> 2.25.1
>

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-25 Thread SeongJae Park

On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
[...]
> > > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we 
> > > > discussed about
> > > > this patchset in high level.  Sharing the summary here for open 
> > > > discussion.  As
> > > > also discussed on the first version of this patchset[2], we want to 
> > > > make single
> > > > action for general page migration with minimum changes, but would like 
> > > > to keep
> > > > page level access re-check.  We also agreed the previously proposed 
> > > > DAMOS
> > > > filter-based approach could make sense for the purpose.
> > > 
> > > Thanks very much for the summary.  I have been trying to merge promote
> > > and demote actions into a single migrate action, but I found an issue
> > > regarding damon_pa_scheme_score.  It currently calls damon_cold_score()
> > > for demote action and damon_hot_score() for promote action, but what
> > > should we call when we use a single migrate action?
> > 
> > Good point!  This is what I didn't think about when suggesting that.  Thank 
> > you
> > for letting me know this gap!  I think there could be two approach, off the 
> > top
> > of my head.
> > 
> > The first one would be extending the interface so that the user can select 
> > the
> > score function.  This would let flexible usage, but I'm bit concerned if 
> > this
> > could make things unnecessarily complex, and would really useful in many
> > general use case.
> 
> I also think this looks complicated and may not be useful for general
> users.
> 
> > The second approach would be letting DAMON infer the intention.  In this 
> > case,
> > I think we could know the intention is the demotion if the scheme has a youg
> > pages exclusion filter.  Then, we could use the cold_score().  And vice 
> > versa.
> > To cover a case that there is no filter at all, I think we could have one
> > assumption.  My humble intuition says the new action (migrate) may be used 
> > more
> > for promotion use case.  So, in damon_pa_scheme_score(), if the action of 
> > the
> > given scheme is the new one (say, MIGRATE), the function will further check 
> > if
> > the scheme has a filter for excluding young pages.  If so, the function will
> > use cold_score().  Otherwise, the function will use hot_score().
> 
> Thanks for suggesting many ideas but I'm afraid that I feel this doesn't
> look good.  Thinking it again, I think we can think about keep using
> DAMOS_PROMOTE and DAMOS_DEMOTE,

In other words, keep having dedicated DAMOS action for intuitive prioritization
score function, or, coupling the prioritization with each action, right?  I
think this makes sense, and fit well with the documentation.

The prioritization mechanism should be different for each action.  For 
example,
rarely accessed (colder) memory regions would be prioritized for page-out
scheme action.  In contrast, the colder regions would be deprioritized for 
huge
page collapse scheme action.  Hence, the prioritization mechanisms for each
action are implemented in each DAMON operations set, together with the 
actions.

In other words, each DAMOS action should allow users intuitively understand
what types of regions will be prioritized.  We already have such couples of
DAMOS actions such as DAMOS_[NO]HUGEPAGE and DAMOS_LRU_[DE]PRIO.  So adding a
couple of action for this case sounds reasonable to me.  And I think this is
better and simpler than having the inferrence based behavior.

That said, I concern if 'PROMOTE' and 'DEMOTE' still sound bit ambiguous to
people who don't know 'demote_folio_list()' and its friends.  Meanwhile, the
name might sound too detail about what it does to people who know the
functions, so make it bit unflexible.  They might also get confused since we
don't have 'promote_folio_list()'.

To my humble understanding, what you really want to do is migrating pages to
specific address range (or node) prioritizing the pages based on the hotness.
What about, say, MIGRATE_{HOT,COLD}?

> but I can make them directly call
> damon_folio_young() for access check instead of using young filter.
> 
> And we can internally handle the complicated combination such as demote
> action sets "young" filter with "matching" true and promote action sets
> "young" filter with "matching" false.  IMHO, this will make the usage
> simpler.

I think whether to exclude young/non-young (maybe idle is better than
non-young?) pages from the action is better to be decoupled for following
reasons.

Firstly, we want to check the page granularity youngness mainly because we
found DAMON's monitoring result is not accurate enough for this use case.  Or,
we could say that's because you cannot wait until DAMON's monitoring result
becomes accurate enough.  For more detail, you could increase minimum age of
your scheme's target access pattern.  I show you set

Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Wed Mar 6, 2024 at 10:05 PM EET, Calvin Owens wrote:
> Hello all,
>
> This patchset makes it possible to use bpftrace with kprobes on kernels
> built without loadable module support.
>
> On a Raspberry Pi 4b, this saves about 700KB of memory where BPF is
> needed but loadable module support is not. These two kernels had
> identical configurations, except CONFIG_MODULE was off in the second:
>
>- Linux version 6.8.0-rc7
>- Memory: 3330672K/4050944K available (16576K kernel code, 2390K rwdata,
>- 12364K rodata, 5632K init, 675K bss, 195984K reserved, 524288K 
> cma-reserved)
>+ Linux version 6.8.0-rc7-3-g2af01251ca21
>+ Memory: 3331400K/4050944K available (16512K kernel code, 2384K rwdata,
>+ 11728K rodata, 5632K init, 673K bss, 195256K reserved, 524288K 
> cma-reserved)
>
> I don't intend to present an exhaustive list of !MODULES usecases, since
> I'm sure there are many I'm not aware of. Performance is a common one,
> the primary justification being that static text is mapped on hugepages
> and module text is not. Security is another, since rootkits are much
> harder to implement without modules.
>
> The first patch is the interesting one: it moves module_alloc() into its
> own file with its own Kconfig option, so it can be utilized even when
> loadable module support is disabled. I got the idea from an unmerged
> patch from a few years ago I found on lkml (see [1/4] for details). I
> think this also has value in its own right, since I suspect there are
> potential users beyond bpf, hopefully we will hear from some.
>
> Patches 2-3 are proofs of concept to demonstrate the first patch is
> sufficient to achieve my goal (full ebpf functionality without modules).
>
> Patch 4 adds a new "-n" argument to vmtest.sh to run the BPF selftests
> without modules, so the prior three patches can be rigorously tested.
>
> If something like the first patch were to eventually be merged, the rest
> could go through the normal bpf-next process as I clean them up: I've
> only based them on Linus' tree and combined them into a series here to
> introduce the idea.
>
> If you prefer to fetch the patches via git:
>
>   [1/4] https://github.com/jcalvinowens/linux.git work/module-alloc
>  +[2/4]+[3/4] https://github.com/jcalvinowens/linux.git work/nomodule-bpf
>  +[4/4] https://github.com/jcalvinowens/linux.git testing/nomodule-bpf-ci
>
> In addition to the automated BPF selftests, I've lightly tested this on
> my laptop (x86_64), a Raspberry Pi 4b (arm64), and a Raspberry Pi Zero W
> (arm). The other architectures have only been compile tested.
>
> I didn't want to spam all the arch maintainers with what I expect will
> be a discussion mostly about modules and bpf, so I've left them off this
> first submission. I will be sure to add them on future submissions of
> the first patch. Of course, feedback on the arch bits is welcome here.
>
> In addition to feedback on the patches themselves, I'm interested in
> hearing from anybody else who might find this functionality useful.
>
> Thanks,
> Calvin
>
>
> Calvin Owens (4):
>   module: mm: Make module_alloc() generally available
>   bpf: Allow BPF_JIT with CONFIG_MODULES=n
>   kprobes: Allow kprobes with CONFIG_MODULES=n
>   selftests/bpf: Support testing the !MODULES case
>
>  arch/Kconfig  |   4 +-
>  arch/arm/kernel/module.c  |  35 -
>  arch/arm/mm/Makefile  |   2 +
>  arch/arm/mm/module_alloc.c|  40 ++
>  arch/arm64/kernel/module.c| 127 -
>  arch/arm64/mm/Makefile|   1 +
>  arch/arm64/mm/module_alloc.c  | 130 ++
>  arch/loongarch/kernel/module.c|   6 -
>  arch/loongarch/mm/Makefile|   2 +
>  arch/loongarch/mm/module_alloc.c  |  10 ++
>  arch/mips/kernel/module.c |  10 --
>  arch/mips/mm/Makefile |   2 +
>  arch/mips/mm/module_alloc.c   |  13 ++
>  arch/nios2/kernel/module.c|  20 ---
>  arch/nios2/mm/Makefile|   2 +
>  arch/nios2/mm/module_alloc.c  |  22 +++
>  arch/parisc/kernel/module.c   |  12 --
>  arch/parisc/mm/Makefile   |   1 +
>  arch/parisc/mm/module_alloc.c |  15 ++
>  arch/powerpc/kernel/module.c  |  36 -
>  arch/powerpc/mm/Makefile  |   1 +
>  arch/powerpc/mm/module_alloc.c|  41 ++
>  arch/riscv/kernel/module.c|  11 --
>  arch/riscv/mm/Makefile|   1 +
>  arch/riscv/mm/module_alloc.c  |  17 +++
>  arch/s390/kernel/module.c |  37 -
>  arch/s390/mm/Makefile |   1 +
>  arch/s390/mm/module_alloc.c   |  42 ++
>  arch/sparc/kernel/module.c

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 12:09 AM EET, Jarkko Sakkinen wrote:
> On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> > +#ifdef CONFIG_MODULES
> > if (register_module_notifier(_kprobe_module_nb))
> > return -EINVAL;
> > +#endif /* CONFIG_MODULES */
>
> register_module_notifier() does have "dummy" version but what
> would I pass to it. It makes more mess than it cleans to declare
> also a "dummy" version of trace_kprobe_module_nb.
>
> The callback itself has too tight module subsystem bindings so
> that they could be simply flagged with IS_DEFINED() (or correct
> if I'm mistaken, this the conclusion I've ended up with).

One way to clean that up would be to create trace_kprobe_module.c and
move kernel module specific code over there and then change
kernel/trace/Makefile as follows:

ifeq ($(CONFIG_PERF_EVENTS),y)
obj-y += trace_kprobe.o
obj-$(CONFIG_MODULES) += trace_kprobe_module.o
endif

and define trace_kprobe_module_init() or similar to do all the dance
with notifiers etc.

This crossed my mind but did not want to do it without feedback.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> +#ifdef CONFIG_MODULES
>   if (register_module_notifier(_kprobe_module_nb))
>   return -EINVAL;
> +#endif /* CONFIG_MODULES */

register_module_notifier() does have "dummy" version but what
would I pass to it. It makes more mess than it cleans to declare
also a "dummy" version of trace_kprobe_module_nb.

The callback itself has too tight module subsystem bindings so
that they could be simply flagged with IS_DEFINED() (or correct
if I'm mistaken, this the conclusion I've ended up with).

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

s/textmem/execmem/ (also in long description)

will hold sending a new version as not a functional issue, will fix
after review.

BR, Jarkko

[PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
  been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c |  8 
 4 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,

[PATCH v3 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c | 18 --
 4 files changed, 58 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..a1a547723c3c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
@@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned long 
*start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned

[PATCH v4 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 kernel/kprobes.c|  2 +-
 4 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1a547723c3c..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void)
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return alloc_execmem(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
-- 
2.44.0

[PATCH v4 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c |  8 
 4 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..a1a547723c3c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
@@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned long 
*start,
return ret ? :

Re: [PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote:
> Tacing with kprobes while running a monolithic kernel is currently
> impossible due the kernel module allocator dependency.
>
> Address the issue by implementing textmem API for RISC-V.
>
> Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> stack
> Link: 
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> continuation
> Signed-off-by: Jarkko Sakkinen 

I think that for any use case it is best for overall good to realize it
like this. I.e. only create patch sets related to the topic that change
behavior for arch's that are in your heavy use. For me that mean x86
and RISC-V.

That is why I shrinked this from to focus into more narrow scope.

For microarch's more alien to one, it is just too easy to make sloppy
mistakes, which could cause unwanted harm. E.g. it is for best of
arch/sh that someone involved with that microarchitecture does later
on the shenanigans.

BR, Jarkko

[PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 kernel/kprobes.c|  2 +-
 4 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..4191251476d0
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1a547723c3c..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void)
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return alloc_execmem(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
-- 
2.44.0

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-25 Thread Honggyu Kim

Hi SeongJae,

On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> > > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> > > 
> > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > > > posted at [1].
> > > > 
> > > > It says there is no implementation of the demote/promote DAMOS action
> > > > are made.  This RFC is about its implementation for physical address
> > > > space.
> > > > 
> > > > 
> [...]
> > > Thank you for running the tests again with the new version of the patches 
> > > and
> > > sharing the results!
> > 
> > It's a bit late answer, but the result was from the previous evaluation.
> > I ran it again with RFC v2, but didn't see much difference so just
> > pasted the same result here.
> 
> No problem, thank you for clarifying :)
> 
> [...]
> > > > Honggyu Kim (3):
> > > >   mm/damon: refactor DAMOS_PAGEOUT with migration_mode
> > > >   mm: make alloc_demote_folio externally invokable for migration
> > > >   mm/damon: introduce DAMOS_DEMOTE action for demotion
> > > > 
> > > > Hyeongtak Ji (4):
> > > >   mm/memory-tiers: add next_promotion_node to find promotion target
> > > >   mm/damon: introduce DAMOS_PROMOTE action for promotion
> > > >   mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
> > > >   mm/damon/sysfs-schemes: apply target_nid for promote and demote
> > > > actions
> > > 
> > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed 
> > > about
> > > this patchset in high level.  Sharing the summary here for open 
> > > discussion.  As
> > > also discussed on the first version of this patchset[2], we want to make 
> > > single
> > > action for general page migration with minimum changes, but would like to 
> > > keep
> > > page level access re-check.  We also agreed the previously proposed DAMOS
> > > filter-based approach could make sense for the purpose.
> > 
> > Thanks very much for the summary.  I have been trying to merge promote
> > and demote actions into a single migrate action, but I found an issue
> > regarding damon_pa_scheme_score.  It currently calls damon_cold_score()
> > for demote action and damon_hot_score() for promote action, but what
> > should we call when we use a single migrate action?
> 
> Good point!  This is what I didn't think about when suggesting that.  Thank 
> you
> for letting me know this gap!  I think there could be two approach, off the 
> top
> of my head.
> 
> The first one would be extending the interface so that the user can select the
> score function.  This would let flexible usage, but I'm bit concerned if this
> could make things unnecessarily complex, and would really useful in many
> general use case.

I also think this looks complicated and may not be useful for general
users.

> The second approach would be letting DAMON infer the intention.  In this case,
> I think we could know the intention is the demotion if the scheme has a youg
> pages exclusion filter.  Then, we could use the cold_score().  And vice versa.
> To cover a case that there is no filter at all, I think we could have one
> assumption.  My humble intuition says the new action (migrate) may be used 
> more
> for promotion use case.  So, in damon_pa_scheme_score(), if the action of the
> given scheme is the new one (say, MIGRATE), the function will further check if
> the scheme has a filter for excluding young pages.  If so, the function will
> use cold_score().  Otherwise, the function will use hot_score().

Thanks for suggesting many ideas but I'm afraid that I feel this doesn't
look good.  Thinking it again, I think we can think about keep using
DAMOS_PROMOTE and DAMOS_DEMOTE, but I can make them directly call
damon_folio_young() for access check instead of using young filter.

And we can internally handle the complicated combination such as demote
action sets "young" filter with "matching" true and promote action sets
"young" filter with "matching" false.  IMHO, this will make the usage
simpler.

I would like to hear how you think about this.

> So I'd more prefer the second approach.  I think it would be not too late to
> consider the first approach after waiting for it turns out more actions have
> such ambiguity and need more general interface for explicitly set the score
> function.

I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
can talk more about this issue.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]

Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Borislav Petkov

On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote:
> Can this patchset be merged in? Or would you prefer me sending out
> another revision with Steven's "Reviewed-by:" tag?

First of all, please do not top-post.

Then, you were on Cc on the previous thread. Please summarize from it
and put in the commit message *why* it is good to have each field added.

And then, above the tracepoint, I'd like you to add a rule which
states what information can and should be added to the tracepoint. And
no, "just because" is not good enough. The previous thread has hints.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Naik, Avadhut

Hi Boris,

Can this patchset be merged in? Or would you prefer me sending out another
revision with Steven's "Reviewed-by:" tag?

On 2/8/2024 11:10, Steven Rostedt wrote:
> On Fri, 26 Jan 2024 01:57:58 -0600
> Avadhut Naik  wrote:
> 
>> This patchset updates the mce_record tracepoint so that the recently added
>> fields of struct mce are exported through it to userspace.
>>
>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>> the tracepoint.
>>
>> The second patch adds the microcode field (Microcode Revision) to the
>> tracepoint.
> 
> From a tracing POV only:
> 
> Reviewed-by: Steven Rostedt (Google) 
> 
> -- Steve

-- 
Thanks,
Avadhut Naik

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-25 Thread Abdellatif El Khlifi

Hi Mathieu,

> > > > > > > > This is an initial patchset for allowing to turn on and off the 
> > > > > > > > remote processor.
> > > > > > > > The FW is already loaded before the Corstone-1000 SoC is 
> > > > > > > > powered on and this
> > > > > > > > is done through the FPGA board bootloader in case of the FPGA 
> > > > > > > > target. Or by the Corstone-1000 FVP model
> > > > > > > > (emulator).
> > > > > > > >
> > > > > > > >From the above I take it that booting with a preloaded firmware 
> > > > > > > >is a
> > > > > > > scenario that needs to be supported and not just a temporary 
> > > > > > > stage.
> > > > > >
> > > > > > The current status of the Corstone-1000 SoC requires that there is
> > > > > > a preloaded firmware for the external core. Preloading is done 
> > > > > > externally
> > > > > > either through the FPGA bootloader or the emulator (FVP) before 
> > > > > > powering
> > > > > > on the SoC.
> > > > > >
> > > > >
> > > > > Ok
> > > > >
> > > > > > Corstone-1000 will be upgraded in a way that the A core running 
> > > > > > Linux is able
> > > > > > to share memory with the remote core and also being able to access 
> > > > > > the remote
> > > > > > core memory so Linux can copy the firmware to. This HW changes are 
> > > > > > still
> > > > > > This is why this patchset is relying on a preloaded firmware. And 
> > > > > > it's the step 1
> > > > > > of adding remoteproc support for Corstone.
> > > > > >
> > > > >
> > > > > Ok, so there is a HW problem where A core and M core can't see each 
> > > > > other's
> > > > > memory, preventing the A core from copying the firmware image to the 
> > > > > proper
> > > > > location.
> > > > >
> > > > > When the HW is fixed, will there be a need to support scenarios where 
> > > > > the
> > > > > firmware image has been preloaded into memory?
> > > >
> > > > No, this scenario won't apply when we get the HW upgrade. No need for an
> > > > external entity anymore. The firmware(s) will all be files in the linux 
> > > > filesystem.
> > > >
> > >
> > > Very well.  I am willing to continue with this driver but it does so 
> > > little that
> > > I wonder if it wouldn't simply be better to move forward with upstreaming 
> > > when
> > > the HW is fixed.  The choice is yours.
> > >
> >
> > I think Robin has raised few points that need clarification. I think it was
> > done as part of DT binding patch. I share those concerns and I wanted to
> > reaching to the same concerns by starting the questions I asked on corstone
> > device tree changes.
> >
> 
> I also agree with Robin's point of view.  Proceeding with an initial
> driver with minimal functionality doesn't preclude having complete
> bindings.  But that said and as I pointed out, it might be better to
> wait for the HW to be fixed before moving forward.

We checked with the HW teams. The missing features will be implemented but
this will take time.

The foundation driver as it is right now is still valuable for people wanting to
know how to power control Corstone external systems in a future proof manner
(even in the incomplete state). We prefer to address all the review comments
made so it can be merged. This includes making the DT binding as complete as
possible as you advised. Then, once the HW is ready, I'll implement the comms
and the FW reload part. Is that OK please ?

Cheers,
Abdellatif

[PATCH 05/13] mailbox: omap: Remove unneeded header omap-mailbox.h

2024-03-25 Thread Andrew Davis

The type of message sent using omap-mailbox is always u32. The definition
of mbox_msg_t is uintptr_t which is wrong as that type changes based on
the architecture (32bit vs 64bit). This type should have been defined as
u32. Instead of making that change here, simply remove the header usage
and fix the last couple users of the same in this driver.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 167348fb1b33b..4c673cb732ed1 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -239,16 +238,14 @@ static void mbox_rx_work(struct work_struct *work)
 {
struct omap_mbox_queue *mq =
container_of(work, struct omap_mbox_queue, work);
-   mbox_msg_t data;
u32 msg;
int len;
 
while (kfifo_len(>fifo) >= sizeof(msg)) {
len = kfifo_out(>fifo, (unsigned char *), sizeof(msg));
WARN_ON(len != sizeof(msg));
-   data = msg;
 
-   mbox_chan_received_data(mq->mbox->chan, (void *)data);
+   mbox_chan_received_data(mq->mbox->chan, (void *)(uintptr_t)msg);
spin_lock_irq(>lock);
if (mq->full) {
mq->full = false;
@@ -515,7 +512,7 @@ static int omap_mbox_chan_send_data(struct mbox_chan *chan, 
void *data)
 {
struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
int ret;
-   u32 msg = omap_mbox_message(data);
+   u32 msg = (u32)(uintptr_t)(data);
 
if (!mbox)
return -EINVAL;
-- 
2.39.2

Re: [PATCH v2] workqueue: add function in event of workqueue_activate_work

2024-03-25 Thread Tejun Heo

On Fri, Mar 08, 2024 at 10:18:18AM +0800, Kassey Li wrote:
> The trace event "workqueue_activate_work" only print work struct.
> However, function is the region of interest in a full sequence of work.
> Current workqueue_activate_work trace event output:
> 
> workqueue_activate_work: work struct ff88b4a0f450
> 
> With this change, workqueue_activate_work will print the function name,
> align with workqueue_queue_work/execute_start/execute_end event.
> 
> workqueue_activate_work: work struct ff80413a78b8 
> function=vmstat_update
> 
> Signed-off-by: Kassey Li 

Applied to wq/for-6.10.

Thanks.

-- 
tejun

Re: [PATCH 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)

2024-03-25 Thread Konrad Dybcio

On 24.03.2024 3:04 PM, Stanislav Jakubek wrote:
> Add a device tree for the Motorola Moto G (2013) smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Initially supported features:
>   - Buttons (Volume Down/Up, Power)
>   - eMMC
>   - Hall Effect Sensor
>   - SimpleFB display
>   - TMP108 temperature sensor
>   - Vibrator
> 
> Signed-off-by: Stanislav Jakubek 
> ---

[...]

> + hob-ram@f50 {
> + reg = <0x0f50 0x4>,
> +   <0x0f54 0x2000>;
> + no-map;
> + };

Any reason it's in two parts? Should it be one contiguous region, or
two separate nodes?

lgtm otherwise

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 9:11 PM EET, Jarkko Sakkinen wrote:
> On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote:
> > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> > > avoid using #ifdefs.
>
> Hmm... I need make a couple of remarks but open for feedback ofc.
>
> First, trace_kprobe_module_exist depends on find_module()
>
> Second, there is a notifier callback that heavily binds to the module
> subsystem.
>
> In both cases using IS_ENABLED would emit a lot of compilation errors.

Also I think adding 'gfp' makes sense exactly at the point as it has
a use case, i.e. two call sites with differing flags. It makes sense
but should be IMHO added exactly at that time.

Leaving it from my patch set does not do any measurable harm but please
correct if I'm missing something.

BR, Jarkko

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-25 Thread Jonthan Haslam

Hi Ingo,

> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
> 
> Have you considered/measured per-CPU RW semaphores?

No I hadn't but thanks hugely for suggesting it! In initial measurements
it seems to be between 20-100% faster than the RW spinlocks! Apologies for
all the exclamation marks but I'm very excited. I'll do some more testing
tomorrow but so far it's looking very good.

Thanks again for the input.

Jon.

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote:
> > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> > avoid using #ifdefs.

Hmm... I need make a couple of remarks but open for feedback ofc.

First, trace_kprobe_module_exist depends on find_module()

Second, there is a notifier callback that heavily binds to the module
subsystem.

In both cases using IS_ENABLED would emit a lot of compilation errors.

BR, Jarkko

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-25 Thread Jonthan Haslam

Hi Masami,

> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
> 
> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 
> 
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.

Absolutely. I have the original production workload to test this with and
a derived one that mimics this test case. The production case has ~24
threads running on a 192 core system which access 14 USDTs around 1.5
million times per second in total (across all USDTs). My test case is
similar but can drive a higher rate of USDT access across more threads and
therefore generate higher contention.

All measurements are done using bpftrace scripts around relevant parts of
code in uprobes.c and application code.

Jon.

> 
> Thank you,
> 
> > 
> > [0] https://docs.kernel.org/locking/spinlocks.html
> > 
> > Signed-off-by: Jonathan Haslam 
> > ---
> >  kernel/events/uprobes.c | 22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> >   */
> >  #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
> >  
> > -static DEFINE_SPINLOCK(uprobes_treelock);  /* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
> >  
> >  #define UPROBES_HASH_SZ13
> >  /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> > loff_t offset)
> >  {
> > struct uprobe *uprobe;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > uprobe = __find_uprobe(inode, offset);
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  
> > return uprobe;
> >  }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > *uprobe)
> >  {
> > struct uprobe *u;
> >  
> > -   spin_lock(_treelock);
> > +   write_lock(_treelock);
> > u = __insert_uprobe(uprobe);
> > -   spin_unlock(_treelock);
> > +   write_unlock(_treelock);
> >  
> > return u;
> >  }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > if (WARN_ON(!uprobe_is_active(uprobe)))
> > return;
> >  
> > -   spin_lock(_treelock);
> > +   write_lock(_treelock);
> > rb_erase(>rb_node, _tree);
> > -   spin_unlock(_treelock);
> > +   write_unlock(_treelock);
> > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > put_uprobe(uprobe);
> >  }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > n = find_node_in_range(inode, min, max);
> > if (n) {
> > for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > get_uprobe(u);
> > }
> > }
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  }
> >  
> >  /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> > long start, unsigned long e
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >  
> > -   spin_lock(_treelock);
> > +   read_lock(_treelock);
> > n = find_node_in_range(inode, min, max);
> > -   spin_unlock(_treelock);
> > +   read_unlock(_treelock);
> >  
> > return !!n;
> >  }
> > -- 
> > 2.43.0
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH] vsock/virtio: fix packet delivery to tap device

2024-03-25 Thread Stefano Garzarella


On Mon, Mar 25, 2024 at 06:12:38PM +0100, Marco Pinna wrote:

Commit 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks") added
virtio_transport_deliver_tap_pkt() for handing packets to the
vsockmon device. However, in virtio_transport_send_pkt_work(),
the function is called before actually sending the packet (i.e.
before placing it in the virtqueue with virtqueue_add_sgs() and checking
whether it returned successfully).


From here..


This may cause timing issues since
the sending of the packet may fail, causing it to be re-queued
(possibly multiple times), while the tap device would show the
packet being sent correctly.


to here...

This a bit unclear, I would rephrase with something like this:

Queuing the packet in the virtqueue can fail even multiple times.
However, in virtio_transport_deliver_tap_pkt() we deliver the packet
to the monitoring tap interface only the first time we call it.
This certainly avoids seeing the same packet replicated multiple
times in the monitoring interface, but it can show the packet
sent with the wrong timestamp or even before we succeed to queue
it in the virtqueue.



Move virtio_transport_deliver_tap_pkt() after calling virtqueue_add_sgs()
and making sure it returned successfully.

Fixes: 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks")
Signed-off-by: Marco Pinna 
---
net/vmw_vsock/virtio_transport.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 1748268e0694..ee5d306a96d0 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -120,7 +120,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
if (!skb)
break;

-   virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
sgs = vsock->out_sgs;
sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
@@ -170,6 +169,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
break;
}

+   virtio_transport_deliver_tap_pkt(skb);
+


I was just worried that consume_skb(), called in
virtio_transport_tx_work() when the host sends an interrupt to the guest
after it has consumed the packet, might be called before this point,
but both run with `vsock->tx_lock` held, so we are protected from
this case.

So, the patch LGTM, I would just clarify the commit message.

Thanks,
Stefano


if (reply) {
struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
int val;
--
2.44.0

Re: (subset) [PATCH 0/5] Add TCPM support for PM7250B and Fairphone 4

2024-03-25 Thread Mark Brown

On Fri, 22 Mar 2024 09:01:31 +0100, Luca Weiss wrote:
> This series adds support for Type-C Port Management on the Fairphone 4
> which enables USB role switching and orientation switching.
> 
> This enables a user for example to plug in a USB stick or a USB keyboard
> to the Type-C port.
> 
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git 
for-next

Thanks!

[1/5] dt-bindings: regulator: qcom,usb-vbus-regulator: Add PM7250B compatible
  commit: 0c5f77f4eaef8ed9fe752d21f40ac471dd511cfc

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

[PATCH 03/13] mailbox: omap: Move omap_mbox_irq_t into driver

2024-03-25 Thread Andrew Davis

This is only used internal to the driver, move it out of the
public header and into the driver file. While we are here,
this is not used as a bitwise, so drop that and make it a
simple enum type.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 5 +
 include/linux/omap-mailbox.h   | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 8151722eef383..c083734b6954c 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -51,6 +51,11 @@
 #define MBOX_INTR_CFG_TYPE10
 #define MBOX_INTR_CFG_TYPE21
 
+typedef enum {
+   IRQ_TX = 1,
+   IRQ_RX = 2,
+} omap_mbox_irq_t;
+
 struct omap_mbox_fifo {
unsigned long msg;
unsigned long fifo_stat;
diff --git a/include/linux/omap-mailbox.h b/include/linux/omap-mailbox.h
index f8ddf8e814167..3cc5c4ed7f5a6 100644
--- a/include/linux/omap-mailbox.h
+++ b/include/linux/omap-mailbox.h
@@ -10,8 +10,4 @@ typedef uintptr_t mbox_msg_t;
 
 #define omap_mbox_message(data) (u32)(mbox_msg_t)(data)
 
-typedef int __bitwise omap_mbox_irq_t;
-#define IRQ_TX ((__force omap_mbox_irq_t) 1)
-#define IRQ_RX ((__force omap_mbox_irq_t) 2)
-
 #endif /* OMAP_MAILBOX_H */
-- 
2.39.2

Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1

2024-03-25 Thread Andrii Nakryiko

On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov
 wrote:
>
> Hi Andrii,
>
> syzbot found UAF in raw_tp cookie series in bpf-next.
> Reverting the whole merge
> 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'")
>
> fixes the issue.
>
> Pls take a look.
> See C reproducer below. It splats consistently with CONFIG_KASAN=y
>
> Thanks.

Will do, traveling today, so will be offline for a bit, but will check
first thing afterwards.

>
> On Sun, Mar 24, 2024 at 4:28 PM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by us..
> > git tree:   bpf-next
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
> > dashboard link: https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114f17a518
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=162bb7a518
> >
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com
> >
> > ==
> > BUG: KASAN: slab-use-after-free in __bpf_trace_run 
> > kernel/trace/bpf_trace.c:2376 [inline]
> > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 
> > kernel/trace/bpf_trace.c:2430
> > Read of size 8 at addr 8880290d9918 by task migration/0/19
> >
> > CPU: 0 PID: 19 Comm: migration/0 Not tainted 
> > 6.8.0-syzkaller-05233-g520fad2e3206 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 02/29/2024
> > Stopper: 0x0 <- 0x0
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:88 [inline]
> >  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
> >  print_address_description mm/kasan/report.c:377 [inline]
> >  print_report+0x169/0x550 mm/kasan/report.c:488
> >  kasan_report+0x143/0x180 mm/kasan/report.c:601
> >  __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline]
> >  bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430
> >  __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27
> >  trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27
> >  rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360
> >  __schedule+0x345/0x4a20 kernel/sched/core.c:6635
> >  __schedule_loop kernel/sched/core.c:6813 [inline]
> >  schedule+0x14b/0x320 kernel/sched/core.c:6828
> >  smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160
> >  kthread+0x2f0/0x390 kernel/kthread.c:388
> >  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
> >  
> >
> > Allocated by task 5075:
> >  kasan_save_stack mm/kasan/common.c:47 [inline]
> >  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
> >  poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
> >  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
> >  kasan_kmalloc include/linux/kasan.h:211 [inline]
> >  kmalloc_trace+0x1d9/0x360 mm/slub.c:4012
> >  kmalloc include/linux/slab.h:590 [inline]
> >  kzalloc include/linux/slab.h:711 [inline]
> >  bpf_raw_tp_link_attach+0x2a0/0x6e0 kernel/bpf/syscall.c:3816
> >  bpf_raw_tracepoint_open+0x1c2/0x240 kernel/bpf/syscall.c:3863
> >  __sys_bpf+0x3c0/0x810 kernel/bpf/syscall.c:5673
> >  __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
> >  __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
> >  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
> >  do_syscall_64+0xfb/0x240
> >  entry_SYSCALL_64_after_hwframe+0x6d/0x75
> >
> > Freed by task 5075:
> >  kasan_save_stack mm/kasan/common.c:47 [inline]
> >  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
> >  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:589
> >  poison_slab_object+0xa6/0xe0 mm/kasan/common.c:240
> >  __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
> >  kasan_slab_free include/linux/kasan.h:184 [inline]
> >  slab_free_hook mm/slub.c:2121 [inline]
> >  slab_free mm/slub.c:4299 [inline]
> >  kfree+0x14a/0x380 mm/slub.c:4409
> >  bpf_link_release+0x3b/0x50 kernel/bpf/syscall.c:3071
> >  __fput+0x429/0x8a0 fs/file_table.c:423
> >  task_work_run+0x24f/0x310 kernel/task_work.c:180
> >  exit_task_work include/linux/task_work.h:38 [inline]
> >  do_exit+0xa1b/0x27e0 kernel/exit.c:878
> >  do_group_exit+0x207/0x2c0 kernel/exit.c:1027
> >  __do_sys_exit_group kernel/exit.c:1038 [inline]
> >  __se_sys_exit_group kernel/exit.c:1036 [inline]
> >

[PATCH 09/13] mailbox: omap: Use function local struct mbox_controller

2024-03-25 Thread Andrew Davis

The mbox_controller struct is only needed in the probe function. Make
it a local variable instead of storing a copy in omap_mbox_device
to simplify that struct.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 17c9b9df78b1d..97f59d9f9f319 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -86,7 +86,6 @@ struct omap_mbox_device {
u32 num_fifos;
u32 intr_type;
struct omap_mbox **mboxes;
-   struct mbox_controller controller;
 };
 
 struct omap_mbox {
@@ -541,7 +540,7 @@ static struct mbox_chan *omap_mbox_of_xlate(struct 
mbox_controller *controller,
struct omap_mbox_device *mdev;
struct omap_mbox *mbox;
 
-   mdev = container_of(controller, struct omap_mbox_device, controller);
+   mdev = dev_get_drvdata(controller->dev);
if (WARN_ON(!mdev))
return ERR_PTR(-EINVAL);
 
@@ -567,6 +566,7 @@ static int omap_mbox_probe(struct platform_device *pdev)
struct device_node *node = pdev->dev.of_node;
struct device_node *child;
const struct omap_mbox_match_data *match_data;
+   struct mbox_controller *controller;
u32 intr_type, info_count;
u32 num_users, num_fifos;
u32 tmp[3];
@@ -685,17 +685,20 @@ static int omap_mbox_probe(struct platform_device *pdev)
mdev->intr_type = intr_type;
mdev->mboxes = list;
 
+   controller = devm_kzalloc(>dev, sizeof(*controller), GFP_KERNEL);
+   if (!controller)
+   return -ENOMEM;
/*
 * OMAP/K3 Mailbox IP does not have a Tx-Done IRQ, but rather a Tx-Ready
 * IRQ and is needed to run the Tx state machine
 */
-   mdev->controller.txdone_irq = true;
-   mdev->controller.dev = mdev->dev;
-   mdev->controller.ops = _mbox_chan_ops;
-   mdev->controller.chans = chnls;
-   mdev->controller.num_chans = info_count;
-   mdev->controller.of_xlate = omap_mbox_of_xlate;
-   ret = devm_mbox_controller_register(mdev->dev, >controller);
+   controller->txdone_irq = true;
+   controller->dev = mdev->dev;
+   controller->ops = _mbox_chan_ops;
+   controller->chans = chnls;
+   controller->num_chans = info_count;
+   controller->of_xlate = omap_mbox_of_xlate;
+   ret = devm_mbox_controller_register(mdev->dev, controller);
if (ret)
return ret;
 
-- 
2.39.2

[PATCH 10/13] mailbox: omap: Use mbox_controller channel list directly

2024-03-25 Thread Andrew Davis

The driver stores a list of omap_mbox structs so it can later use it to
lookup the mailbox names in of_xlate. This same information is already
available in the mbox_controller passed into of_xlate. Simply use that
data and remove the extra allocation and storage of the omap_mbox list.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 42 +-
 1 file changed, 11 insertions(+), 31 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 97f59d9f9f319..8e42266cb31a5 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -85,7 +85,6 @@ struct omap_mbox_device {
u32 num_users;
u32 num_fifos;
u32 intr_type;
-   struct omap_mbox **mboxes;
 };
 
 struct omap_mbox {
@@ -356,25 +355,6 @@ static void omap_mbox_fini(struct omap_mbox *mbox)
mbox_queue_free(mbox->rxq);
 }
 
-static struct omap_mbox *omap_mbox_device_find(struct omap_mbox_device *mdev,
-  const char *mbox_name)
-{
-   struct omap_mbox *_mbox, *mbox = NULL;
-   struct omap_mbox **mboxes = mdev->mboxes;
-   int i;
-
-   if (!mboxes)
-   return NULL;
-
-   for (i = 0; (_mbox = mboxes[i]); i++) {
-   if (!strcmp(_mbox->name, mbox_name)) {
-   mbox = _mbox;
-   break;
-   }
-   }
-   return mbox;
-}
-
 static int omap_mbox_chan_startup(struct mbox_chan *chan)
 {
struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
@@ -539,6 +519,7 @@ static struct mbox_chan *omap_mbox_of_xlate(struct 
mbox_controller *controller,
struct device_node *node;
struct omap_mbox_device *mdev;
struct omap_mbox *mbox;
+   int i;
 
mdev = dev_get_drvdata(controller->dev);
if (WARN_ON(!mdev))
@@ -551,16 +532,23 @@ static struct mbox_chan *omap_mbox_of_xlate(struct 
mbox_controller *controller,
return ERR_PTR(-ENODEV);
}
 
-   mbox = omap_mbox_device_find(mdev, node->name);
+   for (i = 0; i < controller->num_chans; i++) {
+   mbox = controller->chans[i].con_priv;
+   if (!strcmp(mbox->name, node->name)) {
+   of_node_put(node);
+   return >chans[i];
+   }
+   }
+
of_node_put(node);
-   return mbox ? mbox->chan : ERR_PTR(-ENOENT);
+   return ERR_PTR(-ENOENT);
 }
 
 static int omap_mbox_probe(struct platform_device *pdev)
 {
int ret;
struct mbox_chan *chnls;
-   struct omap_mbox **list, *mbox;
+   struct omap_mbox *mbox;
struct omap_mbox_device *mdev;
struct omap_mbox_fifo *fifo;
struct device_node *node = pdev->dev.of_node;
@@ -608,12 +596,6 @@ static int omap_mbox_probe(struct platform_device *pdev)
if (!mdev->irq_ctx)
return -ENOMEM;
 
-   /* allocate one extra for marking end of list */
-   list = devm_kcalloc(>dev, info_count + 1, sizeof(*list),
-   GFP_KERNEL);
-   if (!list)
-   return -ENOMEM;
-
chnls = devm_kcalloc(>dev, info_count + 1, sizeof(*chnls),
 GFP_KERNEL);
if (!chnls)
@@ -675,7 +657,6 @@ static int omap_mbox_probe(struct platform_device *pdev)
return mbox->irq;
mbox->chan = [i];
chnls[i].con_priv = mbox;
-   list[i] = mbox++;
}
 
mutex_init(>cfg_lock);
@@ -683,7 +664,6 @@ static int omap_mbox_probe(struct platform_device *pdev)
mdev->num_users = num_users;
mdev->num_fifos = num_fifos;
mdev->intr_type = intr_type;
-   mdev->mboxes = list;
 
controller = devm_kzalloc(>dev, sizeof(*controller), GFP_KERNEL);
if (!controller)
-- 
2.39.2

[PATCH 12/13] mailbox: omap: Reverse FIFO busy check logic

2024-03-25 Thread Andrew Davis

It is much more clear to check if the hardware FIFO is full and return
EBUSY if true. This allows us to also remove one level of indention
from the core of this function. It also makes the similarities between
omap_mbox_chan_send_noirq() and omap_mbox_chan_send() more obvious.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 33 -
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 8e2760d2c5b0c..c5d4083125856 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -375,34 +375,33 @@ static void omap_mbox_chan_shutdown(struct mbox_chan 
*chan)
 
 static int omap_mbox_chan_send_noirq(struct omap_mbox *mbox, u32 msg)
 {
-   int ret = -EBUSY;
+   if (mbox_fifo_full(mbox))
+   return -EBUSY;
 
-   if (!mbox_fifo_full(mbox)) {
-   omap_mbox_enable_irq(mbox, IRQ_RX);
-   mbox_fifo_write(mbox, msg);
-   ret = 0;
-   omap_mbox_disable_irq(mbox, IRQ_RX);
+   omap_mbox_enable_irq(mbox, IRQ_RX);
+   mbox_fifo_write(mbox, msg);
+   omap_mbox_disable_irq(mbox, IRQ_RX);
 
-   /* we must read and ack the interrupt directly from here */
-   mbox_fifo_read(mbox);
-   ack_mbox_irq(mbox, IRQ_RX);
-   }
+   /* we must read and ack the interrupt directly from here */
+   mbox_fifo_read(mbox);
+   ack_mbox_irq(mbox, IRQ_RX);
 
-   return ret;
+   return 0;
 }
 
 static int omap_mbox_chan_send(struct omap_mbox *mbox, u32 msg)
 {
-   int ret = -EBUSY;
-
-   if (!mbox_fifo_full(mbox)) {
-   mbox_fifo_write(mbox, msg);
-   ret = 0;
+   if (mbox_fifo_full(mbox)) {
+   /* always enable the interrupt */
+   omap_mbox_enable_irq(mbox, IRQ_TX);
+   return -EBUSY;
}
 
+   mbox_fifo_write(mbox, msg);
+
/* always enable the interrupt */
omap_mbox_enable_irq(mbox, IRQ_TX);
-   return ret;
+   return 0;
 }
 
 static int omap_mbox_chan_send_data(struct mbox_chan *chan, void *data)
-- 
2.39.2

[PATCH 00/13] OMAP mailbox FIFO removal

2024-03-25 Thread Andrew Davis

Hello all,

Core of this series is the last patch removing the message FIFO
from OMAP mailbox. This hurts our real-time performance. It was a
legacy leftover from before the common mailbox framework anyway.

The rest of the patches are cleanups found along the way.

Thanks,
Andrew

Andrew Davis (13):
  mailbox: omap: Remove unused omap_mbox_{enable,disable}_irq()
functions
  mailbox: omap: Remove unused omap_mbox_request_channel() function
  mailbox: omap: Move omap_mbox_irq_t into driver
  mailbox: omap: Move fifo size check to point of use
  mailbox: omap: Remove unneeded header omap-mailbox.h
  mailbox: omap: Remove device class
  mailbox: omap: Use devm_pm_runtime_enable() helper
  mailbox: omap: Merge mailbox child node setup loops
  mailbox: omap: Use function local struct mbox_controller
  mailbox: omap: Use mbox_controller channel list directly
  mailbox: omap: Remove mbox_chan_to_omap_mbox()
  mailbox: omap: Reverse FIFO busy check logic
  mailbox: omap: Remove kernel FIFO message queuing

 drivers/mailbox/Kconfig|   9 -
 drivers/mailbox/omap-mailbox.c | 515 +++--
 include/linux/omap-mailbox.h   |  13 -
 3 files changed, 106 insertions(+), 431 deletions(-)

-- 
2.39.2

[PATCH 13/13] mailbox: omap: Remove kernel FIFO message queuing

2024-03-25 Thread Andrew Davis

The kernel FIFO queue has a couple issues. The biggest issue is that
it causes extra latency in a path that can be used in real-time tasks,
such as communication with real-time remote processors.

The whole FIFO idea itself looks to be a leftover from before the
unified mailbox framework. The current mailbox framework expects
mbox_chan_received_data() to be called with data immediately as it
arrives. Remove the FIFO and pass the messages to the mailbox
framework directly.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/Kconfig|   9 ---
 drivers/mailbox/omap-mailbox.c | 103 +
 2 files changed, 3 insertions(+), 109 deletions(-)

diff --git a/drivers/mailbox/Kconfig b/drivers/mailbox/Kconfig
index 42940108a1874..78e4c74fbe5c2 100644
--- a/drivers/mailbox/Kconfig
+++ b/drivers/mailbox/Kconfig
@@ -68,15 +68,6 @@ config OMAP2PLUS_MBOX
  OMAP2/3; or IPU, IVA HD and DSP in OMAP4/5. Say Y here if you
  want to use OMAP2+ Mailbox framework support.
 
-config OMAP_MBOX_KFIFO_SIZE
-   int "Mailbox kfifo default buffer size (bytes)"
-   depends on OMAP2PLUS_MBOX
-   default 256
-   help
- Specify the default size of mailbox's kfifo buffers (bytes).
- This can also be changed at runtime (via the mbox_kfifo_size
- module parameter).
-
 config ROCKCHIP_MBOX
bool "Rockchip Soc Integrated Mailbox Support"
depends on ARCH_ROCKCHIP || COMPILE_TEST
diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index c5d4083125856..4e7e0e2f537b0 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -65,14 +65,6 @@ struct omap_mbox_fifo {
u32 intr_bit;
 };
 
-struct omap_mbox_queue {
-   spinlock_t  lock;
-   struct kfifofifo;
-   struct work_struct  work;
-   struct omap_mbox*mbox;
-   bool full;
-};
-
 struct omap_mbox_match_data {
u32 intr_type;
 };
@@ -90,7 +82,6 @@ struct omap_mbox_device {
 struct omap_mbox {
const char  *name;
int irq;
-   struct omap_mbox_queue  *rxq;
struct omap_mbox_device *parent;
struct omap_mbox_fifo   tx_fifo;
struct omap_mbox_fifo   rx_fifo;
@@ -99,10 +90,6 @@ struct omap_mbox {
boolsend_no_irq;
 };
 
-static unsigned int mbox_kfifo_size = CONFIG_OMAP_MBOX_KFIFO_SIZE;
-module_param(mbox_kfifo_size, uint, S_IRUGO);
-MODULE_PARM_DESC(mbox_kfifo_size, "Size of omap's mailbox kfifo (bytes)");
-
 static inline
 unsigned int mbox_read_reg(struct omap_mbox_device *mdev, size_t ofs)
 {
@@ -202,30 +189,6 @@ static void omap_mbox_disable_irq(struct omap_mbox *mbox, 
omap_mbox_irq_t irq)
mbox_write_reg(mbox->parent, bit, irqdisable);
 }
 
-/*
- * Message receiver(workqueue)
- */
-static void mbox_rx_work(struct work_struct *work)
-{
-   struct omap_mbox_queue *mq =
-   container_of(work, struct omap_mbox_queue, work);
-   u32 msg;
-   int len;
-
-   while (kfifo_len(>fifo) >= sizeof(msg)) {
-   len = kfifo_out(>fifo, (unsigned char *), sizeof(msg));
-   WARN_ON(len != sizeof(msg));
-
-   mbox_chan_received_data(mq->mbox->chan, (void *)(uintptr_t)msg);
-   spin_lock_irq(>lock);
-   if (mq->full) {
-   mq->full = false;
-   omap_mbox_enable_irq(mq->mbox, IRQ_RX);
-   }
-   spin_unlock_irq(>lock);
-   }
-}
-
 /*
  * Mailbox interrupt handler
  */
@@ -238,27 +201,15 @@ static void __mbox_tx_interrupt(struct omap_mbox *mbox)
 
 static void __mbox_rx_interrupt(struct omap_mbox *mbox)
 {
-   struct omap_mbox_queue *mq = mbox->rxq;
u32 msg;
-   int len;
 
while (!mbox_fifo_empty(mbox)) {
-   if (unlikely(kfifo_avail(>fifo) < sizeof(msg))) {
-   omap_mbox_disable_irq(mbox, IRQ_RX);
-   mq->full = true;
-   goto nomem;
-   }
-
msg = mbox_fifo_read(mbox);
-
-   len = kfifo_in(>fifo, (unsigned char *), sizeof(msg));
-   WARN_ON(len != sizeof(msg));
+   mbox_chan_received_data(mbox->chan, (void *)(uintptr_t)msg);
}
 
-   /* no more messages in the fifo. clear IRQ source. */
+   /* clear IRQ source. */
ack_mbox_irq(mbox, IRQ_RX);
-nomem:
-   schedule_work(>rxq->work);
 }
 
 static irqreturn_t mbox_interrupt(int irq, void *p)
@@ -274,57 +225,15 @@ static irqreturn_t mbox_interrupt(int irq, void *p)
return IRQ_HANDLED;
 }
 
-static struct omap_mbox_queue *mbox_queue_alloc(struct omap_mbox *mbox,
-   void (*work)(struct work_struct *))
-{
-   struct omap_mbox_queue *mq;
-   unsigned int size;
-
-   if (!work)
-   return NULL;
-
-   mq = kzalloc(sizeof(*mq), GFP_KERNEL);
-

[PATCH 11/13] mailbox: omap: Remove mbox_chan_to_omap_mbox()

2024-03-25 Thread Andrew Davis

This function only checks if mbox_chan *chan is not NULL, but that cannot
be the case and if it was returning NULL which is not later checked
doesn't save us from this. The second check for chan->con_priv is
completely redundant as if it was NULL we would return NULL just the
same. Simply dereference con_priv directly and remove this function.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 8e42266cb31a5..8e2760d2c5b0c 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -103,14 +103,6 @@ static unsigned int mbox_kfifo_size = 
CONFIG_OMAP_MBOX_KFIFO_SIZE;
 module_param(mbox_kfifo_size, uint, S_IRUGO);
 MODULE_PARM_DESC(mbox_kfifo_size, "Size of omap's mailbox kfifo (bytes)");
 
-static struct omap_mbox *mbox_chan_to_omap_mbox(struct mbox_chan *chan)
-{
-   if (!chan || !chan->con_priv)
-   return NULL;
-
-   return (struct omap_mbox *)chan->con_priv;
-}
-
 static inline
 unsigned int mbox_read_reg(struct omap_mbox_device *mdev, size_t ofs)
 {
@@ -357,7 +349,7 @@ static void omap_mbox_fini(struct omap_mbox *mbox)
 
 static int omap_mbox_chan_startup(struct mbox_chan *chan)
 {
-   struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
+   struct omap_mbox *mbox = chan->con_priv;
struct omap_mbox_device *mdev = mbox->parent;
int ret = 0;
 
@@ -372,7 +364,7 @@ static int omap_mbox_chan_startup(struct mbox_chan *chan)
 
 static void omap_mbox_chan_shutdown(struct mbox_chan *chan)
 {
-   struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
+   struct omap_mbox *mbox = chan->con_priv;
struct omap_mbox_device *mdev = mbox->parent;
 
mutex_lock(>cfg_lock);
@@ -415,7 +407,7 @@ static int omap_mbox_chan_send(struct omap_mbox *mbox, u32 
msg)
 
 static int omap_mbox_chan_send_data(struct mbox_chan *chan, void *data)
 {
-   struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
+   struct omap_mbox *mbox = chan->con_priv;
int ret;
u32 msg = (u32)(uintptr_t)(data);
 
-- 
2.39.2

[PATCH 01/13] mailbox: omap: Remove unused omap_mbox_{enable,disable}_irq() functions

2024-03-25 Thread Andrew Davis

These function are not used, remove these here.

While here, remove the leading _ from the driver internal functions that
do the same thing as the functions removed.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 42 --
 include/linux/omap-mailbox.h   |  3 ---
 2 files changed, 10 insertions(+), 35 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index c961706fe61d5..624a7ccc27285 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -197,7 +197,7 @@ static int is_mbox_irq(struct omap_mbox *mbox, 
omap_mbox_irq_t irq)
return (int)(enable & status & bit);
 }
 
-static void _omap_mbox_enable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq)
+static void omap_mbox_enable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq)
 {
u32 l;
struct omap_mbox_fifo *fifo = (irq == IRQ_TX) ?
@@ -210,7 +210,7 @@ static void _omap_mbox_enable_irq(struct omap_mbox *mbox, 
omap_mbox_irq_t irq)
mbox_write_reg(mbox->parent, l, irqenable);
 }
 
-static void _omap_mbox_disable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq)
+static void omap_mbox_disable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq)
 {
struct omap_mbox_fifo *fifo = (irq == IRQ_TX) ?
>tx_fifo : >rx_fifo;
@@ -227,28 +227,6 @@ static void _omap_mbox_disable_irq(struct omap_mbox *mbox, 
omap_mbox_irq_t irq)
mbox_write_reg(mbox->parent, bit, irqdisable);
 }
 
-void omap_mbox_enable_irq(struct mbox_chan *chan, omap_mbox_irq_t irq)
-{
-   struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
-
-   if (WARN_ON(!mbox))
-   return;
-
-   _omap_mbox_enable_irq(mbox, irq);
-}
-EXPORT_SYMBOL(omap_mbox_enable_irq);
-
-void omap_mbox_disable_irq(struct mbox_chan *chan, omap_mbox_irq_t irq)
-{
-   struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
-
-   if (WARN_ON(!mbox))
-   return;
-
-   _omap_mbox_disable_irq(mbox, irq);
-}
-EXPORT_SYMBOL(omap_mbox_disable_irq);
-
 /*
  * Message receiver(workqueue)
  */
@@ -269,7 +247,7 @@ static void mbox_rx_work(struct work_struct *work)
spin_lock_irq(>lock);
if (mq->full) {
mq->full = false;
-   _omap_mbox_enable_irq(mq->mbox, IRQ_RX);
+   omap_mbox_enable_irq(mq->mbox, IRQ_RX);
}
spin_unlock_irq(>lock);
}
@@ -280,7 +258,7 @@ static void mbox_rx_work(struct work_struct *work)
  */
 static void __mbox_tx_interrupt(struct omap_mbox *mbox)
 {
-   _omap_mbox_disable_irq(mbox, IRQ_TX);
+   omap_mbox_disable_irq(mbox, IRQ_TX);
ack_mbox_irq(mbox, IRQ_TX);
mbox_chan_txdone(mbox->chan, 0);
 }
@@ -293,7 +271,7 @@ static void __mbox_rx_interrupt(struct omap_mbox *mbox)
 
while (!mbox_fifo_empty(mbox)) {
if (unlikely(kfifo_avail(>fifo) < sizeof(msg))) {
-   _omap_mbox_disable_irq(mbox, IRQ_RX);
+   omap_mbox_disable_irq(mbox, IRQ_RX);
mq->full = true;
goto nomem;
}
@@ -375,7 +353,7 @@ static int omap_mbox_startup(struct omap_mbox *mbox)
if (mbox->send_no_irq)
mbox->chan->txdone_method = TXDONE_BY_ACK;
 
-   _omap_mbox_enable_irq(mbox, IRQ_RX);
+   omap_mbox_enable_irq(mbox, IRQ_RX);
 
return 0;
 
@@ -386,7 +364,7 @@ static int omap_mbox_startup(struct omap_mbox *mbox)
 
 static void omap_mbox_fini(struct omap_mbox *mbox)
 {
-   _omap_mbox_disable_irq(mbox, IRQ_RX);
+   omap_mbox_disable_irq(mbox, IRQ_RX);
free_irq(mbox->irq, mbox);
flush_work(>rxq->work);
mbox_queue_free(mbox->rxq);
@@ -533,10 +511,10 @@ static int omap_mbox_chan_send_noirq(struct omap_mbox 
*mbox, u32 msg)
int ret = -EBUSY;
 
if (!mbox_fifo_full(mbox)) {
-   _omap_mbox_enable_irq(mbox, IRQ_RX);
+   omap_mbox_enable_irq(mbox, IRQ_RX);
mbox_fifo_write(mbox, msg);
ret = 0;
-   _omap_mbox_disable_irq(mbox, IRQ_RX);
+   omap_mbox_disable_irq(mbox, IRQ_RX);
 
/* we must read and ack the interrupt directly from here */
mbox_fifo_read(mbox);
@@ -556,7 +534,7 @@ static int omap_mbox_chan_send(struct omap_mbox *mbox, u32 
msg)
}
 
/* always enable the interrupt */
-   _omap_mbox_enable_irq(mbox, IRQ_TX);
+   omap_mbox_enable_irq(mbox, IRQ_TX);
return ret;
 }
 
diff --git a/include/linux/omap-mailbox.h b/include/linux/omap-mailbox.h
index 8aa984ec1f38b..426a80fb32b5c 100644
--- a/include/linux/omap-mailbox.h
+++ b/include/linux/omap-mailbox.h
@@ -20,7 +20,4 @@ struct mbox_client;
 struct mbox_chan *omap_mbox_request_channel(struct mbox_client *cl,
const char *chan_name);
 
-void

[PATCH 08/13] mailbox: omap: Merge mailbox child node setup loops

2024-03-25 Thread Andrew Davis

Currently the driver loops through all mailbox child nodes twice, once
to read in data from each node, and again to make use of this data.
Instead read the data and make use of it in one pass. This removes
the need for several temporary data structures and reduces the
complexity of this main loop in probe.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 119 +
 1 file changed, 46 insertions(+), 73 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 4f956c7b4072c..17c9b9df78b1d 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -89,19 +89,6 @@ struct omap_mbox_device {
struct mbox_controller controller;
 };
 
-struct omap_mbox_fifo_info {
-   int tx_id;
-   int tx_usr;
-   int tx_irq;
-
-   int rx_id;
-   int rx_usr;
-   int rx_irq;
-
-   const char *name;
-   bool send_no_irq;
-};
-
 struct omap_mbox {
const char  *name;
int irq;
@@ -574,8 +561,7 @@ static int omap_mbox_probe(struct platform_device *pdev)
 {
int ret;
struct mbox_chan *chnls;
-   struct omap_mbox **list, *mbox, *mboxblk;
-   struct omap_mbox_fifo_info *finfo, *finfoblk;
+   struct omap_mbox **list, *mbox;
struct omap_mbox_device *mdev;
struct omap_mbox_fifo *fifo;
struct device_node *node = pdev->dev.of_node;
@@ -609,40 +595,6 @@ static int omap_mbox_probe(struct platform_device *pdev)
return -ENODEV;
}
 
-   finfoblk = devm_kcalloc(>dev, info_count, sizeof(*finfoblk),
-   GFP_KERNEL);
-   if (!finfoblk)
-   return -ENOMEM;
-
-   finfo = finfoblk;
-   child = NULL;
-   for (i = 0; i < info_count; i++, finfo++) {
-   child = of_get_next_available_child(node, child);
-   ret = of_property_read_u32_array(child, "ti,mbox-tx", tmp,
-ARRAY_SIZE(tmp));
-   if (ret)
-   return ret;
-   finfo->tx_id = tmp[0];
-   finfo->tx_irq = tmp[1];
-   finfo->tx_usr = tmp[2];
-
-   ret = of_property_read_u32_array(child, "ti,mbox-rx", tmp,
-ARRAY_SIZE(tmp));
-   if (ret)
-   return ret;
-   finfo->rx_id = tmp[0];
-   finfo->rx_irq = tmp[1];
-   finfo->rx_usr = tmp[2];
-
-   finfo->name = child->name;
-
-   finfo->send_no_irq = of_property_read_bool(child, 
"ti,mbox-send-noirq");
-
-   if (finfo->tx_id >= num_fifos || finfo->rx_id >= num_fifos ||
-   finfo->tx_usr >= num_users || finfo->rx_usr >= num_users)
-   return -EINVAL;
-   }
-
mdev = devm_kzalloc(>dev, sizeof(*mdev), GFP_KERNEL);
if (!mdev)
return -ENOMEM;
@@ -667,36 +619,58 @@ static int omap_mbox_probe(struct platform_device *pdev)
if (!chnls)
return -ENOMEM;
 
-   mboxblk = devm_kcalloc(>dev, info_count, sizeof(*mbox),
-  GFP_KERNEL);
-   if (!mboxblk)
-   return -ENOMEM;
+   child = NULL;
+   for (i = 0; i < info_count; i++) {
+   int tx_id, tx_irq, tx_usr;
+   int rx_id, rx_usr;
+
+   mbox = devm_kzalloc(>dev, sizeof(*mbox), GFP_KERNEL);
+   if (!mbox)
+   return -ENOMEM;
+
+   child = of_get_next_available_child(node, child);
+   ret = of_property_read_u32_array(child, "ti,mbox-tx", tmp,
+ARRAY_SIZE(tmp));
+   if (ret)
+   return ret;
+   tx_id = tmp[0];
+   tx_irq = tmp[1];
+   tx_usr = tmp[2];
+
+   ret = of_property_read_u32_array(child, "ti,mbox-rx", tmp,
+ARRAY_SIZE(tmp));
+   if (ret)
+   return ret;
+   rx_id = tmp[0];
+   /* rx_irq = tmp[1]; */
+   rx_usr = tmp[2];
+
+   if (tx_id >= num_fifos || rx_id >= num_fifos ||
+   tx_usr >= num_users || rx_usr >= num_users)
+   return -EINVAL;
 
-   mbox = mboxblk;
-   finfo = finfoblk;
-   for (i = 0; i < info_count; i++, finfo++) {
fifo = >tx_fifo;
-   fifo->msg = MAILBOX_MESSAGE(finfo->tx_id);
-   fifo->fifo_stat = MAILBOX_FIFOSTATUS(finfo->tx_id);
-   fifo->intr_bit = MAILBOX_IRQ_NOTFULL(finfo->tx_id);
-   fifo->irqenable = MAILBOX_IRQENABLE(intr_type, finfo->tx_usr);
-   fifo->irqstatus = MAILBOX_IRQSTATUS(intr_type, finfo->tx_usr);
-   fifo->irqdisable =

[PATCH 06/13] mailbox: omap: Remove device class

2024-03-25 Thread Andrew Davis

The driver currently creates a new device class "mbox". Then for each
mailbox adds a device to that class. This class provides no file
operations provided for any userspace users of this device class.
It may have been extended to be functional in our vendor tree at
some point, but that is not the case anymore, nor does it matter
for the upstream tree.

Remove this device class and related functions and variables.
This also allows us to switch to module_platform_driver() as
there is nothing left to do in module_init().

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 89 +-
 1 file changed, 2 insertions(+), 87 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 4c673cb732ed1..ea467931faf46 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -87,7 +87,6 @@ struct omap_mbox_device {
u32 intr_type;
struct omap_mbox **mboxes;
struct mbox_controller controller;
-   struct list_head elem;
 };
 
 struct omap_mbox_fifo_info {
@@ -107,7 +106,6 @@ struct omap_mbox {
const char  *name;
int irq;
struct omap_mbox_queue  *rxq;
-   struct device   *dev;
struct omap_mbox_device *parent;
struct omap_mbox_fifo   tx_fifo;
struct omap_mbox_fifo   rx_fifo;
@@ -116,10 +114,6 @@ struct omap_mbox {
boolsend_no_irq;
 };
 
-/* global variables for the mailbox devices */
-static DEFINE_MUTEX(omap_mbox_devices_lock);
-static LIST_HEAD(omap_mbox_devices);
-
 static unsigned int mbox_kfifo_size = CONFIG_OMAP_MBOX_KFIFO_SIZE;
 module_param(mbox_kfifo_size, uint, S_IRUGO);
 MODULE_PARM_DESC(mbox_kfifo_size, "Size of omap's mailbox kfifo (bytes)");
@@ -395,61 +389,6 @@ static struct omap_mbox *omap_mbox_device_find(struct 
omap_mbox_device *mdev,
return mbox;
 }
 
-static struct class omap_mbox_class = { .name = "mbox", };
-
-static int omap_mbox_register(struct omap_mbox_device *mdev)
-{
-   int ret;
-   int i;
-   struct omap_mbox **mboxes;
-
-   if (!mdev || !mdev->mboxes)
-   return -EINVAL;
-
-   mboxes = mdev->mboxes;
-   for (i = 0; mboxes[i]; i++) {
-   struct omap_mbox *mbox = mboxes[i];
-
-   mbox->dev = device_create(_mbox_class, mdev->dev,
-   0, mbox, "%s", mbox->name);
-   if (IS_ERR(mbox->dev)) {
-   ret = PTR_ERR(mbox->dev);
-   goto err_out;
-   }
-   }
-
-   mutex_lock(_mbox_devices_lock);
-   list_add(>elem, _mbox_devices);
-   mutex_unlock(_mbox_devices_lock);
-
-   ret = devm_mbox_controller_register(mdev->dev, >controller);
-
-err_out:
-   if (ret) {
-   while (i--)
-   device_unregister(mboxes[i]->dev);
-   }
-   return ret;
-}
-
-static int omap_mbox_unregister(struct omap_mbox_device *mdev)
-{
-   int i;
-   struct omap_mbox **mboxes;
-
-   if (!mdev || !mdev->mboxes)
-   return -EINVAL;
-
-   mutex_lock(_mbox_devices_lock);
-   list_del(>elem);
-   mutex_unlock(_mbox_devices_lock);
-
-   mboxes = mdev->mboxes;
-   for (i = 0; mboxes[i]; i++)
-   device_unregister(mboxes[i]->dev);
-   return 0;
-}
-
 static int omap_mbox_chan_startup(struct mbox_chan *chan)
 {
struct omap_mbox *mbox = mbox_chan_to_omap_mbox(chan);
@@ -782,7 +721,7 @@ static int omap_mbox_probe(struct platform_device *pdev)
mdev->controller.chans = chnls;
mdev->controller.num_chans = info_count;
mdev->controller.of_xlate = omap_mbox_of_xlate;
-   ret = omap_mbox_register(mdev);
+   ret = devm_mbox_controller_register(mdev->dev, >controller);
if (ret)
return ret;
 
@@ -809,7 +748,6 @@ static int omap_mbox_probe(struct platform_device *pdev)
 
 unregister:
pm_runtime_disable(mdev->dev);
-   omap_mbox_unregister(mdev);
return ret;
 }
 
@@ -818,7 +756,6 @@ static void omap_mbox_remove(struct platform_device *pdev)
struct omap_mbox_device *mdev = platform_get_drvdata(pdev);
 
pm_runtime_disable(mdev->dev);
-   omap_mbox_unregister(mdev);
 }
 
 static struct platform_driver omap_mbox_driver = {
@@ -830,29 +767,7 @@ static struct platform_driver omap_mbox_driver = {
.of_match_table = of_match_ptr(omap_mailbox_of_match),
},
 };
-
-static int __init omap_mbox_init(void)
-{
-   int err;
-
-   err = class_register(_mbox_class);
-   if (err)
-   return err;
-
-   err = platform_driver_register(_mbox_driver);
-   if (err)
-   class_unregister(_mbox_class);
-
-   return err;
-}
-subsys_initcall(omap_mbox_init);
-
-static void __exit omap_mbox_exit(void)
-{
-   platform_driver_unregister(_mbox_driver);
-

[PATCH 07/13] mailbox: omap: Use devm_pm_runtime_enable() helper

2024-03-25 Thread Andrew Davis

Use device life-cycle managed runtime enable function to simplify probe
and exit paths.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 18 +++---
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index ea467931faf46..4f956c7b4072c 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -726,11 +726,11 @@ static int omap_mbox_probe(struct platform_device *pdev)
return ret;
 
platform_set_drvdata(pdev, mdev);
-   pm_runtime_enable(mdev->dev);
+   devm_pm_runtime_enable(mdev->dev);
 
ret = pm_runtime_resume_and_get(mdev->dev);
if (ret < 0)
-   goto unregister;
+   return ret;
 
/*
 * just print the raw revision register, the format is not
@@ -741,26 +741,14 @@ static int omap_mbox_probe(struct platform_device *pdev)
 
ret = pm_runtime_put_sync(mdev->dev);
if (ret < 0 && ret != -ENOSYS)
-   goto unregister;
+   return ret;
 
devm_kfree(>dev, finfoblk);
return 0;
-
-unregister:
-   pm_runtime_disable(mdev->dev);
-   return ret;
-}
-
-static void omap_mbox_remove(struct platform_device *pdev)
-{
-   struct omap_mbox_device *mdev = platform_get_drvdata(pdev);
-
-   pm_runtime_disable(mdev->dev);
 }
 
 static struct platform_driver omap_mbox_driver = {
.probe  = omap_mbox_probe,
-   .remove_new = omap_mbox_remove,
.driver = {
.name = "omap-mailbox",
.pm = _mbox_pm_ops,
-- 
2.39.2

[PATCH 04/13] mailbox: omap: Move fifo size check to point of use

2024-03-25 Thread Andrew Davis

The mbox_kfifo_size can be changed at runtime, the sanity
check on it's value should be done when it is used, not
only once at init time.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index c083734b6954c..167348fb1b33b 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -310,6 +310,7 @@ static struct omap_mbox_queue *mbox_queue_alloc(struct 
omap_mbox *mbox,
void (*work)(struct work_struct *))
 {
struct omap_mbox_queue *mq;
+   unsigned int size;
 
if (!work)
return NULL;
@@ -320,7 +321,10 @@ static struct omap_mbox_queue *mbox_queue_alloc(struct 
omap_mbox *mbox,
 
spin_lock_init(>lock);
 
-   if (kfifo_alloc(>fifo, mbox_kfifo_size, GFP_KERNEL))
+   /* kfifo size sanity check: alignment and minimal size */
+   size = ALIGN(mbox_kfifo_size, sizeof(u32));
+   size = max_t(unsigned int, size, sizeof(u32));
+   if (kfifo_alloc(>fifo, size, GFP_KERNEL))
goto error;
 
INIT_WORK(>work, work);
@@ -838,10 +842,6 @@ static int __init omap_mbox_init(void)
if (err)
return err;
 
-   /* kfifo size sanity check: alignment and minimal size */
-   mbox_kfifo_size = ALIGN(mbox_kfifo_size, sizeof(u32));
-   mbox_kfifo_size = max_t(unsigned int, mbox_kfifo_size, sizeof(u32));
-
err = platform_driver_register(_mbox_driver);
if (err)
class_unregister(_mbox_class);
-- 
2.39.2

[PATCH 02/13] mailbox: omap: Remove unused omap_mbox_request_channel() function

2024-03-25 Thread Andrew Davis

This function is not used, remove this function.

Signed-off-by: Andrew Davis 
---
 drivers/mailbox/omap-mailbox.c | 36 --
 include/linux/omap-mailbox.h   |  6 --
 2 files changed, 42 deletions(-)

diff --git a/drivers/mailbox/omap-mailbox.c b/drivers/mailbox/omap-mailbox.c
index 624a7ccc27285..8151722eef383 100644
--- a/drivers/mailbox/omap-mailbox.c
+++ b/drivers/mailbox/omap-mailbox.c
@@ -389,42 +389,6 @@ static struct omap_mbox *omap_mbox_device_find(struct 
omap_mbox_device *mdev,
return mbox;
 }
 
-struct mbox_chan *omap_mbox_request_channel(struct mbox_client *cl,
-   const char *chan_name)
-{
-   struct device *dev = cl->dev;
-   struct omap_mbox *mbox = NULL;
-   struct omap_mbox_device *mdev;
-   int ret;
-
-   if (!dev)
-   return ERR_PTR(-ENODEV);
-
-   if (dev->of_node) {
-   pr_err("%s: please use mbox_request_channel(), this API is 
supported only for OMAP non-DT usage\n",
-  __func__);
-   return ERR_PTR(-ENODEV);
-   }
-
-   mutex_lock(_mbox_devices_lock);
-   list_for_each_entry(mdev, _mbox_devices, elem) {
-   mbox = omap_mbox_device_find(mdev, chan_name);
-   if (mbox)
-   break;
-   }
-   mutex_unlock(_mbox_devices_lock);
-
-   if (!mbox || !mbox->chan)
-   return ERR_PTR(-ENOENT);
-
-   ret = mbox_bind_client(mbox->chan, cl);
-   if (ret)
-   return ERR_PTR(ret);
-
-   return mbox->chan;
-}
-EXPORT_SYMBOL(omap_mbox_request_channel);
-
 static struct class omap_mbox_class = { .name = "mbox", };
 
 static int omap_mbox_register(struct omap_mbox_device *mdev)
diff --git a/include/linux/omap-mailbox.h b/include/linux/omap-mailbox.h
index 426a80fb32b5c..f8ddf8e814167 100644
--- a/include/linux/omap-mailbox.h
+++ b/include/linux/omap-mailbox.h
@@ -14,10 +14,4 @@ typedef int __bitwise omap_mbox_irq_t;
 #define IRQ_TX ((__force omap_mbox_irq_t) 1)
 #define IRQ_RX ((__force omap_mbox_irq_t) 2)
 
-struct mbox_chan;
-struct mbox_client;
-
-struct mbox_chan *omap_mbox_request_channel(struct mbox_client *cl,
-   const char *chan_name);
-
 #endif /* OMAP_MAILBOX_H */
-- 
2.39.2

[PATCH] vsock/virtio: fix packet delivery to tap device

2024-03-25 Thread Marco Pinna

Commit 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks") added
virtio_transport_deliver_tap_pkt() for handing packets to the
vsockmon device. However, in virtio_transport_send_pkt_work(),
the function is called before actually sending the packet (i.e.
before placing it in the virtqueue with virtqueue_add_sgs() and checking
whether it returned successfully). This may cause timing issues since
the sending of the packet may fail, causing it to be re-queued
(possibly multiple times), while the tap device would show the
packet being sent correctly.

Move virtio_transport_deliver_tap_pkt() after calling virtqueue_add_sgs()
and making sure it returned successfully.

Fixes: 82dfb540aeb2 ("VSOCK: Add virtio vsock vsockmon hooks")
Signed-off-by: Marco Pinna 
---
 net/vmw_vsock/virtio_transport.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 1748268e0694..ee5d306a96d0 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -120,7 +120,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
if (!skb)
break;
 
-   virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
sgs = vsock->out_sgs;
sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
@@ -170,6 +169,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
break;
}
 
+   virtio_transport_deliver_tap_pkt(skb);
+
if (reply) {
struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
int val;
-- 
2.44.0

[PATCH 2/3] remoteproc: k3-r5: Fix usage of omap_mbox_message and mbox_msg_t

2024-03-25 Thread Andrew Davis

The type of message sent using omap-mailbox is always u32. The definition
of mbox_msg_t is uintptr_t which is wrong as that type changes based on
the architecture (32bit vs 64bit). Use u32 unconditionally and remove
the now unneeded omap-mailbox.h include.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_r5_remoteproc.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index ad3415a3851b2..3bcde6d00b56a 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -188,7 +187,7 @@ static void k3_r5_rproc_mbox_callback(struct mbox_client 
*client, void *data)
client);
struct device *dev = kproc->rproc->dev.parent;
const char *name = kproc->rproc->name;
-   u32 msg = omap_mbox_message(data);
+   u32 msg = (u32)(uintptr_t)(data);
 
dev_dbg(dev, "mbox msg: 0x%x\n", msg);
 
@@ -222,11 +221,11 @@ static void k3_r5_rproc_kick(struct rproc *rproc, int 
vqid)
 {
struct k3_r5_rproc *kproc = rproc->priv;
struct device *dev = rproc->dev.parent;
-   mbox_msg_t msg = (mbox_msg_t)vqid;
+   u32 msg = vqid;
int ret;
 
/* send the index of the triggered virtqueue in the mailbox payload */
-   ret = mbox_send_message(kproc->mbox, (void *)msg);
+   ret = mbox_send_message(kproc->mbox, (void *)(uintptr_t)msg);
if (ret < 0)
dev_err(dev, "failed to send mailbox message, status = %d\n",
ret);
-- 
2.39.2

[PATCH 3/3] remoteproc: omap: Remove unused header omap-mailbox.h

2024-03-25 Thread Andrew Davis

This header no longer used, remove this include.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/omap_remoteproc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/remoteproc/omap_remoteproc.c 
b/drivers/remoteproc/omap_remoteproc.c
index 8f50ab80e56f4..bde04e3e6d966 100644
--- a/drivers/remoteproc/omap_remoteproc.c
+++ b/drivers/remoteproc/omap_remoteproc.c
@@ -29,7 +29,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.39.2

[PATCH 1/3] remoteproc: k3-dsp: Fix usage of omap_mbox_message and mbox_msg_t

2024-03-25 Thread Andrew Davis

The type of message sent using omap-mailbox is always u32. The definition
of mbox_msg_t is uintptr_t which is wrong as that type changes based on
the architecture (32bit vs 64bit). Use u32 unconditionally and remove
the now unneeded omap-mailbox.h include.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index 3555b535b1683..33b30cfb86c9d 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -113,7 +112,7 @@ static void k3_dsp_rproc_mbox_callback(struct mbox_client 
*client, void *data)
  client);
struct device *dev = kproc->rproc->dev.parent;
const char *name = kproc->rproc->name;
-   u32 msg = omap_mbox_message(data);
+   u32 msg = (u32)(uintptr_t)(data);
 
dev_dbg(dev, "mbox msg: 0x%x\n", msg);
 
@@ -152,11 +151,11 @@ static void k3_dsp_rproc_kick(struct rproc *rproc, int 
vqid)
 {
struct k3_dsp_rproc *kproc = rproc->priv;
struct device *dev = rproc->dev.parent;
-   mbox_msg_t msg = (mbox_msg_t)vqid;
+   u32 msg = vqid;
int ret;
 
/* send the index of the triggered virtqueue in the mailbox payload */
-   ret = mbox_send_message(kproc->mbox, (void *)msg);
+   ret = mbox_send_message(kproc->mbox, (void *)(uintptr_t)msg);
if (ret < 0)
dev_err(dev, "failed to send mailbox message (%pe)\n",
ERR_PTR(ret));
-- 
2.39.2

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Andrii Nakryiko

On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu  wrote:
>
> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
>
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> >
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.
>
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.

Yeah, I don't think we ever see this warning across our machines. And
sure, I can default it to Y, no problem.

>
> >
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
> >  #endif
> >
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)\
> >   ({  \
> >   bool __ret = !rcu_is_watching();\
>
> BTW, maybe we can add "unlikely" in the next "if" line?

sure, can add that as well

>
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > This file can be reset, but the limit can not change in
> > size at runtime.
> >
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > + bool "Validate RCU is on during ftrace recursion check"
> > + depends on FUNCTION_TRACER
> > + depends on ARCH_WANTS_NO_INSTR
>
> default y
>

ok

> > + help
> > +   All callbacks that attach to the function tracing have some sort
> > +   of protection against recursion. This option performs additional
> > +   checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > +   This will add more overhead to all ftrace-based invocations.
>
> ... invocations, but keep it safe.
>
> > +
> > +   If unsure, say N
>
> If unsure, say Y
>

yep, will do, thanks!

> Thank you,
>
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> >   bool "Record functions that recurse in the ring buffer"
> >   depends on FTRACE_RECORD_RECURSION
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-03-25 Thread Mathieu Poirier

On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote:
> The new TEE remoteproc device is used to manage remote firmware in a
> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
> introduced to delegate the loading of the firmware to the trusted
> execution context. In such cases, the firmware should be signed and
> adhere to the image format defined by the TEE.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> Updates from V3:
> - remove support of the attach use case. Will be addressed in a separate
>   thread,
> - add st_rproc_tee_ops::parse_fw ops,
> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage cross
>   reference between the rproc struct and the tee_rproc struct in tee_rproc.c.
> ---
>  drivers/remoteproc/stm32_rproc.c | 60 +---
>  1 file changed, 56 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/remoteproc/stm32_rproc.c 
> b/drivers/remoteproc/stm32_rproc.c
> index 8cd838df4e92..13df33c78aa2 100644
> --- a/drivers/remoteproc/stm32_rproc.c
> +++ b/drivers/remoteproc/stm32_rproc.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "remoteproc_internal.h"
> @@ -49,6 +50,9 @@
>  #define M4_STATE_STANDBY 4
>  #define M4_STATE_CRASH   5
>  
> +/* Remote processor unique identifier aligned with the Trusted Execution 
> Environment definitions */

Why is this the case?  At least from the kernel side it is possible to call
tee_rproc_register() with any kind of value, why is there a need to be any
kind of alignment with the TEE?

> +#define STM32_MP1_M4_PROC_ID0
> +
>  struct stm32_syscon {
>   struct regmap *map;
>   u32 reg;
> @@ -257,6 +261,19 @@ static int stm32_rproc_release(struct rproc *rproc)
>   return 0;
>  }
>  
> +static int stm32_rproc_tee_stop(struct rproc *rproc)
> +{
> + int err;
> +
> + stm32_rproc_request_shutdown(rproc);
> +
> + err = tee_rproc_stop(rproc);
> + if (err)
> + return err;
> +
> + return stm32_rproc_release(rproc);
> +}
> +
>  static int stm32_rproc_prepare(struct rproc *rproc)
>  {
>   struct device *dev = rproc->dev.parent;
> @@ -693,8 +710,19 @@ static const struct rproc_ops st_rproc_ops = {
>   .get_boot_addr  = rproc_elf_get_boot_addr,
>  };
>  
> +static const struct rproc_ops st_rproc_tee_ops = {
> + .prepare= stm32_rproc_prepare,
> + .start  = tee_rproc_start,
> + .stop   = stm32_rproc_tee_stop,
> + .kick   = stm32_rproc_kick,
> + .load   = tee_rproc_load_fw,
> + .parse_fw   = tee_rproc_parse_fw,
> + .find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table,
> +};
> +
>  static const struct of_device_id stm32_rproc_match[] = {
> - { .compatible = "st,stm32mp1-m4" },
> + {.compatible = "st,stm32mp1-m4",},
> + {.compatible = "st,stm32mp1-m4-tee",},
>   {},
>  };
>  MODULE_DEVICE_TABLE(of, stm32_rproc_match);
> @@ -853,6 +881,7 @@ static int stm32_rproc_probe(struct platform_device *pdev)
>   struct device *dev = >dev;
>   struct stm32_rproc *ddata;
>   struct device_node *np = dev->of_node;
> + struct tee_rproc *trproc = NULL;
>   struct rproc *rproc;
>   unsigned int state;
>   int ret;
> @@ -861,9 +890,26 @@ static int stm32_rproc_probe(struct platform_device 
> *pdev)
>   if (ret)
>   return ret;
>  
> - rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, 
> sizeof(*ddata));
> - if (!rproc)
> - return -ENOMEM;
> + if (of_device_is_compatible(np, "st,stm32mp1-m4-tee")) {
> + /*
> +  * Delegate the firmware management to the secure context.
> +  * The firmware loaded has to be signed.
> +  */
> + rproc = devm_rproc_alloc(dev, np->name, _rproc_tee_ops, 
> NULL, sizeof(*ddata));
> + if (!rproc)
> + return -ENOMEM;
> +
> + trproc = tee_rproc_register(dev, rproc, STM32_MP1_M4_PROC_ID);
> + if (IS_ERR(trproc)) {
> + dev_err_probe(dev, PTR_ERR(trproc),
> +   "signed firmware not supported by TEE\n");
> + return PTR_ERR(trproc);
> + }
> + } else {
> + rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, 
> sizeof(*ddata));
> + if (!rproc)
> + return -ENOMEM;
> + }
>  
>   ddata = rproc->priv;
>  
> @@ -915,6 +961,9 @@ static int stm32_rproc_probe(struct platform_device *pdev)
>   dev_pm_clear_wake_irq(dev);
>   device_init_wakeup(dev, false);
>   }
> + if (trproc)

if (rproc->tee_interface)


I am done reviewing this set.

Thanks,
Mathieu

> + tee_rproc_unregister(trproc);
> +
>   return ret;
>  }
>  
> @@ -935,6 +984,9 @@ static void stm32_rproc_remove(struct platform_device 
> *pdev)
>

Re: [PATCH v4 1/4] remoteproc: Add TEE support

2024-03-25 Thread Mathieu Poirier

On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote:
> Add a remoteproc TEE (Trusted Execution Environment) driver
> that will be probed by the TEE bus. If the associated Trusted
> application is supported on secure part this device offers a client

Device or driver?  I thought I touched on that before.

> interface to load a firmware in the secure part.
> This firmware could be authenticated by the secure trusted application.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> Updates from V3:
> - rework TEE_REMOTEPROC description in Kconfig
> - fix some namings
> - add tee_rproc_parse_fw  to support rproc_ops::parse_fw
> - add proc::tee_interface;
> - add rproc struct as parameter of the tee_rproc_register() function
> ---
>  drivers/remoteproc/Kconfig  |  10 +
>  drivers/remoteproc/Makefile |   1 +
>  drivers/remoteproc/tee_remoteproc.c | 434 
>  include/linux/remoteproc.h  |   4 +
>  include/linux/tee_remoteproc.h  | 112 +++
>  5 files changed, 561 insertions(+)
>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>  create mode 100644 include/linux/tee_remoteproc.h
> 
> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> index 48845dc8fa85..2cf1431b2b59 100644
> --- a/drivers/remoteproc/Kconfig
> +++ b/drivers/remoteproc/Kconfig
> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
>  
> It's safe to say N if not interested in using RPU r5f cores.
>  
> +
> +config TEE_REMOTEPROC
> + tristate "remoteproc support by a TEE application"

s/remoteproc/Remoteproc

> + depends on OPTEE
> + help
> +   Support a remote processor with a TEE application. The Trusted
> +   Execution Context is responsible for loading the trusted firmware
> +   image and managing the remote processor's lifecycle.
> +   This can be either built-in or a loadable module.
> +
>  endif # REMOTEPROC
>  
>  endmenu
> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> index 91314a9b43ce..fa8daebce277 100644
> --- a/drivers/remoteproc/Makefile
> +++ b/drivers/remoteproc/Makefile
> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)   += rcar_rproc.o
>  obj-$(CONFIG_ST_REMOTEPROC)  += st_remoteproc.o
>  obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o
>  obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o
> +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o
>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)   += ti_k3_dsp_remoteproc.o
>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o
>  obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o
> diff --git a/drivers/remoteproc/tee_remoteproc.c 
> b/drivers/remoteproc/tee_remoteproc.c
> new file mode 100644
> index ..c855210e52e3
> --- /dev/null
> +++ b/drivers/remoteproc/tee_remoteproc.c
> @@ -0,0 +1,434 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
> + * Author: Arnaud Pouliquen 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "remoteproc_internal.h"
> +
> +#define MAX_TEE_PARAM_ARRY_MEMBER4
> +
> +/*
> + * Authentication of the firmware and load in the remote processor memory
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [in]   params[1].memref:  buffer containing the image of the 
> buffer
> + */
> +#define TA_RPROC_FW_CMD_LOAD_FW  1
> +
> +/*
> + * Start the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_START_FW 2
> +
> +/*
> + * Stop the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_STOP_FW  3
> +
> +/*
> + * Return the address of the resource table, or 0 if not found
> + * No check is done to verify that the address returned is accessible by
> + * the non secure context. If the resource table is loaded in a protected
> + * memory the access by the non secure context will lead to a data abort.
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out]  params[1].value.a: 32bit LSB resource table memory address
> + * [out]  params[1].value.b: 32bit MSB resource table memory address
> + * [out]  params[2].value.a: 32bit LSB resource table memory size
> + * [out]  params[2].value.b: 32bit MSB resource table memory size
> + */
> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE4
> +
> +/*
> + * Return the address of the core dump
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out] params[1].memref:   address of the core dump image if exist,
> + *   else return Null
> + */
> +#define TA_RPROC_FW_CMD_GET_COREDUMP 5
> +
> +struct tee_rproc_context {
> + struct list_head sessions;
> +

[PATCH net-next v3 1/2] net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent

2024-03-25 Thread Balazs Scheidler

This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes
the TCP specific implementation details.

Previously the macro assumed the skb passed as an argument is a
TCP packet, the implementation now uses an argument to the L4 header and
uses that to extract the source/destination ports, which happen
to be named the same in "struct tcphdr" and "struct udphdr"

Signed-off-by: Balazs Scheidler 
---
 include/trace/events/net_probe_common.h | 41 ++
 include/trace/events/tcp.h  | 45 ++---
 2 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/include/trace/events/net_probe_common.h 
b/include/trace/events/net_probe_common.h
index 3930119cab08..50c083b5687d 100644
--- a/include/trace/events/net_probe_common.h
+++ b/include/trace/events/net_probe_common.h
@@ -41,4 +41,45 @@
 
 #endif
 
+#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)   \
+   do {\
+   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   \
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = protoh->source;  \
+   v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
+   v4 = (void *)__entry->daddr;\
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = protoh->dest;\
+   v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
+   } while (0)
+
+#if IS_ENABLED(CONFIG_IPV6)
+
+#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+   do {\
+   const struct iphdr *iph = ip_hdr(skb);  \
+   \
+   if (iph->version == 6) {\
+   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   \
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = protoh->source; \
+   v6->sin6_addr = ipv6_hdr(skb)->saddr;   \
+   v6 = (void *)__entry->daddr;\
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = protoh->dest;   \
+   v6->sin6_addr = ipv6_hdr(skb)->daddr;   \
+   } else  \
+   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \
+   } while (0)
+
+#else
+
+#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)
+
+#endif
+
+
 #endif
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 699dafd204ea..5f19c5c6cda8 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -302,48 +302,6 @@ TRACE_EVENT(tcp_probe,
  __entry->skbaddr, __entry->skaddr)
 );
 
-#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb)   \
-   do {\
-   const struct tcphdr *th = (const struct tcphdr *)skb->data; \
-   struct sockaddr_in *v4 = (void *)__entry->saddr;\
-   \
-   v4->sin_family = AF_INET;   \
-   v4->sin_port = th->source;  \
-   v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
-   v4 = (void *)__entry->daddr;\
-   v4->sin_family = AF_INET;   \
-   v4->sin_port = th->dest;\
-   v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
-   } while (0)
-
-#if IS_ENABLED(CONFIG_IPV6)
-
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb)  \
-   do {\
-   const struct iphdr *iph = ip_hdr(skb);  \
-   \
-   if (iph->version == 6) {\
-   const struct tcphdr *th = (const struct tcphdr 
*)skb->data; \
-   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
-   \
-   v6->sin6_family = AF_INET6;

Re: [PATCH 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)

2024-03-25 Thread Rob Herring



On Sun, 24 Mar 2024 15:03:59 +0100, Stanislav Jakubek wrote:
> Document the Motorola Moto G (2013), which is a smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Signed-off-by: Stanislav Jakubek 
> ---
>  Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
>  1 file changed, 1 insertion(+)
> 


My bot found new DTB warnings on the .dts files added or changed in this
series.

Some warnings may be from an existing SoC .dtsi. Or perhaps the warnings
are fixed by another series. Ultimately, it is up to the platform
maintainer whether these warnings are acceptable or not. No need to reply
unless the platform maintainer has comments.

If you already ran DT checks and didn't see these error(s), then
make sure dt-schema is up to date:

  pip3 install dtschema --upgrade


New warnings running 'make CHECK_DTBS=y qcom/msm8226-motorola-falcon.dtb' for 
f5d4d71cd59f25b80889ef88fa044aa3a4268d46.1711288736.git.stano.jaku...@gmail.com:

arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dtb: syscon@f9011000: 
compatible: 'anyOf' conditional failed, one must be fixed:
['syscon'] is too short
'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 
'allwinner,sun8i-h3-system-controller', 
'allwinner,sun8i-v3s-system-controller', 
'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 
'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 
'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 
'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 
'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 
'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 
'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 
'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 
'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 
'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 
'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 
'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 
'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk3568-qos', '
 rockchip,rk3588-qos', 'rockchip,rv1126-qos', 'starfive,jh7100-sysmain', 
'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 'ti,am654-serdes-ctrl', 
'ti,j784s4-pcie-ctrl']
from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#

[PATCH v2] vp_vdpa: Fix return value check vp_vdpa_request_irq

2024-03-25 Thread gavin.liu

From: Yuxue Liu 

In the vp_vdpa_set_status function, when setting the device status to
VIRTIO_CONFIG_S_DRIVER_OK, the vp_vdpa_request_irq function may fail.
In such cases, the device status should not be set to DRIVER_OK. Add
exception printing to remind the user.

Signed-off-by: Yuxue Liu 
---

V1 -> V2: Remove redundant printouts
V1: https://lore.kernel.org/all/20240315102857.1803-1-gavin@jaguarmicro.com/

---
 drivers/vdpa/virtio_pci/vp_vdpa.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..4caca0517cad 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -216,7 +216,10 @@ static void vp_vdpa_set_status(struct vdpa_device *vdpa, 
u8 status)
 
if (status & VIRTIO_CONFIG_S_DRIVER_OK &&
!(s & VIRTIO_CONFIG_S_DRIVER_OK)) {
-   vp_vdpa_request_irq(vp_vdpa);
+   if (vp_vdpa_request_irq(vp_vdpa)) {
+   WARN_ON(1);
+   return;
+   }
}
 
vp_modern_set_status(mdev, status);
-- 
2.43.0

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-25 Thread Robbin Ehn

Hey,

> 
> func_symbol:
> auipc t0, common_dispatch:high <=> j actual_func:
> jalr t0, common_dispatch:low(t0)
>

If you are patching in a jump, I don't see why you wouldn't jump over
ld+jalr? (no need for common dispatch)
Patching jalr with nop, and keeping auipc, addresses the issue with
having to jump in the disabled case.
But needs either common dispatch or per func dispatch.

Thanks, Robbin

> common_dispatch:
> load t1, index + dispatch-list
> ld t1, 0(t1)
> jr t1
>
>
> >
> > > However, one thing I am not very sure is: do we need a destination
> > > address in a "per-function" manner? It seems like most of the time the
> > > destination address can only be ftrace_call, or ftrace_regs_call. If
> > > the number of destination addresses is very few, then we could
> > > potentially reduce the size of
> > > .
> >
> > Yes, we do need a per-function manner. BPF, e.g., uses
> > dynamically/JIT:ed trampolines/targets.
> >
> >
> >
> > Björn
>
> Cheers,
> Andy

[ANNOUNCE] 5.10.213-rt105

2024-03-25 Thread Luis Claudio R. Goncalves

Hello RT-list!

I'm pleased to announce the 5.10.213-rt105 stable release.

This release is an update to the new stable 5.10.213 version and no extra
changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 200d9bf140d7c6c9fde6f7d1ab6d8d973fd47910

Or to build 5.10.213-rt105 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.213.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.213-rt105.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis

Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Heng Qi





在 2024/3/25 下午7:35, Xuan Zhuo 写道:

On Mon, 25 Mar 2024 04:26:08 -0700, Breno Leitao  wrote:

Hello Xuan,

On Mon, Mar 25, 2024 at 01:57:53PM +0800, Xuan Zhuo wrote:

On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:

Hello Xuan,

On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:

On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  wrote:

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
 sg_buf_size = vi->rss_key_size;



if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));

vi->rss_hash_types_supported =
virtio_cread32(vdev, offsetof(struct virtio_net_config, 
supported_hash_types));
vi->rss_hash_types_supported &=
~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
  VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);

dev->hw_features |= NETIF_F_RXHASH;
}


vi->rss_key_size is initiated here, I wonder if there is something wrong?

Not really, the code above is never executed (in my machines). This is
because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.

Looking further, vdev does not have the VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT features.

Also, when I run `ethtool -x`, I got:

# ethtool  -x eth0
RX flow hash indirection table for eth0 with 1 RX ring(s):
Operation not supported
RSS hash key:
Operation not supported
RSS hash function:
toeplitz: on
xor: off
crc32: off


The spec saies:
Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
supports only one pair of virtqueues, it MUST support at least one of
commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
parameters:

If the device offers VIRTIO_NET_F_RSS, it MUST support
VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.

Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
per 5.1.6.5.6.4.


So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
we should return from virtnet_set_rxfh directly.

Makes sense. Although it is not clear to me how vi->has_rss_hash_report
is related here, but, I am convinced that we shouldn't do any RSS
operation if the device doesn't have the RSS feature, i.e, vi->has_rss
is false.

That said, I am thinking about something like this. How does it sound?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5a7700b103f8..8c1ad7361cf2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3780,6 +3780,9 @@ static int virtnet_set_rxfh(struct net_device 
*dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;

+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+

Should we check has_rss_hash_report?


Hi, Breno.

You can refer to the following modification. It is worth noting
that \field{rss_max_indirection_table_length} should only be
accessed if VIRTIO_NET_F_RSS is negotiated, which I have
modified below:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 727c874..fb4c438 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3836,10 +3836,16 @@ static int virtnet_set_rxfh(struct net_device *dev,
    struct virtnet_info *vi = netdev_priv(dev);
    int i;

+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
    if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
    rxfh->hfunc != ETH_RSS_HASH_TOP)
    return -EOPNOTSUPP;

+   if (rxfh->indir && !vi->has_rss)
+   return -EINVAL;
+
    if (rxfh->indir) {
    for (i = 0; i < vi->rss_indir_table_size; ++i)
    vi->ctrl->rss.indirection_table[i] = 
rxfh->indir[i];

@@ -4757,13 +4763,14 @@ static int virtnet_probe(struct virtio_device *vdev)
    if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
    vi->has_rss_hash_report = true;

-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
    vi->has_rss = true;
-
-   if (vi->has_rss || vi->has_rss_hash_report) {
    vi->rss_indir_table_size =
    virtio_cread16(vdev, offsetof(struct 
virtio_net_config,

    rss_max_indirection_table_length));
+   }
+
+

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He

>>
>> Introduce a tracepoint for icmp_send, which can help users to get more
>> detail information conveniently when icmp abnormal events happen.
>>
>> 1. Giving an usecase example:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D=3D=3D=3D
>> When an application experiences packet loss due to an unreachable UDP
>> destination port, the kernel will send an exception message through the
>> icmp_send function. By adding a trace point for icmp_send, developers or
>> system administrators can obtain detailed information about the UDP
>> packet loss, including the type, code, source address, destination addres=
>s,
>> source port, and destination port. This facilitates the trouble-shooting
>> of UDP packet loss issues especially for those network-service
>> applications.
>>
>> 2. Operation Instructions:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D
>> Switch to the tracing directory.
>> cd /sys/kernel/tracing
>> Filter for destination port unreachable.
>> echo "type=3D=3D3 && code=3D=3D3" > events/icmp/icmp_send/filter
>> Enable trace event.
>> echo 1 > events/icmp/icmp_send/enable
>>
>> 3. Result View:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>  udp_client_erro-11370   [002] ...s.12   124.728002:
>>  icmp_send: icmp_send: type=3D3, code=3D3.
>>  From 127.0.0.1:41895 to 127.0.0.1: ulen=3D23
>>  skbaddr=3D589b167a
>>
>> Changelog
>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>
>This is completely bogus.
>
>Adding tracepoints is ok if there are no side effects like bugs :/
>
>At this point there is no guarantee the UDP header is complete/present
>in skb->head
>
>Look at the existing checks between lines 619 and 623
>
>Then audit all icmp_send() callers, and ask yourself if UDP packets
>can not be malicious (like with a truncated UDP header)
Yeah, you are correct. Directly parsing udphdr through the sdk may
conceal bugs, such as illegal skb. To handle such exceptional scenarios,
we can determine the legitimacy of skb by checking whether the position
of the uh pointer is out of bounds.

Perhaps it could be modified like this: 
struct udphdr *uh = udp_hdr(skb);

if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
(u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb))

Re: [PATCH 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)

2024-03-25 Thread Krzysztof Kozlowski

On 24/03/2024 15:03, Stanislav Jakubek wrote:
> Document the Motorola Moto G (2013), which is a smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Signed-off-by: Stanislav Jakubek 
> ---
>  Documentation/devicetree/bindings/arm/qcom.yaml | 1 +

Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Xuan Zhuo

On Mon, 25 Mar 2024 04:26:08 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Mon, Mar 25, 2024 at 01:57:53PM +0800, Xuan Zhuo wrote:
> > On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> > > Hello Xuan,
> > >
> > > On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > > > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  
> > > > wrote:
> > >
> > > > > 4) Since the command above does not have a key, then the last
> > > > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > > > sg_buf_size = vi->rss_key_size;
> > > >
> > > >
> > > >
> > > > if (vi->has_rss || vi->has_rss_hash_report) {
> > > > vi->rss_indir_table_size =
> > > > virtio_cread16(vdev, offsetof(struct 
> > > > virtio_net_config,
> > > > rss_max_indirection_table_length));
> > > > vi->rss_key_size =
> > > > virtio_cread8(vdev, offsetof(struct 
> > > > virtio_net_config, rss_max_key_size));
> > > >
> > > > vi->rss_hash_types_supported =
> > > > virtio_cread32(vdev, offsetof(struct 
> > > > virtio_net_config, supported_hash_types));
> > > > vi->rss_hash_types_supported &=
> > > > ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> > > >   VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> > > >   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> > > >
> > > > dev->hw_features |= NETIF_F_RXHASH;
> > > > }
> > > >
> > > >
> > > > vi->rss_key_size is initiated here, I wonder if there is something 
> > > > wrong?
> > >
> > > Not really, the code above is never executed (in my machines). This is
> > > because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
> > >
> > > Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> > > VIRTIO_NET_F_HASH_REPORT features.
> > >
> > > Also, when I run `ethtool -x`, I got:
> > >
> > >   # ethtool  -x eth0
> > >   RX flow hash indirection table for eth0 with 1 RX ring(s):
> > >   Operation not supported
> > >   RSS hash key:
> > >   Operation not supported
> > >   RSS hash function:
> > >   toeplitz: on
> > >   xor: off
> > >   crc32: off
> >
> >
> > The spec saies:
> > Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
> > supports only one pair of virtqueues, it MUST support at least one of
> > commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
> > parameters:
> >
> > If the device offers VIRTIO_NET_F_RSS, it MUST support
> > VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.
> >
> > Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
> > per 5.1.6.5.6.4.
> >
> >
> > So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
> > we should return from virtnet_set_rxfh directly.
>
> Makes sense. Although it is not clear to me how vi->has_rss_hash_report
> is related here, but, I am convinced that we shouldn't do any RSS
> operation if the device doesn't have the RSS feature, i.e, vi->has_rss
> is false.
>
> That said, I am thinking about something like this. How does it sound?
>
>   diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>   index 5a7700b103f8..8c1ad7361cf2 100644
>   --- a/drivers/net/virtio_net.c
>   +++ b/drivers/net/virtio_net.c
>   @@ -3780,6 +3780,9 @@ static int virtnet_set_rxfh(struct net_device 
> *dev,
>   struct virtnet_info *vi = netdev_priv(dev);
>   int i;
>
>   +   if (!vi->has_rss)
>   +   return -EOPNOTSUPP;
>   +

Should we check has_rss_hash_report?

@Heng Qi

Could you help us?

Thanks.


>   if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
>   rxfh->hfunc != ETH_RSS_HASH_TOP)
>   return -EOPNOTSUPP;
>
> Thanks!

[PATCH net-next v2 3/3] tcp: add location into reset trace process

2024-03-25 Thread Jason Xing

From: Jason Xing 

In addition to knowing the 4-tuple of the flow which generates RST,
the reason why it does so is very important because we have some
cases where the RST should be sent and have no clue which one
exactly.

Adding location of reset process can help us more, like what
trace_kfree_skb does.

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 14 ++
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index a13eb2147a02..8f6c1a07503c 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -109,13 +109,17 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(
+   const struct sock *sk,
+   const struct sk_buff *skb,
+   void *location),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, location),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
+   __field(void *, location)
__field(int, state)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
@@ -141,12 +145,14 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
entry->saddr);
}
+   __entry->location = location;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
location=%pS",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->location)
 );
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d5c4a969c066..fec54cfc4fb3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -870,7 +870,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
arg.bound_dev_if = sk->sk_bound_dev_if;
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb,  __builtin_return_address(0));
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..fb613582817e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3608,7 +3608,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority)
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL,  __builtin_return_address(0));
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8e9c59b6c00c..7eba9c3d69f1 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1128,7 +1128,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb,  __builtin_return_address(0));
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3

Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Breno Leitao

Hello Xuan,

On Mon, Mar 25, 2024 at 01:57:53PM +0800, Xuan Zhuo wrote:
> On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> > Hello Xuan,
> >
> > On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  
> > > wrote:
> >
> > > > 4) Since the command above does not have a key, then the last
> > > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > > sg_buf_size = vi->rss_key_size;
> > >
> > >
> > >
> > >   if (vi->has_rss || vi->has_rss_hash_report) {
> > >   vi->rss_indir_table_size =
> > >   virtio_cread16(vdev, offsetof(struct virtio_net_config,
> > >   rss_max_indirection_table_length));
> > >   vi->rss_key_size =
> > >   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> > > rss_max_key_size));
> > >
> > >   vi->rss_hash_types_supported =
> > >   virtio_cread32(vdev, offsetof(struct virtio_net_config, 
> > > supported_hash_types));
> > >   vi->rss_hash_types_supported &=
> > >   ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> > > VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> > > VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> > >
> > >   dev->hw_features |= NETIF_F_RXHASH;
> > >   }
> > >
> > >
> > > vi->rss_key_size is initiated here, I wonder if there is something wrong?
> >
> > Not really, the code above is never executed (in my machines). This is
> > because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
> >
> > Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> > VIRTIO_NET_F_HASH_REPORT features.
> >
> > Also, when I run `ethtool -x`, I got:
> >
> > # ethtool  -x eth0
> > RX flow hash indirection table for eth0 with 1 RX ring(s):
> > Operation not supported
> > RSS hash key:
> > Operation not supported
> > RSS hash function:
> > toeplitz: on
> > xor: off
> > crc32: off
> 
> 
> The spec saies:
>   Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
>   supports only one pair of virtqueues, it MUST support at least one of
>   commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
>   parameters:
> 
>   If the device offers VIRTIO_NET_F_RSS, it MUST support
>   VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.
> 
>   Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
>   per 5.1.6.5.6.4.
> 
> 
> So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
> we should return from virtnet_set_rxfh directly.

Makes sense. Although it is not clear to me how vi->has_rss_hash_report
is related here, but, I am convinced that we shouldn't do any RSS
operation if the device doesn't have the RSS feature, i.e, vi->has_rss
is false.

That said, I am thinking about something like this. How does it sound?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5a7700b103f8..8c1ad7361cf2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3780,6 +3780,9 @@ static int virtnet_set_rxfh(struct net_device 
*dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;

Thanks!

[PATCH v2] vp_vdpa: Fix return value check vp_vdpa_request_irq

2024-03-25 Thread gavin.liu

From: Yuxue Liu 

In the vp_vdpa_set_status function, when setting the device status to
VIRTIO_CONFIG_S_DRIVER_OK, the vp_vdpa_request_irq function may fail.
In such cases, the device status should not be set to DRIVER_OK. Add
exception printing to remind the user.

Signed-off-by: Yuxue Liu 
---

V1 -> V2: Remove redundant printouts
V1: https://lore.kernel.org/all/20240315102857.1803-1-gavin@jaguarmicro.com/

---
 drivers/vdpa/virtio_pci/vp_vdpa.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..4caca0517cad 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -216,7 +216,10 @@ static void vp_vdpa_set_status(struct vdpa_device *vdpa, 
u8 status)
 
if (status & VIRTIO_CONFIG_S_DRIVER_OK &&
!(s & VIRTIO_CONFIG_S_DRIVER_OK)) {
-   vp_vdpa_request_irq(vp_vdpa);
+   if (vp_vdpa_request_irq(vp_vdpa)) {
+   WARN_ON(1);
+   return;
+   }
}
 
vp_modern_set_status(mdev, status);
-- 
2.43.0

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-25 Thread Google

On Thu, 21 Mar 2024 07:57:35 -0700
Jonathan Haslam  wrote:

> Active uprobes are stored in an RB tree and accesses to this tree are
> dominated by read operations. Currently these accesses are serialized by
> a spinlock but this leads to enormous contention when large numbers of
> threads are executing active probes.
> 
> This patch converts the spinlock used to serialize access to the
> uprobes_tree RB tree into a reader-writer spinlock. This lock type
> aligns naturally with the overwhelmingly read-only nature of the tree
> usage here. Although the addition of reader-writer spinlocks are
> discouraged [0], this fix is proposed as an interim solution while an
> RCU based approach is implemented (that work is in a nascent form). This
> fix also has the benefit of being trivial, self contained and therefore
> simple to backport.
> 
> This change has been tested against production workloads that exhibit
> significant contention on the spinlock and an almost order of magnitude
> reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

BTW, how did you measure the overhead? I think spinlock overhead
will depend on how much lock contention happens.

Thank you,

> 
> [0] https://docs.kernel.org/locking/spinlocks.html
> 
> Signed-off-by: Jonathan Haslam 
> ---
>  kernel/events/uprobes.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..42bf9b6e8bc0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>   */
>  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
>  
> -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access */
> +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access */
>  
>  #define UPROBES_HASH_SZ  13
>  /* serialize uprobe->pending_list */
> @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> loff_t offset)
>  {
>   struct uprobe *uprobe;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   uprobe = __find_uprobe(inode, offset);
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  
>   return uprobe;
>  }
> @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  {
>   struct uprobe *u;
>  
> - spin_lock(_treelock);
> + write_lock(_treelock);
>   u = __insert_uprobe(uprobe);
> - spin_unlock(_treelock);
> + write_unlock(_treelock);
>  
>   return u;
>  }
> @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
>   if (WARN_ON(!uprobe_is_active(uprobe)))
>   return;
>  
> - spin_lock(_treelock);
> + write_lock(_treelock);
>   rb_erase(>rb_node, _tree);
> - spin_unlock(_treelock);
> + write_unlock(_treelock);
>   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
>   put_uprobe(uprobe);
>  }
> @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
>   min = vaddr_to_offset(vma, start);
>   max = min + (end - start) - 1;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   n = find_node_in_range(inode, min, max);
>   if (n) {
>   for (t = n; t; t = rb_prev(t)) {
> @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
>   get_uprobe(u);
>   }
>   }
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  }
>  
>  /* @vma contains reference counter, not the probed instruction. */
> @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> long start, unsigned long e
>   min = vaddr_to_offset(vma, start);
>   max = min + (end - start) - 1;
>  
> - spin_lock(_treelock);
> + read_lock(_treelock);
>   n = find_node_in_range(inode, min, max);
> - spin_unlock(_treelock);
> + read_unlock(_treelock);
>  
>   return !!n;
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

[PATCH net-next v3 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

2024-03-25 Thread Balazs Scheidler

The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
and destination IP/port whereas this information can be critical in case
of UDP/syslog.

Signed-off-by: Balazs Scheidler 
---
 include/trace/events/udp.h | 29 -
 net/ipv4/udp.c |  2 +-
 net/ipv6/udp.c |  3 ++-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h
index 336fe272889f..9c5abe23d0f5 100644
--- a/include/trace/events/udp.h
+++ b/include/trace/events/udp.h
@@ -7,24 +7,43 @@
 
 #include 
 #include 
+#include 
 
 TRACE_EVENT(udp_fail_queue_rcv_skb,
 
-   TP_PROTO(int rc, struct sock *sk),
+   TP_PROTO(int rc, struct sock *sk, struct sk_buff *skb),
 
-   TP_ARGS(rc, sk),
+   TP_ARGS(rc, sk, skb),
 
TP_STRUCT__entry(
__field(int, rc)
-   __field(__u16, lport)
+
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(__u16, family)
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
),
 
TP_fast_assign(
+   const struct udphdr *uh = (const struct udphdr *)udp_hdr(skb);
+
__entry->rc = rc;
-   __entry->lport = inet_sk(sk)->inet_num;
+   
+   /* for filtering use */
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->family = sk->sk_family;
+
+memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh);
),
 
-   TP_printk("rc=%d port=%hu", __entry->rc, __entry->lport)
+   TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc", __entry->rc,
+ show_family_name(__entry->family),
+ __entry->saddr, __entry->daddr)
 );
 
 #endif /* _TRACE_UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 661d0e0d273f..531882f321f2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2049,8 +2049,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
drop_reason = SKB_DROP_REASON_PROTO_MEM;
}
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   trace_udp_fail_queue_rcv_skb(rc, sk, skb);
kfree_skb_reason(skb, drop_reason);
-   trace_udp_fail_queue_rcv_skb(rc, sk);
return -1;
}
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7c1e6469d091..2e4dc5e6137b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -658,8 +659,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
drop_reason = SKB_DROP_REASON_PROTO_MEM;
}
UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   trace_udp_fail_queue_rcv_skb(rc, sk, skb);
kfree_skb_reason(skb, drop_reason);
-   trace_udp_fail_queue_rcv_skb(rc, sk);
return -1;
}
 
-- 
2.40.1

[PATCH net-next v2 1/3] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters

2024-03-25 Thread Jason Xing

From: Jason Xing 

Introducing entry_saddr and entry_daddr parameters in this macro
for later use can help us record the reverse 4-tuple by analyzing
the 4-tuple of the incoming skb when receiving.

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 699dafd204ea..2495a1d579be 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -302,15 +302,15 @@ TRACE_EVENT(tcp_probe,
  __entry->skbaddr, __entry->skaddr)
 );
 
-#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb)   \
+#define TP_STORE_ADDR_PORTS_SKB_V4(skb, entry_saddr, entry_daddr)  \
do {\
const struct tcphdr *th = (const struct tcphdr *)skb->data; \
-   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   struct sockaddr_in *v4 = (void *)entry_saddr;   \
\
v4->sin_family = AF_INET;   \
v4->sin_port = th->source;  \
v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
-   v4 = (void *)__entry->daddr;\
+   v4 = (void *)entry_daddr;   \
v4->sin_family = AF_INET;   \
v4->sin_port = th->dest;\
v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
@@ -318,29 +318,30 @@ TRACE_EVENT(tcp_probe,
 
 #if IS_ENABLED(CONFIG_IPV6)
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb)  \
+#define TP_STORE_ADDR_PORTS_SKB(skb, entry_saddr, entry_daddr) \
do {\
const struct iphdr *iph = ip_hdr(skb);  \
\
if (iph->version == 6) {\
const struct tcphdr *th = (const struct tcphdr 
*)skb->data; \
-   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   struct sockaddr_in6 *v6 = (void *)entry_saddr;  \
\
v6->sin6_family = AF_INET6; \
v6->sin6_port = th->source; \
v6->sin6_addr = ipv6_hdr(skb)->saddr;   \
-   v6 = (void *)__entry->daddr;\
+   v6 = (void *)entry_daddr;   \
v6->sin6_family = AF_INET6; \
v6->sin6_port = th->dest;   \
v6->sin6_addr = ipv6_hdr(skb)->daddr;   \
} else  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb);   \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, entry_saddr,\
+  entry_daddr); \
} while (0)
 
 #else
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb)  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb)
+#define TP_STORE_ADDR_PORTS_SKB(skb, entry_saddr, entry_daddr) \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, entry_saddr, entry_daddr)
 
 #endif
 
@@ -365,7 +366,7 @@ DECLARE_EVENT_CLASS(tcp_event_skb,
memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
 
-   TP_STORE_ADDR_PORTS_SKB(__entry, skb);
+   TP_STORE_ADDR_PORTS_SKB(skb, __entry->saddr, __entry->daddr);
),
 
TP_printk("skbaddr=%p src=%pISpc dest=%pISpc",
-- 
2.37.3

Re: Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He

>> >> -
>> >> v2->v3:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home=
>/
>> >> 1. Change the tracking directory to/sys/kernel/tracking.
>> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>> >>
>> >> v1->v2:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3D3DsZtRnKRu_tnUwqHuFQT=
>JvJsv=3D
>> >-nz1x...@mail.gmail.com/
>> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> >> 2. move the calling of trace_icmp_send after sanity checks
>> >> in __icmp_send().
>> >>
>> >> Signed-off-by: Peilin He
>> >> Reviewed-by: xu xin 
>> >> Reviewed-by: Yunkai Zhang 
>> >> Cc: Yang Yang 
>> >> Cc: Liu Chun 
>> >> Cc: Xuexin Jiang 
>> >
>> >I think it would be better to target net-next tree since it's not a
>> >fix or something else important.
>> >
>> OK. I would target it for net-next.
>> >> ---
>> >>  include/trace/events/icmp.h | 64 =
>+
>> >>  net/ipv4/icmp.c |  4 +++
>> >>  2 files changed, 68 insertions(+)
>> >>  create mode 100644 include/trace/events/icmp.h
>> >>
>> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> >> new file mode 100644
>> >> index ..2098d4b1b12e
>> >> --- /dev/null
>> >> +++ b/include/trace/events/icmp.h
>> >> @@ -0,0 +1,64 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0 */
>> >> +#undef TRACE_SYSTEM
>> >> +#define TRACE_SYSTEM icmp
>> >> +
>> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> >> +#define _TRACE_ICMP_H
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +TRACE_EVENT(icmp_send,
>> >> +
>> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code=
>),
>> >> +
>> >> +   TP_ARGS(skb, type, code),
>> >> +
>> >> +   TP_STRUCT__entry(
>> >> +   __field(const void *, skbaddr)
>> >> +   __field(int, type)
>> >> +   __field(int, code)
>> >> +   __array(__u8, saddr, 4)
>> >> +   __array(__u8, daddr, 4)
>> >> +   __field(__u16, sport)
>> >> +   __field(__u16, dport)
>> >> +   __field(unsigned short, ulen)
>> >> +   ),
>> >> +
>> >> +   TP_fast_assign(
>> >> +   struct iphdr *iph =3D3D ip_hdr(skb);
>> >> +   int proto_4 =3D3D iph->protocol;
>> >> +   __be32 *p32;
>> >> +
>> >> +   __entry->skbaddr =3D3D skb;
>> >> +   __entry->type =3D3D type;
>> >> +   __entry->code =3D3D code;
>> >> +
>> >> +   if (proto_4 =3D3D=3D3D IPPROTO_UDP) {
>> >> +   struct udphdr *uh =3D3D udp_hdr(skb);
>> >> +   __entry->sport =3D3D ntohs(uh->source)=
>;
>> >> +   __entry->dport =3D3D ntohs(uh->dest);
>> >> +   __entry->ulen =3D3D ntohs(uh->len);
>> >> +   } else {
>> >> +   __entry->sport =3D3D 0;
>> >> +   __entry->dport =3D3D 0;
>> >> +   __entry->ulen =3D3D 0;
>> >> +   }
>> >
>> >What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
>> >and dport like the patch[1] did through extending the use of header
>> >for TCP and UDP?
>> >
>> I believe patch[1] is a good idea as it moves the TCP protocol parsing
>> previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assig=
>n,
>> and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
>> enabling support for both UDP and TCP protocol parsing simultaneously.
>>
>> However, patch[1] only extracts the source and destination addresses of
>> the packet, but does not extract the source port and destination port,
>> which limits the significance of my submitted patch.
>
>No, please take a look at TP_STORE_ADDR_PORTS_SKB() macro again. It
>records 4-tuples of the flow.
>
>Thanks,
>Jason
>
Okay, after patch [1] is merged, we will propose an optimization patch based on 
it.
>>
>> Perhaps the patch[1] could be referenced for integration after it is merg=
>ed.
>> >And, I wonder what the use of tracing ulen of that skb?
>> >
>> The tracking of ulen is primarily aimed at ensuring the legality of recei=
>ved
>> UDP packets and providing developers with more detailed information
>> on exceptions. See net/ipv4/udp.c:2494-2501.
>> >[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8=
>e.1=3D
>> >710866188.git.balazs.scheid...@axoflow.com/
>> >
>> >Thanks,
>> >Jason
>> >
>> >> +
>> >> +   p32 =3D3D (__be32 *) __entry->saddr;
>> >> +   *p32 =3D3D iph->saddr;
>> >> +
>> >> +   p32 =3D3D (__be32 *) __entry->daddr;
>> >> +

Re: [PATCH 64/64] i2c: reword i2c_algorithm in drivers according to newest specification

2024-03-25 Thread Oleksij Rempel

On Fri, Mar 22, 2024 at 02:25:57PM +0100, Wolfram Sang wrote:
> Match the wording in i2c_algorithm in I2C drivers wrt. the newest I2C
> v7, SMBus 3.2, I3C specifications and replace "master/slave" with more
> appropriate terms. For some drivers, this means no more conversions are
> needed. For the others more work needs to be done but this will be
> performed incrementally along with API changes/improvements. All these
> changes here are simple search/replace results.
> 
> Signed-off-by: Wolfram Sang 

Acked-by: Oleksij Rempel  # for i2c-imx.c 

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-25 Thread Gavin Shan




On 3/20/24 17:14, Michael S. Tsirkin wrote:

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:

On 3/20/24 10:49, Michael S. Tsirkin wrote:>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..79456706d0bd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
-   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+   u16 headwithflag = head | (q->split.avail_idx_shadow & 
~(vq->split.vring.num - 1));
+   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
headwithflag);
/* Descriptors and available array need to be set before we expose the
 * new available array entries. */



Ok, Michael. I continued with my debugging code. It still looks like a
hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
involved for the discussion, as suggested by you.

Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
Note that I have only one vCPU in my configuration.

Secondly, the debugging code is enhanced so that the available head for
(last_avail_idx - 1) is read for twice and recorded. It means the available
head for one specific available index is read for twice. I do see the
available heads are different from the consecutive reads. More details
are shared as below.

From the guest side
===

virtio_net virtio0: output.0:id 86 is not a head!
head to be released: 047 062 112

avail_idx:
000  49665
001  49666  <--
 :
015  49664

avail_head:
000  062
001  047  <--
 :
015  112

From the host side
==

avail_idx
000  49663
001  49666  <---
 :

avail_head
000  062  (062)
001  047  (047)  <---
 :
015  086  (112)  // head 086 is returned from the first read,
 // but head 112 is returned from the second read

vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664

Thanks,
Gavin

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Jason Xing

On Mon, Mar 25, 2024 at 12:05 PM Peilin He  wrote:
>
> >> -
> >> v2->v3:
> >> Some fixes according to
> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> >> 1. Change the tracking directory to/sys/kernel/tracking.
> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
> >>
> >> v1->v2:
> >> Some fixes according to
> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
> >-nz1x...@mail.gmail.com/
> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
> >> 2. move the calling of trace_icmp_send after sanity checks
> >> in __icmp_send().
> >>
> >> Signed-off-by: Peilin He
> >> Reviewed-by: xu xin 
> >> Reviewed-by: Yunkai Zhang 
> >> Cc: Yang Yang 
> >> Cc: Liu Chun 
> >> Cc: Xuexin Jiang 
> >
> >I think it would be better to target net-next tree since it's not a
> >fix or something else important.
> >
> OK. I would target it for net-next.
> >> ---
> >>  include/trace/events/icmp.h | 64 +
> >>  net/ipv4/icmp.c |  4 +++
> >>  2 files changed, 68 insertions(+)
> >>  create mode 100644 include/trace/events/icmp.h
> >>
> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> >> new file mode 100644
> >> index ..2098d4b1b12e
> >> --- /dev/null
> >> +++ b/include/trace/events/icmp.h
> >> @@ -0,0 +1,64 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#undef TRACE_SYSTEM
> >> +#define TRACE_SYSTEM icmp
> >> +
> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> >> +#define _TRACE_ICMP_H
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +TRACE_EVENT(icmp_send,
> >> +
> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> >> +
> >> +   TP_ARGS(skb, type, code),
> >> +
> >> +   TP_STRUCT__entry(
> >> +   __field(const void *, skbaddr)
> >> +   __field(int, type)
> >> +   __field(int, code)
> >> +   __array(__u8, saddr, 4)
> >> +   __array(__u8, daddr, 4)
> >> +   __field(__u16, sport)
> >> +   __field(__u16, dport)
> >> +   __field(unsigned short, ulen)
> >> +   ),
> >> +
> >> +   TP_fast_assign(
> >> +   struct iphdr *iph =3D ip_hdr(skb);
> >> +   int proto_4 =3D iph->protocol;
> >> +   __be32 *p32;
> >> +
> >> +   __entry->skbaddr =3D skb;
> >> +   __entry->type =3D type;
> >> +   __entry->code =3D code;
> >> +
> >> +   if (proto_4 =3D=3D IPPROTO_UDP) {
> >> +   struct udphdr *uh =3D udp_hdr(skb);
> >> +   __entry->sport =3D ntohs(uh->source);
> >> +   __entry->dport =3D ntohs(uh->dest);
> >> +   __entry->ulen =3D ntohs(uh->len);
> >> +   } else {
> >> +   __entry->sport =3D 0;
> >> +   __entry->dport =3D 0;
> >> +   __entry->ulen =3D 0;
> >> +   }
> >
> >What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
> >and dport like the patch[1] did through extending the use of header
> >for TCP and UDP?
> >
> I believe patch[1] is a good idea as it moves the TCP protocol parsing
> previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assign,
> and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
> enabling support for both UDP and TCP protocol parsing simultaneously.
>
> However, patch[1] only extracts the source and destination addresses of
> the packet, but does not extract the source port and destination port,
> which limits the significance of my submitted patch.

No, please take a look at TP_STORE_ADDR_PORTS_SKB() macro again. It
records 4-tuples of the flow.

Thanks,
Jason

>
> Perhaps the patch[1] could be referenced for integration after it is merged.
> >And, I wonder what the use of tracing ulen of that skb?
> >
> The tracking of ulen is primarily aimed at ensuring the legality of received
> UDP packets and providing developers with more detailed information
> on exceptions. See net/ipv4/udp.c:2494-2501.
> >[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8e.1=
> >710866188.git.balazs.scheid...@axoflow.com/
> >
> >Thanks,
> >Jason
> >
> >> +
> >> +   p32 =3D (__be32 *) __entry->saddr;
> >> +   *p32 =3D iph->saddr;
> >> +
> >> +   p32 =3D (__be32 *) __entry->daddr;
> >> +   *p32 =3D iph->daddr;
> >> +   ),
> >> +
> >> +   TP_printk("icmp_send: type=3D%d, code=3D%d. From %pI4:%u =
> >to %pI4:%u ulen=3D%d skbaddr=3D%p",
> >> +   __entry->type,

[PATCH net-next 1/3] trace: move to TP_STORE_ADDRS related macro to net_probe_common.h

2024-03-25 Thread Jason Xing

From: Jason Xing 

Put the macro into another standalone file for better extension.
Some tracepoints can use this common part in the future.

Signed-off-by: Jason Xing 
---
 include/trace/events/net_probe_common.h | 29 +
 include/trace/events/tcp.h  | 29 -
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/trace/events/net_probe_common.h 
b/include/trace/events/net_probe_common.h
index 3930119cab08..b1f9a4d3ee13 100644
--- a/include/trace/events/net_probe_common.h
+++ b/include/trace/events/net_probe_common.h
@@ -41,4 +41,33 @@
 
 #endif
 
+#define TP_STORE_V4MAPPED(__entry, saddr, daddr)   \
+   do {\
+   struct in6_addr *pin6;  \
+   \
+   pin6 = (struct in6_addr *)__entry->saddr_v6;\
+   ipv6_addr_set_v4mapped(saddr, pin6);\
+   pin6 = (struct in6_addr *)__entry->daddr_v6;\
+   ipv6_addr_set_v4mapped(daddr, pin6);\
+   } while (0)
+
+#if IS_ENABLED(CONFIG_IPV6)
+#define TP_STORE_ADDRS(__entry, saddr, daddr, saddr6, daddr6)  \
+   do {\
+   if (sk->sk_family == AF_INET6) {\
+   struct in6_addr *pin6;  \
+   \
+   pin6 = (struct in6_addr *)__entry->saddr_v6;\
+   *pin6 = saddr6; \
+   pin6 = (struct in6_addr *)__entry->daddr_v6;\
+   *pin6 = daddr6; \
+   } else {\
+   TP_STORE_V4MAPPED(__entry, saddr, daddr);   \
+   }   \
+   } while (0)
+#else
+#define TP_STORE_ADDRS(__entry, saddr, daddr, saddr6, daddr6)  \
+   TP_STORE_V4MAPPED(__entry, saddr, daddr)
+#endif
+
 #endif
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 699dafd204ea..3c08a0846c47 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -12,35 +12,6 @@
 #include 
 #include 
 
-#define TP_STORE_V4MAPPED(__entry, saddr, daddr)   \
-   do {\
-   struct in6_addr *pin6;  \
-   \
-   pin6 = (struct in6_addr *)__entry->saddr_v6;\
-   ipv6_addr_set_v4mapped(saddr, pin6);\
-   pin6 = (struct in6_addr *)__entry->daddr_v6;\
-   ipv6_addr_set_v4mapped(daddr, pin6);\
-   } while (0)
-
-#if IS_ENABLED(CONFIG_IPV6)
-#define TP_STORE_ADDRS(__entry, saddr, daddr, saddr6, daddr6)  \
-   do {\
-   if (sk->sk_family == AF_INET6) {\
-   struct in6_addr *pin6;  \
-   \
-   pin6 = (struct in6_addr *)__entry->saddr_v6;\
-   *pin6 = saddr6; \
-   pin6 = (struct in6_addr *)__entry->daddr_v6;\
-   *pin6 = daddr6; \
-   } else {\
-   TP_STORE_V4MAPPED(__entry, saddr, daddr);   \
-   }   \
-   } while (0)
-#else
-#define TP_STORE_ADDRS(__entry, saddr, daddr, saddr6, daddr6)  \
-   TP_STORE_V4MAPPED(__entry, saddr, daddr)
-#endif
-
 /*
  * tcp event with arguments sk and skb
  *
-- 
2.37.3

Re: [PATCH 64/64] i2c: reword i2c_algorithm in drivers according to newest specification

2024-03-25 Thread Jarkko Nikula


On 3/22/24 3:25 PM, Wolfram Sang wrote:

Match the wording in i2c_algorithm in I2C drivers wrt. the newest I2C
v7, SMBus 3.2, I3C specifications and replace "master/slave" with more
appropriate terms. For some drivers, this means no more conversions are
needed. For the others more work needs to be done but this will be
performed incrementally along with API changes/improvements. All these
changes here are simple search/replace results.

Signed-off-by: Wolfram Sang 
---
  drivers/i2c/busses/i2c-amd-mp2-plat.c  |  2 +-
  drivers/i2c/busses/i2c-at91-master.c   |  2 +-
  drivers/i2c/busses/i2c-at91-slave.c|  8 
  drivers/i2c/busses/i2c-axxia.c | 10 +-
  drivers/i2c/busses/i2c-cros-ec-tunnel.c|  2 +-
  drivers/i2c/busses/i2c-designware-master.c |  2 +-
  drivers/i2c/busses/i2c-designware-slave.c  |  8 
  drivers/i2c/busses/i2c-diolan-u2c.c|  2 +-
  drivers/i2c/busses/i2c-exynos5.c   |  4 ++--
  drivers/i2c/busses/i2c-gxp.c   | 12 ++--
  drivers/i2c/busses/i2c-hisi.c  |  4 ++--
  drivers/i2c/busses/i2c-img-scb.c   |  2 +-
  drivers/i2c/busses/i2c-imx.c   | 12 ++--
  drivers/i2c/busses/i2c-jz4780.c|  2 +-
  drivers/i2c/busses/i2c-kempld.c|  2 +-
  drivers/i2c/busses/i2c-meson.c |  4 ++--
  drivers/i2c/busses/i2c-mlxbf.c |  8 
  drivers/i2c/busses/i2c-mt65xx.c|  2 +-
  drivers/i2c/busses/i2c-mxs.c   |  2 +-
  drivers/i2c/busses/i2c-nomadik.c   |  2 +-
  drivers/i2c/busses/i2c-npcm7xx.c   | 12 ++--
  drivers/i2c/busses/i2c-nvidia-gpu.c|  4 ++--
  drivers/i2c/busses/i2c-ocores.c|  8 
  drivers/i2c/busses/i2c-octeon-platdrv.c|  2 +-
  drivers/i2c/busses/i2c-omap.c  |  4 ++--
  drivers/i2c/busses/i2c-opal.c  |  4 ++--
  drivers/i2c/busses/i2c-pasemi-core.c   |  2 +-
  drivers/i2c/busses/i2c-pnx.c   |  2 +-
  drivers/i2c/busses/i2c-pxa.c   | 12 ++--
  drivers/i2c/busses/i2c-qcom-cci.c  |  2 +-
  drivers/i2c/busses/i2c-qcom-geni.c |  2 +-
  drivers/i2c/busses/i2c-robotfuzz-osif.c|  2 +-
  drivers/i2c/busses/i2c-rzv2m.c |  8 
  drivers/i2c/busses/i2c-s3c2410.c   |  4 ++--
  drivers/i2c/busses/i2c-stm32f7.c   | 14 +++---
  drivers/i2c/busses/i2c-tegra-bpmp.c|  4 ++--
  drivers/i2c/busses/i2c-tegra.c |  4 ++--
  drivers/i2c/busses/i2c-thunderx-pcidrv.c   |  2 +-
  drivers/i2c/busses/i2c-virtio.c|  2 +-
  drivers/i2c/busses/i2c-wmt.c   |  2 +-
  drivers/i2c/busses/i2c-xiic.c  |  2 +-
  41 files changed, 95 insertions(+), 95 deletions(-)




diff --git a/drivers/i2c/busses/i2c-designware-master.c 
b/drivers/i2c/busses/i2c-designware-master.c
index c7e56002809a..14c61b31f877 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -832,7 +832,7 @@ i2c_dw_xfer(struct i2c_adapter *adap, struct i2c_msg 
msgs[], int num)
  }
  
  static const struct i2c_algorithm i2c_dw_algo = {

-   .master_xfer = i2c_dw_xfer,
+   .xfer = i2c_dw_xfer,
.functionality = i2c_dw_func,
  };
  
diff --git a/drivers/i2c/busses/i2c-designware-slave.c b/drivers/i2c/busses/i2c-designware-slave.c

index 2e079cf20bb5..b47ad6b16814 100644
--- a/drivers/i2c/busses/i2c-designware-slave.c
+++ b/drivers/i2c/busses/i2c-designware-slave.c
@@ -58,7 +58,7 @@ static int i2c_dw_init_slave(struct dw_i2c_dev *dev)
return 0;
  }
  
-static int i2c_dw_reg_slave(struct i2c_client *slave)

+static int i2c_dw_reg_target(struct i2c_client *slave)
  {
struct dw_i2c_dev *dev = i2c_get_adapdata(slave->adapter);
  
@@ -83,7 +83,7 @@ static int i2c_dw_reg_slave(struct i2c_client *slave)

return 0;
  }
  
-static int i2c_dw_unreg_slave(struct i2c_client *slave)

+static int i2c_dw_unreg_target(struct i2c_client *slave)
  {
struct dw_i2c_dev *dev = i2c_get_adapdata(slave->adapter);
  
@@ -214,8 +214,8 @@ static irqreturn_t i2c_dw_isr_slave(int this_irq, void *dev_id)
  
  static const struct i2c_algorithm i2c_dw_algo = {

.functionality = i2c_dw_func,
-   .reg_slave = i2c_dw_reg_slave,
-   .unreg_slave = i2c_dw_unreg_slave,
+   .reg_target = i2c_dw_reg_target,
+   .unreg_target = i2c_dw_unreg_target,
  };


Acked-by: Jarkko Nikula

[PATCH net-next v2 2/3] trace: tcp: fully support trace_tcp_send_reset

2024-03-25 Thread Jason Xing

From: Jason Xing 

Prior to this patch, what we can see by enabling trace_tcp_send is
only happening under two circumstances:
1) active rst mode
2) non-active rst mode and based on the full socket

That means the inconsistency occurs if we use tcpdump and trace
simultaneously to see how rst happens.

It's necessary that we should take into other cases into considerations,
say:
1) time-wait socket
2) no socket
...

By parsing the incoming skb and reversing its 4-tuple can
we know the exact 'flow' which might not exist.

Samples after applied this patch:
1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
state=TCP_ESTABLISHED
2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
state=UNKNOWN
Note:
1) UNKNOWN means we cannot extract the right information from skb.
2) skbaddr/skaddr could be 0

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 39 --
 net/ipv4/tcp_ipv4.c|  4 ++--
 net/ipv6/tcp_ipv6.c|  3 ++-
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 2495a1d579be..a13eb2147a02 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -107,11 +107,46 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
-DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
+TRACE_EVENT(tcp_send_reset,
 
TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
 
-   TP_ARGS(sk, skb)
+   TP_ARGS(sk, skb),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(const void *, skaddr)
+   __field(int, state)
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
+   ),
+
+   TP_fast_assign(
+   __entry->skbaddr = skb;
+   __entry->skaddr = sk;
+   /* Zero means unknown state. */
+   __entry->state = sk ? sk->sk_state : 0;
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   if (sk && sk_fullsock(sk)) {
+   const struct inet_sock *inet = inet_sk(sk);
+
+   TP_STORE_ADDR_PORTS(__entry, inet, sk);
+   } else {
+   /*
+* We should reverse the 4-tuple of skb, so later
+* it can print the right flow direction of rst.
+*/
+   TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
entry->saddr);
+   }
+   ),
+
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+ __entry->skbaddr, __entry->skaddr,
+ __entry->saddr, __entry->daddr,
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
 );
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a22ee5838751..d5c4a969c066 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -868,10 +868,10 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
 */
if (sk) {
arg.bound_dev_if = sk->sk_bound_dev_if;
-   if (sk_fullsock(sk))
-   trace_tcp_send_reset(sk, skb);
}
 
+   trace_tcp_send_reset(sk, skb);
+
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3f4cba49e9ee..8e9c59b6c00c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
if (sk) {
oif = sk->sk_bound_dev_if;
if (sk_fullsock(sk)) {
-   trace_tcp_send_reset(sk, skb);
if (inet6_test_bit(REPFLOW, sk))
label = ip6_flowlabel(ipv6h);
priority = READ_ONCE(sk->sk_priority);
@@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
label = ip6_flowlabel(ipv6h);
}
 
+   trace_tcp_send_reset(sk, skb);
+
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
 );
-- 
2.37.3

[PATCH net-next v2 0/3] tcp: make trace of reset logic complete

2024-03-25 Thread Jason Xing

From: Jason Xing 

Before this, we miss some cases where the TCP layer could send rst but
we cannot trace it. So I decided to complete it :)

v2
1. fix spelling mistakes

Jason Xing (3):
  trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
  trace: tcp: fully support trace_tcp_send_reset
  tcp: add location into reset trace process

 include/trace/events/tcp.h | 68 ++
 net/ipv4/tcp_ipv4.c|  4 +--
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  3 +-
 4 files changed, 60 insertions(+), 17 deletions(-)

-- 
2.37.3

Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Xuan Zhuo

On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  wrote:
>
> > > 4) Since the command above does not have a key, then the last
> > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > sg_buf_size = vi->rss_key_size;
> >
> >
> >
> > if (vi->has_rss || vi->has_rss_hash_report) {
> > vi->rss_indir_table_size =
> > virtio_cread16(vdev, offsetof(struct virtio_net_config,
> > rss_max_indirection_table_length));
> > vi->rss_key_size =
> > virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> > rss_max_key_size));
> >
> > vi->rss_hash_types_supported =
> > virtio_cread32(vdev, offsetof(struct virtio_net_config, 
> > supported_hash_types));
> > vi->rss_hash_types_supported &=
> > ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> >   VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> >   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> >
> > dev->hw_features |= NETIF_F_RXHASH;
> > }
> >
> >
> > vi->rss_key_size is initiated here, I wonder if there is something wrong?
>
> Not really, the code above is never executed (in my machines). This is
> because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
>
> Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> VIRTIO_NET_F_HASH_REPORT features.
>
> Also, when I run `ethtool -x`, I got:
>
>   # ethtool  -x eth0
>   RX flow hash indirection table for eth0 with 1 RX ring(s):
>   Operation not supported
>   RSS hash key:
>   Operation not supported
>   RSS hash function:
>   toeplitz: on
>   xor: off
>   crc32: off


The spec saies:
Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
supports only one pair of virtqueues, it MUST support at least one of
commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
parameters:

If the device offers VIRTIO_NET_F_RSS, it MUST support
VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.

Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
per 5.1.6.5.6.4.


So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
we should return from virtnet_set_rxfh directly.

Thanks.

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He

>>
>> Introduce a tracepoint for icmp_send, which can help users to get more
>> detail information conveniently when icmp abnormal events happen.
>>
>> 1. Giving an usecase example:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D=3D=3D=3D
>> When an application experiences packet loss due to an unreachable UDP
>> destination port, the kernel will send an exception message through the
>> icmp_send function. By adding a trace point for icmp_send, developers or
>> system administrators can obtain detailed information about the UDP
>> packet loss, including the type, code, source address, destination addres=
>s,
>> source port, and destination port. This facilitates the trouble-shooting
>> of UDP packet loss issues especially for those network-service
>> applications.
>>
>> 2. Operation Instructions:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D
>> Switch to the tracing directory.
>> cd /sys/kernel/tracing
>> Filter for destination port unreachable.
>> echo "type=3D=3D3 && code=3D=3D3" > events/icmp/icmp_send/filter
>> Enable trace event.
>> echo 1 > events/icmp/icmp_send/enable
>>
>> 3. Result View:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>  udp_client_erro-11370   [002] ...s.12   124.728002:
>>  icmp_send: icmp_send: type=3D3, code=3D3.
>>  From 127.0.0.1:41895 to 127.0.0.1: ulen=3D23
>>  skbaddr=3D589b167a
>>
>> Changelog
>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>
>This is completely bogus.
>
>Adding tracepoints is ok if there are no side effects like bugs :/
>
>At this point there is no guarantee the UDP header is complete/present
>in skb->head
>
>Look at the existing checks between lines 619 and 623
>
>Then audit all icmp_send() callers, and ask yourself if UDP packets
>can not be malicious (like with a truncated UDP header)
Yeah, you are correct. Directly parsing udphdr through the sdk may
conceal bugs, such as illegal skb. To handle such exceptional scenarios,
we can determine the legitimacy of skb by checking whether the position
of the uh pointer is out of bounds. The modifications in the patch are
as follows: 
struct udphdr *uh = udp_hdr(skb);

if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
(u8 *)uh + sizeof(struct udphdr) >

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He

>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>
>I think it would be better to target net-next tree since it's not a
>fix or something else important.
>
OK. I would target it for net-next.
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>> +   } else {
>> +   __entry->sport =3D 0;
>> +   __entry->dport =3D 0;
>> +   __entry->ulen =3D 0;
>> +   }
>
>What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
>and dport like the patch[1] did through extending the use of header
>for TCP and UDP?
>
I believe patch[1] is a good idea as it moves the TCP protocol parsing
previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assign,
and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
enabling support for both UDP and TCP protocol parsing simultaneously.

However, patch[1] only extracts the source and destination addresses of
the packet, but does not extract the source port and destination port,
which limits the significance of my submitted patch.

Perhaps the patch[1] could be referenced for integration after it is merged.
>And, I wonder what the use of tracing ulen of that skb?
>
The tracking of ulen is primarily aimed at ensuring the legality of received
UDP packets and providing developers with more detailed information
on exceptions. See net/ipv4/udp.c:2494-2501.
>[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8e.1=
>710866188.git.balazs.scheid...@axoflow.com/
>
>Thanks,
>Jason
>
>> +
>> +   p32 =3D (__be32 *) __entry->saddr;
>> +   *p32 =3D iph->saddr;
>> +
>> +   p32 =3D (__be32 *) __entry->daddr;
>> +   *p32 =3D iph->daddr;
>> +   ),
>> +
>> +   TP_printk("icmp_send: type=3D%d, code=3D%d. From %pI4:%u =
>to %pI4:%u ulen=3D%d skbaddr=3D%p",
>> +   __entry->type, __entry->code,
>> +   __entry->saddr, __entry->sport, __entry->daddr,
>> +   __entry->dport, __entry->ulen, __entry->skbaddr)
>> +);
>> +
>> +#endif /* _TRACE_ICMP_H */
>> +
>> +/* This part must be outside protection */
>> +#include 
>> \ No newline at end of file
>> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> index e63a3bf99617..21fb41257fe9 100644
>> ---

[PATCH net-next 2/3] trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()

2024-03-25 Thread Jason Xing

From: Jason Xing 

As the title said, use the macro directly like the patch[1] did
to avoid those duplications. No functional change.

[1]
commit 6a6b0b9914e7 ("tcp: Avoid preprocessor directives in tracepoint macro 
args")

Signed-off-by: Jason Xing 
---
 include/trace/events/sock.h | 18 +++---
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index fd206a6ab5b8..4397f7bfa406 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define family_names   \
EM(AF_INET) \
@@ -223,7 +224,6 @@ TRACE_EVENT(inet_sk_error_report,
 
TP_fast_assign(
const struct inet_sock *inet = inet_sk(sk);
-   struct in6_addr *pin6;
__be32 *p32;
 
__entry->error = sk->sk_err;
@@ -238,20 +238,8 @@ TRACE_EVENT(inet_sk_error_report,
p32 = (__be32 *) __entry->daddr;
*p32 =  inet->inet_daddr;
 
-#if IS_ENABLED(CONFIG_IPV6)
-   if (sk->sk_family == AF_INET6) {
-   pin6 = (struct in6_addr *)__entry->saddr_v6;
-   *pin6 = sk->sk_v6_rcv_saddr;
-   pin6 = (struct in6_addr *)__entry->daddr_v6;
-   *pin6 = sk->sk_v6_daddr;
-   } else
-#endif
-   {
-   pin6 = (struct in6_addr *)__entry->saddr_v6;
-   ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
-   pin6 = (struct in6_addr *)__entry->daddr_v6;
-   ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
-   }
+   TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
+  sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
),
 
TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c error=%d",
-- 
2.37.3

[PATCH net-next 3/3] trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()

2024-03-25 Thread Jason Xing

From: Jason Xing 

As the title said, use the macro directly like the patch[1] did
to avoid those duplications. No functional change.

[1]
commit 6a6b0b9914e7 ("tcp: Avoid preprocessor directives in tracepoint macro 
args")

Signed-off-by: Jason Xing 
---
 include/trace/events/sock.h | 17 ++---
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 4397f7bfa406..0d1c5ce4e6a6 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -160,7 +160,6 @@ TRACE_EVENT(inet_sock_set_state,
 
TP_fast_assign(
const struct inet_sock *inet = inet_sk(sk);
-   struct in6_addr *pin6;
__be32 *p32;
 
__entry->skaddr = sk;
@@ -178,20 +177,8 @@ TRACE_EVENT(inet_sock_set_state,
p32 = (__be32 *) __entry->daddr;
*p32 =  inet->inet_daddr;
 
-#if IS_ENABLED(CONFIG_IPV6)
-   if (sk->sk_family == AF_INET6) {
-   pin6 = (struct in6_addr *)__entry->saddr_v6;
-   *pin6 = sk->sk_v6_rcv_saddr;
-   pin6 = (struct in6_addr *)__entry->daddr_v6;
-   *pin6 = sk->sk_v6_daddr;
-   } else
-#endif
-   {
-   pin6 = (struct in6_addr *)__entry->saddr_v6;
-   ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
-   pin6 = (struct in6_addr *)__entry->daddr_v6;
-   ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
-   }
+   TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
+  sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
),
 
TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
-- 
2.37.3

[PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-25 Thread Jason Xing

From: Jason Xing 

Using the macro for other tracepoints use to be more concise.
No functional change.

Jason Xing (3):
  trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
  trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
  trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()

 include/trace/events/net_probe_common.h | 29 
 include/trace/events/sock.h | 35 -
 include/trace/events/tcp.h  | 29 
 3 files changed, 34 insertions(+), 59 deletions(-)

-- 
2.37.3

1 2 >

1 - 100 of 107 matches

Mail list logo