date:20240411

Re: [PATCH v2] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Google

On Fri, 12 Apr 2024 11:29:50 +0800
Yuntao Wang  wrote:

> There is a space at the end of extra_init_args. In the current logic,
> copying extra_init_args to saved_command_line will cause extra spaces
> in saved_command_line here or there. Remove the trailing space from
> extra_init_args to make the string in saved_command_line look more perfect.
> 
> Signed-off-by: Yuntao Wang 

OK, this looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Let me pick this to bootconfig/for-next.

Thank you,

> ---
> v1 -> v2: Fix the issue using the method suggested by Masami
> 
>  init/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 881f6230ee59..0f03dd15e0e2 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -627,8 +627,10 @@ static void __init setup_command_line(char *command_line)
>  
>   if (extra_command_line)
>   xlen = strlen(extra_command_line);
> - if (extra_init_args)
> + if (extra_init_args) {
> + extra_init_args = strim(extra_init_args); /* remove trailing 
> space */
>   ilen = strlen(extra_init_args) + 4; /* for " -- " */
> + }
>  
>   len = xlen + strlen(boot_command_line) + 1;
>  
> -- 
> 2.44.0
> 

-- 
Masami Hiramatsu (Google)

Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-11 Thread Ian Rogers

On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave  wrote:
>
> In the Open Telemetry profiling SIG [1], we are trying to find a way to
> grab a tracing association quickly on a per-sample basis. The team at
> Elastic has a bespoke way to do this [2], however, I'd like to see a
> more general way to achieve this. The folks I've been talking with seem
> open to the idea of just having a TLS value for this we could capture

Presumably TLS == Thread Local Storage.

> upon each sample. We could then just state, Open Telemetry SDKs should
> have a TLS value for span correlation. However, we need a way to sample
> the TLS or other value(s) when a sampling event is generated. This is
> supported today on Windows via EventActivityIdControl() [3]. Since
> Open Telemetry works on both Windows and Linux, ideally we can do
> something as efficient for Linux based workloads.
>
> This series is to explore how it would be best possible to collect
> supporting data from a user process when a profile sample is collected.
> Having a value stored in TLS makes a lot of sense for this however
> there are other ways to explore. Whatever is chosen, kernel samples
> taken in process context should be able to get this supporting data.
> In these patches on X64 the fsbase and gsbase are used for this.
>
> An option to explore suggested by Mathieu Desnoyers is to utilize rseq
> for processes to register a value location that can be included when
> profiling if desired. This would allow a tighter contract between user
> processes and a profiler.  It would allow better labeling/categorizing
> the correlation values.

It is hard to understand this idea. Are you saying stash a cookie in
TLS for samples to capture to indicate an activity? Restartable
sequences are about preemption on a CPU not of a thread, so at least
my intuition is that they feel different. You could stash information
like this today by changing the thread name which generates comm
events. I've wondered about having similar information in some form of
reserved for profiling stack slot, for example, to stash a pointer to
the name of a function being interpreted. Snapshotting all of a stack
is bad performance wise and for security. A stack slot would be able
to deal with nesting.

> An idea flow would look like this:
> User Task   Profile
> do_work();  sample() -> IP + No activity
> ...
> set_activity(123);
> ...
> do_work();  sample() -> IP + activity (123)
> ...
> set_activity(124);
> ...
> do_work();  sample() -> IP + activity (124)
>
> Ideally, the set_activity() method would not be a syscall. It needs to
> be very cheap as this should not bottleneck work. Ideally this is just
> a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl()
> using EVENT_ACTIVITY_CTRL_SET_ID.
>
> For those not aware, Open Telemetry allows collecting data from multiple
> machines and show where time was spent. The tracing context is already
> available for logs, but not for profiling samples. The idea is to show
> where slowdowns occur and have profile samples to explain why they
> slowed down. This must be possible without having to track context
> switches to do this correlation. This is because the profiling rates
> are typically 20hz - 1Khz, while the context switching rates are much
> higher. We do not want to have to consume high context switch rates
> just to know a correlation for a 20hz signal. Often these 20hz signals
> are always enabled in some environments.
>
> Regardless if TLS, rseq, or other source is used I believe we will need
> a way for perf_events to include it within a sample. The changes in this
> series show how it could be done with TLS. There is some factoring work
> under perf to make it easier to add more dump types using the existing
> ABI. This is mostly to make the patches clearer, certainly the refactor
> parts could get dropped and we could have duplicated/specialized paths.

fs and gs may be used for more than just the C runtime's TLS. For
example, they may be used by emulators or managed runtimes. I'm not
clear why this specific case couldn't be handled through BPF.

Thanks,
Ian

> 1. https://opentelemetry.io/blog/2024/profiling/
> 2. 
> https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
> 3. 
> https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol
>
> Beau Belgrave (4):
>   perf/core: Introduce perf_prepare_dump_data()
>   perf: Introduce PERF_SAMPLE_TLS_USER sample type
>   perf/core: Factor perf_output_sample_udump()
>   perf/x86/core: Add tls dump support
>
>  arch/Kconfig  |   7 ++
>  arch/x86/Kconfig  |   1 +
>  arch/x86/events/core.c|  14 +++
>  arch/x86/include/asm/perf_event.h |   5 +
>  include/linux/perf_event.h|   7 ++
>  include/uapi/linux/perf_event.h   |   5 +-
>  kernel/events/core.c  | 166 +++---
>

[PATCH v2] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Yuntao Wang

There is a space at the end of extra_init_args. In the current logic,
copying extra_init_args to saved_command_line will cause extra spaces
in saved_command_line here or there. Remove the trailing space from
extra_init_args to make the string in saved_command_line look more perfect.

Signed-off-by: Yuntao Wang 
---
v1 -> v2: Fix the issue using the method suggested by Masami

 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 881f6230ee59..0f03dd15e0e2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -627,8 +627,10 @@ static void __init setup_command_line(char *command_line)
 
if (extra_command_line)
xlen = strlen(extra_command_line);
-   if (extra_init_args)
+   if (extra_init_args) {
+   extra_init_args = strim(extra_init_args); /* remove trailing 
space */
ilen = strlen(extra_init_args) + 4; /* for " -- " */
+   }
 
len = xlen + strlen(boot_command_line) + 1;
 
-- 
2.44.0

Re: [PATCH v9 5/9] clk: mmp: Add Marvell PXA1908 clock driver

2024-04-11 Thread Stephen Boyd

Quoting Duje Mihanović (2024-04-11 03:15:34)
> On 4/11/2024 10:00 AM, Stephen Boyd wrote:
> > 
> > Is there a reason this file can't be a platform driver?
> 
> Not that I know of, I did it like this only because the other in-tree 
> MMP clk drivers do so. I guess the initialization should look like any 
> of the qcom GCC drivers then?

Yes.

> 
> While at it, do you think the other MMP clk drivers could use a conversion?
> 

Sure, go for it. I'm a little wary if the conversion cannot be tested
though. It doesn't hurt that other drivers haven't been converted.

[PATCH RESEND] bootconfig: use memblock_free_late to free xbc memory to buddy

2024-04-11 Thread qiang4 . zhang

From: Qiang Zhang 

On the time to free xbc memory, memblock has handed over memory to buddy
allocator. So it doesn't make sense to free memory back to memblock.
memblock_free() called by xbc_exit() even causes UAF bugs on architectures
with CONFIG_ARCH_KEEP_MEMBLOCK disabled like x86. Following KASAN logs
shows this case.

[9.410890] 
==
[9.418962] BUG: KASAN: use-after-free in memblock_isolate_range+0x12d/0x260
[9.426850] Read of size 8 at addr 88845dd3 by task swapper/0/1

[9.435901] CPU: 9 PID: 1 Comm: swapper/0 Tainted: G U 
6.9.0-rc3-00208-g586b5dfb51b9 #5
[9.446403] Hardware name: Intel Corporation RPLP LP5 (CPU:RaptorLake)/RPLP 
LP5 (ID:13), BIOS IRPPN02.01.01.00.00.19.015.D- Dec 28 2023
[9.460789] Call Trace:
[9.463518]  
[9.465859]  dump_stack_lvl+0x53/0x70
[9.469949]  print_report+0xce/0x610
[9.473944]  ? __virt_addr_valid+0xf5/0x1b0
[9.478619]  ? memblock_isolate_range+0x12d/0x260
[9.483877]  kasan_report+0xc6/0x100
[9.487870]  ? memblock_isolate_range+0x12d/0x260
[9.493125]  memblock_isolate_range+0x12d/0x260
[9.498187]  memblock_phys_free+0xb4/0x160
[9.502762]  ? __pfx_memblock_phys_free+0x10/0x10
[9.508021]  ? mutex_unlock+0x7e/0xd0
[9.512111]  ? __pfx_mutex_unlock+0x10/0x10
[9.516786]  ? kernel_init_freeable+0x2d4/0x430
[9.521850]  ? __pfx_kernel_init+0x10/0x10
[9.526426]  xbc_exit+0x17/0x70
[9.529935]  kernel_init+0x38/0x1e0
[9.533829]  ? _raw_spin_unlock_irq+0xd/0x30
[9.538601]  ret_from_fork+0x2c/0x50
[9.542596]  ? __pfx_kernel_init+0x10/0x10
[9.547170]  ret_from_fork_asm+0x1a/0x30
[9.551552]  

[9.555649] The buggy address belongs to the physical page:
[9.561875] page: refcount:0 mapcount:0 mapping: index:0x1 
pfn:0x45dd30
[9.570821] flags: 0x200(node=0|zone=2)
[9.576271] page_type: 0x()
[9.580167] raw: 0200 ea0011774c48 ea0012ba1848 

[9.588823] raw: 0001   

[9.597476] page dumped because: kasan: bad access detected

[9.605362] Memory state around the buggy address:
[9.610714]  88845dd2ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[9.618786]  88845dd2ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[9.626857] >88845dd3: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.634930]^
[9.638534]  88845dd30080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.646605]  88845dd30100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.654675] 
==

Cc: sta...@vger.kernel.org
Signed-off-by: Qiang Zhang 
---
 lib/bootconfig.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index c59d26068a64..4524ee944df0 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -63,7 +63,7 @@ static inline void * __init xbc_alloc_mem(size_t size)
 
 static inline void __init xbc_free_mem(void *addr, size_t size)
 {
-   memblock_free(addr, size);
+   memblock_free_late(__pa(addr), size);
 }
 
 #else /* !__KERNEL__ */
-- 
2.39.2

Re: [PATCH] bootconfig: use memblock_free_late to free xbc memory to buddy

2024-04-11 Thread Qiang Zhang

On Fri, Apr 12, 2024 at 10:03:26AM +0800, qiang4.zh...@linux.intel.com wrote:
>From: Qiang Zhang 
>
>On the time to free xbc memory, memblock has handed over memory to buddy
>allocator. So it doesn't make sense to free memory back to memblock.
>memblock_free() called by xbc_exit() even causes UAF bugs on architectures
>with CONFIG_ARCH_KEEP_MEMBLOCK disabled like x86. Following KASAN logs
>shows this case.
>
>[9.410890] 
>==
>[9.418962] BUG: KASAN: use-after-free in memblock_isolate_range+0x12d/0x260
>[9.426850] Read of size 8 at addr 88845dd3 by task swapper/0/1
>
>[9.435901] CPU: 9 PID: 1 Comm: swapper/0 Tainted: G U 
>6.9.0-rc3-00208-g586b5dfb51b9 #5
>[9.446403] Hardware name: Intel Corporation RPLP LP5 (CPU:RaptorLake)/RPLP 
>LP5 (ID:13), BIOS IRPPN02.01.01.00.00.19.015.D- Dec 28 2023
>[9.460789] Call Trace:
>[9.463518]  
>[9.465859]  dump_stack_lvl+0x53/0x70
>[9.469949]  print_report+0xce/0x610
>[9.473944]  ? __virt_addr_valid+0xf5/0x1b0
>[9.478619]  ? memblock_isolate_range+0x12d/0x260
>[9.483877]  kasan_report+0xc6/0x100
>[9.487870]  ? memblock_isolate_range+0x12d/0x260
>[9.493125]  memblock_isolate_range+0x12d/0x260
>[9.498187]  memblock_phys_free+0xb4/0x160
>[9.502762]  ? __pfx_memblock_phys_free+0x10/0x10
>[9.508021]  ? mutex_unlock+0x7e/0xd0
>[9.512111]  ? __pfx_mutex_unlock+0x10/0x10
>[9.516786]  ? kernel_init_freeable+0x2d4/0x430
>[9.521850]  ? __pfx_kernel_init+0x10/0x10
>[9.526426]  xbc_exit+0x17/0x70
>[9.529935]  kernel_init+0x38/0x1e0
>[9.533829]  ? _raw_spin_unlock_irq+0xd/0x30
>[9.538601]  ret_from_fork+0x2c/0x50
>[9.542596]  ? __pfx_kernel_init+0x10/0x10
>[9.547170]  ret_from_fork_asm+0x1a/0x30
>[9.551552]  
>
>[9.555649] The buggy address belongs to the physical page:
>[9.561875] page: refcount:0 mapcount:0 mapping: index:0x1 
>pfn:0x45dd30
>[9.570821] flags: 0x200(node=0|zone=2)
>[9.576271] page_type: 0x()
>[9.580167] raw: 0200 ea0011774c48 ea0012ba1848 
>
>[9.588823] raw: 0001   
>
>[9.597476] page dumped because: kasan: bad access detected
>
>[9.605362] Memory state around the buggy address:
>[9.610714]  88845dd2ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
>00
>[9.618786]  88845dd2ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
>00
>[9.626857] >88845dd3: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
>ff
>[9.634930]^
>[9.638534]  88845dd30080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
>ff
>[9.646605]  88845dd30100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
>ff
>[9.654675] 
>==

Sorry. Forget to Cc stable. Will send a new one.

>
>Signed-off-by: Qiang Zhang 
>---
> lib/bootconfig.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/lib/bootconfig.c b/lib/bootconfig.c
>index c59d26068a64..4524ee944df0 100644
>--- a/lib/bootconfig.c
>+++ b/lib/bootconfig.c
>@@ -63,7 +63,7 @@ static inline void * __init xbc_alloc_mem(size_t size)
> 
> static inline void __init xbc_free_mem(void *addr, size_t size)
> {
>-  memblock_free(addr, size);
>+  memblock_free_late(__pa(addr), size);
> }
> 
> #else /* !__KERNEL__ */
>-- 
>2.39.2
>

[PATCH] bootconfig: use memblock_free_late to free xbc memory to buddy

2024-04-11 Thread qiang4 . zhang

From: Qiang Zhang 

On the time to free xbc memory, memblock has handed over memory to buddy
allocator. So it doesn't make sense to free memory back to memblock.
memblock_free() called by xbc_exit() even causes UAF bugs on architectures
with CONFIG_ARCH_KEEP_MEMBLOCK disabled like x86. Following KASAN logs
shows this case.

[9.410890] 
==
[9.418962] BUG: KASAN: use-after-free in memblock_isolate_range+0x12d/0x260
[9.426850] Read of size 8 at addr 88845dd3 by task swapper/0/1

[9.435901] CPU: 9 PID: 1 Comm: swapper/0 Tainted: G U 
6.9.0-rc3-00208-g586b5dfb51b9 #5
[9.446403] Hardware name: Intel Corporation RPLP LP5 (CPU:RaptorLake)/RPLP 
LP5 (ID:13), BIOS IRPPN02.01.01.00.00.19.015.D- Dec 28 2023
[9.460789] Call Trace:
[9.463518]  
[9.465859]  dump_stack_lvl+0x53/0x70
[9.469949]  print_report+0xce/0x610
[9.473944]  ? __virt_addr_valid+0xf5/0x1b0
[9.478619]  ? memblock_isolate_range+0x12d/0x260
[9.483877]  kasan_report+0xc6/0x100
[9.487870]  ? memblock_isolate_range+0x12d/0x260
[9.493125]  memblock_isolate_range+0x12d/0x260
[9.498187]  memblock_phys_free+0xb4/0x160
[9.502762]  ? __pfx_memblock_phys_free+0x10/0x10
[9.508021]  ? mutex_unlock+0x7e/0xd0
[9.512111]  ? __pfx_mutex_unlock+0x10/0x10
[9.516786]  ? kernel_init_freeable+0x2d4/0x430
[9.521850]  ? __pfx_kernel_init+0x10/0x10
[9.526426]  xbc_exit+0x17/0x70
[9.529935]  kernel_init+0x38/0x1e0
[9.533829]  ? _raw_spin_unlock_irq+0xd/0x30
[9.538601]  ret_from_fork+0x2c/0x50
[9.542596]  ? __pfx_kernel_init+0x10/0x10
[9.547170]  ret_from_fork_asm+0x1a/0x30
[9.551552]  

[9.555649] The buggy address belongs to the physical page:
[9.561875] page: refcount:0 mapcount:0 mapping: index:0x1 
pfn:0x45dd30
[9.570821] flags: 0x200(node=0|zone=2)
[9.576271] page_type: 0x()
[9.580167] raw: 0200 ea0011774c48 ea0012ba1848 

[9.588823] raw: 0001   

[9.597476] page dumped because: kasan: bad access detected

[9.605362] Memory state around the buggy address:
[9.610714]  88845dd2ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[9.618786]  88845dd2ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[9.626857] >88845dd3: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.634930]^
[9.638534]  88845dd30080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.646605]  88845dd30100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff
[9.654675] 
==

Signed-off-by: Qiang Zhang 
---
 lib/bootconfig.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index c59d26068a64..4524ee944df0 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -63,7 +63,7 @@ static inline void * __init xbc_alloc_mem(size_t size)
 
 static inline void __init xbc_free_mem(void *addr, size_t size)
 {
-   memblock_free(addr, size);
+   memblock_free_late(__pa(addr), size);
 }
 
 #else /* !__KERNEL__ */
-- 
2.39.2

Re: [PATCH] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Yuntao Wang

On Fri, 12 Apr 2024 08:08:39 +0900, Masami Hiramatsu (Google) 
 wrote:

> On Thu, 11 Apr 2024 23:29:40 +0800
> Yuntao Wang  wrote:
> 
> > On Thu, 11 Apr 2024 23:07:45 +0900, Masami Hiramatsu (Google) 
> >  wrote:
> > 
> > > On Thu, 11 Apr 2024 09:19:32 +0200
> > > Geert Uytterhoeven  wrote:
> > > 
> > > > CC Hiramatsu-san (now for real :-)
> > > 
> > > Thanks!
> > > 
> > > > 
> > > > On Thu, Apr 11, 2024 at 6:13 AM Yuntao Wang  wrote:
> > > > > extra_init_args ends with a space, so when concatenating 
> > > > > extra_init_args
> > > > > to saved_command_line, be sure to remove the extra space.
> > > 
> > > Hi Yuntao,
> > > 
> > > Hmm, if you want to trim the end space, you should trim extra_init_args
> > > itself instead of this adjustment. Also, can you share the example?
> > > 
> > > Thank you,
> > 
> > At first, I also intended to fix this issue as you suggested. However,
> > because both extra_command_line and extra_init_args end with a space,
> > making such a change would require modifications in many places.
> 
> You may just need:
> 
> if (extra_init_args)
>   strim(extra_init_args);

Okay, I'll post another patch, making the changes as you suggested.

> > That's why I chose this approach instead.
> > 
> > Here are some examples before and after modification:
> > 
> > Before: [0.829179] Kernel command line: 'console=ttyS0 debug -- 
> > bootconfig_arg1 '
> > After:  [0.032648] Kernel command line: 'console=ttyS0 debug -- 
> > bootconfig_arg1'
> > 
> > Before: [0.757217] Kernel command line: 'console=ttyS0 debug -- 
> > bootconfig_arg1  arg1'
> > After:  [0.068184] Kernel command line: 'console=ttyS0 debug -- 
> > bootconfig_arg1 arg1'
> > 
> > In order to make it easier to observe spaces, I added quotes when 
> > outputting saved_command_line.
> 
> BTW, is this tailing space harm anything? I don't like a cosmetic change.
> 
> Thank you,

I think this modification is necessary.

If saved_command_line is only used internally in the kernel, having extra
spaces, while not perfect, is acceptable to me. However, since 
saved_command_line
can be accessed by users through the /proc/cmdline file, having these extra
spaces here and there makes it look too casual.

> > 
> > Note that the first 'before' ends with a space, and there are two spaces 
> > between
> > 'bootconfig_arg1' and 'arg1' in the second 'before'.
> > 
> > > > >
> > > > > Signed-off-by: Yuntao Wang 
> > > > > ---
> > > > >  init/main.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/init/main.c b/init/main.c
> > > > > index 2ca52474d0c3..cf2c22aa0e8c 100644
> > > > > --- a/init/main.c
> > > > > +++ b/init/main.c
> > > > > @@ -660,12 +660,14 @@ static void __init setup_command_line(char 
> > > > > *command_line)
> > > > > strcpy(saved_command_line + len, 
> > > > > extra_init_args);
> > > > > len += ilen - 4;/* 
> > > > > strlen(extra_init_args) */
> > > > > strcpy(saved_command_line + len,
> > > > > -   boot_command_line + initargs_offs - 
> > > > > 1);
> > > > > +   boot_command_line + initargs_offs);
> > > > > } else {
> > > > > len = strlen(saved_command_line);
> > > > > strcpy(saved_command_line + len, " -- ");
> > > > > len += 4;
> > > > > strcpy(saved_command_line + len, 
> > > > > extra_init_args);
> > > > > +   len += ilen - 4; /* strlen(extra_init_args) */
> > > > > +   saved_command_line[len-1] = '\0'; /* remove 
> > > > > trailing space */
> > > > > }
> > > > > }
> > > > 
> > > > Gr{oetje,eeting}s,
> > > > 
> > > > Geert
> > > > 
> > > > -- 
> > > > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> > > > geert@linux-m68korg
> > > > 
> > > > In personal conversations with technical people, I call myself a 
> > > > hacker. But
> > > > when I'm talking to journalists I just say "programmer" or something 
> > > > like that.
> > > > -- Linus Torvalds
> > > > 
> > > 
> > > 
> > > -- 
> > > Masami Hiramatsu (Google) 
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH] ARM: dts: qcom: msm8974-sony-shinano: Enable vibrator

2024-04-11 Thread Bjorn Andersson



On Sat, 06 Apr 2024 17:27:20 +0200, Luca Weiss wrote:
> Enable the vibrator connected to PM8941 found on the Sony shinano
> platform.
> 
> 

Applied, thanks!

[1/1] ARM: dts: qcom: msm8974-sony-shinano: Enable vibrator
  commit: 5c94b0b906436aad74e559195007afdd328211f4

Best regards,
-- 
Bjorn Andersson

[RFC PATCH 2/4] perf: Introduce PERF_SAMPLE_TLS_USER sample type

2024-04-11 Thread Beau Belgrave

When samples are generated, there is no way via the perf_event ABI to
fetch per-thread data. This data is very useful in tracing scenarios
that involve correlation IDs, such as OpenTelemetry. They are also
useful for tracking per-thread performance details directly within a
cooperating user process.

The newly establish OpenTelemetry profiling group requires a way to get
tracing correlations on both Linux and Windows. On Windows this
correlation is on a per-thread basis directly via ETW. On Linux we need
a fast mechanism to store these details and TLS seems like the best
option, see links for more details.

Add a new sample type (PERF_SAMPLE_TLS_USER) that fetches TLS data up to
X bytes per-sample. Use the existing PERF_SAMPLE_STACK_USER ABI for
outputting data out to consumers. Store requested data size by the user
in the previously reserved u16 (__reserved_2) within perf_event_attr.

Add tls_addr and tls_user_size to perf_sample_data and calculate them
during sample preparation. This allows the output side to know if
truncation is going to occur and not having to re-fetch the TLS value
from the user process a second time.

Add CONFIG_HAVE_PERF_USER_TLS_DUMP so that architectures can specify if
they have a TLS specific register (or other logic) that can be used for
dumping. This does not yet enable any architecture to do TLS dump, it
simply makes it possible by allowing a arch defined method named
arch_perf_user_tls_pointer().

Add perf_tls struct that arch_perf_user_tls_pointer() utilizes to set
TLS details of the address and size (for 32bit on 64bit compat cases).

Link: https://opentelemetry.io/blog/2024/profiling/
Link: 
https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
Signed-off-by: Beau Belgrave 
---
 arch/Kconfig|   7 +++
 include/linux/perf_event.h  |   7 +++
 include/uapi/linux/perf_event.h |   5 +-
 kernel/events/core.c| 105 +++-
 kernel/events/internal.h|  16 +
 5 files changed, 137 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9f066785bb71..6afaf5f46e2f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -430,6 +430,13 @@ config HAVE_PERF_USER_STACK_DUMP
  access to the user stack pointer which is not unified across
  architectures.
 
+config HAVE_PERF_USER_TLS_DUMP
+   bool
+   help
+ Support user tls dumps for perf event samples. This needs
+ access to the user tls pointer which is not unified across
+ architectures.
+
 config HAVE_ARCH_JUMP_LABEL
bool
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a..7fac81929eed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1202,8 +1202,15 @@ struct perf_sample_data {
u64 data_page_size;
u64 code_page_size;
u64 aux_size;
+   u64 tls_addr;
+   u64 tls_user_size;
 } cacheline_aligned;
 
+struct perf_tls {
+   unsigned long base; /* Base address for TLS */
+   unsigned long size; /* Size of base address */
+};
+
 /* default value for data source */
 #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
PERF_MEM_S(LVL, NA)   |\
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 3a64499b0f5d..b62669cfe581 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -162,8 +162,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
PERF_SAMPLE_CODE_PAGE_SIZE  = 1U << 23,
PERF_SAMPLE_WEIGHT_STRUCT   = 1U << 24,
+   PERF_SAMPLE_TLS_USER= 1U << 25,
 
-   PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 26, /* non-ABI */
 };
 
 #define PERF_SAMPLE_WEIGHT_TYPE(PERF_SAMPLE_WEIGHT | 
PERF_SAMPLE_WEIGHT_STRUCT)
@@ -509,7 +510,7 @@ struct perf_event_attr {
 */
__u32   aux_watermark;
__u16   sample_max_stack;
-   __u16   __reserved_2;
+   __u16   sample_tls_user; /* Size of TLS data to dump on samples */
__u32   aux_sample_size;
__u32   __reserved_3;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 07de5cc2aa25..f848bf4be9bd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6926,6 +6926,45 @@ static u64 perf_ustack_task_size(struct pt_regs *regs)
return TASK_SIZE - addr;
 }
 
+/*
+ * Get remaining task size from user tls pointer.
+ *
+ * Outputs the address to use for the dump to avoid doing
+ * this twice (prepare and output).
+ */
+static u64
+perf_utls_task_size(struct pt_regs *regs, u64 dump_size, u64 *tls_addr)
+{
+   struct perf_tls tls;
+   unsigned long addr;
+
+   *tls_addr = 0;
+
+   /*

[RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-11 Thread Beau Belgrave

In the Open Telemetry profiling SIG [1], we are trying to find a way to
grab a tracing association quickly on a per-sample basis. The team at
Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture
upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS or other value(s) when a sampling event is generated. This is
supported today on Windows via EventActivityIdControl() [3]. Since
Open Telemetry works on both Windows and Linux, ideally we can do
something as efficient for Linux based workloads.

This series is to explore how it would be best possible to collect
supporting data from a user process when a profile sample is collected.
Having a value stored in TLS makes a lot of sense for this however
there are other ways to explore. Whatever is chosen, kernel samples
taken in process context should be able to get this supporting data.
In these patches on X64 the fsbase and gsbase are used for this.

An option to explore suggested by Mathieu Desnoyers is to utilize rseq
for processes to register a value location that can be included when
profiling if desired. This would allow a tighter contract between user
processes and a profiler.  It would allow better labeling/categorizing
the correlation values.

An idea flow would look like this:
User Task   Profile
do_work();  sample() -> IP + No activity
...
set_activity(123);
...
do_work();  sample() -> IP + activity (123)
...
set_activity(124);
...
do_work();  sample() -> IP + activity (124)

Ideally, the set_activity() method would not be a syscall. It needs to
be very cheap as this should not bottleneck work. Ideally this is just
a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl()
using EVENT_ACTIVITY_CTRL_SET_ID.

For those not aware, Open Telemetry allows collecting data from multiple
machines and show where time was spent. The tracing context is already
available for logs, but not for profiling samples. The idea is to show
where slowdowns occur and have profile samples to explain why they
slowed down. This must be possible without having to track context
switches to do this correlation. This is because the profiling rates
are typically 20hz - 1Khz, while the context switching rates are much
higher. We do not want to have to consume high context switch rates
just to know a correlation for a 20hz signal. Often these 20hz signals
are always enabled in some environments.

Regardless if TLS, rseq, or other source is used I believe we will need
a way for perf_events to include it within a sample. The changes in this
series show how it could be done with TLS. There is some factoring work
under perf to make it easier to add more dump types using the existing
ABI. This is mostly to make the patches clearer, certainly the refactor
parts could get dropped and we could have duplicated/specialized paths.

1. https://opentelemetry.io/blog/2024/profiling/
2. 
https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
3. 
https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol

Beau Belgrave (4):
  perf/core: Introduce perf_prepare_dump_data()
  perf: Introduce PERF_SAMPLE_TLS_USER sample type
  perf/core: Factor perf_output_sample_udump()
  perf/x86/core: Add tls dump support

 arch/Kconfig  |   7 ++
 arch/x86/Kconfig  |   1 +
 arch/x86/events/core.c|  14 +++
 arch/x86/include/asm/perf_event.h |   5 +
 include/linux/perf_event.h|   7 ++
 include/uapi/linux/perf_event.h   |   5 +-
 kernel/events/core.c  | 166 +++---
 kernel/events/internal.h  |  16 +++
 8 files changed, 180 insertions(+), 41 deletions(-)


base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
-- 
2.34.1

[RFC PATCH 4/4] perf/x86/core: Add tls dump support

2024-04-11 Thread Beau Belgrave

Now that perf supports TLS dumps, x86-64 can provide the details for how
to get TLS data for user threads.

Enable HAVE_PERF_USER_TLS_DUMP Kconfig only for x86-64. I do not have
access to x86 to validate 32-bit.

Utilize mmap_is_ia32() to determine 32/64 bit threads. Use fsbase for
64-bit and gsbase for 32-bit with appropriate size.

Signed-off-by: Beau Belgrave 
---
 arch/x86/Kconfig  |  1 +
 arch/x86/events/core.c| 14 ++
 arch/x86/include/asm/perf_event.h |  5 +
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4fff6ed46e90..8d46ec8ded0c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -263,6 +263,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+   select HAVE_PERF_USER_TLS_DUMP  if X86_64
select MMU_GATHER_RCU_TABLE_FREEif PARAVIRT
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 09050641ce5d..3f851db4c591 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "perf_event.h"
 
@@ -3002,3 +3003,16 @@ u64 perf_get_hw_event_config(int hw_event)
return 0;
 }
 EXPORT_SYMBOL_GPL(perf_get_hw_event_config);
+
+#ifdef CONFIG_X86_64
+void arch_perf_user_tls_pointer(struct perf_tls *tls)
+{
+   if (!mmap_is_ia32()) {
+   tls->base = current->thread.fsbase;
+   tls->size = sizeof(u64);
+   } else {
+   tls->base = current->thread.gsbase;
+   tls->size = sizeof(u32);
+   }
+}
+#endif
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 3736b8a46c04..d0f65e572c20 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -628,4 +628,9 @@ static __always_inline void perf_lopwr_cb(bool lopwr_in)
 
 #define arch_perf_out_copy_user copy_from_user_nmi
 
+#ifdef CONFIG_HAVE_PERF_USER_TLS_DUMP
+struct perf_tls;
+extern void arch_perf_user_tls_pointer(struct perf_tls *tls);
+#endif
+
 #endif /* _ASM_X86_PERF_EVENT_H */
-- 
2.34.1

[RFC PATCH 3/4] perf/core: Factor perf_output_sample_udump()

2024-04-11 Thread Beau Belgrave

We now have two user dump sources (stack and tls). Both are doing the
same logic to ensure the user dump ABI output is properly handled. The
only difference is one gets the address within the method, and the other
is passed the address.

Add perf_output_sample_udump() and utilize it for both stack and tls
sample dumps. The sp register is now fetched outside of this method and
passed to it. This allows both stack and tls to utilize the same code.

Signed-off-by: Beau Belgrave 
---
 kernel/events/core.c | 68 +---
 1 file changed, 19 insertions(+), 49 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f848bf4be9bd..6b3cf5afdd32 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6998,47 +6998,10 @@ perf_sample_dump_size(u16 dump_size, u16 header_size, 
u64 task_size)
 }
 
 static void
-perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
- struct pt_regs *regs)
-{
-   /* Case of a kernel thread, nothing to dump */
-   if (!regs) {
-   u64 size = 0;
-   perf_output_put(handle, size);
-   } else {
-   unsigned long sp;
-   unsigned int rem;
-   u64 dyn_size;
-
-   /*
-* We dump:
-* static size
-*   - the size requested by user or the best one we can fit
-* in to the sample max size
-* data
-*   - user stack dump data
-* dynamic size
-*   - the actual dumped size
-*/
-
-   /* Static size. */
-   perf_output_put(handle, dump_size);
-
-   /* Data. */
-   sp = perf_user_stack_pointer(regs);
-   rem = __output_copy_user(handle, (void *) sp, dump_size);
-   dyn_size = dump_size - rem;
-
-   perf_output_skip(handle, rem);
-
-   /* Dynamic size. */
-   perf_output_put(handle, dyn_size);
-   }
-}
-
-static void
-perf_output_sample_utls(struct perf_output_handle *handle, u64 addr,
-   u64 dump_size, struct pt_regs *regs)
+perf_output_sample_udump(struct perf_output_handle *handle,
+unsigned long addr,
+u64 dump_size,
+struct pt_regs *regs)
 {
/* Case of a kernel thread, nothing to dump */
if (!regs) {
@@ -7054,7 +7017,7 @@ perf_output_sample_utls(struct perf_output_handle 
*handle, u64 addr,
 *   - the size requested by user or the best one we can fit
 * in to the sample max size
 * data
-*   - user tls dump data
+*   - user dump data
 * dynamic size
 *   - the actual dumped size
 */
@@ -7507,9 +7470,16 @@ void perf_output_sample(struct perf_output_handle 
*handle,
}
 
if (sample_type & PERF_SAMPLE_STACK_USER) {
-   perf_output_sample_ustack(handle,
- data->stack_user_size,
- data->regs_user.regs);
+   struct pt_regs *regs = data->regs_user.regs;
+   unsigned long sp = 0;
+
+   if (regs)
+   sp = perf_user_stack_pointer(regs);
+
+   perf_output_sample_udump(handle,
+sp,
+data->stack_user_size,
+regs);
}
 
if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
@@ -7551,10 +7521,10 @@ void perf_output_sample(struct perf_output_handle 
*handle,
perf_output_put(handle, data->code_page_size);
 
if (sample_type & PERF_SAMPLE_TLS_USER) {
-   perf_output_sample_utls(handle,
-   data->tls_addr,
-   data->tls_user_size,
-   data->regs_user.regs);
+   perf_output_sample_udump(handle,
+data->tls_addr,
+data->tls_user_size,
+data->regs_user.regs);
}
 
if (sample_type & PERF_SAMPLE_AUX) {
-- 
2.34.1

[RFC PATCH 1/4] perf/core: Introduce perf_prepare_dump_data()

2024-04-11 Thread Beau Belgrave

Factor out perf_prepare_dump_data() so that the same logic is used for
dumping stack data as other types, such as TLS.

Slightly refactor perf_sample_ustack_size() to perf_sample_dump_size().
Move reg checks up into perf_ustack_task_size() since the task size
must now be calculated before preparing dump data.

Signed-off-by: Beau Belgrave 
---
 kernel/events/core.c | 79 ++--
 1 file changed, 47 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..07de5cc2aa25 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6912,7 +6912,13 @@ static void perf_sample_regs_intr(struct perf_regs 
*regs_intr,
  */
 static u64 perf_ustack_task_size(struct pt_regs *regs)
 {
-   unsigned long addr = perf_user_stack_pointer(regs);
+   unsigned long addr;
+
+   /* No regs, no stack pointer, no dump. */
+   if (!regs)
+   return 0;
+
+   addr = perf_user_stack_pointer(regs);
 
if (!addr || addr >= TASK_SIZE)
return 0;
@@ -6921,42 +6927,35 @@ static u64 perf_ustack_task_size(struct pt_regs *regs)
 }
 
 static u16
-perf_sample_ustack_size(u16 stack_size, u16 header_size,
-   struct pt_regs *regs)
+perf_sample_dump_size(u16 dump_size, u16 header_size, u64 task_size)
 {
-   u64 task_size;
-
-   /* No regs, no stack pointer, no dump. */
-   if (!regs)
-   return 0;
-
/*
-* Check if we fit in with the requested stack size into the:
+* Check if we fit in with the requested dump size into the:
 * - TASK_SIZE
 *   If we don't, we limit the size to the TASK_SIZE.
 *
 * - remaining sample size
-*   If we don't, we customize the stack size to
+*   If we don't, we customize the dump size to
 *   fit in to the remaining sample size.
 */
 
-   task_size  = min((u64) USHRT_MAX, perf_ustack_task_size(regs));
-   stack_size = min(stack_size, (u16) task_size);
+   task_size  = min((u64) USHRT_MAX, task_size);
+   dump_size = min(dump_size, (u16) task_size);
 
/* Current header size plus static size and dynamic size. */
header_size += 2 * sizeof(u64);
 
-   /* Do we fit in with the current stack dump size? */
-   if ((u16) (header_size + stack_size) < header_size) {
+   /* Do we fit in with the current dump size? */
+   if ((u16) (header_size + dump_size) < header_size) {
/*
 * If we overflow the maximum size for the sample,
-* we customize the stack dump size to fit in.
+* we customize the dump size to fit in.
 */
-   stack_size = USHRT_MAX - header_size - sizeof(u64);
-   stack_size = round_up(stack_size, sizeof(u64));
+   dump_size = USHRT_MAX - header_size - sizeof(u64);
+   dump_size = round_up(dump_size, sizeof(u64));
}
 
-   return stack_size;
+   return dump_size;
 }
 
 static void
@@ -7648,6 +7647,32 @@ static __always_inline u64 __cond_set(u64 flags, u64 s, 
u64 d)
return d * !!(flags & s);
 }
 
+static inline u16
+perf_prepare_dump_data(struct perf_sample_data *data,
+  struct perf_event *event,
+  struct pt_regs *regs,
+  u16 dump_size,
+  u64 task_size)
+{
+   u16 header_size = perf_sample_data_size(data, event);
+   u16 size = sizeof(u64);
+
+   dump_size = perf_sample_dump_size(dump_size, header_size,
+ task_size);
+
+   /*
+* If there is something to dump, add space for the dump
+* itself and for the field that tells the dynamic size,
+* which is how many have been actually dumped.
+*/
+   if (dump_size)
+   size += sizeof(u64) + dump_size;
+
+   data->dyn_size += size;
+
+   return dump_size;
+}
+
 void perf_prepare_sample(struct perf_sample_data *data,
 struct perf_event *event,
 struct pt_regs *regs)
@@ -7725,22 +7750,12 @@ void perf_prepare_sample(struct perf_sample_data *data,
 * up the rest of the sample size.
 */
u16 stack_size = event->attr.sample_stack_user;
-   u16 header_size = perf_sample_data_size(data, event);
-   u16 size = sizeof(u64);
-
-   stack_size = perf_sample_ustack_size(stack_size, header_size,
-data->regs_user.regs);
+   u64 task_size = perf_ustack_task_size(regs);
 
-   /*
-* If there is something to dump, add space for the dump
-* itself and for the field that tells the dynamic size,
-* which is how many have been actually dumped.
-*/
-   if

Re: [PATCH] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Google

On Thu, 11 Apr 2024 23:29:40 +0800
Yuntao Wang  wrote:

> On Thu, 11 Apr 2024 23:07:45 +0900, Masami Hiramatsu (Google) 
>  wrote:
> 
> > On Thu, 11 Apr 2024 09:19:32 +0200
> > Geert Uytterhoeven  wrote:
> > 
> > > CC Hiramatsu-san (now for real :-)
> > 
> > Thanks!
> > 
> > > 
> > > On Thu, Apr 11, 2024 at 6:13 AM Yuntao Wang  wrote:
> > > > extra_init_args ends with a space, so when concatenating extra_init_args
> > > > to saved_command_line, be sure to remove the extra space.
> > 
> > Hi Yuntao,
> > 
> > Hmm, if you want to trim the end space, you should trim extra_init_args
> > itself instead of this adjustment. Also, can you share the example?
> > 
> > Thank you,
> 
> At first, I also intended to fix this issue as you suggested. However,
> because both extra_command_line and extra_init_args end with a space,
> making such a change would require modifications in many places.

You may just need:

if (extra_init_args)
strim(extra_init_args);

> That's why I chose this approach instead.
> 
> Here are some examples before and after modification:
> 
> Before: [0.829179] Kernel command line: 'console=ttyS0 debug -- 
> bootconfig_arg1 '
> After:  [0.032648] Kernel command line: 'console=ttyS0 debug -- 
> bootconfig_arg1'
> 
> Before: [0.757217] Kernel command line: 'console=ttyS0 debug -- 
> bootconfig_arg1  arg1'
> After:  [0.068184] Kernel command line: 'console=ttyS0 debug -- 
> bootconfig_arg1 arg1'
> 
> In order to make it easier to observe spaces, I added quotes when outputting 
> saved_command_line.

BTW, is this tailing space harm anything? I don't like a cosmetic change.

Thank you,

> 
> Note that the first 'before' ends with a space, and there are two spaces 
> between
> 'bootconfig_arg1' and 'arg1' in the second 'before'.
> 
> > > >
> > > > Signed-off-by: Yuntao Wang 
> > > > ---
> > > >  init/main.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/init/main.c b/init/main.c
> > > > index 2ca52474d0c3..cf2c22aa0e8c 100644
> > > > --- a/init/main.c
> > > > +++ b/init/main.c
> > > > @@ -660,12 +660,14 @@ static void __init setup_command_line(char 
> > > > *command_line)
> > > > strcpy(saved_command_line + len, 
> > > > extra_init_args);
> > > > len += ilen - 4;/* 
> > > > strlen(extra_init_args) */
> > > > strcpy(saved_command_line + len,
> > > > -   boot_command_line + initargs_offs - 1);
> > > > +   boot_command_line + initargs_offs);
> > > > } else {
> > > > len = strlen(saved_command_line);
> > > > strcpy(saved_command_line + len, " -- ");
> > > > len += 4;
> > > > strcpy(saved_command_line + len, 
> > > > extra_init_args);
> > > > +   len += ilen - 4; /* strlen(extra_init_args) */
> > > > +   saved_command_line[len-1] = '\0'; /* remove 
> > > > trailing space */
> > > > }
> > > > }
> > > 
> > > Gr{oetje,eeting}s,
> > > 
> > > Geert
> > > 
> > > -- 
> > > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> > > geert@linux-m68korg
> > > 
> > > In personal conversations with technical people, I call myself a hacker. 
> > > But
> > > when I'm talking to journalists I just say "programmer" or something like 
> > > that.
> > > -- Linus Torvalds
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-11 Thread Andrew Morton

On Thu, 11 Apr 2024 17:40:02 +0200 Michal Koutný  wrote:

> Hello.
> 
> On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton 
>  wrote:
> > That seems like a large change.
> 
> In what sense is it large?

A large increase in the maximum number of processes.  Or did I misinterpret?

Re: [PATCH v4 06/15] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-04-11 Thread Sam Ravnborg

Hi Mike.

On Thu, Apr 11, 2024 at 07:00:42PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
> 
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
> 
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
> 
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures.  If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---

This code snippet could be more readable ...
> diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
> index 66c45a2764bc..b70047f944cc 100644
> --- a/arch/sparc/kernel/module.c
> +++ b/arch/sparc/kernel/module.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -21,34 +22,26 @@
>  
>  #include "entry.h"
>  
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
>  #ifdef CONFIG_SPARC64
> -
> -#include 
> -
> -static void *module_map(unsigned long size)
> -{
> - if (PAGE_ALIGN(size) > MODULES_LEN)
> - return NULL;
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> + .start = MODULES_VADDR,
> + .end = MODULES_END,
>  #else
> -static void *module_map(unsigned long size)
> + .start = VMALLOC_START,
> + .end = VMALLOC_END,
> +#endif
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
>  {
> - return vmalloc(size);
> -}
> -#endif /* CONFIG_SPARC64 */
> -
> -void *module_alloc(unsigned long size)
> -{
> - void *ret;
> -
> - ret = module_map(size);
> - if (ret)
> - memset(ret, 0, size);
> + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
>  
> - return ret;
> + return _info;
>  }
>  
>  /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */

... if the following was added:

diff --git a/arch/sparc/include/asm/pgtable_32.h 
b/arch/sparc/include/asm/pgtable_32.h
index 9e85d57ac3f2..62bcafe38b1f 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct 
*vma,

 #define VMALLOC_START   _AC(0xfe60,UL)
 #define VMALLOC_END _AC(0xffc0,UL)
+#define MODULES_VADDR   VMALLOC_START
+#define MODULES_END VMALLOC_END


Then the #ifdef CONFIG_SPARC64 could be dropped and the code would be
the same for 32 and 64 bits.

Just a drive-by comment.

Sam

Re: [PATCHv3 1/2] dt-bindings: usb: typec: anx7688: start a binding document

2024-04-11 Thread Dmitry Baryshkov

On Thu, Apr 11, 2024 at 09:59:35PM +0200, Krzysztof Kozlowski wrote:
> On 10/04/2024 04:20, Ondřej Jirman wrote:
> > On Mon, Apr 08, 2024 at 10:12:30PM GMT, Krzysztof Kozlowski wrote:
> >> On 08/04/2024 17:17, Ondřej Jirman wrote:
> >>>
> >>> Now for things to not fail during suspend/resume based on PM callbacks
> >>> invocation order, anx7688 driver needs to enable this regulator too, as 
> >>> long
> >>> as it needs it.
> >>
> >> No, the I2C bus driver needs to manage it. Not one individual I2C
> >> device. Again, why anx7688 is specific? If you next phone has anx8867,
> >> using different driver, you also add there i2c-supply? And if it is
> >> nxp,ptn5100 as well?
> > 
> > Yes, that could work, if I2C core would manage this.
> 
> Either I don't understand about which I2C regulator you speak or this is
> not I2C core regulator. This is a regulator to be managed by the I2C
> controller, not by I2C core.

If it is a supply that pulls up the SDA/SCL lines, then it is generic
enough to be handled by the core. For example, on Qualcomm platforms CCI
lines also usually have external supply as a pull-up.


-- 
With best wishes
Dmitry

Re: [PATCHv3 1/2] dt-bindings: usb: typec: anx7688: start a binding document

2024-04-11 Thread Krzysztof Kozlowski

On 10/04/2024 04:20, Ondřej Jirman wrote:
> On Mon, Apr 08, 2024 at 10:12:30PM GMT, Krzysztof Kozlowski wrote:
>> On 08/04/2024 17:17, Ondřej Jirman wrote:
>>>
>>> Now for things to not fail during suspend/resume based on PM callbacks
>>> invocation order, anx7688 driver needs to enable this regulator too, as long
>>> as it needs it.
>>
>> No, the I2C bus driver needs to manage it. Not one individual I2C
>> device. Again, why anx7688 is specific? If you next phone has anx8867,
>> using different driver, you also add there i2c-supply? And if it is
>> nxp,ptn5100 as well?
> 
> Yes, that could work, if I2C core would manage this.

Either I don't understand about which I2C regulator you speak or this is
not I2C core regulator. This is a regulator to be managed by the I2C
controller, not by I2C core.


> 
>>>
>>> I can put bus-supply to I2C controller node, and read it from the ANX7688 
>>> driver
>>> I guess, by going up a DT node. Whether that's going to be acceptable, I 
>>> don't
>>> know. 
>>>
>>>
>>> VCONN regulator I don't know where else to put either. It doesn't seem to 
>>> belong
>>> anywhere. It's not something directly connected to Type-C connector, so
>>> not part of connector bindings, and there's nothing else I can see, other
>>> than anx7688 device which needs it for core functionality.
>>
>> That sounds like a GPIO, not regulator. anx7688 has GPIOs, right? On
>> Pinephone they go to regulator, but on FooPhone also using anx7688 they
>> go somewhere else, so why this anx7688 assumes this is a regulator?
> 
> CC1/CC2_VCONN control pins are "GPIO" of anx7688, sort of. They have fixed
> purpose of switching external 5V regulator output to one of the CC pins
> on type-c port. I don't care what other purpose with some other firmware
> someone puts to those pins. It's irrelevant to the use case of anx7688
> as a type-c controller/HDMI bridge, which we're describing here.
> 
> VCONN regulator is an actual GPIO controlled regulator on the board, and
> needs to be controlled by the anx7688 driver. So that CC1/CC2_VCONN control
> pins driven by the firmware actually do what they're supposed to do.
> 
> Not sure why it would be a business of anything else but anx7688 driver
> enabling this regulator, because only this driver knows and cares about this.
> If some other board doesn't have the need to manually enable the regulator, or
> doesn't have the regulator, it can simply be optional.
> 
> There are also some other funky supplies in the bindings, that are not 
> connected
> to the chip in any way, but need to be controlled by the driver:
> 
> +  vbus-supply: true
> +  vbus-in-supply: true

Yeah, the vconn looks reasonable. Just provide description of the
supply, so it will be obvious.

> 



Best regards,
Krzysztof

Re: [PATCH v4 00/15] mm: jit/text allocator

2024-04-11 Thread Luis Chamberlain

On Thu, Apr 11, 2024 at 07:00:36PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
> 
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better that having it
> spread all over the place.
> 
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4

I've taken the first 5 patches through modules-next for now to get early
exposure to testing. Of those I just had minor nit feedback on the 5th,
but the rest look good.

Let's wait for review for the rest of the patches 6-15.

  Luis

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-11 Thread Luis Chamberlain

On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
> 
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.

It would be good to describe this is a non-fuctional change.

> Signed-off-by: Mike Rapoport (IBM) 
> ---

> diff --git a/mm/execmem.c b/mm/execmem.c
> new file mode 100644
> index ..ed2ea41a2543
> --- /dev/null
> +++ b/mm/execmem.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0

And this just needs to copy over the copyright notices from the main.c file.

  Luis

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 11:00 AM David Matlack  wrote:
>
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> >
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > > >
> > > > Suggested-by: Yu Zhao 
> > > > Signed-off-by: James Houghton 
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 3b58e2306621..c30918d0887e 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > > *slot, unsigned long npages);
> > > >   */
> > > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > > >
> > > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > > +{
> > > > +   /*
> > > > +* Indicate that we support bitmap-based aging when using the TDP 
> > > > MMU
> > > > +* and the accessed bit is available in the TDP page tables.
> > > > +*
> > > > +* We have no other preparatory work to do here, so we do not need 
> > > > to
> > > > +* redefine kvm_arch_finish_bitmap_age().
> > > > +*/
> > > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > > +&& shadow_accessed_mask;
> > > > +}
> > > > +
> > > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 992e651540e8..fae1a75750bb 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > > kvm_gfn_range *range)
> > > >  {
> > > > bool young = false;
> > > >
> > > > -   if (kvm_memslots_have_rmaps(kvm))
> > > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > > +   if (range->lockless) {
> > > > +   kvm_age_set_unreliable(range);
> > > > +   return false;
> > > > +   }
> > >
> > > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > > virtualization, MGLRU will effectively be blind to all accesses made by
> > > the VM.
> > >
> > > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > > return false immediately and indicate the bitmap is unreliable because a
> > > shadow root is allocate. The notfier will then return
> > > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
> return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

Nope, wrong again. Just ignore me while I try to figure out how this
actually works :)

Re: [PATCH v14 4/4] remoteproc: zynqmp: parse TCM from device tree

2024-04-11 Thread Tanmay Shah




On 4/11/24 11:12 AM, Mathieu Poirier wrote:
> On Wed, Apr 10, 2024 at 05:36:30PM -0500, Tanmay Shah wrote:
>> 
>> 
>> On 4/10/24 11:59 AM, Mathieu Poirier wrote:
>> > On Mon, Apr 08, 2024 at 01:53:14PM -0700, Tanmay Shah wrote:
>> >> ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information
>> >> is available in device-tree. Parse TCM information in driver
>> >> as per new bindings.
>> >> 
>> >> Signed-off-by: Tanmay Shah 
>> >> ---
>> >> 
>> >> Changes in v14:
>> >>   - Add Versal platform support
>> >>   - Add Versal-NET platform support
>> >>   - Maintain backward compatibility for ZynqMP platform and use hardcode
>> >> TCM addresses
>> >>   - Configure TCM based on xlnx,tcm-mode property for Versal
>> >>   - Avoid TCM configuration if that property isn't available in DT 
>> >> 
>> >>  drivers/remoteproc/xlnx_r5_remoteproc.c | 173 ++--
>> >>  1 file changed, 132 insertions(+), 41 deletions(-)
>> >> 
>> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
>> >> b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> index 0f942440b4e2..504492f930ac 100644
>> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> @@ -74,8 +74,8 @@ struct mbox_info {
>> >>  };
>> >>  
>> >>  /*
>> >> - * Hardcoded TCM bank values. This will be removed once TCM bindings are
>> >> - * accepted for system-dt specifications and upstreamed in linux kernel
>> >> + * Hardcoded TCM bank values. This will stay in driver to maintain 
>> >> backward
>> >> + * compatibility with device-tree that does not have TCM information.
>> >>   */
>> >>  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
>> >>   {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
>> >> */
>> >> @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc 
>> >> *rproc, int vqid)
>> >>   dev_warn(dev, "failed to send message\n");
>> >>  }
>> >>  
>> >> -/*
>> >> - * zynqmp_r5_set_mode()
>> >> - *
>> >> - * set RPU cluster and TCM operation mode
>> >> - *
>> >> - * @r5_core: pointer to zynqmp_r5_core type object
>> >> - * @fw_reg_val: value expected by firmware to configure RPU cluster mode
>> >> - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or 
>> >> split)
>> >> - *
>> >> - * Return: 0 for success and < 0 for failure
>> >> - */
>> >> -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
>> >> -   enum rpu_oper_mode fw_reg_val,
>> >> -   enum rpu_tcm_comb tcm_mode)
>> >> -{
>> >> - int ret;
>> >> -
>> >> - ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
>> >> - if (ret < 0) {
>> >> - dev_err(r5_core->dev, "failed to set RPU mode\n");
>> >> - return ret;
>> >> - }
>> >> -
>> >> - ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>> >> - if (ret < 0)
>> >> - dev_err(r5_core->dev, "failed to configure TCM\n");
>> >> -
>> >> - return ret;
>> >> -}
>> >> -
>> >>  /*
>> >>   * zynqmp_r5_rproc_start()
>> >>   * @rproc: single R5 core's corresponding rproc instance
>> >> @@ -761,6 +731,103 @@ static struct zynqmp_r5_core 
>> >> *zynqmp_r5_add_rproc_core(struct device *cdev)
>> >>   return ERR_PTR(ret);
>> >>  }
>> >>  
>> >> +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster 
>> >> *cluster)
>> >> +{
>> >> + int i, j, tcm_bank_count, ret, tcm_pd_idx, pd_count;
>> >> + struct of_phandle_args out_args;
>> >> + struct zynqmp_r5_core *r5_core;
>> >> + struct platform_device *cpdev;
>> >> + struct mem_bank_data *tcm;
>> >> + struct device_node *np;
>> >> + struct resource *res;
>> >> + u64 abs_addr, size;
>> >> + struct device *dev;
>> >> +
>> >> + for (i = 0; i < cluster->core_count; i++) {
>> >> + r5_core = cluster->r5_cores[i];
>> >> + dev = r5_core->dev;
>> >> + np = r5_core->np;
>> >> +
>> >> + pd_count = of_count_phandle_with_args(np, "power-domains",
>> >> +   "#power-domain-cells");
>> >> +
>> >> + if (pd_count <= 0) {
>> >> + dev_err(dev, "invalid power-domains property, %d\n", 
>> >> pd_count);
>> >> + return -EINVAL;
>> >> + }
>> >> +
>> >> + /* First entry in power-domains list is for r5 core, rest for 
>> >> TCM. */
>> >> + tcm_bank_count = pd_count - 1;
>> >> +
>> >> + if (tcm_bank_count <= 0) {
>> >> + dev_err(dev, "invalid TCM count %d\n", tcm_bank_count);
>> >> + return -EINVAL;
>> >> + }
>> >> +
>> >> + r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
>> >> +   sizeof(struct mem_bank_data 
>> >> *),
>> >> +   GFP_KERNEL);
>> >> + if (!r5_core->tcm_banks)
>> >> + return -ENOMEM;
>> >> +
>> >> + r5_core->tcm_bank_count = tcm_bank_count;
>> >> + for (j = 0, tcm_pd_idx =

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Peter Xu

On Thu, Apr 11, 2024 at 06:55:44PM +0200, Paolo Bonzini wrote:
> On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> > Paolo,
> >
> > I may miss a bunch of details here (as I still remember some change_pte
> > patches previously on the list..), however not sure whether we considered
> > enable it?  Asked because I remember Andrea used to have a custom tree
> > maintaining that part:
> >
> > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968
> 
> The patch enables it only for KSM, so it would still require a bunch
> of cleanups, for example I also would still use set_pte_at() in all
> the places that are not KSM. This would at least fix the issue with
> the poor documentation of where to use set_pte_at_notify() vs
> set_pte_at().
> 
> With regard to the implementation, I like the idea of disabling the
> invalidation on the MMU notifier side, but I would rather have
> MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
> overloading the event field.
> 
> > Maybe it can't be enabled for some reason that I overlooked in the current
> > tree, or we just decided to not to?
> 
> I have just learnt about the patch, nobody had ever mentioned it even
> though it's almost 2 years old... It's a lot of code though and no one
> has ever reported an issue for over 10 years, so I think it's easiest
> to just rip the code out.

Right, it was pretty old and I have no idea if that was discussed or
published before..  It would be better to have discussed this earlier.

As long as we have a decision with that being aware and in mind, then it
looks fine to me to take either way to go, and I also agree either way is
better than keep the status quo.

I also have Andrea copied anyway when I replied, so I guess he should be
aware of this and he can chim in anytime.

Thanks!

-- 
Peter Xu

[PATCH 5/5] openrisc: Move FPU state out of pt_regs

2024-04-11 Thread Stafford Horne

My original, naive, FPU support patch had the FPCSR register stored
during both the *mode switch* and *context switch*.  This is wasteful.

Also, the original patches did not save the FPU state when handling
signals during the system call fast path.

We fix this by moving the FPCSR state to thread_struct in task_struct.
We also introduce new helper functions save_fpu and restore_fpu which
can be used to sync the FPU with thread_struct.  These functions are now
called when needed:

 - Setting up and restoring sigcontext when handling signals
 - Before and after __switch_to during context switches
 - When handling FPU exceptions
 - When reading and writing FPU register sets

In the future we can further optimize this by doing lazy FPU save and
restore.  For example, FPU sync is not needed when switching to and from
kernel threads (x86 does this).  FPU save and restore does not need to
be done two times if we have both rescheduling and signal work to do.
However, since OpenRISC FPU state is a single register, I leave these
optimizations for future consideration.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/include/asm/fpu.h   | 22 
 arch/openrisc/include/asm/processor.h |  1 +
 arch/openrisc/include/asm/ptrace.h|  3 +--
 arch/openrisc/kernel/entry.S  | 15 +--
 arch/openrisc/kernel/process.c|  5 
 arch/openrisc/kernel/ptrace.c | 12 +++--
 arch/openrisc/kernel/signal.c | 36 +--
 arch/openrisc/kernel/traps.c  | 14 +++
 8 files changed, 76 insertions(+), 32 deletions(-)
 create mode 100644 arch/openrisc/include/asm/fpu.h

diff --git a/arch/openrisc/include/asm/fpu.h b/arch/openrisc/include/asm/fpu.h
new file mode 100644
index ..57bc44d80d53
--- /dev/null
+++ b/arch/openrisc/include/asm/fpu.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_OPENRISC_FPU_H
+#define __ASM_OPENRISC_FPU_H
+
+struct task_struct;
+
+#ifdef CONFIG_FPU
+static inline void save_fpu(struct task_struct *task)
+{
+   task->thread.fpcsr = mfspr(SPR_FPCSR);
+}
+
+static inline void restore_fpu(struct task_struct *task)
+{
+   mtspr(SPR_FPCSR, task->thread.fpcsr);
+}
+#else
+#define save_fpu(tsk)  do { } while (0)
+#define restore_fpu(tsk)   do { } while (0)
+#endif
+
+#endif /* __ASM_OPENRISC_FPU_H */
diff --git a/arch/openrisc/include/asm/processor.h 
b/arch/openrisc/include/asm/processor.h
index 3b736e74e6ed..e05d1b59e24e 100644
--- a/arch/openrisc/include/asm/processor.h
+++ b/arch/openrisc/include/asm/processor.h
@@ -44,6 +44,7 @@
 struct task_struct;
 
 struct thread_struct {
+   long fpcsr; /* Floating point control status register. */
 };
 
 /*
diff --git a/arch/openrisc/include/asm/ptrace.h 
b/arch/openrisc/include/asm/ptrace.h
index 375147ff71fc..1da3e66292e2 100644
--- a/arch/openrisc/include/asm/ptrace.h
+++ b/arch/openrisc/include/asm/ptrace.h
@@ -59,7 +59,7 @@ struct pt_regs {
 * -1 for all other exceptions.
 */
long  orig_gpr11;   /* For restarting system calls */
-   long fpcsr; /* Floating point control status register. */
+   long dummy; /* Cheap alignment fix */
long dummy2;/* Cheap alignment fix */
 };
 
@@ -115,6 +115,5 @@ static inline long regs_return_value(struct pt_regs *regs)
 #define PT_GPR31  124
 #define PT_PC128
 #define PT_ORIG_GPR11 132
-#define PT_FPCSR  136
 
 #endif /* __ASM_OPENRISC_PTRACE_H */
diff --git a/arch/openrisc/kernel/entry.S b/arch/openrisc/kernel/entry.S
index c9f48e750b72..440711d7bf40 100644
--- a/arch/openrisc/kernel/entry.S
+++ b/arch/openrisc/kernel/entry.S
@@ -106,8 +106,6 @@
l.mtspr r0,r3,SPR_EPCR_BASE ;\
l.lwz   r3,PT_SR(r1);\
l.mtspr r0,r3,SPR_ESR_BASE  ;\
-   l.lwz   r3,PT_FPCSR(r1) ;\
-   l.mtspr r0,r3,SPR_FPCSR ;\
l.lwz   r2,PT_GPR2(r1)  ;\
l.lwz   r3,PT_GPR3(r1)  ;\
l.lwz   r4,PT_GPR4(r1)  ;\
@@ -177,8 +175,6 @@ handler:
;\
/* r30 already save */  ;\
l.swPT_GPR31(r1),r31;\
TRACE_IRQS_OFF_ENTRY;\
-   l.mfspr r30,r0,SPR_FPCSR;\
-   l.swPT_FPCSR(r1),r30;\
/* Store -1 in orig_gpr11 for non-syscall exceptions */ ;\
l.addi  r30,r0,-1   ;\
l.swPT_ORIG_GPR11(r1),r30
@@ -219,8 +215,6 @@ handler:
;\
/* Store

[PATCH 4/5] openrisc: Add FPU config

2024-04-11 Thread Stafford Horne

Allow disabling FPU related code sequences to save space.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/Kconfig | 9 +
 arch/openrisc/kernel/ptrace.c | 6 ++
 arch/openrisc/kernel/traps.c  | 3 ++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 3586cda55bde..69c0258700b2 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -188,6 +188,15 @@ config SMP
 
  If you don't know what to do here, say N.
 
+config FPU
+   bool "FPU support"
+   default y
+   help
+ Say N here if you want to disable all floating-point related 
procedures
+ in the kernel and reduce binary size.
+
+ If you don't know what to do here, say Y.
+
 source "kernel/Kconfig.hz"
 
 config OPENRISC_NO_SPR_SR_DSX
diff --git a/arch/openrisc/kernel/ptrace.c b/arch/openrisc/kernel/ptrace.c
index 1eeac3b62e9d..cf410193095f 100644
--- a/arch/openrisc/kernel/ptrace.c
+++ b/arch/openrisc/kernel/ptrace.c
@@ -88,6 +88,7 @@ static int genregs_set(struct task_struct *target,
return ret;
 }
 
+#ifdef CONFIG_FPU
 /*
  * As OpenRISC shares GPRs and floating point registers we don't need to export
  * the floating point registers again.  So here we only export the fpcsr 
special
@@ -115,13 +116,16 @@ static int fpregs_set(struct task_struct *target,
 >fpcsr, 0, 4);
return ret;
 }
+#endif
 
 /*
  * Define the register sets available on OpenRISC under Linux
  */
 enum or1k_regset {
REGSET_GENERAL,
+#ifdef CONFIG_FPU
REGSET_FPU,
+#endif
 };
 
 static const struct user_regset or1k_regsets[] = {
@@ -133,6 +137,7 @@ static const struct user_regset or1k_regsets[] = {
.regset_get = genregs_get,
.set = genregs_set,
},
+#ifdef CONFIG_FPU
[REGSET_FPU] = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct __or1k_fpu_state) / sizeof(long),
@@ -141,6 +146,7 @@ static const struct user_regset or1k_regsets[] = {
.regset_get = fpregs_get,
.set = fpregs_set,
},
+#endif
 };
 
 static const struct user_regset_view user_or1k_native_view = {
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 211ddaa0c5fa..57e0d674eb04 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -182,6 +182,7 @@ asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned 
long address)
 {
if (user_mode(regs)) {
int code = FPE_FLTUNK;
+#ifdef CONFIG_FPU
unsigned long fpcsr = regs->fpcsr;
 
if (fpcsr & SPR_FPCSR_IVF)
@@ -197,7 +198,7 @@ asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned 
long address)
 
/* Clear all flags */
regs->fpcsr &= ~SPR_FPCSR_ALLF;
-
+#endif
force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
} else {
pr_emerg("KERNEL: Illegal fpe exception 0x%.8lx\n", regs->pc);
-- 
2.44.0

[PATCH 3/5] openrisc: traps: Don't send signals to kernel mode threads

2024-04-11 Thread Stafford Horne

OpenRISC exception handling sends signals to user processes on floating
point exceptions and trap instructions (for debugging) among others.
There is a bug where the trap handling logic may send signals to kernel
threads, we should not send these signals to kernel threads, if that
happens we treat it as an error.

This patch adds conditions to die if the kernel receives these
exceptions in kernel mode code.

Fixes: 27267655c531 ("openrisc: Support floating point user api")
Signed-off-by: Stafford Horne 
---
 arch/openrisc/kernel/traps.c | 48 ++--
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 88fe27e4c10c..211ddaa0c5fa 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -180,29 +180,39 @@ asmlinkage void unhandled_exception(struct pt_regs *regs, 
int ea, int vector)
 
 asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned long address)
 {
-   int code = FPE_FLTUNK;
-   unsigned long fpcsr = regs->fpcsr;
-
-   if (fpcsr & SPR_FPCSR_IVF)
-   code = FPE_FLTINV;
-   else if (fpcsr & SPR_FPCSR_OVF)
-   code = FPE_FLTOVF;
-   else if (fpcsr & SPR_FPCSR_UNF)
-   code = FPE_FLTUND;
-   else if (fpcsr & SPR_FPCSR_DZF)
-   code = FPE_FLTDIV;
-   else if (fpcsr & SPR_FPCSR_IXF)
-   code = FPE_FLTRES;
-
-   /* Clear all flags */
-   regs->fpcsr &= ~SPR_FPCSR_ALLF;
-
-   force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
+   if (user_mode(regs)) {
+   int code = FPE_FLTUNK;
+   unsigned long fpcsr = regs->fpcsr;
+
+   if (fpcsr & SPR_FPCSR_IVF)
+   code = FPE_FLTINV;
+   else if (fpcsr & SPR_FPCSR_OVF)
+   code = FPE_FLTOVF;
+   else if (fpcsr & SPR_FPCSR_UNF)
+   code = FPE_FLTUND;
+   else if (fpcsr & SPR_FPCSR_DZF)
+   code = FPE_FLTDIV;
+   else if (fpcsr & SPR_FPCSR_IXF)
+   code = FPE_FLTRES;
+
+   /* Clear all flags */
+   regs->fpcsr &= ~SPR_FPCSR_ALLF;
+
+   force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
+   } else {
+   pr_emerg("KERNEL: Illegal fpe exception 0x%.8lx\n", regs->pc);
+   die("Die:", regs, SIGFPE);
+   }
 }
 
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
-   force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
+   if (user_mode(regs)) {
+   force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
+   } else {
+   pr_emerg("KERNEL: Illegal trap exception 0x%.8lx\n", regs->pc);
+   die("Die:", regs, SIGILL);
+   }
 }
 
 asmlinkage void do_unaligned_access(struct pt_regs *regs, unsigned long 
address)
-- 
2.44.0

[PATCH 2/5] openrisc: traps: Remove calls to show_registers before die

2024-04-11 Thread Stafford Horne

The die function calls show_registers unconditionally.  Remove calls to
show_registers before calling die to avoid printing all registers and
stack status two times during a crash.

This was found when testing kernel trap and floating point exception
handling.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/kernel/traps.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 6d0fee912747..88fe27e4c10c 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -212,7 +212,6 @@ asmlinkage void do_unaligned_access(struct pt_regs *regs, 
unsigned long address)
force_sig_fault(SIGBUS, BUS_ADRALN, (void __user *)address);
} else {
pr_emerg("KERNEL: Unaligned Access 0x%.8lx\n", address);
-   show_registers(regs);
die("Die:", regs, address);
}
 
@@ -225,7 +224,6 @@ asmlinkage void do_bus_fault(struct pt_regs *regs, unsigned 
long address)
force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
} else {/* Kernel mode */
pr_emerg("KERNEL: Bus error (SIGBUS) 0x%.8lx\n", address);
-   show_registers(regs);
die("Die:", regs, address);
}
 }
@@ -421,7 +419,6 @@ asmlinkage void do_illegal_instruction(struct pt_regs *regs,
} else {/* Kernel mode */
pr_emerg("KERNEL: Illegal instruction (SIGILL) 0x%.8lx\n",
 address);
-   show_registers(regs);
die("Die:", regs, address);
}
 }
-- 
2.44.0

[PATCH 1/5] openrisc: traps: Convert printks to pr_ macros

2024-04-11 Thread Stafford Horne

The pr_* macros are the convention and my upcoming patches add even more
printk's.  Use this opportunity to convert the printks in this file to
the pr_* macros to avoid patch check warnings.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/kernel/traps.c | 88 ++--
 1 file changed, 44 insertions(+), 44 deletions(-)

diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 9370888c9a7e..6d0fee912747 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -51,16 +51,16 @@ static void print_trace(void *data, unsigned long addr, int 
reliable)
 {
const char *loglvl = data;
 
-   printk("%s[<%p>] %s%pS\n", loglvl, (void *) addr, reliable ? "" : "? ",
-  (void *) addr);
+   pr_info("%s[<%p>] %s%pS\n", loglvl, (void *) addr, reliable ? "" : "? ",
+   (void *) addr);
 }
 
 static void print_data(unsigned long base_addr, unsigned long word, int i)
 {
if (i == 0)
-   printk("(%08lx:)\t%08lx", base_addr + (i * 4), word);
+   pr_info("(%08lx:)\t%08lx", base_addr + (i * 4), word);
else
-   printk(" %08lx:\t%08lx", base_addr + (i * 4), word);
+   pr_info(" %08lx:\t%08lx", base_addr + (i * 4), word);
 }
 
 /* displays a short stack trace */
@@ -69,7 +69,7 @@ void show_stack(struct task_struct *task, unsigned long *esp, 
const char *loglvl
if (esp == NULL)
esp = (unsigned long *)
 
-   printk("%sCall trace:\n", loglvl);
+   pr_info("%sCall trace:\n", loglvl);
unwind_stack((void *)loglvl, esp, print_trace);
 }
 
@@ -83,57 +83,57 @@ void show_registers(struct pt_regs *regs)
if (user_mode(regs))
in_kernel = 0;
 
-   printk("CPU #: %d\n"
-  "   PC: %08lxSR: %08lxSP: %08lx FPCSR: %08lx\n",
-  smp_processor_id(), regs->pc, regs->sr, regs->sp,
-  regs->fpcsr);
-   printk("GPR00: %08lx GPR01: %08lx GPR02: %08lx GPR03: %08lx\n",
-  0L, regs->gpr[1], regs->gpr[2], regs->gpr[3]);
-   printk("GPR04: %08lx GPR05: %08lx GPR06: %08lx GPR07: %08lx\n",
-  regs->gpr[4], regs->gpr[5], regs->gpr[6], regs->gpr[7]);
-   printk("GPR08: %08lx GPR09: %08lx GPR10: %08lx GPR11: %08lx\n",
-  regs->gpr[8], regs->gpr[9], regs->gpr[10], regs->gpr[11]);
-   printk("GPR12: %08lx GPR13: %08lx GPR14: %08lx GPR15: %08lx\n",
-  regs->gpr[12], regs->gpr[13], regs->gpr[14], regs->gpr[15]);
-   printk("GPR16: %08lx GPR17: %08lx GPR18: %08lx GPR19: %08lx\n",
-  regs->gpr[16], regs->gpr[17], regs->gpr[18], regs->gpr[19]);
-   printk("GPR20: %08lx GPR21: %08lx GPR22: %08lx GPR23: %08lx\n",
-  regs->gpr[20], regs->gpr[21], regs->gpr[22], regs->gpr[23]);
-   printk("GPR24: %08lx GPR25: %08lx GPR26: %08lx GPR27: %08lx\n",
-  regs->gpr[24], regs->gpr[25], regs->gpr[26], regs->gpr[27]);
-   printk("GPR28: %08lx GPR29: %08lx GPR30: %08lx GPR31: %08lx\n",
-  regs->gpr[28], regs->gpr[29], regs->gpr[30], regs->gpr[31]);
-   printk("  RES: %08lx oGPR11: %08lx\n",
-  regs->gpr[11], regs->orig_gpr11);
-
-   printk("Process %s (pid: %d, stackpage=%08lx)\n",
-  current->comm, current->pid, (unsigned long)current);
+   pr_info("CPU #: %d\n"
+   "   PC: %08lxSR: %08lxSP: %08lx FPCSR: %08lx\n",
+   smp_processor_id(), regs->pc, regs->sr, regs->sp,
+   regs->fpcsr);
+   pr_info("GPR00: %08lx GPR01: %08lx GPR02: %08lx GPR03: %08lx\n",
+   0L, regs->gpr[1], regs->gpr[2], regs->gpr[3]);
+   pr_info("GPR04: %08lx GPR05: %08lx GPR06: %08lx GPR07: %08lx\n",
+   regs->gpr[4], regs->gpr[5], regs->gpr[6], regs->gpr[7]);
+   pr_info("GPR08: %08lx GPR09: %08lx GPR10: %08lx GPR11: %08lx\n",
+   regs->gpr[8], regs->gpr[9], regs->gpr[10], regs->gpr[11]);
+   pr_info("GPR12: %08lx GPR13: %08lx GPR14: %08lx GPR15: %08lx\n",
+   regs->gpr[12], regs->gpr[13], regs->gpr[14], regs->gpr[15]);
+   pr_info("GPR16: %08lx GPR17: %08lx GPR18: %08lx GPR19: %08lx\n",
+   regs->gpr[16], regs->gpr[17], regs->gpr[18], regs->gpr[19]);
+   pr_info("GPR20: %08lx GPR21: %08lx GPR22: %08lx GPR23: %08lx\n",
+   regs->gpr[20], regs->gpr[21], regs->gpr[22], regs->gpr[23]);
+   pr_info("GPR24: %08lx GPR25: %08lx GPR26: %08lx GPR27: %08lx\n",
+   regs->gpr[24], regs->gpr[25], regs->gpr[26], regs->gpr[27]);
+   pr_info("GPR28: %08lx GPR29: %08lx GPR30: %08lx GPR31: %08lx\n",
+   regs->gpr[28], regs->gpr[29], regs->gpr[30], regs->gpr[31]);
+   pr_info("  RES: %08lx oGPR11: %08lx\n",
+   regs->gpr[11], regs->orig_gpr11);
+
+   pr_info("Process %s (pid: %d, stackpage=%08lx)\n",
+   current->comm, current->pid, (unsigned long)current);
/*
 * When in-kernel, we

[PATCH 0/5] OpenRISC FPU and Signal handling fixups

2024-04-11 Thread Stafford Horne

This series has some fixups found when I was doing a deep dive
documentation of the OpenRISC FPU support which was added in 2023.

  
http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/24/or1k-fpu-linux-and-compilers.html

The FPU handling has issues of being inefficient and also not providing the
proper state if signals are received when handling syscalls.

The series is small, so should be easy to see from the commit list but in
summary does:

 - Fixup some issues with exception handling.
 - Adds CONFIG_FPU to allow disabling FPU support
 - Fixups to e FPU handling logic moving the FPCSR state out of the kernel
   stack pt_regs and into the task_struct.

Stafford Horne (5):
  openrisc: traps: Convert printks to pr_ macros
  openrisc: traps: Remove calls to show_registers before die
  openrisc: traps: Don't send signals to kernel mode threads
  openrisc: Add FPU config
  openrisc: Move FPU state out of pt_regs

 arch/openrisc/Kconfig |   9 ++
 arch/openrisc/include/asm/fpu.h   |  22 
 arch/openrisc/include/asm/processor.h |   1 +
 arch/openrisc/include/asm/ptrace.h|   3 +-
 arch/openrisc/kernel/entry.S  |  15 +--
 arch/openrisc/kernel/process.c|   5 +
 arch/openrisc/kernel/ptrace.c |  18 ++--
 arch/openrisc/kernel/signal.c |  36 ++-
 arch/openrisc/kernel/traps.c  | 144 ++
 9 files changed, 160 insertions(+), 93 deletions(-)
 create mode 100644 arch/openrisc/include/asm/fpu.h

-- 
2.44.0

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao 
> > > Signed-off-by: James Houghton 
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > > +* and the accessed bit is available in the TDP page tables.
> > > +*
> > > +* We have no other preparatory work to do here, so we do not need to
> > > +* redefine kvm_arch_finish_bitmap_age().
> > > +*/
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +&& shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > kvm_gfn_range *range)
> > >  {
> > > bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +   if (range->lockless) {
> > > +   kvm_age_set_unreliable(range);
> > > +   return false;
> > > +   }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

The control flow of all this and naming of functions and macros is
overall confusing. args.unreliable and
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE for one. Also I now realize
kvm_arch_prepare/finish_bitmap_age() are used even when the bitmap is
_not_ provided, so those names are also misleading.

Re: [PATCH v4 00/15] mm: jit/text allocator

2024-04-11 Thread Kent Overstreet

On Thu, Apr 11, 2024 at 07:00:36PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
> 
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better that having it
> spread all over the place.
> 
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4
> 
> v4 changes:
> * rebase on v6.9-rc2
> * rename execmem_params to execmem_info and execmem_arch_params() to
>   execmem_arch_setup()
> * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song)
> * avoid extra copy of execmem parameters (Rick)
> * run execmem_init() as core_initcall() except for the architectures that
>   may allocated text really early (currently only x86) (Will)
> * add acks for some of arm64 and riscv changes, thanks Will and Alexandre
> * new commits:
>   - drop call to kasan_alloc_module_shadow() on arm64 because it's not
> needed anymore
>   - rename MODULE_START to MODULES_VADDR on MIPS
>   - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe:
> 
> https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/
> 
> v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org
> * add type parameter to execmem allocation APIs
> * remove BPF dependency on modules
> 
> v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org
> * Separate "module" and "others" allocations with execmem_text_alloc()
> and jit_text_alloc()
> * Drop ROX entailment on x86
> * Add ack for nios2 changes, thanks Dinh Nguyen
> 
> v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org
> 
> = Cover letter from v1 (sligtly updated) =
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> A centralized infrastructure for code allocation allows allocations of
> executable memory as ROX, and future optimizations such as caching large
> pages for better iTLB performance and providing sub-page allocations for
> users that only need small jit code snippets.
> 
> Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
> proposed execmem_alloc [2], but both these approaches were targeting BPF
> allocations and lacked the ground work to abstract executable allocations
> and split them from the modules core.
> 
> Thomas Gleixner suggested to express module allocation restrictions and
> requirements as struct mod_alloc_type_params [3] that would define ranges,
> protections and other parameters for different types of allocations used by
> modules and following that suggestion Song separated allocations of
> different types in modules (commit ac3b43283923 ("module: replace
> module_layout with module_memory")) and posted "Type aware module
> allocator" set [4].
> 
> I liked the idea of parametrising code allocation requirements as a
> structure, but I believe the original proposal and Song's module allocator
> was too module centric, so I came up with these patches.
> 
> This set splits code allocation from modules by introducing execmem_alloc()
> and and execmem_free(), APIs, replaces call sites of module_alloc() and
> module_memfree() with the new APIs and implements core text and related
> allocations in a central place.
> 
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill execmem_info structure and implement execmem_arch_setup() that returns
> a pointer to that structure. If an architecture does not implement
> execmem_arch_setup(), the defaults compatible with the current
> modules::module_alloc() are used.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem APIs
> take a type argument, that will be used to identify the calling subsystem
> and to allow architectures to define parameters for ranges suitable for that
> subsystem.
> 
> The new infrastructure allows decoupling of BPF, kprobes and ftrace from
> modules, and most importantly it paves the way for ROX allocations for
> executable memory.

It looks like you're just doing API cleanup first, then

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-11 10:08 AM, David Matlack wrote:
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
> > 
> > Suggested-by: Yu Zhao 
> > Signed-off-by: James Houghton 
> > ---
> >  arch/x86/include/asm/kvm_host.h | 14 ++
> >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> >  3 files changed, 37 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 3b58e2306621..c30918d0887e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> > unsigned long npages);
> >   */
> >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> >  
> > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > +{
> > +   /*
> > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > +* and the accessed bit is available in the TDP page tables.
> > +*
> > +* We have no other preparatory work to do here, so we do not need to
> > +* redefine kvm_arch_finish_bitmap_age().
> > +*/
> > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > +&& shadow_accessed_mask;
> > +}
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 992e651540e8..fae1a75750bb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> >  {
> > bool young = false;
> >  
> > -   if (kvm_memslots_have_rmaps(kvm))
> > +   if (kvm_memslots_have_rmaps(kvm)) {
> > +   if (range->lockless) {
> > +   kvm_age_set_unreliable(range);
> > +   return false;
> > +   }
> 
> If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> virtualization, MGLRU will effectively be blind to all accesses made by
> the VM.
> 
> kvm_arch_prepare_bitmap_age() will return true indicating that the
> bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> return false immediately and indicate the bitmap is unreliable because a
> shadow root is allocate. The notfier will then return
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> consumed or used. So I think MGLRU will assume all memory is
> unaccessed?
> 
> One way to improve the situation would be to re-order the TDP MMU
> function first and return young instead of false, so that way MGLRU at
> least has visibility into accesses made by L1 (and L2 if EPT is disable
> in L2). But that still means MGLRU is blind to accesses made by L2.
> 
> What about grabbing the mmu_lock if there's a shadow root allocated and
> get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> 
>   if (kvm_memslots_have_rmaps(kvm)) {
>   write_lock(>mmu_lock);
>   young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>   write_unlock(>mmu_lock);
>   }
> 
> The TDP MMU walk would still be lockless. KVM only has to take the
> mmu_lock to collect accesses made by L2.
> 
> kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> as well, but that seems relatively simple with the helper functions.

Wait, even simpler, just check kvm_memslots_have_rmaps() in
kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
bitmap request.

i.e.

static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct 
mmu_notifier *mn)
{
/*
 * Indicate that we support bitmap-based aging when using the TDP MMU
 * and the accessed bit is available in the TDP page tables.
 *
 * We have no other preparatory work to do here, so we do not need to
 * redefine kvm_arch_finish_bitmap_age().
 */
return IS_ENABLED(CONFIG_X86_64)
&& tdp_mmu_enabled
&& shadow_accessed_mask
&& !kvm_memslots_have_rmaps(kvm);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  arch/x86/include/asm/kvm_host.h | 14 ++
>  arch/x86/kvm/mmu/mmu.c  | 16 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
>  3 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3b58e2306621..c30918d0887e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> unsigned long npages);
>   */
>  #define KVM_EXIT_HYPERCALL_MBZ   GENMASK_ULL(31, 1)
>  
> +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> +{
> + /*
> +  * Indicate that we support bitmap-based aging when using the TDP MMU
> +  * and the accessed bit is available in the TDP page tables.
> +  *
> +  * We have no other preparatory work to do here, so we do not need to
> +  * redefine kvm_arch_finish_bitmap_age().
> +  */
> + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> +  && shadow_accessed_mask;
> +}
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..fae1a75750bb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range 
> *range)
>  {
>   bool young = false;
>  
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + if (range->lockless) {
> + kvm_age_set_unreliable(range);
> + return false;
> + }

If a VM has TDP MMU enabled, supports A/D bits, and is using nested
virtualization, MGLRU will effectively be blind to all accesses made by
the VM.

kvm_arch_prepare_bitmap_age() will return true indicating that the
bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
return false immediately and indicate the bitmap is unreliable because a
shadow root is allocate. The notfier will then return
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
consumed or used. So I think MGLRU will assume all memory is
unaccessed?

One way to improve the situation would be to re-order the TDP MMU
function first and return young instead of false, so that way MGLRU at
least has visibility into accesses made by L1 (and L2 if EPT is disable
in L2). But that still means MGLRU is blind to accesses made by L2.

What about grabbing the mmu_lock if there's a shadow root allocated and
get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?

if (kvm_memslots_have_rmaps(kvm)) {
write_lock(>mmu_lock);
young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

The TDP MMU walk would still be lockless. KVM only has to take the
mmu_lock to collect accesses made by L2.

kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
as well, but that seems relatively simple with the helper functions.

[PATCH v2 0/2] OpenRISC module fixups

2024-04-11 Thread Stafford Horne

This series implements some missing OpenRISC relocations to allow
module loading to work.  I tested a few modules and these relocations
were enough to allow for several modules I tested with.

In the series we:
  1. Update the relocations to add all of the relocation types currently
 defined in gnu binutils.
  2. Implement two of the missing relocations needed for module loading.

Since v1:
 - Added patch suggested by Geert to rename all relocation types to the
   R_OR1K_* form.

Stafford Horne (2):
  openrisc: Define openrisc relocation types
  openrisc: Add support for more module relocations

 arch/openrisc/include/uapi/asm/elf.h | 75 
 arch/openrisc/kernel/module.c| 18 +--
 2 files changed, 80 insertions(+), 13 deletions(-)

-- 
2.44.0

Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-11 Thread Paolo Bonzini

On Wed, Apr 10, 2024 at 11:30 PM Andrew Morton
 wrote:
> On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini  wrote:
> > Please review!  Also feel free to take the KVM patches through the mm
> > tree, as I don't expect any conflicts.
>
> It's mainly a KVM thing and the MM changes are small and simple.
> I'd say that the KVM tree would be a better home?

Sure! I'll queue them on my side then.

Paolo

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Paolo Bonzini

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> Paolo,
>
> I may miss a bunch of details here (as I still remember some change_pte
> patches previously on the list..), however not sure whether we considered
> enable it?  Asked because I remember Andrea used to have a custom tree
> maintaining that part:
>
> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.

> Maybe it can't be enabled for some reason that I overlooked in the current
> tree, or we just decided to not to?

I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one
has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.

Paolo

> Thanks,
>
> --
> Peter Xu
>

[PATCH v2 2/2] openrisc: Add support for more module relocations

2024-04-11 Thread Stafford Horne

When testing modules in OpenRISC I found R_OR1K_AHI16 (signed adjusted
high 16-bit) and R_OR1K_SLO16 (split low 16-bit) relocations are used in
modules but not implemented yet.

This patch implements the relocations. I have tested with a few modules.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/kernel/module.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/openrisc/kernel/module.c b/arch/openrisc/kernel/module.c
index 292f0afe27b9..c9ff4c4a0b29 100644
--- a/arch/openrisc/kernel/module.c
+++ b/arch/openrisc/kernel/module.c
@@ -55,6 +55,16 @@ int apply_relocate_add(Elf32_Shdr *sechdrs,
value |= *location & 0xfc00;
*location = value;
break;
+   case R_OR1K_AHI16:
+   /* Adjust the operand to match with a signed LO16.  */
+   value += 0x8000;
+   *((uint16_t *)location + 1) = value >> 16;
+   break;
+   case R_OR1K_SLO16:
+   /* Split value lower 16-bits.  */
+   value = ((value & 0xf800) << 10) | (value & 0x7ff);
+   *location = (*location & ~0x3e007ff) | value;
+   break;
default:
pr_err("module %s: Unknown relocation: %u\n",
   me->name, ELF32_R_TYPE(rel[i].r_info));
-- 
2.44.0

[PATCH v2 1/2] openrisc: Define openrisc relocation types

2024-04-11 Thread Stafford Horne

This defines the current OpenRISC relocation types using the current
R_OR1K_* naming conventions.

The old R_OR32_* definitions are left for backwards compatibility.
Note, the R_OR32_VTENTRY and R_OR32_VTINHERIT macros were defined with
the wrong values the have always been 7 and 8 respectively, not 8 and 7.
They are not used for module loading and I have updated them to use the
correct values.

Signed-off-by: Stafford Horne 
---
 arch/openrisc/include/uapi/asm/elf.h | 75 
 arch/openrisc/kernel/module.c|  8 +--
 2 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/arch/openrisc/include/uapi/asm/elf.h 
b/arch/openrisc/include/uapi/asm/elf.h
index 6868f81c281e..441e343f8268 100644
--- a/arch/openrisc/include/uapi/asm/elf.h
+++ b/arch/openrisc/include/uapi/asm/elf.h
@@ -34,15 +34,72 @@
 #include 
 
 /* The OR1K relocation types... not all relevant for module loader */
-#define R_OR32_NONE0
-#define R_OR32_32  1
-#define R_OR32_16  2
-#define R_OR32_8   3
-#define R_OR32_CONST   4
-#define R_OR32_CONSTH  5
-#define R_OR32_JUMPTARG6
-#define R_OR32_VTINHERIT 7
-#define R_OR32_VTENTRY 8
+#define R_OR1K_NONE0
+#define R_OR1K_32  1
+#define R_OR1K_16  2
+#define R_OR1K_8   3
+#define R_OR1K_LO_16_IN_INSN   4
+#define R_OR1K_HI_16_IN_INSN   5
+#define R_OR1K_INSN_REL_26 6
+#define R_OR1K_GNU_VTENTRY 7
+#define R_OR1K_GNU_VTINHERIT   8
+#define R_OR1K_32_PCREL9
+#define R_OR1K_16_PCREL10
+#define R_OR1K_8_PCREL 11
+#define R_OR1K_GOTPC_HI16  12
+#define R_OR1K_GOTPC_LO16  13
+#define R_OR1K_GOT16   14
+#define R_OR1K_PLT26   15
+#define R_OR1K_GOTOFF_HI16 16
+#define R_OR1K_GOTOFF_LO16 17
+#define R_OR1K_COPY18
+#define R_OR1K_GLOB_DAT19
+#define R_OR1K_JMP_SLOT20
+#define R_OR1K_RELATIVE21
+#define R_OR1K_TLS_GD_HI16 22
+#define R_OR1K_TLS_GD_LO16 23
+#define R_OR1K_TLS_LDM_HI1624
+#define R_OR1K_TLS_LDM_LO1625
+#define R_OR1K_TLS_LDO_HI1626
+#define R_OR1K_TLS_LDO_LO1627
+#define R_OR1K_TLS_IE_HI16 28
+#define R_OR1K_TLS_IE_LO16 29
+#define R_OR1K_TLS_LE_HI16 30
+#define R_OR1K_TLS_LE_LO16 31
+#define R_OR1K_TLS_TPOFF   32
+#define R_OR1K_TLS_DTPOFF  33
+#define R_OR1K_TLS_DTPMOD  34
+#define R_OR1K_AHI16   35
+#define R_OR1K_GOTOFF_AHI1636
+#define R_OR1K_TLS_IE_AHI1637
+#define R_OR1K_TLS_LE_AHI1638
+#define R_OR1K_SLO16   39
+#define R_OR1K_GOTOFF_SLO1640
+#define R_OR1K_TLS_LE_SLO1641
+#define R_OR1K_PCREL_PG21  42
+#define R_OR1K_GOT_PG2143
+#define R_OR1K_TLS_GD_PG21 44
+#define R_OR1K_TLS_LDM_PG2145
+#define R_OR1K_TLS_IE_PG21 46
+#define R_OR1K_LO1347
+#define R_OR1K_GOT_LO1348
+#define R_OR1K_TLS_GD_LO13 49
+#define R_OR1K_TLS_LDM_LO1350
+#define R_OR1K_TLS_IE_LO13 51
+#define R_OR1K_SLO13   52
+#define R_OR1K_PLTA26  53
+#define R_OR1K_GOT_AHI16   54
+
+/* Old relocation names */
+#define R_OR32_NONER_OR1K_NONE
+#define R_OR32_32  R_OR1K_32
+#define R_OR32_16  R_OR1K_16
+#define R_OR32_8   R_OR1K_8
+#define R_OR32_CONST   R_OR1K_LO_16_IN_INSN
+#define R_OR32_CONSTH  R_OR1K_HI_16_IN_INSN
+#define R_OR32_JUMPTARGR_OR1K_INSN_REL_26
+#define R_OR32_VTENTRY R_OR1K_GNU_VTENTRY
+#define R_OR32_VTINHERIT R_OR1K_GNU_VTINHERIT
 
 typedef unsigned long elf_greg_t;
 
diff --git a/arch/openrisc/kernel/module.c b/arch/openrisc/kernel/module.c
index 532013f523ac..292f0afe27b9 100644
--- a/arch/openrisc/kernel/module.c
+++ b/arch/openrisc/kernel/module.c
@@ -39,16 +39,16 @@ int apply_relocate_add(Elf32_Shdr *sechdrs,
value = sym->st_value + rel[i].r_addend;
 
switch (ELF32_R_TYPE(rel[i].r_info)) {
-   case R_OR32_32:
+   case R_OR1K_32:
*location = value;
break;
-   case R_OR32_CONST:
+   case R_OR1K_LO_16_IN_INSN:
*((uint16_t *)location + 1) = value;
break;
-   case R_OR32_CONSTH:
+   case R_OR1K_HI_16_IN_INSN:
*((uint16_t *)location + 1) = value >> 16;
break;
-   case R_OR32_JUMPTARG:
+   case R_OR1K_INSN_REL_26:
value -= (uint32_t)location;
value >>= 2;
value &= 0x03ff;
-- 
2.44.0

Re: [PATCH 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-04-11 Thread Lee Jones

On Thu, 11 Apr 2024, Karel Balej wrote:

> Lee Jones, 2024-04-11T12:37:26+01:00:
> [...]
> > > diff --git a/drivers/mfd/88pm886.c b/drivers/mfd/88pm886.c
> > > new file mode 100644
> > > index ..e06d418a5da9
> > > --- /dev/null
> > > +++ b/drivers/mfd/88pm886.c
> > > @@ -0,0 +1,157 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#include 
> > > +
> > > +#define PM886_REG_INT_STATUS10x05
> > > +
> > > +#define PM886_REG_INT_ENA_1  0x0a
> > > +#define PM886_INT_ENA1_ONKEY BIT(0)
> > > +
> > > +#define PM886_IRQ_ONKEY  0
> > > +
> > > +#define PM886_REGMAP_CONF_MAX_REG0xef
> >
> > Why have you split the defines up between here and the header?
> 
> I tried to keep defines tied to the code which uses them and only put
> defines needed in multiple places in the header. With the exception of
> closely related things, such as register bits which I am keeping
> together with the respective register definitions for clarity. Does that
> not make sense?

It makes sense and it's a nice thought, but I think it's nicer to keep
them all together, rather than have to worry about which ones are and
which ones are not used here or there.  Also, there will be holes in the
definitions, etc.

> > Please place them all in the header.
> 
> Would you then also have me move all the definitions from the regulators
> driver there?

I think it would be nice to have them all nice and contiguous.

So, yes.

> [...]
> 
> > > + err = devm_mfd_add_devices(dev, 0, pm886_devs, ARRAY_SIZE(pm886_devs),
> >
> > Why 0?
> 
> PLATFORM_DEVID_AUTO then? Or will PLATFORM_DEVID_NONE suffice since the
> cells all have different names now (it would probably cause problems
> though if the driver was used multiple times for some reason, wouldn't
> it?)?

You tell me.  Please try and understand the code you author. :)

-- 
Lee Jones [李琼斯]

Re: [PATCH v14 4/4] remoteproc: zynqmp: parse TCM from device tree

2024-04-11 Thread Mathieu Poirier

On Wed, Apr 10, 2024 at 05:36:30PM -0500, Tanmay Shah wrote:
> 
> 
> On 4/10/24 11:59 AM, Mathieu Poirier wrote:
> > On Mon, Apr 08, 2024 at 01:53:14PM -0700, Tanmay Shah wrote:
> >> ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information
> >> is available in device-tree. Parse TCM information in driver
> >> as per new bindings.
> >> 
> >> Signed-off-by: Tanmay Shah 
> >> ---
> >> 
> >> Changes in v14:
> >>   - Add Versal platform support
> >>   - Add Versal-NET platform support
> >>   - Maintain backward compatibility for ZynqMP platform and use hardcode
> >> TCM addresses
> >>   - Configure TCM based on xlnx,tcm-mode property for Versal
> >>   - Avoid TCM configuration if that property isn't available in DT 
> >> 
> >>  drivers/remoteproc/xlnx_r5_remoteproc.c | 173 ++--
> >>  1 file changed, 132 insertions(+), 41 deletions(-)
> >> 
> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> >> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> index 0f942440b4e2..504492f930ac 100644
> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> @@ -74,8 +74,8 @@ struct mbox_info {
> >>  };
> >>  
> >>  /*
> >> - * Hardcoded TCM bank values. This will be removed once TCM bindings are
> >> - * accepted for system-dt specifications and upstreamed in linux kernel
> >> + * Hardcoded TCM bank values. This will stay in driver to maintain 
> >> backward
> >> + * compatibility with device-tree that does not have TCM information.
> >>   */
> >>  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
> >>{0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> >> */
> >> @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, 
> >> int vqid)
> >>dev_warn(dev, "failed to send message\n");
> >>  }
> >>  
> >> -/*
> >> - * zynqmp_r5_set_mode()
> >> - *
> >> - * set RPU cluster and TCM operation mode
> >> - *
> >> - * @r5_core: pointer to zynqmp_r5_core type object
> >> - * @fw_reg_val: value expected by firmware to configure RPU cluster mode
> >> - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or 
> >> split)
> >> - *
> >> - * Return: 0 for success and < 0 for failure
> >> - */
> >> -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
> >> -enum rpu_oper_mode fw_reg_val,
> >> -enum rpu_tcm_comb tcm_mode)
> >> -{
> >> -  int ret;
> >> -
> >> -  ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
> >> -  if (ret < 0) {
> >> -  dev_err(r5_core->dev, "failed to set RPU mode\n");
> >> -  return ret;
> >> -  }
> >> -
> >> -  ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
> >> -  if (ret < 0)
> >> -  dev_err(r5_core->dev, "failed to configure TCM\n");
> >> -
> >> -  return ret;
> >> -}
> >> -
> >>  /*
> >>   * zynqmp_r5_rproc_start()
> >>   * @rproc: single R5 core's corresponding rproc instance
> >> @@ -761,6 +731,103 @@ static struct zynqmp_r5_core 
> >> *zynqmp_r5_add_rproc_core(struct device *cdev)
> >>return ERR_PTR(ret);
> >>  }
> >>  
> >> +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster 
> >> *cluster)
> >> +{
> >> +  int i, j, tcm_bank_count, ret, tcm_pd_idx, pd_count;
> >> +  struct of_phandle_args out_args;
> >> +  struct zynqmp_r5_core *r5_core;
> >> +  struct platform_device *cpdev;
> >> +  struct mem_bank_data *tcm;
> >> +  struct device_node *np;
> >> +  struct resource *res;
> >> +  u64 abs_addr, size;
> >> +  struct device *dev;
> >> +
> >> +  for (i = 0; i < cluster->core_count; i++) {
> >> +  r5_core = cluster->r5_cores[i];
> >> +  dev = r5_core->dev;
> >> +  np = r5_core->np;
> >> +
> >> +  pd_count = of_count_phandle_with_args(np, "power-domains",
> >> +"#power-domain-cells");
> >> +
> >> +  if (pd_count <= 0) {
> >> +  dev_err(dev, "invalid power-domains property, %d\n", 
> >> pd_count);
> >> +  return -EINVAL;
> >> +  }
> >> +
> >> +  /* First entry in power-domains list is for r5 core, rest for 
> >> TCM. */
> >> +  tcm_bank_count = pd_count - 1;
> >> +
> >> +  if (tcm_bank_count <= 0) {
> >> +  dev_err(dev, "invalid TCM count %d\n", tcm_bank_count);
> >> +  return -EINVAL;
> >> +  }
> >> +
> >> +  r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
> >> +sizeof(struct mem_bank_data 
> >> *),
> >> +GFP_KERNEL);
> >> +  if (!r5_core->tcm_banks)
> >> +  return -ENOMEM;
> >> +
> >> +  r5_core->tcm_bank_count = tcm_bank_count;
> >> +  for (j = 0, tcm_pd_idx = 1; j < tcm_bank_count; j++, 
> >> tcm_pd_idx++) {
> >> +  tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data),
>

[RFC PATCH 7/7] x86/module: enable ROX caches for module text

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module
text allocations.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/mm/init.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8e8cd0de3af6..049a8b4c64e2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1102,9 +1102,23 @@ unsigned long arch_max_swapfile_size(void)
 #endif
 
 #ifdef CONFIG_EXECMEM
+static void execmem_invalidate(void *ptr, size_t size, bool writeable)
+{
+   /* fill memory with INT3 instructions */
+   if (writeable)
+   memset(ptr, 0xcc, size);
+   else
+   text_poke_set(ptr, 0xcc, size);
+}
+
 static struct execmem_info execmem_info __ro_after_init = {
+   .invalidate = execmem_invalidate,
.ranges = {
-   [EXECMEM_DEFAULT] = {
+   [EXECMEM_MODULE_TEXT] = {
+   .flags = EXECMEM_KASAN_SHADOW | EXECMEM_ROX_CACHE,
+   .alignment = MODULE_ALIGN,
+   },
+   [EXECMEM_KPROBES...EXECMEM_MODULE_DATA] = {
.flags = EXECMEM_KASAN_SHADOW,
.alignment = MODULE_ALIGN,
},
@@ -1119,9 +1133,16 @@ struct execmem_info __init *execmem_arch_setup(void)
offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
 
start = MODULES_VADDR + offset;
-   execmem_info.ranges[EXECMEM_DEFAULT].start = start;
-   execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
-   execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+   for (int i = EXECMEM_MODULE_TEXT; i < EXECMEM_TYPE_MAX; i++) {
+   struct execmem_range *r = _info.ranges[i];
+
+   r->start = start;
+   r->end = MODULES_END;
+   r->pgprot = PAGE_KERNEL;
+   }
+
+   execmem_info.ranges[EXECMEM_MODULE_TEXT].pgprot = PAGE_KERNEL_ROX;
 
return _info;
 }
-- 
2.43.0

[RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Using large pages to map text areas reduces iTLB pressure and improves
performance.

Extend execmem_alloc() with an ability to use PMD_SIZE'ed pages with ROX
permissions as a cache for smaller allocations.

To populate the cache, a writable large page is allocated from vmalloc with
VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as
ROX.

Portions of that large page are handed out to execmem_alloc() callers
without any changes to the permissions.

When the memory is freed with execmem_free() it is invalidated again so
that it won't contain stale instructions.

The cache is enabled when an architecture sets EXECMEM_ROX_CACHE flag in
definition of an execmem_range.

Signed-off-by: Mike Rapoport (IBM) 
---
 include/linux/execmem.h |   2 +
 mm/execmem.c| 267 ++--
 2 files changed, 262 insertions(+), 7 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 9d22999dbd7d..06f678e6fe55 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -77,12 +77,14 @@ struct execmem_range {
 
 /**
  * struct execmem_info - architecture parameters for code allocations
+ * @invalidate: set memory to contain invalid instructions
  * @ranges: array of parameter sets defining architecture specific
  * parameters for executable memory allocations. The ranges that are not
  * explicitly initialized by an architecture use parameters defined for
  * @EXECMEM_DEFAULT.
  */
 struct execmem_info {
+   void (*invalidate)(void *ptr, size_t size, bool writable);
struct execmem_rangeranges[EXECMEM_TYPE_MAX];
 };
 
diff --git a/mm/execmem.c b/mm/execmem.c
index c920d2b5a721..716fba68ab0e 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -1,30 +1,88 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
+#include 
+
+#include "internal.h"
+
 static struct execmem_info *execmem_info __ro_after_init;
 static struct execmem_info default_execmem_info __ro_after_init;
 
-static void *__execmem_alloc(struct execmem_range *range, size_t size)
+struct execmem_cache {
+   struct mutex mutex;
+   struct maple_tree busy_areas;
+   struct maple_tree free_areas;
+};
+
+static struct execmem_cache execmem_cache = {
+   .mutex = __MUTEX_INITIALIZER(execmem_cache.mutex),
+   .busy_areas = MTREE_INIT_EXT(busy_areas, MT_FLAGS_LOCK_EXTERN,
+execmem_cache.mutex),
+   .free_areas = MTREE_INIT_EXT(free_areas, MT_FLAGS_LOCK_EXTERN,
+execmem_cache.mutex),
+};
+
+static void execmem_cache_clean(struct work_struct *work)
+{
+   struct maple_tree *free_areas = _cache.free_areas;
+   struct mutex *mutex = _cache.mutex;
+   MA_STATE(mas, free_areas, 0, ULONG_MAX);
+   void *area;
+
+   mutex_lock(mutex);
+   mas_for_each(, area, ULONG_MAX) {
+   size_t size;
+
+   if (!xa_is_value(area))
+   continue;
+
+   size = xa_to_value(area);
+
+   if (IS_ALIGNED(size, PMD_SIZE) && IS_ALIGNED(mas.index, 
PMD_SIZE)) {
+   void *ptr = (void *)mas.index;
+
+   mas_erase();
+   vfree(ptr);
+   }
+   }
+   mutex_unlock(mutex);
+}
+
+static DECLARE_WORK(execmem_cache_clean_work, execmem_cache_clean);
+
+static void execmem_invalidate(void *ptr, size_t size, bool writable)
+{
+   if (execmem_info->invalidate)
+   execmem_info->invalidate(ptr, size, writable);
+   else
+   memset(ptr, 0, size);
+}
+
+static void *execmem_vmalloc(struct execmem_range *range, size_t size,
+pgprot_t pgprot, unsigned long vm_flags)
 {
bool kasan = range->flags & EXECMEM_KASAN_SHADOW;
-   unsigned long vm_flags  = VM_FLUSH_RESET_PERMS;
gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
+   unsigned int align = range->alignment;
unsigned long start = range->start;
unsigned long end = range->end;
-   unsigned int align = range->alignment;
-   pgprot_t pgprot = range->pgprot;
void *p;
 
if (kasan)
vm_flags |= VM_DEFER_KMEMLEAK;
 
-   p = __vmalloc_node_range(size, align, start, end, gfp_flags,
-pgprot, vm_flags, NUMA_NO_NODE,
+   if (vm_flags & VM_ALLOW_HUGE_VMAP)
+   align = PMD_SIZE;
+
+   p = __vmalloc_node_range(size, align, start, end, gfp_flags, pgprot,
+vm_flags, NUMA_NO_NODE,
 __builtin_return_address(0));
if (!p && range->fallback_start) {
start = range->fallback_start;
@@ -44,6 +102,199 @@ static void *__execmem_alloc(struct execmem_range *range, 
size_t size)
return NULL;
}
 
+   return p;
+}
+
+static int

[RFC PATCH 5/7] x86/module: perpare module loading for ROX allocations of text

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

When module text memory will be allocated with ROX permissions, the
memory at the actual address where the module will live will contain
invalid instructions and there will be a writable copy that contains the
actual module code.

Update relocations and alternatives patching to deal with it.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/um/kernel/um_arch.c   |  11 ++-
 arch/x86/entry/vdso/vma.c  |   3 +-
 arch/x86/include/asm/alternative.h |  14 +--
 arch/x86/kernel/alternative.c  | 152 +
 arch/x86/kernel/ftrace.c   |  41 +---
 arch/x86/kernel/module.c   |  17 ++--
 6 files changed, 140 insertions(+), 98 deletions(-)

diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index c5ca89e62552..5183c955974e 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -437,24 +437,25 @@ void __init arch_cpu_finalize_init(void)
os_check_bugs();
 }
 
-void apply_seal_endbr(s32 *start, s32 *end)
+void apply_seal_endbr(s32 *start, s32 *end, struct module *mod)
 {
 }
 
-void apply_retpolines(s32 *start, s32 *end)
+void apply_retpolines(s32 *start, s32 *end, struct module *mod)
 {
 }
 
-void apply_returns(s32 *start, s32 *end)
+void apply_returns(s32 *start, s32 *end, struct module *mod)
 {
 }
 
 void apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
-  s32 *start_cfi, s32 *end_cfi)
+  s32 *start_cfi, s32 *end_cfi, struct module *mod)
 {
 }
 
-void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
+void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+   struct module *mod)
 {
 }
 
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 6d83ceb7f1ba..31412adef5d2 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -51,7 +51,8 @@ int __init init_vdso_image(const struct vdso_image *image)
 
apply_alternatives((struct alt_instr *)(image->data + image->alt),
   (struct alt_instr *)(image->data + image->alt +
-   image->alt_len));
+   image->alt_len),
+  NULL);
 
return 0;
 }
diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index fcd20c6dc7f9..6f4b0776fc89 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -96,16 +96,16 @@ extern struct alt_instr __alt_instructions[], 
__alt_instructions_end[];
  * instructions were patched in already:
  */
 extern int alternatives_patched;
+struct module;
 
 extern void alternative_instructions(void);
-extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
-extern void apply_retpolines(s32 *start, s32 *end);
-extern void apply_returns(s32 *start, s32 *end);
-extern void apply_seal_endbr(s32 *start, s32 *end);
+extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
+  struct module *mod);
+extern void apply_retpolines(s32 *start, s32 *end, struct module *mod);
+extern void apply_returns(s32 *start, s32 *end, struct module *mod);
+extern void apply_seal_endbr(s32 *start, s32 *end, struct module *mod);
 extern void apply_fineibt(s32 *start_retpoline, s32 *end_retpoine,
- s32 *start_cfi, s32 *end_cfi);
-
-struct module;
+ s32 *start_cfi, s32 *end_cfi, struct module *mod);
 
 struct callthunk_sites {
s32 *call_start, *call_end;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 45a280f2161c..b4d6868df573 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -462,7 +462,8 @@ static int alt_replace_call(u8 *instr, u8 *insn_buff, 
struct alt_instr *a)
  * to refetch changed I$ lines.
  */
 void __init_or_module noinline apply_alternatives(struct alt_instr *start,
- struct alt_instr *end)
+ struct alt_instr *end,
+ struct module *mod)
 {
struct alt_instr *a;
u8 *instr, *replacement;
@@ -490,10 +491,18 @@ void __init_or_module noinline apply_alternatives(struct 
alt_instr *start,
 * order.
 */
for (a = start; a < end; a++) {
+   unsigned long ins_offs, repl_offs;
int insn_buff_sz = 0;
+   u8 *wr_instr, *wr_replacement;
 
instr = (u8 *)>instr_offset + a->instr_offset;
+   ins_offs =  module_writable_offset(mod, instr);
+   wr_instr = instr + ins_offs;
+
replacement = (u8 *)>repl_offset + a->repl_offset;
+   repl_offs = module_writable_offset(mod, replacement);
+   wr_replacement = replacement +

[RFC PATCH 4/7] ftrace: Add swap_func to ftrace_process_locs()

2024-04-11 Thread Mike Rapoport

From: Song Liu 

ftrace_process_locs sorts module mcount, which is inside RO memory. Add a
ftrace_swap_func so that archs can use RO-memory-poke function to do the
sorting.

Signed-off-by: Song Liu 
Signed-off-by: Mike Rapoport (IBM) 
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 13 -
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 54d53f345d14..54393ce57f08 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1172,4 +1172,6 @@ unsigned long arch_syscall_addr(int nr);
 
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+void ftrace_swap_func(void *a, void *b, int n);
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698..95f11c8cdc5e 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6480,6 +6480,17 @@ static void test_is_sorted(unsigned long *start, 
unsigned long count)
 }
 #endif
 
+void __weak ftrace_swap_func(void *a, void *b, int n)
+{
+   unsigned long t;
+
+   WARN_ON_ONCE(n != sizeof(t));
+
+   t = *((unsigned long *)a);
+   *(unsigned long *)a = *(unsigned long *)b;
+   *(unsigned long *)b = t;
+}
+
 static int ftrace_process_locs(struct module *mod,
   unsigned long *start,
   unsigned long *end)
@@ -6507,7 +6518,7 @@ static int ftrace_process_locs(struct module *mod,
 */
if (!IS_ENABLED(CONFIG_BUILDTIME_MCOUNT_SORT) || mod) {
sort(start, count, sizeof(*start),
-ftrace_cmp_ips, NULL);
+ftrace_cmp_ips, ftrace_swap_func);
} else {
test_is_sorted(start, count);
}
-- 
2.43.0

[RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

In order to support ROX allocations for module text, it is necessary to
handle modifications to the code, such as relocations and alternatives
patching, without write access to that memory.

One option is to use text patching, but this would make module loading
extremely slow and will expose executable code that is not finally formed.

A better way is to have memory allocated with ROX permissions contain
invalid instructions and keep a writable, but not executable copy of the
module text. The relocations and alternative patches would be done on the
writable copy using the addresses of the ROX memory.
Once the module is completely ready, the updated text will be copied to ROX
memory using text patching in one go and the writable copy will be freed.

Add support for that to module initialization code and provide necessary
interfaces in execmem.
---
 include/linux/execmem.h| 23 +
 include/linux/module.h | 11 ++
 kernel/module/main.c   | 70 ++
 kernel/module/strict_rwx.c |  3 ++
 mm/execmem.c   | 11 ++
 5 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index ffd0d12feef5..9d22999dbd7d 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -46,9 +46,11 @@ enum execmem_type {
 /**
  * enum execmem_range_flags - options for executable memory allocations
  * @EXECMEM_KASAN_SHADOW:  allocate kasan shadow
+ * @EXECMEM_READ_ONLY: allocated memory should be mapped as read only
  */
 enum execmem_range_flags {
EXECMEM_KASAN_SHADOW= (1 << 0),
+   EXECMEM_ROX_CACHE   = (1 << 1),
 };
 
 /**
@@ -123,6 +125,27 @@ void *execmem_alloc(enum execmem_type type, size_t size);
  */
 void execmem_free(void *ptr);
 
+/**
+ * execmem_update_copy - copy an update to executable memory
+ * @dst:  destination address to update
+ * @src:  source address containing the data
+ * @size: how many bytes of memory shold be copied
+ *
+ * Copy @size bytes from @src to @dst using text poking if the memory at
+ * @dst is read-only.
+ *
+ * Return: a pointer to @dst or NULL on error
+ */
+void *execmem_update_copy(void *dst, const void *src, size_t size);
+
+/**
+ * execmem_is_rox - check if execmem is read-only
+ * @type - the execmem type to check
+ *
+ * Return: %true if the @type is read-only, %false if it's writable
+ */
+bool execmem_is_rox(enum execmem_type type);
+
 #ifdef CONFIG_ARCH_WANTS_EXECMEM_EARLY
 void execmem_early_init(void);
 #else
diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..3df3202680a2 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -361,6 +361,8 @@ enum mod_mem_type {
 
 struct module_memory {
void *base;
+   void *rw_copy;
+   bool is_rox;
unsigned int size;
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
@@ -368,6 +370,15 @@ struct module_memory {
 #endif
 };
 
+#ifdef CONFIG_MODULES
+unsigned long module_writable_offset(struct module *mod, void *loc);
+#else
+static inline unsigned long module_writable_offset(struct module *mod, void 
*loc)
+{
+   return 0;
+}
+#endif
+
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 /* Only touch one cacheline for common rbtree-for-core-layout case. */
 #define __module_memory_align cacheline_aligned
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 91e185607d4b..f83fbb9c95ee 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1188,6 +1188,21 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
+unsigned long module_writable_offset(struct module *mod, void *loc)
+{
+   if (!mod)
+   return 0;
+
+   for_class_mod_mem_type(type, text) {
+   struct module_memory *mem = >mem[type];
+
+   if (loc >= mem->base && loc < mem->base + mem->size)
+   return (unsigned long)(mem->rw_copy - mem->base);
+   }
+
+   return 0;
+}
+
 static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
 {
unsigned int size = PAGE_ALIGN(mod->mem[type].size);
@@ -1205,6 +1220,23 @@ static int module_memory_alloc(struct module *mod, enum 
mod_mem_type type)
if (!ptr)
return -ENOMEM;
 
+   mod->mem[type].base = ptr;
+
+   if (execmem_is_rox(execmem_type)) {
+   ptr = vzalloc(size);
+
+   if (!ptr) {
+   execmem_free(mod->mem[type].base);
+   return -ENOMEM;
+   }
+
+   mod->mem[type].rw_copy = ptr;
+   mod->mem[type].is_rox = true;
+   } else {
+   mod->mem[type].rw_copy = mod->mem[type].base;
+   memset(mod->mem[type].base, 0, size);
+   }
+
/*
 * The pointer to these blocks of memory are stored on the module
 * structure and we keep that around so long as the module is
@@ -1218,15 +1250,16 @@ static int

[RFC PATCH 2/7] mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explictly
specify node ID will use huge pages only if size_per_node is larger than
PMD_SIZE.
Still the actual allocated memory is not distributed between nodes and
there is no advantage in such approach.
On the contrary, BPF allocates PMD_SIZE * num_possible_nodes() for each
new bpf_prog_pack, while it could do with PMD_SIZE'ed packs.

Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with
NUMA_NO_NODE and use huge pages whenever the requested allocation size
is larger than PMD_SIZE.

Signed-off-by: Mike Rapoport (IBM) 
---
 mm/vmalloc.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 22aa63f4ef63..5fc8b514e457 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3737,8 +3737,6 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
}
 
if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
-   unsigned long size_per_node;
-
/*
 * Try huge pages. Only try for PAGE_KERNEL allocations,
 * others like modules don't yet expect huge pages in
@@ -3746,13 +3744,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
 * supporting them.
 */
 
-   size_per_node = size;
-   if (node == NUMA_NO_NODE)
-   size_per_node /= num_online_nodes();
-   if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
+   if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
shift = PMD_SHIFT;
else
-   shift = arch_vmap_pte_supported_shift(size_per_node);
+   shift = arch_vmap_pte_supported_shift(size);
 
align = max(real_align, 1UL << shift);
size = ALIGN(real_size, 1UL << shift);
-- 
2.43.0

[RFC PATCH 1/7] asm-generic: introduce text-patching.h

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Several architectures support text patching, but they name the header
files that declare patching functions differently.

Make all such headers consistently named text-patching.h and add an empty
header in asm-generic for architectures that do not support text patching.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/alpha/include/asm/Kbuild |  1 +
 arch/arc/include/asm/Kbuild   |  1 +
 arch/arm/include/asm/{patch.h => text-patching.h} |  0
 arch/arm/kernel/ftrace.c  |  2 +-
 arch/arm/kernel/jump_label.c  |  2 +-
 arch/arm/kernel/kgdb.c|  2 +-
 arch/arm/kernel/patch.c   |  2 +-
 arch/arm/probes/kprobes/core.c|  2 +-
 arch/arm/probes/kprobes/opt-arm.c |  2 +-
 .../include/asm/{patching.h => text-patching.h}   |  0
 arch/arm64/kernel/ftrace.c|  2 +-
 arch/arm64/kernel/jump_label.c|  2 +-
 arch/arm64/kernel/kgdb.c  |  2 +-
 arch/arm64/kernel/patching.c  |  2 +-
 arch/arm64/kernel/probes/kprobes.c|  2 +-
 arch/arm64/kernel/traps.c |  2 +-
 arch/arm64/net/bpf_jit_comp.c |  2 +-
 arch/csky/include/asm/Kbuild  |  1 +
 arch/hexagon/include/asm/Kbuild   |  1 +
 arch/loongarch/include/asm/Kbuild |  1 +
 arch/m68k/include/asm/Kbuild  |  1 +
 arch/microblaze/include/asm/Kbuild|  1 +
 arch/mips/include/asm/Kbuild  |  1 +
 arch/nios2/include/asm/Kbuild |  1 +
 arch/openrisc/include/asm/Kbuild  |  1 +
 .../include/asm/{patch.h => text-patching.h}  |  0
 arch/parisc/kernel/ftrace.c   |  2 +-
 arch/parisc/kernel/jump_label.c   |  2 +-
 arch/parisc/kernel/kgdb.c |  2 +-
 arch/parisc/kernel/kprobes.c  |  2 +-
 arch/parisc/kernel/patch.c|  2 +-
 arch/powerpc/include/asm/kprobes.h|  2 +-
 .../asm/{code-patching.h => text-patching.h}  |  0
 arch/powerpc/kernel/crash_dump.c  |  2 +-
 arch/powerpc/kernel/epapr_paravirt.c  |  2 +-
 arch/powerpc/kernel/jump_label.c  |  2 +-
 arch/powerpc/kernel/kgdb.c|  2 +-
 arch/powerpc/kernel/kprobes.c |  2 +-
 arch/powerpc/kernel/module_32.c   |  2 +-
 arch/powerpc/kernel/module_64.c   |  2 +-
 arch/powerpc/kernel/optprobes.c   |  2 +-
 arch/powerpc/kernel/process.c |  2 +-
 arch/powerpc/kernel/security.c|  2 +-
 arch/powerpc/kernel/setup_32.c|  2 +-
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/kernel/static_call.c |  2 +-
 arch/powerpc/kernel/trace/ftrace.c|  2 +-
 arch/powerpc/kernel/trace/ftrace_64_pg.c  |  2 +-
 arch/powerpc/lib/code-patching.c  |  2 +-
 arch/powerpc/lib/feature-fixups.c |  2 +-
 arch/powerpc/lib/test-code-patching.c |  2 +-
 arch/powerpc/lib/test_emulate_step.c  |  2 +-
 arch/powerpc/mm/book3s32/mmu.c|  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  2 +-
 arch/powerpc/mm/book3s64/slb.c|  2 +-
 arch/powerpc/mm/kasan/init_32.c   |  2 +-
 arch/powerpc/mm/mem.c |  2 +-
 arch/powerpc/mm/nohash/44x.c  |  2 +-
 arch/powerpc/mm/nohash/book3e_pgtable.c   |  2 +-
 arch/powerpc/mm/nohash/tlb.c  |  2 +-
 arch/powerpc/net/bpf_jit_comp.c   |  2 +-
 arch/powerpc/perf/8xx-pmu.c   |  2 +-
 arch/powerpc/perf/core-book3s.c   |  2 +-
 arch/powerpc/platforms/85xx/smp.c |  2 +-
 arch/powerpc/platforms/86xx/mpc86xx_smp.c |  2 +-
 arch/powerpc/platforms/cell/smp.c |  2 +-
 arch/powerpc/platforms/powermac/smp.c |  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 +-
 arch/powerpc/platforms/powernv/smp.c  |  2 +-
 arch/powerpc/platforms/pseries/smp.c  |  2 +-
 arch/powerpc/xmon/xmon.c  |  2 +-
 arch/riscv/errata/andes/errata.c  |  2 +-
 arch/riscv/errata/sifive/errata.c |  2 +-
 arch/riscv/errata/thead/errata.c  |  2 +-
 .../include/asm/{patch.h => text-patching.h}  |  0
 arch/riscv/include/asm/uprobes.h  |  2 +-
 arch/riscv/kernel/alternative.c   |  2 +-
 arch/riscv/kernel/cpufeature.c|  3 ++-
 arch/riscv/kernel/ftrace.c|  2 +-

[RFC PATCH 0/7] x86/module: use large ROX pages for text allocations

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Hi,

These patches add support for using large ROX pages for allocations of
executable memory on x86.

They address Andy's comments [1] about having executable mappings for code
that was not completely formed.

The approach taken is to allocate ROX memory along with writable but not
executable memory and use the writable copy to perform relocations and
alternatives patching. After the module text gets into its final shape, the
contents of the writable memory is copied into the actual ROX location
using text poking.

The allocations of the ROX memory use vmalloc(VMAP_ALLOW_HUGE_MAP) to
allocate PMD aligned memory, fill that memory with invalid instructions and
in the end remap it as ROX. Portions of these large pages are handed out to
execmem_alloc() callers without any changes to the permissions. When the
memory is freed with execmem_free() it is invalidated again so that it
won't contain stale instructions.

The module memory allocation, x86 code dealing with relocations and
alternatives patching takes into account the existence of the two copies,
the writable memory and the ROX memory at the actual allocated virtual
address.

This is an early RFC, it is not well tested and there is a lot of room for
improvement. For example, the locking of execmem_cache can be made more
fine grained, freeing of PMD-sized pages from execmem_cache can be
implemented with a shrinker, the large pages can be removed from the direct
map when they are added to the cache and restored there when they are free
from the cache.

Still, I'd like to hear feedback on the approach in general before moving
forward with polishing the details.

The series applies on top of v4 of "jit/text allocator" [2] and also
available at git:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4%2bx86-rox

[1] 
https://lore.kernel.org/all/a17c65c6-863f-4026-9c6f-a04b659e9...@app.fastmail.com
[2] https://lore.kernel.org/linux-mm/20240411160051.2093261-1-r...@kernel.org 

Mike Rapoport (IBM) (6):
  asm-generic: introduce text-patching.h
  mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
  module: prepare to handle ROX allocations for text
  x86/module: perpare module loading for ROX allocations of text
  execmem: add support for cache of large ROX pages
  x86/module: enable ROX caches for module text

Song Liu (1):
  ftrace: Add swap_func to ftrace_process_locs()

 arch/alpha/include/asm/Kbuild |   1 +
 arch/arc/include/asm/Kbuild   |   1 +
 .../include/asm/{patch.h => text-patching.h}  |   0
 arch/arm/kernel/ftrace.c  |   2 +-
 arch/arm/kernel/jump_label.c  |   2 +-
 arch/arm/kernel/kgdb.c|   2 +-
 arch/arm/kernel/patch.c   |   2 +-
 arch/arm/probes/kprobes/core.c|   2 +-
 arch/arm/probes/kprobes/opt-arm.c |   2 +-
 .../asm/{patching.h => text-patching.h}   |   0
 arch/arm64/kernel/ftrace.c|   2 +-
 arch/arm64/kernel/jump_label.c|   2 +-
 arch/arm64/kernel/kgdb.c  |   2 +-
 arch/arm64/kernel/patching.c  |   2 +-
 arch/arm64/kernel/probes/kprobes.c|   2 +-
 arch/arm64/kernel/traps.c |   2 +-
 arch/arm64/net/bpf_jit_comp.c |   2 +-
 arch/csky/include/asm/Kbuild  |   1 +
 arch/hexagon/include/asm/Kbuild   |   1 +
 arch/loongarch/include/asm/Kbuild |   1 +
 arch/m68k/include/asm/Kbuild  |   1 +
 arch/microblaze/include/asm/Kbuild|   1 +
 arch/mips/include/asm/Kbuild  |   1 +
 arch/nios2/include/asm/Kbuild |   1 +
 arch/openrisc/include/asm/Kbuild  |   1 +
 .../include/asm/{patch.h => text-patching.h}  |   0
 arch/parisc/kernel/ftrace.c   |   2 +-
 arch/parisc/kernel/jump_label.c   |   2 +-
 arch/parisc/kernel/kgdb.c |   2 +-
 arch/parisc/kernel/kprobes.c  |   2 +-
 arch/parisc/kernel/patch.c|   2 +-
 arch/powerpc/include/asm/kprobes.h|   2 +-
 .../asm/{code-patching.h => text-patching.h}  |   0
 arch/powerpc/kernel/crash_dump.c  |   2 +-
 arch/powerpc/kernel/epapr_paravirt.c  |   2 +-
 arch/powerpc/kernel/jump_label.c  |   2 +-
 arch/powerpc/kernel/kgdb.c|   2 +-
 arch/powerpc/kernel/kprobes.c |   2 +-
 arch/powerpc/kernel/module_32.c   |   2 +-
 arch/powerpc/kernel/module_64.c   |   2 +-
 arch/powerpc/kernel/optprobes.c   |   2 +-
 arch/powerpc/kernel/process.c |   2 +-
 arch/powerpc/kernel/security.c|   2 +-
 arch/powerpc/kernel/setup_32.c|   2 +-
 arch/powerpc/kernel/setup_64.c|   2 +-
 arch/powerpc/kernel/static_call.c |   2 +-

[PATCH v4 15/15] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.

Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.

Suggested-by: Björn Töpel 
Signed-off-by: Mike Rapoport (IBM) 
---
 kernel/bpf/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index bc25f5098a25..f999e4e0b344 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -43,7 +43,7 @@ config BPF_JIT
bool "Enable BPF Just In Time compiler"
depends on BPF
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
-   depends on MODULES
+   select EXECMEM
help
  BPF programs are normally handled by a BPF interpreter. This option
  allows the kernel to generate native code when a program is loaded
-- 
2.43.0

[PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/Kconfig|  2 +-
 kernel/kprobes.c| 43 +
 kernel/trace/trace_kprobe.c | 11 ++
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index bc9e8e5dccd5..68177adf61a0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,9 +52,9 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
+   select EXECMEM
select TASKS_RCU if PREEMPTION
help
  Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 047ca629ce49..90c056853e6f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1580,6 +1580,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1604,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,24 +2485,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
-{
-   struct kprobe_blacklist_entry *ent, *n;
-
-   list_for_each_entry_safe(ent, n, _blacklist, list) {
-   if (ent->start_addr < start || ent->start_addr >= end)
-   continue;
-   list_del(>list);
-   kfree(ent);
-   }
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-   kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
 {
@@ -2564,6 +2549,25 @@ static int __init populate_kprobe_blacklist(unsigned 
long *start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
+{
+   struct kprobe_blacklist_entry *ent, *n;
+
+   list_for_each_entry_safe(ent, n, _blacklist, list) {
+   if (ent->start_addr < start || ent->start_addr >= end)
+   continue;
+   list_del(>list);
+   kfree(ent);
+   }
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+   kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned long start, end;
@@ -2665,6 +2669,7 @@ static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
.priority = 0
 };
+#endif
 
 void kprobe_free_init_mem(void)
 {
@@ -2724,8 +2729,10 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(_exceptions_nb);
+#ifdef CONFIG_MODULES
if (!err)
err = register_module_notifier(_module_nb);
+#endif
 
kprobes_initialized = (err == 0);
kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 14099cc17fc9..f0610137d6a3 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool 
trace_kprobe_within_module(struct trace_kprobe *tk,
return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
 }
 
+#ifdef CONFIG_MODULES
 static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 {
char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool 
trace_kprobe_module_exist(struct trace_kprobe *tk)
 
return ret;
 }
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+   return false;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,

[PATCH v4 13/15] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropiate

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

There are places where CONFIG_MODULES guards the code that depends on
memory allocation being done with module_alloc().

Replace CONFIG_MODULES with CONFIG_EXECMEM in such places.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/kasan.h | 2 +-
 arch/powerpc/kernel/head_8xx.S   | 4 ++--
 arch/powerpc/kernel/head_book3s_32.S | 6 +++---
 arch/powerpc/lib/code-patching.c | 2 +-
 arch/powerpc/mm/book3s32/mmu.c   | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..2e586733a464 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -285,7 +285,7 @@ config PPC
select IOMMU_HELPER if PPC64
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
-   select KASAN_VMALLOCif KASAN && MODULES
+   select KASAN_VMALLOCif KASAN && EXECMEM
select LOCK_MM_AND_FIND_VMA
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h
index 365d2720097c..b5bbb94c51f6 100644
--- a/arch/powerpc/include/asm/kasan.h
+++ b/arch/powerpc/include/asm/kasan.h
@@ -19,7 +19,7 @@
 
 #define KASAN_SHADOW_SCALE_SHIFT   3
 
-#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32)
+#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32)
 #define KASAN_KERN_START   ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M)
 #else
 #define KASAN_KERN_START   PAGE_OFFSET
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..edc479a7c2bc 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -199,12 +199,12 @@ instruction_counter:
mfspr   r10, SPRN_SRR0  /* Get effective address of fault */
INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11)
mtspr   SPRN_MD_EPN, r10
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
mfcrr11
compare_to_kernel_boundary r10, r10
 #endif
mfspr   r10, SPRN_M_TWB /* Get level 1 table */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
blt+3f
rlwinm  r10, r10, 0, 20, 31
orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha
diff --git a/arch/powerpc/kernel/head_book3s_32.S 
b/arch/powerpc/kernel/head_book3s_32.S
index c1d89764dd22..57196883a00e 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -419,14 +419,14 @@ InstructionTLBMiss:
  */
/* Get PTE (linux-style) and check access */
mfspr   r3,SPRN_IMISS
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
lis r1, TASK_SIZE@h /* check if kernel address */
cmplw   0,r1,r3
 #endif
mfspr   r2, SPRN_SDR1
li  r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
rlwinm  r2, r2, 28, 0xf000
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
li  r0, 3
bgt-112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha   /* if kernel address, 
use */
@@ -442,7 +442,7 @@ InstructionTLBMiss:
andc.   r1,r1,r2/* check access & ~permission */
bne-InstructionAddressInvalid /* return if access not permitted */
/* Convert linux-style PTE to low word of PPC-style PTE */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
rlwimi  r2, r0, 0, 31, 31   /* userspace ? -> PP lsb */
 #endif
ori r1, r1, 0xe06   /* clear out reserved bits */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index c6ab46156cda..7af791446ddf 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -225,7 +225,7 @@ void __init poking_init(void)
 
 static unsigned long get_patch_pfn(void *addr)
 {
-   if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr))
+   if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr))
return vmalloc_to_pfn(addr);
else
return __pa_symbol(addr) >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index 100f999871bc..625fe7d08e06 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, 
unsigned long top)
 
 static bool is_module_segment(unsigned long addr)
 {
-   if (!IS_ENABLED(CONFIG_MODULES))
+   if (!IS_ENABLED(CONFIG_EXECMEM))
return false;
if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M))
return false;
-- 
2.43.0

[PATCH v4 12/15] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.

Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/ftrace.c | 10 --
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e87ddbdaaeb2..5100a769ffda 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select EXECMEM if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index c8ddb7abda7c..8da0e66ca22d 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
-/* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
return execmem_alloc(EXECMEM_FTRACE, size);
@@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
 {
execmem_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-   return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
-- 
2.43.0

[PATCH v4 11/15] arch: make execmem setup available regardless of CONFIG_MODULES

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm/kernel/module.c   |  40 --
 arch/arm/mm/init.c |  40 ++
 arch/arm64/kernel/module.c | 136 -
 arch/arm64/mm/init.c   | 136 +
 arch/loongarch/kernel/module.c |  18 -
 arch/loongarch/mm/init.c   |  20 +
 arch/mips/kernel/module.c  |  21 -
 arch/mips/mm/init.c|  22 ++
 arch/nios2/kernel/module.c |  18 -
 arch/nios2/mm/init.c   |  19 +
 arch/parisc/kernel/module.c|  19 -
 arch/parisc/mm/init.c  |  22 +-
 arch/powerpc/kernel/module.c   |  63 ---
 arch/powerpc/mm/mem.c  |  64 
 arch/riscv/kernel/module.c |  40 --
 arch/riscv/mm/init.c   |  41 ++
 arch/s390/kernel/module.c  |  25 --
 arch/s390/mm/init.c|  28 +++
 arch/sparc/kernel/module.c |  23 --
 arch/sparc/mm/Makefile |   2 +
 arch/sparc/mm/execmem.c|  25 ++
 arch/x86/kernel/module.c   |  25 --
 arch/x86/mm/init.c |  27 +++
 23 files changed, 445 insertions(+), 429 deletions(-)
 create mode 100644 arch/sparc/mm/execmem.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 32974758c73b..677f218f7e84 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -12,54 +12,14 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
-#include 
 
 #include 
 #include 
 #include 
 #include 
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct execmem_info execmem_info __ro_after_init = {
-   .ranges = {
-   [EXECMEM_DEFAULT] = {
-   .start = MODULES_VADDR,
-   .end = MODULES_END,
-   .alignment = 1,
-   },
-   },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
-   struct execmem_range *r = _info.ranges[EXECMEM_DEFAULT];
-
-   r->pgprot = PAGE_KERNEL_EXEC;
-
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-   r->fallback_start = VMALLOC_START;
-   r->fallback_end = VMALLOC_END;
-   }
-
-   return _info;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index e8c6f4be0ce1..e54338825156 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -486,3 +487,42 @@ void free_initrd_mem(unsigned long start, unsigned long 
end)
free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_EXECMEM
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#ifdef CONFIG_MMU
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start = MODULES_VADDR,
+   .end = MODULES_END,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+   struct execmem_range *r = _info.ranges[EXECMEM_DEFAULT];
+
+   r->pgprot = PAGE_KERNEL_EXEC;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   r->fallback_start = VMALLOC_START;
+   r->fallback_end = VMALLOC_END;
+   }
+
+   return _info;
+}
+#endif
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index aa9e2b3d7459..36b25af56324 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -12,154 +12,18 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
-#include 
 
 #include 
 #include 
 #include 
 #include 
 
-static u64 module_direct_base __ro_after_init = 0;
-static

[PATCH v4 10/15] powerpc: extend execmem_params for kprobes allocations

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

powerpc overrides kprobes::alloc_insn_page() to remove writable
permissions when STRICT_MODULE_RWX is on.

Add definition of EXECMEM_KRPOBES to execmem_params to allow using the
generic kprobes::alloc_insn_page() with the desired permissions.

As powerpc uses breakpoint instructions to inject kprobes, it does not
need to constrain kprobe allocations to the modules area and can use the
entire vmalloc address space.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c | 20 
 arch/powerpc/kernel/module.c  | 11 +++
 2 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 9fcd01bb2ce6..14c5ddec3056 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long 
addr, unsigned long offse
return (kprobe_opcode_t *)(addr + offset);
 }
 
-void *alloc_insn_page(void)
-{
-   void *page;
-
-   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
-   if (!page)
-   return NULL;
-
-   if (strict_module_rwx_enabled()) {
-   int err = set_memory_rox((unsigned long)page, 1);
-
-   if (err)
-   goto error;
-   }
-   return page;
-error:
-   execmem_free(page);
-   return NULL;
-}
-
 int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 5a1d0490c831..a1eaa74f2d41 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -95,6 +95,9 @@ static struct execmem_info execmem_info __ro_after_init = {
[EXECMEM_DEFAULT] = {
.alignment = 1,
},
+   [EXECMEM_KPROBES] = {
+   .alignment = 1,
+   },
[EXECMEM_MODULE_DATA] = {
.alignment = 1,
},
@@ -137,5 +140,13 @@ struct execmem_info __init *execmem_arch_setup(void)
 
text->pgprot = prot;
 
+   execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+   execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_END;
+
+   if (strict_module_rwx_enabled())
+   execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+   else
+   execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;
+
return _info;
 }
-- 
2.43.0

[PATCH v4 09/15] riscv: extend execmem_params for generated code allocations

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes and BPF on RISC-V are not placed in
the modules area and these custom allocations are implemented with
overrides of alloc_insn_page() and  bpf_jit_alloc_exec().

Slightly reorder execmem_params initialization to support both 32 and 64
bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in
riscv::execmem_params and drop overrides of alloc_insn_page() and
bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) 
Reviewed-by: Alexandre Ghiti 
---
 arch/riscv/kernel/module.c | 21 -
 arch/riscv/kernel/probes/kprobes.c | 10 --
 arch/riscv/net/bpf_jit_core.c  | 13 -
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index ad32e2a8621a..aad158bb2022 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -906,20 +906,39 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+#ifdef CONFIG_MMU
 static struct execmem_info execmem_info __ro_after_init = {
.ranges = {
[EXECMEM_DEFAULT] = {
.pgprot = PAGE_KERNEL,
.alignment = 1,
},
+   [EXECMEM_KPROBES] = {
+   .pgprot = PAGE_KERNEL_READ_EXEC,
+   .alignment = 1,
+   },
+   [EXECMEM_BPF] = {
+   .pgprot = PAGE_KERNEL,
+   .alignment = PAGE_SIZE,
+   },
},
 };
 
 struct execmem_info __init *execmem_arch_setup(void)
 {
+#ifdef CONFIG_64BIT
execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+#else
+   execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+   execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+#endif
+
+   execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+   execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+   execmem_info.ranges[EXECMEM_BPF].start = BPF_JIT_REGION_START;
+   execmem_info.ranges[EXECMEM_BPF].end = BPF_JIT_REGION_END;
 
return _info;
 }
diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-#ifdef CONFIG_MMU
-void *alloc_insn_page(void)
-{
-   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
-VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-__builtin_return_address(0));
-}
-#endif
-
 /* install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c
index 6b3acac30c06..e238fdbd5dbc 100644
--- a/arch/riscv/net/bpf_jit_core.c
+++ b/arch/riscv/net/bpf_jit_core.c
@@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return BPF_JIT_REGION_SIZE;
 }
 
-void *bpf_jit_alloc_exec(unsigned long size)
-{
-   return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
-   BPF_JIT_REGION_END, GFP_KERNEL,
-   PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
-   return vfree(addr);
-}
-
 void *bpf_arch_text_copy(void *dst, void *src, size_t len)
 {
int ret;
-- 
2.43.0

[PATCH v4 08/15] arm64: extend execmem_info for generated code allocations

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes and BPF on arm64 can be placed
anywhere in vmalloc address space and currently this is implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64.

Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and
drop overrides of alloc_insn_page() and bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Will Deacon 
---
 arch/arm64/kernel/module.c | 14 ++
 arch/arm64/kernel/probes/kprobes.c |  7 ---
 arch/arm64/net/bpf_jit_comp.c  | 11 ---
 3 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index a377a3217cf2..aa9e2b3d7459 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -115,6 +115,12 @@ static struct execmem_info execmem_info __ro_after_init = {
[EXECMEM_DEFAULT] = {
.alignment = MODULE_ALIGN,
},
+   [EXECMEM_KPROBES] = {
+   .alignment = 1,
+   },
+   [EXECMEM_BPF] = {
+   .alignment = 1,
+   },
},
 };
 
@@ -143,6 +149,14 @@ struct execmem_info __init *execmem_arch_setup(void)
r->end = module_plt_base + SZ_2G;
}
 
+   execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+   execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+   execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+   execmem_info.ranges[EXECMEM_BPF].pgprot = PAGE_KERNEL;
+   execmem_info.ranges[EXECMEM_BPF].start = VMALLOC_START;
+   execmem_info.ranges[EXECMEM_BPF].end = VMALLOC_END;
+
return _info;
 }
 
diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 327855a11df2..4268678d0e86 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-void *alloc_insn_page(void)
-{
-   return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 /* arm kprobe: install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 122021f9bdfc..456f5af239fc 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -1793,17 +1793,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return VMALLOC_END - VMALLOC_START;
 }
 
-void *bpf_jit_alloc_exec(unsigned long size)
-{
-   /* Memory is intended to be executable, reset the pointer tag. */
-   return kasan_reset_tag(vmalloc(size));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
-   return vfree(addr);
-}
-
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool bpf_jit_supports_subprog_tailcalls(void)
 {
-- 
2.43.0

[PATCH v4 07/15] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
early initialization of execmem required by x86.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/Kconfig |  6 
 arch/arm/kernel/module.c | 38 +++-
 arch/arm64/kernel/module.c   | 49 -
 arch/powerpc/kernel/module.c | 58 ++
 arch/s390/kernel/module.c| 52 +++
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/module.c | 62 ++--
 include/linux/execmem.h  | 34 ++
 include/linux/moduleloader.h | 12 ---
 kernel/module/main.c | 26 --
 mm/execmem.c | 70 +---
 mm/mm_init.c |  2 ++
 12 files changed, 228 insertions(+), 182 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9f066785bb71..bc9e8e5dccd5 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -960,6 +960,12 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC
  For architectures like powerpc/32 which have constraints on module
  allocation and need to allocate module data outside of module area.
 
+config ARCH_WANTS_EXECMEM_EARLY
+   bool
+   help
+ For architectures that might allocate executable memory early on
+ boot, for instance ftrace on x86.
+
 config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index e74d84f58b77..32974758c73b 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -34,23 +35,28 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start = MODULES_VADDR,
+   .end = MODULES_END,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   gfp_t gfp_mask = GFP_KERNEL;
-   void *p;
-
-   /* Silence the initial allocation */
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-   gfp_mask |= __GFP_NOWARN;
-
-   p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-   if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
-   return p;
-   return __vmalloc_node_range(size, 1,  VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   struct execmem_range *r = _info.ranges[EXECMEM_DEFAULT];
+
+   r->pgprot = PAGE_KERNEL_EXEC;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   r->fallback_start = VMALLOC_START;
+   r->fallback_end = VMALLOC_END;
+   }
+
+   return _info;
 }
 #endif
 
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index e92da4da1b2a..a377a3217cf2 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -108,41 +109,41 @@ static int __init module_init_limits(void)
 
return 0;
 }
-subsys_initcall(module_init_limits);
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .alignment = MODULE_ALIGN,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   void *p = NULL;
+   struct execmem_range *r = _info.ranges[EXECMEM_DEFAULT];
+
+   module_init_limits();
+
+   r->pgprot = PAGE_KERNEL;
 
/*
 * Where possible, prefer to allocate within direct branch range of the
 * kernel such that no PLTs are necessary.
 */
if (module_direct_base) {
-   p = __vmalloc_node_range(size, MODULE_ALIGN,
-module_direct_base,
-module_direct_base + SZ_128M,
-GFP_KERNEL | __GFP_NOWARN,
-PAGE_KERNEL, 0, NUMA_NO_NODE,
-__builtin_return_address(0));
-

[PATCH v4 06/15] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.

Provide a generic implementation in execmem that uses the parameters for
address space ranges, required alignment and page protections provided
by architectures.

The architectures must fill execmem_info structure and implement
execmem_arch_setup() that returns a pointer to that structure. This way the
execmem initialization won't be called from every architecture, but rather
from a central place, namely a core_initcall() in execmem.

The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
with the parameters defined by the architectures.  If an architecture does
not implement execmem_arch_setup(), execmem_alloc() will fall back to
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/loongarch/kernel/module.c | 18 +++--
 arch/mips/kernel/module.c  | 19 +++--
 arch/nios2/kernel/module.c | 19 ++---
 arch/parisc/kernel/module.c| 23 +++
 arch/riscv/kernel/module.c | 21 +++---
 arch/sparc/kernel/module.c | 41 ---
 include/linux/execmem.h| 41 +++
 mm/execmem.c   | 73 --
 8 files changed, 202 insertions(+), 53 deletions(-)

diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index c7d0338d12c1..78c6a68f6c3c 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -490,10 +491,21 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .pgprot = PAGE_KERNEL,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, 
__builtin_return_address(0));
+   execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
+   execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+   return _info;
 }
 
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 9a6c96014904..50505e910763 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct mips_hi16 {
@@ -32,11 +33,21 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULES_VADDR
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start = MODULES_VADDR,
+   .end = MODULES_END,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+   return _info;
 }
 #endif
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..2b68ef8aad42 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,24 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start = MODULES_VADDR,
+   .end = MODULES_END,
+   .pgprot = PAGE_KERNEL_EXEC,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC,
-   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   return _info;
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index d214bbe3c2af..721324c42b7d 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -173,15 +174,21 @@ static inline int reassemble_22(int as22)
((as22 & 0x0003ff) << 3));
 }
 
-void

[PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c|  6 ++--
 arch/s390/kernel/ftrace.c|  4 +--
 arch/s390/kernel/kprobes.c   |  4 +--
 arch/s390/kernel/module.c|  5 +--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
 arch/x86/kernel/ftrace.c |  6 ++--
 arch/x86/kernel/kprobes/core.c   |  4 +--
 include/linux/execmem.h  | 57 
 include/linux/moduleloader.h |  3 --
 kernel/bpf/core.c|  6 ++--
 kernel/kprobes.c |  8 ++---
 kernel/module/Kconfig|  1 +
 kernel/module/main.c | 25 +-
 mm/Kconfig   |  3 ++
 mm/Makefile  |  1 +
 mm/execmem.c | 26 +++
 16 files changed, 122 insertions(+), 45 deletions(-)
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index bbca90a5e2ec..9fcd01bb2ce6 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
 
@@ -142,7 +142,7 @@ void *alloc_insn_page(void)
}
return page;
 error:
-   module_memfree(page);
+   execmem_free(page);
return NULL;
 }
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..798249ef5646 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky 
  */
 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
const char *start, *end;
 
-   ftrace_plt = module_alloc(PAGE_SIZE);
+   ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
if (!ftrace_plt)
panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index f0cf20d4b3c5..3c1b1be744de 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include 
 #include 
 #include 
 #include 
@@ -21,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 42215f9404af..ac97a905e8cd 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-   module_memfree(mod->arch.trampolines_start);
+   execmem_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct 
module *me,
 
size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-   start = module_alloc(numpages * PAGE_SIZE);
+   start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
if (!start)
return -ENOMEM;
set_memory_rox((unsigned long)start, numpages);
diff --git

[PATCH v4 04/15] module: make module_memory_{alloc,free} more self-contained

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().

Signed-off-by: Mike Rapoport (IBM) 
---
 kernel/module/main.c | 64 +++-
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index e1e8a7a9d6c1..5b82b069e0d3 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type)
mod_mem_type_is_core_data(type);
 }
 
-static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
+static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
 {
+   unsigned int size = PAGE_ALIGN(mod->mem[type].size);
+   void *ptr;
+
+   mod->mem[type].size = size;
+
if (mod_mem_use_vmalloc(type))
-   return vzalloc(size);
-   return module_alloc(size);
+   ptr = vmalloc(size);
+   else
+   ptr = module_alloc(size);
+
+   if (!ptr)
+   return -ENOMEM;
+
+   /*
+* The pointer to these blocks of memory are stored on the module
+* structure and we keep that around so long as the module is
+* around. We only free that memory when we unload the module.
+* Just mark them as not being a leak then. The .init* ELF
+* sections *do* get freed after boot so we *could* treat them
+* slightly differently with kmemleak_ignore() and only grey
+* them out as they work as typical memory allocations which
+* *do* eventually get freed, but let's just keep things simple
+* and avoid *any* false positives.
+*/
+   kmemleak_not_leak(ptr);
+
+   memset(ptr, 0, size);
+   mod->mem[type].base = ptr;
+
+   return 0;
 }
 
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(struct module *mod, enum mod_mem_type type)
 {
+   void *ptr = mod->mem[type].base;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
@@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
-   module_memory_free(mod_mem->base, type);
+   module_memory_free(mod, type);
}
 
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, 
mod->mem[MOD_DATA].size);
-   module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+   module_memory_free(mod, MOD_DATA);
 }
 
 /* Free a module, remove from lists, etc. */
@@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, 
struct load_info *info)
 static int move_module(struct module *mod, struct load_info *info)
 {
int i;
-   void *ptr;
enum mod_mem_type t = 0;
int ret = -ENOMEM;
 
@@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct 
load_info *info)
mod->mem[type].base = NULL;
continue;
}
-   mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size);
-   ptr = module_memory_alloc(mod->mem[type].size, type);
-   /*
- * The pointer to these blocks of memory are stored on the 
module
- * structure and we keep that around so long as the module is
- * around. We only free that memory when we unload the module.
- * Just mark them as not being a leak then. The .init* ELF
- * sections *do* get freed after boot so we *could* treat them
- * slightly differently with kmemleak_ignore() and only grey
- * them out as they work as typical memory allocations which
- * *do* eventually get freed, but let's just keep things simple
- * and avoid *any* false positives.
-*/
-   kmemleak_not_leak(ptr);
-   if (!ptr) {
+
+   ret = module_memory_alloc(mod, type);
+   if (ret) {
t = type;
goto out_enomem;
}
-   memset(ptr, 0, mod->mem[type].size);
-   mod->mem[type].base = ptr;
}
 
/* Transfer each section which specifies SHF_ALLOC */
@@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct 
load_info *info)
return 0;
 out_enomem:
for (t--; t >= 0; t--)
-   module_memory_free(mod->mem[t].base, t);
+   module_memory_free(mod, t);
return ret;
 }
 
-- 
2.43.0

[PATCH v4 03/15] nios2: define virtual address space for modules

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.

Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.

Suggested-by: Thomas Gleixner 
Acked-by: Dinh Nguyen 
Acked-by: Song Liu 
Signed-off-by: Mike Rapoport (IBM) 
---
 arch/nios2/include/asm/pgtable.h |  5 -
 arch/nios2/kernel/module.c   | 19 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index d052dfcbe8d3..eab87c6beacb 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
 #include 
 
 #define VMALLOC_START  CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR  (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
 struct mm_struct;
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
 
 #include 
 
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns
- * addresses in 0xc000)
- */
 void *module_alloc(unsigned long size)
 {
-   if (size == 0)
-   return NULL;
-   return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
-   kfree(module_region);
+   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+   GFP_KERNEL, PAGE_KERNEL_EXEC,
+   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+   __builtin_return_address(0));
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
-- 
2.43.0

[PATCH v4 02/15] mips: module: rename MODULE_START to MODULES_VADDR

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

and MODULE_END to MODULES_END to match other architectures that define
custom address space for modules.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/mips/include/asm/pgtable-64.h | 4 ++--
 arch/mips/kernel/module.c  | 4 ++--
 arch/mips/mm/fault.c   | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/asm/pgtable-64.h 
b/arch/mips/include/asm/pgtable-64.h
index 20ca48c1b606..c0109aff223b 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -147,8 +147,8 @@
 #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \
VMALLOC_START != CKSSEG
 /* Load modules into 32bit-compatible segment. */
-#define MODULE_START   CKSSEG
-#define MODULE_END (FIXADDR_START-2*PAGE_SIZE)
+#define MODULES_VADDR  CKSSEG
+#define MODULES_END(FIXADDR_START-2*PAGE_SIZE)
 #endif
 
 #define pte_ERROR(e) \
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 7b2fbaa9cac5..9a6c96014904 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -31,10 +31,10 @@ struct mips_hi16 {
 static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
-#ifdef MODULE_START
+#ifdef MODULES_VADDR
 void *module_alloc(unsigned long size)
 {
-   return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
+   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
 }
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index aaa9a242ebba..37fedeaca2e9 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned 
long write,
 
if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END))
goto VMALLOC_FAULT_TARGET;
-#ifdef MODULE_START
-   if (unlikely(address >= MODULE_START && address < MODULE_END))
+#ifdef MODULES_VADDR
+   if (unlikely(address >= MODULES_VADDR && address < MODULES_END))
goto VMALLOC_FAULT_TARGET;
 #endif
 
-- 
2.43.0

[PATCH v4 01/15] arm64: module: remove uneeded call to kasan_alloc_module_shadow()

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS
modes") KASAN_VMALLOC is always enabled when KASAN is on. This means
that allocations in module_alloc() will be tracked by KASAN protection
for vmalloc() and that kasan_alloc_module_shadow() will be always an
empty inline and there is no point in calling it.

Drop meaningless call to kasan_alloc_module_shadow() from
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm64/kernel/module.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 47e0be610bb6..e92da4da1b2a 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -141,11 +141,6 @@ void *module_alloc(unsigned long size)
__func__);
}
 
-   if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
-   vfree(p);
-   return NULL;
-   }
-
/* Memory is intended to be executable, reset the pointer tag. */
return kasan_reset_tag(p);
 }
-- 
2.43.0

[PATCH v4 00/15] mm: jit/text allocator

2024-04-11 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Hi,

Since v3 I looked into making execmem more of an utility toolbox, as we
discussed at LPC with Mark Rutland, but it was getting more hairier than
having a struct describing architecture constraints and a type identifying
the consumer of execmem.

And I do think that having the description of architecture constraints for
allocations of executable memory in a single place is better that having it
spread all over the place.

The patches available via git:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4

v4 changes:
* rebase on v6.9-rc2
* rename execmem_params to execmem_info and execmem_arch_params() to
  execmem_arch_setup()
* use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song)
* avoid extra copy of execmem parameters (Rick)
* run execmem_init() as core_initcall() except for the architectures that
  may allocated text really early (currently only x86) (Will)
* add acks for some of arm64 and riscv changes, thanks Will and Alexandre
* new commits:
  - drop call to kasan_alloc_module_shadow() on arm64 because it's not
needed anymore
  - rename MODULE_START to MODULES_VADDR on MIPS
  - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe:
https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/

v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org
* add type parameter to execmem allocation APIs
* remove BPF dependency on modules

v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org
* Separate "module" and "others" allocations with execmem_text_alloc()
and jit_text_alloc()
* Drop ROX entailment on x86
* Add ack for nios2 changes, thanks Dinh Nguyen

v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org

= Cover letter from v1 (sligtly updated) =

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

A centralized infrastructure for code allocation allows allocations of
executable memory as ROX, and future optimizations such as caching large
pages for better iTLB performance and providing sub-page allocations for
users that only need small jit code snippets.

Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
proposed execmem_alloc [2], but both these approaches were targeting BPF
allocations and lacked the ground work to abstract executable allocations
and split them from the modules core.

Thomas Gleixner suggested to express module allocation restrictions and
requirements as struct mod_alloc_type_params [3] that would define ranges,
protections and other parameters for different types of allocations used by
modules and following that suggestion Song separated allocations of
different types in modules (commit ac3b43283923 ("module: replace
module_layout with module_memory")) and posted "Type aware module
allocator" set [4].

I liked the idea of parametrising code allocation requirements as a
structure, but I believe the original proposal and Song's module allocator
was too module centric, so I came up with these patches.

This set splits code allocation from modules by introducing execmem_alloc()
and and execmem_free(), APIs, replaces call sites of module_alloc() and
module_memfree() with the new APIs and implements core text and related
allocations in a central place.

Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill execmem_info structure and implement execmem_arch_setup() that returns
a pointer to that structure. If an architecture does not implement
execmem_arch_setup(), the defaults compatible with the current
modules::module_alloc() are used.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem APIs
take a type argument, that will be used to identify the calling subsystem
and to allow architectures to define parameters for ranges suitable for that
subsystem.

The new infrastructure allows decoupling of BPF, kprobes and ftrace from
modules, and most importantly it paves the way for ROX allocations for
executable memory.

[1] 
https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgeco...@intel.com/
[2] https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
[3] https://lore.kernel.org/all/87v8mndy3y.ffs@tglx/
[4] https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org


Mike Rapoport (IBM) (15):
  arm64:

Re: Copying TLS/user register data per perf-sample?

2024-04-11 Thread Beau Belgrave

On Fri, Apr 12, 2024 at 12:55:19AM +0900, Masami Hiramatsu wrote:
> On Wed, 10 Apr 2024 08:35:42 -0700
> Beau Belgrave  wrote:
> 
> > On Wed, Apr 10, 2024 at 10:06:28PM +0900, Masami Hiramatsu wrote:
> > > On Thu, 4 Apr 2024 12:26:41 -0700
> > > Beau Belgrave  wrote:
> > > 
> > > > Hello,
> > > > 
> > > > I'm looking into the possibility of capturing user data that is pointed
> > > > to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
> > > > perf_events.
> > > > 
> > > > I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
> > > > I think it could even use roughly the same ABI in the perf ring buffer.
> > > > Or it may be possible by some kprobe linked to the perf sample function.
> > > > 
> > > > This would allow a profiler to collect TLS (or other values) on x64. In
> > > > the Open Telemetry profiling SIG [1], we are trying to find a fast way
> > > > to grab a tracing association quickly on a per-thread basis. The team
> > > > at Elastic has a bespoke way to do this [2], however, I'd like to see a
> > > > more general way to achieve this. The folks I've been talking with seem
> > > > open to the idea of just having a TLS value for this we could capture
> > > > upon each sample. We could then just state, Open Telemetry SDKs should
> > > > have a TLS value for span correlation. However, we need a way to sample
> > > > the TLS value(s) when a sampling event is generated.
> > > > 
> > > > Is this already possible via some other means? It'd be great to be able
> > > > to do this directly at the perf_event sample via the ABI or a probe.
> > > > 
> > > 
> > > Have you tried to use uprobes? It should be able to access user-space
> > > registers including fs/gs.
> > > 
> > 
> > We need to get fs/gs during a sample interrupt from perf. If the sample
> > interrupt lands during kernel code (IE: syscall) we would also like to
> > get these TLS values when in process context.
> 
> OK, those are not directly accessible from pt_regs.
> 

Yeah, it's a per-arch thread attribute.

> > 
> > I have some patches into the kernel to make this possible via
> > perf_events that works well, however, I don't want to reinvent the wheel
> > if there is some way to get these via perf samples already.
> 
> I would like to see it. I think it is possible to introduce a helper
> to get a base address of user TLS for probe events, and start supporting
> from x86.
> 

For sure, I'm hoping the patches start the right conversations.

> > 
> > In OTel, we are trying to attribute samples to transactions that are
> > occurring. So the TLS fetch has to be aligned exactly with the sample.
> > You can do this via eBPF when it's available, however, we have
> > environments where eBPF is not available.
> > 
> > It's sounding like to do this properly without eBPF a new feature would
> > be required. If so, I do have some patches I can share in a bit as an
> > RFC.
> 
> It is better to be shared in RFC stage, so that we can discuss it from
> the direction level.
> 

Agree, it could be that having the ability to run a probe on sample may
be a better option. Not sure.

Thanks,
-Beau

> Thank you,
> 
> > 
> > Thanks,
> > -Beau
> > 
> > > Thank you,
> > > 
> > > -- 
> > > Masami Hiramatsu (Google) 
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Google

On Thu, 11 Apr 2024 12:20:57 +0200
Marco Elver  wrote:

> Add "sched_prepare_exec" tracepoint, which is run right after the point
> of no return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
> tracepoint runs before flushing the old exec, i.e. while the task still
> has the original state (such as original MM), but when the new exec
> either succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("sched_prepare_exec" to "sched_process_exec").
> 
> Example of tracing output:
> 
>  $ cat /sys/kernel/debug/tracing/trace_pipe
> <...>-379  [003] .  179.626921: sched_prepare_exec: 
> interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
> <...>-381  [002] .  180.048580: sched_prepare_exec: interp=/bin/bash 
> filename=/bin/bash pid=381 comm=sshd
> <...>-385  [001] .  180.068277: sched_prepare_exec: 
> interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash
> <...>-389  [006] .  192.020147: sched_prepare_exec: 
> interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash
> 
> Signed-off-by: Marco Elver 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks,

> ---
> v2:
> * Add more documentation.
> * Also show bprm->interp in trace.
> * Rename to sched_prepare_exec.
> ---
>  fs/exec.c|  8 
>  include/trace/events/sched.h | 35 +++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 38bf71cbdf5e..57fee729dd92 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1268,6 +1268,14 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   return retval;
>  
> + /*
> +  * This tracepoint marks the point before flushing the old exec where
> +  * the current task is still unchanged, but errors are fatal (point of
> +  * no return). The later "sched_process_exec" tracepoint is called after
> +  * the current task has successfully switched to the new exec.
> +  */
> + trace_sched_prepare_exec(current, bprm);
> +
>   /*
>* Ensure all future errors are fatal.
>*/
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..226f47c6939c 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -420,6 +420,41 @@ TRACE_EVENT(sched_process_exec,
> __entry->pid, __entry->old_pid)
>  );
>  
> +/**
> + * sched_prepare_exec - called before setting up new exec
> + * @task:pointer to the current task
> + * @bprm:pointer to linux_binprm used for new exec
> + *
> + * Called before flushing the old exec, where @task is still unchanged, but 
> at
> + * the point of no return during switching to the new exec. At the point it 
> is
> + * called the exec will either succeed, or on failure terminate the task. 
> Also
> + * see the "sched_process_exec" tracepoint, which is called right after @task
> + * has successfully switched to the new exec.
> + */
> +TRACE_EVENT(sched_prepare_exec,
> +
> + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
> +
> + TP_ARGS(task, bprm),
> +
> + TP_STRUCT__entry(
> + __string(   interp, bprm->interp)
> + __string(   filename,   bprm->filename  )
> + __field(pid_t,  pid )
> + __string(   comm,   task->comm  )
> + ),
> +
> + TP_fast_assign(
> + __assign_str(interp, bprm->interp);
> + __assign_str(filename, bprm->filename);
> + __entry->pid = task->pid;
> + __assign_str(comm, task->comm);
> + ),
> +
> + TP_printk("interp=%s filename=%s pid=%d comm=%s",
> +   __get_str(interp), __get_str(filename),
> +   __entry->pid, __get_str(comm))
> +);
>  
>  #ifdef CONFIG_SCHEDSTATS
>  #define DEFINE_EVENT_SCHEDSTAT DEFINE_EVENT
> -- 
> 2.44.0.478.gd926399ef9-goog
> 


-- 
Masami Hiramatsu (Google)

Re: Copying TLS/user register data per perf-sample?

2024-04-11 Thread Google

On Wed, 10 Apr 2024 08:35:42 -0700
Beau Belgrave  wrote:

> On Wed, Apr 10, 2024 at 10:06:28PM +0900, Masami Hiramatsu wrote:
> > On Thu, 4 Apr 2024 12:26:41 -0700
> > Beau Belgrave  wrote:
> > 
> > > Hello,
> > > 
> > > I'm looking into the possibility of capturing user data that is pointed
> > > to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
> > > perf_events.
> > > 
> > > I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
> > > I think it could even use roughly the same ABI in the perf ring buffer.
> > > Or it may be possible by some kprobe linked to the perf sample function.
> > > 
> > > This would allow a profiler to collect TLS (or other values) on x64. In
> > > the Open Telemetry profiling SIG [1], we are trying to find a fast way
> > > to grab a tracing association quickly on a per-thread basis. The team
> > > at Elastic has a bespoke way to do this [2], however, I'd like to see a
> > > more general way to achieve this. The folks I've been talking with seem
> > > open to the idea of just having a TLS value for this we could capture
> > > upon each sample. We could then just state, Open Telemetry SDKs should
> > > have a TLS value for span correlation. However, we need a way to sample
> > > the TLS value(s) when a sampling event is generated.
> > > 
> > > Is this already possible via some other means? It'd be great to be able
> > > to do this directly at the perf_event sample via the ABI or a probe.
> > > 
> > 
> > Have you tried to use uprobes? It should be able to access user-space
> > registers including fs/gs.
> > 
> 
> We need to get fs/gs during a sample interrupt from perf. If the sample
> interrupt lands during kernel code (IE: syscall) we would also like to
> get these TLS values when in process context.

OK, those are not directly accessible from pt_regs.

> 
> I have some patches into the kernel to make this possible via
> perf_events that works well, however, I don't want to reinvent the wheel
> if there is some way to get these via perf samples already.

I would like to see it. I think it is possible to introduce a helper
to get a base address of user TLS for probe events, and start supporting
from x86.

> 
> In OTel, we are trying to attribute samples to transactions that are
> occurring. So the TLS fetch has to be aligned exactly with the sample.
> You can do this via eBPF when it's available, however, we have
> environments where eBPF is not available.
> 
> It's sounding like to do this properly without eBPF a new feature would
> be required. If so, I do have some patches I can share in a bit as an
> RFC.

It is better to be shared in RFC stage, so that we can discuss it from
the direction level.

Thank you,

> 
> Thanks,
> -Beau
> 
> > Thank you,
> > 
> > -- 
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] openrisc: Add support for more module relocations

2024-04-11 Thread Stafford Horne

On Thu, Apr 11, 2024 at 02:12:59PM +0200, Geert Uytterhoeven wrote:
> Hi Stafford,
> 
> On Wed, Apr 10, 2024 at 10:52 PM Stafford Horne  wrote:
> > This patch adds the relocations. Note, we use the old naming R_OR32_*
> > instead or the new naming R_OR1K_* to avoid change as this header is
> > exported as a user api.
> 
> > --- a/arch/openrisc/include/uapi/asm/elf.h
> > +++ b/arch/openrisc/include/uapi/asm/elf.h
> > @@ -43,6 +43,8 @@
> >  #define R_OR32_JUMPTARG6
> >  #define R_OR32_VTINHERIT 7
> >  #define R_OR32_VTENTRY 8
> > +#define R_OR32_AHI16   35
> > +#define R_OR32_SLO16   39
> 
> Would it make sense to switch to the new names, e.g.
> 
>   #define R_OR1K_NONE0
> 
> and add definitions for backwards compatibility?
> 
> #define R_OR32_NONER_OR1K_NONE
> 

Hi Geert,

Actually I had a patch doing this and added all 38 or so relocation definitions.
But I dropped it at the last moment in favor of simplicity.

Let me rework it and add it back.

-Stafford

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Kees Cook

On Thu, 11 Apr 2024 12:20:57 +0200, Marco Elver wrote:
> Add "sched_prepare_exec" tracepoint, which is run right after the point
> of no return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
> tracepoint runs before flushing the old exec, i.e. while the task still
> has the original state (such as original MM), but when the new exec
> either succeeds or crashes (but never returns to the original exec).
> 
> [...]

Applied to for-next/execve, thanks!

[1/1] tracing: Add sched_prepare_exec tracepoint
  https://git.kernel.org/kees/c/5c5fad46e48c

Take care,

-- 
Kees Cook

Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-11 Thread Michal Koutný

Hello.

On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton 
 wrote:
> That seems like a large change.

In what sense is it large?

I tried to lookup the code parts that depend on this default and either
add the other patches or mention the impact (that part could be more
thorough) in the commit message.

> It isn't clear why we'd want to merge this patchset.  Does it improve
> anyone's life and if so, how?

- kernel devs who don't care about policy
  - policy should be decided by distros/users, not in kernel

- users who need many threads
  - current default is too low
  - this is one more place to look at when configuring

- users who want to prevent fork-bombs
  - current default is ineffective (too high), false feeling of safety
  - i.e. they should configure appropriate mechanism appropriately

I thought that the first point alone would be convincing and that only
scaling impact might need clarification.

Regards,
Michal

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Steven Rostedt

On Thu, 11 Apr 2024 08:15:05 -0700
Kees Cook  wrote:

> This looks good to me. If tracing wants to take it:
> 
> Acked-by: Kees Cook 
> 
> If not, I can take it in my tree if I get a tracing Ack. :)

You can take it.

Acked-by: Steven Rostedt (Google) 

-- Steve

Re: [PATCH] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Yuntao Wang

On Thu, 11 Apr 2024 23:07:45 +0900, Masami Hiramatsu (Google) 
 wrote:

> On Thu, 11 Apr 2024 09:19:32 +0200
> Geert Uytterhoeven  wrote:
> 
> > CC Hiramatsu-san (now for real :-)
> 
> Thanks!
> 
> > 
> > On Thu, Apr 11, 2024 at 6:13 AM Yuntao Wang  wrote:
> > > extra_init_args ends with a space, so when concatenating extra_init_args
> > > to saved_command_line, be sure to remove the extra space.
> 
> Hi Yuntao,
> 
> Hmm, if you want to trim the end space, you should trim extra_init_args
> itself instead of this adjustment. Also, can you share the example?
> 
> Thank you,

At first, I also intended to fix this issue as you suggested. However,
because both extra_command_line and extra_init_args end with a space,
making such a change would require modifications in many places.
That's why I chose this approach instead.

Here are some examples before and after modification:

Before: [0.829179] Kernel command line: 'console=ttyS0 debug -- 
bootconfig_arg1 '
After:  [0.032648] Kernel command line: 'console=ttyS0 debug -- 
bootconfig_arg1'

Before: [0.757217] Kernel command line: 'console=ttyS0 debug -- 
bootconfig_arg1  arg1'
After:  [0.068184] Kernel command line: 'console=ttyS0 debug -- 
bootconfig_arg1 arg1'

In order to make it easier to observe spaces, I added quotes when outputting 
saved_command_line.

Note that the first 'before' ends with a space, and there are two spaces between
'bootconfig_arg1' and 'arg1' in the second 'before'.

> > >
> > > Signed-off-by: Yuntao Wang 
> > > ---
> > >  init/main.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/init/main.c b/init/main.c
> > > index 2ca52474d0c3..cf2c22aa0e8c 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -660,12 +660,14 @@ static void __init setup_command_line(char 
> > > *command_line)
> > > strcpy(saved_command_line + len, extra_init_args);
> > > len += ilen - 4;/* 
> > > strlen(extra_init_args) */
> > > strcpy(saved_command_line + len,
> > > -   boot_command_line + initargs_offs - 1);
> > > +   boot_command_line + initargs_offs);
> > > } else {
> > > len = strlen(saved_command_line);
> > > strcpy(saved_command_line + len, " -- ");
> > > len += 4;
> > > strcpy(saved_command_line + len, extra_init_args);
> > > +   len += ilen - 4; /* strlen(extra_init_args) */
> > > +   saved_command_line[len-1] = '\0'; /* remove 
> > > trailing space */
> > > }
> > > }
> > 
> > Gr{oetje,eeting}s,
> > 
> > Geert
> > 
> > -- 
> > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> > geert@linux-m68korg
> > 
> > In personal conversations with technical people, I call myself a hacker. But
> > when I'm talking to journalists I just say "programmer" or something like 
> > that.
> > -- Linus Torvalds
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Kees Cook

On Thu, Apr 11, 2024 at 12:20:57PM +0200, Marco Elver wrote:
> Add "sched_prepare_exec" tracepoint, which is run right after the point
> of no return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
> tracepoint runs before flushing the old exec, i.e. while the task still
> has the original state (such as original MM), but when the new exec
> either succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("sched_prepare_exec" to "sched_process_exec").
> 
> Example of tracing output:
> 
>  $ cat /sys/kernel/debug/tracing/trace_pipe
> <...>-379  [003] .  179.626921: sched_prepare_exec: 
> interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
> <...>-381  [002] .  180.048580: sched_prepare_exec: interp=/bin/bash 
> filename=/bin/bash pid=381 comm=sshd
> <...>-385  [001] .  180.068277: sched_prepare_exec: 
> interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash
> <...>-389  [006] .  192.020147: sched_prepare_exec: 
> interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash
> 
> Signed-off-by: Marco Elver 

This looks good to me. If tracing wants to take it:

Acked-by: Kees Cook 

If not, I can take it in my tree if I get a tracing Ack. :)

-Kees

-- 
Kees Cook

Re: [PATCH 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-04-11 Thread Karel Balej

Lee Jones, 2024-04-11T12:37:26+01:00:
[...]
> > diff --git a/drivers/mfd/88pm886.c b/drivers/mfd/88pm886.c
> > new file mode 100644
> > index ..e06d418a5da9
> > --- /dev/null
> > +++ b/drivers/mfd/88pm886.c
> > @@ -0,0 +1,157 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +#define PM886_REG_INT_STATUS1  0x05
> > +
> > +#define PM886_REG_INT_ENA_10x0a
> > +#define PM886_INT_ENA1_ONKEY   BIT(0)
> > +
> > +#define PM886_IRQ_ONKEY0
> > +
> > +#define PM886_REGMAP_CONF_MAX_REG  0xef
>
> Why have you split the defines up between here and the header?

I tried to keep defines tied to the code which uses them and only put
defines needed in multiple places in the header. With the exception of
closely related things, such as register bits which I am keeping
together with the respective register definitions for clarity. Does that
not make sense?

> Please place them all in the header.

Would you then also have me move all the definitions from the regulators
driver there?

[...]

> > +   err = devm_mfd_add_devices(dev, 0, pm886_devs, ARRAY_SIZE(pm886_devs),
>
> Why 0?

PLATFORM_DEVID_AUTO then? Or will PLATFORM_DEVID_NONE suffice since the
cells all have different names now (it would probably cause problems
though if the driver was used multiple times for some reason, wouldn't
it?)?

Thank you,
K. B.

Re: [PATCH] init/main.c: Remove redundant space from saved_command_line

2024-04-11 Thread Google

On Thu, 11 Apr 2024 09:19:32 +0200
Geert Uytterhoeven  wrote:

> CC Hiramatsu-san (now for real :-)

Thanks!

> 
> On Thu, Apr 11, 2024 at 6:13 AM Yuntao Wang  wrote:
> > extra_init_args ends with a space, so when concatenating extra_init_args
> > to saved_command_line, be sure to remove the extra space.

Hi Yuntao,

Hmm, if you want to trim the end space, you should trim extra_init_args
itself instead of this adjustment. Also, can you share the example?

Thank you,

> >
> > Signed-off-by: Yuntao Wang 
> > ---
> >  init/main.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 2ca52474d0c3..cf2c22aa0e8c 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -660,12 +660,14 @@ static void __init setup_command_line(char 
> > *command_line)
> > strcpy(saved_command_line + len, extra_init_args);
> > len += ilen - 4;/* strlen(extra_init_args) 
> > */
> > strcpy(saved_command_line + len,
> > -   boot_command_line + initargs_offs - 1);
> > +   boot_command_line + initargs_offs);
> > } else {
> > len = strlen(saved_command_line);
> > strcpy(saved_command_line + len, " -- ");
> > len += 4;
> > strcpy(saved_command_line + len, extra_init_args);
> > +   len += ilen - 4; /* strlen(extra_init_args) */
> > +   saved_command_line[len-1] = '\0'; /* remove 
> > trailing space */
> > }
> > }
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v1 0/5] livepatch: klp-convert tool - Minimal version

2024-04-11 Thread Lukas Hruska

Hi,

> > Do we have anything that blocks klp-convert-mini to be merged, or something 
> > that
> > needs to be fixed?
> 
> there is feedback from Petr and I agree with him that a selftest would be 
> appropriate.
> 
> Lukas, are you planning to send v2 with everything addressed?

Yes, I definitely want to send v2 soon. I am starting to work on it
today and hopefuly I will be able to send it next week.

Lukas

Re: [PATCH] [v4] module: don't ignore sysfs_create_link() failures

2024-04-11 Thread Greg Kroah-Hartman

On Mon, Apr 08, 2024 at 09:00:06AM -0700, Luis Chamberlain wrote:
> On Mon, Apr 08, 2024 at 10:05:58AM +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > The sysfs_create_link() return code is marked as __must_check, but the
> > module_add_driver() function tries hard to not care, by assigning the
> > return code to a variable. When building with 'make W=1', gcc still
> > warns because this variable is only assigned but not used:
> > 
> > drivers/base/module.c: In function 'module_add_driver':
> > drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used 
> > [-Wunused-but-set-variable]
> > 
> > Rework the code to properly unwind and return the error code to the
> > caller. My reading of the original code was that it tries to
> > not fail when the links already exist, so keep ignoring -EEXIST
> > errors.
> > 
> > Fixes: e17e0f51aeea ("Driver core: show drivers in /sys/module/")
> > Reviewed-by: Greg Kroah-Hartman 
> 
> Reviewed-by: Luis Chamberlain 

Oh right, I should apply this, sorry about that, will go do that now...

Re: [PATCH] openrisc: Add support for more module relocations

2024-04-11 Thread Geert Uytterhoeven

Hi Stafford,

On Wed, Apr 10, 2024 at 10:52 PM Stafford Horne  wrote:
> This patch adds the relocations. Note, we use the old naming R_OR32_*
> instead or the new naming R_OR1K_* to avoid change as this header is
> exported as a user api.

> --- a/arch/openrisc/include/uapi/asm/elf.h
> +++ b/arch/openrisc/include/uapi/asm/elf.h
> @@ -43,6 +43,8 @@
>  #define R_OR32_JUMPTARG6
>  #define R_OR32_VTINHERIT 7
>  #define R_OR32_VTENTRY 8
> +#define R_OR32_AHI16   35
> +#define R_OR32_SLO16   39

Would it make sense to switch to the new names, e.g.

  #define R_OR1K_NONE0

and add definitions for backwards compatibility?

#define R_OR32_NONER_OR1K_NONE

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH 1/5] dt-bindings: mfd: add entry for Marvell 88PM886 PMIC

2024-04-11 Thread Krzysztof Kozlowski

On 31/03/2024 12:46, Karel Balej wrote:
> Marvell 88PM886 is a PMIC with several subdevices such as onkey,
> regulators or battery and charger. It comes in at least two revisions,
> A0 and A1 -- only A1 is described here at the moment.
> 
> Reviewed-by: Krzysztof Kozlowski 
> Signed-off-by: Karel Balej 
> ---
> 
> Notes:
> RFC v4:
> - Address Krzysztof's comments:
>   - Fix regulators indentation.
>   - Add Krzysztof's trailer.

So you have four versions and suddenly it became v1? No, keep proper
versioning. This is v5. RFC is not a version, but state of patchset that
it is not ready.

Best regards,
Krzysztof

Re: [PATCH 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-04-11 Thread Lee Jones

On Sun, 31 Mar 2024, Karel Balej wrote:

> Marvell 88PM886 is a PMIC which provides various functions such as
> onkey, battery, charger and regulators. It is found for instance in the
> samsung,coreprimevelte smartphone with which this was tested. Implement
> basic support to allow for the use of regulators and onkey.
> 
> Signed-off-by: Karel Balej 
> ---
> 
> Notes:
> v1:
> - Address Mark's feedback:
>   - Move regmap config back out of the header and rename it. Also lower
> its maximum register based on what's actually used in the downstream
> code.
> RFC v4:
> - Use MFD_CELL_* macros.
> - Address Lee's feedback:
>   - Do not define regmap_config.val_bits and .reg_bits.
>   - Drop everything regulator related except mfd_cell (regmap
> initialization, IDs enum etc.). Drop pm886_initialize_subregmaps.
>   - Do not store regmap pointers as an array as there is now only one
> regmap. Also drop the corresponding enum.
>   - Move regmap_config to the header as it is needed in the regulators
> driver.
>   - pm886_chip.whoami -> chip_id
>   - Reword chip ID mismatch error message and print the ID as
> hexadecimal.
>   - Fix includes in include/linux/88pm886.h.
>   - Drop the pm886_irq_number enum and define the (for the moment) only
> IRQ explicitly.
> - Have only one MFD cell for all regulators as they are now registered
>   all at once in the regulators driver.
> - Reword commit message.
> - Make device table static and remove comma after the sentinel to signal
>   that nothing should come after it.
> RFC v3:
> - Drop onkey cell .of_compatible.
> - Rename LDO page offset and regmap to REGULATORS.
> RFC v2:
> - Remove some abstraction.
> - Sort includes alphabetically and add linux/of.h.
> - Depend on OF, remove of_match_ptr and add MODULE_DEVICE_TABLE.
> - Use more temporaries and break long lines.
> - Do not initialize ret in probe.
> - Use the wakeup-source DT property.
> - Rename ret to err.
> - Address Lee's comments:
>   - Drop patched in presets for base regmap and related defines.
>   - Use full sentences in comments.
>   - Remove IRQ comment.
>   - Define regmap_config member values.
>   - Rename data to sys_off_data.
>   - Add _PMIC suffix to Kconfig.
>   - Use dev_err_probe.
>   - Do not store irq_data.
>   - s/WHOAMI/CHIP_ID
>   - Drop LINUX part of include guard name.
>   - Merge in the regulator series modifications in order to have more
> devices and modify the commit message accordingly. Changes with
> respect to the original regulator series patches:
> - ret -> err
> - Add temporary for dev in pm88x_initialize_subregmaps.
> - Drop of_compatible for the regulators.
> - Do not duplicate LDO regmap for bucks.
> - Rewrite commit message.
> 
>  drivers/mfd/88pm886.c   | 157 
>  drivers/mfd/Kconfig |  12 +++
>  drivers/mfd/Makefile|   1 +
>  include/linux/mfd/88pm886.h |  30 +++
>  4 files changed, 200 insertions(+)
>  create mode 100644 drivers/mfd/88pm886.c
>  create mode 100644 include/linux/mfd/88pm886.h
> 
> diff --git a/drivers/mfd/88pm886.c b/drivers/mfd/88pm886.c
> new file mode 100644
> index ..e06d418a5da9
> --- /dev/null
> +++ b/drivers/mfd/88pm886.c
> @@ -0,0 +1,157 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define PM886_REG_INT_STATUS10x05
> +
> +#define PM886_REG_INT_ENA_1  0x0a
> +#define PM886_INT_ENA1_ONKEY BIT(0)
> +
> +#define PM886_IRQ_ONKEY  0
> +
> +#define PM886_REGMAP_CONF_MAX_REG0xef

Why have you split the defines up between here and the header?

Please place them all in the header.

> +static const struct regmap_config pm886_regmap_config = {
> + .reg_bits = 8,
> + .val_bits = 8,
> + .max_register = PM886_REGMAP_CONF_MAX_REG,
> +};
> +
> +static struct regmap_irq pm886_regmap_irqs[] = {
> + REGMAP_IRQ_REG(PM886_IRQ_ONKEY, 0, PM886_INT_ENA1_ONKEY),
> +};
> +
> +static struct regmap_irq_chip pm886_regmap_irq_chip = {
> + .name = "88pm886",
> + .irqs = pm886_regmap_irqs,
> + .num_irqs = ARRAY_SIZE(pm886_regmap_irqs),
> + .num_regs = 4,
> + .status_base = PM886_REG_INT_STATUS1,
> + .ack_base = PM886_REG_INT_STATUS1,
> + .unmask_base = PM886_REG_INT_ENA_1,
> +};
> +
> +static struct resource pm886_onkey_resources[] = {
> + DEFINE_RES_IRQ_NAMED(PM886_IRQ_ONKEY, "88pm886-onkey"),
> +};
> +
> +static struct mfd_cell pm886_devs[] = {
> + MFD_CELL_RES("88pm886-onkey", pm886_onkey_resources),
> + MFD_CELL_NAME("88pm886-regulator"),
> +};
> +
> +static int

Re: [PATCH v9 5/9] clk: mmp: Add Marvell PXA1908 clock driver

2024-04-11 Thread Duje Mihanović


On 4/11/2024 10:00 AM, Stephen Boyd wrote:

Quoting Duje Mihanović (2024-04-02 13:55:41)

diff --git a/drivers/clk/mmp/clk-of-pxa1908.c b/drivers/clk/mmp/clk-of-pxa1908.c
new file mode 100644
index ..6f1f6e25a718
--- /dev/null
+++ b/drivers/clk/mmp/clk-of-pxa1908.c
@@ -0,0 +1,328 @@
+// SPDX-License-Identifier: GPL-2.0-only

[...]

+static void __init pxa1908_apbc_clk_init(struct device_node *np)
+{
+   struct pxa1908_clk_unit *pxa_unit;
+
+   pxa_unit = kzalloc(sizeof(*pxa_unit), GFP_KERNEL);
+   if (!pxa_unit)
+   return;
+
+   pxa_unit->apbc_base = of_iomap(np, 0);
+   if (!pxa_unit->apbc_base) {
+   pr_err("failed to map apbc registers\n");
+   kfree(pxa_unit);
+   return;
+   }
+
+   mmp_clk_init(np, _unit->unit, APBC_NR_CLKS);
+
+   pxa1908_apb_periph_clk_init(pxa_unit);
+}
+CLK_OF_DECLARE(pxa1908_apbc, "marvell,pxa1908-apbc", pxa1908_apbc_clk_init);


Is there a reason this file can't be a platform driver?


Not that I know of, I did it like this only because the other in-tree 
MMP clk drivers do so. I guess the initialization should look like any 
of the qcom GCC drivers then?


While at it, do you think the other MMP clk drivers could use a conversion?

Regards,
--
Duje

[PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Marco Elver

Add "sched_prepare_exec" tracepoint, which is run right after the point
of no return but before the current task assumes its new exec identity.

Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
tracepoint runs before flushing the old exec, i.e. while the task still
has the original state (such as original MM), but when the new exec
either succeeds or crashes (but never returns to the original exec).

Being able to trace this event can be helpful in a number of use cases:

  * allowing tracing eBPF programs access to the original MM on exec,
before current->mm is replaced;
  * counting exec in the original task (via perf event);
  * profiling flush time ("sched_prepare_exec" to "sched_process_exec").

Example of tracing output:

 $ cat /sys/kernel/debug/tracing/trace_pipe
<...>-379  [003] .  179.626921: sched_prepare_exec: 
interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
<...>-381  [002] .  180.048580: sched_prepare_exec: interp=/bin/bash 
filename=/bin/bash pid=381 comm=sshd
<...>-385  [001] .  180.068277: sched_prepare_exec: interp=/usr/bin/tty 
filename=/usr/bin/tty pid=385 comm=bash
<...>-389  [006] .  192.020147: sched_prepare_exec: 
interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash

Signed-off-by: Marco Elver 
---
v2:
* Add more documentation.
* Also show bprm->interp in trace.
* Rename to sched_prepare_exec.
---
 fs/exec.c|  8 
 include/trace/events/sched.h | 35 +++
 2 files changed, 43 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 38bf71cbdf5e..57fee729dd92 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,14 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   /*
+* This tracepoint marks the point before flushing the old exec where
+* the current task is still unchanged, but errors are fatal (point of
+* no return). The later "sched_process_exec" tracepoint is called after
+* the current task has successfully switched to the new exec.
+*/
+   trace_sched_prepare_exec(current, bprm);
+
/*
 * Ensure all future errors are fatal.
 */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..226f47c6939c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -420,6 +420,41 @@ TRACE_EVENT(sched_process_exec,
  __entry->pid, __entry->old_pid)
 );
 
+/**
+ * sched_prepare_exec - called before setting up new exec
+ * @task:  pointer to the current task
+ * @bprm:  pointer to linux_binprm used for new exec
+ *
+ * Called before flushing the old exec, where @task is still unchanged, but at
+ * the point of no return during switching to the new exec. At the point it is
+ * called the exec will either succeed, or on failure terminate the task. Also
+ * see the "sched_process_exec" tracepoint, which is called right after @task
+ * has successfully switched to the new exec.
+ */
+TRACE_EVENT(sched_prepare_exec,
+
+   TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
+
+   TP_ARGS(task, bprm),
+
+   TP_STRUCT__entry(
+   __string(   interp, bprm->interp)
+   __string(   filename,   bprm->filename  )
+   __field(pid_t,  pid )
+   __string(   comm,   task->comm  )
+   ),
+
+   TP_fast_assign(
+   __assign_str(interp, bprm->interp);
+   __assign_str(filename, bprm->filename);
+   __entry->pid = task->pid;
+   __assign_str(comm, task->comm);
+   ),
+
+   TP_printk("interp=%s filename=%s pid=%d comm=%s",
+ __get_str(interp), __get_str(filename),
+ __entry->pid, __get_str(comm))
+);
 
 #ifdef CONFIG_SCHEDSTATS
 #define DEFINE_EVENT_SCHEDSTAT DEFINE_EVENT
-- 
2.44.0.478.gd926399ef9-goog

[PATCH net-next v5] net/ipv4: add tracepoint for icmp_send

2024-04-11 Thread xu.xin16

From: hepeilin 
Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

v4->v5:
Some fixes according to
https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=ce2pccqs7sfm0y9ak-e...@mail.gmail.com/
1.Adjust the position of trace_icmp_send() to before icmp_push_reply().

v3->v4:
Some fixes according to
https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/
1.Add legality check for UDP header in SKB.
2.Target this patch for net-next.

Changelog

v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 65 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 69 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index 0..7d5190f48
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   struct udphdr *uh = udp_hdr(skb);
+   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
+   (u8 *)uh + sizeof(struct udphdr) > 
skb_tail_pointer(skb)) {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   } else {
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */

[PATCH net-next v6] virtio_net: Support RX hash XDP hint

2024-04-11 Thread Liang Chen

The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen 
---
  Changes from v5:
- Preservation of the hash value has been dropped, following the conclusion
  from discussions in V3 reviews. The virtio_net driver doesn't
  accessing/using the virtio_net_hdr after the XDP program execution, so
  nothing tragic should happen. As to the xdp program, if it smashes the
  entry in virtio header, it is likely buggy anyways. Additionally, looking
  up the Intel IGC driver,  it also does not bother with this particular
  aspect.
---
 drivers/net/virtio_net.c | 55 
 1 file changed, 55 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..abd07d479508 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4621,6 +4621,60 @@ static void virtnet_set_big_packets(struct virtnet_info 
*vi, const int mtu)
}
 }
 
+static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
+  enum xdp_rss_hash_type *rss_type)
+{
+   const struct xdp_buff *xdp = (void *)_ctx;
+   struct virtio_net_hdr_v1_hash *hdr_hash;
+   struct virtnet_info *vi;
+
+   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
+   return -ENODATA;
+
+   vi = netdev_priv(xdp->rxq->dev);
+   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
+
+   switch (__le16_to_cpu(hdr_hash->hash_report)) {
+   case VIRTIO_NET_HASH_REPORT_TCPv4:
+   *rss_type = XDP_RSS_TYPE_L4_IPV4_TCP;
+   break;
+   case VIRTIO_NET_HASH_REPORT_UDPv4:
+   *rss_type = XDP_RSS_TYPE_L4_IPV4_UDP;
+   break;
+   case VIRTIO_NET_HASH_REPORT_TCPv6:
+   *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP;
+   break;
+   case VIRTIO_NET_HASH_REPORT_UDPv6:
+   *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP;
+   break;
+   case VIRTIO_NET_HASH_REPORT_TCPv6_EX:
+   *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP_EX;
+   break;
+   case VIRTIO_NET_HASH_REPORT_UDPv6_EX:
+   *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP_EX;
+   break;
+   case VIRTIO_NET_HASH_REPORT_IPv4:
+   *rss_type = XDP_RSS_TYPE_L3_IPV4;
+   break;
+   case VIRTIO_NET_HASH_REPORT_IPv6:
+   *rss_type = XDP_RSS_TYPE_L3_IPV6;
+   break;
+   case VIRTIO_NET_HASH_REPORT_IPv6_EX:
+   *rss_type = XDP_RSS_TYPE_L3_IPV6_EX;
+   break;
+   case VIRTIO_NET_HASH_REPORT_NONE:
+   default:
+   *rss_type = XDP_RSS_TYPE_NONE;
+   }
+
+   *hash = __le32_to_cpu(hdr_hash->hash_value);
+   return 0;
+}
+
+static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
+   .xmo_rx_hash= virtnet_xdp_rx_hash,
+};
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err = -ENOMEM;
@@ -4747,6 +4801,7 @@ static int virtnet_probe(struct virtio_device *vdev)
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
 
dev->hw_features |= NETIF_F_RXHASH;
+   dev->xdp_metadata_ops = _xdp_metadata_ops;
}
 
if (vi->has_rss_hash_report)
-- 
2.40.1

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-11 Thread Jonthan Haslam

> > > OK, then I'll push this to for-next at this moment.
> > > Please share if you have a good idea for the batch interface which can be
> > > backported. I guess it should involve updating userspace changes too.
> > 
> > Did you (or anyone else) need anything more from me on this one so that it
> > can be pushed? I provided some benchmark numbers but happy to provide
> > anything else that may be required.
> 
> Yeah, if you can update with the result, it looks better to me.
> Or, can I update the description?

Sure, please feel free to update the description yourself.

Jon.

> 
> Thank you,
> 
> > 
> > Thanks!
> > 
> > Jon.
> > 
> > > 
> > > Thank you!
> > > 
> > > > >
> > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > while Jonathan will keep working on even better final solution.
> > > > > Thanks!
> > > > >
> > > > > > I look forward to your formalized results :)
> > > > > >
> > > > 
> > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > and USDTs, reporting attach/detach timings:
> > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > bpf_testmod.ko is already unloaded.
> > > > Loading bpf_testmod.ko...
> > > > Successfully loaded bpf_testmod.ko.
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > test_bench_attach_uprobe: attached in   0.120s
> > > > test_bench_attach_uprobe: detached in   0.092s
> > > > #400/5   uprobe_multi_test/bench_uprobe:OK
> > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > test_bench_attach_usdt: attached in   0.124s
> > > > test_bench_attach_usdt: detached in   0.064s
> > > > #400/6   uprobe_multi_test/bench_usdt:OK
> > > > #400 uprobe_multi_test:OK
> > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > Successfully unloaded bpf_testmod.ko.
> > > > 
> > > > So it should be easy for Jonathan to validate his changes with this.
> > > > 
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > BTW, how did you measure the overhead? I think spinlock 
> > > > > > > > > > overhead
> > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > > > > > ---
> > > > > > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > RB_ROOT;
> > > > > > > > > > >   */
> > > > > > > > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > > > > > > > > >
> > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > >
> > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > + read_lock(_treelock);
> > > > > > > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > + read_unlock(_treelock);
> > > > > > > > > > >
> > > > > > > > > > >   return uprobe;
> > > > > > > > > > >  }
> > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe 
> > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *u;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > + write_lock(_treelock);
> > > > > > > > > > >   u = __insert_uprobe(uprobe);
> > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > + write_unlock(_treelock);
> > > > > > > > > > >
> > > > > > > > > > >   return u;
>

Re: [PATCH v9 5/9] clk: mmp: Add Marvell PXA1908 clock driver

2024-04-11 Thread Stephen Boyd

Quoting Duje Mihanović (2024-04-02 13:55:41)
> diff --git a/drivers/clk/mmp/clk-of-pxa1908.c 
> b/drivers/clk/mmp/clk-of-pxa1908.c
> new file mode 100644
> index ..6f1f6e25a718
> --- /dev/null
> +++ b/drivers/clk/mmp/clk-of-pxa1908.c
> @@ -0,0 +1,328 @@
> +// SPDX-License-Identifier: GPL-2.0-only
[...]
> +static void __init pxa1908_apbc_clk_init(struct device_node *np)
> +{
> +   struct pxa1908_clk_unit *pxa_unit;
> +
> +   pxa_unit = kzalloc(sizeof(*pxa_unit), GFP_KERNEL);
> +   if (!pxa_unit)
> +   return;
> +
> +   pxa_unit->apbc_base = of_iomap(np, 0);
> +   if (!pxa_unit->apbc_base) {
> +   pr_err("failed to map apbc registers\n");
> +   kfree(pxa_unit);
> +   return;
> +   }
> +
> +   mmp_clk_init(np, _unit->unit, APBC_NR_CLKS);
> +
> +   pxa1908_apb_periph_clk_init(pxa_unit);
> +}
> +CLK_OF_DECLARE(pxa1908_apbc, "marvell,pxa1908-apbc", pxa1908_apbc_clk_init);

Is there a reason this file can't be a platform driver?

Re: [PATCH v9 4/9] dt-bindings: clock: Add Marvell PXA1908 clock bindings

2024-04-11 Thread Stephen Boyd

Quoting Duje Mihanović (2024-04-02 13:55:40)
> Add dt bindings and documentation for the Marvell PXA1908 clock
> controller.
> 
> Reviewed-by: Conor Dooley 
> Signed-off-by: Duje Mihanović 
> ---

Reviewed-by: Stephen Boyd

Re: [PATCH v9 1/9] clk: mmp: Switch to use struct u32_fract instead of custom one

2024-04-11 Thread Stephen Boyd

Quoting Duje Mihanović (2024-04-02 13:55:37)
> From: Andy Shevchenko 
> 
> The struct mmp_clk_factor_tbl repeats the generic struct u32_fract.
> Kill the custom one and use the generic one instead.
> 
> Signed-off-by: Andy Shevchenko 
> Tested-by: Duje Mihanović 
> Reviewed-by: Linus Walleij 
> Signed-off-by: Duje Mihanović 
> ---

Reviewed-by: Stephen Boyd 

> 
> diff --git a/drivers/clk/mmp/clk.h b/drivers/clk/mmp/clk.h
> index 55ac05379781..c83cec169ddc 100644
> --- a/drivers/clk/mmp/clk.h
> +++ b/drivers/clk/mmp/clk.h
> @@ -3,6 +3,7 @@
>  #define __MACH_MMP_CLK_H
>  
>  #include 
> +#include 
>  #include 
>  #include 
> 

This clkdev include should be dropped in another patch.

Re: [PATCH 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-11 Thread AngeloGioacchino Del Regno


Il 11/04/24 09:34, AngeloGioacchino Del Regno ha scritto:

Il 11/04/24 05:37, olivia.wen ha scritto:

Under different applications, the MT8188 SCP can be used as single-core
or dual-core.

Signed-off-by: olivia.wen 
---
  Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml

index 507f98f..7e7b567 100644
--- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
@@ -22,7 +22,7 @@ properties:
    - mediatek,mt8192-scp
    - mediatek,mt8195-scp
    - mediatek,mt8195-scp-dual
-


Don't remove the blank line, it's there for readability.


+  - mediatek,mt8188-scp-dual


Ah, sorry, one more comment. Please, keep the entries ordered by name.
8188 goes before 8195.



After addressing that comment,

Reviewed-by: AngeloGioacchino Del Regno 



    reg:
  description:
    Should contain the address ranges for memory regions SRAM, CFG, and,
@@ -195,6 +195,7 @@ allOf:
  compatible:
    enum:
  - mediatek,mt8195-scp-dual
+    - mediatek,mt8188-scp-dual


same here.


  then:
    properties:
  reg:

Re: [PATCH 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-11 Thread AngeloGioacchino Del Regno


Il 11/04/24 05:37, olivia.wen ha scritto:

Under different applications, the MT8188 SCP can be used as single-core
or dual-core.

Signed-off-by: olivia.wen 
---
  Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
index 507f98f..7e7b567 100644
--- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
@@ -22,7 +22,7 @@ properties:
- mediatek,mt8192-scp
- mediatek,mt8195-scp
- mediatek,mt8195-scp-dual
-


Don't remove the blank line, it's there for readability.


+  - mediatek,mt8188-scp-dual


After addressing that comment,

Reviewed-by: AngeloGioacchino Del Regno 



reg:
  description:
Should contain the address ranges for memory regions SRAM, CFG, and,
@@ -195,6 +195,7 @@ allOf:
  compatible:
enum:
  - mediatek,mt8195-scp-dual
+- mediatek,mt8188-scp-dual
  then:
properties:
  reg:

Re: [PATCH 2/2] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-11 Thread AngeloGioacchino Del Regno


Il 11/04/24 05:37, olivia.wen ha scritto:

To Support MT8188 SCP core 1 for ISP driver.
The SCP on different chips will require different code sizes
  and IPI buffer sizes based on varying requirements.

Signed-off-by: olivia.wen 
---
  drivers/remoteproc/mtk_common.h|  5 +--
  drivers/remoteproc/mtk_scp.c   | 62 +++---
  drivers/remoteproc/mtk_scp_ipi.c   |  9 --
  include/linux/remoteproc/mtk_scp.h |  1 +
  4 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
index 6d7736a..8f37f65 100644
--- a/drivers/remoteproc/mtk_common.h
+++ b/drivers/remoteproc/mtk_common.h
@@ -78,7 +78,6 @@
  #define MT8195_L2TCM_OFFSET   0x850d0
  
  #define SCP_FW_VER_LEN			32

-#define SCP_SHARE_BUFFER_SIZE  288
  
  struct scp_run {

u32 signaled;
@@ -110,6 +109,8 @@ struct mtk_scp_of_data {
u32 host_to_scp_int_bit;
  
  	size_t ipi_buf_offset;

+   u32 ipi_buffer_size;


this should be `ipi_share_buf_size`


+   u32 max_code_size;


max_code_size should probably be dram_code_size or max_dram_size or dram_size.

Also, both should be size_t, not u32.


  };
  
  struct mtk_scp_of_cluster {

@@ -162,7 +163,7 @@ struct mtk_scp {
  struct mtk_share_obj {
u32 id;
u32 len;
-   u8 share_buf[SCP_SHARE_BUFFER_SIZE];
+   u8 *share_buf;
  };
  
  void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len);

diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6751829..270718d 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -20,7 +20,6 @@
  #include "mtk_common.h"
  #include "remoteproc_internal.h"
  
-#define MAX_CODE_SIZE 0x50

  #define SECTION_NAME_IPI_BUFFER ".ipi_buffer"
  
  /**

@@ -94,14 +93,14 @@ static void scp_ipi_handler(struct mtk_scp *scp)
  {
struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf;
struct scp_ipi_desc *ipi_desc = scp->ipi_desc;
-   u8 tmp_data[SCP_SHARE_BUFFER_SIZE];
+   u8 *tmp_data;
scp_ipi_handler_t handler;
u32 id = readl(_obj->id);
u32 len = readl(_obj->len);
  
-	if (len > SCP_SHARE_BUFFER_SIZE) {

+   if (len > scp->data->ipi_buffer_size) {
dev_err(scp->dev, "ipi message too long (len %d, max %d)", len,
-   SCP_SHARE_BUFFER_SIZE);
+   scp->data->ipi_buffer_size);
return;
}
if (id >= SCP_IPI_MAX) {
@@ -109,6 +108,10 @@ static void scp_ipi_handler(struct mtk_scp *scp)
return;
}
  
+	tmp_data = kzalloc(len, GFP_KERNEL);


I think that this will be impacting on performance a bit, especially if
the scp_ipi_handler gets called frequently (and also remember that this
is in interrupt context).

For best performance, you should allocate this at probe time (in struct mtk_scp
or somewhere else), then:

len = ipi message length
memset zero the tmp_data from len to ipi_buffer_size

memcpy_fromio() etc


+   if (!tmp_data)
+   return;
+
scp_ipi_lock(scp, id);
handler = ipi_desc[id].handler;
if (!handler) {
@@ -123,6 +126,7 @@ static void scp_ipi_handler(struct mtk_scp *scp)
  
  	scp->ipi_id_ack[id] = true;

wake_up(>ack_wq);
+   kfree(tmp_data);


There's a possible memory leak. You forgot to kfree in the NULL handler path.


  }
  
  static int scp_elf_read_ipi_buf_addr(struct mtk_scp *scp,

@@ -133,6 +137,7 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
  {
int ret;
size_t buf_sz, offset;
+   size_t share_buf_offset;
  
  	/* read the ipi buf addr from FW itself first */

ret = scp_elf_read_ipi_buf_addr(scp, fw, );
@@ -154,10 +159,12 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
  
  	scp->recv_buf = (struct mtk_share_obj __iomem *)

(scp->sram_base + offset);
+   share_buf_offset = sizeof(scp->recv_buf->id)
+   + sizeof(scp->recv_buf->len) + scp->data->ipi_buffer_size;
scp->send_buf = (struct mtk_share_obj __iomem *)
-   (scp->sram_base + offset + sizeof(*scp->recv_buf));
-   memset_io(scp->recv_buf, 0, sizeof(*scp->recv_buf));
-   memset_io(scp->send_buf, 0, sizeof(*scp->send_buf));
+   (scp->sram_base + offset + share_buf_offset);
+   memset_io(scp->recv_buf, 0, share_buf_offset);
+   memset_io(scp->send_buf, 0, share_buf_offset);
  
  	return 0;

  }
@@ -891,7 +898,7 @@ static int scp_map_memory_region(struct mtk_scp *scp)
}
  
  	/* Reserved SCP code size */

-   scp->dram_size = MAX_CODE_SIZE;
+   scp->dram_size = scp->data->max_code_size;


Remove the dram_size member from struct mtk_scp and use max_code_size directly.


scp->cpu_addr = dma_alloc_coherent(scp->dev, scp->dram_size,

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-11 Thread Peilin He

>> >> >[...]
>> >> >> >I think my understanding based on what Eric depicted differs from =
>you:
>> >> >> >we're supposed to filter out those many invalid cases and only tra=
>ce
>> >> >> >the valid action of sending a icmp, so where to add a new tracepoi=
>nt
>> >> >> >is important instead of adding more checks in the tracepoint itsel=
>f.
>> >> >> >Please refer to what trace_tcp_retransmit_skb() does :)
>> >> >> >
>> >> >> >Thanks,
>> >> >> >Jason
>> >> >> Okay, thank you for your suggestion. In order to avoid filtering ou=
>t
>> >> >> those many invalid cases and only tracing the valid action of sendi=
>ng
>> >> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
>> >> >> include/trace/events/udp.h. This will solve the problem you mention=
>ed
>> >> >> very well. At this point, only UDP protocol exceptions will be trac=
>ked,
>> >> >> without the need to track them in icmp_send.
>> >> >
>> >> >I'm not against what you did (tracing all the icmp_send() for UDP) in
>> >> >your original patch. I was suggesting that you could put
>> >> >trace_icmp_send() in the right place, then you don't have to check th=
>e
>> >> >possible error condition (like if the skb->head is valid or not, ...)
>> >> >in your trace function.
>> >> >
>> >> >One example that can avoid various checks existing in the
>> >> >__icmp_send() function:
>> >> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> >> >index e63a3bf99617..2c9f7364de45 100644
>> >> >--- a/net/ipv4/icmp.c
>> >> >+++ b/net/ipv4/icmp.c
>> >> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type=
>,
>> >> >int code, __be32 info,
>> >> >if (!fl4.saddr)
>> >> >fl4.saddr =3D htonl(INADDR_DUMMY);
>> >> >
>> >> >+   trace_icmp_send(skb_in, type, code);
>> >> >icmp_push_reply(sk, _param, , , );
>> >> > ende
>> >> >ip_rt_put(rt);
>> >> >
>> >> >If we go here, it means we are ready to send the ICMP skb because
>> >> >we're done extracting the right information in the 'struct sk_buff
>> >> >skb_in'. Simpler and easier, right?
>> >> >
>> >> >Thanks,
>> >> >Jason
>> >>
>> >> I may not fully agree with this viewpoint. When trace_icmp_send is pla=
>ced
>> >> in this position, it cannot guarantee that all skbs in icmp are UDP pr=
>otocols
>> >> (UDP needs to be distinguished based on the proto_4!=3DIPPROTO_UDP con=
>dition),
>> >> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is=
> required).
>> >
>> >Of course, the UDP test statement is absolutely needed! Eric
>> >previously pointed this out in the V1 patch thread. I'm not referring
>> >to this one but like skb->head check something like this which exists
>> >in __icmp_send() function. You can see there are so many checks in it
>> >before sending.
>> >
>> >So only keeping the UDP check is enough, I think.
>>
>> The __icmp_send function only checks the IP header, but does not check
>> the UDP header, as shown in the following code snippet:
>>
>> if ((u8 *)iph < skb_in->head ||
>> (skb_network_header(skb_in) + sizeof(*iph)) >
>> skb_tail_pointer(skb_in))
>> goto out;
>>
>> There is no problem with the IP header check, which does not mean that
>> the UDP header is correct. Therefore, I believe that it is essential to
>> include a legitimacy judgment for the UDP header.
>>
>> Here is an explanation of this code:
>> Firstly, the UDP header (*uh) is extracted from the skb.
>> Then, if the current protocol of the skb is not UDP, or if the address of
>> uh is outside the range of the skb, the source port and destination port
>> will not be resolved, and 0 will be filled in directly.Otherwise,
>> the source port and destination port of the UDP header will be resolved.
>>
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   if (proto_4 !=3D IPPROTO_UDP || (u8 *)uh < skb->head ||
>> +   (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) {
>
>>From the beginning, I always agree with the UDP check. I was saying if
>you can put the trace_icmp_send() just before icmp_push_reply()[1],
>you could avoid those kinds of checks.
>As I said in the previous email, "only keeping the UDP check is
>enough". So you are right.
>
>[1]
>diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>index e63a3bf99617..2c9f7364de45 100644
>--- a/net/ipv4/icmp.c
>+++ b/net/ipv4/icmp.c
>@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
>int code, __be32 info,
>if (!fl4.saddr)
>fl4.saddr =3D htonl(INADDR_DUMMY);
>
>+   trace_icmp_send(skb_in, type, code);
>icmp_push_reply(sk, _param, , , );
> ende:
>ip_rt_put(rt);
>
>If we're doing this, trace_icmp_send() can reflect the real action of
>sending an ICMP like trace_tcp_retransmit_skb(). Or else, the trace
>could print some messages but no real ICMP is sent (see those error
>checks). WDYT?
>
>Thanks,
>Jasoin

Yeah, placing trace_icmp_send() before icmp_push_reply() will ensure
that tracking starts when ICMP

[PATCH] arm64: dts: qcom: qcm6490-fairphone-fp5: Add USB-C orientation GPIO

2024-04-11 Thread Luca Weiss

Define the USB-C orientation GPIOs so that the USB-C ports orientation
is known without having to resort to the altmode notifications.

On PCB level this is the signal from PM7250B (pin CC_OUT) which is
called USB_PHY_PS.

Signed-off-by: Luca Weiss 
---
Depends on (for bindings): 
https://lore.kernel.org/linux-arm-msm/20240409-hdk-orientation-gpios-v2-0-658efd993...@linaro.org/
---
 arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts 
b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
index 4ff9fc24e50e..f3432701945f 100644
--- a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
+++ b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
@@ -77,6 +77,8 @@ pmic-glink {
#address-cells = <1>;
#size-cells = <0>;
 
+   orientation-gpios = < 140 GPIO_ACTIVE_HIGH>;
+
connector@0 {
compatible = "usb-c-connector";
reg = <0>;

---
base-commit: 65b0418f6e86eef0f62fc053fb3622fbaa3e506e
change-id: 20240411-fp5-usb-c-gpio-afd22741adcd

Best regards,
-- 
Luca Weiss

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-11 Thread Jason Xing

On Thu, Apr 11, 2024 at 12:57 PM Peilin He  wrote:
>
> >> >[...]
> >> >> >I think my understanding based on what Eric depicted differs from you:
> >> >> >we're supposed to filter out those many invalid cases and only trace
> >> >> >the valid action of sending a icmp, so where to add a new tracepoint
> >> >> >is important instead of adding more checks in the tracepoint itself.
> >> >> >Please refer to what trace_tcp_retransmit_skb() does :)
> >> >> >
> >> >> >Thanks,
> >> >> >Jason
> >> >> Okay, thank you for your suggestion. In order to avoid filtering out
> >> >> those many invalid cases and only tracing the valid action of sending
> >> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
> >> >> include/trace/events/udp.h. This will solve the problem you mentioned
> >> >> very well. At this point, only UDP protocol exceptions will be tracked,
> >> >> without the need to track them in icmp_send.
> >> >
> >> >I'm not against what you did (tracing all the icmp_send() for UDP) in
> >> >your original patch. I was suggesting that you could put
> >> >trace_icmp_send() in the right place, then you don't have to check the
> >> >possible error condition (like if the skb->head is valid or not, ...)
> >> >in your trace function.
> >> >
> >> >One example that can avoid various checks existing in the
> >> >__icmp_send() function:
> >> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> >> >index e63a3bf99617..2c9f7364de45 100644
> >> >--- a/net/ipv4/icmp.c
> >> >+++ b/net/ipv4/icmp.c
> >> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> >> >int code, __be32 info,
> >> >if (!fl4.saddr)
> >> >fl4.saddr = htonl(INADDR_DUMMY);
> >> >
> >> >+   trace_icmp_send(skb_in, type, code);
> >> >icmp_push_reply(sk, _param, , , );
> >> > ende
> >> >ip_rt_put(rt);
> >> >
> >> >If we go here, it means we are ready to send the ICMP skb because
> >> >we're done extracting the right information in the 'struct sk_buff
> >> >skb_in'. Simpler and easier, right?
> >> >
> >> >Thanks,
> >> >Jason
> >>
> >> I may not fully agree with this viewpoint. When trace_icmp_send is placed
> >> in this position, it cannot guarantee that all skbs in icmp are UDP 
> >> protocols
> >> (UDP needs to be distinguished based on the proto_4!=IPPROTO_UDP 
> >> condition),
> >> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is 
> >> required).
> >
> >Of course, the UDP test statement is absolutely needed! Eric
> >previously pointed this out in the V1 patch thread. I'm not referring
> >to this one but like skb->head check something like this which exists
> >in __icmp_send() function. You can see there are so many checks in it
> >before sending.
> >
> >So only keeping the UDP check is enough, I think.
>
> The __icmp_send function only checks the IP header, but does not check
> the UDP header, as shown in the following code snippet:
>
> if ((u8 *)iph < skb_in->head ||
> (skb_network_header(skb_in) + sizeof(*iph)) >
> skb_tail_pointer(skb_in))
> goto out;
>
> There is no problem with the IP header check, which does not mean that
> the UDP header is correct. Therefore, I believe that it is essential to
> include a legitimacy judgment for the UDP header.
>
> Here is an explanation of this code:
> Firstly, the UDP header (*uh) is extracted from the skb.
> Then, if the current protocol of the skb is not UDP, or if the address of
> uh is outside the range of the skb, the source port and destination port
> will not be resolved, and 0 will be filled in directly.Otherwise,
> the source port and destination port of the UDP header will be resolved.
>
> +   struct udphdr *uh = udp_hdr(skb);
> +   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
> +   (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) {

>From the beginning, I always agree with the UDP check. I was saying if
you can put the trace_icmp_send() just before icmp_push_reply()[1],
you could avoid those kinds of checks.
As I said in the previous email, "only keeping the UDP check is
enough". So you are right.

[1]
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..2c9f7364de45 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
int code, __be32 info,
if (!fl4.saddr)
fl4.saddr = htonl(INADDR_DUMMY);

+   trace_icmp_send(skb_in, type, code);
icmp_push_reply(sk, _param, , , );
 ende:
ip_rt_put(rt);

If we're doing this, trace_icmp_send() can reflect the real action of
sending an ICMP like trace_tcp_retransmit_skb(). Or else, the trace
could print some messages but no real ICMP is sent (see those error
checks). WDYT?

Thanks,
Jason

>
> With best wishes
> Peilin He
>
> >Thanks,
> >Jason
> >
> >>
> >> With best wishes
> >> Peilin He
> >>
> >> >>
> >> >> >> 2.Target this patch for net-next.
> >> >> >>
> >> >> >> v2->v3:
> >>

Re: [PATCH v1 0/5] livepatch: klp-convert tool - Minimal version

2024-04-11 Thread Miroslav Benes

Hi,

> > Summary of changes in this minimal version
> > 
> > 
> > - rebase for v6.5
> > - cleaned-up SoB chains (suggested by pmladek)
> > - klp-convert: remove the symbol map auto-resolving solution
> > - klp-convert: add macro for flagging variables inside a LP src to be 
> > resolved by this tool
> > - klp-convert: code simplification
> 
> Do we have anything that blocks klp-convert-mini to be merged, or something 
> that
> needs to be fixed?

there is feedback from Petr and I agree with him that a selftest would be 
appropriate.

Lukas, are you planning to send v2 with everything addressed?

Miroslav

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-04-11 Thread Liang Chen

On Mon, Apr 8, 2024 at 2:41 PM Jason Wang  wrote:
>
> On Mon, Apr 1, 2024 at 11:38 AM Liang Chen  wrote:
> >
> > On Thu, Feb 29, 2024 at 4:37 PM Liang Chen  
> > wrote:
> > >
> > > On Tue, Feb 27, 2024 at 4:42 AM John Fastabend  
> > > wrote:
> > > >
> > > > Jason Wang wrote:
> > > > > On Fri, Feb 23, 2024 at 9:42 AM Xuan Zhuo 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  
> > > > > > wrote:
> > > > > > > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > > > > > > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > > > > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > > > > > @@ -1033,6 +1039,16 @@ static void 
> > > > > > > > > > > > > > put_xdp_frags(struct xdp_buff *xdp)
> > > > > > > > > > > > > >   }
> > > > > > > > > > > > > >   }
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct 
> > > > > > > > > > > > > > virtnet_xdp_buff *virtnet_xdp,
> > > > > > > > > > > > > > +  struct 
> > > > > > > > > > > > > > net_device *dev,
> > > > > > > > > > > > > > +  struct 
> > > > > > > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > > > > > > > + virtnet_xdp->hash_value = 
> > > > > > > > > > > > > > hdr_hash->hash_value;
> > > > > > > > > > > > > > + virtnet_xdp->hash_report = 
> > > > > > > > > > > > > > hdr_hash->hash_report;
> > > > > > > > > > > > > > + }
> > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > +
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would it be possible to store a pointer to hdr_hash 
> > > > > > > > > > > > > in virtnet_xdp_buff,
> > > > > > > > > > > > > with the purpose of delaying extracting this, until 
> > > > > > > > > > > > > and only if XDP
> > > > > > > > > > > > > bpf_prog calls the kfunc?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > That seems to be the way v1 works,
> > > > > > > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > > > > > > > . But it was pointed out that the inline header may be 
> > > > > > > > > > > > overwritten by
> > > > > > > > > > > > the xdp prog, so the hash is copied out to maintain its 
> > > > > > > > > > > > integrity.
> > > > > > > > > > >
> > > > > > > > > > > Why? isn't XDP supposed to get write access only to the 
> > > > > > > > > > > pkt
> > > > > > > > > > > contents/buffer?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Normally, an XDP program accesses only the packet data. 
> > > > > > > > > > However,
> > > > > > > > > > there's also an XDP RX Metadata area, referenced by the 
> > > > > > > > > > data_meta
> > > > > > > > > > pointer. This pointer can be adjusted with 
> > > > > > > > > > bpf_xdp_adjust_meta to
> > > > > > > > > > point somewhere ahead of the data buffer, thereby granting 
> > > > > > > > > > the XDP
> > > > > > > > > > program access to the virtio header located immediately 
> > > > > > > > > > before the
> > > > > > > > >
> > > > > > > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the 
> > > > > > > > > meta_data before
> > > > > > > > > xdp->data_hard_start:
> > > > > > > > >
> > > > > > > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > > > > > > > >
> > > > > > > > > and virtio net set such field after the virtio_net_hdr:
> > > > > > > > >
> > > > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > > > > > > > >
> > > > > > > > > I don't see how the virtio hdr could be touched? Possibly 
> > > > > > > > > even more
> > > > > > > > > important: if such thing is possible, I think is should be 
> > > > > > > > > somewhat
> > > > > > > > > denied (for the same reason an H/W nic should prevent XDP from
> > > > > > > > > modifying its own buffer descriptor).
> > > > > > > >
> > > > > > > > Thank you for highlighting this concern. The header layout 
> > > > > > > > differs
> > > > > > > > slightly between small and mergeable mode. Taking 'mergeable 
> > > > > > > > mode' as
> > > > > > > > an example, after calling xdp_prepare_buff the layout of 
> > > > > > > > xdp_buff
> > > > > > > > would be as depicted in the diagram below,
> > > > > > > >
> > > > > > > >

Re: [PATCH 2/2] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-11 Thread Krzysztof Kozlowski

On 11/04/2024 05:37, olivia.wen wrote:
> +};
> +
>  static const struct of_device_id mtk_scp_of_match[] = {
>   { .compatible = "mediatek,mt8183-scp", .data = _of_data },
>   { .compatible = "mediatek,mt8186-scp", .data = _of_data },
> @@ -1323,6 +1362,7 @@ static const struct of_device_id mtk_scp_of_match[] = {
>   { .compatible = "mediatek,mt8192-scp", .data = _of_data },
>   { .compatible = "mediatek,mt8195-scp", .data = _of_data },
>   { .compatible = "mediatek,mt8195-scp-dual", .data = 
> _of_data_cores },
> + { .compatible = "mediatek,mt8188-scp-dual", .data = 
> _of_data_cores },

Why do you add new entries to the end? Look at the list first.

Best regards,
Krzysztof

Re: [PATCH 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-11 Thread Krzysztof Kozlowski

On 11/04/2024 05:37, olivia.wen wrote:
> Under different applications, the MT8188 SCP can be used as single-core
> or dual-core.
> 
> Signed-off-by: olivia.wen 

Are you sure you use full name, not email login as name?

> ---
>  Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
> b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> index 507f98f..7e7b567 100644
> --- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> @@ -22,7 +22,7 @@ properties:
>- mediatek,mt8192-scp
>- mediatek,mt8195-scp
>- mediatek,mt8195-scp-dual
> -
> +  - mediatek,mt8188-scp-dual

Missing blank line, misordered.


>reg:
>  description:
>Should contain the address ranges for memory regions SRAM, CFG, and,
> @@ -195,6 +195,7 @@ allOf:
>  compatible:
>enum:
>  - mediatek,mt8195-scp-dual
> +- mediatek,mt8188-scp-dual

Again, keep the order.



Best regards,
Krzysztof

99 matches

Mail list logo