date:20240403

[PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-03 Thread Cindy Lu

Add a document explaining the reconnect process, including what the
Userspace App needs to do and how it works with the kernel.

Signed-off-by: Cindy Lu 
---
 Documentation/userspace-api/vduse.rst | 41 +++
 1 file changed, 41 insertions(+)

diff --git a/Documentation/userspace-api/vduse.rst 
b/Documentation/userspace-api/vduse.rst
index bdb880e01132..7faa83462e78 100644
--- a/Documentation/userspace-api/vduse.rst
+++ b/Documentation/userspace-api/vduse.rst
@@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
after the used ring is filled.
 
 For more details on the uAPI, please see include/uapi/linux/vduse.h.
+
+HOW VDUSE devices reconnection works
+
+1. What is reconnection?
+
+   When the userspace application loads, it should establish a connection
+   to the vduse kernel device. Sometimes,the userspace application exists,
+   and we want to support its restart and connect to the kernel device again
+
+2. How can I support reconnection in a userspace application?
+
+2.1 During initialization, the userspace application should first verify the
+existence of the device "/dev/vduse/vduse_name".
+If it doesn't exist, it means this is the first-time for connection. goto 
step 2.2
+If it exists, it means this is a reconnection, and we should goto step 2.3
+
+2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control.
+When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
+the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.
+
+2.3 Check if the information is suitable for reconnect
+If this is reconnection :
+Before attempting to reconnect, The userspace application needs to use the
+ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
VDUSE_DEV_GET_FEATURES...)
+to get the information from kernel.
+Please review the information and confirm if it is suitable to reconnect.
+
+2.4 Userspace application needs to mmap the memory to userspace
+The userspace application requires mapping one page for every vq. These 
pages
+should be used to save vq-related information during system running. 
Additionally,
+the application must define its own structure to store information for 
reconnection.
+
+2.5 Completed the initialization and running the application.
+While the application is running, it is important to store relevant 
information
+about reconnections in mapped pages. When calling the ioctl 
VDUSE_VQ_GET_INFO to
+get vq information, it's necessary to check whether it's a reconnection. 
If it is
+a reconnection, the vq-related information must be get from the mapped 
pages.
+
+2.6 When the Userspace application exits, it is necessary to unmap all the
+pages for reconnection
-- 
2.43.0

Re: [PATCH v2] Documentation: Add reconnect process for VDUSE

2024-04-03 Thread Cindy Lu

On Fri, Mar 29, 2024 at 5:52 PM Michael S. Tsirkin  wrote:
>
> On Fri, Mar 29, 2024 at 05:38:25PM +0800, Cindy Lu wrote:
> > Add a document explaining the reconnect process, including what the
> > Userspace App needs to do and how it works with the kernel.
> >
> > Signed-off-by: Cindy Lu 
> > ---
> >  Documentation/userspace-api/vduse.rst | 41 +++
> >  1 file changed, 41 insertions(+)
> >
> > diff --git a/Documentation/userspace-api/vduse.rst 
> > b/Documentation/userspace-api/vduse.rst
> > index bdb880e01132..f903aed714d1 100644
> > --- a/Documentation/userspace-api/vduse.rst
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > after the used ring is filled.
> >
> >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > +
> > +HOW VDUSE devices reconnectoin works
>
> typo
>
Really sorry for this, I will send a new version

Thanks
Cindy
> > +
> > +1. What is reconnection?
> > +
> > +   When the userspace application loads, it should establish a connection
> > +   to the vduse kernel device. Sometimes,the userspace application exists,
> > +   and we want to support its restart and connect to the kernel device 
> > again
> > +
> > +2. How can I support reconnection in a userspace application?
> > +
> > +2.1 During initialization, the userspace application should first verify 
> > the
> > +existence of the device "/dev/vduse/vduse_name".
> > +If it doesn't exist, it means this is the first-time for connection. 
> > goto step 2.2
> > +If it exists, it means this is a reconnection, and we should goto step 
> > 2.3
> > +
> > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +/dev/vduse/control.
> > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > +the reconnect information. The total memory size is 
> > PAGE_SIZE*vq_mumber.
> > +
> > +2.3 Check if the information is suitable for reconnect
> > +If this is reconnection :
> > +Before attempting to reconnect, The userspace application needs to use 
> > the
> > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > VDUSE_DEV_GET_FEATURES...)
> > +to get the information from kernel.
> > +Please review the information and confirm if it is suitable to 
> > reconnect.
> > +
> > +2.4 Userspace application needs to mmap the memory to userspace
> > +The userspace application requires mapping one page for every vq. 
> > These pages
> > +should be used to save vq-related information during system running. 
> > Additionally,
> > +the application must define its own structure to store information for 
> > reconnection.
> > +
> > +2.5 Completed the initialization and running the application.
> > +While the application is running, it is important to store relevant 
> > information
> > +about reconnections in mapped pages. When calling the ioctl 
> > VDUSE_VQ_GET_INFO to
> > +get vq information, it's necessary to check whether it's a 
> > reconnection. If it is
> > +a reconnection, the vq-related information must be get from the mapped 
> > pages.
> > +
>
>
> I don't get it. So this is just a way for the application to allocate
> memory? Why do we need this new way to do it?
> Why not just mmap a file anywhere at all?
>
We used to use tmpfs to save this reconnect information,
but this will make the API not self contained, so we changed to using
the kernel memory
Thanks
cindy



>
> > +2.6 When the Userspace application exits, it is necessary to unmap all the
> > +pages for reconnection
> > --
> > 2.43.0
>

Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-03 Thread Dan Williams

Vishal Verma wrote:
> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held(_write)().
> 
> Replace those, and additionally, replace a couple of other
> WARN_ON_ONCE() introduced in the same patch for actual failure
> cases (i.e. when acquiring a semaphore fails in a remove / unregister
> path) with dev_WARN_ONCE() as is the precedent here.

/me goes to look why failures are warned vs bubbling up the error...

> 
> Recall that previously, unregistration paths was implicitly protected by
> overloading the device lock, which the patch in [1] sought to remove.
> This meant adding a semaphore acquisition in these unregistration paths.
> Since that can fail, and it doesn't make sense to return errors from
> these paths, retain the two instances of (now) dev_WARN_ONCE().
> 
> Link: 
> https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
>  [1]
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Cc: Andrew Morton 
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 797e1ebff299..d89c8c3b62f7 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
[..]
> @@ -726,7 +728,8 @@ static void unregister_dax_mapping(void *data)

...and realizes that is hunk is confused.

>   if (rwsem_is_locked(_region_rwsem))
>   return __unregister_dax_mapping(data);

This is bug. Either it is safe to call this without the lock held, or it
must be safe to always acquire the lock. Anything else means the caller
needs to be refactored. Conditional locking is an indicator of a deeper
problem with code organization.

Yes, this was not part of the fixup, but the changelog led me here
because no WARNs should remain at the end of this effort.

I think the confusion results from replace *all* device_lock() when some
device_lock() is still needed.

> - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
> + if (dev_WARN_ONCE(data, down_write_killable(_region_rwsem) != 0,
> +   "unable to acquire region rwsem\n"))

In a context like this that cannot return an error directly to the user
process making the request, like in a sysfs operation handler, then the
locking cannot worry about signals. This would become an uncoditional
down_write() lock. So down_write_killable() is typically for request
contexts where there is a user process waiting for a result. For
contexts like async driver probing where we might be in a kernel thread,
and definitely in functions that return 'void', it is a bug to use
fallible locks.

Re: [PATCH net-next v4 0/2] tcp: make trace of reset logic complete

2024-04-03 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski :

On Mon,  1 Apr 2024 15:36:03 +0800 you wrote:
> From: Jason Xing 
> 
> Before this, we miss some cases where the TCP layer could send RST but
> we cannot trace it. So I decided to complete it :)
> 
> v4
> Link: 
> https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/
> 1. rebased against latest net-next
> 2. remove {} and add skb test statement (Eric)
> 3. drop v3 patch [3/3] temporarily because 1) location is not that useful
> since we can use perf or something else to trace, 2) Eric said we could
> use drop_reason to show why we have to RST, which is good, but this seems
> not work well for those ->send_reset() logic. I need more time to
> investigate this part.
> 
> [...]

Here is the summary with links:
  - [net-next,v4,1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
https://git.kernel.org/netdev/net-next/c/9807080e2170
  - [net-next,v4,2/2] trace: tcp: fully support trace_tcp_send_reset
https://git.kernel.org/netdev/net-next/c/19822a980e19

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Andrii Nakryiko

On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu  wrote:
>
> On Wed, 3 Apr 2024 09:58:12 -0700
> Andrii Nakryiko  wrote:
>
> > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  wrote:
> > >
> > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > Jiri Olsa  wrote:
> > >
> > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > Hi Jiri,
> > > > >
> > > > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > > > Jiri Olsa  wrote:
> > > > >
> > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > >
> > > > > This is interesting approach. But I doubt we need to add additional
> > > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > > >
> > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > for that one as well
> > > >
> > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > possible to do that, the trampoline will just have to save one or
> > > > more additional registers, but adding new syscall seems cleaner to me
> > >
> > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> >
> > I think both ptrace and prctl are for completely different use cases
> > and it would be an abuse of existing API to reuse them for uretprobe
> > tracing. Also, keep in mind, that any extra argument that has to be
> > passed into this syscall means that we need to complicate and slow
> > generated assembly code that is injected into user process (to
> > save/restore registers) and also kernel-side (again, to deal with all
> > the extra registers that would be stored/restored on stack).
> >
> > Given syscalls are not some kind of scarce resources, what's the
> > downside to have a dedicated and simple syscall?
>
> Syscalls are explicitly exposed to user space, thus, even if it is used
> ONLY for a very specific situation, it is an official kernel interface,
> and need to care about the compatibility. (If it causes SIGILL unless
> a specific use case, I don't know there is a "compatibility".)

Check rt_sigreturn syscall (manpage at [0], for example).

   sigreturn() exists only to allow the implementation of signal
   handlers.  It should never be called directly.  (Indeed, a simple
   sigreturn() wrapper in the GNU C library simply returns -1, with
   errno set to ENOSYS.)  Details of the arguments (if any) passed
   to sigreturn() vary depending on the architecture.  (On some
   architectures, such as x86-64, sigreturn() takes no arguments,
   since all of the information that it requires is available in the
   stack frame that was previously created by the kernel on the
   user-space stack.)

This is a very similar use case. Also, check its source code in
arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
on any sign of something not being right. It's exactly the same with
sys_uretprobe.

  [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html

> And the number of syscalls are limited resource.

We have almost 500 of them, it didn't seems like adding 1-2 for good
reasons would be a problem. Can you please point to where the limits
on syscalls as a resource are described? I'm curious to learn.

>
> I'm actually not sure how much we need to care of it, but adding a new
> syscall is worth to be discussed carefully because all of them are
> user-space compatibility.

Absolutely, it's a good discussion to have.

>
> > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > >
> > > > right, I'll check on syzkaller
> > > >
> > > > > set in the user function, what happen if the user function directly
> > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > >
> > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > the current task, or it will trigger uretprobe if there's one pending
> > >
> > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > user program, it should return appropriate error code instead of SIGILL.
> > >
> >
> > This is the way it is today with uretprobes even through interrupt.
>
> I doubt that the interrupt (exception) and syscall should be handled
> differently. Especially, this exception is injected by uprobes but
> syscall will be caused by itself. But syscall can be called from user
> program (of couse this works as sys_kill(self, SIGILL)).

Yep, I'd keep the behavior the same between uretprobes implemented
through int3 and sys_uretprobe.

>
> > E.g., it could happen that user process is using fibers and is
> > replacing stack pointer without kernel realizing this, which will
> > trigger some defensive checks in uretprobe handling code and kernel
> > will send SIGILL because it can't support such cases. This is
> > happening today already, and it works fine in practice (except for
> > applications that manually change stack pointer, too bad, you can't
>

Re: [PATCH net-next 0/6] Implement reset reason mechanism to detect

2024-04-03 Thread Jason Xing

On Thu, Apr 4, 2024 at 9:50 AM Jakub Kicinski  wrote:
>
> On Wed,  3 Apr 2024 15:31:38 +0800 Jason Xing wrote:
> > It's based on top of 
> > https://patchwork.kernel.org/project/netdevbpf/list/?series=840182
>
> Please post as RFC if there's a dependency.
> We don't maintain patch queues for people.

Got it. Thanks.

I'll wait for that patch series to get merged. I believe it will not
take too long:)

> --
> pw-bot: cr

Re: [PATCH net-next 0/6] Implement reset reason mechanism to detect

2024-04-03 Thread Jakub Kicinski

On Wed,  3 Apr 2024 15:31:38 +0800 Jason Xing wrote:
> It's based on top of 
> https://patchwork.kernel.org/project/netdevbpf/list/?series=840182

Please post as RFC if there's a dependency.
We don't maintain patch queues for people.
-- 
pw-bot: cr

[PATCH] tools/latency-collector: fix -Wformat-security compile warns

2024-04-03 Thread Shuah Khan

Fix the following -Wformat-security compile warnings adding missing
format arguments:

latency-collector.c: In function ‘show_available’:
latency-collector.c:938:17: warning: format not a string literal and
no format arguments [-Wformat-security]
  938 | warnx(no_tracer_msg);
  | ^

latency-collector.c:943:17: warning: format not a string literal and
no format arguments [-Wformat-security]
  943 | warnx(no_latency_tr_msg);
  | ^

latency-collector.c: In function ‘find_default_tracer’:
latency-collector.c:986:25: warning: format not a string literal and
no format arguments [-Wformat-security]
  986 | errx(EXIT_FAILURE, no_tracer_msg);
  |
 ^~~~
latency-collector.c: In function ‘scan_arguments’:
latency-collector.c:1881:33: warning: format not a string literal and
no format arguments [-Wformat-security]
 1881 | errx(EXIT_FAILURE, no_tracer_msg);
  | ^~~~

Signed-off-by: Shuah Khan 
---
 tools/tracing/latency/latency-collector.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/tracing/latency/latency-collector.c 
b/tools/tracing/latency/latency-collector.c
index 0fd9c747d396..cf263fe9deaf 100644
--- a/tools/tracing/latency/latency-collector.c
+++ b/tools/tracing/latency/latency-collector.c
@@ -935,12 +935,12 @@ static void show_available(void)
}
 
if (!tracers) {
-   warnx(no_tracer_msg);
+   warnx("%s", no_tracer_msg);
return;
}
 
if (!found) {
-   warnx(no_latency_tr_msg);
+   warnx("%s", no_latency_tr_msg);
tracefs_list_free(tracers);
return;
}
@@ -983,7 +983,7 @@ static const char *find_default_tracer(void)
for (i = 0; relevant_tracers[i]; i++) {
valid = tracer_valid(relevant_tracers[i], );
if (notracer)
-   errx(EXIT_FAILURE, no_tracer_msg);
+   errx(EXIT_FAILURE, "%s", no_tracer_msg);
if (valid)
return relevant_tracers[i];
}
@@ -1878,7 +1878,7 @@ static void scan_arguments(int argc, char *argv[])
}
valid = tracer_valid(current_tracer, );
if (notracer)
-   errx(EXIT_FAILURE, no_tracer_msg);
+   errx(EXIT_FAILURE, "%s", no_tracer_msg);
if (!valid)
errx(EXIT_FAILURE,
 "The tracer %s is not supported by your kernel!\n", current_tracer);
-- 
2.40.1

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Google

On Wed, 3 Apr 2024 09:58:12 -0700
Andrii Nakryiko  wrote:

> On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  wrote:
> >
> > On Wed, 3 Apr 2024 11:47:41 +0200
> > Jiri Olsa  wrote:
> >
> > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > Hi Jiri,
> > > >
> > > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > > Jiri Olsa  wrote:
> > > >
> > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > >
> > > > This is interesting approach. But I doubt we need to add additional
> > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > >
> > > so the plan is to optimize entry uprobe in a similar way and given
> > > the syscall is not a scarce resource I wanted to add another syscall
> > > for that one as well
> > >
> > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > possible to do that, the trampoline will just have to save one or
> > > more additional registers, but adding new syscall seems cleaner to me
> >
> > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> 
> I think both ptrace and prctl are for completely different use cases
> and it would be an abuse of existing API to reuse them for uretprobe
> tracing. Also, keep in mind, that any extra argument that has to be
> passed into this syscall means that we need to complicate and slow
> generated assembly code that is injected into user process (to
> save/restore registers) and also kernel-side (again, to deal with all
> the extra registers that would be stored/restored on stack).
> 
> Given syscalls are not some kind of scarce resources, what's the
> downside to have a dedicated and simple syscall?

Syscalls are explicitly exposed to user space, thus, even if it is used
ONLY for a very specific situation, it is an official kernel interface,
and need to care about the compatibility. (If it causes SIGILL unless
a specific use case, I don't know there is a "compatibility".)
And the number of syscalls are limited resource.

I'm actually not sure how much we need to care of it, but adding a new
syscall is worth to be discussed carefully because all of them are
user-space compatibility.

> > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > >
> > > right, I'll check on syzkaller
> > >
> > > > set in the user function, what happen if the user function directly
> > > > calls this syscall? (maybe it consumes shadow stack?)
> > >
> > > the process should receive SIGILL if there's no pending uretprobe for
> > > the current task, or it will trigger uretprobe if there's one pending
> >
> > No, that is too aggressive and not safe. Since the syscall is exposed to
> > user program, it should return appropriate error code instead of SIGILL.
> >
> 
> This is the way it is today with uretprobes even through interrupt.

I doubt that the interrupt (exception) and syscall should be handled
differently. Especially, this exception is injected by uprobes but
syscall will be caused by itself. But syscall can be called from user
program (of couse this works as sys_kill(self, SIGILL)).

> E.g., it could happen that user process is using fibers and is
> replacing stack pointer without kernel realizing this, which will
> trigger some defensive checks in uretprobe handling code and kernel
> will send SIGILL because it can't support such cases. This is
> happening today already, and it works fine in practice (except for
> applications that manually change stack pointer, too bad, you can't
> trace them with uretprobes, unfortunately).

OK, we at least need to document it.

> 
> So I think it's absolutely adequate to have this behavior if the user
> process is *intentionally* abusing this API.

Of course user expected that it is abusing. So at least we need to
add a document that this syscall number is reserved to uprobes and
user program must not use it.

> 
> > >
> > > but we could limit the syscall to be executed just from the trampoline,
> > > that should prevent all the user space use cases, I'll do that in next
> > > version and add more tests for that
> >
> > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > only from the trampoline, so it is natural to check the caller address.
> > (and uprobe should know where is the trampoline)
> >
> > Since the syscall is always exposed to the user program, it should
> > - Do nothing and return an error unless it is properly called.
> > - check the prerequisites for operation strictly.
> > I concern that new system calls introduce vulnerabilities.
> >
> 
> As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> processes, only the process that is abusing the API. So any extra
> checks that would slow down this approach is an unnecessary overhead
> and complication that will never be useful in practice.

I think at least it should check the caller address to ensure the
address is in the trampoline.
But anyway, uprobes itself can break the target process, so no one

Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-03 Thread Ho-Ren (Jack) Chuang

Hi Jonathan,

Thank you for your feedback. I will fix them (inlined) in the next V11.

On Wed, Apr 3, 2024 at 10:04 AM Jonathan Cameron
 wrote:
>
> A few minor comments inline.
>
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index a44c03c2ba3a..16769552a338 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> >   return -EIO;
> >  }
> >
> > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > + struct list_head *memory_types)
> >  {
> >   return NULL;
> >  }
> >
> > -void mt_put_memory_types(struct list_head *memory_types)
> > +static inline void mt_put_memory_types(struct list_head *memory_types)
> >  {
> Why in this patch and not previous one?

I've also noticed this issue. I will fix it in the next V11.

> >
> >  }
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > index 974af10cfdd8..44fa10980d37 100644
> > --- a/mm/memory-tiers.c
> > +++ b/mm/memory-tiers.c
> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
> >
> >  static DEFINE_MUTEX(memory_tier_lock);
> >  static LIST_HEAD(memory_tiers);
> > +/*
> > + * The list is used to store all memory types that are not created
> > + * by a device driver.
> > + */
> > +static LIST_HEAD(default_memory_types);
> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> >  struct memory_dev_type *default_dram_type;
> >
> > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion 
> > __read_mostly;
> >
> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
> >
> > +/* The lock is used to protect `default_dram_perf*` info and nid. */
> > +static DEFINE_MUTEX(default_dram_perf_lock);
> >  static bool default_dram_perf_error;
> >  static struct access_coordinate default_dram_perf;
> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> > struct memory_dev_type *mem
> >  static struct memory_tier *set_node_memory_tier(int node)
> >  {
> >   struct memory_tier *memtier;
> > - struct memory_dev_type *memtype;
> > + struct memory_dev_type *mtype = default_dram_type;
>
> Does the rename add anything major to the patch?
> If not I'd leave it alone to reduce the churn and give
> a more readable patch.  If it is worth doing perhaps
> a precursor patch?
>

Either name works. Keeping it the same name will make the code
easier to follow. I agree! Thanks.

> > + int adist = MEMTIER_ADISTANCE_DRAM;
> >   pg_data_t *pgdat = NODE_DATA(node);
> >
> >
> > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> > node)
> >   if (!node_state(node, N_MEMORY))
> >   return ERR_PTR(-EINVAL);
> >
> > - __init_node_memory_type(node, default_dram_type);
> > + mt_calc_adistance(node, );
> > + if (node_memory_types[node].memtype == NULL) {
> > + mtype = mt_find_alloc_memory_type(adist, 
> > _memory_types);
> > + if (IS_ERR(mtype)) {
> > + mtype = default_dram_type;
> > + pr_info("Failed to allocate a memory type. Fall 
> > back.\n");
> > + }
> > + }
> > +
> > + __init_node_memory_type(node, mtype);
> >
> > - memtype = node_memory_types[node].memtype;
> > - node_set(node, memtype->nodes);
> > - memtier = find_create_memory_tier(memtype);
> > + mtype = node_memory_types[node].memtype;
> > + node_set(node, mtype->nodes);
> > + memtier = find_create_memory_tier(mtype);
> >   if (!IS_ERR(memtier))
> >   rcu_assign_pointer(pgdat->memtier, memtier);
> >   return memtier;
> > @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head 
> > *memory_types)
> >  }
> >  EXPORT_SYMBOL_GPL(mt_put_memory_types);
> >
> > +/*
> > + * This is invoked via `late_initcall()` to initialize memory tiers for
> > + * CPU-less memory nodes after driver initialization, which is
> > + * expected to provide `adistance` algorithms.
> > + */
> > +static int __init memory_tier_late_init(void)
> > +{
> > + int nid;
> > +
> > + mutex_lock(_tier_lock);
> > + for_each_node_state(nid, N_MEMORY)
> > + if (node_memory_types[nid].memtype == NULL)
> > + /*
> > +  * Some device drivers may have initialized memory 
> > tiers
> > +  * between `memory_tier_init()` and 
> > `memory_tier_late_init()`,
> > +  * potentially bringing online memory nodes and
> > +  * configuring memory tiers. Exclude them here.
> > +  */
>
> Does the comment refer to this path, or to ones where memtype is set?
>

Yes, the comment is for explaining why the

Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-03 Thread Google

On Wed, 3 Apr 2024 12:16:28 -0700
"Paul E. McKenney"  wrote:

> commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig.
> 
> /proc/bootconfig shows boot_command_line[] multiple times following
> every xbc key value pair, that's duplicated and not necessary.
> Remove redundant ones.
> 
> Output before and after the fix is like:
> key1 = value1
> *bootloader argument comments*
> key2 = value2
> *bootloader argument comments*
> key3 = value3
> *bootloader argument comments*
> ...
> 
> key1 = value1
> key2 = value2
> key3 = value3
> *bootloader argument comments*
> ...
> 
> Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
> /proc/bootconfig")
> Signed-off-by: Zhenhua Huang 
> Signed-off-by: Paul E. McKenney 
> Cc: Masami Hiramatsu 
> Cc: 
> Cc: 

OOps, good catch! Let me pick it.

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> 
> diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> index 902b326e1e560..e5635a6b127b0 100644
> --- a/fs/proc/bootconfig.c
> +++ b/fs/proc/bootconfig.c
> @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, 
> size_t size)
>   break;
>   dst += ret;
>   }
> - if (ret >= 0 && boot_command_line[0]) {
> - ret = snprintf(dst, rest(dst, end), "# Parameters from 
> bootloader:\n# %s\n",
> -boot_command_line);
> - if (ret > 0)
> - dst += ret;
> - }
> + }
> + if (ret >= 0 && boot_command_line[0]) {
> + ret = snprintf(dst, rest(dst, end), "# Parameters from 
> bootloader:\n# %s\n",
> +boot_command_line);
> + if (ret > 0)
> + dst += ret;
>   }
>  out:
>   kfree(key);


-- 
Masami Hiramatsu (Google)

Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-03 Thread Ho-Ren (Jack) Chuang

Hi Jonathan,

Thanks for your feedback. I will fix them (inlined) in the next V11.
No worries, it's never too late!

On Wed, Apr 3, 2024 at 9:52 AM Jonathan Cameron
 wrote:
>
> On Tue,  2 Apr 2024 00:17:37 +
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > Since different memory devices require finding, allocating, and putting
> > memory types, these common steps are abstracted in this patch,
> > enhancing the scalability and conciseness of the code.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Reviewed-by: "Huang, Ying" 
>
> Hi,
>
> I know this is a late entry to the discussion but a few comments inline.
> (sorry I didn't look earlier!)
>
> All opportunities to improve code complexity and readability as a result
> of your factoring out.
>
> Jonathan
>
>
> > ---
> >  drivers/dax/kmem.c   | 20 ++--
> >  include/linux/memory-tiers.h | 13 +
> >  mm/memory-tiers.c| 32 
> >  3 files changed, 47 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 42ee360cf4e3..01399e5b53b2 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
> >
> >  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
> >  {
> > - bool found = false;
> >   struct memory_dev_type *mtype;
> >
> >   mutex_lock(_memory_type_lock);
> could use
>
> guard(mutex)(_memory_type_lock);
> return mt_find_alloc_memory_type(adist, _memory_types);
>

I will change it accordingly.

> I'm fine if you ignore this comment though as may be other functions in
> here that could take advantage of the cleanup.h stuff in a future patch.
>
> > - list_for_each_entry(mtype, _memory_types, list) {
> > - if (mtype->adistance == adist) {
> > - found = true;
> > - break;
> > - }
> > - }
> > - if (!found) {
> > - mtype = alloc_memory_type(adist);
> > - if (!IS_ERR(mtype))
> > - list_add(>list, _memory_types);
> > - }
> > + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> >   mutex_unlock(_memory_type_lock);
> >
> >   return mtype;
>
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index 69e781900082..a44c03c2ba3a 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
> >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> >const char *source);
> >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > + struct list_head 
> > *memory_types);
>
> That indent looks unusual.  Align the start of struct with start of int.
>

I can make this aligned but it will show another warning:
"WARNING: line length of 131 exceeds 100 columns"
Is this ok?

> > +void mt_put_memory_types(struct list_head *memory_types);
> >  #ifdef CONFIG_MIGRATION
> >  int next_demotion_node(int node);
> >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> >  {
> >   return -EIO;
> >  }
> > +
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +{
> > + return NULL;
> > +}
> > +
> > +void mt_put_memory_types(struct list_head *memory_types)
> > +{
> > +
> No blank line needed here.

Will fix.

> > +}
> >  #endif   /* CONFIG_NUMA */
> >  #endif  /* _LINUX_MEMORY_TIERS_H */
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > index 0537664620e5..974af10cfdd8 100644
> > --- a/mm/memory-tiers.c
> > +++ b/mm/memory-tiers.c
> > @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
> > memory_dev_type *memtype)
> >  }
> >  EXPORT_SYMBOL_GPL(clear_node_memory_type);
> >
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
>
> Breaking this out as a separate function provides opportunity to improve it.
> Maybe a follow up patch makes sense given it would no longer be a straight
> forward code move.  However in my view it would be simple enough to be obvious
> even within this patch.
>

I will just keep this as is for now to minimize the changes aka mistakes.

> > +{
> > + bool found = false;
> > + struct memory_dev_type *mtype;
> > +
> > + list_for_each_entry(mtype, memory_types, list) {
> > + if (mtype->adistance == adist) {
> > + found = true;
>
> Why not return here?
> return mtype;
>

Yes, I can return here. I will do that and take care of the ptr
returning at this point.

> > +

Re: 回复：回复：general protection fault in refill_obj_stock

2024-04-03 Thread Roman Gushchin

On Tue, Apr 02, 2024 at 02:14:58PM +0800, Ubisectech Sirius wrote:
> > On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> >>> On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> >>> Hello.
> >>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> >>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached 
> >>> to the email were a PoC file of the issue.
> >>
> >>> Thank you for the report!
> >>
> >>> I tried to compile and run your test program for about half an hour
> >>> on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> >>> to reproduce the problem.
> >> 
> >>> Can you, please, share a bit more information? How long does it take
> >>> to reproduce? Do you mind sharing your kernel config? Is there anything 
> >>> special
> >>> about your setup? What are exact steps to reproduce the problem?
> >>> Is this problem reproducible on 6.6?
> >> 
> >> Hi. 
> >> The .config of linux kernel 6.7 has send to you as attachment.
> > Thanks!
> > How long it takes to reproduce a problem? Do you just start your reproducer 
> > and wait?
> I just start the reproducer and wait without any other operation. The speed 
> of reproducing this problem is vary fast(Less than 5 seconds). 
> >> And The problem is reproducible on 6.6.
> > Hm, it rules out my recent changes.
> > Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
> > if it's fast to reproduce, it might be the best option.
> I have try the 6.0, 6.3, 6.4, 6.5 kernel. The Linux kernel 6.5 will get same 
> error output. But other version will get different output like below:
> [ 55.306672][ T7950] KASAN: null-ptr-deref in range 
> [0x0018-0x001f]
> [ 55.307259][ T7950] CPU: 1 PID: 7950 Comm: poc Not tainted 6.3.0 #1
> [ 55.307714][ T7950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [ 55.308363][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> [ 55.316475][ T7950] Call Trace:
> [ 55.316713][ T7950] 
> [ 55.317353][ T7950] tomoyo_path_permission (security/tomoyo/file.c:170 
> security/tomoyo/file.c:587 security/tomoyo/file.c:573)
> [ 55.317744][ T7950] tomoyo_check_open_permission (security/tomoyo/file.c:779)
> [ 55.320152][ T7950] tomoyo_file_open (security/tomoyo/tomoyo.c:332 
> security/tomoyo/tomoyo.c:327)
> [ 55.320495][ T7950] security_file_open (security/security.c:1719 
> (discriminator 13))
> [ 55.320850][ T7950] do_dentry_open (fs/open.c:908)
> [ 55.321526][ T7950] path_openat (fs/namei.c:3561 fs/namei.c:3715)
> [ 55.322614][ T7950] do_filp_open (fs/namei.c:3743)
> [ 55.325086][ T7950] do_sys_openat2 (fs/open.c:1349)
> [ 55.326249][ T7950] __x64_sys_openat (fs/open.c:1375)
> [ 55.327428][ T7950] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80)
> [ 55.327756][ T7950] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:120)
> [ 55.328185][ T7950] RIP: 0033:0x7f1c4a484f29
> [ 55.328504][ T7950] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
> [ 55.329864][ T7950] RSP: 002b:7ffd7bfe8398 EFLAGS: 0246 ORIG_RAX: 
> 0101
> [ 55.330464][ T7950] RAX: ffda RBX:  RCX: 
> 7f1c4a484f29
> [ 55.331024][ T7950] RDX: 00141842 RSI: 2380 RDI: 
> ff9c
> [ 55.331585][ T7950] RBP: 7ffd7bfe83a0 R08:  R09: 
> 7ffd7bfe83f0
> [ 55.332148][ T7950] R10:  R11: 0246 R12: 
> 55c5e36482d0
> [ 55.332707][ T7950] R13:  R14:  R15: 
> 
> [ 55.333268][ T7950] 
> [ 55.333488][ T7950] Modules linked in:
> [ 55.340525][ T7950] ---[ end trace  ]---
> [ 55.340936][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> It look like other problem?

It does look differently.

I can't reproduce any of those. I run into some build time issues when trying to
build the kernel with your config (I have a fairly old toolchain, maybe it's 
the reason),
but when running a more minimalistic config I do not see any issues on 6.1, 6.6 
and 6.7.
Is this some sort of all-yes config or it's somehow specially crafted? Did you 
try
to reproduce the problem with other kernel configs?

It all smells a memory corruption, but who knows.

Thanks!

RE: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-03 Thread John Fastabend

Kui-Feng Lee wrote:
> rethook_find_ret_addr() prints a warning message and returns 0 when the
> target task is running and not the "current" task to prevent returning an
> incorrect return address. However, this check is incomplete as the target
> task can still transition to the running state when finding the return
> address, although it is safe with RCU.
> 
> The issue we encounter is that the kernel frequently prints warning
> messages when BPF profiling programs call to bpf_get_task_stack() on
> running tasks.
> 
> The callers should be aware and willing to take the risk of receiving an
> incorrect return address from a task that is currently running other than
> the "current" one. A warning is not needed here as the callers are intent
> on it.
> 
> Signed-off-by: Kui-Feng Lee 
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..4297a132a7ae 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> *tsk, unsigned long frame
>   if (WARN_ON_ONCE(!cur))
>   return 0;
>  
> - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> + if (tsk != current && task_is_running(tsk))
>   return 0;
>  
>   do {
> -- 
> 2.34.1
> 
> 

Acked-by: John Fastabend

[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-03 Thread Andrii Nakryiko

Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..15b8aa4048d9 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0

[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Andrii Nakryiko

Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0

Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()

2024-04-03 Thread Paul E. McKenney

On Wed, Apr 03, 2024 at 03:29:12PM -0400, Steven Rostedt wrote:
> On Wed, 3 Apr 2024 11:53:14 -0700
> "Paul E. McKenney"  wrote:
> 
> > @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct 
> > ftrace_hash *hash, unsigned long
> > }
> >  }
> >  
> > +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> > +{
> > +   struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> > +
> > +   free_ftrace_hash(fhp);
> > +}
> > +
> >  /**
> >   * register_ftrace_direct - Call a custom trampoline directly
> >   * for multiple functions registered in @ops
> > @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> > unsigned long addr)
> >   out_unlock:
> > mutex_unlock(_mutex);
> >  
> > -   if (free_hash && free_hash != EMPTY_HASH) {
> > -   synchronize_rcu_tasks();
> > -   free_ftrace_hash(free_hash);
> > -   }
> > +   if (free_hash && free_hash != EMPTY_HASH)
> > +   call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
> >  
> > if (new_hash)
> > free_ftrace_hash(new_hash);
> 
> I'm getting ready to go on PTO, but a quick glance doesn't look like this
> would cause any harm.

It is not urgent, as in v6.10 merge window at the earliest.  So if I
don't hear from you in a few weeks, I will re-send.  ;-)

Thanx, Paul

Re: [PATCH v3 24/25] drivers: media: i2c: imx258: Add support for reset gpio

2024-04-03 Thread Luigi311

On 4/3/24 11:03, Ondřej Jirman wrote:
> Hi,
> 
> On Wed, Apr 03, 2024 at 04:28:59PM GMT, Sakari Ailus wrote:
>> Hi Luis,
>>
>> Could you unify the subject prefix for the driver patches, please? E.g.
>> "media: imx258: " would be fine.
>>
>> On Wed, Apr 03, 2024 at 09:03:53AM -0600, g...@luigi311.com wrote:
>>> From: Luis Garcia 
>>>
>>> It was documented in DT, but not implemented.
>>>
>>> Signed-off-by: Ondrej Jirman 
>>> Signed-off-by: Luis Garcia 
>>> ---
>>>  drivers/media/i2c/imx258.c | 14 +-
>>>  1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
>>> index 163f04f6f954..4c117c4829f1 100644
>>> --- a/drivers/media/i2c/imx258.c
>>> +++ b/drivers/media/i2c/imx258.c
>>> @@ -680,6 +680,7 @@ struct imx258 {
>>> unsigned int csi2_flags;
>>>  
>>> struct gpio_desc *powerdown_gpio;
>>> +   struct gpio_desc *reset_gpio;
>>>  
>>> /*
>>>  * Mutex for serialized access:
>>> @@ -1232,7 +1233,11 @@ static int imx258_power_on(struct device *dev)
>>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies);
>>> }
>>>  
>>> -   return ret;
>>> +   gpiod_set_value_cansleep(imx258->reset_gpio, 0);
>>> +
>>> +   usleep_range(400, 500);
>>
>> You could mention this at least in the commit message.
> 
> This is T6 in the datasheet: https://megous.com/dl/tmp/92c9223ce877216e.png
> 
> 
>>> +
>>> +   return 0;
>>>  }
>>>  
>>>  static int imx258_power_off(struct device *dev)
>>> @@ -1243,6 +1248,7 @@ static int imx258_power_off(struct device *dev)
>>> clk_disable_unprepare(imx258->clk);
>>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies);
>>>  
>>> +   gpiod_set_value_cansleep(imx258->reset_gpio, 1);
>>
>> Same question than on the other GPIO: does this belong here?
> 
> No, this should be before the regulator_bulk_disable.
> 
> See: https://megous.com/dl/tmp/c96180b23d7ce63a.png
> 
> kind regards,
>   o.
> 

Since I'm supposed to move the reset up should I also
move the power up with it to match your downstream
driver?

>>> gpiod_set_value_cansleep(imx258->powerdown_gpio, 1);
>>>  
>>> return 0;
>>> @@ -1554,6 +1560,12 @@ static int imx258_probe(struct i2c_client *client)
>>> if (IS_ERR(imx258->powerdown_gpio))
>>> return PTR_ERR(imx258->powerdown_gpio);
>>>  
>>> +   /* request optional reset pin */
>>> +   imx258->reset_gpio = devm_gpiod_get_optional(>dev, "reset",
>>> +   GPIOD_OUT_HIGH);
>>> +   if (IS_ERR(imx258->reset_gpio))
>>> +   return PTR_ERR(imx258->reset_gpio);
>>> +
>>> /* Initialize subdev */
>>> v4l2_i2c_subdev_init(>sd, client, _subdev_ops);
>>>  
>>
>> -- 
>> Regards,
>>
>> Sakari Ailus

Re: [PATCH v3 23/25] drivers: media: i2c: imx258: Add support for powerdown gpio

2024-04-03 Thread Luigi311

On 4/3/24 10:57, Ondřej Jirman wrote:
> Hi Sakari and Luis,
> 
> On Wed, Apr 03, 2024 at 04:25:41PM GMT, Sakari Ailus wrote:
>> Hi Luis, Ondrej,
>>
>> On Wed, Apr 03, 2024 at 09:03:52AM -0600, g...@luigi311.com wrote:
>>> From: Luis Garcia 
>>>
>>> On some boards powerdown signal needs to be deasserted for this
>>> sensor to be enabled.
>>>
>>> Signed-off-by: Ondrej Jirman 
>>> Signed-off-by: Luis Garcia 
>>> ---
>>>  drivers/media/i2c/imx258.c | 13 +
>>>  1 file changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
>>> index 30352c33f63c..163f04f6f954 100644
>>> --- a/drivers/media/i2c/imx258.c
>>> +++ b/drivers/media/i2c/imx258.c
>>> @@ -679,6 +679,8 @@ struct imx258 {
>>> unsigned int lane_mode_idx;
>>> unsigned int csi2_flags;
>>>  
>>> +   struct gpio_desc *powerdown_gpio;
>>> +
>>> /*
>>>  * Mutex for serialized access:
>>>  * Protect sensor module set pad format and start/stop streaming safely.
>>> @@ -1213,6 +1215,8 @@ static int imx258_power_on(struct device *dev)
>>> struct imx258 *imx258 = to_imx258(sd);
>>> int ret;
>>>  
>>> +   gpiod_set_value_cansleep(imx258->powerdown_gpio, 0);
>>
>> What does the spec say? Should this really happen before switching on the
>> supplies below?
> 
> There's no powerdown input in the IMX258 manual. The manual only mentions
> that XCLR (reset) should be held low during power on.
> 
> https://megous.com/dl/tmp/15b0992a720ab82d.png
> 
> https://megous.com/dl/tmp/f2cc991046d97641.png
> 
>This sensor doesn’t have a built-in “Power ON Reset” function. The XCLR pin
>is set to “LOW” and the power supplies are brought up. Then the XCLR pin
>should be set to “High” after INCK supplied.
> 
> So this input is some feature on camera module itself outside of the
> IMX258 chip, which I think is used to gate power to the module. Eg. on 
> Pinephone
> Pro, there are two modules with shared power rails, so enabling supply to
> one module enables it to the other one, too. So this input becomes the only 
> way
> to really enable/disable power to the chip when both are used at once at some
> point, because regulator_bulk_enable/disable becomes ineffective at that 
> point.
> 
> Luis, maybe you saw some other datasheet that mentions this input? IMO,
> it just gates the power rails via some mosfets on the module itself, since
> there's not power down input to the chip itself.
> 
> kind regards,
>   o.
> 

Ondrej, I did not see anything else in the datasheet since I'm pretty sure
I'm looking at the same datasheet as it was supplied to me by Pine64. I'm
not sure what datasheet Dave has access to since he got his for a
completely different module than what we are testing with though.

>>> +
>>> ret = regulator_bulk_enable(IMX258_NUM_SUPPLIES,
>>> imx258->supplies);
>>> if (ret) {
>>> @@ -1224,6 +1228,7 @@ static int imx258_power_on(struct device *dev)
>>> ret = clk_prepare_enable(imx258->clk);
>>> if (ret) {
>>> dev_err(dev, "failed to enable clock\n");
>>> +   gpiod_set_value_cansleep(imx258->powerdown_gpio, 1);
>>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies);
>>> }
>>>  
>>> @@ -1238,6 +1243,8 @@ static int imx258_power_off(struct device *dev)
>>> clk_disable_unprepare(imx258->clk);
>>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies);
>>>  
>>> +   gpiod_set_value_cansleep(imx258->powerdown_gpio, 1);
>>> +
>>> return 0;
>>>  }
>>>  
>>> @@ -1541,6 +1548,12 @@ static int imx258_probe(struct i2c_client *client)
>>> if (!imx258->variant_cfg)
>>> imx258->variant_cfg = _cfg;
>>>  
>>> +   /* request optional power down pin */
>>> +   imx258->powerdown_gpio = devm_gpiod_get_optional(>dev, 
>>> "powerdown",
>>> +   GPIOD_OUT_HIGH);
>>> +   if (IS_ERR(imx258->powerdown_gpio))
>>> +   return PTR_ERR(imx258->powerdown_gpio);
>>> +
>>> /* Initialize subdev */
>>> v4l2_i2c_subdev_init(>sd, client, _subdev_ops);
>>>  
>>
>> -- 
>> Regards,
>>
>> Sakari Ailus

Re: [PATCH v3 22/25] dt-bindings: media: imx258: Add binding for powerdown-gpio

2024-04-03 Thread Luigi311

On 4/3/24 12:49, Pavel Machek wrote:
> Hi!
> 
>> From: Luis Garcia 
>>
>> Add powerdown-gpio binding as it is required for some boards
> 
> "." at end of sentence.
> 
>> Signed-off-by: Ondrej Jirman 
>> Signed-off-by: Luis Garcia 
> 
> If the patch is from Ondrej, he should be in From:, I believe.
> 
> Best regards,
>   Pavel

Ohh your right, this was broken out from another patch and didn't
set it to be his actual commit same with the other ones. They
were based on his downstream changes and just modified to match
up with what Dave had set the values too. I'll set it to him
for the next revision. Thanks!

Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 11:53:14 -0700
"Paul E. McKenney"  wrote:

> @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct 
> ftrace_hash *hash, unsigned long
>   }
>  }
>  
> +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> +{
> + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> +
> + free_ftrace_hash(fhp);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * for multiple functions registered in @ops
> @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   out_unlock:
>   mutex_unlock(_mutex);
>  
> - if (free_hash && free_hash != EMPTY_HASH) {
> - synchronize_rcu_tasks();
> - free_ftrace_hash(free_hash);
> - }
> + if (free_hash && free_hash != EMPTY_HASH)
> + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
>  
>   if (new_hash)
>   free_ftrace_hash(new_hash);

I'm getting ready to go on PTO, but a quick glance doesn't look like this
would cause any harm.

-- Steve

Re: [PATCH v3 21/25] drivers: media: i2c: imx258: Use macros

2024-04-03 Thread Luigi311

On 4/3/24 10:23, Sakari Ailus wrote:
> Hi Luis,
> 
> On Wed, Apr 03, 2024 at 09:03:50AM -0600, g...@luigi311.com wrote:
>> From: Luis Garcia 
>>
>> Use understandable macros instead of raw values.
>>
>> Signed-off-by: Ondrej Jirman 
>> Signed-off-by: Luis Garcia 
>> ---
>>  drivers/media/i2c/imx258.c | 434 ++---
>>  1 file changed, 207 insertions(+), 227 deletions(-)
>>
>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
>> index e2ecf6109516..30352c33f63c 100644
>> --- a/drivers/media/i2c/imx258.c
>> +++ b/drivers/media/i2c/imx258.c
>> @@ -33,8 +33,6 @@
>>  #define IMX258_VTS_30FPS_VGA0x034c
>>  #define IMX258_VTS_MAX  65525
>>  
>> -#define IMX258_REG_VTS  0x0340
>> -
>>  /* HBLANK control - read only */
>>  #define IMX258_PPL_DEFAULT  5352
>>  
>> @@ -90,6 +88,53 @@
>>  #define IMX258_PIXEL_ARRAY_WIDTH4208U
>>  #define IMX258_PIXEL_ARRAY_HEIGHT   3120U
>>  
>> +/* regs */
>> +#define IMX258_REG_PLL_MULT_DRIV  0x0310
>> +#define IMX258_REG_IVTPXCK_DIV0x0301
>> +#define IMX258_REG_IVTSYCK_DIV0x0303
>> +#define IMX258_REG_PREPLLCK_VT_DIV0x0305
>> +#define IMX258_REG_IOPPXCK_DIV0x0309
>> +#define IMX258_REG_IOPSYCK_DIV0x030b
>> +#define IMX258_REG_PREPLLCK_OP_DIV0x030d
>> +#define IMX258_REG_PHASE_PIX_OUTEN0x3030
>> +#define IMX258_REG_PDPIX_DATA_RATE0x3032
>> +#define IMX258_REG_SCALE_MODE 0x0401
>> +#define IMX258_REG_SCALE_MODE_EXT 0x3038
>> +#define IMX258_REG_AF_WINDOW_MODE 0x7bcd
>> +#define IMX258_REG_FRM_LENGTH_CTL 0x0350
>> +#define IMX258_REG_CSI_LANE_MODE  0x0114
>> +#define IMX258_REG_X_EVN_INC  0x0381
>> +#define IMX258_REG_X_ODD_INC  0x0383
>> +#define IMX258_REG_Y_EVN_INC  0x0385
>> +#define IMX258_REG_Y_ODD_INC  0x0387
>> +#define IMX258_REG_BINNING_MODE   0x0900
>> +#define IMX258_REG_BINNING_TYPE_V 0x0901
>> +#define IMX258_REG_FORCE_FD_SUM   0x300d
>> +#define IMX258_REG_DIG_CROP_X_OFFSET  0x0408
>> +#define IMX258_REG_DIG_CROP_Y_OFFSET  0x040a
>> +#define IMX258_REG_DIG_CROP_IMAGE_WIDTH   0x040c
>> +#define IMX258_REG_DIG_CROP_IMAGE_HEIGHT  0x040e
>> +#define IMX258_REG_SCALE_M0x0404
>> +#define IMX258_REG_X_OUT_SIZE 0x034c
>> +#define IMX258_REG_Y_OUT_SIZE 0x034e
>> +#define IMX258_REG_X_ADD_STA  0x0344
>> +#define IMX258_REG_Y_ADD_STA  0x0346
>> +#define IMX258_REG_X_ADD_END  0x0348
>> +#define IMX258_REG_Y_ADD_END  0x034a
>> +#define IMX258_REG_EXCK_FREQ  0x0136
>> +#define IMX258_REG_CSI_DT_FMT 0x0112
>> +#define IMX258_REG_LINE_LENGTH_PCK0x0342
>> +#define IMX258_REG_SCALE_M_EXT0x303a
>> +#define IMX258_REG_FRM_LENGTH_LINES   0x0340
>> +#define IMX258_REG_FINE_INTEG_TIME0x0200
>> +#define IMX258_REG_PLL_IVT_MPY0x0306
>> +#define IMX258_REG_PLL_IOP_MPY0x030e
>> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_H   0x0820
>> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_L   0x0822
>> +
>> +#define REG8(a, v) { a, v }
>> +#define REG16(a, v) { a, ((v) >> 8) & 0xff }, { (a) + 1, (v) & 0xff }
> 
> The patch is nice but these macros are better replaced by the V4L2 CCI
> helper that also offers register access functions. Could you add a patch to
> convert the driver to use it (maybe after this one)?
> 

Ohh perfect, using something else would be great. Ill go ahead and see
if I can get that working.

>> +
>>  struct imx258_reg {
>>  u16 address;
>>  u8 val;
>> @@ -145,179 +190,139 @@ struct imx258_mode {
>>   * lane data rate when using 2 lanes, thus allowing a maximum of 15fps.
>>   */
>>  static const struct imx258_reg mipi_1267mbps_19_2mhz_2l[] = {
>> -{ 0x0136, 0x13 },
>> -{ 0x0137, 0x33 },
>> -{ 0x0301, 0x0A },
>> -{ 0x0303, 0x02 },
>> -{ 0x0305, 0x03 },
>> -{ 0x0306, 0x00 },
>> -{ 0x0307, 0xC6 },
>> -{ 0x0309, 0x0A },
>> -{ 0x030B, 0x01 },
>> -{ 0x030D, 0x02 },
>> -{ 0x030E, 0x00 },
>> -{ 0x030F, 0xD8 },
>> -{ 0x0310, 0x00 },
>> -
>> -{ 0x0114, 0x01 },
>> -{ 0x0820, 0x09 },
>> -{ 0x0821, 0xa6 },
>> -{ 0x0822, 0x66 },
>> -{ 0x0823, 0x66 },
>> +REG16(IMX258_REG_EXCK_FREQ, 0x1333),
>> +REG8(IMX258_REG_IVTPXCK_DIV, 10),
>> +REG8(IMX258_REG_IVTSYCK_DIV, 2),
>> +REG8(IMX258_REG_PREPLLCK_VT_DIV, 3),
>> +REG16(IMX258_REG_PLL_IVT_MPY, 0x00C6),
>> +REG8(IMX258_REG_IOPPXCK_DIV, 10),
>> +

[PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-03 Thread Paul E. McKenney

commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
/proc/bootconfig") adds bootloader argument comments into /proc/bootconfig.

/proc/bootconfig shows boot_command_line[] multiple times following
every xbc key value pair, that's duplicated and not necessary.
Remove redundant ones.

Output before and after the fix is like:
key1 = value1
*bootloader argument comments*
key2 = value2
*bootloader argument comments*
key3 = value3
*bootloader argument comments*
...

key1 = value1
key2 = value2
key3 = value3
*bootloader argument comments*
...

Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
/proc/bootconfig")
Signed-off-by: Zhenhua Huang 
Signed-off-by: Paul E. McKenney 
Cc: Masami Hiramatsu 
Cc: 
Cc: 

diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
index 902b326e1e560..e5635a6b127b0 100644
--- a/fs/proc/bootconfig.c
+++ b/fs/proc/bootconfig.c
@@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, size_t 
size)
break;
dst += ret;
}
-   if (ret >= 0 && boot_command_line[0]) {
-   ret = snprintf(dst, rest(dst, end), "# Parameters from 
bootloader:\n# %s\n",
-  boot_command_line);
-   if (ret > 0)
-   dst += ret;
-   }
+   }
+   if (ret >= 0 && boot_command_line[0]) {
+   ret = snprintf(dst, rest(dst, end), "# Parameters from 
bootloader:\n# %s\n",
+  boot_command_line);
+   if (ret > 0)
+   dst += ret;
}
 out:
kfree(key);

[PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()

2024-04-03 Thread Paul E. McKenney

Note that the immediate pressure for this patch should be relieved by the
NAPI patch series [1], but this sort of problem could easily arise again.
So would this change make sense?

When running heavy test workloads with KASAN enabled, RCU Tasks grace
periods can extend for many tens of seconds, significantly slowing
trace registration.  Therefore, make the registration-side RCU Tasks
grace period be asynchronous via call_rcu_tasks().

Reported-by: Jakub Kicinski 
Reported-by: Alexei Starovoitov 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: 

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698b..6d21e4e97ed48 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct 
ftrace_hash *hash, unsigned long
}
 }
 
+static void register_ftrace_direct_cb(struct rcu_head *rhp)
+{
+   struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
+
+   free_ftrace_hash(fhp);
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * for multiple functions registered in @ops
@@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
unsigned long addr)
  out_unlock:
mutex_unlock(_mutex);
 
-   if (free_hash && free_hash != EMPTY_HASH) {
-   synchronize_rcu_tasks();
-   free_ftrace_hash(free_hash);
-   }
+   if (free_hash && free_hash != EMPTY_HASH)
+   call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
 
if (new_hash)
free_ftrace_hash(new_hash);

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-03 Thread Andrii Nakryiko

On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam  wrote:
>
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
> > I look forward to your formalized results :)
>
> Sorry for the delay in getting back to you on this Masami.
>
> I have used one of the bpf selftest benchmarks to provide some form of
> comparison of the 3 different approaches (spinlock, RW spinlock and
> per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
> benchmark which just executes a single uprobe with a minimal bpf program
> attached. The tests were done on a 32 core qemu/kvm instance.
>

Thanks a lot for running benchmarks and providing results!

> Things to note about the results:
>
> - The results are slightly variable so don't get too caught up on
>   individual thread count - it's the trend that is important.
> - In terms of throughput with this specific benchmark a *very* macro view
>   is that the RW spinlock provides 40-60% more throughput than the
>   spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
>   more throughput then the spinlock.
> - This doesn't fully reflect the large reduction in latency that we have
>   seen in application based measurements. However, it does demonstrate
>   that even the trivial change of going to a RW spinlock provides
>   significant benefits.

This is probably because trig-uprobe-nop creates a single uprobe that
is triggered on many CPUs. While in production we have also *many*
uprobes running on many CPUs. In this benchmark, besides contention on
uprobes_treelock, we are also hammering on other per-uprobe locks
(register_rwsem, also if you don't have [0] patch locally, there will
be another filter lock taken each time, filter->rwlock). There is also
atomic refcounting going on, which when you have the same uprobe
across all CPUs at the same time will cause a bunch of cache line
bouncing.

So yes, it's understandable that in practice in production you see an
even larger effect of optimizing uprobe_treelock than in this
micro-benchmark.

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

>
> I haven't included the measurements on per-CPU RW semaphore write
> performance as they are completely in line with those that Paul McKenney
> posted on his journal [0]. On a 32 core system I see semaphore writes to
> take in the order of 25-28 millisecs - the cost of the synchronize_rcu().
>
> Each block of results below show 1 line per execution of the benchmark (the
> "Summary" line) and each line is a run with one more thread added - a
> thread is a "producer". The lines are edited to remove extraneous output
> that adds no value here.
>
> The tests were executed with this driver script:
>
> for num_threads in {1..20}
> do
> sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary

just want to mention -a (affinity) option that you can pass a bench
tool, it will pin each thread on its own CPU. It generally makes tests
more uniform, eliminating CPU migrations variability.

> done
>
>
> spinlock
>
> Summary: hits1.453 ± 0.005M/s (  1.453M/prod)
> Summary: hits2.087 ± 0.005M/s (  1.043M/prod)
> Summary: hits2.701 ± 0.012M/s (  0.900M/prod)

I also wanted to point out that the first measurement (1.453M/s in
this row) is total throughput across all threads, while value in
parenthesis (0.900M/prod) is averaged throughput per each thread. So
this M/prod value is the most interesting in this benchmark where we
assess the effect of reducing contention.

> Summary: hits1.917 ± 0.011M/s (  0.479M/prod)
> Summary: hits2.105 ± 0.003M/s (  0.421M/prod)
> Summary: hits1.615 ± 0.006M/s (  0.269M/prod)

[...]

Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-03 Thread Jonathan Cameron

A few minor comments inline.

> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a44c03c2ba3a..16769552a338 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
>   return -EIO;
>  }
>  
> -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> + struct list_head *memory_types)
>  {
>   return NULL;
>  }
>  
> -void mt_put_memory_types(struct list_head *memory_types)
> +static inline void mt_put_memory_types(struct list_head *memory_types)
>  {
Why in this patch and not previous one?
>  
>  }
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 974af10cfdd8..44fa10980d37 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -36,6 +36,11 @@ struct node_memory_type_map {
>  
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * The list is used to store all memory types that are not created
> + * by a device driver.
> + */
> +static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
>  
> @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
>  
>  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>  
> +/* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>  static bool default_dram_perf_error;
>  static struct access_coordinate default_dram_perf;
>  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> struct memory_dev_type *mem
>  static struct memory_tier *set_node_memory_tier(int node)
>  {
>   struct memory_tier *memtier;
> - struct memory_dev_type *memtype;
> + struct memory_dev_type *mtype = default_dram_type;

Does the rename add anything major to the patch?
If not I'd leave it alone to reduce the churn and give
a more readable patch.  If it is worth doing perhaps
a precursor patch?

> + int adist = MEMTIER_ADISTANCE_DRAM;
>   pg_data_t *pgdat = NODE_DATA(node);
>  
>  
> @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> node)
>   if (!node_state(node, N_MEMORY))
>   return ERR_PTR(-EINVAL);
>  
> - __init_node_memory_type(node, default_dram_type);
> + mt_calc_adistance(node, );
> + if (node_memory_types[node].memtype == NULL) {
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> + if (IS_ERR(mtype)) {
> + mtype = default_dram_type;
> + pr_info("Failed to allocate a memory type. Fall 
> back.\n");
> + }
> + }
> +
> + __init_node_memory_type(node, mtype);
>  
> - memtype = node_memory_types[node].memtype;
> - node_set(node, memtype->nodes);
> - memtier = find_create_memory_tier(memtype);
> + mtype = node_memory_types[node].memtype;
> + node_set(node, mtype->nodes);
> + memtier = find_create_memory_tier(mtype);
>   if (!IS_ERR(memtier))
>   rcu_assign_pointer(pgdat->memtier, memtier);
>   return memtier;
> @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (node_memory_types[nid].memtype == NULL)
> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */

Does the comment refer to this path, or to ones where memtype is set?

> + set_node_memory_tier(nid);

Given the large comment I would add {} to help with readability.
You could flip the logic to reduce indent
for_each_node_state(nid, N_MEMORY) {
if (node_memory_types[nid].memtype)
continue;
/*
 * Some device drivers may have initialized memory tiers
 * between `memory_tier_init()` and `memory_tier_late_init()`,
 * potentially bringing online memory nodes and
 * configuring memory tiers. Exclude them

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Andrii Nakryiko

On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> Jiri Olsa  wrote:
>
> > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > Hi Jiri,
> > >
> > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > Jiri Olsa  wrote:
> > >
> > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > >
> > > This is interesting approach. But I doubt we need to add additional
> > > syscall just for this purpose. Can't we use another syscall or ioctl?
> >
> > so the plan is to optimize entry uprobe in a similar way and given
> > the syscall is not a scarce resource I wanted to add another syscall
> > for that one as well
> >
> > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > possible to do that, the trampoline will just have to save one or
> > more additional registers, but adding new syscall seems cleaner to me
>
> Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.

I think both ptrace and prctl are for completely different use cases
and it would be an abuse of existing API to reuse them for uretprobe
tracing. Also, keep in mind, that any extra argument that has to be
passed into this syscall means that we need to complicate and slow
generated assembly code that is injected into user process (to
save/restore registers) and also kernel-side (again, to deal with all
the extra registers that would be stored/restored on stack).

Given syscalls are not some kind of scarce resources, what's the
downside to have a dedicated and simple syscall?

>
> >
> > >
> > > Also, we should run syzkaller on this syscall. And if uretprobe is
> >
> > right, I'll check on syzkaller
> >
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.
>

This is the way it is today with uretprobes even through interrupt.
E.g., it could happen that user process is using fibers and is
replacing stack pointer without kernel realizing this, which will
trigger some defensive checks in uretprobe handling code and kernel
will send SIGILL because it can't support such cases. This is
happening today already, and it works fine in practice (except for
applications that manually change stack pointer, too bad, you can't
trace them with uretprobes, unfortunately).

So I think it's absolutely adequate to have this behavior if the user
process is *intentionally* abusing this API.

> >
> > but we could limit the syscall to be executed just from the trampoline,
> > that should prevent all the user space use cases, I'll do that in next
> > version and add more tests for that
>
> Why not limit? :) The uprobe_handle_trampoline() expects it is called
> only from the trampoline, so it is natural to check the caller address.
> (and uprobe should know where is the trampoline)
>
> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.
> I concern that new system calls introduce vulnerabilities.
>

As Oleg and Jiri mentioned, this syscall can't harm kernel or other
processes, only the process that is abusing the API. So any extra
checks that would slow down this approach is an unnecessary overhead
and complication that will never be useful in practice.

Also note that sys_uretprobe is a kind of internal and unstable API
and it is explicitly called out that its contract can change at any
time and user space shouldn't rely on it. It's purely for the kernel's
own usage.

So let's please keep it fast and simple.

> Thank you,
>
>
> >
> > thanks,
> > jirka
> >
> >
> > >

[...]

Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-03 Thread Jonathan Cameron

On Tue,  2 Apr 2024 00:17:37 +
"Ho-Ren (Jack) Chuang"  wrote:

> Since different memory devices require finding, allocating, and putting
> memory types, these common steps are abstracted in this patch,
> enhancing the scalability and conciseness of the code.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Reviewed-by: "Huang, Ying" 

Hi,

I know this is a late entry to the discussion but a few comments inline.
(sorry I didn't look earlier!)

All opportunities to improve code complexity and readability as a result
of your factoring out.

Jonathan


> ---
>  drivers/dax/kmem.c   | 20 ++--
>  include/linux/memory-tiers.h | 13 +
>  mm/memory-tiers.c| 32 
>  3 files changed, 47 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 42ee360cf4e3..01399e5b53b2 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
>  
>  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
>  {
> - bool found = false;
>   struct memory_dev_type *mtype;
>  
>   mutex_lock(_memory_type_lock);
could use

guard(mutex)(_memory_type_lock);
return mt_find_alloc_memory_type(adist, _memory_types);

I'm fine if you ignore this comment though as may be other functions in
here that could take advantage of the cleanup.h stuff in a future patch.

> - list_for_each_entry(mtype, _memory_types, list) {
> - if (mtype->adistance == adist) {
> - found = true;
> - break;
> - }
> - }
> - if (!found) {
> - mtype = alloc_memory_type(adist);
> - if (!IS_ERR(mtype))
> - list_add(>list, _memory_types);
> - }
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
>   mutex_unlock(_memory_type_lock);
>  
>   return mtype;
 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 69e781900082..a44c03c2ba3a 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
>  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>const char *source);
>  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> + struct list_head 
> *memory_types);

That indent looks unusual.  Align the start of struct with start of int.

> +void mt_put_memory_types(struct list_head *memory_types);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
>  {
>   return -EIO;
>  }
> +
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +{
> + return NULL;
> +}
> +
> +void mt_put_memory_types(struct list_head *memory_types)
> +{
> +
No blank line needed here. 
> +}
>  #endif   /* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 0537664620e5..974af10cfdd8 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
> memory_dev_type *memtype)
>  }
>  EXPORT_SYMBOL_GPL(clear_node_memory_type);
>  
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)

Breaking this out as a separate function provides opportunity to improve it.
Maybe a follow up patch makes sense given it would no longer be a straight
forward code move.  However in my view it would be simple enough to be obvious
even within this patch.

> +{
> + bool found = false;
> + struct memory_dev_type *mtype;
> +
> + list_for_each_entry(mtype, memory_types, list) {
> + if (mtype->adistance == adist) {
> + found = true;

Why not return here?
return mtype;

> + break;
> + }
> + }
> + if (!found) {

If returning above, no need for found variable - just do this unconditionally.
+ I suggest you flip logic for simpler to follow code flow.
It's more code but I think a bit easier to read as error handling is
out of the main simple flow.

mtype = alloc_memory_type(adist);
if (IS_ERR(mtype))
return mtype;

list_add(>list, memory_types);

return mtype;

> + mtype = alloc_memory_type(adist);
> + if (!IS_ERR(mtype))
> + list_add(>list, memory_types);
> + }
> +
> + return mtype;
> +}
> +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
> +
> +void mt_put_memory_types(struct

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 8:40 PM EEST, Michal Koutný wrote:
> On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang 
>  wrote:
> > Do we really want to have it implemented in c?
>
> I only pointed to the available C boilerplate.
>
> > There are much fewer lines of
> > code in shell scripts. Note we are not really testing basic cgroup stuff.
> > All we needed were creating/deleting cgroups and set limits which I think
> > have been demonstrated feasible in the ash scripts now.
>
> I assume you refer to
> Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com>
> right?
>
> Could it be even simpler if you didn't stick to cgtools APIs and v1
> compatibility?
>
> Reducing ash_cgexec.sh to something like
>   echo 0 >$R/$1/cgroup.procs
>   shift
>   exec "$@"
> (with some small builerplate for $R and previous mkdirs)

I already asked about necessity of v1 in some response, and fully
support this idea. Then cgexec can simply be a function wrapping
along the lines what you proposed.

BR, Jarkko

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 15:39:44 +0100
Vincent Donnefort  wrote:

> > Do you plan on sending out a v20 series?  
> 
> Of course, let me spin that this week! Got also few typos to fix in the doc 
> and
> I believe an include missing for riscv.

No rush, I'll be on PTO until next Tuesday, and will not get to it before then.

-- Steve

[PATCH net v4] virtio_net: Do not send RSS key if it is not supported

2024-04-03 Thread Breno Leitao

There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
Reviewed-by: Heng Qi 
---
Changelog:

V2:
  * Moved from creating a valid packet, by rejecting the request
completely.
V3:
  * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
the rejection path.
V4:
  * Added a comment in an "if" clause, as suggested by Michael S. Tsirkin.

---
 drivers/net/virtio_net.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..115c3c5414f2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct netlink_ext_ack *extack)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   bool update = false;
int i;
 
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
@@ -3814,13 +3815,28 @@ static int virtnet_set_rxfh(struct net_device *dev,
return -EOPNOTSUPP;
 
if (rxfh->indir) {
+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
+   update = true;
}
-   if (rxfh->key)
+
+   if (rxfh->key) {
+   /* If either _F_HASH_REPORT or _F_RSS are negotiated, the
+* device provides hash calculation capabilities, that is,
+* hash_key is configured.
+*/
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
+   update = true;
+   }
 
-   virtnet_commit_rss_command(vi);
+   if (update)
+   virtnet_commit_rss_command(vi);
 
return 0;
 }
@@ -4729,13 +4745,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 
-- 
2.43.0

Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 8:31 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
> >> Make cgroup test scripts ash compatible.
> >> Remove cg-tools dependency.
> >> Add documentation for functions.
> >>
> >> Tested with busybox on Ubuntu.
> >>
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> v2:
> >> - Fixes for v2 cgroup
> >> - Turn off swapping before memcontrol tests and back on after
> >> - Add comments and reformat
> >> ---
> >>  tools/testing/selftests/sgx/ash_cgexec.sh |  57 ++
> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++---
> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +-
> >>  3 files changed, 179 insertions(+), 78 deletions(-)
> >>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh  
> >> b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> new file mode 100755
> >> index ..9607784378df
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> @@ -0,0 +1,57 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2024 Intel Corporation.
> >> +
> >> +# Move the current shell process to the specified cgroup
> >> +# Arguments:
> >> +# $1 - The cgroup controller name, e.g., misc, memory.
> >> +# $2 - The path of the cgroup,
> >> +# relative to /sys/fs/cgroup for cgroup v2,
> >> +# relative to /sys/fs/cgroup/$1 for v1.
> >> +move_to_cgroup() {
> >> +controllers="$1"
> >> +path="$2"
> >> +
> >> +# Check if cgroup v2 is in use
> >> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> >> +# Cgroup v2 logic
> >> +cgroup_full_path="/sys/fs/cgroup/${path}"
> >> +echo $$ > "${cgroup_full_path}/cgroup.procs"
> >> +else
> >> +# Cgroup v1 logic
> >> +OLD_IFS="$IFS"
> >> +IFS=','
> >> +for controller in $controllers; do
> >> +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
> >> +echo $$ > "${cgroup_full_path}/tasks"
> >> +done
> >> +IFS="$OLD_IFS"
> >> +fi
> >
> > I think that if you could point me to git v10 and all this I could
> > then quite easily create test image and see what I get from that.
> >
> > I will code review the whole thing but this is definitely good
> > enough to start testing this series properly! Thanks for the
> > effort with this. The payback from this comes after the feature
> > is mainline. We have now sort of reference of the usage patterns
> > and less layers when we need to debug any possible (likely) bugs
> > in the future.
> >
> > This is definitely to the right direction. I'm just wondering do
> > we want to support v1 cgroups or would it make sense support only
> > v2?
> > BR, Jarkko
> >
> I can drop v1. I think most distro now support v2.
> Created this branch to host these changes so far:  
> https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus

Thanks! 

I'll point my build to that, make a test image and report the
results.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 7:20 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote:
> >> Hello.
> >>
> >> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen  
> >>  wrote:
> >> > > > It'd be more complicated and less readable to do all the stuff  
> >> without the
> >> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only  
> >> depends
> >> > > > on libc so I hope this would not cause too much inconvenience.
> >> > >
> >> > > As per cgroup-tools, please prove this. It makes the job for more
> >> > > complicated *for you* and you are making the job more  complicated
> >> > > to every possible person in the planet running any kernel QA.
> >> > >
> >> > > I weight the latter more than the former. And it is exactly the
> >> > > reason why we did custom user space kselftest in the first place.
> >> > > Let's keep the tradition. All I can say is that kselftest is
> >> > > unfinished in its current form.
> >> > >
> >> > > What is "esp cgexec"?
> >> >
> >> > Also in kselftest we don't drive ultimate simplicity, we drive
> >> > efficient CI/QA. By open coding something like subset of
> >> > cgroup-tools needed to run the test you also help us later
> >> > on to backtrack the kernel changes. With cgroups-tools you
> >> > would have to use strace to get the same info.
> >>
> >> FWIW, see also functions in
> >> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
> >> They likely cover what you need already -- if the tests are in C.
> >>
> >> (I admit that stuff in tools/testing/selftests/cgroup/ is best
> >> understood with strace.)
> >
> > Thanks!
> >
> > My conclusions are that:
> >
> > 1. We probably cannot move the test part of cgroup test itself
> >given the enclave payload dependency.
> > 2. I think it makes sense to still follow the same pattern as
> >other cgroups test and re-use cgroup_util.[ch] functionaltiy.
> >
> > So yeah I guess we need two test programs instead of one.
> >
> > Something along the lines:
> >
> > 1. main.[ch] -> test_sgx.[ch]
> > 2. introduce test_sgx_cgroup.c
> >
> > And test_sgx_cgroup.c would be implement similar test as the shell
> > script and would follow the structure of existing cgroups tests.
> >
> >>
> >> HTH,
> >> Michal
> >
> > BR, Jarkko
> >
> Do we really want to have it implemented in c? There are much fewer lines  
> of code in shell scripts. Note we are not really testing basic cgroup  
> stuff. All we needed were creating/deleting cgroups and set limits which I  
> think have been demonstrated feasible in the ash scripts now.
>
> Given Dave's comments, and test scripts being working and cover the cases  
> needed IMHO, I don't see much need to move to c code. I can add more cases  
> if needed and fall back a c implementation later  if any case can't be  
> implemented in scripts. How about that?

We can settle to: ash + no dependencies. I guess you have for that
all the work done already.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 6:42 PM EEST, Dave Hansen wrote:
> On 3/30/24 04:23, Jarkko Sakkinen wrote:
> >>> I also wonder is cgroup-tools dependency absolutely required or could
> >>> you just have a function that would interact with sysfs?
> >> I should have checked email before hit the send button for v10 .
> >>
> >> It'd be more complicated and less readable to do all the stuff without the 
> >>  
> >> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends  
> >> on libc so I hope this would not cause too much inconvenience.
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more  complicated
> > to every possible person in the planet running any kernel QA.
>
> I don't see any other use of cgroup-tools in testing/selftests.
>
> I *DO* see a ton of /bin/bash use though.  I wouldn't go to much trouble
> to make the thing ash-compatible.
>
> That said, the most important thing is to get some selftests in place.
> If using cgroup-tools means we get actual, runnable tests in place,
> that's a heck of a lot more important than making them perfect.
> Remember, almost nobody uses SGX.  It's available on *VERY* few systems
> from one CPU vendor and only in very specific hardware configurations.

Ash-compatible is good enough for me, so let's draw the line there.

Ash-compatibility does not cause any major hurdle as can we seen from
Haitao's patch. Earlier version was not even POSIX-compatible, given
that it used hard-coded path.

Most of the added stuff come open coding the tools but in the test
code that is not the big deal, and helps with debugging in the future.
Even right now it helps reviewing kernel patches because it documents
exactly how the feature is seen from user space.

BR, Jarkko

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Oleg Nesterov

Again, I leave this to you and Jiri, but

On 04/03, Masami Hiramatsu wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.

...

> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.

We have sys_munmap(). should it check if the caller is going to unmap
the code region which contains regs->ip and do nothing?

I don't think it should. Userspace should blame itself, SIGSEGV is not
"too aggressive" in this case.

> I concern that new system calls introduce vulnerabilities.

Yes, we need to ensure that sys_uretprobe() can only damage the malicious
caller and nothing else.

Oleg.

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Vincent Donnefort

On Wed, Apr 03, 2024 at 10:13:52AM -0400, Steven Rostedt wrote:
> On Fri, 29 Mar 2024 14:40:55 -0400
> Steven Rostedt  wrote:
> 
> > > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> > > +{
> > > + return VM_FAULT_SIGBUS;
> > > +}  
> > 
> > If this is all it does, I don't believe it's needed.
> > 
> > > +
> > > +#ifdef CONFIG_TRACER_MAX_TRACE
> > > +static int get_snapshot_map(struct trace_array *tr)
> > > +{
> > > + int err = 0;
> > > +
> > > + /*
> > > +  * Called with mmap_lock held. lockdep would be unhappy if we would now
> > > +  * take trace_types_lock. Instead use the specific
> > > +  * snapshot_trigger_lock.
> > > +  */
> > > + spin_lock(>snapshot_trigger_lock);
> > > +
> > > + if (tr->snapshot || tr->mapped == UINT_MAX)
> > > + err = -EBUSY;
> > > + else
> > > + tr->mapped++;
> > > +
> > > + spin_unlock(>snapshot_trigger_lock);
> > > +
> > > + /* Wait for update_max_tr() to observe iter->tr->mapped */
> > > + if (tr->mapped == 1)
> > > + synchronize_rcu();
> > > +
> > > + return err;
> > > +
> > > +}
> > > +static void put_snapshot_map(struct trace_array *tr)
> > > +{
> > > + spin_lock(>snapshot_trigger_lock);
> > > + if (!WARN_ON(!tr->mapped))
> > > + tr->mapped--;
> > > + spin_unlock(>snapshot_trigger_lock);
> > > +}
> > > +#else
> > > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
> > > +static inline void put_snapshot_map(struct trace_array *tr) { }
> > > +#endif
> > > +
> > > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
> > > +{
> > > + struct ftrace_buffer_info *info = vma->vm_file->private_data;
> > > + struct trace_iterator *iter = >iter;
> > > +
> > > + WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file));
> > > + put_snapshot_map(iter->tr);
> > > +}
> > > +
> > > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { }  
> > 
> > Same for the open.
> > 
> > 
> > > +
> > > +static const struct vm_operations_struct tracing_buffers_vmops = {
> > > + .open   = tracing_buffers_mmap_open,
> > > + .close  = tracing_buffers_mmap_close,
> > > + .fault  = tracing_buffers_mmap_fault,
> > > +};  
> > 
> > I replaced this with:
> > 
> > static const struct vm_operations_struct tracing_buffers_vmops = {
> > .close  = tracing_buffers_mmap_close,
> > };
> > 
> > And it appears to work just fine. The mm code handles the NULL cases for
> > .open and .fault.
> > 
> > Is there any reason to do something different than the mm defaults?

No other reason here than my own ignorance. I will remove.

> 
> Hi Vincent,
> 
> Do you plan on sending out a v20 series?

Of course, let me spin that this week! Got also few typos to fix in the doc and
I believe an include missing for riscv.

> 
> -- Steve

Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-03 Thread Daniel Borkmann


On 4/2/24 6:58 PM, Andrii Nakryiko wrote:

On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee  wrote:


rethook_find_ret_addr() prints a warning message and returns 0 when the
target task is running and not the "current" task to prevent returning an
incorrect return address. However, this check is incomplete as the target
task can still transition to the running state when finding the return
address, although it is safe with RCU.

The issue we encounter is that the kernel frequently prints warning
messages when BPF profiling programs call to bpf_get_task_stack() on
running tasks.

The callers should be aware and willing to take the risk of receiving an
incorrect return address from a task that is currently running other than
the "current" one. A warning is not needed here as the callers are intent
on it.

Signed-off-by: Kui-Feng Lee 
---
  kernel/trace/rethook.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..4297a132a7ae 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
*tsk, unsigned long frame
 if (WARN_ON_ONCE(!cur))
 return 0;

-   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
+   if (tsk != current && task_is_running(tsk))
 return 0;



This should probably go through Masami's tree, but the change makes
sense to me, given this is an expected condition.

Acked-by: Andrii Nakryiko 


Masami, I assume you'll pick this up?

Thanks,
Daniel

Re: [PATCH v9 0/9] Initial Marvell PXA1908 support

2024-04-03 Thread Rob Herring



On Tue, 02 Apr 2024 22:55:36 +0200, Duje Mihanović wrote:
> Hello,
> 
> This series adds initial support for the Marvell PXA1908 SoC and
> "samsung,coreprimevelte", a smartphone using the SoC.
> 
> USB works and the phone can boot a rootfs from an SD card, but there are
> some warnings in the dmesg:
> 
> During SMP initialization:
> [0.006519] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU1: 0x00
> [0.006542] CPU features: Unsupported CPU feature variation detected.
> [0.006589] CPU1: Booted secondary processor 0x01 [0x410fd032]
> [0.010710] Detected VIPT I-cache on CPU2
> [0.010716] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU2: 0x00
> [0.010758] CPU2: Booted secondary processor 0x02 [0x410fd032]
> [0.014849] Detected VIPT I-cache on CPU3
> [0.014855] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU3: 0x00
> [0.014895] CPU3: Booted secondary processor 0x03 [0x410fd032]
> 
> SMMU probing fails:
> [0.101798] arm-smmu c001.iommu: probing hardware configuration...
> [0.101809] arm-smmu c001.iommu: SMMUv1 with:
> [0.101816] arm-smmu c001.iommu: no translation support!
> 
> A 3.14 based Marvell tree is available on GitHub
> acorn-marvell/brillo_pxa_kernel, and a Samsung one on GitHub
> CoderCharmander/g361f-kernel.
> 
> Andreas Färber attempted to upstream support for this SoC in 2017:
> https://lore.kernel.org/lkml/20170222022929.10540-1-afaer...@suse.de/
> 
> Signed-off-by: Duje Mihanović 
> 
> Changes in v9:
> - Update trailers and rebase on v6.9-rc2, no changes
> - Link to v8: 
> https://lore.kernel.org/20240110-pxa1908-lkml-v8-0-fea768a59...@skole.hr
> 
> Changes in v8:
> - Drop SSPA patch
> - Drop broken-cd from eMMC node
> - Specify S-Boot hardcoded initramfs location in device tree
> - Add ARM PMU node
> - Correct inverted modem memory base and size
> - Update trailers
> - Rebase on next-20240110
> - Link to v7: 
> https://lore.kernel.org/20231102-pxa1908-lkml-v7-0-cabb1a0cb...@skole.hr
>   and https://lore.kernel.org/20231102152033.5511-1-duje.mihano...@skole.hr
> 
> Changes in v7:
> - Suppress SND_MMP_SOC_SSPA on ARM64
> - Update trailers
> - Rebase on v6.6-rc7
> - Link to v6: 
> https://lore.kernel.org/r/20231010-pxa1908-lkml-v6-0-b2fe09240...@skole.hr
> 
> Changes in v6:
> - Address maintainer comments:
>   - Add "marvell,pxa1908-padconf" binding to pinctrl-single driver
> - Drop GPIO patch as it's been pulled
> - Update trailers
> - Rebase on v6.6-rc5
> - Link to v5: 
> https://lore.kernel.org/r/20230812-pxa1908-lkml-v5-0-a5d51937e...@skole.hr
> 
> Changes in v5:
> - Address maintainer comments:
>   - Move *_NR_CLKS to clock driver from dt binding file
> - Allocate correct number of clocks for each block instead of blindly
>   allocating 50 for each
> - Link to v4: 
> https://lore.kernel.org/r/20230807-pxa1908-lkml-v4-0-cb387d73b...@skole.hr
> 
> Changes in v4:
> - Address maintainer comments:
>   - Relicense clock binding file to BSD-2
> - Add pinctrl-names to SD card node
> - Add vgic registers to GIC node
> - Rebase on v6.5-rc5
> - Link to v3: 
> https://lore.kernel.org/r/20230804-pxa1908-lkml-v3-0-8e48fca37...@skole.hr
> 
> Changes in v3:
> - Address maintainer comments:
>   - Drop GPIO dynamic allocation patch
>   - Move clock register offsets into driver (instead of bindings file)
>   - Add missing Tested-by trailer to u32_fract patch
>   - Move SoC binding to arm/mrvl/mrvl.yaml
> - Add serial0 alias and stdout-path to board dts to enable UART
>   debugging
> - Rebase on v6.5-rc4
> - Link to v2: 
> https://lore.kernel.org/r/20230727162909.6031-1-duje.mihano...@skole.hr
> 
> Changes in v2:
> - Remove earlycon patch as it's been merged into tty-next
> - Address maintainer comments:
>   - Clarify GPIO regressions on older PXA platforms
>   - Add Fixes tag to commit disabling GPIO pinctrl calls for this SoC
>   - Add missing includes to clock driver
>   - Clock driver uses HZ_PER_MHZ, u32_fract and GENMASK
>   - Dual license clock bindings
>   - Change clock IDs to decimal
>   - Fix underscores in dt node names
>   - Move chosen node to top of board dts
>   - Clean up documentation
>   - Reorder commits
>   - Drop pxa,rev-id
> - Rename muic-i2c to i2c-muic
> - Reword some commits
> - Move framebuffer node to chosen
> - Add aliases for mmc nodes
> - Rebase on v6.5-rc3
> - Link to v1: 
> https://lore.kernel.org/r/20230721210042.21535-1-duje.mihano...@skole.hr
> 
> ---
> Andy Shevchenko (1):
>   clk: mmp: Switch to use struct u32_fract instead of custom one
> 
> Duje Mihanović (8):
>   dt-bindings: pinctrl: pinctrl-single: add marvell,pxa1908-padconf 
> compatible
>   pinctrl: single: add marvell,pxa1908-padconf compatible
>   dt-bindings: clock: Add Marvell PXA1908 clock bindings
>

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Fri, 29 Mar 2024 14:40:55 -0400
Steven Rostedt  wrote:

> > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> > +{
> > +   return VM_FAULT_SIGBUS;
> > +}  
> 
> If this is all it does, I don't believe it's needed.
> 
> > +
> > +#ifdef CONFIG_TRACER_MAX_TRACE
> > +static int get_snapshot_map(struct trace_array *tr)
> > +{
> > +   int err = 0;
> > +
> > +   /*
> > +* Called with mmap_lock held. lockdep would be unhappy if we would now
> > +* take trace_types_lock. Instead use the specific
> > +* snapshot_trigger_lock.
> > +*/
> > +   spin_lock(>snapshot_trigger_lock);
> > +
> > +   if (tr->snapshot || tr->mapped == UINT_MAX)
> > +   err = -EBUSY;
> > +   else
> > +   tr->mapped++;
> > +
> > +   spin_unlock(>snapshot_trigger_lock);
> > +
> > +   /* Wait for update_max_tr() to observe iter->tr->mapped */
> > +   if (tr->mapped == 1)
> > +   synchronize_rcu();
> > +
> > +   return err;
> > +
> > +}
> > +static void put_snapshot_map(struct trace_array *tr)
> > +{
> > +   spin_lock(>snapshot_trigger_lock);
> > +   if (!WARN_ON(!tr->mapped))
> > +   tr->mapped--;
> > +   spin_unlock(>snapshot_trigger_lock);
> > +}
> > +#else
> > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
> > +static inline void put_snapshot_map(struct trace_array *tr) { }
> > +#endif
> > +
> > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
> > +{
> > +   struct ftrace_buffer_info *info = vma->vm_file->private_data;
> > +   struct trace_iterator *iter = >iter;
> > +
> > +   WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file));
> > +   put_snapshot_map(iter->tr);
> > +}
> > +
> > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { }  
> 
> Same for the open.
> 
> 
> > +
> > +static const struct vm_operations_struct tracing_buffers_vmops = {
> > +   .open   = tracing_buffers_mmap_open,
> > +   .close  = tracing_buffers_mmap_close,
> > +   .fault  = tracing_buffers_mmap_fault,
> > +};  
> 
> I replaced this with:
> 
> static const struct vm_operations_struct tracing_buffers_vmops = {
>   .close  = tracing_buffers_mmap_close,
> };
> 
> And it appears to work just fine. The mm code handles the NULL cases for
> .open and .fault.
> 
> Is there any reason to do something different than the mm defaults?

Hi Vincent,

Do you plan on sending out a v20 series?

-- Steve

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Google

On Wed, 3 Apr 2024 11:47:41 +0200
Jiri Olsa  wrote:

> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > Hi Jiri,
> > 
> > On Tue,  2 Apr 2024 11:33:00 +0200
> > Jiri Olsa  wrote:
> > 
> > > Adding uretprobe syscall instead of trap to speed up return probe.
> > 
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
> 
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
> 
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me

Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.

> 
> > 
> > Also, we should run syzkaller on this syscall. And if uretprobe is
> 
> right, I'll check on syzkaller
> 
> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
> 
> the process should receive SIGILL if there's no pending uretprobe for
> the current task, or it will trigger uretprobe if there's one pending

No, that is too aggressive and not safe. Since the syscall is exposed to
user program, it should return appropriate error code instead of SIGILL.

> 
> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that

Why not limit? :) The uprobe_handle_trampoline() expects it is called
only from the trampoline, so it is natural to check the caller address.
(and uprobe should know where is the trampoline)

Since the syscall is always exposed to the user program, it should
- Do nothing and return an error unless it is properly called.
- check the prerequisites for operation strictly.
I concern that new system calls introduce vulnerabilities.

Thank you,


> 
> thanks,
> jirka
> 
> 
> > 
> > Thank you,
> > 
> > > 
> > > At the moment the uretprobe setup/path is:
> > > 
> > >   - install entry uprobe
> > > 
> > >   - when the uprobe is hit, it overwrites probed function's return address
> > > on stack with address of the trampoline that contains breakpoint
> > > instruction
> > > 
> > >   - the breakpoint trap code handles the uretprobe consumers execution and
> > > jumps back to original return address
> > > 
> > > This patch replaces the above trampoline's breakpoint instruction with new
> > > ureprobe syscall call. This syscall does exactly the same job as the trap
> > > with some more extra work:
> > > 
> > >   - syscall trampoline must save original value for rax/r11/rcx registers
> > > on stack - rax is set to syscall number and r11/rcx are changed and
> > > used by syscall instruction
> > > 
> > >   - the syscall code reads the original values of those registers and
> > > restore those values in task's pt_regs area
> > > 
> > > Even with the extra registers handling code the having uretprobes handled
> > > by syscalls shows speed improvement.
> > > 
> > >   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> > > 
> > >   current:
> > > 
> > > base   :   15.888 ± 0.033M/s
> > > uprobe-nop :3.016 ± 0.000M/s
> > > uprobe-push:2.832 ± 0.005M/s
> > > uprobe-ret :1.104 ± 0.000M/s
> > > uretprobe-nop  :1.487 ± 0.000M/s
> > > uretprobe-push :1.456 ± 0.000M/s
> > > uretprobe-ret  :0.816 ± 0.001M/s
> > > 
> > >   with the fix:
> > > 
> > > base   :   15.116 ± 0.045M/s
> > > uprobe-nop :3.001 ± 0.045M/s
> > > uprobe-push:2.831 ± 0.004M/s
> > > uprobe-ret :1.102 ± 0.001M/s
> > > uretprobe-nop  :1.969 ± 0.001M/s  < 32% speedup
> > > uretprobe-push :1.905 ± 0.004M/s  < 30% speedup
> > > uretprobe-ret  :0.933 ± 0.002M/s  < 14% speedup
> > > 
> > >   On Amd (AMD Ryzen 7 5700U)
> > > 
> > >   current:
> > > 
> > > base   :5.105 ± 0.003M/s
> > > uprobe-nop :1.552 ± 0.002M/s
> > > uprobe-push:1.408 ± 0.003M/s
> > > uprobe-ret :0.827 ± 0.001M/s
> > > uretprobe-nop  :0.779 ± 0.001M/s
> > > uretprobe-push :0.750 ± 0.001M/s
> > > uretprobe-ret  :0.539 ± 0.001M/s
> > > 
> > >   with the fix:
> > > 
> > > base   :5.119 ± 0.002M/s
> > > uprobe-nop :1.523 ± 0.003M/s
> > > uprobe-push:1.384 ± 0.003M/s
> > > uprobe-ret :0.826 ± 0.002M/s
> > > uretprobe-nop  :0.866 ± 0.002M/s  < 11% speedup
> > > uretprobe-push :0.826 ± 0.002M/s  < 10% speedup
> > > uretprobe-ret  :0.581 ± 0.001M/s  <  7% speedup
> > > 
> > > Reviewed-by: Oleg Nesterov 
> > > Suggested-by: Andrii Nakryiko 
> > > Acked-by: Andrii Nakryiko 
> > > Signed-off-by: Oleg

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Oleg Nesterov

I leave this to you and Masami, but...

On 04/03, Jiri Olsa wrote:
>
> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> >
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
>
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
>
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me

Agreed.

> > Also, we should run syzkaller on this syscall. And if uretprobe is
>
> right, I'll check on syzkaller

I don't understand this concern...

> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
>
> the process should receive SIGILL if there's no pending uretprobe for
> the current task,

Yes,

> or it will trigger uretprobe if there's one pending

... and corrupt the caller. So what?

> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that

Yes, we can... well, ignoring the race with mremap() from another thread.

But why should we care?

Userspace should not call sys_uretprobe(). Likewise, it should not call
sys_restart_syscall(). Likewise, it should not jump to xol_area.

Of course, userspace (especially syzkaller) _can_ do this. So what?

I think the only thing we need to ensure is that the "malicious" task
which calls sys_uretprobe() can only harm itself, nothing more.

No?

Oleg.

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko  wrote:

> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >  
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Ah, thanks for verifying (I should have read the thread before replying to
the previous email).

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 21:00:19 -0700
Andrii Nakryiko  wrote:

> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?

That is totally up to Masami. It may have even less overhead as I'm not
sure how many times that gets called, and there may be more work to do than
with function tracing.

-- Steve

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-03 Thread Huang, Kai

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> When a cgroup usage reaches its limit, and it is to be charged, i.e.,
> sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
> reclaim pages from its LRU or LRUs of its descendants to make room for
> any new allocations. This patch adds the basic building block for the
> per-cgroup reclamation flow and use it for synchronous reclamation in
> sgx_cgroup_try_charge().

It's better to firstly mention _why_ we need this first:

Currently in the EPC page allocation, the kernel simply fails the allocation
when the current EPC cgroup fails to charge due to its usage reaching limit. 
This is not ideal.  When that happens, a better way is to reclaim EPC page(s)
from the current EPC cgroup (and/or its descendants) to reduce its usage so the
new allocation can succeed.

Add the basic building blocks to support the per-cgroup reclamation flow ...

> 
> First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
> which pages are reclaimed, so it can be reused by both the global and
> cgroup reclaimers. Also return the number of pages attempted, so a
> cgroup reclaimer can use it to track reclamation progress from its
> descendants.

IMHO you are jumping too fast to the implementation details.  Better to have
some more background:

"
Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU
list.  To support the "per-cgroup" EPC reclaim, maintain an LRU list for each
EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC page(s)
from a given EPC cgroup (and its descendants).
"

> 
> For the global reclaimer, replace all call sites of sgx_reclaim_pages()
> with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
> just calls sgx_reclaim_pages() with the global LRU passed in.
> 
> For cgroup reclamation, implement a basic reclamation flow, encapsulated
> in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
> pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
> at each node passing in the LRU of that node. It keeps track of total
> attempted pages and stops the walk if desired number of pages are
> attempted.

Then it's time to jump to implementation details:

"
Currently the kernel does the global EPC reclaim in sgx_reclaim_page().  It
always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. 
Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the
global LRU, and then tries to reclaim these pages at once for better
performance.

Use similar way to implement the "cgroup" variant EPC reclaim, but keep the
implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input,
and return the pages that are "scanned" (but not actually reclaimed); 2) loop
the given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of
the given EPC cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for most cases.
"

Then I think it's better to explain why "alternatives" are not chosen:

"
Note, this simple implementation doesn't _exactly_ mimic the current global EPC
reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN
pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual
reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs
with each being smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to have a
new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the
given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one
batch.  But this is unnecessarily complicated at this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to return the
actual "reclaimed" pages, but not "scanned" pages.  However this solution also
has cons: 
"

:

I recall you mentioned "unable to control latency of each reclaim" etc, but IIUC
one could be:

This approach may result in higher chance of "reclaiming EPC pages from
descendants but not the root/given EPC cgorup", e.g., when all EPC pages in the
root EPC cgroup are all young while these in its descendants are not.  This may
not be desired.

Makes sense?

> 
> Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
> a synchronous reclamation is allowed. If the caller allows and cgroup
> usage is at its limit, trigger the synchronous reclamation by calling
> sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
> iterations.

This isn't needed IMHO as you can easily see in the code, and there's no "design
choices" here.

General rule: focus on explaining "why", and "design choices", but not
implementation details, which can be seen in the

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-04-03 Thread Breno Leitao

On Sun, Mar 31, 2024 at 04:20:30PM -0400, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote:
> > @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > return -EOPNOTSUPP;
> >  
> > if (rxfh->indir) {
> > +   if (!vi->has_rss)
> > +   return -EOPNOTSUPP;
> > +
> > for (i = 0; i < vi->rss_indir_table_size; ++i)
> > vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> > +   update = true;
> > }
> > -   if (rxfh->key)
> > +
> > +   if (rxfh->key) {
> > +   if (!vi->has_rss && !vi->has_rss_hash_report)
> > +   return -EOPNOTSUPP;
> 
> 
> What's the logic here? Is it || or &&? A comment can't hurt.

If txfh carries a key, then the device needs to has either has_rss or
has_rss_hash_report "features".

These are basically virtio features VIRTIO_NET_F_HASH_REPORT and
VIRTIO_NET_F_RSS that are set at virtio_probe.

I will add the comment and respin the series.

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-03 Thread Jonthan Haslam

> > > > Given the discussion around per-cpu rw semaphore and need for
> > > > (internal) batched attachment API for uprobes, do you think you can
> > > > apply this patch as is for now? We can then gain initial improvements
> > > > in scalability that are also easy to backport, and Jonathan will work
> > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > suggested by Ingo.
> > > 
> > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > I would like to wait for the next version.
> > 
> > My initial tests show a nice improvement on the over RW spinlocks but
> > significant regression in acquiring a write lock. I've got a few days
> > vacation over Easter but I'll aim to get some more formalised results out
> > to the thread toward the end of next week.
> 
> As far as the write lock is only on the cold path, I think you can choose
> per-cpu RW semaphore. Since it does not do busy wait, the total system
> performance impact will be small.
> I look forward to your formalized results :)

Sorry for the delay in getting back to you on this Masami. 

I have used one of the bpf selftest benchmarks to provide some form of
comparison of the 3 different approaches (spinlock, RW spinlock and
per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
benchmark which just executes a single uprobe with a minimal bpf program
attached. The tests were done on a 32 core qemu/kvm instance.

Things to note about the results:

- The results are slightly variable so don't get too caught up on
  individual thread count - it's the trend that is important.
- In terms of throughput with this specific benchmark a *very* macro view
  is that the RW spinlock provides 40-60% more throughput than the
  spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
  more throughput then the spinlock.
- This doesn't fully reflect the large reduction in latency that we have
  seen in application based measurements. However, it does demonstrate
  that even the trivial change of going to a RW spinlock provides
  significant benefits.

I haven't included the measurements on per-CPU RW semaphore write
performance as they are completely in line with those that Paul McKenney
posted on his journal [0]. On a 32 core system I see semaphore writes to
take in the order of 25-28 millisecs - the cost of the synchronize_rcu().

Each block of results below show 1 line per execution of the benchmark (the
"Summary" line) and each line is a run with one more thread added - a
thread is a "producer". The lines are edited to remove extraneous output
that adds no value here.

The tests were executed with this driver script:

for num_threads in {1..20}
do
sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
done


spinlock

Summary: hits1.453 ± 0.005M/s (  1.453M/prod)
Summary: hits2.087 ± 0.005M/s (  1.043M/prod)
Summary: hits2.701 ± 0.012M/s (  0.900M/prod)
Summary: hits1.917 ± 0.011M/s (  0.479M/prod)
Summary: hits2.105 ± 0.003M/s (  0.421M/prod)
Summary: hits1.615 ± 0.006M/s (  0.269M/prod)
Summary: hits2.046 ± 0.004M/s (  0.292M/prod)
Summary: hits1.818 ± 0.002M/s (  0.227M/prod)
Summary: hits1.867 ± 0.024M/s (  0.207M/prod)
Summary: hits1.692 ± 0.003M/s (  0.169M/prod)
Summary: hits1.778 ± 0.004M/s (  0.162M/prod)
Summary: hits1.710 ± 0.011M/s (  0.142M/prod)
Summary: hits1.687 ± 0.022M/s (  0.130M/prod)
Summary: hits1.697 ± 0.004M/s (  0.121M/prod)
Summary: hits1.645 ± 0.011M/s (  0.110M/prod)
Summary: hits1.671 ± 0.002M/s (  0.104M/prod)
Summary: hits1.692 ± 0.014M/s (  0.100M/prod)
Summary: hits1.700 ± 0.015M/s (  0.094M/prod)
Summary: hits1.668 ± 0.005M/s (  0.088M/prod)
Summary: hits1.644 ± 0.004M/s (  0.082M/prod)


RW spinlock

Summary: hits1.465 ± 0.008M/s (  1.465M/prod)
Summary: hits1.750 ± 0.035M/s (  0.875M/prod)
Summary: hits2.164 ± 0.008M/s (  0.721M/prod)
Summary: hits2.235 ± 0.014M/s (  0.559M/prod)
Summary: hits2.202 ± 0.005M/s (  0.440M/prod)
Summary: hits2.816 ± 0.003M/s (  0.469M/prod)
Summary: hits2.364 ± 0.002M/s (  0.338M/prod)
Summary: hits2.327 ± 0.008M/s (  0.291M/prod)
Summary: hits2.147 ± 0.005M/s (  0.239M/prod)
Summary: hits2.266 ± 0.011M/s (  0.227M/prod)
Summary: hits2.483 ± 0.003M/s (  0.226M/prod)
Summary: hits2.573 ± 0.008M/s (  0.214M/prod)
Summary: hits2.569 ± 0.004M/s (  0.198M/prod)
Summary: hits2.507 ± 0.013M/s (  0.179M/prod)
Summary: hits2.165 ± 0.008M/s (  0.144M/prod)
Summary: hits2.524 ± 0.004M/s (  0.158M/prod)
Summary: hits2.059 ± 0.020M/s (  0.121M/prod)
Summary: hits2.332 ± 0.007M/s (  0.130M/prod)
Summary: hits2.404 ± 0.017M/s (  0.127M/prod)
Summary: hits2.187 ± 0.002M/s (  0.109M/prod)


per-CPU RW semaphore

Summary: hits1.341 ± 0.017M/s (  1.341M/prod)
Summary: hits2.397 ± 0.011M/s (  1.198M/prod)
Summary: hits3.547 ± 0.041M/s (  1.182M/prod)
Summary: hits4.108

Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-03 Thread Stefano Garzarella


On Tue, Apr 02, 2024 at 05:21:39PM -0400, Michael S. Tsirkin wrote:

VDPA_GET_VRING_SIZE by mistake uses the already occupied
ioctl # 0x80 and we never noticed - it happens to work
because the direction and size are different, but confuses
tools such as perf which like to look at just the number,
and breaks the extra robustness of the ioctl numbering macros.

To fix, sort the entries and renumber the ioctl - not too late
since it wasn't in any released kernels yet.

Cc: Arnaldo Carvalho de Melo 
Reported-by: Namhyung Kim 
Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size")
Cc: "Zhu Lingshan" 
Signed-off-by: Michael S. Tsirkin 
---

Build tested only - userspace patches using this will have to adjust.
I will merge this in a week or so unless I hear otherwise,
and afterwards perf can update there header.


Fortunately, we haven't released any kernels with this yet, right?
(other than v6.9-rc*)

LGTM:

Reviewed-by: Stefano Garzarella 



include/uapi/linux/vhost.h | 15 ---
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index bea697390613..b95dd84eef2d 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -179,12 +179,6 @@
/* Get the config size */
#define VHOST_VDPA_GET_CONFIG_SIZE  _IOR(VHOST_VIRTIO, 0x79, __u32)

-/* Get the count of all virtqueues */
-#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
-
-/* Get the number of virtqueue groups. */
-#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
-
/* Get the number of address spaces. */
#define VHOST_VDPA_GET_AS_NUM   _IOR(VHOST_VIRTIO, 0x7A, unsigned int)

@@ -228,10 +222,17 @@
#define VHOST_VDPA_GET_VRING_DESC_GROUP _IOWR(VHOST_VIRTIO, 0x7F,   \
  struct vhost_vring_state)

+
+/* Get the count of all virtqueues */
+#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
+
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
+
/* Get the queue size of a specific virtqueue.
 * userspace set the vring index in vhost_vring_state.index
 * kernel set the queue size in vhost_vring_state.num
 */
-#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
+#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
  struct vhost_vring_state)
#endif
--
MST

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Jiri Olsa

On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Tue,  2 Apr 2024 11:33:00 +0200
> Jiri Olsa  wrote:
> 
> > Adding uretprobe syscall instead of trap to speed up return probe.
> 
> This is interesting approach. But I doubt we need to add additional
> syscall just for this purpose. Can't we use another syscall or ioctl?

so the plan is to optimize entry uprobe in a similar way and given
the syscall is not a scarce resource I wanted to add another syscall
for that one as well

tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
possible to do that, the trampoline will just have to save one or
more additional registers, but adding new syscall seems cleaner to me

> 
> Also, we should run syzkaller on this syscall. And if uretprobe is

right, I'll check on syzkaller

> set in the user function, what happen if the user function directly
> calls this syscall? (maybe it consumes shadow stack?)

the process should receive SIGILL if there's no pending uretprobe for
the current task, or it will trigger uretprobe if there's one pending

but we could limit the syscall to be executed just from the trampoline,
that should prevent all the user space use cases, I'll do that in next
version and add more tests for that

thanks,
jirka


> 
> Thank you,
> 
> > 
> > At the moment the uretprobe setup/path is:
> > 
> >   - install entry uprobe
> > 
> >   - when the uprobe is hit, it overwrites probed function's return address
> > on stack with address of the trampoline that contains breakpoint
> > instruction
> > 
> >   - the breakpoint trap code handles the uretprobe consumers execution and
> > jumps back to original return address
> > 
> > This patch replaces the above trampoline's breakpoint instruction with new
> > ureprobe syscall call. This syscall does exactly the same job as the trap
> > with some more extra work:
> > 
> >   - syscall trampoline must save original value for rax/r11/rcx registers
> > on stack - rax is set to syscall number and r11/rcx are changed and
> > used by syscall instruction
> > 
> >   - the syscall code reads the original values of those registers and
> > restore those values in task's pt_regs area
> > 
> > Even with the extra registers handling code the having uretprobes handled
> > by syscalls shows speed improvement.
> > 
> >   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> > 
> >   current:
> > 
> > base   :   15.888 ± 0.033M/s
> > uprobe-nop :3.016 ± 0.000M/s
> > uprobe-push:2.832 ± 0.005M/s
> > uprobe-ret :1.104 ± 0.000M/s
> > uretprobe-nop  :1.487 ± 0.000M/s
> > uretprobe-push :1.456 ± 0.000M/s
> > uretprobe-ret  :0.816 ± 0.001M/s
> > 
> >   with the fix:
> > 
> > base   :   15.116 ± 0.045M/s
> > uprobe-nop :3.001 ± 0.045M/s
> > uprobe-push:2.831 ± 0.004M/s
> > uprobe-ret :1.102 ± 0.001M/s
> > uretprobe-nop  :1.969 ± 0.001M/s  < 32% speedup
> > uretprobe-push :1.905 ± 0.004M/s  < 30% speedup
> > uretprobe-ret  :0.933 ± 0.002M/s  < 14% speedup
> > 
> >   On Amd (AMD Ryzen 7 5700U)
> > 
> >   current:
> > 
> > base   :5.105 ± 0.003M/s
> > uprobe-nop :1.552 ± 0.002M/s
> > uprobe-push:1.408 ± 0.003M/s
> > uprobe-ret :0.827 ± 0.001M/s
> > uretprobe-nop  :0.779 ± 0.001M/s
> > uretprobe-push :0.750 ± 0.001M/s
> > uretprobe-ret  :0.539 ± 0.001M/s
> > 
> >   with the fix:
> > 
> > base   :5.119 ± 0.002M/s
> > uprobe-nop :1.523 ± 0.003M/s
> > uprobe-push:1.384 ± 0.003M/s
> > uprobe-ret :0.826 ± 0.002M/s
> > uretprobe-nop  :0.866 ± 0.002M/s  < 11% speedup
> > uretprobe-push :0.826 ± 0.002M/s  < 10% speedup
> > uretprobe-ret  :0.581 ± 0.001M/s  <  7% speedup
> > 
> > Reviewed-by: Oleg Nesterov 
> > Suggested-by: Andrii Nakryiko 
> > Acked-by: Andrii Nakryiko 
> > Signed-off-by: Oleg Nesterov 
> > Signed-off-by: Jiri Olsa 
> > ---
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  arch/x86/kernel/uprobes.c  | 83 ++
> >  include/linux/syscalls.h   |  2 +
> >  include/linux/uprobes.h|  2 +
> >  include/uapi/asm-generic/unistd.h  |  5 +-
> >  kernel/events/uprobes.c| 18 --
> >  kernel/sys_ni.c|  2 +
> >  7 files changed, 108 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 7e8d46f4147f..af0a33ab06ee 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -383,6 +383,7 @@
> >  459common  lsm_get_self_attr   sys_lsm_get_self_attr
> >  460common  lsm_set_self_attr   sys_lsm_set_self_attr
> >  461common  lsm_list_modules

[PATCH 06/34] tracing: hide unused ftrace_event_id_fops

2024-04-03 Thread Arnd Bergmann

From: Arnd Bergmann 

When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
unused ftrace_event_id_fops variable:

kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but 
not used [-Werror=unused-const-variable=]
 2155 | static const struct file_operations ftrace_event_id_fops = {

Hide this in the same #ifdef as the reference to it.

Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to 
event_create_dir()")
Signed-off-by: Arnd Bergmann 
---
 kernel/trace/trace_events.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 7c364b87352e..52f75c36bbca 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1670,6 +1670,7 @@ static int trace_format_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
+#ifdef CONFIG_PERF_EVENTS
 static ssize_t
 event_id_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos)
 {
@@ -1684,6 +1685,7 @@ event_id_read(struct file *filp, char __user *ubuf, 
size_t cnt, loff_t *ppos)
 
return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
 }
+#endif
 
 static ssize_t
 event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
@@ -2152,10 +2154,12 @@ static const struct file_operations 
ftrace_event_format_fops = {
.release = seq_release,
 };
 
+#ifdef CONFIG_PERF_EVENTS
 static const struct file_operations ftrace_event_id_fops = {
.read = event_id_read,
.llseek = default_llseek,
 };
+#endif
 
 static const struct file_operations ftrace_event_filter_fops = {
.open = tracing_open_file_tr,
-- 
2.39.2

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-04-03 Thread Heng Qi





在 2024/4/1 上午4:20, Michael S. Tsirkin 写道:

On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote:

There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

 # ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

   if (!sz) {
   virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

 vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

   while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org

net has its own stable process, don't CC stable on net patches.



Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
---
Changelog:

V2:
   * Moved from creating a valid packet, by rejecting the request
 completely
V3:
   * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
 the rejection path.

---
  drivers/net/virtio_net.c | 22 ++
  1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..c4a21ec51adf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct netlink_ext_ack *extack)
  {
struct virtnet_info *vi = netdev_priv(dev);
+   bool update = false;
int i;
  
  	if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&

@@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
return -EOPNOTSUPP;
  
  	if (rxfh->indir) {

+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
+   update = true;
}
-   if (rxfh->key)
+
+   if (rxfh->key) {
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;


What's the logic here? Is it || or &&? A comment can't hurt.


&&

Hi Breno,

You can add a comment like the following:

If either _F_HASH_REPORT or _F_RSS are negotiated, the device provides
hash calculation capabilities, that is, hash_key can be configured.

Regards,
Heng




+
memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
+   update = true;
+   }
  
-	virtnet_commit_rss_command(vi);

+   if (update)
+   virtnet_commit_rss_command(vi);
  
  	return 0;

  }
@@ -4729,13 +4741,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
  
-	if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))

+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
  
-	if (vi->has_rss || vi->has_rss_hash_report) {

vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
  
--

2.43.0

[PATCH net-next 6/6] rstreason: make it work in trace world

2024-04-03 Thread Jason Xing

From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 37 +
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..9bed9e63c9c5 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN1
+#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+#undef FN2
+#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
+DEFINE_RST_REASON(FN1, FN1)
+
+#undef FN1
+#undef FNe1
+#define FN1(reason){ SK_RST_REASON_##reason, #reason },
+#define FNe1(reason)   { SK_RST_REASON_##reason, #reason }
+
+#undef FN2
+#undef FNe2
+#define FN2(reason){ SKB_DROP_REASON_##reason, #reason },
+#define FNe2(reason)   { SKB_DROP_REASON_##reason, #reason }
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const int reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(int, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->reason < RST_REASON_START ?
+   __print_symbolic(__entry->reason, 
DEFINE_DROP_REASON(FN2, FNe2)) :
+   __print_symbolic(__entry->reason, 
DEFINE_RST_REASON(FN1, FNe1)))
 );
 
+#undef FN1
+#undef FNe1
+
+#undef FN2
+#undef FNe2
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 35b0f3bbf596..3aee7cb35ee4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 18fbbad2028a..d5a7ecfcc1b3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3608,7 +3608,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority, int reason)
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index cfcfa2626899..da2f70ad89b5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,7 +1131,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3

[PATCH net-next 5/6] mptcp: support rstreason for passive reset

2024-04-03 Thread Jason Xing

From: Jason Xing 

It relys on what reset options in MPTCP does as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces all the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/subflow.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a68d5d0f3e2a..24668d3020aa 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -304,7 +304,10 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
 
dst_release(dst);
if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);
return NULL;
 }
 
@@ -371,7 +374,10 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
 
dst_release(dst);
if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);
return NULL;
 }
 #endif
@@ -778,6 +784,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
bool fallback, fallback_is_fatal;
struct mptcp_sock *owner;
struct sock *child;
+   int reason;
 
pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
 
@@ -833,7 +840,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
 */
if (!ctx || fallback) {
if (fallback_is_fatal) {
-   subflow_add_reset_reason(skb, MPTCP_RST_EMPTCP);
+   reason = MPTCP_RST_EMPTCP;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}
goto fallback;
@@ -861,7 +869,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
} else if (ctx->mp_join) {
owner = subflow_req->msk;
if (!owner) {
-   subflow_add_reset_reason(skb, 
MPTCP_RST_EPROHIBIT);
+   reason = MPTCP_RST_EPROHIBIT;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}
 
@@ -875,13 +884,18 @@ static struct sock *subflow_syn_recv_sock(const struct 
sock *sk,
 ntohs(inet_sk((struct sock 
*)owner)->inet_sport));
if (!mptcp_pm_sport_in_anno_list(owner, sk)) {
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_MISMATCHPORTACKRX);
+   reason = MPTCP_RST_EUNSPEC;
goto dispose_child;
}
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_JOINPORTACKRX);
}
 
-   if (!mptcp_finish_join(child))
+   if (!mptcp_finish_join(child)) {
+   struct mptcp_subflow_context *subflow = 
mptcp_subflow_ctx(sk);
+
+   reason = subflow->reset_reason;
goto dispose_child;
+   }
 
SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINACKRX);
tcp_rsk(req)->drop_req = true;
@@ -901,7 +915,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   req->rsk_ops->send_reset(sk, skb, convert_mptcp_reason(reason));
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3

[PATCH net-next 4/6] tcp: support rstreason for passive reset

2024-04-03 Thread Jason Xing

From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we don't need to implement those duplicated reset reasons.
This patch replaces all the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/ipv4/tcp_ipv4.c | 8 
 net/ipv6/tcp_ipv6.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c8248abe37a..35b0f3bbf596 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1935,7 +1935,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2280,7 +2280,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2358,7 +2358,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f143b658fb71..cfcfa2626899 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1678,7 +1678,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, reason);
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1864,7 +1864,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1940,7 +1940,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -1995,7 +1995,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:
-- 
2.37.3

[PATCH net-next 3/6] rstreason: prepare for active reset

2024-04-03 Thread Jason Xing

From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  2 +-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  2 +-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6ae35199d3b3..2b9b9d3d8065 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -667,7 +667,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e767721b3a58..eacfe0012977 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2805,7 +2806,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2879,7 +2881,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2897,7 +2900,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3001,7 +3005,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4557,7 +4561,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..18fbbad2028a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3583,7 +3583,7 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index d1ad20ce1c8c..7e7110bf3ea2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))
do_reset = true;
if (do_reset)
-

[PATCH net-next 2/6] rstreason: prepare for passive reset

2024-04-03 Thread Jason Xing

From: Jason Xing 

Adjust the paramenter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  3 ++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..93f9fee7e52f 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -34,7 +34,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ int reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 44b033fe1ef6..628dd783e8f3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ded07e09f813..d64f39e26e87 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27add..251a57cf5822 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -15,6 +15,7 @@
 #include

[PATCH net-next 1/6] net: introduce rstreason to detect why the RST is sent

2024-04-03 Thread Jason Xing

From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) reuse MP_TCPRST option for MPTCP
3) our own reasons

I will implement the basic codes of active/passive reset reason in
those three protocols, which is not complete for this moment. But
it provides a new chance to let other people add more reasons into
it:)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 93 +
 1 file changed, 93 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..24d098a78a60
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(NOT_SPECIFIED)   \
+   FNe(MAX)
+
+#define RST_REASON_START (SKB_DROP_REASON_MAX + 1)
+
+/* There are three parts in order:
+ * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP
+ * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use
+ * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason
+ */
+enum sk_rst_reason {
+   /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse
+* of skb drop reason because rst reason relies on what drop reason
+* indicates exactly why it could happen.
+*/
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /* Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START,
+   /* MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /* Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /* Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /* Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /* Unacceptable performance.
+* This code indicates that the performance of this subflow was
+* too low compared to the other subflows of this Multipath TCP
+* connection.
+*/
+   SK_RST_REASON_MPTCP_RST_EBADPERF,
+   /* Middlebox interference.
+* Middlebox interference has been detected over this subflow,
+* making MPTCP signaling invalid. For example, this may be sent
+* if the checksum does not validate.
+*/
+   SK_RST_REASON_MPTCP_RST_EMIDDLEBOX,
+
+   /* For the real standalone socket reset reason, we start from here */
+   SK_RST_REASON_NOT_SPECIFIED,
+
+   /* Maximum of socket reset reasons.
+* It shouldn't be used as a real 'reason'.
+*/
+   SK_RST_REASON_MAX,
+};
+
+static inline int convert_mptcp_reason(int reason)
+{
+   return reason += RST_REASON_START;
+}
+#endif
-- 
2.37.3

[PATCH net-next 0/6] Implement reset reason mechanism to detect

2024-04-03 Thread Jason Xing

From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism and the detailed passive part works as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

Note:
It's based on top of 
https://patchwork.kernel.org/project/netdevbpf/list/?series=840182


Jason Xing (6):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  rstreason: make it work in trace world

 include/net/request_sock.h |  3 +-
 include/net/rstreason.h| 93 ++
 include/net/tcp.h  |  2 +-
 include/trace/events/tcp.h | 37 +--
 net/dccp/ipv4.c| 10 ++--
 net/dccp/ipv6.c| 10 ++--
 net/dccp/minisocks.c   |  3 +-
 net/ipv4/tcp.c | 15 --
 net/ipv4/tcp_ipv4.c| 14 +++---
 net/ipv4/tcp_minisocks.c   |  3 +-
 net/ipv4/tcp_output.c  |  4 +-
 net/ipv4/tcp_timer.c   |  9 ++--
 net/ipv6/tcp_ipv6.c| 17 ---
 net/mptcp/protocol.c   |  4 +-
 net/mptcp/subflow.c| 33 ++
 15 files changed, 209 insertions(+), 48 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3

Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-04-03 Thread Arnaud POULIQUEN




On 4/1/24 17:46, Mathieu Poirier wrote:
> On Fri, Mar 29, 2024 at 11:57:43AM +0100, Arnaud POULIQUEN wrote:
>>
>>
>> On 3/27/24 18:14, Mathieu Poirier wrote:
>>> On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote:


 On 3/25/24 17:51, Mathieu Poirier wrote:
> On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote:
>> The new TEE remoteproc device is used to manage remote firmware in a
>> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>> introduced to delegate the loading of the firmware to the trusted
>> execution context. In such cases, the firmware should be signed and
>> adhere to the image format defined by the TEE.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> Updates from V3:
>> - remove support of the attach use case. Will be addressed in a separate
>>   thread,
>> - add st_rproc_tee_ops::parse_fw ops,
>> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage 
>> cross
>>   reference between the rproc struct and the tee_rproc struct in 
>> tee_rproc.c.
>> ---
>>  drivers/remoteproc/stm32_rproc.c | 60 +---
>>  1 file changed, 56 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/remoteproc/stm32_rproc.c 
>> b/drivers/remoteproc/stm32_rproc.c
>> index 8cd838df4e92..13df33c78aa2 100644
>> --- a/drivers/remoteproc/stm32_rproc.c
>> +++ b/drivers/remoteproc/stm32_rproc.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include "remoteproc_internal.h"
>> @@ -49,6 +50,9 @@
>>  #define M4_STATE_STANDBY4
>>  #define M4_STATE_CRASH  5
>>  
>> +/* Remote processor unique identifier aligned with the Trusted 
>> Execution Environment definitions */
>
> Why is this the case?  At least from the kernel side it is possible to 
> call
> tee_rproc_register() with any kind of value, why is there a need to be any
> kind of alignment with the TEE?


 The use of the proc_id is to identify a processor in case of multi 
 co-processors.

>>>
>>> That is well understood.
>>>
 For instance we can have a system with A DSP and a modem. We would use the 
 same
 TEE service, but
>>>
>>> That too.
>>>
 the TEE driver will probably be different, same for the signature key.
>>>
>>> What TEE driver are we talking about here?
>>
>> In OP-TEE remoteproc frameork is divided in 2 or  3 layers:
>>
>> - the remoteproc Trusted Application (TA) [1] which is platform agnostic
>> - The remoteproc Pseudo Trusted Application (PTA) [2] which is platform
>> dependent and can rely on the proc ID to retrieve the context.
>> - the remoteproc driver (optional for some platforms) [3], which is in charge
>>  of DT parsing and platform configuration.
>>
> 
> That part makes sense.
> 
>> Here TEE driver can be interpreted by remote PTA and/or platform driver.
>>
> 
> I have to guess PTA means "Platform Trusted Application" but I have no
> guarantee, adding to the level of (already high) confusion brought on by this
> patchset.

As mentioned above, PTA is Pseudo Trusted Application. It is an interface
exposed by OP-TEE core to the OP-TEE application.
In this case PTA is used to implement the platform part.
If you need more details about the remoteproc framework in OP-TEE, there is a
link in the to a presentation in the cover letter.

> 
>> [1]
>> https://elixir.bootlin.com/op-tee/latest/source/ta/remoteproc/src/remoteproc_core.c
>> [2]
>> https://elixir.bootlin.com/op-tee/latest/source/core/pta/stm32mp/remoteproc_pta.c
>> [3]
>> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c
>>
>>>
 In such case the proc ID allows to identify the the processor you want to 
 address.

>>>
>>> That too is well understood, but there is no alignment needed with the TEE, 
>>> i.e
>>> the TEE application is not expecting a value of '0'.  We could set
>>> STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work.  This driver 
>>> won't go
>>> anywhere for as long as it is not the case.
>>
>>
>> Here I suppose that you do not challenge the rproc_ID use in general, but for
>> the stm32mp1 platform as we have only one remote processor. I'm right?
> 
> That is correct - I understand the need for an rproc_ID.  The problem is with
> the comment that states that '0' is aligned with the TEE definitions, which in
> my head means hard coded value and a big red flag.  What it should say is
> "aligned with the TEE device tree definition". 
> 
>>
>> In OP-TEE the check is done here:
>> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c#L64
>>
>> If driver does not register the proc ID an error is returned indicating that 
>> the
>> feature is not supported.
>>
>> In case of stm32mp1 yes we could consider it as

Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-03 Thread Zhu, Lingshan





On 4/3/2024 5:21 AM, Michael S. Tsirkin wrote:

VDPA_GET_VRING_SIZE by mistake uses the already occupied
ioctl # 0x80 and we never noticed - it happens to work
because the direction and size are different, but confuses
tools such as perf which like to look at just the number,
and breaks the extra robustness of the ioctl numbering macros.

To fix, sort the entries and renumber the ioctl - not too late
since it wasn't in any released kernels yet.

Cc: Arnaldo Carvalho de Melo 
Reported-by: Namhyung Kim 
Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size")
Cc: "Zhu Lingshan" 
Signed-off-by: Michael S. Tsirkin 


Reviewed-by: Zhu Lingshan 

Thanks for the fix and sorry for the mess, I should read the whole header file 
to check whether
the number is available.


---

Build tested only - userspace patches using this will have to adjust.
I will merge this in a week or so unless I hear otherwise,
and afterwards perf can update there header.

  include/uapi/linux/vhost.h | 15 ---
  1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index bea697390613..b95dd84eef2d 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -179,12 +179,6 @@
  /* Get the config size */
  #define VHOST_VDPA_GET_CONFIG_SIZE_IOR(VHOST_VIRTIO, 0x79, __u32)
  
-/* Get the count of all virtqueues */

-#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
-
-/* Get the number of virtqueue groups. */
-#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
-
  /* Get the number of address spaces. */
  #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
  
@@ -228,10 +222,17 @@

  #define VHOST_VDPA_GET_VRING_DESC_GROUP   _IOWR(VHOST_VIRTIO, 0x7F,   
\
  struct vhost_vring_state)
  
+

+/* Get the count of all virtqueues */
+#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
+
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
+
  /* Get the queue size of a specific virtqueue.
   * userspace set the vring index in vhost_vring_state.index
   * kernel set the queue size in vhost_vring_state.num
   */
-#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
+#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
  struct vhost_vring_state)
  #endif

Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-03 Thread Eugenio Perez Martin

On Tue, Apr 2, 2024 at 11:21 PM Michael S. Tsirkin  wrote:
>
> VDPA_GET_VRING_SIZE by mistake uses the already occupied
> ioctl # 0x80 and we never noticed - it happens to work
> because the direction and size are different, but confuses
> tools such as perf which like to look at just the number,
> and breaks the extra robustness of the ioctl numbering macros.
>
> To fix, sort the entries and renumber the ioctl - not too late
> since it wasn't in any released kernels yet.
>
> Cc: Arnaldo Carvalho de Melo 
> Reported-by: Namhyung Kim 
> Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size")
> Cc: "Zhu Lingshan" 
> Signed-off-by: Michael S. Tsirkin 

Reviewed-by: Eugenio Pérez 

> ---
>
> Build tested only - userspace patches using this will have to adjust.
> I will merge this in a week or so unless I hear otherwise,
> and afterwards perf can update there header.
>
>  include/uapi/linux/vhost.h | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index bea697390613..b95dd84eef2d 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -179,12 +179,6 @@
>  /* Get the config size */
>  #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32)
>
> -/* Get the count of all virtqueues */
> -#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
> -
> -/* Get the number of virtqueue groups. */
> -#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
> -
>  /* Get the number of address spaces. */
>  #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
>
> @@ -228,10 +222,17 @@
>  #define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F, 
>   \
>   struct vhost_vring_state)
>
> +
> +/* Get the count of all virtqueues */
> +#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
> +
> +/* Get the number of virtqueue groups. */
> +#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
> +
>  /* Get the queue size of a specific virtqueue.
>   * userspace set the vring index in vhost_vring_state.index
>   * kernel set the queue size in vhost_vring_state.num
>   */
> -#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
> +#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
>   struct vhost_vring_state)
>  #endif
> --
> MST
>

60 matches

Mail list logo