[PATCH v3] Documentation: Add reconnect process for VDUSE
Add a document explaining the reconnect process, including what the Userspace App needs to do and how it works with the kernel. Signed-off-by: Cindy Lu --- Documentation/userspace-api/vduse.rst | 41 +++ 1 file changed, 41 insertions(+) diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst index bdb880e01132..7faa83462e78 100644 --- a/Documentation/userspace-api/vduse.rst +++ b/Documentation/userspace-api/vduse.rst @@ -231,3 +231,44 @@ able to start the dataplane processing as follows: after the used ring is filled. For more details on the uAPI, please see include/uapi/linux/vduse.h. + +HOW VDUSE devices reconnection works + +1. What is reconnection? + + When the userspace application loads, it should establish a connection + to the vduse kernel device. Sometimes,the userspace application exists, + and we want to support its restart and connect to the kernel device again + +2. How can I support reconnection in a userspace application? + +2.1 During initialization, the userspace application should first verify the +existence of the device "/dev/vduse/vduse_name". +If it doesn't exist, it means this is the first-time for connection. goto step 2.2 +If it exists, it means this is a reconnection, and we should goto step 2.3 + +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on +/dev/vduse/control. +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber. + +2.3 Check if the information is suitable for reconnect +If this is reconnection : +Before attempting to reconnect, The userspace application needs to use the +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, VDUSE_DEV_GET_FEATURES...) +to get the information from kernel. +Please review the information and confirm if it is suitable to reconnect. + +2.4 Userspace application needs to mmap the memory to userspace +The userspace application requires mapping one page for every vq. These pages +should be used to save vq-related information during system running. Additionally, +the application must define its own structure to store information for reconnection. + +2.5 Completed the initialization and running the application. +While the application is running, it is important to store relevant information +about reconnections in mapped pages. When calling the ioctl VDUSE_VQ_GET_INFO to +get vq information, it's necessary to check whether it's a reconnection. If it is +a reconnection, the vq-related information must be get from the mapped pages. + +2.6 When the Userspace application exits, it is necessary to unmap all the +pages for reconnection -- 2.43.0
Re: [PATCH v2] Documentation: Add reconnect process for VDUSE
On Fri, Mar 29, 2024 at 5:52 PM Michael S. Tsirkin wrote: > > On Fri, Mar 29, 2024 at 05:38:25PM +0800, Cindy Lu wrote: > > Add a document explaining the reconnect process, including what the > > Userspace App needs to do and how it works with the kernel. > > > > Signed-off-by: Cindy Lu > > --- > > Documentation/userspace-api/vduse.rst | 41 +++ > > 1 file changed, 41 insertions(+) > > > > diff --git a/Documentation/userspace-api/vduse.rst > > b/Documentation/userspace-api/vduse.rst > > index bdb880e01132..f903aed714d1 100644 > > --- a/Documentation/userspace-api/vduse.rst > > +++ b/Documentation/userspace-api/vduse.rst > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows: > > after the used ring is filled. > > > > For more details on the uAPI, please see include/uapi/linux/vduse.h. > > + > > +HOW VDUSE devices reconnectoin works > > typo > Really sorry for this, I will send a new version Thanks Cindy > > + > > +1. What is reconnection? > > + > > + When the userspace application loads, it should establish a connection > > + to the vduse kernel device. Sometimes,the userspace application exists, > > + and we want to support its restart and connect to the kernel device > > again > > + > > +2. How can I support reconnection in a userspace application? > > + > > +2.1 During initialization, the userspace application should first verify > > the > > +existence of the device "/dev/vduse/vduse_name". > > +If it doesn't exist, it means this is the first-time for connection. > > goto step 2.2 > > +If it exists, it means this is a reconnection, and we should goto step > > 2.3 > > + > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > > +/dev/vduse/control. > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for > > +the reconnect information. The total memory size is > > PAGE_SIZE*vq_mumber. > > + > > +2.3 Check if the information is suitable for reconnect > > +If this is reconnection : > > +Before attempting to reconnect, The userspace application needs to use > > the > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, > > VDUSE_DEV_GET_FEATURES...) > > +to get the information from kernel. > > +Please review the information and confirm if it is suitable to > > reconnect. > > + > > +2.4 Userspace application needs to mmap the memory to userspace > > +The userspace application requires mapping one page for every vq. > > These pages > > +should be used to save vq-related information during system running. > > Additionally, > > +the application must define its own structure to store information for > > reconnection. > > + > > +2.5 Completed the initialization and running the application. > > +While the application is running, it is important to store relevant > > information > > +about reconnections in mapped pages. When calling the ioctl > > VDUSE_VQ_GET_INFO to > > +get vq information, it's necessary to check whether it's a > > reconnection. If it is > > +a reconnection, the vq-related information must be get from the mapped > > pages. > > + > > > I don't get it. So this is just a way for the application to allocate > memory? Why do we need this new way to do it? > Why not just mmap a file anywhere at all? > We used to use tmpfs to save this reconnect information, but this will make the API not self contained, so we changed to using the kernel memory Thanks cindy > > > +2.6 When the Userspace application exits, it is necessary to unmap all the > > +pages for reconnection > > -- > > 2.43.0 >
Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
Vishal Verma wrote: > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the > referenced patch should be replaced with lockdep_assert_held(_write)(). > > Replace those, and additionally, replace a couple of other > WARN_ON_ONCE() introduced in the same patch for actual failure > cases (i.e. when acquiring a semaphore fails in a remove / unregister > path) with dev_WARN_ONCE() as is the precedent here. /me goes to look why failures are warned vs bubbling up the error... > > Recall that previously, unregistration paths was implicitly protected by > overloading the device lock, which the patch in [1] sought to remove. > This meant adding a semaphore acquisition in these unregistration paths. > Since that can fail, and it doesn't make sense to return errors from > these paths, retain the two instances of (now) dev_WARN_ONCE(). > > Link: > https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch > [1] > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Cc: Andrew Morton > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 23 +-- > 1 file changed, 13 insertions(+), 10 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 797e1ebff299..d89c8c3b62f7 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c [..] > @@ -726,7 +728,8 @@ static void unregister_dax_mapping(void *data) ...and realizes that is hunk is confused. > if (rwsem_is_locked(_region_rwsem)) > return __unregister_dax_mapping(data); This is bug. Either it is safe to call this without the lock held, or it must be safe to always acquire the lock. Anything else means the caller needs to be refactored. Conditional locking is an indicator of a deeper problem with code organization. Yes, this was not part of the fixup, but the changelog led me here because no WARNs should remain at the end of this effort. I think the confusion results from replace *all* device_lock() when some device_lock() is still needed. > - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) > + if (dev_WARN_ONCE(data, down_write_killable(_region_rwsem) != 0, > + "unable to acquire region rwsem\n")) In a context like this that cannot return an error directly to the user process making the request, like in a sysfs operation handler, then the locking cannot worry about signals. This would become an uncoditional down_write() lock. So down_write_killable() is typically for request contexts where there is a user process waiting for a result. For contexts like async driver probing where we might be in a kernel thread, and definitely in functions that return 'void', it is a bug to use fallible locks.
Re: [PATCH net-next v4 0/2] tcp: make trace of reset logic complete
Hello: This series was applied to netdev/net-next.git (main) by Jakub Kicinski : On Mon, 1 Apr 2024 15:36:03 +0800 you wrote: > From: Jason Xing > > Before this, we miss some cases where the TCP layer could send RST but > we cannot trace it. So I decided to complete it :) > > v4 > Link: > https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/ > 1. rebased against latest net-next > 2. remove {} and add skb test statement (Eric) > 3. drop v3 patch [3/3] temporarily because 1) location is not that useful > since we can use perf or something else to trace, 2) Eric said we could > use drop_reason to show why we have to RST, which is good, but this seems > not work well for those ->send_reset() logic. I need more time to > investigate this part. > > [...] Here is the summary with links: - [net-next,v4,1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters https://git.kernel.org/netdev/net-next/c/9807080e2170 - [net-next,v4,2/2] trace: tcp: fully support trace_tcp_send_reset https://git.kernel.org/netdev/net-next/c/19822a980e19 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 09:58:12 -0700 > Andrii Nakryiko wrote: > > > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu wrote: > > > > > > On Wed, 3 Apr 2024 11:47:41 +0200 > > > Jiri Olsa wrote: > > > > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > > > Hi Jiri, > > > > > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > > > Jiri Olsa wrote: > > > > > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > > > > > This is interesting approach. But I doubt we need to add additional > > > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > > > > > so the plan is to optimize entry uprobe in a similar way and given > > > > the syscall is not a scarce resource I wanted to add another syscall > > > > for that one as well > > > > > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > > > possible to do that, the trampoline will just have to save one or > > > > more additional registers, but adding new syscall seems cleaner to me > > > > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. > > > > I think both ptrace and prctl are for completely different use cases > > and it would be an abuse of existing API to reuse them for uretprobe > > tracing. Also, keep in mind, that any extra argument that has to be > > passed into this syscall means that we need to complicate and slow > > generated assembly code that is injected into user process (to > > save/restore registers) and also kernel-side (again, to deal with all > > the extra registers that would be stored/restored on stack). > > > > Given syscalls are not some kind of scarce resources, what's the > > downside to have a dedicated and simple syscall? > > Syscalls are explicitly exposed to user space, thus, even if it is used > ONLY for a very specific situation, it is an official kernel interface, > and need to care about the compatibility. (If it causes SIGILL unless > a specific use case, I don't know there is a "compatibility".) Check rt_sigreturn syscall (manpage at [0], for example). sigreturn() exists only to allow the implementation of signal handlers. It should never be called directly. (Indeed, a simple sigreturn() wrapper in the GNU C library simply returns -1, with errno set to ENOSYS.) Details of the arguments (if any) passed to sigreturn() vary depending on the architecture. (On some architectures, such as x86-64, sigreturn() takes no arguments, since all of the information that it requires is available in the stack frame that was previously created by the kernel on the user-space stack.) This is a very similar use case. Also, check its source code in arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process on any sign of something not being right. It's exactly the same with sys_uretprobe. [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html > And the number of syscalls are limited resource. We have almost 500 of them, it didn't seems like adding 1-2 for good reasons would be a problem. Can you please point to where the limits on syscalls as a resource are described? I'm curious to learn. > > I'm actually not sure how much we need to care of it, but adding a new > syscall is worth to be discussed carefully because all of them are > user-space compatibility. Absolutely, it's a good discussion to have. > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > > > > > right, I'll check on syzkaller > > > > > > > > > set in the user function, what happen if the user function directly > > > > > calls this syscall? (maybe it consumes shadow stack?) > > > > > > > > the process should receive SIGILL if there's no pending uretprobe for > > > > the current task, or it will trigger uretprobe if there's one pending > > > > > > No, that is too aggressive and not safe. Since the syscall is exposed to > > > user program, it should return appropriate error code instead of SIGILL. > > > > > > > This is the way it is today with uretprobes even through interrupt. > > I doubt that the interrupt (exception) and syscall should be handled > differently. Especially, this exception is injected by uprobes but > syscall will be caused by itself. But syscall can be called from user > program (of couse this works as sys_kill(self, SIGILL)). Yep, I'd keep the behavior the same between uretprobes implemented through int3 and sys_uretprobe. > > > E.g., it could happen that user process is using fibers and is > > replacing stack pointer without kernel realizing this, which will > > trigger some defensive checks in uretprobe handling code and kernel > > will send SIGILL because it can't support such cases. This is > > happening today already, and it works fine in practice (except for > > applications that manually change stack pointer, too bad, you can't >
Re: [PATCH net-next 0/6] Implement reset reason mechanism to detect
On Thu, Apr 4, 2024 at 9:50 AM Jakub Kicinski wrote: > > On Wed, 3 Apr 2024 15:31:38 +0800 Jason Xing wrote: > > It's based on top of > > https://patchwork.kernel.org/project/netdevbpf/list/?series=840182 > > Please post as RFC if there's a dependency. > We don't maintain patch queues for people. Got it. Thanks. I'll wait for that patch series to get merged. I believe it will not take too long:) > -- > pw-bot: cr
Re: [PATCH net-next 0/6] Implement reset reason mechanism to detect
On Wed, 3 Apr 2024 15:31:38 +0800 Jason Xing wrote: > It's based on top of > https://patchwork.kernel.org/project/netdevbpf/list/?series=840182 Please post as RFC if there's a dependency. We don't maintain patch queues for people. -- pw-bot: cr
[PATCH] tools/latency-collector: fix -Wformat-security compile warns
Fix the following -Wformat-security compile warnings adding missing format arguments: latency-collector.c: In function ‘show_available’: latency-collector.c:938:17: warning: format not a string literal and no format arguments [-Wformat-security] 938 | warnx(no_tracer_msg); | ^ latency-collector.c:943:17: warning: format not a string literal and no format arguments [-Wformat-security] 943 | warnx(no_latency_tr_msg); | ^ latency-collector.c: In function ‘find_default_tracer’: latency-collector.c:986:25: warning: format not a string literal and no format arguments [-Wformat-security] 986 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ latency-collector.c: In function ‘scan_arguments’: latency-collector.c:1881:33: warning: format not a string literal and no format arguments [-Wformat-security] 1881 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ Signed-off-by: Shuah Khan --- tools/tracing/latency/latency-collector.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/tracing/latency/latency-collector.c b/tools/tracing/latency/latency-collector.c index 0fd9c747d396..cf263fe9deaf 100644 --- a/tools/tracing/latency/latency-collector.c +++ b/tools/tracing/latency/latency-collector.c @@ -935,12 +935,12 @@ static void show_available(void) } if (!tracers) { - warnx(no_tracer_msg); + warnx("%s", no_tracer_msg); return; } if (!found) { - warnx(no_latency_tr_msg); + warnx("%s", no_latency_tr_msg); tracefs_list_free(tracers); return; } @@ -983,7 +983,7 @@ static const char *find_default_tracer(void) for (i = 0; relevant_tracers[i]; i++) { valid = tracer_valid(relevant_tracers[i], ); if (notracer) - errx(EXIT_FAILURE, no_tracer_msg); + errx(EXIT_FAILURE, "%s", no_tracer_msg); if (valid) return relevant_tracers[i]; } @@ -1878,7 +1878,7 @@ static void scan_arguments(int argc, char *argv[]) } valid = tracer_valid(current_tracer, ); if (notracer) - errx(EXIT_FAILURE, no_tracer_msg); + errx(EXIT_FAILURE, "%s", no_tracer_msg); if (!valid) errx(EXIT_FAILURE, "The tracer %s is not supported by your kernel!\n", current_tracer); -- 2.40.1
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, 3 Apr 2024 09:58:12 -0700 Andrii Nakryiko wrote: > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu wrote: > > > > On Wed, 3 Apr 2024 11:47:41 +0200 > > Jiri Olsa wrote: > > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > > Hi Jiri, > > > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > > Jiri Olsa wrote: > > > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > > > This is interesting approach. But I doubt we need to add additional > > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > > > so the plan is to optimize entry uprobe in a similar way and given > > > the syscall is not a scarce resource I wanted to add another syscall > > > for that one as well > > > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > > possible to do that, the trampoline will just have to save one or > > > more additional registers, but adding new syscall seems cleaner to me > > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. > > I think both ptrace and prctl are for completely different use cases > and it would be an abuse of existing API to reuse them for uretprobe > tracing. Also, keep in mind, that any extra argument that has to be > passed into this syscall means that we need to complicate and slow > generated assembly code that is injected into user process (to > save/restore registers) and also kernel-side (again, to deal with all > the extra registers that would be stored/restored on stack). > > Given syscalls are not some kind of scarce resources, what's the > downside to have a dedicated and simple syscall? Syscalls are explicitly exposed to user space, thus, even if it is used ONLY for a very specific situation, it is an official kernel interface, and need to care about the compatibility. (If it causes SIGILL unless a specific use case, I don't know there is a "compatibility".) And the number of syscalls are limited resource. I'm actually not sure how much we need to care of it, but adding a new syscall is worth to be discussed carefully because all of them are user-space compatibility. > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > > > right, I'll check on syzkaller > > > > > > > set in the user function, what happen if the user function directly > > > > calls this syscall? (maybe it consumes shadow stack?) > > > > > > the process should receive SIGILL if there's no pending uretprobe for > > > the current task, or it will trigger uretprobe if there's one pending > > > > No, that is too aggressive and not safe. Since the syscall is exposed to > > user program, it should return appropriate error code instead of SIGILL. > > > > This is the way it is today with uretprobes even through interrupt. I doubt that the interrupt (exception) and syscall should be handled differently. Especially, this exception is injected by uprobes but syscall will be caused by itself. But syscall can be called from user program (of couse this works as sys_kill(self, SIGILL)). > E.g., it could happen that user process is using fibers and is > replacing stack pointer without kernel realizing this, which will > trigger some defensive checks in uretprobe handling code and kernel > will send SIGILL because it can't support such cases. This is > happening today already, and it works fine in practice (except for > applications that manually change stack pointer, too bad, you can't > trace them with uretprobes, unfortunately). OK, we at least need to document it. > > So I think it's absolutely adequate to have this behavior if the user > process is *intentionally* abusing this API. Of course user expected that it is abusing. So at least we need to add a document that this syscall number is reserved to uprobes and user program must not use it. > > > > > > > but we could limit the syscall to be executed just from the trampoline, > > > that should prevent all the user space use cases, I'll do that in next > > > version and add more tests for that > > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called > > only from the trampoline, so it is natural to check the caller address. > > (and uprobe should know where is the trampoline) > > > > Since the syscall is always exposed to the user program, it should > > - Do nothing and return an error unless it is properly called. > > - check the prerequisites for operation strictly. > > I concern that new system calls introduce vulnerabilities. > > > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other > processes, only the process that is abusing the API. So any extra > checks that would slow down this approach is an unnecessary overhead > and complication that will never be useful in practice. I think at least it should check the caller address to ensure the address is in the trampoline. But anyway, uprobes itself can break the target process, so no one
Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
Hi Jonathan, Thank you for your feedback. I will fix them (inlined) in the next V11. On Wed, Apr 3, 2024 at 10:04 AM Jonathan Cameron wrote: > > A few minor comments inline. > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > index a44c03c2ba3a..16769552a338 100644 > > --- a/include/linux/memory-tiers.h > > +++ b/include/linux/memory-tiers.h > > @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct > > access_coordinate *perf, int *adis > > return -EIO; > > } > > > > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > list_head *memory_types) > > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > + struct list_head *memory_types) > > { > > return NULL; > > } > > > > -void mt_put_memory_types(struct list_head *memory_types) > > +static inline void mt_put_memory_types(struct list_head *memory_types) > > { > Why in this patch and not previous one? I've also noticed this issue. I will fix it in the next V11. > > > > } > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > index 974af10cfdd8..44fa10980d37 100644 > > --- a/mm/memory-tiers.c > > +++ b/mm/memory-tiers.c > > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > > > static DEFINE_MUTEX(memory_tier_lock); > > static LIST_HEAD(memory_tiers); > > +/* > > + * The list is used to store all memory types that are not created > > + * by a device driver. > > + */ > > +static LIST_HEAD(default_memory_types); > > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > > struct memory_dev_type *default_dram_type; > > > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion > > __read_mostly; > > > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > > +static DEFINE_MUTEX(default_dram_perf_lock); > > static bool default_dram_perf_error; > > static struct access_coordinate default_dram_perf; > > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, > > struct memory_dev_type *mem > > static struct memory_tier *set_node_memory_tier(int node) > > { > > struct memory_tier *memtier; > > - struct memory_dev_type *memtype; > > + struct memory_dev_type *mtype = default_dram_type; > > Does the rename add anything major to the patch? > If not I'd leave it alone to reduce the churn and give > a more readable patch. If it is worth doing perhaps > a precursor patch? > Either name works. Keeping it the same name will make the code easier to follow. I agree! Thanks. > > + int adist = MEMTIER_ADISTANCE_DRAM; > > pg_data_t *pgdat = NODE_DATA(node); > > > > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int > > node) > > if (!node_state(node, N_MEMORY)) > > return ERR_PTR(-EINVAL); > > > > - __init_node_memory_type(node, default_dram_type); > > + mt_calc_adistance(node, ); > > + if (node_memory_types[node].memtype == NULL) { > > + mtype = mt_find_alloc_memory_type(adist, > > _memory_types); > > + if (IS_ERR(mtype)) { > > + mtype = default_dram_type; > > + pr_info("Failed to allocate a memory type. Fall > > back.\n"); > > + } > > + } > > + > > + __init_node_memory_type(node, mtype); > > > > - memtype = node_memory_types[node].memtype; > > - node_set(node, memtype->nodes); > > - memtier = find_create_memory_tier(memtype); > > + mtype = node_memory_types[node].memtype; > > + node_set(node, mtype->nodes); > > + memtier = find_create_memory_tier(mtype); > > if (!IS_ERR(memtier)) > > rcu_assign_pointer(pgdat->memtier, memtier); > > return memtier; > > @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head > > *memory_types) > > } > > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > > > +/* > > + * This is invoked via `late_initcall()` to initialize memory tiers for > > + * CPU-less memory nodes after driver initialization, which is > > + * expected to provide `adistance` algorithms. > > + */ > > +static int __init memory_tier_late_init(void) > > +{ > > + int nid; > > + > > + mutex_lock(_tier_lock); > > + for_each_node_state(nid, N_MEMORY) > > + if (node_memory_types[nid].memtype == NULL) > > + /* > > + * Some device drivers may have initialized memory > > tiers > > + * between `memory_tier_init()` and > > `memory_tier_late_init()`, > > + * potentially bringing online memory nodes and > > + * configuring memory tiers. Exclude them here. > > + */ > > Does the comment refer to this path, or to ones where memtype is set? > Yes, the comment is for explaining why the
Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig
On Wed, 3 Apr 2024 12:16:28 -0700 "Paul E. McKenney" wrote: > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to > /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig. > > /proc/bootconfig shows boot_command_line[] multiple times following > every xbc key value pair, that's duplicated and not necessary. > Remove redundant ones. > > Output before and after the fix is like: > key1 = value1 > *bootloader argument comments* > key2 = value2 > *bootloader argument comments* > key3 = value3 > *bootloader argument comments* > ... > > key1 = value1 > key2 = value2 > key3 = value3 > *bootloader argument comments* > ... > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to > /proc/bootconfig") > Signed-off-by: Zhenhua Huang > Signed-off-by: Paul E. McKenney > Cc: Masami Hiramatsu > Cc: > Cc: OOps, good catch! Let me pick it. Acked-by: Masami Hiramatsu (Google) Thank you! > > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c > index 902b326e1e560..e5635a6b127b0 100644 > --- a/fs/proc/bootconfig.c > +++ b/fs/proc/bootconfig.c > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, > size_t size) > break; > dst += ret; > } > - if (ret >= 0 && boot_command_line[0]) { > - ret = snprintf(dst, rest(dst, end), "# Parameters from > bootloader:\n# %s\n", > -boot_command_line); > - if (ret > 0) > - dst += ret; > - } > + } > + if (ret >= 0 && boot_command_line[0]) { > + ret = snprintf(dst, rest(dst, end), "# Parameters from > bootloader:\n# %s\n", > +boot_command_line); > + if (ret > 0) > + dst += ret; > } > out: > kfree(key); -- Masami Hiramatsu (Google)
Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Hi Jonathan, Thanks for your feedback. I will fix them (inlined) in the next V11. No worries, it's never too late! On Wed, Apr 3, 2024 at 9:52 AM Jonathan Cameron wrote: > > On Tue, 2 Apr 2024 00:17:37 + > "Ho-Ren (Jack) Chuang" wrote: > > > Since different memory devices require finding, allocating, and putting > > memory types, these common steps are abstracted in this patch, > > enhancing the scalability and conciseness of the code. > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > Reviewed-by: "Huang, Ying" > > Hi, > > I know this is a late entry to the discussion but a few comments inline. > (sorry I didn't look earlier!) > > All opportunities to improve code complexity and readability as a result > of your factoring out. > > Jonathan > > > > --- > > drivers/dax/kmem.c | 20 ++-- > > include/linux/memory-tiers.h | 13 + > > mm/memory-tiers.c| 32 > > 3 files changed, 47 insertions(+), 18 deletions(-) > > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > > index 42ee360cf4e3..01399e5b53b2 100644 > > --- a/drivers/dax/kmem.c > > +++ b/drivers/dax/kmem.c > > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); > > > > static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) > > { > > - bool found = false; > > struct memory_dev_type *mtype; > > > > mutex_lock(_memory_type_lock); > could use > > guard(mutex)(_memory_type_lock); > return mt_find_alloc_memory_type(adist, _memory_types); > I will change it accordingly. > I'm fine if you ignore this comment though as may be other functions in > here that could take advantage of the cleanup.h stuff in a future patch. > > > - list_for_each_entry(mtype, _memory_types, list) { > > - if (mtype->adistance == adist) { > > - found = true; > > - break; > > - } > > - } > > - if (!found) { > > - mtype = alloc_memory_type(adist); > > - if (!IS_ERR(mtype)) > > - list_add(>list, _memory_types); > > - } > > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > > mutex_unlock(_memory_type_lock); > > > > return mtype; > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > index 69e781900082..a44c03c2ba3a 100644 > > --- a/include/linux/memory-tiers.h > > +++ b/include/linux/memory-tiers.h > > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); > > int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > >const char *source); > > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > + struct list_head > > *memory_types); > > That indent looks unusual. Align the start of struct with start of int. > I can make this aligned but it will show another warning: "WARNING: line length of 131 exceeds 100 columns" Is this ok? > > +void mt_put_memory_types(struct list_head *memory_types); > > #ifdef CONFIG_MIGRATION > > int next_demotion_node(int node); > > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct > > access_coordinate *perf, int *adis > > { > > return -EIO; > > } > > + > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > list_head *memory_types) > > +{ > > + return NULL; > > +} > > + > > +void mt_put_memory_types(struct list_head *memory_types) > > +{ > > + > No blank line needed here. Will fix. > > +} > > #endif /* CONFIG_NUMA */ > > #endif /* _LINUX_MEMORY_TIERS_H */ > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > index 0537664620e5..974af10cfdd8 100644 > > --- a/mm/memory-tiers.c > > +++ b/mm/memory-tiers.c > > @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct > > memory_dev_type *memtype) > > } > > EXPORT_SYMBOL_GPL(clear_node_memory_type); > > > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > list_head *memory_types) > > Breaking this out as a separate function provides opportunity to improve it. > Maybe a follow up patch makes sense given it would no longer be a straight > forward code move. However in my view it would be simple enough to be obvious > even within this patch. > I will just keep this as is for now to minimize the changes aka mistakes. > > +{ > > + bool found = false; > > + struct memory_dev_type *mtype; > > + > > + list_for_each_entry(mtype, memory_types, list) { > > + if (mtype->adistance == adist) { > > + found = true; > > Why not return here? > return mtype; > Yes, I can return here. I will do that and take care of the ptr returning at this point. > > +
Re: 回复:回复:general protection fault in refill_obj_stock
On Tue, Apr 02, 2024 at 02:14:58PM +0800, Ubisectech Sirius wrote: > > On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote: > >>> On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote: > >>> Hello. > >>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > >>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached > >>> to the email were a PoC file of the issue. > >> > >>> Thank you for the report! > >> > >>> I tried to compile and run your test program for about half an hour > >>> on a virtual machine running 6.7 with enabled KASAN, but wasn't able > >>> to reproduce the problem. > >> > >>> Can you, please, share a bit more information? How long does it take > >>> to reproduce? Do you mind sharing your kernel config? Is there anything > >>> special > >>> about your setup? What are exact steps to reproduce the problem? > >>> Is this problem reproducible on 6.6? > >> > >> Hi. > >> The .config of linux kernel 6.7 has send to you as attachment. > > Thanks! > > How long it takes to reproduce a problem? Do you just start your reproducer > > and wait? > I just start the reproducer and wait without any other operation. The speed > of reproducing this problem is vary fast(Less than 5 seconds). > >> And The problem is reproducible on 6.6. > > Hm, it rules out my recent changes. > > Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem? > > if it's fast to reproduce, it might be the best option. > I have try the 6.0, 6.3, 6.4, 6.5 kernel. The Linux kernel 6.5 will get same > error output. But other version will get different output like below: > [ 55.306672][ T7950] KASAN: null-ptr-deref in range > [0x0018-0x001f] > [ 55.307259][ T7950] CPU: 1 PID: 7950 Comm: poc Not tainted 6.3.0 #1 > [ 55.307714][ T7950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.15.0-1 04/01/2014 > [ 55.308363][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173) > [ 55.316475][ T7950] Call Trace: > [ 55.316713][ T7950] > [ 55.317353][ T7950] tomoyo_path_permission (security/tomoyo/file.c:170 > security/tomoyo/file.c:587 security/tomoyo/file.c:573) > [ 55.317744][ T7950] tomoyo_check_open_permission (security/tomoyo/file.c:779) > [ 55.320152][ T7950] tomoyo_file_open (security/tomoyo/tomoyo.c:332 > security/tomoyo/tomoyo.c:327) > [ 55.320495][ T7950] security_file_open (security/security.c:1719 > (discriminator 13)) > [ 55.320850][ T7950] do_dentry_open (fs/open.c:908) > [ 55.321526][ T7950] path_openat (fs/namei.c:3561 fs/namei.c:3715) > [ 55.322614][ T7950] do_filp_open (fs/namei.c:3743) > [ 55.325086][ T7950] do_sys_openat2 (fs/open.c:1349) > [ 55.326249][ T7950] __x64_sys_openat (fs/open.c:1375) > [ 55.327428][ T7950] do_syscall_64 (arch/x86/entry/common.c:50 > arch/x86/entry/common.c:80) > [ 55.327756][ T7950] entry_SYSCALL_64_after_hwframe > (arch/x86/entry/entry_64.S:120) > [ 55.328185][ T7950] RIP: 0033:0x7f1c4a484f29 > [ 55.328504][ T7950] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 > 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 > <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48 > [ 55.329864][ T7950] RSP: 002b:7ffd7bfe8398 EFLAGS: 0246 ORIG_RAX: > 0101 > [ 55.330464][ T7950] RAX: ffda RBX: RCX: > 7f1c4a484f29 > [ 55.331024][ T7950] RDX: 00141842 RSI: 2380 RDI: > ff9c > [ 55.331585][ T7950] RBP: 7ffd7bfe83a0 R08: R09: > 7ffd7bfe83f0 > [ 55.332148][ T7950] R10: R11: 0246 R12: > 55c5e36482d0 > [ 55.332707][ T7950] R13: R14: R15: > > [ 55.333268][ T7950] > [ 55.333488][ T7950] Modules linked in: > [ 55.340525][ T7950] ---[ end trace ]--- > [ 55.340936][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173) > It look like other problem? It does look differently. I can't reproduce any of those. I run into some build time issues when trying to build the kernel with your config (I have a fairly old toolchain, maybe it's the reason), but when running a more minimalistic config I do not see any issues on 6.1, 6.6 and 6.7. Is this some sort of all-yes config or it's somehow specially crafted? Did you try to reproduce the problem with other kernel configs? It all smells a memory corruption, but who knows. Thanks!
RE: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
Kui-Feng Lee wrote: > rethook_find_ret_addr() prints a warning message and returns 0 when the > target task is running and not the "current" task to prevent returning an > incorrect return address. However, this check is incomplete as the target > task can still transition to the running state when finding the return > address, although it is safe with RCU. > > The issue we encounter is that the kernel frequently prints warning > messages when BPF profiling programs call to bpf_get_task_stack() on > running tasks. > > The callers should be aware and willing to take the risk of receiving an > incorrect return address from a task that is currently running other than > the "current" one. A warning is not needed here as the callers are intent > on it. > > Signed-off-by: Kui-Feng Lee > --- > kernel/trace/rethook.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > index fa03094e9e69..4297a132a7ae 100644 > --- a/kernel/trace/rethook.c > +++ b/kernel/trace/rethook.c > @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct > *tsk, unsigned long frame > if (WARN_ON_ONCE(!cur)) > return 0; > > - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) > + if (tsk != current && task_is_running(tsk)) > return 0; > > do { > -- > 2.34.1 > > Acked-by: John Fastabend
[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Signed-off-by: Andrii Nakryiko --- kernel/trace/rethook.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..15b8aa4048d9 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!handler)) return NULL; +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING /* * This expects the caller will set up a rethook on a function entry. * When the function returns, the rethook will eventually be reclaimed @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) */ if (unlikely(!rcu_is_watching())) return NULL; +#endif return (struct rethook_node *)objpool_pop(>pool); } -- 2.43.0
[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..7aebd1b8f93e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace execution" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort of + protection against recursion. This option is only to verify that + ftrace (and other users of ftrace_test_recursion_trylock()) are not + called outside of RCU, as if they are, it can cause a race. But it + also has a noticeable overhead when enabled. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()
On Wed, Apr 03, 2024 at 03:29:12PM -0400, Steven Rostedt wrote: > On Wed, 3 Apr 2024 11:53:14 -0700 > "Paul E. McKenney" wrote: > > > @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct > > ftrace_hash *hash, unsigned long > > } > > } > > > > +static void register_ftrace_direct_cb(struct rcu_head *rhp) > > +{ > > + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu); > > + > > + free_ftrace_hash(fhp); > > +} > > + > > /** > > * register_ftrace_direct - Call a custom trampoline directly > > * for multiple functions registered in @ops > > @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, > > unsigned long addr) > > out_unlock: > > mutex_unlock(_mutex); > > > > - if (free_hash && free_hash != EMPTY_HASH) { > > - synchronize_rcu_tasks(); > > - free_ftrace_hash(free_hash); > > - } > > + if (free_hash && free_hash != EMPTY_HASH) > > + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb); > > > > if (new_hash) > > free_ftrace_hash(new_hash); > > I'm getting ready to go on PTO, but a quick glance doesn't look like this > would cause any harm. It is not urgent, as in v6.10 merge window at the earliest. So if I don't hear from you in a few weeks, I will re-send. ;-) Thanx, Paul
Re: [PATCH v3 24/25] drivers: media: i2c: imx258: Add support for reset gpio
On 4/3/24 11:03, Ondřej Jirman wrote: > Hi, > > On Wed, Apr 03, 2024 at 04:28:59PM GMT, Sakari Ailus wrote: >> Hi Luis, >> >> Could you unify the subject prefix for the driver patches, please? E.g. >> "media: imx258: " would be fine. >> >> On Wed, Apr 03, 2024 at 09:03:53AM -0600, g...@luigi311.com wrote: >>> From: Luis Garcia >>> >>> It was documented in DT, but not implemented. >>> >>> Signed-off-by: Ondrej Jirman >>> Signed-off-by: Luis Garcia >>> --- >>> drivers/media/i2c/imx258.c | 14 +- >>> 1 file changed, 13 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c >>> index 163f04f6f954..4c117c4829f1 100644 >>> --- a/drivers/media/i2c/imx258.c >>> +++ b/drivers/media/i2c/imx258.c >>> @@ -680,6 +680,7 @@ struct imx258 { >>> unsigned int csi2_flags; >>> >>> struct gpio_desc *powerdown_gpio; >>> + struct gpio_desc *reset_gpio; >>> >>> /* >>> * Mutex for serialized access: >>> @@ -1232,7 +1233,11 @@ static int imx258_power_on(struct device *dev) >>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies); >>> } >>> >>> - return ret; >>> + gpiod_set_value_cansleep(imx258->reset_gpio, 0); >>> + >>> + usleep_range(400, 500); >> >> You could mention this at least in the commit message. > > This is T6 in the datasheet: https://megous.com/dl/tmp/92c9223ce877216e.png > > >>> + >>> + return 0; >>> } >>> >>> static int imx258_power_off(struct device *dev) >>> @@ -1243,6 +1248,7 @@ static int imx258_power_off(struct device *dev) >>> clk_disable_unprepare(imx258->clk); >>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies); >>> >>> + gpiod_set_value_cansleep(imx258->reset_gpio, 1); >> >> Same question than on the other GPIO: does this belong here? > > No, this should be before the regulator_bulk_disable. > > See: https://megous.com/dl/tmp/c96180b23d7ce63a.png > > kind regards, > o. > Since I'm supposed to move the reset up should I also move the power up with it to match your downstream driver? >>> gpiod_set_value_cansleep(imx258->powerdown_gpio, 1); >>> >>> return 0; >>> @@ -1554,6 +1560,12 @@ static int imx258_probe(struct i2c_client *client) >>> if (IS_ERR(imx258->powerdown_gpio)) >>> return PTR_ERR(imx258->powerdown_gpio); >>> >>> + /* request optional reset pin */ >>> + imx258->reset_gpio = devm_gpiod_get_optional(>dev, "reset", >>> + GPIOD_OUT_HIGH); >>> + if (IS_ERR(imx258->reset_gpio)) >>> + return PTR_ERR(imx258->reset_gpio); >>> + >>> /* Initialize subdev */ >>> v4l2_i2c_subdev_init(>sd, client, _subdev_ops); >>> >> >> -- >> Regards, >> >> Sakari Ailus
Re: [PATCH v3 23/25] drivers: media: i2c: imx258: Add support for powerdown gpio
On 4/3/24 10:57, Ondřej Jirman wrote: > Hi Sakari and Luis, > > On Wed, Apr 03, 2024 at 04:25:41PM GMT, Sakari Ailus wrote: >> Hi Luis, Ondrej, >> >> On Wed, Apr 03, 2024 at 09:03:52AM -0600, g...@luigi311.com wrote: >>> From: Luis Garcia >>> >>> On some boards powerdown signal needs to be deasserted for this >>> sensor to be enabled. >>> >>> Signed-off-by: Ondrej Jirman >>> Signed-off-by: Luis Garcia >>> --- >>> drivers/media/i2c/imx258.c | 13 + >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c >>> index 30352c33f63c..163f04f6f954 100644 >>> --- a/drivers/media/i2c/imx258.c >>> +++ b/drivers/media/i2c/imx258.c >>> @@ -679,6 +679,8 @@ struct imx258 { >>> unsigned int lane_mode_idx; >>> unsigned int csi2_flags; >>> >>> + struct gpio_desc *powerdown_gpio; >>> + >>> /* >>> * Mutex for serialized access: >>> * Protect sensor module set pad format and start/stop streaming safely. >>> @@ -1213,6 +1215,8 @@ static int imx258_power_on(struct device *dev) >>> struct imx258 *imx258 = to_imx258(sd); >>> int ret; >>> >>> + gpiod_set_value_cansleep(imx258->powerdown_gpio, 0); >> >> What does the spec say? Should this really happen before switching on the >> supplies below? > > There's no powerdown input in the IMX258 manual. The manual only mentions > that XCLR (reset) should be held low during power on. > > https://megous.com/dl/tmp/15b0992a720ab82d.png > > https://megous.com/dl/tmp/f2cc991046d97641.png > >This sensor doesn’t have a built-in “Power ON Reset” function. The XCLR pin >is set to “LOW” and the power supplies are brought up. Then the XCLR pin >should be set to “High” after INCK supplied. > > So this input is some feature on camera module itself outside of the > IMX258 chip, which I think is used to gate power to the module. Eg. on > Pinephone > Pro, there are two modules with shared power rails, so enabling supply to > one module enables it to the other one, too. So this input becomes the only > way > to really enable/disable power to the chip when both are used at once at some > point, because regulator_bulk_enable/disable becomes ineffective at that > point. > > Luis, maybe you saw some other datasheet that mentions this input? IMO, > it just gates the power rails via some mosfets on the module itself, since > there's not power down input to the chip itself. > > kind regards, > o. > Ondrej, I did not see anything else in the datasheet since I'm pretty sure I'm looking at the same datasheet as it was supplied to me by Pine64. I'm not sure what datasheet Dave has access to since he got his for a completely different module than what we are testing with though. >>> + >>> ret = regulator_bulk_enable(IMX258_NUM_SUPPLIES, >>> imx258->supplies); >>> if (ret) { >>> @@ -1224,6 +1228,7 @@ static int imx258_power_on(struct device *dev) >>> ret = clk_prepare_enable(imx258->clk); >>> if (ret) { >>> dev_err(dev, "failed to enable clock\n"); >>> + gpiod_set_value_cansleep(imx258->powerdown_gpio, 1); >>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies); >>> } >>> >>> @@ -1238,6 +1243,8 @@ static int imx258_power_off(struct device *dev) >>> clk_disable_unprepare(imx258->clk); >>> regulator_bulk_disable(IMX258_NUM_SUPPLIES, imx258->supplies); >>> >>> + gpiod_set_value_cansleep(imx258->powerdown_gpio, 1); >>> + >>> return 0; >>> } >>> >>> @@ -1541,6 +1548,12 @@ static int imx258_probe(struct i2c_client *client) >>> if (!imx258->variant_cfg) >>> imx258->variant_cfg = _cfg; >>> >>> + /* request optional power down pin */ >>> + imx258->powerdown_gpio = devm_gpiod_get_optional(>dev, >>> "powerdown", >>> + GPIOD_OUT_HIGH); >>> + if (IS_ERR(imx258->powerdown_gpio)) >>> + return PTR_ERR(imx258->powerdown_gpio); >>> + >>> /* Initialize subdev */ >>> v4l2_i2c_subdev_init(>sd, client, _subdev_ops); >>> >> >> -- >> Regards, >> >> Sakari Ailus
Re: [PATCH v3 22/25] dt-bindings: media: imx258: Add binding for powerdown-gpio
On 4/3/24 12:49, Pavel Machek wrote: > Hi! > >> From: Luis Garcia >> >> Add powerdown-gpio binding as it is required for some boards > > "." at end of sentence. > >> Signed-off-by: Ondrej Jirman >> Signed-off-by: Luis Garcia > > If the patch is from Ondrej, he should be in From:, I believe. > > Best regards, > Pavel Ohh your right, this was broken out from another patch and didn't set it to be his actual commit same with the other ones. They were based on his downstream changes and just modified to match up with what Dave had set the values too. I'll set it to him for the next revision. Thanks!
Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()
On Wed, 3 Apr 2024 11:53:14 -0700 "Paul E. McKenney" wrote: > @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct > ftrace_hash *hash, unsigned long > } > } > > +static void register_ftrace_direct_cb(struct rcu_head *rhp) > +{ > + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu); > + > + free_ftrace_hash(fhp); > +} > + > /** > * register_ftrace_direct - Call a custom trampoline directly > * for multiple functions registered in @ops > @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, > unsigned long addr) > out_unlock: > mutex_unlock(_mutex); > > - if (free_hash && free_hash != EMPTY_HASH) { > - synchronize_rcu_tasks(); > - free_ftrace_hash(free_hash); > - } > + if (free_hash && free_hash != EMPTY_HASH) > + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb); > > if (new_hash) > free_ftrace_hash(new_hash); I'm getting ready to go on PTO, but a quick glance doesn't look like this would cause any harm. -- Steve
Re: [PATCH v3 21/25] drivers: media: i2c: imx258: Use macros
On 4/3/24 10:23, Sakari Ailus wrote: > Hi Luis, > > On Wed, Apr 03, 2024 at 09:03:50AM -0600, g...@luigi311.com wrote: >> From: Luis Garcia >> >> Use understandable macros instead of raw values. >> >> Signed-off-by: Ondrej Jirman >> Signed-off-by: Luis Garcia >> --- >> drivers/media/i2c/imx258.c | 434 ++--- >> 1 file changed, 207 insertions(+), 227 deletions(-) >> >> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c >> index e2ecf6109516..30352c33f63c 100644 >> --- a/drivers/media/i2c/imx258.c >> +++ b/drivers/media/i2c/imx258.c >> @@ -33,8 +33,6 @@ >> #define IMX258_VTS_30FPS_VGA0x034c >> #define IMX258_VTS_MAX 65525 >> >> -#define IMX258_REG_VTS 0x0340 >> - >> /* HBLANK control - read only */ >> #define IMX258_PPL_DEFAULT 5352 >> >> @@ -90,6 +88,53 @@ >> #define IMX258_PIXEL_ARRAY_WIDTH4208U >> #define IMX258_PIXEL_ARRAY_HEIGHT 3120U >> >> +/* regs */ >> +#define IMX258_REG_PLL_MULT_DRIV 0x0310 >> +#define IMX258_REG_IVTPXCK_DIV0x0301 >> +#define IMX258_REG_IVTSYCK_DIV0x0303 >> +#define IMX258_REG_PREPLLCK_VT_DIV0x0305 >> +#define IMX258_REG_IOPPXCK_DIV0x0309 >> +#define IMX258_REG_IOPSYCK_DIV0x030b >> +#define IMX258_REG_PREPLLCK_OP_DIV0x030d >> +#define IMX258_REG_PHASE_PIX_OUTEN0x3030 >> +#define IMX258_REG_PDPIX_DATA_RATE0x3032 >> +#define IMX258_REG_SCALE_MODE 0x0401 >> +#define IMX258_REG_SCALE_MODE_EXT 0x3038 >> +#define IMX258_REG_AF_WINDOW_MODE 0x7bcd >> +#define IMX258_REG_FRM_LENGTH_CTL 0x0350 >> +#define IMX258_REG_CSI_LANE_MODE 0x0114 >> +#define IMX258_REG_X_EVN_INC 0x0381 >> +#define IMX258_REG_X_ODD_INC 0x0383 >> +#define IMX258_REG_Y_EVN_INC 0x0385 >> +#define IMX258_REG_Y_ODD_INC 0x0387 >> +#define IMX258_REG_BINNING_MODE 0x0900 >> +#define IMX258_REG_BINNING_TYPE_V 0x0901 >> +#define IMX258_REG_FORCE_FD_SUM 0x300d >> +#define IMX258_REG_DIG_CROP_X_OFFSET 0x0408 >> +#define IMX258_REG_DIG_CROP_Y_OFFSET 0x040a >> +#define IMX258_REG_DIG_CROP_IMAGE_WIDTH 0x040c >> +#define IMX258_REG_DIG_CROP_IMAGE_HEIGHT 0x040e >> +#define IMX258_REG_SCALE_M0x0404 >> +#define IMX258_REG_X_OUT_SIZE 0x034c >> +#define IMX258_REG_Y_OUT_SIZE 0x034e >> +#define IMX258_REG_X_ADD_STA 0x0344 >> +#define IMX258_REG_Y_ADD_STA 0x0346 >> +#define IMX258_REG_X_ADD_END 0x0348 >> +#define IMX258_REG_Y_ADD_END 0x034a >> +#define IMX258_REG_EXCK_FREQ 0x0136 >> +#define IMX258_REG_CSI_DT_FMT 0x0112 >> +#define IMX258_REG_LINE_LENGTH_PCK0x0342 >> +#define IMX258_REG_SCALE_M_EXT0x303a >> +#define IMX258_REG_FRM_LENGTH_LINES 0x0340 >> +#define IMX258_REG_FINE_INTEG_TIME0x0200 >> +#define IMX258_REG_PLL_IVT_MPY0x0306 >> +#define IMX258_REG_PLL_IOP_MPY0x030e >> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_H 0x0820 >> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_L 0x0822 >> + >> +#define REG8(a, v) { a, v } >> +#define REG16(a, v) { a, ((v) >> 8) & 0xff }, { (a) + 1, (v) & 0xff } > > The patch is nice but these macros are better replaced by the V4L2 CCI > helper that also offers register access functions. Could you add a patch to > convert the driver to use it (maybe after this one)? > Ohh perfect, using something else would be great. Ill go ahead and see if I can get that working. >> + >> struct imx258_reg { >> u16 address; >> u8 val; >> @@ -145,179 +190,139 @@ struct imx258_mode { >> * lane data rate when using 2 lanes, thus allowing a maximum of 15fps. >> */ >> static const struct imx258_reg mipi_1267mbps_19_2mhz_2l[] = { >> -{ 0x0136, 0x13 }, >> -{ 0x0137, 0x33 }, >> -{ 0x0301, 0x0A }, >> -{ 0x0303, 0x02 }, >> -{ 0x0305, 0x03 }, >> -{ 0x0306, 0x00 }, >> -{ 0x0307, 0xC6 }, >> -{ 0x0309, 0x0A }, >> -{ 0x030B, 0x01 }, >> -{ 0x030D, 0x02 }, >> -{ 0x030E, 0x00 }, >> -{ 0x030F, 0xD8 }, >> -{ 0x0310, 0x00 }, >> - >> -{ 0x0114, 0x01 }, >> -{ 0x0820, 0x09 }, >> -{ 0x0821, 0xa6 }, >> -{ 0x0822, 0x66 }, >> -{ 0x0823, 0x66 }, >> +REG16(IMX258_REG_EXCK_FREQ, 0x1333), >> +REG8(IMX258_REG_IVTPXCK_DIV, 10), >> +REG8(IMX258_REG_IVTSYCK_DIV, 2), >> +REG8(IMX258_REG_PREPLLCK_VT_DIV, 3), >> +REG16(IMX258_REG_PLL_IVT_MPY, 0x00C6), >> +REG8(IMX258_REG_IOPPXCK_DIV, 10), >> +
[PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig
commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig. /proc/bootconfig shows boot_command_line[] multiple times following every xbc key value pair, that's duplicated and not necessary. Remove redundant ones. Output before and after the fix is like: key1 = value1 *bootloader argument comments* key2 = value2 *bootloader argument comments* key3 = value3 *bootloader argument comments* ... key1 = value1 key2 = value2 key3 = value3 *bootloader argument comments* ... Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to /proc/bootconfig") Signed-off-by: Zhenhua Huang Signed-off-by: Paul E. McKenney Cc: Masami Hiramatsu Cc: Cc: diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c index 902b326e1e560..e5635a6b127b0 100644 --- a/fs/proc/bootconfig.c +++ b/fs/proc/bootconfig.c @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, size_t size) break; dst += ret; } - if (ret >= 0 && boot_command_line[0]) { - ret = snprintf(dst, rest(dst, end), "# Parameters from bootloader:\n# %s\n", - boot_command_line); - if (ret > 0) - dst += ret; - } + } + if (ret >= 0 && boot_command_line[0]) { + ret = snprintf(dst, rest(dst, end), "# Parameters from bootloader:\n# %s\n", + boot_command_line); + if (ret > 0) + dst += ret; } out: kfree(key);
[PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()
Note that the immediate pressure for this patch should be relieved by the NAPI patch series [1], but this sort of problem could easily arise again. So would this change make sense? When running heavy test workloads with KASAN enabled, RCU Tasks grace periods can extend for many tens of seconds, significantly slowing trace registration. Therefore, make the registration-side RCU Tasks grace period be asynchronous via call_rcu_tasks(). Reported-by: Jakub Kicinski Reported-by: Alexei Starovoitov Reported-by: Chris Mason Signed-off-by: Paul E. McKenney Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index da1710499698b..6d21e4e97ed48 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct ftrace_hash *hash, unsigned long } } +static void register_ftrace_direct_cb(struct rcu_head *rhp) +{ + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu); + + free_ftrace_hash(fhp); +} + /** * register_ftrace_direct - Call a custom trampoline directly * for multiple functions registered in @ops @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr) out_unlock: mutex_unlock(_mutex); - if (free_hash && free_hash != EMPTY_HASH) { - synchronize_rcu_tasks(); - free_ftrace_hash(free_hash); - } + if (free_hash && free_hash != EMPTY_HASH) + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb); if (new_hash) free_ftrace_hash(new_hash);
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam wrote: > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > I look forward to your formalized results :) > > Sorry for the delay in getting back to you on this Masami. > > I have used one of the bpf selftest benchmarks to provide some form of > comparison of the 3 different approaches (spinlock, RW spinlock and > per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' > benchmark which just executes a single uprobe with a minimal bpf program > attached. The tests were done on a 32 core qemu/kvm instance. > Thanks a lot for running benchmarks and providing results! > Things to note about the results: > > - The results are slightly variable so don't get too caught up on > individual thread count - it's the trend that is important. > - In terms of throughput with this specific benchmark a *very* macro view > is that the RW spinlock provides 40-60% more throughput than the > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > more throughput then the spinlock. > - This doesn't fully reflect the large reduction in latency that we have > seen in application based measurements. However, it does demonstrate > that even the trivial change of going to a RW spinlock provides > significant benefits. This is probably because trig-uprobe-nop creates a single uprobe that is triggered on many CPUs. While in production we have also *many* uprobes running on many CPUs. In this benchmark, besides contention on uprobes_treelock, we are also hammering on other per-uprobe locks (register_rwsem, also if you don't have [0] patch locally, there will be another filter lock taken each time, filter->rwlock). There is also atomic refcounting going on, which when you have the same uprobe across all CPUs at the same time will cause a bunch of cache line bouncing. So yes, it's understandable that in practice in production you see an even larger effect of optimizing uprobe_treelock than in this micro-benchmark. [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf > > I haven't included the measurements on per-CPU RW semaphore write > performance as they are completely in line with those that Paul McKenney > posted on his journal [0]. On a 32 core system I see semaphore writes to > take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). > > Each block of results below show 1 line per execution of the benchmark (the > "Summary" line) and each line is a run with one more thread added - a > thread is a "producer". The lines are edited to remove extraneous output > that adds no value here. > > The tests were executed with this driver script: > > for num_threads in {1..20} > do > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary just want to mention -a (affinity) option that you can pass a bench tool, it will pin each thread on its own CPU. It generally makes tests more uniform, eliminating CPU migrations variability. > done > > > spinlock > > Summary: hits1.453 ± 0.005M/s ( 1.453M/prod) > Summary: hits2.087 ± 0.005M/s ( 1.043M/prod) > Summary: hits2.701 ± 0.012M/s ( 0.900M/prod) I also wanted to point out that the first measurement (1.453M/s in this row) is total throughput across all threads, while value in parenthesis (0.900M/prod) is averaged throughput per each thread. So this M/prod value is the most interesting in this benchmark where we assess the effect of reducing contention. > Summary: hits1.917 ± 0.011M/s ( 0.479M/prod) > Summary: hits2.105 ± 0.003M/s ( 0.421M/prod) > Summary: hits1.615 ± 0.006M/s ( 0.269M/prod) [...]
Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
A few minor comments inline. > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index a44c03c2ba3a..16769552a338 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct > access_coordinate *perf, int *adis > return -EIO; > } > > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, > + struct list_head *memory_types) > { > return NULL; > } > > -void mt_put_memory_types(struct list_head *memory_types) > +static inline void mt_put_memory_types(struct list_head *memory_types) > { Why in this patch and not previous one? > > } > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..44fa10980d37 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > +static DEFINE_MUTEX(default_dram_perf_lock); > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, > struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype = default_dram_type; Does the rename add anything major to the patch? If not I'd leave it alone to reduce the churn and give a more readable patch. If it is worth doing perhaps a precursor patch? > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int > node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, ); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall > back.\n"); > + } > + } > + > + __init_node_memory_type(node, mtype); > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + mtype = node_memory_types[node].memtype; > + node_set(node, mtype->nodes); > + memtier = find_create_memory_tier(mtype); > if (!IS_ERR(memtier)) > rcu_assign_pointer(pgdat->memtier, memtier); > return memtier; > @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (node_memory_types[nid].memtype == NULL) > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and > `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ Does the comment refer to this path, or to ones where memtype is set? > + set_node_memory_tier(nid); Given the large comment I would add {} to help with readability. You could flip the logic to reduce indent for_each_node_state(nid, N_MEMORY) { if (node_memory_types[nid].memtype) continue; /* * Some device drivers may have initialized memory tiers * between `memory_tier_init()` and `memory_tier_late_init()`, * potentially bringing online memory nodes and * configuring memory tiers. Exclude them
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 11:47:41 +0200 > Jiri Olsa wrote: > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > Hi Jiri, > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > Jiri Olsa wrote: > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > This is interesting approach. But I doubt we need to add additional > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > so the plan is to optimize entry uprobe in a similar way and given > > the syscall is not a scarce resource I wanted to add another syscall > > for that one as well > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > possible to do that, the trampoline will just have to save one or > > more additional registers, but adding new syscall seems cleaner to me > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. I think both ptrace and prctl are for completely different use cases and it would be an abuse of existing API to reuse them for uretprobe tracing. Also, keep in mind, that any extra argument that has to be passed into this syscall means that we need to complicate and slow generated assembly code that is injected into user process (to save/restore registers) and also kernel-side (again, to deal with all the extra registers that would be stored/restored on stack). Given syscalls are not some kind of scarce resources, what's the downside to have a dedicated and simple syscall? > > > > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > right, I'll check on syzkaller > > > > > set in the user function, what happen if the user function directly > > > calls this syscall? (maybe it consumes shadow stack?) > > > > the process should receive SIGILL if there's no pending uretprobe for > > the current task, or it will trigger uretprobe if there's one pending > > No, that is too aggressive and not safe. Since the syscall is exposed to > user program, it should return appropriate error code instead of SIGILL. > This is the way it is today with uretprobes even through interrupt. E.g., it could happen that user process is using fibers and is replacing stack pointer without kernel realizing this, which will trigger some defensive checks in uretprobe handling code and kernel will send SIGILL because it can't support such cases. This is happening today already, and it works fine in practice (except for applications that manually change stack pointer, too bad, you can't trace them with uretprobes, unfortunately). So I think it's absolutely adequate to have this behavior if the user process is *intentionally* abusing this API. > > > > but we could limit the syscall to be executed just from the trampoline, > > that should prevent all the user space use cases, I'll do that in next > > version and add more tests for that > > Why not limit? :) The uprobe_handle_trampoline() expects it is called > only from the trampoline, so it is natural to check the caller address. > (and uprobe should know where is the trampoline) > > Since the syscall is always exposed to the user program, it should > - Do nothing and return an error unless it is properly called. > - check the prerequisites for operation strictly. > I concern that new system calls introduce vulnerabilities. > As Oleg and Jiri mentioned, this syscall can't harm kernel or other processes, only the process that is abusing the API. So any extra checks that would slow down this approach is an unnecessary overhead and complication that will never be useful in practice. Also note that sys_uretprobe is a kind of internal and unstable API and it is explicitly called out that its contract can change at any time and user space shouldn't rely on it. It's purely for the kernel's own usage. So let's please keep it fast and simple. > Thank you, > > > > > > thanks, > > jirka > > > > > > > [...]
Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Tue, 2 Apr 2024 00:17:37 + "Ho-Ren (Jack) Chuang" wrote: > Since different memory devices require finding, allocating, and putting > memory types, these common steps are abstracted in this patch, > enhancing the scalability and conciseness of the code. > > Signed-off-by: Ho-Ren (Jack) Chuang > Reviewed-by: "Huang, Ying" Hi, I know this is a late entry to the discussion but a few comments inline. (sorry I didn't look earlier!) All opportunities to improve code complexity and readability as a result of your factoring out. Jonathan > --- > drivers/dax/kmem.c | 20 ++-- > include/linux/memory-tiers.h | 13 + > mm/memory-tiers.c| 32 > 3 files changed, 47 insertions(+), 18 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 42ee360cf4e3..01399e5b53b2 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); > > static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) > { > - bool found = false; > struct memory_dev_type *mtype; > > mutex_lock(_memory_type_lock); could use guard(mutex)(_memory_type_lock); return mt_find_alloc_memory_type(adist, _memory_types); I'm fine if you ignore this comment though as may be other functions in here that could take advantage of the cleanup.h stuff in a future patch. > - list_for_each_entry(mtype, _memory_types, list) { > - if (mtype->adistance == adist) { > - found = true; > - break; > - } > - } > - if (!found) { > - mtype = alloc_memory_type(adist); > - if (!IS_ERR(mtype)) > - list_add(>list, _memory_types); > - } > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > mutex_unlock(_memory_type_lock); > > return mtype; > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index 69e781900082..a44c03c2ba3a 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); > int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, >const char *source); > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, > + struct list_head > *memory_types); That indent looks unusual. Align the start of struct with start of int. > +void mt_put_memory_types(struct list_head *memory_types); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct > access_coordinate *perf, int *adis > { > return -EIO; > } > + > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) > +{ > + return NULL; > +} > + > +void mt_put_memory_types(struct list_head *memory_types) > +{ > + No blank line needed here. > +} > #endif /* CONFIG_NUMA */ > #endif /* _LINUX_MEMORY_TIERS_H */ > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 0537664620e5..974af10cfdd8 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct > memory_dev_type *memtype) > } > EXPORT_SYMBOL_GPL(clear_node_memory_type); > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) Breaking this out as a separate function provides opportunity to improve it. Maybe a follow up patch makes sense given it would no longer be a straight forward code move. However in my view it would be simple enough to be obvious even within this patch. > +{ > + bool found = false; > + struct memory_dev_type *mtype; > + > + list_for_each_entry(mtype, memory_types, list) { > + if (mtype->adistance == adist) { > + found = true; Why not return here? return mtype; > + break; > + } > + } > + if (!found) { If returning above, no need for found variable - just do this unconditionally. + I suggest you flip logic for simpler to follow code flow. It's more code but I think a bit easier to read as error handling is out of the main simple flow. mtype = alloc_memory_type(adist); if (IS_ERR(mtype)) return mtype; list_add(>list, memory_types); return mtype; > + mtype = alloc_memory_type(adist); > + if (!IS_ERR(mtype)) > + list_add(>list, memory_types); > + } > + > + return mtype; > +} > +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); > + > +void mt_put_memory_types(struct
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 8:40 PM EEST, Michal Koutný wrote: > On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang > wrote: > > Do we really want to have it implemented in c? > > I only pointed to the available C boilerplate. > > > There are much fewer lines of > > code in shell scripts. Note we are not really testing basic cgroup stuff. > > All we needed were creating/deleting cgroups and set limits which I think > > have been demonstrated feasible in the ash scripts now. > > I assume you refer to > Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com> > right? > > Could it be even simpler if you didn't stick to cgtools APIs and v1 > compatibility? > > Reducing ash_cgexec.sh to something like > echo 0 >$R/$1/cgroup.procs > shift > exec "$@" > (with some small builerplate for $R and previous mkdirs) I already asked about necessity of v1 in some response, and fully support this idea. Then cgexec can simply be a function wrapping along the lines what you proposed. BR, Jarkko
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Wed, 3 Apr 2024 15:39:44 +0100 Vincent Donnefort wrote: > > Do you plan on sending out a v20 series? > > Of course, let me spin that this week! Got also few typos to fix in the doc > and > I believe an include missing for riscv. No rush, I'll be on PTO until next Tuesday, and will not get to it before then. -- Steve
[PATCH net v4] virtio_net: Do not send RSS key if it is not supported
There is a bug when setting the RSS options in virtio_net that can break the whole machine, getting the kernel into an infinite loop. Running the following command in any QEMU virtual machine with virtionet will reproduce this problem: # ethtool -X eth0 hfunc toeplitz This is how the problem happens: 1) ethtool_set_rxfh() calls virtnet_set_rxfh() 2) virtnet_set_rxfh() calls virtnet_commit_rss_command() 3) virtnet_commit_rss_command() populates 4 entries for the rss scatter-gather 4) Since the command above does not have a key, then the last scatter-gatter entry will be zeroed, since rss_key_size == 0. sg_buf_size = vi->rss_key_size; 5) This buffer is passed to qemu, but qemu is not happy with a buffer with zero length, and do the following in virtqueue_map_desc() (QEMU function): if (!sz) { virtio_error(vdev, "virtio: zero sized buffers are not allowed"); 6) virtio_error() (also QEMU function) set the device as broken vdev->broken = true; 7) Qemu bails out, and do not repond this crazy kernel. 8) The kernel is waiting for the response to come back (function virtnet_send_command()) 9) The kernel is waiting doing the following : while (!virtqueue_get_buf(vi->cvq, ) && !virtqueue_is_broken(vi->cvq)) cpu_relax(); 10) None of the following functions above is true, thus, the kernel loops here forever. Keeping in mind that virtqueue_is_broken() does not look at the qemu `vdev->broken`, so, it never realizes that the vitio is broken at QEMU side. Fix it by not sending RSS commands if the feature is not available in the device. Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") Cc: sta...@vger.kernel.org Cc: qemu-de...@nongnu.org Signed-off-by: Breno Leitao Reviewed-by: Heng Qi --- Changelog: V2: * Moved from creating a valid packet, by rejecting the request completely. V3: * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked the rejection path. V4: * Added a comment in an "if" clause, as suggested by Michael S. Tsirkin. --- drivers/net/virtio_net.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c22d1118a133..115c3c5414f2 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev, struct netlink_ext_ack *extack) { struct virtnet_info *vi = netdev_priv(dev); + bool update = false; int i; if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE && @@ -3814,13 +3815,28 @@ static int virtnet_set_rxfh(struct net_device *dev, return -EOPNOTSUPP; if (rxfh->indir) { + if (!vi->has_rss) + return -EOPNOTSUPP; + for (i = 0; i < vi->rss_indir_table_size; ++i) vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; + update = true; } - if (rxfh->key) + + if (rxfh->key) { + /* If either _F_HASH_REPORT or _F_RSS are negotiated, the +* device provides hash calculation capabilities, that is, +* hash_key is configured. +*/ + if (!vi->has_rss && !vi->has_rss_hash_report) + return -EOPNOTSUPP; + memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size); + update = true; + } - virtnet_commit_rss_command(vi); + if (update) + virtnet_commit_rss_command(vi); return 0; } @@ -4729,13 +4745,15 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT)) vi->has_rss_hash_report = true; - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) { vi->has_rss = true; - if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_indir_table_size = virtio_cread16(vdev, offsetof(struct virtio_net_config, rss_max_indirection_table_length)); + } + + if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_key_size = virtio_cread8(vdev, offsetof(struct virtio_net_config, rss_max_key_size)); -- 2.43.0
Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts
On Tue Apr 2, 2024 at 8:31 PM EEST, Haitao Huang wrote: > On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote: > >> Make cgroup test scripts ash compatible. > >> Remove cg-tools dependency. > >> Add documentation for functions. > >> > >> Tested with busybox on Ubuntu. > >> > >> Signed-off-by: Haitao Huang > >> --- > >> v2: > >> - Fixes for v2 cgroup > >> - Turn off swapping before memcontrol tests and back on after > >> - Add comments and reformat > >> --- > >> tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++ > >> .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++--- > >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +- > >> 3 files changed, 179 insertions(+), 78 deletions(-) > >> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > >> > >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > >> b/tools/testing/selftests/sgx/ash_cgexec.sh > >> new file mode 100755 > >> index ..9607784378df > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > >> @@ -0,0 +1,57 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2024 Intel Corporation. > >> + > >> +# Move the current shell process to the specified cgroup > >> +# Arguments: > >> +# $1 - The cgroup controller name, e.g., misc, memory. > >> +# $2 - The path of the cgroup, > >> +# relative to /sys/fs/cgroup for cgroup v2, > >> +# relative to /sys/fs/cgroup/$1 for v1. > >> +move_to_cgroup() { > >> +controllers="$1" > >> +path="$2" > >> + > >> +# Check if cgroup v2 is in use > >> +if [ ! -d "/sys/fs/cgroup/misc" ]; then > >> +# Cgroup v2 logic > >> +cgroup_full_path="/sys/fs/cgroup/${path}" > >> +echo $$ > "${cgroup_full_path}/cgroup.procs" > >> +else > >> +# Cgroup v1 logic > >> +OLD_IFS="$IFS" > >> +IFS=',' > >> +for controller in $controllers; do > >> +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}" > >> +echo $$ > "${cgroup_full_path}/tasks" > >> +done > >> +IFS="$OLD_IFS" > >> +fi > > > > I think that if you could point me to git v10 and all this I could > > then quite easily create test image and see what I get from that. > > > > I will code review the whole thing but this is definitely good > > enough to start testing this series properly! Thanks for the > > effort with this. The payback from this comes after the feature > > is mainline. We have now sort of reference of the usage patterns > > and less layers when we need to debug any possible (likely) bugs > > in the future. > > > > This is definitely to the right direction. I'm just wondering do > > we want to support v1 cgroups or would it make sense support only > > v2? > > BR, Jarkko > > > I can drop v1. I think most distro now support v2. > Created this branch to host these changes so far: > https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus Thanks! I'll point my build to that, make a test image and report the results. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 7:20 PM EEST, Haitao Huang wrote: > On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote: > >> Hello. > >> > >> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen > >> wrote: > >> > > > It'd be more complicated and less readable to do all the stuff > >> without the > >> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only > >> depends > >> > > > on libc so I hope this would not cause too much inconvenience. > >> > > > >> > > As per cgroup-tools, please prove this. It makes the job for more > >> > > complicated *for you* and you are making the job more complicated > >> > > to every possible person in the planet running any kernel QA. > >> > > > >> > > I weight the latter more than the former. And it is exactly the > >> > > reason why we did custom user space kselftest in the first place. > >> > > Let's keep the tradition. All I can say is that kselftest is > >> > > unfinished in its current form. > >> > > > >> > > What is "esp cgexec"? > >> > > >> > Also in kselftest we don't drive ultimate simplicity, we drive > >> > efficient CI/QA. By open coding something like subset of > >> > cgroup-tools needed to run the test you also help us later > >> > on to backtrack the kernel changes. With cgroups-tools you > >> > would have to use strace to get the same info. > >> > >> FWIW, see also functions in > >> tools/testing/selftests/cgroup/cgroup_util.{h,c}. > >> They likely cover what you need already -- if the tests are in C. > >> > >> (I admit that stuff in tools/testing/selftests/cgroup/ is best > >> understood with strace.) > > > > Thanks! > > > > My conclusions are that: > > > > 1. We probably cannot move the test part of cgroup test itself > >given the enclave payload dependency. > > 2. I think it makes sense to still follow the same pattern as > >other cgroups test and re-use cgroup_util.[ch] functionaltiy. > > > > So yeah I guess we need two test programs instead of one. > > > > Something along the lines: > > > > 1. main.[ch] -> test_sgx.[ch] > > 2. introduce test_sgx_cgroup.c > > > > And test_sgx_cgroup.c would be implement similar test as the shell > > script and would follow the structure of existing cgroups tests. > > > >> > >> HTH, > >> Michal > > > > BR, Jarkko > > > Do we really want to have it implemented in c? There are much fewer lines > of code in shell scripts. Note we are not really testing basic cgroup > stuff. All we needed were creating/deleting cgroups and set limits which I > think have been demonstrated feasible in the ash scripts now. > > Given Dave's comments, and test scripts being working and cover the cases > needed IMHO, I don't see much need to move to c code. I can add more cases > if needed and fall back a c implementation later if any case can't be > implemented in scripts. How about that? We can settle to: ash + no dependencies. I guess you have for that all the work done already. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 6:42 PM EEST, Dave Hansen wrote: > On 3/30/24 04:23, Jarkko Sakkinen wrote: > >>> I also wonder is cgroup-tools dependency absolutely required or could > >>> you just have a function that would interact with sysfs? > >> I should have checked email before hit the send button for v10 . > >> > >> It'd be more complicated and less readable to do all the stuff without the > >> > >> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends > >> on libc so I hope this would not cause too much inconvenience. > > As per cgroup-tools, please prove this. It makes the job for more > > complicated *for you* and you are making the job more complicated > > to every possible person in the planet running any kernel QA. > > I don't see any other use of cgroup-tools in testing/selftests. > > I *DO* see a ton of /bin/bash use though. I wouldn't go to much trouble > to make the thing ash-compatible. > > That said, the most important thing is to get some selftests in place. > If using cgroup-tools means we get actual, runnable tests in place, > that's a heck of a lot more important than making them perfect. > Remember, almost nobody uses SGX. It's available on *VERY* few systems > from one CPU vendor and only in very specific hardware configurations. Ash-compatible is good enough for me, so let's draw the line there. Ash-compatibility does not cause any major hurdle as can we seen from Haitao's patch. Earlier version was not even POSIX-compatible, given that it used hard-coded path. Most of the added stuff come open coding the tools but in the test code that is not the big deal, and helps with debugging in the future. Even right now it helps reviewing kernel patches because it documents exactly how the feature is seen from user space. BR, Jarkko
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Again, I leave this to you and Jiri, but On 04/03, Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 11:47:41 +0200 > > > set in the user function, what happen if the user function directly > > > calls this syscall? (maybe it consumes shadow stack?) > > > > the process should receive SIGILL if there's no pending uretprobe for > > the current task, or it will trigger uretprobe if there's one pending > > No, that is too aggressive and not safe. Since the syscall is exposed to > user program, it should return appropriate error code instead of SIGILL. ... > Since the syscall is always exposed to the user program, it should > - Do nothing and return an error unless it is properly called. > - check the prerequisites for operation strictly. We have sys_munmap(). should it check if the caller is going to unmap the code region which contains regs->ip and do nothing? I don't think it should. Userspace should blame itself, SIGSEGV is not "too aggressive" in this case. > I concern that new system calls introduce vulnerabilities. Yes, we need to ensure that sys_uretprobe() can only damage the malicious caller and nothing else. Oleg.
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Wed, Apr 03, 2024 at 10:13:52AM -0400, Steven Rostedt wrote: > On Fri, 29 Mar 2024 14:40:55 -0400 > Steven Rostedt wrote: > > > > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) > > > +{ > > > + return VM_FAULT_SIGBUS; > > > +} > > > > If this is all it does, I don't believe it's needed. > > > > > + > > > +#ifdef CONFIG_TRACER_MAX_TRACE > > > +static int get_snapshot_map(struct trace_array *tr) > > > +{ > > > + int err = 0; > > > + > > > + /* > > > + * Called with mmap_lock held. lockdep would be unhappy if we would now > > > + * take trace_types_lock. Instead use the specific > > > + * snapshot_trigger_lock. > > > + */ > > > + spin_lock(>snapshot_trigger_lock); > > > + > > > + if (tr->snapshot || tr->mapped == UINT_MAX) > > > + err = -EBUSY; > > > + else > > > + tr->mapped++; > > > + > > > + spin_unlock(>snapshot_trigger_lock); > > > + > > > + /* Wait for update_max_tr() to observe iter->tr->mapped */ > > > + if (tr->mapped == 1) > > > + synchronize_rcu(); > > > + > > > + return err; > > > + > > > +} > > > +static void put_snapshot_map(struct trace_array *tr) > > > +{ > > > + spin_lock(>snapshot_trigger_lock); > > > + if (!WARN_ON(!tr->mapped)) > > > + tr->mapped--; > > > + spin_unlock(>snapshot_trigger_lock); > > > +} > > > +#else > > > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; } > > > +static inline void put_snapshot_map(struct trace_array *tr) { } > > > +#endif > > > + > > > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) > > > +{ > > > + struct ftrace_buffer_info *info = vma->vm_file->private_data; > > > + struct trace_iterator *iter = >iter; > > > + > > > + WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file)); > > > + put_snapshot_map(iter->tr); > > > +} > > > + > > > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { } > > > > Same for the open. > > > > > > > + > > > +static const struct vm_operations_struct tracing_buffers_vmops = { > > > + .open = tracing_buffers_mmap_open, > > > + .close = tracing_buffers_mmap_close, > > > + .fault = tracing_buffers_mmap_fault, > > > +}; > > > > I replaced this with: > > > > static const struct vm_operations_struct tracing_buffers_vmops = { > > .close = tracing_buffers_mmap_close, > > }; > > > > And it appears to work just fine. The mm code handles the NULL cases for > > .open and .fault. > > > > Is there any reason to do something different than the mm defaults? No other reason here than my own ignorance. I will remove. > > Hi Vincent, > > Do you plan on sending out a v20 series? Of course, let me spin that this week! Got also few typos to fix in the doc and I believe an include missing for riscv. > > -- Steve
Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
On 4/2/24 6:58 PM, Andrii Nakryiko wrote: On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee wrote: rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and not the "current" task to prevent returning an incorrect return address. However, this check is incomplete as the target task can still transition to the running state when finding the return address, although it is safe with RCU. The issue we encounter is that the kernel frequently prints warning messages when BPF profiling programs call to bpf_get_task_stack() on running tasks. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Signed-off-by: Kui-Feng Lee --- kernel/trace/rethook.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..4297a132a7ae 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame if (WARN_ON_ONCE(!cur)) return 0; - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) + if (tsk != current && task_is_running(tsk)) return 0; This should probably go through Masami's tree, but the change makes sense to me, given this is an expected condition. Acked-by: Andrii Nakryiko Masami, I assume you'll pick this up? Thanks, Daniel
Re: [PATCH v9 0/9] Initial Marvell PXA1908 support
On Tue, 02 Apr 2024 22:55:36 +0200, Duje Mihanović wrote: > Hello, > > This series adds initial support for the Marvell PXA1908 SoC and > "samsung,coreprimevelte", a smartphone using the SoC. > > USB works and the phone can boot a rootfs from an SD card, but there are > some warnings in the dmesg: > > During SMP initialization: > [0.006519] CPU features: SANITY CHECK: Unexpected variation in > SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU1: 0x00 > [0.006542] CPU features: Unsupported CPU feature variation detected. > [0.006589] CPU1: Booted secondary processor 0x01 [0x410fd032] > [0.010710] Detected VIPT I-cache on CPU2 > [0.010716] CPU features: SANITY CHECK: Unexpected variation in > SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU2: 0x00 > [0.010758] CPU2: Booted secondary processor 0x02 [0x410fd032] > [0.014849] Detected VIPT I-cache on CPU3 > [0.014855] CPU features: SANITY CHECK: Unexpected variation in > SYS_CNTFRQ_EL0. Boot CPU: 0x00018cba80, CPU3: 0x00 > [0.014895] CPU3: Booted secondary processor 0x03 [0x410fd032] > > SMMU probing fails: > [0.101798] arm-smmu c001.iommu: probing hardware configuration... > [0.101809] arm-smmu c001.iommu: SMMUv1 with: > [0.101816] arm-smmu c001.iommu: no translation support! > > A 3.14 based Marvell tree is available on GitHub > acorn-marvell/brillo_pxa_kernel, and a Samsung one on GitHub > CoderCharmander/g361f-kernel. > > Andreas Färber attempted to upstream support for this SoC in 2017: > https://lore.kernel.org/lkml/20170222022929.10540-1-afaer...@suse.de/ > > Signed-off-by: Duje Mihanović > > Changes in v9: > - Update trailers and rebase on v6.9-rc2, no changes > - Link to v8: > https://lore.kernel.org/20240110-pxa1908-lkml-v8-0-fea768a59...@skole.hr > > Changes in v8: > - Drop SSPA patch > - Drop broken-cd from eMMC node > - Specify S-Boot hardcoded initramfs location in device tree > - Add ARM PMU node > - Correct inverted modem memory base and size > - Update trailers > - Rebase on next-20240110 > - Link to v7: > https://lore.kernel.org/20231102-pxa1908-lkml-v7-0-cabb1a0cb...@skole.hr > and https://lore.kernel.org/20231102152033.5511-1-duje.mihano...@skole.hr > > Changes in v7: > - Suppress SND_MMP_SOC_SSPA on ARM64 > - Update trailers > - Rebase on v6.6-rc7 > - Link to v6: > https://lore.kernel.org/r/20231010-pxa1908-lkml-v6-0-b2fe09240...@skole.hr > > Changes in v6: > - Address maintainer comments: > - Add "marvell,pxa1908-padconf" binding to pinctrl-single driver > - Drop GPIO patch as it's been pulled > - Update trailers > - Rebase on v6.6-rc5 > - Link to v5: > https://lore.kernel.org/r/20230812-pxa1908-lkml-v5-0-a5d51937e...@skole.hr > > Changes in v5: > - Address maintainer comments: > - Move *_NR_CLKS to clock driver from dt binding file > - Allocate correct number of clocks for each block instead of blindly > allocating 50 for each > - Link to v4: > https://lore.kernel.org/r/20230807-pxa1908-lkml-v4-0-cb387d73b...@skole.hr > > Changes in v4: > - Address maintainer comments: > - Relicense clock binding file to BSD-2 > - Add pinctrl-names to SD card node > - Add vgic registers to GIC node > - Rebase on v6.5-rc5 > - Link to v3: > https://lore.kernel.org/r/20230804-pxa1908-lkml-v3-0-8e48fca37...@skole.hr > > Changes in v3: > - Address maintainer comments: > - Drop GPIO dynamic allocation patch > - Move clock register offsets into driver (instead of bindings file) > - Add missing Tested-by trailer to u32_fract patch > - Move SoC binding to arm/mrvl/mrvl.yaml > - Add serial0 alias and stdout-path to board dts to enable UART > debugging > - Rebase on v6.5-rc4 > - Link to v2: > https://lore.kernel.org/r/20230727162909.6031-1-duje.mihano...@skole.hr > > Changes in v2: > - Remove earlycon patch as it's been merged into tty-next > - Address maintainer comments: > - Clarify GPIO regressions on older PXA platforms > - Add Fixes tag to commit disabling GPIO pinctrl calls for this SoC > - Add missing includes to clock driver > - Clock driver uses HZ_PER_MHZ, u32_fract and GENMASK > - Dual license clock bindings > - Change clock IDs to decimal > - Fix underscores in dt node names > - Move chosen node to top of board dts > - Clean up documentation > - Reorder commits > - Drop pxa,rev-id > - Rename muic-i2c to i2c-muic > - Reword some commits > - Move framebuffer node to chosen > - Add aliases for mmc nodes > - Rebase on v6.5-rc3 > - Link to v1: > https://lore.kernel.org/r/20230721210042.21535-1-duje.mihano...@skole.hr > > --- > Andy Shevchenko (1): > clk: mmp: Switch to use struct u32_fract instead of custom one > > Duje Mihanović (8): > dt-bindings: pinctrl: pinctrl-single: add marvell,pxa1908-padconf > compatible > pinctrl: single: add marvell,pxa1908-padconf compatible > dt-bindings: clock: Add Marvell PXA1908 clock bindings >
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Fri, 29 Mar 2024 14:40:55 -0400 Steven Rostedt wrote: > > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) > > +{ > > + return VM_FAULT_SIGBUS; > > +} > > If this is all it does, I don't believe it's needed. > > > + > > +#ifdef CONFIG_TRACER_MAX_TRACE > > +static int get_snapshot_map(struct trace_array *tr) > > +{ > > + int err = 0; > > + > > + /* > > +* Called with mmap_lock held. lockdep would be unhappy if we would now > > +* take trace_types_lock. Instead use the specific > > +* snapshot_trigger_lock. > > +*/ > > + spin_lock(>snapshot_trigger_lock); > > + > > + if (tr->snapshot || tr->mapped == UINT_MAX) > > + err = -EBUSY; > > + else > > + tr->mapped++; > > + > > + spin_unlock(>snapshot_trigger_lock); > > + > > + /* Wait for update_max_tr() to observe iter->tr->mapped */ > > + if (tr->mapped == 1) > > + synchronize_rcu(); > > + > > + return err; > > + > > +} > > +static void put_snapshot_map(struct trace_array *tr) > > +{ > > + spin_lock(>snapshot_trigger_lock); > > + if (!WARN_ON(!tr->mapped)) > > + tr->mapped--; > > + spin_unlock(>snapshot_trigger_lock); > > +} > > +#else > > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; } > > +static inline void put_snapshot_map(struct trace_array *tr) { } > > +#endif > > + > > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) > > +{ > > + struct ftrace_buffer_info *info = vma->vm_file->private_data; > > + struct trace_iterator *iter = >iter; > > + > > + WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file)); > > + put_snapshot_map(iter->tr); > > +} > > + > > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { } > > Same for the open. > > > > + > > +static const struct vm_operations_struct tracing_buffers_vmops = { > > + .open = tracing_buffers_mmap_open, > > + .close = tracing_buffers_mmap_close, > > + .fault = tracing_buffers_mmap_fault, > > +}; > > I replaced this with: > > static const struct vm_operations_struct tracing_buffers_vmops = { > .close = tracing_buffers_mmap_close, > }; > > And it appears to work just fine. The mm code handles the NULL cases for > .open and .fault. > > Is there any reason to do something different than the mm defaults? Hi Vincent, Do you plan on sending out a v20 series? -- Steve
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, 3 Apr 2024 11:47:41 +0200 Jiri Olsa wrote: > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > Hi Jiri, > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > Jiri Olsa wrote: > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > This is interesting approach. But I doubt we need to add additional > > syscall just for this purpose. Can't we use another syscall or ioctl? > > so the plan is to optimize entry uprobe in a similar way and given > the syscall is not a scarce resource I wanted to add another syscall > for that one as well > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > possible to do that, the trampoline will just have to save one or > more additional registers, but adding new syscall seems cleaner to me Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > right, I'll check on syzkaller > > > set in the user function, what happen if the user function directly > > calls this syscall? (maybe it consumes shadow stack?) > > the process should receive SIGILL if there's no pending uretprobe for > the current task, or it will trigger uretprobe if there's one pending No, that is too aggressive and not safe. Since the syscall is exposed to user program, it should return appropriate error code instead of SIGILL. > > but we could limit the syscall to be executed just from the trampoline, > that should prevent all the user space use cases, I'll do that in next > version and add more tests for that Why not limit? :) The uprobe_handle_trampoline() expects it is called only from the trampoline, so it is natural to check the caller address. (and uprobe should know where is the trampoline) Since the syscall is always exposed to the user program, it should - Do nothing and return an error unless it is properly called. - check the prerequisites for operation strictly. I concern that new system calls introduce vulnerabilities. Thank you, > > thanks, > jirka > > > > > > Thank you, > > > > > > > > At the moment the uretprobe setup/path is: > > > > > > - install entry uprobe > > > > > > - when the uprobe is hit, it overwrites probed function's return address > > > on stack with address of the trampoline that contains breakpoint > > > instruction > > > > > > - the breakpoint trap code handles the uretprobe consumers execution and > > > jumps back to original return address > > > > > > This patch replaces the above trampoline's breakpoint instruction with new > > > ureprobe syscall call. This syscall does exactly the same job as the trap > > > with some more extra work: > > > > > > - syscall trampoline must save original value for rax/r11/rcx registers > > > on stack - rax is set to syscall number and r11/rcx are changed and > > > used by syscall instruction > > > > > > - the syscall code reads the original values of those registers and > > > restore those values in task's pt_regs area > > > > > > Even with the extra registers handling code the having uretprobes handled > > > by syscalls shows speed improvement. > > > > > > On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz) > > > > > > current: > > > > > > base : 15.888 ± 0.033M/s > > > uprobe-nop :3.016 ± 0.000M/s > > > uprobe-push:2.832 ± 0.005M/s > > > uprobe-ret :1.104 ± 0.000M/s > > > uretprobe-nop :1.487 ± 0.000M/s > > > uretprobe-push :1.456 ± 0.000M/s > > > uretprobe-ret :0.816 ± 0.001M/s > > > > > > with the fix: > > > > > > base : 15.116 ± 0.045M/s > > > uprobe-nop :3.001 ± 0.045M/s > > > uprobe-push:2.831 ± 0.004M/s > > > uprobe-ret :1.102 ± 0.001M/s > > > uretprobe-nop :1.969 ± 0.001M/s < 32% speedup > > > uretprobe-push :1.905 ± 0.004M/s < 30% speedup > > > uretprobe-ret :0.933 ± 0.002M/s < 14% speedup > > > > > > On Amd (AMD Ryzen 7 5700U) > > > > > > current: > > > > > > base :5.105 ± 0.003M/s > > > uprobe-nop :1.552 ± 0.002M/s > > > uprobe-push:1.408 ± 0.003M/s > > > uprobe-ret :0.827 ± 0.001M/s > > > uretprobe-nop :0.779 ± 0.001M/s > > > uretprobe-push :0.750 ± 0.001M/s > > > uretprobe-ret :0.539 ± 0.001M/s > > > > > > with the fix: > > > > > > base :5.119 ± 0.002M/s > > > uprobe-nop :1.523 ± 0.003M/s > > > uprobe-push:1.384 ± 0.003M/s > > > uprobe-ret :0.826 ± 0.002M/s > > > uretprobe-nop :0.866 ± 0.002M/s < 11% speedup > > > uretprobe-push :0.826 ± 0.002M/s < 10% speedup > > > uretprobe-ret :0.581 ± 0.001M/s < 7% speedup > > > > > > Reviewed-by: Oleg Nesterov > > > Suggested-by: Andrii Nakryiko > > > Acked-by: Andrii Nakryiko > > > Signed-off-by: Oleg
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
I leave this to you and Masami, but... On 04/03, Jiri Olsa wrote: > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > > This is interesting approach. But I doubt we need to add additional > > syscall just for this purpose. Can't we use another syscall or ioctl? > > so the plan is to optimize entry uprobe in a similar way and given > the syscall is not a scarce resource I wanted to add another syscall > for that one as well > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > possible to do that, the trampoline will just have to save one or > more additional registers, but adding new syscall seems cleaner to me Agreed. > > Also, we should run syzkaller on this syscall. And if uretprobe is > > right, I'll check on syzkaller I don't understand this concern... > > set in the user function, what happen if the user function directly > > calls this syscall? (maybe it consumes shadow stack?) > > the process should receive SIGILL if there's no pending uretprobe for > the current task, Yes, > or it will trigger uretprobe if there's one pending ... and corrupt the caller. So what? > but we could limit the syscall to be executed just from the trampoline, > that should prevent all the user space use cases, I'll do that in next > version and add more tests for that Yes, we can... well, ignoring the race with mremap() from another thread. But why should we care? Userspace should not call sys_uretprobe(). Likewise, it should not call sys_restart_syscall(). Likewise, it should not jump to xol_area. Of course, userspace (especially syzkaller) _can_ do this. So what? I think the only thing we need to ensure is that the "malicious" task which calls sys_uretprobe() can only harm itself, nothing more. No? Oleg.
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 2 Apr 2024 22:21:00 -0700 Andrii Nakryiko wrote: > > I just checked our fleet-wide production data for the last 24 hours. > > Within the kprobe/kretprobe code path (ftrace_trampoline and > > everything called from it), rcu_is_watching (both calls, see below) > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > > prefer to be able to avoid that in production use cases. > > > > I just ran synthetic microbenchmark testing multi-kretprobe > throughput. We get (in millions of BPF kretprobe-multi program > invocations per second): > - 5.568M/s as baseline; > - 5.679M/s with changes in this patch (+2% throughput improvement); > - 5.808M/s with disabling rcu_is_watching in rethook_try_get() > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). > > It's definitely noticeable. Ah, thanks for verifying (I should have read the thread before replying to the previous email). -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 2 Apr 2024 21:00:19 -0700 Andrii Nakryiko wrote: > I just noticed another rcu_is_watching() call, in rethook_try_get(), > which seems to be a similar and complementary validation check to the > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > in this patch. It feels like both of them should be controlled by the > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > guard around rcu_is_watching() check in rethook_try_get() as well? That is totally up to Masami. It may have even less overhead as I'm not sure how many times that gets called, and there may be more work to do than with function tracing. -- Steve
Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi > > When a cgroup usage reaches its limit, and it is to be charged, i.e., > sgx_cgroup_try_charge() called for new allocations, the cgroup needs to > reclaim pages from its LRU or LRUs of its descendants to make room for > any new allocations. This patch adds the basic building block for the > per-cgroup reclamation flow and use it for synchronous reclamation in > sgx_cgroup_try_charge(). It's better to firstly mention _why_ we need this first: Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup (and/or its descendants) to reduce its usage so the new allocation can succeed. Add the basic building blocks to support the per-cgroup reclamation flow ... > > First, modify sgx_reclaim_pages() to let callers to pass in the LRU from > which pages are reclaimed, so it can be reused by both the global and > cgroup reclaimers. Also return the number of pages attempted, so a > cgroup reclaimer can use it to track reclamation progress from its > descendants. IMHO you are jumping too fast to the implementation details. Better to have some more background: " Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC page(s) from a given EPC cgroup (and its descendants). " > > For the global reclaimer, replace all call sites of sgx_reclaim_pages() > with calls to a newly created wrapper, sgx_reclaim_pages_global(), which > just calls sgx_reclaim_pages() with the global LRU passed in. > > For cgroup reclamation, implement a basic reclamation flow, encapsulated > in the top-level function, sgx_cgroup_reclaim_pages(). It performs a > pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages() > at each node passing in the LRU of that node. It keeps track of total > attempted pages and stops the walk if desired number of pages are > attempted. Then it's time to jump to implementation details: " Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the global LRU, and then tries to reclaim these pages at once for better performance. Use similar way to implement the "cgroup" variant EPC reclaim, but keep the implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input, and return the pages that are "scanned" (but not actually reclaimed); 2) loop the given EPC cgroup and its descendants and do the new sgx_reclaim_pages() until SGX_NR_TO_SCAN pages are "scanned". This implementation always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC cgroup, and only moves to its descendants when there's no enough reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases. " Then I think it's better to explain why "alternatives" are not chosen: " Note, this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage. Alternatively, the current sgx_reclaim_pages() could be changed to return the actual "reclaimed" pages, but not "scanned" pages. However this solution also has cons: " : I recall you mentioned "unable to control latency of each reclaim" etc, but IIUC one could be: This approach may result in higher chance of "reclaiming EPC pages from descendants but not the root/given EPC cgorup", e.g., when all EPC pages in the root EPC cgroup are all young while these in its descendants are not. This may not be desired. Makes sense? > > Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether > a synchronous reclamation is allowed. If the caller allows and cgroup > usage is at its limit, trigger the synchronous reclamation by calling > sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between > iterations. This isn't needed IMHO as you can easily see in the code, and there's no "design choices" here. General rule: focus on explaining "why", and "design choices", but not implementation details, which can be seen in the
Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported
On Sun, Mar 31, 2024 at 04:20:30PM -0400, Michael S. Tsirkin wrote: > On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote: > > @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev, > > return -EOPNOTSUPP; > > > > if (rxfh->indir) { > > + if (!vi->has_rss) > > + return -EOPNOTSUPP; > > + > > for (i = 0; i < vi->rss_indir_table_size; ++i) > > vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; > > + update = true; > > } > > - if (rxfh->key) > > + > > + if (rxfh->key) { > > + if (!vi->has_rss && !vi->has_rss_hash_report) > > + return -EOPNOTSUPP; > > > What's the logic here? Is it || or &&? A comment can't hurt. If txfh carries a key, then the device needs to has either has_rss or has_rss_hash_report "features". These are basically virtio features VIRTIO_NET_F_HASH_REPORT and VIRTIO_NET_F_RSS that are set at virtio_probe. I will add the comment and respin the series.
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. > I look forward to your formalized results :) Sorry for the delay in getting back to you on this Masami. I have used one of the bpf selftest benchmarks to provide some form of comparison of the 3 different approaches (spinlock, RW spinlock and per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' benchmark which just executes a single uprobe with a minimal bpf program attached. The tests were done on a 32 core qemu/kvm instance. Things to note about the results: - The results are slightly variable so don't get too caught up on individual thread count - it's the trend that is important. - In terms of throughput with this specific benchmark a *very* macro view is that the RW spinlock provides 40-60% more throughput than the spinlock. The per-CPU RW semaphore provides in the order of 50-100% more throughput then the spinlock. - This doesn't fully reflect the large reduction in latency that we have seen in application based measurements. However, it does demonstrate that even the trivial change of going to a RW spinlock provides significant benefits. I haven't included the measurements on per-CPU RW semaphore write performance as they are completely in line with those that Paul McKenney posted on his journal [0]. On a 32 core system I see semaphore writes to take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). Each block of results below show 1 line per execution of the benchmark (the "Summary" line) and each line is a run with one more thread added - a thread is a "producer". The lines are edited to remove extraneous output that adds no value here. The tests were executed with this driver script: for num_threads in {1..20} do sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary done spinlock Summary: hits1.453 ± 0.005M/s ( 1.453M/prod) Summary: hits2.087 ± 0.005M/s ( 1.043M/prod) Summary: hits2.701 ± 0.012M/s ( 0.900M/prod) Summary: hits1.917 ± 0.011M/s ( 0.479M/prod) Summary: hits2.105 ± 0.003M/s ( 0.421M/prod) Summary: hits1.615 ± 0.006M/s ( 0.269M/prod) Summary: hits2.046 ± 0.004M/s ( 0.292M/prod) Summary: hits1.818 ± 0.002M/s ( 0.227M/prod) Summary: hits1.867 ± 0.024M/s ( 0.207M/prod) Summary: hits1.692 ± 0.003M/s ( 0.169M/prod) Summary: hits1.778 ± 0.004M/s ( 0.162M/prod) Summary: hits1.710 ± 0.011M/s ( 0.142M/prod) Summary: hits1.687 ± 0.022M/s ( 0.130M/prod) Summary: hits1.697 ± 0.004M/s ( 0.121M/prod) Summary: hits1.645 ± 0.011M/s ( 0.110M/prod) Summary: hits1.671 ± 0.002M/s ( 0.104M/prod) Summary: hits1.692 ± 0.014M/s ( 0.100M/prod) Summary: hits1.700 ± 0.015M/s ( 0.094M/prod) Summary: hits1.668 ± 0.005M/s ( 0.088M/prod) Summary: hits1.644 ± 0.004M/s ( 0.082M/prod) RW spinlock Summary: hits1.465 ± 0.008M/s ( 1.465M/prod) Summary: hits1.750 ± 0.035M/s ( 0.875M/prod) Summary: hits2.164 ± 0.008M/s ( 0.721M/prod) Summary: hits2.235 ± 0.014M/s ( 0.559M/prod) Summary: hits2.202 ± 0.005M/s ( 0.440M/prod) Summary: hits2.816 ± 0.003M/s ( 0.469M/prod) Summary: hits2.364 ± 0.002M/s ( 0.338M/prod) Summary: hits2.327 ± 0.008M/s ( 0.291M/prod) Summary: hits2.147 ± 0.005M/s ( 0.239M/prod) Summary: hits2.266 ± 0.011M/s ( 0.227M/prod) Summary: hits2.483 ± 0.003M/s ( 0.226M/prod) Summary: hits2.573 ± 0.008M/s ( 0.214M/prod) Summary: hits2.569 ± 0.004M/s ( 0.198M/prod) Summary: hits2.507 ± 0.013M/s ( 0.179M/prod) Summary: hits2.165 ± 0.008M/s ( 0.144M/prod) Summary: hits2.524 ± 0.004M/s ( 0.158M/prod) Summary: hits2.059 ± 0.020M/s ( 0.121M/prod) Summary: hits2.332 ± 0.007M/s ( 0.130M/prod) Summary: hits2.404 ± 0.017M/s ( 0.127M/prod) Summary: hits2.187 ± 0.002M/s ( 0.109M/prod) per-CPU RW semaphore Summary: hits1.341 ± 0.017M/s ( 1.341M/prod) Summary: hits2.397 ± 0.011M/s ( 1.198M/prod) Summary: hits3.547 ± 0.041M/s ( 1.182M/prod) Summary: hits4.108
Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
On Tue, Apr 02, 2024 at 05:21:39PM -0400, Michael S. Tsirkin wrote: VDPA_GET_VRING_SIZE by mistake uses the already occupied ioctl # 0x80 and we never noticed - it happens to work because the direction and size are different, but confuses tools such as perf which like to look at just the number, and breaks the extra robustness of the ioctl numbering macros. To fix, sort the entries and renumber the ioctl - not too late since it wasn't in any released kernels yet. Cc: Arnaldo Carvalho de Melo Reported-by: Namhyung Kim Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size") Cc: "Zhu Lingshan" Signed-off-by: Michael S. Tsirkin --- Build tested only - userspace patches using this will have to adjust. I will merge this in a week or so unless I hear otherwise, and afterwards perf can update there header. Fortunately, we haven't released any kernels with this yet, right? (other than v6.9-rc*) LGTM: Reviewed-by: Stefano Garzarella include/uapi/linux/vhost.h | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index bea697390613..b95dd84eef2d 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -179,12 +179,6 @@ /* Get the config size */ #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32) -/* Get the count of all virtqueues */ -#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) - -/* Get the number of virtqueue groups. */ -#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) - /* Get the number of address spaces. */ #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int) @@ -228,10 +222,17 @@ #define VHOST_VDPA_GET_VRING_DESC_GROUP _IOWR(VHOST_VIRTIO, 0x7F, \ struct vhost_vring_state) + +/* Get the count of all virtqueues */ +#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) + +/* Get the number of virtqueue groups. */ +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) + /* Get the queue size of a specific virtqueue. * userspace set the vring index in vhost_vring_state.index * kernel set the queue size in vhost_vring_state.num */ -#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x80, \ +#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x82, \ struct vhost_vring_state) #endif -- MST
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > Hi Jiri, > > On Tue, 2 Apr 2024 11:33:00 +0200 > Jiri Olsa wrote: > > > Adding uretprobe syscall instead of trap to speed up return probe. > > This is interesting approach. But I doubt we need to add additional > syscall just for this purpose. Can't we use another syscall or ioctl? so the plan is to optimize entry uprobe in a similar way and given the syscall is not a scarce resource I wanted to add another syscall for that one as well tbh I'm not sure sure which syscall or ioctl to reuse for this, it's possible to do that, the trampoline will just have to save one or more additional registers, but adding new syscall seems cleaner to me > > Also, we should run syzkaller on this syscall. And if uretprobe is right, I'll check on syzkaller > set in the user function, what happen if the user function directly > calls this syscall? (maybe it consumes shadow stack?) the process should receive SIGILL if there's no pending uretprobe for the current task, or it will trigger uretprobe if there's one pending but we could limit the syscall to be executed just from the trampoline, that should prevent all the user space use cases, I'll do that in next version and add more tests for that thanks, jirka > > Thank you, > > > > > At the moment the uretprobe setup/path is: > > > > - install entry uprobe > > > > - when the uprobe is hit, it overwrites probed function's return address > > on stack with address of the trampoline that contains breakpoint > > instruction > > > > - the breakpoint trap code handles the uretprobe consumers execution and > > jumps back to original return address > > > > This patch replaces the above trampoline's breakpoint instruction with new > > ureprobe syscall call. This syscall does exactly the same job as the trap > > with some more extra work: > > > > - syscall trampoline must save original value for rax/r11/rcx registers > > on stack - rax is set to syscall number and r11/rcx are changed and > > used by syscall instruction > > > > - the syscall code reads the original values of those registers and > > restore those values in task's pt_regs area > > > > Even with the extra registers handling code the having uretprobes handled > > by syscalls shows speed improvement. > > > > On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz) > > > > current: > > > > base : 15.888 ± 0.033M/s > > uprobe-nop :3.016 ± 0.000M/s > > uprobe-push:2.832 ± 0.005M/s > > uprobe-ret :1.104 ± 0.000M/s > > uretprobe-nop :1.487 ± 0.000M/s > > uretprobe-push :1.456 ± 0.000M/s > > uretprobe-ret :0.816 ± 0.001M/s > > > > with the fix: > > > > base : 15.116 ± 0.045M/s > > uprobe-nop :3.001 ± 0.045M/s > > uprobe-push:2.831 ± 0.004M/s > > uprobe-ret :1.102 ± 0.001M/s > > uretprobe-nop :1.969 ± 0.001M/s < 32% speedup > > uretprobe-push :1.905 ± 0.004M/s < 30% speedup > > uretprobe-ret :0.933 ± 0.002M/s < 14% speedup > > > > On Amd (AMD Ryzen 7 5700U) > > > > current: > > > > base :5.105 ± 0.003M/s > > uprobe-nop :1.552 ± 0.002M/s > > uprobe-push:1.408 ± 0.003M/s > > uprobe-ret :0.827 ± 0.001M/s > > uretprobe-nop :0.779 ± 0.001M/s > > uretprobe-push :0.750 ± 0.001M/s > > uretprobe-ret :0.539 ± 0.001M/s > > > > with the fix: > > > > base :5.119 ± 0.002M/s > > uprobe-nop :1.523 ± 0.003M/s > > uprobe-push:1.384 ± 0.003M/s > > uprobe-ret :0.826 ± 0.002M/s > > uretprobe-nop :0.866 ± 0.002M/s < 11% speedup > > uretprobe-push :0.826 ± 0.002M/s < 10% speedup > > uretprobe-ret :0.581 ± 0.001M/s < 7% speedup > > > > Reviewed-by: Oleg Nesterov > > Suggested-by: Andrii Nakryiko > > Acked-by: Andrii Nakryiko > > Signed-off-by: Oleg Nesterov > > Signed-off-by: Jiri Olsa > > --- > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > arch/x86/kernel/uprobes.c | 83 ++ > > include/linux/syscalls.h | 2 + > > include/linux/uprobes.h| 2 + > > include/uapi/asm-generic/unistd.h | 5 +- > > kernel/events/uprobes.c| 18 -- > > kernel/sys_ni.c| 2 + > > 7 files changed, 108 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > > b/arch/x86/entry/syscalls/syscall_64.tbl > > index 7e8d46f4147f..af0a33ab06ee 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -383,6 +383,7 @@ > > 459common lsm_get_self_attr sys_lsm_get_self_attr > > 460common lsm_set_self_attr sys_lsm_set_self_attr > > 461common lsm_list_modules
[PATCH 06/34] tracing: hide unused ftrace_event_id_fops
From: Arnd Bergmann When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the unused ftrace_event_id_fops variable: kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=] 2155 | static const struct file_operations ftrace_event_id_fops = { Hide this in the same #ifdef as the reference to it. Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to event_create_dir()") Signed-off-by: Arnd Bergmann --- kernel/trace/trace_events.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 7c364b87352e..52f75c36bbca 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -1670,6 +1670,7 @@ static int trace_format_open(struct inode *inode, struct file *file) return 0; } +#ifdef CONFIG_PERF_EVENTS static ssize_t event_id_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) { @@ -1684,6 +1685,7 @@ event_id_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) return simple_read_from_buffer(ubuf, cnt, ppos, buf, len); } +#endif static ssize_t event_filter_read(struct file *filp, char __user *ubuf, size_t cnt, @@ -2152,10 +2154,12 @@ static const struct file_operations ftrace_event_format_fops = { .release = seq_release, }; +#ifdef CONFIG_PERF_EVENTS static const struct file_operations ftrace_event_id_fops = { .read = event_id_read, .llseek = default_llseek, }; +#endif static const struct file_operations ftrace_event_filter_fops = { .open = tracing_open_file_tr, -- 2.39.2
Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported
在 2024/4/1 上午4:20, Michael S. Tsirkin 写道: On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote: There is a bug when setting the RSS options in virtio_net that can break the whole machine, getting the kernel into an infinite loop. Running the following command in any QEMU virtual machine with virtionet will reproduce this problem: # ethtool -X eth0 hfunc toeplitz This is how the problem happens: 1) ethtool_set_rxfh() calls virtnet_set_rxfh() 2) virtnet_set_rxfh() calls virtnet_commit_rss_command() 3) virtnet_commit_rss_command() populates 4 entries for the rss scatter-gather 4) Since the command above does not have a key, then the last scatter-gatter entry will be zeroed, since rss_key_size == 0. sg_buf_size = vi->rss_key_size; 5) This buffer is passed to qemu, but qemu is not happy with a buffer with zero length, and do the following in virtqueue_map_desc() (QEMU function): if (!sz) { virtio_error(vdev, "virtio: zero sized buffers are not allowed"); 6) virtio_error() (also QEMU function) set the device as broken vdev->broken = true; 7) Qemu bails out, and do not repond this crazy kernel. 8) The kernel is waiting for the response to come back (function virtnet_send_command()) 9) The kernel is waiting doing the following : while (!virtqueue_get_buf(vi->cvq, ) && !virtqueue_is_broken(vi->cvq)) cpu_relax(); 10) None of the following functions above is true, thus, the kernel loops here forever. Keeping in mind that virtqueue_is_broken() does not look at the qemu `vdev->broken`, so, it never realizes that the vitio is broken at QEMU side. Fix it by not sending RSS commands if the feature is not available in the device. Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") Cc: sta...@vger.kernel.org net has its own stable process, don't CC stable on net patches. Cc: qemu-de...@nongnu.org Signed-off-by: Breno Leitao --- Changelog: V2: * Moved from creating a valid packet, by rejecting the request completely V3: * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked the rejection path. --- drivers/net/virtio_net.c | 22 ++ 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c22d1118a133..c4a21ec51adf 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev, struct netlink_ext_ack *extack) { struct virtnet_info *vi = netdev_priv(dev); + bool update = false; int i; if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE && @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev, return -EOPNOTSUPP; if (rxfh->indir) { + if (!vi->has_rss) + return -EOPNOTSUPP; + for (i = 0; i < vi->rss_indir_table_size; ++i) vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; + update = true; } - if (rxfh->key) + + if (rxfh->key) { + if (!vi->has_rss && !vi->has_rss_hash_report) + return -EOPNOTSUPP; What's the logic here? Is it || or &&? A comment can't hurt. && Hi Breno, You can add a comment like the following: If either _F_HASH_REPORT or _F_RSS are negotiated, the device provides hash calculation capabilities, that is, hash_key can be configured. Regards, Heng + memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size); + update = true; + } - virtnet_commit_rss_command(vi); + if (update) + virtnet_commit_rss_command(vi); return 0; } @@ -4729,13 +4741,15 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT)) vi->has_rss_hash_report = true; - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) { vi->has_rss = true; - if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_indir_table_size = virtio_cread16(vdev, offsetof(struct virtio_net_config, rss_max_indirection_table_length)); + } + + if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_key_size = virtio_cread8(vdev, offsetof(struct virtio_net_config, rss_max_key_size)); -- 2.43.0
[PATCH net-next 6/6] rstreason: make it work in trace world
From: Jason Xing At last, we should let it work by introducing this reset reason in trace world. One of the possible expected outputs is: ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx state=TCP_ESTABLISHED reason=NOT_SPECIFIED Signed-off-by: Jason Xing --- include/trace/events/tcp.h | 37 + net/ipv4/tcp_ipv4.c| 2 +- net/ipv4/tcp_output.c | 2 +- net/ipv6/tcp_ipv6.c| 2 +- 4 files changed, 36 insertions(+), 7 deletions(-) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 5c04a61a11c2..9bed9e63c9c5 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * tcp event with arguments sk and skb @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, TP_ARGS(sk, skb) ); +#undef FN1 +#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason); +#undef FN2 +#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason); +DEFINE_RST_REASON(FN1, FN1) + +#undef FN1 +#undef FNe1 +#define FN1(reason){ SK_RST_REASON_##reason, #reason }, +#define FNe1(reason) { SK_RST_REASON_##reason, #reason } + +#undef FN2 +#undef FNe2 +#define FN2(reason){ SKB_DROP_REASON_##reason, #reason }, +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason } /* * skb of trace_tcp_send_reset is the skb that caused RST. In case of * active reset, skb should be NULL */ TRACE_EVENT(tcp_send_reset, - TP_PROTO(const struct sock *sk, const struct sk_buff *skb), + TP_PROTO(const struct sock *sk, +const struct sk_buff *skb, +const int reason), - TP_ARGS(sk, skb), + TP_ARGS(sk, skb, reason), TP_STRUCT__entry( __field(const void *, skbaddr) __field(const void *, skaddr) __field(int, state) + __field(int, reason) __array(__u8, saddr, sizeof(struct sockaddr_in6)) __array(__u8, daddr, sizeof(struct sockaddr_in6)) ), @@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset, */ TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, entry->saddr); } + __entry->reason = reason; ), - TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s", + TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s reason=%s", __entry->skbaddr, __entry->skaddr, __entry->saddr, __entry->daddr, - __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN") + __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN", + __entry->reason < RST_REASON_START ? + __print_symbolic(__entry->reason, DEFINE_DROP_REASON(FN2, FNe2)) : + __print_symbolic(__entry->reason, DEFINE_RST_REASON(FN1, FNe1))) ); +#undef FN1 +#undef FNe1 + +#undef FN2 +#undef FNe2 + /* * tcp event with arguments sk * diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 35b0f3bbf596..3aee7cb35ee4 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb, if (sk) arg.bound_dev_if = sk->sk_bound_dev_if; - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != offsetof(struct inet_timewait_sock, tw_bound_dev_if)); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 18fbbad2028a..d5a7ecfcc1b3 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3608,7 +3608,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason) /* skb of trace_tcp_send_reset() keeps the skb that caused RST, * skb here is different to the troublesome skb, so use NULL */ - trace_tcp_send_reset(sk, NULL); + trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED); } /* Send a crossed SYN-ACK during socket establishment. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index cfcfa2626899..da2f70ad89b5 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1131,7 +1131,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb, label = ip6_flowlabel(ipv6h); } - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1, ipv6_get_dsfield(ipv6h), label, priority, txhash, -- 2.37.3
[PATCH net-next 5/6] mptcp: support rstreason for passive reset
From: Jason Xing It relys on what reset options in MPTCP does as rfc8684 says. Reusing this logic can save us much energy. This patch replaces all the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- net/mptcp/subflow.c | 26 -- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index a68d5d0f3e2a..24668d3020aa 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -304,7 +304,10 @@ static struct dst_entry *subflow_v4_route_req(const struct sock *sk, dst_release(dst); if (!req->syncookie) - tcp_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + /* According to RFC 8684, 3.2. Starting a New Subflow, +* we should use an "MPTCP specific error" reason code. +*/ + tcp_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_MPTCP_RST_EMPTCP); return NULL; } @@ -371,7 +374,10 @@ static struct dst_entry *subflow_v6_route_req(const struct sock *sk, dst_release(dst); if (!req->syncookie) - tcp6_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + /* According to RFC 8684, 3.2. Starting a New Subflow, +* we should use an "MPTCP specific error" reason code. +*/ + tcp6_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_MPTCP_RST_EMPTCP); return NULL; } #endif @@ -778,6 +784,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, bool fallback, fallback_is_fatal; struct mptcp_sock *owner; struct sock *child; + int reason; pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn); @@ -833,7 +840,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, */ if (!ctx || fallback) { if (fallback_is_fatal) { - subflow_add_reset_reason(skb, MPTCP_RST_EMPTCP); + reason = MPTCP_RST_EMPTCP; + subflow_add_reset_reason(skb, reason); goto dispose_child; } goto fallback; @@ -861,7 +869,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, } else if (ctx->mp_join) { owner = subflow_req->msk; if (!owner) { - subflow_add_reset_reason(skb, MPTCP_RST_EPROHIBIT); + reason = MPTCP_RST_EPROHIBIT; + subflow_add_reset_reason(skb, reason); goto dispose_child; } @@ -875,13 +884,18 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, ntohs(inet_sk((struct sock *)owner)->inet_sport)); if (!mptcp_pm_sport_in_anno_list(owner, sk)) { SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MISMATCHPORTACKRX); + reason = MPTCP_RST_EUNSPEC; goto dispose_child; } SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINPORTACKRX); } - if (!mptcp_finish_join(child)) + if (!mptcp_finish_join(child)) { + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); + + reason = subflow->reset_reason; goto dispose_child; + } SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINACKRX); tcp_rsk(req)->drop_req = true; @@ -901,7 +915,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, tcp_rsk(req)->drop_req = true; inet_csk_prepare_for_destroy_sock(child); tcp_done(child); - req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + req->rsk_ops->send_reset(sk, skb, convert_mptcp_reason(reason)); /* The last child reference will be released by the caller */ return child; -- 2.37.3
[PATCH net-next 4/6] tcp: support rstreason for passive reset
From: Jason Xing Reuse the dropreason logic to show the exact reason of tcp reset, so we don't need to implement those duplicated reset reasons. This patch replaces all the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- net/ipv4/tcp_ipv4.c | 8 net/ipv6/tcp_ipv6.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 1c8248abe37a..35b0f3bbf596 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1935,7 +1935,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(rsk, skb, reason); discard: kfree_skb_reason(skb, reason); /* Be careful here. If this function gets more complicated and @@ -2280,7 +2280,7 @@ int tcp_v4_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v4_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(nsk, skb, drop_reason); goto discard_and_relse; } sock_put(sk); @@ -2358,7 +2358,7 @@ int tcp_v4_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(NULL, skb, drop_reason); } discard_it: @@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_v4_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(sk, skb, drop_reason); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; case TCP_TW_SUCCESS:; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index f143b658fb71..cfcfa2626899 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1678,7 +1678,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, reason); discard: if (opt_skb) __kfree_skb(opt_skb); @@ -1864,7 +1864,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v6_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(nsk, skb, drop_reason); goto discard_and_relse; } sock_put(sk); @@ -1940,7 +1940,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(NULL, skb, drop_reason); } discard_it: @@ -1995,7 +1995,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) tcp_v6_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, drop_reason); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; case TCP_TW_SUCCESS: -- 2.37.3
[PATCH net-next 3/6] rstreason: prepare for active reset
From: Jason Xing Like what we did to passive reset: only passing possible reset reason in each active reset path. No functional changes. Signed-off-by: Jason Xing --- include/net/tcp.h | 2 +- net/ipv4/tcp.c| 15 ++- net/ipv4/tcp_output.c | 2 +- net/ipv4/tcp_timer.c | 9 ++--- net/mptcp/protocol.c | 4 +++- net/mptcp/subflow.c | 5 +++-- 6 files changed, 24 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6ae35199d3b3..2b9b9d3d8065 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -667,7 +667,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, void tcp_send_probe0(struct sock *); int tcp_write_wakeup(struct sock *, int mib); void tcp_send_fin(struct sock *sk); -void tcp_send_active_reset(struct sock *sk, gfp_t priority); +void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason); int tcp_send_synack(struct sock *); void tcp_push_one(struct sock *, unsigned int mss_now); void __tcp_send_ack(struct sock *sk, u32 rcv_nxt); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e767721b3a58..eacfe0012977 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -275,6 +275,7 @@ #include #include #include +#include #include #include @@ -2805,7 +2806,8 @@ void __tcp_close(struct sock *sk, long timeout) /* Unread data was tossed, zap the connection. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE); tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, sk->sk_allocation); + tcp_send_active_reset(sk, sk->sk_allocation, + SK_RST_REASON_NOT_SPECIFIED); } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { /* Check zero linger _after_ checking for unread data. */ sk->sk_prot->disconnect(sk, 0); @@ -2879,7 +2881,8 @@ void __tcp_close(struct sock *sk, long timeout) struct tcp_sock *tp = tcp_sk(sk); if (READ_ONCE(tp->linger2) < 0) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONLINGER); } else { @@ -2897,7 +2900,8 @@ void __tcp_close(struct sock *sk, long timeout) if (sk->sk_state != TCP_CLOSE) { if (tcp_check_oom(sk, 0)) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); } else if (!check_net(sock_net(sk))) { @@ -3001,7 +3005,7 @@ int tcp_disconnect(struct sock *sk, int flags) /* The last check adjusts for discrepancy of Linux wrt. RFC * states */ - tcp_send_active_reset(sk, gfp_any()); + tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_NOT_SPECIFIED); WRITE_ONCE(sk->sk_err, ECONNRESET); } else if (old_state == TCP_SYN_SENT) WRITE_ONCE(sk->sk_err, ECONNRESET); @@ -4557,7 +4561,8 @@ int tcp_abort(struct sock *sk, int err) smp_wmb(); sk_error_report(sk); if (tcp_need_reset(sk->sk_state)) - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); tcp_done(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e3167ad96567..18fbbad2028a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3583,7 +3583,7 @@ void tcp_send_fin(struct sock *sk) * was unread data in the receive queue. This behavior is recommended * by RFC 2525, section 2.17. -DaveM */ -void tcp_send_active_reset(struct sock *sk, gfp_t priority) +void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason) { struct sk_buff *skb; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index d1ad20ce1c8c..7e7110bf3ea2 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -22,6 +22,7 @@ #include #include #include +#include static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk) { @@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool do_reset) (!tp->snd_wnd && !tp->packets_out)) do_reset = true; if (do_reset) -
[PATCH net-next 2/6] rstreason: prepare for passive reset
From: Jason Xing Adjust the paramenter and support passing reason of reset which is for now NOT_SPECIFIED. No functional changes. Signed-off-by: Jason Xing --- include/net/request_sock.h | 3 ++- net/dccp/ipv4.c| 10 ++ net/dccp/ipv6.c| 10 ++ net/dccp/minisocks.c | 3 ++- net/ipv4/tcp_ipv4.c| 12 +++- net/ipv4/tcp_minisocks.c | 3 ++- net/ipv6/tcp_ipv6.c| 15 +-- net/mptcp/subflow.c| 8 +--- 8 files changed, 39 insertions(+), 25 deletions(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 004e651e6067..93f9fee7e52f 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -34,7 +34,8 @@ struct request_sock_ops { void(*send_ack)(const struct sock *sk, struct sk_buff *skb, struct request_sock *req); void(*send_reset)(const struct sock *sk, - struct sk_buff *skb); + struct sk_buff *skb, + int reason); void(*destructor)(struct request_sock *req); void(*syn_ack_timeout)(const struct request_sock *req); }; diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 44b033fe1ef6..628dd783e8f3 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ackvec.h" #include "ccid.h" @@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, struct request_sock *req return err; } -static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + int reason) { int err; const struct iphdr *rxiph; @@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); kfree_skb(skb); return 0; } @@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index ded07e09f813..d64f39e26e87 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "dccp.h" #include "ipv6.h" @@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock *req) kfree_skb(inet_rsk(req)->pktopts); } -static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + int reason) { const struct ipv6hdr *rxip6h; struct sk_buff *skb; @@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); discard: if (opt_skb != NULL) __kfree_skb(opt_skb); @@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c index 64d805b27add..251a57cf5822 100644 --- a/net/dccp/minisocks.c +++ b/net/dccp/minisocks.c @@ -15,6 +15,7 @@ #include
[PATCH net-next 1/6] net: introduce rstreason to detect why the RST is sent
From: Jason Xing Add a new standalone file for the easy future extension to support both active reset and passive reset in the TCP/DCCP/MPTCP protocols. This patch only does the preparations for reset reason mechanism, nothing else changes. The reset reasons are divided into three parts: 1) reuse drop reasons for passive reset in TCP 2) reuse MP_TCPRST option for MPTCP 3) our own reasons I will implement the basic codes of active/passive reset reason in those three protocols, which is not complete for this moment. But it provides a new chance to let other people add more reasons into it:) Signed-off-by: Jason Xing --- include/net/rstreason.h | 93 + 1 file changed, 93 insertions(+) create mode 100644 include/net/rstreason.h diff --git a/include/net/rstreason.h b/include/net/rstreason.h new file mode 100644 index ..24d098a78a60 --- /dev/null +++ b/include/net/rstreason.h @@ -0,0 +1,93 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _LINUX_RSTREASON_H +#define _LINUX_RSTREASON_H +#include + +#define DEFINE_RST_REASON(FN, FNe) \ + FN(MPTCP_RST_EUNSPEC) \ + FN(MPTCP_RST_EMPTCP)\ + FN(MPTCP_RST_ERESOURCE) \ + FN(MPTCP_RST_EPROHIBIT) \ + FN(MPTCP_RST_EWQ2BIG) \ + FN(MPTCP_RST_EBADPERF) \ + FN(MPTCP_RST_EMIDDLEBOX)\ + FN(NOT_SPECIFIED) \ + FNe(MAX) + +#define RST_REASON_START (SKB_DROP_REASON_MAX + 1) + +/* There are three parts in order: + * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP + * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use + * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason + */ +enum sk_rst_reason { + /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse +* of skb drop reason because rst reason relies on what drop reason +* indicates exactly why it could happen. +*/ + + /* Copy from include/uapi/linux/mptcp.h. +* These reset fields will not be changed since they adhere to +* RFC 8684. So do not touch them. I'm going to list each definition +* of them respectively. +*/ + /* Unspecified error. +* This is the default error; it implies that the subflow is no +* longer available. The presence of this option shows that the +* RST was generated by an MPTCP-aware device. +*/ + SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START, + /* MPTCP-specific error. +* An error has been detected in the processing of MPTCP options. +* This is the usual reason code to return in the cases where a RST +* is being sent to close a subflow because of an invalid response. +*/ + SK_RST_REASON_MPTCP_RST_EMPTCP, + /* Lack of resources. +* This code indicates that the sending host does not have enough +* resources to support the terminated subflow. +*/ + SK_RST_REASON_MPTCP_RST_ERESOURCE, + /* Administratively prohibited. +* This code indicates that the requested subflow is prohibited by +* the policies of the sending host. +*/ + SK_RST_REASON_MPTCP_RST_EPROHIBIT, + /* Too much outstanding data. +* This code indicates that there is an excessive amount of data +* that needs to be transmitted over the terminated subflow while +* having already been acknowledged over one or more other subflows. +* This may occur if a path has been unavailable for a short period +* and it is more efficient to reset and start again than it is to +* retransmit the queued data. +*/ + SK_RST_REASON_MPTCP_RST_EWQ2BIG, + /* Unacceptable performance. +* This code indicates that the performance of this subflow was +* too low compared to the other subflows of this Multipath TCP +* connection. +*/ + SK_RST_REASON_MPTCP_RST_EBADPERF, + /* Middlebox interference. +* Middlebox interference has been detected over this subflow, +* making MPTCP signaling invalid. For example, this may be sent +* if the checksum does not validate. +*/ + SK_RST_REASON_MPTCP_RST_EMIDDLEBOX, + + /* For the real standalone socket reset reason, we start from here */ + SK_RST_REASON_NOT_SPECIFIED, + + /* Maximum of socket reset reasons. +* It shouldn't be used as a real 'reason'. +*/ + SK_RST_REASON_MAX, +}; + +static inline int convert_mptcp_reason(int reason) +{ + return reason += RST_REASON_START; +} +#endif -- 2.37.3
[PATCH net-next 0/6] Implement reset reason mechanism to detect
From: Jason Xing In production, there are so many cases about why the RST skb is sent but we don't have a very convenient/fast method to detect the exact underlying reasons. RST is implemented in two kinds: passive kind (like tcp_v4_send_reset()) and active kind (like tcp_send_active_reset()). The former can be traced carefully 1) in TCP, with the help of drop reasons, which is based on Eric's idea[1], 2) in MPTCP, with the help of reset options defined in RFC 8684. The latter is relatively independent, which should be implemented on our own. In this series, I focus on the fundamental implement mostly about how the rstreason mechnism and the detailed passive part works as an example, not including the active reset part. In future, we can go further and refine those NOT_SPECIFIED reasons. Here are some examples when tracing: -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NO_SOCKET [1] Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ Note: It's based on top of https://patchwork.kernel.org/project/netdevbpf/list/?series=840182 Jason Xing (6): net: introduce rstreason to detect why the RST is sent rstreason: prepare for passive reset rstreason: prepare for active reset tcp: support rstreason for passive reset mptcp: support rstreason for passive reset rstreason: make it work in trace world include/net/request_sock.h | 3 +- include/net/rstreason.h| 93 ++ include/net/tcp.h | 2 +- include/trace/events/tcp.h | 37 +-- net/dccp/ipv4.c| 10 ++-- net/dccp/ipv6.c| 10 ++-- net/dccp/minisocks.c | 3 +- net/ipv4/tcp.c | 15 -- net/ipv4/tcp_ipv4.c| 14 +++--- net/ipv4/tcp_minisocks.c | 3 +- net/ipv4/tcp_output.c | 4 +- net/ipv4/tcp_timer.c | 9 ++-- net/ipv6/tcp_ipv6.c| 17 --- net/mptcp/protocol.c | 4 +- net/mptcp/subflow.c| 33 ++ 15 files changed, 209 insertions(+), 48 deletions(-) create mode 100644 include/net/rstreason.h -- 2.37.3
Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On 4/1/24 17:46, Mathieu Poirier wrote: > On Fri, Mar 29, 2024 at 11:57:43AM +0100, Arnaud POULIQUEN wrote: >> >> >> On 3/27/24 18:14, Mathieu Poirier wrote: >>> On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote: On 3/25/24 17:51, Mathieu Poirier wrote: > On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote: >> The new TEE remoteproc device is used to manage remote firmware in a >> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is >> introduced to delegate the loading of the firmware to the trusted >> execution context. In such cases, the firmware should be signed and >> adhere to the image format defined by the TEE. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> Updates from V3: >> - remove support of the attach use case. Will be addressed in a separate >> thread, >> - add st_rproc_tee_ops::parse_fw ops, >> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage >> cross >> reference between the rproc struct and the tee_rproc struct in >> tee_rproc.c. >> --- >> drivers/remoteproc/stm32_rproc.c | 60 +--- >> 1 file changed, 56 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/remoteproc/stm32_rproc.c >> b/drivers/remoteproc/stm32_rproc.c >> index 8cd838df4e92..13df33c78aa2 100644 >> --- a/drivers/remoteproc/stm32_rproc.c >> +++ b/drivers/remoteproc/stm32_rproc.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> #include "remoteproc_internal.h" >> @@ -49,6 +50,9 @@ >> #define M4_STATE_STANDBY4 >> #define M4_STATE_CRASH 5 >> >> +/* Remote processor unique identifier aligned with the Trusted >> Execution Environment definitions */ > > Why is this the case? At least from the kernel side it is possible to > call > tee_rproc_register() with any kind of value, why is there a need to be any > kind of alignment with the TEE? The use of the proc_id is to identify a processor in case of multi co-processors. >>> >>> That is well understood. >>> For instance we can have a system with A DSP and a modem. We would use the same TEE service, but >>> >>> That too. >>> the TEE driver will probably be different, same for the signature key. >>> >>> What TEE driver are we talking about here? >> >> In OP-TEE remoteproc frameork is divided in 2 or 3 layers: >> >> - the remoteproc Trusted Application (TA) [1] which is platform agnostic >> - The remoteproc Pseudo Trusted Application (PTA) [2] which is platform >> dependent and can rely on the proc ID to retrieve the context. >> - the remoteproc driver (optional for some platforms) [3], which is in charge >> of DT parsing and platform configuration. >> > > That part makes sense. > >> Here TEE driver can be interpreted by remote PTA and/or platform driver. >> > > I have to guess PTA means "Platform Trusted Application" but I have no > guarantee, adding to the level of (already high) confusion brought on by this > patchset. As mentioned above, PTA is Pseudo Trusted Application. It is an interface exposed by OP-TEE core to the OP-TEE application. In this case PTA is used to implement the platform part. If you need more details about the remoteproc framework in OP-TEE, there is a link in the to a presentation in the cover letter. > >> [1] >> https://elixir.bootlin.com/op-tee/latest/source/ta/remoteproc/src/remoteproc_core.c >> [2] >> https://elixir.bootlin.com/op-tee/latest/source/core/pta/stm32mp/remoteproc_pta.c >> [3] >> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c >> >>> In such case the proc ID allows to identify the the processor you want to address. >>> >>> That too is well understood, but there is no alignment needed with the TEE, >>> i.e >>> the TEE application is not expecting a value of '0'. We could set >>> STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work. This driver >>> won't go >>> anywhere for as long as it is not the case. >> >> >> Here I suppose that you do not challenge the rproc_ID use in general, but for >> the stm32mp1 platform as we have only one remote processor. I'm right? > > That is correct - I understand the need for an rproc_ID. The problem is with > the comment that states that '0' is aligned with the TEE definitions, which in > my head means hard coded value and a big red flag. What it should say is > "aligned with the TEE device tree definition". > >> >> In OP-TEE the check is done here: >> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c#L64 >> >> If driver does not register the proc ID an error is returned indicating that >> the >> feature is not supported. >> >> In case of stm32mp1 yes we could consider it as
Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
On 4/3/2024 5:21 AM, Michael S. Tsirkin wrote: VDPA_GET_VRING_SIZE by mistake uses the already occupied ioctl # 0x80 and we never noticed - it happens to work because the direction and size are different, but confuses tools such as perf which like to look at just the number, and breaks the extra robustness of the ioctl numbering macros. To fix, sort the entries and renumber the ioctl - not too late since it wasn't in any released kernels yet. Cc: Arnaldo Carvalho de Melo Reported-by: Namhyung Kim Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size") Cc: "Zhu Lingshan" Signed-off-by: Michael S. Tsirkin Reviewed-by: Zhu Lingshan Thanks for the fix and sorry for the mess, I should read the whole header file to check whether the number is available. --- Build tested only - userspace patches using this will have to adjust. I will merge this in a week or so unless I hear otherwise, and afterwards perf can update there header. include/uapi/linux/vhost.h | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index bea697390613..b95dd84eef2d 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -179,12 +179,6 @@ /* Get the config size */ #define VHOST_VDPA_GET_CONFIG_SIZE_IOR(VHOST_VIRTIO, 0x79, __u32) -/* Get the count of all virtqueues */ -#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) - -/* Get the number of virtqueue groups. */ -#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) - /* Get the number of address spaces. */ #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int) @@ -228,10 +222,17 @@ #define VHOST_VDPA_GET_VRING_DESC_GROUP _IOWR(VHOST_VIRTIO, 0x7F, \ struct vhost_vring_state) + +/* Get the count of all virtqueues */ +#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) + +/* Get the number of virtqueue groups. */ +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) + /* Get the queue size of a specific virtqueue. * userspace set the vring index in vhost_vring_state.index * kernel set the queue size in vhost_vring_state.num */ -#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x80, \ +#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x82, \ struct vhost_vring_state) #endif
Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
On Tue, Apr 2, 2024 at 11:21 PM Michael S. Tsirkin wrote: > > VDPA_GET_VRING_SIZE by mistake uses the already occupied > ioctl # 0x80 and we never noticed - it happens to work > because the direction and size are different, but confuses > tools such as perf which like to look at just the number, > and breaks the extra robustness of the ioctl numbering macros. > > To fix, sort the entries and renumber the ioctl - not too late > since it wasn't in any released kernels yet. > > Cc: Arnaldo Carvalho de Melo > Reported-by: Namhyung Kim > Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size") > Cc: "Zhu Lingshan" > Signed-off-by: Michael S. Tsirkin Reviewed-by: Eugenio Pérez > --- > > Build tested only - userspace patches using this will have to adjust. > I will merge this in a week or so unless I hear otherwise, > and afterwards perf can update there header. > > include/uapi/linux/vhost.h | 15 --- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h > index bea697390613..b95dd84eef2d 100644 > --- a/include/uapi/linux/vhost.h > +++ b/include/uapi/linux/vhost.h > @@ -179,12 +179,6 @@ > /* Get the config size */ > #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32) > > -/* Get the count of all virtqueues */ > -#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) > - > -/* Get the number of virtqueue groups. */ > -#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) > - > /* Get the number of address spaces. */ > #define VHOST_VDPA_GET_AS_NUM _IOR(VHOST_VIRTIO, 0x7A, unsigned int) > > @@ -228,10 +222,17 @@ > #define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F, > \ > struct vhost_vring_state) > > + > +/* Get the count of all virtqueues */ > +#define VHOST_VDPA_GET_VQS_COUNT _IOR(VHOST_VIRTIO, 0x80, __u32) > + > +/* Get the number of virtqueue groups. */ > +#define VHOST_VDPA_GET_GROUP_NUM _IOR(VHOST_VIRTIO, 0x81, __u32) > + > /* Get the queue size of a specific virtqueue. > * userspace set the vring index in vhost_vring_state.index > * kernel set the queue size in vhost_vring_state.num > */ > -#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x80, \ > +#define VHOST_VDPA_GET_VRING_SIZE _IOWR(VHOST_VIRTIO, 0x82, \ > struct vhost_vring_state) > #endif > -- > MST >