[PATCH -next] VDUSE: fix another doc underline warning
Extend the underline for a heading to prevent a documentation build warning. Also spell "reconnection" correctly. Documentation/userspace-api/vduse.rst:236: WARNING: Title underline too short. HOW VDUSE devices reconnectoin works Fixes: 2b3fd606c662 ("Documentation: Add reconnect process for VDUSE") Signed-off-by: Randy Dunlap Cc: Cindy Lu Cc: Michael S. Tsirkin Cc: Jason Wang Cc: Xuan Zhuo Cc: virtualizat...@lists.linux.dev Cc: Jonathan Corbet --- Documentation/userspace-api/vduse.rst |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff -- a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst --- a/Documentation/userspace-api/vduse.rst +++ b/Documentation/userspace-api/vduse.rst @@ -232,8 +232,8 @@ able to start the dataplane processing a For more details on the uAPI, please see include/uapi/linux/vduse.h. -HOW VDUSE devices reconnectoin works - +HOW VDUSE devices reconnection works + 0. Userspace APP checks if the device /dev/vduse/vduse_name exists. If it does not exist, need to create the instance.goto step 1 If it does exist, it means this is a reconnect and goto step 3.
[PATCH] ftrace: fix most kernel-doc warnings
Reduce the number of kernel-doc warnings from 52 down to 10, i.e., fix 42 kernel-doc warnings by (a) using the Returns: format for function return values or (b) using "@var:" instead of "@var -" for function parameter descriptions. Fix one return values list so that it is formatted correctly when rendered for output. Spell "non-zero" with a hyphen in several places. Signed-off-by: Randy Dunlap Reported-by: kernel test robot Link: https://lore.kernel.org/oe-kbuild-all/202312180518.x6frydsn-...@intel.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Mark Rutland Cc: linux-trace-ker...@vger.kernel.org --- This patch addresses most of the reported kernel-doc warnings but does not fix all of them, so I did not use "Closes:" for the Link: tag. kernel/trace/ftrace.c | 90 1 file changed, 46 insertions(+), 44 deletions(-) diff -- a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1160,7 +1160,7 @@ __ftrace_lookup_ip(struct ftrace_hash *h * Search a given @hash to see if a given instruction pointer (@ip) * exists in it. * - * Returns the entry that holds the @ip if found. NULL otherwise. + * Returns: the entry that holds the @ip if found. NULL otherwise. */ struct ftrace_func_entry * ftrace_lookup_ip(struct ftrace_hash *hash, unsigned long ip) @@ -1282,7 +1282,7 @@ static void free_ftrace_hash_rcu(struct /** * ftrace_free_filter - remove all filters for an ftrace_ops - * @ops - the ops to remove the filters from + * @ops: the ops to remove the filters from */ void ftrace_free_filter(struct ftrace_ops *ops) { @@ -1587,7 +1587,7 @@ static struct dyn_ftrace *lookup_rec(uns * @end: end of range to search (inclusive). @end points to the last byte * to check. * - * Returns rec->ip if the related ftrace location is a least partly within + * Returns: rec->ip if the related ftrace location is a least partly within * the given address range. That is, the first address of the instruction * that is either a NOP or call to the function tracer. It checks the ftrace * internal tables to determine if the address belongs or not. @@ -1607,9 +1607,10 @@ unsigned long ftrace_location_range(unsi * ftrace_location - return the ftrace location * @ip: the instruction pointer to check * - * If @ip matches the ftrace location, return @ip. - * If @ip matches sym+0, return sym's ftrace location. - * Otherwise, return 0. + * Returns: + * * If @ip matches the ftrace location, return @ip. + * * If @ip matches sym+0, return sym's ftrace location. + * * Otherwise, return 0. */ unsigned long ftrace_location(unsigned long ip) { @@ -1639,7 +1640,7 @@ out: * @start: start of range to search * @end: end of range to search (inclusive). @end points to the last byte to check. * - * Returns 1 if @start and @end contains a ftrace location. + * Returns: 1 if @start and @end contains a ftrace location. * That is, the instruction that is either a NOP or call to * the function tracer. It checks the ftrace internal tables to * determine if the address belongs or not. @@ -2574,7 +2575,7 @@ static void call_direct_funcs(unsigned l * wants to convert to a callback that saves all regs. If FTRACE_FL_REGS * is not set, then it wants to convert to the normal callback. * - * Returns the address of the trampoline to set to + * Returns: the address of the trampoline to set to */ unsigned long ftrace_get_addr_new(struct dyn_ftrace *rec) { @@ -2615,7 +2616,7 @@ unsigned long ftrace_get_addr_new(struct * a function that saves all the regs. Basically the '_EN' version * represents the current state of the function. * - * Returns the address of the trampoline that is currently being called + * Returns: the address of the trampoline that is currently being called */ unsigned long ftrace_get_addr_curr(struct dyn_ftrace *rec) { @@ -2719,7 +2720,7 @@ struct ftrace_rec_iter { /** * ftrace_rec_iter_start - start up iterating over traced functions * - * Returns an iterator handle that is used to iterate over all + * Returns: an iterator handle that is used to iterate over all * the records that represent address locations where functions * are traced. * @@ -2751,7 +2752,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_ * ftrace_rec_iter_next - get the next record to process. * @iter: The handle to the iterator. * - * Returns the next iterator after the given iterator @iter. + * Returns: the next iterator after the given iterator @iter. */ struct ftrace_rec_iter *ftrace_rec_iter_next(struct ftrace_rec_iter *iter) { @@ -2776,7 +2777,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_ * ftrace_rec_iter_record - get the record at the iterator location * @iter: The current iterator location * - * Returns the record that the current @iter is at. + * Returns: the record that the current @iter is at. */ struct dyn_ftrace *ftrace_rec_iter_record(struct
Re: [RFC PATCH 11/14] csky/thread_info: Introduce TIF_NOTIFY_IPI flag
On Wed, Feb 21, 2024 at 1:20 AM K Prateek Nayak wrote: > > Add support for TIF_NOTIFY_IPI on C-SKY. With TIF_NOTIFY_IPI, a sender > sending an IPI to an idle CPU in TIF_POLLING mode will set the > TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the > CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This > avoids spurious calls to schedule_idle() in cases where an IPI does not > necessarily wake up a task on the idle CPU. > > Cc: Guo Ren > Cc: "Rafael J. Wysocki" > Cc: Daniel Lezcano > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Juri Lelli > Cc: Vincent Guittot > Cc: Dietmar Eggemann > Cc: Steven Rostedt > Cc: Ben Segall > Cc: Mel Gorman > Cc: Daniel Bristot de Oliveira > Cc: Valentin Schneider > Cc: K Prateek Nayak > Cc: linux-c...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux...@vger.kernel.org > Signed-off-by: K Prateek Nayak > --- > arch/csky/include/asm/thread_info.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/csky/include/asm/thread_info.h > b/arch/csky/include/asm/thread_info.h > index b5ed788f0c68..9bc7a037c476 100644 > --- a/arch/csky/include/asm/thread_info.h > +++ b/arch/csky/include/asm/thread_info.h > @@ -61,6 +61,7 @@ static inline struct thread_info *current_thread_info(void) > #define TIF_SYSCALL_TRACEPOINT 5 /* syscall tracepoint instrumentation > */ > #define TIF_SYSCALL_AUDIT 6 /* syscall auditing */ > #define TIF_NOTIFY_SIGNAL 7 /* signal notifications exist */ > +#define TIF_NOTIFY_IPI 8 /* Pending IPI on TIF_POLLLING idle > CPU */ > #define TIF_POLLING_NRFLAG 16 /* poll_idle() is TIF_NEED_RESCHED */ > #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ > #define TIF_RESTORE_SIGMASK20 /* restore signal mask in do_signal() > */ > @@ -73,6 +74,7 @@ static inline struct thread_info *current_thread_info(void) > #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) > #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) > #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) > +#define _TIF_NOTIFY_IPI(1 << TIF_NOTIFY_IPI) > #define _TIF_UPROBE(1 << TIF_UPROBE) > #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG) > #define _TIF_MEMDIE(1 << TIF_MEMDIE) > -- > 2.34.1 > LGTM Acked-by: Guo Ren -- Best Regards Guo Ren
Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>> Ankit Agrawal (3): >> vfio/pci: rename and export do_io_rw() >> vfio/pci: rename and export range_intersect_range >> vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper >> >> MAINTAINERS | 16 +- >> drivers/vfio/pci/Kconfig | 2 + >> drivers/vfio/pci/Makefile | 2 + >> drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 + >> drivers/vfio/pci/nvgrace-gpu/Makefile | 3 + >> drivers/vfio/pci/nvgrace-gpu/main.c | 879 ++ >> drivers/vfio/pci/vfio_pci_config.c | 42 ++ >> drivers/vfio/pci/vfio_pci_rdwr.c | 16 +- >> drivers/vfio/pci/virtio/main.c | 72 +-- >> include/linux/vfio_pci_core.h | 10 +- >> 10 files changed, 993 insertions(+), 59 deletions(-) >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile >> create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c >> > > Applied to vfio next branch for v6.9. Thanks, > > Alex Thanks Alex! Appreciate this along with your guidance and help in the reviews.
Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint
On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni wrote: > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote: > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni wrote: > > > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote: > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni wrote: > > > > > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote: > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer > > > > > > wrote: > > > > > > > On 02/02/2024 13.11, Liang Chen wrote: > > > > > [...] > > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct > > > > > > > > xdp_buff *xdp) > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff > > > > > > > > *virtnet_xdp, > > > > > > > > + struct net_device *dev, > > > > > > > > + struct > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash) > > > > > > > > +{ > > > > > > > > + if (dev->features & NETIF_F_RXHASH) { > > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value; > > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report; > > > > > > > > + } > > > > > > > > +} > > > > > > > > + > > > > > > > > > > > > > > Would it be possible to store a pointer to hdr_hash in > > > > > > > virtnet_xdp_buff, > > > > > > > with the purpose of delaying extracting this, until and only if > > > > > > > XDP > > > > > > > bpf_prog calls the kfunc? > > > > > > > > > > > > > > > > > > > That seems to be the way v1 works, > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/ > > > > > > . But it was pointed out that the inline header may be overwritten > > > > > > by > > > > > > the xdp prog, so the hash is copied out to maintain its integrity. > > > > > > > > > > Why? isn't XDP supposed to get write access only to the pkt > > > > > contents/buffer? > > > > > > > > > > > > > Normally, an XDP program accesses only the packet data. However, > > > > there's also an XDP RX Metadata area, referenced by the data_meta > > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to > > > > point somewhere ahead of the data buffer, thereby granting the XDP > > > > program access to the virtio header located immediately before the > > > > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before > > > xdp->data_hard_start: > > > > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210 > > > > > > and virtio net set such field after the virtio_net_hdr: > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218 > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420 > > > > > > I don't see how the virtio hdr could be touched? Possibly even more > > > important: if such thing is possible, I think is should be somewhat > > > denied (for the same reason an H/W nic should prevent XDP from > > > modifying its own buffer descriptor). > > > > Thank you for highlighting this concern. The header layout differs > > slightly between small and mergeable mode. Taking 'mergeable mode' as > > an example, after calling xdp_prepare_buff the layout of xdp_buff > > would be as depicted in the diagram below, > > > > buf > >| > >v > > +--+--+-+ > > | xdp headroom | virtio header| packet | > > | (256 bytes) | (20 bytes) | content | > > +--+--+-+ > > ^ ^ > > | | > > data_hard_startdata > > data_meta > > > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little > > towards 'data_hard_start', it would point to the inline header, thus > > potentially allowing the XDP program to access the inline header. > > I see. That layout was completely unexpected to me. > > AFAICS the virtio_net driver tries to avoid accessing/using the > virtio_net_hdr after the XDP program execution, so nothing tragic > should happen. > > @Michael, @Jason, I guess the above is like that by design? Isn't it a > bit fragile? YES. We process it carefully. That brings some troubles, we hope to put the virtio-net header to the vring desc like other NICs. But that is a big project. I think this patch is ok, this can be merged to net-next firstly. Thanks. > > Thanks! > > Paolo >
[PATCH 1/2] tracing: Fix snapshot counter going between two tracers that use it
From: "Steven Rostedt (Google)" Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a tracer that uses the snapshot is enabled even if the snapshot was used by the previous tracer. That is: # cd /sys/kernel/tracing # echo wakeup_rt > current_tracer # echo wakeup_dl > current_tracer # echo nop > current_tracer would leave the snapshot counter at 1 and not zero. That's because the enabling of wakeup_dl would increment the counter again but the setting the tracer to nop would only decrement it once. Do not arm the snapshot for a tracer if the previous tracer already had it armed. Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b7a870c8ae2a..480201c3b36e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6164,7 +6164,7 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) tracing_disarm_snapshot(tr); } - if (t->use_max_tr) { + if (!had_max_tr && t->use_max_tr) { ret = tracing_arm_snapshot_locked(tr); if (ret) goto out; -- 2.43.0
[PATCH 0/2] tracing: Fix snapshot accounting
The ring buffer mapping test failed after running the ftrace tests. This was due to some mismatched snapshot accounting that left the snapshot counter enabled when it was not, which prevents the ring buffer from being mapped. Steven Rostedt (Google) (2): tracing: Fix snapshot counter going between two tracers that use it tracing: Decrement the snapshot if the snapshot trigger fails to register kernel/trace/trace.c| 2 +- kernel/trace/trace_events_trigger.c | 5 - 2 files changed, 5 insertions(+), 2 deletions(-)
[PATCH 2/2] tracing: Decrement the snapshot if the snapshot trigger fails to register
From: "Steven Rostedt (Google)" Running the ftrace selftests caused the ring buffer mapping test to fail. Investigating, I found that the snapshot counter would be incremented every time a snapshot trigger was added, even if that snapshot trigger failed. # cd /sys/kernel/tracing # echo "snapshot" > events/sched/sched_process_fork/trigger # echo "snapshot" > events/sched/sched_process_fork/trigger -bash: echo: write error: File exists That second one that fails increments the snapshot counter but doesn't decrement it. It needs to be decremented when the snapshot fails. Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_trigger.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c index 62e4f58b8671..4bec043c8690 100644 --- a/kernel/trace/trace_events_trigger.c +++ b/kernel/trace/trace_events_trigger.c @@ -1491,7 +1491,10 @@ register_snapshot_trigger(char *glob, if (ret < 0) return ret; - return register_trigger(glob, data, file); + ret = register_trigger(glob, data, file); + if (ret < 0) + tracing_disarm_snapshot(file->tr); + return ret; } static void unregister_snapshot_trigger(char *glob, -- 2.43.0
Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw
Hi, On 2/23/2024 3:49 AM, Michael S. Tsirkin wrote: > On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote: >> From: Hou Tao >> >> When trying to insert a 10MB kernel module kept in a virtiofs with cache >> disabled, the following warning was reported: >> SNIP >> >> A feasible solution is to limit the value of max_read for virtiofs, so >> the length passed to kmalloc() will be limited. However it will affects >> the max read size for ITER_IOVEC io and the value of max_write also needs >> limitation. So instead of limiting the values of max_read and max_write, >> introducing max_nopage_rw to cap both the values of max_read and >> max_write when the fuse dio read/write request is initiated from kernel. >> >> Considering that fuse read/write request from kernel is uncommon and to >> decrease the demand for large contiguous pages, set max_nopage_rw as >> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar. >> >> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") >> Signed-off-by: Hou Tao > > So what should I do with this patch? It includes fuse changes > but of course I can merge too if no one wants to bother either way... The patch had got some feedback from Bernd Schubert . And I will post v2 before next Thursday.
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On Thu, 22 Feb 2024 15:26:05 -0600, Huang, Kai wrote: On 23/02/2024 6:09 am, Haitao Huang wrote: On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai wrote: -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { -return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); +for (;;) { +if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, +PAGE_SIZE)) +break; + +if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) +return -ENOMEM; + +if (signal_pending(current)) +return -ERESTARTSYS; + +if (!reclaim) { +queue_work(sgx_epc_cg_wq, _cg->reclaim_work); +return -EBUSY; +} + +if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) +/* All pages were too young to reclaim, try again a little later */ +schedule(); +} + +return 0; } Seems this code change is 90% similar to the existing code in the sgx_alloc_epc_page(): ... for ( ; ; ) { page = __sgx_alloc_epc_page(); if (!IS_ERR(page)) { page->owner = owner; break; } if (list_empty(_active_page_list)) return ERR_PTR(-ENOMEM); if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } sgx_reclaim_pages(); cond_resched(); } ... Is it better to move the logic/code change in try_charge() out to sgx_alloc_epc_page() to unify them? IIUC, the logic is quite similar: When you either failed to allocate one page, or failed to charge one page, you try to reclaim EPC page(s) from the current EPC cgroup, either directly or indirectly. No? Only these lines are the same: if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup reclamation. But why? If we failed to allocate, shouldn't we try to reclaim from the _current_ EPC cgroup instead of global? E.g., I thought one enclave in one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves inside other cgroups? Right. When code reaches to here, we already passed reclaim per cgroup. The cgroup may not at or reach limit but system has run out of physical EPC. Thanks Haitao
Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation
On 23/02/2024 5:36 am, Haitao Huang wrote: On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_lru_list() returns hard coded reference to the global LRU. Change sgx_lru_list() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore global reclamation is still needed in those cases and it should reclaim from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled, otherwise from the global LRU. Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 6b0c26cac621..d4265a390ba9 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru; static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) { +#ifdef CONFIG_CGROUP_SGX_EPC + if (epc_page->epc_cg) + return _page->epc_cg->lru; + + /* This should not happen if kernel is configured correctly */ + WARN_ON_ONCE(1); +#endif return _global_lru; } How about when EPC cgroup is enabled, but one enclave doesn't belong to any EPC cgroup? Is it OK to track EPC pages for these enclaves to the root EPC cgroup's LRU list together with other enclaves belongs to the root cgroup? This should be a valid case, right? There is no such case. Each page is in the root by default. Is it guaranteed by the (misc) cgroup design/implementation? If so please add this information to the changelog and/or comments? It helps non-cgroup expert like me to understand.
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On 23/02/2024 6:20 am, Haitao Huang wrote: On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai wrote: On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote: [...] > > Here the @nr_to_scan is reduced by the number of pages that are > isolated, but > not actually reclaimed (which is reflected by @cnt). > > IIUC, looks you want to make this function do "each cycle" as what you > mentioned > in the v8 [1]: > > I tested with that approach and found we can only target number of > pages > attempted to reclaim not pages actually reclaimed due to the > uncertainty > of how long it takes to reclaim pages. Besides targeting number of > scanned pages for each cycle is also what the ksgxd does. > > If we target actual number of pages, sometimes it just takes too long. > I > saw more timeouts with the default time limit when running parallel > selftests. > > I am not sure what does "sometimes it just takes too long" mean, but > what I am > thinking is you are trying to do some perfect but yet complicated code > here. I think what I observed was that the try_charge() would block too long before getting chance of schedule() to yield, causing more timeouts than necessary. I'll do some re-test to be sure. Looks this is a valid information that can be used to justify whatever you are implementing in the EPC cgroup reclaiming function(s). I'll add some comments. Was assuming this is just following the old design as ksgxd. There were some comments at the beginning of sgx_epc_cgrooup_reclaim_page(). /* * Attempting to reclaim only a few pages will often fail and is * inefficient, while reclaiming a huge number of pages can result in * soft lockups due to holding various locks for an extended duration. */ unsigned int nr_to_scan = SGX_NR_TO_SCAN; I think it can be improved to emphasize we only "attempt" to finish scanning fixed number of pages for reclamation, not enforce number of pages successfully reclaimed. Not sure need to be this comment, but at somewhere just state you are trying to follow the ksgxd() (the current sgx_reclaim_pages()), but trying to do it "_across_ given cgroup and all the descendants". That's the reason you made @nr_to_scan as a pointer. And also some text to explain why to follow ksgxd() -- not wanting to block longer due to loop over descendants etc -- so we can focus on discussing whether such justification is reasonable.
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On 23/02/2024 9:12 am, Haitao Huang wrote: On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai wrote: On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote: StartHi Kai On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai wrote: [...] > > So you introduced the work/workqueue here but there's no place which > actually > queues the work. IMHO you can either: > > 1) move relevant code change here; or > 2) focus on introducing core functions to reclaim certain pages from a > given EPC > cgroup w/o workqueue and introduce the work/workqueue in later patch. > > Makes sense? > Starting in v7, I was trying to split the big patch, #10 in v6 as you and others suggested. My thought process was to put infrastructure needed for per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9 13/15] in the end. That's reasonable for sure. Thanks for the confirmation :-) Before that, all reclaimables are tracked in the global LRU so really there is no "reclaim certain pages from a given EPC cgroup w/o workqueue" or reclaim through workqueue before that point, as suggested in #2. This patch puts down the implementation for both flows but neither used yet, as stated in the commit message. I know it's not used yet. The point is how to split patches to make them more self-contain and easy to review. I would think this patch already self-contained in that all are implementation of cgroup reclamation building blocks utilized later. But I'll try to follow your suggestions below to split further (would prefer not to merge in general unless there is strong reasons). For #2, sorry for not being explicit -- I meant it seems it's more reasonable to split in this way: Patch 1) a). change to sgx_reclaim_pages(); I'll still prefer this to be a separate patch. It is self-contained IMHO. We were splitting the original patch because it was too big. I don't want to merge back unless there is a strong reason. b). introduce sgx_epc_cgroup_reclaim_pages(); Ok. If I got you right, I believe you want to have a cgroup variant function following the same behaviour of the one for global reclaim, i.e., the _current_ sgx_reclaim_pages(), which always tries to scan and reclaim SGX_NR_TO_SCAN pages each time. And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup and all the descendants". And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due to WHATEVER reasons. In that case, the change to sgx_reclaim_pages() and the introduce of sgx_epc_cgroup_reclaim_pages() should really be together because they are completely tied together in terms of implementation. In this way you can just explain clearly in _ONE_ patch why you choose this implementation, and for reviewer it's also easier to review because we can just discuss in one patch. Makes sense? c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name), which just takes an EPC cgroup as input w/o involving any work/workqueue. This is for the workqueue use only. So I think it'd be better be with patch #2 below? There are multiple levels of logic here IMHO: 1. a) and b) above focus on "each reclaim" a given EPC cgroup 2. c) is about a loop of above to bring given cgroup's usage to limit 3. workqueue is one (probably best) way to do c) in async way 4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered To me, it's clear 1) should be in one patch as stated above. Also, to me 3) and 4) are better to be together since they give you a clear view on how the direct/indirect reclaim are triggered. 2) could be flexible depending on how you see it. If you prefer viewing it from low-level implementation of reclaiming pages from cgroup, then it's also OK to be together with 1). If you want to treat it as a part of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK to be with 3) and 4). But to me 2) can be together with 1) or even a separate patch because it's still kinda of low-level reclaiming details. 3) and 4) shouldn't contain such detail but should focus on how direct/indirect reclaim is done. [...] To be honest, the part I'm feeling most confusing is this self-contained-ness. It seems depend on how you look at things. Completely understand. But I think our discussion should be helpful to both of us and others.
Re: [PATCH v4 2/4] tracing/user_events: Introduce multi-format events
On Thu, 22 Feb 2024 00:18:05 + Beau Belgrave wrote: > Currently user_events supports 1 event with the same name and must have > the exact same format when referenced by multiple programs. This opens > an opportunity for malicous or poorly thought through programs to malicious? ;-) -- Steve > create events that others use with different formats. Another scenario > is user programs wishing to use the same event name but add more fields > later when the software updates. Various versions of a program may be > running side-by-side, which is prevented by the current single format > requirement. > > Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates > the user program wishes to use the same user_event name, but may have > several different formats of the event. When this flag is used, create > the underlying tracepoint backing the user_event with a unique name > per-version of the format. It's important that existing ABI users do > not get this logic automatically, even if one of the multi format > events matches the format. This ensures existing programs that create > events and assume the tracepoint name will match exactly continue to > work as expected. Add logic to only check multi-format events with > other multi-format events and single-format events to only check > single-format events during find. > > Change system name of the multi-format event tracepoint to ensure that > multi-format events are isolated completely from single-format events. > This prevents single-format names from conflicting with multi-format > events if they end with the same suffix as the multi-format events. > > Add a register_name (reg_name) to the user_event struct which allows for > split naming of events. We now have the name that was used to register > within user_events as well as the unique name for the tracepoint. Upon > registering events ensure matches based on first the reg_name, followed > by the fields and format of the event. This allows for multiple events > with the same registered name to have different formats. The underlying > tracepoint will have a unique name in the format of {reg_name}.{unique_id}. > > For example, if both "test u32 value" and "test u64 value" are used with > the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique > tracepoints. The dynamic_events file would then show the following: > u:test u64 count > u:test u32 count > > The actual tracepoint names look like this: > test.0 > test.1 > > Both would be under the new user_events_multi system name to prevent the > older ABI from being used to squat on multi-formatted events and block > their use. > > Deleting events via "!u:test u64 count" would only delete the first > tracepoint that matched that format. When the delete ABI is used all > events with the same name will be attempted to be deleted. If > per-version deletion is required, user programs should either not use > persistent events or delete them via dynamic_events. > > Signed-off-by: Beau Belgrave
Re: [PATCH v4 1/4] tracing/user_events: Prepare find/delete for same name events
On Thu, 22 Feb 2024 00:18:04 + Beau Belgrave wrote: > The current code for finding and deleting events assumes that there will > never be cases when user_events are registered with the same name, but > different formats. Scenarios exist where programs want to use the same > name but have different formats. An example is multiple versions of a > program running side-by-side using the same event name, but with updated > formats in each version. > > This change does not yet allow for multi-format events. If user_events > are registered with the same name but different arguments the programs > see the same return values as before. This change simply makes it > possible to easily accomodate for this. accommodate? ;-) > > Update find_user_event() to take in argument parameters and register > flags to accomodate future multi-format event scenarios. Have find accommodate > validate argument matching and return error pointers to cover when > an existing event has the same name but different format. Update > callers to handle error pointer logic. > > Move delete_user_event() to use hash walking directly now that > find_user_event() has changed. Delete all events found that match the > register name, stop if an error occurs and report back to the user. > > Update user_fields_match() to cover list_empty() scenarios now that > find_user_event() uses it directly. This makes the logic consistent > accross several callsites. across ? I'll update this. -- Steve > > Signed-off-by: Beau Belgrave
Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
On Tue, 20 Feb 2024 17:20:52 +0530 wrote: > From: Ankit Agrawal > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal proprietary chip-to-chip cache coherent interconnect. > > The device is peculiar compared to a real PCI device in that whilst > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the > device, it is not used to access device memory once the faster > chip-to-chip interconnect is initialized (occurs at the time of host > system boot). The device memory is accessed instead using the > chip-to-chip interconnect that is exposed as a contiguous physically > addressable region on the host. Since the device memory is cache > coherent with the CPU, it can be mmaped into the user VMA with a > cacheable mapping and used like a regular RAM. The device memory is > not added to the host kernel, but mapped directly as this reduces > memory wastage due to struct pages. > > There is also a requirement of a minimum reserved 1G uncached region > (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. > This is to work around a HW defect. Based on [2], the requisite properties > (uncached, unaligned access) can be achieved through a VM mapping (S1) > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide > a different non-cached property to the reserved 1G region, it needs to > be carved out from the device memory and mapped as a separate region > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets > the Qemu VMA page properties (pgprot) as NORMAL_NC. > > Provide a VFIO PCI variant driver that adapts the unique device memory > representation into a more standard PCI representation facing userspace. > > The variant driver exposes these two regions - the non-cached reserved > (resmem) and the cached rest of the device memory (termed as usemem) as > separate VFIO 64b BAR regions. This is divergent from the baremetal > approach, where the device memory is exposed as a device memory region. > The decision for a different approach was taken in view of the fact that > it would necessiate additional code in Qemu to discover and insert those > regions in the VM IPA, along with the additional VM ACPI DSDT changes to > communiate the device memory region IPA to the VM workloads. Moreover, > this behavior would have to be added to a variety of emulators (beyond > top of tree Qemu) out there desiring grace hopper support. > > Since the device implements 64-bit BAR0, the VFIO PCI variant driver > maps the uncached carved out region to the next available PCI BAR (i.e. > comprising of region 2 and 3). The cached device memory aperture is > assigned BAR region 4 and 5. Qemu will then naturally generate a PCI > device in the VM with the uncached aperture reported as BAR2 region, > the cacheable as BAR4. The variant driver provides emulation for these > fake BARs' PCI config space offset registers. > > The hardware ensures that the system does not crash when the memory > is accessed with the memory enable turned off. It synthesis ~0 reads > and dropped writes on such access. So there is no need to support the > disablement/enablement of BAR through PCI_COMMAND config space register. > > The memory layout on the host looks like the following: >devmem (memlength) > |--| > |-cached|--NC--| > | | > usemem.memphys resmem.memphys > > PCI BARs need to be aligned to the power-of-2, but the actual memory on the > device may not. A read or write access to the physical address from the > last device PFN up to the next power-of-2 aligned physical address > results in reading ~0 and dropped writes. Note that the GPU device > driver [6] is capable of knowing the exact device memory size through > separate means. The device memory size is primarily kept in the system > ACPI tables for use by the VFIO PCI variant module. > > Note that the usemem memory is added by the VM Nvidia device driver [5] > to the VM kernel as memblocks. Hence make the usable memory size memblock > (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and > VFIO driver. The VM device driver make use of the same value for its > calculation to determine USEMEM size. > > Currently there is no provision in KVM for a S2 mapping with > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. > As previously mentioned, resmem is mapped pgprot_writecombine(), that > sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the > proposed changes in [3] and [4], KVM marks the region with > MemAttr[2:0]=0b101 in S2. > > If the device memory properties are not present, the driver registers the > vfio-pci-core function pointers. Since there are no ACPI memory properties > generated for the VM, the
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On 23/02/2024 6:09 am, Haitao Huang wrote: On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai wrote: -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + for (;;) { + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, + PAGE_SIZE)) + break; + + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; + + if (signal_pending(current)) + return -ERESTARTSYS; + + if (!reclaim) { + queue_work(sgx_epc_cg_wq, _cg->reclaim_work); + return -EBUSY; + } + + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) + /* All pages were too young to reclaim, try again a little later */ + schedule(); + } + + return 0; } Seems this code change is 90% similar to the existing code in the sgx_alloc_epc_page(): ... for ( ; ; ) { page = __sgx_alloc_epc_page(); if (!IS_ERR(page)) { page->owner = owner; break; } if (list_empty(_active_page_list)) return ERR_PTR(-ENOMEM); if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } sgx_reclaim_pages(); cond_resched(); } ... Is it better to move the logic/code change in try_charge() out to sgx_alloc_epc_page() to unify them? IIUC, the logic is quite similar: When you either failed to allocate one page, or failed to charge one page, you try to reclaim EPC page(s) from the current EPC cgroup, either directly or indirectly. No? Only these lines are the same: if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup reclamation. But why? If we failed to allocate, shouldn't we try to reclaim from the _current_ EPC cgroup instead of global? E.g., I thought one enclave in one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves inside other cgroups? That's why the logic of other lines is different though they look similar due to similar function names. For the global reclamation we need consider case in that cgroup is not enabled. Similarly list_empty(_active_page_list) would have to be changed to check root cgroup if cgroups enabled otherwise check global LRU. The (!reclaim) case is also different. W/o getting clear on my above question, so far I am not convinced why such difference cannot be hide inside wrapper function(s). So I don't see an obvious good way to abstract those to get meaningful savings. Thanks Haitao
[PATCH v2 4/4] tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
From: "Steven Rostedt (Google)" The TRACE_EVENT macros has some dependency if a __string() field is NULL, where it will save "(null)" as the string. This string is also used by __assign_str(). It's better to create a single macro instead of having something that will not be caught by the compiler if there is an unfortunate typo. Suggested-by: Mathieu Desnoyers Signed-off-by: Steven Rostedt (Google) --- include/linux/trace_events.h | 3 +++ include/trace/events/sunrpc.h| 12 ++-- include/trace/stages/stage5_get_offsets.h| 4 ++-- include/trace/stages/stage6_event_callback.h | 8 4 files changed, 15 insertions(+), 12 deletions(-) diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f..b9d88aced7e1 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -17,6 +17,9 @@ struct dentry; struct bpf_prog; union bpf_attr; +/* Used for event string fields when they are NULL */ +#define EVENT_NULL_STR "(null)" + const char *trace_print_flags_seq(struct trace_seq *p, const char *delim, unsigned long flags, const struct trace_print_flags *flag_array); diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h index cdd3a45e6003..ce6a85b82afa 100644 --- a/include/trace/events/sunrpc.h +++ b/include/trace/events/sunrpc.h @@ -1327,18 +1327,18 @@ TRACE_EVENT(xs_stream_read_data, __field(ssize_t, err) __field(size_t, total) __string(addr, xprt ? xprt->address_strings[RPC_DISPLAY_ADDR] : - "(null)") + EVENT_NULL_STR) __string(port, xprt ? xprt->address_strings[RPC_DISPLAY_PORT] : - "(null)") + EVENT_NULL_STR) ), TP_fast_assign( __entry->err = err; __entry->total = total; __assign_str(addr, xprt ? - xprt->address_strings[RPC_DISPLAY_ADDR] : "(null)"); + xprt->address_strings[RPC_DISPLAY_ADDR] : EVENT_NULL_STR); __assign_str(port, xprt ? - xprt->address_strings[RPC_DISPLAY_PORT] : "(null)"); + xprt->address_strings[RPC_DISPLAY_PORT] : EVENT_NULL_STR); ), TP_printk("peer=[%s]:%s err=%zd total=%zu", __get_str(addr), @@ -1783,7 +1783,7 @@ TRACE_EVENT(svc_process, __string(service, name) __string(procedure, svc_proc_name(rqst)) __string(addr, rqst->rq_xprt ? -rqst->rq_xprt->xpt_remotebuf : "(null)") +rqst->rq_xprt->xpt_remotebuf : EVENT_NULL_STR) ), TP_fast_assign( @@ -1793,7 +1793,7 @@ TRACE_EVENT(svc_process, __assign_str(service, name); __assign_str(procedure, svc_proc_name(rqst)); __assign_str(addr, rqst->rq_xprt ? -rqst->rq_xprt->xpt_remotebuf : "(null)"); +rqst->rq_xprt->xpt_remotebuf : EVENT_NULL_STR); ), TP_printk("addr=%s xid=0x%08x service=%s vers=%u proc=%s", diff --git a/include/trace/stages/stage5_get_offsets.h b/include/trace/stages/stage5_get_offsets.h index 20b801ed3fd4..e6b96608f452 100644 --- a/include/trace/stages/stage5_get_offsets.h +++ b/include/trace/stages/stage5_get_offsets.h @@ -47,7 +47,7 @@ #undef __string #define __string(item, src) __dynamic_array(char, item, \ - strlen((const char *)(src) ? : "(null)") + 1) \ + strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \ __data_offsets->item##_ptr_ = src; #undef __string_len @@ -70,7 +70,7 @@ #undef __rel_string #define __rel_string(item, src) __rel_dynamic_array(char, item, \ - strlen((const char *)(src) ? : "(null)") + 1) \ + strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \ __data_offsets->item##_ptr_ = src; #undef __rel_string_len diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index 38732855eadb..2bfd49713b42 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,14 +32,14 @@ #undef __assign_str #define __assign_str(dst, src) \ - memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)", \ + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : EVENT_NULL_STR, \ __get_dynamic_array_len(dst)) #undef __assign_str_len #define __assign_str_len(dst, src, len) \ do {\
[PATCH v2 3/4] tracing: Use ? : shortcut in trace macros
From: "Steven Rostedt (Google)" Instead of having: #define __assign_str(dst, src)\ memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ __data_offsets.dst##_ptr_ : "(null)", \ __get_dynamic_array_len(dst)) Use the ? : shortcut and compact it down to: #define __assign_str(dst, src)\ memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)", \ __get_dynamic_array_len(dst)) Suggested-by: Mathieu Desnoyers Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage5_get_offsets.h| 4 ++-- include/trace/stages/stage6_event_callback.h | 14 ++ 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/include/trace/stages/stage5_get_offsets.h b/include/trace/stages/stage5_get_offsets.h index 45f8151cf622..20b801ed3fd4 100644 --- a/include/trace/stages/stage5_get_offsets.h +++ b/include/trace/stages/stage5_get_offsets.h @@ -47,7 +47,7 @@ #undef __string #define __string(item, src) __dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) \ + strlen((const char *)(src) ? : "(null)") + 1) \ __data_offsets->item##_ptr_ = src; #undef __string_len @@ -70,7 +70,7 @@ #undef __rel_string #define __rel_string(item, src) __rel_dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) \ + strlen((const char *)(src) ? : "(null)") + 1) \ __data_offsets->item##_ptr_ = src; #undef __rel_string_len diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index c0e5d097324e..38732855eadb 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,15 +32,14 @@ #undef __assign_str #define __assign_str(dst, src) \ - memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)",\ + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)", \ __get_dynamic_array_len(dst)) #undef __assign_str_len #define __assign_str_len(dst, src, len) \ do {\ - memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)", len); \ + memcpy(__get_str(dst), \ + __data_offsets.dst##_ptr_ ? : "(null)", len);\ __get_str(dst)[len] = '\0'; \ } while(0) @@ -95,15 +94,14 @@ #undef __assign_rel_str #define __assign_rel_str(dst, src) \ - memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)",\ + memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? : "(null)", \ __get_rel_dynamic_array_len(dst)) #undef __assign_rel_str_len #define __assign_rel_str_len(dst, src, len)\ do {\ - memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)", len); \ + memcpy(__get_rel_str(dst), \ + __data_offsets.dst##_ptr_ ? : "(null)", len);\ __get_rel_str(dst)[len] = '\0'; \ } while (0) -- 2.43.0
[PATCH v2 2/4] tracing: Do not calculate strlen() twice for __string() fields
From: "Steven Rostedt (Google)" The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the s->string is done twice. Once at the __string() and again in the __assign_str(). The length of the string is calculated via a strlen(), not once, but twice. Once during the __string() macro and again in __assign_str(). But the length is actually already recorded in the data location and here's no reason to call strlen() again. Just use the saved length that was saved in the __string() code for the __assign_str() code. Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage6_event_callback.h | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index b3e2f321e787..c0e5d097324e 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,8 +32,9 @@ #undef __assign_str #define __assign_str(dst, src) \ - strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)") + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)",\ + __get_dynamic_array_len(dst)) #undef __assign_str_len #define __assign_str_len(dst, src, len) \ @@ -94,8 +95,9 @@ #undef __assign_rel_str #define __assign_rel_str(dst, src) \ - strcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)") + memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)",\ + __get_rel_dynamic_array_len(dst)) #undef __assign_rel_str_len #define __assign_rel_str_len(dst, src, len)\ -- 2.43.0
[PATCH v2 1/4] tracing: Rework __assign_str() and __string() to not duplicate getting the string
From: "Steven Rostedt (Google)" The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the s->string is done twice. Once at the __string() and again in the __assign_str(). But the __string() uses dynamic_array() which has a helper structure that is created holding the offsets and length of the string fields. Instead of finding the string twice, just save it off in another field from that helper structure, and have __assign_str() use that instead. Note, this also means that the second parameter of __assign_str() isn't even used anymore, and may be removed in the future. Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage2_data_offsets.h | 4 ++-- include/trace/stages/stage5_get_offsets.h| 15 ++- include/trace/stages/stage6_event_callback.h | 12 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/include/trace/stages/stage2_data_offsets.h b/include/trace/stages/stage2_data_offsets.h index 469b6a64293d..8b0cff06d346 100644 --- a/include/trace/stages/stage2_data_offsets.h +++ b/include/trace/stages/stage2_data_offsets.h @@ -24,7 +24,7 @@ #define __array(type, item, len) #undef __dynamic_array -#define __dynamic_array(type, item, len) u32 item; +#define __dynamic_array(type, item, len) u32 item; const void *item##_ptr_; #undef __string #define __string(item, src) __dynamic_array(char, item, -1) @@ -45,7 +45,7 @@ #define __sockaddr(field, len) __dynamic_array(u8, field, len) #undef __rel_dynamic_array -#define __rel_dynamic_array(type, item, len) u32 item; +#define __rel_dynamic_array(type, item, len) u32 item; const void *item##_ptr_; #undef __rel_string #define __rel_string(item, src) __rel_dynamic_array(char, item, -1) diff --git a/include/trace/stages/stage5_get_offsets.h b/include/trace/stages/stage5_get_offsets.h index e30a13be46ba..45f8151cf622 100644 --- a/include/trace/stages/stage5_get_offsets.h +++ b/include/trace/stages/stage5_get_offsets.h @@ -47,10 +47,12 @@ #undef __string #define __string(item, src) __dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) + strlen((src) ? (const char *)(src) : "(null)") + 1) \ + __data_offsets->item##_ptr_ = src; #undef __string_len -#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1) +#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)\ + __data_offsets->item##_ptr_ = src; #undef __vstring #define __vstring(item, fmt, ap) __dynamic_array(char, item, \ @@ -67,11 +69,14 @@ __data_size += __item_length; #undef __rel_string -#define __rel_string(item, src) __rel_dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) +#define __rel_string(item, src) __rel_dynamic_array(char, item, \ + strlen((src) ? (const char *)(src) : "(null)") + 1) \ + __data_offsets->item##_ptr_ = src; #undef __rel_string_len -#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) + 1) +#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) + 1)\ + __data_offsets->item##_ptr_ = src; + /* * __bitmask_size_in_bytes_raw is the number of bytes needed to hold * num_possible_cpus(). diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index 919b1a4da980..b3e2f321e787 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,12 +32,14 @@ #undef __assign_str #define __assign_str(dst, src) \ - strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)"); + strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)") #undef __assign_str_len #define __assign_str_len(dst, src, len) \ do {\ - memcpy(__get_str(dst), (src), (len)); \ + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)", len); \ __get_str(dst)[len] = '\0'; \ } while(0) @@ -92,12 +94,14 @@ #undef __assign_rel_str #define __assign_rel_str(dst, src) \ - strcpy(__get_rel_str(dst), (src) ? (const char
[PATCH v2 0/4] tracing: Optimize __string()/__assign_str() processing
The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the s->string is done twice. Once at the __string() and again in the __assign_str(). The length of the string is calculated via a strlen(), not once, but twice (using strcpy). The length is actually already recorded in the data location from __string() and here's no reason to call strcpy() in __assign_str() as the length is already known. The __string() macro uses dynamic_array() which has a helper structure that is created holding the offsets and length of the string fields. Instead of finding the string twice, just save it off in another field in that helper structure, and have __assign_str() use that instead. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240222195111.139824...@goodmis.org/ - Both the __dynamic_array() and __rel_dynamic_array() macros end with a semicolon and are used by __string() and __rel_string() respectively. But __string() doesn't have a semicolon after __dynamic_array() but __rel_string does have a semicolon after __rel_dynamic_array() which is unneeded. Remove the unnecessary semicolon to keep the two consistent. - Fixed __rel_string_len() that was incorrectly using __get_str() and not __get_rel_str(). - Added two cleanup patches. One to use the ?: shortcut and the other to use the new macro EVENT_NULL_STR instead of open coding "(null)" as the string must be consistent between __string() and __assign_str(). Steven Rostedt (Google) (4): tracing: Rework __assign_str() and __string() to not duplicate getting the string tracing: Do not calculate strlen() twice for __string() fields tracing: Use ? : shortcut in trace macros tracing: Use EVENT_NULL_STR macro instead of open coding "(null)" include/linux/trace_events.h | 3 +++ include/trace/events/sunrpc.h| 12 ++-- include/trace/stages/stage2_data_offsets.h | 4 ++-- include/trace/stages/stage5_get_offsets.h| 15 ++- include/trace/stages/stage6_event_callback.h | 12 5 files changed, 29 insertions(+), 17 deletions(-)
Re: [PATCH] vduse: implement DMA sync callbacks
On Tue, Feb 20, 2024 at 01:01:36AM -0800, Christoph Hellwig wrote: > On Mon, Feb 19, 2024 at 06:06:06PM +0100, Maxime Coquelin wrote: > > Since commit 295525e29a5b ("virtio_net: merge dma > > operations when filling mergeable buffers"), VDUSE device > > require support for DMA's .sync_single_for_cpu() operation > > as the memory is non-coherent between the device and CPU > > because of the use of a bounce buffer. > > > > This patch implements both .sync_single_for_cpu() and > > sync_single_for_device() callbacks, and also skip bounce > > buffer copies during DMA map and unmap operations if the > > DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra > > copies of the same buffer. > > vduse really needs to get out of implementing fake DMA operations for > something that is not DMA. In a sense ... but on the other hand, the "fake DMA" metaphor seems to work surprisingly well, like in this instance - internal bounce buffer looks a bit like non-coherent DMA. A way to make this all prettier would I guess be to actually wrap all of DMA with virtio wrappers which would all go if () dma_... else vduse_...; or something to this end. A lot of work for sure, and is it really worth it? if the only crazy driver is vduse I'd maybe rather keep the crazy hacks local there ... -- MST
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai wrote: On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote: StartHi Kai On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai wrote: [...] > > So you introduced the work/workqueue here but there's no place which > actually > queues the work. IMHO you can either: > > 1) move relevant code change here; or > 2) focus on introducing core functions to reclaim certain pages from a > given EPC > cgroup w/o workqueue and introduce the work/workqueue in later patch. > > Makes sense? > Starting in v7, I was trying to split the big patch, #10 in v6 as you and others suggested. My thought process was to put infrastructure needed for per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9 13/15] in the end. That's reasonable for sure. Thanks for the confirmation :-) Before that, all reclaimables are tracked in the global LRU so really there is no "reclaim certain pages from a given EPC cgroup w/o workqueue" or reclaim through workqueue before that point, as suggested in #2. This patch puts down the implementation for both flows but neither used yet, as stated in the commit message. I know it's not used yet. The point is how to split patches to make them more self-contain and easy to review. I would think this patch already self-contained in that all are implementation of cgroup reclamation building blocks utilized later. But I'll try to follow your suggestions below to split further (would prefer not to merge in general unless there is strong reasons). For #2, sorry for not being explicit -- I meant it seems it's more reasonable to split in this way: Patch 1) a). change to sgx_reclaim_pages(); I'll still prefer this to be a separate patch. It is self-contained IMHO. We were splitting the original patch because it was too big. I don't want to merge back unless there is a strong reason. b). introduce sgx_epc_cgroup_reclaim_pages(); Ok. c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name), which just takes an EPC cgroup as input w/o involving any work/workqueue. This is for the workqueue use only. So I think it'd be better be with patch #2 below? These functions are all related to how to implement reclaiming pages from a given EPC cgroup, and they are logically related in terms of implementation thus it's easier to be reviewed together. This is pretty much the current patch + sgx_reclaim_pages() - workqueue. Then you just need to justify the design/implementation in changelog/comments. How about just do b) in patch #1, and state the new function is the building block and will be used for both direct and indirect reclamation? Patch 2) - Introduce work/workqueue, and implement the logic to queue the work. Now we all know there's a function to reclaim pages for a given EPC cgroup, then we can focus on when that is called, either directly or indirectly. The try_charge() function will do both actually. For indirect, it queue the work to the wq. For direct it just calls sgx_epc_cgroup_reclaim_pages(). That part is already in separate (I think self-contained) patch [v9, 10/15]. So for this patch, I'll add sgx_epc_cgroup_reclaim_work_func() and introduce work/workqueue so later work can be queued? #1 would force me go back and merge the patches again. I don't think so. I am not asking to put all things together, but only asking to split in better way (that I see). Okay. You mentioned some function is "Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages", but I am not seeing any code doing that in this patch. This needs fixing, either by moving relevant code here, or removing these not-done- yet comments. Yes. The comments will be fixed. For instance (I am just giving an example), if after review we found the queue_work() shouldn't be done in try_charge(), you will need to go back to this patch and remove these comments. That's not the best way. Each patch needs to be self-contained. Sorry I feel kind of lost on this whole thing by now. It seems so random to me. Is there hard rules on this? See above. I am only offering my opinion on how to split patches in better way. To be honest, the part I'm feeling most confusing is this self-contained-ness. It seems depend on how you look at things. I was hoping these statements would help reviewers on the flow of the patches. At the end of [v9 04/15]: For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. At the end of [v9 06/15]: Next patches will first get the cgroup
Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint
On Fri, Feb 09, 2024 at 01:57:25PM +0100, Paolo Abeni wrote: > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote: > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni wrote: > > > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote: > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni wrote: > > > > > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote: > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer > > > > > > wrote: > > > > > > > On 02/02/2024 13.11, Liang Chen wrote: > > > > > [...] > > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct > > > > > > > > xdp_buff *xdp) > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff > > > > > > > > *virtnet_xdp, > > > > > > > > + struct net_device *dev, > > > > > > > > + struct > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash) > > > > > > > > +{ > > > > > > > > + if (dev->features & NETIF_F_RXHASH) { > > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value; > > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report; > > > > > > > > + } > > > > > > > > +} > > > > > > > > + > > > > > > > > > > > > > > Would it be possible to store a pointer to hdr_hash in > > > > > > > virtnet_xdp_buff, > > > > > > > with the purpose of delaying extracting this, until and only if > > > > > > > XDP > > > > > > > bpf_prog calls the kfunc? > > > > > > > > > > > > > > > > > > > That seems to be the way v1 works, > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/ > > > > > > . But it was pointed out that the inline header may be overwritten > > > > > > by > > > > > > the xdp prog, so the hash is copied out to maintain its integrity. > > > > > > > > > > Why? isn't XDP supposed to get write access only to the pkt > > > > > contents/buffer? > > > > > > > > > > > > > Normally, an XDP program accesses only the packet data. However, > > > > there's also an XDP RX Metadata area, referenced by the data_meta > > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to > > > > point somewhere ahead of the data buffer, thereby granting the XDP > > > > program access to the virtio header located immediately before the > > > > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before > > > xdp->data_hard_start: > > > > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210 > > > > > > and virtio net set such field after the virtio_net_hdr: > > > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218 > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420 > > > > > > I don't see how the virtio hdr could be touched? Possibly even more > > > important: if such thing is possible, I think is should be somewhat > > > denied (for the same reason an H/W nic should prevent XDP from > > > modifying its own buffer descriptor). > > > > Thank you for highlighting this concern. The header layout differs > > slightly between small and mergeable mode. Taking 'mergeable mode' as > > an example, after calling xdp_prepare_buff the layout of xdp_buff > > would be as depicted in the diagram below, > > > > buf > >| > >v > > +--+--+-+ > > | xdp headroom | virtio header| packet | > > | (256 bytes) | (20 bytes) | content | > > +--+--+-+ > > ^ ^ > > | | > > data_hard_startdata > > data_meta > > > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little > > towards 'data_hard_start', it would point to the inline header, thus > > potentially allowing the XDP program to access the inline header. > > I see. That layout was completely unexpected to me. > > AFAICS the virtio_net driver tries to avoid accessing/using the > virtio_net_hdr after the XDP program execution, so nothing tragic > should happen. > > @Michael, @Jason, I guess the above is like that by design? Isn't it a > bit fragile? > > Thanks! > > Paolo I agree it is all a bit fragile, not sure how to do better without extra copies though ... -- MST
Re: [PATCH 1/1] vhost: Added pad cleanup if vnet_hdr is not present.
On Mon, Jan 15, 2024 at 05:32:25PM -0500, Michael S. Tsirkin wrote: > On Mon, Jan 15, 2024 at 09:48:40PM +0200, Andrew Melnychenko wrote: > > When the Qemu launched with vhost but without tap vnet_hdr, > > vhost tries to copy vnet_hdr from socket iter with size 0 > > to the page that may contain some trash. > > That trash can be interpreted as unpredictable values for > > vnet_hdr. > > That leads to dropping some packets and in some cases to > > stalling vhost routine when the vhost_net tries to process > > packets and fails in a loop. > > > > Qemu options: > > -netdev tap,vhost=on,vnet_hdr=off,... > > > > Signed-off-by: Andrew Melnychenko > > --- > > drivers/vhost/net.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index f2ed7167c848..57411ac2d08b 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct > > vhost_net_virtqueue *nvq, > > hdr = buf; > > gso = >gso; > > > > + if (!sock_hlen) > > + memset(buf, 0, pad); > > + > > if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && > > vhost16_to_cpu(vq, gso->csum_start) + > > vhost16_to_cpu(vq, gso->csum_offset) + 2 > > > > Hmm need to analyse it to make sure there are no cases where we leak > some data to guest here in case where sock_hlen is set ... Could you post this analysis pls? > > -- > > 2.43.0
Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote: > I'll do some more testing with the cond_resched->schedule fix, check the > cgroup thing and wait for Peter then. > Will get back if any of the above yields some results. As I predicted, if you want attention from sched guys you need to send a patch in their area. -- MST
Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw
On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote: > From: Hou Tao > > When trying to insert a 10MB kernel module kept in a virtiofs with cache > disabled, the following warning was reported: > > [ cut here ] > WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 .. > Modules linked in: > CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), .. > RIP: 0010:__alloc_pages+0x2c4/0x360 > .. > Call Trace: > >? __warn+0x8f/0x150 >? __alloc_pages+0x2c4/0x360 >__kmalloc_large_node+0x86/0x160 >__kmalloc+0xcd/0x140 >virtio_fs_enqueue_req+0x240/0x6d0 >virtio_fs_wake_pending_and_unlock+0x7f/0x190 >queue_request_and_unlock+0x58/0x70 >fuse_simple_request+0x18b/0x2e0 >fuse_direct_io+0x58a/0x850 >fuse_file_read_iter+0xdb/0x130 >__kernel_read+0xf3/0x260 >kernel_read+0x45/0x60 >kernel_read_file+0x1ad/0x2b0 >init_module_from_file+0x6a/0xe0 >idempotent_init_module+0x179/0x230 >__x64_sys_finit_module+0x5d/0xb0 >do_syscall_64+0x36/0xb0 >entry_SYSCALL_64_after_hwframe+0x6e/0x76 >.. > > ---[ end trace ]--- > > The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses > kmalloc-ed memory as bound buffer for fuse args, but > fuse_get_user_pages() only limits the length of fuse arg by max_read or > max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()). > For virtiofs, max_read is UINT_MAX, so a big read request which is about > 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn > with len=10MB, and triggers the warning in __alloc_pages(): > WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)). > > A feasible solution is to limit the value of max_read for virtiofs, so > the length passed to kmalloc() will be limited. However it will affects > the max read size for ITER_IOVEC io and the value of max_write also needs > limitation. So instead of limiting the values of max_read and max_write, > introducing max_nopage_rw to cap both the values of max_read and > max_write when the fuse dio read/write request is initiated from kernel. > > Considering that fuse read/write request from kernel is uncommon and to > decrease the demand for large contiguous pages, set max_nopage_rw as > 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar. > > Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") > Signed-off-by: Hou Tao So what should I do with this patch? It includes fuse changes but of course I can merge too if no one wants to bother either way... > --- > fs/fuse/file.c | 12 +++- > fs/fuse/fuse_i.h| 3 +++ > fs/fuse/inode.c | 1 + > fs/fuse/virtio_fs.c | 6 ++ > 4 files changed, 21 insertions(+), 1 deletion(-) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index a660f1f21540..f1beb7c0b782 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages > *ap, struct iov_iter *ii, > return ret < 0 ? ret : 0; > } > > +static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc, > +const struct iov_iter *iter, int write) > +{ > + unsigned int nmax = write ? fc->max_write : fc->max_read; > + > + if (iov_iter_is_kvec(iter)) > + nmax = min(nmax, fc->max_nopage_rw); > + return nmax; > +} > + > ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, > loff_t *ppos, int flags) > { > @@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct > iov_iter *iter, > struct inode *inode = mapping->host; > struct fuse_file *ff = file->private_data; > struct fuse_conn *fc = ff->fm->fc; > - size_t nmax = write ? fc->max_write : fc->max_read; > + size_t nmax = fuse_max_dio_rw_size(fc, iter, write); > loff_t pos = *ppos; > size_t count = iov_iter_count(iter); > pgoff_t idx_from = pos >> PAGE_SHIFT; > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 1df83eebda92..fc753cd34211 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -594,6 +594,9 @@ struct fuse_conn { > /** Constrain ->max_pages to this value during feature negotiation */ > unsigned int max_pages_limit; > > + /** Maximum read/write size when there is no page in request */ > + unsigned int max_nopage_rw; > + > /** Input queue */ > struct fuse_iqueue iq; > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 2a6d44f91729..4cbbcb4a4b71 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -923,6 +923,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct > fuse_mount *fm, > fc->user_ns = get_user_ns(user_ns); > fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ; > fc->max_pages_limit = FUSE_MAX_MAX_PAGES; > + fc->max_nopage_rw = UINT_MAX; > > INIT_LIST_HEAD(>mounts); >
[PATCH 2/2] tracing: Do not calculate strlen() twice for __string() fields
From: "Steven Rostedt (Google)" The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the length of s->string is done twice. Once at the __string() and again in the __assign_str(). The length of the string is calculated via a strlen(), not once, but twice (via strcpy() in __assign_str()). Once during the __string() macro and again in __assign_str(). But the length is actually already recorded in the data location and there's no reason to call strlen() again. Just use the saved length that was saved in the __string() code for the __assign_str() code. Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage6_event_callback.h | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index e09e3d019dcc..d389725937e9 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,8 +32,9 @@ #undef __assign_str #define __assign_str(dst, src) \ - strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)") + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)",\ + __get_dynamic_array_len(dst)) #undef __assign_str_len #define __assign_str_len(dst, src, len) \ @@ -94,8 +95,9 @@ #undef __assign_rel_str #define __assign_rel_str(dst, src) \ - strcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ - __data_offsets.dst##_ptr_ : "(null)") + memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)",\ + __get_rel_dynamic_array_len(dst)) #undef __assign_rel_str_len #define __assign_rel_str_len(dst, src, len)\ -- 2.43.0
[PATCH 0/2] tracing: Optimize __string()/__assign_str() processing
The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the s->string is done twice. Once at the __string() and again in the __assign_str(). The length of the string is calculated via a strlen(), not once, but twice (using strcpy). The length is actually already recorded in the data location from __string() and here's no reason to call strcpy() in __assign_str() as the length is already known. The __string() macro uses dynamic_array() which has a helper structure that is created holding the offsets and length of the string fields. Instead of finding the string twice, just save it off in another field in that helper structure, and have __assign_str() use that instead. Steven Rostedt (Google) (2): tracing: Rework __assign_str() and __string() to not duplicate getting the string tracing: Do not calculate strlen() twice for __string() fields include/trace/stages/stage2_data_offsets.h | 4 ++-- include/trace/stages/stage5_get_offsets.h| 15 ++- include/trace/stages/stage6_event_callback.h | 14 ++ 3 files changed, 22 insertions(+), 11 deletions(-)
[PATCH 1/2] tracing: Rework __assign_str() and __string() to not duplicate getting the string
From: "Steven Rostedt (Google)" The TRACE_EVENT() macro handles dynamic strings by having: TP_PROTO(struct some_struct *s), TP_ARGS(s), TP_STRUCT__entry( __string(my_string, s->string) ), TP_fast_assign( __assign_str(my_string, s->string); ) TP_printk("%s", __get_str(my_string)) There's even some code that may call a function helper to find the s->string value. The problem with the above is that the work to get the s->string is done twice. Once at the __string() and again in the __assign_str(). The __string() uses dynamic_array() which has a helper structure that is created holding the offsets and length of the string fields. Instead of finding the string twice, just save it off in another field in that helper structure, and have __assign_str() use that instead. Note, this also means that the second parameter of __assign_str() isn't even used anymore, and may be removed in the future. Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage2_data_offsets.h | 4 ++-- include/trace/stages/stage5_get_offsets.h| 15 ++- include/trace/stages/stage6_event_callback.h | 12 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/include/trace/stages/stage2_data_offsets.h b/include/trace/stages/stage2_data_offsets.h index 469b6a64293d..8b0cff06d346 100644 --- a/include/trace/stages/stage2_data_offsets.h +++ b/include/trace/stages/stage2_data_offsets.h @@ -24,7 +24,7 @@ #define __array(type, item, len) #undef __dynamic_array -#define __dynamic_array(type, item, len) u32 item; +#define __dynamic_array(type, item, len) u32 item; const void *item##_ptr_; #undef __string #define __string(item, src) __dynamic_array(char, item, -1) @@ -45,7 +45,7 @@ #define __sockaddr(field, len) __dynamic_array(u8, field, len) #undef __rel_dynamic_array -#define __rel_dynamic_array(type, item, len) u32 item; +#define __rel_dynamic_array(type, item, len) u32 item; const void *item##_ptr_; #undef __rel_string #define __rel_string(item, src) __rel_dynamic_array(char, item, -1) diff --git a/include/trace/stages/stage5_get_offsets.h b/include/trace/stages/stage5_get_offsets.h index e30a13be46ba..c71f38e10419 100644 --- a/include/trace/stages/stage5_get_offsets.h +++ b/include/trace/stages/stage5_get_offsets.h @@ -47,10 +47,12 @@ #undef __string #define __string(item, src) __dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) + strlen((src) ? (const char *)(src) : "(null)") + 1) \ + __data_offsets->item##_ptr_ = src; #undef __string_len -#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1) +#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)\ + __data_offsets->item##_ptr_ = src; #undef __vstring #define __vstring(item, fmt, ap) __dynamic_array(char, item, \ @@ -67,11 +69,14 @@ __data_size += __item_length; #undef __rel_string -#define __rel_string(item, src) __rel_dynamic_array(char, item, \ - strlen((src) ? (const char *)(src) : "(null)") + 1) +#define __rel_string(item, src) __rel_dynamic_array(char, item, \ + strlen((src) ? (const char *)(src) : "(null)") + 1); \ + __data_offsets->item##_ptr_ = src; #undef __rel_string_len -#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) + 1) +#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) + 1)\ + __data_offsets->item##_ptr_ = src; + /* * __bitmask_size_in_bytes_raw is the number of bytes needed to hold * num_possible_cpus(). diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index 919b1a4da980..e09e3d019dcc 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -32,12 +32,14 @@ #undef __assign_str #define __assign_str(dst, src) \ - strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)"); + strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)") #undef __assign_str_len #define __assign_str_len(dst, src, len) \ do {\ - memcpy(__get_str(dst), (src), (len)); \ + memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? \ + __data_offsets.dst##_ptr_ : "(null)", len); \ __get_str(dst)[len] = '\0'; \ } while(0) @@ -92,12 +94,14 @@ #undef __assign_rel_str #define __assign_rel_str(dst, src) \ - strcpy(__get_rel_str(dst), (src) ? (const char
Re: [PATCH v2] PM: runtime: add tracepoint for runtime_status changes
On Wed, Feb 21, 2024 at 8:49 PM Vilas Bhat wrote: > > Existing runtime PM ftrace events (`rpm_suspend`, `rpm_resume`, > `rpm_return_int`) offer limited visibility into the exact timing of device > runtime power state transitions, particularly when asynchronous operations > are involved. When the `rpm_suspend` or `rpm_resume` functions are invoked > with the `RPM_ASYNC` flag, a return value of 0 i.e., success merely > indicates that the device power state request has been queued, not that > the device has yet transitioned. > > A new ftrace event, `rpm_status`, is introduced. This event directly logs > the `power.runtime_status` value of a device whenever it changes providing > granular tracking of runtime power state transitions regardless of > synchronous or asynchronous `rpm_suspend` / `rpm_resume` usage. > > Signed-off-by: Vilas Bhat > --- > V1 -> V2: Modified enum value definition as per reviewer comments. > --- > drivers/base/power/runtime.c | 1 + > include/trace/events/rpm.h | 42 > 2 files changed, 43 insertions(+) > > diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c > index 05793c9fbb84..d10354847878 100644 > --- a/drivers/base/power/runtime.c > +++ b/drivers/base/power/runtime.c > @@ -94,6 +94,7 @@ static void update_pm_runtime_accounting(struct device *dev) > static void __update_runtime_status(struct device *dev, enum rpm_status > status) > { > update_pm_runtime_accounting(dev); > + trace_rpm_status(dev, status); > dev->power.runtime_status = status; > } > > diff --git a/include/trace/events/rpm.h b/include/trace/events/rpm.h > index 3c716214dab1..bd120e23ce12 100644 > --- a/include/trace/events/rpm.h > +++ b/include/trace/events/rpm.h > @@ -101,6 +101,48 @@ TRACE_EVENT(rpm_return_int, > __entry->ret) > ); > > +#define RPM_STATUS_STRINGS \ > + EM(RPM_INVALID, "RPM_INVALID") \ > + EM(RPM_ACTIVE, "RPM_ACTIVE") \ > + EM(RPM_RESUMING, "RPM_RESUMING") \ > + EM(RPM_SUSPENDED, "RPM_SUSPENDED") \ > + EMe(RPM_SUSPENDING, "RPM_SUSPENDING") > + > +/* Enums require being exported to userspace, for user tool parsing. */ > +#undef EM > +#undef EMe > +#define EM(a, b) TRACE_DEFINE_ENUM(a); > +#define EMe(a, b) TRACE_DEFINE_ENUM(a); > + > +RPM_STATUS_STRINGS > + > +/* > + * Now redefine the EM() and EMe() macros to map the enums to the strings > that > + * will be printed in the output. > + */ > +#undef EM > +#undef EMe > +#define EM(a, b) { a, b }, > +#define EMe(a, b) { a, b } > + > +TRACE_EVENT(rpm_status, > + TP_PROTO(struct device *dev, enum rpm_status status), > + TP_ARGS(dev, status), > + > + TP_STRUCT__entry( > + __string(name, dev_name(dev)) > + __field(int,status) > + ), > + > + TP_fast_assign( > + __assign_str(name, dev_name(dev)); > + __entry->status = status; > + ), > + > + TP_printk("%s status=%s", __get_str(name), > + __print_symbolic(__entry->status, RPM_STATUS_STRINGS)) > +); > + > #endif /* _TRACE_RUNTIME_POWER_H */ > > /* This part must be outside protection */ > -- Applied as 6.9 material, thanks!
Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop
On Tue, Jul 25, 2023 at 11:03:11AM +0800, Jason Wang wrote: > On Mon, Jul 24, 2023 at 3:18 PM Michael S. Tsirkin wrote: > > > > On Mon, Jul 24, 2023 at 02:52:49PM +0800, Jason Wang wrote: > > > On Mon, Jul 24, 2023 at 2:46 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Fri, Jul 21, 2023 at 10:18:03PM +0200, Maxime Coquelin wrote: > > > > > > > > > > > > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote: > > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote: > > > > > > > > > > > > > > > > > > > > > On 7/21/23 16:45, Michael S. Tsirkin wrote: > > > > > > > > On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/20/23 23:02, Michael S. Tsirkin wrote: > > > > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson > > > > > > > > > > wrote: > > > > > > > > > > > On 7/20/23 1:38 AM, Jason Wang wrote: > > > > > > > > > > > > > > > > > > > > > > > > Adding cond_resched() to the command waiting loop for a > > > > > > > > > > > > better > > > > > > > > > > > > co-operation with the scheduler. This allows to give > > > > > > > > > > > > CPU a breath to > > > > > > > > > > > > run other task(workqueue) instead of busy looping when > > > > > > > > > > > > preemption is > > > > > > > > > > > > not allowed on a device whose CVQ might be slow. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jason Wang > > > > > > > > > > > > > > > > > > > > > > This still leaves hung processes, but at least it doesn't > > > > > > > > > > > pin the CPU any > > > > > > > > > > > more. Thanks. > > > > > > > > > > > Reviewed-by: Shannon Nelson > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to see a full solution > > > > > > > > > > 1- block until interrupt > > > > > > > > > > > > > > > > > > Would it make sense to also have a timeout? > > > > > > > > > And when timeout expires, set FAILED bit in device status? > > > > > > > > > > > > > > > > virtio spec does not set any limits on the timing of vq > > > > > > > > processing. > > > > > > > > > > > > > > Indeed, but I thought the driver could decide it is too long for > > > > > > > it. > > > > > > > > > > > > > > The issue is we keep waiting with rtnl locked, it can quickly > > > > > > > make the > > > > > > > system unusable. > > > > > > > > > > > > if this is a problem we should find a way not to keep rtnl > > > > > > locked indefinitely. > > > > > > > > > > From the tests I have done, I think it is. With OVS, a > > > > > reconfiguration is > > > > > performed when the VDUSE device is added, and when a MLX5 device is > > > > > in the same bridge, it ends up doing an ioctl() that tries to take the > > > > > rtnl lock. In this configuration, it is not possible to kill OVS > > > > > because > > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by > > > > > virtio- > > > > > net. > > > > > > > > So for sure, we can queue up the work and process it later. > > > > The somewhat tricky part is limiting the memory consumption. > > > > > > And it needs to sync with rtnl somehow, e.g device unregistering which > > > seems not easy. > > > > > > Thanks > > > > since when does device unregister need to send cvq commands? > > It doesn't do this now. But if we don't process the work under rtnl, > we need to synchronize with device unregistering. > > Thanks But what's not easy about it? > > > > > > > > > > > > > > > > > > > > > > > > > > 2- still handle surprise removal correctly by waking in > > > > > > > > > > that case > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > drivers/net/virtio_net.c | 4 +++- > > > > > > > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/virtio_net.c > > > > > > > > > > > > b/drivers/net/virtio_net.c > > > > > > > > > > > > index 9f3b1d6ac33d..e7533f29b219 100644 > > > > > > > > > > > > --- a/drivers/net/virtio_net.c > > > > > > > > > > > > +++ b/drivers/net/virtio_net.c > > > > > > > > > > > > @@ -2314,8 +2314,10 @@ static bool > > > > > > > > > > > > virtnet_send_command(struct virtnet_info *vi, u8 class, > > > > > > > > > > > > u8 cmd, > > > > > > > > > > > > * into the hypervisor, so the request > > > > > > > > > > > > should be handled immediately. > > > > > > > > > > > > */ > > > > > > > > > > > > while (!virtqueue_get_buf(vi->cvq, ) && > > > > > > > > > > > > - !virtqueue_is_broken(vi->cvq)) > > > > > > > > > > > > + !virtqueue_is_broken(vi->cvq)) { > > > > > > > > > > > > + cond_resched(); > > > > > > > > > > > > cpu_relax(); > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > return vi->ctrl->status == VIRTIO_NET_OK; > > > > > > > > > > > > } > > > > > > > > > > > > -- > > > > > > > > > >
Re: [PATCH v3 7/7] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
Hi, On Wed, Feb 14, 2024 at 06:21:27PM +0100, Arnaud Pouliquen wrote: > The new TEE remoteproc device is used to manage remote firmware in a > secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is > introduced to delegate the loading of the firmware to the trusted > execution context. In such cases, the firmware should be signed and > adhere to the image format defined by the TEE. > > A new "to_attach" field is introduced to differentiate the use cases > "firmware loaded by the boot stage" and "firmware loaded by the TEE". > > Signed-off-by: Arnaud Pouliquen > --- > V2 to V3 update: > - remove stm32_rproc_tee_elf_sanity_check(), stm32_rproc_tee_elf_load() > stm32_rproc_tee_elf_find_loaded_rsc_table() and stm32_rproc_tee_start() > that are bnow unused > - use new rproc::alt_boot field to sepcify that the alternate fboot method is > used > - use stm32_rproc::to_attach field to differenciate attch mode from > remoteproc tee boot mode. > - remove the used of stm32_rproc::fw_loaded > --- > drivers/remoteproc/stm32_rproc.c | 85 +--- > 1 file changed, 79 insertions(+), 6 deletions(-) > > diff --git a/drivers/remoteproc/stm32_rproc.c > b/drivers/remoteproc/stm32_rproc.c > index fcc0001e2657..9cfcf66462e0 100644 > --- a/drivers/remoteproc/stm32_rproc.c > +++ b/drivers/remoteproc/stm32_rproc.c > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > #include > > #include "remoteproc_internal.h" > @@ -49,6 +50,9 @@ > #define M4_STATE_STANDBY 4 > #define M4_STATE_CRASH 5 > > +/* Remote processor unique identifier aligned with the Trusted Execution > Environment definitions */ > +#define STM32_MP1_M4_PROC_ID0 > + > struct stm32_syscon { > struct regmap *map; > u32 reg; > @@ -90,6 +94,8 @@ struct stm32_rproc { > struct stm32_mbox mb[MBOX_NB_MBX]; > struct workqueue_struct *workqueue; > bool hold_boot_smc; > + bool to_attach; > + struct tee_rproc *trproc; > void __iomem *rsc_va; > }; > > @@ -253,10 +259,30 @@ static int stm32_rproc_release(struct rproc *rproc) > return err; > } > } > + ddata->to_attach = false; > > return err; > } > > +static int stm32_rproc_tee_attach(struct rproc *rproc) > +{ > + /* Nothing to do, remote proc already started by the secured context. */ > + return 0; > +} > + > +static int stm32_rproc_tee_stop(struct rproc *rproc) > +{ > + int err; > + > + stm32_rproc_request_shutdown(rproc); > + > + err = tee_rproc_stop(rproc); > + if (err) > + return err; > + > + return stm32_rproc_release(rproc); > +} > + > static int stm32_rproc_prepare(struct rproc *rproc) > { > struct device *dev = rproc->dev.parent; > @@ -637,10 +663,14 @@ stm32_rproc_get_loaded_rsc_table(struct rproc *rproc, > size_t *table_sz) > { > struct stm32_rproc *ddata = rproc->priv; > struct device *dev = rproc->dev.parent; > + struct tee_rproc *trproc = ddata->trproc; > phys_addr_t rsc_pa; > u32 rsc_da; > int err; > > + if (trproc && !ddata->to_attach) > + return tee_rproc_get_loaded_rsc_table(rproc, table_sz); > + Why do we need a flag at all? Why can't st_rproc_tee_ops::get_loaded_rsc_table be set to tee_rproc_get_loaded_rsc_table()? > /* The resource table has already been mapped, nothing to do */ > if (ddata->rsc_va) > goto done; > @@ -693,8 +723,20 @@ static const struct rproc_ops st_rproc_ops = { > .get_boot_addr = rproc_elf_get_boot_addr, > }; > > +static const struct rproc_ops st_rproc_tee_ops = { > + .prepare= stm32_rproc_prepare, > + .start = tee_rproc_start, > + .stop = stm32_rproc_tee_stop, > + .attach = stm32_rproc_tee_attach, > + .kick = stm32_rproc_kick, > + .get_loaded_rsc_table = stm32_rproc_get_loaded_rsc_table, > + .find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table, > + .load = tee_rproc_load_fw, > +}; > + > static const struct of_device_id stm32_rproc_match[] = { > - { .compatible = "st,stm32mp1-m4" }, > + {.compatible = "st,stm32mp1-m4",}, > + {.compatible = "st,stm32mp1-m4-tee",}, > {}, > }; > MODULE_DEVICE_TABLE(of, stm32_rproc_match); > @@ -853,6 +895,7 @@ static int stm32_rproc_probe(struct platform_device *pdev) > struct device *dev = >dev; > struct stm32_rproc *ddata; > struct device_node *np = dev->of_node; > + struct tee_rproc *trproc = NULL; > struct rproc *rproc; > unsigned int state; > int ret; > @@ -861,12 +904,33 @@ static int stm32_rproc_probe(struct platform_device > *pdev) > if (ret) > return ret; > > - rproc = rproc_alloc(dev, np->name, _rproc_ops, NULL, sizeof(*ddata)); > - if (!rproc) > - return -ENOMEM; This patch doesn't apply to rproc-next - please rebase.
Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro
On Thu, 22 Feb 2024 13:25:34 -0500 Chuck Lever wrote: > Do you want me to take this through the nfsd tree, or would you like > an Ack from me so you can handle it as part of your clean up? Just > in case: > > Acked-by: Chuck Lever > As my patches depend on this, I can take it with your ack. Thanks! -- Steve
Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro
On Thu, Feb 22, 2024 at 12:28:28PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > I'm working on restructuring the __string* macros so that it doesn't need > to recalculate the string twice. That is, it will save it off when > processing __string() and the __assign_str() will not need to do the work > again as it currently does. > > Currently __string_len(item, src, len) doesn't actually use "src", but my > changes will require src to be correct as that is where the __assign_str() > will get its value from. > > The event class nfsd_clid_class has: > > __string_len(name, name, clp->cl_name.len) > > But the second "name" does not exist and causes my changes to fail to > build. That second parameter should be: clp->cl_name.data. > > Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for > nfsd_clid_class") > Signed-off-by: Steven Rostedt (Google) > --- > fs/nfsd/trace.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h > index d1e8cf079b0f..2cd57033791f 100644 > --- a/fs/nfsd/trace.h > +++ b/fs/nfsd/trace.h > @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class, > __array(unsigned char, addr, sizeof(struct sockaddr_in6)) > __field(unsigned long, flavor) > __array(unsigned char, verifier, NFS4_VERIFIER_SIZE) > - __string_len(name, name, clp->cl_name.len) > + __string_len(name, clp->cl_name.data, clp->cl_name.len) > ), > TP_fast_assign( > __entry->cl_boot = clp->cl_clientid.cl_boot; > -- > 2.43.0 > Do you want me to take this through the nfsd tree, or would you like an Ack from me so you can handle it as part of your clean up? Just in case: Acked-by: Chuck Lever -- Chuck Lever
Re: [PATCH v3 5/7] dt-bindings: remoteproc: Add compatibility for TEE support
On Wed, 14 Feb 2024 18:21:25 +0100, Arnaud Pouliquen wrote: > The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration > where the Cortex-M4 firmware is loaded by the Trusted execution Environment > (TEE). > For instance, this compatible is used in both the Linux and OP-TEE > device-tree: > - In OP-TEE, a node is defined in the device tree with the > st,stm32mp1-m4-tee to support signed remoteproc firmware. > Based on DT properties, OP-TEE authenticates, loads, starts, and stops > the firmware. > - On Linux, when the compatibility is set, the Cortex-M resets should not > be declared in the device tree. > > Signed-off-by: Arnaud Pouliquen > --- > V1 to V2 updates > - update "st,stm32mp1-m4" compatible description to generalize > - remove the 'reset-names' requirement in one conditional branch, as the > property is already part of the condition test. > --- > .../bindings/remoteproc/st,stm32-rproc.yaml | 51 --- > 1 file changed, 43 insertions(+), 8 deletions(-) > Reviewed-by: Rob Herring
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai wrote: +/** + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages + * @root: Root of the tree to start walking from. + * Return: Number of pages reclaimed. Just wondering, do you need to return @cnt given this function is called w/o checking the return value? Yes. Will add explicit commenting that we need scan fixed number of pages for attempted reclamation. + */ +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) +{ + /* +* Attempting to reclaim only a few pages will often fail and is +* inefficient, while reclaiming a huge number of pages can result in +* soft lockups due to holding various locks for an extended duration. +*/ Not sure we need this comment, given it's already implied in sgx_reclaim_pages(). You cannot pass a value > SGX_NR_TO_SCAN anyway. Will rework on these comments to make them more meaningful. [other comments/questions addressed in separate email threads] [...] + +/* + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup + * when the cgroup is at/near its maximum capacity + */ I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here. Does it make more sense to move that code change to this patch for better review? Right. This comment was left-over when I split the old patch. +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) +{ + struct sgx_epc_cgroup *epc_cg; + u64 cur, max; + + epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work); + + for (;;) { + max = sgx_epc_cgroup_max_pages_to_root(epc_cg); + + /* +* Adjust the limit down by one page, the goal is to free up +* pages for fault allocations, not to simply obey the limit. +* Conditionally decrementing max also means the cur vs. max +* check will correctly handle the case where both are zero. +*/ + if (max) + max--; With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one? Logically still needed for case max <= SGX_NR_TO_SCAN * 2 + + /* +* Unless the limit is extremely low, in which case forcing +* reclaim will likely cause thrashing, force the cgroup to +* reclaim at least once if it's operating *near* its maximum +* limit by adjusting @max down by half the min reclaim size. OK. But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could we choose SGX_NR_TO_SCAN instead? IMHO at least we should at least put a comment to mention this. And maybe you can have a dedicated macro for that in which way I believe the code would be easier to understand? Good point. I think the value is kind of arbitrary. We consider enclaves/cgroups of 64K size are very small. If such a cgroup ever reaches the limit, then we don't aggressively reclaim to optimize #PF handling. User might as well just raise the limit if it is not performant. +* This work func is scheduled by sgx_epc_cgroup_try_charge This has been mentioned in the function comment already. +* when it cannot directly reclaim due to being in an atomic +* context, e.g. EPC allocation in a fault handler. Why a fault handler is an "atomic context"? Just say when it cannot directly reclaim. Sure. Waiting +* to reclaim until the cgroup is actually at its limit is less +* performant as it means the faulting task is effectively +* blocked until a worker makes its way through the global work +* queue. +*/ + if (max > SGX_NR_TO_SCAN * 2) + max -= (SGX_NR_TO_SCAN / 2); + + cur = sgx_epc_cgroup_page_counter_read(epc_cg); + + if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg)) + break; + + /* Keep reclaiming until above condition is met. */ + sgx_epc_cgroup_reclaim_pages(epc_cg->cg); Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and descendants. If we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages, seems we can reduce the number of loops here? [We already scan SGX_NR_TO_SCAN pages for the cgroup at the level of sgx_epc_cgroup_reclaim_pages().] I think you mean that we keep scanning and reclaiming until at least SGX_NR_TO_SCAN pages are reclaimed as your code suggested above. We probably can make that a version for this background thread for optimization. But sgx_epc_cgroup_max_pages_to_root() and sgx_epc_cgroup_lru_empty() are not that bad unless we had very deep and wide cgroup trees. So
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko wrote: > > On Wed 21-02-24 13:30:51, Carlos Galo wrote: > > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > > > > > Hi, > > > sorry I have missed this before. > > > > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > > > The current implementation of the mark_victim tracepoint provides only > > > > the process ID (pid) of the victim process. This limitation poses > > > > challenges for userspace tools that need additional information > > > > about the OOM victim. The association between pid and the additional > > > > data may be lost after the kill, making it difficult for userspace to > > > > correlate the OOM event with the specific process. > > > > > > You are correct that post OOM all per-process information is lost. On > > > the other hand we do dump all this information to the kernel log. Could > > > you explain why that is not suitable for your purpose? > > > > Userspace tools often need real-time visibility into OOM situations > > for userspace intervention. Our use case involves utilizing BPF > > programs, along with BPF ring buffers, to provide OOM notification to > > userspace. Parsing kernel logs would be significant overhead as > > opposed to the event based BPF approach. > > Please put that into the changelog. Will do. > > > > In order to mitigate this limitation, add the following fields: > > > > > > > > - UID > > > >In Android each installed application has a unique UID. Including > > > >the `uid` assists in correlating OOM events with specific apps. > > > > > > > > - Process Name (comm) > > > >Enables identification of the affected process. > > > > > > > > - OOM Score > > > >Allows userspace to get additional insights of the relative kill > > > >priority of the OOM victim. > > > > > > What is the oom score useful for? > > > > > The OOM score provides us a measure of the victim's importance. On the > > android side, it allows us to identify if top or foreground apps are > > killed, which have user perceptible impact. > > But the value on its own (wihtout knowing scores of other tasks) doesn't > really tell you anything, does it? Android uses the OOM adj_score ranges to categorize app state (foreground, background, ...). I'll resend a v3 with the commit text updated to include details about this. > > > Is there any reason to provide a different information from the one > > > reported to the kernel log? > > > __oom_kill_process: > > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB > > > oom_score_adj:%hd\n", > > > message, task_pid_nr(victim), victim->comm, > > > K(mm->total_vm), > > > K(get_mm_counter(mm, MM_ANONPAGES)), > > > K(get_mm_counter(mm, MM_FILEPAGES)), > > > K(get_mm_counter(mm, MM_SHMEMPAGES)), > > > from_kuid(_user_ns, task_uid(victim)), > > > mm_pgtables_bytes(mm) >> 10, > > > victim->signal->oom_score_adj); > > > > > > > We added these fields we need (UID, process name, and OOM score), but > > we're open to adding the others if you prefer that for consistency > > with the kernel log. > > yes, I think the consistency would be better here. For one it reports > numbers which can tell quite a lot about the killed victim. It is a > superset of what you already asking for. With a notable exception of the > oom_score which is really dubious without a wider context. Sounds good, I'll resend a v3 that includes these fields as well. Thanks, Carlos > -- > Michal Hocko > SUSE Labs
[PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro
From: "Steven Rostedt (Google)" I'm working on restructuring the __string* macros so that it doesn't need to recalculate the string twice. That is, it will save it off when processing __string() and the __assign_str() will not need to do the work again as it currently does. Currently __string_len(item, src, len) doesn't actually use "src", but my changes will require src to be correct as that is where the __assign_str() will get its value from. The event class nfsd_clid_class has: __string_len(name, name, clp->cl_name.len) But the second "name" does not exist and causes my changes to fail to build. That second parameter should be: clp->cl_name.data. Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for nfsd_clid_class") Signed-off-by: Steven Rostedt (Google) --- fs/nfsd/trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h index d1e8cf079b0f..2cd57033791f 100644 --- a/fs/nfsd/trace.h +++ b/fs/nfsd/trace.h @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class, __array(unsigned char, addr, sizeof(struct sockaddr_in6)) __field(unsigned long, flavor) __array(unsigned char, verifier, NFS4_VERIFIER_SIZE) - __string_len(name, name, clp->cl_name.len) + __string_len(name, clp->cl_name.data, clp->cl_name.len) ), TP_fast_assign( __entry->cl_boot = clp->cl_clientid.cl_boot; -- 2.43.0
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai wrote: On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote: [...] > > Here the @nr_to_scan is reduced by the number of pages that are > isolated, but > not actually reclaimed (which is reflected by @cnt). > > IIUC, looks you want to make this function do "each cycle" as what you > mentioned > in the v8 [1]: > >I tested with that approach and found we can only target number of > pages >attempted to reclaim not pages actually reclaimed due to the > uncertainty >of how long it takes to reclaim pages. Besides targeting number of >scanned pages for each cycle is also what the ksgxd does. > > If we target actual number of pages, sometimes it just takes too long. > I >saw more timeouts with the default time limit when running parallel >selftests. > > I am not sure what does "sometimes it just takes too long" mean, but > what I am > thinking is you are trying to do some perfect but yet complicated code > here. I think what I observed was that the try_charge() would block too long before getting chance of schedule() to yield, causing more timeouts than necessary. I'll do some re-test to be sure. Looks this is a valid information that can be used to justify whatever you are implementing in the EPC cgroup reclaiming function(s). I'll add some comments. Was assuming this is just following the old design as ksgxd. There were some comments at the beginning of sgx_epc_cgrooup_reclaim_page(). /* * Attempting to reclaim only a few pages will often fail and is * inefficient, while reclaiming a huge number of pages can result in * soft lockups due to holding various locks for an extended duration. */ unsigned int nr_to_scan = SGX_NR_TO_SCAN; I think it can be improved to emphasize we only "attempt" to finish scanning fixed number of pages for reclamation, not enforce number of pages successfully reclaimed. > > For instance, I don't think selftest reflect the real workload, and I > believe > adjusting the limit of a given EPC cgroup shouldn't be a frequent > operation, > thus it is acceptable to use some easy-maintain code but less perfect > code. > > Here I still think having @nr_to_scan as a pointer is over-complicated. > For > example, we can still let sgx_reclaim_pages() to always scan > SGX_NR_TO_SCAN > pages, but give up when there's enough pages reclaimed or when the EPC > cgroup > and its descendants have been looped: > > unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) > { >unsigned int cnt = 0; >... > >css_for_each_descendant_pre(pos, css_root) { >... >epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); >cnt += sgx_reclaim_pages(_cg->lru); > >if (cnt >= SGX_NR_TO_SCAN) >break; >} > >... >return cnt; > } > > Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually > reaches any > descendants, but that should be rare and we don't care that much, do we? > I assume you meant @cnt here to be number of pages actually reclaimed. Yes. This could cause sgx_epc_cgroup_reclaim_pages() block too long as @cnt may always be zero (all pages are too young) and you have to loop all descendants. I am not sure whether this is a valid point. For example, your change in patch 10 "x86/sgx: Add EPC reclamation in cgroup try_charge()" already loops all descendants in below code: + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; I meant looping all descendants for reclamation which is expensive and we want to avoid. Not just checking emptiness of LRUs. Anyway, I can see all these can be justification to your design/implementation. My point is please put these justification in changelog/comments so that we can actually understand. Yes, will add clarifying comments. Thanks
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai wrote: -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + for (;;) { + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, + PAGE_SIZE)) + break; + + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; + + if (signal_pending(current)) + return -ERESTARTSYS; + + if (!reclaim) { + queue_work(sgx_epc_cg_wq, _cg->reclaim_work); + return -EBUSY; + } + + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) + /* All pages were too young to reclaim, try again a little later */ + schedule(); + } + + return 0; } Seems this code change is 90% similar to the existing code in the sgx_alloc_epc_page(): ... for ( ; ; ) { page = __sgx_alloc_epc_page(); if (!IS_ERR(page)) { page->owner = owner; break; } if (list_empty(_active_page_list)) return ERR_PTR(-ENOMEM); if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } sgx_reclaim_pages(); cond_resched(); } ... Is it better to move the logic/code change in try_charge() out to sgx_alloc_epc_page() to unify them? IIUC, the logic is quite similar: When you either failed to allocate one page, or failed to charge one page, you try to reclaim EPC page(s) from the current EPC cgroup, either directly or indirectly. No? Only these lines are the same: if (!reclaim) { page = ERR_PTR(-EBUSY); break; } if (signal_pending(current)) { page = ERR_PTR(-ERESTARTSYS); break; } In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup reclamation. That's why the logic of other lines is different though they look similar due to similar function names. For the global reclamation we need consider case in that cgroup is not enabled. Similarly list_empty(_active_page_list) would have to be changed to check root cgroup if cgroups enabled otherwise check global LRU. The (!reclaim) case is also different. So I don't see an obvious good way to abstract those to get meaningful savings. Thanks Haitao
Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation
On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_lru_list() returns hard coded reference to the global LRU. Change sgx_lru_list() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore global reclamation is still needed in those cases and it should reclaim from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled, otherwise from the global LRU. Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 6b0c26cac621..d4265a390ba9 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru; static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) { +#ifdef CONFIG_CGROUP_SGX_EPC + if (epc_page->epc_cg) + return _page->epc_cg->lru; + + /* This should not happen if kernel is configured correctly */ + WARN_ON_ONCE(1); +#endif return _global_lru; } How about when EPC cgroup is enabled, but one enclave doesn't belong to any EPC cgroup? Is it OK to track EPC pages for these enclaves to the root EPC cgroup's LRU list together with other enclaves belongs to the root cgroup? This should be a valid case, right? There is no such case. Each page is in the root by default. Thanks Haitao
Re: [PATCH v9 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer
On Wed, 21 Feb 2024 05:10:36 -0600, Huang, Kai wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: From: Kristen Carlson Accardi When cgroup is enabled, all reclaimable pages will be tracked in cgroup LRUs. The global reclaimer needs to start reclamation from the root cgroup. Expose the top level cgroup reclamation function so the global reclaimer can reuse it. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V8: - Remove unneeded breaks in function declarations. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index abf74fdb12b4..6e31f8727b8a 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root) * @indirect: In ksgxd or EPC cgroup work queue context. * Return: Number of pages reclaimed. */ -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) { /* * Attempting to reclaim only a few pages will often fail and is diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index d061cd807b45..5b3e8e1b8630 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { } static inline void sgx_epc_cgroup_init(void) { } + +static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) +{ + return 0; +} #else struct sgx_epc_cgroup { struct misc_cg *cg; @@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim); void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect); + void sgx_epc_cgroup_init(void); #endif I'd just prefer to merge patch such like this to the one that actually uses the exposed function. It's just couple of LOC and we don't deserve to read these repeated changelog and move back and forth between patches during review. IIRC, Jarkko prefers exposing functions first in separate patch. Jarkko, right? Also I found your definition/expectation of self-contained patches is just confusing or too constrained at least. I usually review each patch separately without back and forth and then review them together to see if they all make sense in terms of breakdown. My minimal expectation is that a patch should not depend on future changes and should not break current state of function. and For this one I thought the idea was you verify if the API exposed make sense without looking at how it is used in future. Then when you review the usage patch, you see if the usage is reasonable. I would really hesitate to merge patches at this point unless we all have an agreement and have good/strong reasons or there is a hard rule about this. Thanks Haitao
Re: [syzbot] [virtualization?] linux-next boot error: WARNING: refcount bug in __free_pages_ok
On Thu, Feb 22, 2024 at 11:06:55AM +0800, Lei Yang wrote: > Hi All > > I hit a similar issue when doing a regression testing from my side. > For the error messages please help review the attachment. > > The latest commit: > commit c02197fc9076e7d991c8f6adc11759c5ba52ddc6 (HEAD -> master, > origin/master, origin/HEAD) > Merge: f2667e0c3240 0846dd77c834 > Author: Linus Torvalds > Date: Sat Feb 17 16:59:31 2024 -0800 > > Merge tag 'powerpc-6.8-3' of > git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux > > Pull powerpc fixes from Michael Ellerman: > "This is a bit of a big batch for rc4, but just due to holiday hangover > and because I didn't send any fixes last week due to a late revert > request. I think next week should be back to normal. > > Regards > Lei It all looks like a generic bug dealing with some refcounting in the allocator. So, a chance of a bisect there? -- MST
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Wed 21-02-24 13:30:51, Carlos Galo wrote: > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > > > Hi, > > sorry I have missed this before. > > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > > The current implementation of the mark_victim tracepoint provides only > > > the process ID (pid) of the victim process. This limitation poses > > > challenges for userspace tools that need additional information > > > about the OOM victim. The association between pid and the additional > > > data may be lost after the kill, making it difficult for userspace to > > > correlate the OOM event with the specific process. > > > > You are correct that post OOM all per-process information is lost. On > > the other hand we do dump all this information to the kernel log. Could > > you explain why that is not suitable for your purpose? > > Userspace tools often need real-time visibility into OOM situations > for userspace intervention. Our use case involves utilizing BPF > programs, along with BPF ring buffers, to provide OOM notification to > userspace. Parsing kernel logs would be significant overhead as > opposed to the event based BPF approach. Please put that into the changelog. > > > In order to mitigate this limitation, add the following fields: > > > > > > - UID > > >In Android each installed application has a unique UID. Including > > >the `uid` assists in correlating OOM events with specific apps. > > > > > > - Process Name (comm) > > >Enables identification of the affected process. > > > > > > - OOM Score > > >Allows userspace to get additional insights of the relative kill > > >priority of the OOM victim. > > > > What is the oom score useful for? > > > The OOM score provides us a measure of the victim's importance. On the > android side, it allows us to identify if top or foreground apps are > killed, which have user perceptible impact. But the value on its own (wihtout knowing scores of other tasks) doesn't really tell you anything, does it? > > Is there any reason to provide a different information from the one > > reported to the kernel log? > > __oom_kill_process: > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", > > message, task_pid_nr(victim), victim->comm, K(mm->total_vm), > > K(get_mm_counter(mm, MM_ANONPAGES)), > > K(get_mm_counter(mm, MM_FILEPAGES)), > > K(get_mm_counter(mm, MM_SHMEMPAGES)), > > from_kuid(_user_ns, task_uid(victim)), > > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); > > > > We added these fields we need (UID, process name, and OOM score), but > we're open to adding the others if you prefer that for consistency > with the kernel log. yes, I think the consistency would be better here. For one it reports numbers which can tell quite a lot about the killed victim. It is a superset of what you already asking for. With a notable exception of the oom_score which is really dubious without a wider context. -- Michal Hocko SUSE Labs
[PATCH] tracing: Use init_utsname()->release
Instead of using UTS_RELEASE, use init_utsname()->release, which means that we don't need to rebuild the code just for the git head commit changing. Signed-off-by: John Garry --- I originally sent an RFC using new string uts_release, but that string is not needed as we can use init_utsname()->release instead. diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..1416bf72f3f4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -13,7 +13,7 @@ * Copyright (C) 2004 Nadia Yvette Chambers */ #include -#include +#include #include #include #include @@ -4368,7 +4368,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter) get_total_entries(buf, , ); seq_printf(m, "# %s latency trace v1.1.5 on %s\n", - name, UTS_RELEASE); + name, init_utsname()->release); seq_puts(m, "# ---" "-\n"); seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |" -- 2.31.1
Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM
On 2/22/24 18:06, maobibo wrote: On 2024/2/22 下午5:34, WANG Xuerui wrote: On 2/17/24 11:15, maobibo wrote: On 2024/2/15 下午6:25, WANG Xuerui wrote: On 2/15/24 18:11, WANG Xuerui wrote: Sorry for the late reply (and Happy Chinese New Year), and thanks for providing microbenchmark numbers! But it seems the more comprehensive CoreMark results were omitted (that's also absent in v3)? While the Of course the benchmark suite should be UnixBench instead of CoreMark. Lesson: don't multi-task code reviews, especially not after consuming beer -- a cup of coffee won't fully cancel the influence. ;-) Where is rule about benchmark choices like UnixBench/Coremark for ipi improvement? Sorry for the late reply. The rules are mostly unwritten, but in general you can think of the preference of benchmark suites as a matter of "effectiveness" -- the closer it's to some real workload in the wild, the better. Micro-benchmarks is okay for illustrating the points, but without demonstrating the impact on realistic workloads, a change could be "useless" in practice or even decrease various performance metrics (be that throughput or latency or anything that matters in the certain case), but get accepted without notice. yes, micro-benchmark cannot represent the real world, however it does not mean that UnixBench/Coremark should be run. You need to point out what is the negative effective from code, or what is the possible real scenario which may benefit. And points out the reasonable benchmark sensitive for IPIs rather than blindly saying UnixBench/Coremark. I was not meaning to argue with you, nor was I implying that your changes "must be regressing things even though I didn't check myself" -- my point is, *any* comparison with realistic workload that shows the performance mostly unaffected inside/outside KVM, would give reviewers (and yourself too) much more confidence in accepting the change. For me, personally I think a microbenchmark could be enough, because the only externally-visible change is the IPI mechanism overhead, but please consider other reviewers that may or may not be familiar enough with LoongArch to be able to notice the "triviality". Also, given the 6-patch size of the series, it could hardly be considered "trivial". -- WANG "xen0n" Xuerui Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/
Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM
On 2024/2/22 下午5:34, WANG Xuerui wrote: On 2/17/24 11:15, maobibo wrote: On 2024/2/15 下午6:25, WANG Xuerui wrote: On 2/15/24 18:11, WANG Xuerui wrote: Sorry for the late reply (and Happy Chinese New Year), and thanks for providing microbenchmark numbers! But it seems the more comprehensive CoreMark results were omitted (that's also absent in v3)? While the Of course the benchmark suite should be UnixBench instead of CoreMark. Lesson: don't multi-task code reviews, especially not after consuming beer -- a cup of coffee won't fully cancel the influence. ;-) Where is rule about benchmark choices like UnixBench/Coremark for ipi improvement? Sorry for the late reply. The rules are mostly unwritten, but in general you can think of the preference of benchmark suites as a matter of "effectiveness" -- the closer it's to some real workload in the wild, the better. Micro-benchmarks is okay for illustrating the points, but without demonstrating the impact on realistic workloads, a change could be "useless" in practice or even decrease various performance metrics (be that throughput or latency or anything that matters in the certain case), but get accepted without notice. yes, micro-benchmark cannot represent the real world, however it does not mean that UnixBench/Coremark should be run. You need to point out what is the negative effective from code, or what is the possible real scenario which may benefit. And points out the reasonable benchmark sensitive for IPIs rather than blindly saying UnixBench/Coremark. Regards Bibo Mao
Re: [PATCH v3 0/7] Introduction of a remoteproc tee to load signed firmware
On 2/22/2024 2:17 PM, Arnaud POULIQUEN wrote: Hello Naman, On 2/22/24 06:43, Naman Jain wrote: On 2/14/2024 10:51 PM, Arnaud Pouliquen wrote: Updates from the previous version [1]: This version proposes another approach based on an alternate load and boot of the coprocessor. Therefore, the constraint introduced by tee_remoteproc is that the firmware has to be authenticated and loaded before the resource table can be obtained. The existing boot sequence is: > 1) Get the resource table and store it in a cache, calling rproc->ops->parse_fw(). 2) Parse the resource table and handle resources, calling rproc_handle_resources. 3) Load the firmware, calling rproc->ops->load(). 4) Start the firmware, calling rproc->ops->start(). => Steps 1 and 2 are executed in rproc_fw_boot(), while steps 3 and 4 are executed in rproc_start(). => the use of rproc->ops->load() ops is mandatory The boot sequence needed for TEE boot is: 1) Load the firmware. 2) Get the loaded resource, no cache. 3) Parse the resource table and handle resources. 4) Start the firmware. Hi, What problem are we really addressing here by reordering load, parse of FW resources? The feature introduced in TEE is the signature of the firmware images. That means that before getting the resource table, we need to first authenticate the firmware images. Authenticating a firmware image means that we have to copy the firmware into protected memory that cannot be corrupted by the non-secure and then verify the signature. The strategy implemented in OP-TEE is to load the firmware into destination memory and then authenticate it. This strategy avoids having a temporary copy of the whole images in a secure memory. This strategy imposes loading the firmware images before retrieving the resource table. Basically, what are the limitations of the current design you are referring to? I understood that TEE is designed that way. The limitation of the current design is that we obtain the resource table before loading the firmware. Following the current design would impose constraints in TEE that are not straightforward. Step 1 (getting the resource table and storing it in a cache) would require having a copy of the resource table in TEE after authenticating the images. However, authenticating the firmware, as explained before, depends on the strategy implemented. In TEE implementation, we load the firmware to authenticate it in the destination memory. Regards, Arnaud Hello Arnaud, I think now I got your point. In TEE, you don't want to do anything(read resource table) with FW images, until its loaded and authenticated. Since current design was not allowing you to do it, you had to reorganize the code so that this can be achieved. Generally speaking, in current design, if authentication fails for some reason later, one can handle it, but it depends on the implementation of parse_fw op if the damage is already done. Please correct me if this is wrong assumption. Patch looks good to me. Regards, Naman Jain
Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM
On 2/17/24 11:15, maobibo wrote: On 2024/2/15 下午6:25, WANG Xuerui wrote: On 2/15/24 18:11, WANG Xuerui wrote: Sorry for the late reply (and Happy Chinese New Year), and thanks for providing microbenchmark numbers! But it seems the more comprehensive CoreMark results were omitted (that's also absent in v3)? While the Of course the benchmark suite should be UnixBench instead of CoreMark. Lesson: don't multi-task code reviews, especially not after consuming beer -- a cup of coffee won't fully cancel the influence. ;-) Where is rule about benchmark choices like UnixBench/Coremark for ipi improvement? Sorry for the late reply. The rules are mostly unwritten, but in general you can think of the preference of benchmark suites as a matter of "effectiveness" -- the closer it's to some real workload in the wild, the better. Micro-benchmarks is okay for illustrating the points, but without demonstrating the impact on realistic workloads, a change could be "useless" in practice or even decrease various performance metrics (be that throughput or latency or anything that matters in the certain case), but get accepted without notice. -- WANG "xen0n" Xuerui Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/
Re: [PATCH v3] modules: wait do_free_init correctly
On Wed, Feb 21, 2024 at 09:40:48AM -0800, Luis Chamberlain wrote: > + live-patching folks, > > Finally, things are starting to be much clearer. Thanks for the time > for working on this, some more comments below and a question which > I think deserves some attention. > > On Sat, Feb 17, 2024 at 04:18:10PM +0800, Changbin Du wrote: > > The synchronization here is just to ensure the module init's been freed > > before doing W+X checking. > > Some nits, this should read instead: > > Fix the ordering of freeing of a module init so that it happens before > W+X checking. > > > But the commit 1a7b7d922081 ("modules: Use > > vmalloc special flag") moves do_free_init() into a global workqueue > > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init > > has completed. We should wait it via flush_work(). > > Remove "But" and adjust as: > > Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved > calling do_free_init() into a global workqueue instead of relying on it > being called through call_rcu(..., do_free_init), which used to allowed us > call do_free_init() asynchronously after the end of a subsequent grace > > period. The move to a global workqueue broke the gaurantees for code > which needed to be sure the do_free_init() would complete with rcu_barrier(). > To fix this callers which used to rely on rcu_barrier() must now instead > use flush_work(_free_wq). > Sure, thanks! > > Without this fix, we still could encounter false positive reports in > > W+X checking, > > This is good thanks for the clarification. > > I think it would be useful for the commit log then to describe also that > it is not that the freeing was not happening, it is just that our sanity > checkers raced against the permission checkers which assume init memory > is already gone. > okay, I'll apend this detailed explanation. > > and the rcu synchronization is unnecessary which can > > introduce significant delay. > > While this can be true, I am not sure if we can remove it. See below. > > > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a > > PREEMPT_RT kernel. > > That's a separate issue. > > > [0.291444] Freeing unused kernel memory: 5568K > > [0.402442] Run /sbin/init as init process > > > > With this fix, the above delay can be eliminated. > > > > Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag") > > Signed-off-by: Changbin Du > > Cc: Xiaoyi Su > > Cc: Eric Chanudet > > > > --- > > v3: > > - amend comment in do_init_module() and update commit msg. > > v2: > > - fix compilation issue for no CONFIG_MODULES found by 0-DAY. > > --- > > include/linux/moduleloader.h | 8 > > init/main.c | 5 +++-- > > kernel/module/main.c | 9 +++-- > > 3 files changed, 18 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h > > index 001b2ce83832..89b1e0ed9811 100644 > > --- a/include/linux/moduleloader.h > > +++ b/include/linux/moduleloader.h > > @@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr, > > const Elf_Shdr *sechdrs, > > struct module *mod); > > > > +#ifdef CONFIG_MODULES > > +void flush_module_init_free_work(void); > > +#else > > +static inline void flush_module_init_free_work(void) > > +{ > > +} > > +#endif > > + > > /* Any cleanup needed when module leaves. */ > > void module_arch_cleanup(struct module *mod); > > > > diff --git a/init/main.c b/init/main.c > > index e24b0780fdff..f0b7e21ac67f 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -99,6 +99,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > #include > > @@ -1402,11 +1403,11 @@ static void mark_readonly(void) > > if (rodata_enabled) { > > /* > > * load_module() results in W+X mappings, which are cleaned > > -* up with call_rcu(). Let's make sure that queued work is > > +* up with init_free_wq. Let's make sure that queued work is > > * flushed so that we don't hit false positives looking for > > * insecure pages which are W+X. > > */ > > - rcu_barrier(); > > Was this the only source of waiters that used rcu_barrier() to sync ? > What about kallsyms, live-patching ? > > This original source to the addition of this rcu_barrier() (in a slight > older modified form with with rcu_barrier_sched()) was commit > ae646f0b9ca13 ("init: fix false positives in W+X checking") since > v4.17 in 2018, 6 years ago. So I'm hoping we don't have any other > side-by new users which have grown dependent on this rcu_barrier() for > other call_rcu()'s they may have used, but it is hard to tell. > Per the condtion 'rodata_enabled' and comments, I think the rcu_barrier() is only used to synchronize with freeing module init memory. > So while I agree that flush work is the right solution, removing the >
Re: [PATCH v3 0/7] Introduction of a remoteproc tee to load signed firmware
Hello Naman, On 2/22/24 06:43, Naman Jain wrote: > On 2/14/2024 10:51 PM, Arnaud Pouliquen wrote: >> Updates from the previous version [1]: >> >> This version proposes another approach based on an alternate load and boot >> of the coprocessor. Therefore, the constraint introduced by tee_remoteproc >> is that the firmware has to be authenticated and loaded before the resource >> table can be obtained. >> >> The existing boot sequence is: > >> 1) Get the resource table and store it in a cache, >> calling rproc->ops->parse_fw(). >> 2) Parse the resource table and handle resources, >> calling rproc_handle_resources. >> 3) Load the firmware, calling rproc->ops->load(). >> 4) Start the firmware, calling rproc->ops->start(). >> => Steps 1 and 2 are executed in rproc_fw_boot(), while steps 3 and 4 are >> executed in rproc_start(). >> => the use of rproc->ops->load() ops is mandatory >> >> The boot sequence needed for TEE boot is: >> >> 1) Load the firmware. >> 2) Get the loaded resource, no cache. >> 3) Parse the resource table and handle resources. >> 4) Start the firmware. > > Hi, > What problem are we really addressing here by reordering load, parse of > FW resources? The feature introduced in TEE is the signature of the firmware images. That means that before getting the resource table, we need to first authenticate the firmware images. Authenticating a firmware image means that we have to copy the firmware into protected memory that cannot be corrupted by the non-secure and then verify the signature. The strategy implemented in OP-TEE is to load the firmware into destination memory and then authenticate it. This strategy avoids having a temporary copy of the whole images in a secure memory. This strategy imposes loading the firmware images before retrieving the resource table. > Basically, what are the limitations of the current design you are referring > to? > I understood that TEE is designed that way. The limitation of the current design is that we obtain the resource table before loading the firmware. Following the current design would impose constraints in TEE that are not straightforward. Step 1 (getting the resource table and storing it in a cache) would require having a copy of the resource table in TEE after authenticating the images. However, authenticating the firmware, as explained before, depends on the strategy implemented. In TEE implementation, we load the firmware to authenticate it in the destination memory. Regards, Arnaud > >> >> Then the crash recovery also has to be managed.For recovery, the cache is >> used to temporarily save the resource table and then reapply it on >> restart: >> 1) Stop the remote processor, calling rproc->ops->stop(). >> 2) Load the firmware, calling rproc->ops->load(). >> 3) Copy cached resource table. >> 4) Start the remote processor, calling rproc->ops->start(). >> >> => This sequence is also needed when TEE manages the boot of the remote >> processor. >> => The rproc->ops->load() is also used in recovery sequence. >> >> Based on the sequences described above, the proposal is to: >> >> - Rework tee_rproc API to better match the rproc_ops structure. >> This allows to simply map the function to implement the load ops, which >> is not optional. The tee_rproc_load_fw() is updated in consequence. >> - Remove the call of rproc_load_segments from rproc_start() to dissociate >> the load and the start. This is necessary to implement the boot sequence >> requested for the TEE remote proc support. >> - Introduce an rproc_alt_fw_boot() function that is an alternative boot >> sequence, which implements the sequence requested for the TEE remoteproc >> support. >> >> >> [1] >> https://lore.kernel.org/lkml/20240118100433.3984196-1-arnaud.pouliq...@foss.st.com/T/ >> >> >> Description of the feature: >> >> This series proposes the implementation of a remoteproc tee driver to >> communicate with a TEE trusted application responsible for authenticating and >> loading the remoteproc firmware image in an Arm secure context. >> >> 1) Principle: >> >> The remoteproc tee driver provides services to communicate with the OP-TEE >> trusted application running on the Trusted Execution Context (TEE). > > s/Context/Environment? > >> The trusted application in TEE manages the remote processor lifecycle: >> >> - authenticating and loading firmware images, >> - isolating and securing the remote processor memories, >> - supporting multi-firmware (e.g., TF-M + Zephyr on a Cortex-M33), >> - managing the start and stop of the firmware by the TEE. >> > > Regards, > Naman Jain >
Re: [PATCH] dt-bindings: usb: typec: anx7688: start a binding document
On 21/02/2024 23:45, Pavel Machek wrote: >>> +properties: >>> + compatible: >>> +const: usb-c-connector >>> + >>> + power-role: true >>> + >>> + data-role: true >>> + >>> + try-power-role: true >> >> I don't understand why do you have here properties. Do you see any >> binding like this? > > Well, it looks like I copied these mistakes from analogix,anx7411.yaml . > >>> + >>> +required: >>> + - compatible >> >> Drop, why is it needed? > > Again, copy from analogix,anx7411.yaml . > > Should I try to fix up analogix,anx7411.yaml , and submit that, first, > or is it easier if you do that? I'll fix it up, thanks. Best regards, Krzysztof