Re: [PATCH v2 1/1] vhost: Added pad cleanup if vnet_hdr is not present.
On Thu, Mar 28, 2024 at 7:44 AM Andrew Melnychenko wrote: > > When the Qemu launched with vhost but without tap vnet_hdr, > vhost tries to copy vnet_hdr from socket iter with size 0 > to the page that may contain some trash. > That trash can be interpreted as unpredictable values for > vnet_hdr. > That leads to dropping some packets and in some cases to > stalling vhost routine when the vhost_net tries to process > packets and fails in a loop. > > Qemu options: > -netdev tap,vhost=on,vnet_hdr=off,... > > From security point of view, wrong values on field used later > tap's tap_get_user_xdp() and will affect skb gso and options. > Later the header(and data in headroom) should not be used by the stack. > Using custom socket as a backend to vhost_net can reveal some data > in the vnet_hdr, although it would require kernel access to implement. > > The issue happens because the value of sock_len in virtqueue is 0. > That value is set at vhost_net_set_features() with > VHOST_NET_F_VIRTIO_NET_HDR, also it's set to zero at device open() > and reset() routine. > So, currently, to trigger the issue, we need to set up qemu with > vhost=on,vnet_hdr=off, or do not configure vhost in the custom program. > > Signed-off-by: Andrew Melnychenko Acked-by: Jason Wang It seems it has been merged by Michael. Thanks > --- > drivers/vhost/net.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index f2ed7167c848..57411ac2d08b 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue > *nvq, > hdr = buf; > gso = >gso; > > + if (!sock_hlen) > + memset(buf, 0, pad); > + > if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && > vhost16_to_cpu(vq, gso->csum_start) + > vhost16_to_cpu(vq, gso->csum_offset) + 2 > > -- > 2.43.0 >
Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()
On Thu, Mar 28, 2024 at 8:22 AM Gavin Shan wrote: > > All the callers of vhost_get_avail_idx() are concerned to the memory > barrier, imposed by smp_rmb() to ensure the order of the available > ring entry read and avail_idx read. > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when > the avail_idx is advanced. With it, the callers needn't to worry > about the memory barrier. > > Suggested-by: Michael S. Tsirkin > Signed-off-by: Gavin Shan Acked-by: Jason Wang Thanks
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: The scripts rely on cgroup-tools package from libcgroup [1]. To run selftests for epc cgroup: sudo ./run_epc_cg_selftests.sh To watch misc cgroup 'current' changes during testing, run this in a separate terminal: ./watch_misc_for_tests.sh current With different cgroups, the script starts one or multiple concurrent SGX selftests, each to run one unclobbered_vdso_oversubscribed test.Each of such test tries to load an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) SMALL - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) LARGE - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) LARGER - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all process exit. The script also includes a test with low mem_cg limit and LARGE sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. [1] https://github.com/libcgroup/libcgroup/blob/main/README Signed-off-by: Haitao Huang --- V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++ .../selftests/sgx/watch_misc_for_tests.sh | 13 + 2 files changed, 259 insertions(+) create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh new file mode 100755 index ..e027bf39f005 --- /dev/null +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh @@ -0,0 +1,246 @@ +#!/bin/bash This is not portable and neither does hold in the wild. It does not even often hold as it is not uncommon to place bash to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has a path that is neither of those two. Should be #!/usr/bin/env bash That is POSIX compatible form. Sure Just got around trying to test this in NUC7 so looking into this in more detail. Thanks. Could you please check if this version works for you? https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f That said can you make the script work with just "#!/usr/bin/env sh" and make sure that it is busybox ash compatible? Yes. I don't see any necessity to make this bash only and it adds to the compilation time of the image. Otherwise lot of this could be tested just with qemu+bzImage+busybox(inside initramfs). will still need cgroup-tools as you pointed out later. Compiling from its upstream code OK? Now you are adding fully glibc shenanigans for the sake of syntax sugar. +# SPDX-License-Identifier: GPL-2.0 +# Copyright(c) 2023 Intel Corporation. + +TEST_ROOT_CG=selftest +cgcreate -g misc:$TEST_ROOT_CG How do you know that cgcreate exists? It is used a lot in the script with no check for the existence. Please fix e.g. with "command -v cgreate". +if [ $? -ne 0 ]; then +echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted." +exit 1 +fi And please do not do it this way. Also, please remove the advice for "cgroups-tool". This is not meant to be debian only. Better would be to e.g. point out the URL of the upstream project. And yeah the whole message should be based on "command -v", not like this. OK +TEST_CG_SUB1=$TEST_ROOT_CG/test1 +TEST_CG_SUB2=$TEST_ROOT_CG/test2 +# We will only set limit in test1 and run tests in test3 +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 +TEST_CG_SUB4=$TEST_ROOT_CG/test4 + +cgcreate -g misc:$TEST_CG_SUB1 +cgcreate -g misc:$TEST_CG_SUB2 +cgcreate -g misc:$TEST_CG_SUB3 +cgcreate -g misc:$TEST_CG_SUB4 + +# Default to V2 +CG_MISC_ROOT=/sys/fs/cgroup +CG_MEM_ROOT=/sys/fs/cgroup +CG_V1=0 +if [ ! -d "/sys/fs/cgroup/misc" ]; then +echo "# cgroup V2 is in use." +else +echo "# cgroup V1 is in use." Is "#" prefix a standard for kselftest? I don't know this, thus asking. +CG_MISC_ROOT=/sys/fs/cgroup/misc +CG_MEM_ROOT=/sys/fs/cgroup/memory +CG_V1=1 Have you checked what is the indentation policy for bash scripts inside kernel tree. I don't know what it is. That's why I'm asking. Right. I looked around and found scripts
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Wed, 2024-03-27 at 15:15 +0200, Jarkko Sakkinen wrote: > I mean I believe the change itself makes sense, it is just not > fully documented in the commit message. Ah, I see. Yes, there could be more background on arch_pick_mmap_layout().
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Tue, 2024-03-26 at 23:38 -0700, Dan Williams wrote: > > +unsigned long > > +mm_get_unmapped_area(struct mm_struct *mm, struct file *file, > > + unsigned long addr, unsigned long len, > > + unsigned long pgoff, unsigned long flags) > > +{ > > Seems like a small waste to have all call sites now need to push an > additional @mm argument onto the stack just to figure out what function > to call. > > > + if (test_bit(MMF_TOPDOWN, >flags)) > > + return arch_get_unmapped_area_topdown(file, addr, len, > > pgoff, flags); > > + return arch_get_unmapped_area(file, addr, len, pgoff, flags); > > This seems small enough to be amenable to a static inline, but that > would require exporting the arch calls which might be painful. > > Would it not be possible to drop the @mm argument and just reference > current->mm internal to this function? Just call this funcion > current_get_unmapped_area()? Hmm, you are right. The callers all pass current->mm. The mm argument could be removed.
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Wed, 27 Mar 2024 19:57:26 -0500, Haitao Huang wrote: On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen wrote: On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > The scripts rely on cgroup-tools package from libcgroup [1]. > > To run selftests for epc cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: ... > wait_and_detect_for_any() { what is "any"? Maybe for some key functions could have short documentation what they are and for what test uses them. I cannot possibly remember all of this just by hints such as "this waits for Any" ;-) I don't think there is actual kernel guideline to engineer the script to work with just ash but at least for me that would inevitably increase my motivation to test this patch set more rather than less. I also wonder is cgroup-tools dependency absolutely required or could you just have a function that would interact with sysfs? I should have checked email before hit the send button for v10 :-). It'd be more complicated and less readable to do all the stuff without the cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends on libc so I hope this would not cause too much inconvenience. I saw bash was also used in cgroup test scripts so at least that's consistent :-) I can look into ash if that's required. Let me know. Sorry I missed you earlier comments. I actually tried to make it compatible with busybox on Ubuntu. I'll address your other comments and update. But meanwhile could you tryout this version in your env? https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f Thanks Haitao
Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
> > You *definitely* want to do that - good catch. > > And TBH, all the screaming words aren't helping either... :) > :) I thought the same as well. But, I felt inconsistently screaming words might be worse. Maybe just update all the words that are not acronyms (such as Processor, Time, Socket, etc.)
Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported
On Wed, 27 Mar 2024 06:51:25 -0700, Breno Leitao wrote: > Hello Xuan, > > On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote: > > On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao wrote: > > > Do not set virtnet_info->rss_indir_table_size if RSS is not available > > > for the device. > > > > > > Currently, rss_indir_table_size is set if either has_rss or > > > has_rss_hash_report is available, but, it should only be set if has_rss > > > is set. > > > > > > On the virtnet_set_rxfh(), return an invalid command if the request has > > > indirection table set, but virtnet does not support RSS. > > > > > > Suggested-by: Heng Qi > > > Signed-off-by: Breno Leitao > > > --- > > > drivers/net/virtio_net.c | 9 +++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > index c22d1118a133..c640fdf28fc5 100644 > > > --- a/drivers/net/virtio_net.c > > > +++ b/drivers/net/virtio_net.c > > > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev, > > > rxfh->hfunc != ETH_RSS_HASH_TOP) > > > return -EOPNOTSUPP; > > > > > > + if (rxfh->indir && !vi->has_rss) > > > + return -EINVAL; > > > + > > > if (rxfh->indir) { > > > > Put !vi->has_rss here? > > I am not sure I understand the suggestion. Where do you suggest we have > !vi->has_rss? > > If we got this far, we either have: > > a) rxfh->indir set and vi->has_rss is also set > b) rxfh->indir not set. (vi->has_rss could be set or not). This function does two tasks. 1. update indir table 2. update rss key #1 only for has_rss #2 for has_rss or has_rss_hash_report So I would code: bool update = false if (rxfh->indir) { if (!vi->has_rss) return -EINVAL; for (i = 0; i < vi->rss_indir_table_size; ++i) vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; update = true } if (rxfh->key) { if (!vi->has_rss && !vi->has_rss_hash_report) return -EINVAL; memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size); update = true } if (update) virtnet_commit_rss_command(vi); Thanks. > > Thanks for the review, > Breno
Re: [PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
"Ho-Ren (Jack) Chuang" writes: [snip] > @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (!node_state(nid, N_CPU) && > + node_memory_types[nid].memtype == NULL) Think about this again. It seems that it is better to check "node_memory_types[nid].memtype == NULL" only here. Because for all node with N_CPU in memory_tier_init(), "node_memory_types[nid].memtype" will be !NULL. And it's possible (in theory) that some nodes becomes "node_state(nid, N_CPU) == true" between memory_tier_init() and memory_tier_late_init(). Otherwise, Looks good to me. Feel free to add Reviewed-by: "Huang, Ying" in the future version. > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and > `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ > + set_node_memory_tier(nid); > + > + establish_demotion_targets(); > + mutex_unlock(_tier_lock); > + > + return 0; > +} > +late_initcall(memory_tier_late_init); > + [snip] -- Best Regards, Huang, Ying
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen wrote: On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > The scripts rely on cgroup-tools package from libcgroup [1]. > > To run selftests for epc cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent > SGX > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each > of such test tries to load an enclave of EPC size equal to the EPC > capacity available on the platform. The script checks results against > the expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with > following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > if > more than 4 concurrent tests are run. The script starts 4 expecting > at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run > lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all process exit. > > The script also includes a test with low mem_cg limit and LARGE > sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is > charged > to a proper mem_cg. > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > Signed-off-by: Haitao Huang > --- > V7: > - Added memcontrol test. > > V5: > - Added script with automatic results checking, remove the > interactive > script. > - The script can run independent from the series below. > --- > .../selftests/sgx/run_epc_cg_selftests.sh | 246 > ++ > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > 2 files changed, 259 insertions(+) > create mode 100755 > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > new file mode 100755 > index ..e027bf39f005 > --- /dev/null > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -0,0 +1,246 @@ > +#!/bin/bash This is not portable and neither does hold in the wild. It does not even often hold as it is not uncommon to place bash to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has a path that is neither of those two. Should be #!/usr/bin/env bash That is POSIX compatible form. Just got around trying to test this in NUC7 so looking into this in more detail. That said can you make the script work with just "#!/usr/bin/env sh" and make sure that it is busybox ash compatible? I don't see any necessity to make this bash only and it adds to the compilation time of the image. Otherwise lot of this could be tested just with qemu+bzImage+busybox(inside initramfs). Now you are adding fully glibc shenanigans for the sake of syntax sugar. > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2023 Intel Corporation. > + > +TEST_ROOT_CG=selftest > +cgcreate -g misc:$TEST_ROOT_CG How do you know that cgcreate exists? It is used a lot in the script with no check for the existence. Please fix e.g. with "command -v cgreate". > +if [ $? -ne 0 ]; then > +echo "# Please make sure cgroup-tools is installed, and misc > cgroup is mounted." > +exit 1 > +fi And please do not do it this way. Also, please remove the advice for "cgroups-tool". This is not meant to be debian only. Better would be to e.g. point out the URL of the upstream project. And yeah the whole message should be based on "command -v", not like this. > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > +# We will only set limit in test1 and run tests in test3 > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > + > +cgcreate -g misc:$TEST_CG_SUB1 > +cgcreate -g misc:$TEST_CG_SUB2 > +cgcreate -g misc:$TEST_CG_SUB3 > +cgcreate -g misc:$TEST_CG_SUB4 > + > +# Default to V2 > +CG_MISC_ROOT=/sys/fs/cgroup > +CG_MEM_ROOT=/sys/fs/cgroup > +CG_V1=0 > +if [ ! -d "/sys/fs/cgroup/misc" ]; then > +echo "# cgroup V2 is in use." > +else > +echo "# cgroup V1 is in use." Is "#" prefix a standard for kselftest? I don't know this, thus asking. > +CG_MISC_ROOT=/sys/fs/cgroup/misc > +CG_MEM_ROOT=/sys/fs/cgroup/memory > +CG_V1=1 Have you checked what is the indentation policy for bash scripts inside kernel tree. I don't know what it is. That's why I'm asking. > +fi > + > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu wrote: > > On Wed, 27 Mar 2024 17:06:01 + > Jonthan Haslam wrote: > > > > > Masami, > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. No, Masami, unfortunately it's not as simple. In BPF we have BPF multi-uprobe, which can be used to attach to thousands of user functions. It currently creates one uprobe at a time, as we don't really have a batched API. If each such uprobe registration will now take a (relatively) long time, when multiplied by number of attach-to user functions, it will be a horrible regression in terms of attachment/detachment performance. So when we switch to per-CPU rw semaphore, we'll need to provide an internal batch uprobe attach/detach API to make sure that attaching to multiple uprobes is still fast. Which is why I was asking to land this patch as is, as it relieves the scalability pains in production and is easy to backport to old kernels. And then we can work on batched APIs and switch to per-CPU rw semaphore. So I hope you can reconsider and accept improvements in this patch, while Jonathan will keep working on even better final solution. Thanks! > I look forward to your formalized results :) > > Thank you, > > > > > Jon. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > will depend on how much lock contention happens. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > --- > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > --- a/kernel/events/uprobes.c > > > > > > +++ b/kernel/events/uprobes.c > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > */ > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > > > > access */ > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree > > > > > > access */ > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > /* serialize uprobe->pending_list */ > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > > > > *inode, loff_t offset) > > > > > > { > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + read_lock(_treelock); > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > - spin_unlock(_treelock); > > > > > > + read_unlock(_treelock); > > > > > > > > > > > > return uprobe; > > > > > > } > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct > > > > > > uprobe *uprobe) > > > > > > { > > > > > > struct uprobe *u; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + write_lock(_treelock); > > > > > > u = __insert_uprobe(uprobe); > > > > > > - spin_unlock(_treelock); > > > > > > + write_unlock(_treelock); > > > > > > > > > > > > return u; > > > > > > } > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > return; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + write_lock(_treelock); > > > > > > rb_erase(>rb_node, _tree); > > > > > > - spin_unlock(_treelock); > > > > > > + write_unlock(_treelock); > > > > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > > > > put_uprobe(uprobe); > > > > > > } > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode > > > > > > *inode, > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > >
Re: [PATCH v19 RESEND 4/5] Documentation: tracing: Add ring-buffer mapping
Hi, On 3/26/24 03:08, Vincent Donnefort wrote: > It is now possible to mmap() a ring-buffer to stream its content. Add > some documentation and a code example. > > Signed-off-by: Vincent Donnefort > > diff --git a/Documentation/trace/ring-buffer-map.rst > b/Documentation/trace/ring-buffer-map.rst > new file mode 100644 > index ..0426ab4bcf3d > --- /dev/null > +++ b/Documentation/trace/ring-buffer-map.rst > @@ -0,0 +1,106 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +== > +Tracefs ring-buffer memory mapping > +== > + > +:Author: Vincent Donnefort > + > +Overview > + > +Tracefs ring-buffer memory map provides an efficient method to stream data > +as no memory copy is necessary. The application mapping the ring-buffer > becomes > +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. > + > +Memory mapping setup > + > +The mapping works with a mmap() of the trace_pipe_raw interface. > + > +The first system page of the mapping contains ring-buffer statistics and > +description. It is referred as the meta-page. One of the most important > field of referred to as fields > +the meta-page is the reader. It contains the sub-buffer ID which can be > safely > +read by the mapper (see ring-buffer-design.rst). > + > +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. > It is ascending might be better IMO. > +therefore effortless to know where the reader starts in the mapping: > + > +.. code-block:: c > + > +reader_id = meta->reader->id; > +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; > + > +When the application is done with the current reader, it can get a new one > using > +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also > updates > +the meta-page fields. [snip] Thanks. -- #Randy
[PATCH v10 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU
From: Kristen Carlson Accardi The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly allocated page into the global LRU list while sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded global LRU references in these functions to make them reusable when pages are tracked in per-cgroup LRUs. Create a helper, sgx_lru_list(), that returns the LRU that tracks a given EPC page. It simply returns the global LRU now, and will later return the LRU of the cgroup within which the EPC page was allocated. Replace the hard coded global LRU with a call to this helper. Next patches will first get the cgroup reclamation flow ready while keeping pages tracked in the global LRU and reclaimed by ksgxd before we make the switch in the end for sgx_lru_list() to return per-cgroup LRU. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 4991eb0af748..8f83f7ac386e 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) +{ + return _global_lru; +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) } /** - * sgx_mark_page_reclaimable() - Mark a page as reclaimable + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU. * @page: EPC page - * - * Mark a page as reclaimable and add it to the active page list. Pages - * are automatically removed from the active list when freed. */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_lru_list(page); + + spin_lock(>lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _global_lru.reclaimable); - spin_unlock(_global_lru.lock); + list_add_tail(>list, >reclaimable); + spin_unlock(>lock); } /** - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU * @page: EPC page * - * Clear the reclaimable flag and remove the page from the active page list. + * Clear the reclaimable flag if set and remove the page from its LRU. * * Return: * 0 on success, @@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_lru_list(page); + + spin_lock(>lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return -EBUSY; } list_del(>list); page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return 0; } -- 2.25.1
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On 3/27/24 17:42, Jason Wang wrote: On Wed, Mar 27, 2024 at 3:35 PM Gavin Shan wrote: On 3/27/24 14:08, Gavin Shan wrote: On 3/27/24 12:44, Jason Wang wrote: On Wed, Mar 27, 2024 at 10:34 AM Jason Wang wrote: On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac697 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..00445ab172b3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ We need to be more verbose here. For example, which load needs to be ordered with which load. The rmb in vhost_get_vq_desc() is used to order the load of avail idx and the load of head. It is paired with e.g virtio_wmb() in virtqueue_add_split(). vhost_vq_avail_empty() are mostly used as a hint in vhost_net_busy_poll() which is under the protection of the vq mutex. An exception is the tx_can_batch(), but in that case it doesn't even want to read the head. Ok, if it is needed only in that path, maybe we can move the barriers there. [cc Will Deacon] Jason, appreciate for your review and comments. I think PATCH[1/2] is the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However, it would be nice to fix all of them in one shoot. I will try with PATCH[2/2] only to see if our issue will disappear or not. However, the issue still exists if PATCH[2/2] is missed. Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with PATCH[2/2] only and unable to hit the issue. However, PATCH[1/2] may be needed by other scenarios. So it would be nice to fix them in one shoot. Yes, see below. Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout == false if I remember correctly. So the added smp_rmb() in vhost_vq_avail_empty() is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order for the available index and available ring entry (head). For example, vhost_vq_avail_empty() called by tx_can_batch() can see next available index, but its corresponding available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed. The next call to get_tx_bufs(), where the available ring entry (head) doesn't arrived yet, leading to stale available ring entry (head) being fetched. handle_tx_copy get_tx_bufs // smp_rmb() won't be executed when vq->avail_idx != vq->last_avail_idx tx_can_batch vhost_vq_avail_empty // vq->avail_idx is updated from vq->avail->idx The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the function is a exposed API, even it's only used by drivers/vhost/net.c at present. It means the API has been broken internally. So it seems more appropriate to fix it up in vhost_vq_avail_empty() so that the API's users needn't worry about the memory access order. When tx_can_batch returns true it means there's still pending tx buffers. Since it might read indices so it still can bypass the smp_rmb() in the vhost_get_vq_desc(). I'd suggest adding those above to change log. With this, Acked-by: Jason Wang Thanks, Jason. The change log has
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On 3/27/24 22:07, Michael S. Tsirkin wrote: On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote: A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac697 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..00445ab172b3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ A slightly clearer comment: /* Since we have updated avail_idx, the following call to * vhost_get_vq_desc will read available ring entries. * Make sure that read happens after the avail_idx read. */ Pls repost with that, and I will apply. Also add suggested-by for will. Sure, the suggested comments have been included to v3. + smp_rmb(); + return false; + } - return vq->avail_idx == vq->last_avail_idx; + return true; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); As a follow-up patch, we should clean out code duplication that accumulated with 3 places reading avail idx in essentially the same way - this duplication is what causes the mess in the 1st place. Yes, nice idea. I've added PATCH[v3 3/3] to improve vhost_get_avail_idx() to handle the memory barrier since all the callers have the concern. v3: https://lore.kernel.org/virtualization/20240328002149.1141302-1-gs...@redhat.com/ Thanks, Gavin
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Thu, 22 Feb 2024 16:24:47 -0600, Huang, Kai wrote: On 23/02/2024 9:12 am, Haitao Huang wrote: On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai wrote: On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote: StartHi Kai On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai wrote: [...] > > So you introduced the work/workqueue here but there's no place which > actually > queues the work. IMHO you can either: > > 1) move relevant code change here; or > 2) focus on introducing core functions to reclaim certain pages from a > given EPC > cgroup w/o workqueue and introduce the work/workqueue in later patch. > > Makes sense? > Starting in v7, I was trying to split the big patch, #10 in v6 as you and others suggested. My thought process was to put infrastructure needed for per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9 13/15] in the end. That's reasonable for sure. Thanks for the confirmation :-) Before that, all reclaimables are tracked in the global LRU so really there is no "reclaim certain pages from a given EPC cgroup w/o workqueue" or reclaim through workqueue before that point, as suggested in #2. This patch puts down the implementation for both flows but neither used yet, as stated in the commit message. I know it's not used yet. The point is how to split patches to make them more self-contain and easy to review. I would think this patch already self-contained in that all are implementation of cgroup reclamation building blocks utilized later. But I'll try to follow your suggestions below to split further (would prefer not to merge in general unless there is strong reasons). For #2, sorry for not being explicit -- I meant it seems it's more reasonable to split in this way: Patch 1) a). change to sgx_reclaim_pages(); I'll still prefer this to be a separate patch. It is self-contained IMHO. We were splitting the original patch because it was too big. I don't want to merge back unless there is a strong reason. b). introduce sgx_epc_cgroup_reclaim_pages(); Ok. If I got you right, I believe you want to have a cgroup variant function following the same behaviour of the one for global reclaim, i.e., the _current_ sgx_reclaim_pages(), which always tries to scan and reclaim SGX_NR_TO_SCAN pages each time. And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup and all the descendants". And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due to WHATEVER reasons. In that case, the change to sgx_reclaim_pages() and the introduce of sgx_epc_cgroup_reclaim_pages() should really be together because they are completely tied together in terms of implementation. In this way you can just explain clearly in _ONE_ patch why you choose this implementation, and for reviewer it's also easier to review because we can just discuss in one patch. Makes sense? c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name), which just takes an EPC cgroup as input w/o involving any work/workqueue. This is for the workqueue use only. So I think it'd be better be with patch #2 below? There are multiple levels of logic here IMHO: 1. a) and b) above focus on "each reclaim" a given EPC cgroup 2. c) is about a loop of above to bring given cgroup's usage to limit 3. workqueue is one (probably best) way to do c) in async way 4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered To me, it's clear 1) should be in one patch as stated above. Also, to me 3) and 4) are better to be together since they give you a clear view on how the direct/indirect reclaim are triggered. 2) could be flexible depending on how you see it. If you prefer viewing it from low-level implementation of reclaiming pages from cgroup, then it's also OK to be together with 1). If you want to treat it as a part of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK to be with 3) and 4). But to me 2) can be together with 1) or even a separate patch because it's still kinda of low-level reclaiming details. 3) and 4) shouldn't contain such detail but should focus on how direct/indirect reclaim is done. I incorporated most of your suggestions, and think it'd be better discuss this with actual code. So I'm sending out v10, and just quickly summarize what I did to address this particular issue here. I pretty much follow above suggestions and end up with two patches: 1) a) and b) above plus direct reclaim triggered in try_charge() so reviewers can see at lease one use of the sgx_cgroup_reclaim_pages(), which is the basic building block. 2) All async related: c) above, workqueue, indirect triggered in try_charge() which queues the work. Please review v10 and if you think the triggering parts need be separated then I'll separate.
[PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing
The scripts rely on cgroup-tools package from libcgroup [1]. To run selftests for epc cgroup: sudo ./run_epc_cg_selftests.sh To watch misc cgroup 'current' changes during testing, run this in a separate terminal: ./watch_misc_for_tests.sh current With different cgroups, the script starts one or multiple concurrent SGX selftests, each to run one unclobbered_vdso_oversubscribed test. Each of such test tries to load an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) SMALL - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) LARGE - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) LARGER - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all process exit. The script also includes a test with low mem_cg limit and LARGE sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. [1] https://github.com/libcgroup/libcgroup/blob/main/README Signed-off-by: Haitao Huang --- V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++ .../selftests/sgx/watch_misc_for_tests.sh | 13 + 2 files changed, 259 insertions(+) create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh new file mode 100755 index ..e027bf39f005 --- /dev/null +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh @@ -0,0 +1,246 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright(c) 2023 Intel Corporation. + +TEST_ROOT_CG=selftest +cgcreate -g misc:$TEST_ROOT_CG +if [ $? -ne 0 ]; then +echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted." +exit 1 +fi +TEST_CG_SUB1=$TEST_ROOT_CG/test1 +TEST_CG_SUB2=$TEST_ROOT_CG/test2 +# We will only set limit in test1 and run tests in test3 +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 +TEST_CG_SUB4=$TEST_ROOT_CG/test4 + +cgcreate -g misc:$TEST_CG_SUB1 +cgcreate -g misc:$TEST_CG_SUB2 +cgcreate -g misc:$TEST_CG_SUB3 +cgcreate -g misc:$TEST_CG_SUB4 + +# Default to V2 +CG_MISC_ROOT=/sys/fs/cgroup +CG_MEM_ROOT=/sys/fs/cgroup +CG_V1=0 +if [ ! -d "/sys/fs/cgroup/misc" ]; then +echo "# cgroup V2 is in use." +else +echo "# cgroup V1 is in use." +CG_MISC_ROOT=/sys/fs/cgroup/misc +CG_MEM_ROOT=/sys/fs/cgroup/memory +CG_V1=1 +fi + +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}') +# This is below number of VA pages needed for enclave of capacity size. So +# should fail oversubscribed cases +SMALL=$(( CAPACITY / 512 )) + +# At least load one enclave of capacity size successfully, maybe up to 4. +# But some may fail if we run more than 4 concurrent enclaves of capacity size. +LARGE=$(( SMALL * 4 )) + +# Load lots of enclaves +LARGER=$CAPACITY +echo "# Setting up limits." +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max + +timestamp=$(date +%Y%m%d_%H%M%S) + +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed" + +wait_check_process_status() { +local pid=$1 +local check_for_success=$2 # If 1, check for success; +# If 0, check for failure +wait "$pid" +local status=$? + +if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then +echo "# Process $pid succeeded." +return 0 +elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then +echo "# Process $pid returned failure." +return 0 +fi +return 1 +} + +wait_and_detect_for_any() { +local pids=("$@") +local check_for_success=$1 # If 1, check for success; +# If 0, check for failure +local detected=1 # 0 for success detection + +for pid in "${pids[@]:1}"; do +if wait_check_process_status "$pid" "$check_for_success"; then +detected=0 +# Wait for other processes to exit +fi +done + +return $detected +} + +echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..." +# Always use leaf node of misc
[PATCH v10 13/14] Docs/x86/sgx: Add description for cgroup support
From: Sean Christopherson Add initial documentation of how to regulate the distribution of SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup controller. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson --- V8: - Limit text width to 80 characters to be consistent. V6: - Remove mentioning of VMM specific behavior on handling SIGBUS - Remove statement of forced reclamation, add statement to specify ENOMEM returned when no reclamation possible. - Added statements on the non-preemptive nature for the max limit - Dropped Reviewed-by tag because of changes V4: - Fix indentation (Randy) - Change misc.events file to be read-only - Fix a typo for 'subsystem' - Add behavior when VMM overcommit EPC with a cgroup (Mikko) --- Documentation/arch/x86/sgx.rst | 83 ++ 1 file changed, 83 insertions(+) diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst index d90796adc2ec..c537e6a9aa65 100644 --- a/Documentation/arch/x86/sgx.rst +++ b/Documentation/arch/x86/sgx.rst @@ -300,3 +300,86 @@ to expected failures and handle them as follows: first call. It indicates a bug in the kernel or the userspace client if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has a return code other than 0. + + +Cgroup Support +== + +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates +distribution of SGX EPC memory, which is a subset of system RAM that is used to +provide SGX-enabled applications with protected memory, and is otherwise +inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be +read/written outside of an SGX enclave. + +Although current systems implement EPC by stealing memory from RAM, for all +intents and purposes the EPC is independent from normal system memory, e.g. must +be reserved at boot from RAM and cannot be converted between EPC and normal +memory while the system is running. The EPC is managed by the SGX subsystem and +is not accounted by the memory controller. Note that this is true only for EPC +memory itself, i.e. normal memory allocations related to SGX and EPC memory, +e.g. the backing memory for evicted EPC pages, are accounted, limited and +protected by the memory controller. + +Much like normal system memory, EPC memory can be overcommitted via virtual +memory techniques and pages can be swapped out of the EPC to their backing store +(normal system memory allocated via shmem). The SGX EPC subsystem is analogous +to the memory subsystem, and it implements limit and protection models for EPC +memory. + +SGX EPC Interface Files +--- + +For a generic description of the Miscellaneous controller interface files, +please see Documentation/admin-guide/cgroup-v2.rst + +All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If +a value which is not PAGE_SIZE aligned is written, the actual value used by the +controller will be rounded down to the closest PAGE_SIZE multiple. + + misc.capacity +A read-only flat-keyed file shown only in the root cgroup. The sgx_epc +resource will show the total amount of EPC memory available on the +platform. + + misc.current +A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc +resource will show the current active EPC memory usage of the cgroup and +its descendants. EPC pages that are swapped out to backing RAM are not +included in the current count. + + misc.max +A read-write single value file which exists on non-root cgroups. The +sgx_epc resource will show the EPC usage hard limit. The default is +"max". + +If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for +page fault handling, will be blocked until EPC can be reclaimed from the +cgroup. If there are no pages left that are reclaimable within the same +group, the kernel returns ENOMEM. + +The EPC pages allocated for a guest VM by the virtual EPC driver are not +reclaimable by the host kernel. In case the guest cgroup's limit is +reached and no reclaimable pages left in the same cgroup, the virtual +EPC driver returns SIGBUS to the user space process to indicate failure +on new EPC allocation requests. + +The misc.max limit is non-preemptive. If a user writes a limit lower +than the current usage to this file, the cgroup will not preemptively +deallocate pages currently in use, and will only start blocking the next +allocation and reclaiming EPC at that time. + + misc.events +A read-only flat-keyed file which exists on non-root cgroups. +A value change in this file generates a file modified event. + + max +The number of times
[PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_lru_list() returns hard coded reference to the global LRU. Change sgx_lru_list() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., the sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore a global reclamation is still needed in those cases and it should be performed from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled, otherwise from the global LRU. Export sgx_cgroup_reclaim_pages() in the header file so it can be reused for this purpose. Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. Export sgx_cgroup_lru_empty() so it can be reused for this purpose. Finally, change sgx_reclaim_direct(), to check and ensure there are free pages at cgroup level so forward progress can be made by the caller. Export sgx_cgroup_should_reclaim() for reuse. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V10: - Add comment to clarify each page belongs to one cgroup, or the root by default. (Kai) - Merge the changes that expose sgx_cgroup_* functions to this patch. - Add changes for sgx_reclaim_direct() that was missed previously. V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 6 +++--- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++ arch/x86/kernel/cpu/sgx/main.c | 29 +++- 3 files changed, 38 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 1defbf213e8d..cacd9e93344e 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -72,7 +72,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) * * Return: %true if all cgroups under the specified root have empty LRU lists. */ -static bool sgx_cgroup_lru_empty(struct misc_cg *root) +bool sgx_cgroup_lru_empty(struct misc_cg *root) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; @@ -125,7 +125,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root) * triggering reclamation, and call cond_resched() in between iterations to * avoid indefinite blocking. */ -static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm) +void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; @@ -166,7 +166,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *cha * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the * cgroup. */ -static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) { u64 cur, max; diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index f66570d3ef42..8f55b38157da 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { } static inline void sgx_cgroup_init(void) { } + +static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm) +{ +} #else struct sgx_cgroup { struct misc_cg *cg; @@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim); void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); +bool sgx_cgroup_lru_empty(struct misc_cg *root); +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg); +void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm); void sgx_cgroup_init(void); #endif diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 7f92455d957d..68f28ff2d5ef 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru; static inline struct sgx_epc_lru_list
[PATCH v10 11/14] x86/sgx: Abstract check for global reclaimable pages
From: Kristen Carlson Accardi To determine if any page available for reclamation at the global level, only checking for emptiness of the global LRU is not adequate when pages are tracked in multiple LRUs, one per cgroup. For this purpose, create a new helper, sgx_can_reclaim(), currently only checks the global LRU, later will check emptiness of LRUs of all cgroups when per-cgroup tracking is turned on. Replace all the checks of the global LRU, list_empty(_global_lru.reclaimable), with calls to sgx_can_reclaim(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V10: - Add comments for the new function. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index c94f8b49e6f2..7f92455d957d 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag return _global_lru; } +/* + * Check if there is any reclaimable page at global level. + */ +static inline bool sgx_can_reclaim(void) +{ + return !list_empty(_global_lru.reclaimable); +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -391,7 +399,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *c static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_global_lru.reclaimable); + sgx_can_reclaim(); } static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) @@ -593,7 +601,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) break; } - if (list_empty(_global_lru.reclaimable)) { + if (!sgx_can_reclaim()) { page = ERR_PTR(-ENOMEM); break; } -- 2.25.1
[PATCH v10 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
Enclave Page Cache(EPC) memory can be swapped out to regular system memory, and the consumed memory should be charged to a proper mem_cgroup. Currently the selection of mem_cgroup to charge is done in sgx_encl_get_mem_cgroup(). But it considers all contexts other than the ksgxd thread are user processes. With the new EPC cgroup implementation, the swapping can also happen in EPC cgroup work-queue threads. In those cases, it improperly selects the root mem_cgroup to charge for the RAM usage. Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take an additional argument to explicitly specify the mm struct to charge for allocations. Callers from background kthreads not associated with a charging mm struct would set it to NULL, while callers in user process contexts set it to current->mm. Internally, it handles the case when the charging mm given is NULL, by searching for an mm struct from enclave's mm_list. Signed-off-by: Haitao Huang Reported-by: Mikko Ylinen --- V10: - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko) V9: - Reduce number of if statements. (Tim) V8: - Limit text paragraphs to 80 characters wide. (Jarkko) --- arch/x86/kernel/cpu/sgx/encl.c | 29 ++-- arch/x86/kernel/cpu/sgx/encl.h | 3 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++ arch/x86/kernel/cpu/sgx/main.c | 29 +--- arch/x86/kernel/cpu/sgx/sgx.h| 2 +- 5 files changed, 36 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index f474179b6f77..7b77dad41daf 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde } /* - * When called from ksgxd, returns the mem_cgroup of a struct mm stored - * in the enclave's mm_list. When not called from ksgxd, just returns - * the mem_cgroup of the current task. + * Find the mem_cgroup to charge for memory allocated on behalf of an enclave. + * + * Used in sgx_encl_alloc_backing() for backing store allocation. + * + * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup + * of a struct mm stored in the enclave's mm_list. */ -static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) +static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl, + struct mm_struct *charge_mm) { struct mem_cgroup *memcg = NULL; struct sgx_encl_mm *encl_mm; int idx; - /* -* If called from normal task context, return the mem_cgroup -* of the current task's mm. The remainder of the handling is for -* ksgxd. -*/ - if (!current_is_ksgxd()) - return get_mem_cgroup_from_mm(current->mm); +/* Use the charge_mm if given. */ + if (charge_mm) + return get_mem_cgroup_from_mm(charge_mm); /* * Search the enclave's mm_list to find an mm associated with @@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * @encl: an enclave pointer * @page_index:enclave page index * @backing: data for accessing backing storage for the page + * @charge_mm: the mm to charge for the allocation * - * When called from ksgxd, sets the active memcg from one of the + * When charge_mm is NULL, sets the active memcg from one of the * mms in the enclave's mm_list prior to any backing page allocation, * in order to ensure that shmem page allocations are charged to the * enclave. Create a backing page for loading data back into an EPC page with @@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * -errno otherwise. */ int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index, - struct sgx_backing *backing) + struct sgx_backing *backing, struct mm_struct *charge_mm) { - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl); + struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm); struct mem_cgroup *memcg = set_active_memcg(encl_memcg); int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index fe15ade02ca1..5ce9d108290f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr, int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags); -bool current_is_ksgxd(void); void sgx_encl_release(struct kref *ref); int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm); const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl); int
[PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
From: Kristen Carlson Accardi When a cgroup usage reaches its limit, and it is to be charged, i.e., sgx_cgroup_try_charge() called for new allocations, the cgroup needs to reclaim pages from its LRU or LRUs of its descendants to make room for any new allocations. This patch adds the basic building block for the per-cgroup reclamation flow and use it for synchronous reclamation in sgx_cgroup_try_charge(). First, modify sgx_reclaim_pages() to let callers to pass in the LRU from which pages are reclaimed, so it can be reused by both the global and cgroup reclaimers. Also return the number of pages attempted, so a cgroup reclaimer can use it to track reclamation progress from its descendants. For the global reclaimer, replace all call sites of sgx_reclaim_pages() with calls to a newly created wrapper, sgx_reclaim_pages_global(), which just calls sgx_reclaim_pages() with the global LRU passed in. For cgroup reclamation, implement a basic reclamation flow, encapsulated in the top-level function, sgx_cgroup_reclaim_pages(). It performs a pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages() at each node passing in the LRU of that node. It keeps track of total attempted pages and stops the walk if desired number of pages are attempted. Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether a synchronous reclamation is allowed. If the caller allows and cgroup usage is at its limit, trigger the synchronous reclamation by calling sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between iterations. A later patch will add support for asynchronous reclamation reusing sgx_cgroup_reclaim_pages(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V10: - Simplify the signature by removing a pointer to nr_to_scan (Kai) - Return pages attempted instead of reclaimed as it is really what the cgroup caller needs to track progress. This further simplifies the design. - Merge patch for exposing sgx_reclaim_pages() with basic synchronous reclamation. (Kai) - Shorten names for EPC cgroup functions. (Jarkko) - Fix/add comments to justify the design (Kai) - Separate out a helper for for addressing single iteration of the loop in sgx_cgroup_try_charge(). (Jarkko) V9: - Add comments for static variables. (Jarkko) V8: - Use width of 80 characters in text paragraphs. (Jarkko) - Remove alignment for substructure variables. (Jarkko) V7: - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim function". Do not split the top level function (Kai) - Dropped patches 7 and 8 of V6. - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 127 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +- arch/x86/kernel/cpu/sgx/main.c | 45 ++ arch/x86/kernel/cpu/sgx/sgx.h| 1 + 4 files changed, 156 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index a1dd43c195b2..f7a487a29ed1 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -9,16 +9,136 @@ static struct sgx_cgroup sgx_cg_root; /** - * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs + * @root: Root of the tree to check + * + * Used to avoid livelocks due to a cgroup having a non-zero charge count but + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or + * because all pages in the cgroup are unreclaimable. + * + * Return: %true if all cgroups under the specified root have empty LRU lists. + */ +static bool sgx_cgroup_lru_empty(struct misc_cg *root) +{ + struct cgroup_subsys_state *css_root; + struct cgroup_subsys_state *pos; + struct sgx_cgroup *sgx_cg; + bool ret = true; + + /* +* Caller ensure css_root ref acquired +*/ + css_root = >css; + + rcu_read_lock(); + css_for_each_descendant_pre(pos, css_root) { + if (!css_tryget(pos)) + break; + + rcu_read_unlock(); + + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); + + spin_lock(_cg->lru.lock); + ret = list_empty(_cg->lru.reclaimable); + spin_unlock(_cg->lru.lock); + + rcu_read_lock(); + css_put(pos); + if (!ret) + break; + } + + rcu_read_unlock(); + + return ret; +} + +/** + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree + * @root: The root of cgroup tree to reclaim from. * + * This function performs a pre-order walk in the cgroup tree under the given + * root, attempting to reclaim pages at each node until a fixed number of pages + *
[PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup
From: Kristen Carlson Accardi In cases EPC pages need be allocated during a page fault and the cgroup usage is near its limit, an asynchronous reclamation needs be triggered to avoid blocking the page fault handling. Create a workqueue, corresponding work item and function definitions for EPC cgroup to support the asynchronous reclamation. In case the workqueue allocation is failed during init, disable cgroup. In sgx_cgroup_try_charge(), if caller does not allow synchronous reclamation, queue an asynchronous work into the workqueue. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V10: - Split asynchronous flow in separate patch. (Kai) - Consider cgroup disabled when the workqueue allocation fail during init. (Kai) - Abstract out sgx_cgroup_should_reclaim(). V9: - Add comments for static variables. (Jarkko) V8: - Remove alignment for substructure variables. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 134 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 1 + arch/x86/kernel/cpu/sgx/main.c | 8 +- 3 files changed, 135 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index f7a487a29ed1..3ca89b1fb7e2 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -5,9 +5,63 @@ #include #include "epc_cgroup.h" +/* + * The minimal free pages maintained by per-cgroup reclaimer + * Set this to the low threshold used by the global reclaimer, ksgxd. + */ +#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES) + +/* + * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal + * free pages would barely leave any page for use, causing excessive reclamation + * and thrashing. + * + * Define the following limit, below which cgroup does not maintain the minimal + * free page threshold. Set this to quadruple of the minimal so at least 75% + * pages used without being reclaimed. + */ +#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4) + /* The root SGX EPC cgroup */ static struct sgx_cgroup sgx_cg_root; +/* + * The work queue that reclaims EPC pages in the background for cgroups. + * + * A cgroup schedules a work item into this queue to reclaim pages within the + * same cgroup when its usage limit is reached and synchronous reclamation is not + * an option, i.e., in a page fault handler. + */ +static struct workqueue_struct *sgx_cg_wq; + +static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg) +{ + return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE; +} + +static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg) +{ + return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE; +} + +/* + * Get the lower bound of limits of a cgroup and its ancestors. Used in + * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is + * close to its limit or its ancestors' hence reclamation is needed. + */ +static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) +{ + struct misc_cg *i = sgx_cg->cg; + u64 m = U64_MAX; + + while (i) { + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); + i = misc_cg_parent(i); + } + + return m / PAGE_SIZE; +} + /** * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs * @root: Root of the tree to check @@ -99,6 +153,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root) rcu_read_unlock(); } +/** + * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup + * @sgx_cg: The cgroup to be checked. + * + * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where + * the minimal number of free page needs be maintained for the cgroup to make + * good forward progress. + * + * Return: %true if number of free pages available for the cgroup below a + * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the + * cgroup. + */ +static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +{ + u64 cur, max; + + if (sgx_cgroup_lru_empty(sgx_cg->cg)) + return false; + + max = sgx_cgroup_max_pages_to_root(sgx_cg); + + /* +* Unless the limit is very low, maintain a minimal number of free pages +* so there is always a few pages available to serve new allocation +* requests quickly. +*/ + if (max > SGX_CG_LOW_LIMIT) + max -= SGX_CG_MIN_FREE_PAGE; + + cur = sgx_cgroup_page_counter_read(sgx_cg); + + return (cur >= max); +} + +/* + * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is + * at/near its maximum capacity. + */ +static void
[PATCH v10 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
From: Sean Christopherson Introduce a data structure to wrap the existing reclaimable list and its spinlock. Each cgroup later will have one instance of this structure to track EPC pages allocated for processes associated with the same cgroup. Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages from the reclaimable list in this structure when its usage reaches near its limit. Use this structure to encapsulate the LRU list and its lock used by the global reclaimer. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Jarkko Sakkinen --- V6: - removed introduction to unreclaimables in commit message. V4: - Removed unneeded comments for the spinlock and the non-reclaimables. (Kai, Jarkko) - Revised the commit to add introduction comments for unreclaimables and multiple LRU lists.(Kai) - Reordered the patches: delay all changes for unreclaimables to later, and this one becomes the first change in the SGX subsystem. V3: - Removed the helper functions and revised commit messages. --- arch/x86/kernel/cpu/sgx/main.c | 39 +- arch/x86/kernel/cpu/sgx/sgx.h | 15 + 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 023af54c1beb..4991eb0af748 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space); /* * These variables are part of the state of the reclaimer, and must be accessed - * with sgx_reclaimer_lock acquired. + * with sgx_global_lru.lock acquired. */ -static LIST_HEAD(sgx_active_page_list); -static DEFINE_SPINLOCK(sgx_reclaimer_lock); +static struct sgx_epc_lru_list sgx_global_lru; static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - if (list_empty(_active_page_list)) + epc_page = list_first_entry_or_null(_global_lru.reclaimable, + struct sgx_epc_page, list); + if (!epc_page) break; - epc_page = list_first_entry(_active_page_list, - struct sgx_epc_page, list); list_del_init(_page->list); encl_page = epc_page->owner; @@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(_reclaimer_lock); - list_add_tail(_page->list, _active_page_list); - spin_unlock(_reclaimer_lock); + spin_lock(_global_lru.lock); + list_add_tail(_page->list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); kref_put(_page->encl->refcount, sgx_encl_release); @@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void) static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_active_page_list); + !list_empty(_global_lru.reclaimable); } /* @@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void) ksgxd_tsk = tsk; + sgx_lru_init(_global_lru); + return true; } @@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _active_page_list); - spin_unlock(_reclaimer_lock); + list_add_tail(>list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); } /** @@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); return -EBUSY; } list_del(>list);
[PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
From: Kristen Carlson Accardi SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments. For instance, within a Kubernetes environment, while a user may specify a particular EPC quota for a pod, the orchestrator requires a mechanism to enforce that the pod's actual runtime EPC usage does not exceed the allocated quota. Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to limit and track EPC allocations per cgroup. Earlier patches have added the "sgx_epc" resource type in the misc cgroup subsystem. Add basic support in SGX driver as the "sgx_epc" resource provider: - Set "capacity" of EPC by calling misc_cg_set_capacity() - Update EPC usage counter, "current", by calling charge and uncharge APIs for EPC allocation and deallocation, respectively. - Setup sgx_epc resource type specific callbacks, which perform initialization and cleanup during cgroup allocation and deallocation, respectively. With these changes, the misc cgroup controller enables user to set a hard limit for EPC usage in the "misc.max" interface file. It reports current usage in "misc.current", the total EPC memory available in "misc.capacity", and the number of times EPC usage reached the max limit in "misc.events". For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V10: - Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko) - Use enums instead of booleans for the parameters. (Dave, Jarkko) V8: - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko) - Remove extra space, '_INTEL'. (Jarkko) V7: - Use a static for root cgroup (Kai) - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai) - Correct check for charge API return (Kai) - Start initialization in SGX device driver init (Kai) - Remove unneeded BUG_ON (Kai) - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai) V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/Kconfig | 13 + arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 70 ++ arch/x86/kernel/cpu/sgx/main.c | 51 ++- arch/x86/kernel/cpu/sgx/sgx.h| 5 ++ include/linux/misc_cgroup.h | 2 + 7 files changed, 214 insertions(+), 2 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 39886bab943a..bda78255a7ab 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1941,6 +1941,19 @@ config X86_SGX If unsure, say N. +config CGROUP_SGX_EPC + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX" + depends on X86_SGX && CGROUP_MISC + help + Provides control over the EPC footprint of tasks in a cgroup via + the Miscellaneous cgroup controller. + + EPC is a subset of regular memory that is usable only by SGX + enclaves and is very limited in quantity, e.g. less than 1% + of total DRAM. + + Say N if unsure. + config X86_USER_SHADOW_STACK bool "X86 userspace shadow stack" depends on AS_WRUSS diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..12901a488da7 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index ..a1dd43c195b2 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2022 Intel Corporation. + +#include +#include +#include "epc_cgroup.h" + +/* The root SGX EPC cgroup */ +static struct sgx_cgroup sgx_cg_root; + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + *
[PATCH v10 04/14] cgroup/misc: Add SGX EPC resource type
From: Kristen Carlson Accardi Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type for the misc controller. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V6: - Split the original patch into this and the preceding one (Kai) --- include/linux/misc_cgroup.h | 4 kernel/cgroup/misc.c| 4 2 files changed, 8 insertions(+) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 541a5611c597..2f6cc3a0ad23 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -17,6 +17,10 @@ enum misc_res_type { MISC_CG_RES_SEV, /* AMD SEV-ES ASIDs resource */ MISC_CG_RES_SEV_ES, +#endif +#ifdef CONFIG_CGROUP_SGX_EPC + /* SGX EPC memory resource */ + MISC_CG_RES_SGX_EPC, #endif MISC_CG_RES_TYPES }; diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 1f0d8e05b36c..e51d6a45007f 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = { /* AMD SEV-ES ASIDs resource */ "sev_es", #endif +#ifdef CONFIG_CGROUP_SGX_EPC + /* Intel SGX EPC memory bytes */ + "sgx_epc", +#endif }; /* Root misc cgroup */ -- 2.25.1
[PATCH v10 03/14] cgroup/misc: Export APIs for SGX driver
From: Kristen Carlson Accardi The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches its or ancestor's limit. This requires a walk from the current cgroup up to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to enable this walk. The SGX driver also needs start a global level reclamation from the root. Export misc_cg_root() for the SGX driver to access. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V6: - Make commit messages more concise and split the original patch into two(Kai) --- include/linux/misc_cgroup.h | 24 kernel/cgroup/misc.c| 21 - 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 0806d4436208..541a5611c597 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -64,6 +64,7 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; +struct misc_cg *misc_cg_root(void); u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); @@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css) return css ? container_of(css, struct misc_cg, css) : NULL; } +/** + * misc_cg_parent() - Get the parent of the passed misc cgroup. + * @cgroup: cgroup whose parent needs to be fetched. + * + * Context: Any context. + * Return: + * * struct misc_cg* - Parent of the @cgroup. + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + */ +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup) +{ + return cgroup ? css_misc(cgroup->css.parent) : NULL; +} + /* * get_current_misc_cg() - Find and get the misc cgroup of the current task. * @@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg) } #else /* !CONFIG_CGROUP_MISC */ +static inline struct misc_cg *misc_cg_root(void) +{ + return NULL; +} + +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg) +{ + return NULL; +} static inline u64 misc_cg_res_total_usage(enum misc_res_type type) { diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 14ab13ef3bc7..1f0d8e05b36c 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; /** - * parent_misc() - Get the parent of the passed misc cgroup. - * @cgroup: cgroup whose parent needs to be fetched. - * - * Context: Any context. - * Return: - * * struct misc_cg* - Parent of the @cgroup. - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + * misc_cg_root() - Return the root misc cgroup. */ -static struct misc_cg *parent_misc(struct misc_cg *cgroup) +struct misc_cg *misc_cg_root(void) { - return cgroup ? css_misc(cgroup->css.parent) : NULL; + return _cg; } +EXPORT_SYMBOL_GPL(misc_cg_root); /** * valid_type() - Check if @type is valid or not. @@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!amount) return 0; - for (i = cg; i; i = parent_misc(i)) { + for (i = cg; i; i = misc_cg_parent(i)) { res = >res[type]; new_usage = atomic64_add_return(amount, >usage); @@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) return 0; err_charge: - for (j = i; j; j = parent_misc(j)) { + for (j = i; j; j = misc_cg_parent(j)) { atomic64_inc(>res[type].events); cgroup_file_notify(>events_file); } - for (j = cg; j != i; j = parent_misc(j)) + for (j = cg; j != i; j = misc_cg_parent(j)) misc_cg_cancel_charge(type, j, amount); misc_cg_cancel_charge(type, i, amount); return ret; @@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!(amount && valid_type(type) && cg)) return; - for (i = cg; i; i = parent_misc(i)) + for (i = cg; i; i = misc_cg_parent(i)) misc_cg_cancel_charge(type, i, amount); } EXPORT_SYMBOL_GPL(misc_cg_uncharge); -- 2.25.1
[PATCH v10 02/14] cgroup/misc: Add per resource callbacks for CSS events
From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V8: - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko) V7: - Make ops one per resource type and store them in array (Michal) - Rename the ops struct to misc_res_ops, and enforce the constraints of required callback functions (Jarkko) - Moved addition of priv field to patch 4 where it was used first. (Jarkko) V6: - Create ops struct for per resource callbacks (Jarkko) - Drop max_write callback (Dave, Michal) - Style fixes (Kai) --- include/linux/misc_cgroup.h | 11 + kernel/cgroup/misc.c| 84 + 2 files changed, 87 insertions(+), 8 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e799b1f8d05b..0806d4436208 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -27,6 +27,16 @@ struct misc_cg; #include +/** + * struct misc_res_ops: per resource type callback ops. + * @alloc: invoked for resource specific initialization when cgroup is allocated. + * @free: invoked for resource specific cleanup when cgroup is deallocated. + */ +struct misc_res_ops { + int (*alloc)(struct misc_cg *cg); + void (*free)(struct misc_cg *cg); +}; + /** * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. @@ -56,6 +66,7 @@ struct misc_cg { u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 79a3717a5803..14ab13ef3bc7 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -39,6 +39,9 @@ static struct misc_cg root_cg; */ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +/* Resource type specific operations */ +static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; + /** * parent_misc() - Get the parent of the passed misc cgroup. * @cgroup: cgroup whose parent needs to be fetched. @@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) } EXPORT_SYMBOL_GPL(misc_cg_set_capacity); +/** + * misc_cg_set_ops() - set resource specific operations. + * @type: Type of the misc res. + * @ops: Operations for the given type. + * + * Context: Any context. + * Return: + * * %0 - Successfully registered the operations. + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks. + */ +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops) +{ + if (!valid_type(type)) + return -EINVAL; + + if (!ops->alloc) { + pr_err("%s: alloc missing\n", __func__); + return -EINVAL; + } + + if (!ops->free) { + pr_err("%s: free missing\n", __func__); + return -EINVAL; + } + + misc_res_ops[type] = ops; + return 0; +} +EXPORT_SYMBOL_GPL(misc_cg_set_ops); + /** * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup. * @type: Misc res type in misc cg to cancel the charge from. @@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = { {} }; +static inline int _misc_cg_res_alloc(struct misc_cg *cg) +{ + enum misc_res_type i; + int ret; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) { + WRITE_ONCE(cg->res[i].max, MAX_NUM); + atomic64_set(>res[i].usage, 0); + if (misc_res_ops[i]) { + ret = misc_res_ops[i]->alloc(cg); + if (ret) + return ret; + } + } + + return 0; +} + +static inline void _misc_cg_res_free(struct misc_cg *cg) +{ + enum misc_res_type i; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) + if (misc_res_ops[i]) + misc_res_ops[i]->free(cg); +} + /** * misc_cg_alloc() - Allocate misc cgroup. * @parent_css: Parent cgroup. @@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = { static struct cgroup_subsys_state * misc_cg_alloc(struct cgroup_subsys_state *parent_css) { - enum misc_res_type i; - struct
[PATCH v10 00/14] Add Cgroup support for SGX EPC memory
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or bare-metal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. A user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store, which are normal system memory allocated via shmem and accounted by the memory controller. Similar to per-cgroup reclamation done by the memory controller, the EPC misc controller needs to implement a per-cgroup EPC reclaiming process: when the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out some EPC pages within the same cgroup to make room for new allocations. For that, this implementation tracks reclaimable EPC pages in a separate LRU list in each cgroup, and below are more details and justification of this design. Track EPC pages in per-cgroup LRUs (from Dave) -- tl;dr: A cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes to the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. Unreclaimable Enclave Pages --- There are a variety of page types for enclaves, each serving different purposes [5]. Although the SGX architecture supports swapping for all types, some special pages, e.g., Version Array(VA) and Secure Enclave Control Structure (SECS)[5], holds meta data of reclaimed pages and enclaves. That makes reclamation of such pages more intricate to manage. The SGX driver global reclaimer currently does not swap out VA pages. It only swaps the SECS page of an enclave when all other associated pages have been swapped out. The cgroup reclaimer follows the same approach and does not track those in per-cgroup LRUs and considers them as unreclaimable pages. The allocation of these pages is counted towards the usage of a specific cgroup and is subject to the cgroup's set EPC limits. Earlier versions of this series implemented forced enclave-killing to reclaim VA and SECS pages. That was designed to enforce the 'max' limit, particularly in scenarios where a user or administrator reduces this limit post-launch of enclaves. However, subsequent discussions [3, 4] indicated that such preemptive enforcement is not necessary for the misc-controllers. Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed, and the limit is only enforced at the time of new EPC allocation request. When a cgroup hits its limit but nothing left in the LRUs of the subtree, i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC within that cgroup will result in an 'ENOMEM'. Unreclaimable Guest VM EPC Pages The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats those as unreclaimable and returns ENOMEM when its limit is hit and nothing reclaimable left within the cgroup. The virtual EPC driver
[PATCH v10 01/14] x86/sgx: Replace boolean parameters with enums
Replace boolean parameters for 'reclaim' in the function sgx_alloc_epc_page() and its callers with an enum. Also opportunistically remove non-static declaration of __sgx_alloc_epc_page() and a typo Signed-off-by: Haitao Huang Suggested-by: Jarkko Sakkinen Suggested-by: Dave Hansen --- arch/x86/kernel/cpu/sgx/encl.c | 12 ++-- arch/x86/kernel/cpu/sgx/encl.h | 4 ++-- arch/x86/kernel/cpu/sgx/ioctl.c | 10 +- arch/x86/kernel/cpu/sgx/main.c | 14 +++--- arch/x86/kernel/cpu/sgx/sgx.h | 13 +++-- arch/x86/kernel/cpu/sgx/virt.c | 2 +- 6 files changed, 32 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..f474179b6f77 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_epc_page *epc_page; int ret; - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) return epc_page; @@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, goto err_out_unlock; } - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) { if (PTR_ERR(epc_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; goto err_out_unlock; } - va_page = sgx_encl_grow(encl, false); + va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM); if (IS_ERR(va_page)) { if (PTR_ERR(va_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; @@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) /** * sgx_alloc_va_page() - Allocate a Version Array (VA) page - * @reclaim: Reclaim EPC pages directly if none available. Enclave - * mutex should not be held if this is set. + * @reclaim: Whether reclaim EPC pages directly if none available. Enclave + * mutex should not be held for SGX_DO_RECLAIM. * * Allocate a free EPC page and convert it to a Version Array (VA) page. * @@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) * a VA page, * -errno otherwise */ -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim) +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim) { struct sgx_epc_page *epc_page; int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..fe15ade02ca1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, unsigned long offset, u64 secinfo_flags); void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr); -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim); +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim); unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page); void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset); bool sgx_va_page_full(struct sgx_va_page *va_page); void sgx_encl_free_epc_page(struct sgx_epc_page *page); struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, unsigned long addr); -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim); +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim); void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page); #endif /* _X86_ENCL_H */ diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..793a0ba2cb16 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -17,7 +17,7 @@ #include "encl.h" #include "encls.h" -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim) +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim) { struct sgx_va_page *va_page = NULL; void *err; @@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) struct file *backing; long ret; - va_page = sgx_encl_grow(encl, true); + va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM); if (IS_ERR(va_page)) return PTR_ERR(va_page); else if (va_page) @@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) encl->backing = backing; - secs_epc = sgx_alloc_epc_page(>secs, true); + secs_epc = sgx_alloc_epc_page(>secs, SGX_DO_RECLAIM); if (IS_ERR(secs_epc)) { ret =
[PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()
All the callers of vhost_get_avail_idx() are concerned to the memory barrier, imposed by smp_rmb() to ensure the order of the available ring entry read and avail_idx read. Improve vhost_get_avail_idx() so that smp_rmb() is executed when the avail_idx is advanced. With it, the callers needn't to worry about the memory barrier. Suggested-by: Michael S. Tsirkin Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 75 +++ 1 file changed, 26 insertions(+), 49 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 32686c79c41d..e6882f4f6ce2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1290,10 +1290,28 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) mutex_unlock(>vqs[i]->mutex); } -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, - __virtio16 *idx) +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *idx, >avail->idx); + __virtio16 avail_idx; + int r; + + r = vhost_get_avail(vq, avail_idx, >avail->idx); + if (unlikely(r)) { + vq_err(vq, "Failed to access avail idx at %p\n", + >avail->idx); + return r; + } + + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Ensure the available ring entry read happens +* before the avail_idx read when the avail_idx +* is advanced. +*/ + smp_rmb(); + } + + return 0; } static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, @@ -2499,7 +2517,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, struct vring_desc desc; unsigned int i, head, found = 0; u16 last_avail_idx; - __virtio16 avail_idx; __virtio16 ring_head; int ret, access; @@ -2507,12 +2524,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, last_avail_idx = vq->last_avail_idx; if (vq->avail_idx == vq->last_avail_idx) { - if (unlikely(vhost_get_avail_idx(vq, _idx))) { - vq_err(vq, "Failed to access avail idx at %p\n", - >avail->idx); + if (unlikely(vhost_get_avail_idx(vq))) return -EFAULT; - } - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { vq_err(vq, "Guest moved used index from %u to %u", @@ -2525,11 +2538,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, */ if (vq->avail_idx == last_avail_idx) return vq->num; - - /* Only get avail ring entries after they have been -* exposed by guest. -*/ - smp_rmb(); } /* Grab the next descriptor number they're advertising, and increment @@ -2790,35 +2798,19 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); /* return true if we're sure that avaiable ring is empty */ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) { - __virtio16 avail_idx; - int r; - if (vq->avail_idx != vq->last_avail_idx) return false; - r = vhost_get_avail_idx(vq, _idx); - if (unlikely(r)) - return false; - - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - if (vq->avail_idx != vq->last_avail_idx) { - /* Since we have updated avail_idx, the following -* call to vhost_get_vq_desc() will read available -* ring entries. Make sure that read happens after -* the avail_idx read. -*/ - smp_rmb(); + if (unlikely(vhost_get_avail_idx(vq))) return false; - } - return true; + return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); /* OK, now we need to know about added descriptors. */ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) { - __virtio16 avail_idx; int r; if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) @@ -2842,25 +2834,10 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) /* They could have slipped one in as we were doing that: make * sure it's written, then check again. */ smp_mb(); - r = vhost_get_avail_idx(vq, _idx); - if (r) { - vq_err(vq, "Failed to check avail idx at %p: %d\n", - >avail->idx, r); + if (unlikely(vhost_get_avail_idx(vq))) return false; - } - - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - if (vq->avail_idx != vq->last_avail_idx) {
[PATCH v3 2/3] vhost: Add smp_rmb() in vhost_enable_notify()
A smp_rmb() has been missed in vhost_enable_notify(), inspired by Will. Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_enable_notify(). When it returns true, it means there's still pending tx buffers. Since it might read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc(). Note that it should be safe until vq->avail_idx is changed by commit d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()"). Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") Cc: # v5.18+ Reported-by: Yihuang Yu Suggested-by: Will Deacon Signed-off-by: Gavin Shan Acked-by: Jason Wang --- drivers/vhost/vhost.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 29df65b2ebf2..32686c79c41d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2848,9 +2848,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) >avail->idx, r); return false; } + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Since we have updated avail_idx, the following +* call to vhost_get_vq_desc() will read available +* ring entries. Make sure that read happens after +* the avail_idx read. +*/ + smp_rmb(); + return true; + } - return vq->avail_idx != vq->last_avail_idx; + return false; } EXPORT_SYMBOL_GPL(vhost_enable_notify); -- 2.44.0
[PATCH v3 1/3] vhost: Add smp_rmb() in vhost_vq_avail_empty()
A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will. Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). When tx_can_batch() returns true, it means there's still pending tx buffers. Since it might read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac69 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Suggested-by: Will Deacon Signed-off-by: Gavin Shan Acked-by: Jason Wang --- drivers/vhost/vhost.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..29df65b2ebf2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,19 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Since we have updated avail_idx, the following +* call to vhost_get_vq_desc() will read available +* ring entries. Make sure that read happens after +* the avail_idx read. +*/ + smp_rmb(); + return false; + } - return vq->avail_idx == vq->last_avail_idx; + return true; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); -- 2.44.0
[PATCH v3 0/3] vhost: Fix stale available ring entries
The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. The wrong head (available ring entry) is seen by the guest when running 'netperf' on the guest and running 'netserver' on another NVidia's grace-grace machine. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=tap0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0 : guest# ifconfig eth0 | grep 'inet addr' inet addr:10.26.1.220 guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! There is missed smp_rmb() in vhost_{vq_avail_empty, enable_notify}() Without smp_rmb(), vq->avail_idx is advanced but the available ring entries aren't arriving to vhost side yet. So a stale available ring entry can be fetched in vhost_get_vq_desc(). Fix it by adding smp_rmb() in those two functions. Note that I need two patches so that they can be easily picked up by the stable kernel. With the changes, I'm unable to hit the issue again. Besides, the function vhost_get_avail_idx() is improved to tackle the memory barrier so that the callers needn't to worry about it. v2: https://lore.kernel.org/virtualization/46c6a9aa-821c-4013-afe7-61ec05fc9...@redhat.com v1: https://lore.kernel.org/virtualization/66e12633-b2d6-4b9a-9103-bb79770fc...@redhat.com Changelog = v3: Improved change log (Jason) Improved comments and added PATCH[v3 3/3] to execute smp_rmb() in vhost_get_avail_idx() (Michael) Gavin Shan (3): vhost: Add smp_rmb() in vhost_vq_avail_empty() vhost: Add smp_rmb() in vhost_enable_notify() vhost: Improve vhost_get_avail_idx() with smp_rmb() drivers/vhost/vhost.c | 51 --- 1 file changed, 24 insertions(+), 27 deletions(-) -- 2.44.0
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, 27 Mar 2024 17:06:01 + Jonthan Haslam wrote: > > > Masami, > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > (internal) batched attachment API for uprobes, do you think you can > > > apply this patch as is for now? We can then gain initial improvements > > > in scalability that are also easy to backport, and Jonathan will work > > > on a more complete solution based on per-cpu RW semaphore, as > > > suggested by Ingo. > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > I would like to wait for the next version. > > My initial tests show a nice improvement on the over RW spinlocks but > significant regression in acquiring a write lock. I've got a few days > vacation over Easter but I'll aim to get some more formalised results out > to the thread toward the end of next week. As far as the write lock is only on the cold path, I think you can choose per-cpu RW semaphore. Since it does not do busy wait, the total system performance impact will be small. I look forward to your formalized results :) Thank you, > > Jon. > > > > > Thank you, > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > will depend on how much lock contention happens. > > > > > > > > Thank you, > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > --- > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > --- a/kernel/events/uprobes.c > > > > > +++ b/kernel/events/uprobes.c > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > */ > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > > > access */ > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree > > > > > access */ > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > /* serialize uprobe->pending_list */ > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > > > *inode, loff_t offset) > > > > > { > > > > > struct uprobe *uprobe; > > > > > > > > > > - spin_lock(_treelock); > > > > > + read_lock(_treelock); > > > > > uprobe = __find_uprobe(inode, offset); > > > > > - spin_unlock(_treelock); > > > > > + read_unlock(_treelock); > > > > > > > > > > return uprobe; > > > > > } > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > > > > *uprobe) > > > > > { > > > > > struct uprobe *u; > > > > > > > > > > - spin_lock(_treelock); > > > > > + write_lock(_treelock); > > > > > u = __insert_uprobe(uprobe); > > > > > - spin_unlock(_treelock); > > > > > + write_unlock(_treelock); > > > > > > > > > > return u; > > > > > } > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > return; > > > > > > > > > > - spin_lock(_treelock); > > > > > + write_lock(_treelock); > > > > > rb_erase(>rb_node, _tree); > > > > > - spin_unlock(_treelock); > > > > > + write_unlock(_treelock); > > > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > > > put_uprobe(uprobe); > > > > > } > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode > > > > > *inode, > > > > > min = vaddr_to_offset(vma, start); > > > > > max = min + (end - start) - 1; > > > > > > > > > > - spin_lock(_treelock); > > > > > + read_lock(_treelock); > > > > > n = find_node_in_range(inode, min, max); > > > > > if (n) { > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode > > > > > *inode, > > > > > get_uprobe(u); > > > > > } > > > > > } > > > > > - spin_unlock(_treelock); > > > > > + read_unlock(_treelock); > > > > > } > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, > > > > > unsigned long start, unsigned long e > > > > > min = vaddr_to_offset(vma, start); > > > > > max = min + (end - start) - 1; > > > > > > > > > > - spin_lock(_treelock); > > > > > + read_lock(_treelock); > > > > > n = find_node_in_range(inode, min, max); > > > > > - spin_unlock(_treelock); > > > > > + read_unlock(_treelock); > > > > > > > > > > return !!n; > > > > > } > > > > > -- > > > > > 2.43.0 > > > > > > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) > > > > > > -- > > Masami
[PATCH v2 1/1] vhost: Added pad cleanup if vnet_hdr is not present.
When the Qemu launched with vhost but without tap vnet_hdr, vhost tries to copy vnet_hdr from socket iter with size 0 to the page that may contain some trash. That trash can be interpreted as unpredictable values for vnet_hdr. That leads to dropping some packets and in some cases to stalling vhost routine when the vhost_net tries to process packets and fails in a loop. Qemu options: -netdev tap,vhost=on,vnet_hdr=off,... >From security point of view, wrong values on field used later tap's tap_get_user_xdp() and will affect skb gso and options. Later the header(and data in headroom) should not be used by the stack. Using custom socket as a backend to vhost_net can reveal some data in the vnet_hdr, although it would require kernel access to implement. The issue happens because the value of sock_len in virtqueue is 0. That value is set at vhost_net_set_features() with VHOST_NET_F_VIRTIO_NET_HDR, also it's set to zero at device open() and reset() routine. So, currently, to trigger the issue, we need to set up qemu with vhost=on,vnet_hdr=off, or do not configure vhost in the custom program. Signed-off-by: Andrew Melnychenko --- drivers/vhost/net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f2ed7167c848..57411ac2d08b 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, hdr = buf; gso = >gso; + if (!sock_hlen) + memset(buf, 0, pad); + if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && vhost16_to_cpu(vq, gso->csum_start) + vhost16_to_cpu(vq, gso->csum_offset) + 2 > -- 2.43.0
Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
On Wed, Mar 27, 2024 at 03:31:01PM -0700, Sohil Mehta wrote: > On 3/27/2024 1:54 PM, Avadhut Naik wrote: > > > - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", > > + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE > > REVISION: %x", > > Nit: s/MICROCODE REVISION/MICROCODE/g > > You could probably get rid of the word REVISION in the interest of > brevity similar to __print_mce(). You *definitely* want to do that - good catch. And TBH, all the screaming words aren't helping either... :) -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
On 3/27/2024 1:54 PM, Avadhut Naik wrote: > - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", > + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: > %x", Nit: s/MICROCODE REVISION/MICROCODE/g You could probably get rid of the word REVISION in the interest of brevity similar to __print_mce(). pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode %x\n", m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid, m->microcode); -Sohil
Re: [PATCH RFT] arm64: dts: qcom: sm8350: Reenable crypto & cryptobam
On 16.02.2024 11:36 AM, Luca Weiss wrote: > On Mon Jan 8, 2024 at 11:45 PM CET, Dmitry Baryshkov wrote: >> On Mon, 8 Jan 2024 at 16:23, Luca Weiss wrote: >>> >>> On Mon Jan 8, 2024 at 3:18 PM CET, Konrad Dybcio wrote: On 8.01.2024 14:49, Luca Weiss wrote: > When num-channels and qcom,num-ees is not provided in devicetree, the > driver will try to read these values from the registers during probe but > this fails if the interconnect is not on and then crashes the system. > > So we can provide these properties in devicetree (queried after patching > BAM driver to enable the necessary interconnect) so we can probe > cryptobam without reading registers and then also use the QCE as > expected. This really feels a bit backwards.. Enable the resource to query the hardware for numbers, so that said resource can be enabled, but slightly later :/ >>> >>> If you think adding interconnect support to driver and dtsi is better, >>> let me know. >> >> I'd say, adding the proper interconnect is a better option. Otherwise >> we just depend on the QCE itself to set up the vote for us. > > Yes, currently we depend on that. > >> >>> >>> Stephan (+CC) mentioned it should be okay like this *shrug* >>> >>> For the record, this is the same way I got the values for sc7280[0] and >>> sm6350[1]. >>> >>> [0] >>> https://lore.kernel.org/linux-arm-msm/20231229-sc7280-cryptobam-fixup-v1-1-bd8f68589...@fairphone.com/ >>> [1] >>> https://lore.kernel.org/linux-arm-msm/20240105-sm6350-qce-v1-0-416e5c731...@fairphone.com/ >>> >>> diff --git a/arch/arm64/boot/dts/qcom/sm8350.dtsi >>> b/arch/arm64/boot/dts/qcom/sm8350.dtsi >>> index b46236235b7f..cd4dd9852d9e 100644 >>> --- a/arch/arm64/boot/dts/qcom/sm8350.dtsi >>> +++ b/arch/arm64/boot/dts/qcom/sm8350.dtsi >>> @@ -1756,8 +1756,8 @@ cryptobam: dma-controller@1dc4000 { >>> qcom,controlled-remotely; >>> iommus = <_smmu 0x594 0x0011>, >>> <_smmu 0x596 0x0011>; >>> - /* FIXME: Probing BAM DMA causes some abort and >>> system hang */ >>> - status = "fail"; >>> + interconnects = <_noc MASTER_CRYPTO 0 >>> _virt SLAVE_EBI1 0>; >>> + interconnect-names = "memory"; >>> }; >>> >>> crypto: crypto@1dfa000 { >>> diff --git a/drivers/dma/qcom/bam_dma.c b/drivers/dma/qcom/bam_dma.c >>> index 5e7d332731e0..9de28f615639 100644 >>> --- a/drivers/dma/qcom/bam_dma.c >>> +++ b/drivers/dma/qcom/bam_dma.c >>> @@ -40,6 +40,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> #include >>> >>> #include "../dmaengine.h" >>> @@ -394,6 +395,7 @@ struct bam_device { >>> const struct reg_offset_data *layout; >>> >>> struct clk *bamclk; >>> + struct icc_path *mem_path; >>> int irq; >>> >>> /* dma start transaction tasklet */ >>> @@ -1206,6 +1208,7 @@ static int bam_init(struct bam_device *bdev) >>> bdev->num_channels = val & BAM_NUM_PIPES_MASK; >>> } >>> >>> + printk(KERN_ERR "%s:%d DBG num_ees=%u num_channels=%u\n", __func__, >>> __LINE__, bdev->num_ees, bdev->num_channels); >>> /* Reset BAM now if fully controlled locally */ >>> if (!bdev->controlled_remotely && !bdev->powered_remotely) >>> bam_reset(bdev); >>> @@ -1298,6 +1301,14 @@ static int bam_dma_probe(struct platform_device >>> *pdev) >>> return ret; >>> } >>> >>> + bdev->mem_path = devm_of_icc_get(bdev->dev, "memory"); >>> + if (IS_ERR(bdev->mem_path)) >>> + return PTR_ERR(bdev->mem_path); >>> + >>> + ret = icc_set_bw(bdev->mem_path, 1, 1); >> >> Probably this needs some more sensible value. > > So downstream qcedev driver uses 384 for the interconnect. But this is > crypto-specific and probably different BAMs have different minimum > requirements? > > #define CRYPTO_AVG_BW 384 > #define CRYPTO_PEAK_BW384 > https://github.com/xiaomi-sm8450-kernel/android_kernel_platform_msm-kernel/blob/lineage-20/drivers/crypto/msm/qce.h#L57 > > Do you have any suggestion what to use here? I'dve expected this to mean anything, but apparently not. My immediate guess is that the 384 was the lowest magic value that didn't result in the bus getting kicked offline.. 1 should be fine upstream due to commit 91e045b93db7 ("interconnect: qcom: Fix small BW votes being truncated to zero"). > > Also I'd assume that with pm_runtime suspended we'd need to clear the > votes in the driver so we don't keep the interconnect alive > unnecessarily? My naive understanding is that the power should only be necessary when the thing is in use, so early probe and pm-active sounds about sane.. Konrad
Re: [RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()
On Wed, 2024-03-27 at 21:11 +, Simon Horman wrote: > > I'm seeing some allmodconfig build problems with this applied on top of > net-next. > ./include/trace/stages/init.h:30: warning: "TRACE_DEFINE_SYM_FNS" redefined > ./include/trace/stages/init.h:54: warning: "TRACE_DEFINE_SYM_LIST" redefined Yeah, the 0-day bot reported that too, sorry about that. It needs two lines to #undef these in init.h before their definition, just like all other macros there. Not sure why my builds didn't show that, maybe it doesn't affect all users. johannes
Re: [RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()
On Tue, Mar 26, 2024 at 08:15:56PM +0100, Johannes Berg wrote: > From: Johannes Berg > > The way __print_symbolic() works is limited and inefficient > in multiple ways: > - you can only use it with a static list of symbols, but >e.g. the SKB dropreasons are now a dynamic list > > - it builds the list in memory _three_ times, so it takes >a lot of memory: >- The print_fmt contains the list (since it's passed to > the macro there). This actually contains the names > _twice_, which is fixed up at runtime. >- TRACE_DEFINE_ENUM() puts a 24-byte struct trace_eval_map > for every entry, plus the string pointed to by it, which > cannot be deduplicated with the strings in the print_fmt >- The in-kernel symbolic printing creates yet another list > of struct trace_print_flags for trace_print_symbols_seq() > > - it also requires runtime fixup during init, which is a lot >of string parsing due to the print_fmt fixup > > Introduce __print_sym() to - over time - replace the old one. > We can easily extend this also to __print_flags later, but I > cared only about the SKB dropreasons for now, which has only > __print_symbolic(). > > This new __print_sym() requires only a single list of items, > created by TRACE_DEFINE_SYM_LIST(), or can even use another > already existing list by using TRACE_DEFINE_SYM_FNS() with > lookup and show methods. > > Then, instead of doing an init-time fixup, just do this at the > time when userspace reads the print_fmt. This way, dynamically > updated lists are possible. > > For userspace, nothing actually changes, because the print_fmt > is shown exactly the same way the old __print_symbolic() was. > > This adds about 4k .text in my test builds, but that'll be > more than paid for by the actual conversions. > > Signed-off-by: Johannes Berg Hi Johannes, I'm seeing some allmodconfig build problems with this applied on top of net-next. In file included from ./include/trace/trace_events.h:27, from ./include/trace/define_trace.h:102, from ./include/trace/events/module.h:134, from kernel/module/main.c:64: ./include/trace/stages/init.h:30: warning: "TRACE_DEFINE_SYM_FNS" redefined 30 | #define TRACE_DEFINE_SYM_FNS(_symbol_id, _lookup, _show) \ | In file included from ./include/linux/trace_events.h:11, from kernel/module/main.c:14: ./include/linux/tracepoint.h:130: note: this is the location of the previous definition 130 | #define TRACE_DEFINE_SYM_FNS(...) | ./include/trace/stages/init.h:54: warning: "TRACE_DEFINE_SYM_LIST" redefined 54 | #define TRACE_DEFINE_SYM_LIST(_symbol_id, ...) \ | ./include/linux/tracepoint.h:131: note: this is the location of the previous definition 131 | #define TRACE_DEFINE_SYM_LIST(...) |
RE: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
> Export microcode version through the tracepoint to prevent ambiguity over > the active version on the system when the MCE was received. > > Signed-off-by: Avadhut Naik > Reviewed-by: Sohil Mehta > Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck
RE: [PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint
> Export PPIN through the tracepoint as it provides a unique identifier for > the system (or socket in case of multi-socket systems) on which the MCE > has been received. > > Signed-off-by: Avadhut Naik > Reviewed-by: Sohil Mehta > Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck
[PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
Currently, the microcode field (Microcode Revision) of struct mce is not exported to userspace through the mce_record tracepoint. Knowing the microcode version on which the MCE was received is critical information for debugging. If the version is not recorded, later attempts to acquire the version might result in discrepancies since it can be changed at runtime. Export microcode version through the tracepoint to prevent ambiguity over the active version on the system when the MCE was received. Signed-off-by: Avadhut Naik Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) --- include/trace/events/mce.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 959ba7b775b1..69438550252c 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -42,6 +42,7 @@ TRACE_EVENT(mce_record, __field(u8, cs ) __field(u8, bank) __field(u8, cpuvendor ) + __field(u32,microcode ) ), TP_fast_assign( @@ -63,9 +64,10 @@ TRACE_EVENT(mce_record, __entry->cs = m->cs; __entry->bank = m->bank; __entry->cpuvendor = m->cpuvendor; + __entry->microcode = m->microcode; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: %x", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -77,7 +79,8 @@ TRACE_EVENT(mce_record, __entry->cpuvendor, __entry->cpuid, __entry->walltime, __entry->socketid, - __entry->apicid) + __entry->apicid, + __entry->microcode) ); #endif /* _TRACE_MCE_H */ -- 2.34.1
[PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint
Machine Check Error information from struct mce is exported to userspace through the mce_record tracepoint. Currently, however, the PPIN (Protected Processor Inventory Number) field of struct mce is not exported through the tracepoint. Export PPIN through the tracepoint as it provides a unique identifier for the system (or socket in case of multi-socket systems) on which the MCE has been received. Signed-off-by: Avadhut Naik Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) --- include/trace/events/mce.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 1391ada0da3b..959ba7b775b1 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -9,6 +9,14 @@ #include #include +/* + * MCE Event Record. + * + * Only very relevant and transient information which cannot be + * gathered from a system by any other means or which can only be + * acquired arduously should be added to this record. + */ + TRACE_EVENT(mce_record, TP_PROTO(struct mce *m), @@ -25,6 +33,7 @@ TRACE_EVENT(mce_record, __field(u64,ipid) __field(u64,ip ) __field(u64,tsc ) + __field(u64,ppin) __field(u64,walltime) __field(u32,cpu ) __field(u32,cpuid ) @@ -45,6 +54,7 @@ TRACE_EVENT(mce_record, __entry->ipid = m->ipid; __entry->ip = m->ip; __entry->tsc= m->tsc; + __entry->ppin = m->ppin; __entry->walltime = m->time; __entry->cpu= m->extcpu; __entry->cpuid = m->cpuid; @@ -55,7 +65,7 @@ TRACE_EVENT(mce_record, __entry->cpuvendor = m->cpuvendor; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -63,6 +73,7 @@ TRACE_EVENT(mce_record, __entry->addr, __entry->misc, __entry->synd, __entry->cs, __entry->ip, __entry->tsc, + __entry->ppin, __entry->cpuvendor, __entry->cpuid, __entry->walltime, __entry->socketid, -- 2.34.1
[PATCH v4 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) [NOTE: - Since only a comment has been added and only the commit messages have been reworked i.e. no code changes have been undertaken for this version, have the retained the below tags from v3: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) ] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 818ea9b4c8237f96ac99dc0b2f02dd6d3f2adb97 -- 2.34.1
Re: [PATCH untested] vhost: order avail ring reads after index updates
On Wed, Mar 27, 2024 at 07:52:02PM +, Will Deacon wrote: > On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote: > > vhost_get_vq_desc (correctly) uses smp_rmb to order > > avail ring reads after index reads. > > However, over time we added two more places that read the > > index and do not bother with barriers. > > Since vhost_get_vq_desc when it was written assumed it is the > > only reader when it sees a new index value is cached > > it does not bother with a barrier either, as a result, > > on the nvidia-gracehopper platform (arm64) available ring > > entry reads have been observed bypassing ring reads, causing > > a ring corruption. > > > > To fix, factor out the correct index access code from vhost_get_vq_desc. > > As a side benefit, we also validate the index on all paths now, which > > will hopefully help catch future errors earlier. > > > > Note: current code is inconsistent in how it handles errors: > > some places treat it as an empty ring, others - non empty. > > This patch does not attempt to change the existing behaviour. > > > > Cc: sta...@vger.kernel.org > > Reported-by: Gavin Shan > > Reported-by: Will Deacon > > Suggested-by: Will Deacon > > Fixes: 275bf960ac69 ("vhost: better detection of available buffers") > > Cc: "Jason Wang" > > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") > > Cc: "Stefano Garzarella" > > Signed-off-by: Michael S. Tsirkin > > --- > > > > I think it's better to bite the bullet and clean up the code. > > Note: this is still only built, not tested. > > Gavin could you help test please? > > Especially on the arm platform you have? > > > > Will thanks so much for finding this race! > > No problem, and I was also hoping that the smp_rmb() could be > consolidated into a single helper like you've done here. > > One minor comment below: > > > drivers/vhost/vhost.c | 80 +++ > > 1 file changed, 42 insertions(+), 38 deletions(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 045f666b4f12..26b70b1fd9ff 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev > > *d) > > mutex_unlock(>vqs[i]->mutex); > > } > > > > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > > - __virtio16 *idx) > > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > { > > - return vhost_get_avail(vq, *idx, >avail->idx); > > + __virtio16 idx; > > + u16 avail_idx; > > + int r = vhost_get_avail(vq, idx, >avail->idx); > > + > > + if (unlikely(r < 0)) { > > + vq_err(vq, "Failed to access avail idx at %p: %d\n", > > + >avail->idx, r); > > + return -EFAULT; > > + } > > + > > + avail_idx = vhost16_to_cpu(vq, idx); > > + > > + /* Check it isn't doing very strange things with descriptor numbers. */ > > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > > + vq_err(vq, "Guest moved used index from %u to %u", > > + vq->last_avail_idx, vq->avail_idx); > > + return -EFAULT; > > + } > > + > > + /* Nothing new? We are done. */ > > + if (avail_idx == vq->avail_idx) > > + return 0; > > + > > + vq->avail_idx = avail_idx; > > + > > + /* We updated vq->avail_idx so we need a memory barrier between > > +* the index read above and the caller reading avail ring entries. > > +*/ > > + smp_rmb(); > > I think you could use smp_acquire__after_ctrl_dep() if you're feeling > brave, but to be honest I'd prefer we went in the opposite direction > and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across > the board. It's just a thankless, error-prone task to get there :( Let's just say that's a separate patch, I tried hard to make this one a bugfix only, no other functional changes at all. > So, for the patch as-is: > > Acked-by: Will Deacon > > (I've not tested it either though, so definitely wait for Gavin on that!) > > Cheers, > > Will
Re: [PATCH untested] vhost: order avail ring reads after index updates
On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote: > vhost_get_vq_desc (correctly) uses smp_rmb to order > avail ring reads after index reads. > However, over time we added two more places that read the > index and do not bother with barriers. > Since vhost_get_vq_desc when it was written assumed it is the > only reader when it sees a new index value is cached > it does not bother with a barrier either, as a result, > on the nvidia-gracehopper platform (arm64) available ring > entry reads have been observed bypassing ring reads, causing > a ring corruption. > > To fix, factor out the correct index access code from vhost_get_vq_desc. > As a side benefit, we also validate the index on all paths now, which > will hopefully help catch future errors earlier. > > Note: current code is inconsistent in how it handles errors: > some places treat it as an empty ring, others - non empty. > This patch does not attempt to change the existing behaviour. > > Cc: sta...@vger.kernel.org > Reported-by: Gavin Shan > Reported-by: Will Deacon > Suggested-by: Will Deacon > Fixes: 275bf960ac69 ("vhost: better detection of available buffers") > Cc: "Jason Wang" > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") > Cc: "Stefano Garzarella" > Signed-off-by: Michael S. Tsirkin > --- > > I think it's better to bite the bullet and clean up the code. > Note: this is still only built, not tested. > Gavin could you help test please? > Especially on the arm platform you have? > > Will thanks so much for finding this race! No problem, and I was also hoping that the smp_rmb() could be consolidated into a single helper like you've done here. One minor comment below: > drivers/vhost/vhost.c | 80 +++ > 1 file changed, 42 insertions(+), 38 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 045f666b4f12..26b70b1fd9ff 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > mutex_unlock(>vqs[i]->mutex); > } > > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > - __virtio16 *idx) > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > { > - return vhost_get_avail(vq, *idx, >avail->idx); > + __virtio16 idx; > + u16 avail_idx; > + int r = vhost_get_avail(vq, idx, >avail->idx); > + > + if (unlikely(r < 0)) { > + vq_err(vq, "Failed to access avail idx at %p: %d\n", > +>avail->idx, r); > + return -EFAULT; > + } > + > + avail_idx = vhost16_to_cpu(vq, idx); > + > + /* Check it isn't doing very strange things with descriptor numbers. */ > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > + vq_err(vq, "Guest moved used index from %u to %u", > +vq->last_avail_idx, vq->avail_idx); > + return -EFAULT; > + } > + > + /* Nothing new? We are done. */ > + if (avail_idx == vq->avail_idx) > + return 0; > + > + vq->avail_idx = avail_idx; > + > + /* We updated vq->avail_idx so we need a memory barrier between > + * the index read above and the caller reading avail ring entries. > + */ > + smp_rmb(); I think you could use smp_acquire__after_ctrl_dep() if you're feeling brave, but to be honest I'd prefer we went in the opposite direction and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across the board. It's just a thankless, error-prone task to get there :( So, for the patch as-is: Acked-by: Will Deacon (I've not tested it either though, so definitely wait for Gavin on that!) Cheers, Will
Re: [PATCH v2 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)
On 26.03.2024 1:11 PM, Stanislav Jakubek wrote: > Add a device tree for the Motorola Moto G (2013) smartphone based > on the Qualcomm MSM8226 SoC. > > Initially supported features: > - Buttons (Volume Down/Up, Power) > - eMMC > - Hall Effect Sensor > - SimpleFB display > - TMP108 temperature sensor > - Vibrator > > Note: the dhob and shob reserved-memory regions are seemingly a part of some > Motorola specific (firmware?) mechanism, see [1]. > > [1] > https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/Documentation/devicetree/bindings/misc/hob_ram.txt > > Signed-off-by: Stanislav Jakubek > --- Reviewed-by: Konrad Dybcio Konrad
[PATCH untested] vhost: order avail ring reads after index updates
vhost_get_vq_desc (correctly) uses smp_rmb to order avail ring reads after index reads. However, over time we added two more places that read the index and do not bother with barriers. Since vhost_get_vq_desc when it was written assumed it is the only reader when it sees a new index value is cached it does not bother with a barrier either, as a result, on the nvidia-gracehopper platform (arm64) available ring entry reads have been observed bypassing ring reads, causing a ring corruption. To fix, factor out the correct index access code from vhost_get_vq_desc. As a side benefit, we also validate the index on all paths now, which will hopefully help catch future errors earlier. Note: current code is inconsistent in how it handles errors: some places treat it as an empty ring, others - non empty. This patch does not attempt to change the existing behaviour. Cc: sta...@vger.kernel.org Reported-by: Gavin Shan Reported-by: Will Deacon Suggested-by: Will Deacon Fixes: 275bf960ac69 ("vhost: better detection of available buffers") Cc: "Jason Wang" Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") Cc: "Stefano Garzarella" Signed-off-by: Michael S. Tsirkin --- I think it's better to bite the bullet and clean up the code. Note: this is still only built, not tested. Gavin could you help test please? Especially on the arm platform you have? Will thanks so much for finding this race! drivers/vhost/vhost.c | 80 +++ 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..26b70b1fd9ff 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) mutex_unlock(>vqs[i]->mutex); } -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, - __virtio16 *idx) +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *idx, >avail->idx); + __virtio16 idx; + u16 avail_idx; + int r = vhost_get_avail(vq, idx, >avail->idx); + + if (unlikely(r < 0)) { + vq_err(vq, "Failed to access avail idx at %p: %d\n", + >avail->idx, r); + return -EFAULT; + } + + avail_idx = vhost16_to_cpu(vq, idx); + + /* Check it isn't doing very strange things with descriptor numbers. */ + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { + vq_err(vq, "Guest moved used index from %u to %u", + vq->last_avail_idx, vq->avail_idx); + return -EFAULT; + } + + /* Nothing new? We are done. */ + if (avail_idx == vq->avail_idx) + return 0; + + vq->avail_idx = avail_idx; + + /* We updated vq->avail_idx so we need a memory barrier between +* the index read above and the caller reading avail ring entries. +*/ + smp_rmb(); + return 1; } static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, @@ -2498,38 +2526,21 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx; - __virtio16 avail_idx; + u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; - /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail_idx = vq->last_avail_idx; if (vq->avail_idx == vq->last_avail_idx) { - if (unlikely(vhost_get_avail_idx(vq, _idx))) { - vq_err(vq, "Failed to access avail idx at %p\n", - >avail->idx); - return -EFAULT; - } - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { - vq_err(vq, "Guest moved used index from %u to %u", - last_avail_idx, vq->avail_idx); - return -EFAULT; - } + ret = vhost_get_avail_idx(vq); + if (unlikely(ret < 0)) + return ret; /* If there's nothing new since last we looked, return * invalid. */ - if (vq->avail_idx == last_avail_idx) + if (!ret) return vq->num; - - /* Only get avail ring entries after they have been -* exposed by guest. -*/ - smp_rmb(); } /* Grab the next descriptor number they're advertising, and increment @@ -2790,25 +2801,21 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); /* return true if we're sure that avaiable ring is empty */ bool
Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote: > > > On 3/25/24 17:51, Mathieu Poirier wrote: > > On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote: > >> The new TEE remoteproc device is used to manage remote firmware in a > >> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is > >> introduced to delegate the loading of the firmware to the trusted > >> execution context. In such cases, the firmware should be signed and > >> adhere to the image format defined by the TEE. > >> > >> Signed-off-by: Arnaud Pouliquen > >> --- > >> Updates from V3: > >> - remove support of the attach use case. Will be addressed in a separate > >> thread, > >> - add st_rproc_tee_ops::parse_fw ops, > >> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage > >> cross > >> reference between the rproc struct and the tee_rproc struct in > >> tee_rproc.c. > >> --- > >> drivers/remoteproc/stm32_rproc.c | 60 +--- > >> 1 file changed, 56 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/remoteproc/stm32_rproc.c > >> b/drivers/remoteproc/stm32_rproc.c > >> index 8cd838df4e92..13df33c78aa2 100644 > >> --- a/drivers/remoteproc/stm32_rproc.c > >> +++ b/drivers/remoteproc/stm32_rproc.c > >> @@ -20,6 +20,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> > >> #include "remoteproc_internal.h" > >> @@ -49,6 +50,9 @@ > >> #define M4_STATE_STANDBY 4 > >> #define M4_STATE_CRASH5 > >> > >> +/* Remote processor unique identifier aligned with the Trusted Execution > >> Environment definitions */ > > > > Why is this the case? At least from the kernel side it is possible to call > > tee_rproc_register() with any kind of value, why is there a need to be any > > kind of alignment with the TEE? > > > The use of the proc_id is to identify a processor in case of multi > co-processors. > That is well understood. > For instance we can have a system with A DSP and a modem. We would use the > same > TEE service, but That too. > the TEE driver will probably be different, same for the signature key. What TEE driver are we talking about here? > In such case the proc ID allows to identify the the processor you want to > address. > That too is well understood, but there is no alignment needed with the TEE, i.e the TEE application is not expecting a value of '0'. We could set STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work. This driver won't go anywhere for as long as it is not the case. > > > > >> +#define STM32_MP1_M4_PROC_ID0 > >> + > >> struct stm32_syscon { > >>struct regmap *map; > >>u32 reg; > >> @@ -257,6 +261,19 @@ static int stm32_rproc_release(struct rproc *rproc) > >>return 0; > >> } > >> > >> +static int stm32_rproc_tee_stop(struct rproc *rproc) > >> +{ > >> + int err; > >> + > >> + stm32_rproc_request_shutdown(rproc); > >> + > >> + err = tee_rproc_stop(rproc); > >> + if (err) > >> + return err; > >> + > >> + return stm32_rproc_release(rproc); > >> +} > >> + > >> static int stm32_rproc_prepare(struct rproc *rproc) > >> { > >>struct device *dev = rproc->dev.parent; > >> @@ -693,8 +710,19 @@ static const struct rproc_ops st_rproc_ops = { > >>.get_boot_addr = rproc_elf_get_boot_addr, > >> }; > >> > >> +static const struct rproc_ops st_rproc_tee_ops = { > >> + .prepare= stm32_rproc_prepare, > >> + .start = tee_rproc_start, > >> + .stop = stm32_rproc_tee_stop, > >> + .kick = stm32_rproc_kick, > >> + .load = tee_rproc_load_fw, > >> + .parse_fw = tee_rproc_parse_fw, > >> + .find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table, > >> +}; > >> + > >> static const struct of_device_id stm32_rproc_match[] = { > >> - { .compatible = "st,stm32mp1-m4" }, > >> + {.compatible = "st,stm32mp1-m4",}, > >> + {.compatible = "st,stm32mp1-m4-tee",}, > >>{}, > >> }; > >> MODULE_DEVICE_TABLE(of, stm32_rproc_match); > >> @@ -853,6 +881,7 @@ static int stm32_rproc_probe(struct platform_device > >> *pdev) > >>struct device *dev = >dev; > >>struct stm32_rproc *ddata; > >>struct device_node *np = dev->of_node; > >> + struct tee_rproc *trproc = NULL; > >>struct rproc *rproc; > >>unsigned int state; > >>int ret; > >> @@ -861,9 +890,26 @@ static int stm32_rproc_probe(struct platform_device > >> *pdev) > >>if (ret) > >>return ret; > >> > >> - rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, > >> sizeof(*ddata)); > >> - if (!rproc) > >> - return -ENOMEM; > >> + if (of_device_is_compatible(np, "st,stm32mp1-m4-tee")) { > >> + /* > >> + * Delegate the firmware management to the secure context. > >> + * The firmware loaded has to be signed. > >> + */ > >> + rproc = devm_rproc_alloc(dev, np->name, _rproc_tee_ops, > >> NULL, sizeof(*ddata)); > >> + if
Re: [PATCH v4 1/4] remoteproc: Add TEE support
On Tue, Mar 26, 2024 at 08:18:23PM +0100, Arnaud POULIQUEN wrote: > Hello Mathieu, > > On 3/25/24 17:46, Mathieu Poirier wrote: > > On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote: > >> Add a remoteproc TEE (Trusted Execution Environment) driver > >> that will be probed by the TEE bus. If the associated Trusted > >> application is supported on secure part this device offers a client > > > > Device or driver? I thought I touched on that before. > > Right, I changed the first instance and missed this one > > > > >> interface to load a firmware in the secure part. > >> This firmware could be authenticated by the secure trusted application. > >> > >> Signed-off-by: Arnaud Pouliquen > >> --- > >> Updates from V3: > >> - rework TEE_REMOTEPROC description in Kconfig > >> - fix some namings > >> - add tee_rproc_parse_fw to support rproc_ops::parse_fw > >> - add proc::tee_interface; > >> - add rproc struct as parameter of the tee_rproc_register() function > >> --- > >> drivers/remoteproc/Kconfig | 10 + > >> drivers/remoteproc/Makefile | 1 + > >> drivers/remoteproc/tee_remoteproc.c | 434 > >> include/linux/remoteproc.h | 4 + > >> include/linux/tee_remoteproc.h | 112 +++ > >> 5 files changed, 561 insertions(+) > >> create mode 100644 drivers/remoteproc/tee_remoteproc.c > >> create mode 100644 include/linux/tee_remoteproc.h > >> > >> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig > >> index 48845dc8fa85..2cf1431b2b59 100644 > >> --- a/drivers/remoteproc/Kconfig > >> +++ b/drivers/remoteproc/Kconfig > >> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC > >> > >> It's safe to say N if not interested in using RPU r5f cores. > >> > >> + > >> +config TEE_REMOTEPROC > >> + tristate "remoteproc support by a TEE application" > > > > s/remoteproc/Remoteproc > > > >> + depends on OPTEE > >> + help > >> +Support a remote processor with a TEE application. The Trusted > >> +Execution Context is responsible for loading the trusted firmware > >> +image and managing the remote processor's lifecycle. > >> +This can be either built-in or a loadable module. > >> + > >> endif # REMOTEPROC > >> > >> endmenu > >> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile > >> index 91314a9b43ce..fa8daebce277 100644 > >> --- a/drivers/remoteproc/Makefile > >> +++ b/drivers/remoteproc/Makefile > >> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)+= rcar_rproc.o > >> obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o > >> obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o > >> obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o > >> +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o > >> obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)+= ti_k3_dsp_remoteproc.o > >> obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o > >> obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o > >> diff --git a/drivers/remoteproc/tee_remoteproc.c > >> b/drivers/remoteproc/tee_remoteproc.c > >> new file mode 100644 > >> index ..c855210e52e3 > >> --- /dev/null > >> +++ b/drivers/remoteproc/tee_remoteproc.c > >> @@ -0,0 +1,434 @@ > >> +// SPDX-License-Identifier: GPL-2.0-or-later > >> +/* > >> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved > >> + * Author: Arnaud Pouliquen > >> + */ > >> + > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> + > >> +#include "remoteproc_internal.h" > >> + > >> +#define MAX_TEE_PARAM_ARRY_MEMBER 4 > >> + > >> +/* > >> + * Authentication of the firmware and load in the remote processor memory > >> + * > >> + * [in] params[0].value.a: unique 32bit identifier of the remote > >> processor > >> + * [in]params[1].memref: buffer containing the image of the > >> buffer > >> + */ > >> +#define TA_RPROC_FW_CMD_LOAD_FW 1 > >> + > >> +/* > >> + * Start the remote processor > >> + * > >> + * [in] params[0].value.a: unique 32bit identifier of the remote > >> processor > >> + */ > >> +#define TA_RPROC_FW_CMD_START_FW 2 > >> + > >> +/* > >> + * Stop the remote processor > >> + * > >> + * [in] params[0].value.a: unique 32bit identifier of the remote > >> processor > >> + */ > >> +#define TA_RPROC_FW_CMD_STOP_FW 3 > >> + > >> +/* > >> + * Return the address of the resource table, or 0 if not found > >> + * No check is done to verify that the address returned is accessible by > >> + * the non secure context. If the resource table is loaded in a protected > >> + * memory the access by the non secure context will lead to a data abort. > >> + * > >> + * [in] params[0].value.a: unique 32bit identifier of the remote > >> processor > >> + * [out] params[1].value.a: 32bit LSB resource table memory address > >> + * [out] params[1].value.b: 32bit MSB resource table memory address >
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > Masami, > > > > Given the discussion around per-cpu rw semaphore and need for > > (internal) batched attachment API for uprobes, do you think you can > > apply this patch as is for now? We can then gain initial improvements > > in scalability that are also easy to backport, and Jonathan will work > > on a more complete solution based on per-cpu RW semaphore, as > > suggested by Ingo. > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > I would like to wait for the next version. My initial tests show a nice improvement on the over RW spinlocks but significant regression in acquiring a write lock. I've got a few days vacation over Easter but I'll aim to get some more formalised results out to the thread toward the end of next week. Jon. > > Thank you, > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > will depend on how much lock contention happens. > > > > > > Thank you, > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > Signed-off-by: Jonathan Haslam > > > > --- > > > > kernel/events/uprobes.c | 22 +++--- > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > --- a/kernel/events/uprobes.c > > > > +++ b/kernel/events/uprobes.c > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > */ > > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > > access */ > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree > > > > access */ > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > /* serialize uprobe->pending_list */ > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > > *inode, loff_t offset) > > > > { > > > > struct uprobe *uprobe; > > > > > > > > - spin_lock(_treelock); > > > > + read_lock(_treelock); > > > > uprobe = __find_uprobe(inode, offset); > > > > - spin_unlock(_treelock); > > > > + read_unlock(_treelock); > > > > > > > > return uprobe; > > > > } > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > > > *uprobe) > > > > { > > > > struct uprobe *u; > > > > > > > > - spin_lock(_treelock); > > > > + write_lock(_treelock); > > > > u = __insert_uprobe(uprobe); > > > > - spin_unlock(_treelock); > > > > + write_unlock(_treelock); > > > > > > > > return u; > > > > } > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > return; > > > > > > > > - spin_lock(_treelock); > > > > + write_lock(_treelock); > > > > rb_erase(>rb_node, _tree); > > > > - spin_unlock(_treelock); > > > > + write_unlock(_treelock); > > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > > put_uprobe(uprobe); > > > > } > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(_treelock); > > > > + read_lock(_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > if (n) { > > > > for (t = n; t; t = rb_prev(t)) { > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > get_uprobe(u); > > > > } > > > > } > > > > - spin_unlock(_treelock); > > > > + read_unlock(_treelock); > > > > } > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, > > > > unsigned long start, unsigned long e > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(_treelock); > > > > + read_lock(_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > - spin_unlock(_treelock); > > > > + read_unlock(_treelock); > > > > > > > > return !!n; > > > > } > > > > -- > > > > 2.43.0 > > > > > > > > > > > > > -- > > > Masami Hiramatsu (Google) > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > > The scripts rely on cgroup-tools package from libcgroup [1]. > > > > To run selftests for epc cgroup: > > > > sudo ./run_epc_cg_selftests.sh > > > > To watch misc cgroup 'current' changes during testing, run this in a > > separate terminal: > > > > ./watch_misc_for_tests.sh current > > > > With different cgroups, the script starts one or multiple concurrent > > SGX > > selftests, each to run one unclobbered_vdso_oversubscribed test. > > Each > > of such test tries to load an enclave of EPC size equal to the EPC > > capacity available on the platform. The script checks results against > > the expectation set for each cgroup and reports success or failure. > > > > The script creates 3 different cgroups at the beginning with > > following > > expectations: > > > > 1) SMALL - intentionally small enough to fail the test loading an > > enclave of size equal to the capacity. > > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > > if > > more than 4 concurrent tests are run. The script starts 4 expecting > > at > > least one test to pass, and then starts 5 expecting at least one test > > to fail. > > 3) LARGER - limit is the same as the capacity, large enough to run > > lots of > > concurrent tests. The script starts 8 of them and expects all pass. > > Then it reruns the same test with one process randomly killed and > > usage checked to be zero after all process exit. > > > > The script also includes a test with low mem_cg limit and LARGE > > sgx_epc > > limit to verify that the RAM used for per-cgroup reclamation is > > charged > > to a proper mem_cg. > > > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > > > Signed-off-by: Haitao Huang > > --- > > V7: > > - Added memcontrol test. > > > > V5: > > - Added script with automatic results checking, remove the > > interactive > > script. > > - The script can run independent from the series below. > > --- > > .../selftests/sgx/run_epc_cg_selftests.sh | 246 > > ++ > > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > > 2 files changed, 259 insertions(+) > > create mode 100755 > > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > create mode 100755 > > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > > > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > new file mode 100755 > > index ..e027bf39f005 > > --- /dev/null > > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > @@ -0,0 +1,246 @@ > > +#!/bin/bash > > This is not portable and neither does hold in the wild. > > It does not even often hold as it is not uncommon to place bash > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > a path that is neither of those two. > > Should be #!/usr/bin/env bash > > That is POSIX compatible form. > > Just got around trying to test this in NUC7 so looking into this in > more detail. > > That said can you make the script work with just "#!/usr/bin/env sh" > and make sure that it is busybox ash compatible? > > I don't see any necessity to make this bash only and it adds to the > compilation time of the image. Otherwise lot of this could be tested > just with qemu+bzImage+busybox(inside initramfs). > > Now you are adding fully glibc shenanigans for the sake of syntax > sugar. > > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright(c) 2023 Intel Corporation. > > + > > +TEST_ROOT_CG=selftest > > +cgcreate -g misc:$TEST_ROOT_CG > > How do you know that cgcreate exists? It is used a lot in the script > with no check for the existence. Please fix e.g. with "command -v > cgreate". > > > +if [ $? -ne 0 ]; then > > + echo "# Please make sure cgroup-tools is installed, and misc > > cgroup is mounted." > > + exit 1 > > +fi > > And please do not do it this way. Also, please remove the advice for > "cgroups-tool". This is not meant to be debian only. Better would be > to e.g. point out the URL of the upstream project. > > And yeah the whole message should be based on "command -v", not like > this. > > > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > > +# We will only set limit in test1 and run tests in test3 > > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > > + > > +cgcreate -g misc:$TEST_CG_SUB1 > > > > > +cgcreate -g misc:$TEST_CG_SUB2 > > +cgcreate -g misc:$TEST_CG_SUB3 > > +cgcreate -g misc:$TEST_CG_SUB4 > > + > > +# Default to V2 > > +CG_MISC_ROOT=/sys/fs/cgroup > > +CG_MEM_ROOT=/sys/fs/cgroup > > +CG_V1=0 > > +if [ ! -d "/sys/fs/cgroup/misc" ]; then > > + echo "# cgroup V2 is in use." > > +else > > + echo "# cgroup V1 is in use." > > Is "#" prefix a standard for kselftest? I don't know this, thus asking. > > > + CG_MISC_ROOT=/sys/fs/cgroup/misc > > + CG_MEM_ROOT=/sys/fs/cgroup/memory
Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 8:42 PM EET, Conor Dooley wrote: > On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote: > > Tacing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by implementing textmem API for RISC-V. > > This doesn't compile for nommu: > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: > 'MODULES_VADDR' undeclared (first use in this function) > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: > 'MODULES_END' undeclared (first use in this function) > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control > reaches end of non-void function [-Werror=return-type] > Clang builds also report: > ../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in > a function definition is a C2x extension [-Wc2x-extensions] > > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > > stack > > Link: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > > continuation > > Signed-off-by: Jarkko Sakkinen > > --- > > v5-v7: > > - No changes. > > v4: > > - Include linux/execmem.h. > > v3: > > - Architecture independent parts have been split to separate patches. > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > > this patch set now. > > Meta comment. I dunno when v1 was sent, but versions can you please > relax with submitting new versions of your patches? There's conversations > ongoing on v5 at the moment, while this is a more recent version. v2 > seems to have been sent on the 23rd and there's been 5 versions in the > last day: > https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=* > > Could you please also try and use a cover letter for patchsets, ideally > with a consistent subject? Otherwise I have to manually mark stuff as > superseded. Point taken but the work has been taken over by Mark and relevant changes are now sucked into that patch set. > Thanks, > Conor. BR, Jarkko
Re: [PATCH] trace/sched: add tgid for sched_wakeup_template
On Wed, 27 Mar 2024 16:50:57 +0800 Tio Zhang wrote: > By doing this, we are able to filter tasks by tgid while we are > tracing wakeup events by ebpf or other methods. > > For example, when we care about tracing a user space process (which has > uncertain number of LWPs, i.e, pids) to monitor its wakeup latency, > without tgid available in sched_wakeup tracepoints, we would struggle > finding out all pids to trace, or we could use kprobe to achieve tgid > tracing, which is less accurate and much less efficient than using > tracepoint. This is a very common trace event, and I really do not want to add more data than necessary to it, as it increases the size of the event which means less events can be recorded on a fixed size trace ring buffer. Note, you are not modifying the "tracepoint", but you are actually modifying a "trace event". "tracepoint" is the hook in the kernel code: trace_sched_wakeup() "trace event" is defined by TRACE_EVENT() macro (and friends) that defines what is exposed in the tracefs file system. I thought ebpf could hook directly to the tracepoint which is: trace_sched_wakeup(p); I believe you can have direct access to the 'p' before it is processed from ebpf. There's also "trace probes" (I think we are lacking documentation on this, as well as event probes :-p): $ gdb vmlinux (gdb) p &((struct task_struct *)0)->tgid $1 = (pid_t *) 0x56c (gdb) p &((struct task_struct *)0)->pid $2 = (pid_t *) 0x568 # echo 't:wakeup sched_waking pid=+0x568($arg1):u32 tgid=+0x56c($arg1):u32' > /sys/kernel/tracing/dynamic_events # trace-cmd start -e wakeup # trace-cmd show trace-cmd-7307[003] d..6. 599486.485762: wakeup: (__probestub_sched_waking+0x4/0x10) pid=845 tgid=845 bash-845 [001] d.s4. 599486.486136: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 bash-845 [001] d..4. 599486.486336: wakeup: (__probestub_sched_waking+0x4/0x10) pid=5516 tgid=5516 kworker/u18:2-5516[001] d..4. 599486.486445: wakeup: (__probestub_sched_waking+0x4/0x10) pid=818 tgid=818 -0 [001] d.s4. 599486.491206: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [001] d.s5. 599486.493218: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [001] d.s4. 599486.497200: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [003] d.s4. 599486.829209: wakeup: (__probestub_sched_waking+0x4/0x10) pid=70 tgid=70 The above attaches to the tracepoint and $arg1 is the 'struct task_struct *p'. -- Steve > > Signed-off-by: Tio Zhang > Signed-off-by: Dylane Chen > --- > include/trace/events/sched.h | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index dbb01b4b7451..ea7e525649e5 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > __field(pid_t, pid ) > __field(int,prio) > __field(int,target_cpu ) > + __field(pid_t, tgid) > ), > > TP_fast_assign( > @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > __entry->pid= p->pid; > __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ > __entry->target_cpu = task_cpu(p); > + __entry->tgid = p->tgid; > ), > > - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", > + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d", > __entry->comm, __entry->pid, __entry->prio, > - __entry->target_cpu) > + __entry->target_cpu, __entry->tgid) > ); > > /*
[PATCH] module: create weak dependecies
It has been seen that for some network mac drivers (i.e. lan78xx) the related module for the phy is loaded dynamically depending on the current hardware. In this case, the associated phy is read using mdio bus and then the associated phy module is loaded during runtime (kernel function phy_request_driver_module). However, no software dependency is defined, so the user tools will no be able to get this dependency. For example, if dracut is used and the hardware is present, lan78xx will be included but no phy module will be added, and in the next restart the device will not work from boot because no related phy will be found during initramfs stage. In order to solve this, we could define a normal 'pre' software dependency in lan78xx module with all the possible phy modules (there may be some), but proceeding in that way, all the possible phy modules would be loaded while only one is necessary. The idea is to create a new type of dependency, that we are going to call 'weak' to be used only by the user tools that need to detect this situation. In that way, for example, dracut could check the 'weak' dependency of the modules involved in order to install these dependencies in initramfs too. That is, for the commented lan78xx module, defining the 'weak' dependency with the possible phy modules list, only the necessary phy would be loaded on demand keeping the same behavior, but all the possible phy modules would be available from initramfs. Signed-off-by: Jose Ignacio Tornos Martinez --- include/linux/module.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..231e710d8736 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -173,6 +173,11 @@ extern void cleanup_module(void); */ #define MODULE_SOFTDEP(_softdep) MODULE_INFO(softdep, _softdep) +/* Weak module dependencies. See man modprobe.d for details. + * Example: MODULE_WEAKDEP("module-foo") + */ +#define MODULE_WEAKDEP(_weakdep) MODULE_INFO(weakdep, _weakdep) + /* * MODULE_FILE is used for generating modules.builtin * So, make it no-op when this is being built as a module -- 2.44.0
Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported
Hello Xuan, On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote: > On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao wrote: > > Do not set virtnet_info->rss_indir_table_size if RSS is not available > > for the device. > > > > Currently, rss_indir_table_size is set if either has_rss or > > has_rss_hash_report is available, but, it should only be set if has_rss > > is set. > > > > On the virtnet_set_rxfh(), return an invalid command if the request has > > indirection table set, but virtnet does not support RSS. > > > > Suggested-by: Heng Qi > > Signed-off-by: Breno Leitao > > --- > > drivers/net/virtio_net.c | 9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index c22d1118a133..c640fdf28fc5 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev, > > rxfh->hfunc != ETH_RSS_HASH_TOP) > > return -EOPNOTSUPP; > > > > + if (rxfh->indir && !vi->has_rss) > > + return -EINVAL; > > + > > if (rxfh->indir) { > > Put !vi->has_rss here? I am not sure I understand the suggestion. Where do you suggest we have !vi->has_rss? If we got this far, we either have: a) rxfh->indir set and vi->has_rss is also set b) rxfh->indir not set. (vi->has_rss could be set or not). Thanks for the review, Breno
Re: [PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported
On Wed, Mar 27, 2024 at 10:27:58AM +0800, Heng Qi wrote: > > > 在 2024/3/26 下午11:19, Breno Leitao 写道: > > There is a bug when setting the RSS options in virtio_net that can break > > the whole machine, getting the kernel into an infinite loop. > > > > Running the following command in any QEMU virtual machine with virtionet > > will reproduce this problem: > > > > # ethtool -X eth0 hfunc toeplitz > > > > This is how the problem happens: > > > > 1) ethtool_set_rxfh() calls virtnet_set_rxfh() > > > > 2) virtnet_set_rxfh() calls virtnet_commit_rss_command() > > > > 3) virtnet_commit_rss_command() populates 4 entries for the rss > > scatter-gather > > > > 4) Since the command above does not have a key, then the last > > scatter-gatter entry will be zeroed, since rss_key_size == 0. > > sg_buf_size = vi->rss_key_size; > > > > 5) This buffer is passed to qemu, but qemu is not happy with a buffer > > with zero length, and do the following in virtqueue_map_desc() (QEMU > > function): > > > >if (!sz) { > >virtio_error(vdev, "virtio: zero sized buffers are not allowed"); > > > > 6) virtio_error() (also QEMU function) set the device as broken > > > > vdev->broken = true; > > > > 7) Qemu bails out, and do not repond this crazy kernel. > > > > 8) The kernel is waiting for the response to come back (function > > virtnet_send_command()) > > > > 9) The kernel is waiting doing the following : > > > >while (!virtqueue_get_buf(vi->cvq, ) && > > !virtqueue_is_broken(vi->cvq)) > > cpu_relax(); > > > > 10) None of the following functions above is true, thus, the kernel > > loops here forever. Keeping in mind that virtqueue_is_broken() does > > not look at the qemu `vdev->broken`, so, it never realizes that the > > vitio is broken at QEMU side. > > > > Fix it by not sending RSS commands if the feature is not available in > > the device. > > > > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") > > Cc: sta...@vger.kernel.org > > Cc: qemu-de...@nongnu.org > > Signed-off-by: Breno Leitao > > --- > > drivers/net/virtio_net.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index c640fdf28fc5..e6b0eaf08ac2 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev, > > struct virtnet_info *vi = netdev_priv(dev); > > int i; > > + if (!vi->has_rss && !vi->has_rss_hash_report) > > + return -EOPNOTSUPP; > > + > > Why not make the second patch as the first, this seems to work better. Sure, that works for me. Let me update it in v2.
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Wed, 2024-03-27 at 02:42 +, Edgecombe, Rick P wrote: > On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote: > > In which conditions which path is used during the initialization of > > mm > > and why is this the case? It is an open claim in the current form. > > There is an arch_pick_mmap_layout() that arch's can have their own > rules for. There is also a > generic one. It gets called during exec. > > > > > That would be nice to have documented for the sake of being > > complete > > description. I have zero doubts of the claim being untrue. > > ...being untrue? > I mean I believe the change itself makes sense, it is just not fully documented in the commit message. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > The scripts rely on cgroup-tools package from libcgroup [1]. > > To run selftests for epc cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent > SGX > selftests, each to run one unclobbered_vdso_oversubscribed test. > Each > of such test tries to load an enclave of EPC size equal to the EPC > capacity available on the platform. The script checks results against > the expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with > following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > if > more than 4 concurrent tests are run. The script starts 4 expecting > at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run > lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all process exit. > > The script also includes a test with low mem_cg limit and LARGE > sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is > charged > to a proper mem_cg. > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > Signed-off-by: Haitao Huang > --- > V7: > - Added memcontrol test. > > V5: > - Added script with automatic results checking, remove the > interactive > script. > - The script can run independent from the series below. > --- > .../selftests/sgx/run_epc_cg_selftests.sh | 246 > ++ > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > 2 files changed, 259 insertions(+) > create mode 100755 > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > new file mode 100755 > index ..e027bf39f005 > --- /dev/null > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -0,0 +1,246 @@ > +#!/bin/bash This is not portable and neither does hold in the wild. It does not even often hold as it is not uncommon to place bash to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has a path that is neither of those two. Should be #!/usr/bin/env bash That is POSIX compatible form. Just got around trying to test this in NUC7 so looking into this in more detail. That said can you make the script work with just "#!/usr/bin/env sh" and make sure that it is busybox ash compatible? I don't see any necessity to make this bash only and it adds to the compilation time of the image. Otherwise lot of this could be tested just with qemu+bzImage+busybox(inside initramfs). Now you are adding fully glibc shenanigans for the sake of syntax sugar. > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2023 Intel Corporation. > + > +TEST_ROOT_CG=selftest > +cgcreate -g misc:$TEST_ROOT_CG How do you know that cgcreate exists? It is used a lot in the script with no check for the existence. Please fix e.g. with "command -v cgreate". > +if [ $? -ne 0 ]; then > + echo "# Please make sure cgroup-tools is installed, and misc > cgroup is mounted." > + exit 1 > +fi And please do not do it this way. Also, please remove the advice for "cgroups-tool". This is not meant to be debian only. Better would be to e.g. point out the URL of the upstream project. And yeah the whole message should be based on "command -v", not like this. > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > +# We will only set limit in test1 and run tests in test3 > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > + > +cgcreate -g misc:$TEST_CG_SUB1 > +cgcreate -g misc:$TEST_CG_SUB2 > +cgcreate -g misc:$TEST_CG_SUB3 > +cgcreate -g misc:$TEST_CG_SUB4 > + > +# Default to V2 > +CG_MISC_ROOT=/sys/fs/cgroup > +CG_MEM_ROOT=/sys/fs/cgroup > +CG_V1=0 > +if [ ! -d "/sys/fs/cgroup/misc" ]; then > + echo "# cgroup V2 is in use." > +else > + echo "# cgroup V1 is in use." Is "#" prefix a standard for kselftest? I don't know this, thus asking. > + CG_MISC_ROOT=/sys/fs/cgroup/misc > + CG_MEM_ROOT=/sys/fs/cgroup/memory > + CG_V1=1 Have you checked what is the indentation policy for bash scripts inside kernel tree. I don't know what it is. That's why I'm asking. > +fi > + > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk > '{print $2}') > +# This is below number of VA pages needed for enclave of capacity > size. So > +# should
Re: [PATCH 2/2] remoteproc: mediatek: Don't parse extraneous subnodes for multi-core
Il 21/03/24 16:27, Mathieu Poirier ha scritto: On Thu, Mar 21, 2024 at 09:46:14AM +0100, AngeloGioacchino Del Regno wrote: When probing multi-core SCP, this driver is parsing all sub-nodes of the scp-cluster node, but one of those could be not an actual SCP core and that would make the entire SCP cluster to fail probing for no good reason. To fix that, in scp_add_multi_core() treat a subnode as a SCP Core by parsing only available subnodes having compatible "mediatek,scp-core". Fixes: 1fdbf0cdde98 ("remoteproc: mediatek: Probe SCP cluster on multi-core SCP") Signed-off-by: AngeloGioacchino Del Regno --- drivers/remoteproc/mtk_scp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c index 67518291a8ad..fbe1c232dae7 100644 --- a/drivers/remoteproc/mtk_scp.c +++ b/drivers/remoteproc/mtk_scp.c @@ -1096,6 +1096,9 @@ static int scp_add_multi_core(struct platform_device *pdev, cluster_of_data = (const struct mtk_scp_of_data **)of_device_get_match_data(dev); for_each_available_child_of_node(np, child) { + if (!of_device_is_compatible(child, "mediatek,scp-core")) + continue; + Interesting - what else gets stashed under the remote processor node? I don't see anything specified in the bindings. Sorry for the late reply - well, in this precise moment in time, upstream, nothing yet. I have noticed this while debugging some lockups and wanted to move the scp_adsp clock controller node as child of the SCP node (as some of those clocks are located *into the SCP's CFG register space*, and it's correct for that to be a child as one of those do depend on the SCP being up - and I'll spare you the rest) and noticed the unexpected behavior, as the SCP driver was treating those as an SCP core. There was no kernel panic, but the SCP would fail probing. This is anyway a missed requirement ... for platforms that want *both* two SCP cores *and* the AudioDSP, as that'd at least be two nodes with the same iostart (scp@1072000, clock-controller@1072000), other than the reasons I explained some lines back. ...and that's why this commit was sent :-) P.S.: The reason why platforms don't crash without the scp_adsp clock controller as a child of SCP is that the bootloader is actually doing basic init for the SCP, hence the block is powered on when booting ;-) Cheers, Angelo Thanks, Mathieu if (!cluster_of_data[core_id]) { ret = -EINVAL; dev_err(dev, "Not support core %d\n", core_id); -- 2.44.0
Re: [PATCH 1/2] remoteproc: mediatek: Make sure IPI buffer fits in L2TCM
Il 21/03/24 16:25, Mathieu Poirier ha scritto: Good day, On Thu, Mar 21, 2024 at 09:46:13AM +0100, AngeloGioacchino Del Regno wrote: The IPI buffer location is read from the firmware that we load to the System Companion Processor, and it's not granted that both the SRAM (L2TCM) size that is defined in the devicetree node is large enough for that, and while this is especially true for multi-core SCP, it's still useful to check on single-core variants as well. Failing to perform this check may make this driver perform R/W oeprations out of the L2TCM boundary, resulting (at best) in a s/oeprations/operations I will fix that when I apply the patch. Thanks for that. Cheers, Angelo
FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 4.19-stable tree
The patch below does not apply to the 4.19-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Feb 2024 14:54:25 +0100 Subject: [PATCH] virtio: reenable config if freezing device failed Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand Message-Id: <20240213135425.795001-1-da...@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b3513..f513ee21b1c18 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) -- 2.43.0
FAILED: Patch "ring-buffer: Do not set shortest_full when full target is hit" failed to apply to 5.4-stable tree
The patch below does not apply to the 5.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 761d9473e27f0c8782895013a3e7b52a37c8bcfc Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 12 Mar 2024 11:56:41 -0400 Subject: [PATCH] ring-buffer: Do not set shortest_full when full target is hit The rb_watermark_hit() checks if the amount of data in the ring buffer is above the percentage level passed in by the "full" variable. If it is, it returns true. But it also sets the "shortest_full" field of the cpu_buffer that informs writers that it needs to call the irq_work if the amount of data on the ring buffer is above the requested amount. The rb_watermark_hit() always sets the shortest_full even if the amount in the ring buffer is what it wants. As it is not going to wait, because it has what it wants, there's no reason to set shortest_full. Link: https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8b...@gandalf.local.home Cc: sta...@vger.kernel.org Cc: Mathieu Desnoyers Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index aa332ace108b1..6ffbccb9bcf00 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, int cpu, int full) pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; ret = !pagebusy && full_hit(buffer, cpu, full); - if (!cpu_buffer->shortest_full || - cpu_buffer->shortest_full > full) - cpu_buffer->shortest_full = full; + if (!ret && (!cpu_buffer->shortest_full || +cpu_buffer->shortest_full > full)) { + cpu_buffer->shortest_full = full; + } raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); } return ret; -- 2.43.0
FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.4-stable tree
The patch below does not apply to the 5.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Feb 2024 14:54:25 +0100 Subject: [PATCH] virtio: reenable config if freezing device failed Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand Message-Id: <20240213135425.795001-1-da...@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b3513..f513ee21b1c18 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) -- 2.43.0
FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 5.15-stable tree
The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 12 Mar 2024 08:15:08 -0400 Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 13 ++- kernel/trace/trace.c | 43 ++-- 4 files changed, 45 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577e..dc5ae4e96aee0 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f9..fc6d0af56bb17 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f4c34b7c7e1e7..350607cce8694 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -902,23 +902,26 @@ static bool rb_wait_once(void *data) * @buffer: buffer to wait on * @cpu: the cpu buffer to wait on
FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.10-stable tree
The patch below does not apply to the 5.10-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Feb 2024 14:54:25 +0100 Subject: [PATCH] virtio: reenable config if freezing device failed Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand Message-Id: <20240213135425.795001-1-da...@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b3513..f513ee21b1c18 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) -- 2.43.0
FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.15-stable tree
The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Feb 2024 14:54:25 +0100 Subject: [PATCH] virtio: reenable config if freezing device failed Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand Message-Id: <20240213135425.795001-1-da...@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b3513..f513ee21b1c18 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) -- 2.43.0
FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.1-stable tree
The patch below does not apply to the 6.1-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 12 Mar 2024 08:15:08 -0400 Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 13 ++- kernel/trace/trace.c | 43 ++-- 4 files changed, 45 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577e..dc5ae4e96aee0 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f9..fc6d0af56bb17 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f4c34b7c7e1e7..350607cce8694 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -902,23 +902,26 @@ static bool rb_wait_once(void *data) * @buffer: buffer to wait on * @cpu: the cpu buffer to wait on
FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 6.1-stable tree
The patch below does not apply to the 6.1-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Feb 2024 14:54:25 +0100 Subject: [PATCH] virtio: reenable config if freezing device failed Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand Message-Id: <20240213135425.795001-1-da...@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b3513..f513ee21b1c18 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) -- 2.43.0
FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.6-stable tree
The patch below does not apply to the 6.6-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 12 Mar 2024 08:15:08 -0400 Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 13 ++- kernel/trace/trace.c | 43 ++-- 4 files changed, 45 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577e..dc5ae4e96aee0 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f9..fc6d0af56bb17 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f4c34b7c7e1e7..350607cce8694 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -902,23 +902,26 @@ static bool rb_wait_once(void *data) * @buffer: buffer to wait on * @cpu: the cpu buffer to wait on
FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.7-stable tree
The patch below does not apply to the 6.7-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to . Thanks, Sasha -- original commit in Linus's tree -- >From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 12 Mar 2024 08:15:08 -0400 Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 13 ++- kernel/trace/trace.c | 43 ++-- 4 files changed, 45 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577e..dc5ae4e96aee0 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f9..fc6d0af56bb17 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f4c34b7c7e1e7..350607cce8694 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -902,23 +902,26 @@ static bool rb_wait_once(void *data) * @buffer: buffer to wait on * @cpu: the cpu buffer to wait on
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote: > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by > Will Deacon . Otherwise, it's not ensured the > available ring entries pushed by guest can be observed by vhost > in time, leading to stale available ring entries fetched by vhost > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's > grace-hopper (ARM64) platform. > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > -accel kvm -machine virt,gic-version=host -cpu host \ > -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ > -m 4096M,slots=16,maxmem=64G \ > -object memory-backend-ram,id=mem0,size=4096M\ >: \ > -netdev tap,id=vnet0,vhost=true \ > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 >: > guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM > virtio_net virtio0: output.0:id 100 is not a head! > > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it > should be safe until vq->avail_idx is changed by commit 275bf960ac697 > ("vhost: better detection of available buffers"). > > Fixes: 275bf960ac697 ("vhost: better detection of available buffers") > Cc: # v4.11+ > Reported-by: Yihuang Yu > Signed-off-by: Gavin Shan > --- > drivers/vhost/vhost.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 045f666b4f12..00445ab172b3 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, > struct vhost_virtqueue *vq) > r = vhost_get_avail_idx(vq, _idx); > if (unlikely(r)) > return false; > + > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > + if (vq->avail_idx != vq->last_avail_idx) { > + /* Similar to what's done in vhost_get_vq_desc(), we need > + * to ensure the available ring entries have been exposed > + * by guest. > + */ A slightly clearer comment: /* Since we have updated avail_idx, the following call to * vhost_get_vq_desc will read available ring entries. * Make sure that read happens after the avail_idx read. */ Pls repost with that, and I will apply. Also add suggested-by for will. > + smp_rmb(); > + return false; > + } > > - return vq->avail_idx == vq->last_avail_idx; > + return true; > } > EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); As a follow-up patch, we should clean out code duplication that accumulated with 3 places reading avail idx in essentially the same way - this duplication is what causes the mess in the 1st place. > -- > 2.44.0
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On Tue, Mar 26, 2024 at 03:46:29PM +, Will Deacon wrote: > On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote: > > On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote: > > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote: > > > > > Secondly, the debugging code is enhanced so that the available head > > > > > for > > > > > (last_avail_idx - 1) is read for twice and recorded. It means the > > > > > available > > > > > head for one specific available index is read for twice. I do see the > > > > > available heads are different from the consecutive reads. More details > > > > > are shared as below. > > > > > > > > > > From the guest side > > > > > === > > > > > > > > > > virtio_net virtio0: output.0:id 86 is not a head! > > > > > head to be released: 047 062 112 > > > > > > > > > > avail_idx: > > > > > 000 49665 > > > > > 001 49666 <-- > > > > > : > > > > > 015 49664 > > > > > > > > what are these #s 49665 and so on? > > > > and how large is the ring? > > > > I am guessing 49664 is the index ring size is 16 and > > > > 49664 % 16 == 0 > > > > > > More than that, 49664 % 256 == 0 > > > > > > So again there seems to be an error in the vicinity of roll-over of > > > the idx low byte, as I observed in the earlier log. Surely this is > > > more than coincidence? > > > > Yeah, I'd still really like to see the disassembly for both sides of the > > protocol here. Gavin, is that something you're able to provide? Worst > > case, the host and guest vmlinux objects would be a starting point. > > > > Personally, I'd be fairly surprised if this was a hardware issue. > > Ok, long shot after eyeballing the vhost code, but does the diff below > help at all? It looks like vhost_vq_avail_empty() can advance the value > saved in 'vq->avail_idx' but without the read barrier, possibly confusing > vhost_get_vq_desc() in polling mode. > > Will > > --->8 > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 045f666b4f12..87bff710331a 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct > vhost_virtqueue *vq) > return false; > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > > + smp_rmb(); > return vq->avail_idx == vq->last_avail_idx; > } > EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); Oh wow you are right. We have: if (vq->avail_idx == vq->last_avail_idx) { if (unlikely(vhost_get_avail_idx(vq, _idx))) { vq_err(vq, "Failed to access avail idx at %p\n", >avail->idx); return -EFAULT; } vq->avail_idx = vhost16_to_cpu(vq, avail_idx); if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { vq_err(vq, "Guest moved used index from %u to %u", last_avail_idx, vq->avail_idx); return -EFAULT; } /* If there's nothing new since last we looked, return * invalid. */ if (vq->avail_idx == last_avail_idx) return vq->num; /* Only get avail ring entries after they have been * exposed by guest. */ smp_rmb(); } and so the rmb only happens if avail_idx is not advanced. Actually there is a bunch of code duplication where we assign to avail_idx, too. Will thanks a lot for looking into this! I kept looking into the virtio side for some reason, the fact that it did not trigger with qemu should have been a big hint! -- MST
Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI
On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin wrote: > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote: > > From: Rong Wang > > > > Once enable iommu domain for one device, the MSI > > translation tables have to be there for software-managed MSI. > > Otherwise, platform with software-managed MSI without an > > irq bypass function, can not get a correct memory write event > > from pcie, will not get irqs. > > The solution is to obtain the MSI phy base address from > > iommu reserved region, and set it to iommu MSI cookie, > > then translation tables will be created while request irq. > > > > Change log > > -- > > > > v1->v2: > > - add resv iotlb to avoid overlap mapping. > > v2->v3: > > - there is no need to export the iommu symbol anymore. > > > > Signed-off-by: Rong Wang > > There's in interest to keep extending vhost iotlb - > we should just switch over to iommufd which supports > this already. IOMMUFD is good but VFIO supports this before IOMMUFD. This patch makes vDPA run without a backporting of full IOMMUFD in the production environment. I think it's worth. If you worry about the extension, we can just use the vhost iotlb existing facility to do this. Thanks > > > --- > > drivers/vhost/vdpa.c | 59 +--- > > 1 file changed, 56 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c > > index ba52d128aeb7..28b56b10372b 100644 > > --- a/drivers/vhost/vdpa.c > > +++ b/drivers/vhost/vdpa.c > > @@ -49,6 +49,7 @@ struct vhost_vdpa { > > struct completion completion; > > struct vdpa_device *vdpa; > > struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS]; > > + struct vhost_iotlb resv_iotlb; > > struct device dev; > > struct cdev cdev; > > atomic_t opened; > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v) > > static int vhost_vdpa_reset(struct vhost_vdpa *v) > > { > > v->in_batch = 0; > > + vhost_iotlb_reset(>resv_iotlb); > > return _compat_vdpa_reset(v); > > } > > > > @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct > > vhost_vdpa *v, > > msg->iova + msg->size - 1 > v->range.last) > > return -EINVAL; > > > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova, > > + msg->iova + msg->size - 1)) > > + return -EINVAL; > > + > > if (vhost_iotlb_itree_first(iotlb, msg->iova, > > msg->iova + msg->size - 1)) > > return -EEXIST; > > > > + > > if (vdpa->use_va) > > return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size, > >msg->uaddr, msg->perm); > > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct > > kiocb *iocb, > > return vhost_chr_write_iter(dev, from); > > } > > > > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, > > struct device *dma_dev, > > + struct vhost_iotlb *resv_iotlb) > > +{ > > + struct list_head dev_resv_regions; > > + phys_addr_t resv_msi_base = 0; > > + struct iommu_resv_region *region; > > + int ret = 0; > > + bool with_sw_msi = false; > > + bool with_hw_msi = false; > > + > > + INIT_LIST_HEAD(_resv_regions); > > + iommu_get_resv_regions(dma_dev, _resv_regions); > > + > > + list_for_each_entry(region, _resv_regions, list) { > > + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start, > > + region->start + region->length - 1, > > + 0, 0, NULL); > > + if (ret) { > > + vhost_iotlb_reset(resv_iotlb); > > + break; > > + } > > + > > + if (region->type == IOMMU_RESV_MSI) > > + with_hw_msi = true; > > + > > + if (region->type == IOMMU_RESV_SW_MSI) { > > + resv_msi_base = region->start; > > + with_sw_msi = true; > > + } > > + } > > + > > + if (!ret && !with_hw_msi && with_sw_msi) > > + ret = iommu_get_msi_cookie(domain, resv_msi_base); > > + > > + iommu_put_resv_regions(dma_dev, _resv_regions); > > + > > + return ret; > > +} > > + > > static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) > > { > > struct vdpa_device *vdpa = v->vdpa; > > @@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct > > vhost_vdpa *v) > > > > ret = iommu_attach_device(v->domain, dma_dev); > > if (ret) > > - goto err_attach; > > + goto err_alloc_domain; > > > > - return 0; > > + ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, > > >resv_iotlb); > > + if (ret) > > + goto err_attach_device; > > > > -err_attach: > > + return 0; > > +err_attach_device: > > + iommu_detach_device(v->domain,
[PATCH] trace/sched: add tgid for sched_wakeup_template
By doing this, we are able to filter tasks by tgid while we are tracing wakeup events by ebpf or other methods. For example, when we care about tracing a user space process (which has uncertain number of LWPs, i.e, pids) to monitor its wakeup latency, without tgid available in sched_wakeup tracepoints, we would struggle finding out all pids to trace, or we could use kprobe to achieve tgid tracing, which is less accurate and much less efficient than using tracepoint. Signed-off-by: Tio Zhang Signed-off-by: Dylane Chen --- include/trace/events/sched.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index dbb01b4b7451..ea7e525649e5 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, __field(pid_t, pid ) __field(int,prio) __field(int,target_cpu ) + __field(pid_t, tgid) ), TP_fast_assign( @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, __entry->pid= p->pid; __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ __entry->target_cpu = task_cpu(p); + __entry->tgid = p->tgid; ), - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d", __entry->comm, __entry->pid, __entry->prio, - __entry->target_cpu) + __entry->target_cpu, __entry->tgid) ); /* -- 2.17.1
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On Wed, Mar 27, 2024 at 3:35 PM Gavin Shan wrote: > > On 3/27/24 14:08, Gavin Shan wrote: > > On 3/27/24 12:44, Jason Wang wrote: > >> On Wed, Mar 27, 2024 at 10:34 AM Jason Wang wrote: > >>> On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: > > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by > Will Deacon . Otherwise, it's not ensured the > available ring entries pushed by guest can be observed by vhost > in time, leading to stale available ring entries fetched by vhost > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's > grace-hopper (ARM64) platform. > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > -accel kvm -machine virt,gic-version=host -cpu host \ > -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ > -m 4096M,slots=16,maxmem=64G \ > -object memory-backend-ram,id=mem0,size=4096M\ > : \ > -netdev tap,id=vnet0,vhost=true \ > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 > : > guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM > virtio_net virtio0: output.0:id 100 is not a head! > > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it > should be safe until vq->avail_idx is changed by commit 275bf960ac697 > ("vhost: better detection of available buffers"). > > Fixes: 275bf960ac697 ("vhost: better detection of available buffers") > Cc: # v4.11+ > Reported-by: Yihuang Yu > Signed-off-by: Gavin Shan > --- > drivers/vhost/vhost.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 045f666b4f12..00445ab172b3 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, > struct vhost_virtqueue *vq) > r = vhost_get_avail_idx(vq, _idx); > if (unlikely(r)) > return false; > + > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > + if (vq->avail_idx != vq->last_avail_idx) { > + /* Similar to what's done in vhost_get_vq_desc(), we need > +* to ensure the available ring entries have been exposed > +* by guest. > +*/ > >>> > >>> We need to be more verbose here. For example, which load needs to be > >>> ordered with which load. > >>> > >>> The rmb in vhost_get_vq_desc() is used to order the load of avail idx > >>> and the load of head. It is paired with e.g virtio_wmb() in > >>> virtqueue_add_split(). > >>> > >>> vhost_vq_avail_empty() are mostly used as a hint in > >>> vhost_net_busy_poll() which is under the protection of the vq mutex. > >>> > >>> An exception is the tx_can_batch(), but in that case it doesn't even > >>> want to read the head. > >> > >> Ok, if it is needed only in that path, maybe we can move the barriers > >> there. > >> > > > > [cc Will Deacon] > > > > Jason, appreciate for your review and comments. I think PATCH[1/2] is > > the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However, > > it would be nice to fix all of them in one shoot. I will try with PATCH[2/2] > > only to see if our issue will disappear or not. However, the issue still > > exists if PATCH[2/2] is missed. > > > > Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with > PATCH[2/2] > only and unable to hit the issue. However, PATCH[1/2] may be needed by other > scenarios. > So it would be nice to fix them in one shoot. Yes, see below. > > > > Firstly, We were failing on the transmit queue and {tvq, > > rvq}->busyloop_timeout > > == false if I remember correctly. So the added smp_rmb() in > > vhost_vq_avail_empty() > > is only a concern to tx_can_batch(). A mutex isn't enough to ensure the > > order > > for the available index and available ring entry (head). For example, > > vhost_vq_avail_empty() > > called by tx_can_batch() can see next available index, but its corresponding > > available ring entry (head) may not be seen by vhost yet if smp_rmb() is > > missed. > > The next call to get_tx_bufs(), where the available ring entry (head) > > doesn't > > arrived yet, leading to stale available ring entry (head) being fetched. > > > >handle_tx_copy > > get_tx_bufs // smp_rmb() won't be executed when > > vq->avail_idx != vq->last_avail_idx > > tx_can_batch > >vhost_vq_avail_empty // vq->avail_idx is updated from > > vq->avail->idx > > > > The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the > > function > > is a exposed API, even it's
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On 3/27/24 14:08, Gavin Shan wrote: On 3/27/24 12:44, Jason Wang wrote: On Wed, Mar 27, 2024 at 10:34 AM Jason Wang wrote: On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M \ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac697 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..00445ab172b3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need + * to ensure the available ring entries have been exposed + * by guest. + */ We need to be more verbose here. For example, which load needs to be ordered with which load. The rmb in vhost_get_vq_desc() is used to order the load of avail idx and the load of head. It is paired with e.g virtio_wmb() in virtqueue_add_split(). vhost_vq_avail_empty() are mostly used as a hint in vhost_net_busy_poll() which is under the protection of the vq mutex. An exception is the tx_can_batch(), but in that case it doesn't even want to read the head. Ok, if it is needed only in that path, maybe we can move the barriers there. [cc Will Deacon] Jason, appreciate for your review and comments. I think PATCH[1/2] is the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However, it would be nice to fix all of them in one shoot. I will try with PATCH[2/2] only to see if our issue will disappear or not. However, the issue still exists if PATCH[2/2] is missed. Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with PATCH[2/2] only and unable to hit the issue. However, PATCH[1/2] may be needed by other scenarios. So it would be nice to fix them in one shoot. Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout == false if I remember correctly. So the added smp_rmb() in vhost_vq_avail_empty() is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order for the available index and available ring entry (head). For example, vhost_vq_avail_empty() called by tx_can_batch() can see next available index, but its corresponding available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed. The next call to get_tx_bufs(), where the available ring entry (head) doesn't arrived yet, leading to stale available ring entry (head) being fetched. handle_tx_copy get_tx_bufs // smp_rmb() won't be executed when vq->avail_idx != vq->last_avail_idx tx_can_batch vhost_vq_avail_empty // vq->avail_idx is updated from vq->avail->idx The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the function is a exposed API, even it's only used by drivers/vhost/net.c at present. It means the API has been broken internally. So it seems more appropriate to fix it up in vhost_vq_avail_empty() so that the API's users needn't worry about the memory access order. + smp_rmb(); + return false; + } - return vq->avail_idx == vq->last_avail_idx; + return true; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); Thanks, Gavin
[PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_*`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. * Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. Signed-off-by: Ho-Ren (Jack) Chuang Signed-off-by: Hao Xiang --- mm/memory-tiers.c | 94 +++ 1 file changed, 78 insertions(+), 16 deletions(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 974af10cfdd8..e24fc3bebae4 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -36,6 +36,11 @@ struct node_memory_type_map { static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * The list is used to store all memory types that are not created + * by a device driver. + */ +static LIST_HEAD(default_memory_types); static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; struct memory_dev_type *default_dram_type; @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); +/* The lock is used to protect `default_dram_perf*` info and nid. */ +static DEFINE_MUTEX(default_dram_perf_lock); static bool default_dram_perf_error; static struct access_coordinate default_dram_perf; static int default_dram_perf_ref_nid = NUMA_NO_NODE; @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; - struct memory_dev_type *memtype; + struct memory_dev_type *mtype = default_dram_type; + int adist = MEMTIER_ADISTANCE_DRAM; pg_data_t *pgdat = NODE_DATA(node); @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node) if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); - __init_node_memory_type(node, default_dram_type); + mt_calc_adistance(node, ); + if (node_memory_types[node].memtype == NULL) { + mtype = mt_find_alloc_memory_type(adist, _memory_types); + if (IS_ERR(mtype)) { + mtype = default_dram_type; + pr_info("Failed to allocate a memory type. Fall back.\n"); + } + } + + __init_node_memory_type(node, mtype); - memtype = node_memory_types[node].memtype; - node_set(node, memtype->nodes); - memtier = find_create_memory_tier(memtype); + mtype = node_memory_types[node].memtype; + node_set(node, mtype->nodes); + memtier = find_create_memory_tier(mtype); if (!IS_ERR(memtier)) rcu_assign_pointer(pgdat->memtier, memtier); return memtier; @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types) } EXPORT_SYMBOL_GPL(mt_put_memory_types); +/* + * This is invoked via `late_initcall()` to initialize memory tiers for + * CPU-less memory nodes
[PATCH v6 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Signed-off-by: Ho-Ren (Jack) Chuang --- drivers/dax/kmem.c | 20 ++-- include/linux/memory-tiers.h | 13 + mm/memory-tiers.c| 32 3 files changed, 47 insertions(+), 18 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 42ee360cf4e3..01399e5b53b2 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) { - bool found = false; struct memory_dev_type *mtype; mutex_lock(_memory_type_lock); - list_for_each_entry(mtype, _memory_types, list) { - if (mtype->adistance == adist) { - found = true; - break; - } - } - if (!found) { - mtype = alloc_memory_type(adist); - if (!IS_ERR(mtype)) - list_add(>list, _memory_types); - } + mtype = mt_find_alloc_memory_type(adist, _memory_types); mutex_unlock(_memory_type_lock); return mtype; @@ -77,13 +66,8 @@ static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) static void kmem_put_memory_types(void) { - struct memory_dev_type *mtype, *mtn; - mutex_lock(_memory_type_lock); - list_for_each_entry_safe(mtype, mtn, _memory_types, list) { - list_del(>list); - put_memory_type(mtype); - } + mt_put_memory_types(_memory_types); mutex_unlock(_memory_type_lock); } diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 69e781900082..a44c03c2ba3a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, const char *source); int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, + struct list_head *memory_types); +void mt_put_memory_types(struct list_head *memory_types); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct access_coordinate *perf, int *adis { return -EIO; } + +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + return NULL; +} + +void mt_put_memory_types(struct list_head *memory_types) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0537664620e5..974af10cfdd8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype) } EXPORT_SYMBOL_GPL(clear_node_memory_type); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + bool found = false; + struct memory_dev_type *mtype; + + list_for_each_entry(mtype, memory_types, list) { + if (mtype->adistance == adist) { + found = true; + break; + } + } + if (!found) { + mtype = alloc_memory_type(adist); + if (!IS_ERR(mtype)) + list_add(>list, memory_types); + } + + return mtype; +} +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); + +void mt_put_memory_types(struct list_head *memory_types) +{ + struct memory_dev_type *mtype, *mtn; + + list_for_each_entry_safe(mtype, mtn, memory_types, list) { + list_del(>list); + put_memory_type(mtype); + } +} +EXPORT_SYMBOL_GPL(mt_put_memory_types); + static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) { pr_info( -- Ho-Ren (Jack) Chuang
[PATCH v6 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes
When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. -v6: Thanks to Ying's comments, * Move `default_dram_perf_lock` to the function's beginning for clarity * Fix double unlocking at v5 -v5: Thanks to Ying's comments, * Add comments about what is protected by `default_dram_perf_lock` * Fix an uninitialized pointer mtype * Slightly shorten the time holding `default_dram_perf_lock` * Fix a deadlock bug in `mt_perf_to_adistance` * https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u -v4: Thanks to Ying's comments, * Remove redundant code * Reorganize patches accordingly * https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u -v3: Thanks to Ying's comments, * Make the newly added code independent of HMAT * Upgrade set_node_memory_tier to support more cases * Put all non-driver-initialized memory types into default_memory_types instead of using hmat_memory_types * find_alloc_memory_type -> mt_find_alloc_memory_type * https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u -v2: Thanks to Ying's comments, * Rewrite cover letter & patch description * Rename functions, don't use _hmat * Abstract common functions into find_alloc_memory_type() * Use the expected way to use set_node_memory_tier instead of modifying it * https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u -v1: * https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u Ho-Ren (Jack) Chuang (2): memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types memory tier: create CPUless memory tiers after obtaining HMAT info drivers/dax/kmem.c | 20 +- include/linux/memory-tiers.h | 13 mm/memory-tiers.c| 126 ++- 3 files changed, 125 insertions(+), 34 deletions(-) -- Ho-Ren (Jack) Chuang
Re: [PATCH net-next v4 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
On Wed, Mar 27, 2024 at 2:05 AM Balazs Scheidler wrote: > > The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source > and destination IP/port whereas this information can be critical in case > of UDP/syslog. > > Signed-off-by: Balazs Scheidler Looks good to me, thanks! Reviewed-by: Jason Xing
Re: [External] Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Tue, Mar 26, 2024 at 10:52 PM Huang, Ying wrote: > > "Ho-Ren (Jack) Chuang" writes: > > > The current implementation treats emulated memory devices, such as > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > > (E820_TYPE_RAM). However, these emulated devices have different > > characteristics than traditional DRAM, making it important to > > distinguish them. Thus, we modify the tiered memory initialization process > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > ensures that the memory tier initialization for these nodes is deferred > > until HMAT information is obtained during the boot process. Finally, > > demotion tables are recalculated at the end. > > > > * late_initcall(memory_tier_late_init); > > Some device drivers may have initialized memory tiers between > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > > online memory nodes and configuring memory tiers. They should be excluded > > in the late init. > > > > * Handle cases where there is no HMAT when creating memory tiers > > There is a scenario where a CPUless node does not provide HMAT information. > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > > In the current implementation, iterating through CPUlist nodes requires > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > > trying to acquire the same lock, leading to a potential deadlock. > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > but also prevents holding a large lock simultaneously. Besides, this patch > > slightly shortens the time holding the lock by putting the lock closer to > > what it protects as well. > > > > * Upgrade `set_node_memory_tier` to support additional cases, including > > default DRAM, late CPUless, and hot-plugged initializations. > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > > handle cases where memtype is not initialized and where HMAT information is > > available. > > > > * Introduce `default_memory_types` for those memory types that are not > > initialized by device drivers. > > Because late initialized memory and default DRAM memory need to be managed, > > a default memory type is created for storing all memory types that are > > not initialized by device drivers and as a fallback. > > > > * Fix a deadlock bug in `mt_perf_to_adistance` > > Because an error path was not handled properly in `mt_perf_to_adistance`, > > unlock before returning the error. > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > Signed-off-by: Hao Xiang > > --- > > mm/memory-tiers.c | 85 +++ > > 1 file changed, 72 insertions(+), 13 deletions(-) > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > index 974af10cfdd8..610db9581ba4 100644 > > --- a/mm/memory-tiers.c > > +++ b/mm/memory-tiers.c > > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > > > static DEFINE_MUTEX(memory_tier_lock); > > static LIST_HEAD(memory_tiers); > > +/* > > + * The list is used to store all memory types that are not created > > + * by a device driver. > > + */ > > +static LIST_HEAD(default_memory_types); > > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > > struct memory_dev_type *default_dram_type; > > > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion > > __read_mostly; > > > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > > +static DEFINE_MUTEX(default_dram_perf_lock); > > static bool default_dram_perf_error; > > static struct access_coordinate default_dram_perf; > > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, > > struct memory_dev_type *mem > > static struct memory_tier *set_node_memory_tier(int node) > > { > > struct memory_tier *memtier; > > - struct memory_dev_type *memtype; > > + struct memory_dev_type *mtype = default_dram_type; > > + int adist = MEMTIER_ADISTANCE_DRAM; > > pg_data_t *pgdat = NODE_DATA(node); > > > > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int > > node) > > if (!node_state(node, N_MEMORY)) > > return ERR_PTR(-EINVAL); > > > > - __init_node_memory_type(node, default_dram_type); > > + mt_calc_adistance(node, ); > > + if (node_memory_types[node].memtype == NULL) { > > + mtype = mt_find_alloc_memory_type(adist, > > _memory_types); > > + if (IS_ERR(mtype)) { > > + mtype = default_dram_type; > > + pr_info("Failed to allocate a
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
Rick Edgecombe wrote: > The mm_struct contains a function pointer *get_unmapped_area(), which > is set to either arch_get_unmapped_area() or > arch_get_unmapped_area_topdown() during the initialization of the mm. > > Since the function pointer only ever points to two functions that are named > the same across all arch's, a function pointer is not really required. In > addition future changes will want to add versions of the functions that > take additional arguments. So to save a pointers worth of bytes in > mm_struct, and prevent adding additional function pointers to mm_struct in > future changes, remove it and keep the information about which > get_unmapped_area() to use in a flag. > > Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by > mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing > behavior mm->get_unmapped_area() would get copied to the new mm in > dup_mm(), so not clobbering the flag preserves the existing behavior > around inheriting the topdown-ness. > > Introduce a helper, mm_get_unmapped_area(), to easily convert code that > refers to the old function pointer to instead select and call either > arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the > flag. Then drop the mm->get_unmapped_area() function pointer. Leave the > get_unmapped_area() pointer in struct file_operations alone. The main > purpose of this change is to reorganize in preparation for future changes, > but it also converts the calls of mm->get_unmapped_area() from indirect > branches into a direct ones. > > The stress-ng bigheap benchmark calls realloc a lot, which calls through > get_unmapped_area() in the kernel. On x86, the change yielded a ~1% > improvement there on a retpoline config. > > In testing a few x86 configs, removing the pointer unfortunately didn't > result in any actual size reductions in the compiled layout of mm_struct. > But depending on compiler or arch alignment requirements, the change could > shrink the size of mm_struct. > > Signed-off-by: Rick Edgecombe > Acked-by: Dave Hansen > Acked-by: Liam R. Howlett > Reviewed-by: Kirill A. Shutemov > Cc: linux-s...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: nvd...@lists.linux.dev > Cc: linux-...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-fsde...@vger.kernel.org > Cc: io-ur...@vger.kernel.org > Cc: b...@vger.kernel.org > --- > v4: > - Split out pde_get_unmapped_area() refactor into separate patch >(Christophe Leroy) > > v3: > - Fix comment that still referred to mm->get_unmapped_area() > - Resolve trivial rebase conflicts with "mm: thp_get_unmapped_area must >honour topdown preference" > - Spelling fix in log > > v2: > - Fix comment on MMF_TOPDOWN (Kirill, rppt) > - Move MMF_TOPDOWN to actually unused bit > - Add MMF_TOPDOWN to MMF_INIT_MASK so it doesn't get clobbered on fork, >and result in the children using the search up path. > - New lower performance results after above bug fix > - Add Reviews and Acks > --- > arch/s390/mm/hugetlbpage.c | 2 +- > arch/s390/mm/mmap.c | 4 ++-- > arch/sparc/kernel/sys_sparc_64.c | 15 ++- > arch/sparc/mm/hugetlbpage.c | 2 +- > arch/x86/kernel/cpu/sgx/driver.c | 2 +- > arch/x86/mm/hugetlbpage.c| 2 +- > arch/x86/mm/mmap.c | 4 ++-- > drivers/char/mem.c | 2 +- > drivers/dax/device.c | 6 +++--- > fs/hugetlbfs/inode.c | 4 ++-- > fs/proc/inode.c | 3 ++- > fs/ramfs/file-mmu.c | 2 +- > include/linux/mm_types.h | 6 +- > include/linux/sched/coredump.h | 5 - > include/linux/sched/mm.h | 5 + > io_uring/io_uring.c | 2 +- > kernel/bpf/arena.c | 2 +- > kernel/bpf/syscall.c | 2 +- > mm/debug.c | 6 -- > mm/huge_memory.c | 9 - > mm/mmap.c| 21 ++--- > mm/shmem.c | 11 +-- > mm/util.c| 6 +++--- > 23 files changed, 66 insertions(+), 57 deletions(-) > > diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c > index c2e8242bd15d..219d906fe830 100644 > --- a/arch/s390/mm/hugetlbpage.c > +++ b/arch/s390/mm/hugetlbpage.c > @@ -328,7 +328,7 @@ unsigned long hugetlb_get_unmapped_area(struct file > *file, unsigned long addr, > goto check_asce_limit; > } > > - if (mm->get_unmapped_area == arch_get_unmapped_area) > + if (!test_bit(MMF_TOPDOWN, >flags)) > addr = hugetlb_get_unmapped_area_bottomup(file, addr, len, > pgoff, flags); > else > diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c > index b14fc0887654..6b2e4436ad4a 100644 > --- a/arch/s390/mm/mmap.c > +++ b/arch/s390/mm/mmap.c > @@