Re: [PATCH v2 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-03-27 Thread Jason Wang
On Thu, Mar 28, 2024 at 7:44 AM Andrew Melnychenko  wrote:
>
> When the Qemu launched with vhost but without tap vnet_hdr,
> vhost tries to copy vnet_hdr from socket iter with size 0
> to the page that may contain some trash.
> That trash can be interpreted as unpredictable values for
> vnet_hdr.
> That leads to dropping some packets and in some cases to
> stalling vhost routine when the vhost_net tries to process
> packets and fails in a loop.
>
> Qemu options:
>   -netdev tap,vhost=on,vnet_hdr=off,...
>
> From security point of view, wrong values on field used later
> tap's tap_get_user_xdp() and will affect skb gso and options.
> Later the header(and data in headroom) should not be used by the stack.
> Using custom socket as a backend to vhost_net can reveal some data
> in the vnet_hdr, although it would require kernel access to implement.
>
> The issue happens because the value of sock_len in virtqueue is 0.
> That value is set at vhost_net_set_features() with
> VHOST_NET_F_VIRTIO_NET_HDR, also it's set to zero at device open()
> and reset() routine.
> So, currently, to trigger the issue, we need to set up qemu with
> vhost=on,vnet_hdr=off, or do not configure vhost in the custom program.
>
> Signed-off-by: Andrew Melnychenko 

Acked-by: Jason Wang 

It seems it has been merged by Michael.

Thanks

> ---
>  drivers/vhost/net.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..57411ac2d08b 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
> hdr = buf;
> gso = >gso;
>
> +   if (!sock_hlen)
> +   memset(buf, 0, pad);
> +
> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> vhost16_to_cpu(vq, gso->csum_start) +
> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> --
> 2.43.0
>




Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-03-27 Thread Jason Wang
On Thu, Mar 28, 2024 at 8:22 AM Gavin Shan  wrote:
>
> All the callers of vhost_get_avail_idx() are concerned to the memory
> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
>
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced. With it, the callers needn't to worry
> about the memory barrier.
>
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Gavin Shan 

Acked-by: Jason Wang 

Thanks




Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Haitao Huang
On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen   
wrote:



On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

With different cgroups, the script starts one or multiple concurrent
SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with
following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some
if
more than 4 concurrent tests are run. The script starts 4 expecting
at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run
lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

The script also includes a test with low mem_cg limit and LARGE
sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is
charged
to a proper mem_cg.

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang 
---
V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the
interactive
script.
- The script can run independent from the series below.
---
 .../selftests/sgx/run_epc_cg_selftests.sh | 246
++
 .../selftests/sgx/watch_misc_for_tests.sh |  13 +
 2 files changed, 259 insertions(+)
 create mode 100755
tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755
tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index ..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash


This is not portable and neither does hold in the wild.

It does not even often hold as it is not uncommon to place bash
to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
a path that is neither of those two.

Should be #!/usr/bin/env bash

That is POSIX compatible form.



Sure


Just got around trying to test this in NUC7 so looking into this in
more detail.


Thanks. Could you please check if this version works for you?

https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f



That said can you make the script work with just "#!/usr/bin/env sh"
and make sure that it is busybox ash compatible?


Yes.



I don't see any necessity to make this bash only and it adds to the
compilation time of the image. Otherwise lot of this could be tested
just with qemu+bzImage+busybox(inside initramfs).



will still need cgroup-tools as you pointed out later. Compiling from its  
upstream code OK?




Now you are adding fully glibc shenanigans for the sake of syntax
sugar.


+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG


How do you know that cgcreate exists? It is used a lot in the script
with no check for the existence. Please fix e.g. with "command -v
cgreate".


+if [ $? -ne 0 ]; then
+echo "# Please make sure cgroup-tools is installed, and misc
cgroup is mounted."
+exit 1
+fi


And please do not do it this way. Also, please remove the advice for
"cgroups-tool". This is not meant to be debian only. Better would be
to e.g. point out the URL of the upstream project.

And yeah the whole message should be based on "command -v", not like
this.



OK


+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1





+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."


Is "#" prefix a standard for kselftest? I don't know this, thus asking.


+CG_MISC_ROOT=/sys/fs/cgroup/misc
+CG_MEM_ROOT=/sys/fs/cgroup/memory
+CG_V1=1


Have you checked what is the indentation policy for bash scripts inside
kernel tree. I don't know what it is. That's why I'm asking.

Right. I looked around and found scripts 

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Edgecombe, Rick P
On Wed, 2024-03-27 at 15:15 +0200, Jarkko Sakkinen wrote:
> I mean I believe the change itself makes sense, it is just not
> fully documented in the commit message.

Ah, I see. Yes, there could be more background on arch_pick_mmap_layout().


Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Edgecombe, Rick P
On Tue, 2024-03-26 at 23:38 -0700, Dan Williams wrote:
> > +unsigned long
> > +mm_get_unmapped_area(struct mm_struct *mm, struct file *file,
> > +    unsigned long addr, unsigned long len,
> > +    unsigned long pgoff, unsigned long flags)
> > +{
> 
> Seems like a small waste to have all call sites now need to push an
> additional @mm argument onto the stack just to figure out what function
> to call.
> 
> > +   if (test_bit(MMF_TOPDOWN, >flags))
> > +   return arch_get_unmapped_area_topdown(file, addr, len, 
> > pgoff, flags);
> > +   return arch_get_unmapped_area(file, addr, len, pgoff, flags);
> 
> This seems small enough to be amenable to a static inline, but that
> would require exporting the arch calls which might be painful.
> 
> Would it not be possible to drop the @mm argument and just reference
> current->mm internal to this function? Just call this funcion
> current_get_unmapped_area()?

Hmm, you are right. The callers all pass current->mm. The mm argument could be 
removed.


Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Haitao Huang
On Wed, 27 Mar 2024 19:57:26 -0500, Haitao Huang  
 wrote:


On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen   
wrote:



On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:

...

> wait_and_detect_for_any() {

what is "any"?

Maybe for some key functions could have short documentation what they
are and for what test uses them. I cannot possibly remember all of this
just by hints such as "this waits for Any" ;-)

I don't think there is actual kernel guideline to engineer the script
to work with just ash but at least for me that would inevitably
increase my motivation to test this patch set more rather than less.


I also wonder is cgroup-tools dependency absolutely required or could
you just have a function that would interact with sysfs?


I should have checked email before hit the send button for v10 :-).

It'd be more complicated and less readable to do all the stuff without  
the cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only  
depends on libc so I hope this would not cause too much inconvenience.


I saw bash was also used in cgroup test scripts so at least that's  
consistent :-)

I can look into ash if that's required. Let me know.



Sorry I missed you earlier comments. I actually tried to make it  
compatible with busybox on Ubuntu.


I'll address your other comments and update. But meanwhile could you  
tryout this version in your env?


https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f

Thanks
Haitao



Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Sohil Mehta


> 
> You *definitely* want to do that - good catch.
> 
> And TBH, all the screaming words aren't helping either... :)
> 

:) I thought the same as well. But, I felt inconsistently screaming
words might be worse. Maybe just update all the words that are not
acronyms (such as Processor, Time, Socket, etc.)



Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-27 Thread Xuan Zhuo
On Wed, 27 Mar 2024 06:51:25 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote:
> > On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> > > Do not set virtnet_info->rss_indir_table_size if RSS is not available
> > > for the device.
> > >
> > > Currently, rss_indir_table_size is set if either has_rss or
> > > has_rss_hash_report is available, but, it should only be set if has_rss
> > > is set.
> > >
> > > On the virtnet_set_rxfh(), return an invalid command if the request has
> > > indirection table set, but virtnet does not support RSS.
> > >
> > > Suggested-by: Heng Qi 
> > > Signed-off-by: Breno Leitao 
> > > ---
> > >  drivers/net/virtio_net.c | 9 +++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index c22d1118a133..c640fdf28fc5 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > >   rxfh->hfunc != ETH_RSS_HASH_TOP)
> > >   return -EOPNOTSUPP;
> > >
> > > + if (rxfh->indir && !vi->has_rss)
> > > + return -EINVAL;
> > > +
> > >   if (rxfh->indir) {
> >
> > Put !vi->has_rss here?
>
> I am not sure I understand the suggestion. Where do you suggest we have
> !vi->has_rss?
>
> If we got this far, we either have:
>
> a) rxfh->indir set and vi->has_rss is also set
> b) rxfh->indir not set. (vi->has_rss could be set or not).


This function does two tasks.
1. update indir table
2. update rss key

#1 only for has_rss
#2 for has_rss or has_rss_hash_report


So I would code:

bool update = false

if (rxfh->indir) {
if (!vi->has_rss)
return -EINVAL;

for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];

update = true
}

if (rxfh->key) {
if (!vi->has_rss && !vi->has_rss_hash_report)
return -EINVAL;

memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
update = true
}


if (update)
virtnet_commit_rss_command(vi);

Thanks.


>
> Thanks for the review,
> Breno



Re: [PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

[snip]

> @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU) &&
> + node_memory_types[nid].memtype == NULL)

Think about this again.  It seems that it is better to check
"node_memory_types[nid].memtype == NULL" only here.  Because for all
node with N_CPU in memory_tier_init(), "node_memory_types[nid].memtype"
will be !NULL.  And it's possible (in theory) that some nodes becomes
"node_state(nid, N_CPU) == true" between memory_tier_init() and
memory_tier_late_init().

Otherwise, Looks good to me.  Feel free to add

Reviewed-by: "Huang, Ying" 

in the future version.

> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +

[snip]

--
Best Regards,
Huang, Ying



Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Haitao Huang
On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen   
wrote:



On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent
> SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with
> following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> if
> more than 4 concurrent tests are run. The script starts 4 expecting
> at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run
> lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> The script also includes a test with low mem_cg limit and LARGE
> sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is
> charged
> to a proper mem_cg.
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang 
> ---
> V7:
> - Added memcontrol test.
>
> V5:
> - Added script with automatic results checking, remove the
> interactive
> script.
> - The script can run independent from the series below.
> ---
>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
>  2 files changed, 259 insertions(+)
>  create mode 100755
> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755
> tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..e027bf39f005
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,246 @@
> +#!/bin/bash

This is not portable and neither does hold in the wild.

It does not even often hold as it is not uncommon to place bash
to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
a path that is neither of those two.

Should be #!/usr/bin/env bash

That is POSIX compatible form.

Just got around trying to test this in NUC7 so looking into this in
more detail.

That said can you make the script work with just "#!/usr/bin/env sh"
and make sure that it is busybox ash compatible?

I don't see any necessity to make this bash only and it adds to the
compilation time of the image. Otherwise lot of this could be tested
just with qemu+bzImage+busybox(inside initramfs).

Now you are adding fully glibc shenanigans for the sake of syntax
sugar.

> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG

How do you know that cgcreate exists? It is used a lot in the script
with no check for the existence. Please fix e.g. with "command -v
cgreate".

> +if [ $? -ne 0 ]; then
> +echo "# Please make sure cgroup-tools is installed, and misc
> cgroup is mounted."
> +exit 1
> +fi

And please do not do it this way. Also, please remove the advice for
"cgroups-tool". This is not meant to be debian only. Better would be
to e.g. point out the URL of the upstream project.

And yeah the whole message should be based on "command -v", not like
this.

> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1



> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_MISC_ROOT=/sys/fs/cgroup
> +CG_MEM_ROOT=/sys/fs/cgroup
> +CG_V1=0
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> +echo "# cgroup V2 is in use."
> +else
> +echo "# cgroup V1 is in use."

Is "#" prefix a standard for kselftest? I don't know this, thus asking.

> +CG_MISC_ROOT=/sys/fs/cgroup/misc
> +CG_MEM_ROOT=/sys/fs/cgroup/memory
> +CG_V1=1

Have you checked what is the indentation policy for bash scripts inside
kernel tree. I don't know what it is. That's why I'm asking.

> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-27 Thread Andrii Nakryiko
On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu  wrote:
>
> On Wed, 27 Mar 2024 17:06:01 +
> Jonthan Haslam  wrote:
>
> > > > Masami,
> > > >
> > > > Given the discussion around per-cpu rw semaphore and need for
> > > > (internal) batched attachment API for uprobes, do you think you can
> > > > apply this patch as is for now? We can then gain initial improvements
> > > > in scalability that are also easy to backport, and Jonathan will work
> > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > suggested by Ingo.
> > >
> > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > I would like to wait for the next version.
> >
> > My initial tests show a nice improvement on the over RW spinlocks but
> > significant regression in acquiring a write lock. I've got a few days
> > vacation over Easter but I'll aim to get some more formalised results out
> > to the thread toward the end of next week.
>
> As far as the write lock is only on the cold path, I think you can choose
> per-cpu RW semaphore. Since it does not do busy wait, the total system
> performance impact will be small.

No, Masami, unfortunately it's not as simple. In BPF we have BPF
multi-uprobe, which can be used to attach to thousands of user
functions. It currently creates one uprobe at a time, as we don't
really have a batched API. If each such uprobe registration will now
take a (relatively) long time, when multiplied by number of attach-to
user functions, it will be a horrible regression in terms of
attachment/detachment performance.

So when we switch to per-CPU rw semaphore, we'll need to provide an
internal batch uprobe attach/detach API to make sure that attaching to
multiple uprobes is still fast.

Which is why I was asking to land this patch as is, as it relieves the
scalability pains in production and is easy to backport to old
kernels. And then we can work on batched APIs and switch to per-CPU rw
semaphore.

So I hope you can reconsider and accept improvements in this patch,
while Jonathan will keep working on even better final solution.
Thanks!

> I look forward to your formalized results :)
>
> Thank you,
>
> >
> > Jon.
> >
> > >
> > > Thank you,
> > >
> > > >
> > > > >
> > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > will depend on how much lock contention happens.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > >
> > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > ---
> > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > --- a/kernel/events/uprobes.c
> > > > > > +++ b/kernel/events/uprobes.c
> > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > >   */
> > > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > > > >
> > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree 
> > > > > > access */
> > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree 
> > > > > > access */
> > > > > >
> > > > > >  #define UPROBES_HASH_SZ  13
> > > > > >  /* serialize uprobe->pending_list */
> > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > > > > *inode, loff_t offset)
> > > > > >  {
> > > > > >   struct uprobe *uprobe;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + read_lock(_treelock);
> > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > - spin_unlock(_treelock);
> > > > > > + read_unlock(_treelock);
> > > > > >
> > > > > >   return uprobe;
> > > > > >  }
> > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct 
> > > > > > uprobe *uprobe)
> > > > > >  {
> > > > > >   struct uprobe *u;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + write_lock(_treelock);
> > > > > >   u = __insert_uprobe(uprobe);
> > > > > > - spin_unlock(_treelock);
> > > > > > + write_unlock(_treelock);
> > > > > >
> > > > > >   return u;
> > > > > >  }
> > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > >   return;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + write_lock(_treelock);
> > > > > >   rb_erase(>rb_node, _tree);
> > > > > > - spin_unlock(_treelock);
> > > > > > + write_unlock(_treelock);
> > > > > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > > > > >   put_uprobe(uprobe);
> > > > > >  }
> > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode 
> > > > > > *inode,
> > > > > >   min = vaddr_to_offset(vma, start);
> > > > > >   max = min + (end - start) - 1;
> > > 

Re: [PATCH v19 RESEND 4/5] Documentation: tracing: Add ring-buffer mapping

2024-03-27 Thread Randy Dunlap
Hi,

On 3/26/24 03:08, Vincent Donnefort wrote:
> It is now possible to mmap() a ring-buffer to stream its content. Add
> some documentation and a code example.
> 
> Signed-off-by: Vincent Donnefort 
> 

> diff --git a/Documentation/trace/ring-buffer-map.rst 
> b/Documentation/trace/ring-buffer-map.rst
> new file mode 100644
> index ..0426ab4bcf3d
> --- /dev/null
> +++ b/Documentation/trace/ring-buffer-map.rst
> @@ -0,0 +1,106 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +Tracefs ring-buffer memory mapping
> +==
> +
> +:Author: Vincent Donnefort 
> +
> +Overview
> +
> +Tracefs ring-buffer memory map provides an efficient method to stream data
> +as no memory copy is necessary. The application mapping the ring-buffer 
> becomes
> +then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
> +
> +Memory mapping setup
> +
> +The mapping works with a mmap() of the trace_pipe_raw interface.
> +
> +The first system page of the mapping contains ring-buffer statistics and
> +description. It is referred as the meta-page. One of the most important 
> field of

  referred to as   
fields

> +the meta-page is the reader. It contains the sub-buffer ID which can be 
> safely
> +read by the mapper (see ring-buffer-design.rst).
> +
> +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. 
> It is

ascending
might be better IMO.

> +therefore effortless to know where the reader starts in the mapping:
> +
> +.. code-block:: c
> +
> +reader_id = meta->reader->id;
> +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
> +
> +When the application is done with the current reader, it can get a new one 
> using
> +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also 
> updates
> +the meta-page fields.


[snip]

Thanks.
-- 
#Randy



[PATCH v10 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.

Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4991eb0af748..8f83f7ac386e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
*epc_page)
+{
+   return _global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(>lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _global_lru.reclaimable);
-   spin_unlock(_global_lru.lock);
+   list_add_tail(>list, >reclaimable);
+   spin_unlock(>lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(>lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
 
return 0;
 }
-- 
2.25.1




Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Gavin Shan

On 3/27/24 17:42, Jason Wang wrote:

On Wed, Mar 27, 2024 at 3:35 PM Gavin Shan  wrote:


On 3/27/24 14:08, Gavin Shan wrote:

On 3/27/24 12:44, Jason Wang wrote:

On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:


A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
-accel kvm -machine virt,gic-version=host -cpu host  \
-smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
-m 4096M,slots=16,maxmem=64G \
-object memory-backend-ram,id=mem0,size=4096M\
 :   \
-netdev tap,id=vnet0,vhost=true  \
-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
 :
guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
should be safe until vq->avail_idx is changed by commit 275bf960ac697
("vhost: better detection of available buffers").

Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
   drivers/vhost/vhost.c | 11 ++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..00445ab172b3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
  r = vhost_get_avail_idx(vq, _idx);
  if (unlikely(r))
  return false;
+
  vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/


We need to be more verbose here. For example, which load needs to be
ordered with which load.

The rmb in vhost_get_vq_desc() is used to order the load of avail idx
and the load of head. It is paired with e.g virtio_wmb() in
virtqueue_add_split().

vhost_vq_avail_empty() are mostly used as a hint in
vhost_net_busy_poll() which is under the protection of the vq mutex.

An exception is the tx_can_batch(), but in that case it doesn't even
want to read the head.


Ok, if it is needed only in that path, maybe we can move the barriers there.



[cc Will Deacon]

Jason, appreciate for your review and comments. I think PATCH[1/2] is
the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However,
it would be nice to fix all of them in one shoot. I will try with PATCH[2/2]
only to see if our issue will disappear or not. However, the issue still
exists if PATCH[2/2] is missed.



Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with 
PATCH[2/2]
only and unable to hit the issue. However, PATCH[1/2] may be needed by other 
scenarios.
So it would be nice to fix them in one shoot.


Yes, see below.





Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout
== false if I remember correctly. So the added smp_rmb() in 
vhost_vq_avail_empty()
is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order
for the available index and available ring entry (head). For example, 
vhost_vq_avail_empty()
called by tx_can_batch() can see next available index, but its corresponding
available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed.
The next call to get_tx_bufs(), where the available ring entry (head) doesn't
arrived yet, leading to stale available ring entry (head) being fetched.

handle_tx_copy
  get_tx_bufs // smp_rmb() won't be executed when 
vq->avail_idx != vq->last_avail_idx
  tx_can_batch
vhost_vq_avail_empty  // vq->avail_idx is updated from 
vq->avail->idx

The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the 
function
is a exposed API, even it's only used by drivers/vhost/net.c at present. It 
means
the API has been broken internally. So it seems more appropriate to fix it up in
vhost_vq_avail_empty() so that the API's users needn't worry about the memory 
access
order.


When tx_can_batch returns true it means there's still pending tx
buffers. Since it might read indices so it still can bypass the
smp_rmb() in the vhost_get_vq_desc().

I'd suggest adding those above to change log.

With this,

Acked-by: Jason Wang 



Thanks, Jason. The change log has 

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Gavin Shan

On 3/27/24 22:07, Michael S. Tsirkin wrote:

On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote:

A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
   -m 4096M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=4096M\
:   \
   -netdev tap,id=vnet0,vhost=true  \
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
:
   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
   virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
should be safe until vq->avail_idx is changed by commit 275bf960ac697
("vhost: better detection of available buffers").

Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/vhost/vhost.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..00445ab172b3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
r = vhost_get_avail_idx(vq, _idx);
if (unlikely(r))
return false;
+
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/


A slightly clearer comment:

/* Since we have updated avail_idx, the following call to
  * vhost_get_vq_desc will read available ring entries.
  * Make sure that read happens after the avail_idx read.
  */

Pls repost with that, and I will apply.

Also add suggested-by for will.



Sure, the suggested comments have been included to v3.




+   smp_rmb();
+   return false;
+   }
  
-	return vq->avail_idx == vq->last_avail_idx;

+   return true;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


As a follow-up patch, we should clean out code duplication that
accumulated with 3 places reading avail idx in essentially
the same way - this duplication is what causes the mess in
the 1st place.



Yes, nice idea. I've added PATCH[v3 3/3] to improve vhost_get_avail_idx()
to handle the memory barrier since all the callers have the concern.

v3: 
https://lore.kernel.org/virtualization/20240328002149.1141302-1-gs...@redhat.com/

Thanks,
Gavin




Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-03-27 Thread Haitao Huang

On Thu, 22 Feb 2024 16:24:47 -0600, Huang, Kai  wrote:




On 23/02/2024 9:12 am, Haitao Huang wrote:
On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai   
wrote:



On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:

StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai   
wrote:

[...]
>
> So you introduced the work/workqueue here but there's no place which
> actually
> queues the work.  IMHO you can either:
>
> 1) move relevant code change here; or
> 2) focus on introducing core functions to reclaim certain pages  
from a

> given EPC
> cgroup w/o workqueue and introduce the work/workqueue in later  
patch.

>
> Makes sense?
>

Starting in v7, I was trying to split the big patch, #10 in v6 as you  
and
others suggested. My thought process was to put infrastructure needed  
for
per-cgroup reclaim in the front, then turn on per-cgroup reclaim in  
[v9

13/15] in the end.


That's reasonable for sure.


 Thanks for the confirmation :-)



Before that, all reclaimables are tracked in the global LRU so really
there is no "reclaim certain pages from a  given EPC cgroup w/o  
workqueue"
or reclaim through workqueue before that point, as suggested in #2.  
This
patch puts down the implementation for both flows but neither used  
yet, as

stated in the commit message.


I know it's not used yet.  The point is how to split patches to make  
them more

self-contain and easy to review.
 I would think this patch already self-contained in that all are  
implementation of cgroup reclamation building blocks utilized later.  
But I'll try to follow your suggestions below to split further (would  
prefer not to merge in general unless there is strong reasons).




For #2, sorry for not being explicit -- I meant it seems it's more  
reasonable to

split in this way:

Patch 1)
  a). change to sgx_reclaim_pages();
 I'll still prefer this to be a separate patch. It is self-contained  
IMHO.
We were splitting the original patch because it was too big. I don't  
want to merge back unless there is a strong reason.



  b). introduce sgx_epc_cgroup_reclaim_pages();

 Ok.


If I got you right, I believe you want to have a cgroup variant function  
following the same behaviour of the one for global reclaim, i.e., the  
_current_ sgx_reclaim_pages(), which always tries to scan and reclaim  
SGX_NR_TO_SCAN pages each time.


And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries  
to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup  
and all the descendants".


And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due  
to WHATEVER reasons.


In that case, the change to sgx_reclaim_pages() and the introduce of  
sgx_epc_cgroup_reclaim_pages() should really be together because they  
are completely tied together in terms of implementation.


In this way you can just explain clearly in _ONE_ patch why you choose  
this implementation, and for reviewer it's also easier to review because  
we can just discuss in one patch.


Makes sense?



  c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better  
name), which just takes an EPC cgroup as input w/o involving any  
work/workqueue.
 This is for the workqueue use only. So I think it'd be better be with  
patch #2 below?


There are multiple levels of logic here IMHO:

   1. a) and b) above focus on "each reclaim" a given EPC cgroup
   2. c) is about a loop of above to bring given cgroup's usage to limit
   3. workqueue is one (probably best) way to do c) in async way
   4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered

To me, it's clear 1) should be in one patch as stated above.

Also, to me 3) and 4) are better to be together since they give you a  
clear view on how the direct/indirect reclaim are triggered.


2) could be flexible depending on how you see it.  If you prefer viewing  
it from low-level implementation of reclaiming pages from cgroup, then  
it's also OK to be together with 1).  If you want to treat it as a part  
of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK  
to be with 3) and 4).


But to me 2) can be together with 1) or even a separate patch because  
it's still kinda of low-level reclaiming details.  3) and 4) shouldn't  
contain such detail but should focus on how direct/indirect reclaim is  
done.


I incorporated most of your suggestions, and think it'd be better discuss  
this with actual code.


So I'm sending out v10, and just quickly summarize what I did to address  
this particular issue here.


I pretty much follow above suggestions and end up with two patches:

1) a) and b) above  plus direct reclaim triggered in try_charge() so  
reviewers can see at lease one use of the sgx_cgroup_reclaim_pages(),  
which is the basic building block.


2) All async related: c) above, workqueue, indirect triggered in  
try_charge() which queues the work.


Please review v10 and if you think the triggering parts need be separated  
then I'll separate.



[PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Haitao Huang
The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg.

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang 
---
V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++
 .../selftests/sgx/watch_misc_for_tests.sh |  13 +
 2 files changed, 259 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index ..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+echo "# Please make sure cgroup-tools is installed, and misc cgroup is 
mounted."
+exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."
+CG_MISC_ROOT=/sys/fs/cgroup/misc
+CG_MEM_ROOT=/sys/fs/cgroup/memory
+CG_V1=1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" >  $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+wait_check_process_status() {
+local pid=$1
+local check_for_success=$2  # If 1, check for success;
+# If 0, check for failure
+wait "$pid"
+local status=$?
+
+if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+echo "# Process $pid succeeded."
+return 0
+elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+echo "# Process $pid returned failure."
+return 0
+fi
+return 1
+}
+
+wait_and_detect_for_any() {
+local pids=("$@")
+local check_for_success=$1  # If 1, check for success;
+# If 0, check for failure
+local detected=1 # 0 for success detection
+
+for pid in "${pids[@]:1}"; do
+if wait_check_process_status "$pid" "$check_for_success"; then
+detected=0
+# Wait for other processes to exit
+fi
+done
+
+return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting 
failure..."
+# Always use leaf node of misc 

[PATCH v10 13/14] Docs/x86/sgx: Add description for cgroup support

2024-03-27 Thread Haitao Huang
From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root cgroups.
+A value change in this file generates a file modified event.
+
+  max
+The number of times 

[PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU. Export
sgx_cgroup_reclaim_pages() in the header file so it can be reused for
this purpose.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
Export sgx_cgroup_lru_empty() so it can be reused for this purpose.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.
Export sgx_cgroup_should_reclaim() for reuse.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  6 +++---
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  7 +++
 arch/x86/kernel/cpu/sgx/main.c   | 29 +++-
 3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 1defbf213e8d..cacd9e93344e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -72,7 +72,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct 
sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -125,7 +125,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
  * triggering reclamation, and call cond_resched() in between iterations to
  * avoid indefinite blocking.
  */
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct 
*charge_mm)
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct 
*charge_mm)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -166,7 +166,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, 
struct mm_struct *cha
  * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
  * cgroup.
  */
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
 {
u64 cur, max;
 
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index f66570d3ef42..8f55b38157da 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup 
*sgx_cg, enum sgx_recl
 static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
 
 static inline void sgx_cgroup_init(void) { }
+
+static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct 
mm_struct *charge_mm)
+{
+}
 #else
 struct sgx_cgroup {
struct misc_cg *cg;
@@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_reclaim_pages(struct misc_cg *root,  struct mm_struct 
*charge_mm);
 void sgx_cgroup_init(void);
 
 #endif
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 7f92455d957d..68f28ff2d5ef 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru;
 
 static inline struct sgx_epc_lru_list 

[PATCH v10 11/14] x86/sgx: Abstract check for global reclaimable pages

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c94f8b49e6f2..7f92455d957d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
sgx_epc_page *epc_pag
return _global_lru;
 }
 
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim(void)
+{
+   return !list_empty(_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -391,7 +399,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list 
*lru, struct mm_struct *c
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim();
 }
 
 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
@@ -593,7 +601,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum 
sgx_reclaim reclaim)
break;
}
 
-   if (list_empty(_global_lru.reclaimable)) {
+   if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
-- 
2.25.1




[PATCH v10 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-03-27 Thread Haitao Huang
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 29 ++--
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++
 arch/x86/kernel/cpu/sgx/main.c   | 29 +---
 arch/x86/kernel/cpu/sgx/sgx.h|  2 +-
 5 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the 
mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
  */
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
 {
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
+/* Use the charge_mm if given. */
+   if (charge_mm)
+   return get_mem_cgroup_from_mm(charge_mm);
 
/*
 * Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
  *
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
  * mms in the enclave's mm_list prior to any backing page allocation,
  * in order to ensure that shmem page allocations are charged to the
  * enclave.  Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, struct mm_struct 
*charge_mm)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, 
charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, 
unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 unsigned long end, unsigned long vm_flags);
 
-bool current_is_ksgxd(void);
 void sgx_encl_release(struct kref *ref);
 int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
 const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
 int 

[PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

When a cgroup usage reaches its limit, and it is to be charged, i.e.,
sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
reclaim pages from its LRU or LRUs of its descendants to make room for
any new allocations. This patch adds the basic building block for the
per-cgroup reclamation flow and use it for synchronous reclamation in
sgx_cgroup_try_charge().

First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
which pages are reclaimed, so it can be reused by both the global and
cgroup reclaimers. Also return the number of pages attempted, so a
cgroup reclaimer can use it to track reclamation progress from its
descendants.

For the global reclaimer, replace all call sites of sgx_reclaim_pages()
with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
just calls sgx_reclaim_pages() with the global LRU passed in.

For cgroup reclamation, implement a basic reclamation flow, encapsulated
in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
at each node passing in the LRU of that node. It keeps track of total
attempted pages and stops the walk if desired number of pages are
attempted.

Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
a synchronous reclamation is allowed. If the caller allows and cgroup
usage is at its limit, trigger the synchronous reclamation by calling
sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
iterations.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static variables. (Jarkko)

V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 127 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   5 +-
 arch/x86/kernel/cpu/sgx/main.c   |  45 ++
 arch/x86/kernel/cpu/sgx/sgx.h|   1 +
 4 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index a1dd43c195b2..f7a487a29ed1 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -9,16 +9,136 @@
 static struct sgx_cgroup sgx_cg_root;
 
 /**
- * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root:  Root of the tree to check
+ *
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ */
+static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_cgroup *sgx_cg;
+   bool ret = true;
+
+   /*
+* Caller ensure css_root ref acquired
+*/
+   css_root = >css;
+
+   rcu_read_lock();
+   css_for_each_descendant_pre(pos, css_root) {
+   if (!css_tryget(pos))
+   break;
+
+   rcu_read_unlock();
+
+   sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+
+   spin_lock(_cg->lru.lock);
+   ret = list_empty(_cg->lru.reclaimable);
+   spin_unlock(_cg->lru.lock);
+
+   rcu_read_lock();
+   css_put(pos);
+   if (!ret)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
+ * @root:  The root of cgroup tree to reclaim from.
  *
+ * This function performs a pre-order walk in the cgroup tree under the given
+ * root, attempting to reclaim pages at each node until a fixed number of pages
+ * 

[PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 134 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   1 +
 arch/x86/kernel/cpu/sgx/main.c   |   8 +-
 3 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index f7a487a29ed1..3ca89b1fb7e2 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,9 +5,63 @@
 #include 
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE   (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the 
minimal
+ * free pages would barely leave any page for use, causing excessive 
reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT   (SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is 
not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+   return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+   return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+   struct misc_cg *i = sgx_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 /**
  * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
  * @root:  Root of the tree to check
@@ -99,6 +153,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
 }
 
+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a 
cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() 
where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+   u64 cur, max;
+
+   if (sgx_cgroup_lru_empty(sgx_cg->cg))
+   return false;
+
+   max = sgx_cgroup_max_pages_to_root(sgx_cg);
+
+   /*
+* Unless the limit is very low, maintain a minimal number of free pages
+* so there is always a few pages available to serve new allocation
+* requests quickly.
+*/
+   if (max > SGX_CG_LOW_LIMIT)
+   max -= SGX_CG_MIN_FREE_PAGE;
+
+   cur = sgx_cgroup_page_counter_read(sgx_cg);
+
+   return (cur >= max);
+}
+
+/*
+ * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity.
+ */
+static void 

[PATCH v10 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-03-27 Thread Haitao Huang
From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 023af54c1beb..4991eb0af748 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(_active_page_list))
+   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_reclaimer_lock);
-   list_add_tail(_page->list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
+   list_add_tail(_page->list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_active_page_list);
+  !list_empty(_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   list_add_tail(>list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
return -EBUSY;
}
 
list_del(>list);

[PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. 
(Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/Kconfig | 13 +
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 70 ++
 arch/x86/kernel/cpu/sgx/main.c   | 51 ++-
 arch/x86/kernel/cpu/sgx/sgx.h|  5 ++
 include/linux/misc_cgroup.h  |  2 +
 7 files changed, 214 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 39886bab943a..bda78255a7ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1941,6 +1941,19 @@ config X86_SGX
 
  If unsure, say N.
 
+config CGROUP_SGX_EPC
+   bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
Intel SGX"
+   depends on X86_SGX && CGROUP_MISC
+   help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
 config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..a1dd43c195b2
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * 

[PATCH v10 04/14] cgroup/misc: Add SGX EPC resource type

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..2f6cc3a0ad23 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 1f0d8e05b36c..e51d6a45007f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1




[PATCH v10 03/14] cgroup/misc: Export APIs for SGX driver

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 14ab13ef3bc7..1f0d8e05b36c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return _cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = >res[type];
 
new_usage = atomic64_add_return(amount, >usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(>res[type].events);
cgroup_file_notify(>events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1




[PATCH v10 02/14] cgroup/misc: Add per resource callbacks for CSS events

2024-03-27 Thread Haitao Huang
From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 84 +
 2 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+   int ret;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+   WRITE_ONCE(cg->res[i].max, MAX_NUM);
+   atomic64_set(>res[i].usage, 0);
+   if (misc_res_ops[i]) {
+   ret = misc_res_ops[i]->alloc(cg);
+   if (ret)
+   return ret;
+   }
+   }
+
+   return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (misc_res_ops[i])
+   misc_res_ops[i]->free(cg);
+}
+
 /**
  * misc_cg_alloc() - Allocate misc cgroup.
  * @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
-   enum misc_res_type i;
-   struct 

[PATCH v10 00/14] Add Cgroup support for SGX EPC memory

2024-03-27 Thread Haitao Huang
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
 
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 
 
Track EPC pages in per-cgroup LRUs (from Dave)
--
 
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes to the existing LRU scanning algorithm, or to add new
LRUs.  The result: Add a new LRU for each cgroup and scans those instead.
Replace the existing global cgroup with the root cgroup's LRU (only when
this new support is compiled in, obviously).
 
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---
 
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
 
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
 
Unreclaimable Guest VM EPC Pages

 
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver 

[PATCH v10 01/14] x86/sgx: Replace boolean parameters with enums

2024-03-27 Thread Haitao Huang
Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.

Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo

Signed-off-by: Haitao Huang 
Suggested-by: Jarkko Sakkinen 
Suggested-by: Dave Hansen 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 12 ++--
 arch/x86/kernel/cpu/sgx/encl.h  |  4 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++---
 arch/x86/kernel/cpu/sgx/sgx.h   | 13 +++--
 arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct 
sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return epc_page;
 
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct 
vm_area_struct *vma,
goto err_out_unlock;
}
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret =  VM_FAULT_NOPAGE;
goto err_out_unlock;
}
 
-   va_page = sgx_encl_grow(encl, false);
+   va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- *   mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ *   mutex should not be held for SGX_DO_RECLAIM.
  *
  * Allocate a free EPC page and convert it to a Version Array (VA) page.
  *
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
 {
struct sgx_epc_page *epc_page;
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl 
*encl,
  unsigned long offset,
  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
 void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
 #include "encl.h"
 #include "encls.h"
 
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim)
 {
struct sgx_va_page *va_page = NULL;
void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
struct file *backing;
long ret;
 
-   va_page = sgx_encl_grow(encl, true);
+   va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
 
encl->backing = backing;
 
-   secs_epc = sgx_alloc_epc_page(>secs, true);
+   secs_epc = sgx_alloc_epc_page(>secs, SGX_DO_RECLAIM);
if (IS_ERR(secs_epc)) {
ret = 

[PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-03-27 Thread Gavin Shan
All the callers of vhost_get_avail_idx() are concerned to the memory
barrier, imposed by smp_rmb() to ensure the order of the available
ring entry read and avail_idx read.

Improve vhost_get_avail_idx() so that smp_rmb() is executed when
the avail_idx is advanced. With it, the callers needn't to worry
about the memory barrier.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 75 +++
 1 file changed, 26 insertions(+), 49 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 32686c79c41d..e6882f4f6ce2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,28 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 avail_idx;
+   int r;
+
+   r = vhost_get_avail(vq, avail_idx, >avail->idx);
+   if (unlikely(r)) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return r;
+   }
+
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Ensure the available ring entry read happens
+* before the avail_idx read when the avail_idx
+* is advanced.
+*/
+   smp_rmb();
+   }
+
+   return 0;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2499,7 +2517,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
struct vring_desc desc;
unsigned int i, head, found = 0;
u16 last_avail_idx;
-   __virtio16 avail_idx;
__virtio16 ring_head;
int ret, access;
 
@@ -2507,12 +2524,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
last_avail_idx = vq->last_avail_idx;
 
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
+   if (unlikely(vhost_get_avail_idx(vq)))
return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
vq_err(vq, "Guest moved used index from %u to %u",
@@ -2525,11 +2538,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 */
if (vq->avail_idx == last_avail_idx)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising, and increment
@@ -2790,35 +2798,19 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 /* return true if we're sure that avaiable ring is empty */
 bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
-   int r;
-
if (vq->avail_idx != vq->last_avail_idx)
return false;
 
-   r = vhost_get_avail_idx(vq, _idx);
-   if (unlikely(r))
-   return false;
-
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-   if (vq->avail_idx != vq->last_avail_idx) {
-   /* Since we have updated avail_idx, the following
-* call to vhost_get_vq_desc() will read available
-* ring entries. Make sure that read happens after
-* the avail_idx read.
-*/
-   smp_rmb();
+   if (unlikely(vhost_get_avail_idx(vq)))
return false;
-   }
 
-   return true;
+   return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -2842,25 +2834,10 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = vhost_get_avail_idx(vq, _idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
+   if (unlikely(vhost_get_avail_idx(vq)))
return false;
-   }
-
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-   if (vq->avail_idx != vq->last_avail_idx) {

[PATCH v3 2/3] vhost: Add smp_rmb() in vhost_enable_notify()

2024-03-27 Thread Gavin Shan
A smp_rmb() has been missed in vhost_enable_notify(), inspired by
Will. Otherwise, it's not ensured the available ring entries pushed
by guest can be observed by vhost in time, leading to stale available
ring entries fetched by vhost in vhost_get_vq_desc(), as reported by
Yihuang Yu on NVidia's grace-hopper (ARM64) platform.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=vnet0,vhost=true  \
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
   :
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_enable_notify(). When it returns true,
it means there's still pending tx buffers. Since it might read indices,
so it still can bypass the smp_rmb() in vhost_get_vq_desc(). Note that
it should be safe until vq->avail_idx is changed by commit d3bb267bbdcb
("vhost: cache avail index in vhost_enable_notify()").

Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc:  # v5.18+
Reported-by: Yihuang Yu 
Suggested-by: Will Deacon 
Signed-off-by: Gavin Shan 
Acked-by: Jason Wang 
---
 drivers/vhost/vhost.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 29df65b2ebf2..32686c79c41d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2848,9 +2848,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
   >avail->idx, r);
return false;
}
+
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Since we have updated avail_idx, the following
+* call to vhost_get_vq_desc() will read available
+* ring entries. Make sure that read happens after
+* the avail_idx read.
+*/
+   smp_rmb();
+   return true;
+   }
 
-   return vq->avail_idx != vq->last_avail_idx;
+   return false;
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
-- 
2.44.0




[PATCH v3 1/3] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Gavin Shan
A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will. Otherwise, it's not ensured the available ring entries pushed
by guest can be observed by vhost in time, leading to stale available
ring entries fetched by vhost in vhost_get_vq_desc(), as reported by
Yihuang Yu on NVidia's grace-hopper (ARM64) platform.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=vnet0,vhost=true  \
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
   :
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). When tx_can_batch()
returns true, it means there's still pending tx buffers. Since it might
read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc().
Note that it should be safe until vq->avail_idx is changed by commit
275bf960ac697 ("vhost: better detection of available buffers").

Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Suggested-by: Will Deacon 
Signed-off-by: Gavin Shan 
Acked-by: Jason Wang 
---
 drivers/vhost/vhost.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..29df65b2ebf2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,19 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
r = vhost_get_avail_idx(vq, _idx);
if (unlikely(r))
return false;
+
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Since we have updated avail_idx, the following
+* call to vhost_get_vq_desc() will read available
+* ring entries. Make sure that read happens after
+* the avail_idx read.
+*/
+   smp_rmb();
+   return false;
+   }
 
-   return vq->avail_idx == vq->last_avail_idx;
+   return true;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.44.0




[PATCH v3 0/3] vhost: Fix stale available ring entries

2024-03-27 Thread Gavin Shan
The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64)
platform. The wrong head (available ring entry) is seen by the guest
when running 'netperf' on the guest and running 'netserver' on another
NVidia's grace-grace machine.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=tap0,vhost=true   \
  -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0
   :
  guest# ifconfig eth0 | grep 'inet addr'
  inet addr:10.26.1.220
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

There is missed smp_rmb() in vhost_{vq_avail_empty, enable_notify}()
Without smp_rmb(), vq->avail_idx is advanced but the available ring
entries aren't arriving to vhost side yet. So a stale available ring
entry can be fetched in vhost_get_vq_desc().

Fix it by adding smp_rmb() in those two functions. Note that I need
two patches so that they can be easily picked up by the stable kernel.
With the changes, I'm unable to hit the issue again. Besides, the
function vhost_get_avail_idx() is improved to tackle the memory barrier
so that the callers needn't to worry about it.

v2: 
https://lore.kernel.org/virtualization/46c6a9aa-821c-4013-afe7-61ec05fc9...@redhat.com
v1: 
https://lore.kernel.org/virtualization/66e12633-b2d6-4b9a-9103-bb79770fc...@redhat.com

Changelog
=
v3:
  Improved change log (Jason)
  Improved comments and added PATCH[v3 3/3] to execute
  smp_rmb() in vhost_get_avail_idx()  (Michael)


Gavin Shan (3):
  vhost: Add smp_rmb() in vhost_vq_avail_empty()
  vhost: Add smp_rmb() in vhost_enable_notify()
  vhost: Improve vhost_get_avail_idx() with smp_rmb()

 drivers/vhost/vhost.c | 51 ---
 1 file changed, 24 insertions(+), 27 deletions(-)

-- 
2.44.0




Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-27 Thread Google
On Wed, 27 Mar 2024 17:06:01 +
Jonthan Haslam  wrote:

> > > Masami,
> > > 
> > > Given the discussion around per-cpu rw semaphore and need for
> > > (internal) batched attachment API for uprobes, do you think you can
> > > apply this patch as is for now? We can then gain initial improvements
> > > in scalability that are also easy to backport, and Jonathan will work
> > > on a more complete solution based on per-cpu RW semaphore, as
> > > suggested by Ingo.
> > 
> > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > I would like to wait for the next version.
> 
> My initial tests show a nice improvement on the over RW spinlocks but
> significant regression in acquiring a write lock. I've got a few days
> vacation over Easter but I'll aim to get some more formalised results out
> to the thread toward the end of next week.

As far as the write lock is only on the cold path, I think you can choose
per-cpu RW semaphore. Since it does not do busy wait, the total system
performance impact will be small.
I look forward to your formalized results :)

Thank you,

> 
> Jon.
> 
> > 
> > Thank you,
> > 
> > > 
> > > >
> > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > will depend on how much lock contention happens.
> > > >
> > > > Thank you,
> > > >
> > > > >
> > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > >
> > > > > Signed-off-by: Jonathan Haslam 
> > > > > ---
> > > > >  kernel/events/uprobes.c | 22 +++---
> > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > --- a/kernel/events/uprobes.c
> > > > > +++ b/kernel/events/uprobes.c
> > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > >   */
> > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > > >
> > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree 
> > > > > access */
> > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree 
> > > > > access */
> > > > >
> > > > >  #define UPROBES_HASH_SZ  13
> > > > >  /* serialize uprobe->pending_list */
> > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > > > *inode, loff_t offset)
> > > > >  {
> > > > >   struct uprobe *uprobe;
> > > > >
> > > > > - spin_lock(_treelock);
> > > > > + read_lock(_treelock);
> > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > - spin_unlock(_treelock);
> > > > > + read_unlock(_treelock);
> > > > >
> > > > >   return uprobe;
> > > > >  }
> > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > > > > *uprobe)
> > > > >  {
> > > > >   struct uprobe *u;
> > > > >
> > > > > - spin_lock(_treelock);
> > > > > + write_lock(_treelock);
> > > > >   u = __insert_uprobe(uprobe);
> > > > > - spin_unlock(_treelock);
> > > > > + write_unlock(_treelock);
> > > > >
> > > > >   return u;
> > > > >  }
> > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > >   return;
> > > > >
> > > > > - spin_lock(_treelock);
> > > > > + write_lock(_treelock);
> > > > >   rb_erase(>rb_node, _tree);
> > > > > - spin_unlock(_treelock);
> > > > > + write_unlock(_treelock);
> > > > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > > > >   put_uprobe(uprobe);
> > > > >  }
> > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode 
> > > > > *inode,
> > > > >   min = vaddr_to_offset(vma, start);
> > > > >   max = min + (end - start) - 1;
> > > > >
> > > > > - spin_lock(_treelock);
> > > > > + read_lock(_treelock);
> > > > >   n = find_node_in_range(inode, min, max);
> > > > >   if (n) {
> > > > >   for (t = n; t; t = rb_prev(t)) {
> > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode 
> > > > > *inode,
> > > > >   get_uprobe(u);
> > > > >   }
> > > > >   }
> > > > > - spin_unlock(_treelock);
> > > > > + read_unlock(_treelock);
> > > > >  }
> > > > >
> > > > >  /* @vma contains reference counter, not the probed instruction. */
> > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, 
> > > > > unsigned long start, unsigned long e
> > > > >   min = vaddr_to_offset(vma, start);
> > > > >   max = min + (end - start) - 1;
> > > > >
> > > > > - spin_lock(_treelock);
> > > > > + read_lock(_treelock);
> > > > >   n = find_node_in_range(inode, min, max);
> > > > > - spin_unlock(_treelock);
> > > > > + read_unlock(_treelock);
> > > > >
> > > > >   return !!n;
> > > > >  }
> > > > > --
> > > > > 2.43.0
> > > > >
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) 
> > 
> > 
> > -- 
> > Masami 

[PATCH v2 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-03-27 Thread Andrew Melnychenko
When the Qemu launched with vhost but without tap vnet_hdr,
vhost tries to copy vnet_hdr from socket iter with size 0
to the page that may contain some trash.
That trash can be interpreted as unpredictable values for
vnet_hdr.
That leads to dropping some packets and in some cases to
stalling vhost routine when the vhost_net tries to process
packets and fails in a loop.

Qemu options:
  -netdev tap,vhost=on,vnet_hdr=off,...

>From security point of view, wrong values on field used later
tap's tap_get_user_xdp() and will affect skb gso and options.
Later the header(and data in headroom) should not be used by the stack.
Using custom socket as a backend to vhost_net can reveal some data
in the vnet_hdr, although it would require kernel access to implement.

The issue happens because the value of sock_len in virtqueue is 0.
That value is set at vhost_net_set_features() with
VHOST_NET_F_VIRTIO_NET_HDR, also it's set to zero at device open()
and reset() routine.
So, currently, to trigger the issue, we need to set up qemu with
vhost=on,vnet_hdr=off, or do not configure vhost in the custom program.

Signed-off-by: Andrew Melnychenko 
---
 drivers/vhost/net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..57411ac2d08b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
hdr = buf;
gso = >gso;
 
+   if (!sock_hlen)
+   memset(buf, 0, pad);
+
if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
vhost16_to_cpu(vq, gso->csum_start) +
vhost16_to_cpu(vq, gso->csum_offset) + 2 >
-- 
2.43.0




Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Borislav Petkov
On Wed, Mar 27, 2024 at 03:31:01PM -0700, Sohil Mehta wrote:
> On 3/27/2024 1:54 PM, Avadhut Naik wrote:
> 
> > -   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
> > +   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE 
> > REVISION: %x",
> 
> Nit: s/MICROCODE REVISION/MICROCODE/g
> 
> You could probably get rid of the word REVISION in the interest of
> brevity similar to __print_mce().

You *definitely* want to do that - good catch.

And TBH, all the screaming words aren't helping either... :)

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Sohil Mehta
On 3/27/2024 1:54 PM, Avadhut Naik wrote:

> - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
> + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
> ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
> %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: 
> %x",

Nit: s/MICROCODE REVISION/MICROCODE/g

You could probably get rid of the word REVISION in the interest of
brevity similar to __print_mce().

pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode
%x\n",
m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid,
m->microcode);


-Sohil



Re: [PATCH RFT] arm64: dts: qcom: sm8350: Reenable crypto & cryptobam

2024-03-27 Thread Konrad Dybcio
On 16.02.2024 11:36 AM, Luca Weiss wrote:
> On Mon Jan 8, 2024 at 11:45 PM CET, Dmitry Baryshkov wrote:
>> On Mon, 8 Jan 2024 at 16:23, Luca Weiss  wrote:
>>>
>>> On Mon Jan 8, 2024 at 3:18 PM CET, Konrad Dybcio wrote:
 On 8.01.2024 14:49, Luca Weiss wrote:
> When num-channels and qcom,num-ees is not provided in devicetree, the
> driver will try to read these values from the registers during probe but
> this fails if the interconnect is not on and then crashes the system.
>
> So we can provide these properties in devicetree (queried after patching
> BAM driver to enable the necessary interconnect) so we can probe
> cryptobam without reading registers and then also use the QCE as
> expected.

 This really feels a bit backwards.. Enable the resource to query the
 hardware for numbers, so that said resource can be enabled, but
 slightly later :/
>>>
>>> If you think adding interconnect support to driver and dtsi is better,
>>> let me know.
>>
>> I'd say, adding the proper interconnect is a better option. Otherwise
>> we just depend on the QCE itself to set up the vote for us.
> 
> Yes, currently we depend on that.
> 
>>
>>>
>>> Stephan (+CC) mentioned it should be okay like this *shrug*
>>>
>>> For the record, this is the same way I got the values for sc7280[0] and
>>> sm6350[1].
>>>
>>> [0] 
>>> https://lore.kernel.org/linux-arm-msm/20231229-sc7280-cryptobam-fixup-v1-1-bd8f68589...@fairphone.com/
>>> [1] 
>>> https://lore.kernel.org/linux-arm-msm/20240105-sm6350-qce-v1-0-416e5c731...@fairphone.com/
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sm8350.dtsi 
>>> b/arch/arm64/boot/dts/qcom/sm8350.dtsi
>>> index b46236235b7f..cd4dd9852d9e 100644
>>> --- a/arch/arm64/boot/dts/qcom/sm8350.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sm8350.dtsi
>>> @@ -1756,8 +1756,8 @@ cryptobam: dma-controller@1dc4000 {
>>> qcom,controlled-remotely;
>>> iommus = <_smmu 0x594 0x0011>,
>>>  <_smmu 0x596 0x0011>;
>>> -   /* FIXME: Probing BAM DMA causes some abort and 
>>> system hang */
>>> -   status = "fail";
>>> +   interconnects = <_noc MASTER_CRYPTO 0 
>>> _virt SLAVE_EBI1 0>;
>>> +   interconnect-names = "memory";
>>> };
>>>
>>> crypto: crypto@1dfa000 {
>>> diff --git a/drivers/dma/qcom/bam_dma.c b/drivers/dma/qcom/bam_dma.c
>>> index 5e7d332731e0..9de28f615639 100644
>>> --- a/drivers/dma/qcom/bam_dma.c
>>> +++ b/drivers/dma/qcom/bam_dma.c
>>> @@ -40,6 +40,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>
>>>  #include "../dmaengine.h"
>>> @@ -394,6 +395,7 @@ struct bam_device {
>>> const struct reg_offset_data *layout;
>>>
>>> struct clk *bamclk;
>>> +   struct icc_path *mem_path;
>>> int irq;
>>>
>>> /* dma start transaction tasklet */
>>> @@ -1206,6 +1208,7 @@ static int bam_init(struct bam_device *bdev)
>>> bdev->num_channels = val & BAM_NUM_PIPES_MASK;
>>> }
>>>
>>> +   printk(KERN_ERR "%s:%d DBG num_ees=%u num_channels=%u\n", __func__, 
>>> __LINE__, bdev->num_ees, bdev->num_channels);
>>> /* Reset BAM now if fully controlled locally */
>>> if (!bdev->controlled_remotely && !bdev->powered_remotely)
>>> bam_reset(bdev);
>>> @@ -1298,6 +1301,14 @@ static int bam_dma_probe(struct platform_device 
>>> *pdev)
>>> return ret;
>>> }
>>>
>>> +   bdev->mem_path = devm_of_icc_get(bdev->dev, "memory");
>>> +   if (IS_ERR(bdev->mem_path))
>>> +   return PTR_ERR(bdev->mem_path);
>>> +
>>> +   ret = icc_set_bw(bdev->mem_path, 1, 1);
>>
>> Probably this needs some more sensible value.
> 
> So downstream qcedev driver uses 384 for the interconnect. But this is
> crypto-specific and probably different BAMs have different minimum
> requirements?
> 
> #define CRYPTO_AVG_BW 384
> #define CRYPTO_PEAK_BW384
> https://github.com/xiaomi-sm8450-kernel/android_kernel_platform_msm-kernel/blob/lineage-20/drivers/crypto/msm/qce.h#L57
> 
> Do you have any suggestion what to use here?

I'dve expected this to mean anything, but apparently not.

My immediate guess is that the 384 was the lowest magic value that didn't
result in the bus getting kicked offline.. 1 should be fine upstream due
to commit 91e045b93db7 ("interconnect: qcom: Fix small BW votes being
truncated to zero").

> 
> Also I'd assume that with pm_runtime suspended we'd need to clear the
> votes in the driver so we don't keep the interconnect alive
> unnecessarily?

My naive understanding is that the power should only be necessary when
the thing is in use, so early probe and pm-active sounds about sane..

Konrad



Re: [RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()

2024-03-27 Thread Johannes Berg
On Wed, 2024-03-27 at 21:11 +, Simon Horman wrote:
> 
> I'm seeing some allmodconfig build problems with this applied on top of
> net-next.

> ./include/trace/stages/init.h:30: warning: "TRACE_DEFINE_SYM_FNS" redefined

> ./include/trace/stages/init.h:54: warning: "TRACE_DEFINE_SYM_LIST" redefined

Yeah, the 0-day bot reported that too, sorry about that. It needs two
lines to #undef these in init.h before their definition, just like all
other macros there. Not sure why my builds didn't show that, maybe it
doesn't affect all users.

johannes



Re: [RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()

2024-03-27 Thread Simon Horman
On Tue, Mar 26, 2024 at 08:15:56PM +0100, Johannes Berg wrote:
> From: Johannes Berg 
> 
> The way __print_symbolic() works is limited and inefficient
> in multiple ways:
>  - you can only use it with a static list of symbols, but
>e.g. the SKB dropreasons are now a dynamic list
> 
>  - it builds the list in memory _three_ times, so it takes
>a lot of memory:
>- The print_fmt contains the list (since it's passed to
>  the macro there). This actually contains the names
>  _twice_, which is fixed up at runtime.
>- TRACE_DEFINE_ENUM() puts a 24-byte struct trace_eval_map
>  for every entry, plus the string pointed to by it, which
>  cannot be deduplicated with the strings in the print_fmt
>- The in-kernel symbolic printing creates yet another list
>  of struct trace_print_flags for trace_print_symbols_seq()
> 
>  - it also requires runtime fixup during init, which is a lot
>of string parsing due to the print_fmt fixup
> 
> Introduce __print_sym() to - over time - replace the old one.
> We can easily extend this also to __print_flags later, but I
> cared only about the SKB dropreasons for now, which has only
> __print_symbolic().
> 
> This new __print_sym() requires only a single list of items,
> created by TRACE_DEFINE_SYM_LIST(), or can even use another
> already existing list by using TRACE_DEFINE_SYM_FNS() with
> lookup and show methods.
> 
> Then, instead of doing an init-time fixup, just do this at the
> time when userspace reads the print_fmt. This way, dynamically
> updated lists are possible.
> 
> For userspace, nothing actually changes, because the print_fmt
> is shown exactly the same way the old __print_symbolic() was.
> 
> This adds about 4k .text in my test builds, but that'll be
> more than paid for by the actual conversions.
> 
> Signed-off-by: Johannes Berg 

Hi Johannes,

I'm seeing some allmodconfig build problems with this applied on top of
net-next.

In file included from ./include/trace/trace_events.h:27,
 from ./include/trace/define_trace.h:102,
 from ./include/trace/events/module.h:134,
 from kernel/module/main.c:64:
./include/trace/stages/init.h:30: warning: "TRACE_DEFINE_SYM_FNS" redefined
   30 | #define TRACE_DEFINE_SYM_FNS(_symbol_id, _lookup, _show)
\
  |
In file included from ./include/linux/trace_events.h:11,
 from kernel/module/main.c:14:
./include/linux/tracepoint.h:130: note: this is the location of the previous 
definition
  130 | #define TRACE_DEFINE_SYM_FNS(...)
  |
./include/trace/stages/init.h:54: warning: "TRACE_DEFINE_SYM_LIST" redefined
   54 | #define TRACE_DEFINE_SYM_LIST(_symbol_id, ...)  
\
  |
./include/linux/tracepoint.h:131: note: this is the location of the previous 
definition
  131 | #define TRACE_DEFINE_SYM_LIST(...)
  |




RE: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Luck, Tony
> Export microcode version through the tracepoint to prevent ambiguity over
> the active version on the system when the MCE was received.
>
> Signed-off-by: Avadhut Naik 
> Reviewed-by: Sohil Mehta 
> Reviewed-by: Steven Rostedt (Google) 

Reviewed-by: Tony Luck 



RE: [PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint

2024-03-27 Thread Luck, Tony
> Export PPIN through the tracepoint as it provides a unique identifier for
> the system (or socket in case of multi-socket systems) on which the MCE
> has been received.
>
> Signed-off-by: Avadhut Naik 
> Reviewed-by: Sohil Mehta 
> Reviewed-by: Steven Rostedt (Google) 

Reviewed-by: Tony Luck 



[PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Avadhut Naik
Currently, the microcode field (Microcode Revision) of struct mce is not
exported to userspace through the mce_record tracepoint.

Knowing the microcode version on which the MCE was received is critical
information for debugging. If the version is not recorded, later attempts
to acquire the version might result in discrepancies since it can be
changed at runtime.

Export microcode version through the tracepoint to prevent ambiguity over
the active version on the system when the MCE was received.

Signed-off-by: Avadhut Naik 
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
---
 include/trace/events/mce.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 959ba7b775b1..69438550252c 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -42,6 +42,7 @@ TRACE_EVENT(mce_record,
__field(u8, cs  )
__field(u8, bank)
__field(u8, cpuvendor   )
+   __field(u32,microcode   )
),
 
TP_fast_assign(
@@ -63,9 +64,10 @@ TRACE_EVENT(mce_record,
__entry->cs = m->cs;
__entry->bank   = m->bank;
__entry->cpuvendor  = m->cpuvendor;
+   __entry->microcode  = m->microcode;
),
 
-   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: 
%x",
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
@@ -77,7 +79,8 @@ TRACE_EVENT(mce_record,
__entry->cpuvendor, __entry->cpuid,
__entry->walltime,
__entry->socketid,
-   __entry->apicid)
+   __entry->apicid,
+   __entry->microcode)
 );
 
 #endif /* _TRACE_MCE_H */
-- 
2.34.1




[PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint

2024-03-27 Thread Avadhut Naik
Machine Check Error information from struct mce is exported to userspace
through the mce_record tracepoint.

Currently, however, the PPIN (Protected Processor Inventory Number) field
of struct mce is not exported through the tracepoint.

Export PPIN through the tracepoint as it provides a unique identifier for
the system (or socket in case of multi-socket systems) on which the MCE
has been received.

Signed-off-by: Avadhut Naik 
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
---
 include/trace/events/mce.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 1391ada0da3b..959ba7b775b1 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -9,6 +9,14 @@
 #include 
 #include 
 
+/*
+ * MCE Event Record.
+ *
+ * Only very relevant and transient information which cannot be
+ * gathered from a system by any other means or which can only be
+ * acquired arduously should be added to this record.
+ */
+
 TRACE_EVENT(mce_record,
 
TP_PROTO(struct mce *m),
@@ -25,6 +33,7 @@ TRACE_EVENT(mce_record,
__field(u64,ipid)
__field(u64,ip  )
__field(u64,tsc )
+   __field(u64,ppin)
__field(u64,walltime)
__field(u32,cpu )
__field(u32,cpuid   )
@@ -45,6 +54,7 @@ TRACE_EVENT(mce_record,
__entry->ipid   = m->ipid;
__entry->ip = m->ip;
__entry->tsc= m->tsc;
+   __entry->ppin   = m->ppin;
__entry->walltime   = m->time;
__entry->cpu= m->extcpu;
__entry->cpuid  = m->cpuid;
@@ -55,7 +65,7 @@ TRACE_EVENT(mce_record,
__entry->cpuvendor  = m->cpuvendor;
),
 
-   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: 
%u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
@@ -63,6 +73,7 @@ TRACE_EVENT(mce_record,
__entry->addr, __entry->misc, __entry->synd,
__entry->cs, __entry->ip,
__entry->tsc,
+   __entry->ppin,
__entry->cpuvendor, __entry->cpuid,
__entry->walltime,
__entry->socketid,
-- 
2.34.1




[PATCH v4 0/2] Update mce_record tracepoint

2024-03-27 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

[NOTE:

 - Since only a comment has been added and only the commit messages have
   been reworked i.e. no code changes have been undertaken for this
   version, have the retained the below tags from v3:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) ]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 818ea9b4c8237f96ac99dc0b2f02dd6d3f2adb97
-- 
2.34.1




Re: [PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin
On Wed, Mar 27, 2024 at 07:52:02PM +, Will Deacon wrote:
> On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote:
> > vhost_get_vq_desc (correctly) uses smp_rmb to order
> > avail ring reads after index reads.
> > However, over time we added two more places that read the
> > index and do not bother with barriers.
> > Since vhost_get_vq_desc when it was written assumed it is the
> > only reader when it sees a new index value is cached
> > it does not bother with a barrier either, as a result,
> > on the nvidia-gracehopper platform (arm64) available ring
> > entry reads have been observed bypassing ring reads, causing
> > a ring corruption.
> > 
> > To fix, factor out the correct index access code from vhost_get_vq_desc.
> > As a side benefit, we also validate the index on all paths now, which
> > will hopefully help catch future errors earlier.
> > 
> > Note: current code is inconsistent in how it handles errors:
> > some places treat it as an empty ring, others - non empty.
> > This patch does not attempt to change the existing behaviour.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Gavin Shan 
> > Reported-by: Will Deacon 
> > Suggested-by: Will Deacon 
> > Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
> > Cc: "Jason Wang" 
> > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> > Cc: "Stefano Garzarella" 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > I think it's better to bite the bullet and clean up the code.
> > Note: this is still only built, not tested.
> > Gavin could you help test please?
> > Especially on the arm platform you have?
> > 
> > Will thanks so much for finding this race!
> 
> No problem, and I was also hoping that the smp_rmb() could be
> consolidated into a single helper like you've done here.
> 
> One minor comment below:
> 
> >  drivers/vhost/vhost.c | 80 +++
> >  1 file changed, 42 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..26b70b1fd9ff 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev 
> > *d)
> > mutex_unlock(>vqs[i]->mutex);
> >  }
> >  
> > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > - __virtio16 *idx)
> > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
> >  {
> > -   return vhost_get_avail(vq, *idx, >avail->idx);
> > +   __virtio16 idx;
> > +   u16 avail_idx;
> > +   int r = vhost_get_avail(vq, idx, >avail->idx);
> > +
> > +   if (unlikely(r < 0)) {
> > +   vq_err(vq, "Failed to access avail idx at %p: %d\n",
> > +  >avail->idx, r);
> > +   return -EFAULT;
> > +   }
> > +
> > +   avail_idx = vhost16_to_cpu(vq, idx);
> > +
> > +   /* Check it isn't doing very strange things with descriptor numbers. */
> > +   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
> > +   vq_err(vq, "Guest moved used index from %u to %u",
> > +  vq->last_avail_idx, vq->avail_idx);
> > +   return -EFAULT;
> > +   }
> > +
> > +   /* Nothing new? We are done. */
> > +   if (avail_idx == vq->avail_idx)
> > +   return 0;
> > +
> > +   vq->avail_idx = avail_idx;
> > +
> > +   /* We updated vq->avail_idx so we need a memory barrier between
> > +* the index read above and the caller reading avail ring entries.
> > +*/
> > +   smp_rmb();
> 
> I think you could use smp_acquire__after_ctrl_dep() if you're feeling
> brave, but to be honest I'd prefer we went in the opposite direction
> and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across
> the board. It's just a thankless, error-prone task to get there :(

Let's just say that's a separate patch, I tried hard to make this one
a bugfix only, no other functional changes at all.

> So, for the patch as-is:
> 
> Acked-by: Will Deacon 
> 
> (I've not tested it either though, so definitely wait for Gavin on that!)
> 
> Cheers,
> 
> Will




Re: [PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Will Deacon
On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote:
> vhost_get_vq_desc (correctly) uses smp_rmb to order
> avail ring reads after index reads.
> However, over time we added two more places that read the
> index and do not bother with barriers.
> Since vhost_get_vq_desc when it was written assumed it is the
> only reader when it sees a new index value is cached
> it does not bother with a barrier either, as a result,
> on the nvidia-gracehopper platform (arm64) available ring
> entry reads have been observed bypassing ring reads, causing
> a ring corruption.
> 
> To fix, factor out the correct index access code from vhost_get_vq_desc.
> As a side benefit, we also validate the index on all paths now, which
> will hopefully help catch future errors earlier.
> 
> Note: current code is inconsistent in how it handles errors:
> some places treat it as an empty ring, others - non empty.
> This patch does not attempt to change the existing behaviour.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Gavin Shan 
> Reported-by: Will Deacon 
> Suggested-by: Will Deacon 
> Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
> Cc: "Jason Wang" 
> Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> Cc: "Stefano Garzarella" 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I think it's better to bite the bullet and clean up the code.
> Note: this is still only built, not tested.
> Gavin could you help test please?
> Especially on the arm platform you have?
> 
> Will thanks so much for finding this race!

No problem, and I was also hoping that the smp_rmb() could be
consolidated into a single helper like you've done here.

One minor comment below:

>  drivers/vhost/vhost.c | 80 +++
>  1 file changed, 42 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..26b70b1fd9ff 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 idx;
> + u16 avail_idx;
> + int r = vhost_get_avail(vq, idx, >avail->idx);
> +
> + if (unlikely(r < 0)) {
> + vq_err(vq, "Failed to access avail idx at %p: %d\n",
> +>avail->idx, r);
> + return -EFAULT;
> + }
> +
> + avail_idx = vhost16_to_cpu(vq, idx);
> +
> + /* Check it isn't doing very strange things with descriptor numbers. */
> + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
> + vq_err(vq, "Guest moved used index from %u to %u",
> +vq->last_avail_idx, vq->avail_idx);
> + return -EFAULT;
> + }
> +
> + /* Nothing new? We are done. */
> + if (avail_idx == vq->avail_idx)
> + return 0;
> +
> + vq->avail_idx = avail_idx;
> +
> + /* We updated vq->avail_idx so we need a memory barrier between
> +  * the index read above and the caller reading avail ring entries.
> +  */
> + smp_rmb();

I think you could use smp_acquire__after_ctrl_dep() if you're feeling
brave, but to be honest I'd prefer we went in the opposite direction
and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across
the board. It's just a thankless, error-prone task to get there :(

So, for the patch as-is:

Acked-by: Will Deacon 

(I've not tested it either though, so definitely wait for Gavin on that!)

Cheers,

Will



Re: [PATCH v2 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)

2024-03-27 Thread Konrad Dybcio
On 26.03.2024 1:11 PM, Stanislav Jakubek wrote:
> Add a device tree for the Motorola Moto G (2013) smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Initially supported features:
>   - Buttons (Volume Down/Up, Power)
>   - eMMC
>   - Hall Effect Sensor
>   - SimpleFB display
>   - TMP108 temperature sensor
>   - Vibrator
> 
> Note: the dhob and shob reserved-memory regions are seemingly a part of some
> Motorola specific (firmware?) mechanism, see [1].
> 
> [1] 
> https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/Documentation/devicetree/bindings/misc/hob_ram.txt
> 
> Signed-off-by: Stanislav Jakubek 
> ---

Reviewed-by: Konrad Dybcio 

Konrad



[PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin
vhost_get_vq_desc (correctly) uses smp_rmb to order
avail ring reads after index reads.
However, over time we added two more places that read the
index and do not bother with barriers.
Since vhost_get_vq_desc when it was written assumed it is the
only reader when it sees a new index value is cached
it does not bother with a barrier either, as a result,
on the nvidia-gracehopper platform (arm64) available ring
entry reads have been observed bypassing ring reads, causing
a ring corruption.

To fix, factor out the correct index access code from vhost_get_vq_desc.
As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.

Cc: sta...@vger.kernel.org
Reported-by: Gavin Shan 
Reported-by: Will Deacon 
Suggested-by: Will Deacon 
Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
Cc: "Jason Wang" 
Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc: "Stefano Garzarella" 
Signed-off-by: Michael S. Tsirkin 
---

I think it's better to bite the bullet and clean up the code.
Note: this is still only built, not tested.
Gavin could you help test please?
Especially on the arm platform you have?

Will thanks so much for finding this race!


 drivers/vhost/vhost.c | 80 +++
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..26b70b1fd9ff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   u16 avail_idx;
+   int r = vhost_get_avail(vq, idx, >avail->idx);
+
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access avail idx at %p: %d\n",
+  >avail->idx, r);
+   return -EFAULT;
+   }
+
+   avail_idx = vhost16_to_cpu(vq, idx);
+
+   /* Check it isn't doing very strange things with descriptor numbers. */
+   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
+
+   /* Nothing new? We are done. */
+   if (avail_idx == vq->avail_idx)
+   return 0;
+
+   vq->avail_idx = avail_idx;
+
+   /* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2498,38 +2526,21 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq->last_avail_idx;
 
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
 
/* If there's nothing new since last we looked, return
 * invalid.
 */
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising, and increment
@@ -2790,25 +2801,21 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 /* return true if we're sure that avaiable ring is empty */
 bool 

Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-03-27 Thread Mathieu Poirier
On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote:
> 
> 
> On 3/25/24 17:51, Mathieu Poirier wrote:
> > On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote:
> >> The new TEE remoteproc device is used to manage remote firmware in a
> >> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
> >> introduced to delegate the loading of the firmware to the trusted
> >> execution context. In such cases, the firmware should be signed and
> >> adhere to the image format defined by the TEE.
> >>
> >> Signed-off-by: Arnaud Pouliquen 
> >> ---
> >> Updates from V3:
> >> - remove support of the attach use case. Will be addressed in a separate
> >>   thread,
> >> - add st_rproc_tee_ops::parse_fw ops,
> >> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage 
> >> cross
> >>   reference between the rproc struct and the tee_rproc struct in 
> >> tee_rproc.c.
> >> ---
> >>  drivers/remoteproc/stm32_rproc.c | 60 +---
> >>  1 file changed, 56 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/remoteproc/stm32_rproc.c 
> >> b/drivers/remoteproc/stm32_rproc.c
> >> index 8cd838df4e92..13df33c78aa2 100644
> >> --- a/drivers/remoteproc/stm32_rproc.c
> >> +++ b/drivers/remoteproc/stm32_rproc.c
> >> @@ -20,6 +20,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  
> >>  #include "remoteproc_internal.h"
> >> @@ -49,6 +50,9 @@
> >>  #define M4_STATE_STANDBY  4
> >>  #define M4_STATE_CRASH5
> >>  
> >> +/* Remote processor unique identifier aligned with the Trusted Execution 
> >> Environment definitions */
> > 
> > Why is this the case?  At least from the kernel side it is possible to call
> > tee_rproc_register() with any kind of value, why is there a need to be any
> > kind of alignment with the TEE?
> 
> 
> The use of the proc_id is to identify a processor in case of multi 
> co-processors.
>

That is well understood.

> For instance we can have a system with A DSP and a modem. We would use the 
> same
> TEE service, but

That too.

> the TEE driver will probably be different, same for the signature key.

What TEE driver are we talking about here?

> In such case the proc ID allows to identify the the processor you want to 
> address.
> 

That too is well understood, but there is no alignment needed with the TEE, i.e
the TEE application is not expecting a value of '0'.  We could set
STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work.  This driver won't go
anywhere for as long as it is not the case.

> 
> > 
> >> +#define STM32_MP1_M4_PROC_ID0
> >> +
> >>  struct stm32_syscon {
> >>struct regmap *map;
> >>u32 reg;
> >> @@ -257,6 +261,19 @@ static int stm32_rproc_release(struct rproc *rproc)
> >>return 0;
> >>  }
> >>  
> >> +static int stm32_rproc_tee_stop(struct rproc *rproc)
> >> +{
> >> +  int err;
> >> +
> >> +  stm32_rproc_request_shutdown(rproc);
> >> +
> >> +  err = tee_rproc_stop(rproc);
> >> +  if (err)
> >> +  return err;
> >> +
> >> +  return stm32_rproc_release(rproc);
> >> +}
> >> +
> >>  static int stm32_rproc_prepare(struct rproc *rproc)
> >>  {
> >>struct device *dev = rproc->dev.parent;
> >> @@ -693,8 +710,19 @@ static const struct rproc_ops st_rproc_ops = {
> >>.get_boot_addr  = rproc_elf_get_boot_addr,
> >>  };
> >>  
> >> +static const struct rproc_ops st_rproc_tee_ops = {
> >> +  .prepare= stm32_rproc_prepare,
> >> +  .start  = tee_rproc_start,
> >> +  .stop   = stm32_rproc_tee_stop,
> >> +  .kick   = stm32_rproc_kick,
> >> +  .load   = tee_rproc_load_fw,
> >> +  .parse_fw   = tee_rproc_parse_fw,
> >> +  .find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table,
> >> +};
> >> +
> >>  static const struct of_device_id stm32_rproc_match[] = {
> >> -  { .compatible = "st,stm32mp1-m4" },
> >> +  {.compatible = "st,stm32mp1-m4",},
> >> +  {.compatible = "st,stm32mp1-m4-tee",},
> >>{},
> >>  };
> >>  MODULE_DEVICE_TABLE(of, stm32_rproc_match);
> >> @@ -853,6 +881,7 @@ static int stm32_rproc_probe(struct platform_device 
> >> *pdev)
> >>struct device *dev = >dev;
> >>struct stm32_rproc *ddata;
> >>struct device_node *np = dev->of_node;
> >> +  struct tee_rproc *trproc = NULL;
> >>struct rproc *rproc;
> >>unsigned int state;
> >>int ret;
> >> @@ -861,9 +890,26 @@ static int stm32_rproc_probe(struct platform_device 
> >> *pdev)
> >>if (ret)
> >>return ret;
> >>  
> >> -  rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, 
> >> sizeof(*ddata));
> >> -  if (!rproc)
> >> -  return -ENOMEM;
> >> +  if (of_device_is_compatible(np, "st,stm32mp1-m4-tee")) {
> >> +  /*
> >> +   * Delegate the firmware management to the secure context.
> >> +   * The firmware loaded has to be signed.
> >> +   */
> >> +  rproc = devm_rproc_alloc(dev, np->name, _rproc_tee_ops, 
> >> NULL, sizeof(*ddata));
> >> +  if 

Re: [PATCH v4 1/4] remoteproc: Add TEE support

2024-03-27 Thread Mathieu Poirier
On Tue, Mar 26, 2024 at 08:18:23PM +0100, Arnaud POULIQUEN wrote:
> Hello Mathieu,
> 
> On 3/25/24 17:46, Mathieu Poirier wrote:
> > On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote:
> >> Add a remoteproc TEE (Trusted Execution Environment) driver
> >> that will be probed by the TEE bus. If the associated Trusted
> >> application is supported on secure part this device offers a client
> > 
> > Device or driver?  I thought I touched on that before.
> 
> Right, I changed the first instance and missed this one
> 
> > 
> >> interface to load a firmware in the secure part.
> >> This firmware could be authenticated by the secure trusted application.
> >>
> >> Signed-off-by: Arnaud Pouliquen 
> >> ---
> >> Updates from V3:
> >> - rework TEE_REMOTEPROC description in Kconfig
> >> - fix some namings
> >> - add tee_rproc_parse_fw  to support rproc_ops::parse_fw
> >> - add proc::tee_interface;
> >> - add rproc struct as parameter of the tee_rproc_register() function
> >> ---
> >>  drivers/remoteproc/Kconfig  |  10 +
> >>  drivers/remoteproc/Makefile |   1 +
> >>  drivers/remoteproc/tee_remoteproc.c | 434 
> >>  include/linux/remoteproc.h  |   4 +
> >>  include/linux/tee_remoteproc.h  | 112 +++
> >>  5 files changed, 561 insertions(+)
> >>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
> >>  create mode 100644 include/linux/tee_remoteproc.h
> >>
> >> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> >> index 48845dc8fa85..2cf1431b2b59 100644
> >> --- a/drivers/remoteproc/Kconfig
> >> +++ b/drivers/remoteproc/Kconfig
> >> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
> >>  
> >>  It's safe to say N if not interested in using RPU r5f cores.
> >>  
> >> +
> >> +config TEE_REMOTEPROC
> >> +  tristate "remoteproc support by a TEE application"
> > 
> > s/remoteproc/Remoteproc
> > 
> >> +  depends on OPTEE
> >> +  help
> >> +Support a remote processor with a TEE application. The Trusted
> >> +Execution Context is responsible for loading the trusted firmware
> >> +image and managing the remote processor's lifecycle.
> >> +This can be either built-in or a loadable module.
> >> +
> >>  endif # REMOTEPROC
> >>  
> >>  endmenu
> >> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> >> index 91314a9b43ce..fa8daebce277 100644
> >> --- a/drivers/remoteproc/Makefile
> >> +++ b/drivers/remoteproc/Makefile
> >> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)+= rcar_rproc.o
> >>  obj-$(CONFIG_ST_REMOTEPROC)   += st_remoteproc.o
> >>  obj-$(CONFIG_ST_SLIM_REMOTEPROC)  += st_slim_rproc.o
> >>  obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o
> >> +obj-$(CONFIG_TEE_REMOTEPROC)  += tee_remoteproc.o
> >>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)+= ti_k3_dsp_remoteproc.o
> >>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o
> >>  obj-$(CONFIG_XLNX_R5_REMOTEPROC)  += xlnx_r5_remoteproc.o
> >> diff --git a/drivers/remoteproc/tee_remoteproc.c 
> >> b/drivers/remoteproc/tee_remoteproc.c
> >> new file mode 100644
> >> index ..c855210e52e3
> >> --- /dev/null
> >> +++ b/drivers/remoteproc/tee_remoteproc.c
> >> @@ -0,0 +1,434 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/*
> >> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
> >> + * Author: Arnaud Pouliquen 
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#include "remoteproc_internal.h"
> >> +
> >> +#define MAX_TEE_PARAM_ARRY_MEMBER 4
> >> +
> >> +/*
> >> + * Authentication of the firmware and load in the remote processor memory
> >> + *
> >> + * [in]  params[0].value.a:   unique 32bit identifier of the remote 
> >> processor
> >> + * [in]params[1].memref:  buffer containing the image of the 
> >> buffer
> >> + */
> >> +#define TA_RPROC_FW_CMD_LOAD_FW   1
> >> +
> >> +/*
> >> + * Start the remote processor
> >> + *
> >> + * [in]  params[0].value.a:   unique 32bit identifier of the remote 
> >> processor
> >> + */
> >> +#define TA_RPROC_FW_CMD_START_FW  2
> >> +
> >> +/*
> >> + * Stop the remote processor
> >> + *
> >> + * [in]  params[0].value.a:   unique 32bit identifier of the remote 
> >> processor
> >> + */
> >> +#define TA_RPROC_FW_CMD_STOP_FW   3
> >> +
> >> +/*
> >> + * Return the address of the resource table, or 0 if not found
> >> + * No check is done to verify that the address returned is accessible by
> >> + * the non secure context. If the resource table is loaded in a protected
> >> + * memory the access by the non secure context will lead to a data abort.
> >> + *
> >> + * [in]  params[0].value.a:   unique 32bit identifier of the remote 
> >> processor
> >> + * [out]  params[1].value.a:  32bit LSB resource table memory address
> >> + * [out]  params[1].value.b:  32bit MSB resource table memory address
> 

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-27 Thread Jonthan Haslam
> > Masami,
> > 
> > Given the discussion around per-cpu rw semaphore and need for
> > (internal) batched attachment API for uprobes, do you think you can
> > apply this patch as is for now? We can then gain initial improvements
> > in scalability that are also easy to backport, and Jonathan will work
> > on a more complete solution based on per-cpu RW semaphore, as
> > suggested by Ingo.
> 
> Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> I would like to wait for the next version.

My initial tests show a nice improvement on the over RW spinlocks but
significant regression in acquiring a write lock. I've got a few days
vacation over Easter but I'll aim to get some more formalised results out
to the thread toward the end of next week.

Jon.

> 
> Thank you,
> 
> > 
> > >
> > > BTW, how did you measure the overhead? I think spinlock overhead
> > > will depend on how much lock contention happens.
> > >
> > > Thank you,
> > >
> > > >
> > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > >
> > > > Signed-off-by: Jonathan Haslam 
> > > > ---
> > > >  kernel/events/uprobes.c | 22 +++---
> > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > >   */
> > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > >
> > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree 
> > > > access */
> > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree 
> > > > access */
> > > >
> > > >  #define UPROBES_HASH_SZ  13
> > > >  /* serialize uprobe->pending_list */
> > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > > *inode, loff_t offset)
> > > >  {
> > > >   struct uprobe *uprobe;
> > > >
> > > > - spin_lock(_treelock);
> > > > + read_lock(_treelock);
> > > >   uprobe = __find_uprobe(inode, offset);
> > > > - spin_unlock(_treelock);
> > > > + read_unlock(_treelock);
> > > >
> > > >   return uprobe;
> > > >  }
> > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > > > *uprobe)
> > > >  {
> > > >   struct uprobe *u;
> > > >
> > > > - spin_lock(_treelock);
> > > > + write_lock(_treelock);
> > > >   u = __insert_uprobe(uprobe);
> > > > - spin_unlock(_treelock);
> > > > + write_unlock(_treelock);
> > > >
> > > >   return u;
> > > >  }
> > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > > >   return;
> > > >
> > > > - spin_lock(_treelock);
> > > > + write_lock(_treelock);
> > > >   rb_erase(>rb_node, _tree);
> > > > - spin_unlock(_treelock);
> > > > + write_unlock(_treelock);
> > > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > > >   put_uprobe(uprobe);
> > > >  }
> > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > >   min = vaddr_to_offset(vma, start);
> > > >   max = min + (end - start) - 1;
> > > >
> > > > - spin_lock(_treelock);
> > > > + read_lock(_treelock);
> > > >   n = find_node_in_range(inode, min, max);
> > > >   if (n) {
> > > >   for (t = n; t; t = rb_prev(t)) {
> > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > >   get_uprobe(u);
> > > >   }
> > > >   }
> > > > - spin_unlock(_treelock);
> > > > + read_unlock(_treelock);
> > > >  }
> > > >
> > > >  /* @vma contains reference counter, not the probed instruction. */
> > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, 
> > > > unsigned long start, unsigned long e
> > > >   min = vaddr_to_offset(vma, start);
> > > >   max = min + (end - start) - 1;
> > > >
> > > > - spin_lock(_treelock);
> > > > + read_lock(_treelock);
> > > >   n = find_node_in_range(inode, min, max);
> > > > - spin_unlock(_treelock);
> > > > + read_unlock(_treelock);
> > > >
> > > >   return !!n;
> > > >  }
> > > > --
> > > > 2.43.0
> > > >
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) 
> 
> 
> -- 
> Masami Hiramatsu (Google) 



Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Jarkko Sakkinen
On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> > The scripts rely on cgroup-tools package from libcgroup [1].
> > 
> > To run selftests for epc cgroup:
> > 
> > sudo ./run_epc_cg_selftests.sh
> > 
> > To watch misc cgroup 'current' changes during testing, run this in a
> > separate terminal:
> > 
> > ./watch_misc_for_tests.sh current
> > 
> > With different cgroups, the script starts one or multiple concurrent
> > SGX
> > selftests, each to run one unclobbered_vdso_oversubscribed test. 
> > Each
> > of such test tries to load an enclave of EPC size equal to the EPC
> > capacity available on the platform. The script checks results against
> > the expectation set for each cgroup and reports success or failure.
> > 
> > The script creates 3 different cgroups at the beginning with
> > following
> > expectations:
> > 
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> > if
> > more than 4 concurrent tests are run. The script starts 4 expecting
> > at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run
> > lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all process exit.
> > 
> > The script also includes a test with low mem_cg limit and LARGE
> > sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is
> > charged
> > to a proper mem_cg.
> > 
> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> > 
> > Signed-off-by: Haitao Huang 
> > ---
> > V7:
> > - Added memcontrol test.
> > 
> > V5:
> > - Added script with automatic results checking, remove the
> > interactive
> > script.
> > - The script can run independent from the series below.
> > ---
> >  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> > ++
> >  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >  2 files changed, 259 insertions(+)
> >  create mode 100755
> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >  create mode 100755
> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> > 
> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > new file mode 100755
> > index ..e027bf39f005
> > --- /dev/null
> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > @@ -0,0 +1,246 @@
> > +#!/bin/bash
>
> This is not portable and neither does hold in the wild.
>
> It does not even often hold as it is not uncommon to place bash
> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> a path that is neither of those two.
>
> Should be #!/usr/bin/env bash
>
> That is POSIX compatible form.
>
> Just got around trying to test this in NUC7 so looking into this in
> more detail.
>
> That said can you make the script work with just "#!/usr/bin/env sh"
> and make sure that it is busybox ash compatible?
>
> I don't see any necessity to make this bash only and it adds to the
> compilation time of the image. Otherwise lot of this could be tested
> just with qemu+bzImage+busybox(inside initramfs).
>
> Now you are adding fully glibc shenanigans for the sake of syntax
> sugar.
>
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright(c) 2023 Intel Corporation.
> > +
> > +TEST_ROOT_CG=selftest
> > +cgcreate -g misc:$TEST_ROOT_CG
>
> How do you know that cgcreate exists? It is used a lot in the script
> with no check for the existence. Please fix e.g. with "command -v
> cgreate".
>
> > +if [ $? -ne 0 ]; then
> > +    echo "# Please make sure cgroup-tools is installed, and misc
> > cgroup is mounted."
> > +    exit 1
> > +fi
>
> And please do not do it this way. Also, please remove the advice for
> "cgroups-tool". This is not meant to be debian only. Better would be
> to e.g. point out the URL of the upstream project.
>
> And yeah the whole message should be based on "command -v", not like
> this.
>
> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> > +# We will only set limit in test1 and run tests in test3
> > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> > +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> > +
> > +cgcreate -g misc:$TEST_CG_SUB1
>
>
>
> > +cgcreate -g misc:$TEST_CG_SUB2
> > +cgcreate -g misc:$TEST_CG_SUB3
> > +cgcreate -g misc:$TEST_CG_SUB4
> > +
> > +# Default to V2
> > +CG_MISC_ROOT=/sys/fs/cgroup
> > +CG_MEM_ROOT=/sys/fs/cgroup
> > +CG_V1=0
> > +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> > +    echo "# cgroup V2 is in use."
> > +else
> > +    echo "# cgroup V1 is in use."
>
> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
>
> > +    CG_MISC_ROOT=/sys/fs/cgroup/misc
> > +    CG_MEM_ROOT=/sys/fs/cgroup/memory

Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-27 Thread Jarkko Sakkinen
On Tue Mar 26, 2024 at 8:42 PM EET, Conor Dooley wrote:
> On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote:
> > Tacing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by implementing textmem API for RISC-V.
>
> This doesn't compile for nommu:
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: 
> 'MODULES_VADDR' undeclared (first use in this function)
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: 
> 'MODULES_END' undeclared (first use in this function)
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control 
> reaches end of non-void function [-Werror=return-type]
> Clang builds also report:
> ../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in 
> a function definition is a C2x extension [-Wc2x-extensions]
>
> > 
> > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> > stack
> > Link: 
> > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> > continuation
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5-v7:
> > - No changes.
> > v4:
> > - Include linux/execmem.h.
> > v3:
> > - Architecture independent parts have been split to separate patches.
> > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> >   this patch set now.
>
> Meta comment. I dunno when v1 was sent, but versions can you please
> relax with submitting new versions of your patches? There's conversations
> ongoing on v5 at the moment, while this is a more recent version. v2
> seems to have been sent on the 23rd and there's been 5 versions in the
> last day:
> https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=*
>
> Could you please also try and use a cover letter for patchsets, ideally
> with a consistent subject? Otherwise I have to manually mark stuff as
> superseded.

Point taken but the work has been taken over by Mark and relevant
changes are now sucked into that patch set.

> Thanks,
> Conor.

BR, Jarkko



Re: [PATCH] trace/sched: add tgid for sched_wakeup_template

2024-03-27 Thread Steven Rostedt
On Wed, 27 Mar 2024 16:50:57 +0800
Tio Zhang  wrote:

> By doing this, we are able to filter tasks by tgid while we are
> tracing wakeup events by ebpf or other methods.
> 
> For example, when we care about tracing a user space process (which has
> uncertain number of LWPs, i.e, pids) to monitor its wakeup latency,
> without tgid available in sched_wakeup tracepoints, we would struggle
> finding out all pids to trace, or we could use kprobe to achieve tgid
> tracing, which is less accurate and much less efficient than using
> tracepoint.

This is a very common trace event, and I really do not want to add more
data than necessary to it, as it increases the size of the event which
means less events can be recorded on a fixed size trace ring buffer.

Note, you are not modifying the "tracepoint", but you are actually
modifying a "trace event".

 "tracepoint" is the hook in the kernel code:

   trace_sched_wakeup()

 "trace event" is defined by TRACE_EVENT() macro (and friends) that defines
 what is exposed in the tracefs file system.

I thought ebpf could hook directly to the tracepoint which is:

trace_sched_wakeup(p);

I believe you can have direct access to the 'p' before it is processed from 
ebpf.

There's also "trace probes" (I think we are lacking documentation on this,
as well as event probes :-p):

 $ gdb vmlinux
(gdb) p &((struct task_struct *)0)->tgid
$1 = (pid_t *) 0x56c
(gdb) p &((struct task_struct *)0)->pid
$2 = (pid_t *) 0x568

 # echo 't:wakeup sched_waking pid=+0x568($arg1):u32 tgid=+0x56c($arg1):u32' > 
/sys/kernel/tracing/dynamic_events

 # trace-cmd start -e wakeup
 # trace-cmd show
   trace-cmd-7307[003] d..6. 599486.485762: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=845 tgid=845
bash-845 [001] d.s4. 599486.486136: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
bash-845 [001] d..4. 599486.486336: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=5516 tgid=5516
   kworker/u18:2-5516[001] d..4. 599486.486445: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=818 tgid=818
  -0   [001] d.s4. 599486.491206: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s5. 599486.493218: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s4. 599486.497200: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [003] d.s4. 599486.829209: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=70 tgid=70

The above attaches to the tracepoint and $arg1 is the 'struct task_struct *p'.

-- Steve


> 
> Signed-off-by: Tio Zhang 
> Signed-off-by: Dylane Chen 
> ---
>  include/trace/events/sched.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..ea7e525649e5 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __field(pid_t,  pid )
>   __field(int,prio)
>   __field(int,target_cpu  )
> + __field(pid_t,  tgid)
>   ),
>  
>   TP_fast_assign(
> @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __entry->pid= p->pid;
>   __entry->prio   = p->prio; /* XXX SCHED_DEADLINE */
>   __entry->target_cpu = task_cpu(p);
> + __entry->tgid   = p->tgid;
>   ),
>  
> - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d",
> __entry->comm, __entry->pid, __entry->prio,
> -   __entry->target_cpu)
> +   __entry->target_cpu, __entry->tgid)
>  );
>  
>  /*




[PATCH] module: create weak dependecies

2024-03-27 Thread Jose Ignacio Tornos Martinez
It has been seen that for some network mac drivers (i.e. lan78xx) the
related module for the phy is loaded dynamically depending on the current
hardware. In this case, the associated phy is read using mdio bus and then
the associated phy module is loaded during runtime (kernel function
phy_request_driver_module). However, no software dependency is defined, so
the user tools will no be able to get this dependency. For example, if
dracut is used and the hardware is present, lan78xx will be included but no
phy module will be added, and in the next restart the device will not work
from boot because no related phy will be found during initramfs stage.

In order to solve this, we could define a normal 'pre' software dependency
in lan78xx module with all the possible phy modules (there may be some),
but proceeding in that way, all the possible phy modules would be loaded
while only one is necessary.

The idea is to create a new type of dependency, that we are going to call
'weak' to be used only by the user tools that need to detect this situation.
In that way, for example, dracut could check the 'weak' dependency of the
modules involved in order to install these dependencies in initramfs too.
That is, for the commented lan78xx module, defining the 'weak' dependency
with the possible phy modules list, only the necessary phy would be loaded
on demand keeping the same behavior, but all the possible phy modules would
be available from initramfs.

Signed-off-by: Jose Ignacio Tornos Martinez 
---
 include/linux/module.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..231e710d8736 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -173,6 +173,11 @@ extern void cleanup_module(void);
  */
 #define MODULE_SOFTDEP(_softdep) MODULE_INFO(softdep, _softdep)
 
+/* Weak module dependencies. See man modprobe.d for details.
+ * Example: MODULE_WEAKDEP("module-foo")
+ */
+#define MODULE_WEAKDEP(_weakdep) MODULE_INFO(weakdep, _weakdep)
+
 /*
  * MODULE_FILE is used for generating modules.builtin
  * So, make it no-op when this is being built as a module
-- 
2.44.0




Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-27 Thread Breno Leitao
Hello Xuan,

On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote:
> On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> > Do not set virtnet_info->rss_indir_table_size if RSS is not available
> > for the device.
> >
> > Currently, rss_indir_table_size is set if either has_rss or
> > has_rss_hash_report is available, but, it should only be set if has_rss
> > is set.
> >
> > On the virtnet_set_rxfh(), return an invalid command if the request has
> > indirection table set, but virtnet does not support RSS.
> >
> > Suggested-by: Heng Qi 
> > Signed-off-by: Breno Leitao 
> > ---
> >  drivers/net/virtio_net.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c22d1118a133..c640fdf28fc5 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > rxfh->hfunc != ETH_RSS_HASH_TOP)
> > return -EOPNOTSUPP;
> >
> > +   if (rxfh->indir && !vi->has_rss)
> > +   return -EINVAL;
> > +
> > if (rxfh->indir) {
> 
> Put !vi->has_rss here?

I am not sure I understand the suggestion. Where do you suggest we have
!vi->has_rss?

If we got this far, we either have:

a) rxfh->indir set and vi->has_rss is also set
b) rxfh->indir not set. (vi->has_rss could be set or not).

Thanks for the review,
Breno



Re: [PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported

2024-03-27 Thread Breno Leitao
On Wed, Mar 27, 2024 at 10:27:58AM +0800, Heng Qi wrote:
> 
> 
> 在 2024/3/26 下午11:19, Breno Leitao 写道:
> > There is a bug when setting the RSS options in virtio_net that can break
> > the whole machine, getting the kernel into an infinite loop.
> > 
> > Running the following command in any QEMU virtual machine with virtionet
> > will reproduce this problem:
> > 
> >  # ethtool -X eth0  hfunc toeplitz
> > 
> > This is how the problem happens:
> > 
> > 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
> > 
> > 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
> > 
> > 3) virtnet_commit_rss_command() populates 4 entries for the rss
> > scatter-gather
> > 
> > 4) Since the command above does not have a key, then the last
> > scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > sg_buf_size = vi->rss_key_size;
> > 
> > 5) This buffer is passed to qemu, but qemu is not happy with a buffer
> > with zero length, and do the following in virtqueue_map_desc() (QEMU
> > function):
> > 
> >if (!sz) {
> >virtio_error(vdev, "virtio: zero sized buffers are not allowed");
> > 
> > 6) virtio_error() (also QEMU function) set the device as broken
> > 
> >  vdev->broken = true;
> > 
> > 7) Qemu bails out, and do not repond this crazy kernel.
> > 
> > 8) The kernel is waiting for the response to come back (function
> > virtnet_send_command())
> > 
> > 9) The kernel is waiting doing the following :
> > 
> >while (!virtqueue_get_buf(vi->cvq, ) &&
> >  !virtqueue_is_broken(vi->cvq))
> >   cpu_relax();
> > 
> > 10) None of the following functions above is true, thus, the kernel
> > loops here forever. Keeping in mind that virtqueue_is_broken() does
> > not look at the qemu `vdev->broken`, so, it never realizes that the
> > vitio is broken at QEMU side.
> > 
> > Fix it by not sending RSS commands if the feature is not available in
> > the device.
> > 
> > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> > Cc: sta...@vger.kernel.org
> > Cc: qemu-de...@nongnu.org
> > Signed-off-by: Breno Leitao 
> > ---
> >   drivers/net/virtio_net.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c640fdf28fc5..e6b0eaf08ac2 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > struct virtnet_info *vi = netdev_priv(dev);
> > int i;
> > +   if (!vi->has_rss && !vi->has_rss_hash_report)
> > +   return -EOPNOTSUPP;
> > +
> 
> Why not make the second patch as the first, this seems to work better.

Sure, that works for me. Let me update it in v2.



Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Jarkko Sakkinen
On Wed, 2024-03-27 at 02:42 +, Edgecombe, Rick P wrote:
> On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote:
> > In which conditions which path is used during the initialization of
> > mm
> > and why is this the case? It is an open claim in the current form.
> 
> There is an arch_pick_mmap_layout() that arch's can have their own
> rules for. There is also a
> generic one. It gets called during exec.
> 
> > 
> > That would be nice to have documented for the sake of being
> > complete
> > description. I have zero doubts of the claim being untrue.
> 
> ...being untrue?
> 

I mean I believe the change itself makes sense, it is just not
fully documented in the commit message.

BR, Jarkko



Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Jarkko Sakkinen
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
> 
> To run selftests for epc cgroup:
> 
> sudo ./run_epc_cg_selftests.sh
> 
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
> 
> ./watch_misc_for_tests.sh current
> 
> With different cgroups, the script starts one or multiple concurrent
> SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test. 
> Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
> 
> The script creates 3 different cgroups at the beginning with
> following
> expectations:
> 
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> if
> more than 4 concurrent tests are run. The script starts 4 expecting
> at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run
> lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
> 
> The script also includes a test with low mem_cg limit and LARGE
> sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is
> charged
> to a proper mem_cg.
> 
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> 
> Signed-off-by: Haitao Huang 
> ---
> V7:
> - Added memcontrol test.
> 
> V5:
> - Added script with automatic results checking, remove the
> interactive
> script.
> - The script can run independent from the series below.
> ---
>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
>  2 files changed, 259 insertions(+)
>  create mode 100755
> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755
> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> 
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..e027bf39f005
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,246 @@
> +#!/bin/bash

This is not portable and neither does hold in the wild.

It does not even often hold as it is not uncommon to place bash
to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
a path that is neither of those two.

Should be #!/usr/bin/env bash

That is POSIX compatible form.

Just got around trying to test this in NUC7 so looking into this in
more detail.

That said can you make the script work with just "#!/usr/bin/env sh"
and make sure that it is busybox ash compatible?

I don't see any necessity to make this bash only and it adds to the
compilation time of the image. Otherwise lot of this could be tested
just with qemu+bzImage+busybox(inside initramfs).

Now you are adding fully glibc shenanigans for the sake of syntax
sugar.

> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG

How do you know that cgcreate exists? It is used a lot in the script
with no check for the existence. Please fix e.g. with "command -v
cgreate".

> +if [ $? -ne 0 ]; then
> +    echo "# Please make sure cgroup-tools is installed, and misc
> cgroup is mounted."
> +    exit 1
> +fi

And please do not do it this way. Also, please remove the advice for
"cgroups-tool". This is not meant to be debian only. Better would be
to e.g. point out the URL of the upstream project.

And yeah the whole message should be based on "command -v", not like
this.

> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1



> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_MISC_ROOT=/sys/fs/cgroup
> +CG_MEM_ROOT=/sys/fs/cgroup
> +CG_V1=0
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> +    echo "# cgroup V2 is in use."
> +else
> +    echo "# cgroup V1 is in use."

Is "#" prefix a standard for kselftest? I don't know this, thus asking.

> +    CG_MISC_ROOT=/sys/fs/cgroup/misc
> +    CG_MEM_ROOT=/sys/fs/cgroup/memory
> +    CG_V1=1

Have you checked what is the indentation policy for bash scripts inside
kernel tree. I don't know what it is. That's why I'm asking.

> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity
> size. So
> +# should 

Re: [PATCH 2/2] remoteproc: mediatek: Don't parse extraneous subnodes for multi-core

2024-03-27 Thread AngeloGioacchino Del Regno

Il 21/03/24 16:27, Mathieu Poirier ha scritto:

On Thu, Mar 21, 2024 at 09:46:14AM +0100, AngeloGioacchino Del Regno wrote:

When probing multi-core SCP, this driver is parsing all sub-nodes of
the scp-cluster node, but one of those could be not an actual SCP core
and that would make the entire SCP cluster to fail probing for no good
reason.

To fix that, in scp_add_multi_core() treat a subnode as a SCP Core by
parsing only available subnodes having compatible "mediatek,scp-core".

Fixes: 1fdbf0cdde98 ("remoteproc: mediatek: Probe SCP cluster on multi-core 
SCP")
Signed-off-by: AngeloGioacchino Del Regno 

---
  drivers/remoteproc/mtk_scp.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 67518291a8ad..fbe1c232dae7 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -1096,6 +1096,9 @@ static int scp_add_multi_core(struct platform_device 
*pdev,
cluster_of_data = (const struct mtk_scp_of_data 
**)of_device_get_match_data(dev);
  
  	for_each_available_child_of_node(np, child) {

+   if (!of_device_is_compatible(child, "mediatek,scp-core"))
+   continue;
+


Interesting - what else gets stashed under the remote processor node?  I don't
see anything specified in the bindings.



Sorry for the late reply - well, in this precise moment in time, upstream,
nothing yet.

I have noticed this while debugging some lockups and wanted to move the scp_adsp
clock controller node as child of the SCP node (as some of those clocks are 
located
*into the SCP's CFG register space*, and it's correct for that to be a child as 
one
of those do depend on the SCP being up - and I'll spare you the rest) and 
noticed
the unexpected behavior, as the SCP driver was treating those as an SCP core.

There was no kernel panic, but the SCP would fail probing.

This is anyway a missed requirement ... for platforms that want *both* two SCP
cores *and* the AudioDSP, as that'd at least be two nodes with the same iostart
(scp@1072000, clock-controller@1072000), other than the reasons I explained some
lines back.

...and that's why this commit was sent :-)

P.S.: The reason why platforms don't crash without the scp_adsp clock controller
  as a child of SCP is that the bootloader is actually doing basic init for
  the SCP, hence the block is powered on when booting ;-)

Cheers,
Angelo


Thanks,
Mathieu


if (!cluster_of_data[core_id]) {
ret = -EINVAL;
dev_err(dev, "Not support core %d\n", core_id);
--
2.44.0







Re: [PATCH 1/2] remoteproc: mediatek: Make sure IPI buffer fits in L2TCM

2024-03-27 Thread AngeloGioacchino Del Regno

Il 21/03/24 16:25, Mathieu Poirier ha scritto:

Good day,

On Thu, Mar 21, 2024 at 09:46:13AM +0100, AngeloGioacchino Del Regno wrote:

The IPI buffer location is read from the firmware that we load to the
System Companion Processor, and it's not granted that both the SRAM
(L2TCM) size that is defined in the devicetree node is large enough
for that, and while this is especially true for multi-core SCP, it's
still useful to check on single-core variants as well.

Failing to perform this check may make this driver perform R/W
oeprations out of the L2TCM boundary, resulting (at best) in a


s/oeprations/operations

I will fix that when I apply the patch.



Thanks for that.

Cheers,
Angelo





FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 4.19-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 13 Feb 2024 14:54:25 +0100
Subject: [PATCH] virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
Message-Id: <20240213135425.795001-1-da...@redhat.com>
Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b3513..f513ee21b1c18 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)
-- 
2.43.0







FAILED: Patch "ring-buffer: Do not set shortest_full when full target is hit" failed to apply to 5.4-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 761d9473e27f0c8782895013a3e7b52a37c8bcfc Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 12 Mar 2024 11:56:41 -0400
Subject: [PATCH] ring-buffer: Do not set shortest_full when full target is hit

The rb_watermark_hit() checks if the amount of data in the ring buffer is
above the percentage level passed in by the "full" variable. If it is, it
returns true.

But it also sets the "shortest_full" field of the cpu_buffer that informs
writers that it needs to call the irq_work if the amount of data on the
ring buffer is above the requested amount.

The rb_watermark_hit() always sets the shortest_full even if the amount in
the ring buffer is what it wants. As it is not going to wait, because it
has what it wants, there's no reason to set shortest_full.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8b...@gandalf.local.home

Cc: sta...@vger.kernel.org
Cc: Mathieu Desnoyers 
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b1..6ffbccb9bcf00 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page;
ret = !pagebusy && full_hit(buffer, cpu, full);
 
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
+   if (!ret && (!cpu_buffer->shortest_full ||
+cpu_buffer->shortest_full > full)) {
+   cpu_buffer->shortest_full = full;
+   }
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
}
return ret;
-- 
2.43.0







FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.4-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 13 Feb 2024 14:54:25 +0100
Subject: [PATCH] virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
Message-Id: <20240213135425.795001-1-da...@redhat.com>
Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b3513..f513ee21b1c18 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)
-- 
2.43.0







FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 5.15-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 12 Mar 2024 08:15:08 -0400
Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org

Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linke li 
Cc: Rabin Vincent 
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577e..dc5ae4e96aee0 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f9..fc6d0af56bb17 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4c34b7c7e1e7..350607cce8694 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -902,23 +902,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on

FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.10-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 13 Feb 2024 14:54:25 +0100
Subject: [PATCH] virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
Message-Id: <20240213135425.795001-1-da...@redhat.com>
Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b3513..f513ee21b1c18 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)
-- 
2.43.0







FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 5.15-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 13 Feb 2024 14:54:25 +0100
Subject: [PATCH] virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
Message-Id: <20240213135425.795001-1-da...@redhat.com>
Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b3513..f513ee21b1c18 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)
-- 
2.43.0







FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.1-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 12 Mar 2024 08:15:08 -0400
Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org

Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linke li 
Cc: Rabin Vincent 
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577e..dc5ae4e96aee0 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f9..fc6d0af56bb17 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4c34b7c7e1e7..350607cce8694 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -902,23 +902,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
 

FAILED: Patch "virtio: reenable config if freezing device failed" failed to apply to 6.1-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 310227f42882c52356b523e2f4e11690eebcd2ab Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 13 Feb 2024 14:54:25 +0100
Subject: [PATCH] virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
Message-Id: <20240213135425.795001-1-da...@redhat.com>
Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b3513..f513ee21b1c18 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)
-- 
2.43.0







FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.6-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 12 Mar 2024 08:15:08 -0400
Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org

Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linke li 
Cc: Rabin Vincent 
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577e..dc5ae4e96aee0 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f9..fc6d0af56bb17 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4c34b7c7e1e7..350607cce8694 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -902,23 +902,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
 

FAILED: Patch "tracing/ring-buffer: Fix wait_on_pipe() race" failed to apply to 6.7-stable tree

2024-03-27 Thread Sasha Levin
The patch below does not apply to the 6.7-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to .

Thanks,
Sasha

-- original commit in Linus's tree --

>From 2aa043a55b9a764c9cbde5a8c654eeaaffe224cf Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 12 Mar 2024 08:15:08 -0400
Subject: [PATCH] tracing/ring-buffer: Fix wait_on_pipe() race

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240312121703.557950...@goodmis.org

Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linke li 
Cc: Rabin Vincent 
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577e..dc5ae4e96aee0 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f9..fc6d0af56bb17 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4c34b7c7e1e7..350607cce8694 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -902,23 +902,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
 

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Michael S. Tsirkin
On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote:
> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
> 
> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> ("vhost: better detection of available buffers").
> 
> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> Cc:  # v4.11+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..00445ab172b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>   r = vhost_get_avail_idx(vq, _idx);
>   if (unlikely(r))
>   return false;
> +
>   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + if (vq->avail_idx != vq->last_avail_idx) {
> + /* Similar to what's done in vhost_get_vq_desc(), we need
> +  * to ensure the available ring entries have been exposed
> +  * by guest.
> +  */

A slightly clearer comment:

/* Since we have updated avail_idx, the following call to
 * vhost_get_vq_desc will read available ring entries.
 * Make sure that read happens after the avail_idx read.
 */

Pls repost with that, and I will apply.

Also add suggested-by for will.


> + smp_rmb();
> + return false;
> + }
>  
> - return vq->avail_idx == vq->last_avail_idx;
> + return true;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

As a follow-up patch, we should clean out code duplication that
accumulated with 3 places reading avail idx in essentially
the same way - this duplication is what causes the mess in
the 1st place.






> -- 
> 2.44.0




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-27 Thread Michael S. Tsirkin
On Tue, Mar 26, 2024 at 03:46:29PM +, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head 
> > > > > for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > > > available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > > 
> > > > > From the guest side
> > > > > ===
> > > > > 
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > > 
> > > > > avail_idx:
> > > > > 000  49665
> > > > > 001  49666  <--
> > > > >  :
> > > > > 015  49664
> > > > 
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > > 
> > > More than that, 49664 % 256 == 0
> > > 
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> > 
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> > 
> > Personally, I'd be fairly surprised if this was a hardware issue.
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> Will
> 
> --->8
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
> return false;
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
> +   smp_rmb();
> return vq->avail_idx == vq->last_avail_idx;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

if (vq->avail_idx == vq->last_avail_idx) {
if (unlikely(vhost_get_avail_idx(vq, _idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
>avail->idx);
return -EFAULT;
}
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
vq_err(vq, "Guest moved used index from %u to %u",
last_avail_idx, vq->avail_idx);
return -EFAULT;
}

/* If there's nothing new since last we looked, return
 * invalid.
 */
if (vq->avail_idx == last_avail_idx)
return vq->num;

/* Only get avail ring entries after they have been
 * exposed by guest.
 */
smp_rmb();
}


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


-- 
MST




Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-27 Thread Jason Wang
On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
>
> On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > From: Rong Wang 
> >
> > Once enable iommu domain for one device, the MSI
> > translation tables have to be there for software-managed MSI.
> > Otherwise, platform with software-managed MSI without an
> > irq bypass function, can not get a correct memory write event
> > from pcie, will not get irqs.
> > The solution is to obtain the MSI phy base address from
> > iommu reserved region, and set it to iommu MSI cookie,
> > then translation tables will be created while request irq.
> >
> > Change log
> > --
> >
> > v1->v2:
> > - add resv iotlb to avoid overlap mapping.
> > v2->v3:
> > - there is no need to export the iommu symbol anymore.
> >
> > Signed-off-by: Rong Wang 
>
> There's in interest to keep extending vhost iotlb -
> we should just switch over to iommufd which supports
> this already.

IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
makes vDPA run without a backporting of full IOMMUFD in the production
environment. I think it's worth.

If you worry about the extension, we can just use the vhost iotlb
existing facility to do this.

Thanks

>
> > ---
> >  drivers/vhost/vdpa.c | 59 +---
> >  1 file changed, 56 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index ba52d128aeb7..28b56b10372b 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> >   struct completion completion;
> >   struct vdpa_device *vdpa;
> >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > + struct vhost_iotlb resv_iotlb;
> >   struct device dev;
> >   struct cdev cdev;
> >   atomic_t opened;
> > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> >  {
> >   v->in_batch = 0;
> > + vhost_iotlb_reset(>resv_iotlb);
> >   return _compat_vdpa_reset(v);
> >  }
> >
> > @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> > vhost_vdpa *v,
> >   msg->iova + msg->size - 1 > v->range.last)
> >   return -EINVAL;
> >
> > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > + msg->iova + msg->size - 1))
> > + return -EINVAL;
> > +
> >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> >   msg->iova + msg->size - 1))
> >   return -EEXIST;
> >
> > +
> >   if (vdpa->use_va)
> >   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
> >msg->uaddr, msg->perm);
> > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct 
> > kiocb *iocb,
> >   return vhost_chr_write_iter(dev, from);
> >  }
> >
> > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, 
> > struct device *dma_dev,
> > + struct vhost_iotlb *resv_iotlb)
> > +{
> > + struct list_head dev_resv_regions;
> > + phys_addr_t resv_msi_base = 0;
> > + struct iommu_resv_region *region;
> > + int ret = 0;
> > + bool with_sw_msi = false;
> > + bool with_hw_msi = false;
> > +
> > + INIT_LIST_HEAD(_resv_regions);
> > + iommu_get_resv_regions(dma_dev, _resv_regions);
> > +
> > + list_for_each_entry(region, _resv_regions, list) {
> > + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> > + region->start + region->length - 1,
> > + 0, 0, NULL);
> > + if (ret) {
> > + vhost_iotlb_reset(resv_iotlb);
> > + break;
> > + }
> > +
> > + if (region->type == IOMMU_RESV_MSI)
> > + with_hw_msi = true;
> > +
> > + if (region->type == IOMMU_RESV_SW_MSI) {
> > + resv_msi_base = region->start;
> > + with_sw_msi = true;
> > + }
> > + }
> > +
> > + if (!ret && !with_hw_msi && with_sw_msi)
> > + ret = iommu_get_msi_cookie(domain, resv_msi_base);
> > +
> > + iommu_put_resv_regions(dma_dev, _resv_regions);
> > +
> > + return ret;
> > +}
> > +
> >  static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
> >  {
> >   struct vdpa_device *vdpa = v->vdpa;
> > @@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct 
> > vhost_vdpa *v)
> >
> >   ret = iommu_attach_device(v->domain, dma_dev);
> >   if (ret)
> > - goto err_attach;
> > + goto err_alloc_domain;
> >
> > - return 0;
> > + ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, 
> > >resv_iotlb);
> > + if (ret)
> > + goto err_attach_device;
> >
> > -err_attach:
> > + return 0;
> > +err_attach_device:
> > + iommu_detach_device(v->domain, 

[PATCH] trace/sched: add tgid for sched_wakeup_template

2024-03-27 Thread Tio Zhang
By doing this, we are able to filter tasks by tgid while we are
tracing wakeup events by ebpf or other methods.

For example, when we care about tracing a user space process (which has
uncertain number of LWPs, i.e, pids) to monitor its wakeup latency,
without tgid available in sched_wakeup tracepoints, we would struggle
finding out all pids to trace, or we could use kprobe to achieve tgid
tracing, which is less accurate and much less efficient than using
tracepoint.

Signed-off-by: Tio Zhang 
Signed-off-by: Dylane Chen 
---
 include/trace/events/sched.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..ea7e525649e5 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
__field(pid_t,  pid )
__field(int,prio)
__field(int,target_cpu  )
+   __field(pid_t,  tgid)
),
 
TP_fast_assign(
@@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
__entry->pid= p->pid;
__entry->prio   = p->prio; /* XXX SCHED_DEADLINE */
__entry->target_cpu = task_cpu(p);
+   __entry->tgid   = p->tgid;
),
 
-   TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
+   TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d",
  __entry->comm, __entry->pid, __entry->prio,
- __entry->target_cpu)
+ __entry->target_cpu, __entry->tgid)
 );
 
 /*
-- 
2.17.1




Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Jason Wang
On Wed, Mar 27, 2024 at 3:35 PM Gavin Shan  wrote:
>
> On 3/27/24 14:08, Gavin Shan wrote:
> > On 3/27/24 12:44, Jason Wang wrote:
> >> On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:
> >>> On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
> 
>  A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
>  Will Deacon . Otherwise, it's not ensured the
>  available ring entries pushed by guest can be observed by vhost
>  in time, leading to stale available ring entries fetched by vhost
>  in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
>  grace-hopper (ARM64) platform.
> 
> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> -accel kvm -machine virt,gic-version=host -cpu host  \
> -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
> -m 4096M,slots=16,maxmem=64G \
> -object memory-backend-ram,id=mem0,size=4096M\
>  :   \
> -netdev tap,id=vnet0,vhost=true  \
> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>  :
> guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
> virtio_net virtio0: output.0:id 100 is not a head!
> 
>  Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
>  should be safe until vq->avail_idx is changed by commit 275bf960ac697
>  ("vhost: better detection of available buffers").
> 
>  Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
>  Cc:  # v4.11+
>  Reported-by: Yihuang Yu 
>  Signed-off-by: Gavin Shan 
>  ---
>    drivers/vhost/vhost.c | 11 ++-
>    1 file changed, 10 insertions(+), 1 deletion(-)
> 
>  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>  index 045f666b4f12..00445ab172b3 100644
>  --- a/drivers/vhost/vhost.c
>  +++ b/drivers/vhost/vhost.c
>  @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
>  struct vhost_virtqueue *vq)
>   r = vhost_get_avail_idx(vq, _idx);
>   if (unlikely(r))
>   return false;
>  +
>   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  +   if (vq->avail_idx != vq->last_avail_idx) {
>  +   /* Similar to what's done in vhost_get_vq_desc(), we need
>  +* to ensure the available ring entries have been exposed
>  +* by guest.
>  +*/
> >>>
> >>> We need to be more verbose here. For example, which load needs to be
> >>> ordered with which load.
> >>>
> >>> The rmb in vhost_get_vq_desc() is used to order the load of avail idx
> >>> and the load of head. It is paired with e.g virtio_wmb() in
> >>> virtqueue_add_split().
> >>>
> >>> vhost_vq_avail_empty() are mostly used as a hint in
> >>> vhost_net_busy_poll() which is under the protection of the vq mutex.
> >>>
> >>> An exception is the tx_can_batch(), but in that case it doesn't even
> >>> want to read the head.
> >>
> >> Ok, if it is needed only in that path, maybe we can move the barriers 
> >> there.
> >>
> >
> > [cc Will Deacon]
> >
> > Jason, appreciate for your review and comments. I think PATCH[1/2] is
> > the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However,
> > it would be nice to fix all of them in one shoot. I will try with PATCH[2/2]
> > only to see if our issue will disappear or not. However, the issue still
> > exists if PATCH[2/2] is missed.
> >
>
> Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with 
> PATCH[2/2]
> only and unable to hit the issue. However, PATCH[1/2] may be needed by other 
> scenarios.
> So it would be nice to fix them in one shoot.

Yes, see below.

>
>
> > Firstly, We were failing on the transmit queue and {tvq, 
> > rvq}->busyloop_timeout
> > == false if I remember correctly. So the added smp_rmb() in 
> > vhost_vq_avail_empty()
> > is only a concern to tx_can_batch(). A mutex isn't enough to ensure the 
> > order
> > for the available index and available ring entry (head). For example, 
> > vhost_vq_avail_empty()
> > called by tx_can_batch() can see next available index, but its corresponding
> > available ring entry (head) may not be seen by vhost yet if smp_rmb() is 
> > missed.
> > The next call to get_tx_bufs(), where the available ring entry (head) 
> > doesn't
> > arrived yet, leading to stale available ring entry (head) being fetched.
> >
> >handle_tx_copy
> >  get_tx_bufs // smp_rmb() won't be executed when 
> > vq->avail_idx != vq->last_avail_idx
> >  tx_can_batch
> >vhost_vq_avail_empty  // vq->avail_idx is updated from 
> > vq->avail->idx
> >
> > The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the 
> > function
> > is a exposed API, even it's 

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Gavin Shan

On 3/27/24 14:08, Gavin Shan wrote:

On 3/27/24 12:44, Jason Wang wrote:

On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:


A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
   -m 4096M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=4096M    \
    :   \
   -netdev tap,id=vnet0,vhost=true  \
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
    :
   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
   virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
should be safe until vq->avail_idx is changed by commit 275bf960ac697
("vhost: better detection of available buffers").

Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/vhost/vhost.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..00445ab172b3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
 r = vhost_get_avail_idx(vq, _idx);
 if (unlikely(r))
 return false;
+
 vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+    * to ensure the available ring entries have been exposed
+    * by guest.
+    */


We need to be more verbose here. For example, which load needs to be
ordered with which load.

The rmb in vhost_get_vq_desc() is used to order the load of avail idx
and the load of head. It is paired with e.g virtio_wmb() in
virtqueue_add_split().

vhost_vq_avail_empty() are mostly used as a hint in
vhost_net_busy_poll() which is under the protection of the vq mutex.

An exception is the tx_can_batch(), but in that case it doesn't even
want to read the head.


Ok, if it is needed only in that path, maybe we can move the barriers there.



[cc Will Deacon]

Jason, appreciate for your review and comments. I think PATCH[1/2] is
the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However,
it would be nice to fix all of them in one shoot. I will try with PATCH[2/2]
only to see if our issue will disappear or not. However, the issue still
exists if PATCH[2/2] is missed.



Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with 
PATCH[2/2]
only and unable to hit the issue. However, PATCH[1/2] may be needed by other 
scenarios.
So it would be nice to fix them in one shoot.



Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout
== false if I remember correctly. So the added smp_rmb() in 
vhost_vq_avail_empty()
is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order
for the available index and available ring entry (head). For example, 
vhost_vq_avail_empty()
called by tx_can_batch() can see next available index, but its corresponding
available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed.
The next call to get_tx_bufs(), where the available ring entry (head) doesn't
arrived yet, leading to stale available ring entry (head) being fetched.

   handle_tx_copy
     get_tx_bufs // smp_rmb() won't be executed when vq->avail_idx 
!= vq->last_avail_idx
     tx_can_batch
   vhost_vq_avail_empty  // vq->avail_idx is updated from vq->avail->idx

The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the 
function
is a exposed API, even it's only used by drivers/vhost/net.c at present. It 
means
the API has been broken internally. So it seems more appropriate to fix it up in
vhost_vq_avail_empty() so that the API's users needn't worry about the memory 
access
order.





+   smp_rmb();
+   return false;
+   }

-   return vq->avail_idx == vq->last_avail_idx;
+   return true;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


Thanks,
Gavin




[PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Ho-Ren (Jack) Chuang
The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist calculation
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `default_dram_perf_lock` to
protect `default_dram_perf_*`. This approach not only avoids deadlock
but also prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
---
 mm/memory-tiers.c | 94 +++
 1 file changed, 78 insertions(+), 16 deletions(-)

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 974af10cfdd8..e24fc3bebae4 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -36,6 +36,11 @@ struct node_memory_type_map {
 
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * The list is used to store all memory types that are not created
+ * by a device driver.
+ */
+static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
 
@@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
 
 static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
 
+/* The lock is used to protect `default_dram_perf*` info and nid. */
+static DEFINE_MUTEX(default_dram_perf_lock);
 static bool default_dram_perf_error;
 static struct access_coordinate default_dram_perf;
 static int default_dram_perf_ref_nid = NUMA_NO_NODE;
@@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct 
memory_dev_type *mem
 static struct memory_tier *set_node_memory_tier(int node)
 {
struct memory_tier *memtier;
-   struct memory_dev_type *memtype;
+   struct memory_dev_type *mtype = default_dram_type;
+   int adist = MEMTIER_ADISTANCE_DRAM;
pg_data_t *pgdat = NODE_DATA(node);
 
 
@@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
 
-   __init_node_memory_type(node, default_dram_type);
+   mt_calc_adistance(node, );
+   if (node_memory_types[node].memtype == NULL) {
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
+   if (IS_ERR(mtype)) {
+   mtype = default_dram_type;
+   pr_info("Failed to allocate a memory type. Fall 
back.\n");
+   }
+   }
+
+   __init_node_memory_type(node, mtype);
 
-   memtype = node_memory_types[node].memtype;
-   node_set(node, memtype->nodes);
-   memtier = find_create_memory_tier(memtype);
+   mtype = node_memory_types[node].memtype;
+   node_set(node, mtype->nodes);
+   memtier = find_create_memory_tier(mtype);
if (!IS_ERR(memtier))
rcu_assign_pointer(pgdat->memtier, memtier);
return memtier;
@@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types)
 }
 EXPORT_SYMBOL_GPL(mt_put_memory_types);
 
+/*
+ * This is invoked via `late_initcall()` to initialize memory tiers for
+ * CPU-less memory nodes 

[PATCH v6 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-03-27 Thread Ho-Ren (Jack) Chuang
Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch,
enhancing the scalability and conciseness of the code.

Signed-off-by: Ho-Ren (Jack) Chuang 
---
 drivers/dax/kmem.c   | 20 ++--
 include/linux/memory-tiers.h | 13 +
 mm/memory-tiers.c| 32 
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(_memory_type_lock);
-   list_for_each_entry(mtype, _memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(>list, _memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
mutex_unlock(_memory_type_lock);
 
return mtype;
@@ -77,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
-   list_del(>list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(_memory_types);
mutex_unlock(_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..974af10cfdd8 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   bool found = false;
+   struct memory_dev_type *mtype;
+
+   list_for_each_entry(mtype, memory_types, list) {
+   if (mtype->adistance == adist) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mtype = alloc_memory_type(adist);
+   if (!IS_ERR(mtype))
+   list_add(>list, memory_types);
+   }
+
+   return mtype;
+}
+EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(>list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 static void dump_hmem_attrs(struct access_coordinate *coord, const char 
*prefix)
 {
pr_info(
-- 
Ho-Ren (Jack) Chuang




[PATCH v6 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-27 Thread Ho-Ren (Jack) Chuang
When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable
from normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes
to the same DRAM tier. This results in normal memory devices with
different attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the initialization
of memory tiers for CPUless NUMA nodes until they obtain HMAT information
and after all devices are initialized at boot time, eliminating the need
for user intervention. If no HMAT is specified, it falls back to
using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance attributes
through QEMU, and the guest now sees memory nodes with performance attributes
in HMAT. With this change, we enable the guest kernel to construct
the correct memory tiering for the memory nodes.

-v6:
 Thanks to Ying's comments,
 * Move `default_dram_perf_lock` to the function's beginning for clarity
 * Fix double unlocking at v5
-v5:
 Thanks to Ying's comments,
 * Add comments about what is protected by `default_dram_perf_lock`
 * Fix an uninitialized pointer mtype
 * Slightly shorten the time holding `default_dram_perf_lock`
 * Fix a deadlock bug in `mt_perf_to_adistance`
 * 
https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u
-v4:
 Thanks to Ying's comments,
 * Remove redundant code
 * Reorganize patches accordingly
 * 
https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u
-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
 * 
https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: introduce an abstract layer for finding,
allocating, and putting memory types
  memory tier: create CPUless memory tiers after obtaining HMAT info

 drivers/dax/kmem.c   |  20 +-
 include/linux/memory-tiers.h |  13 
 mm/memory-tiers.c| 126 ++-
 3 files changed, 125 insertions(+), 34 deletions(-)

-- 
Ho-Ren (Jack) Chuang




Re: [PATCH net-next v4 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

2024-03-27 Thread Jason Xing
On Wed, Mar 27, 2024 at 2:05 AM Balazs Scheidler  wrote:
>
> The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
> and destination IP/port whereas this information can be critical in case
> of UDP/syslog.
>
> Signed-off-by: Balazs Scheidler 

Looks good to me, thanks!

Reviewed-by: Jason Xing 



Re: [External] Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Ho-Ren (Jack) Chuang
On Tue, Mar 26, 2024 at 10:52 PM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > The current implementation treats emulated memory devices, such as
> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> > (E820_TYPE_RAM). However, these emulated devices have different
> > characteristics than traditional DRAM, making it important to
> > distinguish them. Thus, we modify the tiered memory initialization process
> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > ensures that the memory tier initialization for these nodes is deferred
> > until HMAT information is obtained during the boot process. Finally,
> > demotion tables are recalculated at the end.
> >
> > * late_initcall(memory_tier_late_init);
> > Some device drivers may have initialized memory tiers between
> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > online memory nodes and configuring memory tiers. They should be excluded
> > in the late init.
> >
> > * Handle cases where there is no HMAT when creating memory tiers
> > There is a scenario where a CPUless node does not provide HMAT information.
> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >
> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
> > In the current implementation, iterating through CPUlist nodes requires
> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > trying to acquire the same lock, leading to a potential deadlock.
> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > but also prevents holding a large lock simultaneously. Besides, this patch
> > slightly shortens the time holding the lock by putting the lock closer to
> > what it protects as well.
> >
> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >   default DRAM, late CPUless, and hot-plugged initializations.
> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > handle cases where memtype is not initialized and where HMAT information is
> > available.
> >
> > * Introduce `default_memory_types` for those memory types that are not
> >   initialized by device drivers.
> > Because late initialized memory and default DRAM memory need to be managed,
> > a default memory type is created for storing all memory types that are
> > not initialized by device drivers and as a fallback.
> >
> > * Fix a deadlock bug in `mt_perf_to_adistance`
> > Because an error path was not handled properly in `mt_perf_to_adistance`,
> > unlock before returning the error.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Signed-off-by: Hao Xiang 
> > ---
> >  mm/memory-tiers.c | 85 +++
> >  1 file changed, 72 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > index 974af10cfdd8..610db9581ba4 100644
> > --- a/mm/memory-tiers.c
> > +++ b/mm/memory-tiers.c
> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
> >
> >  static DEFINE_MUTEX(memory_tier_lock);
> >  static LIST_HEAD(memory_tiers);
> > +/*
> > + * The list is used to store all memory types that are not created
> > + * by a device driver.
> > + */
> > +static LIST_HEAD(default_memory_types);
> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> >  struct memory_dev_type *default_dram_type;
> >
> > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion 
> > __read_mostly;
> >
> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
> >
> > +/* The lock is used to protect `default_dram_perf*` info and nid. */
> > +static DEFINE_MUTEX(default_dram_perf_lock);
> >  static bool default_dram_perf_error;
> >  static struct access_coordinate default_dram_perf;
> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> > struct memory_dev_type *mem
> >  static struct memory_tier *set_node_memory_tier(int node)
> >  {
> >   struct memory_tier *memtier;
> > - struct memory_dev_type *memtype;
> > + struct memory_dev_type *mtype = default_dram_type;
> > + int adist = MEMTIER_ADISTANCE_DRAM;
> >   pg_data_t *pgdat = NODE_DATA(node);
> >
> >
> > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> > node)
> >   if (!node_state(node, N_MEMORY))
> >   return ERR_PTR(-EINVAL);
> >
> > - __init_node_memory_type(node, default_dram_type);
> > + mt_calc_adistance(node, );
> > + if (node_memory_types[node].memtype == NULL) {
> > + mtype = mt_find_alloc_memory_type(adist, 
> > _memory_types);
> > + if (IS_ERR(mtype)) {
> > + mtype = default_dram_type;
> > + pr_info("Failed to allocate a 

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Dan Williams
Rick Edgecombe wrote:
> The mm_struct contains a function pointer *get_unmapped_area(), which
> is set to either arch_get_unmapped_area() or
> arch_get_unmapped_area_topdown() during the initialization of the mm.
> 
> Since the function pointer only ever points to two functions that are named
> the same across all arch's, a function pointer is not really required. In
> addition future changes will want to add versions of the functions that
> take additional arguments. So to save a pointers worth of bytes in
> mm_struct, and prevent adding additional function pointers to mm_struct in
> future changes, remove it and keep the information about which
> get_unmapped_area() to use in a flag.
> 
> Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
> mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing
> behavior mm->get_unmapped_area() would get copied to the new mm in
> dup_mm(), so not clobbering the flag preserves the existing behavior
> around inheriting the topdown-ness.
> 
> Introduce a helper, mm_get_unmapped_area(), to easily convert code that
> refers to the old function pointer to instead select and call either
> arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
> flag. Then drop the mm->get_unmapped_area() function pointer. Leave the
> get_unmapped_area() pointer in struct file_operations alone. The main
> purpose of this change is to reorganize in preparation for future changes,
> but it also converts the calls of mm->get_unmapped_area() from indirect
> branches into a direct ones.
> 
> The stress-ng bigheap benchmark calls realloc a lot, which calls through
> get_unmapped_area() in the kernel. On x86, the change yielded a ~1%
> improvement there on a retpoline config.
> 
> In testing a few x86 configs, removing the pointer unfortunately didn't
> result in any actual size reductions in the compiled layout of mm_struct.
> But depending on compiler or arch alignment requirements, the change could
> shrink the size of mm_struct.
> 
> Signed-off-by: Rick Edgecombe 
> Acked-by: Dave Hansen 
> Acked-by: Liam R. Howlett 
> Reviewed-by: Kirill A. Shutemov 
> Cc: linux-s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: io-ur...@vger.kernel.org
> Cc: b...@vger.kernel.org
> ---
> v4:
>  - Split out pde_get_unmapped_area() refactor into separate patch
>(Christophe Leroy)
> 
> v3:
>  - Fix comment that still referred to mm->get_unmapped_area()
>  - Resolve trivial rebase conflicts with "mm: thp_get_unmapped_area must
>honour topdown preference"
>  - Spelling fix in log
> 
> v2:
>  - Fix comment on MMF_TOPDOWN (Kirill, rppt)
>  - Move MMF_TOPDOWN to actually unused bit
>  - Add MMF_TOPDOWN to MMF_INIT_MASK so it doesn't get clobbered on fork,
>and result in the children using the search up path.
>  - New lower performance results after above bug fix
>  - Add Reviews and Acks
> ---
>  arch/s390/mm/hugetlbpage.c   |  2 +-
>  arch/s390/mm/mmap.c  |  4 ++--
>  arch/sparc/kernel/sys_sparc_64.c | 15 ++-
>  arch/sparc/mm/hugetlbpage.c  |  2 +-
>  arch/x86/kernel/cpu/sgx/driver.c |  2 +-
>  arch/x86/mm/hugetlbpage.c|  2 +-
>  arch/x86/mm/mmap.c   |  4 ++--
>  drivers/char/mem.c   |  2 +-
>  drivers/dax/device.c |  6 +++---
>  fs/hugetlbfs/inode.c |  4 ++--
>  fs/proc/inode.c  |  3 ++-
>  fs/ramfs/file-mmu.c  |  2 +-
>  include/linux/mm_types.h |  6 +-
>  include/linux/sched/coredump.h   |  5 -
>  include/linux/sched/mm.h |  5 +
>  io_uring/io_uring.c  |  2 +-
>  kernel/bpf/arena.c   |  2 +-
>  kernel/bpf/syscall.c |  2 +-
>  mm/debug.c   |  6 --
>  mm/huge_memory.c |  9 -
>  mm/mmap.c| 21 ++---
>  mm/shmem.c   | 11 +--
>  mm/util.c|  6 +++---
>  23 files changed, 66 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index c2e8242bd15d..219d906fe830 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -328,7 +328,7 @@ unsigned long hugetlb_get_unmapped_area(struct file 
> *file, unsigned long addr,
>   goto check_asce_limit;
>   }
>  
> - if (mm->get_unmapped_area == arch_get_unmapped_area)
> + if (!test_bit(MMF_TOPDOWN, >flags))
>   addr = hugetlb_get_unmapped_area_bottomup(file, addr, len,
>   pgoff, flags);
>   else
> diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
> index b14fc0887654..6b2e4436ad4a 100644
> --- a/arch/s390/mm/mmap.c
> +++ b/arch/s390/mm/mmap.c
> @@