date:20231023

Re: [PATCH] selftests: bpf: add malloc failures checks in bpf_iter

2023-10-23 Thread Yonghong Song




On 10/23/23 7:59 PM, Yuran Pereira wrote:

Since some malloc calls in bpf_iter may at times fail,
this patch adds the appropriate fail checks, and ensures that
any previously allocated resource is appropriately destroyed
before returning the function.

Signed-off-by: Yuran Pereira 
---
  tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c 
b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 1f02168103dd..6d47ea9211a4 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -878,6 +878,11 @@ static void test_bpf_percpu_hash_map(void)
  
  	skel->rodata->num_cpus = bpf_num_possible_cpus();

val = malloc(8 * bpf_num_possible_cpus());
+   if (CHECK(!val, "malloc", "memory allocation failed: %s",
+   strerror(errno))) {
+   bpf_iter_bpf_percpu_hash_map__destroy(skel);
+   return;
+   }



Could you use !ASSERT_OK_PTR(...) instead of CHECK? bpf selftest
prefers to use ASSERT_* series of macros instead of CHECK.
In the above example, printing out strerror is not required,
see some other examples in the same folder.

Also bpf_iter.c has some other usages of CHECK macro.
Since you are touching this file, could you convert all
CHECK macros to ASSERT_* macros?

Basically you need two patches:
  patch 1: convert existing CHECK macros to ASSERT_* macros in bpf_iter.c.
   this should not check any functionality except error messages.
  patch 2: your patch with ASSERT_* macros.

You can use the following as your subject line:
  [PATCH bpf-next] selftests/bpf: Add malloc failure checks in bpf_iter

  
  	err = bpf_iter_bpf_percpu_hash_map__load(skel);

if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_hash_map__load"))
@@ -1057,6 +1062,11 @@ static void test_bpf_percpu_array_map(void)
  
  	skel->rodata->num_cpus = bpf_num_possible_cpus();

val = malloc(8 * bpf_num_possible_cpus());
+   if (CHECK(!val, "malloc", "memory allocation failed: %s",
+   strerror(errno))) {
+   bpf_iter_bpf_percpu_hash_map__destroy(skel);
+   return;
+   }
  
  	err = bpf_iter_bpf_percpu_array_map__load(skel);

if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_array_map__load"))

Re: [PATCH-cgroup 1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask

2023-10-23 Thread Tejun Heo

Hello,

On Wed, Oct 18, 2023 at 03:18:52PM -0400, Waiman Long wrote:
> I have a second thought after taking a further look at that. First of all,
> cpuset_allowed_mask isn't relevant here and the mask can certainly contain
> offline CPUs. So cpu_possible_mask is the proper fallback.
> 
> With the current patch, wq_user_unbound_cpumask is set up initially as 
> (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any
> subsequent write to workqueue/cpumask sysfs file. So using

The current behavior is not something which is carefully planned. It's more
accidental than anything. If we can come up with a more intutive and
consistent behavior, that should be fine.

> wq_user_unbound_cpumask has the implied precedence of user-sysfs written
> mask, command line isolcpus or nohz_full option mask and cpu_possible_mask.
> I think just fall back to wq_user_unbound_cpumask if the operation fails
> should be enough.

But yeah, that sounds acceptable.

Thanks.

-- 
tejun

Re: [PATCH-cgroup 3/4] cgroup/cpuset: Keep track of CPUs in isolated partitions

2023-10-23 Thread Tejun Heo

Hello, Waiman.

On Wed, Oct 18, 2023 at 02:24:00PM -0400, Waiman Long wrote:
> If you mean saving the exclusion cpumask no matter who the caller is, we can
> add another exclusion cpumask to save it and expose it to sysfs. This should
> be done in the first workqueue patch, not as part of this patch. I expose
> this isolated cpumask for testing purpose to be checked by the
> test_cpuset_prs.sh script for correctness. As said, I can expose it without
> cgroup_debug if you think the information is useful in general.

I don't really care where the cpumask is in the source tree. I just want all
the workqueue cpumasks presented to the userspace in a single place. Also, I
think it makes sense to publish it to userspace in an easily accessible
manner as what the eventual configuration ends up being can be confusing and
the effect it has on the system subtle.

Thanks.

-- 
tejun

[PATCH] selftests: bpf: add malloc failures checks in bpf_iter

2023-10-23 Thread Yuran Pereira

Since some malloc calls in bpf_iter may at times fail,
this patch adds the appropriate fail checks, and ensures that
any previously allocated resource is appropriately destroyed
before returning the function.

Signed-off-by: Yuran Pereira 
---
 tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c 
b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 1f02168103dd..6d47ea9211a4 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -878,6 +878,11 @@ static void test_bpf_percpu_hash_map(void)
 
skel->rodata->num_cpus = bpf_num_possible_cpus();
val = malloc(8 * bpf_num_possible_cpus());
+   if (CHECK(!val, "malloc", "memory allocation failed: %s",
+   strerror(errno))) {
+   bpf_iter_bpf_percpu_hash_map__destroy(skel);
+   return;
+   }
 
err = bpf_iter_bpf_percpu_hash_map__load(skel);
if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_hash_map__load"))
@@ -1057,6 +1062,11 @@ static void test_bpf_percpu_array_map(void)
 
skel->rodata->num_cpus = bpf_num_possible_cpus();
val = malloc(8 * bpf_num_possible_cpus());
+   if (CHECK(!val, "malloc", "memory allocation failed: %s",
+   strerror(errno))) {
+   bpf_iter_bpf_percpu_hash_map__destroy(skel);
+   return;
+   }
 
err = bpf_iter_bpf_percpu_array_map__load(skel);
if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_array_map__load"))
-- 
2.25.1

Re: [PATCH] selftests: add a sanity check for zswap

2023-10-23 Thread Rik van Riel

On Fri, 2023-10-20 at 15:20 -0700, Nhat Pham wrote:
> 
> This test add a sanity check that ensure zswap storing works as
> intended.
> 
> Signed-off-by: Nhat Pham 
> 

Acked-by: Rik van Riel 

-- 
All Rights Reversed.

RE: [PATCH v6 8/8] iommu/vt-d: Disallow read-only mappings to nest parent domain

2023-10-23 Thread Liu, Yi L

> From: Baolu Lu 
> Sent: Monday, October 23, 2023 8:18 PM
> 
> On 2023/10/23 19:15, Liu, Yi L wrote:
> >> I would also prefer to introduce is_nested_parent_domain to the user
> >> domain allocation patch (patch 7/8). This field should be checked when
> >> allocating a nested user domain.
> > A ctually, no need. This should be a common check, so iommufd core already
> > has the check. So the parent should be a nest parent domain, otherwise 
> > already
> > returned in iommufd.
> >
> > +   if (!parent->nest_parent)
> > +   return ERR_PTR(-EINVAL);
> 
> I know this will not cause errors in the code. But since you are
> introducing is_parent property in the vt-d driver. The integrity of the
> property should be ensured. In this way, it will make the code more
> readable and maintainable.

Ok, if consider it as a property, then it's fine. At first, I just want to
make it as a special flag for this errata. But we cannot predict if there
will be more nested parent special stuffs, then this flag is also needed.

Regards,
Yi Liu

[PATCH] selftests/net: give more time to udpgro nat tests

2023-10-23 Thread Lucas Karpinski

In some conditions, background processes in udpgro don't have enough
time to set up the sockets. When foreground processes start, this
results in the bad GRO lookup test freezing or reporting that it
received 0 gro segments.

To fix this, increase the time given to background processes to complete
the startup before foreground processes start.

This is the same issue and the same fix as posted by Adrien Therry.
Link: https://lore.kernel.org/all/20221101184809.50013-1-athie...@redhat.com/

Signed-off-by: Lucas Karpinski 
---
 tools/testing/selftests/net/udpgro.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
index 0c743752669a..4ccbcb2390ad 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -97,7 +97,8 @@ run_one_nat() {
echo "ok" || \
echo "failed"&
 
-   sleep 0.1
+   # Hack: let bg programs complete the startup
+   sleep 0.2
./udpgso_bench_tx ${tx_args}
ret=$?
kill -INT $pid
-- 
2.41.0

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Suren Baghdasaryan

On Mon, Oct 23, 2023 at 11:37 AM Peter Xu  wrote:
>
> On Mon, Oct 23, 2023 at 10:43:49AM -0700, Suren Baghdasaryan wrote:
> > > Maybe we should follow what it does with mremap()?  Then your current code
> > > is fine.  Maybe that's the better start.
> >
> > I think that was the original intention, basically treating remapping
> > as a write operation. Maybe I should add a comment here to make it
> > more clear?
>
> Please avoid mention "emulate as a write" - this is not a write, e.g., we
> move a swap entry over without faulting in the page.  We also keep the page
> states, e.g. on hotness.  A write will change all of that.

Understood.

>
> Now rethinking with the recently merged WP_ASYNC: we ignore uffd-wp, which
> means dirty from uffd-wp async tracking POV, that matches with soft-dirty
> always set.  Looks all good.
>
> Perhaps something like "Follow mremap() behavior; ignore uffd-wp for now"
> should work?

Sounds good. Will add in the next version.
Thanks!

>
> --
> Peter Xu
>

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Suren Baghdasaryan

On Mon, Oct 23, 2023 at 8:53 AM David Hildenbrand  wrote:
>
> On 23.10.23 14:29, David Hildenbrand wrote:
> >> +
> >> +/* Only allow remapping if both are mlocked or both aren't */
> >> +if ((src_vma->vm_flags & VM_LOCKED) != (dst_vma->vm_flags & 
> >> VM_LOCKED))
> >> +return -EINVAL;
> >> +
> >> +if (!(src_vma->vm_flags & VM_WRITE) || !(dst_vma->vm_flags & 
> >> VM_WRITE))
> >> +return -EINVAL;
> >
> > Why does one of both need VM_WRITE? If one really needs it, then the
> > destination (where we're moving stuff to).
>
> Just realized that we want both to be writable.
>
> If you have this in place, there is no need to use maybe*_mkwrite(), you
> can use the non-maybe variants.

Ack.

>
> I recall that for UFFDIO_COPY we even support PROT_NONE VMAs, is there
> any reason why we want to have different semantics here?

I don't think so. At least not for the single-mm case.

>
> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Suren Baghdasaryan

On Mon, Oct 23, 2023 at 5:29 AM David Hildenbrand  wrote:
>
> Focusing on validate_remap_areas():
>
> > +
> > +static int validate_remap_areas(struct vm_area_struct *src_vma,
> > + struct vm_area_struct *dst_vma)
> > +{
> > + /* Only allow remapping if both have the same access and protection */
> > + if ((src_vma->vm_flags & VM_ACCESS_FLAGS) != (dst_vma->vm_flags & 
> > VM_ACCESS_FLAGS) ||
> > + pgprot_val(src_vma->vm_page_prot) != 
> > pgprot_val(dst_vma->vm_page_prot))
> > + return -EINVAL;
>
> Makes sense. I do wonder about pkey and friends and if we even have to
> so anything special.

I don't see anything special done for mremap. Do you have something in mind?

>
> > +
> > + /* Only allow remapping if both are mlocked or both aren't */
> > + if ((src_vma->vm_flags & VM_LOCKED) != (dst_vma->vm_flags & 
> > VM_LOCKED))
> > + return -EINVAL;
> > +
> > + if (!(src_vma->vm_flags & VM_WRITE) || !(dst_vma->vm_flags & 
> > VM_WRITE))
> > + return -EINVAL;
>
> Why does one of both need VM_WRITE? If one really needs it, then the
> destination (where we're moving stuff to).

As you noticed later, both should have VM_WRITE.

>
> > +
> > + /*
> > +  * Be strict and only allow remap_pages if either the src or
> > +  * dst range is registered in the userfaultfd to prevent
> > +  * userland errors going unnoticed. As far as the VM
> > +  * consistency is concerned, it would be perfectly safe to
> > +  * remove this check, but there's no useful usage for
> > +  * remap_pages ouside of userfaultfd registered ranges. This
> > +  * is after all why it is an ioctl belonging to the
> > +  * userfaultfd and not a syscall.
>
> I think the last sentence is the important bit and the comment can
> likely be reduced.

Ok, I'll look into shortening it.

>
> > +  *
> > +  * Allow both vmas to be registered in the userfaultfd, just
> > +  * in case somebody finds a way to make such a case useful.
> > +  * Normally only one of the two vmas would be registered in
> > +  * the userfaultfd.
>
> Should we just check the destination? That makes most sense to me,
> because with uffd we are resolving uffd-events. And just like
> copy/zeropage we want to resolve a page fault ("userfault") of a
> non-present page on the destination.

I think that makes sense. Not sure why the original implementation
needed the check for source too. Seems unnecessary.

>
>
> > +  */
> > + if (!dst_vma->vm_userfaultfd_ctx.ctx &&
> > + !src_vma->vm_userfaultfd_ctx.ctx)
> > + return -EINVAL;
>
>
>
> > +
> > + /*
> > +  * FIXME: only allow remapping across anonymous vmas,
> > +  * tmpfs should be added.
> > +  */
> > + if (!vma_is_anonymous(src_vma) || !vma_is_anonymous(dst_vma))
> > + return -EINVAL;
>
> Why a FIXME here? Just drop the comment completely or replace it with
> "We only allow to remap anonymous folios accross anonymous VMAs".

Will do. I guess Andrea had plans to cover tmpfs as well.

>
> > +
> > + /*
> > +  * Ensure the dst_vma has a anon_vma or this page
> > +  * would get a NULL anon_vma when moved in the
> > +  * dst_vma.
> > +  */
> > + if (unlikely(anon_vma_prepare(dst_vma)))
> > + return -ENOMEM;
>
> Makes sense.
>
> > +
> > + return 0;
> > +}
>
>

Thanks,
Suren.


> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH v4 08/17] iommufd: Always setup MSI and anforce cc on kernel-managed domains

2023-10-23 Thread Nicolin Chen

On Mon, Oct 23, 2023 at 10:59:35AM -0300, Jason Gunthorpe wrote:
> On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote:
> 
> > > > And where should we add this comment? Kdoc of
> > > > the alloc uAPI?
> > > 
> > > Maybe right in front of the only enforce_cc op callback?
> > 
> > OK. How about this? Might be a bit verbose though:
> > /*
> >  * enforce_cache_coherenc must be determined during the HWPT allocation.
> >  * Note that a HWPT (non-CC) created for a device (non-CC) can be later
> >  * reused by another device (either non-CC or CC). However, A HWPT (CC)
> >  * created for a device (CC) cannot be reused by another device (non-CC)
> >  * but only devices (CC). Instead user space in this case would need to
> >  * allocate a separate HWPT (non-CC).
> >  */
> 
> Ok, fix the spello in enforce_cache_coherenc

Oops.

I also found the existing piece sorta says a similar thing:
 * Set the coherency mode before we do iopt_table_add_domain() as some
 * iommus have a per-PTE bit that controls it and need to decide before
 * doing any maps. 

So, did this and sending v3:
-* enforce_cache_coherenc must be determined during the HWPT allocation.
+* The cache coherency mode must be configured here and unchanged later.

Re: [PATCH v4 08/17] iommufd: Always setup MSI and anforce cc on kernel-managed domains

2023-10-23 Thread Nicolin Chen

On Mon, Oct 23, 2023 at 02:53:20AM +, Tian, Kevin wrote:
> > From: Nicolin Chen 
> > Sent: Monday, October 23, 2023 8:18 AM
> >
> > On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote:
> > > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Oct 20, 2023 at 02:43:58AM +, Tian, Kevin wrote:
> > > > >
> > > > > > But the user shouldn't assume such explicit consistency since it's 
> > > > > > not
> > > > > > defined in our uAPI. All we defined is that the attaching may
> > > > > > fail due to incompatibility for whatever reason then the user can
> > > > > > always try creating a new hwpt for the to-be-attached device. From
> > > > > > this regard I don't see providing consistency of result is
> > > > > > necessary. 
> > > > >
> > > > > Anyhow, OK, lets add a comment summarizing your points and remove
> > the
> > > > > cc upgrade at attach time (sorry Nicolin/Yi!)
> > > >
> > > > Ack. I will send a small removal series. I assume it should CC
> > > > stable tree also?
> > >
> > > No, it seems more like tidying that fixing a functional issue, do I
> > > misunderstand?
> >
> > Hmm. Maybe the misunderstanding is mine -- Kevin was asking if
> > it was already a bug and you answered yes:
> > https://lore.kernel.org/linux-iommu/20231016115736.gp3...@nvidia.com/
> >
> 
> currently intel-iommu driver already rejects 1) enforcing CC on
> a domain which is already attached to non-CC device and
> 2) attaching a non-CC device to a domain which has enforce_cc.
> 
> so there is no explorable bug to fix in stable tree.

I see. Thanks!

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Peter Xu

On Mon, Oct 23, 2023 at 10:43:49AM -0700, Suren Baghdasaryan wrote:
> > Maybe we should follow what it does with mremap()?  Then your current code
> > is fine.  Maybe that's the better start.
> 
> I think that was the original intention, basically treating remapping
> as a write operation. Maybe I should add a comment here to make it
> more clear?

Please avoid mention "emulate as a write" - this is not a write, e.g., we
move a swap entry over without faulting in the page.  We also keep the page
states, e.g. on hotness.  A write will change all of that.

Now rethinking with the recently merged WP_ASYNC: we ignore uffd-wp, which
means dirty from uffd-wp async tracking POV, that matches with soft-dirty
always set.  Looks all good.

Perhaps something like "Follow mremap() behavior; ignore uffd-wp for now"
should work?

-- 
Peter Xu

Re: [PATCH RFC RFT 2/5] fork: Add shadow stack support to clone3()

2023-10-23 Thread Mark Brown

On Mon, Oct 23, 2023 at 04:32:22PM +, Edgecombe, Rick P wrote:
> On Mon, 2023-10-23 at 14:20 +0100, Mark Brown wrote:

> +Some security folks

I *think* I captured everyone for future versions but I might've missed
some, it's a long Cc list.

> > Add parameters to clone3() specifying the address and size of a
> > shadow
> > stack for the newly created process, we validate that the range
> > specified
> > is accessible to userspace but do not validate that it has been
> > mapped
> > appropriately for use as a shadow stack (normally via
> > map_shadow_stack()).
> > If the shadow stack is specified in this way then the caller is
> > responsible
> > for freeing the memory as with the main stack. If no shadow stack is
> > specified then the existing implicit allocation and freeing behaviour
> > is
> > maintained.

> This will give userspace new powers, very close to a "set SSP" ability.
> They could start a new thread on an active shadow stack, corrupt it,
> etc.

That's true.

> One way to avoid this would be to have shstk_alloc_thread_stack()
> consume a token on the shadow stack passed in the clone args. But it's
> tricky because there is not a CMPXCHG, on x86 at least, that works with
> shadow stack accesses. So the kernel would probably have to GUP the
> page and do a normal CMPXCHG off of the direct map.

> That said, it's already possible to get two threads on the same shadow
> stack by unmapping one and mapping another shadow stack in the same
> place, while the target thread is not doing a call/ret. I don't know if
> there is anything we could do about that without serious compatibility
> restrictions. But this patch would make it a bit more trivial.

> I might lean towards the token solution, even if it becomes more heavy
> weight to use clone3 in this way. It depends on whether the above is
> worth defending.

Right.  We're already adding the cost of the extra map_shadow_stack() so
it doesn't seem that out of scope.  We could also allow clone3() to be
used for allocation, potentially removing the ability to specify the
address entirely and only specifying the size.  I did consider that
option but it felt awkward in the API, though equally the whole shadow
stack allocation thing is a bit that way.  That would avoid concerns
about placing and validating tokens entirely but gives less control to
userspace.

This also doesn't do anything to stop anyone trying to allocate sub page
shadow stacks if they're trying to save memory with all the lack of
overrun protection that implies, though that seems to me to be much more
of a deliberate decision that people might make, a token would prevent
that too unless write access to the shadow stack is enabled.

> > +   /*
> > +    * This doesn't validate that the addresses are
> > mapped
> > +    * VM_SHADOW_STACK, just that they're mapped at all.
> > +    */

> It just checks the range, right?

Yes, same check as for the normal stack.

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 0/8] Introduce mseal() syscall

2023-10-23 Thread Jeff Xu

On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds
 wrote:
>
> On Thu, 19 Oct 2023 at 15:47, Pedro Falcato  wrote:
> > >
> > > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
> >
> > No. POSIX works merely as a baseline that UNIX systems aim towards.
> > You can (and very frequently do) extend POSIX interfaces (in fact,
> > it's how most of POSIX was written, through sheer
> > "design-by-committee" on a bunch of UNIX systems' extensions).
>
> We can in extreme circumstances actually go further than that, and not
> only extend on POSIX requirements, but actively even violate them.
>
> It does need a very good reason, though, but it has happened when
> POSIX requirements were simply actively wrong.
>
> For example, at one point POSIX required
>
>  int accept(int s, struct sockaddr *addr, size_t *addrlen);
>
> which was simply completely wrong. It's utter shite, and didn't
> actually match any reality.
>
> The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
> it "size_t" was completely unacceptable.
>
> So we ignored it, told POSIX people that they were full of sh*t, and
> they eventually fixed it in the next version (by introducing a
> "socklen_t" that had better be the same as "int").
>
> So POSIX can simply be wrong.
>
> Also, if it turns out that we had some ABI that wasn't
> POSIX-compatible, the whole "don't break user space" will take
> precedence over any POSIX concerns, and we will not "fix" our system
> call if people already use our old semantics.
>
> So in that case, we generally end up with a new system call (or new
> flags) instead.
>
> Or sometimes it just is such a small detail that nobody cares - POSIX
> also has a notion of documenting areas of non-conformance, and people
> who really care end up having notions like "conformance vs _strict_
> conformance".
>
>  Linus
>

Thanks Linus for clarifying the guidelines on POSIX in Linux.

-Jeff

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Suren Baghdasaryan

On Sun, Oct 22, 2023 at 10:02 AM Peter Xu  wrote:
>
> On Thu, Oct 19, 2023 at 02:24:06PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Oct 12, 2023 at 3:00 PM Peter Xu  wrote:
> > >
> > > On Sun, Oct 08, 2023 at 11:42:27PM -0700, Suren Baghdasaryan wrote:
> > > > From: Andrea Arcangeli 
> > > >
> > > > Implement the uABI of UFFDIO_MOVE ioctl.
> > > > UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
> > > > needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
> > > > available (in userspace) for recycling, as is usually the case in heap
> > > > compaction algorithms, then we can avoid the page allocation and memcpy
> > > > (done by UFFDIO_COPY). Also, since the pages are recycled in the
> > > > userspace, we avoid the need to release (via madvise) the pages back to
> > > > the kernel [2].
> > > > We see over 40% reduction (on a Google pixel 6 device) in the compacting
> > > > thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
> > > > measured using a benchmark that emulates a heap compaction 
> > > > implementation
> > > > using userfaultfd (to allow concurrent accesses by application threads).
> > > > More details of the usecase are explained in [2].
> > > > Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
> > > > touching them within the same vma. Today, it can only be done by mremap,
> > > > however it forces splitting the vma.
> > > >
> > > > [1] 
> > > > https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarca...@redhat.com/
> > > > [2] 
> > > > https://lore.kernel.org/linux-mm/ca+eeso4uo84ssnbharh4hvlnhauq5nzknkxqxrcyjninvjp...@mail.gmail.com/
> > > >
> > > > Update for the ioctl_userfaultfd(2)  manpage:
> > > >
> > > >UFFDIO_MOVE
> > > >(Since Linux xxx)  Move a continuous memory chunk into the
> > > >userfault registered range and optionally wake up the blocked
> > > >thread. The source and destination addresses and the number of
> > > >bytes to move are specified by the src, dst, and len fields of
> > > >the uffdio_move structure pointed to by argp:
> > > >
> > > >struct uffdio_move {
> > > >__u64 dst;/* Destination of move */
> > > >__u64 src;/* Source of move */
> > > >__u64 len;/* Number of bytes to move */
> > > >__u64 mode;   /* Flags controlling behavior of move */
> > > >__s64 move;   /* Number of bytes moved, or negated error 
> > > > */
> > > >};
> > > >
> > > >The following value may be bitwise ORed in mode to change the
> > > >behavior of the UFFDIO_MOVE operation:
> > > >
> > > >UFFDIO_MOVE_MODE_DONTWAKE
> > > >   Do not wake up the thread that waits for page-fault
> > > >   resolution
> > > >
> > > >UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
> > > >   Allow holes in the source virtual range that is being 
> > > > moved.
> > > >   When not specified, the holes will result in ENOENT error.
> > > >   When specified, the holes will be accounted as 
> > > > successfully
> > > >   moved memory. This is mostly useful to move hugepage 
> > > > aligned
> > > >   virtual regions without knowing if there are transparent
> > > >   hugepages in the regions or not, but preventing the risk 
> > > > of
> > > >   having to split the hugepage during the operation.
> > > >
> > > >The move field is used by the kernel to return the number of
> > > >bytes that was actually moved, or an error (a negated errno-
> > > >style value).  If the value returned in move doesn't match the
> > > >value that was specified in len, the operation fails with the
> > > >error EAGAIN.  The move field is output-only; it is not read by
> > > >the UFFDIO_MOVE operation.
> > > >
> > > >The operation may fail for various reasons. Usually, remapping of
> > > >pages that are not exclusive to the given process fail; once KSM
> > > >might deduplicate pages or fork() COW-shares pages during fork()
> > > >with child processes, they are no longer exclusive. Further, the
> > > >kernel might only perform lightweight checks for detecting 
> > > > whether
> > > >the pages are exclusive, and return -EBUSY in case that check 
> > > > fails.
> > > >To make the operation more likely to succeed, KSM should be
> > > >disabled, fork() should be avoided or MADV_DONTFORK should be
> > > >configured for the source VMA before fork().
> > > >
> > > >This ioctl(2) operation returns 0 on success.  In this case, the
> > > >entire area was moved.  On error, -1 is returned and errno is
> > > >set to indicate the error.  Possible errors include:
> > > >
> > > >EAGAIN The number of bytes moved (i.e., the value returned in
> > > >

Re: [RFC PATCH v1 0/8] Introduce mseal() syscall

2023-10-23 Thread Jeff Xu

On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato  wrote:
>
> On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu  wrote:
> >
> > Hi Pedro
> >
> > Some followup on mmap() + mprotect():
> >
> > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu  wrote:
> > >
> > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato  
> > > wrote:
> > > >
> > > > > >
> > > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > > mmap() + mseal().
> > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', 
> > > > > > it
> > > > > > definitely sounds like a performance win as we halve the number of
> > > > > > syscalls for a sealed mapping. And if we trivially look at e.g 
> > > > > > OpenBSD
> > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > > like common patterns.
> > > > > >
> > > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > > immutable from begining.
> > > > > This is orthogonal to mseal() though.
> > > >
> > > > I don't see how this can be orthogonal to mseal().
> > > > In the case we opt for adding PROT_ bits, we should more or less only
> > > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > > issues.
> > > > vma merging won't merge vmas with different prots. The current
> > > > interfaces (mmap and mprotect) would work just fine.
> > > > In this case, mseal() or mimmutable() would only be needed if you need
> > > > to set immutability over a range of VMAs with different permissions.
> > > >
> > > Agreed. By orthogonal, I meant we can have two APIs:
> > > mmap() and mseal()/mprotect()
> > > i.e. we can't just rely on mmap() only without 
> > > mseal()/mprotect()/mimmutable().
> > > Sealing can be applied after initial memory creation.
> > >
> > > > Note: modifications should look kinda like this: 
> > > > https://godbolt.org/z/Tbjjd14Pe
> > > > The only annoying wrench in my plans here is that we have effectively
> > > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > > I described is not compatible with 32-bit.
> > > >
> > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > > changed to RO, for example, during symbol resolution.
> > > > > The important point is that the application can decide what type of
> > > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > > important to me.
> > > > >
> > > > > mprotect() in linux have the following signature:
> > > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > > the prot bitmasks are all taken here.
> > > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > > even not, we could have mmap2(), so that is not an issue.
> > > >
> > > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > > and we seem to have around 8 different bits used).
> > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 
> > > > :)
> > > >
> > > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > > can probably be worked around if need be.
> > > >
> > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is 
> > > not.
> > > Apology that I was wrong previously and caused confusion.
> > >
> > > There is a slight difference in the syntax of mprotect and mseal.
> > > Each time when mprotect() is called, the kernel takes all of RWX bits
> > > and updates vm_flags,
> > > In other words, the application sets/unset each RWX, and kernel takes it.
> > >
> > > In the mseal() case, the kernel will remember which seal types were
> > > applied previously, and the application doesn’t need to repeat all
> > > existing seal types in the next mseal().  Once a seal type is applied,
> > > it can’t be unsealed.
> > >
> > > So if we want to use mprotect() for sealing, developers need to think
> > > of sealing bits differently than the rest of prot bits. It is a
> > > different programming model, might or might not be an obvious concept
> > > to developers.
> > >
> > This probably doesn't matter much to developers.
> > We can enforce the sealing bit to be the same as the rest of PROT bits.
> > If mprotect() tries to unset sealing, it will fail.
>
> Yep. Erroneous or malicious mprotects would all be caught. However, if
> we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
> to less permissions or even downright munmap()) you'd want some care
> to preserve that bit when setting permissions.
>
> >
> > > There is a difference in input check and error handling as well.
> > > for mseal(), if a given address range has a gap (unallocated memory),
> > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > > updated.
> > > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> > >
> >

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread Suren Baghdasaryan

On Mon, Oct 23, 2023 at 9:36 AM David Hildenbrand  wrote:
>
> On 23.10.23 14:03, David Hildenbrand wrote:
> > On 22.10.23 17:46, Peter Xu wrote:
> >> On Fri, Oct 20, 2023 at 07:16:19PM +0200, David Hildenbrand wrote:
> >>> These are rather the vibes I'm getting from Peter. "Why rename it, could
> >>> confuse people because the original patches are old", "Why exclude it if 
> >>> it
> >>> has been included in the original patches". Not the kind of reasoning I 
> >>> can
> >>> relate to when it comes to upstreaming some patches.
> >>
> >> You can't blame anyone if you misunderstood and biased the question.
> >>
> >> The first question is definitely valid, even until now.  You guys still
> >> prefer to rename it, which I'm totally fine with.
> >>
> >> The 2nd question is wrong from your interpretation.  That's not my point,
> >> at least not starting from a few replies already.  What I was asking for is
> >> why such page movement between mm is dangerous.  I don't think I get solid
> >> answers even until now.
> >>
> >> Noticing "memcg is missing" is not an argument for "cross-mm is dangerous",
> >> it's a review comment.  Suren can address that.
> >>
> >> You'll propose a new feature that may tag an mm is not an argument either,
> >> if it's not merged yet.  We can also address that depending on what it is,
> >> also on which lands earlier.
> >>
> >> It'll be good to discuss these details even in a single-mm support.  Anyone
> >> would like to add that can already refer to discussion in this thread.
> >>
> >> I hope I'm clear.
> >>
> >
> > I said everything I had to say, go read what I wrote.
>
> Re-read your message after flying over first couple of paragraphs
> previously a bit quick too quickly (can easily happen when I'm told that
> I misunderstand questions and read them in a "biased" way).
>
> I'll happy to discuss cross-mm support once we actually need it. I just
> don't see the need to spend any energy on that right now, without any
> users on the horizon.
>
> [(a) I didn't blame anybody, I said that I don't understand the
> reasoning. (b) I hope I made it clear that this is added complexity (and
> not just currently dangerous) and so far I haven't heard a compelling
> argument why we should do any of that or even spend our time discussing
> that. (c) I never used "memcg is missing" as an argument for "cross-mm
> is dangerous", all about added complexity without actual users. (d) "it
> easily shows that there are cases  where this will require extra work --
> without any current benefits" -- is IMHO a perfectly fine argument
> against complexity that currently nobody needs]

Thanks for the discussion, folks!
I think posting the single-mm first and then following up with
cross-mm and its test would help us move forward. That will provide
functionality that is needed today quickly without unnecessary
distractions and will give us more time to discuss cross-mm support.
Also we will be able to test single-mm in isolation and make it more
solid before moving onto cross-mm.
I'll try to post the next version sometime this week.
Thanks,
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread David Hildenbrand


On 23.10.23 14:03, David Hildenbrand wrote:

On 22.10.23 17:46, Peter Xu wrote:

On Fri, Oct 20, 2023 at 07:16:19PM +0200, David Hildenbrand wrote:

These are rather the vibes I'm getting from Peter. "Why rename it, could
confuse people because the original patches are old", "Why exclude it if it
has been included in the original patches". Not the kind of reasoning I can
relate to when it comes to upstreaming some patches.


You can't blame anyone if you misunderstood and biased the question.

The first question is definitely valid, even until now.  You guys still
prefer to rename it, which I'm totally fine with.

The 2nd question is wrong from your interpretation.  That's not my point,
at least not starting from a few replies already.  What I was asking for is
why such page movement between mm is dangerous.  I don't think I get solid
answers even until now.

Noticing "memcg is missing" is not an argument for "cross-mm is dangerous",
it's a review comment.  Suren can address that.

You'll propose a new feature that may tag an mm is not an argument either,
if it's not merged yet.  We can also address that depending on what it is,
also on which lands earlier.

It'll be good to discuss these details even in a single-mm support.  Anyone
would like to add that can already refer to discussion in this thread.

I hope I'm clear.



I said everything I had to say, go read what I wrote.


Re-read your message after flying over first couple of paragraphs 
previously a bit quick too quickly (can easily happen when I'm told that 
I misunderstand questions and read them in a "biased" way).


I'll happy to discuss cross-mm support once we actually need it. I just 
don't see the need to spend any energy on that right now, without any 
users on the horizon.


[(a) I didn't blame anybody, I said that I don't understand the 
reasoning. (b) I hope I made it clear that this is added complexity (and 
not just currently dangerous) and so far I haven't heard a compelling 
argument why we should do any of that or even spend our time discussing 
that. (c) I never used "memcg is missing" as an argument for "cross-mm 
is dangerous", all about added complexity without actual users. (d) "it 
easily shows that there are cases  where this will require extra work -- 
without any current benefits" -- is IMHO a perfectly fine argument 
against complexity that currently nobody needs]


--
Cheers,

David / dhildenb

Re: [PATCH RFC RFT 2/5] fork: Add shadow stack support to clone3()

2023-10-23 Thread Edgecombe, Rick P

+Some security folks

On Mon, 2023-10-23 at 14:20 +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the the
> shadow
> stack for a new thread, instead the kernel will dynamically allocate
> a new
> shadow stack with the same size as the normal stack. This appears to
> be due
> to the shadow stack series having been in development since before
> the more
> extensible clone3() was added rather than anything more deliberate.
> 
> Add parameters to clone3() specifying the address and size of a
> shadow
> stack for the newly created process, we validate that the range
> specified
> is accessible to userspace but do not validate that it has been
> mapped
> appropriately for use as a shadow stack (normally via
> map_shadow_stack()).
> If the shadow stack is specified in this way then the caller is
> responsible
> for freeing the memory as with the main stack. If no shadow stack is
> specified then the existing implicit allocation and freeing behaviour
> is
> maintained.
> 
> If the architecture does not support shadow stacks the shadow stack
> parameters must be zero, architectures that do support the feature
> are
> expected to have the same requirement on individual systems that lack
> shadow stack support.
> 
> Update the existing x86 implementation to pay attention to the newly
> added
> arguments, in order to maintain compatibility we use the existing
> behaviour
> if no shadow stack is specified. Minimal validation is done of the
> supplied
> parameters, detailed enforcement is left to when the thread is
> executed.
> Since we are now using four fields from the kernel_clone_args we pass
> that
> into the shadow stack code rather than individual fields.

This will give userspace new powers, very close to a "set SSP" ability.
They could start a new thread on an active shadow stack, corrupt it,
etc.

One way to avoid this would be to have shstk_alloc_thread_stack()
consume a token on the shadow stack passed in the clone args. But it's
tricky because there is not a CMPXCHG, on x86 at least, that works with
shadow stack accesses. So the kernel would probably have to GUP the
page and do a normal CMPXCHG off of the direct map.

That said, it's already possible to get two threads on the same shadow
stack by unmapping one and mapping another shadow stack in the same
place, while the target thread is not doing a call/ret. I don't know if
there is anything we could do about that without serious compatibility
restrictions. But this patch would make it a bit more trivial.

I might lean towards the token solution, even if it becomes more heavy
weight to use clone3 in this way. It depends on whether the above is
worth defending.

> 
> Signed-off-by: Mark Brown 
> ---
>  arch/x86/include/asm/shstk.h | 11 +++
>  arch/x86/kernel/process.c    |  2 +-
>  arch/x86/kernel/shstk.c  | 36 +++---
> --
>  include/linux/sched/task.h   |  2 ++
>  include/uapi/linux/sched.h   | 17 ++---
>  kernel/fork.c    | 40
> ++--
>  6 files changed, 93 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/shstk.h
> b/arch/x86/include/asm/shstk.h
> index 42fee8959df7..8be7b0a909c3 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
>  #include 
>  
>  struct task_struct;
> +struct kernel_clone_args;
>  struct ksignal;
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>  
>  long shstk_prctl(struct task_struct *task, int option, unsigned long
> arg2);
>  void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> unsigned long clone_flags,
> -  unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> +  const struct kernel_clone_args
> *args);
>  void shstk_free(struct task_struct *p);
>  int setup_signal_shadow_stack(struct ksignal *ksig);
>  int restore_signal_shadow_stack(void);
> @@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct
> *task, int option,
>    unsigned long arg2) { return -EINVAL;
> }
>  static inline void reset_thread_features(void) {}
>  static inline unsigned long shstk_alloc_thread_stack(struct
> task_struct *p,
> -    unsigned long
> clone_flags,
> -    unsigned long
> stack_size) { return 0; }
> +    const struct
> kernel_clone_args *args)
> +{
> +   return 0;
> +}
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline int setup_signal_shadow_stack(struct ksignal *ksig) {
> return 0; }
>  static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread David Hildenbrand


On 23.10.23 14:29, David Hildenbrand wrote:

+
+   /* Only allow remapping if both are mlocked or both aren't */
+   if ((src_vma->vm_flags & VM_LOCKED) != (dst_vma->vm_flags & VM_LOCKED))
+   return -EINVAL;
+
+   if (!(src_vma->vm_flags & VM_WRITE) || !(dst_vma->vm_flags & VM_WRITE))
+   return -EINVAL;


Why does one of both need VM_WRITE? If one really needs it, then the
destination (where we're moving stuff to).


Just realized that we want both to be writable.

If you have this in place, there is no need to use maybe*_mkwrite(), you 
can use the non-maybe variants.


I recall that for UFFDIO_COPY we even support PROT_NONE VMAs, is there 
any reason why we want to have different semantics here?


--
Cheers,

David / dhildenb

Re: [PATCH 0/2] um: kunit: Add Clang support for CONFIG_GCOV

2023-10-23 Thread Arthur Grillo

On 20/10/23 06:21, Michał Winiarski wrote:
> Clang uses a different set of CLI args for coverage, and the output
> needs to be processed by a different set of tools.
> Update the Makefile and add an example of usage in kunit docs.

Great change! It's great not to rely on older versions of GCC for
that.

I was able to generate coverage reports, but I have a few of things to
note:

When I ran the kunit.py it generated this warning several times:

WARNING: modpost: vmlinux (__llvm_covfun): unexpected non-allocatable section.
Did you forget to use "ax"/"aw" in a .S file?
Note that for example  contains
section definitions for use in .S files.

Maybe it would be great to know why this is happening.

Also, the linker consumed a lot of RAM to build the kernel with those
additional flags, but maybe this is expected :P.

Best Regards,
~Arthur Grillo

> 
> Michał Winiarski (2):
>   arch: um: Add Clang coverage support
>   Documentation: kunit: Add clang UML coverage example
> 
>  Documentation/dev-tools/kunit/running_tips.rst | 11 +++
>  arch/um/Makefile-skas  |  5 +
>  2 files changed, 16 insertions(+)
>

Re: [PATCH] selftests:net change ifconfig with ip command

2023-10-23 Thread David Ahern

On 10/23/23 1:12 AM, Jiri Pirko wrote:
>> ip addr add ...
> 
> Why not "address" then? :)

no objection from me.

> What's wrong with "a"?

1-letter commands can be ambiguous. Test scripts should be clear and
obvious.

Re: [PATCH v2] selftests:net change ifconfig with ip command

2023-10-23 Thread David Ahern

On 10/23/23 6:34 AM, Swarup Laxman Kotiaklapudi wrote:
> Change ifconfig with ip command,
> on a system where ifconfig is
> not used this script will not
> work correcly.
> 
> Test result with this patchset:
> 
> sudo make TARGETS="net" kselftest
> 
> TAP version 13
> 1..1
>  timeout set to 1500
>  selftests: net: route_localnet.sh
>  run arp_announce test
>  net.ipv4.conf.veth0.route_localnet = 1
>  net.ipv4.conf.veth1.route_localnet = 1
>  net.ipv4.conf.veth0.arp_announce = 2
>  net.ipv4.conf.veth1.arp_announce = 2
>  PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
>   bytes of data.
>  64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
>  64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
> 
>  --- 127.25.3.14 ping statistics ---
>  5 packets transmitted, 5 received, 0% packet loss, time 4073ms
>  rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
>  ok
>  run arp_ignore test
>  net.ipv4.conf.veth0.route_localnet = 1
>  net.ipv4.conf.veth1.route_localnet = 1
>  net.ipv4.conf.veth0.arp_ignore = 3
>  net.ipv4.conf.veth1.arp_ignore = 3
>  PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
>   bytes of data.
>  64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
>  64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
>  64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
>  64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
>  64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
> 
>  --- 127.25.3.14 ping statistics ---
>  5 packets transmitted, 5 received, 0% packet loss, time 4092ms
>  rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
>  ok
> ok 1 selftests: net: route_localnet.sh
> ...
> 
> Signed-off-by: Swarup Laxman Kotiaklapudi 
> ---
>  tools/testing/selftests/net/route_localnet.sh | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/net/route_localnet.sh 
> b/tools/testing/selftests/net/route_localnet.sh
> index 116bfeab72fa..e08701c750e3 100755
> --- a/tools/testing/selftests/net/route_localnet.sh
> +++ b/tools/testing/selftests/net/route_localnet.sh
> @@ -18,8 +18,10 @@ setup() {
>  ip route del 127.0.0.0/8 dev lo table local
>  ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
>  
> -ifconfig veth0 127.25.3.4/24 up
> -ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
> +ip address add 127.25.3.4/24 dev veth0
> +ip link set dev veth0 up
> +ip netns exec "${PEER_NS}" ip address add 127.25.3.14/24 dev veth1
> +ip netns exec "${PEER_NS}" ip link set dev veth1 up
>  
>  ip route flush cache
>  ip netns exec "${PEER_NS}" ip route flush cache

Reviewed-by: David Ahern

Re: [PATCH v4 08/17] iommufd: Always setup MSI and anforce cc on kernel-managed domains

2023-10-23 Thread Jason Gunthorpe

On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote:

> > > And where should we add this comment? Kdoc of
> > > the alloc uAPI?
> > 
> > Maybe right in front of the only enforce_cc op callback?
> 
> OK. How about this? Might be a bit verbose though:
>   /*
>* enforce_cache_coherenc must be determined during the HWPT allocation.
>* Note that a HWPT (non-CC) created for a device (non-CC) can be later
>* reused by another device (either non-CC or CC). However, A HWPT (CC)
>* created for a device (CC) cannot be reused by another device (non-CC)
>* but only devices (CC). Instead user space in this case would need to
>* allocate a separate HWPT (non-CC).
>*/

Ok, fix the spello in enforce_cache_coherenc

Jason

[PATCH RFC RFT 5/5] kselftest/clone3: Test shadow stack support

2023-10-23 Thread Mark Brown

Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure. We detect support for shadow stacks on the running
system by attempting to allocate a shadow stack page during initialisation
using map_shadow_stack().

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/clone3/clone3.c   | 97 +++
 tools/testing/selftests/clone3/clone3_selftests.h |  5 ++
 2 files changed, 102 insertions(+)

diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
index f1802db82e4e..33c35fdfcdfc 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -21,6 +22,10 @@
 #include "../kselftest.h"
 #include "clone3_selftests.h"
 
+static bool shadow_stack_supported;
+static __u64 shadow_stack;
+static size_t max_supported_args_size;
+
 enum test_mode {
CLONE3_ARGS_NO_TEST,
CLONE3_ARGS_ALL_0,
@@ -28,6 +33,9 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+   CLONE3_ARGS_SHADOW_STACK,
+   CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY,
+   CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY,
 };
 
 typedef bool (*filter_function)(void);
@@ -44,6 +52,28 @@ struct test {
filter_function filter;
 };
 
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+static void test_shadow_stack_supported(void)
+{
+   shadow_stack = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+   if (shadow_stack == -1) {
+   ksft_print_msg("map_shadow_stack() not supported\n");
+   } else if ((void *)shadow_stack == MAP_FAILED) {
+   ksft_print_msg("Failed to map shadow stack\n");
+   } else {
+   ksft_print_msg("Shadow stack supportd\n");
+   shadow_stack_supported = true;
+   }
+
+   /* Dummy stack to use for validating error checks */
+   if (!shadow_stack_supported) {
+   shadow_stack = (__u64)malloc(getpagesize());
+   }
+}
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
struct __clone_args args = {
@@ -89,6 +119,16 @@ static int call_clone3(uint64_t flags, size_t size, enum 
test_mode test_mode)
case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
args.exit_signal = 0x00f0ULL;
break;
+   case CLONE3_ARGS_SHADOW_STACK:
+   args.shadow_stack = shadow_stack;
+   args.shadow_stack_size = getpagesize();
+   break;
+   case CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY:
+   args.shadow_stack_size = getpagesize();
+   break;
+   case CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY:
+   args.shadow_stack = shadow_stack;
+   break;
}
 
memcpy(_ext.args, , sizeof(struct __clone_args));
@@ -167,6 +207,26 @@ static bool not_root(void)
return false;
 }
 
+static bool have_shadow_stack(void)
+{
+   if (shadow_stack_supported) {
+   ksft_print_msg("Shadow stack supported\n");
+   return true;
+   }
+
+   return false;
+}
+
+static bool no_shadow_stack(void)
+{
+   if (!shadow_stack_supported) {
+   ksft_print_msg("Shadow stack not supported\n");
+   return true;
+   }
+
+   return false;
+}
+
 static size_t page_size_plus_8(void)
 {
return getpagesize() + 8;
@@ -309,6 +369,42 @@ static const struct test tests[] = {
.expected = -EINVAL,
.test_mode = CLONE3_ARGS_NO_TEST,
},
+   {
+   .name = "Shadow stack on system with shadow stack",
+   .flags = 0,
+   .size = 0,
+   .expected = 0,
+   .e2big_valid = true,
+   .test_mode = CLONE3_ARGS_SHADOW_STACK,
+   .filter = no_shadow_stack,
+   },
+   {
+   .name = "Shadow stack with only size specified",
+   .flags = 0,
+   .size = 0,
+   .expected = -EINVAL,
+   .e2big_valid = true,
+   .test_mode = CLONE3_ARGS_SHADOW_STACK_SIZE_ONLY,
+   .filter = no_shadow_stack,
+   },
+   {
+   .name = "Shadow stack with only pointer specified",
+   .flags = 0,
+   .size = 0,
+   .expected = -EINVAL,
+   .e2big_valid = true,
+   .test_mode = CLONE3_ARGS_SHADOW_STACK_POINTER_ONLY,
+   .filter = no_shadow_stack,
+   },
+   {
+   .name = "Shadow stack on system without shadow stack",
+   .flags = 0,
+   .size = 0,
+   .expected = -EINVAL,
+   .e2big_valid = true,
+

[PATCH RFC RFT 3/5] selftests/clone3: Factor more of main loop into test_clone3()

2023-10-23 Thread Mark Brown

In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.

No functional change.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/clone3/clone3.c | 77 -
 1 file changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
index 9429d361059e..afe383689a67 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 };
 
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+   const char *name;
+   uint64_t flags;
+   size_t size;
+   size_function size_function;
+   int expected;
+   enum test_mode test_mode;
+   filter_function filter;
+};
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
struct __clone_args args = {
@@ -104,30 +117,40 @@ static int call_clone3(uint64_t flags, size_t size, enum 
test_mode test_mode)
return 0;
 }
 
-static bool test_clone3(uint64_t flags, size_t size, int expected,
-   enum test_mode test_mode)
+static void test_clone3(const struct test *test)
 {
+   size_t size;
int ret;
 
+   if (test->filter && test->filter()) {
+   ksft_test_result_skip("%s\n", test->name);
+   return;
+   }
+
+   if (test->size_function)
+   size = test->size_function();
+   else
+   size = test->size;
+
+   ksft_print_msg("Running test '%s'\n", test->name);
+
ksft_print_msg(
"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
-   getpid(), flags, size);
-   ret = call_clone3(flags, size, test_mode);
+   getpid(), test->flags, size);
+   ret = call_clone3(test->flags, size, test->test_mode);
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
-   getpid(), ret, expected);
-   if (ret != expected) {
+   getpid(), ret, test->expected);
+   if (ret != test->expected) {
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
-   getpid(), ret, expected);
-   return false;
+   getpid(), ret, test->expected);
+   ksft_test_result_fail("%s\n", test->name);
+   return;
}
 
-   return true;
+   ksft_test_result_pass("%s\n", test->name);
 }
 
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
 static bool not_root(void)
 {
if (getuid() != 0) {
@@ -143,16 +166,6 @@ static size_t page_size_plus_8(void)
return getpagesize() + 8;
 }
 
-struct test {
-   const char *name;
-   uint64_t flags;
-   size_t size;
-   size_function size_function;
-   int expected;
-   enum test_mode test_mode;
-   filter_function filter;
-};
-
 static const struct test tests[] = {
{
.name = "simple clone3()",
@@ -301,24 +314,8 @@ int main(int argc, char *argv[])
ksft_set_plan(ARRAY_SIZE(tests));
test_clone3_supported();
 
-   for (i = 0; i < ARRAY_SIZE(tests); i++) {
-   if (tests[i].filter && tests[i].filter()) {
-   ksft_test_result_skip("%s\n", tests[i].name);
-   continue;
-   }
-
-   if (tests[i].size_function)
-   size = tests[i].size_function();
-   else
-   size = tests[i].size;
-
-   ksft_print_msg("Running test '%s'\n", tests[i].name);
-
-   ksft_test_result(test_clone3(tests[i].flags, size,
-tests[i].expected,
-tests[i].test_mode),
-"%s\n", tests[i].name);
-   }
+   for (i = 0; i < ARRAY_SIZE(tests); i++)
+   test_clone3([i]);
 
ksft_finished();
 }

-- 
2.30.2

[PATCH RFC RFT 4/5] selftests/clone3: Allow tests to flag if -E2BIG is a valid error code

2023-10-23 Thread Mark Brown

The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.

Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/clone3/clone3.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
index afe383689a67..f1802db82e4e 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
size_t size;
size_function size_function;
int expected;
+   bool e2big_valid;
enum test_mode test_mode;
filter_function filter;
 };
@@ -141,6 +142,11 @@ static void test_clone3(const struct test *test)
ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
getpid(), ret, test->expected);
if (ret != test->expected) {
+   if (test->e2big_valid && ret == -E2BIG) {
+   ksft_print_msg("Test reported -E2BIG\n");
+   ksft_test_result_skip("%s\n", test->name);
+   return;
+   }
ksft_print_msg(
"[%d] Result (%d) is different than expected (%d)\n",
getpid(), ret, test->expected);

-- 
2.30.2

[PATCH RFC RFT 1/5] mm: Introduce ARCH_HAS_USER_SHADOW_STACK

2023-10-23 Thread Mark Brown

Since multiple architectures have support for shadow stacks and we need to
select support for this feature in several places in the generic code
provide a generic config option that the architectures can select.

Suggested-by: David Hildenbrand 
Signed-off-by: Mark Brown 
---
 arch/x86/Kconfig   | 1 +
 fs/proc/task_mmu.c | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 6 ++
 4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..2b72bae0d877 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1926,6 +1926,7 @@ config X86_USER_SHADOW_STACK
depends on AS_WRUSS
depends on X86_64
select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_HAS_USER_SHADOW_STACK
select X86_CET
help
  Shadow stack protection is a hardware feature that detects function
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..4101e741663a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -697,7 +697,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)]  = "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
 #endif
};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..1728cb77540d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -341,7 +341,7 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
-#ifdef CONFIG_X86_USER_SHADOW_STACK
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
 /*
  * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of
  * support core mm.
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..aaa2c353ea67 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1258,6 +1258,12 @@ config LOCK_MM_AND_FIND_VMA
bool
depends on !STACK_GROWSUP
 
+config ARCH_HAS_USER_SHADOW_STACK
+   bool
+   help
+ The architecture has hardware support for userspace shadow call
+  stacks (eg, x86 CET, arm64 GCS, RISC-V Zisslpcfi).
+
 source "mm/damon/Kconfig"
 
 endmenu

-- 
2.30.2

[PATCH RFC RFT 2/5] fork: Add shadow stack support to clone3()

2023-10-23 Thread Mark Brown

Unlike with the normal stack there is no API for configuring the the shadow
stack for a new thread, instead the kernel will dynamically allocate a new
shadow stack with the same size as the normal stack. This appears to be due
to the shadow stack series having been in development since before the more
extensible clone3() was added rather than anything more deliberate.

Add parameters to clone3() specifying the address and size of a shadow
stack for the newly created process, we validate that the range specified
is accessible to userspace but do not validate that it has been mapped
appropriately for use as a shadow stack (normally via map_shadow_stack()).
If the shadow stack is specified in this way then the caller is responsible
for freeing the memory as with the main stack. If no shadow stack is
specified then the existing implicit allocation and freeing behaviour is
maintained.

If the architecture does not support shadow stacks the shadow stack
parameters must be zero, architectures that do support the feature are
expected to have the same requirement on individual systems that lack
shadow stack support.

Update the existing x86 implementation to pay attention to the newly added
arguments, in order to maintain compatibility we use the existing behaviour
if no shadow stack is specified. Minimal validation is done of the supplied
parameters, detailed enforcement is left to when the thread is executed.
Since we are now using four fields from the kernel_clone_args we pass that
into the shadow stack code rather than individual fields.

Signed-off-by: Mark Brown 
---
 arch/x86/include/asm/shstk.h | 11 +++
 arch/x86/kernel/process.c|  2 +-
 arch/x86/kernel/shstk.c  | 36 +++-
 include/linux/sched/task.h   |  2 ++
 include/uapi/linux/sched.h   | 17 ++---
 kernel/fork.c| 40 ++--
 6 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..8be7b0a909c3 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include 
 
 struct task_struct;
+struct kernel_clone_args;
 struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long 
clone_flags,
-  unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+  const struct kernel_clone_args *args);
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
@@ -26,8 +27,10 @@ static inline long shstk_prctl(struct task_struct *task, int 
option,
   unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
-unsigned long clone_flags,
-unsigned long stack_size) 
{ return 0; }
+const struct 
kernel_clone_args *args)
+{
+   return 0;
+}
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..a9ca80ea5056 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -207,7 +207,7 @@ int copy_thread(struct task_struct *p, const struct 
kernel_clone_args *args)
 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
 * update it.
 */
-   new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+   new_ssp = shstk_alloc_thread_stack(p, args);
if (IS_ERR_VALUE(new_ssp))
return PTR_ERR((void *)new_ssp);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 59e15dd8d0f8..3ae5c3d657dc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,18 +191,44 @@ void reset_thread_features(void)
current->thread.features_locked = 0;
 }
 
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long 
clone_flags,
-  unsigned long stack_size)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+  const struct kernel_clone_args *args)
 {
struct thread_shstk *shstk = >thread.shstk;
+   unsigned long clone_flags = args->flags;
unsigned long addr, size;
 
/*
 * If shadow

[PATCH RFC RFT 0/5] fork: Support shadow stacks in clone3()

2023-10-23 Thread Mark Brown

The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zisslpcfi respectively), I am actively
working on GCS[1].  With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses.  This provides some
protection against ROP attacks and making it easier to collect call
stacks.  These shadow stacks are allocated in the address space of the
userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone().  When the shadow stack is specified
explicitly the kernel will not free it, the inconsistency with
implicitly allocated shadow stacks is a bit awkward but that's existing
ABI so we can't change it.

The memory provided must have been allocated for use as a shadow stack,
the expectation is that this will be done using the map_shadow_stack()
syscall.  I opted not to add validation for this in clone3() since it
will be enforced by hardware anyway.

Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS.  There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.

A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.

The selftest portions of this depend on 34dce23f7e40 ("selftests/clone3:
Report descriptive test names") in -next[2].

[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa...@kernel.org/
[2] 
https://lore.kernel.org/r/20231018-kselftest-clone3-output-v1-1-12b7c50ea...@kernel.org

Signed-off-by: Mark Brown 
---
Mark Brown (5):
  mm: Introduce ARCH_HAS_USER_SHADOW_STACK
  fork: Add shadow stack support to clone3()
  selftests/clone3: Factor more of main loop into test_clone3()
  selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
  kselftest/clone3: Test shadow stack support

 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/shstk.h  |  11 +-
 arch/x86/kernel/process.c |   2 +-
 arch/x86/kernel/shstk.c   |  36 -
 fs/proc/task_mmu.c|   2 +-
 include/linux/mm.h|   2 +-
 include/linux/sched/task.h|   2 +
 include/uapi/linux/sched.h|  17 +-
 kernel/fork.c |  40 -
 mm/Kconfig|   6 +
 tools/testing/selftests/clone3/clone3.c   | 180 +-
 tools/testing/selftests/clone3/clone3_selftests.h |   5 +
 12 files changed, 247 insertions(+), 57 deletions(-)
---
base-commit: 80ab9b52e8d4add7735abdfb935877354b69edb6
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,
-- 
Mark Brown

[PATCH v2] selftests:net change ifconfig with ip command

2023-10-23 Thread Swarup Laxman Kotiaklapudi

Change ifconfig with ip command,
on a system where ifconfig is
not used this script will not
work correcly.

Test result with this patchset:

sudo make TARGETS="net" kselftest

TAP version 13
1..1
 timeout set to 1500
 selftests: net: route_localnet.sh
 run arp_announce test
 net.ipv4.conf.veth0.route_localnet = 1
 net.ipv4.conf.veth1.route_localnet = 1
 net.ipv4.conf.veth0.arp_announce = 2
 net.ipv4.conf.veth1.arp_announce = 2
 PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
  bytes of data.
 64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
 64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
 64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
 64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
 64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms

 --- 127.25.3.14 ping statistics ---
 5 packets transmitted, 5 received, 0% packet loss, time 4073ms
 rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
 ok
 run arp_ignore test
 net.ipv4.conf.veth0.route_localnet = 1
 net.ipv4.conf.veth1.route_localnet = 1
 net.ipv4.conf.veth0.arp_ignore = 3
 net.ipv4.conf.veth1.arp_ignore = 3
 PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
  bytes of data.
 64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
 64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
 64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
 64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
 64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms

 --- 127.25.3.14 ping statistics ---
 5 packets transmitted, 5 received, 0% packet loss, time 4092ms
 rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
 ok
ok 1 selftests: net: route_localnet.sh
...

Signed-off-by: Swarup Laxman Kotiaklapudi 
---
 tools/testing/selftests/net/route_localnet.sh | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/route_localnet.sh 
b/tools/testing/selftests/net/route_localnet.sh
index 116bfeab72fa..e08701c750e3 100755
--- a/tools/testing/selftests/net/route_localnet.sh
+++ b/tools/testing/selftests/net/route_localnet.sh
@@ -18,8 +18,10 @@ setup() {
 ip route del 127.0.0.0/8 dev lo table local
 ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
 
-ifconfig veth0 127.25.3.4/24 up
-ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
+ip address add 127.25.3.4/24 dev veth0
+ip link set dev veth0 up
+ip netns exec "${PEER_NS}" ip address add 127.25.3.14/24 dev veth1
+ip netns exec "${PEER_NS}" ip link set dev veth1 up
 
 ip route flush cache
 ip netns exec "${PEER_NS}" ip route flush cache
-- 
2.34.1

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread David Hildenbrand


Focusing on validate_remap_areas():


+
+static int validate_remap_areas(struct vm_area_struct *src_vma,
+   struct vm_area_struct *dst_vma)
+{
+   /* Only allow remapping if both have the same access and protection */
+   if ((src_vma->vm_flags & VM_ACCESS_FLAGS) != (dst_vma->vm_flags & 
VM_ACCESS_FLAGS) ||
+   pgprot_val(src_vma->vm_page_prot) != 
pgprot_val(dst_vma->vm_page_prot))
+   return -EINVAL;


Makes sense. I do wonder about pkey and friends and if we even have to 
so anything special.



+
+   /* Only allow remapping if both are mlocked or both aren't */
+   if ((src_vma->vm_flags & VM_LOCKED) != (dst_vma->vm_flags & VM_LOCKED))
+   return -EINVAL;
+
+   if (!(src_vma->vm_flags & VM_WRITE) || !(dst_vma->vm_flags & VM_WRITE))
+   return -EINVAL;


Why does one of both need VM_WRITE? If one really needs it, then the 
destination (where we're moving stuff to).



+
+   /*
+* Be strict and only allow remap_pages if either the src or
+* dst range is registered in the userfaultfd to prevent
+* userland errors going unnoticed. As far as the VM
+* consistency is concerned, it would be perfectly safe to
+* remove this check, but there's no useful usage for
+* remap_pages ouside of userfaultfd registered ranges. This
+* is after all why it is an ioctl belonging to the
+* userfaultfd and not a syscall.


I think the last sentence is the important bit and the comment can 
likely be reduced.



+*
+* Allow both vmas to be registered in the userfaultfd, just
+* in case somebody finds a way to make such a case useful.
+* Normally only one of the two vmas would be registered in
+* the userfaultfd.


Should we just check the destination? That makes most sense to me, 
because with uffd we are resolving uffd-events. And just like 
copy/zeropage we want to resolve a page fault ("userfault") of a 
non-present page on the destination.




+*/
+   if (!dst_vma->vm_userfaultfd_ctx.ctx &&
+   !src_vma->vm_userfaultfd_ctx.ctx)
+   return -EINVAL;





+
+   /*
+* FIXME: only allow remapping across anonymous vmas,
+* tmpfs should be added.
+*/
+   if (!vma_is_anonymous(src_vma) || !vma_is_anonymous(dst_vma))
+   return -EINVAL;


Why a FIXME here? Just drop the comment completely or replace it with 
"We only allow to remap anonymous folios accross anonymous VMAs".



+
+   /*
+* Ensure the dst_vma has a anon_vma or this page
+* would get a NULL anon_vma when moved in the
+* dst_vma.
+*/
+   if (unlikely(anon_vma_prepare(dst_vma)))
+   return -ENOMEM;


Makes sense.


+
+   return 0;
+}



--
Cheers,

David / dhildenb

Re: [PATCH v6 8/8] iommu/vt-d: Disallow read-only mappings to nest parent domain

2023-10-23 Thread Baolu Lu


On 2023/10/23 19:15, Liu, Yi L wrote:

I would also prefer to introduce is_nested_parent_domain to the user
domain allocation patch (patch 7/8). This field should be checked when
allocating a nested user domain.

A ctually, no need. This should be a common check, so iommufd core already
has the check. So the parent should be a nest parent domain, otherwise already
returned in iommufd.

+   if (!parent->nest_parent)
+   return ERR_PTR(-EINVAL);


I know this will not cause errors in the code. But since you are
introducing is_parent property in the vt-d driver. The integrity of the
property should be ensured. In this way, it will make the code more
readable and maintainable.

Best regards,
baolu

Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

2023-10-23 Thread David Hildenbrand


On 22.10.23 17:46, Peter Xu wrote:

On Fri, Oct 20, 2023 at 07:16:19PM +0200, David Hildenbrand wrote:

These are rather the vibes I'm getting from Peter. "Why rename it, could
confuse people because the original patches are old", "Why exclude it if it
has been included in the original patches". Not the kind of reasoning I can
relate to when it comes to upstreaming some patches.


You can't blame anyone if you misunderstood and biased the question.

The first question is definitely valid, even until now.  You guys still
prefer to rename it, which I'm totally fine with.

The 2nd question is wrong from your interpretation.  That's not my point,
at least not starting from a few replies already.  What I was asking for is
why such page movement between mm is dangerous.  I don't think I get solid
answers even until now.

Noticing "memcg is missing" is not an argument for "cross-mm is dangerous",
it's a review comment.  Suren can address that.

You'll propose a new feature that may tag an mm is not an argument either,
if it's not merged yet.  We can also address that depending on what it is,
also on which lands earlier.

It'll be good to discuss these details even in a single-mm support.  Anyone
would like to add that can already refer to discussion in this thread.

I hope I'm clear.



I said everything I had to say, go read what I wrote.

--
Cheers,

David / dhildenb

RE: [PATCH v6 8/8] iommu/vt-d: Disallow read-only mappings to nest parent domain

2023-10-23 Thread Liu, Yi L

> From: Baolu Lu 
> Sent: Saturday, October 21, 2023 11:24 AM
> 
> On 10/20/23 5:32 PM, Yi Liu wrote:
> > From: Lu Baolu 
> >
> > When remapping hardware is configured by system software in scalable mode
> > as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry,
> > it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled)
> > in first-stage page-table entries even when second-stage mappings indicate
> > that corresponding first-stage page-table is Read-Only.
> >
> > As the result, contents of pages designated by VMM as Read-Only can be
> > modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of
> > address translation process due to DMAs issued by Guest.
> >
> > This disallows read-only mappings in the domain that is supposed to be used
> > as nested parent. Reference from Sapphire Rapids Specification Update [1],
> > errata details, SPR17. Userspace should know this limitation by checking
> > the IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 flag reported in the
> IOMMU_GET_HW_INFO
> > ioctl.
> >
> > [1] https://www.intel.com/content/www/us/en/content-details/772415/content-
> details.html
> >
> > Reviewed-by: Kevin Tian 
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Yi Liu 
> > ---
> >   drivers/iommu/intel/iommu.c  |  9 +
> >   drivers/iommu/intel/iommu.h  |  1 +
> >   include/uapi/linux/iommufd.h | 12 +++-
> >   3 files changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index c7704e7efd4a..a0341a069fbf 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -2193,6 +2193,11 @@ __domain_mapping(struct dmar_domain *domain,
> unsigned long iov_pfn,
> > if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
> > return -EINVAL;
> >
> > +   if (!(prot & DMA_PTE_WRITE) && domain->is_nested_parent) {
> > +   pr_err_ratelimited("Read-only mapping is disallowed on the 
> > domain
> which serves as the parent in a nested configuration, due to HW errata
> (ERRATA_772415_SPR17)\n");
> > +   return -EINVAL;
> > +   }
> > +
> > attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);
> > attr |= DMA_FL_PTE_PRESENT;
> > if (domain->use_first_level) {
> > @@ -4101,6 +4106,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
> flags,
> > domain = iommu_domain_alloc(dev->bus);
> > if (!domain)
> > return ERR_PTR(-ENOMEM);
> > +   container_of(domain,
> > +struct dmar_domain,
> > +domain)->is_nested_parent = request_nest_parent;
> 
> How about
>   to_dmar_domain(domain)->is_nested_parent = ...;
> ?

Yes.

> 
> I would also prefer to introduce is_nested_parent_domain to the user
> domain allocation patch (patch 7/8). This field should be checked when
> allocating a nested user domain.

A ctually, no need. This should be a common check, so iommufd core already
has the check. So the parent should be a nest parent domain, otherwise already
returned in iommufd.

+   if (!parent->nest_parent)
+   return ERR_PTR(-EINVAL);

https://lore.kernel.org/linux-iommu/20231020091946.12173-8-yi.l@intel.com/

> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 8f81a5c9fcc0..d3f6bc1f6590 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4121,6 +4121,8 @@ intel_iommu_domain_alloc_user(struct device *dev,
> u32 flags,
>  return ERR_PTR(-EINVAL);
>  if (request_nest_parent)
>  return ERR_PTR(-EINVAL);
> +   if (!to_dmar_domain(parent)->is_nested_parent)
> +   return ERR_PTR(-EINVAL);
> 
>  return intel_nested_domain_alloc(parent, user_data);
>   }
> 
> Best regards,
> baolu

RE: [PATCH v6 7/8] iommu/vt-d: Add nested domain allocation

2023-10-23 Thread Liu, Yi L

> From: Baolu Lu 
> Sent: Saturday, October 21, 2023 11:08 AM
> 
> On 10/20/23 5:32 PM, Yi Liu wrote:
> > From: Lu Baolu 
> >
> > This adds the support for IOMMU_HWPT_DATA_VTD_S1 type.
> >
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Yi Liu 
> > ---
> >   drivers/iommu/intel/iommu.c | 39 ++---
> >   1 file changed, 28 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index b47025fbdea4..c7704e7efd4a 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -4076,7 +4076,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
> flags,
> >   struct iommu_domain *parent,
> >   const struct iommu_user_data *user_data)
> >   {
> > -   struct iommu_domain *domain;
> > +   bool request_nest_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
> > struct intel_iommu *iommu;
> >
> > if (flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT))
> > @@ -4086,18 +4086,35 @@ intel_iommu_domain_alloc_user(struct device *dev, 
> > u32
> flags,
> > if (!iommu)
> > return ERR_PTR(-ENODEV);
> >
> > -   if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu-
> >ecap))
> > +   if (!user_data) { /* Must be PAGING domain */
> > +   struct iommu_domain *domain;
> > +
> > +   if (request_nest_parent && !ecap_nest(iommu->ecap))
> 
> Hardware capability is not sufficient. How about adding below helper:

Yes. But the sm_supported() seems a bit confusing. Is it? actually, it's
a combined check of the sm capability and software choice (use sm or
not). How about renaming it to be sm_enabled()? Although the best
enable status check is to read the SM bit in root address register. No hurry
in this circle, but may be nice to have.

> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index b5f33a7c1973..b04bbabcd696 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -540,6 +540,8 @@ enum {
>   #define sm_supported(iommu)(intel_iommu_sm &&
> ecap_smts((iommu)->ecap))
>   #define pasid_supported(iommu) (sm_supported(iommu) && \
>   ecap_pasid((iommu)->ecap))
> +#define nested_supported(iommu)(sm_supported(iommu) &&
> \
> +ecap_nest((iommu)->ecap))
> 
>   struct pasid_entry;
>   struct pasid_state_entry;
> 
> And, use nested_supported() here and bleow
> 
> > +   return ERR_PTR(-EOPNOTSUPP);
> > +   if (parent)
> > +   return ERR_PTR(-EINVAL);
> > +   /*
> > +* domain_alloc_user op needs to fully initialize a domain
> > +* before return, so uses iommu_domain_alloc() here for
> > +* simple.
> > +*/
> > +   domain = iommu_domain_alloc(dev->bus);
> > +   if (!domain)
> > +   return ERR_PTR(-ENOMEM);
> > +   return domain;
> > +   }
> > +
> > +   /* Must be nested domain */
> > +   if (!ecap_nest(iommu->ecap))
> 
> ...here.
> 
> > +   return ERR_PTR(-EOPNOTSUPP);
> > +   if (user_data->type != IOMMU_HWPT_DATA_VTD_S1)
> > return ERR_PTR(-EOPNOTSUPP);
> > +   if (!parent || parent->ops != intel_iommu_ops.default_domain_ops)
> > +   return ERR_PTR(-EINVAL);
> > +   if (request_nest_parent)
> > +   return ERR_PTR(-EINVAL);
> >
> > -   /*
> > -* domain_alloc_user op needs to fully initialize a domain
> > -* before return, so uses iommu_domain_alloc() here for
> > -* simple.
> > -*/
> > -   domain = iommu_domain_alloc(dev->bus);
> > -   if (!domain)
> > -   domain = ERR_PTR(-ENOMEM);
> > -   return domain;
> > +   return intel_nested_domain_alloc(parent, user_data);
> >   }
> >
> >   static void intel_iommu_domain_free(struct iommu_domain *domain)
> 
> Best regards,
> baolu

RE: [PATCH v6 3/8] iommu/vt-d: Add helper for nested domain allocation

2023-10-23 Thread Liu, Yi L

> From: Baolu Lu 
> Sent: Friday, October 20, 2023 7:49 PM
> 
> On 2023/10/20 17:32, Yi Liu wrote:
> > From: Lu Baolu 
> >
> > This adds helper for accepting user parameters and allocate a nested
> > domain.
> >
> > Reviewed-by: Kevin Tian 
> > Signed-off-by: Jacob Pan 
> > Signed-off-by: Lu Baolu 
> > Signed-off-by: Yi Liu 
> > ---
> >   drivers/iommu/intel/Makefile |  2 +-
> >   drivers/iommu/intel/iommu.h  |  2 ++
> >   drivers/iommu/intel/nested.c | 55 
> >   3 files changed, 58 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/iommu/intel/nested.c
> >
> > diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> > index 7af3b8a4f2a0..5dabf081a779 100644
> > --- a/drivers/iommu/intel/Makefile
> > +++ b/drivers/iommu/intel/Makefile
> > @@ -1,6 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_DMAR_TABLE) += dmar.o
> > -obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o
> > +obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o nested.o
> >   obj-$(CONFIG_DMAR_TABLE) += trace.o cap_audit.o
> >   obj-$(CONFIG_DMAR_PERF) += perf.o
> >   obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
> > diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> > index a91077a063ee..ff55184456dd 100644
> > --- a/drivers/iommu/intel/iommu.h
> > +++ b/drivers/iommu/intel/iommu.h
> > @@ -866,6 +866,8 @@ void *alloc_pgtable_page(int node, gfp_t gfp);
> >   void free_pgtable_page(void *vaddr);
> >   void iommu_flush_write_buffer(struct intel_iommu *iommu);
> >   struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 
> > *devfn);
> > +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain
> *s2_domain,
> > +  const struct iommu_user_data 
> > *user_data);
> >
> >   #ifdef CONFIG_INTEL_IOMMU_SVM
> >   void intel_svm_check(struct intel_iommu *iommu);
> > diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
> > new file mode 100644
> > index ..5a2920a98e47
> > --- /dev/null
> > +++ b/drivers/iommu/intel/nested.c
> > @@ -0,0 +1,55 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * nested.c - nested mode translation support
> > + *
> > + * Copyright (C) 2023 Intel Corporation
> > + *
> > + * Author: Lu Baolu 
> > + * Jacob Pan 
> > + * Yi Liu 
> > + */
> > +
> > +#define pr_fmt(fmt)"DMAR: " fmt
> > +
> > +#include 
> > +
> > +#include "iommu.h"
> > +
> > +static void intel_nested_domain_free(struct iommu_domain *domain)
> > +{
> > +   kfree(to_dmar_domain(domain));
> > +}
> > +
> > +static const struct iommu_domain_ops intel_nested_domain_ops = {
> > +   .free   = intel_nested_domain_free,
> > +};
> > +
> > +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain
> *s2_domain,
> > +  const struct iommu_user_data 
> > *user_data)
> > +{
> > +   struct iommu_hwpt_vtd_s1 vtd;
> > +   struct dmar_domain *domain;
> > +   int ret;
> > +
> > +   ret = iommu_copy_struct_from_user(, user_data,
> > + IOMMU_HWPT_DATA_VTD_S1, __reserved);
> > +   if (ret)
> > +   return ERR_PTR(ret);
> > +
> > +   domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT);
> > +   if (!domain)
> > +   return NULL;
> 
>   return ERR_PTR(-ENOMEM);
> ?

Yes, good catch!

> > +
> > +   domain->use_first_level = true;
> > +   domain->s2_domain = to_dmar_domain(s2_domain);
> > +   domain->s1_pgtbl = vtd.pgtbl_addr;
> > +   domain->s1_cfg = vtd;
> > +   domain->domain.ops = _nested_domain_ops;
> > +   domain->domain.type = IOMMU_DOMAIN_NESTED;
> > +   INIT_LIST_HEAD(>devices);
> > +   INIT_LIST_HEAD(>dev_pasids);
> > +   spin_lock_init(>lock);
> > +   xa_init(>iommu_array);
> > +
> > +   return >domain;
> > +}
> 
> Best regards,
> baolu

Re: [PATCH] Incorrect backtracking for load/store or atomic ops

2023-10-23 Thread Tao Lyu

>    
> On Fri, Oct 20, 2023 at 9:06 AM Tao Lyu  wrote:
>>
>> Hi,
>>
>> I found the backtracking logic of the eBPF verifier is flawed
>> when meeting 1) normal load and store instruction or
>> 2) atomic memory instructions.
>>
>> # Normal load and store
>>
>> Here, I show one case about the normal load and store instructions,
>> which can be exploited to achieve arbitrary read and write with two 
>> requirements:
>> 1) The uploading program should have at least CAP_BPF, which is required for 
>> most eBPF applications.
>> 2) Disable CPU mitigations by adding "mitigations=off" in the kernel booting 
>> command line. Otherwise,
>> the Spectre mitigation in the eBPF verifier will prevent exploitation.
>>
>>    1: r3 = r10 (stack pointer)
>>    3:   if cond
>>  /   \
>>    /    \
>> 4: *(u64 *)(r3 -120) = 200  6: *(u64 *)(r3 -120) = arbitrary 
>>offset to r2
>>  verification state 1  verification state 2 
>>(prune point)
>>   \  /
>> \  /
>>   7:  r6 = *(u64 *)(r1 -120)
>>  ...
>> 17:    r7 = a map pointer
>> 18:    r7 += r6
>>  // Out-of-bound access from the right side path
>>
>> Give an eBPF program (tools/testing/selftests/bpf/test_precise.c)
>> whose simplified control flow graph looks like the above.
>> When the verifier goes through the first (left-side) path and reaches insn 
>> 18,
>> it will backtrack on register 6 like below.
>>
>> 18: (0f) r7 += r6
>> mark_precise: frame0: last_idx 18 first_idx 17 subseq_idx -1
>> mark_precise: frame0: regs=r6 stack= before 17: (bf) r7 = r0
>> ...
>> mark_precise: frame0: regs=r6 stack= before 7: (79) r6 = *(u64 *)(r3 -120)
>>
>> However, the backtracking process is problematic when it reaches insn 7.
>> Insn 7 is to load a value from the stack, but the stack pointer is 
>> represented by r3 instead of r10.
>> ** In this case, the verifier (as shown below) will reset the precision on 
>> r6 and not mark the precision on the stack. **
>> Afterward, the backtracking finishes without annotating any registers in any 
>> verifier states.
>>
>> else if (class == BPF_LDX) {
>> if (!bt_is_reg_set(bt, dreg))
>> return 0;
>> bt_clear_reg(bt, dreg);
>> if (insn->src_reg != BPF_REG_FP)
>> return 0;
>> ...
>>    }
>>
>> Finally, when the second (left-side) path reaches insn 7 again,
>> it will compare the verifier states with the previous one.
>> However, it realizes these two states are equal because no precision is on 
>> r6,
>> thus the eBPF program an easily pass the verifier
>> although the second path contains an invalid access offset.
>> We have successfully exploited this bug for getting the root privilege.
>> If needed, we can share the exploitation.
>> BTW, when using the similar instructions in sub_prog can also trigger an 
>> assertion in the verifier:
>> "[ 1510.165537] verifier backtracking bug
>> [ 1510.165582] WARNING: CPU: 2 PID: 382 at kernel/bpf/verifier.c:3626 
>> __mark_chain_precision+0x4568/0x4e50"
>>
>>
>>
>> IMO, to fully patch this bug, we need to know whether the insn->src_reg is 
>> an alias of BPF_REG_FP.
>
> Yes!
>
>> However, it might need too much code addition.
>
> No :) I don't think it's a lot of code. I've been meaning to tackle
> this for a while, but perhaps the time is now.
>
> The plan is to use jmp_history to record an extra flags for some
> instructions (even if they are not jumps). Like in this case, we can
> remember for LDX and STX instructions that they were doing register
> load/spill, and take that into account during backtrack_insn() without
> having to guess based on r10.
>
> I have part of this ready locally, I'll try to finish this up in a
> next week or two. Stay tuned (unless you want to tackle that
> yourself).

Great! I'll keep focus on this.

>
>> Or we just do not clear the precision on the src register.
>>
>> # Atomic memory instructions
>>
>> Then, I show that the backtracking on atomic load and store is also flawed.
>> As shown below, when the backtrack_insn() function in the verifier meets 
>> store instructions,
>> it checks if the stack slot is set with precision or not. If not, just 
>> return.
>>
>> if (!bt_is_slot_set(bt, spi))
>> return 0;
>> bt_clear_slot(bt, spi);
>> if (class == BPF_STX)
>> bt_set_reg(bt, sreg);
>>
>> Assume we have an atomic_fetch_or instruction 
>> (tools/testing/selftests/bpf/verifier/atomic_precision.c) shown below.
>>
>> 7:

Re: [PATCH] selftests:net change ifconfig with ip command

2023-10-23 Thread Jiri Pirko

Sun, Oct 22, 2023 at 07:50:52PM CEST, dsah...@kernel.org wrote:
>On 10/22/23 5:31 AM, Swarup Laxman Kotiaklapudi wrote:
>> diff --git a/tools/testing/selftests/net/route_localnet.sh 
>> b/tools/testing/selftests/net/route_localnet.sh
>> index 116bfeab72fa..3ab9beb4462c 100755
>> --- a/tools/testing/selftests/net/route_localnet.sh
>> +++ b/tools/testing/selftests/net/route_localnet.sh
>> @@ -18,8 +18,10 @@ setup() {
>>  ip route del 127.0.0.0/8 dev lo table local
>>  ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
>>  
>> -ifconfig veth0 127.25.3.4/24 up
>> -ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
>> +ip a add 127.25.3.4/24 dev veth0
>
>ip addr add ...

Why not "address" then? :)
What's wrong with "a"?


>
>> +ip link set dev veth0 up
>> +ip netns exec "${PEER_NS}" ip a add 127.25.3.14/24 dev veth1
>
>ip addr add ...
>
>> +ip netns exec "${PEER_NS}" ip link set dev veth1 up
>>  
>>  ip route flush cache
>>  ip netns exec "${PEER_NS}" ip route flush cache
>
>

40 matches

Mail list logo