date:20240102

Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff

2024-01-02 Thread Menglong Dong

On Wed, Jan 3, 2024 at 11:55 AM Yonghong Song  wrote:
>
>
> On 1/2/24 6:54 PM, Menglong Dong wrote:
> > On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau  
> > wrote:
> >> On 1/2/24 10:11 AM, Stanislav Fomichev wrote:
> >>> On 12/29, Menglong Dong wrote:
>  For now, we have to call some helpers when we need to update the csum,
>  such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are
>  not inlined, which causes poor performance.
> 
>  In fact, we can define our own csum update functions in BPF program
>  instead of bpf_l3_csum_replace, which is totally inlined and efficient.
>  However, we can't do this for bpf_l4_csum_replace for now, as we can't
>  update skb->csum, which can cause skb->csum invalid in the rx path with
>  CHECKSUM_COMPLETE mode.
> 
>  What's more, we can't use the direct data access and have to use
>  skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such
>  as modifing the vni in the vxlan header and the underlay udp header has
>  no checksum.
> >> There is bpf_csum_update(), does it work?
> >> A helper call should be acceptable comparing with the csum calculation 
> >> itself.
> > Yeah, this helper works in this case! Now we miss the last
> > piece for the tx path: ip_summed. We need to know if it is
> > CHECKSUM_PARTIAL to decide if we should update the
> > csum in the packet. In the tx path, the csum in the L4 is the
> > pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL.
> >
> > Maybe we can introduce a lightweight kfunc to get its
> > value? Such as bpf_skb_csum_mode(). As we need only call
> > it once, there shouldn't be overhead on it.
>
> You don't need kfunc, you can do checking like
>struct sk_buff *kskb = bpf_cast_to_kern_ctx(skb);
>if (kskb->ip_summed == CHECKSUM_PARTIAL) ...
>...
>

Great, this is exactly what I need! Thanks~

Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff

2024-01-02 Thread Yonghong Song




On 1/2/24 6:54 PM, Menglong Dong wrote:

On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau  wrote:

On 1/2/24 10:11 AM, Stanislav Fomichev wrote:

On 12/29, Menglong Dong wrote:

For now, we have to call some helpers when we need to update the csum,
such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are
not inlined, which causes poor performance.

In fact, we can define our own csum update functions in BPF program
instead of bpf_l3_csum_replace, which is totally inlined and efficient.
However, we can't do this for bpf_l4_csum_replace for now, as we can't
update skb->csum, which can cause skb->csum invalid in the rx path with
CHECKSUM_COMPLETE mode.

What's more, we can't use the direct data access and have to use
skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such
as modifing the vni in the vxlan header and the underlay udp header has
no checksum.

There is bpf_csum_update(), does it work?
A helper call should be acceptable comparing with the csum calculation itself.

Yeah, this helper works in this case! Now we miss the last
piece for the tx path: ip_summed. We need to know if it is
CHECKSUM_PARTIAL to decide if we should update the
csum in the packet. In the tx path, the csum in the L4 is the
pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL.

Maybe we can introduce a lightweight kfunc to get its
value? Such as bpf_skb_csum_mode(). As we need only call
it once, there shouldn't be overhead on it.


You don't need kfunc, you can do checking like
  struct sk_buff *kskb = bpf_cast_to_kern_ctx(skb);
  if (kskb->ip_summed == CHECKSUM_PARTIAL) ...
  ...
  



Thanks!
Menglong Dong


In the first patch, we make skb->csum readable and writable, and we make
skb->ip_summed readable. For now, for tc only. With these 2 fields, we
don't need to call bpf helpers for csum update any more.

In the second patch, we add some testcases for the read/write testing for
skb->csum and skb->ip_summed.

If this series is acceptable, we can define the inlined functions for csum
update in libbpf in the next step.

One downside of exposing those as __sk_buff fields is that all this
skb internal csum stuff now becomes a UAPI. And I'm not sure we want

+1. Please no new __sk_buff extension and no new conversion in
bpf_convert_ctx_access().


that :-) Should we add a lightweight kfunc to reset the fields instead?
Or will it still have an unacceptable overhead?

Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain

2024-01-02 Thread Baolu Lu


On 2024/1/3 9:33, Yi Liu wrote:

On 2024/1/3 02:44, Jason Gunthorpe wrote:

On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:

+static void intel_nested_flush_cache(struct dmar_domain *domain, u64 
addr,

+ unsigned long npages, bool ih, u32 *error)
+{
+    struct iommu_domain_info *info;
+    unsigned long i;
+    unsigned mask;
+    u32 fault;
+
+    xa_for_each(>iommu_array, i, info)
+    qi_flush_piotlb(info->iommu,
+    domain_id_iommu(domain, info->iommu),
+    IOMMU_NO_PASID, addr, npages, ih, NULL);


This locking on the xarray is messed up throughout the driver. There
could be a concurrent detach at this point which will free info and
UAF this.


hmmm, xa_for_each() takes and releases rcu lock, and according to the
domain_detach_iommu(), info is freed after xa_erase(). For an existing
info stored in xarray, xa_erase() should return after rcu lock is released.
is it? Any idea? @Baolu


I once thought locking for xarray is self-contained. I need more thought
on this before taking further action.

Best regards,
baolu

Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff

2024-01-02 Thread Menglong Dong

On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau  wrote:
>
> On 1/2/24 10:11 AM, Stanislav Fomichev wrote:
> > On 12/29, Menglong Dong wrote:
> >> For now, we have to call some helpers when we need to update the csum,
> >> such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are
> >> not inlined, which causes poor performance.
> >>
> >> In fact, we can define our own csum update functions in BPF program
> >> instead of bpf_l3_csum_replace, which is totally inlined and efficient.
> >> However, we can't do this for bpf_l4_csum_replace for now, as we can't
> >> update skb->csum, which can cause skb->csum invalid in the rx path with
> >> CHECKSUM_COMPLETE mode.
> >>
> >> What's more, we can't use the direct data access and have to use
> >> skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such
> >> as modifing the vni in the vxlan header and the underlay udp header has
> >> no checksum.
>
> There is bpf_csum_update(), does it work?
> A helper call should be acceptable comparing with the csum calculation itself.

Yeah, this helper works in this case! Now we miss the last
piece for the tx path: ip_summed. We need to know if it is
CHECKSUM_PARTIAL to decide if we should update the
csum in the packet. In the tx path, the csum in the L4 is the
pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL.

Maybe we can introduce a lightweight kfunc to get its
value? Such as bpf_skb_csum_mode(). As we need only call
it once, there shouldn't be overhead on it.

Thanks!
Menglong Dong

>
> >>
> >> In the first patch, we make skb->csum readable and writable, and we make
> >> skb->ip_summed readable. For now, for tc only. With these 2 fields, we
> >> don't need to call bpf helpers for csum update any more.
> >>
> >> In the second patch, we add some testcases for the read/write testing for
> >> skb->csum and skb->ip_summed.
> >>
> >> If this series is acceptable, we can define the inlined functions for csum
> >> update in libbpf in the next step.
> >
> > One downside of exposing those as __sk_buff fields is that all this
> > skb internal csum stuff now becomes a UAPI. And I'm not sure we want
>
> +1. Please no new __sk_buff extension and no new conversion in
> bpf_convert_ctx_access().
>
> > that :-) Should we add a lightweight kfunc to reset the fields instead?
> > Or will it still have an unacceptable overhead?
>

Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

2024-01-02 Thread Yi Liu


On 2024/1/3 07:38, Jason Gunthorpe wrote:

On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:

I think I misread Yi's narrative: dev_id is a working approach
for VMM to convert to a vRID, while he is asking for a better
alternative :)


In concept, dev_id works, but in reality we have problem to get a dev_id
for a given device in intel iommu driver, hence I'm asking for help here. :)


I think we just need to solve this one way or another.. Even if you
use a viommu object you still end up having difficult coupling to
iommufd

Some:
   iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)

Callable by a driver (using the driver-callable function
infrastructure we made for dirty tracking) Is really all that is
needed here.


yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
IOMMUFD is configed. Maybe such an API could be compiled when
IOMMUFD_DRIVER is enabled as well. This does address my concern on making
intel iommu driver depending on iommufd. But still need a way to pass in
the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
iommu driver could call the above iommufd_get_dev_id(). One possible way is
passing it when attaching device to domain and clear it in detach. However,
this seems not ideal as iommufd_ctx information should be static between
bind_iommufd and unbind. But we don't call into intel iommu driver in the
bind and unbind operations. May need to add new iommu op. Any suggestion
here?

--
Regards,
Yi Liu

Re: [PATCH v2 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Peter Xu

On Tue, Jan 02, 2024 at 03:32:56PM -0800, Suren Baghdasaryan wrote:
> While testing the split PMD path with lockdep enabled I've got an
> "Invalid wait context" error caused by split_huge_page_to_list() trying
> to lock anon_vma->rwsem while inside RCU read section. The issues is due
> to move_pages_pte() calling split_folio() under RCU read lock. Fix this
> by unmapping the PTEs and exiting RCU read section before splitting the
> folio and then retrying. The same retry pattern is used when locking the
> folio or anon_vma in this function. After splitting the large folio we
> unlock and release it because after the split the old folio might not be
> the one that contains the src_addr.
> 
> Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI")
> Signed-off-by: Suren Baghdasaryan 

Reviewed-by: Peter Xu 

Thanks,

-- 
Peter Xu

Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain

2024-01-02 Thread Yi Liu


On 2024/1/3 02:44, Jason Gunthorpe wrote:

On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:


+static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
+unsigned long npages, bool ih, u32 *error)
+{
+   struct iommu_domain_info *info;
+   unsigned long i;
+   unsigned mask;
+   u32 fault;
+
+   xa_for_each(>iommu_array, i, info)
+   qi_flush_piotlb(info->iommu,
+   domain_id_iommu(domain, info->iommu),
+   IOMMU_NO_PASID, addr, npages, ih, NULL);


This locking on the xarray is messed up throughout the driver. There
could be a concurrent detach at this point which will free info and
UAF this.


hmmm, xa_for_each() takes and releases rcu lock, and according to the
domain_detach_iommu(), info is freed after xa_erase(). For an existing
info stored in xarray, xa_erase() should return after rcu lock is released.
is it? Any idea? @Baolu

void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
{
struct iommu_domain_info *info;

spin_lock(>lock);
info = xa_load(>iommu_array, iommu->seq_id);
if (--info->refcnt == 0) {
clear_bit(info->did, iommu->domain_ids);
xa_erase(>iommu_array, iommu->seq_id);
domain->nid = NUMA_NO_NODE;
domain_update_iommu_cap(domain);
kfree(info);
}
spin_unlock(>lock);
}


This seems to be systemic issue, so I'm going to ignore it here, but
please make a series to fix it completely.


yeah, this writing is the same with other places that reference the 
iommu_array. If there is real problem, may check with Baolu and Kevin.



xarray is probably a bad data structure to manage attachment, a linked
list is going to use less memory in most cases and you need a mutex
lock anyhow.


below is the commit that introduces iommu_array.

commit ba949f4cd4c39c587e9b722ac7eb7f7e8a42dace
Author: Lu Baolu 
Date:   Tue Jul 12 08:09:05 2022 +0800

iommu/vt-d: Refactor iommu information of each domain

When a DMA domain is attached to a device, it needs to allocate a domain
ID from its IOMMU. Currently, the domain ID information is stored in two
static arrays embedded in the domain structure. This can lead to memory
waste when the driver is running on a small platform.

This optimizes these static arrays by replacing them with an xarray and
consuming memory on demand.

Signed-off-by: Lu Baolu 
Reviewed-by: Kevin Tian 
Reviewed-by: Steve Wahl 
Link: 
https://lore.kernel.org/r/20220702015610.2849494-4-baolu...@linux.intel.com

Signed-off-by: Joerg Roedel 


--
Regards,
Yi Liu

Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff

2024-01-02 Thread Martin KaFai Lau


On 1/2/24 10:11 AM, Stanislav Fomichev wrote:

On 12/29, Menglong Dong wrote:

For now, we have to call some helpers when we need to update the csum,
such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are
not inlined, which causes poor performance.

In fact, we can define our own csum update functions in BPF program
instead of bpf_l3_csum_replace, which is totally inlined and efficient.
However, we can't do this for bpf_l4_csum_replace for now, as we can't
update skb->csum, which can cause skb->csum invalid in the rx path with
CHECKSUM_COMPLETE mode.

What's more, we can't use the direct data access and have to use
skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such
as modifing the vni in the vxlan header and the underlay udp header has
no checksum.


There is bpf_csum_update(), does it work?
A helper call should be acceptable comparing with the csum calculation itself.



In the first patch, we make skb->csum readable and writable, and we make
skb->ip_summed readable. For now, for tc only. With these 2 fields, we
don't need to call bpf helpers for csum update any more.

In the second patch, we add some testcases for the read/write testing for
skb->csum and skb->ip_summed.

If this series is acceptable, we can define the inlined functions for csum
update in libbpf in the next step.


One downside of exposing those as __sk_buff fields is that all this
skb internal csum stuff now becomes a UAPI. And I'm not sure we want


+1. Please no new __sk_buff extension and no new conversion in 
bpf_convert_ctx_access().



that :-) Should we add a lightweight kfunc to reset the fields instead?
Or will it still have an unacceptable overhead?

Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

2024-01-02 Thread Jason Gunthorpe

On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > I think I misread Yi's narrative: dev_id is a working approach
> > for VMM to convert to a vRID, while he is asking for a better
> > alternative :)
> 
> In concept, dev_id works, but in reality we have problem to get a dev_id
> for a given device in intel iommu driver, hence I'm asking for help here. :)

I think we just need to solve this one way or another.. Even if you
use a viommu object you still end up having difficult coupling to
iommufd

Some:
  iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)

Callable by a driver (using the driver-callable function
infrastructure we made for dirty tracking) Is really all that is
needed here.

Jason

Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Suren Baghdasaryan

On Tue, Jan 2, 2024 at 3:16 PM Suren Baghdasaryan  wrote:
>
> On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan  wrote:
> >
> > On Tue, Jan 2, 2024 at 1:00 AM Peter Xu  wrote:
> > >
> > > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, 
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >
> > > >   /* at this point we have src_folio locked */
> > > >   if (folio_test_large(src_folio)) {
> > > > + /* split_folio() can block */
> > > > + pte_unmap(_src_pte);
> > > > + pte_unmap(_dst_pte);
> > > > + src_pte = dst_pte = NULL;
> > > >   err = split_folio(src_folio);
> > > >   if (err)
> > > >   goto out;
> > > > + goto retry;
> > > >   }
> > >
> > > Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> > > thp, I think it means it's pte mapped here. Then after the split we may
> > > want to fetch the small folio after the split, not the head one?
> >
> > I think we need to re-fetch the src_folio only if the src_addr falls
> > into a non-head page. Looking at the __split_huge_page(), the head
> > page is skipped in the last loop, so I think it should stay valid.
> > That said, maybe it's just an implementation detail of the
> > __split_huge_page() and I should not rely on that and refetch anyway?
>
> I'll post a v2 with this fix and re-fetching the folio
> unconditionally. We also don't need to reset src_folio_pte value
> because it's used only if src_folio is not NULL.

Posted at 
https://lore.kernel.org/all/20240102233256.1077959-1-sur...@google.com/

> Thanks for catching this, Peter!
>
> >
> > >
> > > --
> > > Peter Xu
> > >

[PATCH v2 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Suren Baghdasaryan

While testing the split PMD path with lockdep enabled I've got an
"Invalid wait context" error caused by split_huge_page_to_list() trying
to lock anon_vma->rwsem while inside RCU read section. The issues is due
to move_pages_pte() calling split_folio() under RCU read lock. Fix this
by unmapping the PTEs and exiting RCU read section before splitting the
folio and then retrying. The same retry pattern is used when locking the
folio or anon_vma in this function. After splitting the large folio we
unlock and release it because after the split the old folio might not be
the one that contains the src_addr.

Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan 
---
Changes from v1 [1]:
1. Reset src_folio and src_folio_pte after folio is split, per Peter Xu

[1] https://lore.kernel.org/all/20231230025607.2476912-1-sur...@google.com/

 mm/userfaultfd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5e718014e671..216ab4c8621f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1078,9 +1078,18 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t 
*dst_pmd, pmd_t *src_pmd,
 
/* at this point we have src_folio locked */
if (folio_test_large(src_folio)) {
+   /* split_folio() can block */
+   pte_unmap(_src_pte);
+   pte_unmap(_dst_pte);
+   src_pte = dst_pte = NULL;
err = split_folio(src_folio);
if (err)
goto out;
+   /* have to reacquire the folio after it got split */
+   folio_unlock(src_folio);
+   folio_put(src_folio);
+   src_folio = NULL;
+   goto retry;
}
 
if (!src_anon_vma) {
-- 
2.43.0.472.g3155946c3a-goog

Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Suren Baghdasaryan

On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan  wrote:
>
> On Tue, Jan 2, 2024 at 1:00 AM Peter Xu  wrote:
> >
> > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, 
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >
> > >   /* at this point we have src_folio locked */
> > >   if (folio_test_large(src_folio)) {
> > > + /* split_folio() can block */
> > > + pte_unmap(_src_pte);
> > > + pte_unmap(_dst_pte);
> > > + src_pte = dst_pte = NULL;
> > >   err = split_folio(src_folio);
> > >   if (err)
> > >   goto out;
> > > + goto retry;
> > >   }
> >
> > Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> > thp, I think it means it's pte mapped here. Then after the split we may
> > want to fetch the small folio after the split, not the head one?
>
> I think we need to re-fetch the src_folio only if the src_addr falls
> into a non-head page. Looking at the __split_huge_page(), the head
> page is skipped in the last loop, so I think it should stay valid.
> That said, maybe it's just an implementation detail of the
> __split_huge_page() and I should not rely on that and refetch anyway?

I'll post a v2 with this fix and re-fetching the folio
unconditionally. We also don't need to reset src_folio_pte value
because it's used only if src_folio is not NULL.
Thanks for catching this, Peter!

>
> >
> > --
> > Peter Xu
> >

Re: [PATCH v3 4/4] selftest/bpf: Test a perf bpf program that suppresses side effects.

2024-01-02 Thread Song Liu

On Sun, Dec 10, 2023 at 8:56 PM Kyle Huey  wrote:
>
> The test sets a hardware breakpoint and uses a bpf program to suppress the
> side effects of a perf event sample, including I/O availability signals,
> SIGTRAPs, and decrementing the event counter limit, if the ip matches the
> expected value. Then the function with the breakpoint is executed multiple
> times to test that all effects behave as expected.
>
> Signed-off-by: Kyle Huey 
> ---
>  .../selftests/bpf/prog_tests/perf_skip.c  | 140 ++
>  .../selftests/bpf/progs/test_perf_skip.c  |  15 ++
>  2 files changed, 155 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c 
> b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> new file mode 100644
> index ..0200736a8baf
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> @@ -0,0 +1,140 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +
> +#include 
> +#include "test_perf_skip.skel.h"
> +#include 
> +#include 
> +#include 
> +
> +#ifndef TRAP_PERF
> +#define TRAP_PERF 6
> +#endif
> +
> +int signals_unexpected = 1;
> +int sigio_count, sigtrap_count;
> +
> +static void handle_sigio(int sig __always_unused)
> +{
> +   ASSERT_OK(signals_unexpected, "perf event not skipped");

ASSERT_OK is a little confusing. Maybe do something like:

static int signals_expected;
static void handle_sigio(int sig __always_unused)
{
ASSERT_EQ(signals_expected, 1, "expected sig_io");
}
serial_test_perf_skip()
{
...
signals_expected = 1;
}

> +   ++sigio_count;
> +}
> +
> +static void handle_sigtrap(int signum __always_unused,
> +  siginfo_t *info,
> +  void *ucontext __always_unused)
> +{
> +   ASSERT_OK(signals_unexpected, "perf event not skipped");
ditto

> +   ASSERT_EQ(info->si_code, TRAP_PERF, "wrong si_code");
> +   ++sigtrap_count;
> +}
> +
> +static noinline int test_function(void)
> +{
> +   asm volatile ("");
> +   return 0;
> +}
> +
> +void serial_test_perf_skip(void)
> +{
> +   struct sigaction action = {};
> +   struct sigaction previous_sigtrap;
> +   sighandler_t previous_sigio;
> +   struct test_perf_skip *skel = NULL;
> +   struct perf_event_attr attr = {};
> +   int perf_fd = -1;
> +   int err;
> +   struct f_owner_ex owner;
> +   struct bpf_link *prog_link = NULL;
> +
> +   action.sa_flags = SA_SIGINFO | SA_NODEFER;
> +   action.sa_sigaction = handle_sigtrap;
> +   sigemptyset(_mask);
> +   if (!ASSERT_OK(sigaction(SIGTRAP, , _sigtrap), 
> "sigaction"))
> +   return;
> +
> +   previous_sigio = signal(SIGIO, handle_sigio);

handle signal() errors here?

> +
> +   skel = test_perf_skip__open_and_load();
> +   if (!ASSERT_OK_PTR(skel, "skel_load"))
> +   goto cleanup;
> +
> +   attr.type = PERF_TYPE_BREAKPOINT;
> +   attr.size = sizeof(attr);
> +   attr.bp_type = HW_BREAKPOINT_X;
> +   attr.bp_addr = (uintptr_t)test_function;
> +   attr.bp_len = sizeof(long);
> +   attr.sample_period = 1;
> +   attr.sample_type = PERF_SAMPLE_IP;
> +   attr.pinned = 1;
> +   attr.exclude_kernel = 1;
> +   attr.exclude_hv = 1;
> +   attr.precise_ip = 3;
> +   attr.sigtrap = 1;
> +   attr.remove_on_exec = 1;
> +
> +   perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
> +   if (perf_fd < 0 && (errno == ENOENT || errno == EOPNOTSUPP)) {
> +   printf("SKIP:no PERF_TYPE_BREAKPOINT/HW_BREAKPOINT_X\n");
> +   test__skip();
> +   goto cleanup;
> +   }
> +   if (!ASSERT_OK(perf_fd < 0, "perf_event_open"))
> +   goto cleanup;
> +
> +   /* Configure the perf event to signal on sample. */
> +   err = fcntl(perf_fd, F_SETFL, O_ASYNC);
> +   if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)"))
> +   goto cleanup;
> +
> +   owner.type = F_OWNER_TID;
> +   owner.pid = syscall(__NR_gettid);
> +   err = fcntl(perf_fd, F_SETOWN_EX, );
> +   if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)"))
> +   goto cleanup;
> +
> +   /*
> +* Allow at most one sample. A sample rejected by bpf should
> +* not count against this.
> +*/

Multi-line comment style should be like

/* Allow at most one sample. A sample rejected by bpf should
* not count against this.
*/

> +   err = ioctl(perf_fd, PERF_EVENT_IOC_REFRESH, 1);
> +   if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_REFRESH)"))
> +   goto cleanup;
> +
> +   prog_link = bpf_program__attach_perf_event(skel->progs.handler, 
> perf_fd);
> +   if (!ASSERT_OK_PTR(prog_link, "bpf_program__attach_perf_event"))
> +   goto cleanup;
> +
> +   /* Configure the

Re: [PATCH RESEND v4 1/3] kselftests: lib.mk: Add TEST_GEN_MODS_DIR variable

2024-01-02 Thread Joe Lawrence

On Wed, Dec 20, 2023 at 01:53:12PM -0300, Marcos Paulo de Souza wrote:
> Add TEST_GEN_MODS_DIR variable for kselftests. It can point to
> a directory containing kernel modules that will be used by
> selftest scripts.
> 
> The modules are built as external modules for the running kernel.
> As a result they are always binary compatible and the same tests
> can be used for older or newer kernels.
> 
> The build requires "kernel-devel" package to be installed.
> For example, in the upstream sources, the rpm devel package
> is produced by "make rpm-pkg"
> 
> The modules can be built independently by
> 
>   make -C tools/testing/selftests/livepatch/
> 
> or they will be automatically built before running the tests via
> 
>   make -C tools/testing/selftests/livepatch/ run_tests
> 
> Note that they are _not_ built when running the standalone
> tests by calling, for example, ./test-state.sh.
> 
> Signed-off-by: Marcos Paulo de Souza 
> ---
>  Documentation/dev-tools/kselftest.rst |  4 
>  tools/testing/selftests/lib.mk| 20 +++-
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/dev-tools/kselftest.rst 
> b/Documentation/dev-tools/kselftest.rst
> index ab376b316c36..7f3582a67318 100644
> --- a/Documentation/dev-tools/kselftest.rst
> +++ b/Documentation/dev-tools/kselftest.rst
> @@ -245,6 +245,10 @@ Contributing new tests (details)
> TEST_PROGS, TEST_GEN_PROGS mean it is the executable tested by
> default.
>  
> +   TEST_GEN_MODS_DIR should be used by tests that require modules to be built
> +   before the test starts. The variable will contain the name of the 
> directory
> +   containing the modules.
> +
> TEST_CUSTOM_PROGS should be used by tests that require custom build
> rules and prevent common build rule use.
>  
> diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
> index 118e0964bda9..6c7c5a0112cf 100644
> --- a/tools/testing/selftests/lib.mk
> +++ b/tools/testing/selftests/lib.mk
> @@ -70,12 +70,15 @@ KHDR_INCLUDES := -isystem $(KHDR_DIR)
>  # TEST_PROGS are for test shell scripts.
>  # TEST_CUSTOM_PROGS and TEST_PROGS will be run by common run_tests
>  # and install targets. Common clean doesn't touch them.
> +# TEST_GEN_MODS_DIR is used to specify a directory with modules to be built
> +# before the test executes. These modules are cleaned on the clean target as 
> well.
>  TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
>  TEST_GEN_PROGS_EXTENDED := $(patsubst 
> %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
>  TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
> +TEST_GEN_MODS_DIR := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_MODS_DIR))
>  
>  all: kernel_header_files $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) \
> - $(TEST_GEN_FILES)
> + $(TEST_GEN_FILES) $(if $(TEST_GEN_MODS_DIR),gen_mods_dir)
>  
>  kernel_header_files:
>   @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null;  \
> @@ -105,8 +108,8 @@ endef
>  
>  run_tests: all
>  ifdef building_out_of_srctree
> - @if [ "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)" != "X" ]; 
> then \
> - rsync -aq --copy-unsafe-links $(TEST_PROGS) 
> $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(OUTPUT); \
> + @if [ 
> "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)$(TEST_GEN_MODS_DIR)" != 
> "X" ]; then \
> + rsync -aq --copy-unsafe-links $(TEST_PROGS) 
> $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(TEST_GEN_MODS_DIR) $(OUTPUT); \
>   fi
>   @if [ "X$(TEST_PROGS)" != "X" ]; then \
>   $(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) \
> @@ -118,6 +121,12 @@ else
>   @$(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) $(TEST_PROGS))
>  endif
>  
> +gen_mods_dir:
> + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR)
> +
> +clean_mods_dir:
> + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR) clean
> +
>  define INSTALL_SINGLE_RULE
>   $(if $(INSTALL_LIST),@mkdir -p $(INSTALL_PATH))
>   $(if $(INSTALL_LIST),rsync -a --copy-unsafe-links $(INSTALL_LIST) 
> $(INSTALL_PATH)/)
> @@ -131,6 +140,7 @@ define INSTALL_RULE
>   $(eval INSTALL_LIST = $(TEST_CUSTOM_PROGS)) $(INSTALL_SINGLE_RULE)
>   $(eval INSTALL_LIST = $(TEST_GEN_PROGS_EXTENDED)) $(INSTALL_SINGLE_RULE)
>   $(eval INSTALL_LIST = $(TEST_GEN_FILES)) $(INSTALL_SINGLE_RULE)
> + $(eval INSTALL_LIST = $(TEST_GEN_MODS_DIR)) $(INSTALL_SINGLE_RULE)

Hi Marcos,

Sorry for the late reply on this, but I'm reviewing this version by
trying to retrofit it into our selftest packaging (pre-build the test
module .ko's and stash those into an rpm rather than building on the
test host).

Since $TEST_GEN_MODS_DIR is treated as a directory, I found that the
selftest install target copies a bunch of intermediate object and kbuild
files:

  $ mkdir /tmp/test-install
  $ make KDIR=$(pwd) INSTALL_PATH=/tmp/test-install TARGETS=livepatch \
   -C tools/testing/selftests/ install

  [ ... builds livepatch selftests ... ]

Re: [PATCH] selftests: Move KTAP bash helpers to selftests common folder

2024-01-02 Thread Nícolas F . R . A . Prado

On Tue, Jan 02, 2024 at 03:15:28PM +0100, Laura Nao wrote:
> Move bash helpers for outputting in KTAP format to the common selftests
> folder. This allows kselftests other than the dt one to source the file
> and make use of the helper functions.
> Define pass, fail and skip codes in the same file too.
> 
> Signed-off-by: Laura Nao 

Reviewed-by: Nícolas F. R. A. Prado 
Tested-by: Nícolas F. R. A. Prado 

Thanks,
Nícolas

Re: [PATCH v10 00/10] Add iommufd nesting (part 2/2)

2024-01-02 Thread Jason Gunthorpe

On Tue, Jan 02, 2024 at 06:38:24AM -0800, Yi Liu wrote:
> Lu Baolu (4):
>   iommu: Add cache_invalidate_user op
>   iommu/vt-d: Allow qi_submit_sync() to return the QI faults
>   iommu/vt-d: Convert stage-1 cache invalidation to return QI fault
>   iommu/vt-d: Add iotlb flush for nested domain
> 
> Nicolin Chen (4):
>   iommu: Add iommu_copy_struct_from_user_array helper
>   iommufd/selftest: Add mock_domain_cache_invalidate_user support
>   iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
>   iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
> 
> Yi Liu (2):
>   iommufd: Add IOMMU_HWPT_INVALIDATE
>   iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

Applied to for-next, thanks

Jason

Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain

2024-01-02 Thread Jason Gunthorpe

On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:

> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
> +  unsigned long npages, bool ih, u32 *error)
> +{
> + struct iommu_domain_info *info;
> + unsigned long i;
> + unsigned mask;
> + u32 fault;
> +
> + xa_for_each(>iommu_array, i, info)
> + qi_flush_piotlb(info->iommu,
> + domain_id_iommu(domain, info->iommu),
> + IOMMU_NO_PASID, addr, npages, ih, NULL);

This locking on the xarray is messed up throughout the driver. There
could be a concurrent detach at this point which will free info and
UAF this.

This seems to be systemic issue, so I'm going to ignore it here, but
please make a series to fix it completely.

xarray is probably a bad data structure to manage attachment, a linked
list is going to use less memory in most cases and you need a mutex
lock anyhow.

Jason

Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff

2024-01-02 Thread Stanislav Fomichev

On 12/29, Menglong Dong wrote:
> For now, we have to call some helpers when we need to update the csum,
> such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are
> not inlined, which causes poor performance.
> 
> In fact, we can define our own csum update functions in BPF program
> instead of bpf_l3_csum_replace, which is totally inlined and efficient.
> However, we can't do this for bpf_l4_csum_replace for now, as we can't
> update skb->csum, which can cause skb->csum invalid in the rx path with
> CHECKSUM_COMPLETE mode.
> 
> What's more, we can't use the direct data access and have to use
> skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such
> as modifing the vni in the vxlan header and the underlay udp header has
> no checksum.
> 
> In the first patch, we make skb->csum readable and writable, and we make
> skb->ip_summed readable. For now, for tc only. With these 2 fields, we
> don't need to call bpf helpers for csum update any more.
> 
> In the second patch, we add some testcases for the read/write testing for
> skb->csum and skb->ip_summed.
> 
> If this series is acceptable, we can define the inlined functions for csum
> update in libbpf in the next step.

One downside of exposing those as __sk_buff fields is that all this
skb internal csum stuff now becomes a UAPI. And I'm not sure we want
that :-) Should we add a lightweight kfunc to reset the fields instead?
Or will it still have an unacceptable overhead?

Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

2024-01-02 Thread Jason Gunthorpe

On Fri, Dec 15, 2023 at 01:50:07AM +, Tian, Kevin wrote:

> > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > maintained, then intel iommu can report a vRID back to user. But intel
> > iommu driver does not have viommu context, no place to hold the vRID-
> > >pRID
> > mapping. TBH. It may require other reasons to introduce it other than the
> > error reporting need. Anyhow, this requires more thinking and also has
> > dependency even if it is doable in intel side.
> 
> this sounds like a cleaner way to inject knowledge which iommu driver
> requires to find out the user tag. but yes it's a bit weird to introduce
> viommu awareness in intel iommu driver when there is no such thing
> in real hardware.
> 
> and for this error reporting case what we actually require is the
> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> and then store vRID under device_domain_info. a bit tricky on
> life cycle management and also incompatible with SIOV...
> 
> let's see whether Jason has a better idea here.

I think v10 is OK

struct iommu_hwpt_invalidate {
__u32 size;
__u32 hwpt_id;
__aligned_u64 data_uptr;
__u32 data_type;
__u32 entry_len;
__u32 entry_num;
__u32 __reserved;
};

Sends the invalidation to the HWPT which matches what Intel wanted
where the entire HWPT and all its associated devices are
invalidated. No seperate per-device invalidation.

For error and event reporting they should be returned to userspace
with the IOMMU dev_id indicating the originating PCI function.

The VMM would have to convert dev_id into vRID according to the vIOMMU
instance that the device is hooked up.

In iommu land we should never have a "RID" but always some kind of
device-specific "device ID" which is the index into the particular HW
table, and that ID is naturally scoped to within the IOMMU instance
that owns the table - so it is very much not a global ID that can be
used alone in any of the uAPI.

The uAPI should use the iommufd device ID to refer to specific
devices.

Jason

Re: [PATCH v8 21/24] evm: Move to LSM infrastructure

2024-01-02 Thread Mimi Zohar

On Tue, 2024-01-02 at 12:56 +0100, Roberto Sassu wrote:
> On 12/26/2023 11:13 PM, Mimi Zohar wrote:
> > On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote:
> >> From: Roberto Sassu 
> >>
> >> As for IMA, move hardcoded EVM function calls from various places in the
> >> kernel to the LSM infrastructure, by introducing a new LSM named 'evm'
> >> (last and always enabled like 'ima'). The order in the Makefile ensures
> >> that 'evm' hooks are executed after 'ima' ones.
> >>
> >> Make EVM functions as static (except for evm_inode_init_security(), which
> >> is exported), and register them as hook implementations in init_evm_lsm().
> >>
> >> Unlike before (see commit to move IMA to the LSM infrastructure),
> >> evm_inode_post_setattr(), evm_inode_post_set_acl(),
> >> evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not
> >> executed for private inodes.
> >>
> > 
> > Missing is a comment on moving the inline function definitions -
> > evm_inode_remove_acl(), evm_inode_post_remove_acl(), and
> > evm_inode_post_set_acl() - to evm_main.c.
> 
> Ok.
> 
> >> Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c
> >>
> >> Signed-off-by: Roberto Sassu 
> >> ---
> > 
> > [...]
> >> @@ -2307,9 +2299,7 @@ int security_inode_setxattr(struct mnt_idmap *idmap,
> >>   
> >>if (ret == 1)
> >>ret = cap_inode_setxattr(dentry, name, value, size, flags);
> >> -  if (ret)
> >> -  return ret;
> >> -  return evm_inode_setxattr(idmap, dentry, name, value, size, flags);
> >> +  return ret;
> >>   }
> > 
> > Even though capability will be called after EVM, it doesn't make a
> > difference in this instance.
> > 
> > [...]
> > 
> >>   /**
> >> @@ -2493,9 +2472,7 @@ int security_inode_removexattr(struct mnt_idmap 
> >> *idmap,
> >>ret = call_int_hook(inode_removexattr, 1, idmap, dentry, name);
> >>if (ret == 1)
> >>ret = cap_inode_removexattr(idmap, dentry, name);
> >> -  if (ret)
> >> -  return ret;
> >> -  return evm_inode_removexattr(idmap, dentry, name);
> >> +  return ret;
> >>   }
> > 
> > 'security.capability' is one of the EVM protected xattrs.  As
> > capability isn't an LSM, it will now be called after EVM, which is a
> > problem.
> 
> Uhm, according to this comment in security_inode_removexattr() and 
> security_inode_setxattr():
> 
>   /*
>* SELinux and Smack integrate the cap call,
>* so assume that all LSMs supplying this call do so.
>*/
> 
> We can add the call to IMA and EVM as well, to be compliant.

SELinux and Smack are the only current LSMs that register the
security_inode_removexattr hook.  Both enforce mandatory access
control,
so their calling capabilities to enforce DAC kind of makes sense.  I'm
not sure it makes sense for IMA and EVM to call capability directly,
just because of the comment.

> However, I'm missing why the two cases are different. It seems 
> cap_inode_set/removexattr() are doing just checks.

Both IMA and EVM require CAP_SYS_ADMIN to write/remove security.ima and
security.evm respectively.  In addition, EVM must recalculate
security.evm if any protected security xattrs are set or
removed.   However, security.evm is updated on
security_inode_post_setxattr, not security_inode_setxattr.

Mimi

Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation

2024-01-02 Thread Eric Dumazet

On Tue, Jan 2, 2024 at 2:25 PM Richard Gobert  wrote:
>
> The existing code always pulls the IPv6 header and sets the transport
> offset initially. Then optionally again pulls any extension headers in
> ipv6_gso_pull_exthdrs and sets the transport offset again on return from
> that call. skb->data is set at the start of the first extension header
> before calling ipv6_gso_pull_exthdrs, and must disable the frag0
> optimization because that function uses pskb_may_pull/pskb_pull instead of
> skb_gro_ helpers. It sets the GRO offset to the TCP header with
> skb_gro_pull and sets the transport header. Then returns skb->data to its
> position before this block.
>
> This commit introduces a new helper function - ipv6_gro_pull_exthdrs -
> which is used in ipv6_gro_receive to pull ipv6 ext headers instead of
> ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all
> operations use skb_gro_* helpers, and the frag0 fast path can be taken for
> IPv6 packets with ext headers.
>
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 
> ---
>  include/net/ipv6.h |  1 +
>  net/ipv6/ip6_offload.c | 51 +-
>  2 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 78d38dd88aba..217240efa182 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -26,6 +26,7 @@ struct ip_tunnel_info;
>  #define SIN6_LEN_RFC2133   24
>
>  #define IPV6_MAXPLEN   65535
> +#define IPV6_MIN_EXTHDR_LEN8

// Hmm see my following comment.

>
>  /*
>   * NextHeader field of IPv6 header
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index 0e0b5fed0995..c07111d8f56a 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -37,6 +37,40 @@
> INDIRECT_CALL_L4(cb, f2, f1, head, skb);\
>  })
>
> +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto)
> +{
> +   const struct net_offload *ops = NULL;
> +   struct ipv6_opt_hdr *opth;
> +
> +   for (;;) {
> +   int len;
> +
> +   ops = rcu_dereference(inet6_offloads[proto]);
> +
> +   if (unlikely(!ops))
> +   break;
> +
> +   if (!(ops->flags & INET6_PROTO_GSO_EXTHDR))
> +   break;
> +
> +   opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off);

I do not see a compelling reason for adding yet another constant here.

I would stick to

   opth = skb_gro_header(skb, off + sizeof(*opth), off);

Consistency with similar helpers is desirable.

> +   if (unlikely(!opth))
> +   break;
> +
> +   len = ipv6_optlen(opth);
> +
> +   opth = skb_gro_header(skb, off + len, off);

Note this call will take care of precise pull.

> +   if (unlikely(!opth))
> +   break;
> +   proto = opth->nexthdr;
> +
> +   off += len;
> +   }
> +
> +   skb_gro_pull(skb, off - skb_network_offset(skb));
> +   return proto;
> +}
> +
>  static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto)
>  {
> const struct net_offload *ops = NULL;
> @@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff 
> *ipv6_gro_receive(struct list_head *head,
> goto out;
>
> skb_set_network_header(skb, off);
> -   skb_gro_pull(skb, sizeof(*iph));
> -   skb_set_transport_header(skb, skb_gro_offset(skb));
>
> -   flush += ntohs(iph->payload_len) != skb_gro_len(skb);
> +   flush += ntohs(iph->payload_len) != skb->len - hlen;
>
> proto = iph->nexthdr;
> ops = rcu_dereference(inet6_offloads[proto]);
> if (!ops || !ops->callbacks.gro_receive) {
> -   pskb_pull(skb, skb_gro_offset(skb));
> -   skb_gro_frag0_invalidate(skb);
> -   proto = ipv6_gso_pull_exthdrs(skb, proto);
> -   skb_gro_pull(skb, -skb_transport_offset(skb));
> -   skb_reset_transport_header(skb);
> -   __skb_push(skb, skb_gro_offset(skb));
> +   proto = ipv6_gro_pull_exthdrs(skb, hlen, proto);
>
> ops = rcu_dereference(inet6_offloads[proto]);
> if (!ops || !ops->callbacks.gro_receive)
> goto out;
>
> -   iph = ipv6_hdr(skb);
> +   iph = skb_gro_network_header(skb);
> +   } else {
> +   skb_gro_pull(skb, sizeof(*iph));
> }
>
> +   skb_set_transport_header(skb, skb_gro_offset(skb));
> +
> NAPI_GRO_CB(skb)->proto = proto;
>
> flush--;
> --
> 2.36.1
>

Re: [PATCH net-next v2 1/3] net: gso: add HBH extension header offload support

2024-01-02 Thread Eric Dumazet

On Tue, Jan 2, 2024 at 2:21 PM Richard Gobert  wrote:
>
> This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it
> is done for routing and dstopts) since it is supported in GSO and GRO.
> This allows to remove specific HBH conditionals in GSO and GRO when
> pulling and parsing an incoming packet.
>
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 

Reviewed-by: Eric Dumazet

Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Suren Baghdasaryan

On Tue, Jan 2, 2024 at 1:00 AM Peter Xu  wrote:
>
> On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, 
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >
> >   /* at this point we have src_folio locked */
> >   if (folio_test_large(src_folio)) {
> > + /* split_folio() can block */
> > + pte_unmap(_src_pte);
> > + pte_unmap(_dst_pte);
> > + src_pte = dst_pte = NULL;
> >   err = split_folio(src_folio);
> >   if (err)
> >   goto out;
> > + goto retry;
> >   }
>
> Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> thp, I think it means it's pte mapped here. Then after the split we may
> want to fetch the small folio after the split, not the head one?

I think we need to re-fetch the src_folio only if the src_addr falls
into a non-head page. Looking at the __split_huge_page(), the head
page is skipped in the last loop, so I think it should stay valid.
That said, maybe it's just an implementation detail of the
__split_huge_page() and I should not rely on that and refetch anyway?

>
> --
> Peter Xu
>

Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation

2024-01-02 Thread David Ahern

On 1/2/24 6:24 AM, Richard Gobert wrote:
> The existing code always pulls the IPv6 header and sets the transport
> offset initially. Then optionally again pulls any extension headers in
> ipv6_gso_pull_exthdrs and sets the transport offset again on return from
> that call. skb->data is set at the start of the first extension header
> before calling ipv6_gso_pull_exthdrs, and must disable the frag0
> optimization because that function uses pskb_may_pull/pskb_pull instead of
> skb_gro_ helpers. It sets the GRO offset to the TCP header with
> skb_gro_pull and sets the transport header. Then returns skb->data to its
> position before this block.
> 
> This commit introduces a new helper function - ipv6_gro_pull_exthdrs -
> which is used in ipv6_gro_receive to pull ipv6 ext headers instead of
> ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all
> operations use skb_gro_* helpers, and the frag0 fast path can be taken for
> IPv6 packets with ext headers.
> 
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 
> ---
>  include/net/ipv6.h |  1 +
>  net/ipv6/ip6_offload.c | 51 +-
>  2 files changed, 42 insertions(+), 10 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next v2 1/3] net: gso: add HBH extension header offload support

2024-01-02 Thread David Ahern

On 1/2/24 6:20 AM, Richard Gobert wrote:
> This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it
> is done for routing and dstopts) since it is supported in GSO and GRO.
> This allows to remove specific HBH conditionals in GSO and GRO when
> pulling and parsing an incoming packet.
> 
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 
> ---
>  net/ipv6/exthdrs_offload.c | 11 +++
>  net/ipv6/ip6_offload.c | 25 +++--
>  2 files changed, 22 insertions(+), 14 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next v2 3/3] selftests/net: fix GRO coalesce test and add ext header coalesce tests

2024-01-02 Thread Willem de Bruijn

Richard Gobert wrote:
> Currently there is no test which checks that IPv6 extension header packets
> successfully coalesce. This commit adds a test, which verifies two IPv6
> packets with HBH extension headers do coalesce, and another test which
> checks that packets with different extension header data do not coalesce
> in GRO.
> 
> I changed the receive socket filter to accept a packet with one extension
> header. This change exposed a bug in the fragment test -- the old BPF did
> not accept the fragment packet. I updated correct_num_packets in the
> fragment test accordingly.
> 
> Signed-off-by: Richard Gobert 

Reviewed-by: Willem de Bruijn 

Thanks for adding the second test.

> +static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char 
> *ext_payload)
> +{
> + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr *)(optpkt + 
> tcp_offset);
> + struct ipv6hdr *iph = (struct ipv6hdr *)(optpkt + ETH_HLEN);
> + char *exthdr_payload_start = (char *)(exthdr + 1);
> +
> + exthdr->hdrlen = 0;
> + exthdr->nexthdr = IPPROTO_TCP;
> +
> + if (ext_payload)
> + memcpy(exthdr_payload_start, ext_payload, MIN_EXTHDR_SIZE - 
> sizeof(*exthdr));

minor nit, in case this gets respun: ext_payload is always true.

[PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain

2024-01-02 Thread Yi Liu

From: Lu Baolu 

This implements the .cache_invalidate_user() callback to support iotlb
flush for nested domain.

Reviewed-by: Kevin Tian 
Signed-off-by: Lu Baolu 
Co-developed-by: Yi Liu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/intel/nested.c | 107 +++
 1 file changed, 107 insertions(+)

diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index b5a5563ab32c..f1f86437939c 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -73,9 +73,116 @@ static void intel_nested_domain_free(struct iommu_domain 
*domain)
kfree(to_dmar_domain(domain));
 }
 
+static void nested_flush_dev_iotlb(struct dmar_domain *domain, u64 addr,
+  unsigned mask, u32 *fault)
+{
+   struct device_domain_info *info;
+   unsigned long flags;
+   u16 sid, qdep;
+
+   spin_lock_irqsave(>lock, flags);
+   list_for_each_entry(info, >devices, link) {
+   if (!info->ats_enabled)
+   continue;
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask, fault);
+   quirk_extra_dev_tlb_flush(info, addr, mask,
+ IOMMU_NO_PASID, qdep);
+   }
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
+unsigned long npages, bool ih, u32 *error)
+{
+   struct iommu_domain_info *info;
+   unsigned long i;
+   unsigned mask;
+   u32 fault;
+
+   xa_for_each(>iommu_array, i, info)
+   qi_flush_piotlb(info->iommu,
+   domain_id_iommu(domain, info->iommu),
+   IOMMU_NO_PASID, addr, npages, ih, NULL);
+
+   if (!domain->has_iotlb_device)
+   return;
+
+   if (npages == U64_MAX)
+   mask = 64 - VTD_PAGE_SHIFT;
+   else
+   mask = ilog2(__roundup_pow_of_two(npages));
+
+   nested_flush_dev_iotlb(domain, addr, mask, );
+
+   *error = 0;
+   /*
+* Invalidation queue error (i.e. IQE) will not be reported to user
+* as it's caused only by driver internal bug.
+*/
+   if (fault & DMA_FSTS_ICE)
+   *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ICE;
+   if (fault & DMA_FSTS_ITE)
+   *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ITE;
+}
+
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
+ struct iommu_user_data_array 
*array)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct iommu_hwpt_vtd_s1_invalidate inv_entry;
+   u32 processed = 0;
+   int ret = 0;
+   u32 index;
+
+   if (array->type != IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   for (index = 0; index < array->entry_num; index++) {
+   ret = iommu_copy_struct_from_user_array(_entry, array,
+   
IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+   index, hw_error);
+   if (ret)
+   break;
+
+   if (inv_entry.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) {
+   ret = -EOPNOTSUPP;
+   break;
+   }
+
+   if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
+   ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
+   ret = -EINVAL;
+   break;
+   }
+
+   intel_nested_flush_cache(dmar_domain, inv_entry.addr,
+inv_entry.npages,
+inv_entry.flags & 
IOMMU_VTD_INV_FLAGS_LEAF,
+_entry.hw_error);
+
+   ret = iommu_respond_struct_to_user_array(array, index,
+(void *)_entry,
+sizeof(inv_entry));
+   if (ret)
+   break;
+
+   processed++;
+   }
+
+out:
+   array->entry_num = processed;
+   return ret;
+}
+
 static const struct iommu_domain_ops intel_nested_domain_ops = {
.attach_dev = intel_nested_attach_dev,
.free   = intel_nested_domain_free,
+   .cache_invalidate_user  = intel_nested_cache_invalidate_user,
 };
 
 struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,
-- 
2.34.1

[PATCH v10 09/10] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

2024-01-02 Thread Yi Liu

This adds the data structure invalidating caches for the nested domain
allocated with IOMMU_HWPT_DATA_VTD_S1 type.

Reviewed-by: Kevin Tian 
Signed-off-by: Lu Baolu 
Signed-off-by: Yi Liu 
---
 include/uapi/linux/iommufd.h | 58 
 1 file changed, 58 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 824560c50ec6..37c778055201 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -623,6 +623,64 @@ enum iommu_hwpt_invalidate_data_type {
IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
 };
 
+/**
+ * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
+ *   stage-1 cache invalidation
+ * @IOMMU_VTD_INV_FLAGS_LEAF: Indicates whether the invalidation applies
+ *to all-levels page structure cache or just
+ *the leaf PTE cache.
+ */
+enum iommu_hwpt_vtd_s1_invalidate_flags {
+   IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
+};
+
+/**
+ * enum iommu_hwpt_vtd_s1_invalidate_error - Hardware error of invalidation
+ * @IOMMU_HWPT_INVALIDATE_VTD_S1_ICE: Invalidation Completion Error, details
+ *refer to 11.4.7.1 Fault Status Register
+ *of VT-d specification.
+ * @IOMMU_HWPT_INVALIDATE_VTD_S1_ITE: Invalidation Time-out Error, details
+ *refer to 11.4.7.1 Fault Status Register
+ *of VT-d specification.
+ */
+enum iommu_hwpt_vtd_s1_invalidate_error {
+   IOMMU_HWPT_INVALIDATE_VTD_S1_ICE = 1 << 0,
+   IOMMU_HWPT_INVALIDATE_VTD_S1_ITE = 1 << 1,
+};
+
+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *   (IOMMU_HWPT_INVALIDATE_DATA_VTD_S1)
+ * @addr: The start address of the range to be invalidated. It needs to
+ *be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @hw_error: One of enum iommu_hwpt_vtd_s1_invalidate_error
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be U64_MAX.
+ *
+ * The device TLB will be invalidated automatically if ATS is enabled.
+ *
+ * An entry is considered 'handled' after it passes the audit and submitted
+ * to the IOMMU by the underlying driver. Check the @entry_num output of
+ * struct iommu_hwpt_invalidate for the number of handled entries. A 'handled'
+ * request may still fail in hardware for various reasons, e.g. due to timeout
+ * on waiting for device response upon a device TLB invalidation request. In
+ * such case the hardware error info is reported in the @hw_error field of the
+ * handled entry.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+   __aligned_u64 addr;
+   __aligned_u64 npages;
+   __u32 flags;
+   __u32 hw_error;
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)
-- 
2.34.1

[PATCH v10 08/10] iommu/vt-d: Convert stage-1 cache invalidation to return QI fault

2024-01-02 Thread Yi Liu

From: Lu Baolu 

This makes the pasid based cache invalidation and device TLB invalidation
to return QI faults to callers. This is needed when usersapce invalidates
cache after modifying the stage-1 page table used in nested translation.
Hardware errors during invalidation should be reported to user.

Reviewed-by: Kevin Tian 
Signed-off-by: Lu Baolu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/intel/dmar.c  | 13 +++--
 drivers/iommu/intel/iommu.c | 12 ++--
 drivers/iommu/intel/iommu.h |  6 +++---
 drivers/iommu/intel/pasid.c | 12 +++-
 drivers/iommu/intel/svm.c   |  8 
 5 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index c3251c36c567..d9e38ff34437 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1529,7 +1529,7 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, 
u64 addr,
 }
 
 void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
-   u16 qdep, u64 addr, unsigned mask)
+   u16 qdep, u64 addr, unsigned mask, u32 *fault)
 {
struct qi_desc desc;
 
@@ -1556,12 +1556,12 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 
sid, u16 pfsid,
desc.qw2 = 0;
desc.qw3 = 0;
 
-   qi_submit_sync(iommu, , 1, 0, NULL);
+   qi_submit_sync(iommu, , 1, 0, fault);
 }
 
 /* PASID-based IOTLB invalidation */
 void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
-unsigned long npages, bool ih)
+unsigned long npages, bool ih, u32 *fault)
 {
struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
 
@@ -1597,12 +1597,13 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 
did, u32 pasid, u64 addr,
QI_EIOTLB_AM(mask);
}
 
-   qi_submit_sync(iommu, , 1, 0, NULL);
+   qi_submit_sync(iommu, , 1, 0, fault);
 }
 
 /* PASID-based device IOTLB Invalidate */
 void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
- u32 pasid,  u16 qdep, u64 addr, unsigned int 
size_order)
+ u32 pasid,  u16 qdep, u64 addr,
+ unsigned int size_order, u32 *fault)
 {
unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
@@ -1650,7 +1651,7 @@ void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, 
u16 sid, u16 pfsid,
desc.qw1 |= QI_DEV_EIOTLB_SIZE;
}
 
-   qi_submit_sync(iommu, , 1, 0, NULL);
+   qi_submit_sync(iommu, , 1, 0, fault);
 }
 
 void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did,
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47d..68e494f1d03a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1462,7 +1462,7 @@ static void __iommu_flush_dev_iotlb(struct 
device_domain_info *info,
sid = info->bus << 8 | info->devfn;
qdep = info->ats_qdep;
qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-  qdep, addr, mask);
+  qdep, addr, mask, NULL);
quirk_extra_dev_tlb_flush(info, addr, mask, IOMMU_NO_PASID, qdep);
 }
 
@@ -1490,7 +1490,7 @@ static void iommu_flush_dev_iotlb(struct dmar_domain 
*domain,
 PCI_DEVID(info->bus, info->devfn),
 info->pfsid, dev_pasid->pasid,
 info->ats_qdep, addr,
-mask);
+mask, NULL);
}
spin_unlock_irqrestore(>lock, flags);
 }
@@ -1505,10 +1505,10 @@ static void domain_flush_pasid_iotlb(struct intel_iommu 
*iommu,
 
spin_lock_irqsave(>lock, flags);
list_for_each_entry(dev_pasid, >dev_pasids, link_domain)
-   qi_flush_piotlb(iommu, did, dev_pasid->pasid, addr, npages, ih);
+   qi_flush_piotlb(iommu, did, dev_pasid->pasid, addr, npages, ih, 
NULL);
 
if (!list_empty(>devices))
-   qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr, npages, ih);
+   qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr, npages, ih, 
NULL);
spin_unlock_irqrestore(>lock, flags);
 }
 
@@ -5195,10 +5195,10 @@ void quirk_extra_dev_tlb_flush(struct 
device_domain_info *info,
sid = PCI_DEVID(info->bus, info->devfn);
if (pasid == IOMMU_NO_PASID) {
qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-  qdep, address, mask);
+  qdep, address, mask, NULL);
} else {
qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid,
-pasid, qdep, address, mask);
+pasid, qdep, address, mask,

[PATCH v10 07/10] iommu/vt-d: Allow qi_submit_sync() to return the QI faults

2024-01-02 Thread Yi Liu

From: Lu Baolu 

This allows qi_submit_sync() to return back faults to callers.

qi_submit_sync() has a retry when timeout error happens, thus it cannot
return timeout error back to the caller. This was discussed in a separate
thread [1]. Here we keep it intact and just make sure no retry for the
newly added user domain cache invalidation by checking if the caller is
interested in the fault or not.

[1] 
https://lore.kernel.org/all/20231228001646.587653-6-haifeng.z...@linux.intel.com/

Reviewed-by: Kevin Tian 
Signed-off-by: Lu Baolu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/intel/dmar.c  | 35 +++--
 drivers/iommu/intel/iommu.h |  2 +-
 drivers/iommu/intel/irq_remapping.c |  2 +-
 drivers/iommu/intel/pasid.c |  2 +-
 drivers/iommu/intel/svm.c   |  6 ++---
 5 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 23cb80d62a9a..c3251c36c567 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,7 +1267,8 @@ static void qi_dump_fault(struct intel_iommu *iommu, u32 
fault)
   (unsigned long long)desc->qw1);
 }
 
-static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index,
+ int wait_index, u32 *fsts)
 {
u32 fault;
int head, tail;
@@ -1278,8 +1279,12 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
return -EAGAIN;
 
fault = readl(iommu->reg + DMAR_FSTS_REG);
-   if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
+   fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
+   if (fault) {
+   if (fsts)
+   *fsts |= fault;
qi_dump_fault(iommu, fault);
+   }
 
/*
 * If IQE happens, the head points to the descriptor associated
@@ -1324,8 +1329,9 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
head = (head - 2 + QI_LENGTH) % QI_LENGTH;
} while (head != tail);
 
+   /* No need to retry if the caller is interested in the timeout 
error */
if (qi->desc_status[wait_index] == QI_ABORT)
-   return -EAGAIN;
+   return fsts ? -ETIMEDOUT : -EAGAIN;
}
 
if (fault & DMA_FSTS_ICE) {
@@ -1342,9 +1348,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
  * time, a wait descriptor will be appended to each submission to ensure
  * hardware has completed the invalidation before return. Wait descriptors
  * can be part of the submission but it will not be polled for completion.
+ * If callers are interested in the QI faults that occur during the handling
+ * of requests, the QI faults are saved in @fault.
  */
 int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-  unsigned int count, unsigned long options)
+  unsigned int count, unsigned long options, u32 *fault)
 {
struct q_inval *qi = iommu->qi;
s64 devtlb_start_ktime = 0;
@@ -1422,6 +1430,9 @@ int qi_submit_sync(struct intel_iommu *iommu, struct 
qi_desc *desc,
 */
writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG);
 
+   if (fault)
+   *fault = 0;
+
while (qi->desc_status[wait_index] != QI_DONE) {
/*
 * We will leave the interrupts disabled, to prevent interrupt
@@ -1430,7 +1441,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct 
qi_desc *desc,
 * a deadlock where the interrupt context can wait indefinitely
 * for free slots in the queue.
 */
-   rc = qi_check_fault(iommu, index, wait_index);
+   rc = qi_check_fault(iommu, index, wait_index, fault);
if (rc)
break;
 
@@ -1476,7 +1487,7 @@ void qi_global_iec(struct intel_iommu *iommu)
desc.qw3 = 0;
 
/* should never fail */
-   qi_submit_sync(iommu, , 1, 0);
+   qi_submit_sync(iommu, , 1, 0, NULL);
 }
 
 void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm,
@@ -1490,7 +1501,7 @@ void qi_flush_context(struct intel_iommu *iommu, u16 did, 
u16 sid, u8 fm,
desc.qw2 = 0;
desc.qw3 = 0;
 
-   qi_submit_sync(iommu, , 1, 0);
+   qi_submit_sync(iommu, , 1, 0, NULL);
 }
 
 void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
@@ -1514,7 +1525,7 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, 
u64 addr,
desc.qw2 = 0;
desc.qw3 = 0;
 
-   qi_submit_sync(iommu, , 1, 0);
+   qi_submit_sync(iommu, , 1, 0, NULL);
 }
 
 void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
@@ -1545,7 +1556,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu,

[PATCH v10 06/10] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl

2024-01-02 Thread Yi Liu

From: Nicolin Chen 

Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using
the new IOMMU_TEST_OP_MD_CHECK_IOTLB.

Reviewed-by: Kevin Tian 
Signed-off-by: Nicolin Chen 
Co-developed-by: Yi Liu 
Signed-off-by: Yi Liu 
---
 tools/testing/selftests/iommu/iommufd.c   | 186 ++
 tools/testing/selftests/iommu/iommufd_utils.h |  33 
 2 files changed, 219 insertions(+)

diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index c8763b880a16..ff0aa77911f3 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length)
TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
+   TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, __reserved);
TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
out_iova_alignment);
@@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
struct iommu_hwpt_selftest data = {
.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
};
+   struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {};
uint32_t nested_hwpt_id[2] = {};
+   uint32_t num_inv;
uint32_t parent_hwpt_id = 0;
uint32_t parent_hwpt_id_not_work = 0;
uint32_t test_hwpt_id = 0;
@@ -344,6 +347,189 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
EXPECT_ERRNO(EBUSY,
 _test_ioctl_destroy(self->fd, parent_hwpt_id));
 
+   /* hwpt_invalidate only supports a user-managed hwpt (nested) */
+   num_inv = 1;
+   test_err_hwpt_invalidate(ENOENT, parent_hwpt_id, inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Check data_type by passing zero-length array */
+   num_inv = 0;
+   test_cmd_hwpt_invalidate(nested_hwpt_id[0], inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Negative test: Invalid data_type */
+   num_inv = 1;
+   test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+
IOMMU_HWPT_INVALIDATE_DATA_SELFTEST_INVALID,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Negative test: structure size sanity */
+   num_inv = 1;
+   test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs) + 1, _inv);
+   assert(!num_inv);
+
+   num_inv = 1;
+   test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+1, _inv);
+   assert(!num_inv);
+
+   /* Negative test: invalid flag is passed */
+   num_inv = 1;
+   inv_reqs[0].flags = 0x;
+   test_err_hwpt_invalidate(EOPNOTSUPP, nested_hwpt_id[0], 
inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Negative test: non-zero __reserved is passed */
+   num_inv = 1;
+   inv_reqs[0].flags = 0;
+   inv_reqs[0].__reserved = 0x1234;
+   test_err_hwpt_invalidate(EOPNOTSUPP, nested_hwpt_id[0], 
inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Negative test: invalid data_uptr when array is not empty */
+   num_inv = 1;
+   inv_reqs[0].flags = 0;
+   test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], NULL,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+sizeof(*inv_reqs), _inv);
+   assert(!num_inv);
+
+   /* Negative test: invalid entry_len when array is not empty */
+   num_inv = 1;
+   inv_reqs[0].flags = 0;
+   test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs,
+IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+

[PATCH v10 05/10] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op

2024-01-02 Thread Yi Liu

From: Nicolin Chen 

Allow to test whether IOTLB has been invalidated or not.

Reviewed-by: Kevin Tian 
Signed-off-by: Nicolin Chen 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/iommufd_test.h  |  5 
 drivers/iommu/iommufd/selftest.c  | 26 +++
 tools/testing/selftests/iommu/iommufd.c   |  4 +++
 tools/testing/selftests/iommu/iommufd_utils.h | 24 +
 4 files changed, 59 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
index 2eef5afde711..1cedd6b5ba2b 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -21,6 +21,7 @@ enum {
IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
IOMMU_TEST_OP_DIRTY,
+   IOMMU_TEST_OP_MD_CHECK_IOTLB,
 };
 
 enum {
@@ -121,6 +122,10 @@ struct iommu_test_cmd {
__aligned_u64 uptr;
__aligned_u64 out_nr_dirty;
} dirty;
+   struct {
+   __u32 id;
+   __u32 iotlb;
+   } check_iotlb;
};
__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index ebc6c15abf67..9528528cab27 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -853,6 +853,28 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd 
*ucmd,
return 0;
 }
 
+static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd,
+  u32 mockpt_id, unsigned int iotlb_id,
+  u32 iotlb)
+{
+   struct mock_iommu_domain_nested *mock_nested;
+   struct iommufd_hw_pagetable *hwpt;
+   int rc = 0;
+
+   hwpt = get_md_pagetable_nested(ucmd, mockpt_id, _nested);
+   if (IS_ERR(hwpt))
+   return PTR_ERR(hwpt);
+
+   mock_nested = container_of(hwpt->domain,
+  struct mock_iommu_domain_nested, domain);
+
+   if (iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX ||
+   mock_nested->iotlb[iotlb_id] != iotlb)
+   rc = -EINVAL;
+   iommufd_put_object(ucmd->ictx, >obj);
+   return rc;
+}
+
 struct selftest_access {
struct iommufd_access *access;
struct file *file;
@@ -1334,6 +1356,10 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
return iommufd_test_md_check_refs(
ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
cmd->check_refs.length, cmd->check_refs.refs);
+   case IOMMU_TEST_OP_MD_CHECK_IOTLB:
+   return iommufd_test_md_check_iotlb(ucmd, cmd->id,
+  cmd->check_iotlb.id,
+  cmd->check_iotlb.iotlb);
case IOMMU_TEST_OP_CREATE_ACCESS:
return iommufd_test_create_access(ucmd, cmd->id,
  cmd->create_access.flags);
diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index 6ed328c863c4..c8763b880a16 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -330,6 +330,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
   _hwpt_id[1],
   IOMMU_HWPT_DATA_SELFTEST, ,
   sizeof(data));
+   test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[0],
+ IOMMU_TEST_IOTLB_DEFAULT);
+   test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[1],
+ IOMMU_TEST_IOTLB_DEFAULT);
 
/* Negative test: a nested hwpt on top of a nested hwpt */
test_err_hwpt_alloc_nested(EINVAL, self->device_id,
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h 
b/tools/testing/selftests/iommu/iommufd_utils.h
index ad9202335656..fe0a0f566b67 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -195,6 +195,30 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, 
__u32 pt_id,
 _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
  hwpt_id, data_type, data, data_len))
 
+#define test_cmd_hwpt_check_iotlb(hwpt_id, iotlb_id, expected) 
\
+   ({ \
+   struct iommu_test_cmd test_cmd = { \
+   .size = sizeof(test_cmd),  \
+   .op = IOMMU_TEST_OP_MD_CHECK_IOTLB,\
+   .id = hwpt_id, \
+   .check_iotlb = {

[PATCH v10 04/10] iommufd/selftest: Add mock_domain_cache_invalidate_user support

2024-01-02 Thread Yi Liu

From: Nicolin Chen 

Add mock_domain_cache_invalidate_user() data structure to support user
space selftest program to cover user cache invalidation pathway.

Reviewed-by: Kevin Tian 
Signed-off-by: Nicolin Chen 
Co-developed-by: Yi Liu 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/iommufd_test.h | 34 
 drivers/iommu/iommufd/selftest.c | 60 
 2 files changed, 94 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h 
b/drivers/iommu/iommufd/iommufd_test.h
index 7910fbe1962d..2eef5afde711 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -148,4 +148,38 @@ struct iommu_hwpt_selftest {
__u32 iotlb;
 };
 
+/* Should not be equal to any defined value in enum 
iommu_hwpt_invalidate_data_type */
+#define IOMMU_HWPT_INVALIDATE_DATA_SELFTEST 0xdeadbeef
+#define IOMMU_HWPT_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef
+
+/**
+ * enum iommu_hwpt_invalidate_selftest_error - Hardware error of invalidation
+ * @IOMMU_TEST_INVALIDATE_FAKE_ERROR: Fake hw error per test program's request
+ */
+enum iommu_hwpt_invalidate_selftest_error {
+   IOMMU_TEST_INVALIDATE_FAKE_ERROR = (1 << 0)
+};
+
+/**
+ * struct iommu_hwpt_invalidate_selftest - Invalidation data for Mock driver
+ * 
(IOMMU_HWPT_INVALIDATE_DATA_SELFTEST)
+ * @flags: Invalidate flags
+ * @iotlb_id: Invalidate iotlb entry index
+ * @hw_error: One of enum iommu_hwpt_invalidate_selftest_error
+ * @__reserved: Must be 0
+ *
+ * If IOMMU_TEST_INVALIDATE_ALL is set in @flags, @iotlb_id will be ignored
+ * @hw_error meaningful only if the request is processed successfully.
+ * If IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR is set in @flags, report a
+ * hw error back, cache would not be invalidated in this case.
+ */
+struct iommu_hwpt_invalidate_selftest {
+#define IOMMU_TEST_INVALIDATE_FLAG_ALL (1 << 0)
+#define IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR   (1 << 1)
+   __u32 flags;
+   __u32 iotlb_id;
+   __u32 hw_error;
+   __u32 __reserved;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 022ef8f55088..ebc6c15abf67 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -473,9 +473,69 @@ static void mock_domain_free_nested(struct iommu_domain 
*domain)
kfree(mock_nested);
 }
 
+static int
+mock_domain_cache_invalidate_user(struct iommu_domain *domain,
+ struct iommu_user_data_array *array)
+{
+   struct mock_iommu_domain_nested *mock_nested =
+   container_of(domain, struct mock_iommu_domain_nested, domain);
+   u32 hw_error = 0, processed = 0;
+   struct iommu_hwpt_invalidate_selftest inv;
+   int i = 0, j;
+   int rc = 0;
+
+   if (array->type != IOMMU_HWPT_INVALIDATE_DATA_SELFTEST) {
+   rc = -EINVAL;
+   goto out;
+   }
+
+   for ( ; i < array->entry_num; i++) {
+   rc = iommu_copy_struct_from_user_array(, array,
+  
IOMMU_HWPT_INVALIDATE_DATA_SELFTEST,
+  i, __reserved);
+   if (rc)
+   break;
+
+   if ((inv.flags & ~(IOMMU_TEST_INVALIDATE_FLAG_ALL |
+ IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR)) ||
+   inv.__reserved) {
+   rc = -EOPNOTSUPP;
+   break;
+   }
+
+   if (inv.iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX) {
+   rc = -EINVAL;
+   break;
+   }
+
+   if (inv.flags & IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR) {
+   hw_error = IOMMU_TEST_INVALIDATE_FAKE_ERROR;
+   } else if (inv.flags & IOMMU_TEST_INVALIDATE_FLAG_ALL) {
+   /* Invalidate all mock iotlb entries and ignore 
iotlb_id */
+   for (j = 0; j < MOCK_NESTED_DOMAIN_IOTLB_NUM; j++)
+   mock_nested->iotlb[j] = 0;
+   } else {
+   mock_nested->iotlb[inv.iotlb_id] = 0;
+   }
+
+   inv.hw_error = hw_error;
+   rc = iommu_respond_struct_to_user_array(array, i, (void *),
+   sizeof(inv));
+   if (rc)
+   break;
+
+   processed++;
+   }
+
+out:
+   array->entry_num = processed;
+   return rc;
+}
+
 static struct iommu_domain_ops domain_nested_ops = {
.free = mock_domain_free_nested,
.attach_dev = mock_domain_nop_attach,
+   .cache_invalidate_user = mock_domain_cache_invalidate_user,
 };
 
 static inline struct iommufd_hw_pagetable *
-- 
2.34.1

[PATCH v10 03/10] iommu: Add iommu_copy_struct_from_user_array helper

2024-01-02 Thread Yi Liu

From: Nicolin Chen 

Wrap up the data pointer/num sanity and __iommu_copy_struct_from_user
call for iommu drivers to copy driver specific data at a specific location
in the struct iommu_user_data_array, and iommu_respond_struct_to_user_array()
to copy response to a specific location in the struct iommu_user_data_array.

And expect it to be used in cache_invalidate_user ops for example.

Reviewed-by: Kevin Tian 
Signed-off-by: Nicolin Chen 
Co-developed-by: Yi Liu 
Signed-off-by: Yi Liu 
---
 include/linux/iommu.h | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 93c0d12dd047..c3434c9eaa6d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -341,6 +341,80 @@ static inline int __iommu_copy_struct_from_user(
  sizeof(*kdst),  \
  offsetofend(typeof(*kdst), min_last))
 
+/**
+ * __iommu_copy_struct_from_user_array - Copy iommu driver specific user space
+ *   data from an iommu_user_data_array
+ * @dst_data: Pointer to an iommu driver specific user data that is defined in
+ *include/uapi/linux/iommufd.h
+ * @src_array: Pointer to a struct iommu_user_data_array for a user space array
+ * @data_type: The data type of the @dst_data. Must match with @src_array.type
+ * @index: Index to the location in the array to copy user data from
+ * @data_len: Length of current user data structure, i.e. sizeof(struct _dst)
+ * @min_len: Initial length of user data structure for backward compatibility.
+ *   This should be offsetofend using the last member in the user data
+ *   struct that was initially added to include/uapi/linux/iommufd.h
+ */
+static inline int
+__iommu_copy_struct_from_user_array(void *dst_data,
+   const struct iommu_user_data_array 
*src_array,
+   unsigned int data_type, unsigned int index,
+   size_t data_len, size_t min_len)
+{
+   struct iommu_user_data src_data;
+
+   if (WARN_ON(!src_array || index >= src_array->entry_num))
+   return -EINVAL;
+   if (!src_array->entry_num)
+   return -EINVAL;
+   src_data.uptr = src_array->uptr + src_array->entry_len * index;
+   src_data.len = src_array->entry_len;
+   src_data.type = src_array->type;
+
+   return __iommu_copy_struct_from_user(dst_data, _data, data_type,
+data_len, min_len);
+}
+
+/**
+ * iommu_copy_struct_from_user_array - Copy iommu driver specific user space
+ * data from an iommu_user_data_array
+ * @kdst: Pointer to an iommu driver specific user data that is defined in
+ *include/uapi/linux/iommufd.h
+ * @user_array: Pointer to a struct iommu_user_data_array for a user space
+ *  array
+ * @data_type: The data type of the @kdst. Must match with @user_array->type
+ * @index: Index to the location in the array to copy user data from
+ * @min_last: The last memember of the data structure @kdst points in the
+ *initial version.
+ * Return 0 for success, otherwise -error.
+ */
+#define iommu_copy_struct_from_user_array(kdst, user_array, data_type,   \
+ index, min_last)\
+   __iommu_copy_struct_from_user_array(kdst, user_array, data_type, \
+   index, sizeof(*kdst),\
+   offsetofend(typeof(*kdst),   \
+   min_last))
+
+/**
+ * iommu_respond_struct_to_user_array - Copy the response in @ksrc back to
+ *  a specific entry of user array
+ * @user_array: Pointer to a struct iommu_user_data_array for a user space
+ *  array
+ * @index: Index to the location in the array to copy response
+ * @ksrc: Pointer to kernel structure
+ * @klen: Length of @ksrc struct
+ *
+ * This only copies response of one entry (@index) in @user_array.
+ */
+static inline int
+iommu_respond_struct_to_user_array(const struct iommu_user_data_array *array,
+  unsigned int index, void *ksrc, size_t klen)
+{
+   if (copy_to_user(array->uptr + array->entry_len * index,
+ksrc, min_t(size_t, array->entry_len, klen)))
+   return -EFAULT;
+   return 0;
+}
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
-- 
2.34.1

[PATCH v10 02/10] iommufd: Add IOMMU_HWPT_INVALIDATE

2024-01-02 Thread Yi Liu

In nested translation, the stage-1 page table is user-managed but cached
by the IOMMU hardware, so an update on present page table entries in the
stage-1 page table should be followed with a cache invalidation.

Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
support multiple invalidation data in one ioctl.

enum iommu_hwpt_invalidate_data_type is defined to tag the data type of
the entries in the multi-entry array.

Reviewed-by: Kevin Tian 
Co-developed-by: Nicolin Chen 
Signed-off-by: Nicolin Chen 
Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/hw_pagetable.c| 41 +++
 drivers/iommu/iommufd/iommufd_private.h | 10 ++
 drivers/iommu/iommufd/main.c|  3 ++
 include/uapi/linux/iommufd.h| 43 +
 4 files changed, 97 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c 
b/drivers/iommu/iommufd/hw_pagetable.c
index cbb5df0a6c32..4e8711f19f72 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -371,3 +371,44 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd 
*ucmd)
iommufd_put_object(ucmd->ictx, _paging->common.obj);
return rc;
 }
+
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
+{
+   struct iommu_hwpt_invalidate *cmd = ucmd->cmd;
+   struct iommu_user_data_array data_array = {
+   .type = cmd->data_type,
+   .uptr = u64_to_user_ptr(cmd->data_uptr),
+   .entry_len = cmd->entry_len,
+   .entry_num = cmd->entry_num,
+   };
+   struct iommufd_hw_pagetable *hwpt;
+   u32 done_num = 0;
+   int rc;
+
+   if (cmd->__reserved) {
+   rc = -EOPNOTSUPP;
+   goto out;
+   }
+
+   if (cmd->entry_num && (!cmd->data_uptr || !cmd->entry_len)) {
+   rc = -EINVAL;
+   goto out;
+   }
+
+   hwpt = iommufd_get_hwpt_nested(ucmd, cmd->hwpt_id);
+   if (IS_ERR(hwpt)) {
+   rc = PTR_ERR(hwpt);
+   goto out;
+   }
+
+   rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain,
+ _array);
+   done_num = data_array.entry_num;
+
+   iommufd_put_object(ucmd->ictx, >obj);
+out:
+   cmd->entry_num = done_num;
+   if (iommufd_ucmd_respond(ucmd, sizeof(*cmd)))
+   return -EFAULT;
+   return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index abae041e256f..991f864d1f9b 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -328,6 +328,15 @@ iommufd_get_hwpt_paging(struct iommufd_ucmd *ucmd, u32 id)
   IOMMUFD_OBJ_HWPT_PAGING),
struct iommufd_hwpt_paging, common.obj);
 }
+
+static inline struct iommufd_hw_pagetable *
+iommufd_get_hwpt_nested(struct iommufd_ucmd *ucmd, u32 id)
+{
+   return container_of(iommufd_get_object(ucmd->ictx, id,
+  IOMMUFD_OBJ_HWPT_NESTED),
+   struct iommufd_hw_pagetable, obj);
+}
+
 int iommufd_hwpt_set_dirty_tracking(struct iommufd_ucmd *ucmd);
 int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd);
 
@@ -345,6 +354,7 @@ void iommufd_hwpt_paging_abort(struct iommufd_object *obj);
 void iommufd_hwpt_nested_destroy(struct iommufd_object *obj);
 void iommufd_hwpt_nested_abort(struct iommufd_object *obj);
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd);
 
 static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
struct iommufd_hw_pagetable *hwpt)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index c9091e46d208..39b32932c61e 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -322,6 +322,7 @@ union ucmd_buffer {
struct iommu_hw_info info;
struct iommu_hwpt_alloc hwpt;
struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
+   struct iommu_hwpt_invalidate cache;
struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
struct iommu_ioas_alloc alloc;
struct iommu_ioas_allow_iovas allow_iovas;
@@ -360,6 +361,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 __reserved),
IOCTL_OP(IOMMU_HWPT_GET_DIRTY_BITMAP, iommufd_hwpt_get_dirty_bitmap,
 struct iommu_hwpt_get_dirty_bitmap, data),
+   IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate,
+struct iommu_hwpt_invalidate, __reserved),
IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING, iommufd_hwpt_set_dirty_tracking,
 struct iommu_hwpt_set_dirty_tracking, __reserved),
IOCTL_OP(IOMMU_IOAS_ALLOC,

[PATCH v10 01/10] iommu: Add cache_invalidate_user op

2024-01-02 Thread Yi Liu

From: Lu Baolu 

The updates of the PTEs in the nested page table will be propagated to the
hardware caches.

Add a new domain op cache_invalidate_user for the userspace to flush the
hardware caches for a nested domain through iommufd. No wrapper for it,
as it's only supposed to be used by iommufd. Then, pass in invalidation
requests in form of a user data array conatining a number of invalidation
data entries.

Signed-off-by: Lu Baolu 
Reviewed-by: Kevin Tian 
Signed-off-by: Nicolin Chen 
Signed-off-by: Yi Liu 
---
 include/linux/iommu.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6291aa7b079b..93c0d12dd047 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,23 @@ struct iommu_user_data {
size_t len;
 };
 
+/**
+ * struct iommu_user_data_array - iommu driver specific user space data array
+ * @type: The data type of all the entries in the user buffer array
+ * @uptr: Pointer to the user buffer array
+ * @entry_len: The fixed-width length of an entry in the array, in bytes
+ * @entry_num: The number of total entries in the array
+ *
+ * The user buffer includes an array of requests with format defined in
+ * include/uapi/linux/iommufd.h
+ */
+struct iommu_user_data_array {
+   unsigned int type;
+   void __user *uptr;
+   size_t entry_len;
+   u32 entry_num;
+};
+
 /**
  * __iommu_copy_struct_from_user - Copy iommu driver specific user space data
  * @dst_data: Pointer to an iommu driver specific user data that is defined in
@@ -440,6 +457,13 @@ struct iommu_ops {
  * @iotlb_sync_map: Sync mappings created recently using @map to the hardware
  * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
  *queue
+ * @cache_invalidate_user: Flush hardware cache for user space IO page table.
+ * The @domain must be IOMMU_DOMAIN_NESTED. The @array
+ * passes in the cache invalidation requests, in form
+ * of a driver data structure. The driver must update
+ * array->entry_num to report the number of handled
+ * invalidation requests. The driver data structure
+ * must be defined in include/uapi/linux/iommufd.h
  * @iova_to_phys: translate iova to physical address
  * @enforce_cache_coherency: Prevent any kind of DMA from bypassing 
IOMMU_CACHE,
  *   including no-snoop TLPs on PCIe or other platform
@@ -465,6 +489,8 @@ struct iommu_domain_ops {
  size_t size);
void (*iotlb_sync)(struct iommu_domain *domain,
   struct iommu_iotlb_gather *iotlb_gather);
+   int (*cache_invalidate_user)(struct iommu_domain *domain,
+struct iommu_user_data_array *array);
 
phys_addr_t (*iova_to_phys)(struct iommu_domain *domain,
dma_addr_t iova);
-- 
2.34.1

[PATCH v10 00/10] Add iommufd nesting (part 2/2)

2024-01-02 Thread Yi Liu

Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.

Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.

.-.  .---.
|   vIOMMU|  | Guest I/O page table  |
| |  '---'
./
| PASID Entry |--- PASID cache flush --+
'-'|
| |V
| |   I/O page table pointer in GPA
'-'
Guest
--| Shadow |---|
  vv   v
Host
.-.  ..
|   pIOMMU|  |  FS for GIOVA->GPA |
| |  ''
./  |
| PASID Entry | V (Nested xlate)
'\.--.
| |   | SS for GPA->HPA, unmanaged domain|
| |   '--'
'-'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables


This series is based on the first part which was merged [1], this series is to
add the cache invalidation interface or the userspace to invalidate cache after
modifying the stage-1 page table. This includes both the iommufd changes and the
VT-d driver changes.

Complete code can be found in [2], QEMU could can be found in [3].

At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.

[1] 
https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l@intel.com/ 
- merged
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

Change log:

v10:
 - Minor tweak to patch 07 (Kevin)
 - Rebase on top of 6.7-rc8

v9: 
https://lore.kernel.org/linux-iommu/20231228150629.13149-1-yi.l@intel.com/
 - Add a test case which sets both IOMMU_TEST_INVALIDATE_FLAG_ALL and
   IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR in flags, and expect to succeed
   and see an 'error'. (Kevin)
 - Returns -ETIMEOUT in qi_check_fault() if caller is interested with the
   fault when timeout happens. If not, the qi_submit_sync() will keep retry
   hence unable to report the error back to user. For now, only the user cache
   invalidation path has interest on the time out error. So this change only
   affects the user cache invalidation path. Other path will still hang in
   qi_submit_sync() when timeout happens. (Kevin)

v8: 
https://lore.kernel.org/linux-iommu/20231227161354.67701-1-yi.l@intel.com/
 - Pass invalidation hint to the cache invalidation helper in the 
cache_invalidate_user
   op path (Kevin)
 - Move the devTLB invalidation out of info->iommu loop (Kevin, Weijiang)
 - Clear *fault per restart in qi_submit_sync() to avoid acroos submission error
   accumulation. (Kevin)
 - Define the vtd cache invalidation uapi structure in separate patch (Kevin)
 - Rename inv_error to be hw_error (Kevin)
 - Rename 'reqs_uptr', 'req_type', 'req_len' and 'req_num' to be 'data_uptr',
   'data_type', "entry_len' and 'entry_num" (Kevin)
 - Allow user to set IOMMU_TEST_INVALIDATE_FLAG_ALL and 
IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR
   in the same time (Kevin)

v7: 
https://lore.kernel.org/linux-iommu/20231221153948.119007-1-yi.l@intel.com/
 - Remove domain->ops->cache_invalidate_user check in hwpt alloc path due
   to failure in bisect (Baolu)
 - Remove out_driver_error_code from struct iommu_hwpt_invalidate after
   discussion in v6. Should expect per-entry error code.
 - Rework the selftest cache invalidation part to report a per-entry error
 - Allow user to pass in an empty array to have a try-and-fail mechanism for
   user to check if a given req_type is supported by the kernel (Jason)
 - Define a separate enum type for cache invalidation data (Jason)
 - Fix the IOMMU_HWPT_INVALIDATE to always update the req_num field before
   returning (Nicolin)
 - Merge the VT-d nesting part 2/2
   
https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l@intel.com/
   into this series to avoid defining empty enum in the middle of the series.
   The major difference is adding the VT-d related invalidation uapi structures
   together with the generic data structures in patch 02 of this

Re: [PATCH] kselftest: Add basic test for probing the rust sample modules

2024-01-02 Thread Laura Nao

On 12/21/23 20:46, Miguel Ojeda wrote:
> On Fri, Dec 15, 2023 at 2:21 PM Laura Nao 
> wrote:
>>
>> Add new basic kselftest that checks if the available rust sample
>> modules
>> can be added and removed correctly.
>>
>> Signed-off-by: Laura Nao 
> 
> Thanks Laura!
> 
> Shuah: do you want that we pick this one? If so, your `Acked-by` would
> be nice -- thanks! Otherwise, please feel free to pick it up.
> 
> Cc'ing David too since it involves KTAP in case he has comments.
> 
>> diff --git a/tools/testing/selftests/rust/Makefile
>> b/tools/testing/selftests/rust/Makefile
> 
> Missing SPDX line? (it can be added when picking it up, though).
> 

Thanks for the feedback Miguel!

>> +$(OUTPUT)/ktap_helpers.sh:
>> +   cp $(top_srcdir)/tools/testing/selftests/dt/ktap_helpers.sh
>> $@
> 
> This may be something for another series, but should these helpers be
> factored out perhaps / provided by the framework? Does it work
> sourcing them from `dt` directly instead of copying meanwhile (to
> simplify)?
> 
>> +KSFT_PASS=0
>> +KSFT_FAIL=1
>> +KSFT_SKIP=4
> 
> Similarly, would it make sense for this kind of "common constants" be
> factored somehow? Or does that not make sense (I see other tests also
> define them "manually")?
> 

Sourcing the file from the `dt` folder does work when running the test
with `make -C tools/testing/selftests TARGETS=rust run_tests`, but fails
when the test is installed with `make -C tools/testing/selftests
TARGETS=rust install` and run with
`./tools/testing/selftests/kselftest_install/run_kselftest.sh` (unless
the dt kselftest is installed too).

I agree factoring out the helpers might be a better solution. I sent a
patch to move the ktap_helpers.sh file to the kselftest common
directory, so that kselftests written in bash can make use of the helper
functions more easily:

https://lore.kernel.org/linux-kselftest/20240102141528.169947-1-laura@collabora.com/T/#u

If that patch is merged first, I'll rework this one and send a v2 with
the proper adjustments.

Best,

Laura

[PATCH] selftests: Move KTAP bash helpers to selftests common folder

2024-01-02 Thread Laura Nao

Move bash helpers for outputting in KTAP format to the common selftests
folder. This allows kselftests other than the dt one to source the file
and make use of the helper functions.
Define pass, fail and skip codes in the same file too.

Signed-off-by: Laura Nao 
---
 tools/testing/selftests/Makefile  | 1 +
 tools/testing/selftests/dt/Makefile   | 2 +-
 tools/testing/selftests/dt/test_unprobed_devices.sh   | 6 +-
 tools/testing/selftests/{dt => kselftest}/ktap_helpers.sh | 6 ++
 4 files changed, 9 insertions(+), 6 deletions(-)
 rename tools/testing/selftests/{dt => kselftest}/ktap_helpers.sh (94%)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 3b2061d1c1a5..976e96013c91 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -251,6 +251,7 @@ ifdef INSTALL_PATH
install -m 744 kselftest/module.sh $(INSTALL_PATH)/kselftest/
install -m 744 kselftest/runner.sh $(INSTALL_PATH)/kselftest/
install -m 744 kselftest/prefix.pl $(INSTALL_PATH)/kselftest/
+   install -m 744 kselftest/ktap_helpers.sh $(INSTALL_PATH)/kselftest/
install -m 744 run_kselftest.sh $(INSTALL_PATH)/
rm -f $(TEST_LIST)
@ret=1; \
diff --git a/tools/testing/selftests/dt/Makefile 
b/tools/testing/selftests/dt/Makefile
index 62dc00ee4978..2d33ee9e9b71 100644
--- a/tools/testing/selftests/dt/Makefile
+++ b/tools/testing/selftests/dt/Makefile
@@ -4,7 +4,7 @@ ifneq ($(PY3),)
 
 TEST_PROGS := test_unprobed_devices.sh
 TEST_GEN_FILES := compatible_list
-TEST_FILES := compatible_ignore_list ktap_helpers.sh
+TEST_FILES := compatible_ignore_list
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/dt/test_unprobed_devices.sh 
b/tools/testing/selftests/dt/test_unprobed_devices.sh
index b07af2a4c4de..f2307ee443a6 100755
--- a/tools/testing/selftests/dt/test_unprobed_devices.sh
+++ b/tools/testing/selftests/dt/test_unprobed_devices.sh
@@ -15,16 +15,12 @@
 
 DIR="$(dirname $(readlink -f "$0"))"
 
-source "${DIR}"/ktap_helpers.sh
+source "${DIR}"/../kselftest/ktap_helpers.sh
 
 PDT=/proc/device-tree/
 COMPAT_LIST="${DIR}"/compatible_list
 IGNORE_LIST="${DIR}"/compatible_ignore_list
 
-KSFT_PASS=0
-KSFT_FAIL=1
-KSFT_SKIP=4
-
 ktap_print_header
 
 if [[ ! -d "${PDT}" ]]; then
diff --git a/tools/testing/selftests/dt/ktap_helpers.sh 
b/tools/testing/selftests/kselftest/ktap_helpers.sh
similarity index 94%
rename from tools/testing/selftests/dt/ktap_helpers.sh
rename to tools/testing/selftests/kselftest/ktap_helpers.sh
index 8dfae51bb4e2..dd79d96f3b5a 100644
--- a/tools/testing/selftests/dt/ktap_helpers.sh
+++ b/tools/testing/selftests/kselftest/ktap_helpers.sh
@@ -9,6 +9,12 @@ KTAP_CNT_PASS=0
 KTAP_CNT_FAIL=0
 KTAP_CNT_SKIP=0
 
+KSFT_PASS=0
+KSFT_FAIL=1
+KSFT_XFAIL=2
+KSFT_XPASS=3
+KSFT_SKIP=4
+
 ktap_print_header() {
echo "TAP version 13"
 }
-- 
2.30.2

Re: [PATCH net-next 0/4] mptcp: add CurrEstab MIB counter

2024-01-02 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller :

On Fri, 22 Dec 2023 13:47:21 +0100 you wrote:
> This MIB counter is similar to the one of TCP -- CurrEstab -- available
> in /proc/net/snmp. This is useful to quickly list the number of MPTCP
> connections without having to iterate over all of them.
> 
> Patch 1 prepares its support by adding new helper functions:
> 
>  - MPTCP_DEC_STATS(): similar to MPTCP_INC_STATS(), but this time to
>decrement a counter.
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] mptcp: add CurrEstab MIB counter support
https://git.kernel.org/netdev/net-next/c/d9cd27b8cd19
  - [net-next,2/4] mptcp: use mptcp_set_state
(no matching commit)
  - [net-next,3/4] selftests: mptcp: join: check CURRESTAB counters
https://git.kernel.org/netdev/net-next/c/0bd962dd86b2
  - [net-next,4/4] selftests: mptcp: diag: check CURRESTAB counters
https://git.kernel.org/netdev/net-next/c/81ab772819da

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH net-next 0/2] selftest/net: Some more TCP-AO selftest post-merge fixups

2024-01-02 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller :

On Fri, 22 Dec 2023 01:59:05 + you wrote:
> Note that there's another post-merge fix for TCP-AO selftests, but that
> doesn't conflict with these, so I don't resend that:
> 
> https://lore.kernel.org/all/20231219-b4-tcp-ao-selftests-out-of-tree-v1-1-0fff92d26...@arista.com/T/#u
> 
> Signed-off-by: Dmitry Safonov 
> 
> [...]

Here is the summary with links:
  - [net-next,1/2] selftest/tcp-ao: Set routes in a proper VRF table id
https://git.kernel.org/netdev/net-next/c/72cd9f8d5a99
  - [net-next,2/2] selftest/tcp-ao: Work on namespace-ified sysctl_optmem_max
https://git.kernel.org/netdev/net-next/c/80057b2080a8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

[PATCH net-next v2 3/3] selftests/net: fix GRO coalesce test and add ext header coalesce tests

2024-01-02 Thread Richard Gobert

Currently there is no test which checks that IPv6 extension header packets
successfully coalesce. This commit adds a test, which verifies two IPv6
packets with HBH extension headers do coalesce, and another test which
checks that packets with different extension header data do not coalesce
in GRO.

I changed the receive socket filter to accept a packet with one extension
header. This change exposed a bug in the fragment test -- the old BPF did
not accept the fragment packet. I updated correct_num_packets in the
fragment test accordingly.

Signed-off-by: Richard Gobert 
---
 tools/testing/selftests/net/gro.c | 94 +--
 1 file changed, 88 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/net/gro.c 
b/tools/testing/selftests/net/gro.c
index 30024d0ed373..6dbba8ec53a1 100644
--- a/tools/testing/selftests/net/gro.c
+++ b/tools/testing/selftests/net/gro.c
@@ -71,6 +71,12 @@
 #define MAX_PAYLOAD (IP_MAXPACKET - sizeof(struct tcphdr) - sizeof(struct 
ipv6hdr))
 #define NUM_LARGE_PKT (MAX_PAYLOAD / MSS)
 #define MAX_HDR_LEN (ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct tcphdr))
+#define MIN_EXTHDR_SIZE 8
+#define EXT_PAYLOAD_1 "\x00\x00\x00\x00\x00\x00"
+#define EXT_PAYLOAD_2 "\x11\x11\x11\x11\x11\x11"
+
+#define ipv6_optlen(p)  (((p)->hdrlen+1) << 3) /* calculate IPv6 extension 
header len */
+#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
 
 static const char *addr6_src = "fdaa::2";
 static const char *addr6_dst = "fdaa::1";
@@ -104,7 +110,7 @@ static void setup_sock_filter(int fd)
const int dport_off = tcp_offset + offsetof(struct tcphdr, dest);
const int ethproto_off = offsetof(struct ethhdr, h_proto);
int optlen = 0;
-   int ipproto_off;
+   int ipproto_off, opt_ipproto_off;
int next_off;
 
if (proto == PF_INET)
@@ -116,14 +122,30 @@ static void setup_sock_filter(int fd)
if (strcmp(testname, "ip") == 0) {
if (proto == PF_INET)
optlen = sizeof(struct ip_timestamp);
-   else
-   optlen = sizeof(struct ip6_frag);
+   else {
+   BUILD_BUG_ON(sizeof(struct ip6_hbh) > MIN_EXTHDR_SIZE);
+   BUILD_BUG_ON(sizeof(struct ip6_dest) > MIN_EXTHDR_SIZE);
+   BUILD_BUG_ON(sizeof(struct ip6_frag) > MIN_EXTHDR_SIZE);
+
+   /* same size for HBH and Fragment extension header 
types */
+   optlen = MIN_EXTHDR_SIZE;
+   opt_ipproto_off = ETH_HLEN + sizeof(struct ipv6hdr)
+   + offsetof(struct ip6_ext, ip6e_nxt);
+   }
}
 
+   /* this filter validates the following:
+*  - packet is IPv4/IPv6 according to the running test.
+*  - packet is TCP. Also handles the case of one extension header 
and then TCP.
+*  - checks the packet tcp dport equals to DPORT. Also handles the 
case of one
+*extension header and then TCP.
+*/
struct sock_filter filter[] = {
BPF_STMT(BPF_LD  + BPF_H   + BPF_ABS, ethproto_off),
-   BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, 
ntohs(ethhdr_proto), 0, 7),
+   BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, 
ntohs(ethhdr_proto), 0, 9),
BPF_STMT(BPF_LD  + BPF_B   + BPF_ABS, ipproto_off),
+   BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 2, 0),
+   BPF_STMT(BPF_LD  + BPF_B   + BPF_ABS, opt_ipproto_off),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 0, 5),
BPF_STMT(BPF_LD  + BPF_H   + BPF_ABS, dport_off),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, DPORT, 2, 0),
@@ -576,6 +598,40 @@ static void add_ipv4_ts_option(void *buf, void *optpkt)
iph->check = checksum_fold(iph, sizeof(struct iphdr) + optlen, 0);
 }
 
+static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char 
*ext_payload)
+{
+   struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr *)(optpkt + 
tcp_offset);
+   struct ipv6hdr *iph = (struct ipv6hdr *)(optpkt + ETH_HLEN);
+   char *exthdr_payload_start = (char *)(exthdr + 1);
+
+   exthdr->hdrlen = 0;
+   exthdr->nexthdr = IPPROTO_TCP;
+
+   if (ext_payload)
+   memcpy(exthdr_payload_start, ext_payload, MIN_EXTHDR_SIZE - 
sizeof(*exthdr));
+
+   memcpy(optpkt, buf, tcp_offset);
+   memcpy(optpkt + tcp_offset + MIN_EXTHDR_SIZE, buf + tcp_offset,
+   sizeof(struct tcphdr) + PAYLOAD_LEN);
+
+   iph->nexthdr = exthdr_type;
+   iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE);
+}
+
+static void send_ipv6_exthdr(int fd, struct sockaddr_ll *daddr, char 
*ext_data1, char *ext_data2)
+{
+   static char buf[MAX_HDR_LEN + PAYLOAD_LEN];
+   static char exthdr_pck[sizeof(buf) +

[PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation

2024-01-02 Thread Richard Gobert

The existing code always pulls the IPv6 header and sets the transport
offset initially. Then optionally again pulls any extension headers in
ipv6_gso_pull_exthdrs and sets the transport offset again on return from
that call. skb->data is set at the start of the first extension header
before calling ipv6_gso_pull_exthdrs, and must disable the frag0
optimization because that function uses pskb_may_pull/pskb_pull instead of
skb_gro_ helpers. It sets the GRO offset to the TCP header with
skb_gro_pull and sets the transport header. Then returns skb->data to its
position before this block.

This commit introduces a new helper function - ipv6_gro_pull_exthdrs -
which is used in ipv6_gro_receive to pull ipv6 ext headers instead of
ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all
operations use skb_gro_* helpers, and the frag0 fast path can be taken for
IPv6 packets with ext headers.

Signed-off-by: Richard Gobert 
Reviewed-by: Willem de Bruijn 
---
 include/net/ipv6.h |  1 +
 net/ipv6/ip6_offload.c | 51 +-
 2 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 78d38dd88aba..217240efa182 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -26,6 +26,7 @@ struct ip_tunnel_info;
 #define SIN6_LEN_RFC2133   24
 
 #define IPV6_MAXPLEN   65535
+#define IPV6_MIN_EXTHDR_LEN8
 
 /*
  * NextHeader field of IPv6 header
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 0e0b5fed0995..c07111d8f56a 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -37,6 +37,40 @@
INDIRECT_CALL_L4(cb, f2, f1, head, skb);\
 })
 
+static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto)
+{
+   const struct net_offload *ops = NULL;
+   struct ipv6_opt_hdr *opth;
+
+   for (;;) {
+   int len;
+
+   ops = rcu_dereference(inet6_offloads[proto]);
+
+   if (unlikely(!ops))
+   break;
+
+   if (!(ops->flags & INET6_PROTO_GSO_EXTHDR))
+   break;
+
+   opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off);
+   if (unlikely(!opth))
+   break;
+
+   len = ipv6_optlen(opth);
+
+   opth = skb_gro_header(skb, off + len, off);
+   if (unlikely(!opth))
+   break;
+   proto = opth->nexthdr;
+
+   off += len;
+   }
+
+   skb_gro_pull(skb, off - skb_network_offset(skb));
+   return proto;
+}
+
 static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto)
 {
const struct net_offload *ops = NULL;
@@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff 
*ipv6_gro_receive(struct list_head *head,
goto out;
 
skb_set_network_header(skb, off);
-   skb_gro_pull(skb, sizeof(*iph));
-   skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   flush += ntohs(iph->payload_len) != skb_gro_len(skb);
+   flush += ntohs(iph->payload_len) != skb->len - hlen;
 
proto = iph->nexthdr;
ops = rcu_dereference(inet6_offloads[proto]);
if (!ops || !ops->callbacks.gro_receive) {
-   pskb_pull(skb, skb_gro_offset(skb));
-   skb_gro_frag0_invalidate(skb);
-   proto = ipv6_gso_pull_exthdrs(skb, proto);
-   skb_gro_pull(skb, -skb_transport_offset(skb));
-   skb_reset_transport_header(skb);
-   __skb_push(skb, skb_gro_offset(skb));
+   proto = ipv6_gro_pull_exthdrs(skb, hlen, proto);
 
ops = rcu_dereference(inet6_offloads[proto]);
if (!ops || !ops->callbacks.gro_receive)
goto out;
 
-   iph = ipv6_hdr(skb);
+   iph = skb_gro_network_header(skb);
+   } else {
+   skb_gro_pull(skb, sizeof(*iph));
}
 
+   skb_set_transport_header(skb, skb_gro_offset(skb));
+
NAPI_GRO_CB(skb)->proto = proto;
 
flush--;
-- 
2.36.1

[PATCH net-next v2 1/3] net: gso: add HBH extension header offload support

2024-01-02 Thread Richard Gobert

This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it
is done for routing and dstopts) since it is supported in GSO and GRO.
This allows to remove specific HBH conditionals in GSO and GRO when
pulling and parsing an incoming packet.

Signed-off-by: Richard Gobert 
Reviewed-by: Willem de Bruijn 
---
 net/ipv6/exthdrs_offload.c | 11 +++
 net/ipv6/ip6_offload.c | 25 +++--
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 06750d65d480..4c00398f4dca 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -16,6 +16,10 @@ static const struct net_offload dstopt_offload = {
.flags  =   INET6_PROTO_GSO_EXTHDR,
 };
 
+static const struct net_offload hbh_offload = {
+   .flags  =   INET6_PROTO_GSO_EXTHDR,
+};
+
 int __init ipv6_exthdrs_offload_init(void)
 {
int ret;
@@ -28,9 +32,16 @@ int __init ipv6_exthdrs_offload_init(void)
if (ret)
goto out_rt;
 
+   ret = inet6_add_offload(_offload, IPPROTO_HOPOPTS);
+   if (ret)
+   goto out_dstopts;
+
 out:
return ret;
 
+out_dstopts:
+   inet6_del_offload(_offload, IPPROTO_DSTOPTS);
+
 out_rt:
inet6_del_offload(_offload, IPPROTO_ROUTING);
goto out;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index d6314287338d..0e0b5fed0995 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -45,15 +45,13 @@ static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int 
proto)
struct ipv6_opt_hdr *opth;
int len;
 
-   if (proto != NEXTHDR_HOP) {
-   ops = rcu_dereference(inet6_offloads[proto]);
+   ops = rcu_dereference(inet6_offloads[proto]);
 
-   if (unlikely(!ops))
-   break;
+   if (unlikely(!ops))
+   break;
 
-   if (!(ops->flags & INET6_PROTO_GSO_EXTHDR))
-   break;
-   }
+   if (!(ops->flags & INET6_PROTO_GSO_EXTHDR))
+   break;
 
if (unlikely(!pskb_may_pull(skb, 8)))
break;
@@ -171,13 +169,12 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph,
 
proto = iph->nexthdr;
for (;;) {
-   if (proto != NEXTHDR_HOP) {
-   *opps = rcu_dereference(inet6_offloads[proto]);
-   if (unlikely(!(*opps)))
-   break;
-   if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR))
-   break;
-   }
+   *opps = rcu_dereference(inet6_offloads[proto]);
+   if (unlikely(!(*opps)))
+   break;
+   if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR))
+   break;
+
opth = (void *)opth + optlen;
optlen = ipv6_optlen(opth);
len += optlen;
-- 
2.36.1

[PATCH net-next v2 0/3] net: gro: reduce extension header parsing overhead

2024-01-02 Thread Richard Gobert

This series attempts to reduce the parsing overhead of IPv6 extension
headers in GRO and GSO, by removing extension header specific code and
enabling the frag0 fast path.
 
The following changes were made:
 - Removed some unnecessary HBH conditionals by adding HBH offload
   to inet6_offloads
 - Added a utility function to support frag0 fast path in ipv6_gro_receive
 - Added selftests for IPv6 packets with extension headers in GRO

Richard

v1 -> v2:
 - Added a minimum IPv6 extension header length constant to make code self
   documenting.
 - Added new selftest which checks that packets with different extension
   header payloads do not coalesce.
 - Added more info in the second commit message regarding the code changes.
 - v1:
   
https://lore.kernel.org/netdev/f4eff69d-3917-4c42-8c6b-d09597ac4...@gmail.com/

Richard Gobert (3):
  net: gso: add HBH extension header offload support
  net: gro: parse ipv6 ext headers without frag0 invalidation
  selftests/net: fix GRO coalesce test and add ext header coalesce tests

 include/net/ipv6.h|  1 +
 net/ipv6/exthdrs_offload.c| 11 
 net/ipv6/ip6_offload.c| 76 +
 tools/testing/selftests/net/gro.c | 94 +--
 4 files changed, 152 insertions(+), 30 deletions(-)

-- 
2.36.1

Re: [RFC PATCH v3 0/3] Add test to verify probe of devices from discoverable busses

2024-01-02 Thread Nícolas F . R . A . Prado

On Tue, Jan 02, 2024 at 10:45:59AM +0300, Dan Carpenter wrote:
> Life hack: Don't put RFC in the subject.  Especially if it's a v2 or
> higher.  No one reads RFC patches.

Thanks for the tip. I've had a mixed experience with RFC series in the past,
though this time around I did get some feedback on the previous versions so I
can't complain. And I wasn't expecting swift replies in the middle of the
holidays :). In any case, this should be the last RFC version as I feel like the
approach has consolidated by now.

> 
> This patchset seems like a low risk patch to apply.

That's an interesting take on the usage of RFC I hadn't considered.

Thanks,
Nícolas

Re: [PATCH v8 21/24] evm: Move to LSM infrastructure

2024-01-02 Thread Roberto Sassu


On 12/26/2023 11:13 PM, Mimi Zohar wrote:

On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote:

From: Roberto Sassu 

As for IMA, move hardcoded EVM function calls from various places in the
kernel to the LSM infrastructure, by introducing a new LSM named 'evm'
(last and always enabled like 'ima'). The order in the Makefile ensures
that 'evm' hooks are executed after 'ima' ones.

Make EVM functions as static (except for evm_inode_init_security(), which
is exported), and register them as hook implementations in init_evm_lsm().

Unlike before (see commit to move IMA to the LSM infrastructure),
evm_inode_post_setattr(), evm_inode_post_set_acl(),
evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not
executed for private inodes.



Missing is a comment on moving the inline function definitions -
evm_inode_remove_acl(), evm_inode_post_remove_acl(), and
evm_inode_post_set_acl() - to evm_main.c.


Ok.


Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c

Signed-off-by: Roberto Sassu 
---


[...]

@@ -2307,9 +2299,7 @@ int security_inode_setxattr(struct mnt_idmap *idmap,
  
  	if (ret == 1)

ret = cap_inode_setxattr(dentry, name, value, size, flags);
-   if (ret)
-   return ret;
-   return evm_inode_setxattr(idmap, dentry, name, value, size, flags);
+   return ret;
  }


Even though capability will be called after EVM, it doesn't make a
difference in this instance.

[...]


  /**
@@ -2493,9 +2472,7 @@ int security_inode_removexattr(struct mnt_idmap *idmap,
ret = call_int_hook(inode_removexattr, 1, idmap, dentry, name);
if (ret == 1)
ret = cap_inode_removexattr(idmap, dentry, name);
-   if (ret)
-   return ret;
-   return evm_inode_removexattr(idmap, dentry, name);
+   return ret;
  }


'security.capability' is one of the EVM protected xattrs.  As
capability isn't an LSM, it will now be called after EVM, which is a
problem.


Uhm, according to this comment in security_inode_removexattr() and 
security_inode_setxattr():


/*
 * SELinux and Smack integrate the cap call,
 * so assume that all LSMs supplying this call do so.
 */

We can add the call to IMA and EVM as well, to be compliant.

However, I'm missing why the two cases are different. It seems 
cap_inode_set/removexattr() are doing just checks.


Thanks

Roberto

Re: [PATCH v2 net-next] selftests/net: change shebang to bash to support "source"

2024-01-02 Thread Guillaume Nault

On Tue, Jan 02, 2024 at 04:20:16PM +0800, Yujie Liu wrote:
> On Sun, Dec 31, 2023 at 01:17:11PM +0100, Guillaume Nault wrote:
> > On Fri, Dec 29, 2023 at 09:19:31PM +0800, Yujie Liu wrote:
> > > The patch set [1] added a general lib.sh in net selftests, and converted
> > > several test scripts to source the lib.sh.
> > > 
> > > unicast_extensions.sh (converted in [1]) and pmtu.sh (converted in [2])
> > > have a /bin/sh shebang which may point to various shells in different
> > > distributions, but "source" is only available in some of them. For
> > > example, "source" is a built-it function in bash, but it cannot be
> > > used in dash.
> > > 
> > > Refer to other scripts that were converted together, simply change the
> > > shebang to bash to fix the following issues when the default /bin/sh
> > > points to other shells.
> > 
> > Looks like it'd be simpler to just replace the "source" commands with
> > "." and leave the shebang as is (unless there are other bash-specific
> > constructs in these scripts of course).
> > 
> > Generally speaking, I think we should avoid madating a specific shell,
> > unless that really simplifies the test script (which is not the case
> > here).
> 
> Hi Guillaume,
> 
> Thanks for the comments. Actually I also considered replacing "source"
> with "." at first, but finally decided to change the shebang in
> consideration of being consistent with other scripts. We can see that
> there are 140+ scripts in net selftests that have "source lib.sh" and
> "bash" shebang, but none of the selftests has ". lib.sh". If we replace
> "source" with "." and keep the "sh" shebang specifically for
> unicast_extensions.sh and pmtu.sh, we will get only 2 scripts using
> "sh and ." while most other scripts using "bash and source". Maybe it
> would be nice to keep the consistency by changing the shebang as well?
> What do you think? :)

The use of "source" instead of "." is clearly an overlook. Consistency
is desirable only when it brings better quality code.

And it should be easy enough to convert the remaining "source lib.sh"
in a followup patch to make the other shell scripts consistent.


> linux/tools/testing/selftests/net$ grep -rl "source lib.sh" . | xargs grep -F 
> '#!/bin/'
> ./test_vxlan_under_vrf.sh:#!/bin/bash
> ./test_vxlan_nolocalbypass.sh:#!/bin/bash
> ./xfrm_policy.sh:#!/bin/bash
> ./test_vxlan_mdb.sh:#!/bin/bash
> ./test_bridge_backup_port.sh:#!/bin/bash
> ./vrf_route_leaking.sh:#!/bin/bash
> ./l2tp.sh:#!/bin/bash
> ./netns-name.sh:#!/bin/bash
> ./rtnetlink.sh:#!/bin/bash
> ./ioam6.sh:#!/bin/bash
> ./drop_monitor_tests.sh:#!/bin/bash
> ./test_vxlan_vnifiltering.sh:#!/bin/bash
> ./icmp.sh:#!/bin/bash
> ./gre_gso.sh:#!/bin/bash
> ./fib_nexthop_multiprefix.sh:#!/bin/bash
> ./icmp_redirect.sh:#!/bin/bash
> ./vrf-xfrm-tests.sh:#!/bin/bash
> ./vrf_strict_mode_test.sh:#!/bin/bash
> ./fcnal-test.sh:#!/bin/bash
> ./stress_reuseport_listen.sh:#!/bin/bash
> ./srv6_end_dt4_l3vpn_test.sh:#!/bin/bash
> ./test_bridge_neigh_suppress.sh:#!/bin/bash
> ./cmsg_ipv6.sh:#!/bin/bash
> ./arp_ndisc_evict_nocarrier.sh:#!/bin/bash
> ./fib_rule_tests.sh:#!/bin/bash
> ./srv6_end_dt6_l3vpn_test.sh:#!/bin/bash
> ./forwarding/custom_multipath_hash.sh:#!/bin/bash
> ./forwarding/gre_inner_v4_multipath.sh:#!/bin/bash
> ./forwarding/tc_tunnel_key.sh:#!/bin/bash
> ./forwarding/tc_shblocks.sh:#!/bin/bash
> ./forwarding/router_nh.sh:#!/bin/bash
> ./forwarding/skbedit_priority.sh:#!/bin/bash
> ./forwarding/tc_mpls_l2vpn.sh:#!/bin/bash
> ./forwarding/gre_inner_v6_multipath.sh:#!/bin/bash
> ./forwarding/vxlan_symmetric.sh:#!/bin/bash
> ./forwarding/bridge_mdb.sh:#!/bin/bash
> ./forwarding/no_forwarding.sh:#!/bin/bash
> ./forwarding/router_bridge_1d.sh:#!/bin/bash
> ./forwarding/tc_flower_port_range.sh:#!/bin/bash
> ./forwarding/router_multicast.sh:#!/bin/bash
> ./forwarding/bridge_locked_port.sh:#!/bin/bash
> ./forwarding/vxlan_asymmetric_ipv6.sh:#!/bin/bash
> ./forwarding/dual_vxlan_bridge.sh:#!/bin/bash
> ./forwarding/bridge_port_isolation.sh:#!/bin/bash
> ./forwarding/local_termination.sh:#!/bin/bash
> ./forwarding/ipip_flat_gre_keys.sh:#!/bin/bash
> ./forwarding/gre_multipath_nh_res.sh:#!/bin/bash
> ./forwarding/gre_multipath.sh:#!/bin/bash
> ./forwarding/vxlan_bridge_1d_ipv6.sh:#!/bin/bash
> ./forwarding/ip6gre_flat_keys.sh:#!/bin/bash
> ./forwarding/gre_multipath_nh.sh:#!/bin/bash
> ./forwarding/bridge_mld.sh:#!/bin/bash
> ./forwarding/ip6gre_inner_v6_multipath.sh:#!/bin/bash
> ./forwarding/ip6gre_flat_key.sh:#!/bin/bash
> ./forwarding/vxlan_asymmetric.sh:#!/bin/bash
> ./forwarding/tc_flower_router.sh:#!/bin/bash
> ./forwarding/router_bridge_vlan_upper_pvid.sh:#!/bin/bash
> ./forwarding/mirror_gre_vlan_bridge_1q.sh:#!/bin/bash
> ./forwarding/q_in_vni_ipv6.sh:#!/bin/bash
> ./forwarding/mirror_gre_lag_lacp.sh:#!/bin/bash
> ./forwarding/ip6gre_custom_multipath_hash.sh:#!/bin/bash
> ./forwarding/vxlan_bridge_1d.sh:#!/bin/bash
> ./forwarding/ip6gre_hier_key.sh:#!/bin/bash
>

Re: [PATCH v8 23/24] ima: Make it independent from 'integrity' LSM

2024-01-02 Thread Roberto Sassu


On 12/27/2023 8:21 PM, Mimi Zohar wrote:

On Wed, 2023-12-27 at 17:39 +0100, Roberto Sassu wrote:

On 12/27/2023 2:22 PM, Mimi Zohar wrote:

On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote:

From: Roberto Sassu 

Make the 'ima' LSM independent from the 'integrity' LSM by introducing IMA
own integrity metadata (ima_iint_cache structure, with IMA-specific fields
from the integrity_iint_cache structure), and by managing it directly from
the 'ima' LSM.

Move the remaining IMA-specific flags to security/integrity/ima/ima.h,
since they are now unnecessary in the common integrity layer.

Replace integrity_iint_cache with ima_iint_cache in various places
of the IMA code.

Then, reserve space in the security blob for the entire ima_iint_cache
structure, so that it is available for all inodes having the security blob
allocated (those for which security_inode_alloc() was called).  Adjust the
IMA code accordingly, call ima_iint_inode() to retrieve the ima_iint_cache
structure. Keep the non-NULL checks since there can be inodes without
security blob.


Previously the 'iint' memory was only allocated for regular files in
policy and were tagged S_IMA.  This patch totally changes when and how
memory is being allocated.  Does it make sense to allocate memory at
security_inode_alloc()?  Is this change really necessary for making IMA
a full fledged LSM?


Good question. I think it wouldn't be necessary, we can reuse the same
approach as in the patch 'integrity: Switch from rbtree to LSM-managed
blob for integrity_iint_cache'.


Going forward with the v8 proposed solution would require some real
memory usage analysis for different types of policies.

To me the "integrity: Switch from rbtree to LSM-managed blob for
integrity_iint_cache" makes a lot more sense.   Looking back at the
original thread, your reasons back then for not directly allocating the
integrity_iint_cache are still valid for the ima_iint_cache structure.


Uhm, ok. It should not be too difficult to restore the old mechanism for 
ima_iint_cache. Will do it in v9.


Thanks

Roberto


Mimi





Don't include the inode pointer as field in the ima_iint_cache structure,
since the association with the inode is clear. Since the inode field is
missing in ima_iint_cache, pass the extra inode parameter to
ima_get_verity_digest().

Finally, register ima_inode_alloc_security/ima_inode_free_security() to
initialize/deinitialize the new ima_iint_cache structure (before this task
was done by iint_init_always() and iint_free()). Also, duplicate
iint_lockdep_annotate() for the ima_iint_cache structure, and name it
ima_iint_lockdep_annotate().

Signed-off-by: Roberto Sassu

Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

2024-01-02 Thread Peter Xu

On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t 
> *dst_pmd, pmd_t *src_pmd,
>  
>   /* at this point we have src_folio locked */
>   if (folio_test_large(src_folio)) {
> + /* split_folio() can block */
> + pte_unmap(_src_pte);
> + pte_unmap(_dst_pte);
> + src_pte = dst_pte = NULL;
>   err = split_folio(src_folio);
>   if (err)
>   goto out;
> + goto retry;
>   }

Do we also need to clear src_folio and src_folio_pte?  If the folio is a
thp, I think it means it's pte mapped here. Then after the split we may
want to fetch the small folio after the split, not the head one?

-- 
Peter Xu

[PATCH] mm/selftests: hugepage-mremap: conform test to TAP format output

2024-01-02 Thread Muhammad Usama Anjum

Conform the layout, informational and status messages to TAP. No
functional change is intended other than the layout of output messages.

Signed-off-by: Muhammad Usama Anjum 
---
 tools/testing/selftests/mm/hugepage-mremap.c | 87 
 1 file changed, 35 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/mm/hugepage-mremap.c 
b/tools/testing/selftests/mm/hugepage-mremap.c
index cabd0084f57b..c463d1c09c9b 100644
--- a/tools/testing/selftests/mm/hugepage-mremap.c
+++ b/tools/testing/selftests/mm/hugepage-mremap.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include "../kselftest.h"
 #include "vm_util.h"
 
 #define DEFAULT_LENGTH_MB 10UL
@@ -34,7 +35,7 @@
 
 static void check_bytes(char *addr)
 {
-   printf("First hex is %x\n", *((unsigned int *)addr));
+   ksft_print_msg("First hex is %x\n", *((unsigned int *)addr));
 }
 
 static void write_bytes(char *addr, size_t len)
@@ -52,7 +53,7 @@ static int read_bytes(char *addr, size_t len)
check_bytes(addr);
for (i = 0; i < len; i++)
if (*(addr + i) != (char)i) {
-   printf("Mismatch at %lu\n", i);
+   ksft_print_msg("Mismatch at %lu\n", i);
return 1;
}
return 0;
@@ -66,17 +67,13 @@ static void register_region_with_uffd(char *addr, size_t 
len)
/* Create and enable userfaultfd object. */
 
uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
-   if (uffd == -1) {
-   perror("userfaultfd");
-   exit(1);
-   }
+   if (uffd == -1)
+   ksft_exit_fail_msg("userfaultfd: %s\n", strerror(errno));
 
uffdio_api.api = UFFD_API;
uffdio_api.features = 0;
-   if (ioctl(uffd, UFFDIO_API, _api) == -1) {
-   perror("ioctl-UFFDIO_API");
-   exit(1);
-   }
+   if (ioctl(uffd, UFFDIO_API, _api) == -1)
+   ksft_exit_fail_msg("ioctl-UFFDIO_API: %s\n", strerror(errno));
 
/* Create a private anonymous mapping. The memory will be
 * demand-zero paged--that is, not yet allocated. When we
@@ -86,21 +83,17 @@ static void register_region_with_uffd(char *addr, size_t 
len)
 
addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-   if (addr == MAP_FAILED) {
-   perror("mmap");
-   exit(1);
-   }
+   if (addr == MAP_FAILED)
+   ksft_exit_fail_msg("mmap: %s\n", strerror(errno));
 
-   printf("Address returned by mmap() = %p\n", addr);
+   ksft_print_msg("Address returned by mmap() = %p\n", addr);
 
/* Register the memory range of the mapping we just created for
 * handling by the userfaultfd object. In mode, we request to track
 * missing pages (i.e., pages that have not yet been faulted in).
 */
-   if (uffd_register(uffd, addr, len, true, false, false)) {
-   perror("ioctl-UFFDIO_REGISTER");
-   exit(1);
-   }
+   if (uffd_register(uffd, addr, len, true, false, false))
+   ksft_exit_fail_msg("ioctl-UFFDIO_REGISTER: %s\n", 
strerror(errno));
 }
 
 int main(int argc, char *argv[])
@@ -108,10 +101,11 @@ int main(int argc, char *argv[])
size_t length = 0;
int ret = 0, fd;
 
-   if (argc >= 2 && !strcmp(argv[1], "-h")) {
-   printf("Usage: %s [length_in_MB]\n", argv[0]);
-   exit(1);
-   }
+   ksft_print_header();
+   ksft_set_plan(1);
+
+   if (argc >= 2 && !strcmp(argv[1], "-h"))
+   ksft_exit_fail_msg("Usage: %s [length_in_MB]\n", argv[0]);
 
/* Read memory length as the first arg if valid, otherwise fallback to
 * the default length.
@@ -123,50 +117,40 @@ int main(int argc, char *argv[])
 
length = MB_TO_BYTES(length);
fd = memfd_create(argv[0], MFD_HUGETLB);
-   if (fd < 0) {
-   perror("Open failed");
-   exit(1);
-   }
+   if (fd < 0)
+   ksft_exit_fail_msg("Open failed: %s\n", strerror(errno));
 
/* mmap to a PUD aligned address to hopefully trigger pmd sharing. */
unsigned long suggested_addr = 0x7eaa4000;
void *haddr = mmap((void *)suggested_addr, length, PROTECTION,
   MAP_HUGETLB | MAP_SHARED | MAP_POPULATE, fd, 0);
-   printf("Map haddr: Returned address is %p\n", haddr);
-   if (haddr == MAP_FAILED) {
-   perror("mmap1");
-   exit(1);
-   }
+   ksft_print_msg("Map haddr: Returned address is %p\n", haddr);
+   if (haddr == MAP_FAILED)
+   ksft_exit_fail_msg("mmap1: %s\n", strerror(errno));
 
/* mmap again to a dummy address to hopefully trigger pmd sharing. */
suggested_addr = 0x7daa4000;
void *daddr = mmap((void *)suggested_addr, length, PROTECTION,
   MAP_HUGETLB |

Re: [PATCH v2 net-next] selftests/net: change shebang to bash to support "source"

2024-01-02 Thread Yujie Liu

On Sun, Dec 31, 2023 at 01:17:11PM +0100, Guillaume Nault wrote:
> On Fri, Dec 29, 2023 at 09:19:31PM +0800, Yujie Liu wrote:
> > The patch set [1] added a general lib.sh in net selftests, and converted
> > several test scripts to source the lib.sh.
> > 
> > unicast_extensions.sh (converted in [1]) and pmtu.sh (converted in [2])
> > have a /bin/sh shebang which may point to various shells in different
> > distributions, but "source" is only available in some of them. For
> > example, "source" is a built-it function in bash, but it cannot be
> > used in dash.
> > 
> > Refer to other scripts that were converted together, simply change the
> > shebang to bash to fix the following issues when the default /bin/sh
> > points to other shells.
> 
> Looks like it'd be simpler to just replace the "source" commands with
> "." and leave the shebang as is (unless there are other bash-specific
> constructs in these scripts of course).
> 
> Generally speaking, I think we should avoid madating a specific shell,
> unless that really simplifies the test script (which is not the case
> here).

Hi Guillaume,

Thanks for the comments. Actually I also considered replacing "source"
with "." at first, but finally decided to change the shebang in
consideration of being consistent with other scripts. We can see that
there are 140+ scripts in net selftests that have "source lib.sh" and
"bash" shebang, but none of the selftests has ". lib.sh". If we replace
"source" with "." and keep the "sh" shebang specifically for
unicast_extensions.sh and pmtu.sh, we will get only 2 scripts using
"sh and ." while most other scripts using "bash and source". Maybe it
would be nice to keep the consistency by changing the shebang as well?
What do you think? :)

linux/tools/testing/selftests/net$ grep -rl "source lib.sh" . | xargs grep -F 
'#!/bin/'
./test_vxlan_under_vrf.sh:#!/bin/bash
./test_vxlan_nolocalbypass.sh:#!/bin/bash
./xfrm_policy.sh:#!/bin/bash
./test_vxlan_mdb.sh:#!/bin/bash
./test_bridge_backup_port.sh:#!/bin/bash
./vrf_route_leaking.sh:#!/bin/bash
./l2tp.sh:#!/bin/bash
./netns-name.sh:#!/bin/bash
./rtnetlink.sh:#!/bin/bash
./ioam6.sh:#!/bin/bash
./drop_monitor_tests.sh:#!/bin/bash
./test_vxlan_vnifiltering.sh:#!/bin/bash
./icmp.sh:#!/bin/bash
./gre_gso.sh:#!/bin/bash
./fib_nexthop_multiprefix.sh:#!/bin/bash
./icmp_redirect.sh:#!/bin/bash
./vrf-xfrm-tests.sh:#!/bin/bash
./vrf_strict_mode_test.sh:#!/bin/bash
./fcnal-test.sh:#!/bin/bash
./stress_reuseport_listen.sh:#!/bin/bash
./srv6_end_dt4_l3vpn_test.sh:#!/bin/bash
./test_bridge_neigh_suppress.sh:#!/bin/bash
./cmsg_ipv6.sh:#!/bin/bash
./arp_ndisc_evict_nocarrier.sh:#!/bin/bash
./fib_rule_tests.sh:#!/bin/bash
./srv6_end_dt6_l3vpn_test.sh:#!/bin/bash
./forwarding/custom_multipath_hash.sh:#!/bin/bash
./forwarding/gre_inner_v4_multipath.sh:#!/bin/bash
./forwarding/tc_tunnel_key.sh:#!/bin/bash
./forwarding/tc_shblocks.sh:#!/bin/bash
./forwarding/router_nh.sh:#!/bin/bash
./forwarding/skbedit_priority.sh:#!/bin/bash
./forwarding/tc_mpls_l2vpn.sh:#!/bin/bash
./forwarding/gre_inner_v6_multipath.sh:#!/bin/bash
./forwarding/vxlan_symmetric.sh:#!/bin/bash
./forwarding/bridge_mdb.sh:#!/bin/bash
./forwarding/no_forwarding.sh:#!/bin/bash
./forwarding/router_bridge_1d.sh:#!/bin/bash
./forwarding/tc_flower_port_range.sh:#!/bin/bash
./forwarding/router_multicast.sh:#!/bin/bash
./forwarding/bridge_locked_port.sh:#!/bin/bash
./forwarding/vxlan_asymmetric_ipv6.sh:#!/bin/bash
./forwarding/dual_vxlan_bridge.sh:#!/bin/bash
./forwarding/bridge_port_isolation.sh:#!/bin/bash
./forwarding/local_termination.sh:#!/bin/bash
./forwarding/ipip_flat_gre_keys.sh:#!/bin/bash
./forwarding/gre_multipath_nh_res.sh:#!/bin/bash
./forwarding/gre_multipath.sh:#!/bin/bash
./forwarding/vxlan_bridge_1d_ipv6.sh:#!/bin/bash
./forwarding/ip6gre_flat_keys.sh:#!/bin/bash
./forwarding/gre_multipath_nh.sh:#!/bin/bash
./forwarding/bridge_mld.sh:#!/bin/bash
./forwarding/ip6gre_inner_v6_multipath.sh:#!/bin/bash
./forwarding/ip6gre_flat_key.sh:#!/bin/bash
./forwarding/vxlan_asymmetric.sh:#!/bin/bash
./forwarding/tc_flower_router.sh:#!/bin/bash
./forwarding/router_bridge_vlan_upper_pvid.sh:#!/bin/bash
./forwarding/mirror_gre_vlan_bridge_1q.sh:#!/bin/bash
./forwarding/q_in_vni_ipv6.sh:#!/bin/bash
./forwarding/mirror_gre_lag_lacp.sh:#!/bin/bash
./forwarding/ip6gre_custom_multipath_hash.sh:#!/bin/bash
./forwarding/vxlan_bridge_1d.sh:#!/bin/bash
./forwarding/ip6gre_hier_key.sh:#!/bin/bash
./forwarding/gre_custom_multipath_hash.sh:#!/bin/bash
./forwarding/ipip_flat_gre_key.sh:#!/bin/bash
./forwarding/mirror_gre_flower.sh:#!/bin/bash
./forwarding/router_bridge.sh:#!/bin/bash
./forwarding/vxlan_symmetric_ipv6.sh:#!/bin/bash
./forwarding/mirror_gre_bridge_1q.sh:#!/bin/bash
./forwarding/router_multipath.sh:#!/bin/bash
./forwarding/tc_vlan_modify.sh:#!/bin/bash
./forwarding/vxlan_bridge_1q.sh:#!/bin/bash
./forwarding/bridge_mdb_port_down.sh:#!/bin/bash
./forwarding/tc_flower.sh:#!/bin/bash
./forwarding/tc_flower_cfm.sh:#!/bin/bash

52 matches

Mail list logo