Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff
On Wed, Jan 3, 2024 at 11:55 AM Yonghong Song wrote: > > > On 1/2/24 6:54 PM, Menglong Dong wrote: > > On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau > > wrote: > >> On 1/2/24 10:11 AM, Stanislav Fomichev wrote: > >>> On 12/29, Menglong Dong wrote: > For now, we have to call some helpers when we need to update the csum, > such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are > not inlined, which causes poor performance. > > In fact, we can define our own csum update functions in BPF program > instead of bpf_l3_csum_replace, which is totally inlined and efficient. > However, we can't do this for bpf_l4_csum_replace for now, as we can't > update skb->csum, which can cause skb->csum invalid in the rx path with > CHECKSUM_COMPLETE mode. > > What's more, we can't use the direct data access and have to use > skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such > as modifing the vni in the vxlan header and the underlay udp header has > no checksum. > >> There is bpf_csum_update(), does it work? > >> A helper call should be acceptable comparing with the csum calculation > >> itself. > > Yeah, this helper works in this case! Now we miss the last > > piece for the tx path: ip_summed. We need to know if it is > > CHECKSUM_PARTIAL to decide if we should update the > > csum in the packet. In the tx path, the csum in the L4 is the > > pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL. > > > > Maybe we can introduce a lightweight kfunc to get its > > value? Such as bpf_skb_csum_mode(). As we need only call > > it once, there shouldn't be overhead on it. > > You don't need kfunc, you can do checking like >struct sk_buff *kskb = bpf_cast_to_kern_ctx(skb); >if (kskb->ip_summed == CHECKSUM_PARTIAL) ... >... > Great, this is exactly what I need! Thanks~
Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff
On 1/2/24 6:54 PM, Menglong Dong wrote: On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau wrote: On 1/2/24 10:11 AM, Stanislav Fomichev wrote: On 12/29, Menglong Dong wrote: For now, we have to call some helpers when we need to update the csum, such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are not inlined, which causes poor performance. In fact, we can define our own csum update functions in BPF program instead of bpf_l3_csum_replace, which is totally inlined and efficient. However, we can't do this for bpf_l4_csum_replace for now, as we can't update skb->csum, which can cause skb->csum invalid in the rx path with CHECKSUM_COMPLETE mode. What's more, we can't use the direct data access and have to use skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such as modifing the vni in the vxlan header and the underlay udp header has no checksum. There is bpf_csum_update(), does it work? A helper call should be acceptable comparing with the csum calculation itself. Yeah, this helper works in this case! Now we miss the last piece for the tx path: ip_summed. We need to know if it is CHECKSUM_PARTIAL to decide if we should update the csum in the packet. In the tx path, the csum in the L4 is the pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL. Maybe we can introduce a lightweight kfunc to get its value? Such as bpf_skb_csum_mode(). As we need only call it once, there shouldn't be overhead on it. You don't need kfunc, you can do checking like struct sk_buff *kskb = bpf_cast_to_kern_ctx(skb); if (kskb->ip_summed == CHECKSUM_PARTIAL) ... ... Thanks! Menglong Dong In the first patch, we make skb->csum readable and writable, and we make skb->ip_summed readable. For now, for tc only. With these 2 fields, we don't need to call bpf helpers for csum update any more. In the second patch, we add some testcases for the read/write testing for skb->csum and skb->ip_summed. If this series is acceptable, we can define the inlined functions for csum update in libbpf in the next step. One downside of exposing those as __sk_buff fields is that all this skb internal csum stuff now becomes a UAPI. And I'm not sure we want +1. Please no new __sk_buff extension and no new conversion in bpf_convert_ctx_access(). that :-) Should we add a lightweight kfunc to reset the fields instead? Or will it still have an unacceptable overhead?
Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
On 2024/1/3 9:33, Yi Liu wrote: On 2024/1/3 02:44, Jason Gunthorpe wrote: On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote: +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr, + unsigned long npages, bool ih, u32 *error) +{ + struct iommu_domain_info *info; + unsigned long i; + unsigned mask; + u32 fault; + + xa_for_each(>iommu_array, i, info) + qi_flush_piotlb(info->iommu, + domain_id_iommu(domain, info->iommu), + IOMMU_NO_PASID, addr, npages, ih, NULL); This locking on the xarray is messed up throughout the driver. There could be a concurrent detach at this point which will free info and UAF this. hmmm, xa_for_each() takes and releases rcu lock, and according to the domain_detach_iommu(), info is freed after xa_erase(). For an existing info stored in xarray, xa_erase() should return after rcu lock is released. is it? Any idea? @Baolu I once thought locking for xarray is self-contained. I need more thought on this before taking further action. Best regards, baolu
Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff
On Wed, Jan 3, 2024 at 8:52 AM Martin KaFai Lau wrote: > > On 1/2/24 10:11 AM, Stanislav Fomichev wrote: > > On 12/29, Menglong Dong wrote: > >> For now, we have to call some helpers when we need to update the csum, > >> such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are > >> not inlined, which causes poor performance. > >> > >> In fact, we can define our own csum update functions in BPF program > >> instead of bpf_l3_csum_replace, which is totally inlined and efficient. > >> However, we can't do this for bpf_l4_csum_replace for now, as we can't > >> update skb->csum, which can cause skb->csum invalid in the rx path with > >> CHECKSUM_COMPLETE mode. > >> > >> What's more, we can't use the direct data access and have to use > >> skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such > >> as modifing the vni in the vxlan header and the underlay udp header has > >> no checksum. > > There is bpf_csum_update(), does it work? > A helper call should be acceptable comparing with the csum calculation itself. Yeah, this helper works in this case! Now we miss the last piece for the tx path: ip_summed. We need to know if it is CHECKSUM_PARTIAL to decide if we should update the csum in the packet. In the tx path, the csum in the L4 is the pseudo header only if skb->ip_summed is CHECKSUM_PARTIAL. Maybe we can introduce a lightweight kfunc to get its value? Such as bpf_skb_csum_mode(). As we need only call it once, there shouldn't be overhead on it. Thanks! Menglong Dong > > >> > >> In the first patch, we make skb->csum readable and writable, and we make > >> skb->ip_summed readable. For now, for tc only. With these 2 fields, we > >> don't need to call bpf helpers for csum update any more. > >> > >> In the second patch, we add some testcases for the read/write testing for > >> skb->csum and skb->ip_summed. > >> > >> If this series is acceptable, we can define the inlined functions for csum > >> update in libbpf in the next step. > > > > One downside of exposing those as __sk_buff fields is that all this > > skb internal csum stuff now becomes a UAPI. And I'm not sure we want > > +1. Please no new __sk_buff extension and no new conversion in > bpf_convert_ctx_access(). > > > that :-) Should we add a lightweight kfunc to reset the fields instead? > > Or will it still have an unacceptable overhead? >
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On 2024/1/3 07:38, Jason Gunthorpe wrote: On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote: I think I misread Yi's narrative: dev_id is a working approach for VMM to convert to a vRID, while he is asking for a better alternative :) In concept, dev_id works, but in reality we have problem to get a dev_id for a given device in intel iommu driver, hence I'm asking for help here. :) I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev) Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here. yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when IOMMUFD is configed. Maybe such an API could be compiled when IOMMUFD_DRIVER is enabled as well. This does address my concern on making intel iommu driver depending on iommufd. But still need a way to pass in the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel iommu driver could call the above iommufd_get_dev_id(). One possible way is passing it when attaching device to domain and clear it in detach. However, this seems not ideal as iommufd_ctx information should be static between bind_iommufd and unbind. But we don't call into intel iommu driver in the bind and unbind operations. May need to add new iommu op. Any suggestion here? -- Regards, Yi Liu
Re: [PATCH v2 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
On Tue, Jan 02, 2024 at 03:32:56PM -0800, Suren Baghdasaryan wrote: > While testing the split PMD path with lockdep enabled I've got an > "Invalid wait context" error caused by split_huge_page_to_list() trying > to lock anon_vma->rwsem while inside RCU read section. The issues is due > to move_pages_pte() calling split_folio() under RCU read lock. Fix this > by unmapping the PTEs and exiting RCU read section before splitting the > folio and then retrying. The same retry pattern is used when locking the > folio or anon_vma in this function. After splitting the large folio we > unlock and release it because after the split the old folio might not be > the one that contains the src_addr. > > Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI") > Signed-off-by: Suren Baghdasaryan Reviewed-by: Peter Xu Thanks, -- Peter Xu
Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
On 2024/1/3 02:44, Jason Gunthorpe wrote: On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote: +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr, +unsigned long npages, bool ih, u32 *error) +{ + struct iommu_domain_info *info; + unsigned long i; + unsigned mask; + u32 fault; + + xa_for_each(>iommu_array, i, info) + qi_flush_piotlb(info->iommu, + domain_id_iommu(domain, info->iommu), + IOMMU_NO_PASID, addr, npages, ih, NULL); This locking on the xarray is messed up throughout the driver. There could be a concurrent detach at this point which will free info and UAF this. hmmm, xa_for_each() takes and releases rcu lock, and according to the domain_detach_iommu(), info is freed after xa_erase(). For an existing info stored in xarray, xa_erase() should return after rcu lock is released. is it? Any idea? @Baolu void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu) { struct iommu_domain_info *info; spin_lock(>lock); info = xa_load(>iommu_array, iommu->seq_id); if (--info->refcnt == 0) { clear_bit(info->did, iommu->domain_ids); xa_erase(>iommu_array, iommu->seq_id); domain->nid = NUMA_NO_NODE; domain_update_iommu_cap(domain); kfree(info); } spin_unlock(>lock); } This seems to be systemic issue, so I'm going to ignore it here, but please make a series to fix it completely. yeah, this writing is the same with other places that reference the iommu_array. If there is real problem, may check with Baolu and Kevin. xarray is probably a bad data structure to manage attachment, a linked list is going to use less memory in most cases and you need a mutex lock anyhow. below is the commit that introduces iommu_array. commit ba949f4cd4c39c587e9b722ac7eb7f7e8a42dace Author: Lu Baolu Date: Tue Jul 12 08:09:05 2022 +0800 iommu/vt-d: Refactor iommu information of each domain When a DMA domain is attached to a device, it needs to allocate a domain ID from its IOMMU. Currently, the domain ID information is stored in two static arrays embedded in the domain structure. This can lead to memory waste when the driver is running on a small platform. This optimizes these static arrays by replacing them with an xarray and consuming memory on demand. Signed-off-by: Lu Baolu Reviewed-by: Kevin Tian Reviewed-by: Steve Wahl Link: https://lore.kernel.org/r/20220702015610.2849494-4-baolu...@linux.intel.com Signed-off-by: Joerg Roedel -- Regards, Yi Liu
Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff
On 1/2/24 10:11 AM, Stanislav Fomichev wrote: On 12/29, Menglong Dong wrote: For now, we have to call some helpers when we need to update the csum, such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are not inlined, which causes poor performance. In fact, we can define our own csum update functions in BPF program instead of bpf_l3_csum_replace, which is totally inlined and efficient. However, we can't do this for bpf_l4_csum_replace for now, as we can't update skb->csum, which can cause skb->csum invalid in the rx path with CHECKSUM_COMPLETE mode. What's more, we can't use the direct data access and have to use skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such as modifing the vni in the vxlan header and the underlay udp header has no checksum. There is bpf_csum_update(), does it work? A helper call should be acceptable comparing with the csum calculation itself. In the first patch, we make skb->csum readable and writable, and we make skb->ip_summed readable. For now, for tc only. With these 2 fields, we don't need to call bpf helpers for csum update any more. In the second patch, we add some testcases for the read/write testing for skb->csum and skb->ip_summed. If this series is acceptable, we can define the inlined functions for csum update in libbpf in the next step. One downside of exposing those as __sk_buff fields is that all this skb internal csum stuff now becomes a UAPI. And I'm not sure we want +1. Please no new __sk_buff extension and no new conversion in bpf_convert_ctx_access(). that :-) Should we add a lightweight kfunc to reset the fields instead? Or will it still have an unacceptable overhead?
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote: > > I think I misread Yi's narrative: dev_id is a working approach > > for VMM to convert to a vRID, while he is asking for a better > > alternative :) > > In concept, dev_id works, but in reality we have problem to get a dev_id > for a given device in intel iommu driver, hence I'm asking for help here. :) I think we just need to solve this one way or another.. Even if you use a viommu object you still end up having difficult coupling to iommufd Some: iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev) Callable by a driver (using the driver-callable function infrastructure we made for dirty tracking) Is really all that is needed here. Jason
Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
On Tue, Jan 2, 2024 at 3:16 PM Suren Baghdasaryan wrote: > > On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan wrote: > > > > On Tue, Jan 2, 2024 at 1:00 AM Peter Xu wrote: > > > > > > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote: > > > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, > > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > > > > /* at this point we have src_folio locked */ > > > > if (folio_test_large(src_folio)) { > > > > + /* split_folio() can block */ > > > > + pte_unmap(_src_pte); > > > > + pte_unmap(_dst_pte); > > > > + src_pte = dst_pte = NULL; > > > > err = split_folio(src_folio); > > > > if (err) > > > > goto out; > > > > + goto retry; > > > > } > > > > > > Do we also need to clear src_folio and src_folio_pte? If the folio is a > > > thp, I think it means it's pte mapped here. Then after the split we may > > > want to fetch the small folio after the split, not the head one? > > > > I think we need to re-fetch the src_folio only if the src_addr falls > > into a non-head page. Looking at the __split_huge_page(), the head > > page is skipped in the last loop, so I think it should stay valid. > > That said, maybe it's just an implementation detail of the > > __split_huge_page() and I should not rely on that and refetch anyway? > > I'll post a v2 with this fix and re-fetching the folio > unconditionally. We also don't need to reset src_folio_pte value > because it's used only if src_folio is not NULL. Posted at https://lore.kernel.org/all/20240102233256.1077959-1-sur...@google.com/ > Thanks for catching this, Peter! > > > > > > > > > -- > > > Peter Xu > > >
[PATCH v2 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
While testing the split PMD path with lockdep enabled I've got an "Invalid wait context" error caused by split_huge_page_to_list() trying to lock anon_vma->rwsem while inside RCU read section. The issues is due to move_pages_pte() calling split_folio() under RCU read lock. Fix this by unmapping the PTEs and exiting RCU read section before splitting the folio and then retrying. The same retry pattern is used when locking the folio or anon_vma in this function. After splitting the large folio we unlock and release it because after the split the old folio might not be the one that contains the src_addr. Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan --- Changes from v1 [1]: 1. Reset src_folio and src_folio_pte after folio is split, per Peter Xu [1] https://lore.kernel.org/all/20231230025607.2476912-1-sur...@google.com/ mm/userfaultfd.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5e718014e671..216ab4c8621f 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1078,9 +1078,18 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, /* at this point we have src_folio locked */ if (folio_test_large(src_folio)) { + /* split_folio() can block */ + pte_unmap(_src_pte); + pte_unmap(_dst_pte); + src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out; + /* have to reacquire the folio after it got split */ + folio_unlock(src_folio); + folio_put(src_folio); + src_folio = NULL; + goto retry; } if (!src_anon_vma) { -- 2.43.0.472.g3155946c3a-goog
Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan wrote: > > On Tue, Jan 2, 2024 at 1:00 AM Peter Xu wrote: > > > > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote: > > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > > /* at this point we have src_folio locked */ > > > if (folio_test_large(src_folio)) { > > > + /* split_folio() can block */ > > > + pte_unmap(_src_pte); > > > + pte_unmap(_dst_pte); > > > + src_pte = dst_pte = NULL; > > > err = split_folio(src_folio); > > > if (err) > > > goto out; > > > + goto retry; > > > } > > > > Do we also need to clear src_folio and src_folio_pte? If the folio is a > > thp, I think it means it's pte mapped here. Then after the split we may > > want to fetch the small folio after the split, not the head one? > > I think we need to re-fetch the src_folio only if the src_addr falls > into a non-head page. Looking at the __split_huge_page(), the head > page is skipped in the last loop, so I think it should stay valid. > That said, maybe it's just an implementation detail of the > __split_huge_page() and I should not rely on that and refetch anyway? I'll post a v2 with this fix and re-fetching the folio unconditionally. We also don't need to reset src_folio_pte value because it's used only if src_folio is not NULL. Thanks for catching this, Peter! > > > > > -- > > Peter Xu > >
Re: [PATCH v3 4/4] selftest/bpf: Test a perf bpf program that suppresses side effects.
On Sun, Dec 10, 2023 at 8:56 PM Kyle Huey wrote: > > The test sets a hardware breakpoint and uses a bpf program to suppress the > side effects of a perf event sample, including I/O availability signals, > SIGTRAPs, and decrementing the event counter limit, if the ip matches the > expected value. Then the function with the breakpoint is executed multiple > times to test that all effects behave as expected. > > Signed-off-by: Kyle Huey > --- > .../selftests/bpf/prog_tests/perf_skip.c | 140 ++ > .../selftests/bpf/progs/test_perf_skip.c | 15 ++ > 2 files changed, 155 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c > create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c > > diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c > b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > new file mode 100644 > index ..0200736a8baf > --- /dev/null > +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > @@ -0,0 +1,140 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#define _GNU_SOURCE > + > +#include > +#include "test_perf_skip.skel.h" > +#include > +#include > +#include > + > +#ifndef TRAP_PERF > +#define TRAP_PERF 6 > +#endif > + > +int signals_unexpected = 1; > +int sigio_count, sigtrap_count; > + > +static void handle_sigio(int sig __always_unused) > +{ > + ASSERT_OK(signals_unexpected, "perf event not skipped"); ASSERT_OK is a little confusing. Maybe do something like: static int signals_expected; static void handle_sigio(int sig __always_unused) { ASSERT_EQ(signals_expected, 1, "expected sig_io"); } serial_test_perf_skip() { ... signals_expected = 1; } > + ++sigio_count; > +} > + > +static void handle_sigtrap(int signum __always_unused, > + siginfo_t *info, > + void *ucontext __always_unused) > +{ > + ASSERT_OK(signals_unexpected, "perf event not skipped"); ditto > + ASSERT_EQ(info->si_code, TRAP_PERF, "wrong si_code"); > + ++sigtrap_count; > +} > + > +static noinline int test_function(void) > +{ > + asm volatile (""); > + return 0; > +} > + > +void serial_test_perf_skip(void) > +{ > + struct sigaction action = {}; > + struct sigaction previous_sigtrap; > + sighandler_t previous_sigio; > + struct test_perf_skip *skel = NULL; > + struct perf_event_attr attr = {}; > + int perf_fd = -1; > + int err; > + struct f_owner_ex owner; > + struct bpf_link *prog_link = NULL; > + > + action.sa_flags = SA_SIGINFO | SA_NODEFER; > + action.sa_sigaction = handle_sigtrap; > + sigemptyset(_mask); > + if (!ASSERT_OK(sigaction(SIGTRAP, , _sigtrap), > "sigaction")) > + return; > + > + previous_sigio = signal(SIGIO, handle_sigio); handle signal() errors here? > + > + skel = test_perf_skip__open_and_load(); > + if (!ASSERT_OK_PTR(skel, "skel_load")) > + goto cleanup; > + > + attr.type = PERF_TYPE_BREAKPOINT; > + attr.size = sizeof(attr); > + attr.bp_type = HW_BREAKPOINT_X; > + attr.bp_addr = (uintptr_t)test_function; > + attr.bp_len = sizeof(long); > + attr.sample_period = 1; > + attr.sample_type = PERF_SAMPLE_IP; > + attr.pinned = 1; > + attr.exclude_kernel = 1; > + attr.exclude_hv = 1; > + attr.precise_ip = 3; > + attr.sigtrap = 1; > + attr.remove_on_exec = 1; > + > + perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); > + if (perf_fd < 0 && (errno == ENOENT || errno == EOPNOTSUPP)) { > + printf("SKIP:no PERF_TYPE_BREAKPOINT/HW_BREAKPOINT_X\n"); > + test__skip(); > + goto cleanup; > + } > + if (!ASSERT_OK(perf_fd < 0, "perf_event_open")) > + goto cleanup; > + > + /* Configure the perf event to signal on sample. */ > + err = fcntl(perf_fd, F_SETFL, O_ASYNC); > + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)")) > + goto cleanup; > + > + owner.type = F_OWNER_TID; > + owner.pid = syscall(__NR_gettid); > + err = fcntl(perf_fd, F_SETOWN_EX, ); > + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)")) > + goto cleanup; > + > + /* > +* Allow at most one sample. A sample rejected by bpf should > +* not count against this. > +*/ Multi-line comment style should be like /* Allow at most one sample. A sample rejected by bpf should * not count against this. */ > + err = ioctl(perf_fd, PERF_EVENT_IOC_REFRESH, 1); > + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_REFRESH)")) > + goto cleanup; > + > + prog_link = bpf_program__attach_perf_event(skel->progs.handler, > perf_fd); > + if (!ASSERT_OK_PTR(prog_link, "bpf_program__attach_perf_event")) > + goto cleanup; > + > + /* Configure the
Re: [PATCH RESEND v4 1/3] kselftests: lib.mk: Add TEST_GEN_MODS_DIR variable
On Wed, Dec 20, 2023 at 01:53:12PM -0300, Marcos Paulo de Souza wrote: > Add TEST_GEN_MODS_DIR variable for kselftests. It can point to > a directory containing kernel modules that will be used by > selftest scripts. > > The modules are built as external modules for the running kernel. > As a result they are always binary compatible and the same tests > can be used for older or newer kernels. > > The build requires "kernel-devel" package to be installed. > For example, in the upstream sources, the rpm devel package > is produced by "make rpm-pkg" > > The modules can be built independently by > > make -C tools/testing/selftests/livepatch/ > > or they will be automatically built before running the tests via > > make -C tools/testing/selftests/livepatch/ run_tests > > Note that they are _not_ built when running the standalone > tests by calling, for example, ./test-state.sh. > > Signed-off-by: Marcos Paulo de Souza > --- > Documentation/dev-tools/kselftest.rst | 4 > tools/testing/selftests/lib.mk| 20 +++- > 2 files changed, 19 insertions(+), 5 deletions(-) > > diff --git a/Documentation/dev-tools/kselftest.rst > b/Documentation/dev-tools/kselftest.rst > index ab376b316c36..7f3582a67318 100644 > --- a/Documentation/dev-tools/kselftest.rst > +++ b/Documentation/dev-tools/kselftest.rst > @@ -245,6 +245,10 @@ Contributing new tests (details) > TEST_PROGS, TEST_GEN_PROGS mean it is the executable tested by > default. > > + TEST_GEN_MODS_DIR should be used by tests that require modules to be built > + before the test starts. The variable will contain the name of the > directory > + containing the modules. > + > TEST_CUSTOM_PROGS should be used by tests that require custom build > rules and prevent common build rule use. > > diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk > index 118e0964bda9..6c7c5a0112cf 100644 > --- a/tools/testing/selftests/lib.mk > +++ b/tools/testing/selftests/lib.mk > @@ -70,12 +70,15 @@ KHDR_INCLUDES := -isystem $(KHDR_DIR) > # TEST_PROGS are for test shell scripts. > # TEST_CUSTOM_PROGS and TEST_PROGS will be run by common run_tests > # and install targets. Common clean doesn't touch them. > +# TEST_GEN_MODS_DIR is used to specify a directory with modules to be built > +# before the test executes. These modules are cleaned on the clean target as > well. > TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS)) > TEST_GEN_PROGS_EXTENDED := $(patsubst > %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED)) > TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES)) > +TEST_GEN_MODS_DIR := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_MODS_DIR)) > > all: kernel_header_files $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) \ > - $(TEST_GEN_FILES) > + $(TEST_GEN_FILES) $(if $(TEST_GEN_MODS_DIR),gen_mods_dir) > > kernel_header_files: > @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null; \ > @@ -105,8 +108,8 @@ endef > > run_tests: all > ifdef building_out_of_srctree > - @if [ "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)" != "X" ]; > then \ > - rsync -aq --copy-unsafe-links $(TEST_PROGS) > $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(OUTPUT); \ > + @if [ > "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)$(TEST_GEN_MODS_DIR)" != > "X" ]; then \ > + rsync -aq --copy-unsafe-links $(TEST_PROGS) > $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(TEST_GEN_MODS_DIR) $(OUTPUT); \ > fi > @if [ "X$(TEST_PROGS)" != "X" ]; then \ > $(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) \ > @@ -118,6 +121,12 @@ else > @$(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) $(TEST_PROGS)) > endif > > +gen_mods_dir: > + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR) > + > +clean_mods_dir: > + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR) clean > + > define INSTALL_SINGLE_RULE > $(if $(INSTALL_LIST),@mkdir -p $(INSTALL_PATH)) > $(if $(INSTALL_LIST),rsync -a --copy-unsafe-links $(INSTALL_LIST) > $(INSTALL_PATH)/) > @@ -131,6 +140,7 @@ define INSTALL_RULE > $(eval INSTALL_LIST = $(TEST_CUSTOM_PROGS)) $(INSTALL_SINGLE_RULE) > $(eval INSTALL_LIST = $(TEST_GEN_PROGS_EXTENDED)) $(INSTALL_SINGLE_RULE) > $(eval INSTALL_LIST = $(TEST_GEN_FILES)) $(INSTALL_SINGLE_RULE) > + $(eval INSTALL_LIST = $(TEST_GEN_MODS_DIR)) $(INSTALL_SINGLE_RULE) Hi Marcos, Sorry for the late reply on this, but I'm reviewing this version by trying to retrofit it into our selftest packaging (pre-build the test module .ko's and stash those into an rpm rather than building on the test host). Since $TEST_GEN_MODS_DIR is treated as a directory, I found that the selftest install target copies a bunch of intermediate object and kbuild files: $ mkdir /tmp/test-install $ make KDIR=$(pwd) INSTALL_PATH=/tmp/test-install TARGETS=livepatch \ -C tools/testing/selftests/ install [ ... builds livepatch selftests ... ]
Re: [PATCH] selftests: Move KTAP bash helpers to selftests common folder
On Tue, Jan 02, 2024 at 03:15:28PM +0100, Laura Nao wrote: > Move bash helpers for outputting in KTAP format to the common selftests > folder. This allows kselftests other than the dt one to source the file > and make use of the helper functions. > Define pass, fail and skip codes in the same file too. > > Signed-off-by: Laura Nao Reviewed-by: Nícolas F. R. A. Prado Tested-by: Nícolas F. R. A. Prado Thanks, Nícolas
Re: [PATCH v10 00/10] Add iommufd nesting (part 2/2)
On Tue, Jan 02, 2024 at 06:38:24AM -0800, Yi Liu wrote: > Lu Baolu (4): > iommu: Add cache_invalidate_user op > iommu/vt-d: Allow qi_submit_sync() to return the QI faults > iommu/vt-d: Convert stage-1 cache invalidation to return QI fault > iommu/vt-d: Add iotlb flush for nested domain > > Nicolin Chen (4): > iommu: Add iommu_copy_struct_from_user_array helper > iommufd/selftest: Add mock_domain_cache_invalidate_user support > iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op > iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl > > Yi Liu (2): > iommufd: Add IOMMU_HWPT_INVALIDATE > iommufd: Add data structure for Intel VT-d stage-1 cache invalidation Applied to for-next, thanks Jason
Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote: > +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr, > + unsigned long npages, bool ih, u32 *error) > +{ > + struct iommu_domain_info *info; > + unsigned long i; > + unsigned mask; > + u32 fault; > + > + xa_for_each(>iommu_array, i, info) > + qi_flush_piotlb(info->iommu, > + domain_id_iommu(domain, info->iommu), > + IOMMU_NO_PASID, addr, npages, ih, NULL); This locking on the xarray is messed up throughout the driver. There could be a concurrent detach at this point which will free info and UAF this. This seems to be systemic issue, so I'm going to ignore it here, but please make a series to fix it completely. xarray is probably a bad data structure to manage attachment, a linked list is going to use less memory in most cases and you need a mutex lock anyhow. Jason
Re: [PATCH bpf-next 0/2] bpf: add csum/ip_summed fields to __sk_buff
On 12/29, Menglong Dong wrote: > For now, we have to call some helpers when we need to update the csum, > such as bpf_l4_csum_replace, bpf_l3_csum_replace, etc. These helpers are > not inlined, which causes poor performance. > > In fact, we can define our own csum update functions in BPF program > instead of bpf_l3_csum_replace, which is totally inlined and efficient. > However, we can't do this for bpf_l4_csum_replace for now, as we can't > update skb->csum, which can cause skb->csum invalid in the rx path with > CHECKSUM_COMPLETE mode. > > What's more, we can't use the direct data access and have to use > skb_store_bytes() with the BPF_F_RECOMPUTE_CSUM flag in some case, such > as modifing the vni in the vxlan header and the underlay udp header has > no checksum. > > In the first patch, we make skb->csum readable and writable, and we make > skb->ip_summed readable. For now, for tc only. With these 2 fields, we > don't need to call bpf helpers for csum update any more. > > In the second patch, we add some testcases for the read/write testing for > skb->csum and skb->ip_summed. > > If this series is acceptable, we can define the inlined functions for csum > update in libbpf in the next step. One downside of exposing those as __sk_buff fields is that all this skb internal csum stuff now becomes a UAPI. And I'm not sure we want that :-) Should we add a lightweight kfunc to reset the fields instead? Or will it still have an unacceptable overhead?
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Fri, Dec 15, 2023 at 01:50:07AM +, Tian, Kevin wrote: > > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is > > maintained, then intel iommu can report a vRID back to user. But intel > > iommu driver does not have viommu context, no place to hold the vRID- > > >pRID > > mapping. TBH. It may require other reasons to introduce it other than the > > error reporting need. Anyhow, this requires more thinking and also has > > dependency even if it is doable in intel side. > > this sounds like a cleaner way to inject knowledge which iommu driver > requires to find out the user tag. but yes it's a bit weird to introduce > viommu awareness in intel iommu driver when there is no such thing > in real hardware. > > and for this error reporting case what we actually require is the > reverse map i.e. pRID->vRID. Not sure whether we can leverage the > same RID mapping uAPI as for ARM/AMD but ignore viommu_id > and then store vRID under device_domain_info. a bit tricky on > life cycle management and also incompatible with SIOV... > > let's see whether Jason has a better idea here. I think v10 is OK struct iommu_hwpt_invalidate { __u32 size; __u32 hwpt_id; __aligned_u64 data_uptr; __u32 data_type; __u32 entry_len; __u32 entry_num; __u32 __reserved; }; Sends the invalidation to the HWPT which matches what Intel wanted where the entire HWPT and all its associated devices are invalidated. No seperate per-device invalidation. For error and event reporting they should be returned to userspace with the IOMMU dev_id indicating the originating PCI function. The VMM would have to convert dev_id into vRID according to the vIOMMU instance that the device is hooked up. In iommu land we should never have a "RID" but always some kind of device-specific "device ID" which is the index into the particular HW table, and that ID is naturally scoped to within the IOMMU instance that owns the table - so it is very much not a global ID that can be used alone in any of the uAPI. The uAPI should use the iommufd device ID to refer to specific devices. Jason
Re: [PATCH v8 21/24] evm: Move to LSM infrastructure
On Tue, 2024-01-02 at 12:56 +0100, Roberto Sassu wrote: > On 12/26/2023 11:13 PM, Mimi Zohar wrote: > > On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote: > >> From: Roberto Sassu > >> > >> As for IMA, move hardcoded EVM function calls from various places in the > >> kernel to the LSM infrastructure, by introducing a new LSM named 'evm' > >> (last and always enabled like 'ima'). The order in the Makefile ensures > >> that 'evm' hooks are executed after 'ima' ones. > >> > >> Make EVM functions as static (except for evm_inode_init_security(), which > >> is exported), and register them as hook implementations in init_evm_lsm(). > >> > >> Unlike before (see commit to move IMA to the LSM infrastructure), > >> evm_inode_post_setattr(), evm_inode_post_set_acl(), > >> evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not > >> executed for private inodes. > >> > > > > Missing is a comment on moving the inline function definitions - > > evm_inode_remove_acl(), evm_inode_post_remove_acl(), and > > evm_inode_post_set_acl() - to evm_main.c. > > Ok. > > >> Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c > >> > >> Signed-off-by: Roberto Sassu > >> --- > > > > [...] > >> @@ -2307,9 +2299,7 @@ int security_inode_setxattr(struct mnt_idmap *idmap, > >> > >>if (ret == 1) > >>ret = cap_inode_setxattr(dentry, name, value, size, flags); > >> - if (ret) > >> - return ret; > >> - return evm_inode_setxattr(idmap, dentry, name, value, size, flags); > >> + return ret; > >> } > > > > Even though capability will be called after EVM, it doesn't make a > > difference in this instance. > > > > [...] > > > >> /** > >> @@ -2493,9 +2472,7 @@ int security_inode_removexattr(struct mnt_idmap > >> *idmap, > >>ret = call_int_hook(inode_removexattr, 1, idmap, dentry, name); > >>if (ret == 1) > >>ret = cap_inode_removexattr(idmap, dentry, name); > >> - if (ret) > >> - return ret; > >> - return evm_inode_removexattr(idmap, dentry, name); > >> + return ret; > >> } > > > > 'security.capability' is one of the EVM protected xattrs. As > > capability isn't an LSM, it will now be called after EVM, which is a > > problem. > > Uhm, according to this comment in security_inode_removexattr() and > security_inode_setxattr(): > > /* >* SELinux and Smack integrate the cap call, >* so assume that all LSMs supplying this call do so. >*/ > > We can add the call to IMA and EVM as well, to be compliant. SELinux and Smack are the only current LSMs that register the security_inode_removexattr hook. Both enforce mandatory access control, so their calling capabilities to enforce DAC kind of makes sense. I'm not sure it makes sense for IMA and EVM to call capability directly, just because of the comment. > However, I'm missing why the two cases are different. It seems > cap_inode_set/removexattr() are doing just checks. Both IMA and EVM require CAP_SYS_ADMIN to write/remove security.ima and security.evm respectively. In addition, EVM must recalculate security.evm if any protected security xattrs are set or removed. However, security.evm is updated on security_inode_post_setxattr, not security_inode_setxattr. Mimi
Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
On Tue, Jan 2, 2024 at 2:25 PM Richard Gobert wrote: > > The existing code always pulls the IPv6 header and sets the transport > offset initially. Then optionally again pulls any extension headers in > ipv6_gso_pull_exthdrs and sets the transport offset again on return from > that call. skb->data is set at the start of the first extension header > before calling ipv6_gso_pull_exthdrs, and must disable the frag0 > optimization because that function uses pskb_may_pull/pskb_pull instead of > skb_gro_ helpers. It sets the GRO offset to the TCP header with > skb_gro_pull and sets the transport header. Then returns skb->data to its > position before this block. > > This commit introduces a new helper function - ipv6_gro_pull_exthdrs - > which is used in ipv6_gro_receive to pull ipv6 ext headers instead of > ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all > operations use skb_gro_* helpers, and the frag0 fast path can be taken for > IPv6 packets with ext headers. > > Signed-off-by: Richard Gobert > Reviewed-by: Willem de Bruijn > --- > include/net/ipv6.h | 1 + > net/ipv6/ip6_offload.c | 51 +- > 2 files changed, 42 insertions(+), 10 deletions(-) > > diff --git a/include/net/ipv6.h b/include/net/ipv6.h > index 78d38dd88aba..217240efa182 100644 > --- a/include/net/ipv6.h > +++ b/include/net/ipv6.h > @@ -26,6 +26,7 @@ struct ip_tunnel_info; > #define SIN6_LEN_RFC2133 24 > > #define IPV6_MAXPLEN 65535 > +#define IPV6_MIN_EXTHDR_LEN8 // Hmm see my following comment. > > /* > * NextHeader field of IPv6 header > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index 0e0b5fed0995..c07111d8f56a 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -37,6 +37,40 @@ > INDIRECT_CALL_L4(cb, f2, f1, head, skb);\ > }) > > +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) > +{ > + const struct net_offload *ops = NULL; > + struct ipv6_opt_hdr *opth; > + > + for (;;) { > + int len; > + > + ops = rcu_dereference(inet6_offloads[proto]); > + > + if (unlikely(!ops)) > + break; > + > + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) > + break; > + > + opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off); I do not see a compelling reason for adding yet another constant here. I would stick to opth = skb_gro_header(skb, off + sizeof(*opth), off); Consistency with similar helpers is desirable. > + if (unlikely(!opth)) > + break; > + > + len = ipv6_optlen(opth); > + > + opth = skb_gro_header(skb, off + len, off); Note this call will take care of precise pull. > + if (unlikely(!opth)) > + break; > + proto = opth->nexthdr; > + > + off += len; > + } > + > + skb_gro_pull(skb, off - skb_network_offset(skb)); > + return proto; > +} > + > static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) > { > const struct net_offload *ops = NULL; > @@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff > *ipv6_gro_receive(struct list_head *head, > goto out; > > skb_set_network_header(skb, off); > - skb_gro_pull(skb, sizeof(*iph)); > - skb_set_transport_header(skb, skb_gro_offset(skb)); > > - flush += ntohs(iph->payload_len) != skb_gro_len(skb); > + flush += ntohs(iph->payload_len) != skb->len - hlen; > > proto = iph->nexthdr; > ops = rcu_dereference(inet6_offloads[proto]); > if (!ops || !ops->callbacks.gro_receive) { > - pskb_pull(skb, skb_gro_offset(skb)); > - skb_gro_frag0_invalidate(skb); > - proto = ipv6_gso_pull_exthdrs(skb, proto); > - skb_gro_pull(skb, -skb_transport_offset(skb)); > - skb_reset_transport_header(skb); > - __skb_push(skb, skb_gro_offset(skb)); > + proto = ipv6_gro_pull_exthdrs(skb, hlen, proto); > > ops = rcu_dereference(inet6_offloads[proto]); > if (!ops || !ops->callbacks.gro_receive) > goto out; > > - iph = ipv6_hdr(skb); > + iph = skb_gro_network_header(skb); > + } else { > + skb_gro_pull(skb, sizeof(*iph)); > } > > + skb_set_transport_header(skb, skb_gro_offset(skb)); > + > NAPI_GRO_CB(skb)->proto = proto; > > flush--; > -- > 2.36.1 >
Re: [PATCH net-next v2 1/3] net: gso: add HBH extension header offload support
On Tue, Jan 2, 2024 at 2:21 PM Richard Gobert wrote: > > This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it > is done for routing and dstopts) since it is supported in GSO and GRO. > This allows to remove specific HBH conditionals in GSO and GRO when > pulling and parsing an incoming packet. > > Signed-off-by: Richard Gobert > Reviewed-by: Willem de Bruijn Reviewed-by: Eric Dumazet
Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
On Tue, Jan 2, 2024 at 1:00 AM Peter Xu wrote: > > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote: > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > /* at this point we have src_folio locked */ > > if (folio_test_large(src_folio)) { > > + /* split_folio() can block */ > > + pte_unmap(_src_pte); > > + pte_unmap(_dst_pte); > > + src_pte = dst_pte = NULL; > > err = split_folio(src_folio); > > if (err) > > goto out; > > + goto retry; > > } > > Do we also need to clear src_folio and src_folio_pte? If the folio is a > thp, I think it means it's pte mapped here. Then after the split we may > want to fetch the small folio after the split, not the head one? I think we need to re-fetch the src_folio only if the src_addr falls into a non-head page. Looking at the __split_huge_page(), the head page is skipped in the last loop, so I think it should stay valid. That said, maybe it's just an implementation detail of the __split_huge_page() and I should not rely on that and refetch anyway? > > -- > Peter Xu >
Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
On 1/2/24 6:24 AM, Richard Gobert wrote: > The existing code always pulls the IPv6 header and sets the transport > offset initially. Then optionally again pulls any extension headers in > ipv6_gso_pull_exthdrs and sets the transport offset again on return from > that call. skb->data is set at the start of the first extension header > before calling ipv6_gso_pull_exthdrs, and must disable the frag0 > optimization because that function uses pskb_may_pull/pskb_pull instead of > skb_gro_ helpers. It sets the GRO offset to the TCP header with > skb_gro_pull and sets the transport header. Then returns skb->data to its > position before this block. > > This commit introduces a new helper function - ipv6_gro_pull_exthdrs - > which is used in ipv6_gro_receive to pull ipv6 ext headers instead of > ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all > operations use skb_gro_* helpers, and the frag0 fast path can be taken for > IPv6 packets with ext headers. > > Signed-off-by: Richard Gobert > Reviewed-by: Willem de Bruijn > --- > include/net/ipv6.h | 1 + > net/ipv6/ip6_offload.c | 51 +- > 2 files changed, 42 insertions(+), 10 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next v2 1/3] net: gso: add HBH extension header offload support
On 1/2/24 6:20 AM, Richard Gobert wrote: > This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it > is done for routing and dstopts) since it is supported in GSO and GRO. > This allows to remove specific HBH conditionals in GSO and GRO when > pulling and parsing an incoming packet. > > Signed-off-by: Richard Gobert > Reviewed-by: Willem de Bruijn > --- > net/ipv6/exthdrs_offload.c | 11 +++ > net/ipv6/ip6_offload.c | 25 +++-- > 2 files changed, 22 insertions(+), 14 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next v2 3/3] selftests/net: fix GRO coalesce test and add ext header coalesce tests
Richard Gobert wrote: > Currently there is no test which checks that IPv6 extension header packets > successfully coalesce. This commit adds a test, which verifies two IPv6 > packets with HBH extension headers do coalesce, and another test which > checks that packets with different extension header data do not coalesce > in GRO. > > I changed the receive socket filter to accept a packet with one extension > header. This change exposed a bug in the fragment test -- the old BPF did > not accept the fragment packet. I updated correct_num_packets in the > fragment test accordingly. > > Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn Thanks for adding the second test. > +static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char > *ext_payload) > +{ > + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr *)(optpkt + > tcp_offset); > + struct ipv6hdr *iph = (struct ipv6hdr *)(optpkt + ETH_HLEN); > + char *exthdr_payload_start = (char *)(exthdr + 1); > + > + exthdr->hdrlen = 0; > + exthdr->nexthdr = IPPROTO_TCP; > + > + if (ext_payload) > + memcpy(exthdr_payload_start, ext_payload, MIN_EXTHDR_SIZE - > sizeof(*exthdr)); minor nit, in case this gets respun: ext_payload is always true.
[PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
From: Lu Baolu This implements the .cache_invalidate_user() callback to support iotlb flush for nested domain. Reviewed-by: Kevin Tian Signed-off-by: Lu Baolu Co-developed-by: Yi Liu Signed-off-by: Yi Liu --- drivers/iommu/intel/nested.c | 107 +++ 1 file changed, 107 insertions(+) diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c index b5a5563ab32c..f1f86437939c 100644 --- a/drivers/iommu/intel/nested.c +++ b/drivers/iommu/intel/nested.c @@ -73,9 +73,116 @@ static void intel_nested_domain_free(struct iommu_domain *domain) kfree(to_dmar_domain(domain)); } +static void nested_flush_dev_iotlb(struct dmar_domain *domain, u64 addr, + unsigned mask, u32 *fault) +{ + struct device_domain_info *info; + unsigned long flags; + u16 sid, qdep; + + spin_lock_irqsave(>lock, flags); + list_for_each_entry(info, >devices, link) { + if (!info->ats_enabled) + continue; + sid = info->bus << 8 | info->devfn; + qdep = info->ats_qdep; + qi_flush_dev_iotlb(info->iommu, sid, info->pfsid, + qdep, addr, mask, fault); + quirk_extra_dev_tlb_flush(info, addr, mask, + IOMMU_NO_PASID, qdep); + } + spin_unlock_irqrestore(>lock, flags); +} + +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr, +unsigned long npages, bool ih, u32 *error) +{ + struct iommu_domain_info *info; + unsigned long i; + unsigned mask; + u32 fault; + + xa_for_each(>iommu_array, i, info) + qi_flush_piotlb(info->iommu, + domain_id_iommu(domain, info->iommu), + IOMMU_NO_PASID, addr, npages, ih, NULL); + + if (!domain->has_iotlb_device) + return; + + if (npages == U64_MAX) + mask = 64 - VTD_PAGE_SHIFT; + else + mask = ilog2(__roundup_pow_of_two(npages)); + + nested_flush_dev_iotlb(domain, addr, mask, ); + + *error = 0; + /* +* Invalidation queue error (i.e. IQE) will not be reported to user +* as it's caused only by driver internal bug. +*/ + if (fault & DMA_FSTS_ICE) + *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ICE; + if (fault & DMA_FSTS_ITE) + *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ITE; +} + +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain, + struct iommu_user_data_array *array) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct iommu_hwpt_vtd_s1_invalidate inv_entry; + u32 processed = 0; + int ret = 0; + u32 index; + + if (array->type != IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) { + ret = -EINVAL; + goto out; + } + + for (index = 0; index < array->entry_num; index++) { + ret = iommu_copy_struct_from_user_array(_entry, array, + IOMMU_HWPT_INVALIDATE_DATA_VTD_S1, + index, hw_error); + if (ret) + break; + + if (inv_entry.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) { + ret = -EOPNOTSUPP; + break; + } + + if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) || + ((inv_entry.npages == U64_MAX) && inv_entry.addr)) { + ret = -EINVAL; + break; + } + + intel_nested_flush_cache(dmar_domain, inv_entry.addr, +inv_entry.npages, +inv_entry.flags & IOMMU_VTD_INV_FLAGS_LEAF, +_entry.hw_error); + + ret = iommu_respond_struct_to_user_array(array, index, +(void *)_entry, +sizeof(inv_entry)); + if (ret) + break; + + processed++; + } + +out: + array->entry_num = processed; + return ret; +} + static const struct iommu_domain_ops intel_nested_domain_ops = { .attach_dev = intel_nested_attach_dev, .free = intel_nested_domain_free, + .cache_invalidate_user = intel_nested_cache_invalidate_user, }; struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent, -- 2.34.1
[PATCH v10 09/10] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
This adds the data structure invalidating caches for the nested domain allocated with IOMMU_HWPT_DATA_VTD_S1 type. Reviewed-by: Kevin Tian Signed-off-by: Lu Baolu Signed-off-by: Yi Liu --- include/uapi/linux/iommufd.h | 58 1 file changed, 58 insertions(+) diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 824560c50ec6..37c778055201 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -623,6 +623,64 @@ enum iommu_hwpt_invalidate_data_type { IOMMU_HWPT_INVALIDATE_DATA_VTD_S1, }; +/** + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d + * stage-1 cache invalidation + * @IOMMU_VTD_INV_FLAGS_LEAF: Indicates whether the invalidation applies + *to all-levels page structure cache or just + *the leaf PTE cache. + */ +enum iommu_hwpt_vtd_s1_invalidate_flags { + IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0, +}; + +/** + * enum iommu_hwpt_vtd_s1_invalidate_error - Hardware error of invalidation + * @IOMMU_HWPT_INVALIDATE_VTD_S1_ICE: Invalidation Completion Error, details + *refer to 11.4.7.1 Fault Status Register + *of VT-d specification. + * @IOMMU_HWPT_INVALIDATE_VTD_S1_ITE: Invalidation Time-out Error, details + *refer to 11.4.7.1 Fault Status Register + *of VT-d specification. + */ +enum iommu_hwpt_vtd_s1_invalidate_error { + IOMMU_HWPT_INVALIDATE_VTD_S1_ICE = 1 << 0, + IOMMU_HWPT_INVALIDATE_VTD_S1_ITE = 1 << 1, +}; + +/** + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation + * (IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) + * @addr: The start address of the range to be invalidated. It needs to + *be 4KB aligned. + * @npages: Number of contiguous 4K pages to be invalidated. + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags + * @hw_error: One of enum iommu_hwpt_vtd_s1_invalidate_error + * + * The Intel VT-d specific invalidation data for user-managed stage-1 cache + * invalidation in nested translation. Userspace uses this structure to + * tell the impacted cache scope after modifying the stage-1 page table. + * + * Invalidating all the caches related to the page table by setting @addr + * to be 0 and @npages to be U64_MAX. + * + * The device TLB will be invalidated automatically if ATS is enabled. + * + * An entry is considered 'handled' after it passes the audit and submitted + * to the IOMMU by the underlying driver. Check the @entry_num output of + * struct iommu_hwpt_invalidate for the number of handled entries. A 'handled' + * request may still fail in hardware for various reasons, e.g. due to timeout + * on waiting for device response upon a device TLB invalidation request. In + * such case the hardware error info is reported in the @hw_error field of the + * handled entry. + */ +struct iommu_hwpt_vtd_s1_invalidate { + __aligned_u64 addr; + __aligned_u64 npages; + __u32 flags; + __u32 hw_error; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate) -- 2.34.1
[PATCH v10 08/10] iommu/vt-d: Convert stage-1 cache invalidation to return QI fault
From: Lu Baolu This makes the pasid based cache invalidation and device TLB invalidation to return QI faults to callers. This is needed when usersapce invalidates cache after modifying the stage-1 page table used in nested translation. Hardware errors during invalidation should be reported to user. Reviewed-by: Kevin Tian Signed-off-by: Lu Baolu Signed-off-by: Yi Liu --- drivers/iommu/intel/dmar.c | 13 +++-- drivers/iommu/intel/iommu.c | 12 ++-- drivers/iommu/intel/iommu.h | 6 +++--- drivers/iommu/intel/pasid.c | 12 +++- drivers/iommu/intel/svm.c | 8 5 files changed, 27 insertions(+), 24 deletions(-) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index c3251c36c567..d9e38ff34437 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1529,7 +1529,7 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, } void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid, - u16 qdep, u64 addr, unsigned mask) + u16 qdep, u64 addr, unsigned mask, u32 *fault) { struct qi_desc desc; @@ -1556,12 +1556,12 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid, desc.qw2 = 0; desc.qw3 = 0; - qi_submit_sync(iommu, , 1, 0, NULL); + qi_submit_sync(iommu, , 1, 0, fault); } /* PASID-based IOTLB invalidation */ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr, -unsigned long npages, bool ih) +unsigned long npages, bool ih, u32 *fault) { struct qi_desc desc = {.qw2 = 0, .qw3 = 0}; @@ -1597,12 +1597,13 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr, QI_EIOTLB_AM(mask); } - qi_submit_sync(iommu, , 1, 0, NULL); + qi_submit_sync(iommu, , 1, 0, fault); } /* PASID-based device IOTLB Invalidate */ void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid, - u32 pasid, u16 qdep, u64 addr, unsigned int size_order) + u32 pasid, u16 qdep, u64 addr, + unsigned int size_order, u32 *fault) { unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1); struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0}; @@ -1650,7 +1651,7 @@ void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid, desc.qw1 |= QI_DEV_EIOTLB_SIZE; } - qi_submit_sync(iommu, , 1, 0, NULL); + qi_submit_sync(iommu, , 1, 0, fault); } void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 897159dba47d..68e494f1d03a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1462,7 +1462,7 @@ static void __iommu_flush_dev_iotlb(struct device_domain_info *info, sid = info->bus << 8 | info->devfn; qdep = info->ats_qdep; qi_flush_dev_iotlb(info->iommu, sid, info->pfsid, - qdep, addr, mask); + qdep, addr, mask, NULL); quirk_extra_dev_tlb_flush(info, addr, mask, IOMMU_NO_PASID, qdep); } @@ -1490,7 +1490,7 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, PCI_DEVID(info->bus, info->devfn), info->pfsid, dev_pasid->pasid, info->ats_qdep, addr, -mask); +mask, NULL); } spin_unlock_irqrestore(>lock, flags); } @@ -1505,10 +1505,10 @@ static void domain_flush_pasid_iotlb(struct intel_iommu *iommu, spin_lock_irqsave(>lock, flags); list_for_each_entry(dev_pasid, >dev_pasids, link_domain) - qi_flush_piotlb(iommu, did, dev_pasid->pasid, addr, npages, ih); + qi_flush_piotlb(iommu, did, dev_pasid->pasid, addr, npages, ih, NULL); if (!list_empty(>devices)) - qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr, npages, ih); + qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr, npages, ih, NULL); spin_unlock_irqrestore(>lock, flags); } @@ -5195,10 +5195,10 @@ void quirk_extra_dev_tlb_flush(struct device_domain_info *info, sid = PCI_DEVID(info->bus, info->devfn); if (pasid == IOMMU_NO_PASID) { qi_flush_dev_iotlb(info->iommu, sid, info->pfsid, - qdep, address, mask); + qdep, address, mask, NULL); } else { qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid, -pasid, qdep, address, mask); +pasid, qdep, address, mask,
[PATCH v10 07/10] iommu/vt-d: Allow qi_submit_sync() to return the QI faults
From: Lu Baolu This allows qi_submit_sync() to return back faults to callers. qi_submit_sync() has a retry when timeout error happens, thus it cannot return timeout error back to the caller. This was discussed in a separate thread [1]. Here we keep it intact and just make sure no retry for the newly added user domain cache invalidation by checking if the caller is interested in the fault or not. [1] https://lore.kernel.org/all/20231228001646.587653-6-haifeng.z...@linux.intel.com/ Reviewed-by: Kevin Tian Signed-off-by: Lu Baolu Signed-off-by: Yi Liu --- drivers/iommu/intel/dmar.c | 35 +++-- drivers/iommu/intel/iommu.h | 2 +- drivers/iommu/intel/irq_remapping.c | 2 +- drivers/iommu/intel/pasid.c | 2 +- drivers/iommu/intel/svm.c | 6 ++--- 5 files changed, 29 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 23cb80d62a9a..c3251c36c567 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1267,7 +1267,8 @@ static void qi_dump_fault(struct intel_iommu *iommu, u32 fault) (unsigned long long)desc->qw1); } -static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) +static int qi_check_fault(struct intel_iommu *iommu, int index, + int wait_index, u32 *fsts) { u32 fault; int head, tail; @@ -1278,8 +1279,12 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) return -EAGAIN; fault = readl(iommu->reg + DMAR_FSTS_REG); - if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) + fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE; + if (fault) { + if (fsts) + *fsts |= fault; qi_dump_fault(iommu, fault); + } /* * If IQE happens, the head points to the descriptor associated @@ -1324,8 +1329,9 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) head = (head - 2 + QI_LENGTH) % QI_LENGTH; } while (head != tail); + /* No need to retry if the caller is interested in the timeout error */ if (qi->desc_status[wait_index] == QI_ABORT) - return -EAGAIN; + return fsts ? -ETIMEDOUT : -EAGAIN; } if (fault & DMA_FSTS_ICE) { @@ -1342,9 +1348,11 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) * time, a wait descriptor will be appended to each submission to ensure * hardware has completed the invalidation before return. Wait descriptors * can be part of the submission but it will not be polled for completion. + * If callers are interested in the QI faults that occur during the handling + * of requests, the QI faults are saved in @fault. */ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, - unsigned int count, unsigned long options) + unsigned int count, unsigned long options, u32 *fault) { struct q_inval *qi = iommu->qi; s64 devtlb_start_ktime = 0; @@ -1422,6 +1430,9 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, */ writel(qi->free_head << shift, iommu->reg + DMAR_IQT_REG); + if (fault) + *fault = 0; + while (qi->desc_status[wait_index] != QI_DONE) { /* * We will leave the interrupts disabled, to prevent interrupt @@ -1430,7 +1441,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ - rc = qi_check_fault(iommu, index, wait_index); + rc = qi_check_fault(iommu, index, wait_index, fault); if (rc) break; @@ -1476,7 +1487,7 @@ void qi_global_iec(struct intel_iommu *iommu) desc.qw3 = 0; /* should never fail */ - qi_submit_sync(iommu, , 1, 0); + qi_submit_sync(iommu, , 1, 0, NULL); } void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, @@ -1490,7 +1501,7 @@ void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, desc.qw2 = 0; desc.qw3 = 0; - qi_submit_sync(iommu, , 1, 0); + qi_submit_sync(iommu, , 1, 0, NULL); } void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, @@ -1514,7 +1525,7 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, desc.qw2 = 0; desc.qw3 = 0; - qi_submit_sync(iommu, , 1, 0); + qi_submit_sync(iommu, , 1, 0, NULL); } void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid, @@ -1545,7 +1556,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu,
[PATCH v10 06/10] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
From: Nicolin Chen Add test cases for the IOMMU_HWPT_INVALIDATE ioctl and verify it by using the new IOMMU_TEST_OP_MD_CHECK_IOTLB. Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen Co-developed-by: Yi Liu Signed-off-by: Yi Liu --- tools/testing/selftests/iommu/iommufd.c | 186 ++ tools/testing/selftests/iommu/iommufd_utils.h | 33 2 files changed, 219 insertions(+) diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index c8763b880a16..ff0aa77911f3 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -116,6 +116,7 @@ TEST_F(iommufd, cmd_length) TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id); TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved); TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved); + TEST_LENGTH(iommu_hwpt_invalidate, IOMMU_HWPT_INVALIDATE, __reserved); TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id); TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES, out_iova_alignment); @@ -271,7 +272,9 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested) struct iommu_hwpt_selftest data = { .iotlb = IOMMU_TEST_IOTLB_DEFAULT, }; + struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {}; uint32_t nested_hwpt_id[2] = {}; + uint32_t num_inv; uint32_t parent_hwpt_id = 0; uint32_t parent_hwpt_id_not_work = 0; uint32_t test_hwpt_id = 0; @@ -344,6 +347,189 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested) EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, parent_hwpt_id)); + /* hwpt_invalidate only supports a user-managed hwpt (nested) */ + num_inv = 1; + test_err_hwpt_invalidate(ENOENT, parent_hwpt_id, inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Check data_type by passing zero-length array */ + num_inv = 0; + test_cmd_hwpt_invalidate(nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Negative test: Invalid data_type */ + num_inv = 1; + test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs, + IOMMU_HWPT_INVALIDATE_DATA_SELFTEST_INVALID, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Negative test: structure size sanity */ + num_inv = 1; + test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs) + 1, _inv); + assert(!num_inv); + + num_inv = 1; + test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +1, _inv); + assert(!num_inv); + + /* Negative test: invalid flag is passed */ + num_inv = 1; + inv_reqs[0].flags = 0x; + test_err_hwpt_invalidate(EOPNOTSUPP, nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Negative test: non-zero __reserved is passed */ + num_inv = 1; + inv_reqs[0].flags = 0; + inv_reqs[0].__reserved = 0x1234; + test_err_hwpt_invalidate(EOPNOTSUPP, nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Negative test: invalid data_uptr when array is not empty */ + num_inv = 1; + inv_reqs[0].flags = 0; + test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], NULL, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +sizeof(*inv_reqs), _inv); + assert(!num_inv); + + /* Negative test: invalid entry_len when array is not empty */ + num_inv = 1; + inv_reqs[0].flags = 0; + test_err_hwpt_invalidate(EINVAL, nested_hwpt_id[0], inv_reqs, +IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, +
[PATCH v10 05/10] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
From: Nicolin Chen Allow to test whether IOTLB has been invalidated or not. Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen Signed-off-by: Yi Liu --- drivers/iommu/iommufd/iommufd_test.h | 5 drivers/iommu/iommufd/selftest.c | 26 +++ tools/testing/selftests/iommu/iommufd.c | 4 +++ tools/testing/selftests/iommu/iommufd_utils.h | 24 + 4 files changed, 59 insertions(+) diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 2eef5afde711..1cedd6b5ba2b 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -21,6 +21,7 @@ enum { IOMMU_TEST_OP_ACCESS_REPLACE_IOAS, IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS, IOMMU_TEST_OP_DIRTY, + IOMMU_TEST_OP_MD_CHECK_IOTLB, }; enum { @@ -121,6 +122,10 @@ struct iommu_test_cmd { __aligned_u64 uptr; __aligned_u64 out_nr_dirty; } dirty; + struct { + __u32 id; + __u32 iotlb; + } check_iotlb; }; __u32 last; }; diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index ebc6c15abf67..9528528cab27 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -853,6 +853,28 @@ static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd, return 0; } +static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd, + u32 mockpt_id, unsigned int iotlb_id, + u32 iotlb) +{ + struct mock_iommu_domain_nested *mock_nested; + struct iommufd_hw_pagetable *hwpt; + int rc = 0; + + hwpt = get_md_pagetable_nested(ucmd, mockpt_id, _nested); + if (IS_ERR(hwpt)) + return PTR_ERR(hwpt); + + mock_nested = container_of(hwpt->domain, + struct mock_iommu_domain_nested, domain); + + if (iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX || + mock_nested->iotlb[iotlb_id] != iotlb) + rc = -EINVAL; + iommufd_put_object(ucmd->ictx, >obj); + return rc; +} + struct selftest_access { struct iommufd_access *access; struct file *file; @@ -1334,6 +1356,10 @@ int iommufd_test(struct iommufd_ucmd *ucmd) return iommufd_test_md_check_refs( ucmd, u64_to_user_ptr(cmd->check_refs.uptr), cmd->check_refs.length, cmd->check_refs.refs); + case IOMMU_TEST_OP_MD_CHECK_IOTLB: + return iommufd_test_md_check_iotlb(ucmd, cmd->id, + cmd->check_iotlb.id, + cmd->check_iotlb.iotlb); case IOMMU_TEST_OP_CREATE_ACCESS: return iommufd_test_create_access(ucmd, cmd->id, cmd->create_access.flags); diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 6ed328c863c4..c8763b880a16 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -330,6 +330,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested) _hwpt_id[1], IOMMU_HWPT_DATA_SELFTEST, , sizeof(data)); + test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[0], + IOMMU_TEST_IOTLB_DEFAULT); + test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[1], + IOMMU_TEST_IOTLB_DEFAULT); /* Negative test: a nested hwpt on top of a nested hwpt */ test_err_hwpt_alloc_nested(EINVAL, self->device_id, diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index ad9202335656..fe0a0f566b67 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -195,6 +195,30 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \ hwpt_id, data_type, data, data_len)) +#define test_cmd_hwpt_check_iotlb(hwpt_id, iotlb_id, expected) \ + ({ \ + struct iommu_test_cmd test_cmd = { \ + .size = sizeof(test_cmd), \ + .op = IOMMU_TEST_OP_MD_CHECK_IOTLB,\ + .id = hwpt_id, \ + .check_iotlb = {
[PATCH v10 04/10] iommufd/selftest: Add mock_domain_cache_invalidate_user support
From: Nicolin Chen Add mock_domain_cache_invalidate_user() data structure to support user space selftest program to cover user cache invalidation pathway. Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen Co-developed-by: Yi Liu Signed-off-by: Yi Liu --- drivers/iommu/iommufd/iommufd_test.h | 34 drivers/iommu/iommufd/selftest.c | 60 2 files changed, 94 insertions(+) diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h index 7910fbe1962d..2eef5afde711 100644 --- a/drivers/iommu/iommufd/iommufd_test.h +++ b/drivers/iommu/iommufd/iommufd_test.h @@ -148,4 +148,38 @@ struct iommu_hwpt_selftest { __u32 iotlb; }; +/* Should not be equal to any defined value in enum iommu_hwpt_invalidate_data_type */ +#define IOMMU_HWPT_INVALIDATE_DATA_SELFTEST 0xdeadbeef +#define IOMMU_HWPT_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef + +/** + * enum iommu_hwpt_invalidate_selftest_error - Hardware error of invalidation + * @IOMMU_TEST_INVALIDATE_FAKE_ERROR: Fake hw error per test program's request + */ +enum iommu_hwpt_invalidate_selftest_error { + IOMMU_TEST_INVALIDATE_FAKE_ERROR = (1 << 0) +}; + +/** + * struct iommu_hwpt_invalidate_selftest - Invalidation data for Mock driver + * (IOMMU_HWPT_INVALIDATE_DATA_SELFTEST) + * @flags: Invalidate flags + * @iotlb_id: Invalidate iotlb entry index + * @hw_error: One of enum iommu_hwpt_invalidate_selftest_error + * @__reserved: Must be 0 + * + * If IOMMU_TEST_INVALIDATE_ALL is set in @flags, @iotlb_id will be ignored + * @hw_error meaningful only if the request is processed successfully. + * If IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR is set in @flags, report a + * hw error back, cache would not be invalidated in this case. + */ +struct iommu_hwpt_invalidate_selftest { +#define IOMMU_TEST_INVALIDATE_FLAG_ALL (1 << 0) +#define IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR (1 << 1) + __u32 flags; + __u32 iotlb_id; + __u32 hw_error; + __u32 __reserved; +}; + #endif diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c index 022ef8f55088..ebc6c15abf67 100644 --- a/drivers/iommu/iommufd/selftest.c +++ b/drivers/iommu/iommufd/selftest.c @@ -473,9 +473,69 @@ static void mock_domain_free_nested(struct iommu_domain *domain) kfree(mock_nested); } +static int +mock_domain_cache_invalidate_user(struct iommu_domain *domain, + struct iommu_user_data_array *array) +{ + struct mock_iommu_domain_nested *mock_nested = + container_of(domain, struct mock_iommu_domain_nested, domain); + u32 hw_error = 0, processed = 0; + struct iommu_hwpt_invalidate_selftest inv; + int i = 0, j; + int rc = 0; + + if (array->type != IOMMU_HWPT_INVALIDATE_DATA_SELFTEST) { + rc = -EINVAL; + goto out; + } + + for ( ; i < array->entry_num; i++) { + rc = iommu_copy_struct_from_user_array(, array, + IOMMU_HWPT_INVALIDATE_DATA_SELFTEST, + i, __reserved); + if (rc) + break; + + if ((inv.flags & ~(IOMMU_TEST_INVALIDATE_FLAG_ALL | + IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR)) || + inv.__reserved) { + rc = -EOPNOTSUPP; + break; + } + + if (inv.iotlb_id > MOCK_NESTED_DOMAIN_IOTLB_ID_MAX) { + rc = -EINVAL; + break; + } + + if (inv.flags & IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR) { + hw_error = IOMMU_TEST_INVALIDATE_FAKE_ERROR; + } else if (inv.flags & IOMMU_TEST_INVALIDATE_FLAG_ALL) { + /* Invalidate all mock iotlb entries and ignore iotlb_id */ + for (j = 0; j < MOCK_NESTED_DOMAIN_IOTLB_NUM; j++) + mock_nested->iotlb[j] = 0; + } else { + mock_nested->iotlb[inv.iotlb_id] = 0; + } + + inv.hw_error = hw_error; + rc = iommu_respond_struct_to_user_array(array, i, (void *), + sizeof(inv)); + if (rc) + break; + + processed++; + } + +out: + array->entry_num = processed; + return rc; +} + static struct iommu_domain_ops domain_nested_ops = { .free = mock_domain_free_nested, .attach_dev = mock_domain_nop_attach, + .cache_invalidate_user = mock_domain_cache_invalidate_user, }; static inline struct iommufd_hw_pagetable * -- 2.34.1
[PATCH v10 03/10] iommu: Add iommu_copy_struct_from_user_array helper
From: Nicolin Chen Wrap up the data pointer/num sanity and __iommu_copy_struct_from_user call for iommu drivers to copy driver specific data at a specific location in the struct iommu_user_data_array, and iommu_respond_struct_to_user_array() to copy response to a specific location in the struct iommu_user_data_array. And expect it to be used in cache_invalidate_user ops for example. Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen Co-developed-by: Yi Liu Signed-off-by: Yi Liu --- include/linux/iommu.h | 74 +++ 1 file changed, 74 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 93c0d12dd047..c3434c9eaa6d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -341,6 +341,80 @@ static inline int __iommu_copy_struct_from_user( sizeof(*kdst), \ offsetofend(typeof(*kdst), min_last)) +/** + * __iommu_copy_struct_from_user_array - Copy iommu driver specific user space + * data from an iommu_user_data_array + * @dst_data: Pointer to an iommu driver specific user data that is defined in + *include/uapi/linux/iommufd.h + * @src_array: Pointer to a struct iommu_user_data_array for a user space array + * @data_type: The data type of the @dst_data. Must match with @src_array.type + * @index: Index to the location in the array to copy user data from + * @data_len: Length of current user data structure, i.e. sizeof(struct _dst) + * @min_len: Initial length of user data structure for backward compatibility. + * This should be offsetofend using the last member in the user data + * struct that was initially added to include/uapi/linux/iommufd.h + */ +static inline int +__iommu_copy_struct_from_user_array(void *dst_data, + const struct iommu_user_data_array *src_array, + unsigned int data_type, unsigned int index, + size_t data_len, size_t min_len) +{ + struct iommu_user_data src_data; + + if (WARN_ON(!src_array || index >= src_array->entry_num)) + return -EINVAL; + if (!src_array->entry_num) + return -EINVAL; + src_data.uptr = src_array->uptr + src_array->entry_len * index; + src_data.len = src_array->entry_len; + src_data.type = src_array->type; + + return __iommu_copy_struct_from_user(dst_data, _data, data_type, +data_len, min_len); +} + +/** + * iommu_copy_struct_from_user_array - Copy iommu driver specific user space + * data from an iommu_user_data_array + * @kdst: Pointer to an iommu driver specific user data that is defined in + *include/uapi/linux/iommufd.h + * @user_array: Pointer to a struct iommu_user_data_array for a user space + * array + * @data_type: The data type of the @kdst. Must match with @user_array->type + * @index: Index to the location in the array to copy user data from + * @min_last: The last memember of the data structure @kdst points in the + *initial version. + * Return 0 for success, otherwise -error. + */ +#define iommu_copy_struct_from_user_array(kdst, user_array, data_type, \ + index, min_last)\ + __iommu_copy_struct_from_user_array(kdst, user_array, data_type, \ + index, sizeof(*kdst),\ + offsetofend(typeof(*kdst), \ + min_last)) + +/** + * iommu_respond_struct_to_user_array - Copy the response in @ksrc back to + * a specific entry of user array + * @user_array: Pointer to a struct iommu_user_data_array for a user space + * array + * @index: Index to the location in the array to copy response + * @ksrc: Pointer to kernel structure + * @klen: Length of @ksrc struct + * + * This only copies response of one entry (@index) in @user_array. + */ +static inline int +iommu_respond_struct_to_user_array(const struct iommu_user_data_array *array, + unsigned int index, void *ksrc, size_t klen) +{ + if (copy_to_user(array->uptr + array->entry_len * index, +ksrc, min_t(size_t, array->entry_len, klen))) + return -EFAULT; + return 0; +} + /** * struct iommu_ops - iommu ops and capabilities * @capable: check capability -- 2.34.1
[PATCH v10 02/10] iommufd: Add IOMMU_HWPT_INVALIDATE
In nested translation, the stage-1 page table is user-managed but cached by the IOMMU hardware, so an update on present page table entries in the stage-1 page table should be followed with a cache invalidation. Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation. It takes hwpt_id to specify the iommu_domain, and a multi-entry array to support multiple invalidation data in one ioctl. enum iommu_hwpt_invalidate_data_type is defined to tag the data type of the entries in the multi-entry array. Reviewed-by: Kevin Tian Co-developed-by: Nicolin Chen Signed-off-by: Nicolin Chen Signed-off-by: Yi Liu --- drivers/iommu/iommufd/hw_pagetable.c| 41 +++ drivers/iommu/iommufd/iommufd_private.h | 10 ++ drivers/iommu/iommufd/main.c| 3 ++ include/uapi/linux/iommufd.h| 43 + 4 files changed, 97 insertions(+) diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index cbb5df0a6c32..4e8711f19f72 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -371,3 +371,44 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, _paging->common.obj); return rc; } + +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) +{ + struct iommu_hwpt_invalidate *cmd = ucmd->cmd; + struct iommu_user_data_array data_array = { + .type = cmd->data_type, + .uptr = u64_to_user_ptr(cmd->data_uptr), + .entry_len = cmd->entry_len, + .entry_num = cmd->entry_num, + }; + struct iommufd_hw_pagetable *hwpt; + u32 done_num = 0; + int rc; + + if (cmd->__reserved) { + rc = -EOPNOTSUPP; + goto out; + } + + if (cmd->entry_num && (!cmd->data_uptr || !cmd->entry_len)) { + rc = -EINVAL; + goto out; + } + + hwpt = iommufd_get_hwpt_nested(ucmd, cmd->hwpt_id); + if (IS_ERR(hwpt)) { + rc = PTR_ERR(hwpt); + goto out; + } + + rc = hwpt->domain->ops->cache_invalidate_user(hwpt->domain, + _array); + done_num = data_array.entry_num; + + iommufd_put_object(ucmd->ictx, >obj); +out: + cmd->entry_num = done_num; + if (iommufd_ucmd_respond(ucmd, sizeof(*cmd))) + return -EFAULT; + return rc; +} diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index abae041e256f..991f864d1f9b 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -328,6 +328,15 @@ iommufd_get_hwpt_paging(struct iommufd_ucmd *ucmd, u32 id) IOMMUFD_OBJ_HWPT_PAGING), struct iommufd_hwpt_paging, common.obj); } + +static inline struct iommufd_hw_pagetable * +iommufd_get_hwpt_nested(struct iommufd_ucmd *ucmd, u32 id) +{ + return container_of(iommufd_get_object(ucmd->ictx, id, + IOMMUFD_OBJ_HWPT_NESTED), + struct iommufd_hw_pagetable, obj); +} + int iommufd_hwpt_set_dirty_tracking(struct iommufd_ucmd *ucmd); int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd); @@ -345,6 +354,7 @@ void iommufd_hwpt_paging_abort(struct iommufd_object *obj); void iommufd_hwpt_nested_destroy(struct iommufd_object *obj); void iommufd_hwpt_nested_abort(struct iommufd_object *obj); int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd); +int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd); static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx, struct iommufd_hw_pagetable *hwpt) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index c9091e46d208..39b32932c61e 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -322,6 +322,7 @@ union ucmd_buffer { struct iommu_hw_info info; struct iommu_hwpt_alloc hwpt; struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap; + struct iommu_hwpt_invalidate cache; struct iommu_hwpt_set_dirty_tracking set_dirty_tracking; struct iommu_ioas_alloc alloc; struct iommu_ioas_allow_iovas allow_iovas; @@ -360,6 +361,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_HWPT_GET_DIRTY_BITMAP, iommufd_hwpt_get_dirty_bitmap, struct iommu_hwpt_get_dirty_bitmap, data), + IOCTL_OP(IOMMU_HWPT_INVALIDATE, iommufd_hwpt_invalidate, +struct iommu_hwpt_invalidate, __reserved), IOCTL_OP(IOMMU_HWPT_SET_DIRTY_TRACKING, iommufd_hwpt_set_dirty_tracking, struct iommu_hwpt_set_dirty_tracking, __reserved), IOCTL_OP(IOMMU_IOAS_ALLOC,
[PATCH v10 01/10] iommu: Add cache_invalidate_user op
From: Lu Baolu The updates of the PTEs in the nested page table will be propagated to the hardware caches. Add a new domain op cache_invalidate_user for the userspace to flush the hardware caches for a nested domain through iommufd. No wrapper for it, as it's only supposed to be used by iommufd. Then, pass in invalidation requests in form of a user data array conatining a number of invalidation data entries. Signed-off-by: Lu Baolu Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen Signed-off-by: Yi Liu --- include/linux/iommu.h | 26 ++ 1 file changed, 26 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 6291aa7b079b..93c0d12dd047 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -284,6 +284,23 @@ struct iommu_user_data { size_t len; }; +/** + * struct iommu_user_data_array - iommu driver specific user space data array + * @type: The data type of all the entries in the user buffer array + * @uptr: Pointer to the user buffer array + * @entry_len: The fixed-width length of an entry in the array, in bytes + * @entry_num: The number of total entries in the array + * + * The user buffer includes an array of requests with format defined in + * include/uapi/linux/iommufd.h + */ +struct iommu_user_data_array { + unsigned int type; + void __user *uptr; + size_t entry_len; + u32 entry_num; +}; + /** * __iommu_copy_struct_from_user - Copy iommu driver specific user space data * @dst_data: Pointer to an iommu driver specific user data that is defined in @@ -440,6 +457,13 @@ struct iommu_ops { * @iotlb_sync_map: Sync mappings created recently using @map to the hardware * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush *queue + * @cache_invalidate_user: Flush hardware cache for user space IO page table. + * The @domain must be IOMMU_DOMAIN_NESTED. The @array + * passes in the cache invalidation requests, in form + * of a driver data structure. The driver must update + * array->entry_num to report the number of handled + * invalidation requests. The driver data structure + * must be defined in include/uapi/linux/iommufd.h * @iova_to_phys: translate iova to physical address * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE, * including no-snoop TLPs on PCIe or other platform @@ -465,6 +489,8 @@ struct iommu_domain_ops { size_t size); void (*iotlb_sync)(struct iommu_domain *domain, struct iommu_iotlb_gather *iotlb_gather); + int (*cache_invalidate_user)(struct iommu_domain *domain, +struct iommu_user_data_array *array); phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova); -- 2.34.1
[PATCH v10 00/10] Add iommufd nesting (part 2/2)
Nested translation is a hardware feature that is supported by many modern IOMMU hardwares. It has two stages (stage-1, stage-2) address translation to get access to the physical address. stage-1 translation table is owned by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes to stage-1 translation table should be followed by an IOTLB invalidation. Take Intel VT-d as an example, the stage-1 translation table is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-. .---. | vIOMMU| | Guest I/O page table | | | '---' ./ | PASID Entry |--- PASID cache flush --+ '-'| | |V | | I/O page table pointer in GPA '-' Guest --| Shadow |---| vv v Host .-. .. | pIOMMU| | FS for GIOVA->GPA | | | '' ./ | | PASID Entry | V (Nested xlate) '\.--. | | | SS for GPA->HPA, unmanaged domain| | | '--' '-' Where: - FS = First stage page tables - SS = Second stage page tables This series is based on the first part which was merged [1], this series is to add the cache invalidation interface or the userspace to invalidate cache after modifying the stage-1 page table. This includes both the iommufd changes and the VT-d driver changes. Complete code can be found in [2], QEMU could can be found in [3]. At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks them for the help. ^_^. Look forward to your feedbacks. [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l@intel.com/ - merged [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 Change log: v10: - Minor tweak to patch 07 (Kevin) - Rebase on top of 6.7-rc8 v9: https://lore.kernel.org/linux-iommu/20231228150629.13149-1-yi.l@intel.com/ - Add a test case which sets both IOMMU_TEST_INVALIDATE_FLAG_ALL and IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR in flags, and expect to succeed and see an 'error'. (Kevin) - Returns -ETIMEOUT in qi_check_fault() if caller is interested with the fault when timeout happens. If not, the qi_submit_sync() will keep retry hence unable to report the error back to user. For now, only the user cache invalidation path has interest on the time out error. So this change only affects the user cache invalidation path. Other path will still hang in qi_submit_sync() when timeout happens. (Kevin) v8: https://lore.kernel.org/linux-iommu/20231227161354.67701-1-yi.l@intel.com/ - Pass invalidation hint to the cache invalidation helper in the cache_invalidate_user op path (Kevin) - Move the devTLB invalidation out of info->iommu loop (Kevin, Weijiang) - Clear *fault per restart in qi_submit_sync() to avoid acroos submission error accumulation. (Kevin) - Define the vtd cache invalidation uapi structure in separate patch (Kevin) - Rename inv_error to be hw_error (Kevin) - Rename 'reqs_uptr', 'req_type', 'req_len' and 'req_num' to be 'data_uptr', 'data_type', "entry_len' and 'entry_num" (Kevin) - Allow user to set IOMMU_TEST_INVALIDATE_FLAG_ALL and IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR in the same time (Kevin) v7: https://lore.kernel.org/linux-iommu/20231221153948.119007-1-yi.l@intel.com/ - Remove domain->ops->cache_invalidate_user check in hwpt alloc path due to failure in bisect (Baolu) - Remove out_driver_error_code from struct iommu_hwpt_invalidate after discussion in v6. Should expect per-entry error code. - Rework the selftest cache invalidation part to report a per-entry error - Allow user to pass in an empty array to have a try-and-fail mechanism for user to check if a given req_type is supported by the kernel (Jason) - Define a separate enum type for cache invalidation data (Jason) - Fix the IOMMU_HWPT_INVALIDATE to always update the req_num field before returning (Nicolin) - Merge the VT-d nesting part 2/2 https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l@intel.com/ into this series to avoid defining empty enum in the middle of the series. The major difference is adding the VT-d related invalidation uapi structures together with the generic data structures in patch 02 of this
Re: [PATCH] kselftest: Add basic test for probing the rust sample modules
On 12/21/23 20:46, Miguel Ojeda wrote: > On Fri, Dec 15, 2023 at 2:21 PM Laura Nao > wrote: >> >> Add new basic kselftest that checks if the available rust sample >> modules >> can be added and removed correctly. >> >> Signed-off-by: Laura Nao > > Thanks Laura! > > Shuah: do you want that we pick this one? If so, your `Acked-by` would > be nice -- thanks! Otherwise, please feel free to pick it up. > > Cc'ing David too since it involves KTAP in case he has comments. > >> diff --git a/tools/testing/selftests/rust/Makefile >> b/tools/testing/selftests/rust/Makefile > > Missing SPDX line? (it can be added when picking it up, though). > Thanks for the feedback Miguel! >> +$(OUTPUT)/ktap_helpers.sh: >> + cp $(top_srcdir)/tools/testing/selftests/dt/ktap_helpers.sh >> $@ > > This may be something for another series, but should these helpers be > factored out perhaps / provided by the framework? Does it work > sourcing them from `dt` directly instead of copying meanwhile (to > simplify)? > >> +KSFT_PASS=0 >> +KSFT_FAIL=1 >> +KSFT_SKIP=4 > > Similarly, would it make sense for this kind of "common constants" be > factored somehow? Or does that not make sense (I see other tests also > define them "manually")? > Sourcing the file from the `dt` folder does work when running the test with `make -C tools/testing/selftests TARGETS=rust run_tests`, but fails when the test is installed with `make -C tools/testing/selftests TARGETS=rust install` and run with `./tools/testing/selftests/kselftest_install/run_kselftest.sh` (unless the dt kselftest is installed too). I agree factoring out the helpers might be a better solution. I sent a patch to move the ktap_helpers.sh file to the kselftest common directory, so that kselftests written in bash can make use of the helper functions more easily: https://lore.kernel.org/linux-kselftest/20240102141528.169947-1-laura@collabora.com/T/#u If that patch is merged first, I'll rework this one and send a v2 with the proper adjustments. Best, Laura
[PATCH] selftests: Move KTAP bash helpers to selftests common folder
Move bash helpers for outputting in KTAP format to the common selftests folder. This allows kselftests other than the dt one to source the file and make use of the helper functions. Define pass, fail and skip codes in the same file too. Signed-off-by: Laura Nao --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/dt/Makefile | 2 +- tools/testing/selftests/dt/test_unprobed_devices.sh | 6 +- tools/testing/selftests/{dt => kselftest}/ktap_helpers.sh | 6 ++ 4 files changed, 9 insertions(+), 6 deletions(-) rename tools/testing/selftests/{dt => kselftest}/ktap_helpers.sh (94%) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 3b2061d1c1a5..976e96013c91 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -251,6 +251,7 @@ ifdef INSTALL_PATH install -m 744 kselftest/module.sh $(INSTALL_PATH)/kselftest/ install -m 744 kselftest/runner.sh $(INSTALL_PATH)/kselftest/ install -m 744 kselftest/prefix.pl $(INSTALL_PATH)/kselftest/ + install -m 744 kselftest/ktap_helpers.sh $(INSTALL_PATH)/kselftest/ install -m 744 run_kselftest.sh $(INSTALL_PATH)/ rm -f $(TEST_LIST) @ret=1; \ diff --git a/tools/testing/selftests/dt/Makefile b/tools/testing/selftests/dt/Makefile index 62dc00ee4978..2d33ee9e9b71 100644 --- a/tools/testing/selftests/dt/Makefile +++ b/tools/testing/selftests/dt/Makefile @@ -4,7 +4,7 @@ ifneq ($(PY3),) TEST_PROGS := test_unprobed_devices.sh TEST_GEN_FILES := compatible_list -TEST_FILES := compatible_ignore_list ktap_helpers.sh +TEST_FILES := compatible_ignore_list include ../lib.mk diff --git a/tools/testing/selftests/dt/test_unprobed_devices.sh b/tools/testing/selftests/dt/test_unprobed_devices.sh index b07af2a4c4de..f2307ee443a6 100755 --- a/tools/testing/selftests/dt/test_unprobed_devices.sh +++ b/tools/testing/selftests/dt/test_unprobed_devices.sh @@ -15,16 +15,12 @@ DIR="$(dirname $(readlink -f "$0"))" -source "${DIR}"/ktap_helpers.sh +source "${DIR}"/../kselftest/ktap_helpers.sh PDT=/proc/device-tree/ COMPAT_LIST="${DIR}"/compatible_list IGNORE_LIST="${DIR}"/compatible_ignore_list -KSFT_PASS=0 -KSFT_FAIL=1 -KSFT_SKIP=4 - ktap_print_header if [[ ! -d "${PDT}" ]]; then diff --git a/tools/testing/selftests/dt/ktap_helpers.sh b/tools/testing/selftests/kselftest/ktap_helpers.sh similarity index 94% rename from tools/testing/selftests/dt/ktap_helpers.sh rename to tools/testing/selftests/kselftest/ktap_helpers.sh index 8dfae51bb4e2..dd79d96f3b5a 100644 --- a/tools/testing/selftests/dt/ktap_helpers.sh +++ b/tools/testing/selftests/kselftest/ktap_helpers.sh @@ -9,6 +9,12 @@ KTAP_CNT_PASS=0 KTAP_CNT_FAIL=0 KTAP_CNT_SKIP=0 +KSFT_PASS=0 +KSFT_FAIL=1 +KSFT_XFAIL=2 +KSFT_XPASS=3 +KSFT_SKIP=4 + ktap_print_header() { echo "TAP version 13" } -- 2.30.2
Re: [PATCH net-next 0/4] mptcp: add CurrEstab MIB counter
Hello: This series was applied to netdev/net-next.git (main) by David S. Miller : On Fri, 22 Dec 2023 13:47:21 +0100 you wrote: > This MIB counter is similar to the one of TCP -- CurrEstab -- available > in /proc/net/snmp. This is useful to quickly list the number of MPTCP > connections without having to iterate over all of them. > > Patch 1 prepares its support by adding new helper functions: > > - MPTCP_DEC_STATS(): similar to MPTCP_INC_STATS(), but this time to >decrement a counter. > > [...] Here is the summary with links: - [net-next,1/4] mptcp: add CurrEstab MIB counter support https://git.kernel.org/netdev/net-next/c/d9cd27b8cd19 - [net-next,2/4] mptcp: use mptcp_set_state (no matching commit) - [net-next,3/4] selftests: mptcp: join: check CURRESTAB counters https://git.kernel.org/netdev/net-next/c/0bd962dd86b2 - [net-next,4/4] selftests: mptcp: diag: check CURRESTAB counters https://git.kernel.org/netdev/net-next/c/81ab772819da You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH net-next 0/2] selftest/net: Some more TCP-AO selftest post-merge fixups
Hello: This series was applied to netdev/net-next.git (main) by David S. Miller : On Fri, 22 Dec 2023 01:59:05 + you wrote: > Note that there's another post-merge fix for TCP-AO selftests, but that > doesn't conflict with these, so I don't resend that: > > https://lore.kernel.org/all/20231219-b4-tcp-ao-selftests-out-of-tree-v1-1-0fff92d26...@arista.com/T/#u > > Signed-off-by: Dmitry Safonov > > [...] Here is the summary with links: - [net-next,1/2] selftest/tcp-ao: Set routes in a proper VRF table id https://git.kernel.org/netdev/net-next/c/72cd9f8d5a99 - [net-next,2/2] selftest/tcp-ao: Work on namespace-ified sysctl_optmem_max https://git.kernel.org/netdev/net-next/c/80057b2080a8 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
[PATCH net-next v2 3/3] selftests/net: fix GRO coalesce test and add ext header coalesce tests
Currently there is no test which checks that IPv6 extension header packets successfully coalesce. This commit adds a test, which verifies two IPv6 packets with HBH extension headers do coalesce, and another test which checks that packets with different extension header data do not coalesce in GRO. I changed the receive socket filter to accept a packet with one extension header. This change exposed a bug in the fragment test -- the old BPF did not accept the fragment packet. I updated correct_num_packets in the fragment test accordingly. Signed-off-by: Richard Gobert --- tools/testing/selftests/net/gro.c | 94 +-- 1 file changed, 88 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c index 30024d0ed373..6dbba8ec53a1 100644 --- a/tools/testing/selftests/net/gro.c +++ b/tools/testing/selftests/net/gro.c @@ -71,6 +71,12 @@ #define MAX_PAYLOAD (IP_MAXPACKET - sizeof(struct tcphdr) - sizeof(struct ipv6hdr)) #define NUM_LARGE_PKT (MAX_PAYLOAD / MSS) #define MAX_HDR_LEN (ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct tcphdr)) +#define MIN_EXTHDR_SIZE 8 +#define EXT_PAYLOAD_1 "\x00\x00\x00\x00\x00\x00" +#define EXT_PAYLOAD_2 "\x11\x11\x11\x11\x11\x11" + +#define ipv6_optlen(p) (((p)->hdrlen+1) << 3) /* calculate IPv6 extension header len */ +#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) static const char *addr6_src = "fdaa::2"; static const char *addr6_dst = "fdaa::1"; @@ -104,7 +110,7 @@ static void setup_sock_filter(int fd) const int dport_off = tcp_offset + offsetof(struct tcphdr, dest); const int ethproto_off = offsetof(struct ethhdr, h_proto); int optlen = 0; - int ipproto_off; + int ipproto_off, opt_ipproto_off; int next_off; if (proto == PF_INET) @@ -116,14 +122,30 @@ static void setup_sock_filter(int fd) if (strcmp(testname, "ip") == 0) { if (proto == PF_INET) optlen = sizeof(struct ip_timestamp); - else - optlen = sizeof(struct ip6_frag); + else { + BUILD_BUG_ON(sizeof(struct ip6_hbh) > MIN_EXTHDR_SIZE); + BUILD_BUG_ON(sizeof(struct ip6_dest) > MIN_EXTHDR_SIZE); + BUILD_BUG_ON(sizeof(struct ip6_frag) > MIN_EXTHDR_SIZE); + + /* same size for HBH and Fragment extension header types */ + optlen = MIN_EXTHDR_SIZE; + opt_ipproto_off = ETH_HLEN + sizeof(struct ipv6hdr) + + offsetof(struct ip6_ext, ip6e_nxt); + } } + /* this filter validates the following: +* - packet is IPv4/IPv6 according to the running test. +* - packet is TCP. Also handles the case of one extension header and then TCP. +* - checks the packet tcp dport equals to DPORT. Also handles the case of one +*extension header and then TCP. +*/ struct sock_filter filter[] = { BPF_STMT(BPF_LD + BPF_H + BPF_ABS, ethproto_off), - BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, ntohs(ethhdr_proto), 0, 7), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, ntohs(ethhdr_proto), 0, 9), BPF_STMT(BPF_LD + BPF_B + BPF_ABS, ipproto_off), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 2, 0), + BPF_STMT(BPF_LD + BPF_B + BPF_ABS, opt_ipproto_off), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 0, 5), BPF_STMT(BPF_LD + BPF_H + BPF_ABS, dport_off), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, DPORT, 2, 0), @@ -576,6 +598,40 @@ static void add_ipv4_ts_option(void *buf, void *optpkt) iph->check = checksum_fold(iph, sizeof(struct iphdr) + optlen, 0); } +static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext_payload) +{ + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr *)(optpkt + tcp_offset); + struct ipv6hdr *iph = (struct ipv6hdr *)(optpkt + ETH_HLEN); + char *exthdr_payload_start = (char *)(exthdr + 1); + + exthdr->hdrlen = 0; + exthdr->nexthdr = IPPROTO_TCP; + + if (ext_payload) + memcpy(exthdr_payload_start, ext_payload, MIN_EXTHDR_SIZE - sizeof(*exthdr)); + + memcpy(optpkt, buf, tcp_offset); + memcpy(optpkt + tcp_offset + MIN_EXTHDR_SIZE, buf + tcp_offset, + sizeof(struct tcphdr) + PAYLOAD_LEN); + + iph->nexthdr = exthdr_type; + iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE); +} + +static void send_ipv6_exthdr(int fd, struct sockaddr_ll *daddr, char *ext_data1, char *ext_data2) +{ + static char buf[MAX_HDR_LEN + PAYLOAD_LEN]; + static char exthdr_pck[sizeof(buf) +
[PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
The existing code always pulls the IPv6 header and sets the transport offset initially. Then optionally again pulls any extension headers in ipv6_gso_pull_exthdrs and sets the transport offset again on return from that call. skb->data is set at the start of the first extension header before calling ipv6_gso_pull_exthdrs, and must disable the frag0 optimization because that function uses pskb_may_pull/pskb_pull instead of skb_gro_ helpers. It sets the GRO offset to the TCP header with skb_gro_pull and sets the transport header. Then returns skb->data to its position before this block. This commit introduces a new helper function - ipv6_gro_pull_exthdrs - which is used in ipv6_gro_receive to pull ipv6 ext headers instead of ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all operations use skb_gro_* helpers, and the frag0 fast path can be taken for IPv6 packets with ext headers. Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn --- include/net/ipv6.h | 1 + net/ipv6/ip6_offload.c | 51 +- 2 files changed, 42 insertions(+), 10 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 78d38dd88aba..217240efa182 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -26,6 +26,7 @@ struct ip_tunnel_info; #define SIN6_LEN_RFC2133 24 #define IPV6_MAXPLEN 65535 +#define IPV6_MIN_EXTHDR_LEN8 /* * NextHeader field of IPv6 header diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 0e0b5fed0995..c07111d8f56a 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -37,6 +37,40 @@ INDIRECT_CALL_L4(cb, f2, f1, head, skb);\ }) +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) +{ + const struct net_offload *ops = NULL; + struct ipv6_opt_hdr *opth; + + for (;;) { + int len; + + ops = rcu_dereference(inet6_offloads[proto]); + + if (unlikely(!ops)) + break; + + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) + break; + + opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off); + if (unlikely(!opth)) + break; + + len = ipv6_optlen(opth); + + opth = skb_gro_header(skb, off + len, off); + if (unlikely(!opth)) + break; + proto = opth->nexthdr; + + off += len; + } + + skb_gro_pull(skb, off - skb_network_offset(skb)); + return proto; +} + static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) { const struct net_offload *ops = NULL; @@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, goto out; skb_set_network_header(skb, off); - skb_gro_pull(skb, sizeof(*iph)); - skb_set_transport_header(skb, skb_gro_offset(skb)); - flush += ntohs(iph->payload_len) != skb_gro_len(skb); + flush += ntohs(iph->payload_len) != skb->len - hlen; proto = iph->nexthdr; ops = rcu_dereference(inet6_offloads[proto]); if (!ops || !ops->callbacks.gro_receive) { - pskb_pull(skb, skb_gro_offset(skb)); - skb_gro_frag0_invalidate(skb); - proto = ipv6_gso_pull_exthdrs(skb, proto); - skb_gro_pull(skb, -skb_transport_offset(skb)); - skb_reset_transport_header(skb); - __skb_push(skb, skb_gro_offset(skb)); + proto = ipv6_gro_pull_exthdrs(skb, hlen, proto); ops = rcu_dereference(inet6_offloads[proto]); if (!ops || !ops->callbacks.gro_receive) goto out; - iph = ipv6_hdr(skb); + iph = skb_gro_network_header(skb); + } else { + skb_gro_pull(skb, sizeof(*iph)); } + skb_set_transport_header(skb, skb_gro_offset(skb)); + NAPI_GRO_CB(skb)->proto = proto; flush--; -- 2.36.1
[PATCH net-next v2 1/3] net: gso: add HBH extension header offload support
This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it is done for routing and dstopts) since it is supported in GSO and GRO. This allows to remove specific HBH conditionals in GSO and GRO when pulling and parsing an incoming packet. Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn --- net/ipv6/exthdrs_offload.c | 11 +++ net/ipv6/ip6_offload.c | 25 +++-- 2 files changed, 22 insertions(+), 14 deletions(-) diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c index 06750d65d480..4c00398f4dca 100644 --- a/net/ipv6/exthdrs_offload.c +++ b/net/ipv6/exthdrs_offload.c @@ -16,6 +16,10 @@ static const struct net_offload dstopt_offload = { .flags = INET6_PROTO_GSO_EXTHDR, }; +static const struct net_offload hbh_offload = { + .flags = INET6_PROTO_GSO_EXTHDR, +}; + int __init ipv6_exthdrs_offload_init(void) { int ret; @@ -28,9 +32,16 @@ int __init ipv6_exthdrs_offload_init(void) if (ret) goto out_rt; + ret = inet6_add_offload(_offload, IPPROTO_HOPOPTS); + if (ret) + goto out_dstopts; + out: return ret; +out_dstopts: + inet6_del_offload(_offload, IPPROTO_DSTOPTS); + out_rt: inet6_del_offload(_offload, IPPROTO_ROUTING); goto out; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index d6314287338d..0e0b5fed0995 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -45,15 +45,13 @@ static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) struct ipv6_opt_hdr *opth; int len; - if (proto != NEXTHDR_HOP) { - ops = rcu_dereference(inet6_offloads[proto]); + ops = rcu_dereference(inet6_offloads[proto]); - if (unlikely(!ops)) - break; + if (unlikely(!ops)) + break; - if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) - break; - } + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) + break; if (unlikely(!pskb_may_pull(skb, 8))) break; @@ -171,13 +169,12 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph, proto = iph->nexthdr; for (;;) { - if (proto != NEXTHDR_HOP) { - *opps = rcu_dereference(inet6_offloads[proto]); - if (unlikely(!(*opps))) - break; - if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR)) - break; - } + *opps = rcu_dereference(inet6_offloads[proto]); + if (unlikely(!(*opps))) + break; + if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR)) + break; + opth = (void *)opth + optlen; optlen = ipv6_optlen(opth); len += optlen; -- 2.36.1
[PATCH net-next v2 0/3] net: gro: reduce extension header parsing overhead
This series attempts to reduce the parsing overhead of IPv6 extension headers in GRO and GSO, by removing extension header specific code and enabling the frag0 fast path. The following changes were made: - Removed some unnecessary HBH conditionals by adding HBH offload to inet6_offloads - Added a utility function to support frag0 fast path in ipv6_gro_receive - Added selftests for IPv6 packets with extension headers in GRO Richard v1 -> v2: - Added a minimum IPv6 extension header length constant to make code self documenting. - Added new selftest which checks that packets with different extension header payloads do not coalesce. - Added more info in the second commit message regarding the code changes. - v1: https://lore.kernel.org/netdev/f4eff69d-3917-4c42-8c6b-d09597ac4...@gmail.com/ Richard Gobert (3): net: gso: add HBH extension header offload support net: gro: parse ipv6 ext headers without frag0 invalidation selftests/net: fix GRO coalesce test and add ext header coalesce tests include/net/ipv6.h| 1 + net/ipv6/exthdrs_offload.c| 11 net/ipv6/ip6_offload.c| 76 + tools/testing/selftests/net/gro.c | 94 +-- 4 files changed, 152 insertions(+), 30 deletions(-) -- 2.36.1
Re: [RFC PATCH v3 0/3] Add test to verify probe of devices from discoverable busses
On Tue, Jan 02, 2024 at 10:45:59AM +0300, Dan Carpenter wrote: > Life hack: Don't put RFC in the subject. Especially if it's a v2 or > higher. No one reads RFC patches. Thanks for the tip. I've had a mixed experience with RFC series in the past, though this time around I did get some feedback on the previous versions so I can't complain. And I wasn't expecting swift replies in the middle of the holidays :). In any case, this should be the last RFC version as I feel like the approach has consolidated by now. > > This patchset seems like a low risk patch to apply. That's an interesting take on the usage of RFC I hadn't considered. Thanks, Nícolas
Re: [PATCH v8 21/24] evm: Move to LSM infrastructure
On 12/26/2023 11:13 PM, Mimi Zohar wrote: On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote: From: Roberto Sassu As for IMA, move hardcoded EVM function calls from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'evm' (last and always enabled like 'ima'). The order in the Makefile ensures that 'evm' hooks are executed after 'ima' ones. Make EVM functions as static (except for evm_inode_init_security(), which is exported), and register them as hook implementations in init_evm_lsm(). Unlike before (see commit to move IMA to the LSM infrastructure), evm_inode_post_setattr(), evm_inode_post_set_acl(), evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not executed for private inodes. Missing is a comment on moving the inline function definitions - evm_inode_remove_acl(), evm_inode_post_remove_acl(), and evm_inode_post_set_acl() - to evm_main.c. Ok. Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c Signed-off-by: Roberto Sassu --- [...] @@ -2307,9 +2299,7 @@ int security_inode_setxattr(struct mnt_idmap *idmap, if (ret == 1) ret = cap_inode_setxattr(dentry, name, value, size, flags); - if (ret) - return ret; - return evm_inode_setxattr(idmap, dentry, name, value, size, flags); + return ret; } Even though capability will be called after EVM, it doesn't make a difference in this instance. [...] /** @@ -2493,9 +2472,7 @@ int security_inode_removexattr(struct mnt_idmap *idmap, ret = call_int_hook(inode_removexattr, 1, idmap, dentry, name); if (ret == 1) ret = cap_inode_removexattr(idmap, dentry, name); - if (ret) - return ret; - return evm_inode_removexattr(idmap, dentry, name); + return ret; } 'security.capability' is one of the EVM protected xattrs. As capability isn't an LSM, it will now be called after EVM, which is a problem. Uhm, according to this comment in security_inode_removexattr() and security_inode_setxattr(): /* * SELinux and Smack integrate the cap call, * so assume that all LSMs supplying this call do so. */ We can add the call to IMA and EVM as well, to be compliant. However, I'm missing why the two cases are different. It seems cap_inode_set/removexattr() are doing just checks. Thanks Roberto
Re: [PATCH v2 net-next] selftests/net: change shebang to bash to support "source"
On Tue, Jan 02, 2024 at 04:20:16PM +0800, Yujie Liu wrote: > On Sun, Dec 31, 2023 at 01:17:11PM +0100, Guillaume Nault wrote: > > On Fri, Dec 29, 2023 at 09:19:31PM +0800, Yujie Liu wrote: > > > The patch set [1] added a general lib.sh in net selftests, and converted > > > several test scripts to source the lib.sh. > > > > > > unicast_extensions.sh (converted in [1]) and pmtu.sh (converted in [2]) > > > have a /bin/sh shebang which may point to various shells in different > > > distributions, but "source" is only available in some of them. For > > > example, "source" is a built-it function in bash, but it cannot be > > > used in dash. > > > > > > Refer to other scripts that were converted together, simply change the > > > shebang to bash to fix the following issues when the default /bin/sh > > > points to other shells. > > > > Looks like it'd be simpler to just replace the "source" commands with > > "." and leave the shebang as is (unless there are other bash-specific > > constructs in these scripts of course). > > > > Generally speaking, I think we should avoid madating a specific shell, > > unless that really simplifies the test script (which is not the case > > here). > > Hi Guillaume, > > Thanks for the comments. Actually I also considered replacing "source" > with "." at first, but finally decided to change the shebang in > consideration of being consistent with other scripts. We can see that > there are 140+ scripts in net selftests that have "source lib.sh" and > "bash" shebang, but none of the selftests has ". lib.sh". If we replace > "source" with "." and keep the "sh" shebang specifically for > unicast_extensions.sh and pmtu.sh, we will get only 2 scripts using > "sh and ." while most other scripts using "bash and source". Maybe it > would be nice to keep the consistency by changing the shebang as well? > What do you think? :) The use of "source" instead of "." is clearly an overlook. Consistency is desirable only when it brings better quality code. And it should be easy enough to convert the remaining "source lib.sh" in a followup patch to make the other shell scripts consistent. > linux/tools/testing/selftests/net$ grep -rl "source lib.sh" . | xargs grep -F > '#!/bin/' > ./test_vxlan_under_vrf.sh:#!/bin/bash > ./test_vxlan_nolocalbypass.sh:#!/bin/bash > ./xfrm_policy.sh:#!/bin/bash > ./test_vxlan_mdb.sh:#!/bin/bash > ./test_bridge_backup_port.sh:#!/bin/bash > ./vrf_route_leaking.sh:#!/bin/bash > ./l2tp.sh:#!/bin/bash > ./netns-name.sh:#!/bin/bash > ./rtnetlink.sh:#!/bin/bash > ./ioam6.sh:#!/bin/bash > ./drop_monitor_tests.sh:#!/bin/bash > ./test_vxlan_vnifiltering.sh:#!/bin/bash > ./icmp.sh:#!/bin/bash > ./gre_gso.sh:#!/bin/bash > ./fib_nexthop_multiprefix.sh:#!/bin/bash > ./icmp_redirect.sh:#!/bin/bash > ./vrf-xfrm-tests.sh:#!/bin/bash > ./vrf_strict_mode_test.sh:#!/bin/bash > ./fcnal-test.sh:#!/bin/bash > ./stress_reuseport_listen.sh:#!/bin/bash > ./srv6_end_dt4_l3vpn_test.sh:#!/bin/bash > ./test_bridge_neigh_suppress.sh:#!/bin/bash > ./cmsg_ipv6.sh:#!/bin/bash > ./arp_ndisc_evict_nocarrier.sh:#!/bin/bash > ./fib_rule_tests.sh:#!/bin/bash > ./srv6_end_dt6_l3vpn_test.sh:#!/bin/bash > ./forwarding/custom_multipath_hash.sh:#!/bin/bash > ./forwarding/gre_inner_v4_multipath.sh:#!/bin/bash > ./forwarding/tc_tunnel_key.sh:#!/bin/bash > ./forwarding/tc_shblocks.sh:#!/bin/bash > ./forwarding/router_nh.sh:#!/bin/bash > ./forwarding/skbedit_priority.sh:#!/bin/bash > ./forwarding/tc_mpls_l2vpn.sh:#!/bin/bash > ./forwarding/gre_inner_v6_multipath.sh:#!/bin/bash > ./forwarding/vxlan_symmetric.sh:#!/bin/bash > ./forwarding/bridge_mdb.sh:#!/bin/bash > ./forwarding/no_forwarding.sh:#!/bin/bash > ./forwarding/router_bridge_1d.sh:#!/bin/bash > ./forwarding/tc_flower_port_range.sh:#!/bin/bash > ./forwarding/router_multicast.sh:#!/bin/bash > ./forwarding/bridge_locked_port.sh:#!/bin/bash > ./forwarding/vxlan_asymmetric_ipv6.sh:#!/bin/bash > ./forwarding/dual_vxlan_bridge.sh:#!/bin/bash > ./forwarding/bridge_port_isolation.sh:#!/bin/bash > ./forwarding/local_termination.sh:#!/bin/bash > ./forwarding/ipip_flat_gre_keys.sh:#!/bin/bash > ./forwarding/gre_multipath_nh_res.sh:#!/bin/bash > ./forwarding/gre_multipath.sh:#!/bin/bash > ./forwarding/vxlan_bridge_1d_ipv6.sh:#!/bin/bash > ./forwarding/ip6gre_flat_keys.sh:#!/bin/bash > ./forwarding/gre_multipath_nh.sh:#!/bin/bash > ./forwarding/bridge_mld.sh:#!/bin/bash > ./forwarding/ip6gre_inner_v6_multipath.sh:#!/bin/bash > ./forwarding/ip6gre_flat_key.sh:#!/bin/bash > ./forwarding/vxlan_asymmetric.sh:#!/bin/bash > ./forwarding/tc_flower_router.sh:#!/bin/bash > ./forwarding/router_bridge_vlan_upper_pvid.sh:#!/bin/bash > ./forwarding/mirror_gre_vlan_bridge_1q.sh:#!/bin/bash > ./forwarding/q_in_vni_ipv6.sh:#!/bin/bash > ./forwarding/mirror_gre_lag_lacp.sh:#!/bin/bash > ./forwarding/ip6gre_custom_multipath_hash.sh:#!/bin/bash > ./forwarding/vxlan_bridge_1d.sh:#!/bin/bash > ./forwarding/ip6gre_hier_key.sh:#!/bin/bash >
Re: [PATCH v8 23/24] ima: Make it independent from 'integrity' LSM
On 12/27/2023 8:21 PM, Mimi Zohar wrote: On Wed, 2023-12-27 at 17:39 +0100, Roberto Sassu wrote: On 12/27/2023 2:22 PM, Mimi Zohar wrote: On Thu, 2023-12-14 at 18:08 +0100, Roberto Sassu wrote: From: Roberto Sassu Make the 'ima' LSM independent from the 'integrity' LSM by introducing IMA own integrity metadata (ima_iint_cache structure, with IMA-specific fields from the integrity_iint_cache structure), and by managing it directly from the 'ima' LSM. Move the remaining IMA-specific flags to security/integrity/ima/ima.h, since they are now unnecessary in the common integrity layer. Replace integrity_iint_cache with ima_iint_cache in various places of the IMA code. Then, reserve space in the security blob for the entire ima_iint_cache structure, so that it is available for all inodes having the security blob allocated (those for which security_inode_alloc() was called). Adjust the IMA code accordingly, call ima_iint_inode() to retrieve the ima_iint_cache structure. Keep the non-NULL checks since there can be inodes without security blob. Previously the 'iint' memory was only allocated for regular files in policy and were tagged S_IMA. This patch totally changes when and how memory is being allocated. Does it make sense to allocate memory at security_inode_alloc()? Is this change really necessary for making IMA a full fledged LSM? Good question. I think it wouldn't be necessary, we can reuse the same approach as in the patch 'integrity: Switch from rbtree to LSM-managed blob for integrity_iint_cache'. Going forward with the v8 proposed solution would require some real memory usage analysis for different types of policies. To me the "integrity: Switch from rbtree to LSM-managed blob for integrity_iint_cache" makes a lot more sense. Looking back at the original thread, your reasons back then for not directly allocating the integrity_iint_cache are still valid for the ima_iint_cache structure. Uhm, ok. It should not be too difficult to restore the old mechanism for ima_iint_cache. Will do it in v9. Thanks Roberto Mimi Don't include the inode pointer as field in the ima_iint_cache structure, since the association with the inode is clear. Since the inode field is missing in ima_iint_cache, pass the extra inode parameter to ima_get_verity_digest(). Finally, register ima_inode_alloc_security/ima_inode_free_security() to initialize/deinitialize the new ima_iint_cache structure (before this task was done by iint_init_always() and iint_free()). Also, duplicate iint_lockdep_annotate() for the ima_iint_cache structure, and name it ima_iint_lockdep_annotate(). Signed-off-by: Roberto Sassu
Re: [PATCH 1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote: > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t > *dst_pmd, pmd_t *src_pmd, > > /* at this point we have src_folio locked */ > if (folio_test_large(src_folio)) { > + /* split_folio() can block */ > + pte_unmap(_src_pte); > + pte_unmap(_dst_pte); > + src_pte = dst_pte = NULL; > err = split_folio(src_folio); > if (err) > goto out; > + goto retry; > } Do we also need to clear src_folio and src_folio_pte? If the folio is a thp, I think it means it's pte mapped here. Then after the split we may want to fetch the small folio after the split, not the head one? -- Peter Xu
[PATCH] mm/selftests: hugepage-mremap: conform test to TAP format output
Conform the layout, informational and status messages to TAP. No functional change is intended other than the layout of output messages. Signed-off-by: Muhammad Usama Anjum --- tools/testing/selftests/mm/hugepage-mremap.c | 87 1 file changed, 35 insertions(+), 52 deletions(-) diff --git a/tools/testing/selftests/mm/hugepage-mremap.c b/tools/testing/selftests/mm/hugepage-mremap.c index cabd0084f57b..c463d1c09c9b 100644 --- a/tools/testing/selftests/mm/hugepage-mremap.c +++ b/tools/testing/selftests/mm/hugepage-mremap.c @@ -24,6 +24,7 @@ #include #include #include +#include "../kselftest.h" #include "vm_util.h" #define DEFAULT_LENGTH_MB 10UL @@ -34,7 +35,7 @@ static void check_bytes(char *addr) { - printf("First hex is %x\n", *((unsigned int *)addr)); + ksft_print_msg("First hex is %x\n", *((unsigned int *)addr)); } static void write_bytes(char *addr, size_t len) @@ -52,7 +53,7 @@ static int read_bytes(char *addr, size_t len) check_bytes(addr); for (i = 0; i < len; i++) if (*(addr + i) != (char)i) { - printf("Mismatch at %lu\n", i); + ksft_print_msg("Mismatch at %lu\n", i); return 1; } return 0; @@ -66,17 +67,13 @@ static void register_region_with_uffd(char *addr, size_t len) /* Create and enable userfaultfd object. */ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); - if (uffd == -1) { - perror("userfaultfd"); - exit(1); - } + if (uffd == -1) + ksft_exit_fail_msg("userfaultfd: %s\n", strerror(errno)); uffdio_api.api = UFFD_API; uffdio_api.features = 0; - if (ioctl(uffd, UFFDIO_API, _api) == -1) { - perror("ioctl-UFFDIO_API"); - exit(1); - } + if (ioctl(uffd, UFFDIO_API, _api) == -1) + ksft_exit_fail_msg("ioctl-UFFDIO_API: %s\n", strerror(errno)); /* Create a private anonymous mapping. The memory will be * demand-zero paged--that is, not yet allocated. When we @@ -86,21 +83,17 @@ static void register_region_with_uffd(char *addr, size_t len) addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); - if (addr == MAP_FAILED) { - perror("mmap"); - exit(1); - } + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap: %s\n", strerror(errno)); - printf("Address returned by mmap() = %p\n", addr); + ksft_print_msg("Address returned by mmap() = %p\n", addr); /* Register the memory range of the mapping we just created for * handling by the userfaultfd object. In mode, we request to track * missing pages (i.e., pages that have not yet been faulted in). */ - if (uffd_register(uffd, addr, len, true, false, false)) { - perror("ioctl-UFFDIO_REGISTER"); - exit(1); - } + if (uffd_register(uffd, addr, len, true, false, false)) + ksft_exit_fail_msg("ioctl-UFFDIO_REGISTER: %s\n", strerror(errno)); } int main(int argc, char *argv[]) @@ -108,10 +101,11 @@ int main(int argc, char *argv[]) size_t length = 0; int ret = 0, fd; - if (argc >= 2 && !strcmp(argv[1], "-h")) { - printf("Usage: %s [length_in_MB]\n", argv[0]); - exit(1); - } + ksft_print_header(); + ksft_set_plan(1); + + if (argc >= 2 && !strcmp(argv[1], "-h")) + ksft_exit_fail_msg("Usage: %s [length_in_MB]\n", argv[0]); /* Read memory length as the first arg if valid, otherwise fallback to * the default length. @@ -123,50 +117,40 @@ int main(int argc, char *argv[]) length = MB_TO_BYTES(length); fd = memfd_create(argv[0], MFD_HUGETLB); - if (fd < 0) { - perror("Open failed"); - exit(1); - } + if (fd < 0) + ksft_exit_fail_msg("Open failed: %s\n", strerror(errno)); /* mmap to a PUD aligned address to hopefully trigger pmd sharing. */ unsigned long suggested_addr = 0x7eaa4000; void *haddr = mmap((void *)suggested_addr, length, PROTECTION, MAP_HUGETLB | MAP_SHARED | MAP_POPULATE, fd, 0); - printf("Map haddr: Returned address is %p\n", haddr); - if (haddr == MAP_FAILED) { - perror("mmap1"); - exit(1); - } + ksft_print_msg("Map haddr: Returned address is %p\n", haddr); + if (haddr == MAP_FAILED) + ksft_exit_fail_msg("mmap1: %s\n", strerror(errno)); /* mmap again to a dummy address to hopefully trigger pmd sharing. */ suggested_addr = 0x7daa4000; void *daddr = mmap((void *)suggested_addr, length, PROTECTION, MAP_HUGETLB |
Re: [PATCH v2 net-next] selftests/net: change shebang to bash to support "source"
On Sun, Dec 31, 2023 at 01:17:11PM +0100, Guillaume Nault wrote: > On Fri, Dec 29, 2023 at 09:19:31PM +0800, Yujie Liu wrote: > > The patch set [1] added a general lib.sh in net selftests, and converted > > several test scripts to source the lib.sh. > > > > unicast_extensions.sh (converted in [1]) and pmtu.sh (converted in [2]) > > have a /bin/sh shebang which may point to various shells in different > > distributions, but "source" is only available in some of them. For > > example, "source" is a built-it function in bash, but it cannot be > > used in dash. > > > > Refer to other scripts that were converted together, simply change the > > shebang to bash to fix the following issues when the default /bin/sh > > points to other shells. > > Looks like it'd be simpler to just replace the "source" commands with > "." and leave the shebang as is (unless there are other bash-specific > constructs in these scripts of course). > > Generally speaking, I think we should avoid madating a specific shell, > unless that really simplifies the test script (which is not the case > here). Hi Guillaume, Thanks for the comments. Actually I also considered replacing "source" with "." at first, but finally decided to change the shebang in consideration of being consistent with other scripts. We can see that there are 140+ scripts in net selftests that have "source lib.sh" and "bash" shebang, but none of the selftests has ". lib.sh". If we replace "source" with "." and keep the "sh" shebang specifically for unicast_extensions.sh and pmtu.sh, we will get only 2 scripts using "sh and ." while most other scripts using "bash and source". Maybe it would be nice to keep the consistency by changing the shebang as well? What do you think? :) linux/tools/testing/selftests/net$ grep -rl "source lib.sh" . | xargs grep -F '#!/bin/' ./test_vxlan_under_vrf.sh:#!/bin/bash ./test_vxlan_nolocalbypass.sh:#!/bin/bash ./xfrm_policy.sh:#!/bin/bash ./test_vxlan_mdb.sh:#!/bin/bash ./test_bridge_backup_port.sh:#!/bin/bash ./vrf_route_leaking.sh:#!/bin/bash ./l2tp.sh:#!/bin/bash ./netns-name.sh:#!/bin/bash ./rtnetlink.sh:#!/bin/bash ./ioam6.sh:#!/bin/bash ./drop_monitor_tests.sh:#!/bin/bash ./test_vxlan_vnifiltering.sh:#!/bin/bash ./icmp.sh:#!/bin/bash ./gre_gso.sh:#!/bin/bash ./fib_nexthop_multiprefix.sh:#!/bin/bash ./icmp_redirect.sh:#!/bin/bash ./vrf-xfrm-tests.sh:#!/bin/bash ./vrf_strict_mode_test.sh:#!/bin/bash ./fcnal-test.sh:#!/bin/bash ./stress_reuseport_listen.sh:#!/bin/bash ./srv6_end_dt4_l3vpn_test.sh:#!/bin/bash ./test_bridge_neigh_suppress.sh:#!/bin/bash ./cmsg_ipv6.sh:#!/bin/bash ./arp_ndisc_evict_nocarrier.sh:#!/bin/bash ./fib_rule_tests.sh:#!/bin/bash ./srv6_end_dt6_l3vpn_test.sh:#!/bin/bash ./forwarding/custom_multipath_hash.sh:#!/bin/bash ./forwarding/gre_inner_v4_multipath.sh:#!/bin/bash ./forwarding/tc_tunnel_key.sh:#!/bin/bash ./forwarding/tc_shblocks.sh:#!/bin/bash ./forwarding/router_nh.sh:#!/bin/bash ./forwarding/skbedit_priority.sh:#!/bin/bash ./forwarding/tc_mpls_l2vpn.sh:#!/bin/bash ./forwarding/gre_inner_v6_multipath.sh:#!/bin/bash ./forwarding/vxlan_symmetric.sh:#!/bin/bash ./forwarding/bridge_mdb.sh:#!/bin/bash ./forwarding/no_forwarding.sh:#!/bin/bash ./forwarding/router_bridge_1d.sh:#!/bin/bash ./forwarding/tc_flower_port_range.sh:#!/bin/bash ./forwarding/router_multicast.sh:#!/bin/bash ./forwarding/bridge_locked_port.sh:#!/bin/bash ./forwarding/vxlan_asymmetric_ipv6.sh:#!/bin/bash ./forwarding/dual_vxlan_bridge.sh:#!/bin/bash ./forwarding/bridge_port_isolation.sh:#!/bin/bash ./forwarding/local_termination.sh:#!/bin/bash ./forwarding/ipip_flat_gre_keys.sh:#!/bin/bash ./forwarding/gre_multipath_nh_res.sh:#!/bin/bash ./forwarding/gre_multipath.sh:#!/bin/bash ./forwarding/vxlan_bridge_1d_ipv6.sh:#!/bin/bash ./forwarding/ip6gre_flat_keys.sh:#!/bin/bash ./forwarding/gre_multipath_nh.sh:#!/bin/bash ./forwarding/bridge_mld.sh:#!/bin/bash ./forwarding/ip6gre_inner_v6_multipath.sh:#!/bin/bash ./forwarding/ip6gre_flat_key.sh:#!/bin/bash ./forwarding/vxlan_asymmetric.sh:#!/bin/bash ./forwarding/tc_flower_router.sh:#!/bin/bash ./forwarding/router_bridge_vlan_upper_pvid.sh:#!/bin/bash ./forwarding/mirror_gre_vlan_bridge_1q.sh:#!/bin/bash ./forwarding/q_in_vni_ipv6.sh:#!/bin/bash ./forwarding/mirror_gre_lag_lacp.sh:#!/bin/bash ./forwarding/ip6gre_custom_multipath_hash.sh:#!/bin/bash ./forwarding/vxlan_bridge_1d.sh:#!/bin/bash ./forwarding/ip6gre_hier_key.sh:#!/bin/bash ./forwarding/gre_custom_multipath_hash.sh:#!/bin/bash ./forwarding/ipip_flat_gre_key.sh:#!/bin/bash ./forwarding/mirror_gre_flower.sh:#!/bin/bash ./forwarding/router_bridge.sh:#!/bin/bash ./forwarding/vxlan_symmetric_ipv6.sh:#!/bin/bash ./forwarding/mirror_gre_bridge_1q.sh:#!/bin/bash ./forwarding/router_multipath.sh:#!/bin/bash ./forwarding/tc_vlan_modify.sh:#!/bin/bash ./forwarding/vxlan_bridge_1q.sh:#!/bin/bash ./forwarding/bridge_mdb_port_down.sh:#!/bin/bash ./forwarding/tc_flower.sh:#!/bin/bash ./forwarding/tc_flower_cfm.sh:#!/bin/bash