Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen wrote: > > On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote: > > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: > > > > > > From: Tycho Andersen > > > > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and > > > that might sleep: > > > > > > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > > index 9d5c75f02295..7891add0913f 100644 > > > --- a/arch/x86/mm/fault.c > > > +++ b/arch/x86/mm/fault.c > > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long > > > error_code, > > > /* Executive summary in case the body of the oops scrolled away */ > > > printk(KERN_DEFAULT "CR2: %016lx\n", address); > > > > > > + /* > > > +* We're about to oops, which might kill the task. Make sure we're > > > +* allowed to sleep. > > > +*/ > > > + flags |= X86_EFLAGS_IF; > > > + > > > oops_end(flags, regs, sig); > > > } > > > > > > > > > NAK. If there's a bug in rewind_stack_do_exit(), please fix it in > > rewind_stack_do_exit(). > > [I trimmed the CC list since google rejected it with E2BIG :)] > > I guess the problem is really that do_exit() (or really > exit_signals()) might sleep. Maybe we should put an irq_enable() at > the beginning of do_exit() instead and fix this problem for all > arches? > Hmm. do_exit() isn't really meant to be "try your best to leave the system somewhat usable without returning" -- it's a function that, other than in OOPSes, is called from a well-defined state. So I think rewind_stack_do_exit() is probably a better spot. But we need to rewind the stack and *then* turn on IRQs, since we otherwise risk exploding quite badly. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: > > XPFO flushes kernel space TLB entries for pages that are now mapped > in userspace on not only the current CPU but also all other CPUs > synchronously. Processes on each core allocating pages causes a > flood of IPI messages to all other cores to flush TLB entries. > Many of these messages are to flush the entire TLB on the core if > the number of entries being flushed from local core exceeds > tlb_single_page_flush_ceiling. The cost of TLB flush caused by > unmapping pages from physmap goes up dramatically on machines with > high core count. > > This patch flushes relevant TLB entries for current process or > entire TLB depending upon number of entries for the current CPU > and posts a pending TLB flush on all other CPUs when a page is > unmapped from kernel space and mapped in userspace. Each core > checks the pending TLB flush flag for itself on every context > switch, flushes its TLB if the flag is set and clears it. > This patch potentially aggregates multiple TLB flushes into one. > This has very significant impact especially on machines with large > core counts. Why is this a reasonable strategy? > +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end) > +{ > + struct cpumask tmp_mask; > + > + /* > +* Balance as user space task's flush, a bit conservative. > +* Do a local flush immediately and post a pending flush on all > +* other CPUs. Local flush can be a range flush or full flush > +* depending upon the number of entries to be flushed. Remote > +* flushes will be done by individual processors at the time of > +* context switch and this allows multiple flush requests from > +* other CPUs to be batched together. > +*/ I don't like this function at all. A core function like this is a contract of sorts between the caller and the implementation. There is no such thing as an "xpfo" flush, and this function's behavior isn't at all well defined. For flush_tlb_kernel_range(), I can tell you exactly what that function does, and the implementation is either correct or incorrect. With this function, I have no idea what is actually required, and I can't possibly tell whether it's correct. As far as I can see, xpfo_flush_tlb_kernel_range() actually means "flush this range on this CPU right now, and flush it on remote CPUs eventually". It would be valid, but probably silly, to flush locally and to never flush at all on remote CPUs. This makes me wonder what the point is. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote: > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: > > > > From: Tycho Andersen > > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and > > that might sleep: > > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 9d5c75f02295..7891add0913f 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long > > error_code, > > /* Executive summary in case the body of the oops scrolled away */ > > printk(KERN_DEFAULT "CR2: %016lx\n", address); > > > > + /* > > +* We're about to oops, which might kill the task. Make sure we're > > +* allowed to sleep. > > +*/ > > + flags |= X86_EFLAGS_IF; > > + > > oops_end(flags, regs, sig); > > } > > > > > NAK. If there's a bug in rewind_stack_do_exit(), please fix it in > rewind_stack_do_exit(). [I trimmed the CC list since google rejected it with E2BIG :)] I guess the problem is really that do_exit() (or really exit_signals()) might sleep. Maybe we should put an irq_enable() at the beginning of do_exit() instead and fix this problem for all arches? Tycho ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH AUTOSEL 5.0 008/262] swiotlb: add checks for the return value of memblock_alloc*()
On Thu, Mar 28, 2019 at 07:55:20AM +0200, Mike Rapoport wrote: Hi, On Wed, Mar 27, 2019 at 01:57:43PM -0400, Sasha Levin wrote: From: Mike Rapoport [ Upstream commit a0bf842e89a3842162aa8514b9bf4611c86fee10 ] Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. This patch is a part of a series that removes panic() calls from memblock allocators rather than a fix. I've dropped it, thanks! -- Thanks, Sasha ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: > > From: Tycho Andersen > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and > that might sleep: > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9d5c75f02295..7891add0913f 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long > error_code, > /* Executive summary in case the body of the oops scrolled away */ > printk(KERN_DEFAULT "CR2: %016lx\n", address); > > + /* > +* We're about to oops, which might kill the task. Make sure we're > +* allowed to sleep. > +*/ > + flags |= X86_EFLAGS_IF; > + > oops_end(flags, regs, sig); > } > NAK. If there's a bug in rewind_stack_do_exit(), please fix it in rewind_stack_do_exit(). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
From: Juerg Haefliger This patch adds support for XPFO for x86-64. It uses the generic infrastructure in place for XPFO and adds the architecture specific bits to enable XPFO on x86-64. CC: x...@kernel.org Suggested-by: Vasileios P. Kemerlis Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen Signed-off-by: Marco Benatto Signed-off-by: Julian Stecklina Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- .../admin-guide/kernel-parameters.txt | 10 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h| 26 arch/x86/mm/Makefile | 2 + arch/x86/mm/pageattr.c| 23 +--- arch/x86/mm/xpfo.c| 123 ++ include/linux/xpfo.h | 2 + 7 files changed, 162 insertions(+), 25 deletions(-) create mode 100644 arch/x86/mm/xpfo.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9b36da94760e..e65e3bc1efe0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2997,11 +2997,11 @@ nox2apic[X86-64,APIC] Do not enable x2APIC mode. - noxpfo [XPFO] Disable eXclusive Page Frame Ownership (XPFO) - when CONFIG_XPFO is on. Physical pages mapped into - user applications will also be mapped in the - kernel's address space as if CONFIG_XPFO was not - enabled. + noxpfo [XPFO,X86-64] Disable eXclusive Page Frame + Ownership (XPFO) when CONFIG_XPFO is on. Physical + pages mapped into user applications will also be + mapped in the kernel's address space as if + CONFIG_XPFO was not enabled. cpu0_hotplug[X86] Turn on CPU0 hotplug feature when CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 68261430fe6e..122786604252 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -209,6 +209,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMESif PROC_FS + select ARCH_SUPPORTS_XPFO if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2779ace16d23..5c0e1581fa56 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } +/* + * The current flushing context - we pass it instead of 5 arguments: + */ +struct cpa_data { + unsigned long *vaddr; + pgd_t *pgd; + pgprot_tmask_set; + pgprot_tmask_clr; + unsigned long numpages; + unsigned long curpage; + unsigned long pfn; + unsigned intflags; + unsigned intforce_split : 1, + force_static_prot : 1; + struct page **pages; +}; + + +int +should_split_large_page(pte_t *kpte, unsigned long address, + struct cpa_data *cpa); +extern spinlock_t cpa_lock; +int +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, + struct page *base); + #include #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 4b101dd6e52f..93b0fdaf4a99 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)+= pti.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o + +obj-$(CONFIG_XPFO) += xpfo.o diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 14e6119838a6..530b5df0617e 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -28,23 +28,6 @@ #include "mm_internal.h" -/* - * The current flushing context - we pass it instead of 5 arguments: - */ -struct cpa_data { - unsigned long *vaddr; - pgd_t *pgd; - pgprot_tmask_set; - pgprot_tmask_clr; - unsigned long numpages; - unsigned long curpage; - unsigned long pfn; - unsigned intflags; - unsigned intforce_split : 1, - force_static_prot : 1; - struct page **pages; -}; - enum cpa_warn { CPA_CONFLICT, CPA_PROTECT, @@ -59,7 +42,7 @@ static const int cpa_warn_level = CPA_PROTECT; * entries change the page attribute in parallel to some other cpu * splitting a large page entry along with changing the attribute. */ -static
[RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol
From: Tycho Andersen We need someting like this for testing XPFO. Since it's architecture specific, putting it in the test code is slightly awkward, so let's make it an arch-specific symbol and export it for use in LKDTM. CC: linux-arm-ker...@lists.infradead.org CC: x...@kernel.org Signed-off-by: Tycho Andersen Tested-by: Marco Benatto Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- v7: * make user_virt_to_phys a GPL symbol v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case arch/x86/mm/xpfo.c | 57 include/linux/xpfo.h | 4 2 files changed, 61 insertions(+) diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c index 3045bb7e4659..b42513347865 100644 --- a/arch/x86/mm/xpfo.c +++ b/arch/x86/mm/xpfo.c @@ -121,3 +121,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order) flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); } EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb); + +/* Convert a user space virtual address to a physical address. + * Shamelessly copied from slow_virt_to_phys() and lookup_address() in + * arch/x86/mm/pageattr.c + */ +phys_addr_t user_virt_to_phys(unsigned long addr) +{ + phys_addr_t phys_addr; + unsigned long offset; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset(current->mm, addr); + if (pgd_none(*pgd)) + return 0; + + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) + return 0; + + if (p4d_large(*p4d) || !p4d_present(*p4d)) { + phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT; + offset = addr & ~P4D_MASK; + goto out; + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) + return 0; + + if (pud_large(*pud) || !pud_present(*pud)) { + phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT; + offset = addr & ~PUD_MASK; + goto out; + } + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return 0; + + if (pmd_large(*pmd) || !pmd_present(*pmd)) { + phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT; + offset = addr & ~PMD_MASK; + goto out; + } + + pte = pte_offset_kernel(pmd, addr); + phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT; + offset = addr & ~PAGE_MASK; + +out: + return (phys_addr_t)(phys_addr | offset); +} +EXPORT_SYMBOL_GPL(user_virt_to_phys); diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h index c1d232da7ee0..5d8d06e4b796 100644 --- a/include/linux/xpfo.h +++ b/include/linux/xpfo.h @@ -89,6 +89,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map); void xpfo_free_pages(struct page *page, int order); +phys_addr_t user_virt_to_phys(unsigned long addr); + #else /* !CONFIG_XPFO */ static inline void xpfo_init_single_page(struct page *page) { } @@ -102,6 +104,8 @@ static inline void xpfo_free_pages(struct page *page, int order) { } static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { } static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { } +static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; } + #endif /* CONFIG_XPFO */ #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP)) -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap
From: Tycho Andersen vm_mmap is exported, which means kernel modules can use it. In particular, for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so let's support it via vm_mmap. Signed-off-by: Tycho Andersen Tested-by: Marco Benatto Tested-by: Khalid Aziz Cc: Khalid Aziz --- include/linux/mm.h | 2 ++ mm/mmap.c | 19 +-- mm/util.c | 32 3 files changed, 35 insertions(+), 18 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80bb6408fe73..3e4f6525d06b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2412,6 +2412,8 @@ struct vm_unmapped_area_info { extern unsigned long unmapped_area(struct vm_unmapped_area_info *info); extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info); +struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags); + /* * Search for an unmapped address range. * diff --git a/mm/mmap.c b/mm/mmap.c index fc1809b1bed6..65382d942598 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; } else if (flags & MAP_HUGETLB) { - struct user_struct *user = NULL; - struct hstate *hs; - - hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); - if (!hs) - return -EINVAL; - - len = ALIGN(len, huge_page_size(hs)); - /* -* VM_NORESERVE is used because the reservations will be -* taken when vm_ops->mmap() is called -* A dummy user value is used because we are not locking -* memory so no accounting is necessary -*/ - file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, - VM_NORESERVE, - , HUGETLB_ANONHUGE_INODE, - (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); + file = map_hugetlb_setup(, flags); if (IS_ERR(file)) return PTR_ERR(file); } diff --git a/mm/util.c b/mm/util.c index 379319b1bcfd..86b763861828 100644 --- a/mm/util.c +++ b/mm/util.c @@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, return ret; } +struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags) +{ + struct user_struct *user = NULL; + struct hstate *hs; + + hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); + if (!hs) + return ERR_PTR(-EINVAL); + + *len = ALIGN(*len, huge_page_size(hs)); + + /* +* VM_NORESERVE is used because the reservations will be +* taken when vm_ops->mmap() is called +* A dummy user value is used because we are not locking +* memory so no accounting is necessary +*/ + return hugetlb_file_setup(HUGETLB_ANON_FILE, *len, + VM_NORESERVE, + , HUGETLB_ANONHUGE_INODE, + (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); +} + unsigned long vm_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long offset) @@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long addr, if (unlikely(offset_in_page(offset))) return -EINVAL; + if (flag & MAP_HUGETLB) { + if (file) + return -EINVAL; + + file = map_hugetlb_setup(, flag); + if (IS_ERR(file)) + return PTR_ERR(file); + } + return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT); } EXPORT_SYMBOL(vm_mmap); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 07/13] arm64/mm: Add support for XPFO
From: Juerg Haefliger Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and provide a hook for updating a single kernel page table entry (which is required by the generic XPFO code). XPFO doesn't support section/contiguous mappings yet, so let's disable it if XPFO is turned on. Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for pointing out we need NO_CONT_MAPPINGS too. CC: linux-arm-ker...@lists.infradead.org Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- .../admin-guide/kernel-parameters.txt | 2 +- arch/arm64/Kconfig| 1 + arch/arm64/mm/Makefile| 2 + arch/arm64/mm/mmu.c | 2 +- arch/arm64/mm/xpfo.c | 66 +++ 5 files changed, 71 insertions(+), 2 deletions(-) create mode 100644 arch/arm64/mm/xpfo.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e65e3bc1efe0..9fcf8c83031a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2997,7 +2997,7 @@ nox2apic[X86-64,APIC] Do not enable x2APIC mode. - noxpfo [XPFO,X86-64] Disable eXclusive Page Frame + noxpfo [XPFO,X86-64,ARM64] Disable eXclusive Page Frame Ownership (XPFO) when CONFIG_XPFO is on. Physical pages mapped into user applications will also be mapped in the kernel's address space as if diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a4168d366127..9a8d8e649cf8 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -174,6 +174,7 @@ config ARM64 select SWIOTLB select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK + select ARCH_SUPPORTS_XPFO help ARM 64-bit (AArch64) Linux support. diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile index 849c1df3d214..cca3808d9776 100644 --- a/arch/arm64/mm/Makefile +++ b/arch/arm64/mm/Makefile @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n obj-$(CONFIG_KASAN)+= kasan_init.o KASAN_SANITIZE_kasan_init.o:= n + +obj-$(CONFIG_XPFO) += xpfo.o diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index b6f5aa52ac67..1673f7443d62 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -453,7 +453,7 @@ static void __init map_mem(pgd_t *pgdp) struct memblock_region *reg; int flags = 0; - if (rodata_full || debug_pagealloc_enabled()) + if (rodata_full || debug_pagealloc_enabled() || xpfo_enabled()) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; /* diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c new file mode 100644 index ..7866c5acfffb --- /dev/null +++ b/arch/arm64/mm/xpfo.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger + * Vasileios P. Kemerlis + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#include +#include + +#include + +/* + * Lookup the page table entry for a virtual address and return a pointer to + * the entry. Based on x86 tree. + */ +static pte_t *lookup_address(unsigned long addr) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + + pgd = pgd_offset_k(addr); + if (pgd_none(*pgd)) + return NULL; + + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + + return pte_offset_kernel(pmd, addr); +} + +/* Update a single kernel page table entry */ +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) +{ + pte_t *pte = lookup_address((unsigned long)kaddr); + + if (unlikely(!pte)) { + WARN(1, "xpfo: invalid address %p\n", kaddr); + return; + } + + set_pte(pte, pfn_pte(page_to_pfn(page), prot)); +} +EXPORT_SYMBOL_GPL(set_kpte); + +inline void xpfo_flush_kernel_tlb(struct page *page, int order) +{ + unsigned long kaddr = (unsigned long)page_address(page); + unsigned long size = PAGE_SIZE; + + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); +} +EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together
When XPFO forces a TLB flush on all cores, the performance impact is very significant. Batching as many of these TLB updates as possible can help lower this impact. When a userspace allocates a page, kernel tries to get that page from the per-cpu free list. This free list is replenished in bulk when it runs low. Free list is being replenished for future allocation to userspace is a good opportunity to update TLB entries in batch and reduce the impact of multiple TLB flushes later. This patch adds new tags for the page so a page can be marked as available for userspace allocation and unmapped from kernel address space. All such pages are removed from kernel address space in bulk at the time they are added to per-cpu free list. This patch when combined with deferred TLB flushes improves performance further. Using the same benchmark as before of building kernel in parallel, here are the system times on two differently sized systems: Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM make -j60 all 5.0 913.862s 5.0+XPFO+Deferred flush+Batch update1165.259s 1.28x Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM make -j4 all 5.0 610.642s 5.0+XPFO+Deferred flush+Batch update773.075s1.27x Signed-off-by: Khalid Aziz Cc: Khalid Aziz Signed-off-by: Tycho Andersen --- v9: - Do not map a page freed by userspace back into kernel. Mark it as unmapped instead and map it back in only when needed. This avoids the cost of unmap and TLBV flush if the page is allocated back to userspace. arch/x86/include/asm/pgtable.h | 2 +- arch/x86/mm/pageattr.c | 9 -- arch/x86/mm/xpfo.c | 11 +-- include/linux/xpfo.h | 11 +++ mm/page_alloc.c| 9 ++ mm/xpfo.c | 54 +++--- 6 files changed, 79 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5c0e1581fa56..61f64c6c687c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1461,7 +1461,7 @@ should_split_large_page(pte_t *kpte, unsigned long address, extern spinlock_t cpa_lock; int __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, - struct page *base); + struct page *base, bool xpfo_split); #include #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 530b5df0617e..8fe86ac6bff0 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -911,7 +911,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn, int __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, - struct page *base) + struct page *base, bool xpfo_split) { unsigned long lpaddr, lpinc, ref_pfn, pfn, pfninc = 1; pte_t *pbase = (pte_t *)page_address(base); @@ -1008,7 +1008,10 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, * page attribute in parallel, that also falls into the * just split large page entry. */ - flush_tlb_all(); + if (xpfo_split) + xpfo_flush_tlb_all(); + else + flush_tlb_all(); spin_unlock(_lock); return 0; @@ -1027,7 +1030,7 @@ static int split_large_page(struct cpa_data *cpa, pte_t *kpte, if (!base) return -ENOMEM; - if (__split_large_page(cpa, kpte, address, base)) + if (__split_large_page(cpa, kpte, address, base, false)) __free_page(base); return 0; diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c index 638eee5b1f09..8c482c7b54f5 100644 --- a/arch/x86/mm/xpfo.c +++ b/arch/x86/mm/xpfo.c @@ -47,7 +47,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) cpa.vaddr = kaddr; cpa.pages = - cpa.mask_set = prot; + cpa.mask_set = canon_pgprot(prot); cpa.mask_clr = msk_clr; cpa.numpages = 1; cpa.flags = 0; @@ -57,7 +57,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) do_split = should_split_large_page(pte, (unsigned long)kaddr, ); - if (do_split) { + if (do_split > 0) { struct page *base; base = alloc_pages(GFP_ATOMIC, 0); @@ -69,7 +69,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) if (!debug_pagealloc_enabled()) spin_lock(_lock); if (__split_large_page(, pte, (unsigned long)kaddr, - base) <
[RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
From: Julian Stecklina Only the xpfo_kunmap call that needs to actually unmap the page needs to be serialized. We need to be careful to handle the case, where after the atomic decrement of the mapcount, a xpfo_kmap increased the mapcount again. In this case, we can safely skip modifying the page table. Model-checked with up to 4 concurrent callers with Spin. Signed-off-by: Julian Stecklina Signed-off-by: Khalid Aziz Cc: Khalid Aziz Cc: x...@kernel.org Cc: kernel-harden...@lists.openwall.com Cc: Vasileios P. Kemerlis Cc: Juerg Haefliger Cc: Tycho Andersen Cc: Marco Benatto Cc: David Woodhouse --- include/linux/xpfo.h | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h index 2318c7eb5fb7..37e7f52fa6ce 100644 --- a/include/linux/xpfo.h +++ b/include/linux/xpfo.h @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page) static inline void xpfo_kunmap(void *kaddr, struct page *page) { unsigned long flags; + bool flush_tlb = false; if (!static_branch_unlikely(_inited)) return; @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page) * The page is to be allocated back to user space, so unmap it from * the kernel, flush the TLB and tag it as a user page. */ - spin_lock_irqsave(>xpfo_lock, flags); - if (atomic_dec_return(>xpfo_mapcount) == 0) { -#ifdef CONFIG_XPFO_DEBUG - WARN_ON(PageXpfoUnmapped(page)); -#endif - SetPageXpfoUnmapped(page); - set_kpte(kaddr, page, __pgprot(0)); - xpfo_flush_kernel_tlb(page, 0); + spin_lock_irqsave(>xpfo_lock, flags); + + /* +* In the case, where we raced with kmap after the +* atomic_dec_return, we must not nuke the mapping. +*/ + if (atomic_read(>xpfo_mapcount) == 0) { + SetPageXpfoUnmapped(page); + set_kpte(kaddr, page, __pgprot(0)); + flush_tlb = true; + } + spin_unlock_irqrestore(>xpfo_lock, flags); } - spin_unlock_irqrestore(>xpfo_lock, flags); + if (flush_tlb) + xpfo_flush_kernel_tlb(page, 0); } void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory
From: Tycho Andersen In some cases (on arm64 DMA and data cache flushes) we may have unmapped the underlying pages needed for something via XPFO. Here are some primitives useful for ensuring the underlying memory is mapped/unmapped in the face of xpfo. Signed-off-by: Tycho Andersen Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- include/linux/xpfo.h | 21 + mm/xpfo.c| 30 ++ 2 files changed, 51 insertions(+) diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h index 5d8d06e4b796..2318c7eb5fb7 100644 --- a/include/linux/xpfo.h +++ b/include/linux/xpfo.h @@ -91,6 +91,15 @@ void xpfo_free_pages(struct page *page, int order); phys_addr_t user_virt_to_phys(unsigned long addr); +#define XPFO_NUM_PAGES(addr, size) \ + (PFN_UP((unsigned long) (addr) + (size)) - \ + PFN_DOWN((unsigned long) (addr))) + +void xpfo_temp_map(const void *addr, size_t size, void **mapping, + size_t mapping_len); +void xpfo_temp_unmap(const void *addr, size_t size, void **mapping, +size_t mapping_len); + #else /* !CONFIG_XPFO */ static inline void xpfo_init_single_page(struct page *page) { } @@ -106,6 +115,18 @@ static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { } static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; } +#define XPFO_NUM_PAGES(addr, size) 0 + +static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping, +size_t mapping_len) +{ +} + +static inline void xpfo_temp_unmap(const void *addr, size_t size, + void **mapping, size_t mapping_len) +{ +} + #endif /* CONFIG_XPFO */ #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP)) diff --git a/mm/xpfo.c b/mm/xpfo.c index b74fee0479e7..974f1b70ccd9 100644 --- a/mm/xpfo.c +++ b/mm/xpfo.c @@ -14,6 +14,7 @@ * the Free Software Foundation. */ +#include #include #include #include @@ -104,3 +105,32 @@ void xpfo_free_pages(struct page *page, int order) } } } + +void xpfo_temp_map(const void *addr, size_t size, void **mapping, + size_t mapping_len) +{ + struct page *page = virt_to_page(addr); + int i, num_pages = mapping_len / sizeof(mapping[0]); + + memset(mapping, 0, mapping_len); + + for (i = 0; i < num_pages; i++) { + if (page_to_virt(page + i) >= addr + size) + break; + + if (PageXpfoUnmapped(page + i)) + mapping[i] = kmap_atomic(page + i); + } +} +EXPORT_SYMBOL(xpfo_temp_map); + +void xpfo_temp_unmap(const void *addr, size_t size, void **mapping, +size_t mapping_len) +{ + int i, num_pages = mapping_len / sizeof(mapping[0]); + + for (i = 0; i < num_pages; i++) + if (mapping[i]) + kunmap_atomic(mapping[i]); +} +EXPORT_SYMBOL(xpfo_temp_unmap); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO
From: Juerg Haefliger XPFO can unmap a bounce buffer. Check for this and map it back in if needed. Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen Signed-off-by: Khalid Aziz Cc: Khalid Aziz Reviewed-by: Konrad Rzeszutek Wilk --- v9: * Added a generic check for whether a page is mapped or not (suggested by Chris Hellwig) v6: * guard against lookup_xpfo() returning NULL include/linux/highmem.h | 7 +++ kernel/dma/swiotlb.c| 3 ++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 59a1a5fa598d..cf21f023dff4 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -77,6 +77,13 @@ static inline struct page *kmap_to_page(void *addr) } static inline unsigned long totalhigh_pages(void) { return 0UL; } +static inline bool page_is_unmapped(struct page *page) +{ + if (PageHighMem(page) || PageXpfoUnmapped(page)) + return true; + else + return false; +} #endif /* CONFIG_HIGHMEM */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 1fb6fd68b9c7..90a1a3709b55 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -392,8 +392,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr, { unsigned long pfn = PFN_DOWN(orig_addr); unsigned char *vaddr = phys_to_virt(tlb_addr); + struct page *page = pfn_to_page(pfn); - if (PageHighMem(pfn_to_page(pfn))) { + if (page_is_unmapped(page)) { /* The buffer does not have a mapping. Map it in and copy */ unsigned int offset = orig_addr & ~PAGE_MASK; char *buffer; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions
From: Juerg Haefliger If the page is unmapped by XPFO, a data cache flush results in a fatal page fault, so let's temporarily map the region, flush the cache, and then unmap it. CC: linux-arm-ker...@lists.infradead.org Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen --- v6: actually flush in the face of xpfo, and temporarily map the underlying memory so it can be flushed correctly arch/arm64/mm/flush.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index 5c9073bace83..114e8bc5a3dc 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -28,9 +29,15 @@ void sync_icache_aliases(void *kaddr, unsigned long len) { unsigned long addr = (unsigned long)kaddr; + unsigned long num_pages = XPFO_NUM_PAGES(addr, len); + void *mapping[num_pages]; if (icache_is_aliasing()) { + xpfo_temp_map(kaddr, len, mapping, + sizeof(mapping[0]) * num_pages); __clean_dcache_area_pou(kaddr, len); + xpfo_temp_unmap(kaddr, len, mapping, + sizeof(mapping[0]) * num_pages); __flush_icache_all(); } else { /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v9 06/13] lkdtm: Add test for XPFO
From: Juerg Haefliger This test simply reads from userspace memory via the kernel's linear map. v6: * drop an #ifdef, just let the test fail if XPFO is not supported * add XPFO_SMP test to try and test the case when one CPU does an xpfo unmap of an address, that it can't be used accidentally by other CPUs. Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen Tested-by: Marco Benatto [jstec...@amazon.de: rebased from v4.13 to v4.19] Signed-off-by: Julian Stecklina Tested-by: Khalid Aziz Cc: Khalid Aziz --- drivers/misc/lkdtm/Makefile | 1 + drivers/misc/lkdtm/core.c | 3 + drivers/misc/lkdtm/lkdtm.h | 5 + drivers/misc/lkdtm/xpfo.c | 196 4 files changed, 205 insertions(+) create mode 100644 drivers/misc/lkdtm/xpfo.c diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile index 951c984de61a..97c6b7818cce 100644 --- a/drivers/misc/lkdtm/Makefile +++ b/drivers/misc/lkdtm/Makefile @@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM) += refcount.o lkdtm-$(CONFIG_LKDTM) += rodata_objcopy.o lkdtm-$(CONFIG_LKDTM) += usercopy.o lkdtm-$(CONFIG_LKDTM) += stackleak.o +lkdtm-$(CONFIG_LKDTM) += xpfo.o KASAN_SANITIZE_stackleak.o := n KCOV_INSTRUMENT_rodata.o := n diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 2837dc77478e..25f4ab4ebf50 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(USERCOPY_KERNEL), CRASHTYPE(USERCOPY_KERNEL_DS), CRASHTYPE(STACKLEAK_ERASING), + CRASHTYPE(XPFO_READ_USER), + CRASHTYPE(XPFO_READ_USER_HUGE), + CRASHTYPE(XPFO_SMP), }; diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 3c6fd327e166..6b31ff0c7f8f 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void); /* lkdtm_stackleak.c */ void lkdtm_STACKLEAK_ERASING(void); +/* lkdtm_xpfo.c */ +void lkdtm_XPFO_READ_USER(void); +void lkdtm_XPFO_READ_USER_HUGE(void); +void lkdtm_XPFO_SMP(void); + #endif diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c new file mode 100644 index ..8876128f0144 --- /dev/null +++ b/drivers/misc/lkdtm/xpfo.c @@ -0,0 +1,196 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This is for all the tests related to XPFO (eXclusive Page Frame Ownership). + */ + +#include "lkdtm.h" + +#include +#include +#include +#include +#include + +#include +#include + +#define XPFO_DATA 0xdeadbeef + +static unsigned long do_map(unsigned long flags) +{ + unsigned long user_addr, user_data = XPFO_DATA; + + user_addr = vm_mmap(NULL, 0, PAGE_SIZE, + PROT_READ | PROT_WRITE | PROT_EXEC, + flags, 0); + if (user_addr >= TASK_SIZE) { + pr_warn("Failed to allocate user memory\n"); + return 0; + } + + if (copy_to_user((void __user *)user_addr, _data, +sizeof(user_data))) { + pr_warn("copy_to_user failed\n"); + goto free_user; + } + + return user_addr; + +free_user: + vm_munmap(user_addr, PAGE_SIZE); + return 0; +} + +static unsigned long *user_to_kernel(unsigned long user_addr) +{ + phys_addr_t phys_addr; + void *virt_addr; + + phys_addr = user_virt_to_phys(user_addr); + if (!phys_addr) { + pr_warn("Failed to get physical address of user memory\n"); + return NULL; + } + + virt_addr = phys_to_virt(phys_addr); + if (phys_addr != virt_to_phys(virt_addr)) { + pr_warn("Physical address of user memory seems incorrect\n"); + return NULL; + } + + return virt_addr; +} + +static void read_map(unsigned long *virt_addr) +{ + pr_info("Attempting bad read from kernel address %p\n", virt_addr); + if (*(unsigned long *)virt_addr == XPFO_DATA) + pr_err("FAIL: Bad read succeeded?!\n"); + else + pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n"); +} + +static void read_user_with_flags(unsigned long flags) +{ + unsigned long user_addr, *kernel; + + user_addr = do_map(flags); + if (!user_addr) { + pr_err("FAIL: map failed\n"); + return; + } + + kernel = user_to_kernel(user_addr); + if (!kernel) { + pr_err("FAIL: user to kernel conversion failed\n"); + goto free_user; + } + + read_map(kernel); + +free_user: + vm_munmap(user_addr, PAGE_SIZE); +} + +/* Read from userspace via the kernel's linear map. */ +void lkdtm_XPFO_READ_USER(void) +{ + read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS); +} + +void lkdtm_XPFO_READ_USER_HUGE(void) +{ +
[RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
XPFO flushes kernel space TLB entries for pages that are now mapped in userspace on not only the current CPU but also all other CPUs synchronously. Processes on each core allocating pages causes a flood of IPI messages to all other cores to flush TLB entries. Many of these messages are to flush the entire TLB on the core if the number of entries being flushed from local core exceeds tlb_single_page_flush_ceiling. The cost of TLB flush caused by unmapping pages from physmap goes up dramatically on machines with high core count. This patch flushes relevant TLB entries for current process or entire TLB depending upon number of entries for the current CPU and posts a pending TLB flush on all other CPUs when a page is unmapped from kernel space and mapped in userspace. Each core checks the pending TLB flush flag for itself on every context switch, flushes its TLB if the flag is set and clears it. This patch potentially aggregates multiple TLB flushes into one. This has very significant impact especially on machines with large core counts. To illustrate this, kernel was compiled with -j on two classes of machines - a server with high core count and large amount of memory, and a desktop class machine with more modest specs. System time from "make -j" from vanilla 4.20 kernel, 4.20 with XPFO patches before applying this patch and after applying this patch are below: Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM make -j60 all 4.20950.966s 4.20+XPFO 25073.169s 26.366x 4.20+XPFO+Deferred flush1372.874s1.44x Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM make -j4 all 4.20607.671s 4.20+XPFO 1588.646s 2.614x 4.20+XPFO+Deferred flush803.989s1.32x This same code should be implemented for other architectures as well once finalized. Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 52 + arch/x86/mm/xpfo.c | 2 +- 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index f4204bf377fc..92d23629d01d 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end); static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 999d6d8f0bef..cc806a01a0eb 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -37,6 +37,20 @@ */ #define LAST_USER_MM_IBPB 0x1UL +/* + * A TLB flush may be needed to flush stale TLB entries + * for pages that have been mapped into userspace and unmapped + * from kernel space. This TLB flush needs to be propagated to + * all CPUs. Asynchronous flush requests to all CPUs can cause + * significant performance imapct. Queue a pending flush for + * a CPU instead. Multiple of these requests can then be handled + * by a CPU at a less disruptive time, like context switch, in + * one go and reduce performance impact significantly. Following + * data structure is used to keep track of CPUs with pending full + * TLB flush forced by xpfo. + */ +static cpumask_t pending_xpfo_flush; + /* * We get here when we do something requiring a TLB invalidation * but could not go invalidate all of the contexts. We do the @@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif + + /* +* If there is a pending TLB flush for this CPU due to XPFO +* flush, do it now. +*/ + if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) { + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + __flush_tlb_all(); + } + this_cpu_write(cpu_tlbstate.is_lazy, false); /* @@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) } } +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + struct cpumask tmp_mask; + + /* +* Balance as user space task's flush, a bit conservative. +* Do a local flush immediately and post a pending flush on all +* other CPUs. Local flush can be a range flush or full flush +* depending upon the number of entries to be flushed. Remote +* flushes will be done by individual processors at the time of +* context switch and this allows multiple flush requests from +
[RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
From: Juerg Haefliger This patch adds basic support infrastructure for XPFO which protects against 'ret2dir' kernel attacks. The basic idea is to enforce exclusive ownership of page frames by either the kernel or userspace, unless explicitly requested by the kernel. Whenever a page destined for userspace is allocated, it is unmapped from physmap (the kernel's page table). When such a page is reclaimed from userspace, it is mapped back to physmap. Individual architectures can enable full XPFO support using this infrastructure by supplying architecture specific pieces. Additional fields in the page struct are used for XPFO housekeeping, specifically: - two flags to distinguish user vs. kernel pages and to tag unmapped pages. - a reference counter to balance kmap/kunmap operations. - a lock to serialize access to the XPFO fields. This patch is based on the work of Vasileios P. Kemerlis et al. who published their work in this paper: http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf CC: x...@kernel.org Suggested-by: Vasileios P. Kemerlis Signed-off-by: Juerg Haefliger Signed-off-by: Tycho Andersen Signed-off-by: Marco Benatto [jstec...@amazon.de: encode all XPFO info in struct page] Signed-off-by: Julian Stecklina Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- v9: * Do not use page extensions. Encode all xpfo information in struct page (Julian Stecklina). * Split architecture specific code into its own separate patch * Move kmap*() to include/linux/xpfo.h for cleaner code as suggested for an earlier version of this patch * Use irq versions of spin_lock to address possible deadlock around xpfo_lock caused by interrupts. * Incorporated various feedback provided on v6 patch way back. v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush the tlb entry on all CPUs when unmapping it in kunmap * handle lookup_page_ext()/lookup_xpfo() returning NULL * drop lots of BUG()s in favor of WARN() * don't disable irqs in xpfo_kmap/xpfo_kunmap, export __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to pass it .../admin-guide/kernel-parameters.txt | 6 + include/linux/highmem.h | 31 +--- include/linux/mm_types.h | 8 + include/linux/page-flags.h| 23 ++- include/linux/xpfo.h | 147 ++ include/trace/events/mmflags.h| 10 +- mm/Makefile | 1 + mm/compaction.c | 2 +- mm/internal.h | 2 +- mm/page_alloc.c | 10 +- mm/page_isolation.c | 2 +- mm/xpfo.c | 106 + security/Kconfig | 27 13 files changed, 337 insertions(+), 38 deletions(-) create mode 100644 include/linux/xpfo.h create mode 100644 mm/xpfo.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 858b6c0b9a15..9b36da94760e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2997,6 +2997,12 @@ nox2apic[X86-64,APIC] Do not enable x2APIC mode. + noxpfo [XPFO] Disable eXclusive Page Frame Ownership (XPFO) + when CONFIG_XPFO is on. Physical pages mapped into + user applications will also be mapped in the + kernel's address space as if CONFIG_XPFO was not + enabled. + cpu0_hotplug[X86] Turn on CPU0 hotplug feature when CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. Some features depend on CPU0. Known dependencies are: diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ea5cdbd8c2c3..59a1a5fa598d 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -8,6 +8,7 @@ #include #include #include +#include #include @@ -77,36 +78,6 @@ static inline struct page *kmap_to_page(void *addr) static inline unsigned long totalhigh_pages(void) { return 0UL; } -#ifndef ARCH_HAS_KMAP -static inline void *kmap(struct page *page) -{ - might_sleep(); - return page_address(page); -} - -static inline void kunmap(struct page *page) -{ -} - -static inline void *kmap_atomic(struct page *page) -{ - preempt_disable(); - pagefault_disable(); - return page_address(page); -} -#define kmap_atomic_prot(page, prot) kmap_atomic(page) - -static inline void __kunmap_atomic(void *addr) -{ - pagefault_enable(); - preempt_enable(); -} - -#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) - -#define kmap_flush_unused()do {} while(0) -#endif - #endif /* CONFIG_HIGHMEM */ #if defined(CONFIG_HIGHMEM) ||
[RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
This is another update to the work Juerg, Tycho and Julian have done on XPFO. After the last round of updates, we were seeing very significant performance penalties when stale TLB entries were flushed actively after an XPFO TLB update. Benchmark for measuring performance is kernel build using parallel make. To get full protection from ret2dir attackes, we must flush stale TLB entries. Performance penalty from flushing stale TLB entries goes up as the number of cores goes up. On a desktop class machine with only 4 cores, enabling TLB flush for stale entries causes system time for "make -j4" to go up by a factor of 2.61x but on a larger machine with 96 cores, system time with "make -j60" goes up by a factor of 26.37x! I have been working on reducing this performance penalty. I implemented two solutions to reduce performance penalty and that has had large impact. XPFO code flushes TLB every time a page is allocated to userspace. It does so by sending IPIs to all processors to flush TLB. Back to back allocations of pages to userspace on multiple processors results in a storm of IPIs. Each one of these incoming IPIs is handled by a processor by flushing its TLB. To reduce this IPI storm, I have added a per CPU flag that can be set to tell a processor to flush its TLB. A processor checks this flag on every context switch. If the flag is set, it flushes its TLB and clears the flag. This allows for multiple TLB flush requests to a single CPU to be combined into a single request. A kernel TLB entry for a page that has been allocated to userspace is flushed on all processors unlike the previous version of this patch. A processor could hold a stale kernel TLB entry that was removed on another processor until the next context switch. A local userspace page allocation by the currently running process could force the TLB flush earlier for such entries. The other solution reduces the number of TLB flushes required, by performing TLB flush for multiple pages at one time when pages are refilled on the per-cpu freelist. If the pages being addedd to per-cpu freelist are marked for userspace allocation, TLB entries for these pages can be flushed upfront and pages tagged as currently unmapped. When any such page is allocated to userspace, there is no need to performa a TLB flush at that time any more. This batching of TLB flushes reduces performance imapct further. Similarly when these user pages are freed by userspace and added back to per-cpu free list, they are left unmapped and tagged so. This further optimization reduced performance impact from 1.32x to 1.28x for 96-core server and from 1.31x to 1.27x for a 4-core desktop. I measured system time for parallel make with unmodified 4.20 kernel, 4.20 with XPFO patches before these patches and then again after applying each of these patches. Here are the results: Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM make -j60 all 5.0 913.862s 5.0+this patch series 1165.259ss 1.28x Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM make -j4 all 5.0 610.642s 5.0+this patch series 773.075s1.27x Performance with this patch set is good enough to use these as starting point for further refinement before we merge it into main kernel, hence RFC. I have restructurerd the patches in this version to separate out architecture independent code. I folded much of the code improvement by Julian to not use page extension into patch 3. What remains to be done beyond this patch series: 1. Performance improvements: Ideas to explore - (1) kernel mappings private to an mm, (2) Any others?? 2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb" from Juerg. I dropped it for now since swiotlb code for ARM has changed a lot since this patch was written. I could use help from ARM experts on this. 3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current CPUs" to other architectures besides x86. 4. Change kmap to not map the page back to physmap, instead map it to a new va similar to what kmap_high does. Mapping page back into physmap re-opens the ret2dir security for the duration of kmap. All of the kmap_high and related code can be reused for this but that will require restructuring that code so it can be built for 64-bits as well. Any objections to that? - Juerg Haefliger (6): mm: Add support for eXclusive Page Frame Ownership (XPFO) xpfo, x86: Add support for XPFO for x86-64 lkdtm: Add test for XPFO arm64/mm: Add support for XPFO swiotlb: Map the buffer if it was unmapped by XPFO arm64/mm, xpfo: temporarily map dcache regions Julian Stecklina (1): xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz (2): xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) xpfo, mm: Optimize XPFO TLB flushes by
[RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
From: Tycho Andersen Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and that might sleep: Aug 23 19:30:27 xpfo kernel: [ 38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33 Aug 23 19:30:27 xpfo kernel: [ 38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test Aug 23 19:30:27 xpfo kernel: [ 38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G D 4.13.0-rc5+ #228 Aug 23 19:30:27 xpfo kernel: [ 38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 Aug 23 19:30:27 xpfo kernel: [ 38.306926] Call Trace: Aug 23 19:30:27 xpfo kernel: [ 38.307243] dump_stack+0x63/0x8b Aug 23 19:30:27 xpfo kernel: [ 38.307665] ___might_sleep+0xec/0x110 Aug 23 19:30:27 xpfo kernel: [ 38.308139] __might_sleep+0x45/0x80 Aug 23 19:30:27 xpfo kernel: [ 38.308593] exit_signals+0x21/0x1c0 Aug 23 19:30:27 xpfo kernel: [ 38.309046] ? blocking_notifier_call_chain+0x11/0x20 Aug 23 19:30:27 xpfo kernel: [ 38.309677] do_exit+0x98/0xbf0 Aug 23 19:30:27 xpfo kernel: [ 38.310078] ? smp_reader+0x27/0x40 [lkdtm] Aug 23 19:30:27 xpfo kernel: [ 38.310604] ? kthread+0x10f/0x150 Aug 23 19:30:27 xpfo kernel: [ 38.311045] ? read_user_with_flags+0x60/0x60 [lkdtm] Aug 23 19:30:27 xpfo kernel: [ 38.311680] rewind_stack_do_exit+0x17/0x20 To be safe, let's just always enable irqs. The particular case I'm hitting is: Aug 23 19:30:27 xpfo kernel: [ 38.278615] __bad_area_nosemaphore+0x1a9/0x1d0 Aug 23 19:30:27 xpfo kernel: [ 38.278617] bad_area_nosemaphore+0xf/0x20 Aug 23 19:30:27 xpfo kernel: [ 38.278618] __do_page_fault+0xd1/0x540 Aug 23 19:30:27 xpfo kernel: [ 38.278620] ? irq_work_queue+0x9b/0xb0 Aug 23 19:30:27 xpfo kernel: [ 38.278623] ? wake_up_klogd+0x36/0x40 Aug 23 19:30:27 xpfo kernel: [ 38.278624] trace_do_page_fault+0x3c/0xf0 Aug 23 19:30:27 xpfo kernel: [ 38.278625] do_async_page_fault+0x14/0x60 Aug 23 19:30:27 xpfo kernel: [ 38.278627] async_page_fault+0x28/0x30 When a fault is in kernel space which has been triggered by XPFO. Signed-off-by: Tycho Andersen CC: x...@kernel.org Tested-by: Khalid Aziz Cc: Khalid Aziz --- arch/x86/mm/fault.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9d5c75f02295..7891add0913f 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code, /* Executive summary in case the body of the oops scrolled away */ printk(KERN_DEFAULT "CR2: %016lx\n", address); + /* +* We're about to oops, which might kill the task. Make sure we're +* allowed to sleep. +*/ + flags |= X86_EFLAGS_IF; + oops_end(flags, regs, sig); } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: remove NULL struct device support in the DMA API
On Wed, Apr 03, 2019 at 07:26:40PM +0100, Russell King - ARM Linux admin wrote: > On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote: > > We still have a few drivers which pass a NULL struct device pointer > > to DMA API functions, which generally is a bad idea as the API > > implementations rely on the device not only for ops selection, but > > also the dma mask and various other attributes, and many implementations > > have been broken for NULL device support for a while. > > I think I must be missing something, but... > > My understanding is that ISA DMA is normally limited to 24 bits of > address Yes. > - indeed, the x86 version only programs 24 bits of DMA address. > Looking through this series, it appears that the conversions mean that > the DMA mask for ISA becomes the full all-ones DMA mask, which would > of course lead to memory corruption if only 24 bits of the address end > up being programmed into the hardware. In the generic dma mapping code no struct device has always meant a 32-bit DMA mask - take a look at the dma_get_mask() function. > Maybe you could say why you think this series is safe in regard to ISA > DMA? ISA DMA has always been rather painful in a myriad of ways, and the DMA API so far hasn't helped, given that we don't do bounce buffering for the 24-bit limit, but just the higher limits. So far even if you do use the DMA API and pass a device ISA DMA so far always meant that the higher layers had to assure things are addressable, either by using GFP_DMA allocation in the drivers, or mid-layer hacks like the unchecked_isa_dma flag in SCSI and/or BLK_BOUNCE_ISA in the block layer. This series doesn't change those facts at all. I have some half started series to clean some of this up but it isn't high up on the priority list. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
I think this might have been this commit: commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59 Author: Christoph Hellwig Date: Fri Feb 15 09:30:28 2019 +0100 sparc64/pci_sun4v: allow large DMA masks the patch below adds a few missing checks and hopefully should fix your problem. If not can you try to revert the commit to check if my theory was correct to start with? Date: Wed Apr 3 21:34:34 2019 +0200 diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c index a8af6023c126..14b93c5564e3 100644 --- a/arch/sparc/kernel/pci_sun4v.c +++ b/arch/sparc/kernel/pci_sun4v.c @@ -73,6 +73,11 @@ static inline void iommu_batch_start(struct device *dev, unsigned long prot, uns p->npages = 0; } +static inline bool iommu_use_atu(struct iommu *iommu, u64 mask) +{ + return iommu->atu && mask > DMA_BIT_MASK(32); +} + /* Interrupts must be disabled. */ static long iommu_batch_flush(struct iommu_batch *p, u64 mask) { @@ -92,7 +97,7 @@ static long iommu_batch_flush(struct iommu_batch *p, u64 mask) prot &= (HV_PCI_MAP_ATTR_READ | HV_PCI_MAP_ATTR_WRITE); while (npages != 0) { - if (mask <= DMA_BIT_MASK(32) || !pbm->iommu->atu) { + if (!iommu_use_atu(pbm->iommu, mask)) { num = pci_sun4v_iommu_map(devhandle, HV_PCI_TSBID(0, entry), npages, @@ -179,7 +184,6 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size, unsigned long flags, order, first_page, npages, n; unsigned long prot = 0; struct iommu *iommu; - struct atu *atu; struct iommu_map_table *tbl; struct page *page; void *ret; @@ -205,13 +209,11 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size, memset((char *)first_page, 0, PAGE_SIZE << order); iommu = dev->archdata.iommu; - atu = iommu->atu; - mask = dev->coherent_dma_mask; - if (mask <= DMA_BIT_MASK(32) || !atu) + if (!iommu_use_atu(iommu, mask)) tbl = >tbl; else - tbl = >tbl; + tbl = >atu->tbl; entry = iommu_tbl_range_alloc(dev, tbl, npages, NULL, (unsigned long)(-1), 0); @@ -333,7 +335,7 @@ static void dma_4v_free_coherent(struct device *dev, size_t size, void *cpu, atu = iommu->atu; devhandle = pbm->devhandle; - if (dvma <= DMA_BIT_MASK(32)) { + if (!iommu_use_atu(iommu, dvma)) { tbl = >tbl; iotsb_num = 0; /* we don't care for legacy iommu */ } else { @@ -374,7 +376,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, npages >>= IO_PAGE_SHIFT; mask = *dev->dma_mask; - if (mask <= DMA_BIT_MASK(32)) + if (!iommu_use_atu(iommu, mask)) tbl = >tbl; else tbl = >tbl; @@ -510,7 +512,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist, IO_PAGE_SIZE) >> IO_PAGE_SHIFT; mask = *dev->dma_mask; - if (mask <= DMA_BIT_MASK(32)) + if (!iommu_use_atu(iommu, mask)) tbl = >tbl; else tbl = >tbl; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: remove NULL struct device support in the DMA API
On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote: > We still have a few drivers which pass a NULL struct device pointer > to DMA API functions, which generally is a bad idea as the API > implementations rely on the device not only for ops selection, but > also the dma mask and various other attributes, and many implementations > have been broken for NULL device support for a while. I think I must be missing something, but... My understanding is that ISA DMA is normally limited to 24 bits of address - indeed, the x86 version only programs 24 bits of DMA address. Looking through this series, it appears that the conversions mean that the DMA mask for ISA becomes the full all-ones DMA mask, which would of course lead to memory corruption if only 24 bits of the address end up being programmed into the hardware. Maybe you could say why you think this series is safe in regard to ISA DMA? > > This series removes the few remaning users that weren't picked up in > the last merge window and then removes core support for this "feature". > > A git tree is also available at: > > git://git.infradead.org/users/hch/misc.git dma-remove-NULL-dev-support > > Gitweb: > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-remove-NULL-dev-support > -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 4/6] iommu/ipmmu-vmsa: Move num_utlbs to SoC-specific features
The maximum number of micro-TLBs per IPMMU instance is not fixed, but depends on the SoC type. Hence move it from struct ipmmu_vmsa_device to struct ipmmu_features, and set up the correct value for both R-Car Gen2 and Gen3 SoCs. Note that currently no code uses this value. Signed-off-by: Geert Uytterhoeven Reviewed-by: Laurent Pinchart --- v2: - Add Reviewed-by. --- drivers/iommu/ipmmu-vmsa.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 87acf86f295fac0d..3fa57627b1e35562 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -42,6 +42,7 @@ struct ipmmu_features { bool use_ns_alias_offset; bool has_cache_leaf_nodes; unsigned int number_of_contexts; + unsigned int num_utlbs; bool setup_imbuscr; bool twobit_imttbcr_sl0; bool reserved_context; @@ -53,7 +54,6 @@ struct ipmmu_vmsa_device { struct iommu_device iommu; struct ipmmu_vmsa_device *root; const struct ipmmu_features *features; - unsigned int num_utlbs; unsigned int num_ctx; spinlock_t lock;/* Protects ctx and domains[] */ DECLARE_BITMAP(ctx, IPMMU_CTX_MAX); @@ -972,6 +972,7 @@ static const struct ipmmu_features ipmmu_features_default = { .use_ns_alias_offset = true, .has_cache_leaf_nodes = false, .number_of_contexts = 1, /* software only tested with one context */ + .num_utlbs = 32, .setup_imbuscr = true, .twobit_imttbcr_sl0 = false, .reserved_context = false, @@ -981,6 +982,7 @@ static const struct ipmmu_features ipmmu_features_rcar_gen3 = { .use_ns_alias_offset = false, .has_cache_leaf_nodes = true, .number_of_contexts = 8, + .num_utlbs = 48, .setup_imbuscr = false, .twobit_imttbcr_sl0 = true, .reserved_context = true, @@ -1033,7 +1035,6 @@ static int ipmmu_probe(struct platform_device *pdev) } mmu->dev = >dev; - mmu->num_utlbs = 48; spin_lock_init(>lock); bitmap_zero(mmu->ctx, IPMMU_CTX_MAX); mmu->features = of_device_get_match_data(>dev); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 3/6] iommu/ipmmu-vmsa: Make IPMMU_CTX_MAX unsigned
Make the IPMMU_CTX_MAX constant unsigned, to match the type of ipmmu_features.number_of_contexts. This allows to use plain min() instead of type-casting min_t(). Signed-off-by: Geert Uytterhoeven Reviewed-by: Laurent Pinchart --- v2: - Add Reviewed-by. --- drivers/iommu/ipmmu-vmsa.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index f2061bd1dc7b8852..87acf86f295fac0d 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -36,7 +36,7 @@ #define arm_iommu_detach_device(...) do {} while (0) #endif -#define IPMMU_CTX_MAX 8 +#define IPMMU_CTX_MAX 8U struct ipmmu_features { bool use_ns_alias_offset; @@ -1060,8 +1060,7 @@ static int ipmmu_probe(struct platform_device *pdev) if (mmu->features->use_ns_alias_offset) mmu->base += IM_NS_ALIAS_OFFSET; - mmu->num_ctx = min_t(unsigned int, IPMMU_CTX_MAX, -mmu->features->number_of_contexts); + mmu->num_ctx = min(IPMMU_CTX_MAX, mmu->features->number_of_contexts); irq = platform_get_irq(pdev, 0); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/6] iommu/ipmmu-vmsa: Suspend/resume support and assorted cleanups
Hi Jörg, Magnus, On R-Car Gen3 systems with PSCI, PSCI may power down the SoC during system suspend, thus losing all IOMMU state. Hence after s2ram, devices behind an IPMMU (e.g. SATA), and configured to use it, will fail to complete their I/O operations. This patch series adds suspend/resume support to the Renesas IPMMU-VMSA IOMMU driver, and performs some smaller cleanups and fixes during the process. Most patches are fairly independent, except for patch 6/6, which depends on patches 4/6 and 5/6. Changes compared to v1: - Dropped "iommu/ipmmu-vmsa: Call ipmmu_ctx_write_root() instead of open coding", - Add Reviewed-by, - Merge IMEAR/IMELAR, - s/ipmmu_context_init/ipmmu_domain_setup_context/, - Drop PSCI checks. This has been tested on Salvator-XS with R-Car H3 ES2.0, with IPMMU suport for SATA enabled. To play safe, the resume operation has also been tested on R-Car M2-W. Thanks! Geert Uytterhoeven (6): iommu/ipmmu-vmsa: Link IOMMUs and devices in sysfs iommu/ipmmu-vmsa: Prepare to handle 40-bit error addresses iommu/ipmmu-vmsa: Make IPMMU_CTX_MAX unsigned iommu/ipmmu-vmsa: Move num_utlbs to SoC-specific features iommu/ipmmu-vmsa: Extract hardware context initialization iommu/ipmmu-vmsa: Add suspend/resume support drivers/iommu/ipmmu-vmsa.c | 185 + 1 file changed, 124 insertions(+), 61 deletions(-) -- 2.17.1 Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/6] iommu/ipmmu-vmsa: Prepare to handle 40-bit error addresses
On R-Car Gen3, the faulting virtual address is a 40-bit address, and comprised of two registers. Read the upper address part, and combine both parts, when running on a 64-bit system. Signed-off-by: Geert Uytterhoeven --- Apart from this, the driver doesn't support 40-bit IOVA addresses yet. v2: - Merge IMEAR/IMELAR. --- drivers/iommu/ipmmu-vmsa.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 9f2b781e20a0eba6..f2061bd1dc7b8852 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -186,7 +186,8 @@ static struct ipmmu_vmsa_device *to_ipmmu(struct device *dev) #define IMMAIR_ATTR_IDX_WBRWA 1 #define IMMAIR_ATTR_IDX_DEV2 -#define IMEAR 0x0030 +#define IMELAR 0x0030 /* IMEAR on R-Car Gen2 */ +#define IMEUAR 0x0034 /* R-Car Gen3 only */ #define IMPCTR 0x0200 #define IMPSTR 0x0208 @@ -522,14 +523,16 @@ static irqreturn_t ipmmu_domain_irq(struct ipmmu_vmsa_domain *domain) { const u32 err_mask = IMSTR_MHIT | IMSTR_ABORT | IMSTR_PF | IMSTR_TF; struct ipmmu_vmsa_device *mmu = domain->mmu; + unsigned long iova; u32 status; - u32 iova; status = ipmmu_ctx_read_root(domain, IMSTR); if (!(status & err_mask)) return IRQ_NONE; - iova = ipmmu_ctx_read_root(domain, IMEAR); + iova = ipmmu_ctx_read_root(domain, IMELAR); + if (IS_ENABLED(CONFIG_64BIT)) + iova |= (u64)ipmmu_ctx_read_root(domain, IMEUAR) << 32; /* * Clear the error status flags. Unlike traditional interrupt flag @@ -541,10 +544,10 @@ static irqreturn_t ipmmu_domain_irq(struct ipmmu_vmsa_domain *domain) /* Log fatal errors. */ if (status & IMSTR_MHIT) - dev_err_ratelimited(mmu->dev, "Multiple TLB hits @0x%08x\n", + dev_err_ratelimited(mmu->dev, "Multiple TLB hits @0x%lx\n", iova); if (status & IMSTR_ABORT) - dev_err_ratelimited(mmu->dev, "Page Table Walk Abort @0x%08x\n", + dev_err_ratelimited(mmu->dev, "Page Table Walk Abort @0x%lx\n", iova); if (!(status & (IMSTR_PF | IMSTR_TF))) @@ -560,7 +563,7 @@ static irqreturn_t ipmmu_domain_irq(struct ipmmu_vmsa_domain *domain) return IRQ_HANDLED; dev_err_ratelimited(mmu->dev, - "Unhandled fault: status 0x%08x iova 0x%08x\n", + "Unhandled fault: status 0x%08x iova 0x%lx\n", status, iova); return IRQ_HANDLED; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 5/6] iommu/ipmmu-vmsa: Extract hardware context initialization
ipmmu_domain_init_context() takes care of (1) initializing the software domain, and (2) initializing the hardware context for the domain. Extract the code to initialize the hardware context into a new subroutine ipmmu_domain_setup_context(), to prepare for later reuse. Signed-off-by: Geert Uytterhoeven --- v2: - s/ipmmu_context_init/ipmmu_domain_setup_context/. --- drivers/iommu/ipmmu-vmsa.c | 91 -- 1 file changed, 48 insertions(+), 43 deletions(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 3fa57627b1e35562..56e84bcc9532e1ce 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -404,52 +404,10 @@ static void ipmmu_domain_free_context(struct ipmmu_vmsa_device *mmu, spin_unlock_irqrestore(>lock, flags); } -static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain) +static void ipmmu_domain_setup_context(struct ipmmu_vmsa_domain *domain) { u64 ttbr; u32 tmp; - int ret; - - /* -* Allocate the page table operations. -* -* VMSA states in section B3.6.3 "Control of Secure or Non-secure memory -* access, Long-descriptor format" that the NStable bit being set in a -* table descriptor will result in the NStable and NS bits of all child -* entries being ignored and considered as being set. The IPMMU seems -* not to comply with this, as it generates a secure access page fault -* if any of the NStable and NS bits isn't set when running in -* non-secure mode. -*/ - domain->cfg.quirks = IO_PGTABLE_QUIRK_ARM_NS; - domain->cfg.pgsize_bitmap = SZ_1G | SZ_2M | SZ_4K; - domain->cfg.ias = 32; - domain->cfg.oas = 40; - domain->cfg.tlb = _gather_ops; - domain->io_domain.geometry.aperture_end = DMA_BIT_MASK(32); - domain->io_domain.geometry.force_aperture = true; - /* -* TODO: Add support for coherent walk through CCI with DVM and remove -* cache handling. For now, delegate it to the io-pgtable code. -*/ - domain->cfg.iommu_dev = domain->mmu->root->dev; - - /* -* Find an unused context. -*/ - ret = ipmmu_domain_allocate_context(domain->mmu->root, domain); - if (ret < 0) - return ret; - - domain->context_id = ret; - - domain->iop = alloc_io_pgtable_ops(ARM_32_LPAE_S1, >cfg, - domain); - if (!domain->iop) { - ipmmu_domain_free_context(domain->mmu->root, - domain->context_id); - return -EINVAL; - } /* TTBR0 */ ttbr = domain->cfg.arm_lpae_s1_cfg.ttbr[0]; @@ -495,7 +453,54 @@ static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain) */ ipmmu_ctx_write_all(domain, IMCTR, IMCTR_INTEN | IMCTR_FLUSH | IMCTR_MMUEN); +} + +static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain) +{ + int ret; + + /* +* Allocate the page table operations. +* +* VMSA states in section B3.6.3 "Control of Secure or Non-secure memory +* access, Long-descriptor format" that the NStable bit being set in a +* table descriptor will result in the NStable and NS bits of all child +* entries being ignored and considered as being set. The IPMMU seems +* not to comply with this, as it generates a secure access page fault +* if any of the NStable and NS bits isn't set when running in +* non-secure mode. +*/ + domain->cfg.quirks = IO_PGTABLE_QUIRK_ARM_NS; + domain->cfg.pgsize_bitmap = SZ_1G | SZ_2M | SZ_4K; + domain->cfg.ias = 32; + domain->cfg.oas = 40; + domain->cfg.tlb = _gather_ops; + domain->io_domain.geometry.aperture_end = DMA_BIT_MASK(32); + domain->io_domain.geometry.force_aperture = true; + /* +* TODO: Add support for coherent walk through CCI with DVM and remove +* cache handling. For now, delegate it to the io-pgtable code. +*/ + domain->cfg.iommu_dev = domain->mmu->root->dev; + + /* +* Find an unused context. +*/ + ret = ipmmu_domain_allocate_context(domain->mmu->root, domain); + if (ret < 0) + return ret; + + domain->context_id = ret; + + domain->iop = alloc_io_pgtable_ops(ARM_32_LPAE_S1, >cfg, + domain); + if (!domain->iop) { + ipmmu_domain_free_context(domain->mmu->root, + domain->context_id); + return -EINVAL; + } + ipmmu_domain_setup_context(domain); return 0; } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org
[PATCH v2 1/6] iommu/ipmmu-vmsa: Link IOMMUs and devices in sysfs
As of commit 7af9a5fdb9e0ca33 ("iommu/ipmmu-vmsa: Use iommu_device_sysfs_add()/remove()"), IOMMU devices show up under /sys/class/iommus/, but their "devices" subdirectories are empty. Likewise, devices tied to an IOMMU do not have an "iommu" backlink. Make sure all links are created, on both arm32 and arm64. Signed-off-by: Geert Uytterhoeven Reviewed-by: Laurent Pinchart --- v2: - Add Reviewed-by. --- drivers/iommu/ipmmu-vmsa.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 9a380c10655e182d..9f2b781e20a0eba6 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -885,27 +885,37 @@ static int ipmmu_init_arm_mapping(struct device *dev) static int ipmmu_add_device(struct device *dev) { + struct ipmmu_vmsa_device *mmu = to_ipmmu(dev); struct iommu_group *group; + int ret; /* * Only let through devices that have been verified in xlate() */ - if (!to_ipmmu(dev)) + if (!mmu) return -ENODEV; - if (IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_IOMMU_DMA)) - return ipmmu_init_arm_mapping(dev); + if (IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_IOMMU_DMA)) { + ret = ipmmu_init_arm_mapping(dev); + if (ret) + return ret; + } else { + group = iommu_group_get_for_dev(dev); + if (IS_ERR(group)) + return PTR_ERR(group); - group = iommu_group_get_for_dev(dev); - if (IS_ERR(group)) - return PTR_ERR(group); + iommu_group_put(group); + } - iommu_group_put(group); + iommu_device_link(>iommu, dev); return 0; } static void ipmmu_remove_device(struct device *dev) { + struct ipmmu_vmsa_device *mmu = to_ipmmu(dev); + + iommu_device_unlink(>iommu, dev); arm_iommu_detach_device(dev); iommu_group_remove_device(dev); } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 6/6] iommu/ipmmu-vmsa: Add suspend/resume support
During PSCI system suspend, R-Car Gen3 SoCs are powered down, and all IPMMU state is lost. Hence after s2ram, devices wired behind an IPMMU, and configured to use it, will see their DMA operations hang. To fix this, restore all IPMMU contexts, and re-enable all active micro-TLBs during system resume. Signed-off-by: Geert Uytterhoeven --- This patch takes a different approach than the BSP, which implements a bulk save/restore of all registers during system suspend/resume. v2: - Drop PSCI checks. --- drivers/iommu/ipmmu-vmsa.c | 47 +- 1 file changed, 46 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 56e84bcc9532e1ce..408ad0b2591925e0 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -36,7 +36,10 @@ #define arm_iommu_detach_device(...) do {} while (0) #endif -#define IPMMU_CTX_MAX 8U +#define IPMMU_CTX_MAX 8U +#define IPMMU_CTX_INVALID -1 + +#define IPMMU_UTLB_MAX 48U struct ipmmu_features { bool use_ns_alias_offset; @@ -58,6 +61,7 @@ struct ipmmu_vmsa_device { spinlock_t lock;/* Protects ctx and domains[] */ DECLARE_BITMAP(ctx, IPMMU_CTX_MAX); struct ipmmu_vmsa_domain *domains[IPMMU_CTX_MAX]; + s8 utlb_ctx[IPMMU_UTLB_MAX]; struct iommu_group *group; struct dma_iommu_mapping *mapping; @@ -335,6 +339,7 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain, ipmmu_write(mmu, IMUCTR(utlb), IMUCTR_TTSEL_MMU(domain->context_id) | IMUCTR_FLUSH | IMUCTR_MMUEN); + mmu->utlb_ctx[utlb] = domain->context_id; } /* @@ -346,6 +351,7 @@ static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain, struct ipmmu_vmsa_device *mmu = domain->mmu; ipmmu_write(mmu, IMUCTR(utlb), 0); + mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID; } static void ipmmu_tlb_flush_all(void *cookie) @@ -1043,6 +1049,7 @@ static int ipmmu_probe(struct platform_device *pdev) spin_lock_init(>lock); bitmap_zero(mmu->ctx, IPMMU_CTX_MAX); mmu->features = of_device_get_match_data(>dev); + memset(mmu->utlb_ctx, IPMMU_CTX_INVALID, mmu->features->num_utlbs); dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(40)); /* Map I/O memory and request IRQ. */ @@ -1158,10 +1165,48 @@ static int ipmmu_remove(struct platform_device *pdev) return 0; } +#ifdef CONFIG_PM_SLEEP +static int ipmmu_resume_noirq(struct device *dev) +{ + struct ipmmu_vmsa_device *mmu = dev_get_drvdata(dev); + unsigned int i; + + /* Reset root MMU and restore contexts */ + if (ipmmu_is_root(mmu)) { + ipmmu_device_reset(mmu); + + for (i = 0; i < mmu->num_ctx; i++) { + if (!mmu->domains[i]) + continue; + + ipmmu_domain_setup_context(mmu->domains[i]); + } + } + + /* Re-enable active micro-TLBs */ + for (i = 0; i < mmu->features->num_utlbs; i++) { + if (mmu->utlb_ctx[i] == IPMMU_CTX_INVALID) + continue; + + ipmmu_utlb_enable(mmu->root->domains[mmu->utlb_ctx[i]], i); + } + + return 0; +} + +static const struct dev_pm_ops ipmmu_pm = { + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(NULL, ipmmu_resume_noirq) +}; +#define DEV_PM_OPS _pm +#else +#define DEV_PM_OPS NULL +#endif /* CONFIG_PM_SLEEP */ + static struct platform_driver ipmmu_driver = { .driver = { .name = "ipmmu-vmsa", .of_match_table = of_match_ptr(ipmmu_of_ids), + .pm = DEV_PM_OPS, }, .probe = ipmmu_probe, .remove = ipmmu_remove, -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 09/22] vfio: VFIO_IOMMU_BIND/UNBIND_MSI
On Wed, 3 Apr 2019 16:30:15 +0200 Auger Eric wrote: > Hi Alex, > > On 3/22/19 11:09 PM, Alex Williamson wrote: > > On Fri, 22 Mar 2019 10:30:02 +0100 > > Auger Eric wrote: > > > >> Hi Alex, > >> On 3/22/19 12:01 AM, Alex Williamson wrote: > >>> On Sun, 17 Mar 2019 18:22:19 +0100 > >>> Eric Auger wrote: > >>> > This patch adds the VFIO_IOMMU_BIND/UNBIND_MSI ioctl which aim > to pass/withdraw the guest MSI binding to/from the host. > > Signed-off-by: Eric Auger > > --- > v3 -> v4: > - add UNBIND > - unwind on BIND error > > v2 -> v3: > - adapt to new proto of bind_guest_msi > - directly use vfio_iommu_for_each_dev > > v1 -> v2: > - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi > --- > drivers/vfio/vfio_iommu_type1.c | 58 + > include/uapi/linux/vfio.h | 29 + > 2 files changed, 87 insertions(+) > > diff --git a/drivers/vfio/vfio_iommu_type1.c > b/drivers/vfio/vfio_iommu_type1.c > index 12a40b9db6aa..66513679081b 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -1710,6 +1710,25 @@ static int vfio_cache_inv_fn(struct device *dev, > void *data) > return iommu_cache_invalidate(d, dev, >info); > } > > +static int vfio_bind_msi_fn(struct device *dev, void *data) > +{ > +struct vfio_iommu_type1_bind_msi *ustruct = > +(struct vfio_iommu_type1_bind_msi *)data; > +struct iommu_domain *d = iommu_get_domain_for_dev(dev); > + > +return iommu_bind_guest_msi(d, dev, ustruct->iova, > +ustruct->gpa, ustruct->size); > +} > + > +static int vfio_unbind_msi_fn(struct device *dev, void *data) > +{ > +dma_addr_t *iova = (dma_addr_t *)data; > +struct iommu_domain *d = iommu_get_domain_for_dev(dev); > >>> > >>> Same as previous, we can encapsulate domain in our own struct to avoid > >>> a lookup. > >>> > + > +iommu_unbind_guest_msi(d, dev, *iova); > >>> > >>> Is it strange that iommu-core is exposing these interfaces at a device > >>> level if every one of them requires us to walk all the devices? Thanks, > >>> > >> > >> Hum this per device API was devised in response of Robin's comments on > >> > >> [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie. > >> > >> " > >> But that then seems to reveal a somewhat bigger problem - if the callers > >> are simply registering IPAs, and relying on the ITS driver to grab an > >> entry and fill in a PA later, then how does either one know *which* PA > >> is supposed to belong to a given IPA in the case where you have multiple > >> devices with different ITS targets assigned to the same guest? (and if > >> it's possible to assume a guest will use per-device stage 1 mappings and > >> present it with a single vITS backed by multiple pITSes, I think things > >> start breaking even harder.) > >> " > >> > >> However looking back into the problem I wonder if there was an issue > >> with the iommu_domain based API. > >> > >> If my understanding is correct, when assigned devices are protected by a > >> vIOMMU then they necessarily end up in separate host iommu domains even > >> if they belong to the same iommu_domain on the guest. And there can only > >> be a single device in this iommu_domain. > > > > Don't forget that a container represents the IOMMU context in a vfio > > environment, groups are associated with containers and a group may > > contain one or more devices. When a vIOMMU comes into play, we still > > only have an IOMMU context per container. If we have multiple devices > > in a group, we run into problems with vIOMMU. We can resolve this by > > requiring that the user ignore all but one device in the group, > > or making sure that the devices in the group have the same IOMMU > > context. The latter we could do in QEMU if PCIe-to-PCI bridges there > > masked the per-device address space as it does on real hardware (ie. > > there is no requester ID on conventional PCI, all transactions appear to > > the IOMMU with the bridge requester ID). So I raise this question > > because vfio's minimum domain granularity is a group. > > > >> If this is confirmed, there is a non ambiguous association between 1 > >> physical iommu_domain, 1 device, 1 S1 mapping and 1 physical MSI > >> controller. > >> > >> I added the device handle handle to disambiguate those associations. The > >> gIOVA ->gDB mapping is associated with a device handle. Then when the > >> host needs a stage 1 mapping for this device, to build the nested > >> mapping towards the physical DB it can easily grab the gIOVA->gDB stage > >> 1 mapping registered for this device. > >> > >> The correctness looks more obvious to me, at
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
On 02/04/2019 23:39, Rob Gardner wrote: On 4/2/19 2:30 PM, Meelis Roos wrote: [ 17.566584] scsi host0: ioc0: LSISAS1064 A3, FwRev=010ah, Ports=1, MaxQ=511, IRQ=27 [ 17.595897] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 0, phy 0, sas_addr 0x5000c5001799a45d [ 17.598465] Unable to handle kernel NULL pointer dereference [ 17.598623] tsk->{mm,active_mm}->context = [ 17.598723] tsk->{mm,active_mm}->pgd = 88802000 [ 17.598774] \|/ \|/ [ 17.598774] "@'/ .. \`@" [ 17.598774] /_| \__/ |_\ [ 17.598774] \__U_/ [ 17.598894] swapper/0(1): Oops [#1] [ 17.598937] CPU: 12 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1 #118 [ 17.598994] TSTATE: 80e01601 TPC: 004483a8 TNPC: 004483ac Y: Not tainted [ 17.599086] TPC: You may use gdb to figure out what the NULL pointer points to: gdb vmlinux l *(dma_4v_map_sg+0xe8) gdb did not parse the file but objdump --disassemble worked and +0xe8 seems to be 4483a8 Of course that was right there in the panic message, as TPC is the address of the instruction that faulted: ldx [ %i4 ], %g1 For anyone wishing to dig into this further, here is my off the cuff analysis: I believe the fault is happening on this line: base_shift = tbl->table_map_base >> IO_PAGE_SHIFT; The tbl variable is assigned to one of two values in the statement above, but since the register dump shows the value in %i4 was 0x10, that strongly suggests that it executed this: tbl = >tbl; Because the offset of the tbl field in struct atu is 0x10, and that was computed here: 448384: b8 07 60 10 add %i5, 0x10, %i4 (The offset of tbl in struct iommu is 0, so we would have seen that 0 in %i4 if it had taken the iommu path.) From the register dump, the value in %i5 was 0. And that came from this instruction: 4482f4: fa 58 e2 58 ldx [ %g3 + 0x258 ], %i5 Likewise, %g3 came from here: 4482d4: c6 5e 22 18 ldx [ %i0 + 0x218 ], %g3 And %i0 is arg0, struct device *dev. So the code is loading some field in struct device at offset 0x218, which is consistent with the source: iommu = dev->archdata.iommu; So %g3 points to struct iommu, and the code is trying to load the value at offset 0x258 in that structure, probably this: atu = iommu->atu; And atu is the NULL pointer. Now whether this is the problem, I don't know. It may be that mask (*dev->dma_mask) was wrong, causing the code to take the >tbl path instead of the >tbl path. We can see from the code that mask is in %g7, and the register dump shows the value of %g7 is fff, while DMA_BIT_MASK(32) is in %g1 and is , so this might be the result of some confusion over 32 bit vs 64 bit stuff. Nice deduction! If it was AArch64 asm I might have tried, but I've never even seen SPARC asm before :) FWIW, scripts/faddr2line is your friend when deciphering stacktrace symbols. In terms of the crash itself, I'd note that there's also been ongoing cleanup to fix the remaining places where the DMA API was called with NULL instead of the appropriate device - it could be that as a result of that, the driver/subsystem here is now taking a path that has not been properly exercised before, and/or that it's not quite the right device pointer being picked up. I hope these bits of information help somebody debug further. Thanks, Robin. Rob 004482c0 : 4482c0: 9d e3 be b0 save %sp, -336, %sp 4482c4: 80 a6 e0 03 cmp %i3, 3 4482c8: 02 40 00 c1 be,pn %icc, 4485cc 4482cc: 92 10 21 e2 mov 0x1e2, %o1 4482d0: 80 a0 00 1a cmp %g0, %i2 4482d4: c6 5e 22 18 ldx [ %i0 + 0x218 ], %g3 4482d8: 82 10 20 00 clr %g1 4482dc: 84 60 3f ff subc %g0, -1, %g2 4482e0: 83 78 e4 01 movre %g3, 1, %g1 4482e4: 80 90 80 01 orcc %g2, %g1, %g0 4482e8: 12 40 00 bd bne,pn %icc, 4485dc 4482ec: 80 a6 e0 01 cmp %i3, 1 4482f0: 84 10 20 03 mov 3, %g2 4482f4: fa 58 e2 58 ldx [ %g3 + 0x258 ], %i5 4482f8: 85 64 60 01 move %icc, 1, %g2 4482fc: b8 0f 20 02 and %i4, 2, %i4 448300: c0 77 a7 f7 clrx [ %fp + 0x7f7 ] 448304: 82 10 a0 04 or %g2, 4, %g1 448308: c0 26 60 18 clr [ %i1 + 0x18 ] 44830c: 85 7f 14 01 movrne %i4, %g1, %g2 448310: 8f 52 00 00 rdpr %pil, %g7 448314: 82 11 e0 0e or %g7, 0xe, %g1 448318: 91 90 60 00 wrpr %g1, 0, %pil 44831c: ce 77 a7 bf stx %g7, [ %fp + 0x7bf ] 448320: 0f 00 02 00 sethi %hi(0x8), %g7 448324: 27 00 00 40 sethi %hi(0x1), %l3 448328: ce 77 a7 df stx %g7, [ %fp + 0x7df ] 44832c: 0f 00 28 21 sethi %hi(0xa08400), %g7 448330:
Re: [PATCH 4/7] pxa3xx-gcu: pass struct device to dma_mmap_coherent
On 03/21/2019 11:52 PM, Christoph Hellwig wrote: > Just like we do for all other DMA operations. > > Signed-off-by: Christoph Hellwig Acked-by: Bartlomiej Zolnierkiewicz Best regards, -- Bartlomiej Zolnierkiewicz Samsung R Institute Poland Samsung Electronics ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/7] da8xx-fb: pass struct device to DMA API functions
On 03/21/2019 11:52 PM, Christoph Hellwig wrote: > The DMA API generally relies on a struct device to work properly, and > only barely works without one for legacy reasons. Pass the easily > available struct device from the platform_device to remedy this. > > Signed-off-by: Christoph Hellwig Acked-by: Bartlomiej Zolnierkiewicz Best regards, -- Bartlomiej Zolnierkiewicz Samsung R Institute Poland Samsung Electronics ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: remove NULL struct device support in the DMA API
Any comments on the remaining patches? I'd like to give this series a couple weeks of soaking in linux-next before the end of the merge window, so reviews would be apprciated. On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote: > We still have a few drivers which pass a NULL struct device pointer > to DMA API functions, which generally is a bad idea as the API > implementations rely on the device not only for ops selection, but > also the dma mask and various other attributes, and many implementations > have been broken for NULL device support for a while. > > This series removes the few remaning users that weren't picked up in > the last merge window and then removes core support for this "feature". > > A git tree is also available at: > > git://git.infradead.org/users/hch/misc.git dma-remove-NULL-dev-support > > Gitweb: > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-remove-NULL-dev-support > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu ---end quoted text--- ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 09/22] vfio: VFIO_IOMMU_BIND/UNBIND_MSI
Hi Alex, On 3/22/19 11:09 PM, Alex Williamson wrote: > On Fri, 22 Mar 2019 10:30:02 +0100 > Auger Eric wrote: > >> Hi Alex, >> On 3/22/19 12:01 AM, Alex Williamson wrote: >>> On Sun, 17 Mar 2019 18:22:19 +0100 >>> Eric Auger wrote: >>> This patch adds the VFIO_IOMMU_BIND/UNBIND_MSI ioctl which aim to pass/withdraw the guest MSI binding to/from the host. Signed-off-by: Eric Auger --- v3 -> v4: - add UNBIND - unwind on BIND error v2 -> v3: - adapt to new proto of bind_guest_msi - directly use vfio_iommu_for_each_dev v1 -> v2: - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi --- drivers/vfio/vfio_iommu_type1.c | 58 + include/uapi/linux/vfio.h | 29 + 2 files changed, 87 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 12a40b9db6aa..66513679081b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1710,6 +1710,25 @@ static int vfio_cache_inv_fn(struct device *dev, void *data) return iommu_cache_invalidate(d, dev, >info); } +static int vfio_bind_msi_fn(struct device *dev, void *data) +{ + struct vfio_iommu_type1_bind_msi *ustruct = + (struct vfio_iommu_type1_bind_msi *)data; + struct iommu_domain *d = iommu_get_domain_for_dev(dev); + + return iommu_bind_guest_msi(d, dev, ustruct->iova, + ustruct->gpa, ustruct->size); +} + +static int vfio_unbind_msi_fn(struct device *dev, void *data) +{ + dma_addr_t *iova = (dma_addr_t *)data; + struct iommu_domain *d = iommu_get_domain_for_dev(dev); >>> >>> Same as previous, we can encapsulate domain in our own struct to avoid >>> a lookup. >>> + + iommu_unbind_guest_msi(d, dev, *iova); >>> >>> Is it strange that iommu-core is exposing these interfaces at a device >>> level if every one of them requires us to walk all the devices? Thanks, >> >> Hum this per device API was devised in response of Robin's comments on >> >> [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie. >> >> " >> But that then seems to reveal a somewhat bigger problem - if the callers >> are simply registering IPAs, and relying on the ITS driver to grab an >> entry and fill in a PA later, then how does either one know *which* PA >> is supposed to belong to a given IPA in the case where you have multiple >> devices with different ITS targets assigned to the same guest? (and if >> it's possible to assume a guest will use per-device stage 1 mappings and >> present it with a single vITS backed by multiple pITSes, I think things >> start breaking even harder.) >> " >> >> However looking back into the problem I wonder if there was an issue >> with the iommu_domain based API. >> >> If my understanding is correct, when assigned devices are protected by a >> vIOMMU then they necessarily end up in separate host iommu domains even >> if they belong to the same iommu_domain on the guest. And there can only >> be a single device in this iommu_domain. > > Don't forget that a container represents the IOMMU context in a vfio > environment, groups are associated with containers and a group may > contain one or more devices. When a vIOMMU comes into play, we still > only have an IOMMU context per container. If we have multiple devices > in a group, we run into problems with vIOMMU. We can resolve this by > requiring that the user ignore all but one device in the group, > or making sure that the devices in the group have the same IOMMU > context. The latter we could do in QEMU if PCIe-to-PCI bridges there > masked the per-device address space as it does on real hardware (ie. > there is no requester ID on conventional PCI, all transactions appear to > the IOMMU with the bridge requester ID). So I raise this question > because vfio's minimum domain granularity is a group. > >> If this is confirmed, there is a non ambiguous association between 1 >> physical iommu_domain, 1 device, 1 S1 mapping and 1 physical MSI >> controller. >> >> I added the device handle handle to disambiguate those associations. The >> gIOVA ->gDB mapping is associated with a device handle. Then when the >> host needs a stage 1 mapping for this device, to build the nested >> mapping towards the physical DB it can easily grab the gIOVA->gDB stage >> 1 mapping registered for this device. >> >> The correctness looks more obvious to me, at least. > > Except all devices within all groups within the same container > necessarily share the same IOMMU context, so from that perspective, it > appears to impose non-trivial redundancy on the caller. Thanks, Taking into consideration the case where we could have several devices attached to the same host iommu group, each of them possibly using
Re: [PATCH v1 1/3] iommu/tegra-smmu: Fix invalid ASID bits on Tegra30/114
03.04.2019 11:41, Thierry Reding пишет: > On Thu, Mar 07, 2019 at 01:50:07AM +0300, Dmitry Osipenko wrote: >> Both Tegra30 and Tegra114 have 4 ASID's and the corresponding bitfield of >> the TLB_FLUSH register differs from later Tegra generations that have 128 >> ASID's. >> >> In a result the PTE's are now flushed correctly from TLB and this fixes >> problems with graphics (randomly failing tests) on Tegra30. >> >> Cc: stable >> Signed-off-by: Dmitry Osipenko >> --- >> drivers/iommu/tegra-smmu.c | 25 ++--- >> 1 file changed, 18 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c >> index 5182c7d6171e..8d30653cd13a 100644 >> --- a/drivers/iommu/tegra-smmu.c >> +++ b/drivers/iommu/tegra-smmu.c >> @@ -102,7 +102,6 @@ static inline u32 smmu_readl(struct tegra_smmu *smmu, >> unsigned long offset) >> #define SMMU_TLB_FLUSH_VA_MATCH_ALL (0 << 0) >> #define SMMU_TLB_FLUSH_VA_MATCH_SECTION (2 << 0) >> #define SMMU_TLB_FLUSH_VA_MATCH_GROUP (3 << 0) >> -#define SMMU_TLB_FLUSH_ASID(x) (((x) & 0x7f) << 24) > > Given that the same operation is repeated three times below, it might > have been worth to fold the conditional into the macro. That'd require > the macro to take an smmu parameter, but would otherwise leave the > individual instances shorter. I had that variant initially and the result felt more clumsy to me. > But either way, the fix is good, so: > > Acked-by: Thierry Reding > Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/3] arm64: dts: qcom: msm8998: Add PCIe SMMU node
On 4/2/2019 7:24 PM, Robin Murphy wrote: On 30/03/2019 14:18, Vivek Gautam wrote: You should probably have some "bus" and "iface" clocks too, per the requirement of "qcom,smmu-v2". Maybe Vivek might know what's relevant for MSM8998? As Jeffrey rightly mentioned, these clocks are not under the control of Linux. So, we won't need to add clocks to this SMMU. OK, in that case the "clock-names" part of binding doc probably wants refining to reflect which implementations do actually require clocks. Certainly. Marc, do you want to push a patch for the same? Or, let me know I can prepare one. Thanks Vivek Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure
On 03/04/2019 09:14, Greg KH wrote: On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote: On 28/03/2019 10:08, John Garry wrote: In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after devres release"), we changed the ordering of tearing down the device DMA ops and releasing all the device's resources; this was because the DMA ops should be maintained until we release the device's managed DMA memories. Hi all, A friendly reminder on this patch... I didn't see any update. I thought that it had some importance. Thanks, John However, we have seen another crash on an arm64 system when a device driver probe fails: hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 scsi host1: hisi_sas_v3_hw BUG: Bad page state in process swapper/0 pfn:313f5 page:7ec4fd40 count:1 mapcount:0 mapping: index:0x0 flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd48 7ec4fd48 raw: 0001 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1000(reserved) Modules linked in: CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433 Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.12.01 01/29/2019 Call trace: dump_backtrace+0x0/0x118 show_stack+0x14/0x1c dump_stack+0xa4/0xc8 bad_page+0xe4/0x13c free_pages_check_bad+0x4c/0xc0 __free_pages_ok+0x30c/0x340 __free_pages+0x30/0x44 __dma_direct_free_pages+0x30/0x38 dma_direct_free+0x24/0x38 dma_free_attrs+0x9c/0xd8 dmam_release+0x20/0x28 release_nodes+0x17c/0x220 devres_release_all+0x34/0x54 really_probe+0xc4/0x2c8 driver_probe_device+0x58/0xfc device_driver_attach+0x68/0x70 __driver_attach+0x94/0xdc bus_for_each_dev+0x5c/0xb4 driver_attach+0x20/0x28 bus_add_driver+0x14c/0x200 driver_register+0x6c/0x124 __pci_register_driver+0x48/0x50 sas_v3_pci_driver_init+0x20/0x28 do_one_initcall+0x40/0x25c kernel_init_freeable+0x2b8/0x3c0 kernel_init+0x10/0x100 ret_from_fork+0x10/0x18 Disabling lock debugging due to kernel taint BUG: Bad page state in process swapper/0 pfn:313f6 page:7ec4fd80 count:1 mapcount:0 mapping: index:0x0 [ 89.322983] flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd88 7ec4fd88 raw: 0001 The crash occurs for the same reason. In this case, on the really_probe() failure path, we are still clearing the DMA ops prior to releasing the device's managed memories. This patch fixes this issue by reordering the DMA ops teardown and the call to devres_release_all() on the failure path. Reported-by: Xiang Chen Tested-by: Xiang Chen Signed-off-by: John Garry So does this "fix" 376991db4b64? If so, should this be added to the patch and also backported to the stable trees? Hi Greg, No, I don't think so. I'd say it supplements it. Here I'm trying to fix up another path in which we tear down the DMA ops prior to releasing the device's resources. I didn't add a fixes tag as 376991db4b64 didn't have one either. It will need to be backported to stable, I figure the same as 376991db4b64. Thanks, John thanks, greg k-h . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/iova: Separate atomic variables to improve performance
From: Jinyu Qi In struct iova_domain, there are three atomic variables, the former two are about TLB flush counters which use atomic_add operation, anoter is used to flush timer that use cmpxhg operation. These variables are in the same cache line, so it will cause some performance loss under the condition that many cores call queue_iova function, Let's isolate the two type atomic variables to different cache line to reduce cache line conflict. Cc: Joerg Roedel Signed-off-by: Jinyu Qi --- include/linux/iova.h | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/iova.h b/include/linux/iova.h index 0b93bf96693e..28a5128405f8 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -76,6 +76,14 @@ struct iova_domain { unsigned long start_pfn; /* Lower limit for this domain */ unsigned long dma_32bit_pfn; unsigned long max32_alloc_size; /* Size of last failed allocation */ + struct iova_fq __percpu *fq;/* Flush Queue */ + + atomic64_t fq_flush_start_cnt; /* Number of TLB flushes that + have been started */ + + atomic64_t fq_flush_finish_cnt;/* Number of TLB flushes that + have been finished */ + struct iova anchor; /* rbtree lookup anchor */ struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE]; /* IOVA range caches */ @@ -85,14 +93,6 @@ struct iova_domain { iova_entry_dtor entry_dtor; /* IOMMU driver specific destructor for iova entry */ - struct iova_fq __percpu *fq;/* Flush Queue */ - - atomic64_t fq_flush_start_cnt; /* Number of TLB flushes that - have been started */ - - atomic64_t fq_flush_finish_cnt;/* Number of TLB flushes that - have been finished */ - struct timer_list fq_timer; /* Timer to regularily empty the flush-queues */ atomic_t fq_timer_on; /* 1 when timer is active, 0 -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 3/3] iommu/tegra-smmu: Respect IOMMU API read-write protections
On Thu, Mar 07, 2019 at 01:50:09AM +0300, Dmitry Osipenko wrote: > Set PTE read/write attributes accordingly to the the protections requested > by IOMMU API. > > Signed-off-by: Dmitry Osipenko > --- > drivers/iommu/tegra-smmu.c | 13 ++--- > 1 file changed, 10 insertions(+), 3 deletions(-) Acked-by: Thierry Reding signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 2/3] iommu/tegra-smmu: Properly release domain resources
On Thu, Mar 07, 2019 at 01:50:08AM +0300, Dmitry Osipenko wrote: > Release all memory allocations associated with a released domain and emit > warning if domain is in-use at the time of destruction. > > Signed-off-by: Dmitry Osipenko > --- > drivers/iommu/tegra-smmu.c | 3 +++ > 1 file changed, 3 insertions(+) Acked-by: Thierry Reding signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 1/3] iommu/tegra-smmu: Fix invalid ASID bits on Tegra30/114
On Thu, Mar 07, 2019 at 01:50:07AM +0300, Dmitry Osipenko wrote: > Both Tegra30 and Tegra114 have 4 ASID's and the corresponding bitfield of > the TLB_FLUSH register differs from later Tegra generations that have 128 > ASID's. > > In a result the PTE's are now flushed correctly from TLB and this fixes > problems with graphics (randomly failing tests) on Tegra30. > > Cc: stable > Signed-off-by: Dmitry Osipenko > --- > drivers/iommu/tegra-smmu.c | 25 ++--- > 1 file changed, 18 insertions(+), 7 deletions(-) > > diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c > index 5182c7d6171e..8d30653cd13a 100644 > --- a/drivers/iommu/tegra-smmu.c > +++ b/drivers/iommu/tegra-smmu.c > @@ -102,7 +102,6 @@ static inline u32 smmu_readl(struct tegra_smmu *smmu, > unsigned long offset) > #define SMMU_TLB_FLUSH_VA_MATCH_ALL (0 << 0) > #define SMMU_TLB_FLUSH_VA_MATCH_SECTION (2 << 0) > #define SMMU_TLB_FLUSH_VA_MATCH_GROUP (3 << 0) > -#define SMMU_TLB_FLUSH_ASID(x) (((x) & 0x7f) << 24) Given that the same operation is repeated three times below, it might have been worth to fold the conditional into the macro. That'd require the macro to take an smmu parameter, but would otherwise leave the individual instances shorter. But either way, the fix is good, so: Acked-by: Thierry Reding signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure
On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote: > On 28/03/2019 10:08, John Garry wrote: > > In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after > > devres release"), we changed the ordering of tearing down the device DMA > > ops and releasing all the device's resources; this was because the DMA ops > > should be maintained until we release the device's managed DMA memories. > > > > Hi all, > > A friendly reminder on this patch... I didn't see any update. > > I thought that it had some importance. > > Thanks, > John > > > However, we have seen another crash on an arm64 system when a > > device driver probe fails: > > > > hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 > > scsi host1: hisi_sas_v3_hw > > BUG: Bad page state in process swapper/0 pfn:313f5 > > page:7ec4fd40 count:1 mapcount:0 > > mapping: index:0x0 > > flags: 0xfffe0001000(reserved) > > raw: 0fffe0001000 7ec4fd48 7ec4fd48 > > > > raw: 0001 > > > > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > > bad because of flags: 0x1000(reserved) > > Modules linked in: > > CPU: 49 PID: 1 Comm: swapper/0 Not tainted > > 5.1.0-rc1-43081-g22d97fd-dirty #1433 > > Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI > > RC0 - V1.12.01 01/29/2019 > > Call trace: > > dump_backtrace+0x0/0x118 > > show_stack+0x14/0x1c > > dump_stack+0xa4/0xc8 > > bad_page+0xe4/0x13c > > free_pages_check_bad+0x4c/0xc0 > > __free_pages_ok+0x30c/0x340 > > __free_pages+0x30/0x44 > > __dma_direct_free_pages+0x30/0x38 > > dma_direct_free+0x24/0x38 > > dma_free_attrs+0x9c/0xd8 > > dmam_release+0x20/0x28 > > release_nodes+0x17c/0x220 > > devres_release_all+0x34/0x54 > > really_probe+0xc4/0x2c8 > > driver_probe_device+0x58/0xfc > > device_driver_attach+0x68/0x70 > > __driver_attach+0x94/0xdc > > bus_for_each_dev+0x5c/0xb4 > > driver_attach+0x20/0x28 > > bus_add_driver+0x14c/0x200 > > driver_register+0x6c/0x124 > > __pci_register_driver+0x48/0x50 > > sas_v3_pci_driver_init+0x20/0x28 > > do_one_initcall+0x40/0x25c > > kernel_init_freeable+0x2b8/0x3c0 > > kernel_init+0x10/0x100 > > ret_from_fork+0x10/0x18 > > Disabling lock debugging due to kernel taint > > BUG: Bad page state in process swapper/0 pfn:313f6 > > page:7ec4fd80 count:1 mapcount:0 > > mapping: index:0x0 > > [ 89.322983] flags: 0xfffe0001000(reserved) > > raw: 0fffe0001000 7ec4fd88 7ec4fd88 > > > > raw: 0001 > > > > > > The crash occurs for the same reason. > > > > In this case, on the really_probe() failure path, we are still clearing > > the DMA ops prior to releasing the device's managed memories. > > > > This patch fixes this issue by reordering the DMA ops teardown and the > > call to devres_release_all() on the failure path. > > > > Reported-by: Xiang Chen > > Tested-by: Xiang Chen > > Signed-off-by: John Garry So does this "fix" 376991db4b64? If so, should this be added to the patch and also backported to the stable trees? thanks, greg k-h ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure
On 28/03/2019 10:08, John Garry wrote: In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after devres release"), we changed the ordering of tearing down the device DMA ops and releasing all the device's resources; this was because the DMA ops should be maintained until we release the device's managed DMA memories. Hi all, A friendly reminder on this patch... I didn't see any update. I thought that it had some importance. Thanks, John However, we have seen another crash on an arm64 system when a device driver probe fails: hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 scsi host1: hisi_sas_v3_hw BUG: Bad page state in process swapper/0 pfn:313f5 page:7ec4fd40 count:1 mapcount:0 mapping: index:0x0 flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd48 7ec4fd48 raw: 0001 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1000(reserved) Modules linked in: CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433 Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.12.01 01/29/2019 Call trace: dump_backtrace+0x0/0x118 show_stack+0x14/0x1c dump_stack+0xa4/0xc8 bad_page+0xe4/0x13c free_pages_check_bad+0x4c/0xc0 __free_pages_ok+0x30c/0x340 __free_pages+0x30/0x44 __dma_direct_free_pages+0x30/0x38 dma_direct_free+0x24/0x38 dma_free_attrs+0x9c/0xd8 dmam_release+0x20/0x28 release_nodes+0x17c/0x220 devres_release_all+0x34/0x54 really_probe+0xc4/0x2c8 driver_probe_device+0x58/0xfc device_driver_attach+0x68/0x70 __driver_attach+0x94/0xdc bus_for_each_dev+0x5c/0xb4 driver_attach+0x20/0x28 bus_add_driver+0x14c/0x200 driver_register+0x6c/0x124 __pci_register_driver+0x48/0x50 sas_v3_pci_driver_init+0x20/0x28 do_one_initcall+0x40/0x25c kernel_init_freeable+0x2b8/0x3c0 kernel_init+0x10/0x100 ret_from_fork+0x10/0x18 Disabling lock debugging due to kernel taint BUG: Bad page state in process swapper/0 pfn:313f6 page:7ec4fd80 count:1 mapcount:0 mapping: index:0x0 [ 89.322983] flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd88 7ec4fd88 raw: 0001 The crash occurs for the same reason. In this case, on the really_probe() failure path, we are still clearing the DMA ops prior to releasing the device's managed memories. This patch fixes this issue by reordering the DMA ops teardown and the call to devres_release_all() on the failure path. Reported-by: Xiang Chen Tested-by: Xiang Chen Signed-off-by: John Garry --- For convenience, here is the 2nd half of really_probe() now: atomic_inc(_count); pr_debug("bus: '%s': %s: probing driver %s with device %s\n", drv->bus->name, __func__, drv->name, dev_name(dev)); WARN_ON(!list_empty(>devres_head)); re_probe: dev->driver = drv; /* If using pinctrl, bind pins now before probing */ ret = pinctrl_bind_pins(dev); if (ret) goto pinctrl_bind_failed; if (dev->bus->dma_configure) { ret = dev->bus->dma_configure(dev); if (ret) goto probe_failed; } if (driver_sysfs_add(dev)) { printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n", __func__, dev_name(dev)); goto probe_failed; } if (dev->pm_domain && dev->pm_domain->activate) { ret = dev->pm_domain->activate(dev); if (ret) goto probe_failed; } if (dev->bus->probe) { ret = dev->bus->probe(dev); if (ret) goto probe_failed; } else if (drv->probe) { ret = drv->probe(dev); if (ret) goto probe_failed; } if (test_remove) { test_remove = false; if (dev->bus->remove) dev->bus->remove(dev); else if (drv->remove) drv->remove(dev); devres_release_all(dev); driver_sysfs_remove(dev); dev->driver = NULL; dev_set_drvdata(dev, NULL); if (dev->pm_domain && dev->pm_domain->dismiss) dev->pm_domain->dismiss(dev); pm_runtime_reinit(dev); goto re_probe; } pinctrl_init_done(dev); if (dev->pm_domain && dev->pm_domain->sync) dev->pm_domain->sync(dev); driver_bound(dev); ret = 1; pr_debug("bus: '%s': %s: bound device %s to driver %s\n", drv->bus->name, __func__,