Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen  wrote:
>
> On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
> > >
> > > From: Tycho Andersen 
> > >
> > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > that might sleep:
> > >
> >
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index 9d5c75f02295..7891add0913f 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> > > error_code,
> > > /* Executive summary in case the body of the oops scrolled away */
> > > printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > >
> > > +   /*
> > > +* We're about to oops, which might kill the task. Make sure we're
> > > +* allowed to sleep.
> > > +*/
> > > +   flags |= X86_EFLAGS_IF;
> > > +
> > > oops_end(flags, regs, sig);
> > >  }
> > >
> >
> >
> > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > rewind_stack_do_exit().
>
> [I trimmed the CC list since google rejected it with E2BIG :)]
>
> I guess the problem is really that do_exit() (or really
> exit_signals()) might sleep. Maybe we should put an irq_enable() at
> the beginning of do_exit() instead and fix this problem for all
> arches?
>

Hmm.  do_exit() isn't really meant to be "try your best to leave the
system somewhat usable without returning" -- it's a function that,
other than in OOPSes, is called from a well-defined state.  So I think
rewind_stack_do_exit() is probably a better spot.  But we need to
rewind the stack and *then* turn on IRQs, since we otherwise risk
exploding quite badly.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>
> XPFO flushes kernel space TLB entries for pages that are now mapped
> in userspace on not only the current CPU but also all other CPUs
> synchronously. Processes on each core allocating pages causes a
> flood of IPI messages to all other cores to flush TLB entries.
> Many of these messages are to flush the entire TLB on the core if
> the number of entries being flushed from local core exceeds
> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
> unmapping pages from physmap goes up dramatically on machines with
> high core count.
>
> This patch flushes relevant TLB entries for current process or
> entire TLB depending upon number of entries for the current CPU
> and posts a pending TLB flush on all other CPUs when a page is
> unmapped from kernel space and mapped in userspace. Each core
> checks the pending TLB flush flag for itself on every context
> switch, flushes its TLB if the flag is set and clears it.
> This patch potentially aggregates multiple TLB flushes into one.
> This has very significant impact especially on machines with large
> core counts.

Why is this a reasonable strategy?

> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +   struct cpumask tmp_mask;
> +
> +   /*
> +* Balance as user space task's flush, a bit conservative.
> +* Do a local flush immediately and post a pending flush on all
> +* other CPUs. Local flush can be a range flush or full flush
> +* depending upon the number of entries to be flushed. Remote
> +* flushes will be done by individual processors at the time of
> +* context switch and this allows multiple flush requests from
> +* other CPUs to be batched together.
> +*/

I don't like this function at all.  A core function like this is a
contract of sorts between the caller and the implementation.  There is
no such thing as an "xpfo" flush, and this function's behavior isn't
at all well defined.  For flush_tlb_kernel_range(), I can tell you
exactly what that function does, and the implementation is either
correct or incorrect.  With this function, I have no idea what is
actually required, and I can't possibly tell whether it's correct.

As far as I can see, xpfo_flush_tlb_kernel_range() actually means
"flush this range on this CPU right now, and flush it on remote CPUs
eventually".  It would be valid, but probably silly, to flush locally
and to never flush at all on remote CPUs.  This makes me wonder what
the point is.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Tycho Andersen
On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
> >
> > From: Tycho Andersen 
> >
> > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > that might sleep:
> >
> 
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 9d5c75f02295..7891add0913f 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> > error_code,
> > /* Executive summary in case the body of the oops scrolled away */
> > printk(KERN_DEFAULT "CR2: %016lx\n", address);
> >
> > +   /*
> > +* We're about to oops, which might kill the task. Make sure we're
> > +* allowed to sleep.
> > +*/
> > +   flags |= X86_EFLAGS_IF;
> > +
> > oops_end(flags, regs, sig);
> >  }
> >
> 
> 
> NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> rewind_stack_do_exit().

[I trimmed the CC list since google rejected it with E2BIG :)]

I guess the problem is really that do_exit() (or really
exit_signals()) might sleep. Maybe we should put an irq_enable() at
the beginning of do_exit() instead and fix this problem for all
arches?

Tycho
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH AUTOSEL 5.0 008/262] swiotlb: add checks for the return value of memblock_alloc*()

2019-04-03 Thread Sasha Levin

On Thu, Mar 28, 2019 at 07:55:20AM +0200, Mike Rapoport wrote:

Hi,

On Wed, Mar 27, 2019 at 01:57:43PM -0400, Sasha Levin wrote:

From: Mike Rapoport 

[ Upstream commit a0bf842e89a3842162aa8514b9bf4611c86fee10 ]

Add panic() calls if memblock_alloc() returns NULL.

The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.


This patch is a part of a series that removes panic() calls from memblock
allocators rather than a fix.


I've dropped it, thanks!

--
Thanks,
Sasha
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>
> From: Tycho Andersen 
>
> Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> that might sleep:
>


> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9d5c75f02295..7891add0913f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> error_code,
> /* Executive summary in case the body of the oops scrolled away */
> printk(KERN_DEFAULT "CR2: %016lx\n", address);
>
> +   /*
> +* We're about to oops, which might kill the task. Make sure we're
> +* allowed to sleep.
> +*/
> +   flags |= X86_EFLAGS_IF;
> +
> oops_end(flags, regs, sig);
>  }
>


NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
rewind_stack_do_exit().
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This patch adds support for XPFO for x86-64. It uses the generic
infrastructure in place for XPFO and adds the architecture specific
bits to enable XPFO on x86-64.

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |  10 +-
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable.h|  26 
 arch/x86/mm/Makefile  |   2 +
 arch/x86/mm/pageattr.c|  23 +---
 arch/x86/mm/xpfo.c| 123 ++
 include/linux/xpfo.h  |   2 +
 7 files changed, 162 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9b36da94760e..e65e3bc1efe0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,11 +2997,11 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
-   noxpfo  [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
-   when CONFIG_XPFO is on. Physical pages mapped into
-   user applications will also be mapped in the
-   kernel's address space as if CONFIG_XPFO was not
-   enabled.
+   noxpfo  [XPFO,X86-64] Disable eXclusive Page Frame
+   Ownership (XPFO) when CONFIG_XPFO is on. Physical
+   pages mapped into user applications will also be
+   mapped in the kernel's address space as if
+   CONFIG_XPFO was not enabled.
 
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..122786604252 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -209,6 +209,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select ARCH_SUPPORTS_XPFO   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..5c0e1581fa56 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+   unsigned long   *vaddr;
+   pgd_t   *pgd;
+   pgprot_tmask_set;
+   pgprot_tmask_clr;
+   unsigned long   numpages;
+   unsigned long   curpage;
+   unsigned long   pfn;
+   unsigned intflags;
+   unsigned intforce_split : 1,
+   force_static_prot   : 1;
+   struct page **pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+   struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+  struct page *base);
+
 #include 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14e6119838a6..530b5df0617e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -28,23 +28,6 @@
 
 #include "mm_internal.h"
 
-/*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
-   unsigned long   *vaddr;
-   pgd_t   *pgd;
-   pgprot_tmask_set;
-   pgprot_tmask_clr;
-   unsigned long   numpages;
-   unsigned long   curpage;
-   unsigned long   pfn;
-   unsigned intflags;
-   unsigned intforce_split : 1,
-   force_static_prot   : 1;
-   struct page **pages;
-};
-
 enum cpa_warn {
CPA_CONFLICT,
CPA_PROTECT,
@@ -59,7 +42,7 @@ static const int cpa_warn_level = CPA_PROTECT;
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  */
-static 

[RFC PATCH v9 05/13] mm: add a user_virt_to_phys symbol

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

CC: linux-arm-ker...@lists.infradead.org
CC: x...@kernel.org
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
v7: * make user_virt_to_phys a GPL symbol

v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case

 arch/x86/mm/xpfo.c   | 57 
 include/linux/xpfo.h |  4 
 2 files changed, 61 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 3045bb7e4659..b42513347865 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -121,3 +121,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int 
order)
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+   phys_addr_t phys_addr;
+   unsigned long offset;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(current->mm, addr);
+   if (pgd_none(*pgd))
+   return 0;
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d))
+   return 0;
+
+   if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+   phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+   offset = addr & ~P4D_MASK;
+   goto out;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return 0;
+
+   if (pud_large(*pud) || !pud_present(*pud)) {
+   phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+   offset = addr & ~PUD_MASK;
+   goto out;
+   }
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return 0;
+
+   if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+   phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+   offset = addr & ~PMD_MASK;
+   goto out;
+   }
+
+   pte =  pte_offset_kernel(pmd, addr);
+   phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+   offset = addr & ~PAGE_MASK;
+
+out:
+   return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index c1d232da7ee0..5d8d06e4b796 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -89,6 +89,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page *page)
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
 void xpfo_free_pages(struct page *page, int order);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -102,6 +104,8 @@ static inline void xpfo_free_pages(struct page *page, int 
order) { }
 static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
 static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 01/13] mm: add MAP_HUGETLB support to vm_mmap

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 include/linux/mm.h |  2 ++
 mm/mmap.c  | 19 +--
 mm/util.c  | 32 
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..3e4f6525d06b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2412,6 +2412,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index fc1809b1bed6..65382d942598 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, 
unsigned long len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
-   struct user_struct *user = NULL;
-   struct hstate *hs;
-
-   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-   if (!hs)
-   return -EINVAL;
-
-   len = ALIGN(len, huge_page_size(hs));
-   /*
-* VM_NORESERVE is used because the reservations will be
-* taken when vm_ops->mmap() is called
-* A dummy user value is used because we are not locking
-* memory so no accounting is necessary
-*/
-   file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-   VM_NORESERVE,
-   , HUGETLB_ANONHUGE_INODE,
-   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   file = map_hugetlb_setup(, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 379319b1bcfd..86b763861828 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned 
long addr,
return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+   struct user_struct *user = NULL;
+   struct hstate *hs;
+
+   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   if (!hs)
+   return ERR_PTR(-EINVAL);
+
+   *len = ALIGN(*len, huge_page_size(hs));
+
+   /*
+* VM_NORESERVE is used because the reservations will be
+* taken when vm_ops->mmap() is called
+* A dummy user value is used because we are not locking
+* memory so no accounting is necessary
+*/
+   return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+   VM_NORESERVE,
+   , HUGETLB_ANONHUGE_INODE,
+   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long 
addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
 
+   if (flag & MAP_HUGETLB) {
+   if (file)
+   return -EINVAL;
+
+   file = map_hugetlb_setup(, flag);
+   if (IS_ERR(file))
+   return PTR_ERR(file);
+   }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 07/13] arm64/mm: Add support for XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

XPFO doesn't support section/contiguous mappings yet, so let's disable
it if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland
for pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |  2 +-
 arch/arm64/Kconfig|  1 +
 arch/arm64/mm/Makefile|  2 +
 arch/arm64/mm/mmu.c   |  2 +-
 arch/arm64/mm/xpfo.c  | 66 +++
 5 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index e65e3bc1efe0..9fcf8c83031a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,7 +2997,7 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
-   noxpfo  [XPFO,X86-64] Disable eXclusive Page Frame
+   noxpfo  [XPFO,X86-64,ARM64] Disable eXclusive Page Frame
Ownership (XPFO) when CONFIG_XPFO is on. Physical
pages mapped into user applications will also be
mapped in the kernel's address space as if
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..9a8d8e649cf8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_XPFO
help
  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
 
 obj-$(CONFIG_KASAN)+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o:= n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b6f5aa52ac67..1673f7443d62 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -453,7 +453,7 @@ static void __init map_mem(pgd_t *pgdp)
struct memblock_region *reg;
int flags = 0;
 
-   if (rodata_full || debug_pagealloc_enabled())
+   if (rodata_full || debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
/*
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index ..7866c5acfffb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger 
+ *   Vasileios P. Kemerlis 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include 
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+
+   return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+   pte_t *pte = lookup_address((unsigned long)kaddr);
+
+   if (unlikely(!pte)) {
+   WARN(1, "xpfo: invalid address %p\n", kaddr);
+   return;
+   }
+
+   set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+EXPORT_SYMBOL_GPL(set_kpte);
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+   unsigned long kaddr = (unsigned long)page_address(page);
+   unsigned long size = PAGE_SIZE;
+
+   flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
+EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 13/13] xpfo, mm: Optimize XPFO TLB flushes by batching them together

2019-04-03 Thread Khalid Aziz
When XPFO forces a TLB flush on all cores, the performance impact is
very significant. Batching as many of these TLB updates as
possible can help lower this impact. When a userspace allocates a
page, kernel tries to get that page from the per-cpu free list.
This free list is replenished in bulk when it runs low. Free
list is being replenished for future allocation to userspace is a
good opportunity to update TLB entries in batch and reduce the
impact of multiple TLB flushes later. This patch adds new tags for
the page so a page can be marked as available for userspace
allocation and unmapped from kernel address space. All such pages
are removed from kernel address space in bulk at the time they are
added to per-cpu free list. This patch when combined with deferred
TLB flushes improves performance further. Using the same benchmark
as before of building kernel in parallel, here are the system
times on two differently sized systems:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0 913.862s
5.0+XPFO+Deferred flush+Batch update1165.259s   1.28x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0 610.642s
5.0+XPFO+Deferred flush+Batch update773.075s1.27x

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Signed-off-by: Tycho Andersen 
---
v9:
- Do not map a page freed by userspace back into kernel. Mark
 it as unmapped instead and map it back in only when needed. This
 avoids the cost of unmap and TLBV flush if the page is allocated
 back to userspace.

 arch/x86/include/asm/pgtable.h |  2 +-
 arch/x86/mm/pageattr.c |  9 --
 arch/x86/mm/xpfo.c | 11 +--
 include/linux/xpfo.h   | 11 +++
 mm/page_alloc.c|  9 ++
 mm/xpfo.c  | 54 +++---
 6 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5c0e1581fa56..61f64c6c687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1461,7 +1461,7 @@ should_split_large_page(pte_t *kpte, unsigned long 
address,
 extern spinlock_t cpa_lock;
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-  struct page *base);
+  struct page *base, bool xpfo_split);
 
 #include 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 530b5df0617e..8fe86ac6bff0 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -911,7 +911,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, 
unsigned long pfn,
 
 int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
-  struct page *base)
+  struct page *base, bool xpfo_split)
 {
unsigned long lpaddr, lpinc, ref_pfn, pfn, pfninc = 1;
pte_t *pbase = (pte_t *)page_address(base);
@@ -1008,7 +1008,10 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, 
unsigned long address,
 * page attribute in parallel, that also falls into the
 * just split large page entry.
 */
-   flush_tlb_all();
+   if (xpfo_split)
+   xpfo_flush_tlb_all();
+   else
+   flush_tlb_all();
spin_unlock(_lock);
 
return 0;
@@ -1027,7 +1030,7 @@ static int split_large_page(struct cpa_data *cpa, pte_t 
*kpte,
if (!base)
return -ENOMEM;
 
-   if (__split_large_page(cpa, kpte, address, base))
+   if (__split_large_page(cpa, kpte, address, base, false))
__free_page(base);
 
return 0;
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 638eee5b1f09..8c482c7b54f5 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -47,7 +47,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
 
cpa.vaddr = kaddr;
cpa.pages = 
-   cpa.mask_set = prot;
+   cpa.mask_set = canon_pgprot(prot);
cpa.mask_clr = msk_clr;
cpa.numpages = 1;
cpa.flags = 0;
@@ -57,7 +57,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
 
do_split = should_split_large_page(pte, (unsigned long)kaddr,
   );
-   if (do_split) {
+   if (do_split > 0) {
struct page *base;
 
base = alloc_pages(GFP_ATOMIC, 0);
@@ -69,7 +69,7 @@ inline void set_kpte(void *kaddr, struct page *page, pgprot_t 
prot)
if (!debug_pagealloc_enabled())
spin_lock(_lock);
if  (__split_large_page(, pte, (unsigned long)kaddr,
-   base) < 

[RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-04-03 Thread Khalid Aziz
From: Julian Stecklina 

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
---
 include/linux/xpfo.h | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 2318c7eb5fb7..37e7f52fa6ce 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
 static inline void xpfo_kunmap(void *kaddr, struct page *page)
 {
unsigned long flags;
+   bool flush_tlb = false;
 
if (!static_branch_unlikely(_inited))
return;
@@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
*page)
 * The page is to be allocated back to user space, so unmap it from
 * the kernel, flush the TLB and tag it as a user page.
 */
-   spin_lock_irqsave(>xpfo_lock, flags);
-
if (atomic_dec_return(>xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-   WARN_ON(PageXpfoUnmapped(page));
-#endif
-   SetPageXpfoUnmapped(page);
-   set_kpte(kaddr, page, __pgprot(0));
-   xpfo_flush_kernel_tlb(page, 0);
+   spin_lock_irqsave(>xpfo_lock, flags);
+
+   /*
+* In the case, where we raced with kmap after the
+* atomic_dec_return, we must not nuke the mapping.
+*/
+   if (atomic_read(>xpfo_mapcount) == 0) {
+   SetPageXpfoUnmapped(page);
+   set_kpte(kaddr, page, __pgprot(0));
+   flush_tlb = true;
+   }
+   spin_unlock_irqrestore(>xpfo_lock, flags);
}
 
-   spin_unlock_irqrestore(>xpfo_lock, flags);
+   if (flush_tlb)
+   xpfo_flush_kernel_tlb(page, 0);
 }
 
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 09/13] xpfo: add primitives for mapping underlying memory

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 include/linux/xpfo.h | 21 +
 mm/xpfo.c| 30 ++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 5d8d06e4b796..2318c7eb5fb7 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -91,6 +91,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+#define XPFO_NUM_PAGES(addr, size) \
+   (PFN_UP((unsigned long) (addr) + (size)) - \
+   PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -106,6 +115,18 @@ static inline void xpfo_flush_kernel_tlb(struct page 
*page, int order) { }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+  void **mapping, size_t mapping_len)
+{
+}
+
 #endif /* CONFIG_XPFO */
 
 #if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
diff --git a/mm/xpfo.c b/mm/xpfo.c
index b74fee0479e7..974f1b70ccd9 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -104,3 +105,32 @@ void xpfo_free_pages(struct page *page, int order)
}
}
 }
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len)
+{
+   struct page *page = virt_to_page(addr);
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   memset(mapping, 0, mapping_len);
+
+   for (i = 0; i < num_pages; i++) {
+   if (page_to_virt(page + i) >= addr + size)
+   break;
+
+   if (PageXpfoUnmapped(page + i))
+   mapping[i] = kmap_atomic(page + i);
+   }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   for (i = 0; i < num_pages; i++)
+   if (mapping[i])
+   kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 08/13] swiotlb: Map the buffer if it was unmapped by XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

XPFO can unmap a bounce buffer. Check for this and map it back in if
needed.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
Reviewed-by: Konrad Rzeszutek Wilk 
---
v9: * Added a generic check for whether a page is mapped or not (suggested
  by Chris Hellwig)

v6: * guard against lookup_xpfo() returning NULL

 include/linux/highmem.h | 7 +++
 kernel/dma/swiotlb.c| 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 59a1a5fa598d..cf21f023dff4 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -77,6 +77,13 @@ static inline struct page *kmap_to_page(void *addr)
 }
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
+static inline bool page_is_unmapped(struct page *page)
+{
+   if (PageHighMem(page) || PageXpfoUnmapped(page))
+   return true;
+   else
+   return false;
+}
 
 #endif /* CONFIG_HIGHMEM */
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fb6fd68b9c7..90a1a3709b55 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -392,8 +392,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
 {
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+   struct page *page = pfn_to_page(pfn);
 
-   if (PageHighMem(pfn_to_page(pfn))) {
+   if (page_is_unmapped(page)) {
/* The buffer does not have a mapping.  Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 10/13] arm64/mm, xpfo: temporarily map dcache regions

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
---
v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly

 arch/arm64/mm/flush.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 5c9073bace83..114e8bc5a3dc 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
unsigned long addr = (unsigned long)kaddr;
+   unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+   void *mapping[num_pages];
 
if (icache_is_aliasing()) {
+   xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+   xpfo_temp_unmap(kaddr, len, mapping,
+   sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
/*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v9 06/13] lkdtm: Add test for XPFO

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
  unmap of an address, that it can't be used accidentally by other
  CPUs.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 196 
 4 files changed, 205 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)   += refcount.o
 lkdtm-$(CONFIG_LKDTM)  += rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)  += usercopy.o
 lkdtm-$(CONFIG_LKDTM)  += stackleak.o
+lkdtm-$(CONFIG_LKDTM)  += xpfo.o
 
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_rodata.o   := n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
+   CRASHTYPE(XPFO_READ_USER),
+   CRASHTYPE(XPFO_READ_USER_HUGE),
+   CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index ..8876128f0144
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+   unsigned long user_addr, user_data = XPFO_DATA;
+
+   user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+   PROT_READ | PROT_WRITE | PROT_EXEC,
+   flags, 0);
+   if (user_addr >= TASK_SIZE) {
+   pr_warn("Failed to allocate user memory\n");
+   return 0;
+   }
+
+   if (copy_to_user((void __user *)user_addr, _data,
+sizeof(user_data))) {
+   pr_warn("copy_to_user failed\n");
+   goto free_user;
+   }
+
+   return user_addr;
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+   return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+   phys_addr_t phys_addr;
+   void *virt_addr;
+
+   phys_addr = user_virt_to_phys(user_addr);
+   if (!phys_addr) {
+   pr_warn("Failed to get physical address of user memory\n");
+   return NULL;
+   }
+
+   virt_addr = phys_to_virt(phys_addr);
+   if (phys_addr != virt_to_phys(virt_addr)) {
+   pr_warn("Physical address of user memory seems incorrect\n");
+   return NULL;
+   }
+
+   return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+   pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+   if (*(unsigned long *)virt_addr == XPFO_DATA)
+   pr_err("FAIL: Bad read succeeded?!\n");
+   else
+   pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+   unsigned long user_addr, *kernel;
+
+   user_addr = do_map(flags);
+   if (!user_addr) {
+   pr_err("FAIL: map failed\n");
+   return;
+   }
+
+   kernel = user_to_kernel(user_addr);
+   if (!kernel) {
+   pr_err("FAIL: user to kernel conversion failed\n");
+   goto free_user;
+   }
+
+   read_map(kernel);
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+   read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE(void)
+{
+   

[RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-03 Thread Khalid Aziz
XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20950.966s
4.20+XPFO   25073.169s  26.366x
4.20+XPFO+Deferred flush1372.874s1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.614x
4.20+XPFO+Deferred flush803.989s1.32x

This same code should be implemented for other architectures as
well once finalized.

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 52 +
 arch/x86/mm/xpfo.c  |  2 +-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, 
unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 999d6d8f0bef..cc806a01a0eb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -37,6 +37,20 @@
  */
 #define LAST_USER_MM_IBPB  0x1UL
 
+/*
+ * A TLB flush may be needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space. This TLB flush needs to be propagated to
+ * all CPUs. Asynchronous flush requests to all CPUs can cause
+ * significant performance imapct. Queue a pending flush for
+ * a CPU instead. Multiple of these requests can then be handled
+ * by a CPU at a less disruptive time, like context switch, in
+ * one go and reduce performance impact significantly. Following
+ * data structure is used to keep track of CPUs with pending full
+ * TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
+
+   /*
+* If there is a pending TLB flush for this CPU due to XPFO
+* flush, do it now.
+*/
+   if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) {
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+   __flush_tlb_all();
+   }
+
this_cpu_write(cpu_tlbstate.is_lazy, false);
 
/*
@@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
long end)
}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+   struct cpumask tmp_mask;
+
+   /*
+* Balance as user space task's flush, a bit conservative.
+* Do a local flush immediately and post a pending flush on all
+* other CPUs. Local flush can be a range flush or full flush
+* depending upon the number of entries to be flushed. Remote
+* flushes will be done by individual processors at the time of
+* context switch and this allows multiple flush requests from
+   

[RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-03 Thread Khalid Aziz
From: Juerg Haefliger 

This patch adds basic support infrastructure for XPFO which protects
against 'ret2dir' kernel attacks. The basic idea is to enforce exclusive
ownership of page frames by either the kernel or userspace, unless
explicitly requested by the kernel. Whenever a page destined for
userspace is allocated, it is unmapped from physmap (the kernel's page
table). When such a page is reclaimed from userspace, it is mapped back
to physmap. Individual architectures can enable full XPFO support using
this infrastructure by supplying architecture specific pieces.

Additional fields in the page struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
[jstec...@amazon.de: encode all XPFO info in struct page]
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 v9: * Do not use page extensions. Encode all xpfo information in struct
  page (Julian Stecklina).
* Split architecture specific code into its own separate patch
* Move kmap*() to include/linux/xpfo.h for cleaner code as suggested
  for an earlier version of this patch
* Use irq versions of spin_lock to address possible deadlock around
  xpfo_lock caused by interrupts.
* Incorporated various feedback provided on v6 patch way back.

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
  the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
  __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
  pass it

 .../admin-guide/kernel-parameters.txt |   6 +
 include/linux/highmem.h   |  31 +---
 include/linux/mm_types.h  |   8 +
 include/linux/page-flags.h|  23 ++-
 include/linux/xpfo.h  | 147 ++
 include/trace/events/mmflags.h|  10 +-
 mm/Makefile   |   1 +
 mm/compaction.c   |   2 +-
 mm/internal.h |   2 +-
 mm/page_alloc.c   |  10 +-
 mm/page_isolation.c   |   2 +-
 mm/xpfo.c | 106 +
 security/Kconfig  |  27 
 13 files changed, 337 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 858b6c0b9a15..9b36da94760e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,6 +2997,12 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
+   noxpfo  [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
+   when CONFIG_XPFO is on. Physical pages mapped into
+   user applications will also be mapped in the
+   kernel's address space as if CONFIG_XPFO was not
+   enabled.
+
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..59a1a5fa598d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -77,36 +78,6 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
 
-#ifndef ARCH_HAS_KMAP
-static inline void *kmap(struct page *page)
-{
-   might_sleep();
-   return page_address(page);
-}
-
-static inline void kunmap(struct page *page)
-{
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-   preempt_disable();
-   pagefault_disable();
-   return page_address(page);
-}
-#define kmap_atomic_prot(page, prot)   kmap_atomic(page)
-
-static inline void __kunmap_atomic(void *addr)
-{
-   pagefault_enable();
-   preempt_enable();
-}
-
-#define kmap_atomic_pfn(pfn)   kmap_atomic(pfn_to_page(pfn))
-
-#define kmap_flush_unused()do {} while(0)
-#endif
-
 #endif /* CONFIG_HIGHMEM */
 
 #if defined(CONFIG_HIGHMEM) || 

[RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership

2019-04-03 Thread Khalid Aziz
This is another update to the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further. Similarly when
these user pages are freed by userspace and added back to per-cpu
free list, they are left unmapped and tagged so. This further
optimization reduced performance impact from 1.32x to 1.28x for
96-core server and from 1.31x to 1.27x for a 4-core desktop.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

5.0 913.862s
5.0+this patch series   1165.259ss  1.28x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

5.0 610.642s
5.0+this patch series   773.075s1.27x

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have restructurerd the patches in this version to separate out
architecture independent code. I folded much of the code
improvement by Julian to not use page extension into patch 3. 

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) kernel mappings
   private to an mm, (2) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot since this patch was written. I could use help
   from ARM experts on this.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.
4. Change kmap to not map the page back to physmap, instead map it
   to a new va similar to what kmap_high does. Mapping page back
   into physmap re-opens the ret2dir security for the duration of
   kmap. All of the kmap_high and related code can be reused for this
   but that will require restructuring that code so it can be built for
   64-bits as well. Any objections to that?

-

Juerg Haefliger (6):
  mm: Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo, x86: Add support for XPFO for x86-64
  lkdtm: Add test for XPFO
  arm64/mm: Add support for XPFO
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm, xpfo: temporarily map dcache regions

Julian Stecklina (1):
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
  xpfo, mm: Optimize XPFO TLB flushes by 

[RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Khalid Aziz
From: Tycho Andersen 

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen 
CC: x...@kernel.org
Tested-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 arch/x86/mm/fault.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..7891add0913f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+   /*
+* We're about to oops, which might kill the task. Make sure we're
+* allowed to sleep.
+*/
+   flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: remove NULL struct device support in the DMA API

2019-04-03 Thread Christoph Hellwig
On Wed, Apr 03, 2019 at 07:26:40PM +0100, Russell King - ARM Linux admin wrote:
> On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote:
> > We still have a few drivers which pass a NULL struct device pointer
> > to DMA API functions, which generally is a bad idea as the API
> > implementations rely on the device not only for ops selection, but
> > also the dma mask and various other attributes, and many implementations
> > have been broken for NULL device support for a while.
> 
> I think I must be missing something, but...
> 
> My understanding is that ISA DMA is normally limited to 24 bits of
> address

Yes.

> - indeed, the x86 version only programs 24 bits of DMA address.
> Looking through this series, it appears that the conversions mean that
> the DMA mask for ISA becomes the full all-ones DMA mask, which would
> of course lead to memory corruption if only 24 bits of the address end
> up being programmed into the hardware.

In the generic dma mapping code no struct device has always meant a
32-bit DMA mask - take a look at the dma_get_mask() function.

> Maybe you could say why you think this series is safe in regard to ISA
> DMA?

ISA DMA has always been rather painful in a myriad of ways, and the
DMA API so far hasn't helped, given that we don't do bounce buffering
for the 24-bit limit, but just the higher limits.  So far even if you
do use the DMA API and pass a device ISA DMA so far always meant
that the higher layers had to assure things are addressable, either
by using GFP_DMA allocation in the drivers, or mid-layer hacks like
the unchecked_isa_dma flag in SCSI and/or BLK_BOUNCE_ISA in the
block layer.

This series doesn't change those facts at all.  I have some half
started series to clean some of this up but it isn't high up on
the priority list.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-03 Thread Christoph Hellwig
I think this might have been this commit:

commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59
Author: Christoph Hellwig 
Date:   Fri Feb 15 09:30:28 2019 +0100

sparc64/pci_sun4v: allow large DMA masks

the patch below adds a few missing checks and hopefully should fix
your problem.  If not can you try to revert the commit to check if
my theory was correct to start with?


Date:   Wed Apr 3 21:34:34 2019 +0200

diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index a8af6023c126..14b93c5564e3 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -73,6 +73,11 @@ static inline void iommu_batch_start(struct device *dev, 
unsigned long prot, uns
p->npages   = 0;
 }
 
+static inline bool iommu_use_atu(struct iommu *iommu, u64 mask)
+{
+   return iommu->atu && mask > DMA_BIT_MASK(32);
+}
+
 /* Interrupts must be disabled.  */
 static long iommu_batch_flush(struct iommu_batch *p, u64 mask)
 {
@@ -92,7 +97,7 @@ static long iommu_batch_flush(struct iommu_batch *p, u64 mask)
prot &= (HV_PCI_MAP_ATTR_READ | HV_PCI_MAP_ATTR_WRITE);
 
while (npages != 0) {
-   if (mask <= DMA_BIT_MASK(32) || !pbm->iommu->atu) {
+   if (!iommu_use_atu(pbm->iommu, mask)) {
num = pci_sun4v_iommu_map(devhandle,
  HV_PCI_TSBID(0, entry),
  npages,
@@ -179,7 +184,6 @@ static void *dma_4v_alloc_coherent(struct device *dev, 
size_t size,
unsigned long flags, order, first_page, npages, n;
unsigned long prot = 0;
struct iommu *iommu;
-   struct atu *atu;
struct iommu_map_table *tbl;
struct page *page;
void *ret;
@@ -205,13 +209,11 @@ static void *dma_4v_alloc_coherent(struct device *dev, 
size_t size,
memset((char *)first_page, 0, PAGE_SIZE << order);
 
iommu = dev->archdata.iommu;
-   atu = iommu->atu;
-
mask = dev->coherent_dma_mask;
-   if (mask <= DMA_BIT_MASK(32) || !atu)
+   if (!iommu_use_atu(iommu, mask))
tbl = >tbl;
else
-   tbl = >tbl;
+   tbl = >atu->tbl;
 
entry = iommu_tbl_range_alloc(dev, tbl, npages, NULL,
  (unsigned long)(-1), 0);
@@ -333,7 +335,7 @@ static void dma_4v_free_coherent(struct device *dev, size_t 
size, void *cpu,
atu = iommu->atu;
devhandle = pbm->devhandle;
 
-   if (dvma <= DMA_BIT_MASK(32)) {
+   if (!iommu_use_atu(iommu, dvma)) {
tbl = >tbl;
iotsb_num = 0; /* we don't care for legacy iommu */
} else {
@@ -374,7 +376,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, 
struct page *page,
npages >>= IO_PAGE_SHIFT;
 
mask = *dev->dma_mask;
-   if (mask <= DMA_BIT_MASK(32))
+   if (!iommu_use_atu(iommu, mask))
tbl = >tbl;
else
tbl = >tbl;
@@ -510,7 +512,7 @@ static int dma_4v_map_sg(struct device *dev, struct 
scatterlist *sglist,
  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
 
mask = *dev->dma_mask;
-   if (mask <= DMA_BIT_MASK(32))
+   if (!iommu_use_atu(iommu, mask))
tbl = >tbl;
else
tbl = >tbl;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: remove NULL struct device support in the DMA API

2019-04-03 Thread Russell King - ARM Linux admin
On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote:
> We still have a few drivers which pass a NULL struct device pointer
> to DMA API functions, which generally is a bad idea as the API
> implementations rely on the device not only for ops selection, but
> also the dma mask and various other attributes, and many implementations
> have been broken for NULL device support for a while.

I think I must be missing something, but...

My understanding is that ISA DMA is normally limited to 24 bits of
address - indeed, the x86 version only programs 24 bits of DMA address.
Looking through this series, it appears that the conversions mean that
the DMA mask for ISA becomes the full all-ones DMA mask, which would
of course lead to memory corruption if only 24 bits of the address end
up being programmed into the hardware.

Maybe you could say why you think this series is safe in regard to ISA
DMA?

> 
> This series removes the few remaning users that weren't picked up in
> the last merge window and then removes core support for this "feature".
> 
> A git tree is also available at:
> 
> git://git.infradead.org/users/hch/misc.git dma-remove-NULL-dev-support
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-remove-NULL-dev-support
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 4/6] iommu/ipmmu-vmsa: Move num_utlbs to SoC-specific features

2019-04-03 Thread Geert Uytterhoeven
The maximum number of micro-TLBs per IPMMU instance is not fixed, but
depends on the SoC type.  Hence move it from struct ipmmu_vmsa_device to
struct ipmmu_features, and set up the correct value for both R-Car Gen2
and Gen3 SoCs.

Note that currently no code uses this value.

Signed-off-by: Geert Uytterhoeven 
Reviewed-by: Laurent Pinchart 
---
v2:
  - Add Reviewed-by.
---
 drivers/iommu/ipmmu-vmsa.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 87acf86f295fac0d..3fa57627b1e35562 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -42,6 +42,7 @@ struct ipmmu_features {
bool use_ns_alias_offset;
bool has_cache_leaf_nodes;
unsigned int number_of_contexts;
+   unsigned int num_utlbs;
bool setup_imbuscr;
bool twobit_imttbcr_sl0;
bool reserved_context;
@@ -53,7 +54,6 @@ struct ipmmu_vmsa_device {
struct iommu_device iommu;
struct ipmmu_vmsa_device *root;
const struct ipmmu_features *features;
-   unsigned int num_utlbs;
unsigned int num_ctx;
spinlock_t lock;/* Protects ctx and domains[] */
DECLARE_BITMAP(ctx, IPMMU_CTX_MAX);
@@ -972,6 +972,7 @@ static const struct ipmmu_features ipmmu_features_default = 
{
.use_ns_alias_offset = true,
.has_cache_leaf_nodes = false,
.number_of_contexts = 1, /* software only tested with one context */
+   .num_utlbs = 32,
.setup_imbuscr = true,
.twobit_imttbcr_sl0 = false,
.reserved_context = false,
@@ -981,6 +982,7 @@ static const struct ipmmu_features ipmmu_features_rcar_gen3 
= {
.use_ns_alias_offset = false,
.has_cache_leaf_nodes = true,
.number_of_contexts = 8,
+   .num_utlbs = 48,
.setup_imbuscr = false,
.twobit_imttbcr_sl0 = true,
.reserved_context = true,
@@ -1033,7 +1035,6 @@ static int ipmmu_probe(struct platform_device *pdev)
}
 
mmu->dev = >dev;
-   mmu->num_utlbs = 48;
spin_lock_init(>lock);
bitmap_zero(mmu->ctx, IPMMU_CTX_MAX);
mmu->features = of_device_get_match_data(>dev);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/6] iommu/ipmmu-vmsa: Make IPMMU_CTX_MAX unsigned

2019-04-03 Thread Geert Uytterhoeven
Make the IPMMU_CTX_MAX constant unsigned, to match the type of
ipmmu_features.number_of_contexts.

This allows to use plain min() instead of type-casting min_t().

Signed-off-by: Geert Uytterhoeven 
Reviewed-by: Laurent Pinchart 
---
v2:
  - Add Reviewed-by.
---
 drivers/iommu/ipmmu-vmsa.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index f2061bd1dc7b8852..87acf86f295fac0d 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -36,7 +36,7 @@
 #define arm_iommu_detach_device(...)   do {} while (0)
 #endif
 
-#define IPMMU_CTX_MAX 8
+#define IPMMU_CTX_MAX 8U
 
 struct ipmmu_features {
bool use_ns_alias_offset;
@@ -1060,8 +1060,7 @@ static int ipmmu_probe(struct platform_device *pdev)
if (mmu->features->use_ns_alias_offset)
mmu->base += IM_NS_ALIAS_OFFSET;
 
-   mmu->num_ctx = min_t(unsigned int, IPMMU_CTX_MAX,
-mmu->features->number_of_contexts);
+   mmu->num_ctx = min(IPMMU_CTX_MAX, mmu->features->number_of_contexts);
 
irq = platform_get_irq(pdev, 0);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/6] iommu/ipmmu-vmsa: Suspend/resume support and assorted cleanups

2019-04-03 Thread Geert Uytterhoeven
Hi Jörg, Magnus,

On R-Car Gen3 systems with PSCI, PSCI may power down the SoC during
system suspend, thus losing all IOMMU state.  Hence after s2ram, devices
behind an IPMMU (e.g. SATA), and configured to use it, will fail to
complete their I/O operations.

This patch series adds suspend/resume support to the Renesas IPMMU-VMSA
IOMMU driver, and performs some smaller cleanups and fixes during the
process.  Most patches are fairly independent, except for patch 6/6,
which depends on patches 4/6 and 5/6.

Changes compared to v1:
  - Dropped "iommu/ipmmu-vmsa: Call ipmmu_ctx_write_root() instead of
open coding",
  - Add Reviewed-by,
  - Merge IMEAR/IMELAR,
  - s/ipmmu_context_init/ipmmu_domain_setup_context/,
  - Drop PSCI checks.

This has been tested on Salvator-XS with R-Car H3 ES2.0, with IPMMU
suport for SATA enabled.  To play safe, the resume operation has also
been tested on R-Car M2-W.

Thanks!

Geert Uytterhoeven (6):
  iommu/ipmmu-vmsa: Link IOMMUs and devices in sysfs
  iommu/ipmmu-vmsa: Prepare to handle 40-bit error addresses
  iommu/ipmmu-vmsa: Make IPMMU_CTX_MAX unsigned
  iommu/ipmmu-vmsa: Move num_utlbs to SoC-specific features
  iommu/ipmmu-vmsa: Extract hardware context initialization
  iommu/ipmmu-vmsa: Add suspend/resume support

 drivers/iommu/ipmmu-vmsa.c | 185 +
 1 file changed, 124 insertions(+), 61 deletions(-)

-- 
2.17.1

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 2/6] iommu/ipmmu-vmsa: Prepare to handle 40-bit error addresses

2019-04-03 Thread Geert Uytterhoeven
On R-Car Gen3, the faulting virtual address is a 40-bit address, and
comprised of two registers.  Read the upper address part, and combine
both parts, when running on a 64-bit system.

Signed-off-by: Geert Uytterhoeven 
---
Apart from this, the driver doesn't support 40-bit IOVA addresses yet.

v2:
  - Merge IMEAR/IMELAR.
---
 drivers/iommu/ipmmu-vmsa.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 9f2b781e20a0eba6..f2061bd1dc7b8852 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -186,7 +186,8 @@ static struct ipmmu_vmsa_device *to_ipmmu(struct device 
*dev)
 #define IMMAIR_ATTR_IDX_WBRWA  1
 #define IMMAIR_ATTR_IDX_DEV2
 
-#define IMEAR  0x0030
+#define IMELAR 0x0030  /* IMEAR on R-Car Gen2 */
+#define IMEUAR 0x0034  /* R-Car Gen3 only */
 
 #define IMPCTR 0x0200
 #define IMPSTR 0x0208
@@ -522,14 +523,16 @@ static irqreturn_t ipmmu_domain_irq(struct 
ipmmu_vmsa_domain *domain)
 {
const u32 err_mask = IMSTR_MHIT | IMSTR_ABORT | IMSTR_PF | IMSTR_TF;
struct ipmmu_vmsa_device *mmu = domain->mmu;
+   unsigned long iova;
u32 status;
-   u32 iova;
 
status = ipmmu_ctx_read_root(domain, IMSTR);
if (!(status & err_mask))
return IRQ_NONE;
 
-   iova = ipmmu_ctx_read_root(domain, IMEAR);
+   iova = ipmmu_ctx_read_root(domain, IMELAR);
+   if (IS_ENABLED(CONFIG_64BIT))
+   iova |= (u64)ipmmu_ctx_read_root(domain, IMEUAR) << 32;
 
/*
 * Clear the error status flags. Unlike traditional interrupt flag
@@ -541,10 +544,10 @@ static irqreturn_t ipmmu_domain_irq(struct 
ipmmu_vmsa_domain *domain)
 
/* Log fatal errors. */
if (status & IMSTR_MHIT)
-   dev_err_ratelimited(mmu->dev, "Multiple TLB hits @0x%08x\n",
+   dev_err_ratelimited(mmu->dev, "Multiple TLB hits @0x%lx\n",
iova);
if (status & IMSTR_ABORT)
-   dev_err_ratelimited(mmu->dev, "Page Table Walk Abort @0x%08x\n",
+   dev_err_ratelimited(mmu->dev, "Page Table Walk Abort @0x%lx\n",
iova);
 
if (!(status & (IMSTR_PF | IMSTR_TF)))
@@ -560,7 +563,7 @@ static irqreturn_t ipmmu_domain_irq(struct 
ipmmu_vmsa_domain *domain)
return IRQ_HANDLED;
 
dev_err_ratelimited(mmu->dev,
-   "Unhandled fault: status 0x%08x iova 0x%08x\n",
+   "Unhandled fault: status 0x%08x iova 0x%lx\n",
status, iova);
 
return IRQ_HANDLED;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 5/6] iommu/ipmmu-vmsa: Extract hardware context initialization

2019-04-03 Thread Geert Uytterhoeven
ipmmu_domain_init_context() takes care of (1) initializing the software
domain, and (2) initializing the hardware context for the domain.

Extract the code to initialize the hardware context into a new subroutine
ipmmu_domain_setup_context(), to prepare for later reuse.

Signed-off-by: Geert Uytterhoeven 
---
v2:
  - s/ipmmu_context_init/ipmmu_domain_setup_context/.
---
 drivers/iommu/ipmmu-vmsa.c | 91 --
 1 file changed, 48 insertions(+), 43 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 3fa57627b1e35562..56e84bcc9532e1ce 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -404,52 +404,10 @@ static void ipmmu_domain_free_context(struct 
ipmmu_vmsa_device *mmu,
spin_unlock_irqrestore(>lock, flags);
 }
 
-static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain)
+static void ipmmu_domain_setup_context(struct ipmmu_vmsa_domain *domain)
 {
u64 ttbr;
u32 tmp;
-   int ret;
-
-   /*
-* Allocate the page table operations.
-*
-* VMSA states in section B3.6.3 "Control of Secure or Non-secure memory
-* access, Long-descriptor format" that the NStable bit being set in a
-* table descriptor will result in the NStable and NS bits of all child
-* entries being ignored and considered as being set. The IPMMU seems
-* not to comply with this, as it generates a secure access page fault
-* if any of the NStable and NS bits isn't set when running in
-* non-secure mode.
-*/
-   domain->cfg.quirks = IO_PGTABLE_QUIRK_ARM_NS;
-   domain->cfg.pgsize_bitmap = SZ_1G | SZ_2M | SZ_4K;
-   domain->cfg.ias = 32;
-   domain->cfg.oas = 40;
-   domain->cfg.tlb = _gather_ops;
-   domain->io_domain.geometry.aperture_end = DMA_BIT_MASK(32);
-   domain->io_domain.geometry.force_aperture = true;
-   /*
-* TODO: Add support for coherent walk through CCI with DVM and remove
-* cache handling. For now, delegate it to the io-pgtable code.
-*/
-   domain->cfg.iommu_dev = domain->mmu->root->dev;
-
-   /*
-* Find an unused context.
-*/
-   ret = ipmmu_domain_allocate_context(domain->mmu->root, domain);
-   if (ret < 0)
-   return ret;
-
-   domain->context_id = ret;
-
-   domain->iop = alloc_io_pgtable_ops(ARM_32_LPAE_S1, >cfg,
-  domain);
-   if (!domain->iop) {
-   ipmmu_domain_free_context(domain->mmu->root,
- domain->context_id);
-   return -EINVAL;
-   }
 
/* TTBR0 */
ttbr = domain->cfg.arm_lpae_s1_cfg.ttbr[0];
@@ -495,7 +453,54 @@ static int ipmmu_domain_init_context(struct 
ipmmu_vmsa_domain *domain)
 */
ipmmu_ctx_write_all(domain, IMCTR,
IMCTR_INTEN | IMCTR_FLUSH | IMCTR_MMUEN);
+}
+
+static int ipmmu_domain_init_context(struct ipmmu_vmsa_domain *domain)
+{
+   int ret;
+
+   /*
+* Allocate the page table operations.
+*
+* VMSA states in section B3.6.3 "Control of Secure or Non-secure memory
+* access, Long-descriptor format" that the NStable bit being set in a
+* table descriptor will result in the NStable and NS bits of all child
+* entries being ignored and considered as being set. The IPMMU seems
+* not to comply with this, as it generates a secure access page fault
+* if any of the NStable and NS bits isn't set when running in
+* non-secure mode.
+*/
+   domain->cfg.quirks = IO_PGTABLE_QUIRK_ARM_NS;
+   domain->cfg.pgsize_bitmap = SZ_1G | SZ_2M | SZ_4K;
+   domain->cfg.ias = 32;
+   domain->cfg.oas = 40;
+   domain->cfg.tlb = _gather_ops;
+   domain->io_domain.geometry.aperture_end = DMA_BIT_MASK(32);
+   domain->io_domain.geometry.force_aperture = true;
+   /*
+* TODO: Add support for coherent walk through CCI with DVM and remove
+* cache handling. For now, delegate it to the io-pgtable code.
+*/
+   domain->cfg.iommu_dev = domain->mmu->root->dev;
+
+   /*
+* Find an unused context.
+*/
+   ret = ipmmu_domain_allocate_context(domain->mmu->root, domain);
+   if (ret < 0)
+   return ret;
+
+   domain->context_id = ret;
+
+   domain->iop = alloc_io_pgtable_ops(ARM_32_LPAE_S1, >cfg,
+  domain);
+   if (!domain->iop) {
+   ipmmu_domain_free_context(domain->mmu->root,
+ domain->context_id);
+   return -EINVAL;
+   }
 
+   ipmmu_domain_setup_context(domain);
return 0;
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH v2 1/6] iommu/ipmmu-vmsa: Link IOMMUs and devices in sysfs

2019-04-03 Thread Geert Uytterhoeven
As of commit 7af9a5fdb9e0ca33 ("iommu/ipmmu-vmsa: Use
iommu_device_sysfs_add()/remove()"), IOMMU devices show up under
/sys/class/iommus/, but their "devices" subdirectories are empty.
Likewise, devices tied to an IOMMU do not have an "iommu" backlink.

Make sure all links are created, on both arm32 and arm64.

Signed-off-by: Geert Uytterhoeven 
Reviewed-by: Laurent Pinchart 
---
v2:
  - Add Reviewed-by.
---
 drivers/iommu/ipmmu-vmsa.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 9a380c10655e182d..9f2b781e20a0eba6 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -885,27 +885,37 @@ static int ipmmu_init_arm_mapping(struct device *dev)
 
 static int ipmmu_add_device(struct device *dev)
 {
+   struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
struct iommu_group *group;
+   int ret;
 
/*
 * Only let through devices that have been verified in xlate()
 */
-   if (!to_ipmmu(dev))
+   if (!mmu)
return -ENODEV;
 
-   if (IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_IOMMU_DMA))
-   return ipmmu_init_arm_mapping(dev);
+   if (IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_IOMMU_DMA)) {
+   ret = ipmmu_init_arm_mapping(dev);
+   if (ret)
+   return ret;
+   } else {
+   group = iommu_group_get_for_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
 
-   group = iommu_group_get_for_dev(dev);
-   if (IS_ERR(group))
-   return PTR_ERR(group);
+   iommu_group_put(group);
+   }
 
-   iommu_group_put(group);
+   iommu_device_link(>iommu, dev);
return 0;
 }
 
 static void ipmmu_remove_device(struct device *dev)
 {
+   struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
+
+   iommu_device_unlink(>iommu, dev);
arm_iommu_detach_device(dev);
iommu_group_remove_device(dev);
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 6/6] iommu/ipmmu-vmsa: Add suspend/resume support

2019-04-03 Thread Geert Uytterhoeven
During PSCI system suspend, R-Car Gen3 SoCs are powered down, and all
IPMMU state is lost.  Hence after s2ram, devices wired behind an IPMMU,
and configured to use it, will see their DMA operations hang.

To fix this, restore all IPMMU contexts, and re-enable all active
micro-TLBs during system resume.

Signed-off-by: Geert Uytterhoeven 
---
This patch takes a different approach than the BSP, which implements a
bulk save/restore of all registers during system suspend/resume.

v2:
  - Drop PSCI checks.
---
 drivers/iommu/ipmmu-vmsa.c | 47 +-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 56e84bcc9532e1ce..408ad0b2591925e0 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -36,7 +36,10 @@
 #define arm_iommu_detach_device(...)   do {} while (0)
 #endif
 
-#define IPMMU_CTX_MAX 8U
+#define IPMMU_CTX_MAX  8U
+#define IPMMU_CTX_INVALID  -1
+
+#define IPMMU_UTLB_MAX 48U
 
 struct ipmmu_features {
bool use_ns_alias_offset;
@@ -58,6 +61,7 @@ struct ipmmu_vmsa_device {
spinlock_t lock;/* Protects ctx and domains[] */
DECLARE_BITMAP(ctx, IPMMU_CTX_MAX);
struct ipmmu_vmsa_domain *domains[IPMMU_CTX_MAX];
+   s8 utlb_ctx[IPMMU_UTLB_MAX];
 
struct iommu_group *group;
struct dma_iommu_mapping *mapping;
@@ -335,6 +339,7 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain 
*domain,
ipmmu_write(mmu, IMUCTR(utlb),
IMUCTR_TTSEL_MMU(domain->context_id) | IMUCTR_FLUSH |
IMUCTR_MMUEN);
+   mmu->utlb_ctx[utlb] = domain->context_id;
 }
 
 /*
@@ -346,6 +351,7 @@ static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain 
*domain,
struct ipmmu_vmsa_device *mmu = domain->mmu;
 
ipmmu_write(mmu, IMUCTR(utlb), 0);
+   mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
 }
 
 static void ipmmu_tlb_flush_all(void *cookie)
@@ -1043,6 +1049,7 @@ static int ipmmu_probe(struct platform_device *pdev)
spin_lock_init(>lock);
bitmap_zero(mmu->ctx, IPMMU_CTX_MAX);
mmu->features = of_device_get_match_data(>dev);
+   memset(mmu->utlb_ctx, IPMMU_CTX_INVALID, mmu->features->num_utlbs);
dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(40));
 
/* Map I/O memory and request IRQ. */
@@ -1158,10 +1165,48 @@ static int ipmmu_remove(struct platform_device *pdev)
return 0;
 }
 
+#ifdef CONFIG_PM_SLEEP
+static int ipmmu_resume_noirq(struct device *dev)
+{
+   struct ipmmu_vmsa_device *mmu = dev_get_drvdata(dev);
+   unsigned int i;
+
+   /* Reset root MMU and restore contexts */
+   if (ipmmu_is_root(mmu)) {
+   ipmmu_device_reset(mmu);
+
+   for (i = 0; i < mmu->num_ctx; i++) {
+   if (!mmu->domains[i])
+   continue;
+
+   ipmmu_domain_setup_context(mmu->domains[i]);
+   }
+   }
+
+   /* Re-enable active micro-TLBs */
+   for (i = 0; i < mmu->features->num_utlbs; i++) {
+   if (mmu->utlb_ctx[i] == IPMMU_CTX_INVALID)
+   continue;
+
+   ipmmu_utlb_enable(mmu->root->domains[mmu->utlb_ctx[i]], i);
+   }
+
+   return 0;
+}
+
+static const struct dev_pm_ops ipmmu_pm  = {
+   SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(NULL, ipmmu_resume_noirq)
+};
+#define DEV_PM_OPS _pm
+#else
+#define DEV_PM_OPS NULL
+#endif /* CONFIG_PM_SLEEP */
+
 static struct platform_driver ipmmu_driver = {
.driver = {
.name = "ipmmu-vmsa",
.of_match_table = of_match_ptr(ipmmu_of_ids),
+   .pm = DEV_PM_OPS,
},
.probe = ipmmu_probe,
.remove = ipmmu_remove,
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 09/22] vfio: VFIO_IOMMU_BIND/UNBIND_MSI

2019-04-03 Thread Alex Williamson
On Wed, 3 Apr 2019 16:30:15 +0200
Auger Eric  wrote:

> Hi Alex,
> 
> On 3/22/19 11:09 PM, Alex Williamson wrote:
> > On Fri, 22 Mar 2019 10:30:02 +0100
> > Auger Eric  wrote:
> >   
> >> Hi Alex,
> >> On 3/22/19 12:01 AM, Alex Williamson wrote:  
> >>> On Sun, 17 Mar 2019 18:22:19 +0100
> >>> Eric Auger  wrote:
> >>> 
>  This patch adds the VFIO_IOMMU_BIND/UNBIND_MSI ioctl which aim
>  to pass/withdraw the guest MSI binding to/from the host.
> 
>  Signed-off-by: Eric Auger 
> 
>  ---
>  v3 -> v4:
>  - add UNBIND
>  - unwind on BIND error
> 
>  v2 -> v3:
>  - adapt to new proto of bind_guest_msi
>  - directly use vfio_iommu_for_each_dev
> 
>  v1 -> v2:
>  - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
>  ---
>   drivers/vfio/vfio_iommu_type1.c | 58 +
>   include/uapi/linux/vfio.h   | 29 +
>   2 files changed, 87 insertions(+)
> 
>  diff --git a/drivers/vfio/vfio_iommu_type1.c 
>  b/drivers/vfio/vfio_iommu_type1.c
>  index 12a40b9db6aa..66513679081b 100644
>  --- a/drivers/vfio/vfio_iommu_type1.c
>  +++ b/drivers/vfio/vfio_iommu_type1.c
>  @@ -1710,6 +1710,25 @@ static int vfio_cache_inv_fn(struct device *dev, 
>  void *data)
>   return iommu_cache_invalidate(d, dev, >info);
>   }
>   
>  +static int vfio_bind_msi_fn(struct device *dev, void *data)
>  +{
>  +struct vfio_iommu_type1_bind_msi *ustruct =
>  +(struct vfio_iommu_type1_bind_msi *)data;
>  +struct iommu_domain *d = iommu_get_domain_for_dev(dev);
>  +
>  +return iommu_bind_guest_msi(d, dev, ustruct->iova,
>  +ustruct->gpa, ustruct->size);
>  +}
>  +
>  +static int vfio_unbind_msi_fn(struct device *dev, void *data)
>  +{
>  +dma_addr_t *iova = (dma_addr_t *)data;
>  +struct iommu_domain *d = iommu_get_domain_for_dev(dev);
> >>>
> >>> Same as previous, we can encapsulate domain in our own struct to avoid
> >>> a lookup.
> >>> 
>  +
>  +iommu_unbind_guest_msi(d, dev, *iova);
> >>>
> >>> Is it strange that iommu-core is exposing these interfaces at a device
> >>> level if every one of them requires us to walk all the devices?  Thanks,  
> >>>   
> >>
> >> Hum this per device API was devised in response of Robin's comments on
> >>
> >> [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie.
> >>
> >> "
> >> But that then seems to reveal a somewhat bigger problem - if the callers
> >> are simply registering IPAs, and relying on the ITS driver to grab an
> >> entry and fill in a PA later, then how does either one know *which* PA
> >> is supposed to belong to a given IPA in the case where you have multiple
> >> devices with different ITS targets assigned to the same guest? (and if
> >> it's possible to assume a guest will use per-device stage 1 mappings and
> >> present it with a single vITS backed by multiple pITSes, I think things
> >> start breaking even harder.)
> >> "
> >>
> >> However looking back into the problem I wonder if there was an issue
> >> with the iommu_domain based API.
> >>
> >> If my understanding is correct, when assigned devices are protected by a
> >> vIOMMU then they necessarily end up in separate host iommu domains even
> >> if they belong to the same iommu_domain on the guest. And there can only
> >> be a single device in this iommu_domain.  
> > 
> > Don't forget that a container represents the IOMMU context in a vfio
> > environment, groups are associated with containers and a group may
> > contain one or more devices.  When a vIOMMU comes into play, we still
> > only have an IOMMU context per container.  If we have multiple devices
> > in a group, we run into problems with vIOMMU.  We can resolve this by
> > requiring that the user ignore all but one device in the group,
> > or making sure that the devices in the group have the same IOMMU
> > context.  The latter we could do in QEMU if PCIe-to-PCI bridges there
> > masked the per-device address space as it does on real hardware (ie.
> > there is no requester ID on conventional PCI, all transactions appear to
> > the IOMMU with the bridge requester ID).  So I raise this question
> > because vfio's minimum domain granularity is a group.
> >   
> >> If this is confirmed, there is a non ambiguous association between 1
> >> physical iommu_domain, 1 device, 1 S1 mapping and 1 physical MSI
> >> controller.
> >>
> >> I added the device handle handle to disambiguate those associations. The
> >> gIOVA ->gDB mapping is associated with a device handle. Then when the
> >> host needs a stage 1 mapping for this device, to build the nested
> >> mapping towards the physical DB it can easily grab the gIOVA->gDB stage
> >> 1 mapping registered for this device.
> >>
> >> The correctness looks more obvious to me, at 

Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-03 Thread Robin Murphy

On 02/04/2019 23:39, Rob Gardner wrote:

On 4/2/19 2:30 PM, Meelis Roos wrote:
[   17.566584] scsi host0: ioc0: LSISAS1064 A3, FwRev=010ah, 
Ports=1, MaxQ=511, IRQ=27
[   17.595897] mptsas: ioc0: attaching ssp device: fw_channel 0, 
fw_id 0, phy 0, sas_addr 0x5000c5001799a45d

[   17.598465] Unable to handle kernel NULL pointer dereference
[   17.598623] tsk->{mm,active_mm}->context = 
[   17.598723] tsk->{mm,active_mm}->pgd = 88802000
[   17.598774]   \|/  \|/
[   17.598774]   "@'/ .. \`@"
[   17.598774]   /_| \__/ |_\
[   17.598774]  \__U_/
[   17.598894] swapper/0(1): Oops [#1]
[   17.598937] CPU: 12 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1 
#118
[   17.598994] TSTATE: 80e01601 TPC: 004483a8 TNPC: 
004483ac Y:     Not tainted

[   17.599086] TPC: 


You may use gdb to figure out what the NULL pointer points to:

gdb vmlinux

l *(dma_4v_map_sg+0xe8)


gdb did not parse the file but objdump --disassemble worked and +0xe8 
seems to be 4483a8



Of course that was right there in the panic message, as TPC is the 
address of the instruction that faulted:


ldx  [ %i4 ], %g1

For anyone wishing to dig into this further, here is my off the cuff 
analysis:


I believe the fault is happening on this line:

     base_shift = tbl->table_map_base >> IO_PAGE_SHIFT;

The tbl variable is assigned to one of two values in the statement 
above, but since the register dump shows the value in %i4 was 0x10, that 
strongly suggests that it executed this:


     tbl = >tbl;

Because the offset of the tbl field in struct atu is 0x10, and that was 
computed here:


448384:   b8 07 60 10 add  %i5, 0x10, %i4

(The offset of tbl in struct iommu is 0, so we would have seen that 0 in 
%i4 if it had taken the iommu path.)


 From the register dump, the value in %i5 was 0. And that came from this 
instruction:


4482f4:   fa 58 e2 58 ldx  [ %g3 + 0x258 ], %i5

Likewise, %g3 came from here:

4482d4:   c6 5e 22 18 ldx  [ %i0 + 0x218 ], %g3

And %i0 is arg0, struct device *dev. So the code is loading some field 
in struct device at offset 0x218, which is consistent with the source:


iommu = dev->archdata.iommu;

So %g3 points to struct iommu, and the code is trying to load the value 
at offset 0x258 in that structure, probably this:


atu = iommu->atu;

And atu is the NULL pointer.

Now whether this is the problem, I don't know. It may be that mask 
(*dev->dma_mask) was wrong, causing the code to take the >tbl path 
instead of the >tbl path. We can see from the code that mask is 
in %g7, and the register dump shows the value of %g7 is fff, 
while DMA_BIT_MASK(32) is in %g1 and is , so this might 
be the result of some confusion over 32 bit vs 64 bit stuff.


Nice deduction! If it was AArch64 asm I might have tried, but I've never 
even seen SPARC asm before :)


FWIW, scripts/faddr2line is your friend when deciphering stacktrace symbols.

In terms of the crash itself, I'd note that there's also been ongoing 
cleanup to fix the remaining places where the DMA API was called with 
NULL instead of the appropriate device - it could be that as a result of 
that, the driver/subsystem here is now taking a path that has not been 
properly exercised before, and/or that it's not quite the right device 
pointer being picked up.



I hope these bits of information help somebody debug further.


Thanks,
Robin.




Rob




004482c0 :
  4482c0:   9d e3 be b0 save  %sp, -336, %sp
  4482c4:   80 a6 e0 03 cmp  %i3, 3
  4482c8:   02 40 00 c1 be,pn   %icc, 4485cc 


  4482cc:   92 10 21 e2 mov  0x1e2, %o1
  4482d0:   80 a0 00 1a cmp  %g0, %i2
  4482d4:   c6 5e 22 18 ldx  [ %i0 + 0x218 ], %g3
  4482d8:   82 10 20 00 clr  %g1
  4482dc:   84 60 3f ff subc  %g0, -1, %g2
  4482e0:   83 78 e4 01 movre  %g3, 1, %g1
  4482e4:   80 90 80 01 orcc  %g2, %g1, %g0
  4482e8:   12 40 00 bd bne,pn   %icc, 4485dc 


  4482ec:   80 a6 e0 01 cmp  %i3, 1
  4482f0:   84 10 20 03 mov  3, %g2
  4482f4:   fa 58 e2 58 ldx  [ %g3 + 0x258 ], %i5
  4482f8:   85 64 60 01 move  %icc, 1, %g2
  4482fc:   b8 0f 20 02 and  %i4, 2, %i4
  448300:   c0 77 a7 f7 clrx  [ %fp + 0x7f7 ]
  448304:   82 10 a0 04 or  %g2, 4, %g1
  448308:   c0 26 60 18 clr  [ %i1 + 0x18 ]
  44830c:   85 7f 14 01 movrne  %i4, %g1, %g2
  448310:   8f 52 00 00 rdpr  %pil, %g7
  448314:   82 11 e0 0e or  %g7, 0xe, %g1
  448318:   91 90 60 00 wrpr  %g1, 0, %pil
  44831c:   ce 77 a7 bf stx  %g7, [ %fp + 0x7bf ]
  448320:   0f 00 02 00 sethi  %hi(0x8), %g7
  448324:   27 00 00 40 sethi  %hi(0x1), %l3
  448328:   ce 77 a7 df stx  %g7, [ %fp + 0x7df ]
  44832c:   0f 00 28 21 sethi  %hi(0xa08400), %g7
  448330:   

Re: [PATCH 4/7] pxa3xx-gcu: pass struct device to dma_mmap_coherent

2019-04-03 Thread Bartlomiej Zolnierkiewicz


On 03/21/2019 11:52 PM, Christoph Hellwig wrote:
> Just like we do for all other DMA operations.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Bartlomiej Zolnierkiewicz 

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R Institute Poland
Samsung Electronics
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/7] da8xx-fb: pass struct device to DMA API functions

2019-04-03 Thread Bartlomiej Zolnierkiewicz


On 03/21/2019 11:52 PM, Christoph Hellwig wrote:
> The DMA API generally relies on a struct device to work properly, and
> only barely works without one for legacy reasons.  Pass the easily
> available struct device from the platform_device to remedy this.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Bartlomiej Zolnierkiewicz 

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R Institute Poland
Samsung Electronics
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: remove NULL struct device support in the DMA API

2019-04-03 Thread Christoph Hellwig
Any comments on the remaining patches?  I'd like to give this series
a couple weeks of soaking in linux-next before the end of the merge
window, so reviews would be apprciated.

On Thu, Mar 21, 2019 at 03:52:28PM -0700, Christoph Hellwig wrote:
> We still have a few drivers which pass a NULL struct device pointer
> to DMA API functions, which generally is a bad idea as the API
> implementations rely on the device not only for ops selection, but
> also the dma mask and various other attributes, and many implementations
> have been broken for NULL device support for a while.
> 
> This series removes the few remaning users that weren't picked up in
> the last merge window and then removes core support for this "feature".
> 
> A git tree is also available at:
> 
> git://git.infradead.org/users/hch/misc.git dma-remove-NULL-dev-support
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-remove-NULL-dev-support
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
---end quoted text---
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 09/22] vfio: VFIO_IOMMU_BIND/UNBIND_MSI

2019-04-03 Thread Auger Eric
Hi Alex,

On 3/22/19 11:09 PM, Alex Williamson wrote:
> On Fri, 22 Mar 2019 10:30:02 +0100
> Auger Eric  wrote:
> 
>> Hi Alex,
>> On 3/22/19 12:01 AM, Alex Williamson wrote:
>>> On Sun, 17 Mar 2019 18:22:19 +0100
>>> Eric Auger  wrote:
>>>   
 This patch adds the VFIO_IOMMU_BIND/UNBIND_MSI ioctl which aim
 to pass/withdraw the guest MSI binding to/from the host.

 Signed-off-by: Eric Auger 

 ---
 v3 -> v4:
 - add UNBIND
 - unwind on BIND error

 v2 -> v3:
 - adapt to new proto of bind_guest_msi
 - directly use vfio_iommu_for_each_dev

 v1 -> v2:
 - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
 ---
  drivers/vfio/vfio_iommu_type1.c | 58 +
  include/uapi/linux/vfio.h   | 29 +
  2 files changed, 87 insertions(+)

 diff --git a/drivers/vfio/vfio_iommu_type1.c 
 b/drivers/vfio/vfio_iommu_type1.c
 index 12a40b9db6aa..66513679081b 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -1710,6 +1710,25 @@ static int vfio_cache_inv_fn(struct device *dev, 
 void *data)
return iommu_cache_invalidate(d, dev, >info);
  }
  
 +static int vfio_bind_msi_fn(struct device *dev, void *data)
 +{
 +  struct vfio_iommu_type1_bind_msi *ustruct =
 +  (struct vfio_iommu_type1_bind_msi *)data;
 +  struct iommu_domain *d = iommu_get_domain_for_dev(dev);
 +
 +  return iommu_bind_guest_msi(d, dev, ustruct->iova,
 +  ustruct->gpa, ustruct->size);
 +}
 +
 +static int vfio_unbind_msi_fn(struct device *dev, void *data)
 +{
 +  dma_addr_t *iova = (dma_addr_t *)data;
 +  struct iommu_domain *d = iommu_get_domain_for_dev(dev);  
>>>
>>> Same as previous, we can encapsulate domain in our own struct to avoid
>>> a lookup.
>>>   
 +
 +  iommu_unbind_guest_msi(d, dev, *iova);  
>>>
>>> Is it strange that iommu-core is exposing these interfaces at a device
>>> level if every one of them requires us to walk all the devices?  Thanks,  
>>
>> Hum this per device API was devised in response of Robin's comments on
>>
>> [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie.
>>
>> "
>> But that then seems to reveal a somewhat bigger problem - if the callers
>> are simply registering IPAs, and relying on the ITS driver to grab an
>> entry and fill in a PA later, then how does either one know *which* PA
>> is supposed to belong to a given IPA in the case where you have multiple
>> devices with different ITS targets assigned to the same guest? (and if
>> it's possible to assume a guest will use per-device stage 1 mappings and
>> present it with a single vITS backed by multiple pITSes, I think things
>> start breaking even harder.)
>> "
>>
>> However looking back into the problem I wonder if there was an issue
>> with the iommu_domain based API.
>>
>> If my understanding is correct, when assigned devices are protected by a
>> vIOMMU then they necessarily end up in separate host iommu domains even
>> if they belong to the same iommu_domain on the guest. And there can only
>> be a single device in this iommu_domain.
> 
> Don't forget that a container represents the IOMMU context in a vfio
> environment, groups are associated with containers and a group may
> contain one or more devices.  When a vIOMMU comes into play, we still
> only have an IOMMU context per container.  If we have multiple devices
> in a group, we run into problems with vIOMMU.  We can resolve this by
> requiring that the user ignore all but one device in the group,
> or making sure that the devices in the group have the same IOMMU
> context.  The latter we could do in QEMU if PCIe-to-PCI bridges there
> masked the per-device address space as it does on real hardware (ie.
> there is no requester ID on conventional PCI, all transactions appear to
> the IOMMU with the bridge requester ID).  So I raise this question
> because vfio's minimum domain granularity is a group.
> 
>> If this is confirmed, there is a non ambiguous association between 1
>> physical iommu_domain, 1 device, 1 S1 mapping and 1 physical MSI
>> controller.
>>
>> I added the device handle handle to disambiguate those associations. The
>> gIOVA ->gDB mapping is associated with a device handle. Then when the
>> host needs a stage 1 mapping for this device, to build the nested
>> mapping towards the physical DB it can easily grab the gIOVA->gDB stage
>> 1 mapping registered for this device.
>>
>> The correctness looks more obvious to me, at least.
> 
> Except all devices within all groups within the same container
> necessarily share the same IOMMU context, so from that perspective, it
> appears to impose non-trivial redundancy on the caller.  Thanks,

Taking into consideration the case where we could have several devices
attached to the same host iommu group, each of them possibly using

Re: [PATCH v1 1/3] iommu/tegra-smmu: Fix invalid ASID bits on Tegra30/114

2019-04-03 Thread Dmitry Osipenko
03.04.2019 11:41, Thierry Reding пишет:
> On Thu, Mar 07, 2019 at 01:50:07AM +0300, Dmitry Osipenko wrote:
>> Both Tegra30 and Tegra114 have 4 ASID's and the corresponding bitfield of
>> the TLB_FLUSH register differs from later Tegra generations that have 128
>> ASID's.
>>
>> In a result the PTE's are now flushed correctly from TLB and this fixes
>> problems with graphics (randomly failing tests) on Tegra30.
>>
>> Cc: stable 
>> Signed-off-by: Dmitry Osipenko 
>> ---
>>  drivers/iommu/tegra-smmu.c | 25 ++---
>>  1 file changed, 18 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
>> index 5182c7d6171e..8d30653cd13a 100644
>> --- a/drivers/iommu/tegra-smmu.c
>> +++ b/drivers/iommu/tegra-smmu.c
>> @@ -102,7 +102,6 @@ static inline u32 smmu_readl(struct tegra_smmu *smmu, 
>> unsigned long offset)
>>  #define  SMMU_TLB_FLUSH_VA_MATCH_ALL (0 << 0)
>>  #define  SMMU_TLB_FLUSH_VA_MATCH_SECTION (2 << 0)
>>  #define  SMMU_TLB_FLUSH_VA_MATCH_GROUP   (3 << 0)
>> -#define  SMMU_TLB_FLUSH_ASID(x)  (((x) & 0x7f) << 24)
> 
> Given that the same operation is repeated three times below, it might
> have been worth to fold the conditional into the macro. That'd require
> the macro to take an smmu parameter, but would otherwise leave the
> individual instances shorter.

I had that variant initially and the result felt more clumsy to me.

> But either way, the fix is good, so:
> 
> Acked-by: Thierry Reding 
> 

Thanks
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/3] arm64: dts: qcom: msm8998: Add PCIe SMMU node

2019-04-03 Thread Vivek Gautam



On 4/2/2019 7:24 PM, Robin Murphy wrote:

On 30/03/2019 14:18, Vivek Gautam wrote:

You should probably have some "bus" and "iface" clocks too, per the
requirement of "qcom,smmu-v2". Maybe Vivek might know what's relevant
for MSM8998?


As Jeffrey rightly mentioned, these clocks are not under the control 
of Linux.

So, we won't need to add clocks to this SMMU.


OK, in that case the "clock-names" part of binding doc probably wants 
refining to reflect which implementations do actually require clocks.


Certainly.
Marc, do you want to push a patch for the same? Or, let me know I can 
prepare one.


Thanks
Vivek



Robin.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure

2019-04-03 Thread John Garry

On 03/04/2019 09:14, Greg KH wrote:

On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote:

On 28/03/2019 10:08, John Garry wrote:

In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
devres release"), we changed the ordering of tearing down the device DMA
ops and releasing all the device's resources; this was because the DMA ops
should be maintained until we release the device's managed DMA memories.



Hi all,

A friendly reminder on this patch... I didn't see any update.

I thought that it had some importance.

Thanks,
John


However, we have seen another crash on an arm64 system when a
device driver probe fails:

  hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
  scsi host1: hisi_sas_v3_hw
  BUG: Bad page state in process swapper/0  pfn:313f5
  page:7ec4fd40 count:1 mapcount:0
  mapping: index:0x0
  flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd48 7ec4fd48

  raw:   0001

  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  bad because of flags: 0x1000(reserved)
  Modules linked in:
  CPU: 49 PID: 1 Comm: swapper/0 Not tainted
5.1.0-rc1-43081-g22d97fd-dirty #1433
  Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
RC0 - V1.12.01 01/29/2019
  Call trace:
  dump_backtrace+0x0/0x118
  show_stack+0x14/0x1c
  dump_stack+0xa4/0xc8
  bad_page+0xe4/0x13c
  free_pages_check_bad+0x4c/0xc0
  __free_pages_ok+0x30c/0x340
  __free_pages+0x30/0x44
  __dma_direct_free_pages+0x30/0x38
  dma_direct_free+0x24/0x38
  dma_free_attrs+0x9c/0xd8
  dmam_release+0x20/0x28
  release_nodes+0x17c/0x220
  devres_release_all+0x34/0x54
  really_probe+0xc4/0x2c8
  driver_probe_device+0x58/0xfc
  device_driver_attach+0x68/0x70
  __driver_attach+0x94/0xdc
  bus_for_each_dev+0x5c/0xb4
  driver_attach+0x20/0x28
  bus_add_driver+0x14c/0x200
  driver_register+0x6c/0x124
  __pci_register_driver+0x48/0x50
  sas_v3_pci_driver_init+0x20/0x28
  do_one_initcall+0x40/0x25c
  kernel_init_freeable+0x2b8/0x3c0
  kernel_init+0x10/0x100
  ret_from_fork+0x10/0x18
  Disabling lock debugging due to kernel taint
  BUG: Bad page state in process swapper/0  pfn:313f6
  page:7ec4fd80 count:1 mapcount:0
mapping: index:0x0
[   89.322983] flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd88 7ec4fd88

  raw:   0001


The crash occurs for the same reason.

In this case, on the really_probe() failure path, we are still clearing
the DMA ops prior to releasing the device's managed memories.

This patch fixes this issue by reordering the DMA ops teardown and the
call to devres_release_all() on the failure path.

Reported-by: Xiang Chen 
Tested-by: Xiang Chen 
Signed-off-by: John Garry 


So does this "fix" 376991db4b64?  If so, should this be added to the
patch and also backported to the stable trees?


Hi Greg,

No, I don't think so. I'd say it supplements it. Here I'm trying to fix 
up another path in which we tear down the DMA ops prior to releasing the 
device's resources.


I didn't add a fixes tag as 376991db4b64 didn't have one either. It will 
need to be backported to stable, I figure the same as 376991db4b64.


Thanks,
John



thanks,

greg k-h

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/iova: Separate atomic variables to improve performance

2019-04-03 Thread Shaokun Zhang
From: Jinyu Qi  

In struct iova_domain, there are three atomic variables, the former two
are about TLB flush counters which use atomic_add operation, anoter is
used to flush timer that use cmpxhg operation.
These variables are in the same cache line, so it will cause some
performance loss under the condition that many cores call queue_iova
function, Let's isolate the two type atomic variables to different
cache line to reduce cache line conflict.

Cc: Joerg Roedel 
Signed-off-by: Jinyu Qi 
---
 include/linux/iova.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0b93bf96693e..28a5128405f8 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -76,6 +76,14 @@ struct iova_domain {
unsigned long   start_pfn;  /* Lower limit for this domain */
unsigned long   dma_32bit_pfn;
unsigned long   max32_alloc_size; /* Size of last failed allocation */
+   struct iova_fq __percpu *fq;/* Flush Queue */
+
+   atomic64_t  fq_flush_start_cnt; /* Number of TLB flushes that
+  have been started */
+
+   atomic64_t  fq_flush_finish_cnt;/* Number of TLB flushes that
+  have been finished */
+
struct iova anchor; /* rbtree lookup anchor */
struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE];  /* IOVA range 
caches */
 
@@ -85,14 +93,6 @@ struct iova_domain {
iova_entry_dtor entry_dtor; /* IOMMU driver specific destructor for
   iova entry */
 
-   struct iova_fq __percpu *fq;/* Flush Queue */
-
-   atomic64_t  fq_flush_start_cnt; /* Number of TLB flushes that
-  have been started */
-
-   atomic64_t  fq_flush_finish_cnt;/* Number of TLB flushes that
-  have been finished */
-
struct timer_list fq_timer; /* Timer to regularily empty the
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 3/3] iommu/tegra-smmu: Respect IOMMU API read-write protections

2019-04-03 Thread Thierry Reding
On Thu, Mar 07, 2019 at 01:50:09AM +0300, Dmitry Osipenko wrote:
> Set PTE read/write attributes accordingly to the the protections requested
> by IOMMU API.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/iommu/tegra-smmu.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 2/3] iommu/tegra-smmu: Properly release domain resources

2019-04-03 Thread Thierry Reding
On Thu, Mar 07, 2019 at 01:50:08AM +0300, Dmitry Osipenko wrote:
> Release all memory allocations associated with a released domain and emit
> warning if domain is in-use at the time of destruction.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/iommu/tegra-smmu.c | 3 +++
>  1 file changed, 3 insertions(+)

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/3] iommu/tegra-smmu: Fix invalid ASID bits on Tegra30/114

2019-04-03 Thread Thierry Reding
On Thu, Mar 07, 2019 at 01:50:07AM +0300, Dmitry Osipenko wrote:
> Both Tegra30 and Tegra114 have 4 ASID's and the corresponding bitfield of
> the TLB_FLUSH register differs from later Tegra generations that have 128
> ASID's.
> 
> In a result the PTE's are now flushed correctly from TLB and this fixes
> problems with graphics (randomly failing tests) on Tegra30.
> 
> Cc: stable 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/iommu/tegra-smmu.c | 25 ++---
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
> index 5182c7d6171e..8d30653cd13a 100644
> --- a/drivers/iommu/tegra-smmu.c
> +++ b/drivers/iommu/tegra-smmu.c
> @@ -102,7 +102,6 @@ static inline u32 smmu_readl(struct tegra_smmu *smmu, 
> unsigned long offset)
>  #define  SMMU_TLB_FLUSH_VA_MATCH_ALL (0 << 0)
>  #define  SMMU_TLB_FLUSH_VA_MATCH_SECTION (2 << 0)
>  #define  SMMU_TLB_FLUSH_VA_MATCH_GROUP   (3 << 0)
> -#define  SMMU_TLB_FLUSH_ASID(x)  (((x) & 0x7f) << 24)

Given that the same operation is repeated three times below, it might
have been worth to fold the conditional into the macro. That'd require
the macro to take an smmu parameter, but would otherwise leave the
individual instances shorter.

But either way, the fix is good, so:

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure

2019-04-03 Thread Greg KH
On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote:
> On 28/03/2019 10:08, John Garry wrote:
> > In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
> > devres release"), we changed the ordering of tearing down the device DMA
> > ops and releasing all the device's resources; this was because the DMA ops
> > should be maintained until we release the device's managed DMA memories.
> > 
> 
> Hi all,
> 
> A friendly reminder on this patch... I didn't see any update.
> 
> I thought that it had some importance.
> 
> Thanks,
> John
> 
> > However, we have seen another crash on an arm64 system when a
> > device driver probe fails:
> > 
> >   hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
> >   scsi host1: hisi_sas_v3_hw
> >   BUG: Bad page state in process swapper/0  pfn:313f5
> >   page:7ec4fd40 count:1 mapcount:0
> >   mapping: index:0x0
> >   flags: 0xfffe0001000(reserved)
> >   raw: 0fffe0001000 7ec4fd48 7ec4fd48
> > 
> >   raw:   0001
> > 
> >   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >   bad because of flags: 0x1000(reserved)
> >   Modules linked in:
> >   CPU: 49 PID: 1 Comm: swapper/0 Not tainted
> > 5.1.0-rc1-43081-g22d97fd-dirty #1433
> >   Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
> > RC0 - V1.12.01 01/29/2019
> >   Call trace:
> >   dump_backtrace+0x0/0x118
> >   show_stack+0x14/0x1c
> >   dump_stack+0xa4/0xc8
> >   bad_page+0xe4/0x13c
> >   free_pages_check_bad+0x4c/0xc0
> >   __free_pages_ok+0x30c/0x340
> >   __free_pages+0x30/0x44
> >   __dma_direct_free_pages+0x30/0x38
> >   dma_direct_free+0x24/0x38
> >   dma_free_attrs+0x9c/0xd8
> >   dmam_release+0x20/0x28
> >   release_nodes+0x17c/0x220
> >   devres_release_all+0x34/0x54
> >   really_probe+0xc4/0x2c8
> >   driver_probe_device+0x58/0xfc
> >   device_driver_attach+0x68/0x70
> >   __driver_attach+0x94/0xdc
> >   bus_for_each_dev+0x5c/0xb4
> >   driver_attach+0x20/0x28
> >   bus_add_driver+0x14c/0x200
> >   driver_register+0x6c/0x124
> >   __pci_register_driver+0x48/0x50
> >   sas_v3_pci_driver_init+0x20/0x28
> >   do_one_initcall+0x40/0x25c
> >   kernel_init_freeable+0x2b8/0x3c0
> >   kernel_init+0x10/0x100
> >   ret_from_fork+0x10/0x18
> >   Disabling lock debugging due to kernel taint
> >   BUG: Bad page state in process swapper/0  pfn:313f6
> >   page:7ec4fd80 count:1 mapcount:0
> > mapping: index:0x0
> > [   89.322983] flags: 0xfffe0001000(reserved)
> >   raw: 0fffe0001000 7ec4fd88 7ec4fd88
> > 
> >   raw:   0001
> > 
> > 
> > The crash occurs for the same reason.
> > 
> > In this case, on the really_probe() failure path, we are still clearing
> > the DMA ops prior to releasing the device's managed memories.
> > 
> > This patch fixes this issue by reordering the DMA ops teardown and the
> > call to devres_release_all() on the failure path.
> > 
> > Reported-by: Xiang Chen 
> > Tested-by: Xiang Chen 
> > Signed-off-by: John Garry 

So does this "fix" 376991db4b64?  If so, should this be added to the
patch and also backported to the stable trees?

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure

2019-04-03 Thread John Garry

On 28/03/2019 10:08, John Garry wrote:

In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
devres release"), we changed the ordering of tearing down the device DMA
ops and releasing all the device's resources; this was because the DMA ops
should be maintained until we release the device's managed DMA memories.



Hi all,

A friendly reminder on this patch... I didn't see any update.

I thought that it had some importance.

Thanks,
John


However, we have seen another crash on an arm64 system when a
device driver probe fails:

  hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
  scsi host1: hisi_sas_v3_hw
  BUG: Bad page state in process swapper/0  pfn:313f5
  page:7ec4fd40 count:1 mapcount:0
  mapping: index:0x0
  flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd48 7ec4fd48

  raw:   0001

  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  bad because of flags: 0x1000(reserved)
  Modules linked in:
  CPU: 49 PID: 1 Comm: swapper/0 Not tainted
5.1.0-rc1-43081-g22d97fd-dirty #1433
  Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
RC0 - V1.12.01 01/29/2019
  Call trace:
  dump_backtrace+0x0/0x118
  show_stack+0x14/0x1c
  dump_stack+0xa4/0xc8
  bad_page+0xe4/0x13c
  free_pages_check_bad+0x4c/0xc0
  __free_pages_ok+0x30c/0x340
  __free_pages+0x30/0x44
  __dma_direct_free_pages+0x30/0x38
  dma_direct_free+0x24/0x38
  dma_free_attrs+0x9c/0xd8
  dmam_release+0x20/0x28
  release_nodes+0x17c/0x220
  devres_release_all+0x34/0x54
  really_probe+0xc4/0x2c8
  driver_probe_device+0x58/0xfc
  device_driver_attach+0x68/0x70
  __driver_attach+0x94/0xdc
  bus_for_each_dev+0x5c/0xb4
  driver_attach+0x20/0x28
  bus_add_driver+0x14c/0x200
  driver_register+0x6c/0x124
  __pci_register_driver+0x48/0x50
  sas_v3_pci_driver_init+0x20/0x28
  do_one_initcall+0x40/0x25c
  kernel_init_freeable+0x2b8/0x3c0
  kernel_init+0x10/0x100
  ret_from_fork+0x10/0x18
  Disabling lock debugging due to kernel taint
  BUG: Bad page state in process swapper/0  pfn:313f6
  page:7ec4fd80 count:1 mapcount:0
mapping: index:0x0
[   89.322983] flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd88 7ec4fd88

  raw:   0001


The crash occurs for the same reason.

In this case, on the really_probe() failure path, we are still clearing
the DMA ops prior to releasing the device's managed memories.

This patch fixes this issue by reordering the DMA ops teardown and the
call to devres_release_all() on the failure path.

Reported-by: Xiang Chen 
Tested-by: Xiang Chen 
Signed-off-by: John Garry 
---

For convenience, here is the 2nd half of really_probe() now:

atomic_inc(_count);
pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
 drv->bus->name, __func__, drv->name, dev_name(dev));
WARN_ON(!list_empty(>devres_head));

re_probe:
dev->driver = drv;

/* If using pinctrl, bind pins now before probing */
ret = pinctrl_bind_pins(dev);
if (ret)
goto pinctrl_bind_failed;

if (dev->bus->dma_configure) {
ret = dev->bus->dma_configure(dev);
if (ret)
goto probe_failed;
}

if (driver_sysfs_add(dev)) {
printk(KERN_ERR "%s: driver_sysfs_add(%s) failed\n",
__func__, dev_name(dev));
goto probe_failed;
}

if (dev->pm_domain && dev->pm_domain->activate) {
ret = dev->pm_domain->activate(dev);
if (ret)
goto probe_failed;
}

if (dev->bus->probe) {
ret = dev->bus->probe(dev);
if (ret)
goto probe_failed;
} else if (drv->probe) {
ret = drv->probe(dev);
if (ret)
goto probe_failed;
}

if (test_remove) {
test_remove = false;

if (dev->bus->remove)
dev->bus->remove(dev);
else if (drv->remove)
drv->remove(dev);

devres_release_all(dev);
driver_sysfs_remove(dev);
dev->driver = NULL;
dev_set_drvdata(dev, NULL);
if (dev->pm_domain && dev->pm_domain->dismiss)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);

goto re_probe;
}

pinctrl_init_done(dev);

if (dev->pm_domain && dev->pm_domain->sync)
dev->pm_domain->sync(dev);

driver_bound(dev);
ret = 1;
pr_debug("bus: '%s': %s: bound device %s to driver %s\n",
 drv->bus->name, __func__,