[PATCH kernel v2 2/2] powerpc/mm/iommu: Put pages on process exit

2016-07-19 Thread Alexey Kardashevskiy
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This references and caches MM once per container and adds tracking
how many times each preregistered area was registered in
a specific container. This way we do not depend on @current pointing to
a valid task descriptor.

This changes the userspce interface to return EBUSY if memory is
already registered (mm_iommu_get() used to increment the counter);
however it should not have any practical effect as the only
userspace tool available now does register memory area once per
container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Cc: David Gibson <da...@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Balbir Singh <bsinghar...@gmail.com>
Cc: Nicholas Piggin <npig...@gmail.com>
Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  1 -
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c| 10 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 52 +-
 4 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 745b4bd..90338fd 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,7 +24,6 @@ extern long mm_iommu_get(struct mm_struct *mm, unsigned long 
ua, unsigned long e
 extern long mm_iommu_put(struct mm_struct *mm,
struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 227b2a6..5c67d1c 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 65086bf..901773d 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -293,13 +293,3 @@ void mm_iommu_init(mm_context_t *ctx)
 {
INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(mm_context_t *ctx)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
-   list_del_rcu(>next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9752e77..40e71a0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,6 +89,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered areas and how many times
+ * it has referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -101,12 +110,26 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   list_del(>next);
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   kfree(tcemem);
+
+

[PATCH kernel v2 1/2] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-07-19 Thread Alexey Kardashevskiy
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better cache @mm and use it later when the process is gone
(@current or @current->mm are NULL).

This changes mm_iommu_xxx API to receive mm_struct instead of using one
from @current.

This is needed by the following patch to do proper cleanup in time.
This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
to do proper cleanup via tce_iommu_clear() patch.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h | 15 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 45 +-
 drivers/vfio/vfio_iommu_spapr_tce.c| 41 +++
 3 files changed, 51 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 9d2cd0c..745b4bd 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -18,16 +18,17 @@ extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
 
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern long mm_iommu_put(struct mm_struct *mm,
+   struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(mm_context_t *ctx);
 extern void mm_iommu_cleanup(mm_context_t *ctx);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-   unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-   unsigned long entries);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+   unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..65086bf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
}
 
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current->pid,
+   current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
@@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-   if (!current || !current->mm)
-   return false;
-
-   return !list_empty(>mm->context.iommu_group_mem_list);
+   return !list_empty(>context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list,
+   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
next) {
if ((mem->ua == ua) && (mem->entries == entries)) {
++mem->used;
@@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
}
 
-   ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+   ret = mm_iommu_adjust_locked_vm(mm, entries, true);
if (ret)
goto unlock_exit;
 
@@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
mem->entries = entries;
*pmem = mem;
 
-   list_add_rcu(>next, >mm->context.iommu_group_mem_list);
+   list_add_rcu(>next, >context.iommu_group_mem_list);
 
 unlock_exit:
if (locked_entries && ret)
-

[PATCH kernel v2 2/2] powerpc/mm/iommu: Put pages on process exit

2016-07-19 Thread Alexey Kardashevskiy
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This references and caches MM once per container and adds tracking
how many times each preregistered area was registered in
a specific container. This way we do not depend on @current pointing to
a valid task descriptor.

This changes the userspce interface to return EBUSY if memory is
already registered (mm_iommu_get() used to increment the counter);
however it should not have any practical effect as the only
userspace tool available now does register memory area once per
container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Cc: David Gibson 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Balbir Singh 
Cc: Nicholas Piggin 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h |  1 -
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c| 10 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 52 +-
 4 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 745b4bd..90338fd 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,7 +24,6 @@ extern long mm_iommu_get(struct mm_struct *mm, unsigned long 
ua, unsigned long e
 extern long mm_iommu_put(struct mm_struct *mm,
struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 227b2a6..5c67d1c 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 65086bf..901773d 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -293,13 +293,3 @@ void mm_iommu_init(mm_context_t *ctx)
 {
INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(mm_context_t *ctx)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
-   list_del_rcu(>next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9752e77..40e71a0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,6 +89,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered areas and how many times
+ * it has referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -101,12 +110,26 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   list_del(>next);
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   kfree(tcemem);
+
+   return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
__u64 vaddr, __u64 size)
 {
struct mm_iommu_table_gr

[PATCH kernel v2 0/2] powerpc/mm/iommu: Put pages on process exit

2016-07-19 Thread Alexey Kardashevskiy
This is a fix to a bug when guest memory stays Active
after QEMU process exited. This happened because the QEMU memory context
was not released in a short period of time after QEMU process exited.
More details are in the commit logs.

Please comment. Thanks.

Alexey Kardashevskiy (2):
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  powerpc/mm/iommu: Put pages on process exit

 arch/powerpc/include/asm/mmu_context.h | 16 +++---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 --
 arch/powerpc/mm/mmu_context_iommu.c| 55 +++--
 drivers/vfio/vfio_iommu_spapr_tce.c| 89 --
 4 files changed, 100 insertions(+), 64 deletions(-)

-- 
2.5.0.rc3



[PATCH kernel v2 0/2] powerpc/mm/iommu: Put pages on process exit

2016-07-19 Thread Alexey Kardashevskiy
This is a fix to a bug when guest memory stays Active
after QEMU process exited. This happened because the QEMU memory context
was not released in a short period of time after QEMU process exited.
More details are in the commit logs.

Please comment. Thanks.

Alexey Kardashevskiy (2):
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  powerpc/mm/iommu: Put pages on process exit

 arch/powerpc/include/asm/mmu_context.h | 16 +++---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 --
 arch/powerpc/mm/mmu_context_iommu.c| 55 +++--
 drivers/vfio/vfio_iommu_spapr_tce.c| 89 --
 4 files changed, 100 insertions(+), 64 deletions(-)

-- 
2.5.0.rc3



Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-14 Thread Alexey Kardashevskiy
On 14/07/16 21:52, Alexey Kardashevskiy wrote:
> On 14/07/16 20:12, Balbir Singh wrote:
>> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
>>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
>>> the userspace starts using VFIO. When the userspace process finishes,
>>> all the pinned pages need to be put; this is done as a part of
>>> the userspace memory context (MM) destruction which happens on
>>> the very last mmdrop().
>>>
>>> This approach has a problem that a MM of the userspace process
>>> may live longer than the userspace process itself as kernel threads
>>> usually execute on a MM of a userspace process which was runnning
>>> on a CPU where the kernel thread was scheduled to. If this happened,
>>> the MM remains referenced until this exact kernel thread wakes up again
>>> and releases the very last reference to the MM, on an idle system this
>>> can take even hours.
>>>
>>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
>>> pages) from destroy_context() (called on the last mmdrop()) to
>>> the arch-specific arch_exit_mmap() hook (called on the last mmput()).
>>> mmdrop() decrements the mm->mm_count which is a total reference number;
>>> mmput() decrements the mm->mm_users which is a number of user spaces and
>>> this is actually the counter we want to watch for here.
>>>
>>> Cc: David Gibson <da...@gibson.dropbear.id.au>
>>> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
>>> Cc: Paul Mackerras <pau...@samba.org>
>>> Cc: Balbir Singh <bsinghar...@gmail.com>
>>> Cc: Nick Piggin <npig...@kernel.dk>
>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>> ---
>>>  arch/powerpc/include/asm/mmu_context.h | 3 +++
>>>  arch/powerpc/mm/mmu_context_book3s64.c | 4 
>>>  2 files changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mmu_context.h 
>>> b/arch/powerpc/include/asm/mmu_context.h
>>> index 9d2cd0c..24b590d 100644
>>> --- a/arch/powerpc/include/asm/mmu_context.h
>>> +++ b/arch/powerpc/include/asm/mmu_context.h
>>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct 
>>> *oldmm,
>>>
>>>  static inline void arch_exit_mmap(struct mm_struct *mm)
>>>  {
>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>> +   mm_iommu_cleanup(>context);
>>> +#endif
>>>  }
>>>
>>
>> The assumption is that all necessary tear down has happened. Earlier
>> we were doing the teardown
>> on the last mm ref drop, do we have any dependence on that? I don't
>> think so, but just checking
>>
>> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely
>> on the exit path to do the teardown?
> 
> QEMU does call unregister when a memory region is deleted. Handling it in
> arch_exit_mmap() is for the case when a QEMU process just dies for whatever
> reason.

After a second thought the patch is not correct, before it would actually
put pages when all references to MM were dropped and many of them are hold
by file descriptors, with this patch, pages are put before files are closed
which means that devices will not be reset and doing DMA to memory which is
not pinned but there are TCEs pointign to it.

I will make another patch, with additional mmget()/mmput().



>> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a 
>> version
>> of mm_iommu_cleanup that handles it
> 
> 
> That would be a change unrelated to the fix. And I just do not see the
> point in this - mm_iommu_cleanup() is declared in the very same header.
> 
> 
>> Basically do
>>
>> #ifdef CONFIG_SPAPR_TCE_IOMMU
>> extern void mm_iommu_cleanup(void *ptr)
>> #else
>> static inline mm_iommu_cleanup(void *ptr) {}
>> #endif
>>
>> Otherwise looks good
>>
>> Acked-by: Balbir Singh <bsinghar...@gmail.com>
> 
> Thanks!
> 
> 


-- 
Alexey


Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-14 Thread Alexey Kardashevskiy
On 14/07/16 21:52, Alexey Kardashevskiy wrote:
> On 14/07/16 20:12, Balbir Singh wrote:
>> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy  wrote:
>>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
>>> the userspace starts using VFIO. When the userspace process finishes,
>>> all the pinned pages need to be put; this is done as a part of
>>> the userspace memory context (MM) destruction which happens on
>>> the very last mmdrop().
>>>
>>> This approach has a problem that a MM of the userspace process
>>> may live longer than the userspace process itself as kernel threads
>>> usually execute on a MM of a userspace process which was runnning
>>> on a CPU where the kernel thread was scheduled to. If this happened,
>>> the MM remains referenced until this exact kernel thread wakes up again
>>> and releases the very last reference to the MM, on an idle system this
>>> can take even hours.
>>>
>>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
>>> pages) from destroy_context() (called on the last mmdrop()) to
>>> the arch-specific arch_exit_mmap() hook (called on the last mmput()).
>>> mmdrop() decrements the mm->mm_count which is a total reference number;
>>> mmput() decrements the mm->mm_users which is a number of user spaces and
>>> this is actually the counter we want to watch for here.
>>>
>>> Cc: David Gibson 
>>> Cc: Benjamin Herrenschmidt 
>>> Cc: Paul Mackerras 
>>> Cc: Balbir Singh 
>>> Cc: Nick Piggin 
>>> Signed-off-by: Alexey Kardashevskiy 
>>> ---
>>>  arch/powerpc/include/asm/mmu_context.h | 3 +++
>>>  arch/powerpc/mm/mmu_context_book3s64.c | 4 
>>>  2 files changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mmu_context.h 
>>> b/arch/powerpc/include/asm/mmu_context.h
>>> index 9d2cd0c..24b590d 100644
>>> --- a/arch/powerpc/include/asm/mmu_context.h
>>> +++ b/arch/powerpc/include/asm/mmu_context.h
>>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct 
>>> *oldmm,
>>>
>>>  static inline void arch_exit_mmap(struct mm_struct *mm)
>>>  {
>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>> +   mm_iommu_cleanup(>context);
>>> +#endif
>>>  }
>>>
>>
>> The assumption is that all necessary tear down has happened. Earlier
>> we were doing the teardown
>> on the last mm ref drop, do we have any dependence on that? I don't
>> think so, but just checking
>>
>> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely
>> on the exit path to do the teardown?
> 
> QEMU does call unregister when a memory region is deleted. Handling it in
> arch_exit_mmap() is for the case when a QEMU process just dies for whatever
> reason.

After a second thought the patch is not correct, before it would actually
put pages when all references to MM were dropped and many of them are hold
by file descriptors, with this patch, pages are put before files are closed
which means that devices will not be reset and doing DMA to memory which is
not pinned but there are TCEs pointign to it.

I will make another patch, with additional mmget()/mmput().



>> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a 
>> version
>> of mm_iommu_cleanup that handles it
> 
> 
> That would be a change unrelated to the fix. And I just do not see the
> point in this - mm_iommu_cleanup() is declared in the very same header.
> 
> 
>> Basically do
>>
>> #ifdef CONFIG_SPAPR_TCE_IOMMU
>> extern void mm_iommu_cleanup(void *ptr)
>> #else
>> static inline mm_iommu_cleanup(void *ptr) {}
>> #endif
>>
>> Otherwise looks good
>>
>> Acked-by: Balbir Singh 
> 
> Thanks!
> 
> 


-- 
Alexey


Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-14 Thread Alexey Kardashevskiy
On 14/07/16 20:12, Balbir Singh wrote:
> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote:
>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
>> the userspace starts using VFIO. When the userspace process finishes,
>> all the pinned pages need to be put; this is done as a part of
>> the userspace memory context (MM) destruction which happens on
>> the very last mmdrop().
>>
>> This approach has a problem that a MM of the userspace process
>> may live longer than the userspace process itself as kernel threads
>> usually execute on a MM of a userspace process which was runnning
>> on a CPU where the kernel thread was scheduled to. If this happened,
>> the MM remains referenced until this exact kernel thread wakes up again
>> and releases the very last reference to the MM, on an idle system this
>> can take even hours.
>>
>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
>> pages) from destroy_context() (called on the last mmdrop()) to
>> the arch-specific arch_exit_mmap() hook (called on the last mmput()).
>> mmdrop() decrements the mm->mm_count which is a total reference number;
>> mmput() decrements the mm->mm_users which is a number of user spaces and
>> this is actually the counter we want to watch for here.
>>
>> Cc: David Gibson <da...@gibson.dropbear.id.au>
>> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
>> Cc: Paul Mackerras <pau...@samba.org>
>> Cc: Balbir Singh <bsinghar...@gmail.com>
>> Cc: Nick Piggin <npig...@kernel.dk>
>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/mmu_context.h | 3 +++
>>  arch/powerpc/mm/mmu_context_book3s64.c | 4 
>>  2 files changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> b/arch/powerpc/include/asm/mmu_context.h
>> index 9d2cd0c..24b590d 100644
>> --- a/arch/powerpc/include/asm/mmu_context.h
>> +++ b/arch/powerpc/include/asm/mmu_context.h
>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
>>
>>  static inline void arch_exit_mmap(struct mm_struct *mm)
>>  {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +   mm_iommu_cleanup(>context);
>> +#endif
>>  }
>>
> 
> The assumption is that all necessary tear down has happened. Earlier
> we were doing the teardown
> on the last mm ref drop, do we have any dependence on that? I don't
> think so, but just checking
> 
> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely
> on the exit path to do the teardown?

QEMU does call unregister when a memory region is deleted. Handling it in
arch_exit_mmap() is for the case when a QEMU process just dies for whatever
reason.

> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a version
> of mm_iommu_cleanup that handles it


That would be a change unrelated to the fix. And I just do not see the
point in this - mm_iommu_cleanup() is declared in the very same header.


> Basically do
> 
> #ifdef CONFIG_SPAPR_TCE_IOMMU
> extern void mm_iommu_cleanup(void *ptr)
> #else
> static inline mm_iommu_cleanup(void *ptr) {}
> #endif
> 
> Otherwise looks good
> 
> Acked-by: Balbir Singh <bsinghar...@gmail.com>

Thanks!


-- 
Alexey


Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-14 Thread Alexey Kardashevskiy
On 14/07/16 20:12, Balbir Singh wrote:
> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy  wrote:
>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
>> the userspace starts using VFIO. When the userspace process finishes,
>> all the pinned pages need to be put; this is done as a part of
>> the userspace memory context (MM) destruction which happens on
>> the very last mmdrop().
>>
>> This approach has a problem that a MM of the userspace process
>> may live longer than the userspace process itself as kernel threads
>> usually execute on a MM of a userspace process which was runnning
>> on a CPU where the kernel thread was scheduled to. If this happened,
>> the MM remains referenced until this exact kernel thread wakes up again
>> and releases the very last reference to the MM, on an idle system this
>> can take even hours.
>>
>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
>> pages) from destroy_context() (called on the last mmdrop()) to
>> the arch-specific arch_exit_mmap() hook (called on the last mmput()).
>> mmdrop() decrements the mm->mm_count which is a total reference number;
>> mmput() decrements the mm->mm_users which is a number of user spaces and
>> this is actually the counter we want to watch for here.
>>
>> Cc: David Gibson 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Balbir Singh 
>> Cc: Nick Piggin 
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/include/asm/mmu_context.h | 3 +++
>>  arch/powerpc/mm/mmu_context_book3s64.c | 4 
>>  2 files changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> b/arch/powerpc/include/asm/mmu_context.h
>> index 9d2cd0c..24b590d 100644
>> --- a/arch/powerpc/include/asm/mmu_context.h
>> +++ b/arch/powerpc/include/asm/mmu_context.h
>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
>>
>>  static inline void arch_exit_mmap(struct mm_struct *mm)
>>  {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +   mm_iommu_cleanup(>context);
>> +#endif
>>  }
>>
> 
> The assumption is that all necessary tear down has happened. Earlier
> we were doing the teardown
> on the last mm ref drop, do we have any dependence on that? I don't
> think so, but just checking
> 
> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely
> on the exit path to do the teardown?

QEMU does call unregister when a memory region is deleted. Handling it in
arch_exit_mmap() is for the case when a QEMU process just dies for whatever
reason.

> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a version
> of mm_iommu_cleanup that handles it


That would be a change unrelated to the fix. And I just do not see the
point in this - mm_iommu_cleanup() is declared in the very same header.


> Basically do
> 
> #ifdef CONFIG_SPAPR_TCE_IOMMU
> extern void mm_iommu_cleanup(void *ptr)
> #else
> static inline mm_iommu_cleanup(void *ptr) {}
> #endif
> 
> Otherwise looks good
> 
> Acked-by: Balbir Singh 

Thanks!


-- 
Alexey


[PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-13 Thread Alexey Kardashevskiy
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
usually execute on a MM of a userspace process which was runnning
on a CPU where the kernel thread was scheduled to. If this happened,
the MM remains referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
pages) from destroy_context() (called on the last mmdrop()) to
the arch-specific arch_exit_mmap() hook (called on the last mmput()).
mmdrop() decrements the mm->mm_count which is a total reference number;
mmput() decrements the mm->mm_users which is a number of user spaces and
this is actually the counter we want to watch for here.

Cc: David Gibson <da...@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Balbir Singh <bsinghar...@gmail.com>
Cc: Nick Piggin <npig...@kernel.dk>
Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h | 3 +++
 arch/powerpc/mm/mmu_context_book3s64.c | 4 
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 9d2cd0c..24b590d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+   mm_iommu_cleanup(>context);
+#endif
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 1962..aaeba74 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
-- 
2.5.0.rc3



[PATCH kernel] powerpc/mm/iommu: Put pages on process exit

2016-07-13 Thread Alexey Kardashevskiy
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
usually execute on a MM of a userspace process which was runnning
on a CPU where the kernel thread was scheduled to. If this happened,
the MM remains referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This fixes the issue by moving mm_iommu_cleanup() (the helper which puts
pages) from destroy_context() (called on the last mmdrop()) to
the arch-specific arch_exit_mmap() hook (called on the last mmput()).
mmdrop() decrements the mm->mm_count which is a total reference number;
mmput() decrements the mm->mm_users which is a number of user spaces and
this is actually the counter we want to watch for here.

Cc: David Gibson 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Balbir Singh 
Cc: Nick Piggin 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h | 3 +++
 arch/powerpc/mm/mmu_context_book3s64.c | 4 
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 9d2cd0c..24b590d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+   mm_iommu_cleanup(>context);
+#endif
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 1962..aaeba74 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
-- 
2.5.0.rc3



Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

2016-05-06 Thread Alexey Kardashevskiy

On 05/06/2016 01:05 AM, Alex Williamson wrote:

On Thu, 5 May 2016 12:15:46 +
"Tian, Kevin"  wrote:


From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com]
Sent: Thursday, May 05, 2016 7:43 PM

Hi David and Kevin,

On 2016/5/5 17:54, David Laight wrote:


From: Tian, Kevin

Sent: 05 May 2016 10:37

...

Acutually, we are not aimed at accessing MSI-X table from
guest. So I think it's safe to passthrough MSI-X table if we
can make sure guest kernel would not touch MSI-X table in
normal code path such as para-virtualized guest kernel on PPC64.


Then how do you prevent malicious guest kernel accessing it?

Or a malicious guest driver for an ethernet card setting up
the receive buffer ring to contain a single word entry that
contains the address associated with an MSI-X interrupt and
then using a loopback mode to cause a specific packet be
received that writes the required word through that address.

Remember the PCIe cycle for an interrupt is a normal memory write
cycle.

David



If we have enough permission to load a malicious driver or
kernel, we can easily break the guest without exposed
MSI-X table.

I think it should be safe to expose MSI-X table if we can
make sure that malicious guest driver/kernel can't use
the MSI-X table to break other guest or host. The
capability of IRQ remapping could provide this
kind of protection.



With IRQ remapping it doesn't mean you can pass through MSI-X
structure to guest. I know actual IRQ remapping might be platform
specific, but at least for Intel VT-d specification, MSI-X entry must
be configured with a remappable format by host kernel which
contains an index into IRQ remapping table. The index will find a
IRQ remapping entry which controls interrupt routing for a specific
device. If you allow a malicious program random index into MSI-X
entry of assigned device, the hole is obvious...

Above might make sense only for a IRQ remapping implementation
which doesn't rely on extended MSI-X format (e.g. simply based on
BDF). If that's the case for PPC, then you should build MSI-X
passthrough based on this fact instead of general IRQ remapping
enabled or not.


I don't think anyone is expecting that we can expose the MSI-X vector
table to the guest and the guest can make direct use of it.  The end
goal here is that the guest on a power system is already
paravirtualized to not program the device MSI-X by directly writing to
the MSI-X vector table.  They have hypercalls for this since they
always run virtualized.  Therefore a) they never intend to touch the
MSI-X vector table and b) they have sufficient isolation that a guest
can only hurt itself by doing so.

On x86 we don't have a), our method of programming the MSI-X vector
table is to directly write to it. Therefore we will always require QEMU
to place a MemoryRegion over the vector table to intercept those
accesses.  However with interrupt remapping, we do have b) on x86, which
means that we don't need to be so strict in disallowing user accesses
to the MSI-X vector table.  It's not useful for configuring MSI-X on
the device, but the user should only be able to hurt themselves by
writing it directly.  x86 doesn't really get anything out of this
change, but it helps this special case on power pretty significantly
aiui.  Thanks,


Excellent short overview, saved :)

How do we proceed with these patches? Nobody seems objecting them but also 
nobody seems taking them either...





--
Alexey


Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

2016-05-06 Thread Alexey Kardashevskiy

On 05/06/2016 01:05 AM, Alex Williamson wrote:

On Thu, 5 May 2016 12:15:46 +
"Tian, Kevin"  wrote:


From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com]
Sent: Thursday, May 05, 2016 7:43 PM

Hi David and Kevin,

On 2016/5/5 17:54, David Laight wrote:


From: Tian, Kevin

Sent: 05 May 2016 10:37

...

Acutually, we are not aimed at accessing MSI-X table from
guest. So I think it's safe to passthrough MSI-X table if we
can make sure guest kernel would not touch MSI-X table in
normal code path such as para-virtualized guest kernel on PPC64.


Then how do you prevent malicious guest kernel accessing it?

Or a malicious guest driver for an ethernet card setting up
the receive buffer ring to contain a single word entry that
contains the address associated with an MSI-X interrupt and
then using a loopback mode to cause a specific packet be
received that writes the required word through that address.

Remember the PCIe cycle for an interrupt is a normal memory write
cycle.

David



If we have enough permission to load a malicious driver or
kernel, we can easily break the guest without exposed
MSI-X table.

I think it should be safe to expose MSI-X table if we can
make sure that malicious guest driver/kernel can't use
the MSI-X table to break other guest or host. The
capability of IRQ remapping could provide this
kind of protection.



With IRQ remapping it doesn't mean you can pass through MSI-X
structure to guest. I know actual IRQ remapping might be platform
specific, but at least for Intel VT-d specification, MSI-X entry must
be configured with a remappable format by host kernel which
contains an index into IRQ remapping table. The index will find a
IRQ remapping entry which controls interrupt routing for a specific
device. If you allow a malicious program random index into MSI-X
entry of assigned device, the hole is obvious...

Above might make sense only for a IRQ remapping implementation
which doesn't rely on extended MSI-X format (e.g. simply based on
BDF). If that's the case for PPC, then you should build MSI-X
passthrough based on this fact instead of general IRQ remapping
enabled or not.


I don't think anyone is expecting that we can expose the MSI-X vector
table to the guest and the guest can make direct use of it.  The end
goal here is that the guest on a power system is already
paravirtualized to not program the device MSI-X by directly writing to
the MSI-X vector table.  They have hypercalls for this since they
always run virtualized.  Therefore a) they never intend to touch the
MSI-X vector table and b) they have sufficient isolation that a guest
can only hurt itself by doing so.

On x86 we don't have a), our method of programming the MSI-X vector
table is to directly write to it. Therefore we will always require QEMU
to place a MemoryRegion over the vector table to intercept those
accesses.  However with interrupt remapping, we do have b) on x86, which
means that we don't need to be so strict in disallowing user accesses
to the MSI-X vector table.  It's not useful for configuring MSI-X on
the device, but the user should only be able to hurt themselves by
writing it directly.  x86 doesn't really get anything out of this
change, but it helps this special case on power pretty significantly
aiui.  Thanks,


Excellent short overview, saved :)

How do we proceed with these patches? Nobody seems objecting them but also 
nobody seems taking them either...





--
Alexey


Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge

2016-05-06 Thread Alexey Kardashevskiy

On 04/27/2016 10:43 PM, Yongji Xie wrote:

Any IODA host bridge have the capability of IRQ remapping.
So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge
is detected.

Signed-off-by: Yongji Xie <xyj...@linux.vnet.ibm.com>



Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru>



---
 arch/powerpc/platforms/powernv/pci-ioda.c |8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f90dc04..9557638 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void)
pnv_npu_ioda_fixup();
 }

+int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+   bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+   return 0;
+}
+
 /*
  * Returns the alignment for I/O or memory windows for P2P
  * bridges. That actually depends on how PEs are segmented.
@@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
 */
ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;

+   ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare;
+
if (phb->type == PNV_PHB_NPU)
hose->controller_ops = pnv_npu_ioda_controller_ops;
else




--
Alexey


Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge

2016-05-06 Thread Alexey Kardashevskiy

On 04/27/2016 10:43 PM, Yongji Xie wrote:

Any IODA host bridge have the capability of IRQ remapping.
So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge
is detected.

Signed-off-by: Yongji Xie 



Reviewed-by: Alexey Kardashevskiy 



---
 arch/powerpc/platforms/powernv/pci-ioda.c |8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f90dc04..9557638 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void)
pnv_npu_ioda_fixup();
 }

+int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+   bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP;
+   return 0;
+}
+
 /*
  * Returns the alignment for I/O or memory windows for P2P
  * bridges. That actually depends on how PEs are segmented.
@@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
 */
ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;

+   ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare;
+
if (phb->type == PNV_PHB_NPU)
hose->controller_ops = pnv_npu_ioda_controller_ops;
else




--
Alexey


Re: [PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through

2016-05-04 Thread Alexey Kardashevskiy

On 05/04/2016 12:08 AM, Alistair Popple wrote:

Hi Alexey,

On Fri, 29 Apr 2016 18:55:24 Alexey Kardashevskiy wrote:

IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v4:
* reused pnv_npu_set_window/pnv_npu_unset_window where possible
* added comments, changed commit log

v3:
* moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered
* removed hack to make iommu_add_device() work, iommu_group_add_device()
is used instead
* cleanup in gpe_table_group_to_npe_cb()

v2:
* reimplemented to support NPU + GPU in the same group
* merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and
"powerpc/powernv/npu: Enable passing through via VFIO" into this patch
---
 arch/powerpc/platforms/powernv/npu-dma.c  |  64 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 102

++

 arch/powerpc/platforms/powernv/pci.h  |   6 ++
 3 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c

b/arch/powerpc/platforms/powernv/npu-dma.c

index cb2d1da..0459e10 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct

pnv_ioda_pe *npe,

return pe;
 }

-static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
struct iommu_table *tbl)
 {
struct pnv_phb *phb = npe->phb;
@@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe

*npe,

pnv_pci_ioda2_tce_invalidate_entire(phb, false);

/* Add the table to the list so its TCE cache will get invalidated */
-   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);

return 0;
 }

-static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc;
@@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe

*npe)

}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

-   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[num],
>table_group);

return 0;
@@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;

-   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
+   rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);

/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe

*npe)

if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;

-   rc = pnv_npu_unset_windo

Re: [PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through

2016-05-04 Thread Alexey Kardashevskiy

On 05/04/2016 12:08 AM, Alistair Popple wrote:

Hi Alexey,

On Fri, 29 Apr 2016 18:55:24 Alexey Kardashevskiy wrote:

IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* reused pnv_npu_set_window/pnv_npu_unset_window where possible
* added comments, changed commit log

v3:
* moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered
* removed hack to make iommu_add_device() work, iommu_group_add_device()
is used instead
* cleanup in gpe_table_group_to_npe_cb()

v2:
* reimplemented to support NPU + GPU in the same group
* merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and
"powerpc/powernv/npu: Enable passing through via VFIO" into this patch
---
 arch/powerpc/platforms/powernv/npu-dma.c  |  64 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 102

++

 arch/powerpc/platforms/powernv/pci.h  |   6 ++
 3 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c

b/arch/powerpc/platforms/powernv/npu-dma.c

index cb2d1da..0459e10 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct

pnv_ioda_pe *npe,

return pe;
 }

-static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
struct iommu_table *tbl)
 {
struct pnv_phb *phb = npe->phb;
@@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe

*npe,

pnv_pci_ioda2_tce_invalidate_entire(phb, false);

/* Add the table to the list so its TCE cache will get invalidated */
-   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);

return 0;
 }

-static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc;
@@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe

*npe)

}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

-   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[num],
>table_group);

return 0;
@@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;

-   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
+   rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);

/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe

*npe)

if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;

-   rc = pnv_npu_unset_window(npe)

Re: [PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling

2016-05-04 Thread Alexey Kardashevskiy

On 05/03/2016 05:37 PM, Alistair Popple wrote:

On Fri, 29 Apr 2016 18:55:23 Alexey Kardashevskiy wrote:

The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.


Agreed. It was used here because the peer array was added to deal with (2)
below ...


2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.


... because we weren't aware iommu_table_group could be used to do this
instead.


This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.


Happy to see it go. I'm not too familiar with iommu groups but based on the
code and what you have described to me both here and offline everything looks
good to me. One pretty minor style comment below.

Reviewed-By: Alistair Popple <alist...@popple.id.au>


While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v4:
* reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been
added
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 69

+--

 arch/powerpc/platforms/powernv/pci-ioda.c | 57 -
 arch/powerpc/platforms/powernv/pci.h  |  6 ---
 3 files changed, 26 insertions(+), 106 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c

b/arch/powerpc/platforms/powernv/npu-dma.c

index 800d70f..cb2d1da 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -136,22 +136,17 @@ static struct pnv_ioda_pe

*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,

struct pnv_ioda_pe *pe;
struct pci_dn *pdn;

-   if (npe->flags & PNV_IODA_PE_PEER) {
-   pe = npe->peers[0];
-   pdev = pe->pdev;
-   } else {
-   pdev = pnv_pci_get_gpu_dev(npe->pdev);
-   if (!pdev)
-   return NULL;
+   pdev = pnv_pci_get_gpu_dev(npe->pdev);
+   if (!pdev)
+   return NULL;

-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return NULL;
+   pdn = pci_get_pdn(pdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return NULL;

-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = >ioda.pe_array[pdn->pe_number];
-   }
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+   pe = >ioda.pe_array[pdn->pe_number];

if (gpdev)
*gpdev = pdev;
@@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

+   /* Add the table to the list so its TCE cache will get invalidated */
+   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   tbl, >table_group);


Where tbl is associated with the GPU and is what links the NPU and GPU PEs.


return 0;
 }

@@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe

*npe)

}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

+   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   >table_group);
+
return 0;
 }

-void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
-{
-   struct pnv_ioda_pe *gpe;
-   struct pci_dev *gpdev;
-   int i, avail = -1;
-
-   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
-   return;
-
-   gpe = get_gpu_pci_dev_and_pe(npe, );
-   if (!gpe)
-   return;
-
-   for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-   /* Nothing to do if the PE is already connected. */
-   if (gpe->peers[i] == npe)
-   return;
-
-   if (!gpe->peers[i])
-   avail = i;
-   }
-
-   if (WARN_ON(avail < 0))
-   return;
-
-   gpe->peers[avail] = npe;
-   gpe->flags |= PNV_IODA_PE_PEER;
-
-   /*
-* We assume that the NPU devices only have a single peer PE
-* (the GPU PCIe device PE).
-*/
-   npe->peers[0] = gpe;
-   npe->flags |= PNV_IODA_PE_PEER;
-}
-
 /*
  * Enables 32 bit DMA on NPU.
  */
@@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe

*npe)

npe->pe_number, npe->pe_number,
0 /* bypass base */, top);

+   if (r

Re: [PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling

2016-05-04 Thread Alexey Kardashevskiy

On 05/03/2016 05:37 PM, Alistair Popple wrote:

On Fri, 29 Apr 2016 18:55:23 Alexey Kardashevskiy wrote:

The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.


Agreed. It was used here because the peer array was added to deal with (2)
below ...


2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.


... because we weren't aware iommu_table_group could be used to do this
instead.


This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.


Happy to see it go. I'm not too familiar with iommu groups but based on the
code and what you have described to me both here and offline everything looks
good to me. One pretty minor style comment below.

Reviewed-By: Alistair Popple 


While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been
added
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 69

+--

 arch/powerpc/platforms/powernv/pci-ioda.c | 57 -
 arch/powerpc/platforms/powernv/pci.h  |  6 ---
 3 files changed, 26 insertions(+), 106 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c

b/arch/powerpc/platforms/powernv/npu-dma.c

index 800d70f..cb2d1da 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -136,22 +136,17 @@ static struct pnv_ioda_pe

*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,

struct pnv_ioda_pe *pe;
struct pci_dn *pdn;

-   if (npe->flags & PNV_IODA_PE_PEER) {
-   pe = npe->peers[0];
-   pdev = pe->pdev;
-   } else {
-   pdev = pnv_pci_get_gpu_dev(npe->pdev);
-   if (!pdev)
-   return NULL;
+   pdev = pnv_pci_get_gpu_dev(npe->pdev);
+   if (!pdev)
+   return NULL;

-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return NULL;
+   pdn = pci_get_pdn(pdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return NULL;

-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = >ioda.pe_array[pdn->pe_number];
-   }
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+   pe = >ioda.pe_array[pdn->pe_number];

if (gpdev)
*gpdev = pdev;
@@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

+   /* Add the table to the list so its TCE cache will get invalidated */
+   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   tbl, >table_group);


Where tbl is associated with the GPU and is what links the NPU and GPU PEs.


return 0;
 }

@@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe

*npe)

}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);

+   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   >table_group);
+
return 0;
 }

-void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
-{
-   struct pnv_ioda_pe *gpe;
-   struct pci_dev *gpdev;
-   int i, avail = -1;
-
-   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
-   return;
-
-   gpe = get_gpu_pci_dev_and_pe(npe, );
-   if (!gpe)
-   return;
-
-   for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-   /* Nothing to do if the PE is already connected. */
-   if (gpe->peers[i] == npe)
-   return;
-
-   if (!gpe->peers[i])
-   avail = i;
-   }
-
-   if (WARN_ON(avail < 0))
-   return;
-
-   gpe->peers[avail] = npe;
-   gpe->flags |= PNV_IODA_PE_PEER;
-
-   /*
-* We assume that the NPU devices only have a single peer PE
-* (the GPU PCIe device PE).
-*/
-   npe->peers[0] = gpe;
-   npe->flags |= PNV_IODA_PE_PEER;
-}
-
 /*
  * Enables 32 bit DMA on NPU.
  */
@@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe

*npe)

npe->pe_number, npe->pe_number,
0 /* bypass base */, top);

+   if (rc == OPAL_SUCCESS)
+   pnv_pci_ioda2_

[PATCH kernel v4 06/11] powerpc/powernv/npu: Use the correct IOMMU page size

2016-04-29 Thread Alexey Kardashevskiy
This uses the page size from iommu_table instead of hard-coded 4K.
This should cause no change in behavior.

While we are here, move bits around to prepare for further rework
which will define and use iommu_table_group_ops.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: Alistair Popple <alist...@popple.id.au>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 778570c..5bd5fee 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
struct pnv_ioda_pe *gpe;
-   void *addr;
-   unsigned int size;
+   struct iommu_table *tbl;
int64_t rc;
 
/*
@@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
*npe)
if (!gpe)
return;
 
-   addr = (void *)gpe->table_group.tables[0]->it_base;
-   size = gpe->table_group.tables[0]->it_size << 3;
+   tbl = gpe->table_group.tables[0];
rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
-   npe->pe_number, 1, __pa(addr),
-   size, 0x1000);
+   npe->pe_number, 1, __pa(tbl->it_base),
+   tbl->it_size << 3,
+   IOMMU_PAGE_SIZE(tbl));
if (rc != OPAL_SUCCESS)
pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
__func__, rc, phb->hose->global_number, npe->pe_number);
-- 
2.5.0.rc3



[PATCH kernel v4 06/11] powerpc/powernv/npu: Use the correct IOMMU page size

2016-04-29 Thread Alexey Kardashevskiy
This uses the page size from iommu_table instead of hard-coded 4K.
This should cause no change in behavior.

While we are here, move bits around to prepare for further rework
which will define and use iommu_table_group_ops.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Alistair Popple 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 778570c..5bd5fee 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
struct pnv_ioda_pe *gpe;
-   void *addr;
-   unsigned int size;
+   struct iommu_table *tbl;
int64_t rc;
 
/*
@@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
*npe)
if (!gpe)
return;
 
-   addr = (void *)gpe->table_group.tables[0]->it_base;
-   size = gpe->table_group.tables[0]->it_size << 3;
+   tbl = gpe->table_group.tables[0];
rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
-   npe->pe_number, 1, __pa(addr),
-   size, 0x1000);
+   npe->pe_number, 1, __pa(tbl->it_base),
+   tbl->it_size << 3,
+   IOMMU_PAGE_SIZE(tbl));
if (rc != OPAL_SUCCESS)
pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
__func__, rc, phb->hose->global_number, npe->pe_number);
-- 
2.5.0.rc3



[PATCH kernel v4 09/11] powerpc/powernv/npu: Add set/unset window helpers

2016-04-29 Thread Alexey Kardashevskiy
The upcoming NVLink passthrough support will require NPU code to cope
with two DMA windows.

This adds a pnv_npu_set_window() helper which programs 32bit window to
the hardware. This also adds multilevel TCE support.

This adds a pnv_npu_unset_window() helper which removes the DMA window
from the hardware. This does not make difference now as the caller -
pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
patch will use it to manage TCE table lists for TCE Kill handling.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 65 +++-
 1 file changed, 55 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index bec9267..800d70f 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -159,6 +159,56 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
+static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+   struct iommu_table *tbl)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc;
+   const unsigned long size = tbl->it_indirect_levels ?
+   tbl->it_level_size : tbl->it_size;
+   const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+   const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+   pe_info(npe, "Setting up window %llx..%llx pg=%lx\n",
+   start_addr, start_addr + win_size - 1,
+   IOMMU_PAGE_SIZE(tbl));
+
+   rc = opal_pci_map_pe_dma_window(phb->opal_id,
+   npe->pe_number,
+   npe->pe_number,
+   tbl->it_indirect_levels + 1,
+   __pa(tbl->it_base),
+   size << 3,
+   IOMMU_PAGE_SIZE(tbl));
+   if (rc) {
+   pe_err(npe, "Failed to configure TCE table, err %lld\n", rc);
+   return rc;
+   }
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return 0;
+}
+
+static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc;
+
+   pe_info(npe, "Removing DMA window\n");
+
+   rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
+   npe->pe_number,
+   0/* levels */, 0/* table address */,
+   0/* table size */, 0/* page size */);
+   if (rc) {
+   pe_err(npe, "Unmapping failed, ret = %lld\n", rc);
+   return rc;
+   }
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return 0;
+}
+
 void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 {
struct pnv_ioda_pe *gpe;
@@ -200,10 +250,8 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
  */
 static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 {
-   struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
struct pnv_ioda_pe *gpe;
-   struct iommu_table *tbl;
int64_t rc;
 
/*
@@ -217,14 +265,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;
 
-   tbl = gpe->table_group.tables[0];
-   rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
-   npe->pe_number, 1, __pa(tbl->it_base),
-   tbl->it_size << 3,
-   IOMMU_PAGE_SIZE(tbl));
-   if (rc != OPAL_SUCCESS)
-   pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
-   __func__, rc, phb->hose->global_number, npe->pe_number);
+   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
 
/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -248,6 +289,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
+   rc = pnv_npu_unset_window(npe);
+   if (rc != OPAL_SUCCESS)
+   return rc;
+
/* Enable the bypass window */
 
top = roundup_pow_of_two(top);
-- 
2.5.0.rc3



[PATCH kernel v4 09/11] powerpc/powernv/npu: Add set/unset window helpers

2016-04-29 Thread Alexey Kardashevskiy
The upcoming NVLink passthrough support will require NPU code to cope
with two DMA windows.

This adds a pnv_npu_set_window() helper which programs 32bit window to
the hardware. This also adds multilevel TCE support.

This adds a pnv_npu_unset_window() helper which removes the DMA window
from the hardware. This does not make difference now as the caller -
pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
patch will use it to manage TCE table lists for TCE Kill handling.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 65 +++-
 1 file changed, 55 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index bec9267..800d70f 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -159,6 +159,56 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
+static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+   struct iommu_table *tbl)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc;
+   const unsigned long size = tbl->it_indirect_levels ?
+   tbl->it_level_size : tbl->it_size;
+   const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+   const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+   pe_info(npe, "Setting up window %llx..%llx pg=%lx\n",
+   start_addr, start_addr + win_size - 1,
+   IOMMU_PAGE_SIZE(tbl));
+
+   rc = opal_pci_map_pe_dma_window(phb->opal_id,
+   npe->pe_number,
+   npe->pe_number,
+   tbl->it_indirect_levels + 1,
+   __pa(tbl->it_base),
+   size << 3,
+   IOMMU_PAGE_SIZE(tbl));
+   if (rc) {
+   pe_err(npe, "Failed to configure TCE table, err %lld\n", rc);
+   return rc;
+   }
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return 0;
+}
+
+static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc;
+
+   pe_info(npe, "Removing DMA window\n");
+
+   rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
+   npe->pe_number,
+   0/* levels */, 0/* table address */,
+   0/* table size */, 0/* page size */);
+   if (rc) {
+   pe_err(npe, "Unmapping failed, ret = %lld\n", rc);
+   return rc;
+   }
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return 0;
+}
+
 void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 {
struct pnv_ioda_pe *gpe;
@@ -200,10 +250,8 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
  */
 static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 {
-   struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
struct pnv_ioda_pe *gpe;
-   struct iommu_table *tbl;
int64_t rc;
 
/*
@@ -217,14 +265,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;
 
-   tbl = gpe->table_group.tables[0];
-   rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
-   npe->pe_number, 1, __pa(tbl->it_base),
-   tbl->it_size << 3,
-   IOMMU_PAGE_SIZE(tbl));
-   if (rc != OPAL_SUCCESS)
-   pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
-   __func__, rc, phb->hose->global_number, npe->pe_number);
+   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
 
/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -248,6 +289,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
+   rc = pnv_npu_unset_window(npe);
+   if (rc != OPAL_SUCCESS)
+   return rc;
+
/* Enable the bypass window */
 
top = roundup_pow_of_two(top);
-- 
2.5.0.rc3



[PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling

2016-04-29 Thread Alexey Kardashevskiy
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v4:
* reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been
added
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 69 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 57 -
 arch/powerpc/platforms/powernv/pci.h  |  6 ---
 3 files changed, 26 insertions(+), 106 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 800d70f..cb2d1da 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
struct pnv_ioda_pe *pe;
struct pci_dn *pdn;
 
-   if (npe->flags & PNV_IODA_PE_PEER) {
-   pe = npe->peers[0];
-   pdev = pe->pdev;
-   } else {
-   pdev = pnv_pci_get_gpu_dev(npe->pdev);
-   if (!pdev)
-   return NULL;
+   pdev = pnv_pci_get_gpu_dev(npe->pdev);
+   if (!pdev)
+   return NULL;
 
-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return NULL;
+   pdn = pci_get_pdn(pdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return NULL;
 
-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = >ioda.pe_array[pdn->pe_number];
-   }
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+   pe = >ioda.pe_array[pdn->pe_number];
 
if (gpdev)
*gpdev = pdev;
@@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
+   /* Add the table to the list so its TCE cache will get invalidated */
+   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   tbl, >table_group);
+
return 0;
 }
 
@@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   >table_group);
+
return 0;
 }
 
-void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
-{
-   struct pnv_ioda_pe *gpe;
-   struct pci_dev *gpdev;
-   int i, avail = -1;
-
-   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
-   return;
-
-   gpe = get_gpu_pci_dev_and_pe(npe, );
-   if (!gpe)
-   return;
-
-   for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-   /* Nothing to do if the PE is already connected. */
-   if (gpe->peers[i] == npe)
-   return;
-
-   if (!gpe->peers[i])
-   avail = i;
-   }
-
-   if (WARN_ON(avail < 0))
-   return;
-
-   gpe->peers[avail] = npe;
-   gpe->flags |= PNV_IODA_PE_PEER;
-
-   /*
-* We assume that the NPU devices only have a single peer PE
-* (the GPU PCIe device PE).
-*/
-   npe->peers[0] = gpe;
-   npe->flags |= PNV_IODA_PE_PEER;
-}
-
 /*
  * Enables 32 bit DMA on NPU.
  */
@@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
npe->pe_number, npe->pe_number,
0 /* bypass base */, top);
 
+   if (rc == OPAL_SUCCESS)
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
return rc;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index db7695f..922c74c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1816,23 +1816,12 @@ static inline void 
pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
/* 01xb - invalidate TCEs that match the specified PE# */
unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
struct pnv_phb *phb = pe->

[PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling

2016-04-29 Thread Alexey Kardashevskiy
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been
added
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 69 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 57 -
 arch/powerpc/platforms/powernv/pci.h  |  6 ---
 3 files changed, 26 insertions(+), 106 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 800d70f..cb2d1da 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
struct pnv_ioda_pe *pe;
struct pci_dn *pdn;
 
-   if (npe->flags & PNV_IODA_PE_PEER) {
-   pe = npe->peers[0];
-   pdev = pe->pdev;
-   } else {
-   pdev = pnv_pci_get_gpu_dev(npe->pdev);
-   if (!pdev)
-   return NULL;
+   pdev = pnv_pci_get_gpu_dev(npe->pdev);
+   if (!pdev)
+   return NULL;
 
-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return NULL;
+   pdn = pci_get_pdn(pdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return NULL;
 
-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = >ioda.pe_array[pdn->pe_number];
-   }
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+   pe = >ioda.pe_array[pdn->pe_number];
 
if (gpdev)
*gpdev = pdev;
@@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
+   /* Add the table to the list so its TCE cache will get invalidated */
+   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   tbl, >table_group);
+
return 0;
 }
 
@@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   >table_group);
+
return 0;
 }
 
-void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
-{
-   struct pnv_ioda_pe *gpe;
-   struct pci_dev *gpdev;
-   int i, avail = -1;
-
-   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
-   return;
-
-   gpe = get_gpu_pci_dev_and_pe(npe, );
-   if (!gpe)
-   return;
-
-   for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-   /* Nothing to do if the PE is already connected. */
-   if (gpe->peers[i] == npe)
-   return;
-
-   if (!gpe->peers[i])
-   avail = i;
-   }
-
-   if (WARN_ON(avail < 0))
-   return;
-
-   gpe->peers[avail] = npe;
-   gpe->flags |= PNV_IODA_PE_PEER;
-
-   /*
-* We assume that the NPU devices only have a single peer PE
-* (the GPU PCIe device PE).
-*/
-   npe->peers[0] = gpe;
-   npe->flags |= PNV_IODA_PE_PEER;
-}
-
 /*
  * Enables 32 bit DMA on NPU.
  */
@@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
npe->pe_number, npe->pe_number,
0 /* bypass base */, top);
 
+   if (rc == OPAL_SUCCESS)
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
return rc;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index db7695f..922c74c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1816,23 +1816,12 @@ static inline void 
pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
/* 01xb - invalidate TCEs that match the specified PE# */
unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
struct pnv_phb *phb = pe->phb;
-   struct pnv_

[PATCH kernel v4 00/11] powerpc/powernv/npu: Enable PCI pass through for NVLink

2016-04-29 Thread Alexey Kardashevskiy
IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also
contain a couple of really fast links between GPU and CPU. These links
are exposed to the userspace by the OPAL firmware as bridges.
In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade. More details are in 11/11.

This reworks the existing NPU support in the powernv platform and adds
VFIO support on top of that.

v4 has new patch "powerpc/powernv/npu: Add set/unset window" and bunch of
cleanups.

"vfio_pci: Test for extended capabilities if config space > 256 bytes" is
included here if anyone decides to test the patchset which will crash
without it.

This was tested on POWER8NVL platform; pvr=0x004c0100.


Please comment. Thanks.

Alex, I guess we will need your "acked-by" for
"vfio/spapr: Relax the IOMMU compatibility check" to proceed.


Alexey Kardashevskiy (11):
  vfio_pci: Test for extended capabilities if config space > 256 bytes
  vfio/spapr: Relax the IOMMU compatibility check
  powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
  powerpc/powernv: Define TCE Kill flags
  powerpc/powernv/npu: TCE Kill helpers cleanup
  powerpc/powernv/npu: Use the correct IOMMU page size
  powerpc/powernv/npu: Simplify DMA setup
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Enable NVLink pass through

 arch/powerpc/platforms/powernv/npu-dma.c  | 287 --
 arch/powerpc/platforms/powernv/pci-ioda.c | 224 +++
 arch/powerpc/platforms/powernv/pci.h  |  31 ++--
 drivers/vfio/pci/vfio_pci_config.c|  17 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |   3 +-
 5 files changed, 327 insertions(+), 235 deletions(-)

-- 
2.5.0.rc3



[PATCH kernel v4 07/11] powerpc/powernv/npu: Simplify DMA setup

2016-04-29 Thread Alexey Kardashevskiy
NPU devices are emulated in firmware and mainly used for NPU NVLink
training; one NPU device is per a hardware link. Their DMA/TCE setup
must match the GPU which is connected via PCIe and NVLink so any changes
to the DMA/TCE setup on the GPU PCIe device need to be propagated to
the NVLink device as this is what device drivers expect and it doesn't
make much sense to do anything else.

This makes NPU DMA setup explicit.
pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.

Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.

This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).

This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.

This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.

This stops using npe->tce_bypass_base as it never changes and values
other than zero are not supported.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: Alistair Popple <alist...@popple.id.au>
---
Changes:
v2:
* changed first paragraph of the commit log from Alistair comment
* removed npe->tce_bypass_base
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 89 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 +--
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 3 files changed, 53 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 5bd5fee..bec9267 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -196,10 +196,9 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 }
 
 /*
- * For the NPU we want to point the TCE table at the same table as the
- * real PCI device.
+ * Enables 32 bit DMA on NPU.
  */
-static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
+static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 {
struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
@@ -235,72 +234,62 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
*npe)
 }
 
 /*
- * Enable/disable bypass mode on the NPU. The NPU only supports one
+ * Enables bypass mode on the NPU. The NPU only supports one
  * window per link, so bypass needs to be explicitly enabled or
  * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be
  * active at the same time.
  */
-int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enable)
+static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc = 0;
+   phys_addr_t top = memblock_end_of_DRAM();
 
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
-   if (enable) {
-   /* Enable the bypass window */
-   phys_addr_t top = memblock_end_of_DRAM();
+   /* Enable the bypass window */
 
-   npe->tce_bypass_base = 0;
-   top = roundup_pow_of_two(top);
-   dev_info(>pdev->dev, "Enabling bypass for PE %d\n",
-npe->pe_number);
-   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
-   npe->pe_number, npe->pe_number,
-   npe->tce_bypass_base, top);
-   } else {
-   /*
-* Disable the bypass window by replacing it with the
-* TCE32 window.
-*/
-   pnv_npu_disable_bypass(npe);
-   }
+   top = roundup_pow_of_two(top);
+   dev_info(>pdev->dev, "Enabling bypass for PE %d\n",
+   npe->pe_number);
+   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+   npe->pe_number, npe->pe_number,
+   0 /* bypass base */, top);
 
return rc;
 }
 
-int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask)
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool b

[PATCH kernel v4 00/11] powerpc/powernv/npu: Enable PCI pass through for NVLink

2016-04-29 Thread Alexey Kardashevskiy
IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also
contain a couple of really fast links between GPU and CPU. These links
are exposed to the userspace by the OPAL firmware as bridges.
In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade. More details are in 11/11.

This reworks the existing NPU support in the powernv platform and adds
VFIO support on top of that.

v4 has new patch "powerpc/powernv/npu: Add set/unset window" and bunch of
cleanups.

"vfio_pci: Test for extended capabilities if config space > 256 bytes" is
included here if anyone decides to test the patchset which will crash
without it.

This was tested on POWER8NVL platform; pvr=0x004c0100.


Please comment. Thanks.

Alex, I guess we will need your "acked-by" for
"vfio/spapr: Relax the IOMMU compatibility check" to proceed.


Alexey Kardashevskiy (11):
  vfio_pci: Test for extended capabilities if config space > 256 bytes
  vfio/spapr: Relax the IOMMU compatibility check
  powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
  powerpc/powernv: Define TCE Kill flags
  powerpc/powernv/npu: TCE Kill helpers cleanup
  powerpc/powernv/npu: Use the correct IOMMU page size
  powerpc/powernv/npu: Simplify DMA setup
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Enable NVLink pass through

 arch/powerpc/platforms/powernv/npu-dma.c  | 287 --
 arch/powerpc/platforms/powernv/pci-ioda.c | 224 +++
 arch/powerpc/platforms/powernv/pci.h  |  31 ++--
 drivers/vfio/pci/vfio_pci_config.c|  17 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |   3 +-
 5 files changed, 327 insertions(+), 235 deletions(-)

-- 
2.5.0.rc3



[PATCH kernel v4 07/11] powerpc/powernv/npu: Simplify DMA setup

2016-04-29 Thread Alexey Kardashevskiy
NPU devices are emulated in firmware and mainly used for NPU NVLink
training; one NPU device is per a hardware link. Their DMA/TCE setup
must match the GPU which is connected via PCIe and NVLink so any changes
to the DMA/TCE setup on the GPU PCIe device need to be propagated to
the NVLink device as this is what device drivers expect and it doesn't
make much sense to do anything else.

This makes NPU DMA setup explicit.
pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.

Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.

This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).

This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.

This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.

This stops using npe->tce_bypass_base as it never changes and values
other than zero are not supported.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Alistair Popple 
---
Changes:
v2:
* changed first paragraph of the commit log from Alistair comment
* removed npe->tce_bypass_base
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 89 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 +--
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 3 files changed, 53 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 5bd5fee..bec9267 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -196,10 +196,9 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 }
 
 /*
- * For the NPU we want to point the TCE table at the same table as the
- * real PCI device.
+ * Enables 32 bit DMA on NPU.
  */
-static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
+static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 {
struct pnv_phb *phb = npe->phb;
struct pci_dev *gpdev;
@@ -235,72 +234,62 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe 
*npe)
 }
 
 /*
- * Enable/disable bypass mode on the NPU. The NPU only supports one
+ * Enables bypass mode on the NPU. The NPU only supports one
  * window per link, so bypass needs to be explicitly enabled or
  * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be
  * active at the same time.
  */
-int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enable)
+static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc = 0;
+   phys_addr_t top = memblock_end_of_DRAM();
 
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
-   if (enable) {
-   /* Enable the bypass window */
-   phys_addr_t top = memblock_end_of_DRAM();
+   /* Enable the bypass window */
 
-   npe->tce_bypass_base = 0;
-   top = roundup_pow_of_two(top);
-   dev_info(>pdev->dev, "Enabling bypass for PE %d\n",
-npe->pe_number);
-   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
-   npe->pe_number, npe->pe_number,
-   npe->tce_bypass_base, top);
-   } else {
-   /*
-* Disable the bypass window by replacing it with the
-* TCE32 window.
-*/
-   pnv_npu_disable_bypass(npe);
-   }
+   top = roundup_pow_of_two(top);
+   dev_info(>pdev->dev, "Enabling bypass for PE %d\n",
+   npe->pe_number);
+   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+   npe->pe_number, npe->pe_number,
+   0 /* bypass base */, top);
 
return rc;
 }
 
-int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask)
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 {
-   struct pci_controller *hose = pci_bus_to_host(npdev->bus);
-   struct pn

[PATCH kernel v4 08/11] powerpc/powernv/ioda2: Export debug helper pe_level_printk()

2016-04-29 Thread Alexey Kardashevskiy
This exports debugging helper pe_level_printk() and corresponding macroses
so they can be used in npu-dma.c.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 9 +
 arch/powerpc/platforms/powernv/pci.h  | 9 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 272521e..db7695f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -56,7 +56,7 @@
 
 static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
-static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
 {
struct va_format vaf;
@@ -87,13 +87,6 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
va_end(args);
 }
 
-#define pe_err(pe, fmt, ...)   \
-   pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
-#define pe_warn(pe, fmt, ...)  \
-   pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
-#define pe_info(pe, fmt, ...)  \
-   pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
-
 static bool pnv_iommu_bypass_disabled __read_mostly;
 
 static int __init iommu_setup(char *str)
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index d574a9d..485e5b1 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -236,6 +236,15 @@ extern void pnv_pci_dma_bus_setup(struct pci_bus *bus);
 extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 
+extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+   const char *fmt, ...);
+#define pe_err(pe, fmt, ...)   \
+   pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
+#define pe_warn(pe, fmt, ...)  \
+   pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
+#define pe_info(pe, fmt, ...)  \
+   pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
+
 /* Nvlink functions */
 extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
 extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
-- 
2.5.0.rc3



[PATCH kernel v4 08/11] powerpc/powernv/ioda2: Export debug helper pe_level_printk()

2016-04-29 Thread Alexey Kardashevskiy
This exports debugging helper pe_level_printk() and corresponding macroses
so they can be used in npu-dma.c.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 9 +
 arch/powerpc/platforms/powernv/pci.h  | 9 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 272521e..db7695f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -56,7 +56,7 @@
 
 static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
-static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
 {
struct va_format vaf;
@@ -87,13 +87,6 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
va_end(args);
 }
 
-#define pe_err(pe, fmt, ...)   \
-   pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
-#define pe_warn(pe, fmt, ...)  \
-   pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
-#define pe_info(pe, fmt, ...)  \
-   pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
-
 static bool pnv_iommu_bypass_disabled __read_mostly;
 
 static int __init iommu_setup(char *str)
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index d574a9d..485e5b1 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -236,6 +236,15 @@ extern void pnv_pci_dma_bus_setup(struct pci_bus *bus);
 extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 
+extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+   const char *fmt, ...);
+#define pe_err(pe, fmt, ...)   \
+   pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
+#define pe_warn(pe, fmt, ...)  \
+   pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
+#define pe_info(pe, fmt, ...)  \
+   pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
+
 /* Nvlink functions */
 extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
 extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
-- 
2.5.0.rc3



[PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through

2016-04-29 Thread Alexey Kardashevskiy
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v4:
* reused pnv_npu_set_window/pnv_npu_unset_window where possible
* added comments, changed commit log

v3:
* moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered
* removed hack to make iommu_add_device() work, iommu_group_add_device()
is used instead
* cleanup in gpe_table_group_to_npe_cb()

v2:
* reimplemented to support NPU + GPU in the same group
* merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and
"powerpc/powernv/npu: Enable passing through via VFIO" into this patch
---
 arch/powerpc/platforms/powernv/npu-dma.c  |  64 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++
 arch/powerpc/platforms/powernv/pci.h  |   6 ++
 3 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index cb2d1da..0459e10 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
-static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
struct iommu_table *tbl)
 {
struct pnv_phb *phb = npe->phb;
@@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
/* Add the table to the list so its TCE cache will get invalidated */
-   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);
 
return 0;
 }
 
-static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc;
@@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
-   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[num],
>table_group);
 
return 0;
@@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;
 
-   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
+   rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
 
/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
-   rc = pnv_npu_unset_window(npe);
+   rc = pnv_npu_unset_window(npe, 0);
if (rc != OPAL_SUCCESS)
return rc;
 
@@ -307

[PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through

2016-04-29 Thread Alexey Kardashevskiy
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* reused pnv_npu_set_window/pnv_npu_unset_window where possible
* added comments, changed commit log

v3:
* moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered
* removed hack to make iommu_add_device() work, iommu_group_add_device()
is used instead
* cleanup in gpe_table_group_to_npe_cb()

v2:
* reimplemented to support NPU + GPU in the same group
* merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and
"powerpc/powernv/npu: Enable passing through via VFIO" into this patch
---
 arch/powerpc/platforms/powernv/npu-dma.c  |  64 +--
 arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++
 arch/powerpc/platforms/powernv/pci.h  |   6 ++
 3 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index cb2d1da..0459e10 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
-static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
+long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
struct iommu_table *tbl)
 {
struct pnv_phb *phb = npe->phb;
@@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe,
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
/* Add the table to the list so its TCE cache will get invalidated */
-   pnv_pci_link_table_and_group(phb->hose->node, 0,
+   pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);
 
return 0;
 }
 
-static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
+long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
 {
struct pnv_phb *phb = npe->phb;
int64_t rc;
@@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe)
}
pnv_pci_ioda2_tce_invalidate_entire(phb, false);
 
-   pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+   pnv_pci_unlink_table_and_group(npe->table_group.tables[num],
>table_group);
 
return 0;
@@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;
 
-   rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]);
+   rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
 
/*
 * We don't initialise npu_pe->tce32_table as we always use
@@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
if (phb->type != PNV_PHB_NPU || !npe->pdev)
return -EINVAL;
 
-   rc = pnv_npu_unset_window(npe);
+   rc = pnv_npu_unset_window(npe, 0);
if (rc != OPAL_SUCCESS)
return rc;
 
@@ -307,3 +30

[PATCH kernel v4 04/11] powerpc/powernv: Define TCE Kill flags

2016-04-29 Thread Alexey Kardashevskiy
This replaces magic constants for TCE Kill IODA2 register with macros.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ca307b6..03be25d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,10 +1811,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_PE   PPC_BIT(1)
+#define TCE_KILL_INVAL_TCE  PPC_BIT(2)
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
-   unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
+   unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
struct pnv_phb *phb = pe->phb;
struct pnv_ioda_pe *npe;
int i;
@@ -1842,7 +1845,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned 
pe_number, bool rm,
unsigned long start, end, inc;
 
/* We'll invalidate DMA address in PE scope */
-   start = 0x2ull << 60;
+   start = TCE_KILL_INVAL_TCE;
start |= (pe_number & 0xFF);
end = start;
 
-- 
2.5.0.rc3



[PATCH kernel v4 04/11] powerpc/powernv: Define TCE Kill flags

2016-04-29 Thread Alexey Kardashevskiy
This replaces magic constants for TCE Kill IODA2 register with macros.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ca307b6..03be25d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,10 +1811,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_PE   PPC_BIT(1)
+#define TCE_KILL_INVAL_TCE  PPC_BIT(2)
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
-   unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
+   unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
struct pnv_phb *phb = pe->phb;
struct pnv_ioda_pe *npe;
int i;
@@ -1842,7 +1845,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned 
pe_number, bool rm,
unsigned long start, end, inc;
 
/* We'll invalidate DMA address in PE scope */
-   start = 0x2ull << 60;
+   start = TCE_KILL_INVAL_TCE;
start |= (pe_number & 0xFF);
end = start;
 
-- 
2.5.0.rc3



[PATCH kernel v4 01/11] vfio_pci: Test for extended capabilities if config space > 256 bytes

2016-04-29 Thread Alexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0x which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v2:
* instead of checking for 0x, this only does the check if
device's config size is big enough
---
 drivers/vfio/pci/vfio_pci_config.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..d0c4358 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE,
+   );
+   vdev->extended_caps = (dword != 0);
+   }
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
+   vdev->extended_caps = dword != 0;
+   }
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



[PATCH kernel v4 02/11] vfio/spapr: Relax the IOMMU compatibility check

2016-04-29 Thread Alexey Kardashevskiy
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0582b72..3054e3f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}
table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
-   if (table_group_tmp->ops != table_group->ops) {
+   if (table_group_tmp->ops->create_table !=
+   table_group->ops->create_table) {
pr_warn("tce_vfio: Group %d is incompatible with group 
%d\n",
iommu_group_id(iommu_group),
iommu_group_id(tcegrp->grp));
-- 
2.5.0.rc3



[PATCH kernel v4 01/11] vfio_pci: Test for extended capabilities if config space > 256 bytes

2016-04-29 Thread Alexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0x which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* instead of checking for 0x, this only does the check if
device's config size is big enough
---
 drivers/vfio/pci/vfio_pci_config.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..d0c4358 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE,
+   );
+   vdev->extended_caps = (dword != 0);
+   }
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
+   vdev->extended_caps = dword != 0;
+   }
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



[PATCH kernel v4 02/11] vfio/spapr: Relax the IOMMU compatibility check

2016-04-29 Thread Alexey Kardashevskiy
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0582b72..3054e3f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}
table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
-   if (table_group_tmp->ops != table_group->ops) {
+   if (table_group_tmp->ops->create_table !=
+   table_group->ops->create_table) {
pr_warn("tce_vfio: Group %d is incompatible with group 
%d\n",
iommu_group_id(iommu_group),
iommu_group_id(tcegrp->grp));
-- 
2.5.0.rc3



[PATCH kernel v4 05/11] powerpc/powernv/npu: TCE Kill helpers cleanup

2016-04-29 Thread Alexey Kardashevskiy
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.

Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
Reviewed-by: Alistair Popple <alist...@popple.id.au>
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 41 ---
 arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++
 arch/powerpc/platforms/powernv/pci.h  |  7 +-
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 7229acd..778570c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -25,8 +25,6 @@
  * Other types of TCE cache invalidation are not functional in the
  * hardware.
  */
-#define TCE_KILL_INVAL_ALL PPC_BIT(0)
-
 static struct pci_dev *get_pci_dev(struct device_node *dn)
 {
return PCI_DN(dn)->pcidev;
@@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
-void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
-{
-   struct pnv_phb *phb = npe->phb;
-
-   if (WARN_ON(phb->type != PNV_PHB_NPU ||
-   !phb->ioda.tce_inval_reg ||
-   !(npe->flags & PNV_IODA_PE_DEV)))
-   return;
-
-   mb(); /* Ensure previous TCE table stores are visible */
-   __raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
-   phb->ioda.tce_inval_reg);
-}
-
-void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
-   struct iommu_table *tbl,
-   unsigned long index,
-   unsigned long npages,
-   bool rm)
-{
-   struct pnv_phb *phb = npe->phb;
-
-   /* We can only invalidate the whole cache on NPU */
-   unsigned long val = TCE_KILL_INVAL_ALL;
-
-   if (WARN_ON(phb->type != PNV_PHB_NPU ||
-   !phb->ioda.tce_inval_reg ||
-   !(npe->flags & PNV_IODA_PE_DEV)))
-   return;
-
-   mb(); /* Ensure previous TCE table stores are visible */
-   if (rm)
-   __raw_rm_writeq(cpu_to_be64(val),
- (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
-   else
-   __raw_writeq(cpu_to_be64(val),
-   phb->ioda.tce_inval_reg);
-}
-
 void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 {
struct pnv_ioda_pe *gpe;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 03be25d..a67d51e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,9 +1811,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
 #define TCE_KILL_INVAL_PE   PPC_BIT(1)
 #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
 
+void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
+{
+   const unsigned long val = TCE_KILL_INVAL_ALL;
+
+   mb(); /* Ensure previous TCE table stores are visible */
+   if (rm)
+   __raw_rm_writeq(cpu_to_be64(val),
+   (__be64 __iomem *)
+   phb->ioda.tce_inval_reg_phys);
+   else
+   __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
+}
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
@@ -1834,7 +1848,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct 
pnv_ioda_pe *pe)
if (!npe || npe->phb->type != PNV_PHB_NPU)
continue;
 
-   pnv_npu_tce_invalidate_entire(npe);
+   pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
}
 }
 
@@ -1883,14 +1897,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
index, npages);
 
if (pe->flags & PNV_IODA_PE_PEER)
-   /* Invalidate PEs using the same TCE table */
+   /*
+* The NVLink hardware does not support TCE kill
+* per TCE entry so we have to invalidate
+* the entire cache for it.
+*/
for (i = 0;

[PATCH kernel v4 05/11] powerpc/powernv/npu: TCE Kill helpers cleanup

2016-04-29 Thread Alexey Kardashevskiy
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.

Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Alistair Popple 
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 41 ---
 arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++
 arch/powerpc/platforms/powernv/pci.h  |  7 +-
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 7229acd..778570c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -25,8 +25,6 @@
  * Other types of TCE cache invalidation are not functional in the
  * hardware.
  */
-#define TCE_KILL_INVAL_ALL PPC_BIT(0)
-
 static struct pci_dev *get_pci_dev(struct device_node *dn)
 {
return PCI_DN(dn)->pcidev;
@@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
-void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
-{
-   struct pnv_phb *phb = npe->phb;
-
-   if (WARN_ON(phb->type != PNV_PHB_NPU ||
-   !phb->ioda.tce_inval_reg ||
-   !(npe->flags & PNV_IODA_PE_DEV)))
-   return;
-
-   mb(); /* Ensure previous TCE table stores are visible */
-   __raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
-   phb->ioda.tce_inval_reg);
-}
-
-void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
-   struct iommu_table *tbl,
-   unsigned long index,
-   unsigned long npages,
-   bool rm)
-{
-   struct pnv_phb *phb = npe->phb;
-
-   /* We can only invalidate the whole cache on NPU */
-   unsigned long val = TCE_KILL_INVAL_ALL;
-
-   if (WARN_ON(phb->type != PNV_PHB_NPU ||
-   !phb->ioda.tce_inval_reg ||
-   !(npe->flags & PNV_IODA_PE_DEV)))
-   return;
-
-   mb(); /* Ensure previous TCE table stores are visible */
-   if (rm)
-   __raw_rm_writeq(cpu_to_be64(val),
- (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
-   else
-   __raw_writeq(cpu_to_be64(val),
-   phb->ioda.tce_inval_reg);
-}
-
 void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 {
struct pnv_ioda_pe *gpe;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 03be25d..a67d51e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,9 +1811,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
 #define TCE_KILL_INVAL_PE   PPC_BIT(1)
 #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
 
+void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
+{
+   const unsigned long val = TCE_KILL_INVAL_ALL;
+
+   mb(); /* Ensure previous TCE table stores are visible */
+   if (rm)
+   __raw_rm_writeq(cpu_to_be64(val),
+   (__be64 __iomem *)
+   phb->ioda.tce_inval_reg_phys);
+   else
+   __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
+}
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
@@ -1834,7 +1848,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct 
pnv_ioda_pe *pe)
if (!npe || npe->phb->type != PNV_PHB_NPU)
continue;
 
-   pnv_npu_tce_invalidate_entire(npe);
+   pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
}
 }
 
@@ -1883,14 +1897,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
index, npages);
 
if (pe->flags & PNV_IODA_PE_PEER)
-   /* Invalidate PEs using the same TCE table */
+   /*
+* The NVLink hardware does not support TCE kill
+* per TCE entry so we have to invalidate
+* the entire cache for it.
+*/
for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
npe = pe->peers[i];
-

[PATCH kernel v4 03/11] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire

2016-04-29 Thread Alexey Kardashevskiy
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c5baaf3..ca307b6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,7 +1811,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
-static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
+static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
@@ -2075,7 +2075,7 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
 
pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);
-   pnv_pci_ioda2_tce_invalidate_entire(pe);
+   pnv_pci_ioda2_tce_invalidate_pe(pe);
 
return 0;
 }
@@ -2219,7 +2219,7 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
if (ret)
pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
else
-   pnv_pci_ioda2_tce_invalidate_entire(pe);
+   pnv_pci_ioda2_tce_invalidate_pe(pe);
 
pnv_pci_unlink_table_and_group(table_group->tables[num], table_group);
 
-- 
2.5.0.rc3



[PATCH kernel v4 03/11] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire

2016-04-29 Thread Alexey Kardashevskiy
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c5baaf3..ca307b6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,7 +1811,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
-static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
+static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
/* 01xb - invalidate TCEs that match the specified PE# */
unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
@@ -2075,7 +2075,7 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
 
pnv_pci_link_table_and_group(phb->hose->node, num,
tbl, >table_group);
-   pnv_pci_ioda2_tce_invalidate_entire(pe);
+   pnv_pci_ioda2_tce_invalidate_pe(pe);
 
return 0;
 }
@@ -2219,7 +2219,7 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
if (ret)
pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
else
-   pnv_pci_ioda2_tce_invalidate_entire(pe);
+   pnv_pci_ioda2_tce_invalidate_pe(pe);
 
pnv_pci_unlink_table_and_group(table_group->tables[num], table_group);
 
-- 
2.5.0.rc3



[PATCH kernel v2] vfio_pci: Test for extended capabilities if config space > 256 bytes

2016-04-28 Thread Alexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0x which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
Changes:
v2:
* instead of checking for 0x, this only does the check if
device's config size is big enough
---
 drivers/vfio/pci/vfio_pci_config.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..d0c4358 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE,
+   );
+   vdev->extended_caps = (dword != 0);
+   }
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
+   vdev->extended_caps = dword != 0;
+   }
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



[PATCH kernel v2] vfio_pci: Test for extended capabilities if config space > 256 bytes

2016-04-28 Thread Alexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0x which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* instead of checking for 0x, this only does the check if
device's config size is big enough
---
 drivers/vfio/pci/vfio_pci_config.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..d0c4358 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE,
+   );
+   vdev->extended_caps = (dword != 0);
+   }
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
+   vdev->extended_caps = dword != 0;
+   }
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



[PATCH kernel] vfio_pci: Make extended capabilities test more robust

2016-04-28 Thread Alexey Kardashevskiy
VFIO reads a dword beyond the standard PCI config space (256 bytes) to
know if there are extended config space (4096 bytes). It relies on
the platform to return zero if there is no extended space.

However at least on PPC64/POWERNV platform, the system firmware (OPAL)
returns 0x in this case. VFIO treats it as a proof that there is
extended config space and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset of 256 which is beyond the buffer allocated for
vdev->vconfig - it is 256 bytes for a device without extended config
space.

This adds an additional check that config space read returned non-zero
and non- value.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..8a53421 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1140,7 +1140,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
case PCI_CAP_ID_EXP:
/* Test for extended capabilities */
pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   vdev->extended_caps = (dword != 0) && (dword != 0x);
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



[PATCH kernel] vfio_pci: Make extended capabilities test more robust

2016-04-28 Thread Alexey Kardashevskiy
VFIO reads a dword beyond the standard PCI config space (256 bytes) to
know if there are extended config space (4096 bytes). It relies on
the platform to return zero if there is no extended space.

However at least on PPC64/POWERNV platform, the system firmware (OPAL)
returns 0x in this case. VFIO treats it as a proof that there is
extended config space and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset of 256 which is beyond the buffer allocated for
vdev->vconfig - it is 256 bytes for a device without extended config
space.

This adds an additional check that config space read returned non-zero
and non- value.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..8a53421 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1140,7 +1140,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
case PCI_CAP_ID_EXP:
/* Test for extended capabilities */
pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, );
-   vdev->extended_caps = (dword != 0);
+   vdev->extended_caps = (dword != 0) && (dword != 0x);
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



Re: [RFC v6 04/10] PCI: Add support for enforcing all MMIO BARs to be page aligned

2016-04-25 Thread Alexey Kardashevskiy

On 04/18/2016 08:56 PM, Yongji Xie wrote:

When vfio passthrough a PCI device of which MMIO BARs are
smaller than PAGE_SIZE, guest will not handle the mmio
accesses to the BARs which leads to mmio emulations in host.

This is because vfio will not allow to passthrough one BAR's
mmio page which may be shared with other BARs. Otherwise,
there will be a backdoor that guest can use to access BARs
of other guest.

To solve this issue, this patch modifies resource_alignment
to support syntax where multiple devices get the same
alignment. So we can use something like
"pci=resource_alignment=*:*:*.*:noresize" to enforce the
alignment of all MMIO BARs to be at least PAGE_SIZE so that
one BAR's mmio page would not be shared with other BARs.

And we also define a macro PCIBIOS_MIN_ALIGNMENT to enable this
automatically on PPC64 platform which can easily hit this issue
because its PAGE_SIZE is 64KB.

Signed-off-by: Yongji Xie 
---
  Documentation/kernel-parameters.txt |2 ++
  arch/powerpc/include/asm/pci.h  |2 ++
  drivers/pci/pci.c   |   64 +--
  3 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index d8b29ab..542be4a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
aligned memory resources.
If  is not specified,
PAGE_SIZE is used as alignment.
+   , ,  and  can be set to
+   "*" which means match all values.
PCI-PCI bridge can be specified, if resource
windows need to be expanded.
noresize: Don't change the resources' sizes when
diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 6f8065a..78f230f 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -30,6 +30,8 @@
  #define PCIBIOS_MIN_IO0x1000
  #define PCIBIOS_MIN_MEM   0x1000

+#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
+
  struct pci_dev;

  /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7564ccc..0381c28 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4605,7 +4605,12 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
int seg, bus, slot, func, align_order, count;
resource_size_t align = 0;
char *p;
+   bool invalid = false;

+#ifdef PCIBIOS_MIN_ALIGNMENT
+   align = PCIBIOS_MIN_ALIGNMENT;
+   *resize = false;
+#endif
spin_lock(_alignment_lock);
p = resource_alignment_param;
while (*p) {
@@ -4622,16 +4627,49 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
} else {
align_order = -1;
}
-   if (sscanf(p, "%x:%x:%x.%x%n",
-   , , , , ) != 4) {



I'd replace the above lines with:

char segstr[5] = "*", busstr[3] = "*";
char slotstr[3] = "*", funstr[2] = "*";

if (sscanf(p, "%4[^:]:%2[^:]:%2[^.].%1s%n",
, , , , ) != 4) {


and add some wrapper like:

static bool glob_match_hex(char const *pat, int val)
{
char valstr[5]; /* 5 should be enough for PCI */
snprintf(valstr, sizeof(valstr) - 1, "%4x", val);
return glob_match(pat, valstr);
}

and then use glob_match_hex() (or make a wrapper like above on top of 
fnmatch()), this would enable better mask handling.


If anyone finds this useful (which I am not sure about).





+   if (p[0] == '*' && p[1] == ':') {
+   seg = -1;
+   count = 1;
+   } else if (sscanf(p, "%x%n", , ) != 1 ||
+   p[count] != ':') {
+   invalid = true;
+   break;
+   }
+   p += count + 1;
+   if (*p == '*') {
+   bus = -1;
+   count = 1;
+   } else if (sscanf(p, "%x%n", , ) != 1) {
+   invalid = true;
+   break;
+   }
+   p += count;
+   if (*p == '.') {
+   slot = bus;
+   bus = seg;
seg = 0;
-   if (sscanf(p, "%x:%x.%x%n",
-   , , , ) != 3) {
-   /* Invalid format */
-   printk(KERN_ERR "PCI: Can't parse resource_alignment 
parameter: %s\n",
-   

Re: [RFC v6 04/10] PCI: Add support for enforcing all MMIO BARs to be page aligned

2016-04-25 Thread Alexey Kardashevskiy

On 04/18/2016 08:56 PM, Yongji Xie wrote:

When vfio passthrough a PCI device of which MMIO BARs are
smaller than PAGE_SIZE, guest will not handle the mmio
accesses to the BARs which leads to mmio emulations in host.

This is because vfio will not allow to passthrough one BAR's
mmio page which may be shared with other BARs. Otherwise,
there will be a backdoor that guest can use to access BARs
of other guest.

To solve this issue, this patch modifies resource_alignment
to support syntax where multiple devices get the same
alignment. So we can use something like
"pci=resource_alignment=*:*:*.*:noresize" to enforce the
alignment of all MMIO BARs to be at least PAGE_SIZE so that
one BAR's mmio page would not be shared with other BARs.

And we also define a macro PCIBIOS_MIN_ALIGNMENT to enable this
automatically on PPC64 platform which can easily hit this issue
because its PAGE_SIZE is 64KB.

Signed-off-by: Yongji Xie 
---
  Documentation/kernel-parameters.txt |2 ++
  arch/powerpc/include/asm/pci.h  |2 ++
  drivers/pci/pci.c   |   64 +--
  3 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index d8b29ab..542be4a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
aligned memory resources.
If  is not specified,
PAGE_SIZE is used as alignment.
+   , ,  and  can be set to
+   "*" which means match all values.
PCI-PCI bridge can be specified, if resource
windows need to be expanded.
noresize: Don't change the resources' sizes when
diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 6f8065a..78f230f 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -30,6 +30,8 @@
  #define PCIBIOS_MIN_IO0x1000
  #define PCIBIOS_MIN_MEM   0x1000

+#define PCIBIOS_MIN_ALIGNMENT  PAGE_SIZE
+
  struct pci_dev;

  /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7564ccc..0381c28 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4605,7 +4605,12 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
int seg, bus, slot, func, align_order, count;
resource_size_t align = 0;
char *p;
+   bool invalid = false;

+#ifdef PCIBIOS_MIN_ALIGNMENT
+   align = PCIBIOS_MIN_ALIGNMENT;
+   *resize = false;
+#endif
spin_lock(_alignment_lock);
p = resource_alignment_param;
while (*p) {
@@ -4622,16 +4627,49 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev,
} else {
align_order = -1;
}
-   if (sscanf(p, "%x:%x:%x.%x%n",
-   , , , , ) != 4) {



I'd replace the above lines with:

char segstr[5] = "*", busstr[3] = "*";
char slotstr[3] = "*", funstr[2] = "*";

if (sscanf(p, "%4[^:]:%2[^:]:%2[^.].%1s%n",
, , , , ) != 4) {


and add some wrapper like:

static bool glob_match_hex(char const *pat, int val)
{
char valstr[5]; /* 5 should be enough for PCI */
snprintf(valstr, sizeof(valstr) - 1, "%4x", val);
return glob_match(pat, valstr);
}

and then use glob_match_hex() (or make a wrapper like above on top of 
fnmatch()), this would enable better mask handling.


If anyone finds this useful (which I am not sure about).





+   if (p[0] == '*' && p[1] == ':') {
+   seg = -1;
+   count = 1;
+   } else if (sscanf(p, "%x%n", , ) != 1 ||
+   p[count] != ':') {
+   invalid = true;
+   break;
+   }
+   p += count + 1;
+   if (*p == '*') {
+   bus = -1;
+   count = 1;
+   } else if (sscanf(p, "%x%n", , ) != 1) {
+   invalid = true;
+   break;
+   }
+   p += count;
+   if (*p == '.') {
+   slot = bus;
+   bus = seg;
seg = 0;
-   if (sscanf(p, "%x:%x.%x%n",
-   , , , ) != 3) {
-   /* Invalid format */
-   printk(KERN_ERR "PCI: Can't parse resource_alignment 
parameter: %s\n",
-   p);
+   

Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-19 Thread Alexey Kardashevskiy

On 03/16/2016 08:45 PM, Or Gerlitz wrote:

On Wed, Mar 16, 2016 at 10:34 AM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote:


Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not
work in a guest:


So where is the breakage point for you? does 4.4 works? if not, what?


Ah, my bad. It is unrelated to the kernel version.

I tried passing a PF to a guest while its VFs are already passed to another 
guest and see how exactly it blows up (AER/EEH were thrown but the host 
recovered => good) but this left the device in a weird state when I could 
not use VF in a guest anymore but it seemed to keep working on the host.


It seems like the actual adapter does not reset completely when the machine 
is rebooted, I had unplug/replug power cables to fix this.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-19 Thread Alexey Kardashevskiy

On 03/16/2016 08:45 PM, Or Gerlitz wrote:

On Wed, Mar 16, 2016 at 10:34 AM, Alexey Kardashevskiy  wrote:


Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not
work in a guest:


So where is the breakage point for you? does 4.4 works? if not, what?


Ah, my bad. It is unrelated to the kernel version.

I tried passing a PF to a guest while its VFs are already passed to another 
guest and see how exactly it blows up (AER/EEH were thrown but the host 
recovered => good) but this left the device in a weird state when I could 
not use VF in a guest anymore but it seemed to keep working on the host.


It seems like the actual adapter does not reset completely when the machine 
is rebooted, I had unplug/replug power cables to fix this.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-16 Thread Alexey Kardashevskiy

On 03/16/2016 05:09 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 04:49:00PM +1100, Alexey Kardashevskiy wrote:

On 03/16/2016 04:10 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote:


So with v4.5 as a host, there is no actual distro available today to
use as a guest in the next 6 months (or whatever it takes to
backport this partucular patch back there).

You could have added a module parameter to enforce the old behavoir,
at least...

And sorry but from the original commit log I could not understand
why exactly all existing guests need to be broken. Could you please
point me to a piece of documentation describing all this UAR
bisuness (what is UAR, why 128 UARs are required and for what, etc).
Thanks.



We are going to send a patch that fixes this using a module parameter.
The patch will be on top of Huy's patch.

Some background to the problem: mlx4 supported devices require 128 UAR


What does UAR stand for?

User Access Region. It's the way you interface with the hardware.



pages from PCI memory space defined by BAR2-3. Each UAR page can be
any power of 2 value from 4K up to 64K. Before Huy's patch the driver
chose UAR page size to be equal to system page size. Since PowerPC's
page size is 64K this means minimum requirement of UAR pages is not
met (default UAR BAR is 8MB and only half of it is really reserved for
UARs).


And what was the downside? afaict the performance was good...



It's not a performance issue. Defining 64KB for a UAR is not required
and wastes pci memory mapped i/o space.




More details can be found in the programmer's manual.


Can you please point me to this manual on the website? I tried,
honestly, could not find it. Thanks.


It's not publically available. If you have an FAE that work with your
company you can ask him how to get the doc.



Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not 
work in a guest:


root@le-dbg:~# dhclient eth0
mlx4_en: eth0:   frag:0 - size:1518 prefix:0 stride:1536

mlx4_core :00:00.0: Internal error detected on the communication channel
mlx4_core :00:00.0: device is going to be reset
mlx4_core :00:00.0: VF reset is not needed
mlx4_core :00:00.0: device was reset successfully
mlx4_en :00:00.0: Internal error detected, restarting device
mlx4_core :00:00.0: command 0x5 failed: fw status = 0x1
mlx4_core :00:00.0: Failed to close slave function
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset


mlx4_core :00:00.0: slave is currently in the middle of FLR - Deferring 
probe
mlx4_core :00:00.0: mlx4_restart_one: ERROR: mlx4_load_one failed, 
pci_name=:00:00.0, err=-517

mlx4_core :00:00.0: mlx4_restart_one was ended, ret=-517

root@le-dbg:~# ifconfig -a
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@le-dbg:~# lspci -v
00:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

Subsystem: IBM Device 61b0
Physical Slot: C16
Flags: bus master, fast devsel, latency 0
Memory at 1012000 (64-bit, prefetchable) [size=64M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] MSI-X: Enable- Count=52 Masked-
Capabilities: [40] Power Management version 0
Kernel driver in use: mlx4_core



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-16 Thread Alexey Kardashevskiy

On 03/16/2016 05:09 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 04:49:00PM +1100, Alexey Kardashevskiy wrote:

On 03/16/2016 04:10 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote:


So with v4.5 as a host, there is no actual distro available today to
use as a guest in the next 6 months (or whatever it takes to
backport this partucular patch back there).

You could have added a module parameter to enforce the old behavoir,
at least...

And sorry but from the original commit log I could not understand
why exactly all existing guests need to be broken. Could you please
point me to a piece of documentation describing all this UAR
bisuness (what is UAR, why 128 UARs are required and for what, etc).
Thanks.



We are going to send a patch that fixes this using a module parameter.
The patch will be on top of Huy's patch.

Some background to the problem: mlx4 supported devices require 128 UAR


What does UAR stand for?

User Access Region. It's the way you interface with the hardware.



pages from PCI memory space defined by BAR2-3. Each UAR page can be
any power of 2 value from 4K up to 64K. Before Huy's patch the driver
chose UAR page size to be equal to system page size. Since PowerPC's
page size is 64K this means minimum requirement of UAR pages is not
met (default UAR BAR is 8MB and only half of it is really reserved for
UARs).


And what was the downside? afaict the performance was good...



It's not a performance issue. Defining 64KB for a UAR is not required
and wastes pci memory mapped i/o space.




More details can be found in the programmer's manual.


Can you please point me to this manual on the website? I tried,
honestly, could not find it. Thanks.


It's not publically available. If you have an FAE that work with your
company you can ask him how to get the doc.



Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not 
work in a guest:


root@le-dbg:~# dhclient eth0
mlx4_en: eth0:   frag:0 - size:1518 prefix:0 stride:1536

mlx4_core :00:00.0: Internal error detected on the communication channel
mlx4_core :00:00.0: device is going to be reset
mlx4_core :00:00.0: VF reset is not needed
mlx4_core :00:00.0: device was reset successfully
mlx4_en :00:00.0: Internal error detected, restarting device
mlx4_core :00:00.0: command 0x5 failed: fw status = 0x1
mlx4_core :00:00.0: Failed to close slave function
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset


mlx4_core :00:00.0: slave is currently in the middle of FLR - Deferring 
probe
mlx4_core :00:00.0: mlx4_restart_one: ERROR: mlx4_load_one failed, 
pci_name=:00:00.0, err=-517

mlx4_core :00:00.0: mlx4_restart_one was ended, ret=-517

root@le-dbg:~# ifconfig -a
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@le-dbg:~# lspci -v
00:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

Subsystem: IBM Device 61b0
Physical Slot: C16
Flags: bus master, fast devsel, latency 0
Memory at 1012000 (64-bit, prefetchable) [size=64M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] MSI-X: Enable- Count=52 Masked-
Capabilities: [40] Power Management version 0
Kernel driver in use: mlx4_core



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/16/2016 04:10 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote:


So with v4.5 as a host, there is no actual distro available today to
use as a guest in the next 6 months (or whatever it takes to
backport this partucular patch back there).

You could have added a module parameter to enforce the old behavoir,
at least...

And sorry but from the original commit log I could not understand
why exactly all existing guests need to be broken. Could you please
point me to a piece of documentation describing all this UAR
bisuness (what is UAR, why 128 UARs are required and for what, etc).
Thanks.



We are going to send a patch that fixes this using a module parameter.
The patch will be on top of Huy's patch.

Some background to the problem: mlx4 supported devices require 128 UAR


What does UAR stand for?


pages from PCI memory space defined by BAR2-3. Each UAR page can be
any power of 2 value from 4K up to 64K. Before Huy's patch the driver
chose UAR page size to be equal to system page size. Since PowerPC's
page size is 64K this means minimum requirement of UAR pages is not
met (default UAR BAR is 8MB and only half of it is really reserved for
UARs).


And what was the downside? afaict the performance was good...



More details can be found in the programmer's manual.


Can you please point me to this manual on the website? I tried, honestly, 
could not find it. Thanks.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/16/2016 04:10 PM, Eli Cohen wrote:

On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote:


So with v4.5 as a host, there is no actual distro available today to
use as a guest in the next 6 months (or whatever it takes to
backport this partucular patch back there).

You could have added a module parameter to enforce the old behavoir,
at least...

And sorry but from the original commit log I could not understand
why exactly all existing guests need to be broken. Could you please
point me to a piece of documentation describing all this UAR
bisuness (what is UAR, why 128 UARs are required and for what, etc).
Thanks.



We are going to send a patch that fixes this using a module parameter.
The patch will be on top of Huy's patch.

Some background to the problem: mlx4 supported devices require 128 UAR


What does UAR stand for?


pages from PCI memory space defined by BAR2-3. Each UAR page can be
any power of 2 value from 4K up to 64K. Before Huy's patch the driver
chose UAR page size to be equal to system page size. Since PowerPC's
page size is 64K this means minimum requirement of UAR pages is not
met (default UAR BAR is 8MB and only half of it is really reserved for
UARs).


And what was the downside? afaict the performance was good...



More details can be found in the programmer's manual.


Can you please point me to this manual on the website? I tried, honestly, 
could not find it. Thanks.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/15/2016 09:40 PM, Or Gerlitz wrote:

On Tue, Mar 15, 2016 at 12:19 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote:

This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917.

Without this revert, POWER "pseries" KVM guests with a VF passed to a guest
using VFIO fail to bring the driver up:

mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing :00:00.0
mlx4_core :00:00.0: enabling device ( -> 0002)
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset
mlx4_core :00:00.0: Sending vhcr0
mlx4_core :00:00.0: HCA minimum page size:512
mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536
mlx4_core :00:00.0: Failed to obtain slave caps



Both host and guest use 64K system pages.

How to fix this properly? Thanks.


The commit message says:

"[..] Regarding backward compatibility in SR-IOV, if hypervisor has
this new code, the virtual OS must be updated. [...]"



So with v4.5 as a host, there is no actual distro available today to use as 
a guest in the next 6 months (or whatever it takes to backport this 
partucular patch back there).


You could have added a module parameter to enforce the old behavoir, at 
least...


And sorry but from the original commit log I could not understand why 
exactly all existing guests need to be broken. Could you please point me to 
a piece of documentation describing all this UAR bisuness (what is UAR, why 
128 UARs are required and for what, etc). Thanks.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/15/2016 09:40 PM, Or Gerlitz wrote:

On Tue, Mar 15, 2016 at 12:19 PM, Alexey Kardashevskiy  wrote:

This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917.

Without this revert, POWER "pseries" KVM guests with a VF passed to a guest
using VFIO fail to bring the driver up:

mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing :00:00.0
mlx4_core :00:00.0: enabling device ( -> 0002)
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset
mlx4_core :00:00.0: Sending vhcr0
mlx4_core :00:00.0: HCA minimum page size:512
mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536
mlx4_core :00:00.0: Failed to obtain slave caps



Both host and guest use 64K system pages.

How to fix this properly? Thanks.


The commit message says:

"[..] Regarding backward compatibility in SR-IOV, if hypervisor has
this new code, the virtual OS must be updated. [...]"



So with v4.5 as a host, there is no actual distro available today to use as 
a guest in the next 6 months (or whatever it takes to backport this 
partucular patch back there).


You could have added a module parameter to enforce the old behavoir, at 
least...


And sorry but from the original commit log I could not understand why 
exactly all existing guests need to be broken. Could you please point me to 
a piece of documentation describing all this UAR bisuness (what is UAR, why 
128 UARs are required and for what, etc). Thanks.



--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/16/2016 02:29 AM, Christoph Hellwig wrote:

On Tue, Mar 15, 2016 at 04:23:33PM +0200, Or Gerlitz wrote:

Let us check. I was under (the maybe wrong) impression, that before this
patch both PF/VF drivers were not operative on some systems, so on those
systems it's fair to require the VF driver to be patched too.


To me it sounds like the system worked before. Alexey, can you confirm?


It worked just fine for year(s), this is definitely regression.


--
Alexey


Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy

On 03/16/2016 02:29 AM, Christoph Hellwig wrote:

On Tue, Mar 15, 2016 at 04:23:33PM +0200, Or Gerlitz wrote:

Let us check. I was under (the maybe wrong) impression, that before this
patch both PF/VF drivers were not operative on some systems, so on those
systems it's fair to require the VF driver to be patched too.


To me it sounds like the system worked before. Alexey, can you confirm?


It worked just fine for year(s), this is definitely regression.


--
Alexey


[RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy
This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917.

Without this revert, POWER "pseries" KVM guests with a VF passed to a guest
using VFIO fail to bring the driver up:

mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing :00:00.0
mlx4_core :00:00.0: enabling device ( -> 0002)
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset
mlx4_core :00:00.0: Sending vhcr0
mlx4_core :00:00.0: HCA minimum page size:512
mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536
mlx4_core :00:00.0: Failed to obtain slave caps


Both host and guest use 64K system pages.

How to fix this properly? Thanks.



---
 drivers/infiniband/hw/mlx4/qp.c   |  7 +--
 drivers/net/ethernet/mellanox/mlx4/cq.c   |  4 +-
 drivers/net/ethernet/mellanox/mlx4/en_resources.c |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|  4 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c   |  7 ++-
 drivers/net/ethernet/mellanox/mlx4/main.c | 56 +--
 drivers/net/ethernet/mellanox/mlx4/pd.c   | 12 ++---
 include/linux/mlx4/device.h   | 13 --
 8 files changed, 22 insertions(+), 84 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index fd97534..bc5536f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1681,12 +1681,9 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
}
 
if (qp->ibqp.uobject)
-   context->usr_page = cpu_to_be32(
-   mlx4_to_hw_uar_index(dev->dev,
-
to_mucontext(ibqp->uobject->context)->uar.index));
+   context->usr_page = 
cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index);
else
-   context->usr_page = cpu_to_be32(
-   mlx4_to_hw_uar_index(dev->dev, dev->priv_uar.index));
+   context->usr_page = cpu_to_be32(dev->priv_uar.index);
 
if (attr_mask & IB_QP_DEST_QPN)
context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c 
b/drivers/net/ethernet/mellanox/mlx4/cq.c
index a849da9..3348e64 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -318,9 +318,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent,
if (timestamp_en)
cq_context->flags  |= cpu_to_be32(1 << 19);
 
-   cq_context->logsize_usrpage =
-   cpu_to_be32((ilog2(nent) << 24) |
-   mlx4_to_hw_uar_index(dev, uar->index));
+   cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | 
uar->index);
cq_context->comp_eqn= 
priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn;
cq_context->log_page_size   = mtt->page_shift - MLX4_ICM_PAGE_SHIFT;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c 
b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index 02e925d..12aab5a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -58,8 +58,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int 
size, int stride,
} else {
context->sq_size_stride = ilog2(TXBB_SIZE) - 4;
}
-   context->usr_page = cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev,
-   mdev->priv_uar.index));
+   context->usr_page = cpu_to_be32(mdev->priv_uar.index);
context->local_qpn = cpu_to_be32(qpn);
context->pri_path.ackto = 1 & 0x07;
context->pri_path.sched_queue = 0x83 | (priv->port - 1) << 6;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index e0946ab..4421bf5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -213,9 +213,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
ring->cqn, user_prio, >context);
if (ring->bf_alloced)
-   ring->context.usr_page =
-   cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev,
-ring->bf.uar->index));
+   ring->context.usr_page = cpu_to_be32(ring->bf.uar->index);
 
err = mlx4_qp_to_ready(mdev->dev, >wqres.mtt, >context,
   >qp, >qp_state);
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index f613977..4696053 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -940,10 +940,9 @@ static void __iomem *mlx4_get_eq_uar(struct mlx4_dev *dev, 
struct mlx4_eq *eq)
 

[RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"

2016-03-15 Thread Alexey Kardashevskiy
This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917.

Without this revert, POWER "pseries" KVM guests with a VF passed to a guest
using VFIO fail to bring the driver up:

mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing :00:00.0
mlx4_core :00:00.0: enabling device ( -> 0002)
mlx4_core :00:00.0: Detected virtual function - running in slave mode
mlx4_core :00:00.0: Sending reset
mlx4_core :00:00.0: Sending vhcr0
mlx4_core :00:00.0: HCA minimum page size:512
mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536
mlx4_core :00:00.0: Failed to obtain slave caps


Both host and guest use 64K system pages.

How to fix this properly? Thanks.



---
 drivers/infiniband/hw/mlx4/qp.c   |  7 +--
 drivers/net/ethernet/mellanox/mlx4/cq.c   |  4 +-
 drivers/net/ethernet/mellanox/mlx4/en_resources.c |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|  4 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c   |  7 ++-
 drivers/net/ethernet/mellanox/mlx4/main.c | 56 +--
 drivers/net/ethernet/mellanox/mlx4/pd.c   | 12 ++---
 include/linux/mlx4/device.h   | 13 --
 8 files changed, 22 insertions(+), 84 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index fd97534..bc5536f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1681,12 +1681,9 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
}
 
if (qp->ibqp.uobject)
-   context->usr_page = cpu_to_be32(
-   mlx4_to_hw_uar_index(dev->dev,
-
to_mucontext(ibqp->uobject->context)->uar.index));
+   context->usr_page = 
cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index);
else
-   context->usr_page = cpu_to_be32(
-   mlx4_to_hw_uar_index(dev->dev, dev->priv_uar.index));
+   context->usr_page = cpu_to_be32(dev->priv_uar.index);
 
if (attr_mask & IB_QP_DEST_QPN)
context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c 
b/drivers/net/ethernet/mellanox/mlx4/cq.c
index a849da9..3348e64 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -318,9 +318,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent,
if (timestamp_en)
cq_context->flags  |= cpu_to_be32(1 << 19);
 
-   cq_context->logsize_usrpage =
-   cpu_to_be32((ilog2(nent) << 24) |
-   mlx4_to_hw_uar_index(dev, uar->index));
+   cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | 
uar->index);
cq_context->comp_eqn= 
priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn;
cq_context->log_page_size   = mtt->page_shift - MLX4_ICM_PAGE_SHIFT;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c 
b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index 02e925d..12aab5a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -58,8 +58,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int 
size, int stride,
} else {
context->sq_size_stride = ilog2(TXBB_SIZE) - 4;
}
-   context->usr_page = cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev,
-   mdev->priv_uar.index));
+   context->usr_page = cpu_to_be32(mdev->priv_uar.index);
context->local_qpn = cpu_to_be32(qpn);
context->pri_path.ackto = 1 & 0x07;
context->pri_path.sched_queue = 0x83 | (priv->port - 1) << 6;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index e0946ab..4421bf5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -213,9 +213,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
ring->cqn, user_prio, >context);
if (ring->bf_alloced)
-   ring->context.usr_page =
-   cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev,
-ring->bf.uar->index));
+   ring->context.usr_page = cpu_to_be32(ring->bf.uar->index);
 
err = mlx4_qp_to_ready(mdev->dev, >wqres.mtt, >context,
   >qp, >qp_state);
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index f613977..4696053 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -940,10 +940,9 @@ static void __iomem *mlx4_get_eq_uar(struct mlx4_dev *dev, 
struct mlx4_eq *eq)
 

Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment

2016-03-09 Thread Alexey Kardashevskiy

On 03/07/2016 06:48 PM, Yongji Xie wrote:

When using resource_alignment kernel parameter, the current
implement reassigns the alignment by changing resources' size
which can potentially break some drivers.


How can this possibly break any driver?... It rounds up, not down, what do 
I miss here?




So this patch adds a new option "noresize" for the parameter
to solve this problem.

Signed-off-by: Yongji Xie 
---
  Documentation/kernel-parameters.txt |5 -
  drivers/pci/pci.c   |   36 +--
  2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9a53c92..d8b29ab 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
window. The default value is 64 megabytes.
resource_alignment=
Format:
-   [@][:]:.[; ...]
+   [@][:]:.
+   [:noresize][; ...]
Specifies alignment and device to reassign
aligned memory resources.
If  is not specified,
PAGE_SIZE is used as alignment.
PCI-PCI bridge can be specified, if resource
windows need to be expanded.
+   noresize: Don't change the resources' sizes when
+   reassigning alignment.
ecrc=   Enable/disable PCIe ECRC (transaction layer
end-to-end CRC checking).
bios: Use BIOS/firmware settings. This is the
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 602eb42..760cce5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
   * RETURNS: Resource alignment if it is specified.
   *  Zero if it is not specified.
   */
-static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
+static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
+   bool *resize)
  {
int seg, bus, slot, func, align_order, count;
resource_size_t align = 0;
@@ -4626,6 +4627,11 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev)
}
}
p += count;
+   if (!strncmp(p, ":noresize", 9)) {
+   *resize = false;
+   p += 9;
+   } else
+   *resize = true;
if (seg == pci_domain_nr(dev->bus) &&
bus == dev->bus->number &&
slot == PCI_SLOT(dev->devfn) &&
@@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
*dev)
  {
int i;
struct resource *r;
+   bool resize;
resource_size_t align, size;
u16 command;

/* check if specified PCI is target device to reassign */
-   align = pci_specified_resource_alignment(dev);
+   align = pci_specified_resource_alignment(dev, );


A compiler should have warned here about passing a pointer to unitialized 
@resize.




if (!align)
return;

@@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
*dev)
if (!(r->flags & IORESOURCE_MEM))
continue;
size = resource_size(r);
-   if (size < align) {
-   size = align;
-   dev_info(>dev,
-   "Rounding up size of resource #%d to %#llx.\n",
-   i, (unsigned long long)size);
+   if (resize) {
+   if (size < align) {
+   size = align;
+   dev_info(>dev,
+   "Rounding up size of resource #%d to 
%#llx.\n",
+   i, (unsigned long long)size);
+   }
+   r->flags |= IORESOURCE_UNSET;
+   r->end = size - 1;
+   r->start = 0;
+   } else {
+   if (size > align)
+   align = size;
+   r->flags &= ~IORESOURCE_SIZEALIGN;
+   r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET;
+   r->start = align;
+   r->end = r->start + size - 1;
}
-   r->flags |= IORESOURCE_UNSET;
-   r->end = size - 1;
-  

Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment

2016-03-09 Thread Alexey Kardashevskiy

On 03/07/2016 06:48 PM, Yongji Xie wrote:

When using resource_alignment kernel parameter, the current
implement reassigns the alignment by changing resources' size
which can potentially break some drivers.


How can this possibly break any driver?... It rounds up, not down, what do 
I miss here?




So this patch adds a new option "noresize" for the parameter
to solve this problem.

Signed-off-by: Yongji Xie 
---
  Documentation/kernel-parameters.txt |5 -
  drivers/pci/pci.c   |   36 +--
  2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9a53c92..d8b29ab 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
window. The default value is 64 megabytes.
resource_alignment=
Format:
-   [@][:]:.[; ...]
+   [@][:]:.
+   [:noresize][; ...]
Specifies alignment and device to reassign
aligned memory resources.
If  is not specified,
PAGE_SIZE is used as alignment.
PCI-PCI bridge can be specified, if resource
windows need to be expanded.
+   noresize: Don't change the resources' sizes when
+   reassigning alignment.
ecrc=   Enable/disable PCIe ECRC (transaction layer
end-to-end CRC checking).
bios: Use BIOS/firmware settings. This is the
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 602eb42..760cce5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
   * RETURNS: Resource alignment if it is specified.
   *  Zero if it is not specified.
   */
-static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
+static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
+   bool *resize)
  {
int seg, bus, slot, func, align_order, count;
resource_size_t align = 0;
@@ -4626,6 +4627,11 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev)
}
}
p += count;
+   if (!strncmp(p, ":noresize", 9)) {
+   *resize = false;
+   p += 9;
+   } else
+   *resize = true;
if (seg == pci_domain_nr(dev->bus) &&
bus == dev->bus->number &&
slot == PCI_SLOT(dev->devfn) &&
@@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
*dev)
  {
int i;
struct resource *r;
+   bool resize;
resource_size_t align, size;
u16 command;

/* check if specified PCI is target device to reassign */
-   align = pci_specified_resource_alignment(dev);
+   align = pci_specified_resource_alignment(dev, );


A compiler should have warned here about passing a pointer to unitialized 
@resize.




if (!align)
return;

@@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
*dev)
if (!(r->flags & IORESOURCE_MEM))
continue;
size = resource_size(r);
-   if (size < align) {
-   size = align;
-   dev_info(>dev,
-   "Rounding up size of resource #%d to %#llx.\n",
-   i, (unsigned long long)size);
+   if (resize) {
+   if (size < align) {
+   size = align;
+   dev_info(>dev,
+   "Rounding up size of resource #%d to 
%#llx.\n",
+   i, (unsigned long long)size);
+   }
+   r->flags |= IORESOURCE_UNSET;
+   r->end = size - 1;
+   r->start = 0;
+   } else {
+   if (size > align)
+   align = size;
+   r->flags &= ~IORESOURCE_SIZEALIGN;
+   r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET;
+   r->start = align;
+   r->end = r->start + size - 1;
}
-   r->flags |= IORESOURCE_UNSET;
-   r->end = size - 1;
-   r->start = 0;
}
  

Re: [kernel] powerpc: Make vmalloc_to_phys() public

2016-01-26 Thread Alexey Kardashevskiy

On 01/25/2016 09:06 PM, Paul Mackerras wrote:

On Mon, Jan 25, 2016 at 04:46:03PM +1100, Michael Ellerman wrote:

On Thu, 2016-21-01 at 07:35:08 UTC, Alexey Kardashevskiy wrote:

This makes vmalloc_to_phys() public as there will be another user
(in-kernel VFIO acceleration) for it soon.

As a part of future little optimization, this changes the helper to call
vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the
struct page may not be power-of-two aligned which will make gcc use
multiply instructions instead of shifts.


You should also mention why you need to export it, presumably because whatever
new user you have in mind can be built as a module.


If I remember correctly, it's the *existing* user in hv-24x7.c that
needs the export.


No, CONFIG_HV_PERF_CTRS (which controls hv-24x7.c) is "bool" so today 
hv-24x7.c cannot compile as a module.


This exports the vmalloc_to_phys() symbol as a new user is going to be KVM 
which can compile as a module.



--
Alexey


Re: [kernel] powerpc: Make vmalloc_to_phys() public

2016-01-26 Thread Alexey Kardashevskiy

On 01/25/2016 09:06 PM, Paul Mackerras wrote:

On Mon, Jan 25, 2016 at 04:46:03PM +1100, Michael Ellerman wrote:

On Thu, 2016-21-01 at 07:35:08 UTC, Alexey Kardashevskiy wrote:

This makes vmalloc_to_phys() public as there will be another user
(in-kernel VFIO acceleration) for it soon.

As a part of future little optimization, this changes the helper to call
vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the
struct page may not be power-of-two aligned which will make gcc use
multiply instructions instead of shifts.


You should also mention why you need to export it, presumably because whatever
new user you have in mind can be built as a module.


If I remember correctly, it's the *existing* user in hv-24x7.c that
needs the export.


No, CONFIG_HV_PERF_CTRS (which controls hv-24x7.c) is "bool" so today 
hv-24x7.c cannot compile as a module.


This exports the vmalloc_to_phys() symbol as a new user is going to be KVM 
which can compile as a module.



--
Alexey


Re: [PATCH] vfio/noiommu: Don't use iommu_present() to track fake groups

2016-01-24 Thread Alexey Kardashevskiy

On 01/23/2016 04:23 AM, Alex Williamson wrote:

Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems.  First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.

The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes.  The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.

Reported-by: Alexey Kardashevskiy 
Fixes: 03a76b60f8ba ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson 




Reviewed-by: Alexey Kardashevskiy 
Tested-by: Alexey Kardashevskiy 

Thanks!


--
Alexey


Re: [PATCH] vfio/noiommu: Don't use iommu_present() to track fake groups

2016-01-24 Thread Alexey Kardashevskiy

On 01/23/2016 04:23 AM, Alex Williamson wrote:

Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems.  First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.

The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes.  The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.

Reported-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Fixes: 03a76b60f8ba ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson <alex.william...@redhat.com>




Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <a...@ozlabs.ru>

Thanks!


--
Alexey


Re: [PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected

2016-01-21 Thread Alexey Kardashevskiy

On 01/22/2016 05:34 PM, Alexey Kardashevskiy wrote:

Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO
on systems which do not implement iommu_ops for the PCI bus even though
there is an VFIO IOMMU driver for it such as SPAPR TCE driver for
PPC64/powernv platform.

This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as
it is done in the rest of the file to re-enable VFIO on powernv.

Signed-off-by: Alexey Kardashevskiy 
---
  drivers/vfio/vfio.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3f8060e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group,
atomic_set(>opened, 0);
group->iommu_group = iommu_group;
group->noiommu = !iommu_present;
-



Agrh. Unrelated change, repost?



group->nb.notifier_call = vfio_iommu_group_notifier;

/*
@@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev,

group = vfio_group_get_from_iommu(iommu_group);
if (!group) {
+#ifdef CONFIG_VFIO_NOIOMMU
group = vfio_create_group(iommu_group, iommu_present(dev->bus));
+#else
+   group = vfio_create_group(iommu_group, true);
+#endif
if (IS_ERR(group)) {
iommu_group_put(iommu_group);
return PTR_ERR(group);




--
Alexey


[PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected

2016-01-21 Thread Alexey Kardashevskiy
Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO
on systems which do not implement iommu_ops for the PCI bus even though
there is an VFIO IOMMU driver for it such as SPAPR TCE driver for
PPC64/powernv platform.

This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as
it is done in the rest of the file to re-enable VFIO on powernv.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3f8060e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group,
atomic_set(>opened, 0);
group->iommu_group = iommu_group;
group->noiommu = !iommu_present;
-
group->nb.notifier_call = vfio_iommu_group_notifier;
 
/*
@@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev,
 
group = vfio_group_get_from_iommu(iommu_group);
if (!group) {
+#ifdef CONFIG_VFIO_NOIOMMU
group = vfio_create_group(iommu_group, iommu_present(dev->bus));
+#else
+   group = vfio_create_group(iommu_group, true);
+#endif
if (IS_ERR(group)) {
iommu_group_put(iommu_group);
return PTR_ERR(group);
-- 
2.5.0.rc3



Re: [PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected

2016-01-21 Thread Alexey Kardashevskiy

On 01/22/2016 05:34 PM, Alexey Kardashevskiy wrote:

Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO
on systems which do not implement iommu_ops for the PCI bus even though
there is an VFIO IOMMU driver for it such as SPAPR TCE driver for
PPC64/powernv platform.

This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as
it is done in the rest of the file to re-enable VFIO on powernv.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
  drivers/vfio/vfio.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3f8060e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group,
atomic_set(>opened, 0);
group->iommu_group = iommu_group;
group->noiommu = !iommu_present;
-



Agrh. Unrelated change, repost?



group->nb.notifier_call = vfio_iommu_group_notifier;

/*
@@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev,

group = vfio_group_get_from_iommu(iommu_group);
if (!group) {
+#ifdef CONFIG_VFIO_NOIOMMU
group = vfio_create_group(iommu_group, iommu_present(dev->bus));
+#else
+   group = vfio_create_group(iommu_group, true);
+#endif
if (IS_ERR(group)) {
iommu_group_put(iommu_group);
return PTR_ERR(group);




--
Alexey


[PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected

2016-01-21 Thread Alexey Kardashevskiy
Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO
on systems which do not implement iommu_ops for the PCI bus even though
there is an VFIO IOMMU driver for it such as SPAPR TCE driver for
PPC64/powernv platform.

This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as
it is done in the rest of the file to re-enable VFIO on powernv.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 drivers/vfio/vfio.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3f8060e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct 
iommu_group *iommu_group,
atomic_set(>opened, 0);
group->iommu_group = iommu_group;
group->noiommu = !iommu_present;
-
group->nb.notifier_call = vfio_iommu_group_notifier;
 
/*
@@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev,
 
group = vfio_group_get_from_iommu(iommu_group);
if (!group) {
+#ifdef CONFIG_VFIO_NOIOMMU
group = vfio_create_group(iommu_group, iommu_present(dev->bus));
+#else
+   group = vfio_create_group(iommu_group, true);
+#endif
if (IS_ERR(group)) {
iommu_group_put(iommu_group);
return PTR_ERR(group);
-- 
2.5.0.rc3



[PATCH kernel] powerpc: Make vmalloc_to_phys() public

2016-01-20 Thread Alexey Kardashevskiy
This makes vmalloc_to_phys() public as there will be another user
(in-kernel VFIO acceleration) for it soon.

As a part of future little optimization, this changes the helper to call
vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the
struct page may not be power-of-two aligned which will make gcc use
multiply instructions instead of shifts.

Signed-off-by: Alexey Kardashevskiy 
---

A couple of notes:

1. real_vmalloc_addr() will be reworked later by Paul separately;

2. the optimization note it not valid at the moment as
vmalloc_to_pfn() calls vmalloc_to_page() which does the actual
search; these helpers functionality will be swapped later
(also, by Paul).

---
 arch/powerpc/include/asm/pgtable.h | 3 +++
 arch/powerpc/mm/pgtable.c  | 8 
 arch/powerpc/perf/hv-24x7.c| 8 
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index ac9fb11..47897a3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -78,6 +78,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, 
unsigned long ea,
}
return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift);
 }
+
+unsigned long vmalloc_to_phys(void *vmalloc_addr);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 83dfd79..de37ff4 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -243,3 +243,11 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
 }
 #endif /* CONFIG_DEBUG_VM */
 
+unsigned long vmalloc_to_phys(void *va)
+{
+   unsigned long pfn = vmalloc_to_pfn(va);
+
+   BUG_ON(!pfn);
+   return __pa(pfn_to_kaddr(pfn)) + offset_in_page(va);
+}
+EXPORT_SYMBOL_GPL(vmalloc_to_phys);
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 9f9dfda..3b09ecf 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -493,14 +493,6 @@ static size_t event_to_attr_ct(struct hv_24x7_event_data 
*event)
}
 }
 
-static unsigned long vmalloc_to_phys(void *v)
-{
-   struct page *p = vmalloc_to_page(v);
-
-   BUG_ON(!p);
-   return page_to_phys(p) + offset_in_page(v);
-}
-
 /* */
 struct event_uniq {
struct rb_node node;
-- 
2.5.0.rc3



[PATCH kernel] powerpc: Make vmalloc_to_phys() public

2016-01-20 Thread Alexey Kardashevskiy
This makes vmalloc_to_phys() public as there will be another user
(in-kernel VFIO acceleration) for it soon.

As a part of future little optimization, this changes the helper to call
vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the
struct page may not be power-of-two aligned which will make gcc use
multiply instructions instead of shifts.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---

A couple of notes:

1. real_vmalloc_addr() will be reworked later by Paul separately;

2. the optimization note it not valid at the moment as
vmalloc_to_pfn() calls vmalloc_to_page() which does the actual
search; these helpers functionality will be swapped later
(also, by Paul).

---
 arch/powerpc/include/asm/pgtable.h | 3 +++
 arch/powerpc/mm/pgtable.c  | 8 
 arch/powerpc/perf/hv-24x7.c| 8 
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index ac9fb11..47897a3 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -78,6 +78,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, 
unsigned long ea,
}
return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift);
 }
+
+unsigned long vmalloc_to_phys(void *vmalloc_addr);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 83dfd79..de37ff4 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -243,3 +243,11 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
 }
 #endif /* CONFIG_DEBUG_VM */
 
+unsigned long vmalloc_to_phys(void *va)
+{
+   unsigned long pfn = vmalloc_to_pfn(va);
+
+   BUG_ON(!pfn);
+   return __pa(pfn_to_kaddr(pfn)) + offset_in_page(va);
+}
+EXPORT_SYMBOL_GPL(vmalloc_to_phys);
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 9f9dfda..3b09ecf 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -493,14 +493,6 @@ static size_t event_to_attr_ct(struct hv_24x7_event_data 
*event)
}
 }
 
-static unsigned long vmalloc_to_phys(void *v)
-{
-   struct page *p = vmalloc_to_page(v);
-
-   BUG_ON(!p);
-   return page_to_phys(p) + offset_in_page(v);
-}
-
 /* */
 struct event_uniq {
struct rb_node node;
-- 
2.5.0.rc3



Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-12-21 Thread Alexey Kardashevskiy

On 12/08/2015 04:46 PM, Paul E. McKenney wrote:

On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote:

On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote:


As in the following?  (And yes, I was confused by the fact that the
docbook header was in front of a differently-named macro!)


That looks great!  Have you committed to one of your branches?


Indeed I have, though quite recently.  Please see branch rcu/dev of:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or, more specifically, this commit: 69b907297f4e

If your testing goes well, I will push this into the upcoming merge
window.


Sorry, it took a little longer than I expected. It works just fine, thanks!


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-12-21 Thread Alexey Kardashevskiy

On 12/08/2015 04:46 PM, Paul E. McKenney wrote:

On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote:

On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote:


As in the following?  (And yes, I was confused by the fact that the
docbook header was in front of a differently-named macro!)


That looks great!  Have you committed to one of your branches?


Indeed I have, though quite recently.  Please see branch rcu/dev of:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or, more specifically, this commit: 69b907297f4e

If your testing goes well, I will push this into the upcoming merge
window.


Sorry, it took a little longer than I expected. It works just fine, thanks!


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] VFIO: capability chains

2015-12-17 Thread Alexey Kardashevskiy

On 12/18/2015 01:38 PM, Alex Williamson wrote:

On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:

On 11/24/2015 07:43 AM, Alex Williamson wrote:

Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to
expose
variable sized information to the user, whether it's sparse mmaps
for
a region, as implemented here, or DMA mapping ranges of an IOMMU,
or
reserved MSI mapping ranges, etc.  Extending data structures is
hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy
how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures
that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This
method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists
with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,


Out of curiosity - could this information be exposed to the userspace
via
/sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change
after
vfio_pci driver is bound to a device.


For what purpose?  vfio doesn't have a sysfs interface, why start one?
Thanks,


well, it could simplify debugging a bit if this information was available 
from the userspace without programming a test tool doing some ioctl()'s. 
Not a big deal though...




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] VFIO: capability chains

2015-12-17 Thread Alexey Kardashevskiy

On 11/24/2015 07:43 AM, Alex Williamson wrote:

Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to expose
variable sized information to the user, whether it's sparse mmaps for
a region, as implemented here, or DMA mapping ranges of an IOMMU, or
reserved MSI mapping ranges, etc.  Extending data structures is hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,


Out of curiosity - could this information be exposed to the userspace via 
/sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after 
vfio_pci driver is bound to a device.






Alex

---

Alex Williamson (3):
   vfio: Define capability chains
   vfio: Define sparse mmap capability for regions
   vfio/pci: Include sparse mmap capability for MSI-X table regions


  drivers/vfio/pci/vfio_pci.c |  101 +++
  include/uapi/linux/vfio.h   |   53 ++-
  2 files changed, 152 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create

2015-12-17 Thread Alexey Kardashevskiy
The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal
to 32 bytes. However due to the gcc's default alignment, the actual
size of this struct is 40 bytes.

This fills gaps with __resv1/2 fields.

This should not cause any change in behavior.

Signed-off-by: Alexey Kardashevskiy 
---
 include/uapi/linux/vfio.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9fd7b5d..d117233 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create {
__u32 flags;
/* in */
__u32 page_shift;
+   __u32 __resv1;
__u64 window_size;
__u32 levels;
+   __u32 __resv2;
/* out */
__u64 start_addr;
 };
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] VFIO: capability chains

2015-12-17 Thread Alexey Kardashevskiy

On 11/24/2015 07:43 AM, Alex Williamson wrote:

Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to expose
variable sized information to the user, whether it's sparse mmaps for
a region, as implemented here, or DMA mapping ranges of an IOMMU, or
reserved MSI mapping ranges, etc.  Extending data structures is hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,


Out of curiosity - could this information be exposed to the userspace via 
/sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after 
vfio_pci driver is bound to a device.






Alex

---

Alex Williamson (3):
   vfio: Define capability chains
   vfio: Define sparse mmap capability for regions
   vfio/pci: Include sparse mmap capability for MSI-X table regions


  drivers/vfio/pci/vfio_pci.c |  101 +++
  include/uapi/linux/vfio.h   |   53 ++-
  2 files changed, 152 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create

2015-12-17 Thread Alexey Kardashevskiy
The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal
to 32 bytes. However due to the gcc's default alignment, the actual
size of this struct is 40 bytes.

This fills gaps with __resv1/2 fields.

This should not cause any change in behavior.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 include/uapi/linux/vfio.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9fd7b5d..d117233 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create {
__u32 flags;
/* in */
__u32 page_shift;
+   __u32 __resv1;
__u64 window_size;
__u32 levels;
+   __u32 __resv2;
/* out */
__u64 start_addr;
 };
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] VFIO: capability chains

2015-12-17 Thread Alexey Kardashevskiy

On 12/18/2015 01:38 PM, Alex Williamson wrote:

On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:

On 11/24/2015 07:43 AM, Alex Williamson wrote:

Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to
expose
variable sized information to the user, whether it's sparse mmaps
for
a region, as implemented here, or DMA mapping ranges of an IOMMU,
or
reserved MSI mapping ranges, etc.  Extending data structures is
hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy
how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures
that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This
method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists
with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,


Out of curiosity - could this information be exposed to the userspace
via
/sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change
after
vfio_pci driver is bound to a device.


For what purpose?  vfio doesn't have a sysfs interface, why start one?
Thanks,


well, it could simplify debugging a bit if this information was available 
from the userspace without programming a test tool doing some ioctl()'s. 
Not a big deal though...




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-11-05 Thread Alexey Kardashevskiy

On 11/04/2015 01:39 AM, Steven Rostedt wrote:

On Tue,  3 Nov 2015 17:57:05 +1100
Alexey Kardashevskiy  wrote:


This defines list_for_each_entry_lockless. This allows safe list
traversing in cases when lockdep() invocation is unwanted like
real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy 
---

This is for VFIO acceleration in POWERKVM for pSeries guests.
There is a KVM instance. There also can be some VFIO (PCI passthru)
devices attached to a KVM guest.

To perform DMA, a pSeries guest registers DMA memory by calling some
hypercalls explicitely at the rate close to one-two hcalls per
a network packet, i.e. very often. When a guest does a hypercall
(which is just an assembly instruction), the host kernel receives it
in the real mode (MMU is off). When real mode fails to handle it,
it enables MMU and tries handling a hcall in virtual mode.

A logical bus ID (LIOBN) is a tagret id for these hypecalls.

Each VFIO device belongs to an IOMMU group. Each group has an address
translation table. It is allowed to have multiple IOMMU groups (i.e.
multiple tables) under the same LIOBN.

So effectively every DMA hcall has to update one or more TCE tables
attached to the same LIOBN. RCU is used to update/traverse this list
safely.

Using RCU as is in virtual mode is fine. Lockdep works, etc.
list_add_rcu() is used to populate the list;
list_del_rcu() + call_rcu() used to remove groups from a list.
These operations can happen in runtim as a result of PCI hotplug/unplug
in guests.

Using RCU as is in real mode is not fine as some RCU checks can lock up
the system and in real mode we won't even have a chance to see any
debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.

Previous version of this used to define list_for_each_entry_rcu_notrace()
but it was proposed to use list_entry_lockless() instead. However
the comment for lockless_dereference() suggests this is a good idea
if "lifetime is managed by something other than RCU" but it is in my case.

So what would be the correct approach here? Thanks.


If the only use case for this so far is in POWERKVM, perhaps it should
be defined specifically (and in arch/powerpc) and not confuse others
about using this.

Or, if you do imagine that this can be used in other scenarios, then a
much deeper comment must be made in the code in the kerneldoc section.
list_for_each_entry_rcu() should really be used in 99.99% of the time
in the kernel. This looks to be an extreme exception. I hate to add a
generic helper for something that will only be used in one location.



No, I cannot imagine this really and I can move it somewhere in arch/powerpc.

Still, is my approach correct? What does the comment for 
lockless_dereference() actally mean - it won't work together with RCU at 
all or this is to force people not to use it as "list_for_each_entry_rcu() 
should really be used in 99.99% of the time"? :)






-- Steve



---
  include/linux/rculist.h | 16 
  1 file changed, 16 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..a83a924 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))

  /**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently
+ */
+#define list_entry_lockless(ptr, type, member) \
+   container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+#define list_for_each_entry_lockless(pos, head, member) \
+   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_lockless(pos->member.next, typeof(*pos), 
member))
+
+/**
   * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
   * @pos:  the type * to use as a loop cursor.
   * @head: the head for your list.





--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-11-05 Thread Alexey Kardashevskiy

On 11/04/2015 01:39 AM, Steven Rostedt wrote:

On Tue,  3 Nov 2015 17:57:05 +1100
Alexey Kardashevskiy <a...@ozlabs.ru> wrote:


This defines list_for_each_entry_lockless. This allows safe list
traversing in cases when lockdep() invocation is unwanted like
real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---

This is for VFIO acceleration in POWERKVM for pSeries guests.
There is a KVM instance. There also can be some VFIO (PCI passthru)
devices attached to a KVM guest.

To perform DMA, a pSeries guest registers DMA memory by calling some
hypercalls explicitely at the rate close to one-two hcalls per
a network packet, i.e. very often. When a guest does a hypercall
(which is just an assembly instruction), the host kernel receives it
in the real mode (MMU is off). When real mode fails to handle it,
it enables MMU and tries handling a hcall in virtual mode.

A logical bus ID (LIOBN) is a tagret id for these hypecalls.

Each VFIO device belongs to an IOMMU group. Each group has an address
translation table. It is allowed to have multiple IOMMU groups (i.e.
multiple tables) under the same LIOBN.

So effectively every DMA hcall has to update one or more TCE tables
attached to the same LIOBN. RCU is used to update/traverse this list
safely.

Using RCU as is in virtual mode is fine. Lockdep works, etc.
list_add_rcu() is used to populate the list;
list_del_rcu() + call_rcu() used to remove groups from a list.
These operations can happen in runtim as a result of PCI hotplug/unplug
in guests.

Using RCU as is in real mode is not fine as some RCU checks can lock up
the system and in real mode we won't even have a chance to see any
debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.

Previous version of this used to define list_for_each_entry_rcu_notrace()
but it was proposed to use list_entry_lockless() instead. However
the comment for lockless_dereference() suggests this is a good idea
if "lifetime is managed by something other than RCU" but it is in my case.

So what would be the correct approach here? Thanks.


If the only use case for this so far is in POWERKVM, perhaps it should
be defined specifically (and in arch/powerpc) and not confuse others
about using this.

Or, if you do imagine that this can be used in other scenarios, then a
much deeper comment must be made in the code in the kerneldoc section.
list_for_each_entry_rcu() should really be used in 99.99% of the time
in the kernel. This looks to be an extreme exception. I hate to add a
generic helper for something that will only be used in one location.



No, I cannot imagine this really and I can move it somewhere in arch/powerpc.

Still, is my approach correct? What does the comment for 
lockless_dereference() actally mean - it won't work together with RCU at 
all or this is to force people not to use it as "list_for_each_entry_rcu() 
should really be used in 99.99% of the time"? :)






-- Steve



---
  include/linux/rculist.h | 16 
  1 file changed, 16 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..a83a924 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))

  /**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently
+ */
+#define list_entry_lockless(ptr, type, member) \
+   container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+#define list_for_each_entry_lockless(pos, head, member) \
+   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_lockless(pos->member.next, typeof(*pos), 
member))
+
+/**
   * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
   * @pos:  the type * to use as a loop cursor.
   * @head: the head for your list.





--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-11-02 Thread Alexey Kardashevskiy
This defines list_for_each_entry_lockless. This allows safe list
traversing in cases when lockdep() invocation is unwanted like
real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy 
---

This is for VFIO acceleration in POWERKVM for pSeries guests.
There is a KVM instance. There also can be some VFIO (PCI passthru)
devices attached to a KVM guest.

To perform DMA, a pSeries guest registers DMA memory by calling some
hypercalls explicitely at the rate close to one-two hcalls per
a network packet, i.e. very often. When a guest does a hypercall
(which is just an assembly instruction), the host kernel receives it
in the real mode (MMU is off). When real mode fails to handle it,
it enables MMU and tries handling a hcall in virtual mode.

A logical bus ID (LIOBN) is a tagret id for these hypecalls.

Each VFIO device belongs to an IOMMU group. Each group has an address
translation table. It is allowed to have multiple IOMMU groups (i.e.
multiple tables) under the same LIOBN.

So effectively every DMA hcall has to update one or more TCE tables
attached to the same LIOBN. RCU is used to update/traverse this list
safely.

Using RCU as is in virtual mode is fine. Lockdep works, etc.
list_add_rcu() is used to populate the list;
list_del_rcu() + call_rcu() used to remove groups from a list.
These operations can happen in runtim as a result of PCI hotplug/unplug
in guests.

Using RCU as is in real mode is not fine as some RCU checks can lock up
the system and in real mode we won't even have a chance to see any
debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.

Previous version of this used to define list_for_each_entry_rcu_notrace()
but it was proposed to use list_entry_lockless() instead. However
the comment for lockless_dereference() suggests this is a good idea
if "lifetime is managed by something other than RCU" but it is in my case.

So what would be the correct approach here? Thanks.
---
 include/linux/rculist.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..a83a924 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently
+ */
+#define list_entry_lockless(ptr, type, member) \
+   container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+#define list_for_each_entry_lockless(pos, head, member) \
+   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_lockless(pos->member.next, typeof(*pos), 
member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

2015-11-02 Thread Alexey Kardashevskiy
This defines list_for_each_entry_lockless. This allows safe list
traversing in cases when lockdep() invocation is unwanted like
real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---

This is for VFIO acceleration in POWERKVM for pSeries guests.
There is a KVM instance. There also can be some VFIO (PCI passthru)
devices attached to a KVM guest.

To perform DMA, a pSeries guest registers DMA memory by calling some
hypercalls explicitely at the rate close to one-two hcalls per
a network packet, i.e. very often. When a guest does a hypercall
(which is just an assembly instruction), the host kernel receives it
in the real mode (MMU is off). When real mode fails to handle it,
it enables MMU and tries handling a hcall in virtual mode.

A logical bus ID (LIOBN) is a tagret id for these hypecalls.

Each VFIO device belongs to an IOMMU group. Each group has an address
translation table. It is allowed to have multiple IOMMU groups (i.e.
multiple tables) under the same LIOBN.

So effectively every DMA hcall has to update one or more TCE tables
attached to the same LIOBN. RCU is used to update/traverse this list
safely.

Using RCU as is in virtual mode is fine. Lockdep works, etc.
list_add_rcu() is used to populate the list;
list_del_rcu() + call_rcu() used to remove groups from a list.
These operations can happen in runtim as a result of PCI hotplug/unplug
in guests.

Using RCU as is in real mode is not fine as some RCU checks can lock up
the system and in real mode we won't even have a chance to see any
debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.

Previous version of this used to define list_for_each_entry_rcu_notrace()
but it was proposed to use list_entry_lockless() instead. However
the comment for lockless_dereference() suggests this is a good idea
if "lifetime is managed by something other than RCU" but it is in my case.

So what would be the correct approach here? Thanks.
---
 include/linux/rculist.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..a83a924 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently
+ */
+#define list_entry_lockless(ptr, type, member) \
+   container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+#define list_for_each_entry_lockless(pos, head, member) \
+   for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_lockless(pos->member.next, typeof(*pos), 
member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] rcu: Fix comment for rcu_dereference_raw_notrace

2015-11-01 Thread Alexey Kardashevskiy
rcu_dereference_raw() calls indirectly rcu_read_lock_held() while
rcu_dereference_raw_notrace() does not so fix the comment about the latter.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/rcupdate.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 581abf8..931d627 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -760,7 +760,7 @@ static inline void rcu_preempt_sleep_check(void)
  * The tracing infrastructure traces RCU (we want that), but unfortunately
  * some of the RCU checks causes tracing to lock up the system.
  *
- * The tracing version of rcu_dereference_raw() must not call
+ * The no-tracing version of rcu_dereference_raw() must not call
  * rcu_read_lock_held().
  */
 #define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH kernel] rcu: Fix comment for rcu_dereference_raw_notrace

2015-11-01 Thread Alexey Kardashevskiy
rcu_dereference_raw() calls indirectly rcu_read_lock_held() while
rcu_dereference_raw_notrace() does not so fix the comment about the latter.

Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
---
 include/linux/rcupdate.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 581abf8..931d627 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -760,7 +760,7 @@ static inline void rcu_preempt_sleep_check(void)
  * The tracing infrastructure traces RCU (we want that), but unfortunately
  * some of the RCU checks causes tracing to lock up the system.
  *
- * The tracing version of rcu_dereference_raw() must not call
+ * The no-tracing version of rcu_dereference_raw() must not call
  * rcu_read_lock_held().
  */
 #define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Alexey Kardashevskiy

On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:

On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:

On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:

On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan 
---
  arch/powerpc/include/asm/dma-mapping.h | 3 +++
  arch/powerpc/kernel/dma.c  | 9 +
  2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
  #define HAVE_ARCH_DMA_SET_MASK 1
  extern int dma_set_mask(struct device *dev, u64 dma_mask);

+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
  #include 

  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
  }
  EXPORT_SYMBOL(dma_set_mask);

+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;



All PCI devices have this initialized on POWER (at least, our, IBM's
POWER) so 4K will always be returned here while in the case of
(get_dma_ops(dev)==_direct_ops) it could actually return
PAGE_SHIFT. Is 4K still preferred value to return here?


Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == _direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.


If it is not available, then something went wrong and BUG_ON(!tbl || 
!tbl->it_page_shift) make more sense here than pretending that this 
function can ever return PAGE_SHIFT. imho.





3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
are using DDW, right?


This is for pseries guests; for the powernv host it is a "bypass" mode 
which does 64bit direct DMA mapping and there is no additional window for 
that (i.e. DIRECT64_PROPNAME, etc).




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Alexey Kardashevskiy

On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:

On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan 
---
  arch/powerpc/include/asm/dma-mapping.h | 3 +++
  arch/powerpc/kernel/dma.c  | 9 +
  2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
  #define HAVE_ARCH_DMA_SET_MASK 1
  extern int dma_set_mask(struct device *dev, u64 dma_mask);

+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
  #include 

  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
  }
  EXPORT_SYMBOL(dma_set_mask);

+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;



All PCI devices have this initialized on POWER (at least, our, IBM's POWER) 
so 4K will always be returned here while in the case of 
(get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 
4K still preferred value to return here?





+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
  u64 __dma_get_required_mask(struct device *dev)
  {
struct dma_map_ops *dma_ops = get_dma_ops(dev);




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Alexey Kardashevskiy

On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:

On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan 
---
  arch/powerpc/include/asm/dma-mapping.h | 3 +++
  arch/powerpc/kernel/dma.c  | 9 +
  2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
  #define HAVE_ARCH_DMA_SET_MASK 1
  extern int dma_set_mask(struct device *dev, u64 dma_mask);

+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
  #include 

  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
  }
  EXPORT_SYMBOL(dma_set_mask);

+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;



All PCI devices have this initialized on POWER (at least, our, IBM's POWER) 
so 4K will always be returned here while in the case of 
(get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 
4K still preferred value to return here?





+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
  u64 __dma_get_required_mask(struct device *dev)
  {
struct dma_map_ops *dma_ops = get_dma_ops(dev);




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Alexey Kardashevskiy

On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:

On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:

On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:

On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
  arch/powerpc/include/asm/dma-mapping.h | 3 +++
  arch/powerpc/kernel/dma.c  | 9 +
  2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
  #define HAVE_ARCH_DMA_SET_MASK 1
  extern int dma_set_mask(struct device *dev, u64 dma_mask);

+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
  #include 

  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
  }
  EXPORT_SYMBOL(dma_set_mask);

+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;



All PCI devices have this initialized on POWER (at least, our, IBM's
POWER) so 4K will always be returned here while in the case of
(get_dma_ops(dev)==_direct_ops) it could actually return
PAGE_SHIFT. Is 4K still preferred value to return here?


Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == _direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.


If it is not available, then something went wrong and BUG_ON(!tbl || 
!tbl->it_page_shift) make more sense here than pretending that this 
function can ever return PAGE_SHIFT. imho.





3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
are using DDW, right?


This is for pseries guests; for the powernv host it is a "bypass" mode 
which does 64bit direct DMA mapping and there is no additional window for 
that (i.e. DIRECT64_PROPNAME, etc).




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-26 Thread Alexey Kardashevskiy

On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:

When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan 
---
  arch/powerpc/platforms/pseries/iommu.c | 36 ++
  1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
  }

+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);



This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM does 
not set it as 64bit windows are dynamic there so something like 
find_existing_ddw() needs to be used here.





+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
  #else  /* CONFIG_PCI */
  #define pci_dma_bus_setup_pSeries NULL
  #define pci_dma_dev_setup_pSeries NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
  #define pci_dma_dev_setup_pSeriesLP   NULL
  #define dma_set_mask_pSeriesLPNULL
  #define dma_get_required_mask_pSeriesLP   NULL
+#define dma_get_page_shift_pSeriesLP   NULL
  #endif /* !CONFIG_PCI */

  static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-26 Thread Alexey Kardashevskiy

On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:

When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan 
---
  arch/powerpc/platforms/pseries/iommu.c | 36 ++
  1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
  }

+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);



This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM does 
not set it as 64bit windows are dynamic there so something like 
find_existing_ddw() needs to be used here.





+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
  #else  /* CONFIG_PCI */
  #define pci_dma_bus_setup_pSeries NULL
  #define pci_dma_dev_setup_pSeries NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
  #define pci_dma_dev_setup_pSeriesLP   NULL
  #define dma_set_mask_pSeriesLPNULL
  #define dma_get_required_mask_pSeriesLP   NULL
+#define dma_get_page_shift_pSeriesLP   NULL
  #endif /* !CONFIG_PCI */

  static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-17 Thread Alexey Kardashevskiy

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:

On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy


IIRC CMA was for mobile devices and their camera/codec drivers and you
don't use QEMU on those? What do you need CMA for in your case?



I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.






This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.


OK such conversation should probably start by mentioning the VM_PINNED
effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345

It's more general approach to dealing with pinned pages, and moving them
out of CMA area (and compacting them in general) prior pinning is one of
the things that should be done within that framework.



And I assume these patches did not go anywhere, right?...




Then there's the effort to enable migrating pages other than LRU during
compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
I don't know if that would be applicable in your use case, i.e. are the
pins for DMA short-lived and can the isolation/migration code wait a bit
for the transfer to finish so it can grab them, or something?



Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this "compaction" is too late.





Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?


Should be.


- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.


Hmm, once the page is allocated, then the migratetype is not tracked
anywhere (except in page_owner debug data). But the unmovable allocations
might exhaust available unmovable pageblocks and lead to fragmentation. So
"add MOVABLE back" would be too late. Instead we would need to tell the
allocator somehow to give us movable page but outside of CMA.


It is it movable, why do we care if it is in CMA or not?


CMA's own
__alloc_contig_migrate_range() avoids this problem by allocating movable
pages, but the range has been already page-isolated and thus the allocator
won't see the pages there.You obviously can't take this approach and
isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.



I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?





- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.


You don't have to do reclaim_clean_pages_from_list(), but the isolation has
to be careful, yeah.



The isolation here means the whole CMA zone isolation which I "obviously 
can't take this approach"? :)




- are there any flags in madvise() to address this (could not
locate any relevant)?


AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)


- what else is missing? disabled interrupts? locks?


See what isolate_migratepages_block() does.



Thanks for the pointers! I'll have a closer look at Peter's patchset.


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-17 Thread Alexey Kardashevskiy

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:

On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy


IIRC CMA was for mobile devices and their camera/codec drivers and you
don't use QEMU on those? What do you need CMA for in your case?



I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.






This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.


OK such conversation should probably start by mentioning the VM_PINNED
effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345

It's more general approach to dealing with pinned pages, and moving them
out of CMA area (and compacting them in general) prior pinning is one of
the things that should be done within that framework.



And I assume these patches did not go anywhere, right?...




Then there's the effort to enable migrating pages other than LRU during
compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
I don't know if that would be applicable in your use case, i.e. are the
pins for DMA short-lived and can the isolation/migration code wait a bit
for the transfer to finish so it can grab them, or something?



Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this compaction is too late.





Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?


Should be.


- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.


Hmm, once the page is allocated, then the migratetype is not tracked
anywhere (except in page_owner debug data). But the unmovable allocations
might exhaust available unmovable pageblocks and lead to fragmentation. So
add MOVABLE back would be too late. Instead we would need to tell the
allocator somehow to give us movable page but outside of CMA.


It is it movable, why do we care if it is in CMA or not?


CMA's own
__alloc_contig_migrate_range() avoids this problem by allocating movable
pages, but the range has been already page-isolated and thus the allocator
won't see the pages there.You obviously can't take this approach and
isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.



I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?





- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.


You don't have to do reclaim_clean_pages_from_list(), but the isolation has
to be careful, yeah.



The isolation here means the whole CMA zone isolation which I obviously 
can't take this approach? :)




- are there any flags in madvise() to address this (could not
locate any relevant)?


AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)


- what else is missing? disabled interrupts? locks?


See what isolate_migratepages_block() does.



Thanks for the pointers! I'll have a closer look at Peter's patchset.


--
Alexey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-15 Thread Alexey Kardashevskiy

On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote:

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.



Anyone, any idea? Or the question is way too stupid? :)




QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/mm/mmu_context_iommu.c | 40 +++--
  mm/page_alloc.c | 36 +
  2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
  }
  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);

+extern int mm_iommu_move_page(unsigned long pfn);
+
  long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem)
  {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
+   unsigned long moved = 0, tried = 0;

if (!current || !current->mm)
return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
}

for (i = 0; i < entries; ++i) {
+   unsigned long pfn;
+
if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
1/* pages */, 1/* iswrite */, )) {
-   for (j = 0; j < i; ++j)
-   put_page(pfn_to_page(
-   mem->hpas[j] >> PAGE_SHIFT));
-   vfree(mem->hpas);
-   kfree(mem);
ret = -EFAULT;
-   goto unlock_exit;
+   goto put_exit;
+   }
+
+   pfn = page_to_pfn(page);
+   if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+   {
+   unsigned long pfnold = pfn;
+   put_page(page);
+   page = NULL;
+   mm_iommu_move_page(pfn);
+   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+   1/* pages */, 1/* iswrite */, 
)) {
+   ret = -EFAULT;
+   goto put_exit;
+   }
+   pfn = page_to_pfn(page);
+   if (pfn != pfnold)
+   ++moved;
}

mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,

list_add_rc

Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-15 Thread Alexey Kardashevskiy

On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote:

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.



Anyone, any idea? Or the question is way too stupid? :)




QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
  arch/powerpc/mm/mmu_context_iommu.c | 40 +++--
  mm/page_alloc.c | 36 +
  2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
  }
  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);

+extern int mm_iommu_move_page(unsigned long pfn);
+
  long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem)
  {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
+   unsigned long moved = 0, tried = 0;

if (!current || !current-mm)
return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
}

for (i = 0; i  entries; ++i) {
+   unsigned long pfn;
+
if (1 != get_user_pages_fast(ua + (i  PAGE_SHIFT),
1/* pages */, 1/* iswrite */, page)) {
-   for (j = 0; j  i; ++j)
-   put_page(pfn_to_page(
-   mem-hpas[j]  PAGE_SHIFT));
-   vfree(mem-hpas);
-   kfree(mem);
ret = -EFAULT;
-   goto unlock_exit;
+   goto put_exit;
+   }
+
+   pfn = page_to_pfn(page);
+   if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+   {
+   unsigned long pfnold = pfn;
+   put_page(page);
+   page = NULL;
+   mm_iommu_move_page(pfn);
+   if (1 != get_user_pages_fast(ua + (i  PAGE_SHIFT),
+   1/* pages */, 1/* iswrite */, 
page)) {
+   ret = -EFAULT;
+   goto put_exit;
+   }
+   pfn = page_to_pfn(page);
+   if (pfn != pfnold)
+   ++moved;
}

mem-hpas[i] = page_to_pfn(page)  PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,

list_add_rcu(mem-next, current-mm

[RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-05 Thread Alexey Kardashevskiy
This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/mm/mmu_context_iommu.c | 40 +++--
 mm/page_alloc.c | 36 +
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
+extern int mm_iommu_move_page(unsigned long pfn);
+
 long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
+   unsigned long moved = 0, tried = 0;
 
if (!current || !current->mm)
return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
}
 
for (i = 0; i < entries; ++i) {
+   unsigned long pfn;
+
if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
1/* pages */, 1/* iswrite */, )) {
-   for (j = 0; j < i; ++j)
-   put_page(pfn_to_page(
-   mem->hpas[j] >> PAGE_SHIFT));
-   vfree(mem->hpas);
-   kfree(mem);
ret = -EFAULT;
-   goto unlock_exit;
+   goto put_exit;
+   }
+
+   pfn = page_to_pfn(page);
+   if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+   {
+   unsigned long pfnold = pfn;
+   put_page(page);
+   page = NULL;
+   mm_iommu_move_page(pfn);
+   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+   1/* pages */, 1/* iswrite */, 
)) {
+   ret = -EFAULT;
+   goto put_exit;
+   }
+   pfn = page_to_pfn(page);
+   if (pfn != pfnold)
+   ++moved;
}
 
mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
list_add_rcu(>next, >mm->context.iommu_group_mem_list);
 
+   printk("***K*** %s %u: tried %lu, moved %lu of %l

[RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

2015-08-05 Thread Alexey Kardashevskiy
This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/mm/mmu_context_iommu.c | 40 +++--
 mm/page_alloc.c | 36 +
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
+extern int mm_iommu_move_page(unsigned long pfn);
+
 long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
+   unsigned long moved = 0, tried = 0;
 
if (!current || !current-mm)
return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
}
 
for (i = 0; i  entries; ++i) {
+   unsigned long pfn;
+
if (1 != get_user_pages_fast(ua + (i  PAGE_SHIFT),
1/* pages */, 1/* iswrite */, page)) {
-   for (j = 0; j  i; ++j)
-   put_page(pfn_to_page(
-   mem-hpas[j]  PAGE_SHIFT));
-   vfree(mem-hpas);
-   kfree(mem);
ret = -EFAULT;
-   goto unlock_exit;
+   goto put_exit;
+   }
+
+   pfn = page_to_pfn(page);
+   if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+   {
+   unsigned long pfnold = pfn;
+   put_page(page);
+   page = NULL;
+   mm_iommu_move_page(pfn);
+   if (1 != get_user_pages_fast(ua + (i  PAGE_SHIFT),
+   1/* pages */, 1/* iswrite */, 
page)) {
+   ret = -EFAULT;
+   goto put_exit;
+   }
+   pfn = page_to_pfn(page);
+   if (pfn != pfnold)
+   ++moved;
}
 
mem-hpas[i] = page_to_pfn(page)  PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
list_add_rcu(mem-next, current-mm-context.iommu_group_mem_list);
 
+   printk(***K*** %s %u: tried %lu, moved %lu of %lu\n, __func__, 
__LINE__

Re: [PATCH kernel] powerpc/powernv/ioda2: Fix calculation for memory allocated for TCE table

2015-07-23 Thread Alexey Kardashevskiy

On 07/21/2015 04:24 PM, Michael Ellerman wrote:

On Mon, 2015-07-20 at 20:45 +1000, Alexey Kardashevskiy wrote:

The existing code stores the amount of memory allocated for a TCE table.
At the moment it uses @offset which is a virtual offset in the TCE table
which is only correct for a one level tables and it does not include
memory allocated for intermediate levels. When multilevel TCE table is
requested, WARN_ON in tce_iommu_create_table() prints a warning.

This adds an additional counter to pnv_pci_ioda2_table_do_alloc_pages()
to count actually allocated memory.

Signed-off-by: Alexey Kardashevskiy 
---

I was sure I sent it already but could not find it anywhere so reposting.
Sorry if it is a duplicate.


Stable?


No, this is for DDW patchset I recently posted and which did not get to any 
release yet.




--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >