[PATCH kernel v2 2/2] powerpc/mm/iommu: Put pages on process exit
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when the userspace starts using VFIO. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This references and caches MM once per container and adds tracking how many times each preregistered area was registered in a specific container. This way we do not depend on @current pointing to a valid task descriptor. This changes the userspce interface to return EBUSY if memory is already registered (mm_iommu_get() used to increment the counter); however it should not have any practical effect as the only userspace tool available now does register memory area once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Cc: David Gibson <da...@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org> Cc: Paul Mackerras <pau...@samba.org> Cc: Balbir Singh <bsinghar...@gmail.com> Cc: Nicholas Piggin <npig...@gmail.com> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- arch/powerpc/include/asm/mmu_context.h | 1 - arch/powerpc/mm/mmu_context_book3s64.c | 4 --- arch/powerpc/mm/mmu_context_iommu.c| 10 --- drivers/vfio/vfio_iommu_spapr_tce.c| 52 +- 4 files changed, 51 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 745b4bd..90338fd 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -24,7 +24,6 @@ extern long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long e extern long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(mm_context_t *ctx); -extern void mm_iommu_cleanup(mm_context_t *ctx); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index 227b2a6..5c67d1c 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(>context); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 65086bf..901773d 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -293,13 +293,3 @@ void mm_iommu_init(mm_context_t *ctx) { INIT_LIST_HEAD_RCU(>iommu_group_mem_list); } - -void mm_iommu_cleanup(mm_context_t *ctx) -{ - struct mm_iommu_table_group_mem_t *mem, *tmp; - - list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) { - list_del_rcu(>next); - mm_iommu_do_free(mem); - } -} diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 9752e77..40e71a0 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -89,6 +89,15 @@ struct tce_iommu_group { }; /* + * A container needs to remember which preregistered areas and how many times + * it has referenced to do proper cleanup at the userspace process exit. + */ +struct tce_iommu_prereg { + struct list_head next; + struct mm_iommu_table_group_mem_t *mem; +}; + +/* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group * at the moment of initialization. @@ -101,12 +110,26 @@ struct tce_container { struct mm_struct *mm; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct list_head group_list; + struct list_head prereg_list; }; +static long tce_iommu_prereg_free(struct tce_container *container, + struct tce_iommu_prereg *tcemem) +{ + long ret; + + list_del(>next); + ret = mm_iommu_put(container->mm, tcemem->mem); + kfree(tcemem); + +
[PATCH kernel v2 1/2] powerpc/iommu: Stop using @current in mm_iommu_xxx
In some situations the userspace memory context may live longer than the userspace process itself so if we need to do proper memory context cleanup, we better cache @mm and use it later when the process is gone (@current or @current->mm are NULL). This changes mm_iommu_xxx API to receive mm_struct instead of using one from @current. This is needed by the following patch to do proper cleanup in time. This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs" to do proper cleanup via tce_iommu_clear() patch. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu_context.h | 15 ++-- arch/powerpc/mm/mmu_context_iommu.c| 45 +- drivers/vfio/vfio_iommu_spapr_tce.c| 41 +++ 3 files changed, 51 insertions(+), 50 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 9d2cd0c..745b4bd 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -18,16 +18,17 @@ extern void destroy_context(struct mm_struct *mm); #ifdef CONFIG_SPAPR_TCE_IOMMU struct mm_iommu_table_group_mem_t; -extern bool mm_iommu_preregistered(void); -extern long mm_iommu_get(unsigned long ua, unsigned long entries, +extern bool mm_iommu_preregistered(struct mm_struct *mm); +extern long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); +extern long mm_iommu_put(struct mm_struct *mm, + struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(mm_context_t *ctx); extern void mm_iommu_cleanup(mm_context_t *ctx); -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, - unsigned long size); -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, - unsigned long entries); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, + unsigned long ua, unsigned long size); +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, + unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..65086bf 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, } pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", - current->pid, + current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, mm->locked_vm << PAGE_SHIFT, @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, return ret; } -bool mm_iommu_preregistered(void) +bool mm_iommu_preregistered(struct mm_struct *mm) { - if (!current || !current->mm) - return false; - - return !list_empty(>mm->context.iommu_group_mem_list); + return !list_empty(>context.iommu_group_mem_list); } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); -long mm_iommu_get(unsigned long ua, unsigned long entries, +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; - if (!current || !current->mm) - return -ESRCH; /* process exited */ - mutex_lock(_list_mutex); - list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list, + list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) { if ((mem->ua == ua) && (mem->entries == entries)) { ++mem->used; @@ -102,7 +96,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } - ret = mm_iommu_adjust_locked_vm(current->mm, entries, true); + ret = mm_iommu_adjust_locked_vm(mm, entries, true); if (ret) goto unlock_exit; @@ -142,11 +136,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, mem->entries = entries; *pmem = mem; - list_add_rcu(>next, >mm->context.iommu_group_mem_list); + list_add_rcu(>next, >context.iommu_group_mem_list); unlock_exit: if (locked_entries && ret) -
[PATCH kernel v2 2/2] powerpc/mm/iommu: Put pages on process exit
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when the userspace starts using VFIO. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This references and caches MM once per container and adds tracking how many times each preregistered area was registered in a specific container. This way we do not depend on @current pointing to a valid task descriptor. This changes the userspce interface to return EBUSY if memory is already registered (mm_iommu_get() used to increment the counter); however it should not have any practical effect as the only userspace tool available now does register memory area once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Cc: David Gibson Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Balbir Singh Cc: Nicholas Piggin Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu_context.h | 1 - arch/powerpc/mm/mmu_context_book3s64.c | 4 --- arch/powerpc/mm/mmu_context_iommu.c| 10 --- drivers/vfio/vfio_iommu_spapr_tce.c| 52 +- 4 files changed, 51 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 745b4bd..90338fd 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -24,7 +24,6 @@ extern long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long e extern long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(mm_context_t *ctx); -extern void mm_iommu_cleanup(mm_context_t *ctx); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index 227b2a6..5c67d1c 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(>context); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 65086bf..901773d 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -293,13 +293,3 @@ void mm_iommu_init(mm_context_t *ctx) { INIT_LIST_HEAD_RCU(>iommu_group_mem_list); } - -void mm_iommu_cleanup(mm_context_t *ctx) -{ - struct mm_iommu_table_group_mem_t *mem, *tmp; - - list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) { - list_del_rcu(>next); - mm_iommu_do_free(mem); - } -} diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 9752e77..40e71a0 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -89,6 +89,15 @@ struct tce_iommu_group { }; /* + * A container needs to remember which preregistered areas and how many times + * it has referenced to do proper cleanup at the userspace process exit. + */ +struct tce_iommu_prereg { + struct list_head next; + struct mm_iommu_table_group_mem_t *mem; +}; + +/* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group * at the moment of initialization. @@ -101,12 +110,26 @@ struct tce_container { struct mm_struct *mm; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct list_head group_list; + struct list_head prereg_list; }; +static long tce_iommu_prereg_free(struct tce_container *container, + struct tce_iommu_prereg *tcemem) +{ + long ret; + + list_del(>next); + ret = mm_iommu_put(container->mm, tcemem->mem); + kfree(tcemem); + + return ret; +} + static long tce_iommu_unregister_pages(struct tce_container *container, __u64 vaddr, __u64 size) { struct mm_iommu_table_gr
[PATCH kernel v2 0/2] powerpc/mm/iommu: Put pages on process exit
This is a fix to a bug when guest memory stays Active after QEMU process exited. This happened because the QEMU memory context was not released in a short period of time after QEMU process exited. More details are in the commit logs. Please comment. Thanks. Alexey Kardashevskiy (2): powerpc/iommu: Stop using @current in mm_iommu_xxx powerpc/mm/iommu: Put pages on process exit arch/powerpc/include/asm/mmu_context.h | 16 +++--- arch/powerpc/mm/mmu_context_book3s64.c | 4 -- arch/powerpc/mm/mmu_context_iommu.c| 55 +++-- drivers/vfio/vfio_iommu_spapr_tce.c| 89 -- 4 files changed, 100 insertions(+), 64 deletions(-) -- 2.5.0.rc3
[PATCH kernel v2 0/2] powerpc/mm/iommu: Put pages on process exit
This is a fix to a bug when guest memory stays Active after QEMU process exited. This happened because the QEMU memory context was not released in a short period of time after QEMU process exited. More details are in the commit logs. Please comment. Thanks. Alexey Kardashevskiy (2): powerpc/iommu: Stop using @current in mm_iommu_xxx powerpc/mm/iommu: Put pages on process exit arch/powerpc/include/asm/mmu_context.h | 16 +++--- arch/powerpc/mm/mmu_context_book3s64.c | 4 -- arch/powerpc/mm/mmu_context_iommu.c| 55 +++-- drivers/vfio/vfio_iommu_spapr_tce.c| 89 -- 4 files changed, 100 insertions(+), 64 deletions(-) -- 2.5.0.rc3
Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit
On 14/07/16 21:52, Alexey Kardashevskiy wrote: > On 14/07/16 20:12, Balbir Singh wrote: >> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when >>> the userspace starts using VFIO. When the userspace process finishes, >>> all the pinned pages need to be put; this is done as a part of >>> the userspace memory context (MM) destruction which happens on >>> the very last mmdrop(). >>> >>> This approach has a problem that a MM of the userspace process >>> may live longer than the userspace process itself as kernel threads >>> usually execute on a MM of a userspace process which was runnning >>> on a CPU where the kernel thread was scheduled to. If this happened, >>> the MM remains referenced until this exact kernel thread wakes up again >>> and releases the very last reference to the MM, on an idle system this >>> can take even hours. >>> >>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts >>> pages) from destroy_context() (called on the last mmdrop()) to >>> the arch-specific arch_exit_mmap() hook (called on the last mmput()). >>> mmdrop() decrements the mm->mm_count which is a total reference number; >>> mmput() decrements the mm->mm_users which is a number of user spaces and >>> this is actually the counter we want to watch for here. >>> >>> Cc: David Gibson <da...@gibson.dropbear.id.au> >>> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org> >>> Cc: Paul Mackerras <pau...@samba.org> >>> Cc: Balbir Singh <bsinghar...@gmail.com> >>> Cc: Nick Piggin <npig...@kernel.dk> >>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>> --- >>> arch/powerpc/include/asm/mmu_context.h | 3 +++ >>> arch/powerpc/mm/mmu_context_book3s64.c | 4 >>> 2 files changed, 3 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/powerpc/include/asm/mmu_context.h >>> b/arch/powerpc/include/asm/mmu_context.h >>> index 9d2cd0c..24b590d 100644 >>> --- a/arch/powerpc/include/asm/mmu_context.h >>> +++ b/arch/powerpc/include/asm/mmu_context.h >>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct >>> *oldmm, >>> >>> static inline void arch_exit_mmap(struct mm_struct *mm) >>> { >>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>> + mm_iommu_cleanup(>context); >>> +#endif >>> } >>> >> >> The assumption is that all necessary tear down has happened. Earlier >> we were doing the teardown >> on the last mm ref drop, do we have any dependence on that? I don't >> think so, but just checking >> >> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely >> on the exit path to do the teardown? > > QEMU does call unregister when a memory region is deleted. Handling it in > arch_exit_mmap() is for the case when a QEMU process just dies for whatever > reason. After a second thought the patch is not correct, before it would actually put pages when all references to MM were dropped and many of them are hold by file descriptors, with this patch, pages are put before files are closed which means that devices will not be reset and doing DMA to memory which is not pinned but there are TCEs pointign to it. I will make another patch, with additional mmget()/mmput(). >> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a >> version >> of mm_iommu_cleanup that handles it > > > That would be a change unrelated to the fix. And I just do not see the > point in this - mm_iommu_cleanup() is declared in the very same header. > > >> Basically do >> >> #ifdef CONFIG_SPAPR_TCE_IOMMU >> extern void mm_iommu_cleanup(void *ptr) >> #else >> static inline mm_iommu_cleanup(void *ptr) {} >> #endif >> >> Otherwise looks good >> >> Acked-by: Balbir Singh <bsinghar...@gmail.com> > > Thanks! > > -- Alexey
Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit
On 14/07/16 21:52, Alexey Kardashevskiy wrote: > On 14/07/16 20:12, Balbir Singh wrote: >> On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy wrote: >>> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when >>> the userspace starts using VFIO. When the userspace process finishes, >>> all the pinned pages need to be put; this is done as a part of >>> the userspace memory context (MM) destruction which happens on >>> the very last mmdrop(). >>> >>> This approach has a problem that a MM of the userspace process >>> may live longer than the userspace process itself as kernel threads >>> usually execute on a MM of a userspace process which was runnning >>> on a CPU where the kernel thread was scheduled to. If this happened, >>> the MM remains referenced until this exact kernel thread wakes up again >>> and releases the very last reference to the MM, on an idle system this >>> can take even hours. >>> >>> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts >>> pages) from destroy_context() (called on the last mmdrop()) to >>> the arch-specific arch_exit_mmap() hook (called on the last mmput()). >>> mmdrop() decrements the mm->mm_count which is a total reference number; >>> mmput() decrements the mm->mm_users which is a number of user spaces and >>> this is actually the counter we want to watch for here. >>> >>> Cc: David Gibson >>> Cc: Benjamin Herrenschmidt >>> Cc: Paul Mackerras >>> Cc: Balbir Singh >>> Cc: Nick Piggin >>> Signed-off-by: Alexey Kardashevskiy >>> --- >>> arch/powerpc/include/asm/mmu_context.h | 3 +++ >>> arch/powerpc/mm/mmu_context_book3s64.c | 4 >>> 2 files changed, 3 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/powerpc/include/asm/mmu_context.h >>> b/arch/powerpc/include/asm/mmu_context.h >>> index 9d2cd0c..24b590d 100644 >>> --- a/arch/powerpc/include/asm/mmu_context.h >>> +++ b/arch/powerpc/include/asm/mmu_context.h >>> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct >>> *oldmm, >>> >>> static inline void arch_exit_mmap(struct mm_struct *mm) >>> { >>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>> + mm_iommu_cleanup(>context); >>> +#endif >>> } >>> >> >> The assumption is that all necessary tear down has happened. Earlier >> we were doing the teardown >> on the last mm ref drop, do we have any dependence on that? I don't >> think so, but just checking >> >> Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely >> on the exit path to do the teardown? > > QEMU does call unregister when a memory region is deleted. Handling it in > arch_exit_mmap() is for the case when a QEMU process just dies for whatever > reason. After a second thought the patch is not correct, before it would actually put pages when all references to MM were dropped and many of them are hold by file descriptors, with this patch, pages are put before files are closed which means that devices will not be reset and doing DMA to memory which is not pinned but there are TCEs pointign to it. I will make another patch, with additional mmget()/mmput(). >> Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a >> version >> of mm_iommu_cleanup that handles it > > > That would be a change unrelated to the fix. And I just do not see the > point in this - mm_iommu_cleanup() is declared in the very same header. > > >> Basically do >> >> #ifdef CONFIG_SPAPR_TCE_IOMMU >> extern void mm_iommu_cleanup(void *ptr) >> #else >> static inline mm_iommu_cleanup(void *ptr) {} >> #endif >> >> Otherwise looks good >> >> Acked-by: Balbir Singh > > Thanks! > > -- Alexey
Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit
On 14/07/16 20:12, Balbir Singh wrote: > On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when >> the userspace starts using VFIO. When the userspace process finishes, >> all the pinned pages need to be put; this is done as a part of >> the userspace memory context (MM) destruction which happens on >> the very last mmdrop(). >> >> This approach has a problem that a MM of the userspace process >> may live longer than the userspace process itself as kernel threads >> usually execute on a MM of a userspace process which was runnning >> on a CPU where the kernel thread was scheduled to. If this happened, >> the MM remains referenced until this exact kernel thread wakes up again >> and releases the very last reference to the MM, on an idle system this >> can take even hours. >> >> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts >> pages) from destroy_context() (called on the last mmdrop()) to >> the arch-specific arch_exit_mmap() hook (called on the last mmput()). >> mmdrop() decrements the mm->mm_count which is a total reference number; >> mmput() decrements the mm->mm_users which is a number of user spaces and >> this is actually the counter we want to watch for here. >> >> Cc: David Gibson <da...@gibson.dropbear.id.au> >> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org> >> Cc: Paul Mackerras <pau...@samba.org> >> Cc: Balbir Singh <bsinghar...@gmail.com> >> Cc: Nick Piggin <npig...@kernel.dk> >> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >> --- >> arch/powerpc/include/asm/mmu_context.h | 3 +++ >> arch/powerpc/mm/mmu_context_book3s64.c | 4 >> 2 files changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/mmu_context.h >> b/arch/powerpc/include/asm/mmu_context.h >> index 9d2cd0c..24b590d 100644 >> --- a/arch/powerpc/include/asm/mmu_context.h >> +++ b/arch/powerpc/include/asm/mmu_context.h >> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm, >> >> static inline void arch_exit_mmap(struct mm_struct *mm) >> { >> +#ifdef CONFIG_SPAPR_TCE_IOMMU >> + mm_iommu_cleanup(>context); >> +#endif >> } >> > > The assumption is that all necessary tear down has happened. Earlier > we were doing the teardown > on the last mm ref drop, do we have any dependence on that? I don't > think so, but just checking > > Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely > on the exit path to do the teardown? QEMU does call unregister when a memory region is deleted. Handling it in arch_exit_mmap() is for the case when a QEMU process just dies for whatever reason. > Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a version > of mm_iommu_cleanup that handles it That would be a change unrelated to the fix. And I just do not see the point in this - mm_iommu_cleanup() is declared in the very same header. > Basically do > > #ifdef CONFIG_SPAPR_TCE_IOMMU > extern void mm_iommu_cleanup(void *ptr) > #else > static inline mm_iommu_cleanup(void *ptr) {} > #endif > > Otherwise looks good > > Acked-by: Balbir Singh <bsinghar...@gmail.com> Thanks! -- Alexey
Re: [PATCH kernel] powerpc/mm/iommu: Put pages on process exit
On 14/07/16 20:12, Balbir Singh wrote: > On Thu, Jul 14, 2016 at 3:16 PM, Alexey Kardashevskiy wrote: >> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when >> the userspace starts using VFIO. When the userspace process finishes, >> all the pinned pages need to be put; this is done as a part of >> the userspace memory context (MM) destruction which happens on >> the very last mmdrop(). >> >> This approach has a problem that a MM of the userspace process >> may live longer than the userspace process itself as kernel threads >> usually execute on a MM of a userspace process which was runnning >> on a CPU where the kernel thread was scheduled to. If this happened, >> the MM remains referenced until this exact kernel thread wakes up again >> and releases the very last reference to the MM, on an idle system this >> can take even hours. >> >> This fixes the issue by moving mm_iommu_cleanup() (the helper which puts >> pages) from destroy_context() (called on the last mmdrop()) to >> the arch-specific arch_exit_mmap() hook (called on the last mmput()). >> mmdrop() decrements the mm->mm_count which is a total reference number; >> mmput() decrements the mm->mm_users which is a number of user spaces and >> this is actually the counter we want to watch for here. >> >> Cc: David Gibson >> Cc: Benjamin Herrenschmidt >> Cc: Paul Mackerras >> Cc: Balbir Singh >> Cc: Nick Piggin >> Signed-off-by: Alexey Kardashevskiy >> --- >> arch/powerpc/include/asm/mmu_context.h | 3 +++ >> arch/powerpc/mm/mmu_context_book3s64.c | 4 >> 2 files changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/mmu_context.h >> b/arch/powerpc/include/asm/mmu_context.h >> index 9d2cd0c..24b590d 100644 >> --- a/arch/powerpc/include/asm/mmu_context.h >> +++ b/arch/powerpc/include/asm/mmu_context.h >> @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm, >> >> static inline void arch_exit_mmap(struct mm_struct *mm) >> { >> +#ifdef CONFIG_SPAPR_TCE_IOMMU >> + mm_iommu_cleanup(>context); >> +#endif >> } >> > > The assumption is that all necessary tear down has happened. Earlier > we were doing the teardown > on the last mm ref drop, do we have any dependence on that? I don't > think so, but just checking > > Does qemu dotce_iommu_register_pages on exit()/atexit() or do we rely > on the exit path to do the teardown? QEMU does call unregister when a memory region is deleted. Handling it in arch_exit_mmap() is for the case when a QEMU process just dies for whatever reason. > Can we please remove the #ifdef CONFIG_SPAPR_TCE_IOMMU here and have a version > of mm_iommu_cleanup that handles it That would be a change unrelated to the fix. And I just do not see the point in this - mm_iommu_cleanup() is declared in the very same header. > Basically do > > #ifdef CONFIG_SPAPR_TCE_IOMMU > extern void mm_iommu_cleanup(void *ptr) > #else > static inline mm_iommu_cleanup(void *ptr) {} > #endif > > Otherwise looks good > > Acked-by: Balbir Singh Thanks! -- Alexey
[PATCH kernel] powerpc/mm/iommu: Put pages on process exit
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when the userspace starts using VFIO. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads usually execute on a MM of a userspace process which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This fixes the issue by moving mm_iommu_cleanup() (the helper which puts pages) from destroy_context() (called on the last mmdrop()) to the arch-specific arch_exit_mmap() hook (called on the last mmput()). mmdrop() decrements the mm->mm_count which is a total reference number; mmput() decrements the mm->mm_users which is a number of user spaces and this is actually the counter we want to watch for here. Cc: David Gibson <da...@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org> Cc: Paul Mackerras <pau...@samba.org> Cc: Balbir Singh <bsinghar...@gmail.com> Cc: Nick Piggin <npig...@kernel.dk> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- arch/powerpc/include/asm/mmu_context.h | 3 +++ arch/powerpc/mm/mmu_context_book3s64.c | 4 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 9d2cd0c..24b590d 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm, static inline void arch_exit_mmap(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(>context); +#endif } static inline void arch_unmap(struct mm_struct *mm, diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index 1962..aaeba74 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(>context); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); -- 2.5.0.rc3
[PATCH kernel] powerpc/mm/iommu: Put pages on process exit
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when the userspace starts using VFIO. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads usually execute on a MM of a userspace process which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This fixes the issue by moving mm_iommu_cleanup() (the helper which puts pages) from destroy_context() (called on the last mmdrop()) to the arch-specific arch_exit_mmap() hook (called on the last mmput()). mmdrop() decrements the mm->mm_count which is a total reference number; mmput() decrements the mm->mm_users which is a number of user spaces and this is actually the counter we want to watch for here. Cc: David Gibson Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Balbir Singh Cc: Nick Piggin Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu_context.h | 3 +++ arch/powerpc/mm/mmu_context_book3s64.c | 4 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 9d2cd0c..24b590d 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -138,6 +138,9 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm, static inline void arch_exit_mmap(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(>context); +#endif } static inline void arch_unmap(struct mm_struct *mm, diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index 1962..aaeba74 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(>context); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); -- 2.5.0.rc3
Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
On 05/06/2016 01:05 AM, Alex Williamson wrote: On Thu, 5 May 2016 12:15:46 + "Tian, Kevin"wrote: From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com] Sent: Thursday, May 05, 2016 7:43 PM Hi David and Kevin, On 2016/5/5 17:54, David Laight wrote: From: Tian, Kevin Sent: 05 May 2016 10:37 ... Acutually, we are not aimed at accessing MSI-X table from guest. So I think it's safe to passthrough MSI-X table if we can make sure guest kernel would not touch MSI-X table in normal code path such as para-virtualized guest kernel on PPC64. Then how do you prevent malicious guest kernel accessing it? Or a malicious guest driver for an ethernet card setting up the receive buffer ring to contain a single word entry that contains the address associated with an MSI-X interrupt and then using a loopback mode to cause a specific packet be received that writes the required word through that address. Remember the PCIe cycle for an interrupt is a normal memory write cycle. David If we have enough permission to load a malicious driver or kernel, we can easily break the guest without exposed MSI-X table. I think it should be safe to expose MSI-X table if we can make sure that malicious guest driver/kernel can't use the MSI-X table to break other guest or host. The capability of IRQ remapping could provide this kind of protection. With IRQ remapping it doesn't mean you can pass through MSI-X structure to guest. I know actual IRQ remapping might be platform specific, but at least for Intel VT-d specification, MSI-X entry must be configured with a remappable format by host kernel which contains an index into IRQ remapping table. The index will find a IRQ remapping entry which controls interrupt routing for a specific device. If you allow a malicious program random index into MSI-X entry of assigned device, the hole is obvious... Above might make sense only for a IRQ remapping implementation which doesn't rely on extended MSI-X format (e.g. simply based on BDF). If that's the case for PPC, then you should build MSI-X passthrough based on this fact instead of general IRQ remapping enabled or not. I don't think anyone is expecting that we can expose the MSI-X vector table to the guest and the guest can make direct use of it. The end goal here is that the guest on a power system is already paravirtualized to not program the device MSI-X by directly writing to the MSI-X vector table. They have hypercalls for this since they always run virtualized. Therefore a) they never intend to touch the MSI-X vector table and b) they have sufficient isolation that a guest can only hurt itself by doing so. On x86 we don't have a), our method of programming the MSI-X vector table is to directly write to it. Therefore we will always require QEMU to place a MemoryRegion over the vector table to intercept those accesses. However with interrupt remapping, we do have b) on x86, which means that we don't need to be so strict in disallowing user accesses to the MSI-X vector table. It's not useful for configuring MSI-X on the device, but the user should only be able to hurt themselves by writing it directly. x86 doesn't really get anything out of this change, but it helps this special case on power pretty significantly aiui. Thanks, Excellent short overview, saved :) How do we proceed with these patches? Nobody seems objecting them but also nobody seems taking them either... -- Alexey
Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
On 05/06/2016 01:05 AM, Alex Williamson wrote: On Thu, 5 May 2016 12:15:46 + "Tian, Kevin" wrote: From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com] Sent: Thursday, May 05, 2016 7:43 PM Hi David and Kevin, On 2016/5/5 17:54, David Laight wrote: From: Tian, Kevin Sent: 05 May 2016 10:37 ... Acutually, we are not aimed at accessing MSI-X table from guest. So I think it's safe to passthrough MSI-X table if we can make sure guest kernel would not touch MSI-X table in normal code path such as para-virtualized guest kernel on PPC64. Then how do you prevent malicious guest kernel accessing it? Or a malicious guest driver for an ethernet card setting up the receive buffer ring to contain a single word entry that contains the address associated with an MSI-X interrupt and then using a loopback mode to cause a specific packet be received that writes the required word through that address. Remember the PCIe cycle for an interrupt is a normal memory write cycle. David If we have enough permission to load a malicious driver or kernel, we can easily break the guest without exposed MSI-X table. I think it should be safe to expose MSI-X table if we can make sure that malicious guest driver/kernel can't use the MSI-X table to break other guest or host. The capability of IRQ remapping could provide this kind of protection. With IRQ remapping it doesn't mean you can pass through MSI-X structure to guest. I know actual IRQ remapping might be platform specific, but at least for Intel VT-d specification, MSI-X entry must be configured with a remappable format by host kernel which contains an index into IRQ remapping table. The index will find a IRQ remapping entry which controls interrupt routing for a specific device. If you allow a malicious program random index into MSI-X entry of assigned device, the hole is obvious... Above might make sense only for a IRQ remapping implementation which doesn't rely on extended MSI-X format (e.g. simply based on BDF). If that's the case for PPC, then you should build MSI-X passthrough based on this fact instead of general IRQ remapping enabled or not. I don't think anyone is expecting that we can expose the MSI-X vector table to the guest and the guest can make direct use of it. The end goal here is that the guest on a power system is already paravirtualized to not program the device MSI-X by directly writing to the MSI-X vector table. They have hypercalls for this since they always run virtualized. Therefore a) they never intend to touch the MSI-X vector table and b) they have sufficient isolation that a guest can only hurt itself by doing so. On x86 we don't have a), our method of programming the MSI-X vector table is to directly write to it. Therefore we will always require QEMU to place a MemoryRegion over the vector table to intercept those accesses. However with interrupt remapping, we do have b) on x86, which means that we don't need to be so strict in disallowing user accesses to the MSI-X vector table. It's not useful for configuring MSI-X on the device, but the user should only be able to hurt themselves by writing it directly. x86 doesn't really get anything out of this change, but it helps this special case on power pretty significantly aiui. Thanks, Excellent short overview, saved :) How do we proceed with these patches? Nobody seems objecting them but also nobody seems taking them either... -- Alexey
Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
On 04/27/2016 10:43 PM, Yongji Xie wrote: Any IODA host bridge have the capability of IRQ remapping. So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge is detected. Signed-off-by: Yongji Xie <xyj...@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- arch/powerpc/platforms/powernv/pci-ioda.c |8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f90dc04..9557638 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void) pnv_npu_ioda_fixup(); } +int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge) +{ + bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP; + return 0; +} + /* * Returns the alignment for I/O or memory windows for P2P * bridges. That actually depends on how PEs are segmented. @@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, */ ppc_md.pcibios_fixup = pnv_pci_ioda_fixup; + ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare; + if (phb->type == PNV_PHB_NPU) hose->controller_ops = pnv_npu_ioda_controller_ops; else -- Alexey
Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
On 04/27/2016 10:43 PM, Yongji Xie wrote: Any IODA host bridge have the capability of IRQ remapping. So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge is detected. Signed-off-by: Yongji Xie Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c |8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f90dc04..9557638 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void) pnv_npu_ioda_fixup(); } +int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge) +{ + bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP; + return 0; +} + /* * Returns the alignment for I/O or memory windows for P2P * bridges. That actually depends on how PEs are segmented. @@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, */ ppc_md.pcibios_fixup = pnv_pci_ioda_fixup; + ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare; + if (phb->type == PNV_PHB_NPU) hose->controller_ops = pnv_npu_ioda_controller_ops; else -- Alexey
Re: [PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through
On 05/04/2016 12:08 AM, Alistair Popple wrote: Hi Alexey, On Fri, 29 Apr 2016 18:55:24 Alexey Kardashevskiy wrote: IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v4: * reused pnv_npu_set_window/pnv_npu_unset_window where possible * added comments, changed commit log v3: * moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered * removed hack to make iommu_add_device() work, iommu_group_add_device() is used instead * cleanup in gpe_table_group_to_npe_cb() v2: * reimplemented to support NPU + GPU in the same group * merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and "powerpc/powernv/npu: Enable passing through via VFIO" into this patch --- arch/powerpc/platforms/powernv/npu-dma.c | 64 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++ arch/powerpc/platforms/powernv/pci.h | 6 ++ 3 files changed, 166 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index cb2d1da..0459e10 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -static long pnv_npu_set_window(struct pnv_ioda_pe *npe, +long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, struct iommu_table *tbl) { struct pnv_phb *phb = npe->phb; @@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, pnv_pci_ioda2_tce_invalidate_entire(phb, false); /* Add the table to the list so its TCE cache will get invalidated */ - pnv_pci_link_table_and_group(phb->hose->node, 0, + pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); return 0; } -static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) { struct pnv_phb *phb = npe->phb; int64_t rc; @@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); - pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + pnv_pci_unlink_table_and_group(npe->table_group.tables[num], >table_group); return 0; @@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); + rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - rc = pnv_npu_unset_windo
Re: [PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through
On 05/04/2016 12:08 AM, Alistair Popple wrote: Hi Alexey, On Fri, 29 Apr 2016 18:55:24 Alexey Kardashevskiy wrote: IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * reused pnv_npu_set_window/pnv_npu_unset_window where possible * added comments, changed commit log v3: * moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered * removed hack to make iommu_add_device() work, iommu_group_add_device() is used instead * cleanup in gpe_table_group_to_npe_cb() v2: * reimplemented to support NPU + GPU in the same group * merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and "powerpc/powernv/npu: Enable passing through via VFIO" into this patch --- arch/powerpc/platforms/powernv/npu-dma.c | 64 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++ arch/powerpc/platforms/powernv/pci.h | 6 ++ 3 files changed, 166 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index cb2d1da..0459e10 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -static long pnv_npu_set_window(struct pnv_ioda_pe *npe, +long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, struct iommu_table *tbl) { struct pnv_phb *phb = npe->phb; @@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, pnv_pci_ioda2_tce_invalidate_entire(phb, false); /* Add the table to the list so its TCE cache will get invalidated */ - pnv_pci_link_table_and_group(phb->hose->node, 0, + pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); return 0; } -static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) { struct pnv_phb *phb = npe->phb; int64_t rc; @@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); - pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + pnv_pci_unlink_table_and_group(npe->table_group.tables[num], >table_group); return 0; @@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); + rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - rc = pnv_npu_unset_window(npe)
Re: [PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling
On 05/03/2016 05:37 PM, Alistair Popple wrote: On Fri, 29 Apr 2016 18:55:23 Alexey Kardashevskiy wrote: The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. Agreed. It was used here because the peer array was added to deal with (2) below ... 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. ... because we weren't aware iommu_table_group could be used to do this instead. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. Happy to see it go. I'm not too familiar with iommu groups but based on the code and what you have described to me both here and offline everything looks good to me. One pretty minor style comment below. Reviewed-By: Alistair Popple <alist...@popple.id.au> While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v4: * reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been added --- arch/powerpc/platforms/powernv/npu-dma.c | 69 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 57 - arch/powerpc/platforms/powernv/pci.h | 6 --- 3 files changed, 26 insertions(+), 106 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 800d70f..cb2d1da 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pnv_ioda_pe *pe; struct pci_dn *pdn; - if (npe->flags & PNV_IODA_PE_PEER) { - pe = npe->peers[0]; - pdev = pe->pdev; - } else { - pdev = pnv_pci_get_gpu_dev(npe->pdev); - if (!pdev) - return NULL; + pdev = pnv_pci_get_gpu_dev(npe->pdev); + if (!pdev) + return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return NULL; + pdn = pci_get_pdn(pdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = >ioda.pe_array[pdn->pe_number]; - } + hose = pci_bus_to_host(pdev->bus); + phb = hose->private_data; + pe = >ioda.pe_array[pdn->pe_number]; if (gpdev) *gpdev = pdev; @@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + /* Add the table to the list so its TCE cache will get invalidated */ + pnv_pci_link_table_and_group(phb->hose->node, 0, + tbl, >table_group); Where tbl is associated with the GPU and is what links the NPU and GPU PEs. return 0; } @@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + >table_group); + return 0; } -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) -{ - struct pnv_ioda_pe *gpe; - struct pci_dev *gpdev; - int i, avail = -1; - - if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) - return; - - gpe = get_gpu_pci_dev_and_pe(npe, ); - if (!gpe) - return; - - for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) { - /* Nothing to do if the PE is already connected. */ - if (gpe->peers[i] == npe) - return; - - if (!gpe->peers[i]) - avail = i; - } - - if (WARN_ON(avail < 0)) - return; - - gpe->peers[avail] = npe; - gpe->flags |= PNV_IODA_PE_PEER; - - /* -* We assume that the NPU devices only have a single peer PE -* (the GPU PCIe device PE). -*/ - npe->peers[0] = gpe; - npe->flags |= PNV_IODA_PE_PEER; -} - /* * Enables 32 bit DMA on NPU. */ @@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) npe->pe_number, npe->pe_number, 0 /* bypass base */, top); + if (r
Re: [PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling
On 05/03/2016 05:37 PM, Alistair Popple wrote: On Fri, 29 Apr 2016 18:55:23 Alexey Kardashevskiy wrote: The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. Agreed. It was used here because the peer array was added to deal with (2) below ... 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. ... because we weren't aware iommu_table_group could be used to do this instead. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. Happy to see it go. I'm not too familiar with iommu groups but based on the code and what you have described to me both here and offline everything looks good to me. One pretty minor style comment below. Reviewed-By: Alistair Popple While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been added --- arch/powerpc/platforms/powernv/npu-dma.c | 69 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 57 - arch/powerpc/platforms/powernv/pci.h | 6 --- 3 files changed, 26 insertions(+), 106 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 800d70f..cb2d1da 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pnv_ioda_pe *pe; struct pci_dn *pdn; - if (npe->flags & PNV_IODA_PE_PEER) { - pe = npe->peers[0]; - pdev = pe->pdev; - } else { - pdev = pnv_pci_get_gpu_dev(npe->pdev); - if (!pdev) - return NULL; + pdev = pnv_pci_get_gpu_dev(npe->pdev); + if (!pdev) + return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return NULL; + pdn = pci_get_pdn(pdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = >ioda.pe_array[pdn->pe_number]; - } + hose = pci_bus_to_host(pdev->bus); + phb = hose->private_data; + pe = >ioda.pe_array[pdn->pe_number]; if (gpdev) *gpdev = pdev; @@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + /* Add the table to the list so its TCE cache will get invalidated */ + pnv_pci_link_table_and_group(phb->hose->node, 0, + tbl, >table_group); Where tbl is associated with the GPU and is what links the NPU and GPU PEs. return 0; } @@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + >table_group); + return 0; } -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) -{ - struct pnv_ioda_pe *gpe; - struct pci_dev *gpdev; - int i, avail = -1; - - if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) - return; - - gpe = get_gpu_pci_dev_and_pe(npe, ); - if (!gpe) - return; - - for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) { - /* Nothing to do if the PE is already connected. */ - if (gpe->peers[i] == npe) - return; - - if (!gpe->peers[i]) - avail = i; - } - - if (WARN_ON(avail < 0)) - return; - - gpe->peers[avail] = npe; - gpe->flags |= PNV_IODA_PE_PEER; - - /* -* We assume that the NPU devices only have a single peer PE -* (the GPU PCIe device PE). -*/ - npe->peers[0] = gpe; - npe->flags |= PNV_IODA_PE_PEER; -} - /* * Enables 32 bit DMA on NPU. */ @@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) npe->pe_number, npe->pe_number, 0 /* bypass base */, top); + if (rc == OPAL_SUCCESS) + pnv_pci_ioda2_
[PATCH kernel v4 06/11] powerpc/powernv/npu: Use the correct IOMMU page size
This uses the page size from iommu_table instead of hard-coded 4K. This should cause no change in behavior. While we are here, move bits around to prepare for further rework which will define and use iommu_table_group_ops. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alist...@popple.id.au> --- arch/powerpc/platforms/powernv/npu-dma.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 778570c..5bd5fee 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; struct pnv_ioda_pe *gpe; - void *addr; - unsigned int size; + struct iommu_table *tbl; int64_t rc; /* @@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) if (!gpe) return; - addr = (void *)gpe->table_group.tables[0]->it_base; - size = gpe->table_group.tables[0]->it_size << 3; + tbl = gpe->table_group.tables[0]; rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, - npe->pe_number, 1, __pa(addr), - size, 0x1000); + npe->pe_number, 1, __pa(tbl->it_base), + tbl->it_size << 3, + IOMMU_PAGE_SIZE(tbl)); if (rc != OPAL_SUCCESS) pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n", __func__, rc, phb->hose->global_number, npe->pe_number); -- 2.5.0.rc3
[PATCH kernel v4 06/11] powerpc/powernv/npu: Use the correct IOMMU page size
This uses the page size from iommu_table instead of hard-coded 4K. This should cause no change in behavior. While we are here, move bits around to prepare for further rework which will define and use iommu_table_group_ops. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Reviewed-by: Alistair Popple --- arch/powerpc/platforms/powernv/npu-dma.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 778570c..5bd5fee 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; struct pnv_ioda_pe *gpe; - void *addr; - unsigned int size; + struct iommu_table *tbl; int64_t rc; /* @@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) if (!gpe) return; - addr = (void *)gpe->table_group.tables[0]->it_base; - size = gpe->table_group.tables[0]->it_size << 3; + tbl = gpe->table_group.tables[0]; rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, - npe->pe_number, 1, __pa(addr), - size, 0x1000); + npe->pe_number, 1, __pa(tbl->it_base), + tbl->it_size << 3, + IOMMU_PAGE_SIZE(tbl)); if (rc != OPAL_SUCCESS) pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n", __func__, rc, phb->hose->global_number, npe->pe_number); -- 2.5.0.rc3
[PATCH kernel v4 09/11] powerpc/powernv/npu: Add set/unset window helpers
The upcoming NVLink passthrough support will require NPU code to cope with two DMA windows. This adds a pnv_npu_set_window() helper which programs 32bit window to the hardware. This also adds multilevel TCE support. This adds a pnv_npu_unset_window() helper which removes the DMA window from the hardware. This does not make difference now as the caller - pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next patch will use it to manage TCE table lists for TCE Kill handling. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- arch/powerpc/platforms/powernv/npu-dma.c | 65 +++- 1 file changed, 55 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index bec9267..800d70f 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -159,6 +159,56 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } +static long pnv_npu_set_window(struct pnv_ioda_pe *npe, + struct iommu_table *tbl) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc; + const unsigned long size = tbl->it_indirect_levels ? + tbl->it_level_size : tbl->it_size; + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; + const __u64 win_size = tbl->it_size << tbl->it_page_shift; + + pe_info(npe, "Setting up window %llx..%llx pg=%lx\n", + start_addr, start_addr + win_size - 1, + IOMMU_PAGE_SIZE(tbl)); + + rc = opal_pci_map_pe_dma_window(phb->opal_id, + npe->pe_number, + npe->pe_number, + tbl->it_indirect_levels + 1, + __pa(tbl->it_base), + size << 3, + IOMMU_PAGE_SIZE(tbl)); + if (rc) { + pe_err(npe, "Failed to configure TCE table, err %lld\n", rc); + return rc; + } + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return 0; +} + +static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc; + + pe_info(npe, "Removing DMA window\n"); + + rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, + npe->pe_number, + 0/* levels */, 0/* table address */, + 0/* table size */, 0/* page size */); + if (rc) { + pe_err(npe, "Unmapping failed, ret = %lld\n", rc); + return rc; + } + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return 0; +} + void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) { struct pnv_ioda_pe *gpe; @@ -200,10 +250,8 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) */ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) { - struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; struct pnv_ioda_pe *gpe; - struct iommu_table *tbl; int64_t rc; /* @@ -217,14 +265,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - tbl = gpe->table_group.tables[0]; - rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, - npe->pe_number, 1, __pa(tbl->it_base), - tbl->it_size << 3, - IOMMU_PAGE_SIZE(tbl)); - if (rc != OPAL_SUCCESS) - pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n", - __func__, rc, phb->hose->global_number, npe->pe_number); + rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -248,6 +289,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; + rc = pnv_npu_unset_window(npe); + if (rc != OPAL_SUCCESS) + return rc; + /* Enable the bypass window */ top = roundup_pow_of_two(top); -- 2.5.0.rc3
[PATCH kernel v4 09/11] powerpc/powernv/npu: Add set/unset window helpers
The upcoming NVLink passthrough support will require NPU code to cope with two DMA windows. This adds a pnv_npu_set_window() helper which programs 32bit window to the hardware. This also adds multilevel TCE support. This adds a pnv_npu_unset_window() helper which removes the DMA window from the hardware. This does not make difference now as the caller - pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next patch will use it to manage TCE table lists for TCE Kill handling. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/npu-dma.c | 65 +++- 1 file changed, 55 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index bec9267..800d70f 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -159,6 +159,56 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } +static long pnv_npu_set_window(struct pnv_ioda_pe *npe, + struct iommu_table *tbl) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc; + const unsigned long size = tbl->it_indirect_levels ? + tbl->it_level_size : tbl->it_size; + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; + const __u64 win_size = tbl->it_size << tbl->it_page_shift; + + pe_info(npe, "Setting up window %llx..%llx pg=%lx\n", + start_addr, start_addr + win_size - 1, + IOMMU_PAGE_SIZE(tbl)); + + rc = opal_pci_map_pe_dma_window(phb->opal_id, + npe->pe_number, + npe->pe_number, + tbl->it_indirect_levels + 1, + __pa(tbl->it_base), + size << 3, + IOMMU_PAGE_SIZE(tbl)); + if (rc) { + pe_err(npe, "Failed to configure TCE table, err %lld\n", rc); + return rc; + } + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return 0; +} + +static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc; + + pe_info(npe, "Removing DMA window\n"); + + rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, + npe->pe_number, + 0/* levels */, 0/* table address */, + 0/* table size */, 0/* page size */); + if (rc) { + pe_err(npe, "Unmapping failed, ret = %lld\n", rc); + return rc; + } + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return 0; +} + void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) { struct pnv_ioda_pe *gpe; @@ -200,10 +250,8 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) */ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) { - struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; struct pnv_ioda_pe *gpe; - struct iommu_table *tbl; int64_t rc; /* @@ -217,14 +265,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - tbl = gpe->table_group.tables[0]; - rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, - npe->pe_number, 1, __pa(tbl->it_base), - tbl->it_size << 3, - IOMMU_PAGE_SIZE(tbl)); - if (rc != OPAL_SUCCESS) - pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n", - __func__, rc, phb->hose->global_number, npe->pe_number); + rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -248,6 +289,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; + rc = pnv_npu_unset_window(npe); + if (rc != OPAL_SUCCESS) + return rc; + /* Enable the bypass window */ top = roundup_pow_of_two(top); -- 2.5.0.rc3
[PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v4: * reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been added --- arch/powerpc/platforms/powernv/npu-dma.c | 69 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 57 - arch/powerpc/platforms/powernv/pci.h | 6 --- 3 files changed, 26 insertions(+), 106 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 800d70f..cb2d1da 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pnv_ioda_pe *pe; struct pci_dn *pdn; - if (npe->flags & PNV_IODA_PE_PEER) { - pe = npe->peers[0]; - pdev = pe->pdev; - } else { - pdev = pnv_pci_get_gpu_dev(npe->pdev); - if (!pdev) - return NULL; + pdev = pnv_pci_get_gpu_dev(npe->pdev); + if (!pdev) + return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return NULL; + pdn = pci_get_pdn(pdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = >ioda.pe_array[pdn->pe_number]; - } + hose = pci_bus_to_host(pdev->bus); + phb = hose->private_data; + pe = >ioda.pe_array[pdn->pe_number]; if (gpdev) *gpdev = pdev; @@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + /* Add the table to the list so its TCE cache will get invalidated */ + pnv_pci_link_table_and_group(phb->hose->node, 0, + tbl, >table_group); + return 0; } @@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + >table_group); + return 0; } -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) -{ - struct pnv_ioda_pe *gpe; - struct pci_dev *gpdev; - int i, avail = -1; - - if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) - return; - - gpe = get_gpu_pci_dev_and_pe(npe, ); - if (!gpe) - return; - - for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) { - /* Nothing to do if the PE is already connected. */ - if (gpe->peers[i] == npe) - return; - - if (!gpe->peers[i]) - avail = i; - } - - if (WARN_ON(avail < 0)) - return; - - gpe->peers[avail] = npe; - gpe->flags |= PNV_IODA_PE_PEER; - - /* -* We assume that the NPU devices only have a single peer PE -* (the GPU PCIe device PE). -*/ - npe->peers[0] = gpe; - npe->flags |= PNV_IODA_PE_PEER; -} - /* * Enables 32 bit DMA on NPU. */ @@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) npe->pe_number, npe->pe_number, 0 /* bypass base */, top); + if (rc == OPAL_SUCCESS) + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + return rc; } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index db7695f..922c74c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1816,23 +1816,12 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) /* 01xb - invalidate TCEs that match the specified PE# */ unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF); struct pnv_phb *phb = pe->
[PATCH kernel v4 10/11] powerpc/powernv/npu: Rework TCE Kill handling
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * reworked as "powerpc/powernv/npu: Add set/unset window helpers" has been added --- arch/powerpc/platforms/powernv/npu-dma.c | 69 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 57 - arch/powerpc/platforms/powernv/pci.h | 6 --- 3 files changed, 26 insertions(+), 106 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 800d70f..cb2d1da 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pnv_ioda_pe *pe; struct pci_dn *pdn; - if (npe->flags & PNV_IODA_PE_PEER) { - pe = npe->peers[0]; - pdev = pe->pdev; - } else { - pdev = pnv_pci_get_gpu_dev(npe->pdev); - if (!pdev) - return NULL; + pdev = pnv_pci_get_gpu_dev(npe->pdev); + if (!pdev) + return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return NULL; + pdn = pci_get_pdn(pdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = >ioda.pe_array[pdn->pe_number]; - } + hose = pci_bus_to_host(pdev->bus); + phb = hose->private_data; + pe = >ioda.pe_array[pdn->pe_number]; if (gpdev) *gpdev = pdev; @@ -186,6 +181,10 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + /* Add the table to the list so its TCE cache will get invalidated */ + pnv_pci_link_table_and_group(phb->hose->node, 0, + tbl, >table_group); + return 0; } @@ -206,45 +205,12 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); + pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + >table_group); + return 0; } -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) -{ - struct pnv_ioda_pe *gpe; - struct pci_dev *gpdev; - int i, avail = -1; - - if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) - return; - - gpe = get_gpu_pci_dev_and_pe(npe, ); - if (!gpe) - return; - - for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) { - /* Nothing to do if the PE is already connected. */ - if (gpe->peers[i] == npe) - return; - - if (!gpe->peers[i]) - avail = i; - } - - if (WARN_ON(avail < 0)) - return; - - gpe->peers[avail] = npe; - gpe->flags |= PNV_IODA_PE_PEER; - - /* -* We assume that the NPU devices only have a single peer PE -* (the GPU PCIe device PE). -*/ - npe->peers[0] = gpe; - npe->flags |= PNV_IODA_PE_PEER; -} - /* * Enables 32 bit DMA on NPU. */ @@ -302,6 +268,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) npe->pe_number, npe->pe_number, 0 /* bypass base */, top); + if (rc == OPAL_SUCCESS) + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + return rc; } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index db7695f..922c74c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1816,23 +1816,12 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) /* 01xb - invalidate TCEs that match the specified PE# */ unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF); struct pnv_phb *phb = pe->phb; - struct pnv_
[PATCH kernel v4 00/11] powerpc/powernv/npu: Enable PCI pass through for NVLink
IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also contain a couple of really fast links between GPU and CPU. These links are exposed to the userspace by the OPAL firmware as bridges. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. More details are in 11/11. This reworks the existing NPU support in the powernv platform and adds VFIO support on top of that. v4 has new patch "powerpc/powernv/npu: Add set/unset window" and bunch of cleanups. "vfio_pci: Test for extended capabilities if config space > 256 bytes" is included here if anyone decides to test the patchset which will crash without it. This was tested on POWER8NVL platform; pvr=0x004c0100. Please comment. Thanks. Alex, I guess we will need your "acked-by" for "vfio/spapr: Relax the IOMMU compatibility check" to proceed. Alexey Kardashevskiy (11): vfio_pci: Test for extended capabilities if config space > 256 bytes vfio/spapr: Relax the IOMMU compatibility check powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire powerpc/powernv: Define TCE Kill flags powerpc/powernv/npu: TCE Kill helpers cleanup powerpc/powernv/npu: Use the correct IOMMU page size powerpc/powernv/npu: Simplify DMA setup powerpc/powernv/ioda2: Export debug helper pe_level_printk() powerpc/powernv/npu: Add set/unset window helpers powerpc/powernv/npu: Rework TCE Kill handling powerpc/powernv/npu: Enable NVLink pass through arch/powerpc/platforms/powernv/npu-dma.c | 287 -- arch/powerpc/platforms/powernv/pci-ioda.c | 224 +++ arch/powerpc/platforms/powernv/pci.h | 31 ++-- drivers/vfio/pci/vfio_pci_config.c| 17 +- drivers/vfio/vfio_iommu_spapr_tce.c | 3 +- 5 files changed, 327 insertions(+), 235 deletions(-) -- 2.5.0.rc3
[PATCH kernel v4 07/11] powerpc/powernv/npu: Simplify DMA setup
NPU devices are emulated in firmware and mainly used for NPU NVLink training; one NPU device is per a hardware link. Their DMA/TCE setup must match the GPU which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on the GPU PCIe device need to be propagated to the NVLink device as this is what device drivers expect and it doesn't make much sense to do anything else. This makes NPU DMA setup explicit. pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda, made static and prints warning as dma_set_mask() should never be called on this function as in any case it will not configure GPU; so we make this explicit. Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will remove), we test every PCI device if there are corresponding NVLink devices. If there are any, we propagate bypass mode to just found NPU devices by calling the setup helper directly (which takes @bypass) and avoid guessing (i.e. calculating from DMA mask) whether we need bypass or not on NPU devices. Since DMA setup happens in very rare occasion, this will not slow down booting or VFIO start/stop much. This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it more clear what the function really does which is programming 32bit table address to the TVT ("disabling bypass" means writing zeroes to the TVT). This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as the DMA configuration on NPU does not matter until dma_set_mask() is called on GPU and that will do the NPU DMA configuration. This removes phb->dma_dev_setup initialization for NPU as pnv_pci_ioda_dma_dev_setup is no-op for it anyway. This stops using npe->tce_bypass_base as it never changes and values other than zero are not supported. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alist...@popple.id.au> --- Changes: v2: * changed first paragraph of the commit log from Alistair comment * removed npe->tce_bypass_base --- arch/powerpc/platforms/powernv/npu-dma.c | 89 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 30 +-- arch/powerpc/platforms/powernv/pci.h | 3 +- 3 files changed, 53 insertions(+), 69 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 5bd5fee..bec9267 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -196,10 +196,9 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) } /* - * For the NPU we want to point the TCE table at the same table as the - * real PCI device. + * Enables 32 bit DMA on NPU. */ -static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) +static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) { struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; @@ -235,72 +234,62 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) } /* - * Enable/disable bypass mode on the NPU. The NPU only supports one + * Enables bypass mode on the NPU. The NPU only supports one * window per link, so bypass needs to be explicitly enabled or * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be * active at the same time. */ -int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enable) +static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) { struct pnv_phb *phb = npe->phb; int64_t rc = 0; + phys_addr_t top = memblock_end_of_DRAM(); if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - if (enable) { - /* Enable the bypass window */ - phys_addr_t top = memblock_end_of_DRAM(); + /* Enable the bypass window */ - npe->tce_bypass_base = 0; - top = roundup_pow_of_two(top); - dev_info(>pdev->dev, "Enabling bypass for PE %d\n", -npe->pe_number); - rc = opal_pci_map_pe_dma_window_real(phb->opal_id, - npe->pe_number, npe->pe_number, - npe->tce_bypass_base, top); - } else { - /* -* Disable the bypass window by replacing it with the -* TCE32 window. -*/ - pnv_npu_disable_bypass(npe); - } + top = roundup_pow_of_two(top); + dev_info(>pdev->dev, "Enabling bypass for PE %d\n", + npe->pe_number); + rc = opal_pci_map_pe_dma_window_real(phb->opal_id, + npe->pe_number, npe->pe_number, + 0 /* bypass base */, top); return rc; } -int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask) +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool b
[PATCH kernel v4 00/11] powerpc/powernv/npu: Enable PCI pass through for NVLink
IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also contain a couple of really fast links between GPU and CPU. These links are exposed to the userspace by the OPAL firmware as bridges. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. More details are in 11/11. This reworks the existing NPU support in the powernv platform and adds VFIO support on top of that. v4 has new patch "powerpc/powernv/npu: Add set/unset window" and bunch of cleanups. "vfio_pci: Test for extended capabilities if config space > 256 bytes" is included here if anyone decides to test the patchset which will crash without it. This was tested on POWER8NVL platform; pvr=0x004c0100. Please comment. Thanks. Alex, I guess we will need your "acked-by" for "vfio/spapr: Relax the IOMMU compatibility check" to proceed. Alexey Kardashevskiy (11): vfio_pci: Test for extended capabilities if config space > 256 bytes vfio/spapr: Relax the IOMMU compatibility check powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire powerpc/powernv: Define TCE Kill flags powerpc/powernv/npu: TCE Kill helpers cleanup powerpc/powernv/npu: Use the correct IOMMU page size powerpc/powernv/npu: Simplify DMA setup powerpc/powernv/ioda2: Export debug helper pe_level_printk() powerpc/powernv/npu: Add set/unset window helpers powerpc/powernv/npu: Rework TCE Kill handling powerpc/powernv/npu: Enable NVLink pass through arch/powerpc/platforms/powernv/npu-dma.c | 287 -- arch/powerpc/platforms/powernv/pci-ioda.c | 224 +++ arch/powerpc/platforms/powernv/pci.h | 31 ++-- drivers/vfio/pci/vfio_pci_config.c| 17 +- drivers/vfio/vfio_iommu_spapr_tce.c | 3 +- 5 files changed, 327 insertions(+), 235 deletions(-) -- 2.5.0.rc3
[PATCH kernel v4 07/11] powerpc/powernv/npu: Simplify DMA setup
NPU devices are emulated in firmware and mainly used for NPU NVLink training; one NPU device is per a hardware link. Their DMA/TCE setup must match the GPU which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on the GPU PCIe device need to be propagated to the NVLink device as this is what device drivers expect and it doesn't make much sense to do anything else. This makes NPU DMA setup explicit. pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda, made static and prints warning as dma_set_mask() should never be called on this function as in any case it will not configure GPU; so we make this explicit. Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will remove), we test every PCI device if there are corresponding NVLink devices. If there are any, we propagate bypass mode to just found NPU devices by calling the setup helper directly (which takes @bypass) and avoid guessing (i.e. calculating from DMA mask) whether we need bypass or not on NPU devices. Since DMA setup happens in very rare occasion, this will not slow down booting or VFIO start/stop much. This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it more clear what the function really does which is programming 32bit table address to the TVT ("disabling bypass" means writing zeroes to the TVT). This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as the DMA configuration on NPU does not matter until dma_set_mask() is called on GPU and that will do the NPU DMA configuration. This removes phb->dma_dev_setup initialization for NPU as pnv_pci_ioda_dma_dev_setup is no-op for it anyway. This stops using npe->tce_bypass_base as it never changes and values other than zero are not supported. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Reviewed-by: Alistair Popple --- Changes: v2: * changed first paragraph of the commit log from Alistair comment * removed npe->tce_bypass_base --- arch/powerpc/platforms/powernv/npu-dma.c | 89 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 30 +-- arch/powerpc/platforms/powernv/pci.h | 3 +- 3 files changed, 53 insertions(+), 69 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 5bd5fee..bec9267 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -196,10 +196,9 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) } /* - * For the NPU we want to point the TCE table at the same table as the - * real PCI device. + * Enables 32 bit DMA on NPU. */ -static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) +static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) { struct pnv_phb *phb = npe->phb; struct pci_dev *gpdev; @@ -235,72 +234,62 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe) } /* - * Enable/disable bypass mode on the NPU. The NPU only supports one + * Enables bypass mode on the NPU. The NPU only supports one * window per link, so bypass needs to be explicitly enabled or * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be * active at the same time. */ -int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enable) +static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) { struct pnv_phb *phb = npe->phb; int64_t rc = 0; + phys_addr_t top = memblock_end_of_DRAM(); if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - if (enable) { - /* Enable the bypass window */ - phys_addr_t top = memblock_end_of_DRAM(); + /* Enable the bypass window */ - npe->tce_bypass_base = 0; - top = roundup_pow_of_two(top); - dev_info(>pdev->dev, "Enabling bypass for PE %d\n", -npe->pe_number); - rc = opal_pci_map_pe_dma_window_real(phb->opal_id, - npe->pe_number, npe->pe_number, - npe->tce_bypass_base, top); - } else { - /* -* Disable the bypass window by replacing it with the -* TCE32 window. -*/ - pnv_npu_disable_bypass(npe); - } + top = roundup_pow_of_two(top); + dev_info(>pdev->dev, "Enabling bypass for PE %d\n", + npe->pe_number); + rc = opal_pci_map_pe_dma_window_real(phb->opal_id, + npe->pe_number, npe->pe_number, + 0 /* bypass base */, top); return rc; } -int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask) +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) { - struct pci_controller *hose = pci_bus_to_host(npdev->bus); - struct pn
[PATCH kernel v4 08/11] powerpc/powernv/ioda2: Export debug helper pe_level_printk()
This exports debugging helper pe_level_printk() and corresponding macroses so they can be used in npu-dma.c. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- arch/powerpc/platforms/powernv/pci-ioda.c | 9 + arch/powerpc/platforms/powernv/pci.h | 9 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 272521e..db7695f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -56,7 +56,7 @@ static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl); -static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, +void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { struct va_format vaf; @@ -87,13 +87,6 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, va_end(args); } -#define pe_err(pe, fmt, ...) \ - pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__) -#define pe_warn(pe, fmt, ...) \ - pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__) -#define pe_info(pe, fmt, ...) \ - pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) - static bool pnv_iommu_bypass_disabled __read_mostly; static int __init iommu_setup(char *str) diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index d574a9d..485e5b1 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -236,6 +236,15 @@ extern void pnv_pci_dma_bus_setup(struct pci_bus *bus); extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type); extern void pnv_teardown_msi_irqs(struct pci_dev *pdev); +extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, + const char *fmt, ...); +#define pe_err(pe, fmt, ...) \ + pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__) +#define pe_warn(pe, fmt, ...) \ + pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__) +#define pe_info(pe, fmt, ...) \ + pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) + /* Nvlink functions */ extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe); extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe); -- 2.5.0.rc3
[PATCH kernel v4 08/11] powerpc/powernv/ioda2: Export debug helper pe_level_printk()
This exports debugging helper pe_level_printk() and corresponding macroses so they can be used in npu-dma.c. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 9 + arch/powerpc/platforms/powernv/pci.h | 9 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 272521e..db7695f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -56,7 +56,7 @@ static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl); -static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, +void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { struct va_format vaf; @@ -87,13 +87,6 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, va_end(args); } -#define pe_err(pe, fmt, ...) \ - pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__) -#define pe_warn(pe, fmt, ...) \ - pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__) -#define pe_info(pe, fmt, ...) \ - pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) - static bool pnv_iommu_bypass_disabled __read_mostly; static int __init iommu_setup(char *str) diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index d574a9d..485e5b1 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -236,6 +236,15 @@ extern void pnv_pci_dma_bus_setup(struct pci_bus *bus); extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type); extern void pnv_teardown_msi_irqs(struct pci_dev *pdev); +extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, + const char *fmt, ...); +#define pe_err(pe, fmt, ...) \ + pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__) +#define pe_warn(pe, fmt, ...) \ + pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__) +#define pe_info(pe, fmt, ...) \ + pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) + /* Nvlink functions */ extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe); extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe); -- 2.5.0.rc3
[PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v4: * reused pnv_npu_set_window/pnv_npu_unset_window where possible * added comments, changed commit log v3: * moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered * removed hack to make iommu_add_device() work, iommu_group_add_device() is used instead * cleanup in gpe_table_group_to_npe_cb() v2: * reimplemented to support NPU + GPU in the same group * merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and "powerpc/powernv/npu: Enable passing through via VFIO" into this patch --- arch/powerpc/platforms/powernv/npu-dma.c | 64 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++ arch/powerpc/platforms/powernv/pci.h | 6 ++ 3 files changed, 166 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index cb2d1da..0459e10 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -static long pnv_npu_set_window(struct pnv_ioda_pe *npe, +long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, struct iommu_table *tbl) { struct pnv_phb *phb = npe->phb; @@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, pnv_pci_ioda2_tce_invalidate_entire(phb, false); /* Add the table to the list so its TCE cache will get invalidated */ - pnv_pci_link_table_and_group(phb->hose->node, 0, + pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); return 0; } -static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) { struct pnv_phb *phb = npe->phb; int64_t rc; @@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); - pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + pnv_pci_unlink_table_and_group(npe->table_group.tables[num], >table_group); return 0; @@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); + rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - rc = pnv_npu_unset_window(npe); + rc = pnv_npu_unset_window(npe, 0); if (rc != OPAL_SUCCESS) return rc; @@ -307
[PATCH kernel v4 11/11] powerpc/powernv/npu: Enable NVLink pass through
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * reused pnv_npu_set_window/pnv_npu_unset_window where possible * added comments, changed commit log v3: * moved NPU-to-GPU IOMMU grouping later after all PHBs are discovered * removed hack to make iommu_add_device() work, iommu_group_add_device() is used instead * cleanup in gpe_table_group_to_npe_cb() v2: * reimplemented to support NPU + GPU in the same group * merged "powerpc/powernv/npu: Add NPU devices to IOMMU group" and "powerpc/powernv/npu: Enable passing through via VFIO" into this patch --- arch/powerpc/platforms/powernv/npu-dma.c | 64 +-- arch/powerpc/platforms/powernv/pci-ioda.c | 102 ++ arch/powerpc/platforms/powernv/pci.h | 6 ++ 3 files changed, 166 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index cb2d1da..0459e10 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -154,7 +155,7 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -static long pnv_npu_set_window(struct pnv_ioda_pe *npe, +long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, struct iommu_table *tbl) { struct pnv_phb *phb = npe->phb; @@ -182,13 +183,13 @@ static long pnv_npu_set_window(struct pnv_ioda_pe *npe, pnv_pci_ioda2_tce_invalidate_entire(phb, false); /* Add the table to the list so its TCE cache will get invalidated */ - pnv_pci_link_table_and_group(phb->hose->node, 0, + pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); return 0; } -static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) +long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) { struct pnv_phb *phb = npe->phb; int64_t rc; @@ -205,7 +206,7 @@ static long pnv_npu_unset_window(struct pnv_ioda_pe *npe) } pnv_pci_ioda2_tce_invalidate_entire(phb, false); - pnv_pci_unlink_table_and_group(npe->table_group.tables[0], + pnv_pci_unlink_table_and_group(npe->table_group.tables[num], >table_group); return 0; @@ -231,7 +232,7 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - rc = pnv_npu_set_window(npe, gpe->table_group.tables[0]); + rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]); /* * We don't initialise npu_pe->tce32_table as we always use @@ -255,7 +256,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU || !npe->pdev) return -EINVAL; - rc = pnv_npu_unset_window(npe); + rc = pnv_npu_unset_window(npe, 0); if (rc != OPAL_SUCCESS) return rc; @@ -307,3 +30
[PATCH kernel v4 04/11] powerpc/powernv: Define TCE Kill flags
This replaces magic constants for TCE Kill IODA2 register with macros. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> --- arch/powerpc/platforms/powernv/pci-ioda.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ca307b6..03be25d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,10 +1811,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +#define TCE_KILL_INVAL_PE PPC_BIT(1) +#define TCE_KILL_INVAL_TCE PPC_BIT(2) + static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ - unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF); + unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF); struct pnv_phb *phb = pe->phb; struct pnv_ioda_pe *npe; int i; @@ -1842,7 +1845,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm, unsigned long start, end, inc; /* We'll invalidate DMA address in PE scope */ - start = 0x2ull << 60; + start = TCE_KILL_INVAL_TCE; start |= (pe_number & 0xFF); end = start; -- 2.5.0.rc3
[PATCH kernel v4 04/11] powerpc/powernv: Define TCE Kill flags
This replaces magic constants for TCE Kill IODA2 register with macros. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/platforms/powernv/pci-ioda.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ca307b6..03be25d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,10 +1811,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +#define TCE_KILL_INVAL_PE PPC_BIT(1) +#define TCE_KILL_INVAL_TCE PPC_BIT(2) + static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ - unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF); + unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF); struct pnv_phb *phb = pe->phb; struct pnv_ioda_pe *npe; int i; @@ -1842,7 +1845,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm, unsigned long start, end, inc; /* We'll invalidate DMA address in PE scope */ - start = 0x2ull << 60; + start = TCE_KILL_INVAL_TCE; start |= (pe_number & 0xFF); end = start; -- 2.5.0.rc3
[PATCH kernel v4 01/11] vfio_pci: Test for extended capabilities if config space > 256 bytes
PCI-Express spec says that reading 4 bytes at offset 100h should return zero if there is no extended capability so VFIO reads this dword to know if there are extended capabilities. However it is not always possible to access the extended space so generic PCI code in pci_cfg_space_size_ext() checks if pci_read_config_dword() can read beyond 100h and if the check fails, it sets the config space size to 100h. VFIO does its own extended capabilities check by reading at offset 100h which may produce 0x which VFIO treats as the extended config space presense and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset 100h which is beyond the buffer allocated for vdev->vconfig (which is 256 bytes) which leads to random memory corruption. This makes VFIO only check for the extended capabilities if the discovered config size is more than 256 bytes. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v2: * instead of checking for 0x, this only does the check if device's config size is big enough --- drivers/vfio/pci/vfio_pci_config.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..d0c4358 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return pcibios_err_to_errno(ret); if (PCI_X_CMD_VERSION(word)) { - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, + ); + vdev->extended_caps = (dword != 0); + } return PCI_CAP_PCIX_SIZEOF_V2; } else return PCI_CAP_PCIX_SIZEOF_V0; @@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return byte; case PCI_CAP_ID_EXP: - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); + vdev->extended_caps = dword != 0; + } /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
[PATCH kernel v4 02/11] vfio/spapr: Relax the IOMMU compatibility check
We are going to have multiple different types of PHB on the same system with POWER8 + NVLink and PHBs will have different IOMMU ops. However we only really care about one callback - create_table - so we can relax the compatibility check here. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> --- drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 0582b72..3054e3f 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data, goto unlock_exit; } table_group_tmp = iommu_group_get_iommudata(tcegrp->grp); - if (table_group_tmp->ops != table_group->ops) { + if (table_group_tmp->ops->create_table != + table_group->ops->create_table) { pr_warn("tce_vfio: Group %d is incompatible with group %d\n", iommu_group_id(iommu_group), iommu_group_id(tcegrp->grp)); -- 2.5.0.rc3
[PATCH kernel v4 01/11] vfio_pci: Test for extended capabilities if config space > 256 bytes
PCI-Express spec says that reading 4 bytes at offset 100h should return zero if there is no extended capability so VFIO reads this dword to know if there are extended capabilities. However it is not always possible to access the extended space so generic PCI code in pci_cfg_space_size_ext() checks if pci_read_config_dword() can read beyond 100h and if the check fails, it sets the config space size to 100h. VFIO does its own extended capabilities check by reading at offset 100h which may produce 0x which VFIO treats as the extended config space presense and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset 100h which is beyond the buffer allocated for vdev->vconfig (which is 256 bytes) which leads to random memory corruption. This makes VFIO only check for the extended capabilities if the discovered config size is more than 256 bytes. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * instead of checking for 0x, this only does the check if device's config size is big enough --- drivers/vfio/pci/vfio_pci_config.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..d0c4358 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return pcibios_err_to_errno(ret); if (PCI_X_CMD_VERSION(word)) { - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, + ); + vdev->extended_caps = (dword != 0); + } return PCI_CAP_PCIX_SIZEOF_V2; } else return PCI_CAP_PCIX_SIZEOF_V0; @@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return byte; case PCI_CAP_ID_EXP: - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); + vdev->extended_caps = dword != 0; + } /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
[PATCH kernel v4 02/11] vfio/spapr: Relax the IOMMU compatibility check
We are going to have multiple different types of PHB on the same system with POWER8 + NVLink and PHBs will have different IOMMU ops. However we only really care about one callback - create_table - so we can relax the compatibility check here. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 0582b72..3054e3f 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data, goto unlock_exit; } table_group_tmp = iommu_group_get_iommudata(tcegrp->grp); - if (table_group_tmp->ops != table_group->ops) { + if (table_group_tmp->ops->create_table != + table_group->ops->create_table) { pr_warn("tce_vfio: Group %d is incompatible with group %d\n", iommu_group_id(iommu_group), iommu_group_id(tcegrp->grp)); -- 2.5.0.rc3
[PATCH kernel v4 05/11] powerpc/powernv/npu: TCE Kill helpers cleanup
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8 so let's reuse the existing code for NPU. The only bit missing is a helper to reset the entire TCE cache so this moves such a helper from NPU code and renames it. Since pnv_npu_tce_invalidate() does really invalidate the entire cache, this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU. This adds an explicit comment for workaround for invalidating NPU TCE cache. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alist...@popple.id.au> --- arch/powerpc/platforms/powernv/npu-dma.c | 41 --- arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++ arch/powerpc/platforms/powernv/pci.h | 7 +- 3 files changed, 25 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 7229acd..778570c 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -25,8 +25,6 @@ * Other types of TCE cache invalidation are not functional in the * hardware. */ -#define TCE_KILL_INVAL_ALL PPC_BIT(0) - static struct pci_dev *get_pci_dev(struct device_node *dn) { return PCI_DN(dn)->pcidev; @@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe) -{ - struct pnv_phb *phb = npe->phb; - - if (WARN_ON(phb->type != PNV_PHB_NPU || - !phb->ioda.tce_inval_reg || - !(npe->flags & PNV_IODA_PE_DEV))) - return; - - mb(); /* Ensure previous TCE table stores are visible */ - __raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL), - phb->ioda.tce_inval_reg); -} - -void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe, - struct iommu_table *tbl, - unsigned long index, - unsigned long npages, - bool rm) -{ - struct pnv_phb *phb = npe->phb; - - /* We can only invalidate the whole cache on NPU */ - unsigned long val = TCE_KILL_INVAL_ALL; - - if (WARN_ON(phb->type != PNV_PHB_NPU || - !phb->ioda.tce_inval_reg || - !(npe->flags & PNV_IODA_PE_DEV))) - return; - - mb(); /* Ensure previous TCE table stores are visible */ - if (rm) - __raw_rm_writeq(cpu_to_be64(val), - (__be64 __iomem *) phb->ioda.tce_inval_reg_phys); - else - __raw_writeq(cpu_to_be64(val), - phb->ioda.tce_inval_reg); -} - void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) { struct pnv_ioda_pe *gpe; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 03be25d..a67d51e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,9 +1811,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +#define TCE_KILL_INVAL_ALL PPC_BIT(0) #define TCE_KILL_INVAL_PE PPC_BIT(1) #define TCE_KILL_INVAL_TCE PPC_BIT(2) +void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm) +{ + const unsigned long val = TCE_KILL_INVAL_ALL; + + mb(); /* Ensure previous TCE table stores are visible */ + if (rm) + __raw_rm_writeq(cpu_to_be64(val), + (__be64 __iomem *) + phb->ioda.tce_inval_reg_phys); + else + __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg); +} + static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ @@ -1834,7 +1848,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) if (!npe || npe->phb->type != PNV_PHB_NPU) continue; - pnv_npu_tce_invalidate_entire(npe); + pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false); } } @@ -1883,14 +1897,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, index, npages); if (pe->flags & PNV_IODA_PE_PEER) - /* Invalidate PEs using the same TCE table */ + /* +* The NVLink hardware does not support TCE kill +* per TCE entry so we have to invalidate +* the entire cache for it. +*/ for (i = 0;
[PATCH kernel v4 05/11] powerpc/powernv/npu: TCE Kill helpers cleanup
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8 so let's reuse the existing code for NPU. The only bit missing is a helper to reset the entire TCE cache so this moves such a helper from NPU code and renames it. Since pnv_npu_tce_invalidate() does really invalidate the entire cache, this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU. This adds an explicit comment for workaround for invalidating NPU TCE cache. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Reviewed-by: Alistair Popple --- arch/powerpc/platforms/powernv/npu-dma.c | 41 --- arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++ arch/powerpc/platforms/powernv/pci.h | 7 +- 3 files changed, 25 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 7229acd..778570c 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -25,8 +25,6 @@ * Other types of TCE cache invalidation are not functional in the * hardware. */ -#define TCE_KILL_INVAL_ALL PPC_BIT(0) - static struct pci_dev *get_pci_dev(struct device_node *dn) { return PCI_DN(dn)->pcidev; @@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe) -{ - struct pnv_phb *phb = npe->phb; - - if (WARN_ON(phb->type != PNV_PHB_NPU || - !phb->ioda.tce_inval_reg || - !(npe->flags & PNV_IODA_PE_DEV))) - return; - - mb(); /* Ensure previous TCE table stores are visible */ - __raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL), - phb->ioda.tce_inval_reg); -} - -void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe, - struct iommu_table *tbl, - unsigned long index, - unsigned long npages, - bool rm) -{ - struct pnv_phb *phb = npe->phb; - - /* We can only invalidate the whole cache on NPU */ - unsigned long val = TCE_KILL_INVAL_ALL; - - if (WARN_ON(phb->type != PNV_PHB_NPU || - !phb->ioda.tce_inval_reg || - !(npe->flags & PNV_IODA_PE_DEV))) - return; - - mb(); /* Ensure previous TCE table stores are visible */ - if (rm) - __raw_rm_writeq(cpu_to_be64(val), - (__be64 __iomem *) phb->ioda.tce_inval_reg_phys); - else - __raw_writeq(cpu_to_be64(val), - phb->ioda.tce_inval_reg); -} - void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe) { struct pnv_ioda_pe *gpe; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 03be25d..a67d51e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,9 +1811,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +#define TCE_KILL_INVAL_ALL PPC_BIT(0) #define TCE_KILL_INVAL_PE PPC_BIT(1) #define TCE_KILL_INVAL_TCE PPC_BIT(2) +void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm) +{ + const unsigned long val = TCE_KILL_INVAL_ALL; + + mb(); /* Ensure previous TCE table stores are visible */ + if (rm) + __raw_rm_writeq(cpu_to_be64(val), + (__be64 __iomem *) + phb->ioda.tce_inval_reg_phys); + else + __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg); +} + static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ @@ -1834,7 +1848,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) if (!npe || npe->phb->type != PNV_PHB_NPU) continue; - pnv_npu_tce_invalidate_entire(npe); + pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false); } } @@ -1883,14 +1897,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, index, npages); if (pe->flags & PNV_IODA_PE_PEER) - /* Invalidate PEs using the same TCE table */ + /* +* The NVLink hardware does not support TCE kill +* per TCE entry so we have to invalidate +* the entire cache for it. +*/ for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) { npe = pe->peers[i]; -
[PATCH kernel v4 03/11] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for the specific PE rather than the entire cache, rename it to pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add a proper pnv_pci_ioda2_tce_invalidate_entire(). Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> --- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c5baaf3..ca307b6 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,7 +1811,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; -static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe) +static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF); @@ -2075,7 +2075,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); - pnv_pci_ioda2_tce_invalidate_entire(pe); + pnv_pci_ioda2_tce_invalidate_pe(pe); return 0; } @@ -2219,7 +2219,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, if (ret) pe_warn(pe, "Unmapping failed, ret = %ld\n", ret); else - pnv_pci_ioda2_tce_invalidate_entire(pe); + pnv_pci_ioda2_tce_invalidate_pe(pe); pnv_pci_unlink_table_and_group(table_group->tables[num], table_group); -- 2.5.0.rc3
[PATCH kernel v4 03/11] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for the specific PE rather than the entire cache, rename it to pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add a proper pnv_pci_ioda2_tce_invalidate_entire(). Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c5baaf3..ca307b6 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1811,7 +1811,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; -static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe) +static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { /* 01xb - invalidate TCEs that match the specified PE# */ unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF); @@ -2075,7 +2075,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, pnv_pci_link_table_and_group(phb->hose->node, num, tbl, >table_group); - pnv_pci_ioda2_tce_invalidate_entire(pe); + pnv_pci_ioda2_tce_invalidate_pe(pe); return 0; } @@ -2219,7 +2219,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, if (ret) pe_warn(pe, "Unmapping failed, ret = %ld\n", ret); else - pnv_pci_ioda2_tce_invalidate_entire(pe); + pnv_pci_ioda2_tce_invalidate_pe(pe); pnv_pci_unlink_table_and_group(table_group->tables[num], table_group); -- 2.5.0.rc3
[PATCH kernel v2] vfio_pci: Test for extended capabilities if config space > 256 bytes
PCI-Express spec says that reading 4 bytes at offset 100h should return zero if there is no extended capability so VFIO reads this dword to know if there are extended capabilities. However it is not always possible to access the extended space so generic PCI code in pci_cfg_space_size_ext() checks if pci_read_config_dword() can read beyond 100h and if the check fails, it sets the config space size to 100h. VFIO does its own extended capabilities check by reading at offset 100h which may produce 0x which VFIO treats as the extended config space presense and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset 100h which is beyond the buffer allocated for vdev->vconfig (which is 256 bytes) which leads to random memory corruption. This makes VFIO only check for the extended capabilities if the discovered config size is more than 256 bytes. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- Changes: v2: * instead of checking for 0x, this only does the check if device's config size is big enough --- drivers/vfio/pci/vfio_pci_config.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..d0c4358 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return pcibios_err_to_errno(ret); if (PCI_X_CMD_VERSION(word)) { - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, + ); + vdev->extended_caps = (dword != 0); + } return PCI_CAP_PCIX_SIZEOF_V2; } else return PCI_CAP_PCIX_SIZEOF_V0; @@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return byte; case PCI_CAP_ID_EXP: - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); + vdev->extended_caps = dword != 0; + } /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
[PATCH kernel v2] vfio_pci: Test for extended capabilities if config space > 256 bytes
PCI-Express spec says that reading 4 bytes at offset 100h should return zero if there is no extended capability so VFIO reads this dword to know if there are extended capabilities. However it is not always possible to access the extended space so generic PCI code in pci_cfg_space_size_ext() checks if pci_read_config_dword() can read beyond 100h and if the check fails, it sets the config space size to 100h. VFIO does its own extended capabilities check by reading at offset 100h which may produce 0x which VFIO treats as the extended config space presense and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset 100h which is beyond the buffer allocated for vdev->vconfig (which is 256 bytes) which leads to random memory corruption. This makes VFIO only check for the extended capabilities if the discovered config size is more than 256 bytes. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * instead of checking for 0x, this only does the check if device's config size is big enough --- drivers/vfio/pci/vfio_pci_config.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..d0c4358 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return pcibios_err_to_errno(ret); if (PCI_X_CMD_VERSION(word)) { - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, + ); + vdev->extended_caps = (dword != 0); + } return PCI_CAP_PCIX_SIZEOF_V2; } else return PCI_CAP_PCIX_SIZEOF_V0; @@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return byte; case PCI_CAP_ID_EXP: - /* Test for extended capabilities */ - pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) { + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); + vdev->extended_caps = dword != 0; + } /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
[PATCH kernel] vfio_pci: Make extended capabilities test more robust
VFIO reads a dword beyond the standard PCI config space (256 bytes) to know if there are extended config space (4096 bytes). It relies on the platform to return zero if there is no extended space. However at least on PPC64/POWERNV platform, the system firmware (OPAL) returns 0x in this case. VFIO treats it as a proof that there is extended config space and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset of 256 which is beyond the buffer allocated for vdev->vconfig - it is 256 bytes for a device without extended config space. This adds an additional check that config space read returned non-zero and non- value. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- drivers/vfio/pci/vfio_pci_config.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..8a53421 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1140,7 +1140,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) case PCI_CAP_ID_EXP: /* Test for extended capabilities */ pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + vdev->extended_caps = (dword != 0) && (dword != 0x); /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
[PATCH kernel] vfio_pci: Make extended capabilities test more robust
VFIO reads a dword beyond the standard PCI config space (256 bytes) to know if there are extended config space (4096 bytes). It relies on the platform to return zero if there is no extended space. However at least on PPC64/POWERNV platform, the system firmware (OPAL) returns 0x in this case. VFIO treats it as a proof that there is extended config space and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset of 256 which is beyond the buffer allocated for vdev->vconfig - it is 256 bytes for a device without extended config space. This adds an additional check that config space read returned non-zero and non- value. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/pci/vfio_pci_config.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 142c533..8a53421 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1140,7 +1140,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) case PCI_CAP_ID_EXP: /* Test for extended capabilities */ pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, ); - vdev->extended_caps = (dword != 0); + vdev->extended_caps = (dword != 0) && (dword != 0x); /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) -- 2.5.0.rc3
Re: [RFC v6 04/10] PCI: Add support for enforcing all MMIO BARs to be page aligned
On 04/18/2016 08:56 PM, Yongji Xie wrote: When vfio passthrough a PCI device of which MMIO BARs are smaller than PAGE_SIZE, guest will not handle the mmio accesses to the BARs which leads to mmio emulations in host. This is because vfio will not allow to passthrough one BAR's mmio page which may be shared with other BARs. Otherwise, there will be a backdoor that guest can use to access BARs of other guest. To solve this issue, this patch modifies resource_alignment to support syntax where multiple devices get the same alignment. So we can use something like "pci=resource_alignment=*:*:*.*:noresize" to enforce the alignment of all MMIO BARs to be at least PAGE_SIZE so that one BAR's mmio page would not be shared with other BARs. And we also define a macro PCIBIOS_MIN_ALIGNMENT to enable this automatically on PPC64 platform which can easily hit this issue because its PAGE_SIZE is 64KB. Signed-off-by: Yongji Xie--- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/include/asm/pci.h |2 ++ drivers/pci/pci.c | 64 +-- 3 files changed, 57 insertions(+), 11 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d8b29ab..542be4a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. aligned memory resources. If is not specified, PAGE_SIZE is used as alignment. + , , and can be set to + "*" which means match all values. PCI-PCI bridge can be specified, if resource windows need to be expanded. noresize: Don't change the resources' sizes when diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h index 6f8065a..78f230f 100644 --- a/arch/powerpc/include/asm/pci.h +++ b/arch/powerpc/include/asm/pci.h @@ -30,6 +30,8 @@ #define PCIBIOS_MIN_IO0x1000 #define PCIBIOS_MIN_MEM 0x1000 +#define PCIBIOS_MIN_ALIGNMENT PAGE_SIZE + struct pci_dev; /* Values for the `which' argument to sys_pciconfig_iobase syscall. */ diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7564ccc..0381c28 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4605,7 +4605,12 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, int seg, bus, slot, func, align_order, count; resource_size_t align = 0; char *p; + bool invalid = false; +#ifdef PCIBIOS_MIN_ALIGNMENT + align = PCIBIOS_MIN_ALIGNMENT; + *resize = false; +#endif spin_lock(_alignment_lock); p = resource_alignment_param; while (*p) { @@ -4622,16 +4627,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, } else { align_order = -1; } - if (sscanf(p, "%x:%x:%x.%x%n", - , , , , ) != 4) { I'd replace the above lines with: char segstr[5] = "*", busstr[3] = "*"; char slotstr[3] = "*", funstr[2] = "*"; if (sscanf(p, "%4[^:]:%2[^:]:%2[^.].%1s%n", , , , , ) != 4) { and add some wrapper like: static bool glob_match_hex(char const *pat, int val) { char valstr[5]; /* 5 should be enough for PCI */ snprintf(valstr, sizeof(valstr) - 1, "%4x", val); return glob_match(pat, valstr); } and then use glob_match_hex() (or make a wrapper like above on top of fnmatch()), this would enable better mask handling. If anyone finds this useful (which I am not sure about). + if (p[0] == '*' && p[1] == ':') { + seg = -1; + count = 1; + } else if (sscanf(p, "%x%n", , ) != 1 || + p[count] != ':') { + invalid = true; + break; + } + p += count + 1; + if (*p == '*') { + bus = -1; + count = 1; + } else if (sscanf(p, "%x%n", , ) != 1) { + invalid = true; + break; + } + p += count; + if (*p == '.') { + slot = bus; + bus = seg; seg = 0; - if (sscanf(p, "%x:%x.%x%n", - , , , ) != 3) { - /* Invalid format */ - printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n", -
Re: [RFC v6 04/10] PCI: Add support for enforcing all MMIO BARs to be page aligned
On 04/18/2016 08:56 PM, Yongji Xie wrote: When vfio passthrough a PCI device of which MMIO BARs are smaller than PAGE_SIZE, guest will not handle the mmio accesses to the BARs which leads to mmio emulations in host. This is because vfio will not allow to passthrough one BAR's mmio page which may be shared with other BARs. Otherwise, there will be a backdoor that guest can use to access BARs of other guest. To solve this issue, this patch modifies resource_alignment to support syntax where multiple devices get the same alignment. So we can use something like "pci=resource_alignment=*:*:*.*:noresize" to enforce the alignment of all MMIO BARs to be at least PAGE_SIZE so that one BAR's mmio page would not be shared with other BARs. And we also define a macro PCIBIOS_MIN_ALIGNMENT to enable this automatically on PPC64 platform which can easily hit this issue because its PAGE_SIZE is 64KB. Signed-off-by: Yongji Xie --- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/include/asm/pci.h |2 ++ drivers/pci/pci.c | 64 +-- 3 files changed, 57 insertions(+), 11 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d8b29ab..542be4a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2918,6 +2918,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. aligned memory resources. If is not specified, PAGE_SIZE is used as alignment. + , , and can be set to + "*" which means match all values. PCI-PCI bridge can be specified, if resource windows need to be expanded. noresize: Don't change the resources' sizes when diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h index 6f8065a..78f230f 100644 --- a/arch/powerpc/include/asm/pci.h +++ b/arch/powerpc/include/asm/pci.h @@ -30,6 +30,8 @@ #define PCIBIOS_MIN_IO0x1000 #define PCIBIOS_MIN_MEM 0x1000 +#define PCIBIOS_MIN_ALIGNMENT PAGE_SIZE + struct pci_dev; /* Values for the `which' argument to sys_pciconfig_iobase syscall. */ diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7564ccc..0381c28 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4605,7 +4605,12 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, int seg, bus, slot, func, align_order, count; resource_size_t align = 0; char *p; + bool invalid = false; +#ifdef PCIBIOS_MIN_ALIGNMENT + align = PCIBIOS_MIN_ALIGNMENT; + *resize = false; +#endif spin_lock(_alignment_lock); p = resource_alignment_param; while (*p) { @@ -4622,16 +4627,49 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, } else { align_order = -1; } - if (sscanf(p, "%x:%x:%x.%x%n", - , , , , ) != 4) { I'd replace the above lines with: char segstr[5] = "*", busstr[3] = "*"; char slotstr[3] = "*", funstr[2] = "*"; if (sscanf(p, "%4[^:]:%2[^:]:%2[^.].%1s%n", , , , , ) != 4) { and add some wrapper like: static bool glob_match_hex(char const *pat, int val) { char valstr[5]; /* 5 should be enough for PCI */ snprintf(valstr, sizeof(valstr) - 1, "%4x", val); return glob_match(pat, valstr); } and then use glob_match_hex() (or make a wrapper like above on top of fnmatch()), this would enable better mask handling. If anyone finds this useful (which I am not sure about). + if (p[0] == '*' && p[1] == ':') { + seg = -1; + count = 1; + } else if (sscanf(p, "%x%n", , ) != 1 || + p[count] != ':') { + invalid = true; + break; + } + p += count + 1; + if (*p == '*') { + bus = -1; + count = 1; + } else if (sscanf(p, "%x%n", , ) != 1) { + invalid = true; + break; + } + p += count; + if (*p == '.') { + slot = bus; + bus = seg; seg = 0; - if (sscanf(p, "%x:%x.%x%n", - , , , ) != 3) { - /* Invalid format */ - printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n", - p); +
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 08:45 PM, Or Gerlitz wrote: On Wed, Mar 16, 2016 at 10:34 AM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote: Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not work in a guest: So where is the breakage point for you? does 4.4 works? if not, what? Ah, my bad. It is unrelated to the kernel version. I tried passing a PF to a guest while its VFs are already passed to another guest and see how exactly it blows up (AER/EEH were thrown but the host recovered => good) but this left the device in a weird state when I could not use VF in a guest anymore but it seemed to keep working on the host. It seems like the actual adapter does not reset completely when the machine is rebooted, I had unplug/replug power cables to fix this. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 08:45 PM, Or Gerlitz wrote: On Wed, Mar 16, 2016 at 10:34 AM, Alexey Kardashevskiy wrote: Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not work in a guest: So where is the breakage point for you? does 4.4 works? if not, what? Ah, my bad. It is unrelated to the kernel version. I tried passing a PF to a guest while its VFs are already passed to another guest and see how exactly it blows up (AER/EEH were thrown but the host recovered => good) but this left the device in a weird state when I could not use VF in a guest anymore but it seemed to keep working on the host. It seems like the actual adapter does not reset completely when the machine is rebooted, I had unplug/replug power cables to fix this. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 05:09 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 04:49:00PM +1100, Alexey Kardashevskiy wrote: On 03/16/2016 04:10 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote: So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. We are going to send a patch that fixes this using a module parameter. The patch will be on top of Huy's patch. Some background to the problem: mlx4 supported devices require 128 UAR What does UAR stand for? User Access Region. It's the way you interface with the hardware. pages from PCI memory space defined by BAR2-3. Each UAR page can be any power of 2 value from 4K up to 64K. Before Huy's patch the driver chose UAR page size to be equal to system page size. Since PowerPC's page size is 64K this means minimum requirement of UAR pages is not met (default UAR BAR is 8MB and only half of it is really reserved for UARs). And what was the downside? afaict the performance was good... It's not a performance issue. Defining 64KB for a UAR is not required and wastes pci memory mapped i/o space. More details can be found in the programmer's manual. Can you please point me to this manual on the website? I tried, honestly, could not find it. Thanks. It's not publically available. If you have an FAE that work with your company you can ask him how to get the doc. Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not work in a guest: root@le-dbg:~# dhclient eth0 mlx4_en: eth0: frag:0 - size:1518 prefix:0 stride:1536 mlx4_core :00:00.0: Internal error detected on the communication channel mlx4_core :00:00.0: device is going to be reset mlx4_core :00:00.0: VF reset is not needed mlx4_core :00:00.0: device was reset successfully mlx4_en :00:00.0: Internal error detected, restarting device mlx4_core :00:00.0: command 0x5 failed: fw status = 0x1 mlx4_core :00:00.0: Failed to close slave function mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: slave is currently in the middle of FLR - Deferring probe mlx4_core :00:00.0: mlx4_restart_one: ERROR: mlx4_load_one failed, pci_name=:00:00.0, err=-517 mlx4_core :00:00.0: mlx4_restart_one was ended, ret=-517 root@le-dbg:~# ifconfig -a loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) root@le-dbg:~# lspci -v 00:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] Subsystem: IBM Device 61b0 Physical Slot: C16 Flags: bus master, fast devsel, latency 0 Memory at 1012000 (64-bit, prefetchable) [size=64M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [9c] MSI-X: Enable- Count=52 Masked- Capabilities: [40] Power Management version 0 Kernel driver in use: mlx4_core -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 05:09 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 04:49:00PM +1100, Alexey Kardashevskiy wrote: On 03/16/2016 04:10 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote: So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. We are going to send a patch that fixes this using a module parameter. The patch will be on top of Huy's patch. Some background to the problem: mlx4 supported devices require 128 UAR What does UAR stand for? User Access Region. It's the way you interface with the hardware. pages from PCI memory space defined by BAR2-3. Each UAR page can be any power of 2 value from 4K up to 64K. Before Huy's patch the driver chose UAR page size to be equal to system page size. Since PowerPC's page size is 64K this means minimum requirement of UAR pages is not met (default UAR BAR is 8MB and only half of it is really reserved for UARs). And what was the downside? afaict the performance was good... It's not a performance issue. Defining 64KB for a UAR is not required and wastes pci memory mapped i/o space. More details can be found in the programmer's manual. Can you please point me to this manual on the website? I tried, honestly, could not find it. Thanks. It's not publically available. If you have an FAE that work with your company you can ask him how to get the doc. Oh. ok. It also looks like even with the reverted patch, mlx4 VF does not work in a guest: root@le-dbg:~# dhclient eth0 mlx4_en: eth0: frag:0 - size:1518 prefix:0 stride:1536 mlx4_core :00:00.0: Internal error detected on the communication channel mlx4_core :00:00.0: device is going to be reset mlx4_core :00:00.0: VF reset is not needed mlx4_core :00:00.0: device was reset successfully mlx4_en :00:00.0: Internal error detected, restarting device mlx4_core :00:00.0: command 0x5 failed: fw status = 0x1 mlx4_core :00:00.0: Failed to close slave function mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: slave is currently in the middle of FLR - Deferring probe mlx4_core :00:00.0: mlx4_restart_one: ERROR: mlx4_load_one failed, pci_name=:00:00.0, err=-517 mlx4_core :00:00.0: mlx4_restart_one was ended, ret=-517 root@le-dbg:~# ifconfig -a loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) root@le-dbg:~# lspci -v 00:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] Subsystem: IBM Device 61b0 Physical Slot: C16 Flags: bus master, fast devsel, latency 0 Memory at 1012000 (64-bit, prefetchable) [size=64M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [9c] MSI-X: Enable- Count=52 Masked- Capabilities: [40] Power Management version 0 Kernel driver in use: mlx4_core -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 04:10 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote: So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. We are going to send a patch that fixes this using a module parameter. The patch will be on top of Huy's patch. Some background to the problem: mlx4 supported devices require 128 UAR What does UAR stand for? pages from PCI memory space defined by BAR2-3. Each UAR page can be any power of 2 value from 4K up to 64K. Before Huy's patch the driver chose UAR page size to be equal to system page size. Since PowerPC's page size is 64K this means minimum requirement of UAR pages is not met (default UAR BAR is 8MB and only half of it is really reserved for UARs). And what was the downside? afaict the performance was good... More details can be found in the programmer's manual. Can you please point me to this manual on the website? I tried, honestly, could not find it. Thanks. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 04:10 PM, Eli Cohen wrote: On Wed, Mar 16, 2016 at 01:07:58PM +1100, Alexey Kardashevskiy wrote: So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. We are going to send a patch that fixes this using a module parameter. The patch will be on top of Huy's patch. Some background to the problem: mlx4 supported devices require 128 UAR What does UAR stand for? pages from PCI memory space defined by BAR2-3. Each UAR page can be any power of 2 value from 4K up to 64K. Before Huy's patch the driver chose UAR page size to be equal to system page size. Since PowerPC's page size is 64K this means minimum requirement of UAR pages is not met (default UAR BAR is 8MB and only half of it is really reserved for UARs). And what was the downside? afaict the performance was good... More details can be found in the programmer's manual. Can you please point me to this manual on the website? I tried, honestly, could not find it. Thanks. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/15/2016 09:40 PM, Or Gerlitz wrote: On Tue, Mar 15, 2016 at 12:19 PM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote: This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917. Without this revert, POWER "pseries" KVM guests with a VF passed to a guest using VFIO fail to bring the driver up: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing :00:00.0 mlx4_core :00:00.0: enabling device ( -> 0002) mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: Sending vhcr0 mlx4_core :00:00.0: HCA minimum page size:512 mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536 mlx4_core :00:00.0: Failed to obtain slave caps Both host and guest use 64K system pages. How to fix this properly? Thanks. The commit message says: "[..] Regarding backward compatibility in SR-IOV, if hypervisor has this new code, the virtual OS must be updated. [...]" So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/15/2016 09:40 PM, Or Gerlitz wrote: On Tue, Mar 15, 2016 at 12:19 PM, Alexey Kardashevskiy wrote: This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917. Without this revert, POWER "pseries" KVM guests with a VF passed to a guest using VFIO fail to bring the driver up: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing :00:00.0 mlx4_core :00:00.0: enabling device ( -> 0002) mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: Sending vhcr0 mlx4_core :00:00.0: HCA minimum page size:512 mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536 mlx4_core :00:00.0: Failed to obtain slave caps Both host and guest use 64K system pages. How to fix this properly? Thanks. The commit message says: "[..] Regarding backward compatibility in SR-IOV, if hypervisor has this new code, the virtual OS must be updated. [...]" So with v4.5 as a host, there is no actual distro available today to use as a guest in the next 6 months (or whatever it takes to backport this partucular patch back there). You could have added a module parameter to enforce the old behavoir, at least... And sorry but from the original commit log I could not understand why exactly all existing guests need to be broken. Could you please point me to a piece of documentation describing all this UAR bisuness (what is UAR, why 128 UARs are required and for what, etc). Thanks. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 02:29 AM, Christoph Hellwig wrote: On Tue, Mar 15, 2016 at 04:23:33PM +0200, Or Gerlitz wrote: Let us check. I was under (the maybe wrong) impression, that before this patch both PF/VF drivers were not operative on some systems, so on those systems it's fair to require the VF driver to be patched too. To me it sounds like the system worked before. Alexey, can you confirm? It worked just fine for year(s), this is definitely regression. -- Alexey
Re: [RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
On 03/16/2016 02:29 AM, Christoph Hellwig wrote: On Tue, Mar 15, 2016 at 04:23:33PM +0200, Or Gerlitz wrote: Let us check. I was under (the maybe wrong) impression, that before this patch both PF/VF drivers were not operative on some systems, so on those systems it's fair to require the VF driver to be patched too. To me it sounds like the system worked before. Alexey, can you confirm? It worked just fine for year(s), this is definitely regression. -- Alexey
[RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917. Without this revert, POWER "pseries" KVM guests with a VF passed to a guest using VFIO fail to bring the driver up: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing :00:00.0 mlx4_core :00:00.0: enabling device ( -> 0002) mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: Sending vhcr0 mlx4_core :00:00.0: HCA minimum page size:512 mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536 mlx4_core :00:00.0: Failed to obtain slave caps Both host and guest use 64K system pages. How to fix this properly? Thanks. --- drivers/infiniband/hw/mlx4/qp.c | 7 +-- drivers/net/ethernet/mellanox/mlx4/cq.c | 4 +- drivers/net/ethernet/mellanox/mlx4/en_resources.c | 3 +- drivers/net/ethernet/mellanox/mlx4/en_tx.c| 4 +- drivers/net/ethernet/mellanox/mlx4/eq.c | 7 ++- drivers/net/ethernet/mellanox/mlx4/main.c | 56 +-- drivers/net/ethernet/mellanox/mlx4/pd.c | 12 ++--- include/linux/mlx4/device.h | 13 -- 8 files changed, 22 insertions(+), 84 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index fd97534..bc5536f 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1681,12 +1681,9 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, } if (qp->ibqp.uobject) - context->usr_page = cpu_to_be32( - mlx4_to_hw_uar_index(dev->dev, - to_mucontext(ibqp->uobject->context)->uar.index)); + context->usr_page = cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index); else - context->usr_page = cpu_to_be32( - mlx4_to_hw_uar_index(dev->dev, dev->priv_uar.index)); + context->usr_page = cpu_to_be32(dev->priv_uar.index); if (attr_mask & IB_QP_DEST_QPN) context->remote_qpn = cpu_to_be32(attr->dest_qp_num); diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c index a849da9..3348e64 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c @@ -318,9 +318,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, if (timestamp_en) cq_context->flags |= cpu_to_be32(1 << 19); - cq_context->logsize_usrpage = - cpu_to_be32((ilog2(nent) << 24) | - mlx4_to_hw_uar_index(dev, uar->index)); + cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); cq_context->comp_eqn= priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn; cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c b/drivers/net/ethernet/mellanox/mlx4/en_resources.c index 02e925d..12aab5a 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c @@ -58,8 +58,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride, } else { context->sq_size_stride = ilog2(TXBB_SIZE) - 4; } - context->usr_page = cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev, - mdev->priv_uar.index)); + context->usr_page = cpu_to_be32(mdev->priv_uar.index); context->local_qpn = cpu_to_be32(qpn); context->pri_path.ackto = 1 & 0x07; context->pri_path.sched_queue = 0x83 | (priv->port - 1) << 6; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index e0946ab..4421bf5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -213,9 +213,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv, mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn, ring->cqn, user_prio, >context); if (ring->bf_alloced) - ring->context.usr_page = - cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev, -ring->bf.uar->index)); + ring->context.usr_page = cpu_to_be32(ring->bf.uar->index); err = mlx4_qp_to_ready(mdev->dev, >wqres.mtt, >context, >qp, >qp_state); diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c index f613977..4696053 100644 --- a/drivers/net/ethernet/mellanox/mlx4/eq.c +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c @@ -940,10 +940,9 @@ static void __iomem *mlx4_get_eq_uar(struct mlx4_dev *dev, struct mlx4_eq *eq)
[RFC PATCH kernel] Revert "net/mlx4_core: Set UAR page size to 4KB regardless of system page size"
This reverts commit 85743f1eb34548ba4b056d2f184a3d107a3b8917. Without this revert, POWER "pseries" KVM guests with a VF passed to a guest using VFIO fail to bring the driver up: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing :00:00.0 mlx4_core :00:00.0: enabling device ( -> 0002) mlx4_core :00:00.0: Detected virtual function - running in slave mode mlx4_core :00:00.0: Sending reset mlx4_core :00:00.0: Sending vhcr0 mlx4_core :00:00.0: HCA minimum page size:512 mlx4_core :00:00.0: UAR size:4096 != kernel PAGE_SIZE of 65536 mlx4_core :00:00.0: Failed to obtain slave caps Both host and guest use 64K system pages. How to fix this properly? Thanks. --- drivers/infiniband/hw/mlx4/qp.c | 7 +-- drivers/net/ethernet/mellanox/mlx4/cq.c | 4 +- drivers/net/ethernet/mellanox/mlx4/en_resources.c | 3 +- drivers/net/ethernet/mellanox/mlx4/en_tx.c| 4 +- drivers/net/ethernet/mellanox/mlx4/eq.c | 7 ++- drivers/net/ethernet/mellanox/mlx4/main.c | 56 +-- drivers/net/ethernet/mellanox/mlx4/pd.c | 12 ++--- include/linux/mlx4/device.h | 13 -- 8 files changed, 22 insertions(+), 84 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index fd97534..bc5536f 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1681,12 +1681,9 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, } if (qp->ibqp.uobject) - context->usr_page = cpu_to_be32( - mlx4_to_hw_uar_index(dev->dev, - to_mucontext(ibqp->uobject->context)->uar.index)); + context->usr_page = cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index); else - context->usr_page = cpu_to_be32( - mlx4_to_hw_uar_index(dev->dev, dev->priv_uar.index)); + context->usr_page = cpu_to_be32(dev->priv_uar.index); if (attr_mask & IB_QP_DEST_QPN) context->remote_qpn = cpu_to_be32(attr->dest_qp_num); diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c index a849da9..3348e64 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c @@ -318,9 +318,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, if (timestamp_en) cq_context->flags |= cpu_to_be32(1 << 19); - cq_context->logsize_usrpage = - cpu_to_be32((ilog2(nent) << 24) | - mlx4_to_hw_uar_index(dev, uar->index)); + cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); cq_context->comp_eqn= priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn; cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c b/drivers/net/ethernet/mellanox/mlx4/en_resources.c index 02e925d..12aab5a 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c @@ -58,8 +58,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride, } else { context->sq_size_stride = ilog2(TXBB_SIZE) - 4; } - context->usr_page = cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev, - mdev->priv_uar.index)); + context->usr_page = cpu_to_be32(mdev->priv_uar.index); context->local_qpn = cpu_to_be32(qpn); context->pri_path.ackto = 1 & 0x07; context->pri_path.sched_queue = 0x83 | (priv->port - 1) << 6; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index e0946ab..4421bf5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -213,9 +213,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv, mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn, ring->cqn, user_prio, >context); if (ring->bf_alloced) - ring->context.usr_page = - cpu_to_be32(mlx4_to_hw_uar_index(mdev->dev, -ring->bf.uar->index)); + ring->context.usr_page = cpu_to_be32(ring->bf.uar->index); err = mlx4_qp_to_ready(mdev->dev, >wqres.mtt, >context, >qp, >qp_state); diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c index f613977..4696053 100644 --- a/drivers/net/ethernet/mellanox/mlx4/eq.c +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c @@ -940,10 +940,9 @@ static void __iomem *mlx4_get_eq_uar(struct mlx4_dev *dev, struct mlx4_eq *eq)
Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment
On 03/07/2016 06:48 PM, Yongji Xie wrote: When using resource_alignment kernel parameter, the current implement reassigns the alignment by changing resources' size which can potentially break some drivers. How can this possibly break any driver?... It rounds up, not down, what do I miss here? So this patch adds a new option "noresize" for the parameter to solve this problem. Signed-off-by: Yongji Xie--- Documentation/kernel-parameters.txt |5 - drivers/pci/pci.c | 36 +-- 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9a53c92..d8b29ab 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted. window. The default value is 64 megabytes. resource_alignment= Format: - [@][:]:.[; ...] + [@][:]:. + [:noresize][; ...] Specifies alignment and device to reassign aligned memory resources. If is not specified, PAGE_SIZE is used as alignment. PCI-PCI bridge can be specified, if resource windows need to be expanded. + noresize: Don't change the resources' sizes when + reassigning alignment. ecrc= Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking). bios: Use BIOS/firmware settings. This is the diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 602eb42..760cce5 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock); * RETURNS: Resource alignment if it is specified. * Zero if it is not specified. */ -static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev) +static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, + bool *resize) { int seg, bus, slot, func, align_order, count; resource_size_t align = 0; @@ -4626,6 +4627,11 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev) } } p += count; + if (!strncmp(p, ":noresize", 9)) { + *resize = false; + p += 9; + } else + *resize = true; if (seg == pci_domain_nr(dev->bus) && bus == dev->bus->number && slot == PCI_SLOT(dev->devfn) && @@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) { int i; struct resource *r; + bool resize; resource_size_t align, size; u16 command; /* check if specified PCI is target device to reassign */ - align = pci_specified_resource_alignment(dev); + align = pci_specified_resource_alignment(dev, ); A compiler should have warned here about passing a pointer to unitialized @resize. if (!align) return; @@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) if (!(r->flags & IORESOURCE_MEM)) continue; size = resource_size(r); - if (size < align) { - size = align; - dev_info(>dev, - "Rounding up size of resource #%d to %#llx.\n", - i, (unsigned long long)size); + if (resize) { + if (size < align) { + size = align; + dev_info(>dev, + "Rounding up size of resource #%d to %#llx.\n", + i, (unsigned long long)size); + } + r->flags |= IORESOURCE_UNSET; + r->end = size - 1; + r->start = 0; + } else { + if (size > align) + align = size; + r->flags &= ~IORESOURCE_SIZEALIGN; + r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET; + r->start = align; + r->end = r->start + size - 1; } - r->flags |= IORESOURCE_UNSET; - r->end = size - 1; -
Re: [RFC PATCH v4 1/7] PCI: Add a new option for resource_alignment to reassign alignment
On 03/07/2016 06:48 PM, Yongji Xie wrote: When using resource_alignment kernel parameter, the current implement reassigns the alignment by changing resources' size which can potentially break some drivers. How can this possibly break any driver?... It rounds up, not down, what do I miss here? So this patch adds a new option "noresize" for the parameter to solve this problem. Signed-off-by: Yongji Xie --- Documentation/kernel-parameters.txt |5 - drivers/pci/pci.c | 36 +-- 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9a53c92..d8b29ab 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2912,13 +2912,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted. window. The default value is 64 megabytes. resource_alignment= Format: - [@][:]:.[; ...] + [@][:]:. + [:noresize][; ...] Specifies alignment and device to reassign aligned memory resources. If is not specified, PAGE_SIZE is used as alignment. PCI-PCI bridge can be specified, if resource windows need to be expanded. + noresize: Don't change the resources' sizes when + reassigning alignment. ecrc= Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking). bios: Use BIOS/firmware settings. This is the diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 602eb42..760cce5 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4598,7 +4598,8 @@ static DEFINE_SPINLOCK(resource_alignment_lock); * RETURNS: Resource alignment if it is specified. * Zero if it is not specified. */ -static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev) +static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, + bool *resize) { int seg, bus, slot, func, align_order, count; resource_size_t align = 0; @@ -4626,6 +4627,11 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev) } } p += count; + if (!strncmp(p, ":noresize", 9)) { + *resize = false; + p += 9; + } else + *resize = true; if (seg == pci_domain_nr(dev->bus) && bus == dev->bus->number && slot == PCI_SLOT(dev->devfn) && @@ -4658,11 +4664,12 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) { int i; struct resource *r; + bool resize; resource_size_t align, size; u16 command; /* check if specified PCI is target device to reassign */ - align = pci_specified_resource_alignment(dev); + align = pci_specified_resource_alignment(dev, ); A compiler should have warned here about passing a pointer to unitialized @resize. if (!align) return; @@ -4684,15 +4691,24 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) if (!(r->flags & IORESOURCE_MEM)) continue; size = resource_size(r); - if (size < align) { - size = align; - dev_info(>dev, - "Rounding up size of resource #%d to %#llx.\n", - i, (unsigned long long)size); + if (resize) { + if (size < align) { + size = align; + dev_info(>dev, + "Rounding up size of resource #%d to %#llx.\n", + i, (unsigned long long)size); + } + r->flags |= IORESOURCE_UNSET; + r->end = size - 1; + r->start = 0; + } else { + if (size > align) + align = size; + r->flags &= ~IORESOURCE_SIZEALIGN; + r->flags |= IORESOURCE_STARTALIGN | IORESOURCE_UNSET; + r->start = align; + r->end = r->start + size - 1; } - r->flags |= IORESOURCE_UNSET; - r->end = size - 1; - r->start = 0; }
Re: [kernel] powerpc: Make vmalloc_to_phys() public
On 01/25/2016 09:06 PM, Paul Mackerras wrote: On Mon, Jan 25, 2016 at 04:46:03PM +1100, Michael Ellerman wrote: On Thu, 2016-21-01 at 07:35:08 UTC, Alexey Kardashevskiy wrote: This makes vmalloc_to_phys() public as there will be another user (in-kernel VFIO acceleration) for it soon. As a part of future little optimization, this changes the helper to call vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the struct page may not be power-of-two aligned which will make gcc use multiply instructions instead of shifts. You should also mention why you need to export it, presumably because whatever new user you have in mind can be built as a module. If I remember correctly, it's the *existing* user in hv-24x7.c that needs the export. No, CONFIG_HV_PERF_CTRS (which controls hv-24x7.c) is "bool" so today hv-24x7.c cannot compile as a module. This exports the vmalloc_to_phys() symbol as a new user is going to be KVM which can compile as a module. -- Alexey
Re: [kernel] powerpc: Make vmalloc_to_phys() public
On 01/25/2016 09:06 PM, Paul Mackerras wrote: On Mon, Jan 25, 2016 at 04:46:03PM +1100, Michael Ellerman wrote: On Thu, 2016-21-01 at 07:35:08 UTC, Alexey Kardashevskiy wrote: This makes vmalloc_to_phys() public as there will be another user (in-kernel VFIO acceleration) for it soon. As a part of future little optimization, this changes the helper to call vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the struct page may not be power-of-two aligned which will make gcc use multiply instructions instead of shifts. You should also mention why you need to export it, presumably because whatever new user you have in mind can be built as a module. If I remember correctly, it's the *existing* user in hv-24x7.c that needs the export. No, CONFIG_HV_PERF_CTRS (which controls hv-24x7.c) is "bool" so today hv-24x7.c cannot compile as a module. This exports the vmalloc_to_phys() symbol as a new user is going to be KVM which can compile as a module. -- Alexey
Re: [PATCH] vfio/noiommu: Don't use iommu_present() to track fake groups
On 01/23/2016 04:23 AM, Alex Williamson wrote: Using iommu_present() to determine whether an IOMMU group is real or fake has some problems. First, apparently Power systems don't register an IOMMU on the device bus, so the groups and containers get marked as noiommu and then won't bind to their actual IOMMU driver. Second, I expect we'll run into the same issue as we try to support vGPUs through vfio, since they're likely to emulate this behavior of creating an IOMMU group on a virtual device and then providing a vfio IOMMU backend tailored to the sort of isolation they provide, which won't necessarily be fully compatible with the IOMMU API. The solution here is to use the existing iommudata interface to IOMMU groups, which allows us to easily identify the fake groups we've created for noiommu purposes. The iommudata we set is purely arbitrary since we're only comparing the address, so we use the address of the noiommu switch itself. Reported-by: Alexey Kardashevskiy Fixes: 03a76b60f8ba ("vfio: Include No-IOMMU mode") Signed-off-by: Alex Williamson Reviewed-by: Alexey Kardashevskiy Tested-by: Alexey Kardashevskiy Thanks! -- Alexey
Re: [PATCH] vfio/noiommu: Don't use iommu_present() to track fake groups
On 01/23/2016 04:23 AM, Alex Williamson wrote: Using iommu_present() to determine whether an IOMMU group is real or fake has some problems. First, apparently Power systems don't register an IOMMU on the device bus, so the groups and containers get marked as noiommu and then won't bind to their actual IOMMU driver. Second, I expect we'll run into the same issue as we try to support vGPUs through vfio, since they're likely to emulate this behavior of creating an IOMMU group on a virtual device and then providing a vfio IOMMU backend tailored to the sort of isolation they provide, which won't necessarily be fully compatible with the IOMMU API. The solution here is to use the existing iommudata interface to IOMMU groups, which allows us to easily identify the fake groups we've created for noiommu purposes. The iommudata we set is purely arbitrary since we're only comparing the address, so we use the address of the noiommu switch itself. Reported-by: Alexey Kardashevskiy <a...@ozlabs.ru> Fixes: 03a76b60f8ba ("vfio: Include No-IOMMU mode") Signed-off-by: Alex Williamson <alex.william...@redhat.com> Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru> Tested-by: Alexey Kardashevskiy <a...@ozlabs.ru> Thanks! -- Alexey
Re: [PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected
On 01/22/2016 05:34 PM, Alexey Kardashevskiy wrote: Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO on systems which do not implement iommu_ops for the PCI bus even though there is an VFIO IOMMU driver for it such as SPAPR TCE driver for PPC64/powernv platform. This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as it is done in the rest of the file to re-enable VFIO on powernv. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 82f25cc..3f8060e 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group, atomic_set(>opened, 0); group->iommu_group = iommu_group; group->noiommu = !iommu_present; - Agrh. Unrelated change, repost? group->nb.notifier_call = vfio_iommu_group_notifier; /* @@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev, group = vfio_group_get_from_iommu(iommu_group); if (!group) { +#ifdef CONFIG_VFIO_NOIOMMU group = vfio_create_group(iommu_group, iommu_present(dev->bus)); +#else + group = vfio_create_group(iommu_group, true); +#endif if (IS_ERR(group)) { iommu_group_put(iommu_group); return PTR_ERR(group); -- Alexey
[PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected
Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO on systems which do not implement iommu_ops for the PCI bus even though there is an VFIO IOMMU driver for it such as SPAPR TCE driver for PPC64/powernv platform. This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as it is done in the rest of the file to re-enable VFIO on powernv. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 82f25cc..3f8060e 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group, atomic_set(>opened, 0); group->iommu_group = iommu_group; group->noiommu = !iommu_present; - group->nb.notifier_call = vfio_iommu_group_notifier; /* @@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev, group = vfio_group_get_from_iommu(iommu_group); if (!group) { +#ifdef CONFIG_VFIO_NOIOMMU group = vfio_create_group(iommu_group, iommu_present(dev->bus)); +#else + group = vfio_create_group(iommu_group, true); +#endif if (IS_ERR(group)) { iommu_group_put(iommu_group); return PTR_ERR(group); -- 2.5.0.rc3
Re: [PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected
On 01/22/2016 05:34 PM, Alexey Kardashevskiy wrote: Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO on systems which do not implement iommu_ops for the PCI bus even though there is an VFIO IOMMU driver for it such as SPAPR TCE driver for PPC64/powernv platform. This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as it is done in the rest of the file to re-enable VFIO on powernv. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- drivers/vfio/vfio.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 82f25cc..3f8060e 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group, atomic_set(>opened, 0); group->iommu_group = iommu_group; group->noiommu = !iommu_present; - Agrh. Unrelated change, repost? group->nb.notifier_call = vfio_iommu_group_notifier; /* @@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev, group = vfio_group_get_from_iommu(iommu_group); if (!group) { +#ifdef CONFIG_VFIO_NOIOMMU group = vfio_create_group(iommu_group, iommu_present(dev->bus)); +#else + group = vfio_create_group(iommu_group, true); +#endif if (IS_ERR(group)) { iommu_group_put(iommu_group); return PTR_ERR(group); -- Alexey
[PATCH kernel] vfio: Only check for bus IOMMU when NOIOMMU is selected
Recent change 03a76b60 "vfio: Include No-IOMMU mode" disabled VFIO on systems which do not implement iommu_ops for the PCI bus even though there is an VFIO IOMMU driver for it such as SPAPR TCE driver for PPC64/powernv platform. This moves iommu_present() call under #ifdef CONFIG_VFIO_NOIOMMU as it is done in the rest of the file to re-enable VFIO on powernv. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- drivers/vfio/vfio.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 82f25cc..3f8060e 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -343,7 +343,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group, atomic_set(>opened, 0); group->iommu_group = iommu_group; group->noiommu = !iommu_present; - group->nb.notifier_call = vfio_iommu_group_notifier; /* @@ -767,7 +766,11 @@ int vfio_add_group_dev(struct device *dev, group = vfio_group_get_from_iommu(iommu_group); if (!group) { +#ifdef CONFIG_VFIO_NOIOMMU group = vfio_create_group(iommu_group, iommu_present(dev->bus)); +#else + group = vfio_create_group(iommu_group, true); +#endif if (IS_ERR(group)) { iommu_group_put(iommu_group); return PTR_ERR(group); -- 2.5.0.rc3
[PATCH kernel] powerpc: Make vmalloc_to_phys() public
This makes vmalloc_to_phys() public as there will be another user (in-kernel VFIO acceleration) for it soon. As a part of future little optimization, this changes the helper to call vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the struct page may not be power-of-two aligned which will make gcc use multiply instructions instead of shifts. Signed-off-by: Alexey Kardashevskiy --- A couple of notes: 1. real_vmalloc_addr() will be reworked later by Paul separately; 2. the optimization note it not valid at the moment as vmalloc_to_pfn() calls vmalloc_to_page() which does the actual search; these helpers functionality will be swapped later (also, by Paul). --- arch/powerpc/include/asm/pgtable.h | 3 +++ arch/powerpc/mm/pgtable.c | 8 arch/powerpc/perf/hv-24x7.c| 8 3 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index ac9fb11..47897a3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -78,6 +78,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift); } + +unsigned long vmalloc_to_phys(void *vmalloc_addr); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_H */ diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 83dfd79..de37ff4 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -243,3 +243,11 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) } #endif /* CONFIG_DEBUG_VM */ +unsigned long vmalloc_to_phys(void *va) +{ + unsigned long pfn = vmalloc_to_pfn(va); + + BUG_ON(!pfn); + return __pa(pfn_to_kaddr(pfn)) + offset_in_page(va); +} +EXPORT_SYMBOL_GPL(vmalloc_to_phys); diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 9f9dfda..3b09ecf 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -493,14 +493,6 @@ static size_t event_to_attr_ct(struct hv_24x7_event_data *event) } } -static unsigned long vmalloc_to_phys(void *v) -{ - struct page *p = vmalloc_to_page(v); - - BUG_ON(!p); - return page_to_phys(p) + offset_in_page(v); -} - /* */ struct event_uniq { struct rb_node node; -- 2.5.0.rc3
[PATCH kernel] powerpc: Make vmalloc_to_phys() public
This makes vmalloc_to_phys() public as there will be another user (in-kernel VFIO acceleration) for it soon. As a part of future little optimization, this changes the helper to call vmalloc_to_pfn() instead of vmalloc_to_page() as the size of the struct page may not be power-of-two aligned which will make gcc use multiply instructions instead of shifts. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- A couple of notes: 1. real_vmalloc_addr() will be reworked later by Paul separately; 2. the optimization note it not valid at the moment as vmalloc_to_pfn() calls vmalloc_to_page() which does the actual search; these helpers functionality will be swapped later (also, by Paul). --- arch/powerpc/include/asm/pgtable.h | 3 +++ arch/powerpc/mm/pgtable.c | 8 arch/powerpc/perf/hv-24x7.c| 8 3 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index ac9fb11..47897a3 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -78,6 +78,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift); } + +unsigned long vmalloc_to_phys(void *vmalloc_addr); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_H */ diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 83dfd79..de37ff4 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -243,3 +243,11 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) } #endif /* CONFIG_DEBUG_VM */ +unsigned long vmalloc_to_phys(void *va) +{ + unsigned long pfn = vmalloc_to_pfn(va); + + BUG_ON(!pfn); + return __pa(pfn_to_kaddr(pfn)) + offset_in_page(va); +} +EXPORT_SYMBOL_GPL(vmalloc_to_phys); diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 9f9dfda..3b09ecf 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -493,14 +493,6 @@ static size_t event_to_attr_ct(struct hv_24x7_event_data *event) } } -static unsigned long vmalloc_to_phys(void *v) -{ - struct page *p = vmalloc_to_page(v); - - BUG_ON(!p); - return page_to_phys(p) + offset_in_page(v); -} - /* */ struct event_uniq { struct rb_node node; -- 2.5.0.rc3
Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
On 12/08/2015 04:46 PM, Paul E. McKenney wrote: On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote: On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote: As in the following? (And yes, I was confused by the fact that the docbook header was in front of a differently-named macro!) That looks great! Have you committed to one of your branches? Indeed I have, though quite recently. Please see branch rcu/dev of: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git Or, more specifically, this commit: 69b907297f4e If your testing goes well, I will push this into the upcoming merge window. Sorry, it took a little longer than I expected. It works just fine, thanks! -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
On 12/08/2015 04:46 PM, Paul E. McKenney wrote: On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote: On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote: As in the following? (And yes, I was confused by the fact that the docbook header was in front of a differently-named macro!) That looks great! Have you committed to one of your branches? Indeed I have, though quite recently. Please see branch rcu/dev of: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git Or, more specifically, this commit: 69b907297f4e If your testing goes well, I will push this into the upcoming merge window. Sorry, it took a little longer than I expected. It works just fine, thanks! -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] VFIO: capability chains
On 12/18/2015 01:38 PM, Alex Williamson wrote: On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote: On 11/24/2015 07:43 AM, Alex Williamson wrote: Please see the commit log and comments in patch 1 for a general explanation of the problems that this series tries to address. The general problem is that we have several cases where we want to expose variable sized information to the user, whether it's sparse mmaps for a region, as implemented here, or DMA mapping ranges of an IOMMU, or reserved MSI mapping ranges, etc. Extending data structures is hard; extending them to report variable sized data is really hard. After considering several options, I think the best approach is to copy how PCI does capabilities. This allows the ioctl to only expose the capabilities that are relevant for them, avoids data structures that are too complicated to parse, and avoids creating a new ioctl each time we think of something else that we'd like to report. This method also doesn't preclude extensions to the fixed structure since the offset of these capabilities is entirely dynamic. Comments welcome, I'll also follow-up to the QEMU and KVM lists with an RFC making use of this for mmaps skipping over the MSI-X table. Thanks, Out of curiosity - could this information be exposed to the userspace via /sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after vfio_pci driver is bound to a device. For what purpose? vfio doesn't have a sysfs interface, why start one? Thanks, well, it could simplify debugging a bit if this information was available from the userspace without programming a test tool doing some ioctl()'s. Not a big deal though... -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] VFIO: capability chains
On 11/24/2015 07:43 AM, Alex Williamson wrote: Please see the commit log and comments in patch 1 for a general explanation of the problems that this series tries to address. The general problem is that we have several cases where we want to expose variable sized information to the user, whether it's sparse mmaps for a region, as implemented here, or DMA mapping ranges of an IOMMU, or reserved MSI mapping ranges, etc. Extending data structures is hard; extending them to report variable sized data is really hard. After considering several options, I think the best approach is to copy how PCI does capabilities. This allows the ioctl to only expose the capabilities that are relevant for them, avoids data structures that are too complicated to parse, and avoids creating a new ioctl each time we think of something else that we'd like to report. This method also doesn't preclude extensions to the fixed structure since the offset of these capabilities is entirely dynamic. Comments welcome, I'll also follow-up to the QEMU and KVM lists with an RFC making use of this for mmaps skipping over the MSI-X table. Thanks, Out of curiosity - could this information be exposed to the userspace via /sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after vfio_pci driver is bound to a device. Alex --- Alex Williamson (3): vfio: Define capability chains vfio: Define sparse mmap capability for regions vfio/pci: Include sparse mmap capability for MSI-X table regions drivers/vfio/pci/vfio_pci.c | 101 +++ include/uapi/linux/vfio.h | 53 ++- 2 files changed, 152 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create
The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal to 32 bytes. However due to the gcc's default alignment, the actual size of this struct is 40 bytes. This fills gaps with __resv1/2 fields. This should not cause any change in behavior. Signed-off-by: Alexey Kardashevskiy --- include/uapi/linux/vfio.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9fd7b5d..d117233 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create { __u32 flags; /* in */ __u32 page_shift; + __u32 __resv1; __u64 window_size; __u32 levels; + __u32 __resv2; /* out */ __u64 start_addr; }; -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] VFIO: capability chains
On 11/24/2015 07:43 AM, Alex Williamson wrote: Please see the commit log and comments in patch 1 for a general explanation of the problems that this series tries to address. The general problem is that we have several cases where we want to expose variable sized information to the user, whether it's sparse mmaps for a region, as implemented here, or DMA mapping ranges of an IOMMU, or reserved MSI mapping ranges, etc. Extending data structures is hard; extending them to report variable sized data is really hard. After considering several options, I think the best approach is to copy how PCI does capabilities. This allows the ioctl to only expose the capabilities that are relevant for them, avoids data structures that are too complicated to parse, and avoids creating a new ioctl each time we think of something else that we'd like to report. This method also doesn't preclude extensions to the fixed structure since the offset of these capabilities is entirely dynamic. Comments welcome, I'll also follow-up to the QEMU and KVM lists with an RFC making use of this for mmaps skipping over the MSI-X table. Thanks, Out of curiosity - could this information be exposed to the userspace via /sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after vfio_pci driver is bound to a device. Alex --- Alex Williamson (3): vfio: Define capability chains vfio: Define sparse mmap capability for regions vfio/pci: Include sparse mmap capability for MSI-X table regions drivers/vfio/pci/vfio_pci.c | 101 +++ include/uapi/linux/vfio.h | 53 ++- 2 files changed, 152 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create
The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal to 32 bytes. However due to the gcc's default alignment, the actual size of this struct is 40 bytes. This fills gaps with __resv1/2 fields. This should not cause any change in behavior. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- include/uapi/linux/vfio.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9fd7b5d..d117233 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create { __u32 flags; /* in */ __u32 page_shift; + __u32 __resv1; __u64 window_size; __u32 levels; + __u32 __resv2; /* out */ __u64 start_addr; }; -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] VFIO: capability chains
On 12/18/2015 01:38 PM, Alex Williamson wrote: On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote: On 11/24/2015 07:43 AM, Alex Williamson wrote: Please see the commit log and comments in patch 1 for a general explanation of the problems that this series tries to address. The general problem is that we have several cases where we want to expose variable sized information to the user, whether it's sparse mmaps for a region, as implemented here, or DMA mapping ranges of an IOMMU, or reserved MSI mapping ranges, etc. Extending data structures is hard; extending them to report variable sized data is really hard. After considering several options, I think the best approach is to copy how PCI does capabilities. This allows the ioctl to only expose the capabilities that are relevant for them, avoids data structures that are too complicated to parse, and avoids creating a new ioctl each time we think of something else that we'd like to report. This method also doesn't preclude extensions to the fixed structure since the offset of these capabilities is entirely dynamic. Comments welcome, I'll also follow-up to the QEMU and KVM lists with an RFC making use of this for mmaps skipping over the MSI-X table. Thanks, Out of curiosity - could this information be exposed to the userspace via /sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change after vfio_pci driver is bound to a device. For what purpose? vfio doesn't have a sysfs interface, why start one? Thanks, well, it could simplify debugging a bit if this information was available from the userspace without programming a test tool doing some ioctl()'s. Not a big deal though... -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
On 11/04/2015 01:39 AM, Steven Rostedt wrote: On Tue, 3 Nov 2015 17:57:05 +1100 Alexey Kardashevskiy wrote: This defines list_for_each_entry_lockless. This allows safe list traversing in cases when lockdep() invocation is unwanted like real mode (MMU is off). Signed-off-by: Alexey Kardashevskiy --- This is for VFIO acceleration in POWERKVM for pSeries guests. There is a KVM instance. There also can be some VFIO (PCI passthru) devices attached to a KVM guest. To perform DMA, a pSeries guest registers DMA memory by calling some hypercalls explicitely at the rate close to one-two hcalls per a network packet, i.e. very often. When a guest does a hypercall (which is just an assembly instruction), the host kernel receives it in the real mode (MMU is off). When real mode fails to handle it, it enables MMU and tries handling a hcall in virtual mode. A logical bus ID (LIOBN) is a tagret id for these hypecalls. Each VFIO device belongs to an IOMMU group. Each group has an address translation table. It is allowed to have multiple IOMMU groups (i.e. multiple tables) under the same LIOBN. So effectively every DMA hcall has to update one or more TCE tables attached to the same LIOBN. RCU is used to update/traverse this list safely. Using RCU as is in virtual mode is fine. Lockdep works, etc. list_add_rcu() is used to populate the list; list_del_rcu() + call_rcu() used to remove groups from a list. These operations can happen in runtim as a result of PCI hotplug/unplug in guests. Using RCU as is in real mode is not fine as some RCU checks can lock up the system and in real mode we won't even have a chance to see any debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used. Previous version of this used to define list_for_each_entry_rcu_notrace() but it was proposed to use list_entry_lockless() instead. However the comment for lockless_dereference() suggests this is a good idea if "lifetime is managed by something other than RCU" but it is in my case. So what would be the correct approach here? Thanks. If the only use case for this so far is in POWERKVM, perhaps it should be defined specifically (and in arch/powerpc) and not confuse others about using this. Or, if you do imagine that this can be used in other scenarios, then a much deeper comment must be made in the code in the kerneldoc section. list_for_each_entry_rcu() should really be used in 99.99% of the time in the kernel. This looks to be an extreme exception. I hate to add a generic helper for something that will only be used in one location. No, I cannot imagine this really and I can move it somewhere in arch/powerpc. Still, is my approach correct? What does the comment for lockless_dereference() actally mean - it won't work together with RCU at all or this is to force people not to use it as "list_for_each_entry_rcu() should really be used in 99.99% of the time"? :) -- Steve --- include/linux/rculist.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 17c6b1f..a83a924 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list, pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** + * list_for_each_entry_lockless - iterate over rcu list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_struct within the struct. + * + * This list-traversal primitive may safely run concurrently + */ +#define list_entry_lockless(ptr, type, member) \ + container_of((typeof(ptr))lockless_dereference(ptr), type, member) + +#define list_for_each_entry_lockless(pos, head, member) \ + for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \ + >member != (head); \ + pos = list_entry_lockless(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_continue_rcu - continue iteration over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
On 11/04/2015 01:39 AM, Steven Rostedt wrote: On Tue, 3 Nov 2015 17:57:05 +1100 Alexey Kardashevskiy <a...@ozlabs.ru> wrote: This defines list_for_each_entry_lockless. This allows safe list traversing in cases when lockdep() invocation is unwanted like real mode (MMU is off). Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- This is for VFIO acceleration in POWERKVM for pSeries guests. There is a KVM instance. There also can be some VFIO (PCI passthru) devices attached to a KVM guest. To perform DMA, a pSeries guest registers DMA memory by calling some hypercalls explicitely at the rate close to one-two hcalls per a network packet, i.e. very often. When a guest does a hypercall (which is just an assembly instruction), the host kernel receives it in the real mode (MMU is off). When real mode fails to handle it, it enables MMU and tries handling a hcall in virtual mode. A logical bus ID (LIOBN) is a tagret id for these hypecalls. Each VFIO device belongs to an IOMMU group. Each group has an address translation table. It is allowed to have multiple IOMMU groups (i.e. multiple tables) under the same LIOBN. So effectively every DMA hcall has to update one or more TCE tables attached to the same LIOBN. RCU is used to update/traverse this list safely. Using RCU as is in virtual mode is fine. Lockdep works, etc. list_add_rcu() is used to populate the list; list_del_rcu() + call_rcu() used to remove groups from a list. These operations can happen in runtim as a result of PCI hotplug/unplug in guests. Using RCU as is in real mode is not fine as some RCU checks can lock up the system and in real mode we won't even have a chance to see any debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used. Previous version of this used to define list_for_each_entry_rcu_notrace() but it was proposed to use list_entry_lockless() instead. However the comment for lockless_dereference() suggests this is a good idea if "lifetime is managed by something other than RCU" but it is in my case. So what would be the correct approach here? Thanks. If the only use case for this so far is in POWERKVM, perhaps it should be defined specifically (and in arch/powerpc) and not confuse others about using this. Or, if you do imagine that this can be used in other scenarios, then a much deeper comment must be made in the code in the kerneldoc section. list_for_each_entry_rcu() should really be used in 99.99% of the time in the kernel. This looks to be an extreme exception. I hate to add a generic helper for something that will only be used in one location. No, I cannot imagine this really and I can move it somewhere in arch/powerpc. Still, is my approach correct? What does the comment for lockless_dereference() actally mean - it won't work together with RCU at all or this is to force people not to use it as "list_for_each_entry_rcu() should really be used in 99.99% of the time"? :) -- Steve --- include/linux/rculist.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 17c6b1f..a83a924 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list, pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** + * list_for_each_entry_lockless - iterate over rcu list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_struct within the struct. + * + * This list-traversal primitive may safely run concurrently + */ +#define list_entry_lockless(ptr, type, member) \ + container_of((typeof(ptr))lockless_dereference(ptr), type, member) + +#define list_for_each_entry_lockless(pos, head, member) \ + for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \ + >member != (head); \ + pos = list_entry_lockless(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_continue_rcu - continue iteration over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
This defines list_for_each_entry_lockless. This allows safe list traversing in cases when lockdep() invocation is unwanted like real mode (MMU is off). Signed-off-by: Alexey Kardashevskiy --- This is for VFIO acceleration in POWERKVM for pSeries guests. There is a KVM instance. There also can be some VFIO (PCI passthru) devices attached to a KVM guest. To perform DMA, a pSeries guest registers DMA memory by calling some hypercalls explicitely at the rate close to one-two hcalls per a network packet, i.e. very often. When a guest does a hypercall (which is just an assembly instruction), the host kernel receives it in the real mode (MMU is off). When real mode fails to handle it, it enables MMU and tries handling a hcall in virtual mode. A logical bus ID (LIOBN) is a tagret id for these hypecalls. Each VFIO device belongs to an IOMMU group. Each group has an address translation table. It is allowed to have multiple IOMMU groups (i.e. multiple tables) under the same LIOBN. So effectively every DMA hcall has to update one or more TCE tables attached to the same LIOBN. RCU is used to update/traverse this list safely. Using RCU as is in virtual mode is fine. Lockdep works, etc. list_add_rcu() is used to populate the list; list_del_rcu() + call_rcu() used to remove groups from a list. These operations can happen in runtim as a result of PCI hotplug/unplug in guests. Using RCU as is in real mode is not fine as some RCU checks can lock up the system and in real mode we won't even have a chance to see any debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used. Previous version of this used to define list_for_each_entry_rcu_notrace() but it was proposed to use list_entry_lockless() instead. However the comment for lockless_dereference() suggests this is a good idea if "lifetime is managed by something other than RCU" but it is in my case. So what would be the correct approach here? Thanks. --- include/linux/rculist.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 17c6b1f..a83a924 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list, pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** + * list_for_each_entry_lockless - iterate over rcu list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_struct within the struct. + * + * This list-traversal primitive may safely run concurrently + */ +#define list_entry_lockless(ptr, type, member) \ + container_of((typeof(ptr))lockless_dereference(ptr), type, member) + +#define list_for_each_entry_lockless(pos, head, member) \ + for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \ + >member != (head); \ + pos = list_entry_lockless(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_continue_rcu - continue iteration over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
This defines list_for_each_entry_lockless. This allows safe list traversing in cases when lockdep() invocation is unwanted like real mode (MMU is off). Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- This is for VFIO acceleration in POWERKVM for pSeries guests. There is a KVM instance. There also can be some VFIO (PCI passthru) devices attached to a KVM guest. To perform DMA, a pSeries guest registers DMA memory by calling some hypercalls explicitely at the rate close to one-two hcalls per a network packet, i.e. very often. When a guest does a hypercall (which is just an assembly instruction), the host kernel receives it in the real mode (MMU is off). When real mode fails to handle it, it enables MMU and tries handling a hcall in virtual mode. A logical bus ID (LIOBN) is a tagret id for these hypecalls. Each VFIO device belongs to an IOMMU group. Each group has an address translation table. It is allowed to have multiple IOMMU groups (i.e. multiple tables) under the same LIOBN. So effectively every DMA hcall has to update one or more TCE tables attached to the same LIOBN. RCU is used to update/traverse this list safely. Using RCU as is in virtual mode is fine. Lockdep works, etc. list_add_rcu() is used to populate the list; list_del_rcu() + call_rcu() used to remove groups from a list. These operations can happen in runtim as a result of PCI hotplug/unplug in guests. Using RCU as is in real mode is not fine as some RCU checks can lock up the system and in real mode we won't even have a chance to see any debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used. Previous version of this used to define list_for_each_entry_rcu_notrace() but it was proposed to use list_entry_lockless() instead. However the comment for lockless_dereference() suggests this is a good idea if "lifetime is managed by something other than RCU" but it is in my case. So what would be the correct approach here? Thanks. --- include/linux/rculist.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 17c6b1f..a83a924 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list, pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** + * list_for_each_entry_lockless - iterate over rcu list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_struct within the struct. + * + * This list-traversal primitive may safely run concurrently + */ +#define list_entry_lockless(ptr, type, member) \ + container_of((typeof(ptr))lockless_dereference(ptr), type, member) + +#define list_for_each_entry_lockless(pos, head, member) \ + for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \ + >member != (head); \ + pos = list_entry_lockless(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_continue_rcu - continue iteration over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] rcu: Fix comment for rcu_dereference_raw_notrace
rcu_dereference_raw() calls indirectly rcu_read_lock_held() while rcu_dereference_raw_notrace() does not so fix the comment about the latter. Signed-off-by: Alexey Kardashevskiy --- include/linux/rcupdate.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 581abf8..931d627 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -760,7 +760,7 @@ static inline void rcu_preempt_sleep_check(void) * The tracing infrastructure traces RCU (we want that), but unfortunately * some of the RCU checks causes tracing to lock up the system. * - * The tracing version of rcu_dereference_raw() must not call + * The no-tracing version of rcu_dereference_raw() must not call * rcu_read_lock_held(). */ #define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu) -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH kernel] rcu: Fix comment for rcu_dereference_raw_notrace
rcu_dereference_raw() calls indirectly rcu_read_lock_held() while rcu_dereference_raw_notrace() does not so fix the comment about the latter. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> --- include/linux/rcupdate.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 581abf8..931d627 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -760,7 +760,7 @@ static inline void rcu_preempt_sleep_check(void) * The tracing infrastructure traces RCU (we want that), but unfortunately * some of the RCU checks causes tracing to lock up the system. * - * The tracing version of rcu_dereference_raw() must not call + * The no-tracing version of rcu_dereference_raw() must not call * rcu_read_lock_held(). */ #define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu) -- 2.5.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote: On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote: On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan --- arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/kernel/dma.c | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; All PCI devices have this initialized on POWER (at least, our, IBM's POWER) so 4K will always be returned here while in the case of (get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 4K still preferred value to return here? Right, so the logic of my series, goes like this: a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is PAGE_SHIFT everywhere, including Power. b) After 2/7, the Power code will return either the IOMMU table's shift value, if set, or PAGE_SHIFT (I guess this would be the case if get_dma_ops(dev) == _direct_ops, as you said). That is no different than we have now, except we can return the accurate IOMMU value if available. If it is not available, then something went wrong and BUG_ON(!tbl || !tbl->it_page_shift) make more sense here than pretending that this function can ever return PAGE_SHIFT. imho. 3) After 3/7, the platform can override the generic Power get_dma_page_shift(). 4) After 4/7, pseries will return the DDW value, if available, then fallback to the IOMMU table's value. I think in the case of get_dma_ops(dev)==_direct_ops, the only way that can happen is if we are using DDW, right? This is for pseries guests; for the powernv host it is a "bypass" mode which does 64bit direct DMA mapping and there is no additional window for that (i.e. DIRECT64_PROPNAME, etc). -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan --- arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/kernel/dma.c | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; All PCI devices have this initialized on POWER (at least, our, IBM's POWER) so 4K will always be returned here while in the case of (get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 4K still preferred value to return here? + return PAGE_SHIFT; +} +EXPORT_SYMBOL(dma_get_page_shift); + u64 __dma_get_required_mask(struct device *dev) { struct dma_map_ops *dma_ops = get_dma_ops(dev); -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan--- arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/kernel/dma.c | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; All PCI devices have this initialized on POWER (at least, our, IBM's POWER) so 4K will always be returned here while in the case of (get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 4K still preferred value to return here? + return PAGE_SHIFT; +} +EXPORT_SYMBOL(dma_get_page_shift); + u64 __dma_get_required_mask(struct device *dev) { struct dma_map_ops *dma_ops = get_dma_ops(dev); -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote: On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote: On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote: On Power, the kernel's page size can differ from the IOMMU's page size, so we need to override the generic implementation, which always returns the kernel's page size. Lookup the IOMMU's page size from struct iommu_table, if available. Fallback to the kernel's page size, otherwise. Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/dma-mapping.h | 3 +++ arch/powerpc/kernel/dma.c | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 7f522c0..c5638f4 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off) #define HAVE_ARCH_DMA_SET_MASK 1 extern int dma_set_mask(struct device *dev, u64 dma_mask); +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1 +extern unsigned long dma_get_page_shift(struct device *dev); + #include extern int __dma_set_mask(struct device *dev, u64 dma_mask); diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 59503ed..e805af2 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask) } EXPORT_SYMBOL(dma_set_mask); +unsigned long dma_get_page_shift(struct device *dev) +{ + struct iommu_table *tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; All PCI devices have this initialized on POWER (at least, our, IBM's POWER) so 4K will always be returned here while in the case of (get_dma_ops(dev)==_direct_ops) it could actually return PAGE_SHIFT. Is 4K still preferred value to return here? Right, so the logic of my series, goes like this: a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is PAGE_SHIFT everywhere, including Power. b) After 2/7, the Power code will return either the IOMMU table's shift value, if set, or PAGE_SHIFT (I guess this would be the case if get_dma_ops(dev) == _direct_ops, as you said). That is no different than we have now, except we can return the accurate IOMMU value if available. If it is not available, then something went wrong and BUG_ON(!tbl || !tbl->it_page_shift) make more sense here than pretending that this function can ever return PAGE_SHIFT. imho. 3) After 3/7, the platform can override the generic Power get_dma_page_shift(). 4) After 4/7, pseries will return the DDW value, if available, then fallback to the IOMMU table's value. I think in the case of get_dma_ops(dev)==_direct_ops, the only way that can happen is if we are using DDW, right? This is for pseries guests; for the powernv host it is a "bypass" mode which does 64bit direct DMA mapping and there is no additional window for that (i.e. DIRECT64_PROPNAME, etc). -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote: When DDW (Dynamic DMA Windows) are present for a device, we have stored the TCE (Translation Control Entry) size in a special device tree property. Check if we have enabled DDW for the device and return the TCE size from that property if present. If the property isn't present, fallback to looking the value up in struct iommu_table. If we don't find a iommu_table, fallback to the kernel's page size. Signed-off-by: Nishanth Aravamudan --- arch/powerpc/platforms/pseries/iommu.c | 36 ++ 1 file changed, 36 insertions(+) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 0946b98..1bf6471 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) return dma_iommu_ops.get_required_mask(dev); } +static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev) +{ + struct iommu_table *tbl; + + if (!disable_ddw && dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct device_node *dn; + + dn = pci_device_to_OF_node(pdev); + + /* search upwards for ibm,dma-window */ + for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group; + dn = dn->parent) + if (of_get_property(dn, "ibm,dma-window", NULL)) + break; + /* +* if there is a DDW configuration, the TCE shift is stored in +* the property +*/ + if (dn && PCI_DN(dn)) { + const struct dynamic_dma_window_prop *direct64 = + of_get_property(dn, DIRECT64_PROPNAME, NULL); This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM does not set it as 64bit windows are dynamic there so something like find_existing_ddw() needs to be used here. + if (direct64) + return be32_to_cpu(direct64->tce_shift); + } + } + + tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + + return PAGE_SHIFT; +} + #else /* CONFIG_PCI */ #define pci_dma_bus_setup_pSeries NULL #define pci_dma_dev_setup_pSeries NULL @@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) #define pci_dma_dev_setup_pSeriesLP NULL #define dma_set_mask_pSeriesLPNULL #define dma_get_required_mask_pSeriesLP NULL +#define dma_get_page_shift_pSeriesLP NULL #endif /* !CONFIG_PCI */ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, @@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void) pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP; ppc_md.dma_set_mask = dma_set_mask_pSeriesLP; ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP; + ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP; } else { pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries; pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries; -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote: When DDW (Dynamic DMA Windows) are present for a device, we have stored the TCE (Translation Control Entry) size in a special device tree property. Check if we have enabled DDW for the device and return the TCE size from that property if present. If the property isn't present, fallback to looking the value up in struct iommu_table. If we don't find a iommu_table, fallback to the kernel's page size. Signed-off-by: Nishanth Aravamudan--- arch/powerpc/platforms/pseries/iommu.c | 36 ++ 1 file changed, 36 insertions(+) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 0946b98..1bf6471 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) return dma_iommu_ops.get_required_mask(dev); } +static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev) +{ + struct iommu_table *tbl; + + if (!disable_ddw && dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct device_node *dn; + + dn = pci_device_to_OF_node(pdev); + + /* search upwards for ibm,dma-window */ + for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group; + dn = dn->parent) + if (of_get_property(dn, "ibm,dma-window", NULL)) + break; + /* +* if there is a DDW configuration, the TCE shift is stored in +* the property +*/ + if (dn && PCI_DN(dn)) { + const struct dynamic_dma_window_prop *direct64 = + of_get_property(dn, DIRECT64_PROPNAME, NULL); This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM does not set it as 64bit windows are dynamic there so something like find_existing_ddw() needs to be used here. + if (direct64) + return be32_to_cpu(direct64->tce_shift); + } + } + + tbl = get_iommu_table_base(dev); + if (tbl) + return tbl->it_page_shift; + + return PAGE_SHIFT; +} + #else /* CONFIG_PCI */ #define pci_dma_bus_setup_pSeries NULL #define pci_dma_dev_setup_pSeries NULL @@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev) #define pci_dma_dev_setup_pSeriesLP NULL #define dma_set_mask_pSeriesLPNULL #define dma_get_required_mask_pSeriesLP NULL +#define dma_get_page_shift_pSeriesLP NULL #endif /* !CONFIG_PCI */ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action, @@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void) pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP; ppc_md.dma_set_mask = dma_set_mask_pSeriesLP; ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP; + ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP; } else { pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries; pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries; -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
On 08/17/2015 05:45 PM, Vlastimil Babka wrote: On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote: This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy IIRC CMA was for mobile devices and their camera/codec drivers and you don't use QEMU on those? What do you need CMA for in your case? I do not want QEMU to get memory from CMA, this is my point. It just happens sometime that the kernel allocates movable pages from there. This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. OK such conversation should probably start by mentioning the VM_PINNED effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345 It's more general approach to dealing with pinned pages, and moving them out of CMA area (and compacting them in general) prior pinning is one of the things that should be done within that framework. And I assume these patches did not go anywhere, right?... Then there's the effort to enable migrating pages other than LRU during compaction (and thus CMA allocation): https://lwn.net/Articles/650864/ I don't know if that would be applicable in your use case, i.e. are the pins for DMA short-lived and can the isolation/migration code wait a bit for the transfer to finish so it can grab them, or something? Pins for DMA are long-lived, pretty much as long as the guest is running. So this "compaction" is too late. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? Should be. - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. Hmm, once the page is allocated, then the migratetype is not tracked anywhere (except in page_owner debug data). But the unmovable allocations might exhaust available unmovable pageblocks and lead to fragmentation. So "add MOVABLE back" would be too late. Instead we would need to tell the allocator somehow to give us movable page but outside of CMA. It is it movable, why do we care if it is in CMA or not? CMA's own __alloc_contig_migrate_range() avoids this problem by allocating movable pages, but the range has been already page-isolated and thus the allocator won't see the pages there.You obviously can't take this approach and isolate all CMA pageblocks like that. That smells like a new __GFP_FLAG, meh. I understood (more or less) all of it except the __GFP_FLAG - when/what would use it? - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. You don't have to do reclaim_clean_pages_from_list(), but the isolation has to be careful, yeah. The isolation here means the whole CMA zone isolation which I "obviously can't take this approach"? :) - are there any flags in madvise() to address this (could not locate any relevant)? AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE) - what else is missing? disabled interrupts? locks? See what isolate_migratepages_block() does. Thanks for the pointers! I'll have a closer look at Peter's patchset. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
On 08/17/2015 05:45 PM, Vlastimil Babka wrote: On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote: This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy IIRC CMA was for mobile devices and their camera/codec drivers and you don't use QEMU on those? What do you need CMA for in your case? I do not want QEMU to get memory from CMA, this is my point. It just happens sometime that the kernel allocates movable pages from there. This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. OK such conversation should probably start by mentioning the VM_PINNED effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345 It's more general approach to dealing with pinned pages, and moving them out of CMA area (and compacting them in general) prior pinning is one of the things that should be done within that framework. And I assume these patches did not go anywhere, right?... Then there's the effort to enable migrating pages other than LRU during compaction (and thus CMA allocation): https://lwn.net/Articles/650864/ I don't know if that would be applicable in your use case, i.e. are the pins for DMA short-lived and can the isolation/migration code wait a bit for the transfer to finish so it can grab them, or something? Pins for DMA are long-lived, pretty much as long as the guest is running. So this compaction is too late. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? Should be. - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. Hmm, once the page is allocated, then the migratetype is not tracked anywhere (except in page_owner debug data). But the unmovable allocations might exhaust available unmovable pageblocks and lead to fragmentation. So add MOVABLE back would be too late. Instead we would need to tell the allocator somehow to give us movable page but outside of CMA. It is it movable, why do we care if it is in CMA or not? CMA's own __alloc_contig_migrate_range() avoids this problem by allocating movable pages, but the range has been already page-isolated and thus the allocator won't see the pages there.You obviously can't take this approach and isolate all CMA pageblocks like that. That smells like a new __GFP_FLAG, meh. I understood (more or less) all of it except the __GFP_FLAG - when/what would use it? - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. You don't have to do reclaim_clean_pages_from_list(), but the isolation has to be careful, yeah. The isolation here means the whole CMA zone isolation which I obviously can't take this approach? :) - are there any flags in madvise() to address this (could not locate any relevant)? AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE) - what else is missing? disabled interrupts? locks? See what isolate_migratepages_block() does. Thanks for the pointers! I'll have a closer look at Peter's patchset. -- Alexey -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote: This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. Anyone, any idea? Or the question is way too stupid? :) QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. - are there any flags in madvise() to address this (could not locate any relevant)? - what else is missing? disabled interrupts? locks? Thanks! Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/mm/mmu_context_iommu.c | 40 +++-- mm/page_alloc.c | 36 + 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..bf6850e 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); +extern int mm_iommu_move_page(unsigned long pfn); + long mm_iommu_get(unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; + unsigned long moved = 0, tried = 0; if (!current || !current->mm) return -ESRCH; /* process exited */ @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } for (i = 0; i < entries; ++i) { + unsigned long pfn; + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), 1/* pages */, 1/* iswrite */, )) { - for (j = 0; j < i; ++j) - put_page(pfn_to_page( - mem->hpas[j] >> PAGE_SHIFT)); - vfree(mem->hpas); - kfree(mem); ret = -EFAULT; - goto unlock_exit; + goto put_exit; + } + + pfn = page_to_pfn(page); + if (get_pageblock_migratetype(page) == MIGRATE_CMA) + { + unsigned long pfnold = pfn; + put_page(page); + page = NULL; + mm_iommu_move_page(pfn); + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), + 1/* pages */, 1/* iswrite */, )) { + ret = -EFAULT; + goto put_exit; + } + pfn = page_to_pfn(page); + if (pfn != pfnold) + ++moved; } mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, list_add_rc
Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote: This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. Anyone, any idea? Or the question is way too stupid? :) QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. - are there any flags in madvise() to address this (could not locate any relevant)? - what else is missing? disabled interrupts? locks? Thanks! Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/mm/mmu_context_iommu.c | 40 +++-- mm/page_alloc.c | 36 + 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..bf6850e 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); +extern int mm_iommu_move_page(unsigned long pfn); + long mm_iommu_get(unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; + unsigned long moved = 0, tried = 0; if (!current || !current-mm) return -ESRCH; /* process exited */ @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } for (i = 0; i entries; ++i) { + unsigned long pfn; + if (1 != get_user_pages_fast(ua + (i PAGE_SHIFT), 1/* pages */, 1/* iswrite */, page)) { - for (j = 0; j i; ++j) - put_page(pfn_to_page( - mem-hpas[j] PAGE_SHIFT)); - vfree(mem-hpas); - kfree(mem); ret = -EFAULT; - goto unlock_exit; + goto put_exit; + } + + pfn = page_to_pfn(page); + if (get_pageblock_migratetype(page) == MIGRATE_CMA) + { + unsigned long pfnold = pfn; + put_page(page); + page = NULL; + mm_iommu_move_page(pfn); + if (1 != get_user_pages_fast(ua + (i PAGE_SHIFT), + 1/* pages */, 1/* iswrite */, page)) { + ret = -EFAULT; + goto put_exit; + } + pfn = page_to_pfn(page); + if (pfn != pfnold) + ++moved; } mem-hpas[i] = page_to_pfn(page) PAGE_SHIFT; @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, list_add_rcu(mem-next, current-mm
[RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. - are there any flags in madvise() to address this (could not locate any relevant)? - what else is missing? disabled interrupts? locks? Thanks! Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/mm/mmu_context_iommu.c | 40 +++-- mm/page_alloc.c | 36 + 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..bf6850e 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); +extern int mm_iommu_move_page(unsigned long pfn); + long mm_iommu_get(unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; + unsigned long moved = 0, tried = 0; if (!current || !current->mm) return -ESRCH; /* process exited */ @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } for (i = 0; i < entries; ++i) { + unsigned long pfn; + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), 1/* pages */, 1/* iswrite */, )) { - for (j = 0; j < i; ++j) - put_page(pfn_to_page( - mem->hpas[j] >> PAGE_SHIFT)); - vfree(mem->hpas); - kfree(mem); ret = -EFAULT; - goto unlock_exit; + goto put_exit; + } + + pfn = page_to_pfn(page); + if (get_pageblock_migratetype(page) == MIGRATE_CMA) + { + unsigned long pfnold = pfn; + put_page(page); + page = NULL; + mm_iommu_move_page(pfn); + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), + 1/* pages */, 1/* iswrite */, )) { + ret = -EFAULT; + goto put_exit; + } + pfn = page_to_pfn(page); + if (pfn != pfnold) + ++moved; } mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT; @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, list_add_rcu(>next, >mm->context.iommu_group_mem_list); + printk("***K*** %s %u: tried %lu, moved %lu of %l
[RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
This is about VFIO aka PCI passthrough used from QEMU. KVM is irrelevant here. QEMU is a machine emulator. It allocates guest RAM from anonymous memory and these pages are movable which is ok. They may happen to be allocated from the contiguous memory allocation zone (CMA). Which is also ok as long they are movable. However if the guest starts using VFIO (which can be hotplugged into the guest), in most cases it involves DMA which requires guest RAM pages to be pinned and not move once their addresses are programmed to the hardware for DMA. So we end up in a situation when quite many pages in CMA are not movable anymore. And we get bunch of these: [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy This is a very rough patch to start the conversation about how to move pages properly. mm/page_alloc.c does this and arch/powerpc/mm/mmu_context_iommu.c exploits it. Please do not comment on the style and code placement, this is just to give some context :) Obviously, this does not work well - it manages to migrate only few pages and crashes as it is missing locks/disabling interrupts and I probably should not just remove pages from LRU list (normally, I guess, only these can migrate) and a million of other things. The questions are: - what is the correct way of telling if the page is in CMA? is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? - how to tell MM to move page away? I am calling migrate_pages() with an get_new_page callback which allocates a page with GFP_USER but without GFP_MOVABLE which should allocate new page out of CMA which seems ok but there is a little convern that we might want to add MOVABLE back when VFIO device is unplugged from the guest. - do I need to isolate pages by using isolate_migratepages_range, reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? I dropped them for now and the patch uses only @migratepages from the compact_control struct. - are there any flags in madvise() to address this (could not locate any relevant)? - what else is missing? disabled interrupts? locks? Thanks! Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/mm/mmu_context_iommu.c | 40 +++-- mm/page_alloc.c | 36 + 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..bf6850e 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); +extern int mm_iommu_move_page(unsigned long pfn); + long mm_iommu_get(unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; + unsigned long moved = 0, tried = 0; if (!current || !current-mm) return -ESRCH; /* process exited */ @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } for (i = 0; i entries; ++i) { + unsigned long pfn; + if (1 != get_user_pages_fast(ua + (i PAGE_SHIFT), 1/* pages */, 1/* iswrite */, page)) { - for (j = 0; j i; ++j) - put_page(pfn_to_page( - mem-hpas[j] PAGE_SHIFT)); - vfree(mem-hpas); - kfree(mem); ret = -EFAULT; - goto unlock_exit; + goto put_exit; + } + + pfn = page_to_pfn(page); + if (get_pageblock_migratetype(page) == MIGRATE_CMA) + { + unsigned long pfnold = pfn; + put_page(page); + page = NULL; + mm_iommu_move_page(pfn); + if (1 != get_user_pages_fast(ua + (i PAGE_SHIFT), + 1/* pages */, 1/* iswrite */, page)) { + ret = -EFAULT; + goto put_exit; + } + pfn = page_to_pfn(page); + if (pfn != pfnold) + ++moved; } mem-hpas[i] = page_to_pfn(page) PAGE_SHIFT; @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, list_add_rcu(mem-next, current-mm-context.iommu_group_mem_list); + printk(***K*** %s %u: tried %lu, moved %lu of %lu\n, __func__, __LINE__
Re: [PATCH kernel] powerpc/powernv/ioda2: Fix calculation for memory allocated for TCE table
On 07/21/2015 04:24 PM, Michael Ellerman wrote: On Mon, 2015-07-20 at 20:45 +1000, Alexey Kardashevskiy wrote: The existing code stores the amount of memory allocated for a TCE table. At the moment it uses @offset which is a virtual offset in the TCE table which is only correct for a one level tables and it does not include memory allocated for intermediate levels. When multilevel TCE table is requested, WARN_ON in tce_iommu_create_table() prints a warning. This adds an additional counter to pnv_pci_ioda2_table_do_alloc_pages() to count actually allocated memory. Signed-off-by: Alexey Kardashevskiy --- I was sure I sent it already but could not find it anywhere so reposting. Sorry if it is a duplicate. Stable? No, this is for DDW patchset I recently posted and which did not get to any release yet. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/