[PATCH] Revert powerpc/tm: Abort syscalls in active transactions
This reverts commit feba40362b11341bee6d8ed58d54b896abbd9f84. Although the principle of this change is good, the implementation has a few issues. Firstly we can sometimes fail to abort a syscall because r12 may have been clobbered by C code if we went down the virtual CPU accounting path, or if syscall tracing was enabled. Secondly we have decided that it is safer to abort the syscall even earlier in the syscall entry path, so that we avoid the syscall tracing path when we are transactional. So that we have time to thoroughly test those changes we have decided to revert this for this merge window and will merge the fixed version in the next window. NB. Rather than reverting the selftest we just drop tm-syscall from TEST_PROGS so that it's not run by default. Fixes: feba40362b11 (powerpc/tm: Abort syscalls in active transactions) Signed-off-by: Michael Ellerman m...@ellerman.id.au --- Documentation/powerpc/transactional_memory.txt | 32 +- arch/powerpc/include/uapi/asm/tm.h | 2 +- arch/powerpc/kernel/entry_64.S | 19 --- tools/testing/selftests/powerpc/tm/Makefile| 2 +- 4 files changed, 18 insertions(+), 37 deletions(-) diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.txt index ba0a2a4a54ba..ded69794a5c0 100644 --- a/Documentation/powerpc/transactional_memory.txt +++ b/Documentation/powerpc/transactional_memory.txt @@ -74,23 +74,22 @@ Causes of transaction aborts Syscalls -Syscalls made from within an active transaction will not be performed and the -transaction will be doomed by the kernel with the failure code TM_CAUSE_SYSCALL -| TM_CAUSE_PERSISTENT. +Performing syscalls from within transaction is not recommended, and can lead +to unpredictable results. -Syscalls made from within a suspended transaction are performed as normal and -the transaction is not explicitly doomed by the kernel. However, what the -kernel does to perform the syscall may result in the transaction being doomed -by the hardware. The syscall is performed in suspended mode so any side -effects will be persistent, independent of transaction success or failure. No -guarantees are provided by the kernel about which syscalls will affect -transaction success. +Syscalls do not by design abort transactions, but beware: The kernel code will +not be running in transactional state. The effect of syscalls will always +remain visible, but depending on the call they may abort your transaction as a +side-effect, read soon-to-be-aborted transactional data that should not remain +invisible, etc. If you constantly retry a transaction that constantly aborts +itself by calling a syscall, you'll have a livelock make no progress. -Care must be taken when relying on syscalls to abort during active transactions -if the calls are made via a library. Libraries may cache values (which may -give the appearance of success) or perform operations that cause transaction -failure before entering the kernel (which may produce different failure codes). -Examples are glibc's getpid() and lazy symbol resolution. +Simple syscalls (e.g. sigprocmask()) could be OK. Even things like write() +from, say, printf() should be OK as long as the kernel does not access any +memory that was accessed transactionally. + +Consider any syscalls that happen to work as debug-only -- not recommended for +production use. Best to queue them up till after the transaction is over. Signals @@ -177,7 +176,8 @@ kernel aborted a transaction: TM_CAUSE_RESCHED Thread was rescheduled. TM_CAUSE_TLBI Software TLB invalid. TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap. - TM_CAUSE_SYSCALL Syscall from active transaction. + TM_CAUSE_SYSCALL Currently unused; future syscalls that must abort +transactions for consistency will use this. TM_CAUSE_SIGNALSignal delivered. TM_CAUSE_MISC Currently unused. TM_CAUSE_ALIGNMENT Alignment fault. diff --git a/arch/powerpc/include/uapi/asm/tm.h b/arch/powerpc/include/uapi/asm/tm.h index 5047659815a5..5d836b7c1176 100644 --- a/arch/powerpc/include/uapi/asm/tm.h +++ b/arch/powerpc/include/uapi/asm/tm.h @@ -11,7 +11,7 @@ #define TM_CAUSE_RESCHED 0xde #define TM_CAUSE_TLBI 0xdc #define TM_CAUSE_FAC_UNAV 0xda -#define TM_CAUSE_SYSCALL 0xd8 +#define TM_CAUSE_SYSCALL 0xd8 /* future use */ #define TM_CAUSE_MISC 0xd6 /* future use */ #define TM_CAUSE_SIGNAL0xd4 #define TM_CAUSE_ALIGNMENT 0xd2 diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 8ca9434c40e6..afbc20019c2e 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -34,7 +34,6 @@ #include asm/ftrace.h #include asm/hw_irq.h #include asm/context_tracking.h -#include asm/tm.h /* * System calls. @@ -146,24 +145,6 @@
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; Hrm, so here we have more copies of the full iommu_table structures, which again muddies the lifetime. The table_group pointer is presumably meaningless in these copies, which seems dangerously confusing. + struct list_head group_list; }; static long tce_unregister_pages(struct tce_container *container, @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift; } +static inline bool tce_groups_attached(struct tce_container *container) +{ + return !list_empty(container-group_list); +} + static struct iommu_table *spapr_tce_find_table( struct tce_container *container, phys_addr_t ioba) { long i; struct iommu_table *ret = NULL; - struct iommu_table_group *table_group; - - table_group = iommu_group_get_iommudata(container-grp); - if (!table_group) - return NULL; for (i = 0; i IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { - struct iommu_table *tbl = table_group-tables[i]; + struct iommu_table *tbl = container-tables[i]; unsigned long entry = ioba tbl-it_page_shift; unsigned long start = tbl-it_offset; unsigned long end = start + tbl-it_size; @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container) int ret = 0; unsigned long locked; struct iommu_table_group *table_group; - - if (!container-grp) - return -ENXIO; + struct tce_iommu_group *tcegrp; if (!current-mm) return -ESRCH; /* process exited */ @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container) * as there is no way to know how much we should increment * the locked_vm counter. */ -
Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache
On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. [snip] +long mm_iommu_alloc(unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem) +{ + struct mm_iommu_table_group_mem_t *mem; + long i, j; + struct page *page = NULL; + + list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list, + next) { + if ((mem-ua == ua) (mem-entries == entries)) + return -EBUSY; + + /* Overlap? */ + if ((mem-ua (ua + (entries PAGE_SHIFT))) + (ua (mem-ua + (mem-entries PAGE_SHIFT + return -EINVAL; + } + + mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + mem-hpas = vzalloc(entries * sizeof(mem-hpas[0])); + if (!mem-hpas) { + kfree(mem); + return -ENOMEM; + } So, I've thought more about this and I'm really confused as to what this is supposed to be accomplishing. I see that you need to keep track of what regions are registered, so you don't double lock or unlock, but I don't see what the point of actualy storing the translations in hpas is. I had assumed it was so that you could later on get to the translations in real mode when you do in-kernel acceleration. But that doesn't make sense, because the array is vmalloc()ed, so can't be accessed in real mode anyway. I can't think of a circumstance in which you can use hpas where you couldn't just walk the page tables anyway. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpXYAi7YA0h0.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 13/15] cputime:Introduce the cputime_to_timespec64/timespec64_to_cputime function
This patch introduces some functions for converting cputime to timespec64 and back, that repalce the timespec type with timespec64 type, as well as for arch/s390 and arch/powerpc architecture. And these new methods will replace the old cputime_to_timespec/timespec_to_cputime function to ready for 2038 issue. The cputime_to_timespec/timespec_to_cputime functions are moved to include/linux/cputime.h file for removing conveniently. Signed-off-by: Baolin Wang baolin.w...@linaro.org --- arch/powerpc/include/asm/cputime.h|6 +++--- arch/s390/include/asm/cputime.h |8 include/asm-generic/cputime_jiffies.h | 10 +- include/asm-generic/cputime_nsecs.h |4 ++-- include/linux/cputime.h | 15 +++ 5 files changed, 29 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index e245255..5dda5c0 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -154,9 +154,9 @@ static inline cputime_t secs_to_cputime(const unsigned long sec) } /* - * Convert cputime - timespec + * Convert cputime - timespec64 */ -static inline void cputime_to_timespec(const cputime_t ct, struct timespec *p) +static inline void cputime_to_timespec64(const cputime_t ct, struct timespec64 *p) { u64 x = (__force u64) ct; unsigned int frac; @@ -168,7 +168,7 @@ static inline void cputime_to_timespec(const cputime_t ct, struct timespec *p) p-tv_nsec = x; } -static inline cputime_t timespec_to_cputime(const struct timespec *p) +static inline cputime_t timespec64_to_cputime(const struct timespec64 *p) { u64 ct; diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h index b91e960..1266697 100644 --- a/arch/s390/include/asm/cputime.h +++ b/arch/s390/include/asm/cputime.h @@ -89,16 +89,16 @@ static inline cputime_t secs_to_cputime(const unsigned int s) } /* - * Convert cputime to timespec and back. + * Convert cputime to timespec64 and back. */ -static inline cputime_t timespec_to_cputime(const struct timespec *value) +static inline cputime_t timespec64_to_cputime(const struct timespec64 *value) { unsigned long long ret = value-tv_sec * CPUTIME_PER_SEC; return (__force cputime_t)(ret + __div(value-tv_nsec * CPUTIME_PER_USEC, NSEC_PER_USEC)); } -static inline void cputime_to_timespec(const cputime_t cputime, - struct timespec *value) +static inline void cputime_to_timespec64(const cputime_t cputime, + struct timespec64 *value) { unsigned long long __cputime = (__force unsigned long long) cputime; #ifndef CONFIG_64BIT diff --git a/include/asm-generic/cputime_jiffies.h b/include/asm-generic/cputime_jiffies.h index fe386fc..54e034c 100644 --- a/include/asm-generic/cputime_jiffies.h +++ b/include/asm-generic/cputime_jiffies.h @@ -44,12 +44,12 @@ typedef u64 __nocast cputime64_t; #define secs_to_cputime(sec) jiffies_to_cputime((sec) * HZ) /* - * Convert cputime to timespec and back. + * Convert cputime to timespec64 and back. */ -#define timespec_to_cputime(__val) \ - jiffies_to_cputime(timespec_to_jiffies(__val)) -#define cputime_to_timespec(__ct,__val)\ - jiffies_to_timespec(cputime_to_jiffies(__ct),__val) +#define timespec64_to_cputime(__val) \ + jiffies_to_cputime(timespec64_to_jiffies(__val)) +#define cputime_to_timespec64(__ct,__val) \ + jiffies_to_timespec64(cputime_to_jiffies(__ct),__val) /* * Convert cputime to timeval and back. diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h index 0419485..65c875b 100644 --- a/include/asm-generic/cputime_nsecs.h +++ b/include/asm-generic/cputime_nsecs.h @@ -73,12 +73,12 @@ typedef u64 __nocast cputime64_t; /* * Convert cputime - timespec (nsec) */ -static inline cputime_t timespec_to_cputime(const struct timespec *val) +static inline cputime_t timespec64_to_cputime(const struct timespec64 *val) { u64 ret = val-tv_sec * NSEC_PER_SEC + val-tv_nsec; return (__force cputime_t) ret; } -static inline void cputime_to_timespec(const cputime_t ct, struct timespec *val) +static inline void cputime_to_timespec64(const cputime_t ct, struct timespec64 *val) { u32 rem; diff --git a/include/linux/cputime.h b/include/linux/cputime.h index f2eb2ee..f01896f 100644 --- a/include/linux/cputime.h +++ b/include/linux/cputime.h @@ -13,4 +13,19 @@ usecs_to_cputime((__nsecs) / NSEC_PER_USEC) #endif +static inline cputime_t timespec_to_cputime(const struct timespec *ts) +{ + struct timespec64 ts64 = timespec_to_timespec64(*ts); + return timespec64_to_cputime(ts64); +} + +static inline void cputime_to_timespec(const cputime_t cputime, + struct timespec *value) +{ + struct timespec64 *ts64;
Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache
On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. [snip] +long mm_iommu_alloc(unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem) +{ + struct mm_iommu_table_group_mem_t *mem; + long i, j; + struct page *page = NULL; + + list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list, + next) { + if ((mem-ua == ua) (mem-entries == entries)) + return -EBUSY; + + /* Overlap? */ + if ((mem-ua (ua + (entries PAGE_SHIFT))) + (ua (mem-ua + (mem-entries PAGE_SHIFT + return -EINVAL; + } + + mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + mem-hpas = vzalloc(entries * sizeof(mem-hpas[0])); + if (!mem-hpas) { + kfree(mem); + return -ENOMEM; + } So, I've thought more about this and I'm really confused as to what this is supposed to be accomplishing. I see that you need to keep track of what regions are registered, so you don't double lock or unlock, but I don't see what the point of actualy storing the translations in hpas is. I had assumed it was so that you could later on get to the translations in real mode when you do in-kernel acceleration. But that doesn't make sense, because the array is vmalloc()ed, so can't be accessed in real mode anyway. We can access vmalloc'd arrays in real mode using real_vmalloc_addr(). I can't think of a circumstance in which you can use hpas where you couldn't just walk the page tables anyway. The problem with walking the page tables is that there is no guarantee that the page you find that way is the page that was returned by the gup_fast() we did earlier. Storing the hpas means that we know for sure that the page we're doing DMA to is one that we have an elevated page count on. Also, there are various points where a Linux PTE is made temporarily invalid for a short time. If we happened to do a H_PUT_TCE on one cpu while another cpu was doing that, we'd get a spurious failure returned by the H_PUT_TCE. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). A summary of the semantic differeces between v1 and v2 would be nice. At this point it's not really clear to me if there's a case for creating v2, or if this could just be done by adding (optional) functionality to v1. +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from pre-registered range. + +This separation helps in optimizing DMA for guests. + --- [1] VFIO was originally an acronym for Virtual Function I/O in its diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 892a584..4cfc2c1 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c So, from
Re: [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group
On Sat, Apr 25, 2015 at 10:14:54PM +1000, Alexey Kardashevskiy wrote: A table group might not have a table but it always has the default 32bit window parameters so use these. No change in behavior is expected. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru It would be easier to review if you took this and the parts of the earlier patch which add the tce32_* fields to table_group and roll them up on their own. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpneYXVkHI3I.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 00/15] Convert the posix_clock_operations and k_clock structure to ready for 2038
This patch series changes the 32-bit time type (timespec/itimerspec) to the 64-bit one (timespec64/itimerspec64), since 32-bit time types will break in the year 2038. This patch series introduces new methods with timespec64/itimerspec64 type, and removes the old ones with timespec/itimerspec type for posix_clock_operations and k_clock structure. Also introduces some new functions with timespec64/itimerspec64 type, like current_kernel_time64(), hrtimer_get_res64(), cputime_to_timespec64() and timespec64_to_cputime(). Changes since V1: -Split some patch into small patch. -Change the methods for converting the syscall and add some default function for new 64bit methods for syscall function. -Introduce the new function do_sys_settimeofday64() and move do_sys_settimeofday() function to head file. -Modify the EXPORT_SYMPOL issue. -Add new 64bit methods in cputime_nsecs.h file. -Modify some patch logs. Baolin Wang (15): linux/time64.h:Introduce the 'struct itimerspec64' for 64bit timekeeping:Introduce the current_kernel_time64() function with timespec64 type time/hrtimer:Introduce hrtimer_get_res64() with timespec64 type for getting the timer resolution posix timers:Introduce the 64bit methods with timespec64 type for k_clock structure posix-timers:Split out the guts of the syscall and change the implementation posix-timers:Convert to the 64bit methods for the syscall function time:Introduce the do_sys_settimeofday64() function with timespec64 type time/posix-timers:Convert to the 64bit methods for k_clock callback functions char/mmtimer:Convert to the 64bit methods for k_clock callback function time/alarmtimer:Convert to the new methods for k_clock structure time/posix-clock:Convert to the 64bit methods for k_clock and posix_clock_operations structure time/time:Introduce the timespec64_to_jiffies/jiffies_to_timespec64 function cputime:Introduce the cputime_to_timespec64/timespec64_to_cputime function time/posix-cpu-timers:Convert to the 64bit methods for k_clock structure k_clock:Remove the 32bit methods with timespec/itimerspec type arch/powerpc/include/asm/cputime.h|6 +- arch/s390/include/asm/cputime.h |8 +- drivers/char/mmtimer.c| 36 +++-- drivers/ptp/ptp_clock.c | 26 +--- include/asm-generic/cputime_jiffies.h | 10 +- include/asm-generic/cputime_nsecs.h |4 +- include/linux/cputime.h | 15 ++ include/linux/hrtimer.h | 12 +- include/linux/jiffies.h | 21 ++- include/linux/posix-clock.h | 10 +- include/linux/posix-timers.h | 18 +-- include/linux/time64.h| 35 + include/linux/timekeeping.h | 26 +++- kernel/time/alarmtimer.c | 43 +++--- kernel/time/hrtimer.c | 10 +- kernel/time/posix-clock.c | 20 +-- kernel/time/posix-cpu-timers.c| 83 +- kernel/time/posix-timers.c| 269 ++--- kernel/time/time.c| 22 +-- kernel/time/timekeeping.c |6 +- kernel/time/timekeeping.h |2 +- 21 files changed, 428 insertions(+), 254 deletions(-) -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container does this, the address ranges will the same. What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; Hrm, so here we have more copies of the full iommu_table structures, which again muddies the lifetime. The table_group pointer is presumably meaningless in these copies, which seems dangerously confusing. Ouch. This is bad. No, table_group is not pointless here as it is used to get to the PE number to invalidate TCE cache. I just realized although I need to update just a single table, I still have to invalidate TCE cache for every attached group/PE so I need a list of iommu_table_group's here, not a single pointer... + struct list_head group_list; }; static long tce_unregister_pages(struct tce_container *container, @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift; } +static inline bool tce_groups_attached(struct tce_container *container) +{ + return
Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
On 04/30/2015 02:37 PM, David Gibson wrote: On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 03:30 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote: This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. Doesn't the free callback belong with the previous patch introducing multi-level tables? If I did that, you would say why is it here if nothing calls it on multilevel patch and I see the allocation but I do not see memory release ;) Yeah, fair enough ;) I need some rule of thumb here. I think it is a bit cleaner if the same patch adds a callback for memory allocation and its counterpart, no? On further consideration, yes, I think you're right. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 19 arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++-- 3 files changed, 96 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 0f50ee2..7694546 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,6 +70,7 @@ struct iommu_table_ops { /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); /* * Switches ownership from the kernel itself to an external * user. While onwership is taken, the kernel cannot use IOMMU itself. @@ -160,6 +172,13 @@ struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; With this information, table_group seems even more like a bad name. iommu_state maybe? Please, no. We will never come to agreement then :( And iommu_state is too general anyway, it won't pass. struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cc1d09c..4828837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,6 +24,7 @@ #include linux/msi.h #include linux/memblock.h #include linux/iommu.h +#include linux/sizes.h #include asm/sections.h #include asm/io.h @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); tbl-it_ops = pnv_ioda1_iommu_ops; + pe-table_group.tce32_start = tbl-it_offset tbl-it_page_shift; + pe-table_group.tce32_size = tbl-it_size tbl-it_page_shift; iommu_init_table(tbl, phb-hose-node);
[v3] powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
From: Igal Liberman igal.liber...@freescale.com Describe the PHY topology for all configurations supported by each board Based on prior work by Andy Fleming aflem...@freescale.com Signed-off-by: Igal Liberman igal.liber...@freescale.com Signed-off-by: Shruti Kanetkar shr...@freescale.com Signed-off-by: Emil Medve emilian.me...@freescale.com --- v3: Fixed incorrect E-Mail address (signed-off-by) v2: Remove 'Change-Id' arch/powerpc/boot/dts/b4860qds.dts| 60 - arch/powerpc/boot/dts/b4qds.dtsi | 51 - arch/powerpc/boot/dts/p1023rdb.dts| 24 +- arch/powerpc/boot/dts/p2041rdb.dts| 92 +++- arch/powerpc/boot/dts/p3041ds.dts | 112 - arch/powerpc/boot/dts/p4080ds.dts | 184 ++- arch/powerpc/boot/dts/p5020ds.dts | 112 - arch/powerpc/boot/dts/p5040ds.dts | 234 ++- arch/powerpc/boot/dts/t1040rdb.dts| 32 ++- arch/powerpc/boot/dts/t1042rdb.dts| 30 ++- arch/powerpc/boot/dts/t1042rdb_pi.dts | 18 +- arch/powerpc/boot/dts/t104xqds.dtsi | 178 ++- arch/powerpc/boot/dts/t104xrdb.dtsi | 33 ++- arch/powerpc/boot/dts/t2080qds.dts| 158 - arch/powerpc/boot/dts/t2080rdb.dts| 67 +- arch/powerpc/boot/dts/t2081qds.dts| 221 +- arch/powerpc/boot/dts/t4240qds.dts| 400 - arch/powerpc/boot/dts/t4240rdb.dts| 149 +++- 18 files changed, 2135 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/boot/dts/b4860qds.dts b/arch/powerpc/boot/dts/b4860qds.dts index 6bb3707..98b1ef4 100644 --- a/arch/powerpc/boot/dts/b4860qds.dts +++ b/arch/powerpc/boot/dts/b4860qds.dts @@ -1,7 +1,7 @@ /* * B4860DS Device Tree Source * - * Copyright 2012 Freescale Semiconductor Inc. + * Copyright 2012 - 2015 Freescale Semiconductor Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -39,12 +39,69 @@ model = fsl,B4860QDS; compatible = fsl,B4860QDS; + aliases { + phy_sgmii_1e = phy_sgmii_1e; + phy_sgmii_1f = phy_sgmii_1f; + phy_xaui_slot1 = phy_xaui_slot1; + phy_xaui_slot2 = phy_xaui_slot2; + }; + ifc: localbus@ffe124000 { board-control@3,0 { compatible = fsl,b4860qds-fpga, fsl,fpga-qixis; }; }; + soc@ffe00 { + fman@40 { + ethernet@e8000 { + phy-handle = phy_sgmii_1e; + phy-connection-type = sgmii; + }; + + ethernet@ea000 { + phy-handle = phy_sgmii_1f; + phy-connection-type = sgmii; + }; + + ethernet@f { + phy-handle = phy_xaui_slot1; + phy-connection-type = xgmii; + }; + + ethernet@f2000 { + phy-handle = phy_xaui_slot2; + phy-connection-type = xgmii; + }; + + mdio@fc000 { + phy_sgmii_1e: ethernet-phy@1e { + reg = 0x1e; + status = disabled; + }; + + phy_sgmii_1f: ethernet-phy@1f { + reg = 0x1f; + status = disabled; + }; + }; + + mdio@fd000 { + phy_xaui_slot1: xaui-phy@slot1 { + compatible = ethernet-phy-ieee802.3-c45; + reg = 0x7; + status = disabled; + }; + + phy_xaui_slot2: xaui-phy@slot2 { + compatible = ethernet-phy-ieee802.3-c45; + reg = 0x6; + status = disabled; + }; + }; + }; + }; + rio: rapidio@ffe0c { reg = 0xf 0xfe0c 0 0x11000; @@ -55,7 +112,6 @@ ranges = 0 0 0xc 0x3000 0 0x1000; }; }; - }; /include/ fsl/b4860si-post.dtsi diff --git a/arch/powerpc/boot/dts/b4qds.dtsi b/arch/powerpc/boot/dts/b4qds.dtsi index 559d006..af49456 100644 --- a/arch/powerpc/boot/dts/b4qds.dtsi +++ b/arch/powerpc/boot/dts/b4qds.dtsi @@ -1,7 +1,7 @@ /* * B4420DS Device Tree Source * - * Copyright 2012
[PATCH 0/2] powerpc: tweak the kernel options for CRASH_DUMP and RELOCATABLE
Hi, The first patch fixes a build error when CRASH_DUMP=y ADVANCED_OPTIONS=n for ppc32. The second does some cleanup for RELOCATABLE option. Kevin Hao (2): powerpc: fix the dependency issue for CRASH_DUMP powerpc: merge the RELOCATABLE config entries for ppc32 and ppc64 arch/powerpc/Kconfig | 68 +++- arch/powerpc/configs/ppc64_defconfig | 1 + 2 files changed, 29 insertions(+), 40 deletions(-) -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] powerpc: fix the dependency issue for CRASH_DUMP
In the current code, the RELOCATABLE will be forcedly enabled when enabling CRASH_DUMP. But for ppc32, the RELOCABLE also depend on ADVANCED_OPTIONS and select NONSTATIC_KERNEL. This will cause build error when CRASH_DUMP=y ADVANCED_OPTIONS=n. Even there is no such issue for ppc64, but select is only for non-visible symbols and for symbols with no dependencies. As for a symbol like RELOCATABLE, it is definitely not suitable to select it. So choose to depend on it. Also enable the RELOCATABLE explicitly for the defconfigs which has CRASH_DUMP enabled. Signed-off-by: Kevin Hao haoke...@gmail.com --- arch/powerpc/Kconfig | 3 +-- arch/powerpc/configs/ppc64_defconfig | 1 + 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 190cc48abc0c..d6bbf4f6f869 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -429,8 +429,7 @@ config KEXEC config CRASH_DUMP bool Build a kdump crash kernel - depends on PPC64 || 6xx || FSL_BOOKE || (44x !SMP) - select RELOCATABLE if (PPC64 !COMPILE_TEST) || 44x || FSL_BOOKE + depends on 6xx || ((PPC64 || FSL_BOOKE || (44x !SMP)) RELOCATABLE) help Build a kernel suitable for use as a kdump capture kernel. The same kernel binary can be used as production kernel and dump diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index aad501ae3834..01f7b63f2df0 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -46,6 +46,7 @@ CONFIG_BINFMT_MISC=m CONFIG_PPC_TRANSACTIONAL_MEM=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y +CONFIG_RELOCATABLE=y CONFIG_IRQ_ALL_CPUS=y CONFIG_MEMORY_HOTREMOVE=y CONFIG_KSM=y -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] libfdt: add fdt type definitions
In preparation for libfdt/dtc update, add the new fdt specific types. Signed-off-by: Rob Herring r...@kernel.org Cc: Russell King li...@arm.linux.org.uk Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Michael Ellerman m...@ellerman.id.au Cc: linux-arm-ker...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org --- arch/arm/boot/compressed/libfdt_env.h | 4 arch/powerpc/boot/libfdt_env.h| 4 arch/powerpc/boot/of.h| 2 ++ include/linux/libfdt_env.h| 4 4 files changed, 14 insertions(+) diff --git a/arch/arm/boot/compressed/libfdt_env.h b/arch/arm/boot/compressed/libfdt_env.h index 1f4e718..17ae0f3 100644 --- a/arch/arm/boot/compressed/libfdt_env.h +++ b/arch/arm/boot/compressed/libfdt_env.h @@ -5,6 +5,10 @@ #include linux/string.h #include asm/byteorder.h +typedef __be16 fdt16_t; +typedef __be32 fdt32_t; +typedef __be64 fdt64_t; + #define fdt16_to_cpu(x)be16_to_cpu(x) #define cpu_to_fdt16(x)cpu_to_be16(x) #define fdt32_to_cpu(x)be32_to_cpu(x) diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h index 8dcd744..7e3789e 100644 --- a/arch/powerpc/boot/libfdt_env.h +++ b/arch/powerpc/boot/libfdt_env.h @@ -10,6 +10,10 @@ typedef u32 uint32_t; typedef u64 uint64_t; typedef unsigned long uintptr_t; +typedef __be16 fdt16_t; +typedef __be32 fdt32_t; +typedef __be64 fdt64_t; + #define fdt16_to_cpu(x)be16_to_cpu(x) #define cpu_to_fdt16(x)cpu_to_be16(x) #define fdt32_to_cpu(x)be32_to_cpu(x) diff --git a/arch/powerpc/boot/of.h b/arch/powerpc/boot/of.h index 5603320..53f8f27 100644 --- a/arch/powerpc/boot/of.h +++ b/arch/powerpc/boot/of.h @@ -21,7 +21,9 @@ int of_setprop(const void *phandle, const char *name, const void *buf, /* Console functions */ void of_console_init(void); +typedef u16__be16; typedef u32__be32; +typedef u64__be64; #ifdef __LITTLE_ENDIAN__ #define cpu_to_be16(x) swab16(x) diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h index 01508c7..2a663c6 100644 --- a/include/linux/libfdt_env.h +++ b/include/linux/libfdt_env.h @@ -5,6 +5,10 @@ #include asm/byteorder.h +typedef __be16 fdt16_t; +typedef __be32 fdt32_t; +typedef __be64 fdt64_t; + #define fdt32_to_cpu(x) be32_to_cpu(x) #define cpu_to_fdt32(x) cpu_to_be32(x) #define fdt64_to_cpu(x) be64_to_cpu(x) -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] powerpc: merge the RELOCATABLE config entries for ppc32 and ppc64
It makes no sense to keep two separate RELOCATABLE config entries for ppc32 and ppc64 respectively. Merge them into one and move it to a common place. The dependency on ADVANCED_OPTIONS for ppc32 seems unnecessary, also drop it. Signed-off-by: Kevin Hao haoke...@gmail.com --- arch/powerpc/Kconfig | 65 ++-- 1 file changed, 27 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d6bbf4f6f869..4080a14707bb 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -427,6 +427,33 @@ config KEXEC interface is strongly in flux, so no good recommendation can be made. +config RELOCATABLE + bool Build a relocatable kernel + depends on (PPC64 !COMPILE_TEST) || (FLATMEM (44x || FSL_BOOKE)) + select NONSTATIC_KERNEL + help + This builds a kernel image that is capable of running at the + location the kernel is loaded at. For ppc32, there is no any + alignment restrictions, and this feature is a superset of + DYNAMIC_MEMSTART and hence overrides it. For ppc64, we should use + 16k-aligned base address. The kernel is linked as a + position-independent executable (PIE) and contains dynamic relocations + which are processed early in the bootup process. + + One use is for the kexec on panic case where the recovery kernel + must live at a different physical address than the primary + kernel. + + Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address + it has been loaded at and the compile time physical addresses + CONFIG_PHYSICAL_START is ignored. However CONFIG_PHYSICAL_START + setting can still be useful to bootwrappers that need to know the + load address of the kernel (eg. u-boot/mkimage). + +config RELOCATABLE_PPC32 + def_bool y + depends on PPC32 RELOCATABLE + config CRASH_DUMP bool Build a kdump crash kernel depends on 6xx || ((PPC64 || FSL_BOOKE || (44x !SMP)) RELOCATABLE) @@ -926,29 +953,6 @@ config DYNAMIC_MEMSTART This option is overridden by CONFIG_RELOCATABLE -config RELOCATABLE - bool Build a relocatable kernel - depends on ADVANCED_OPTIONS FLATMEM (44x || FSL_BOOKE) - select NONSTATIC_KERNEL - help - This builds a kernel image that is capable of running at the - location the kernel is loaded at, without any alignment restrictions. - This feature is a superset of DYNAMIC_MEMSTART and hence overrides it. - - One use is for the kexec on panic case where the recovery kernel - must live at a different physical address than the primary - kernel. - - Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address - it has been loaded at and the compile time physical addresses - CONFIG_PHYSICAL_START is ignored. However CONFIG_PHYSICAL_START - setting can still be useful to bootwrappers that need to know the - load address of the kernel (eg. u-boot/mkimage). - -config RELOCATABLE_PPC32 - def_bool y - depends on PPC32 RELOCATABLE - config PAGE_OFFSET_BOOL bool Set custom page offset address depends on ADVANCED_OPTIONS @@ -1034,21 +1038,6 @@ config PIN_TLB endmenu if PPC64 -config RELOCATABLE - bool Build a relocatable kernel - depends on !COMPILE_TEST - select NONSTATIC_KERNEL - help - This builds a kernel image that is capable of running anywhere - in the RMA (real memory area) at any 16k-aligned base address. - The kernel is linked as a position-independent executable (PIE) - and contains dynamic relocations which are processed early - in the bootup process. - - One use is for the kexec on panic case where the recovery kernel - must live at a different physical address than the primary - kernel. - # This value must have zeroes in the bottom 60 bits otherwise lots will break config PAGE_OFFSET hex -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform
On 04/28/2015 03:48 PM, Arnd Bergmann wrote: On Tuesday 28 April 2015 15:40:35 Vasant Hegde wrote: +++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt @@ -0,0 +1,29 @@ +Device Tree binding for LEDs on IBM Power Systems +- + +The 'led' node under '/ibm,opal' lists service indicators available in the +system and their capabilities. + +led { + compatible = ibm,opal-v3-led; + phandle = 0x106b; + linux,phandle = 0x106b; + led-mode = lightpath; + + U78C9.001.RST0027-P1-C1 { + led-types = identify, fault; + led-loc = descendent; + phandle = 0x106f; + linux,phandle = 0x106f; + }; + ... + ... +}; Arnd, Thanks for the review. We normally don't list the 'phandle' or 'linux,phandle' properties in the binding description. Sure. .Will fix. + +Each node under 'led' node describes location code of FRU/Enclosure. + +The properties under each node: + + led-types : Supported LED types (attention/identify/fault). + + led-loc : enclosure/descendent(FRU) location code. Could you use the standard 'label' property for this? This was discussed earlier [1] and agreed to use led-types property here.. [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-March/126301.html -Vasant ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform
On 04/30/2015 07:59 PM, Jacek Anaszewski wrote: Hi Vasant, Hi Jacek, .../... diff --git a/Documentation/devicetree/bindings/leds/leds-powernv.txt b/Documentation/devicetree/bindings/leds/leds-powernv.txt new file mode 100644 index 000..6bb0e7e --- /dev/null +++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt @@ -0,0 +1,29 @@ +Device Tree binding for LEDs on IBM Power Systems +- + +The 'led' node under '/ibm,opal' lists service indicators available in the +system and their capabilities. + +led { +compatible = ibm,opal-v3-led; +phandle = 0x106b; +linux,phandle = 0x106b; +led-mode = lightpath; + +U78C9.001.RST0027-P1-C1 { +led-types = identify, fault; +led-loc = descendent; +phandle = 0x106f; +linux,phandle = 0x106f; +}; +... +... +}; + +Each node under 'led' node describes location code of FRU/Enclosure. + +The properties under each node: + + led-types : Supported LED types (attention/identify/fault). + + led-loc : enclosure/descendent(FRU) location code. DT documentation it usually constructed so that properties are described in the beginning and the file ends with an example. Also last time I mistakenly requested to remove description of compatible property, but it should also be present here and the entry should described it in detail, like: - compatible : Should be ibm,opal-v3-led. That's fine. I will fix it in v5. Please refer to the other bindings, I will express my opinion on the LED part after powerpc maintainer will ack DT bindings. Sure.. @Ben/Michael, Can you please review/ack this patchset? -Vasant ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux
Regards, Igal Liberman. -Original Message- From: Wood Scott-B07421 Sent: Thursday, April 30, 2015 3:31 AM To: Liberman Igal-B31950 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Tang Yuantian-B29983 Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux On Wed, 2015-04-22 at 05:47 -0500, Liberman Igal-B31950 wrote: Regards, Igal Liberman. -Original Message- From: Wood Scott-B07421 Sent: Tuesday, April 21, 2015 3:52 AM To: Liberman Igal-B31950 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Tang Yuantian-B29983 Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux On Mon, 2015-04-20 at 06:40 -0500, Liberman Igal-B31950 wrote: Regards, Igal Liberman. -Original Message- From: Liberman Igal-B31950 Sent: Monday, April 20, 2015 2:07 PM To: Wood Scott-B07421 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org Subject: RE: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux Regards, Igal Liberman. -Original Message- From: Wood Scott-B07421 Sent: Friday, April 17, 2015 8:41 AM To: Liberman Igal-B31950 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux On Thu, 2015-04-16 at 01:11 -0500, Liberman Igal-B31950 wrote: Regards, Igal Liberman. -Original Message- From: Wood Scott-B07421 Sent: Wednesday, April 15, 2015 8:36 PM To: Liberman Igal-B31950 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux On Tue, 2015-04-14 at 13:56 +0300, Igal.Liberman wrote: From: Igal Liberman igal.liber...@freescale.com v3: Addressed feedback from Scott: - Removed clock specifier description. v2: Addressed feedback from Scott: - Moved the fman-clk-mux clock provider details under clocks property. Signed-off-by: Igal Liberman igal.liber...@freescale.com --- .../devicetree/bindings/clock/qoriq-clock.txt | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/clock/qoriq-clock.tx t b/Documentation/devicetree/bindings/clock/qoriq-clock.tx t index b0d7b73..2bb3b38 100644 --- a/Documentation/devicetree/bindings/clock/qoriq-clock.tx t +++ b/Documentation/devicetree/bindings/clock/qoriq-cloc +++ k.tx +++ t @@ -65,9 +65,10 @@ Required properties: It takes parent's clock-frequency as its clock. * fsl,qoriq-platform-pll-1.0 for the platform PLL clock (v1.0) * fsl,qoriq-platform-pll-2.0 for the platform PLL clock (v2.0) + * fsl,fman-clk-mux for the Frame Manager clock. - #clock-cells: From common clock binding. The number of cells in a - clock-specifier. Should be 0 for fsl,qoriq-sysclk-[1,2].0 - clocks, or 1 for fsl,qoriq-core-pll-[1,2].0 clocks. + clock-specifier. Should be 0 for fsl,qoriq-sysclk- [1,2].0 and + fsl,fman-clk-mux clocks or 1 for +fsl,qoriq-core-pll- [1,2].0. For fsl,qoriq-core-pll-1.0 clocks, the single clock-specifier cell may take the following values: * 0 - equal to the PLL frequency @@ -145,6 +146,18 @@ Example for clock block and clock provider: clocks = sysclk; clock-output-names = platform-pll, platform-pll- div2; }; + + fm0clk: fm0-clk-mux { + #clock-cells = 0; + reg = 0x10 4 + compatible = fsl,fman-clk-mux; + clocks = pll0 0, pll0 1, pll0 2, pll0 3, + platform_pll 0, pll1 1, pll1 2; + clock-names = pll0, pll0-div2, pll0-div3, + pll0-div4, platform-pll, pll1- div2, + pll1-div3; + clock-output-names = fm0-clk; + }; }; }; I don't see this register in the manuals for older DPAA chips, such as p4080 or p3041. Is
Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform
Hi Vasant, On 04/28/2015 12:10 PM, Vasant Hegde wrote: This patch implements LED driver for PowerNV platform using the existing generic LED class framework. PowerNV platform has below type of LEDs: - System attention Indicates there is a problem with the system that needs attention. - Identify Helps the user locate/identify a particular FRU or resource in the system. - Fault Indicates there is a problem with the FRU or resource at the location with which the indicator is associated. We register classdev structures for all individual LEDs detected on the system through LED specific device tree nodes. Device tree nodes specify what all kind of LEDs present on the same location code. It registers LED classdev structure for each of them. All the system LEDs can be found in the same regular path /sys/class/leds/. We don't use LED colors. We use LED node and led-types property to form LED classdev. Our LEDs have names in this format. location_code:attention|identify|fault Any positive brightness value would turn on the LED and a zero value would turn off the LED. The driver will return LED_FULL (255) for any turned on LED and LED_OFF (0) for any turned off LED. As per the LED class framework, the 'brightness_set' function should not sleep. Hence these functions have been implemented through global work queue tasks which might sleep on OPAL async call completion. The platform level implementation of LED get and set state has been achieved through OPAL calls. These calls are made available for the driver by exporting from architecture specific codes. Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com Acked-by: Stewart Smith stew...@linux.vnet.ibm.com Tested-by: Stewart Smith stew...@linux.vnet.ibm.com --- Changes in v4: - s/u64/__be64/g for big endian data we get from firmware - Addressed review comments from Jacek. Major once are: Removed list in powernv_led_data structure s/kzalloc/devm_kzalloc/ Removed compatible property from documentation s/powernv_led_set_queue/powernv_brightness_set/ - Removed LED specific brightness_set/get function. Instead this version uses single function to queue all LED set/get requests. Later we use LED name to detect LED type and value. - Removed hardcoded LED type used in previous version. Instead we use led-types property to form LED classdev. Changes in v3: - Addressed review comments from Jacek. Major once are: Replaced spin lock and mutex and removed redundant structures Replaced pr_* with dev_* Moved OPAL platform sepcific part to separate patch Moved repteated code to common function Added device tree documentation for LEDs Changes in v2: - Added System Attention indicator support - Moved common code to powernv_led_set_queue() .../devicetree/bindings/leds/leds-powernv.txt | 29 + drivers/leds/Kconfig | 11 drivers/leds/Makefile |1 drivers/leds/leds-powernv.c| 472 4 files changed, 513 insertions(+) create mode 100644 Documentation/devicetree/bindings/leds/leds-powernv.txt create mode 100644 drivers/leds/leds-powernv.c diff --git a/Documentation/devicetree/bindings/leds/leds-powernv.txt b/Documentation/devicetree/bindings/leds/leds-powernv.txt new file mode 100644 index 000..6bb0e7e --- /dev/null +++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt @@ -0,0 +1,29 @@ +Device Tree binding for LEDs on IBM Power Systems +- + +The 'led' node under '/ibm,opal' lists service indicators available in the +system and their capabilities. + +led { + compatible = ibm,opal-v3-led; + phandle = 0x106b; + linux,phandle = 0x106b; + led-mode = lightpath; + + U78C9.001.RST0027-P1-C1 { + led-types = identify, fault; + led-loc = descendent; + phandle = 0x106f; + linux,phandle = 0x106f; + }; + ... + ... +}; + +Each node under 'led' node describes location code of FRU/Enclosure. + +The properties under each node: + + led-types : Supported LED types (attention/identify/fault). + + led-loc : enclosure/descendent(FRU) location code. DT documentation it usually constructed so that properties are described in the beginning and the file ends with an example. Also last time I mistakenly requested to remove description of compatible property, but it should also be present here and the entry should described it in detail, like: - compatible : Should be ibm,opal-v3-led. Please refer to the other bindings, I will express my opinion on the LED part after powerpc maintainer will ack DT bindings. diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig index 25b320d..2ea0849
Re: [PATCH net-next v3 2/4] ibmveth: Add support for TSO
From: Thomas Falcon tlfal...@linux.vnet.ibm.com Date: Wed, 29 Apr 2015 16:25:45 -0500 Add support for TSO. TSO is turned off by default and must be enabled and configured by the user. The driver version number is increased so that users can be sure that they are using ibmveth with TSO support. Cc: Brian King brk...@linux.vnet.ibm.com Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com --- v2: Included statistics that were previously in a separate patch Applied. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH net-next v3 1/4] ibmveth: change rx buffer default allocation for CMO
From: Thomas Falcon tlfal...@linux.vnet.ibm.com Date: Wed, 29 Apr 2015 16:25:44 -0500 This patch enables 64k rx buffer pools by default. If Cooperative Memory Overcommitment (CMO) is enabled, the number of 64k buffers is reduced to save memory. Cc: Brian King brk...@linux.vnet.ibm.com Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com Applied. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH net-next v3 3/4] ibmveth: Add GRO support
From: Thomas Falcon tlfal...@linux.vnet.ibm.com Date: Wed, 29 Apr 2015 16:25:46 -0500 Cc: Brian King brk...@linux.vnet.ibm.com Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com Applied. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH net-next v3 4/4] ibmveth: Add support for Large Receive Offload
From: Thomas Falcon tlfal...@linux.vnet.ibm.com Date: Wed, 29 Apr 2015 16:25:47 -0500 Enables receiving large packets from other LPARs. These packets have a -1 IP header checksum, so we must recalculate to have a valid checksum. Signed-off-by: Brian King brk...@linux.vnet.ibm.com Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com --- v3: -Removed code setting network and transport headers -get IP header from skb data Thanks again to Eric Dumazet v2: -Included statistics that were previously in a separate patch -Zeroed the IP header checksum before calling ip_fast_csum Thanks to Eric Dumazet. Applied. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On 04/29/2015 04:40 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size page_shift) * sizeof(unsigned long); This doesn't make much sense. The userspace view can't possibly be a property of the specific low-level IOMMU model. This it_userspace thing is all about memory preregistration. I need some way to track how many actual mappings the mm_iommu_table_group_mem_t has in order to decide whether to allow unregistering or not. When I clear TCE, I can read the old value which is host physical address which I cannot use to find the preregistered region and adjust the mappings counter; I can only use userspace addresses for this (not even guest physical addresses as it is VFIO and probably no KVM). So I have to keep userspace addresses somewhere, one per IOMMU page, and the iommu_table seems a natural place for this. +} + static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, int num, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, BUG_ON(tbl-it_userspace); tbl-it_userspace = uas; + tbl-it_allocated_size += uas_cb; tbl-it_ops = pnv_ioda2_iommu_ops; if (pe-tce_inval_reg) tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) } static struct iommu_table_group_ops pnv_pci_ioda2_ops = { + .get_table_size = pnv_pci_ioda2_get_table_size, .create_table = pnv_pci_ioda2_create_table, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index fc129c4..1b5b48a 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl-it_type = TCE_PCI; } +unsigned long pnv_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long bytes = 0; + const unsigned window_shift = ilog2(window_size); + unsigned entries_shift = window_shift - page_shift; + unsigned
Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
On 04/30/2015 04:55 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). A summary of the semantic differeces between v1 and v2 would be nice. At this point it's not really clear to me if there's a case for creating v2, or if this could just be done by adding (optional) functionality to v1. v1: memory preregistration is not supported; explicit enable/disable ioctls are required v2: memory preregistration is required; explicit enable/disable are prohibited (as they are not needed). Mixing these in one IOMMU type caused a lot of problems like should I increment locked_vm by the 32bit window size on enable() or not; what do I do about pages pinning when map/map (check if it is from registered memory and do not pin?). Having 2 IOMMU models makes everything a lot simpler. +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from
[PATCH v3 3/3] Documentation: mmc: Update Arasan SDHC documentation to support 4.9a version of Arasan SDHC controller.
This patch updates Arasan SDHC documentation to support 4.9a version of Arasan SDHC controller. Signed-off-by: Suman Tripathi stripa...@apm.com --- Documentation/devicetree/bindings/mmc/arasan,sdhci.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt index 98ee2ab..f01d41a 100644 --- a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt +++ b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt @@ -8,7 +8,8 @@ Device Tree Bindings for the Arasan SDHCI Controller [3] Documentation/devicetree/bindings/interrupt-controller/interrupts.txt Required Properties: - - compatible: Compatibility string. Must be 'arasan,sdhci-8.9a' + - compatible: Compatibility string. Must be 'arasan,sdhci-8.9a' or +'arasan,sdhci-4.9a' - reg: From mmc bindings: Register location and length. - clocks: From clock bindings: Handles to clock inputs. - clock-names: From clock bindings: Tuple including clk_xin and clk_ahb @@ -18,7 +19,7 @@ Required Properties: Example: sdhci@e010 { - compatible = arasan,sdhci-8.9a; + compatible = arasan,sdhci-8.9a, arasan,sdhci-4.9a; reg = 0xe010 0x1000; clock-names = clk_xin, clk_ahb; clocks = clkc 21, clkc 32; -- 1.8.2.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 2/3] mmc: host: arasan: Add the support for sdhci-arasan4.9a in sdhci-of-arasan.c.
This patch adds the quirks and compatible string in sdhci-of-arasan.c to support sdhci-arasan4.9a version of controller. Signed-off-by: Suman Tripathi stripa...@apm.com --- drivers/mmc/host/sdhci-of-arasan.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/mmc/host/sdhci-of-arasan.c b/drivers/mmc/host/sdhci-of-arasan.c index 981d66e..92a4222 100644 --- a/drivers/mmc/host/sdhci-of-arasan.c +++ b/drivers/mmc/host/sdhci-of-arasan.c @@ -20,6 +20,7 @@ */ #include linux/module.h +#include linux/of_device.h #include sdhci-pltfm.h #define SDHCI_ARASAN_CLK_CTRL_OFFSET 0x2c @@ -169,6 +170,11 @@ static int sdhci_arasan_probe(struct platform_device *pdev) goto clk_disable_all; } + if (of_device_is_compatible(pdev-dev.of_node, arasan,sdhci-4.9a)) { + host-quirks |= SDHCI_QUIRK_NO_HISPD_BIT; + host-quirks2 |= SDHCI_QUIRK2_HOST_NO_CMD23; + } + sdhci_get_of_property(pdev); pltfm_host = sdhci_priv(host); pltfm_host-priv = sdhci_arasan; @@ -206,6 +212,7 @@ static int sdhci_arasan_remove(struct platform_device *pdev) static const struct of_device_id sdhci_arasan_of_match[] = { { .compatible = arasan,sdhci-8.9a }, + { .compatible = arasan,sdhci-4.9a }, { } }; MODULE_DEVICE_TABLE(of, sdhci_arasan_of_match); -- 1.8.2.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/3] arm64: dts: Add the arasan sdhc nodes in apm-storm.dtsi.
This patch adds the arasan sdhc nodes to reuse the of-arasan driver for APM X-Gene SoC. Signed-off-by: Suman Tripathi stripa...@apm.com --- arch/arm64/boot/dts/apm-mustang.dts | 4 arch/arm64/boot/dts/apm-storm.dtsi | 44 + 2 files changed, 48 insertions(+) diff --git a/arch/arm64/boot/dts/apm-mustang.dts b/arch/arm64/boot/dts/apm-mustang.dts index 8eb6d94..d0e52a9 100644 --- a/arch/arm64/boot/dts/apm-mustang.dts +++ b/arch/arm64/boot/dts/apm-mustang.dts @@ -44,3 +44,7 @@ xgenet { status = ok; }; + +sdhc0 { + status = ok; +}; diff --git a/arch/arm64/boot/dts/apm-storm.dtsi b/arch/arm64/boot/dts/apm-storm.dtsi index 87d3205..d6c2216 100644 --- a/arch/arm64/boot/dts/apm-storm.dtsi +++ b/arch/arm64/boot/dts/apm-storm.dtsi @@ -144,6 +144,40 @@ clock-output-names = socplldiv2; }; + ahbclk: ahbclk@1f2ac000 { + compatible = apm,xgene-device-clock; + #clock-cells = 1; + clocks = socplldiv2 0; + reg = 0x0 0x1f2ac000 0x0 0x1000 + 0x0 0x1700 0x0 0x2000; + reg-names = csr-reg, div-reg; + csr-offset = 0x0; + csr-mask = 0x1; + enable-offset = 0x8; + enable-mask = 0x1; + divider-offset = 0x164; + divider-width = 0x5; + divider-shift = 0x0; + clock-output-names = ahbclk; + }; + + sdioclk: sdioclk@1f2ac000 { + compatible = apm,xgene-device-clock; + #clock-cells = 1; + clocks = socplldiv2 0; + reg = 0x0 0x1f2ac000 0x0 0x1000 + 0x0 0x1700 0x0 0x2000; + reg-names = csr-reg, div-reg; + csr-offset = 0x0; + csr-mask = 0x2; + enable-offset = 0x8; + enable-mask = 0x2; + divider-offset = 0x178; + divider-width = 0x8; + divider-shift = 0x0; + clock-output-names = sdioclk; + }; + qmlclk: qmlclk { compatible = apm,xgene-device-clock; #clock-cells = 1; @@ -503,6 +537,16 @@ interrupts = 0x0 0x4f 0x4; }; + sdhc0: sdhc@1c00 { + device_type = sdhc; + compatible = arasan,sdhci-8.9a, arasan,sdhci-4.9a; + reg = 0x0 0x1c00 0x0 0x100; + interrupts = 0x0 0x49 0x4; + dma-coherent; + clock-names = clk_xin, clk_ahb; + clocks = sdioclk 0, ahbclk 0; + }; + phy1: phy@1f21a000 { compatible = apm,xgene-phy; reg = 0x0 0x1f21a000 0x0 0x100; -- 1.8.2.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/3] Add SDHCI support for APM X-Gene SoC using ARASAN SDHCI controller.
This patch adds the SDHCI support for APM X-Gene SoC using ARASAN SDHCI controller. v1 change: * Use the CONFIG_ARM64_DMA_HAS_IOMMU for dma-mapping. v2 change: * Drop the IOMMU support and switching to PIO mode for arasan. controller integrated inside APM X-Gene SoC. v3 change: * Change the sdhci-of-arasan.c to support arasan4.9a. * Add quirks for arasan4.9a. Signed-off-by: Suman Tripathi stripa...@apm.com --- Suman Tripathi (3): arm64: dts: Add the arasan sdhc nodes in apm-storm.dtsi. mmc: host: arasan: Add the support for sdhci-arasan4.9a in sdhci-of-arasan.c Documentation: mmc: Update Arasan SDHC documentation to support 4.9a version of Arasan SDHC controller. .../devicetree/bindings/mmc/arasan,sdhci.txt | 5 ++- arch/arm64/boot/dts/apm-mustang.dts| 4 ++ arch/arm64/boot/dts/apm-storm.dtsi | 44 ++ drivers/mmc/host/sdhci-of-arasan.c | 7 4 files changed, 58 insertions(+), 2 deletions(-) -- 1.8.2.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 1/2] mm/thp: Use new functions to clear pmd on splitting and collapse
Some arch may require an explicit IPI before a THP PMD split or collapse. This enable us to use local_irq_disable to prevent a parallel THP PMD split or collapse. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- include/asm-generic/pgtable.h | 32 mm/huge_memory.c | 9 + 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index fe617b7e4be6..e95c697bef25 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -184,6 +184,38 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define pmdp_splitting_flush_notify pmdp_clear_flush_notify +#else +static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +{ + BUILD_BUG(); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + +#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +{ + return pmdp_clear_flush(vma, address, pmdp); +} +#else +static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +{ + BUILD_BUG(); + return __pmd(0); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif + #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cce4604c192f..30c1b46fcf6d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2187,7 +2187,7 @@ static void collapse_huge_page(struct mm_struct *mm, * huge and small TLB entries for the same virtual address * to avoid the risk of CPU bugs in that area. */ - _pmd = pmdp_clear_flush(vma, address, pmd); + _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); @@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, write = pmd_write(*pmd); young = pmd_young(*pmd); - - /* leave pmd empty until pte is filled */ - pmdp_clear_flush_notify(vma, haddr, pmd); + /* +* leave pmd empty until pte is filled. +*/ + pmdp_splitting_flush_notify(vma, haddr, pmd); pgtable = pgtable_trans_huge_withdraw(mm, pmd); pmd_populate(mm, _pmd, pgtable); -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 2/2] powerpc/thp: Remove _PAGE_SPLITTING and related code
With the new thp refcounting we don't need to mark the PMD splitting. Drop the code to handle this. Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s_64.h | 6 -- arch/powerpc/include/asm/pgtable-ppc64.h | 29 ++-- arch/powerpc/mm/hugepage-hash64.c| 3 - arch/powerpc/mm/hugetlbpage.c| 2 +- arch/powerpc/mm/pgtable_64.c | 111 --- mm/gup.c | 2 +- 6 files changed, 52 insertions(+), 101 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 2d81e202bdcc..9a96fe3caa48 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -298,12 +298,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing, cpu_relax(); continue; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - /* If hugepage and is trans splitting return None */ - if (unlikely(hugepage -pmd_trans_splitting(pte_pmd(old_pte - return __pte(0); -#endif /* If pte is not present return None */ if (unlikely(!(old_pte _PAGE_PRESENT))) return __pte(0); diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 843cb35e6add..655dde8e9683 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -361,11 +361,6 @@ void pgtable_cache_init(void); #endif /* __ASSEMBLY__ */ /* - * THP pages can't be special. So use the _PAGE_SPECIAL - */ -#define _PAGE_SPLITTING _PAGE_SPECIAL - -/* * We need to differentiate between explicit huge page and THP huge * page, since THP huge page also need to track real subpage details */ @@ -375,8 +370,7 @@ void pgtable_cache_init(void); * set of bits not changed in pmd_modify. */ #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | \ -_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \ -_PAGE_THP_HUGE) +_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE) #ifndef __ASSEMBLY__ /* @@ -458,13 +452,6 @@ static inline int pmd_trans_huge(pmd_t pmd) return (pmd_val(pmd) 0x3) (pmd_val(pmd) _PAGE_THP_HUGE); } -static inline int pmd_trans_splitting(pmd_t pmd) -{ - if (pmd_trans_huge(pmd)) - return pmd_val(pmd) _PAGE_SPLITTING; - return 0; -} - extern int has_transparent_hugepage(void); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ @@ -517,12 +504,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd; } -static inline pmd_t pmd_mksplitting(pmd_t pmd) -{ - pmd_val(pmd) |= _PAGE_SPLITTING; - return pmd; -} - #define __HAVE_ARCH_PMD_SAME static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) { @@ -577,8 +558,12 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0); } -#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH -extern void pmdp_splitting_flush(struct vm_area_struct *vma, +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY +extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); #define __HAVE_ARCH_PGTABLE_DEPOSIT diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c index 86686514ae13..078f7207afd2 100644 --- a/arch/powerpc/mm/hugepage-hash64.c +++ b/arch/powerpc/mm/hugepage-hash64.c @@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid, /* If PMD busy, retry the access */ if (unlikely(old_pmd _PAGE_BUSY)) return 0; - /* If PMD is trans splitting retry the access */ - if (unlikely(old_pmd _PAGE_SPLITTING)) - return 0; /* If PMD permissions don't match, take page fault */ if (unlikely(access ~old_pmd)) return 1; diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index f30ae0f7f570..dfd7db0cfbee 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -1008,7 +1008,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift * hpte invalidate * */ - if (pmd_none(pmd) || pmd_trans_splitting(pmd)) + if (pmd_none(pmd))
[PATCH v2] powerpc/eeh: Delay probing EEH device during hotplug
Commit ff57b454ddb9 (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: ff57b454ddb9 (powerpc/eeh: Do probe on pci_dn) Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- v2: Corrected commit ID --- arch/powerpc/kernel/eeh.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index b798c86..04b5d94 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1061,6 +1061,9 @@ void eeh_add_device_early(struct pci_dn *pdn) if (!edev || !eeh_enabled()) return; + if (!eeh_has_flag(EEH_PROBE_MODE_DEVTREE)) + return; + /* USB Bus children of PCI devices will not have BUID's */ phb = edev-phb; if (NULL == phb || @@ -1115,6 +1118,9 @@ void eeh_add_device_late(struct pci_dev *dev) return; } + if (eeh_has_flag(EEH_PROBE_MODE_DEV)) + eeh_ops-probe(pdn, NULL); + /* * The EEH cache might not be removed correctly because of * unbalanced kref to the device during unplug time, which -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). The combined patchset is fine. My comment is because your commit message says that multiple groups are possible *because* 2 TCE tables per group are allowed, and it's not at all clear why one follows from the other. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container does this, the address ranges will the same. Oh, ok. For some reason I thought that (at least on the older machines) the different PEs used different and not easily changeable DMA windows in bus addresses space. What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). So allowing multiple groups per container should be possible (at the kernel rather than qemu level) by writing the same value to multiple TCE tables. I guess its not worth doing for just the almost-obsolete IOMMUs though. +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; Hrm, so here we have more copies of the full iommu_table structures, which again muddies the lifetime. The table_group pointer is presumably meaningless in these copies, which seems dangerously
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. As in single per page. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. Where is that enforced? More to the point, that's a VFIO constraint, but it's here affecting the design of a structure owned by the platform code. [snip] static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, int nid = pe-phb-hose-node; __u64 bus_offset = num ? pe-tce_bypass_base : 0; long ret; + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size page_shift); + + uas = vzalloc(uas_cb); + if (!uas) + return -ENOMEM; I don't see why this is allocated both here as well as in take_ownership. Where else? The only alternative is vfio_iommu_spapr_tce but I really do not want to touch iommu_table fields there. Well to put it another way, why isn't take_ownership calling create itself (or at least a common helper). Clearly the it_userspace table needs to have lifetime which matches the TCE table itself, so there should be a single function that marks the beginning of that joint lifetime. Isn't this function used for core-kernel users of the iommu as well, in which case it shouldn't need the it_userspace. No. This is an iommu_table_group_ops callback which calls what the platform code calls (pnv_pci_create_table()) plus allocates this it_userspace thing. The callback is only called from VFIO. Ok. As touched on above it seems more like this should be owned by VFIO code than the platform code. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpdapItpnDMX.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 0/2] Remove _PAGE_SPLITTING from ppc64
The changes are on top of what is posted at http://mid.gmane.org/1429823043-157133-1-git-send-email-kirill.shute...@linux.intel.com git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5 Changes from V1: * Fold part of patch 3 to 1 and 2 * Drop patch 3. * Make generic version of pmdp_splitting_flush_notify inline. Aneesh Kumar K.V (2): mm/thp: Use new functions to clear pmd on splitting and collapse powerpc/thp: Remove _PAGE_SPLITTING and related code arch/powerpc/include/asm/kvm_book3s_64.h | 6 -- arch/powerpc/include/asm/pgtable-ppc64.h | 29 ++-- arch/powerpc/mm/hugepage-hash64.c| 3 - arch/powerpc/mm/hugetlbpage.c| 2 +- arch/powerpc/mm/pgtable_64.c | 111 --- include/asm-generic/pgtable.h| 32 + mm/gup.c | 2 +- mm/huge_memory.c | 9 +-- 8 files changed, 89 insertions(+), 105 deletions(-) -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On Fri, May 01, 2015 at 10:46:08AM +1000, Benjamin Herrenschmidt wrote: On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote: On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). I think you are talking past each other :-) But yes, having 2 tables per group is orthogonal to the ability of having multiple groups per container. The latter is made possible on P8 in large part because each PE has its own DMA address space (unlike P5IOC2 or P7IOC where a single address space is segmented). Also, on P8 you can actually make the TVT entries point to the same table in memory, thus removing the need to duplicate the actual tables (though you still have to duplicate the invalidations). I would however recommend only sharing the table that way within a chip/node. .../.. -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Yes. This is the problem with P7IOC and earlier. It *could* be doable on P7IOC by making them the same PE but let's not go there. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container does this, the address ranges will the same. But that is only possible on P8 because only there do we have separate address spaces between PEs. What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). And P7IOC has a similar issue. The DMA address top bits indexes the window on P7IOC within a shared address space. It's possible to configure a TVT to cover multiple devices but with very serious limitations. Ok. To check my understanding does this sound reasonable: * The table_group more-or-less represents a PE, but in a way you can reference without first knowing the specific IOMMU hardware type. * When attaching multiple groups to the same container, the first PE (i.e. table_group) attached is used as a representative so that subsequent groups can be checked for compatibility with the first PE and therefore all PEs currently included in the container - This is why the table_group appears in some places where it doesn't seem sensible from a pure object ownership point of view -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpJLfJZ6lyD4.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:40 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size page_shift) * sizeof(unsigned long); This doesn't make much sense. The userspace view can't possibly be a property of the specific low-level IOMMU model. This it_userspace thing is all about memory preregistration. I need some way to track how many actual mappings the mm_iommu_table_group_mem_t has in order to decide whether to allow unregistering or not. When I clear TCE, I can read the old value which is host physical address which I cannot use to find the preregistered region and adjust the mappings counter; I can only use userspace addresses for this (not even guest physical addresses as it is VFIO and probably no KVM). So I have to keep userspace addresses somewhere, one per IOMMU page, and the iommu_table seems a natural place for this. Well.. sort of. But as noted elsewhere this pulls VFIO specific constraints into a platform code structure. And whether you get this table depends on the platform IOMMU type rather than on what VFIO wants to do with it, which doesn't make sense. What might make more sense is an opaque pointer io iommu_table for use by the table owner (in the take_ownership sense). The pointer would be stored in iommu_table, but VFIO is responsible for populating and managing its contents. Or you could just put the userspace mappings in the container. Although you might want a different data structure in that case. The other thing to bear in mind is that registered regions are likely to be large contiguous blocks in user addresses, though obviously not contiguous in physical addr. So you might be able to compaticfy this information by storing it as a list of variable length blocks in userspace address space, rather than a per-page address.. But.. isn't there a bigger problem here. As Paulus was pointing out, there's nothing guaranteeing the page tables continue to contain the same page as was there at gup() time. What's going to happen if you REGISTER a memory region, then mremap() over it? Then attempt to PUT_TCE a page in the region? Or what if you mremap() it to someplace else then try to PUT_TCE a page there? Or REGISTER it again in its new location? -- David Gibson| I'll have my music baroque, and my code david AT
Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 04:55 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). A summary of the semantic differeces between v1 and v2 would be nice. At this point it's not really clear to me if there's a case for creating v2, or if this could just be done by adding (optional) functionality to v1. v1: memory preregistration is not supported; explicit enable/disable ioctls are required v2: memory preregistration is required; explicit enable/disable are prohibited (as they are not needed). Mixing these in one IOMMU type caused a lot of problems like should I increment locked_vm by the 32bit window size on enable() or not; what do I do about pages pinning when map/map (check if it is from registered memory and do not pin?). Having 2 IOMMU models makes everything a lot simpler. Ok. Would it simplify it further if you made v2 only usable on IODA2 hardware? +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected
Re: [PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS
On Fri, 2015-05-01 at 11:23 +0800, Jeremy Kerr wrote: Hi Michael, Traffic on the cbe-oss-dev list is more or less non-existent, other than CC's from linuxppc. Plus all that spam that never makes it out of the moderation queue. It's seems like we may as well just send everyone to linuxppc and archive the list. Acked-by: Jeremy Kerr j...@ozlabs.org [This'll get a mention at our returning from prom_init-farewell party too, right?] You bet, or hell let's just have *another* party, the cbe-oss-dead party. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote: On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). I think you are talking past each other :-) But yes, having 2 tables per group is orthogonal to the ability of having multiple groups per container. The latter is made possible on P8 in large part because each PE has its own DMA address space (unlike P5IOC2 or P7IOC where a single address space is segmented). Also, on P8 you can actually make the TVT entries point to the same table in memory, thus removing the need to duplicate the actual tables (though you still have to duplicate the invalidations). I would however recommend only sharing the table that way within a chip/node. .../.. -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Yes. This is the problem with P7IOC and earlier. It *could* be doable on P7IOC by making them the same PE but let's not go there. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container does this, the address ranges will the same. But that is only possible on P8 because only there do we have separate address spaces between PEs. What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). And P7IOC has a similar issue. The DMA address top bits indexes the window on P7IOC within a shared address space. It's possible to configure a TVT to cover multiple devices but with very serious limitations. +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; Hrm, so here we have more copies of the full iommu_table structures, which again muddies the lifetime. The table_group pointer is presumably meaningless in these copies, which seems dangerously confusing. Ouch. This is bad. No, table_group is not pointless here as it is used to get to the PE number to invalidate TCE cache. I just realized although I need to update just a single table, I still have to invalidate TCE cache for every attached group/PE so I need a list of iommu_table_group's here, not a single pointer... +
[PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS
Traffic on the cbe-oss-dev list is more or less non-existent, other than CC's from linuxppc. It's seems like we may as well just send everyone to linuxppc and archive the list. Signed-off-by: Michael Ellerman m...@ellerman.id.au --- MAINTAINERS | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2e5bbc0d68b2..b8e038ca0d26 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2433,7 +2433,6 @@ F: Documentation/devicetree/bindings/net/ieee802154/cc2520.txt CELL BROADBAND ENGINE ARCHITECTURE M: Arnd Bergmann a...@arndb.de L: linuxppc-dev@lists.ozlabs.org -L: cbe-oss-...@lists.ozlabs.org W: http://www.ibm.com/developerworks/power/cell/ S: Supported F: arch/powerpc/include/asm/cell*.h @@ -7824,14 +7823,13 @@ F: drivers/net/wireless/prism54/ PS3 NETWORK SUPPORT M: Geoff Levand ge...@infradead.org L: net...@vger.kernel.org -L: cbe-oss-...@lists.ozlabs.org +L: linuxppc-dev@lists.ozlabs.org S: Maintained F: drivers/net/ethernet/toshiba/ps3_gelic_net.* PS3 PLATFORM SUPPORT M: Geoff Levand ge...@infradead.org L: linuxppc-dev@lists.ozlabs.org -L: cbe-oss-...@lists.ozlabs.org S: Maintained F: arch/powerpc/boot/ps3* F: arch/powerpc/include/asm/lv1call.h @@ -7845,7 +7843,7 @@ F:sound/ppc/snd_ps3* PS3VRAM DRIVER M: Jim Paris j...@jtan.com -L: cbe-oss-...@lists.ozlabs.org +L: linuxppc-dev@lists.ozlabs.org S: Maintained F: drivers/block/ps3vram.c @@ -9321,7 +9319,6 @@ F:drivers/net/ethernet/toshiba/spider_net* SPU FILE SYSTEM M: Jeremy Kerr j...@ozlabs.org L: linuxppc-dev@lists.ozlabs.org -L: cbe-oss-...@lists.ozlabs.org W: http://www.ibm.com/developerworks/power/cell/ S: Supported F: Documentation/filesystems/spufs.txt -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: Add opal-prd channel
Hi Ben, +static LIST_HEAD(opal_prd_msg_queue); +static DEFINE_SPINLOCK(opal_prd_msg_queue_lock); +static DECLARE_WAIT_QUEUE_HEAD(opal_prd_msg_wait); +static atomic_t usage; opal_prd_usage ... otherwise it's a mess in the symbols map OK, I'll change this. Also why limit the number of opens ? we might want to have tools using the opal prd for xscom :-) (in absence of debugfs). .. as long as not two people read() it should be ok. Or a tool to dump the regions etc... I don't see any reason to block multiple open's. Simplicity, really. We can do a get exclusive, but there's no (current) use-case for multiple openers on a PRD interface. Pulling this thread a little, you've hit on a key decision point of the prd design - I see there being two directions we could take with this: 1) This interface is specifically for PRD functions, or 2) This interface is a generic userspace interface to OPAL, and PRD is a subset of that. I've been aiming for (1) with the current code; and the nature of the generic read() write() operations being PRD-specific enforces that. Allowing multiple openers will help with (2), but if we want to go in that direction, I think we'd be better off doing a couple of other changes too: * move the general functions (eg xscom, range mappings, OCC control) to a separate interface that isn't tied to PRD - say just /dev/opal * using this prd code for only the prd-event handling, possibly renamed to /dev/opal-prd-events. This would still need some method of enforcing exclusive access. In this case, the actual PRD application would use both devices, dequeueing events (and updating the ipoll mask) from the latter, and using the former for helper functionality. Other tools (eg generic xscom access) would just use the generic interface, and not the PRD one, which wouldn't enforce exclusive access. Regardless of the choice here, we could also remove the single-open exclusion, and shift that responsibility to userspace (eg, flock() on the PRD device node?). The main reason for the exclusion is to prevent multiple prd daemons running, which may get messy when updating the ipoll mask. Should we rely exclusively on userspace setting the right permissions or should we check CAP_SYSADMIN here ? I'm okay with relying on userspace, is there any reason not to? +vma-vm_page_prot = phys_mem_access_prot(file, vma-vm_pgoff, + size, vma-vm_page_prot) +| _PAGE_SPECIAL; + +rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size, +vma-vm_page_prot); Do we still have the warnings of process exist about the map count or is that fixed ? No, not fixed at present. I'll need to chat to you about that. +case OPAL_PRD_SCOM_READ: +rc = copy_from_user(scom, (void __user *)param, sizeof(scom)); +if (rc) +return -EFAULT; + +rc = opal_xscom_read(scom.chip, scom.addr, +(__be64 *)scom.data); Are we exporting these for modules ? No, but opal-prd isn't configurable as a module at the moment. +scom.data = be64_to_cpu(scom.data); +pr_debug(ioctl SCOM_READ: chip %llx addr %016llx +data %016llx rc %d\n, +scom.chip, scom.addr, scom.data, rc); pr_devel ? This removes the possibility of CONFIG_DYNAMIC_DEBUG, is that intentional? +if (rc) +return -EIO; Should we consider returning more info about the SCOM error ? HBRT might actually need that... Maybe opal_prd_scom needs a field for the OPAL rc which is currently not very descriptive but that's fixable. Sounds good, I'll add that in. On error, we'll return -EIO and have the OPAL error code in the struct for further detail. +nr_ranges = of_property_count_strings(np, reserved-names); +ranges_prop = of_get_property(np, reserved-ranges, NULL); +if (!ranges_prop) { +of_node_put(np); +return -ENODEV; +} Didn't we say we had a problem with using those properties due to coalescing ? Shouldn't we define specific ones for the HBRT regions ? There's not a problem at the moment, but one day we will need to expand the PRD's get_reserved_mem interface to allow per-chip ranges. This would use a different device-tree representation. However, I think it'd be better to remove this code entirely (ie, remove the range member of struct opal_prd_info), and require userspace to do the device-tree parsing. +static int __init opal_prd_init(void) +{ +int rc; + +/* parse the code region information from the device tree */ +rc = parse_regions(); +if (rc) { +pr_err(Couldn't parse region information from DT\n); +return rc; +} Should we create a virtual device under the OPAL node in FW so we have something to attach to ?
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)-it_userspace ? \ + ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 2eaba0c..74a3f52 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include linux/pci.h #include linux/iommu.h #include linux/sched.h +#include linux/vmalloc.h #include asm/io.h #include asm/prom.h #include asm/iommu.h @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name) free_pages((unsigned long) tbl-it_map, order); } + WARN_ON(tbl-it_userspace); + memset(tbl, 0, sizeof(*tbl)); } @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl-it_size + 7) 3; int ret = 0; + unsigned long *uas; /* * VFIO does not control TCE entries allocation and the guest @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl) if (!tbl-it_ops-exchange) return -EINVAL; + uas = vzalloc(sizeof(*uas) * tbl-it_size); + if (!uas) + return -ENOMEM; + spin_lock_irqsave(tbl-large_pool.lock, flags); for (i = 0; i tbl-nr_pools; i++) spin_lock(tbl-pools[i].lock); @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl-it_map, 0xff, sz); } + if (ret) { + vfree(uas); + } else { + BUG_ON(tbl-it_userspace); + tbl-it_userspace = uas; + } + for (i = 0; i tbl-nr_pools; i++) spin_unlock(tbl-pools[i].lock); spin_unlock_irqrestore(tbl-large_pool.lock, flags); @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl-it_size + 7) 3; + vfree(tbl-it_userspace); + tbl-it_userspace = NULL; + spin_lock_irqsave(tbl-large_pool.lock, flags); for (i = 0; i tbl-nr_pools; i++) spin_lock(tbl-pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 45bc131..e0be556 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -25,6 +25,7 @@ #include linux/memblock.h #include linux/iommu.h #include linux/sizes.h +#include linux/vmalloc.h #include asm/sections.h #include asm/io.h @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index, pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false); } +void pnv_pci_ioda2_free_table(struct iommu_table *tbl) +{ + vfree(tbl-it_userspace); + tbl-it_userspace = NULL; + + pnv_pci_free_table(tbl); +} + static struct iommu_table_ops pnv_ioda2_iommu_ops = {
Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug
On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote: On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote: Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn) I don't have that SHA. I've have a commit with the same subject which is ff57b454ddb9. Is that the same thing? Yes, they're same thing. I also have 89a51df5ab1d, which Fixes that commit (ff57..). There seems something wrong at local because git show 1c509148b leads me to the patch as well. However, I believe ff57b454ddb9 is the correct one. I'll send v2 to fix it up. --- gwshan@gwshan:~/sandbox/linux$ git show 1c509148b 1 commit 1c509148bd6b5199dc1d97e146eda496f9f22a06 2 Author: Gavin Shan gws...@linux.vnet.ibm.com 3 Date: Tue Mar 17 10:49:45 2015 +1100 4 5 powerpc/eeh: Do probe on pci_dn gwshan@gwshan:~/sandbox/linux$ git show ff57b454ddb9 1 commit ff57b454ddb938d98d48d8df356357000fedc88c 2 Author: Gavin Shan gws...@linux.vnet.ibm.com 3 Date: Tue Mar 17 16:15:06 2015 +1100 4 5 powerpc/eeh: Do probe on pci_dn Thanks, Gavin cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS
Hi Michael, Traffic on the cbe-oss-dev list is more or less non-existent, other than CC's from linuxppc. Plus all that spam that never makes it out of the moderation queue. It's seems like we may as well just send everyone to linuxppc and archive the list. Acked-by: Jeremy Kerr j...@ozlabs.org [This'll get a mention at our returning from prom_init-farewell party too, right?] Cheers, Jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug
On Fri, 2015-05-01 at 11:28 +1000, Gavin Shan wrote: On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote: On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote: Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn) I don't have that SHA. I've have a commit with the same subject which is ff57b454ddb9. Is that the same thing? Yes, they're same thing. OK. There seems something wrong at local because git show 1c509148b leads me to the patch as well. However, I believe ff57b454ddb9 is the correct one. I'll send v2 to fix it up. You just have it in your tree from when you were developing it I guess? If you do: $ git show --format=fuller 1c509148b It should show you the committer etc. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug
On Fri, May 01, 2015 at 01:51:37PM +1000, Michael Ellerman wrote: On Fri, 2015-05-01 at 11:28 +1000, Gavin Shan wrote: On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote: On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote: Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn) I don't have that SHA. I've have a commit with the same subject which is ff57b454ddb9. Is that the same thing? Yes, they're same thing. OK. There seems something wrong at local because git show 1c509148b leads me to the patch as well. However, I believe ff57b454ddb9 is the correct one. I'll send v2 to fix it up. You just have it in your tree from when you were developing it I guess? If you do: $ git show --format=fuller 1c509148b It should show you the committer etc. Yeah, it's true. The committer is myself for commit 1c509148b. Thanks, Gavin ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
On Thu, Apr 30, 2015 at 07:56:17PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 02:37 PM, David Gibson wrote: On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 03:30 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote: This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. Doesn't the free callback belong with the previous patch introducing multi-level tables? If I did that, you would say why is it here if nothing calls it on multilevel patch and I see the allocation but I do not see memory release ;) Yeah, fair enough ;) I need some rule of thumb here. I think it is a bit cleaner if the same patch adds a callback for memory allocation and its counterpart, no? On further consideration, yes, I think you're right. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 19 arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++-- 3 files changed, 96 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 0f50ee2..7694546 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,6 +70,7 @@ struct iommu_table_ops { /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); /* * Switches ownership from the kernel itself to an external * user. While onwership is taken, the kernel cannot use IOMMU itself. @@ -160,6 +172,13 @@ struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; With this information, table_group seems even more like a bad name. iommu_state maybe? Please, no. We will never come to agreement then :( And iommu_state is too general anyway, it won't pass. struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cc1d09c..4828837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,6 +24,7 @@ #include linux/msi.h #include linux/memblock.h #include linux/iommu.h +#include linux/sizes.h #include asm/sections.h #include asm/io.h @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); tbl-it_ops = pnv_ioda1_iommu_ops; + pe-table_group.tce32_start = tbl-it_offset tbl-it_page_shift; +
Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache
On Thu, Apr 30, 2015 at 06:25:25PM +1000, Paul Mackerras wrote: On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. [snip] +long mm_iommu_alloc(unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem) +{ + struct mm_iommu_table_group_mem_t *mem; + long i, j; + struct page *page = NULL; + + list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list, + next) { + if ((mem-ua == ua) (mem-entries == entries)) + return -EBUSY; + + /* Overlap? */ + if ((mem-ua (ua + (entries PAGE_SHIFT))) + (ua (mem-ua + (mem-entries PAGE_SHIFT + return -EINVAL; + } + + mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + mem-hpas = vzalloc(entries * sizeof(mem-hpas[0])); + if (!mem-hpas) { + kfree(mem); + return -ENOMEM; + } So, I've thought more about this and I'm really confused as to what this is supposed to be accomplishing. I see that you need to keep track of what regions are registered, so you don't double lock or unlock, but I don't see what the point of actualy storing the translations in hpas is. I had assumed it was so that you could later on get to the translations in real mode when you do in-kernel acceleration. But that doesn't make sense, because the array is vmalloc()ed, so can't be accessed in real mode anyway. We can access vmalloc'd arrays in real mode using real_vmalloc_addr(). Ah, ok. I can't think of a circumstance in which you can use hpas where you couldn't just walk the page tables anyway. The problem with walking the page tables is that there is no guarantee that the page you find that way is the page that was returned by the gup_fast() we did earlier. Storing the hpas means that we know for sure that the page we're doing DMA to is one that we have an elevated page count on. Also, there are various points where a Linux PTE is made temporarily invalid for a short time. If we happened to do a H_PUT_TCE on one cpu while another cpu was doing that, we'd get a spurious failure returned by the H_PUT_TCE. I think we want this explanation in the commit message. Anr/or in a comment somewhere, I'm not sure. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpvdLlhle7Fo.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/8] powerpc/powernv: Add a virtual irqchip for opal events
Hi Alistair, With all the patches applied on top of 'v4.0-rc7', I see this issue during the boot itself http://pastebin.hursley.ibm.com/918 Few compile warnings and minor comments. drivers/tty/hvc/hvc_opal.c: In function ‘hvc_opal_probe’: drivers/tty/hvc/hvc_opal.c:174:6: warning: unused variable ‘rc’ [-Wunused-variable] int rc; ^ drivers/tty/hvc/hvc_opal.c: At top level: drivers/tty/hvc/hvc_opal.c:65:13: warning: ‘hvc_opal_event_registered’ defined but not used [-Wunused-variable] static bool hvc_opal_event_registered; On 04/10/2015 01:54 PM, Alistair Popple wrote: Whenever an interrupt is received for opal the linux kernel gets a bitfield indicating certain events that have occurred and need handling by the various device drivers. Currently this is handled using a notifier interface where we call every device driver that has registered to receive opal events. This approach has several drawbacks. For example each driver has to do its own checking to see if the event is relevant as well as event masking. There is also no easy method of recording the number of times we receive particular events. This patch solves these issues by exposing opal events via the standard interrupt APIs by adding a new interrupt chip and domain. Drivers can then register for the appropriate events using standard kernel calls such as irq_of_parse_and_map(). Signed-off-by: Alistair Popple alist...@popple.id.au --- + +static int __init opal_event_init(void) +{ + struct device_node *dn, *opal_node; + const __be32 *irqs; + int i, irqlen; + + opal_node = of_find_node_by_path(/ibm,opal); + if (!opal_node) { + pr_warn(opal: Node not found\n); + return -ENODEV; + } + + dn = of_find_compatible_node(NULL, NULL, ibm,opal-event); + + /* If dn is NULL it means the domain won't be linked to a DT +* node so therefore irq_of_parse_and_map(...) wont work. But +* that shouldn't be problem because if we're running a +* version of skiboot that doesn't have the dn then the +* devices won't have the correct properties and will have to +* fall back to the legacy method (opal_event_request(...)) +* anyway. */ + opal_event_irqchip.domain = + irq_domain_add_linear(dn, 64, opal_event_domain_ops, A macro would be better, which is maximum event bits we have. + opal_event_irqchip); + if (IS_ERR(opal_event_irqchip.domain)) { + pr_warn(opal: Unable to create irq domain\n); + return PTR_ERR(opal_event_irqchip.domain); + } + + /* Get interrupt property */ + irqs = of_get_property(opal_node, opal-interrupts, irqlen); of_node_put() Should decrement the refcount of the nodes 'opal_node' and 'dn' (if !NULL) before returning from the function. + opal_irq_count = irqs ? (irqlen / 4) : 0; + pr_debug(Found %d interrupts reserved for OPAL\n, opal_irq_count); + + /* Install interrupt handlers */ + opal_irqs = kcalloc(opal_irq_count, sizeof(unsigned int), GFP_KERNEL); Safer to use 'sizeof(*opal_irqs)' Neelesh. + for (i = 0; irqs i opal_irq_count; i++, irqs++) { + unsigned int irq, virq; + int rc; + + /* Get hardware and virtual IRQ */ + irq = be32_to_cpup(irqs); + virq = irq_create_mapping(NULL, irq); + if (virq == NO_IRQ) { + pr_warn(Failed to map irq 0x%x\n, irq); + continue; + } + + /* Install interrupt handler */ + rc = request_irq(virq, opal_interrupt, 0, opal, NULL); + if (rc) { + irq_dispose_mapping(virq); + pr_warn(Error %d requesting irq %d (0x%x)\n, +rc, virq, irq); + continue; + } + + /* Cache IRQ */ + opal_irqs[i] = virq; + } + + return 0; +} +machine_core_initcall(powernv, opal_event_init); + ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state()
When asserting reset in pcibios_set_pcie_reset_state(), the PE is enforced to (hardware) frozen state in order to drop unexpected PCI transactions (except PCI config read/write) automatically by hardware during reset, which would cause recursive EEH error. However, the (software) frozen state EEH_PE_ISOLATED is missed. When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED is set in PE state retrival backend. Unfortunately, nobody (the reset handler or the EEH recovery functinality in host) will clear EEH_PE_ISOLATED when the PE has been passed through to guest. The patch sets and clears EEH_PE_ISOLATED properly during reset in function pcibios_set_pcie_reset_state() to fix the issue. Fixes: 28158cd (Enhance pcibios_set_pcie_reset_state()) Reported-by: Carol L. Soto cls...@us.ibm.com Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com Tested-by: Carol L. Soto cls...@us.ibm.com --- arch/powerpc/kernel/eeh.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 65f38d2..b798c86 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -749,21 +749,24 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat eeh_unfreeze_pe(pe, false); eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED); eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev); + eeh_pe_state_clear(pe, EEH_PE_ISOLATED); break; case pcie_hot_reset: + eeh_pe_state_mark(pe, EEH_PE_ISOLATED); eeh_ops-set_option(pe, EEH_OPT_FREEZE_PE); eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev); eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED); eeh_ops-reset(pe, EEH_RESET_HOT); break; case pcie_warm_reset: + eeh_pe_state_mark(pe, EEH_PE_ISOLATED); eeh_ops-set_option(pe, EEH_OPT_FREEZE_PE); eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev); eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED); eeh_ops-reset(pe, EEH_RESET_FUNDAMENTAL); break; default: - eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED); + eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED); return -EINVAL; }; -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug
On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote: Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn) I don't have that SHA. I've have a commit with the same subject which is ff57b454ddb9. Is that the same thing? I also have 89a51df5ab1d, which Fixes that commit (ff57..). cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/eeh: Delay probing EEH device during hotplug
Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH devices in early stage, which is reasonable to pSeries platform. However, it's wrong for PowerNV platform because the PE# isn't determined until the resources (IO and MMIO) are assigned to PE in hotplug case. So we have to delay probing EEH devices for PowerNV platform until the PE# is assigned. Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn) Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index b798c86..04b5d94 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1061,6 +1061,9 @@ void eeh_add_device_early(struct pci_dn *pdn) if (!edev || !eeh_enabled()) return; + if (!eeh_has_flag(EEH_PROBE_MODE_DEVTREE)) + return; + /* USB Bus children of PCI devices will not have BUID's */ phb = edev-phb; if (NULL == phb || @@ -1115,6 +1118,9 @@ void eeh_add_device_late(struct pci_dev *dev) return; } + if (eeh_has_flag(EEH_PROBE_MODE_DEV)) + eeh_ops-probe(pdn, NULL); + /* * The EEH cache might not be removed correctly because of * unbalanced kref to the device during unplug time, which -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/pseries: Fix possible leaked device node reference
On 04/29/2015 06:44 PM, Nathan Fontenot wrote: Failure return from dlpar_configure_connector when dlpar adding cpus results in leaking references to the cpus parent device node. Move the call to of_node_put() prior to checking the result of dlpar_configure_connector. Fixes: 8d5ff320766f (powerpc/pseries: Make dlpar_configure_connector parent node aware) That commit went into 3.12. Shouldn't this be CC to stable? -Tyrel Signed-off-by: Nathan Fontenot nf...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 019d34a..47d9cebe 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -421,11 +421,10 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) return -ENODEV; dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent); + of_node_put(parent); if (!dn) return -EINVAL; - of_node_put(parent); - rc = dlpar_attach_node(dn); if (rc) { dlpar_release_drc(drc_index); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/8] powerpc/powernv: Add a virtual irqchip for opal events
Hi Alistair, Applied all of the patches on top of 'v4.0-rc7', found this issue during the boot itself http://pastebin.hursley.ibm.com/918. There are few compile warnings and minor comments. drivers/tty/hvc/hvc_opal.c: In function ‘hvc_opal_probe’: drivers/tty/hvc/hvc_opal.c:174:6: warning: unused variable ‘rc’ [-Wunused-variable] int rc; ^ drivers/tty/hvc/hvc_opal.c: At top level: drivers/tty/hvc/hvc_opal.c:65:13: warning: ‘hvc_opal_event_registered’ defined but not used [-Wunused-variable] static bool hvc_opal_event_registered; Regards, Neelesh. On 04/10/2015 01:54 PM, Alistair Popple wrote: Whenever an interrupt is received for opal the linux kernel gets a bitfield indicating certain events that have occurred and need handling by the various device drivers. Currently this is handled using a notifier interface where we call every device driver that has registered to receive opal events. This approach has several drawbacks. For example each driver has to do its own checking to see if the event is relevant as well as event masking. There is also no easy method of recording the number of times we receive particular events. This patch solves these issues by exposing opal events via the standard interrupt APIs by adding a new interrupt chip and domain. Drivers can then register for the appropriate events using standard kernel calls such as irq_of_parse_and_map(). Signed-off-by: Alistair Popple alist...@popple.id.au --- +static int __init opal_event_init(void) +{ + struct device_node *dn, *opal_node; + const __be32 *irqs; + int i, irqlen; + + opal_node = of_find_node_by_path(/ibm,opal); + if (!opal_node) { + pr_warn(opal: Node not found\n); + return -ENODEV; + } + + dn = of_find_compatible_node(NULL, NULL, ibm,opal-event); + + /* If dn is NULL it means the domain won't be linked to a DT +* node so therefore irq_of_parse_and_map(...) wont work. But +* that shouldn't be problem because if we're running a +* version of skiboot that doesn't have the dn then the +* devices won't have the correct properties and will have to +* fall back to the legacy method (opal_event_request(...)) +* anyway. */ + opal_event_irqchip.domain = + irq_domain_add_linear(dn, 64, opal_event_domain_ops, A macro would be better, which is maximum event bits we have. + opal_event_irqchip); + if (IS_ERR(opal_event_irqchip.domain)) { + pr_warn(opal: Unable to create irq domain\n); + return PTR_ERR(opal_event_irqchip.domain); + } + + /* Get interrupt property */ + irqs = of_get_property(opal_node, opal-interrupts, irqlen); + opal_irq_count = irqs ? (irqlen / 4) : 0; of_node_put() Need to decrement the refcount of these nodes, 'opal_node' 'dn' (if !NULL) + pr_debug(Found %d interrupts reserved for OPAL\n, opal_irq_count); + + /* Install interrupt handlers */ + opal_irqs = kcalloc(opal_irq_count, sizeof(unsigned int), GFP_KERNEL); Safe to use 'sizeof(*opal_irqs)' + for (i = 0; irqs i opal_irq_count; i++, irqs++) { + unsigned int irq, virq; + int rc; + + /* Get hardware and virtual IRQ */ + irq = be32_to_cpup(irqs); + virq = irq_create_mapping(NULL, irq); + if (virq == NO_IRQ) { + pr_warn(Failed to map irq 0x%x\n, irq); + continue; + } + + /* Install interrupt handler */ + rc = request_irq(virq, opal_interrupt, 0, opal, NULL); + if (rc) { + irq_dispose_mapping(virq); + pr_warn(Error %d requesting irq %d (0x%x)\n, +rc, virq, irq); + continue; + } + + /* Cache IRQ */ + opal_irqs[i] = virq; + } + + return 0; +} +machine_core_initcall(powernv, opal_event_init); + ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/pseries: Fix possible leaked device node reference
On Thu, 2015-04-30 at 14:58 -0700, Tyrel Datwyler wrote: On 04/29/2015 06:44 PM, Nathan Fontenot wrote: Failure return from dlpar_configure_connector when dlpar adding cpus results in leaking references to the cpus parent device node. Move the call to of_node_put() prior to checking the result of dlpar_configure_connector. Fixes: 8d5ff320766f (powerpc/pseries: Make dlpar_configure_connector parent node aware) That commit went into 3.12. Shouldn't this be CC to stable? It could, but it fails the real bug test: - It must fix a real bug that bothers people (not a, This could be a problem... type thing). (from Documentation/stable_kernel_rules.txt) Because the node we're leaking a reference on is /cpus, and in practice that's pretty harmless. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: ppc64le crash in dm on 4.1+
On Fri, May 1, 2015 at 1:18 AM, Mike Snitzer snit...@redhat.com wrote: I just booted 3d99e3fe13d473ac4578c37f477a59b829530764 (linus' tree as of this morning) on a Tuletta and got the following: This is fixed with the following commit (which I just sent to Linus for 4.1-rc2 inclusion): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-nextid=aa6df8dd28c01d9a3d2cfcfe9dd0a4a334d1cd81 Thanks! Cheers, Joel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: ppc64le crash in dm on 4.1+
On Wed, Apr 29 2015 at 11:29pm -0400, Joel Stanley j...@jms.id.au wrote: Hello, I just booted 3d99e3fe13d473ac4578c37f477a59b829530764 (linus' tree as of this morning) on a Tuletta and got the following: This is fixed with the following commit (which I just sent to Linus for 4.1-rc2 inclusion): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-nextid=aa6df8dd28c01d9a3d2cfcfe9dd0a4a334d1cd81 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
This patch adds support for FSP EPOW (Early Power Off Warning) and DPO (Delayed Power Off) events support for PowerNV platform. EPOW events are generated by SPCN/FSP due to various critical system conditions that need system shutdown. Few examples of these conditions are high ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system shutdown request. This patch enables host kernel on PowerNV platform to handle OPAL notifications for these events and initiate system poweroff. Since EPOW notifications are sent in advance of impending shutdown event and thus this patch also adds functionality to wait for EPOW condition to return to normal. If EPOW condition doesn't return to normal in estimated time it proceeds with graceful system shutdown. System admin can also add host userspace scripts to perform any specific actions like graceful guest shutdown upon system poweroff. Signed-off-by: Vipin K Parashar vi...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal-api.h| 30 ++ arch/powerpc/include/asm/opal.h| 3 +- arch/powerpc/platforms/powernv/Makefile| 1 + .../platforms/powernv/opal-poweroff-events.c | 358 + arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 5 files changed, 392 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/platforms/powernv/opal-poweroff-events.c diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0321a90..03b3cef 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -730,6 +730,36 @@ struct opal_i2c_request { __be64 buffer_ra; /* Buffer real address */ }; +/* + * EPOW status sharing (OPAL and the host) + * + * The host will pass on OPAL, a buffer of length OPAL_EPOW_MAX_CLASSES + * to fetch system wide EPOW status. Each element in the returned buffer + * will contain bitwise EPOW status for each EPOW sub class. + */ + +/* EPOW types */ +enum OpalEpow { + OPAL_EPOW_POWER = 0,/* Power EPOW */ + OPAL_EPOW_TEMP = 1,/* Temperature EPOW */ + OPAL_EPOW_COOLING = 2,/* Cooling EPOW */ + OPAL_MAX_EPOW_CLASSES = 3,/* Max EPOW categories */ +}; + +/* Power EPOW events */ +enum OpalEpowPower { + OPAL_EPOW_POWER_UPS = 0x1, /* System on UPS power */ + OPAL_EPOW_POWER_UPS_LOW = 0x2, /* System on UPS power with low battery*/ +}; + +/* Temperature EPOW events */ +enum OpalEpowTemp { + OPAL_EPOW_TEMP_HIGH_AMB = 0x1, /* High ambient temperature */ + OPAL_EPOW_TEMP_CRIT_AMB = 0x2, /* Critical ambient temperature */ + OPAL_EPOW_TEMP_HIGH_INT = 0x4, /* High internal temperature */ + OPAL_EPOW_TEMP_CRIT_INT = 0x8, /* Critical internal temperature */ +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 042af1a..0777864 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -141,7 +141,6 @@ int64_t opal_pci_fence_phb(uint64_t phb_id); int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data); int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t error_type, uint8_t mask_action); int64_t opal_set_slot_led_status(uint64_t phb_id, uint64_t slot_id, uint8_t led_type, uint8_t led_action); -int64_t opal_get_epow_status(__be64 *status); int64_t opal_set_system_attention_led(uint8_t led_action); int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe, __be16 *pci_error_type, __be16 *severity); @@ -200,6 +199,8 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf, uint64_t size, uint64_t token); int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size, uint64_t token); +int32_t opal_get_epow_status(__be32 *status, __be32 *num_classes); +int32_t opal_get_dpo_status(__be32 *timeout); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index 33e44f3..b817bdb 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -2,6 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o opal-async.o obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o obj-y += opal-msglog.o opal-hmi.o opal-power.o +obj-y += opal-poweroff-events.o obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o diff --git a/arch/powerpc/platforms/powernv/opal-poweroff-events.c