[PATCH] Revert powerpc/tm: Abort syscalls in active transactions

2015-04-30 Thread Michael Ellerman
This reverts commit feba40362b11341bee6d8ed58d54b896abbd9f84.

Although the principle of this change is good, the implementation has a
few issues.

Firstly we can sometimes fail to abort a syscall because r12 may have
been clobbered by C code if we went down the virtual CPU accounting
path, or if syscall tracing was enabled.

Secondly we have decided that it is safer to abort the syscall even
earlier in the syscall entry path, so that we avoid the syscall tracing
path when we are transactional.

So that we have time to thoroughly test those changes we have decided to
revert this for this merge window and will merge the fixed version in
the next window.

NB. Rather than reverting the selftest we just drop tm-syscall from
TEST_PROGS so that it's not run by default.

Fixes: feba40362b11 (powerpc/tm: Abort syscalls in active transactions)
Signed-off-by: Michael Ellerman m...@ellerman.id.au
---
 Documentation/powerpc/transactional_memory.txt | 32 +-
 arch/powerpc/include/uapi/asm/tm.h |  2 +-
 arch/powerpc/kernel/entry_64.S | 19 ---
 tools/testing/selftests/powerpc/tm/Makefile|  2 +-
 4 files changed, 18 insertions(+), 37 deletions(-)

diff --git a/Documentation/powerpc/transactional_memory.txt 
b/Documentation/powerpc/transactional_memory.txt
index ba0a2a4a54ba..ded69794a5c0 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.txt
@@ -74,23 +74,22 @@ Causes of transaction aborts
 Syscalls
 
 
-Syscalls made from within an active transaction will not be performed and the
-transaction will be doomed by the kernel with the failure code TM_CAUSE_SYSCALL
-| TM_CAUSE_PERSISTENT.
+Performing syscalls from within transaction is not recommended, and can lead
+to unpredictable results.
 
-Syscalls made from within a suspended transaction are performed as normal and
-the transaction is not explicitly doomed by the kernel.  However, what the
-kernel does to perform the syscall may result in the transaction being doomed
-by the hardware.  The syscall is performed in suspended mode so any side
-effects will be persistent, independent of transaction success or failure.  No
-guarantees are provided by the kernel about which syscalls will affect
-transaction success.
+Syscalls do not by design abort transactions, but beware: The kernel code will
+not be running in transactional state.  The effect of syscalls will always
+remain visible, but depending on the call they may abort your transaction as a
+side-effect, read soon-to-be-aborted transactional data that should not remain
+invisible, etc.  If you constantly retry a transaction that constantly aborts
+itself by calling a syscall, you'll have a livelock  make no progress.
 
-Care must be taken when relying on syscalls to abort during active transactions
-if the calls are made via a library.  Libraries may cache values (which may
-give the appearance of success) or perform operations that cause transaction
-failure before entering the kernel (which may produce different failure codes).
-Examples are glibc's getpid() and lazy symbol resolution.
+Simple syscalls (e.g. sigprocmask()) could be OK.  Even things like write()
+from, say, printf() should be OK as long as the kernel does not access any
+memory that was accessed transactionally.
+
+Consider any syscalls that happen to work as debug-only -- not recommended for
+production use.  Best to queue them up till after the transaction is over.
 
 
 Signals
@@ -177,7 +176,8 @@ kernel aborted a transaction:
  TM_CAUSE_RESCHED   Thread was rescheduled.
  TM_CAUSE_TLBI  Software TLB invalid.
  TM_CAUSE_FAC_UNAV  FP/VEC/VSX unavailable trap.
- TM_CAUSE_SYSCALL   Syscall from active transaction.
+ TM_CAUSE_SYSCALL   Currently unused; future syscalls that must abort
+transactions for consistency will use this.
  TM_CAUSE_SIGNALSignal delivered.
  TM_CAUSE_MISC  Currently unused.
  TM_CAUSE_ALIGNMENT Alignment fault.
diff --git a/arch/powerpc/include/uapi/asm/tm.h 
b/arch/powerpc/include/uapi/asm/tm.h
index 5047659815a5..5d836b7c1176 100644
--- a/arch/powerpc/include/uapi/asm/tm.h
+++ b/arch/powerpc/include/uapi/asm/tm.h
@@ -11,7 +11,7 @@
 #define TM_CAUSE_RESCHED   0xde
 #define TM_CAUSE_TLBI  0xdc
 #define TM_CAUSE_FAC_UNAV  0xda
-#define TM_CAUSE_SYSCALL   0xd8
+#define TM_CAUSE_SYSCALL   0xd8  /* future use */
 #define TM_CAUSE_MISC  0xd6  /* future use */
 #define TM_CAUSE_SIGNAL0xd4
 #define TM_CAUSE_ALIGNMENT 0xd2
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 8ca9434c40e6..afbc20019c2e 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -34,7 +34,6 @@
 #include asm/ftrace.h
 #include asm/hw_irq.h
 #include asm/context_tracking.h
-#include asm/tm.h
 
 /*
  * System calls.
@@ -146,24 +145,6 @@ 

Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
 At the moment only one group per container is supported.
 POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
 IOMMU group so we can relax this limitation and support multiple groups
 per container.

It's not obvious why allowing multiple TCE tables per PE has any
pearing on allowing multiple groups per container.

 This adds TCE table descriptors to a container and uses iommu_table_group_ops
 to create/set DMA windows on IOMMU groups so the same TCE tables will be
 shared between several IOMMU groups.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v7:
 * updated doc
 ---
  Documentation/vfio.txt  |   8 +-
  drivers/vfio/vfio_iommu_spapr_tce.c | 268 
 ++--
  2 files changed, 199 insertions(+), 77 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 94328c8..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
  
  This implementation has some specifics:
  
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).

I thought the more fundamental problem was that different PEs tended
to use disjoint bus address ranges, so even by duplicating put_tce
across PEs you couldn't have a common address space.

 +Newer systems (POWER8 with IODA2) have improved hardware design which allows
 +to remove this limitation and have multiple IOMMU groups per a VFIO 
 container.
  
  2) The hardware supports so called DMA windows - the PCI address range
  within which DMA transfer is allowed, any attempt to access address space
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index a7d6729..970e3a2 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
   * into DMA'ble space using the IOMMU
   */
  
 +struct tce_iommu_group {
 + struct list_head next;
 + struct iommu_group *grp;
 +};
 +
  /*
   * The container descriptor supports only a single group per container.
   * Required by the API as the container is not supplied with the IOMMU group
 @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
   */
  struct tce_container {
   struct mutex lock;
 - struct iommu_group *grp;
   bool enabled;
   unsigned long locked_pages;
   bool v2;
 + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];

Hrm,  so here we have more copies of the full iommu_table structures,
which again muddies the lifetime.  The table_group pointer is
presumably meaningless in these copies, which seems dangerously
confusing.

 + struct list_head group_list;
  };
  
  static long tce_unregister_pages(struct tce_container *container,
 @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, 
 unsigned page_shift)
   return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift;
  }
  
 +static inline bool tce_groups_attached(struct tce_container *container)
 +{
 + return !list_empty(container-group_list);
 +}
 +
  static struct iommu_table *spapr_tce_find_table(
   struct tce_container *container,
   phys_addr_t ioba)
  {
   long i;
   struct iommu_table *ret = NULL;
 - struct iommu_table_group *table_group;
 -
 - table_group = iommu_group_get_iommudata(container-grp);
 - if (!table_group)
 - return NULL;
  
   for (i = 0; i  IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
 - struct iommu_table *tbl = table_group-tables[i];
 + struct iommu_table *tbl = container-tables[i];
   unsigned long entry = ioba  tbl-it_page_shift;
   unsigned long start = tbl-it_offset;
   unsigned long end = start + tbl-it_size;
 @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container 
 *container)
   int ret = 0;
   unsigned long locked;
   struct iommu_table_group *table_group;
 -
 - if (!container-grp)
 - return -ENXIO;
 + struct tce_iommu_group *tcegrp;
  
   if (!current-mm)
   return -ESRCH; /* process exited */
 @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container 
 *container)
* as there is no way to know how much we should increment
* the locked_vm counter.
*/
 - 

Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
 We are adding support for DMA memory pre-registration to be used in
 conjunction with VFIO. The idea is that the userspace which is going to
 run a guest may want to pre-register a user space memory region so
 it all gets pinned once and never goes away. Having this done,
 a hypervisor will not have to pin/unpin pages on every DMA map/unmap
 request. This is going to help with multiple pinning of the same memory
 and in-kernel acceleration of DMA requests.
 
 This adds a list of memory regions to mm_context_t. Each region consists
 of a header and a list of physical addresses. This adds API to:
 1. register/unregister memory regions;
 2. do final cleanup (which puts all pre-registered pages);
 3. do userspace to physical address translation;
 4. manage a mapped pages counter; when it is zero, it is safe to
 unregister the region.
 
 Multiple registration of the same region is allowed, kref is used to
 track the number of registrations.

[snip]
 +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
 + struct mm_iommu_table_group_mem_t **pmem)
 +{
 + struct mm_iommu_table_group_mem_t *mem;
 + long i, j;
 + struct page *page = NULL;
 +
 + list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list,
 + next) {
 + if ((mem-ua == ua)  (mem-entries == entries))
 + return -EBUSY;
 +
 + /* Overlap? */
 + if ((mem-ua  (ua + (entries  PAGE_SHIFT))) 
 + (ua  (mem-ua + (mem-entries  PAGE_SHIFT
 + return -EINVAL;
 + }
 +
 + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 + if (!mem)
 + return -ENOMEM;
 +
 + mem-hpas = vzalloc(entries * sizeof(mem-hpas[0]));
 + if (!mem-hpas) {
 + kfree(mem);
 + return -ENOMEM;
 + }

So, I've thought more about this and I'm really confused as to what
this is supposed to be accomplishing.

I see that you need to keep track of what regions are registered, so
you don't double lock or unlock, but I don't see what the point of
actualy storing the translations in hpas is.

I had assumed it was so that you could later on get to the
translations in real mode when you do in-kernel acceleration.  But
that doesn't make sense, because the array is vmalloc()ed, so can't be
accessed in real mode anyway.

I can't think of a circumstance in which you can use hpas where you
couldn't just walk the page tables anyway.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpXYAi7YA0h0.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 13/15] cputime:Introduce the cputime_to_timespec64/timespec64_to_cputime function

2015-04-30 Thread Baolin Wang
This patch introduces some functions for converting cputime to timespec64 and 
back,
that repalce the timespec type with timespec64 type, as well as for arch/s390 
and
arch/powerpc architecture.

And these new methods will replace the old 
cputime_to_timespec/timespec_to_cputime
function to ready for 2038 issue. The cputime_to_timespec/timespec_to_cputime 
functions
are moved to include/linux/cputime.h file for removing conveniently.

Signed-off-by: Baolin Wang baolin.w...@linaro.org
---
 arch/powerpc/include/asm/cputime.h|6 +++---
 arch/s390/include/asm/cputime.h   |8 
 include/asm-generic/cputime_jiffies.h |   10 +-
 include/asm-generic/cputime_nsecs.h   |4 ++--
 include/linux/cputime.h   |   15 +++
 5 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index e245255..5dda5c0 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -154,9 +154,9 @@ static inline cputime_t secs_to_cputime(const unsigned long 
sec)
 }
 
 /*
- * Convert cputime - timespec
+ * Convert cputime - timespec64
  */
-static inline void cputime_to_timespec(const cputime_t ct, struct timespec *p)
+static inline void cputime_to_timespec64(const cputime_t ct, struct timespec64 
*p)
 {
u64 x = (__force u64) ct;
unsigned int frac;
@@ -168,7 +168,7 @@ static inline void cputime_to_timespec(const cputime_t ct, 
struct timespec *p)
p-tv_nsec = x;
 }
 
-static inline cputime_t timespec_to_cputime(const struct timespec *p)
+static inline cputime_t timespec64_to_cputime(const struct timespec64 *p)
 {
u64 ct;
 
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index b91e960..1266697 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -89,16 +89,16 @@ static inline cputime_t secs_to_cputime(const unsigned int 
s)
 }
 
 /*
- * Convert cputime to timespec and back.
+ * Convert cputime to timespec64 and back.
  */
-static inline cputime_t timespec_to_cputime(const struct timespec *value)
+static inline cputime_t timespec64_to_cputime(const struct timespec64 *value)
 {
unsigned long long ret = value-tv_sec * CPUTIME_PER_SEC;
return (__force cputime_t)(ret + __div(value-tv_nsec * 
CPUTIME_PER_USEC, NSEC_PER_USEC));
 }
 
-static inline void cputime_to_timespec(const cputime_t cputime,
-  struct timespec *value)
+static inline void cputime_to_timespec64(const cputime_t cputime,
+  struct timespec64 *value)
 {
unsigned long long __cputime = (__force unsigned long long) cputime;
 #ifndef CONFIG_64BIT
diff --git a/include/asm-generic/cputime_jiffies.h 
b/include/asm-generic/cputime_jiffies.h
index fe386fc..54e034c 100644
--- a/include/asm-generic/cputime_jiffies.h
+++ b/include/asm-generic/cputime_jiffies.h
@@ -44,12 +44,12 @@ typedef u64 __nocast cputime64_t;
 #define secs_to_cputime(sec)   jiffies_to_cputime((sec) * HZ)
 
 /*
- * Convert cputime to timespec and back.
+ * Convert cputime to timespec64 and back.
  */
-#define timespec_to_cputime(__val) \
-   jiffies_to_cputime(timespec_to_jiffies(__val))
-#define cputime_to_timespec(__ct,__val)\
-   jiffies_to_timespec(cputime_to_jiffies(__ct),__val)
+#define timespec64_to_cputime(__val)   \
+   jiffies_to_cputime(timespec64_to_jiffies(__val))
+#define cputime_to_timespec64(__ct,__val)  \
+   jiffies_to_timespec64(cputime_to_jiffies(__ct),__val)
 
 /*
  * Convert cputime to timeval and back.
diff --git a/include/asm-generic/cputime_nsecs.h 
b/include/asm-generic/cputime_nsecs.h
index 0419485..65c875b 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -73,12 +73,12 @@ typedef u64 __nocast cputime64_t;
 /*
  * Convert cputime - timespec (nsec)
  */
-static inline cputime_t timespec_to_cputime(const struct timespec *val)
+static inline cputime_t timespec64_to_cputime(const struct timespec64 *val)
 {
u64 ret = val-tv_sec * NSEC_PER_SEC + val-tv_nsec;
return (__force cputime_t) ret;
 }
-static inline void cputime_to_timespec(const cputime_t ct, struct timespec 
*val)
+static inline void cputime_to_timespec64(const cputime_t ct, struct timespec64 
*val)
 {
u32 rem;
 
diff --git a/include/linux/cputime.h b/include/linux/cputime.h
index f2eb2ee..f01896f 100644
--- a/include/linux/cputime.h
+++ b/include/linux/cputime.h
@@ -13,4 +13,19 @@
usecs_to_cputime((__nsecs) / NSEC_PER_USEC)
 #endif
 
+static inline cputime_t timespec_to_cputime(const struct timespec *ts)
+{
+   struct timespec64 ts64 = timespec_to_timespec64(*ts);
+   return timespec64_to_cputime(ts64);
+}
+
+static inline void cputime_to_timespec(const cputime_t cputime,
+   struct timespec *value)
+{
+   struct timespec64 *ts64;

Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-04-30 Thread Paul Mackerras
On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
  We are adding support for DMA memory pre-registration to be used in
  conjunction with VFIO. The idea is that the userspace which is going to
  run a guest may want to pre-register a user space memory region so
  it all gets pinned once and never goes away. Having this done,
  a hypervisor will not have to pin/unpin pages on every DMA map/unmap
  request. This is going to help with multiple pinning of the same memory
  and in-kernel acceleration of DMA requests.
  
  This adds a list of memory regions to mm_context_t. Each region consists
  of a header and a list of physical addresses. This adds API to:
  1. register/unregister memory regions;
  2. do final cleanup (which puts all pre-registered pages);
  3. do userspace to physical address translation;
  4. manage a mapped pages counter; when it is zero, it is safe to
  unregister the region.
  
  Multiple registration of the same region is allowed, kref is used to
  track the number of registrations.
 
 [snip]
  +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
  +   struct mm_iommu_table_group_mem_t **pmem)
  +{
  +   struct mm_iommu_table_group_mem_t *mem;
  +   long i, j;
  +   struct page *page = NULL;
  +
  +   list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list,
  +   next) {
  +   if ((mem-ua == ua)  (mem-entries == entries))
  +   return -EBUSY;
  +
  +   /* Overlap? */
  +   if ((mem-ua  (ua + (entries  PAGE_SHIFT))) 
  +   (ua  (mem-ua + (mem-entries  PAGE_SHIFT
  +   return -EINVAL;
  +   }
  +
  +   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
  +   if (!mem)
  +   return -ENOMEM;
  +
  +   mem-hpas = vzalloc(entries * sizeof(mem-hpas[0]));
  +   if (!mem-hpas) {
  +   kfree(mem);
  +   return -ENOMEM;
  +   }
 
 So, I've thought more about this and I'm really confused as to what
 this is supposed to be accomplishing.
 
 I see that you need to keep track of what regions are registered, so
 you don't double lock or unlock, but I don't see what the point of
 actualy storing the translations in hpas is.
 
 I had assumed it was so that you could later on get to the
 translations in real mode when you do in-kernel acceleration.  But
 that doesn't make sense, because the array is vmalloc()ed, so can't be
 accessed in real mode anyway.

We can access vmalloc'd arrays in real mode using real_vmalloc_addr().

 I can't think of a circumstance in which you can use hpas where you
 couldn't just walk the page tables anyway.

The problem with walking the page tables is that there is no guarantee
that the page you find that way is the page that was returned by the
gup_fast() we did earlier.  Storing the hpas means that we know for
sure that the page we're doing DMA to is one that we have an elevated
page count on.

Also, there are various points where a Linux PTE is made temporarily
invalid for a short time.  If we happened to do a H_PUT_TCE on one cpu
while another cpu was doing that, we'd get a spurious failure returned
by the H_PUT_TCE.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update into 2 different
 operations. It requires 1) guest pages to be registered first 2) consequent
 map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 
 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..94328c8 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
 changed:
  
   
  
 +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
 +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
 +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
 +(which are unsupported in v1 IOMMU).

A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.

 +PPC64 paravirtualized guests generate a lot of map/unmap requests,
 +and the handling of those includes pinning/unpinning pages and updating
 +mm::locked_vm counter to make sure we do not exceed the rlimit.
 +The v2 IOMMU splits accounting and pinning into separate operations:
 +
 +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
 +receive a user space address and size of the block to be pinned.
 +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
 +be called with the exact address and size used for registering
 +the memory block. The userspace is not expected to call these often.
 +The ranges are stored in a linked list in a VFIO container.
 +
 +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
 +IOMMU table and do not do pinning; instead these check that the userspace
 +address is from pre-registered range.
 +
 +This separation helps in optimizing DMA for guests.
 +
  
 ---
  
  [1] VFIO was originally an acronym for Virtual Function I/O in its
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 892a584..4cfc2c1 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c

So, from 

Re: [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:54PM +1000, Alexey Kardashevskiy wrote:
 A table group might not have a table but it always has the default 32bit
 window parameters so use these.
 
 No change in behavior is expected.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

It would be easier to review if you took this and the parts of the
earlier patch which add the tce32_* fields to table_group and roll
them up on their own.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpneYXVkHI3I.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 00/15] Convert the posix_clock_operations and k_clock structure to ready for 2038

2015-04-30 Thread Baolin Wang
This patch series changes the 32-bit time type (timespec/itimerspec) to the 
64-bit one
(timespec64/itimerspec64), since 32-bit time types will break in the year 2038.

This patch series introduces new methods with timespec64/itimerspec64 type,
and removes the old ones with timespec/itimerspec type for 
posix_clock_operations
and k_clock structure.

Also introduces some new functions with timespec64/itimerspec64 type, like 
current_kernel_time64(),
hrtimer_get_res64(), cputime_to_timespec64() and timespec64_to_cputime().

Changes since V1:
-Split some patch into small patch.
-Change the methods for converting the syscall and add some default 
function for new 64bit methods for syscall function.
-Introduce the new function do_sys_settimeofday64() and move 
do_sys_settimeofday() function to head file.
-Modify the EXPORT_SYMPOL issue.
-Add new 64bit methods in cputime_nsecs.h file.
-Modify some patch logs.

Baolin Wang (15):
  linux/time64.h:Introduce the 'struct itimerspec64' for 64bit
  timekeeping:Introduce the current_kernel_time64() function with
timespec64 type
  time/hrtimer:Introduce hrtimer_get_res64() with timespec64 type for
getting the timer resolution
  posix timers:Introduce the 64bit methods with timespec64 type for
k_clock structure
  posix-timers:Split out the guts of the syscall and change the
implementation
  posix-timers:Convert to the 64bit methods for the syscall function
  time:Introduce the do_sys_settimeofday64() function with timespec64
type
  time/posix-timers:Convert to the 64bit methods for k_clock callback
functions
  char/mmtimer:Convert to the 64bit methods for k_clock callback
function
  time/alarmtimer:Convert to the new methods for k_clock structure
  time/posix-clock:Convert to the 64bit methods for k_clock and
posix_clock_operations structure
  time/time:Introduce the timespec64_to_jiffies/jiffies_to_timespec64
function
  cputime:Introduce the cputime_to_timespec64/timespec64_to_cputime
function
  time/posix-cpu-timers:Convert to the 64bit methods for k_clock
structure
  k_clock:Remove the 32bit methods with timespec/itimerspec type

 arch/powerpc/include/asm/cputime.h|6 +-
 arch/s390/include/asm/cputime.h   |8 +-
 drivers/char/mmtimer.c|   36 +++--
 drivers/ptp/ptp_clock.c   |   26 +---
 include/asm-generic/cputime_jiffies.h |   10 +-
 include/asm-generic/cputime_nsecs.h   |4 +-
 include/linux/cputime.h   |   15 ++
 include/linux/hrtimer.h   |   12 +-
 include/linux/jiffies.h   |   21 ++-
 include/linux/posix-clock.h   |   10 +-
 include/linux/posix-timers.h  |   18 +--
 include/linux/time64.h|   35 +
 include/linux/timekeeping.h   |   26 +++-
 kernel/time/alarmtimer.c  |   43 +++---
 kernel/time/hrtimer.c |   10 +-
 kernel/time/posix-clock.c |   20 +--
 kernel/time/posix-cpu-timers.c|   83 +-
 kernel/time/posix-timers.c|  269 ++---
 kernel/time/time.c|   22 +--
 kernel/time/timekeeping.c |6 +-
 kernel/time/timekeeping.h |2 +-
 21 files changed, 428 insertions(+), 254 deletions(-)

-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

2015-04-30 Thread Alexey Kardashevskiy

On 04/30/2015 05:22 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:

At the moment only one group per container is supported.
POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
IOMMU group so we can relax this limitation and support multiple groups
per container.


It's not obvious why allowing multiple TCE tables per PE has any
pearing on allowing multiple groups per container.



This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 
outcomes:

1. reusing the same IOMMU table for multiple groups - patch 31;
2. allowing dynamic create/remove of IOMMU tables - patch 32.

I can remove this one from the patchset and post it separately later but 
since 1..30 aim to support both 1) and 2), I'd think I better keep them all 
together (might explain some of changes I do in 1..30).





This adds TCE table descriptors to a container and uses iommu_table_group_ops
to create/set DMA windows on IOMMU groups so the same TCE tables will be
shared between several IOMMU groups.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v7:
* updated doc
---
  Documentation/vfio.txt  |   8 +-
  drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++--
  2 files changed, 199 insertions(+), 77 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 94328c8..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note

  This implementation has some specifics:

-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).


I thought the more fundamental problem was that different PEs tended
to use disjoint bus address ranges, so even by duplicating put_tce
across PEs you couldn't have a common address space.



Sorry, I am not following you here.

By duplicating put_tce, I can have multiple IOMMU groups on the same 
virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple 
groups per container does this, the address ranges will the same.


What I cannot do on p5ioc2 is programming the same table to multiple 
physical PHBs (or I could but it is very different than IODA2 and pretty 
ugly and might not always be possible because I would have to allocate 
these pages from some common pool and face problems like fragmentation).





+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.

  2) The hardware supports so called DMA windows - the PCI address range
  within which DMA transfer is allowed, any attempt to access address space
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index a7d6729..970e3a2 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
   * into DMA'ble space using the IOMMU
   */

+struct tce_iommu_group {
+   struct list_head next;
+   struct iommu_group *grp;
+};
+
  /*
   * The container descriptor supports only a single group per container.
   * Required by the API as the container is not supplied with the IOMMU group
@@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
   */
  struct tce_container {
struct mutex lock;
-   struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
bool v2;
+   struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];


Hrm,  so here we have more copies of the full iommu_table structures,
which again muddies the lifetime.  The table_group pointer is
presumably meaningless in these copies, which seems dangerously
confusing.



Ouch. This is bad. No, table_group is not pointless here as it is used to 
get to the PE number to invalidate TCE cache. I just realized although I 
need to update just a single table, I still have to invalidate TCE cache 
for every attached group/PE so I need a list of iommu_table_group's here, 
not a single pointer...





+   struct list_head group_list;
  };

  static long tce_unregister_pages(struct tce_container *container,
@@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, 
unsigned page_shift)
return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift;
  }

+static inline bool tce_groups_attached(struct tce_container *container)
+{
+   return 

Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

2015-04-30 Thread Alexey Kardashevskiy

On 04/30/2015 02:37 PM, David Gibson wrote:

On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 03:30 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:

This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.


Doesn't the free callback belong with the previous patch introducing
multi-level tables?




If I did that, you would say why is it here if nothing calls it on
multilevel patch and I see the allocation but I do not see memory
release ;)


Yeah, fair enough ;)


I need some rule of thumb here. I think it is a bit cleaner if the same
patch adds a callback for memory allocation and its counterpart, no?


On further consideration, yes, I think you're right.


create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
  arch/powerpc/include/asm/iommu.h| 19 
  arch/powerpc/platforms/powernv/pci-ioda.c   | 75 ++---
  arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
  3 files changed, 96 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 0f50ee2..7694546 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -70,6 +70,7 @@ struct iommu_table_ops {
/* get() returns a physical address */
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
+   void (*free)(struct iommu_table *tbl);
  };

  /* These are used by VIO */
@@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   long (*create_table)(struct iommu_table_group *table_group,
+   int num,
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels,
+   struct iommu_table *tbl);
+   long (*set_window)(struct iommu_table_group *table_group,
+   int num,
+   struct iommu_table *tblnew);
+   long (*unset_window)(struct iommu_table_group *table_group,
+   int num);
/*
 * Switches ownership from the kernel itself to an external
 * user. While onwership is taken, the kernel cannot use IOMMU itself.
@@ -160,6 +172,13 @@ struct iommu_table_group {
  #ifdef CONFIG_IOMMU_API
struct iommu_group *group;
  #endif
+   /* Some key properties of IOMMU */
+   __u32 tce32_start;
+   __u32 tce32_size;
+   __u64 pgsizes; /* Bitmap of supported page sizes */
+   __u32 max_dynamic_windows_supported;
+   __u32 max_levels;


With this information, table_group seems even more like a bad name.
iommu_state maybe?



Please, no. We will never come to agreement then :( And iommu_state is too
general anyway, it won't pass.



struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct iommu_table_group_ops *ops;
  };
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index cc1d09c..4828837 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -24,6 +24,7 @@
  #include linux/msi.h
  #include linux/memblock.h
  #include linux/iommu.h
+#include linux/sizes.h

  #include asm/sections.h
  #include asm/io.h
@@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
  #endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
+   .free = pnv_pci_free_table,
  };

  static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 TCE_PCI_SWINV_PAIR);

tbl-it_ops = pnv_ioda1_iommu_ops;
+   pe-table_group.tce32_start = tbl-it_offset  tbl-it_page_shift;
+   pe-table_group.tce32_size = tbl-it_size  tbl-it_page_shift;
iommu_init_table(tbl, phb-hose-node);

[v3] powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)

2015-04-30 Thread Igal . Liberman
From: Igal Liberman igal.liber...@freescale.com

Describe the PHY topology for all configurations supported by each board

Based on prior work by Andy Fleming aflem...@freescale.com

Signed-off-by: Igal Liberman igal.liber...@freescale.com
Signed-off-by: Shruti Kanetkar shr...@freescale.com
Signed-off-by: Emil Medve emilian.me...@freescale.com
---

v3: Fixed incorrect E-Mail address (signed-off-by)

v2: Remove 'Change-Id'

 arch/powerpc/boot/dts/b4860qds.dts|   60 -
 arch/powerpc/boot/dts/b4qds.dtsi  |   51 -
 arch/powerpc/boot/dts/p1023rdb.dts|   24 +-
 arch/powerpc/boot/dts/p2041rdb.dts|   92 +++-
 arch/powerpc/boot/dts/p3041ds.dts |  112 -
 arch/powerpc/boot/dts/p4080ds.dts |  184 ++-
 arch/powerpc/boot/dts/p5020ds.dts |  112 -
 arch/powerpc/boot/dts/p5040ds.dts |  234 ++-
 arch/powerpc/boot/dts/t1040rdb.dts|   32 ++-
 arch/powerpc/boot/dts/t1042rdb.dts|   30 ++-
 arch/powerpc/boot/dts/t1042rdb_pi.dts |   18 +-
 arch/powerpc/boot/dts/t104xqds.dtsi   |  178 ++-
 arch/powerpc/boot/dts/t104xrdb.dtsi   |   33 ++-
 arch/powerpc/boot/dts/t2080qds.dts|  158 -
 arch/powerpc/boot/dts/t2080rdb.dts|   67 +-
 arch/powerpc/boot/dts/t2081qds.dts|  221 +-
 arch/powerpc/boot/dts/t4240qds.dts|  400 -
 arch/powerpc/boot/dts/t4240rdb.dts|  149 +++-
 18 files changed, 2135 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/boot/dts/b4860qds.dts 
b/arch/powerpc/boot/dts/b4860qds.dts
index 6bb3707..98b1ef4 100644
--- a/arch/powerpc/boot/dts/b4860qds.dts
+++ b/arch/powerpc/boot/dts/b4860qds.dts
@@ -1,7 +1,7 @@
 /*
  * B4860DS Device Tree Source
  *
- * Copyright 2012 Freescale Semiconductor Inc.
+ * Copyright 2012 - 2015 Freescale Semiconductor Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
@@ -39,12 +39,69 @@
model = fsl,B4860QDS;
compatible = fsl,B4860QDS;
 
+   aliases {
+   phy_sgmii_1e = phy_sgmii_1e;
+   phy_sgmii_1f = phy_sgmii_1f;
+   phy_xaui_slot1 = phy_xaui_slot1;
+   phy_xaui_slot2 = phy_xaui_slot2;
+   };
+
ifc: localbus@ffe124000 {
board-control@3,0 {
compatible = fsl,b4860qds-fpga, fsl,fpga-qixis;
};
};
 
+   soc@ffe00 {
+   fman@40 {
+   ethernet@e8000 {
+   phy-handle = phy_sgmii_1e;
+   phy-connection-type = sgmii;
+   };
+
+   ethernet@ea000 {
+   phy-handle = phy_sgmii_1f;
+   phy-connection-type = sgmii;
+   };
+
+   ethernet@f {
+   phy-handle = phy_xaui_slot1;
+   phy-connection-type = xgmii;
+   };
+
+   ethernet@f2000 {
+   phy-handle = phy_xaui_slot2;
+   phy-connection-type = xgmii;
+   };
+
+   mdio@fc000 {
+   phy_sgmii_1e: ethernet-phy@1e {
+   reg = 0x1e;
+   status = disabled;
+   };
+
+   phy_sgmii_1f: ethernet-phy@1f {
+   reg = 0x1f;
+   status = disabled;
+   };
+   };
+
+   mdio@fd000 {
+   phy_xaui_slot1: xaui-phy@slot1 {
+   compatible = 
ethernet-phy-ieee802.3-c45;
+   reg = 0x7;
+   status = disabled;
+   };
+
+   phy_xaui_slot2: xaui-phy@slot2 {
+   compatible = 
ethernet-phy-ieee802.3-c45;
+   reg = 0x6;
+   status = disabled;
+   };
+   };
+   };
+   };
+
rio: rapidio@ffe0c {
reg = 0xf 0xfe0c 0 0x11000;
 
@@ -55,7 +112,6 @@
ranges = 0 0 0xc 0x3000 0 0x1000;
};
};
-
 };
 
 /include/ fsl/b4860si-post.dtsi
diff --git a/arch/powerpc/boot/dts/b4qds.dtsi b/arch/powerpc/boot/dts/b4qds.dtsi
index 559d006..af49456 100644
--- a/arch/powerpc/boot/dts/b4qds.dtsi
+++ b/arch/powerpc/boot/dts/b4qds.dtsi
@@ -1,7 +1,7 @@
 /*
  * B4420DS Device Tree Source
  *
- * Copyright 2012 

[PATCH 0/2] powerpc: tweak the kernel options for CRASH_DUMP and RELOCATABLE

2015-04-30 Thread Kevin Hao
Hi,

The first patch fixes a build error when CRASH_DUMP=y  ADVANCED_OPTIONS=n
for ppc32. The second does some cleanup for RELOCATABLE option.

Kevin Hao (2):
  powerpc: fix the dependency issue for CRASH_DUMP
  powerpc: merge the RELOCATABLE config entries for ppc32 and ppc64

 arch/powerpc/Kconfig | 68 +++-
 arch/powerpc/configs/ppc64_defconfig |  1 +
 2 files changed, 29 insertions(+), 40 deletions(-)

-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] powerpc: fix the dependency issue for CRASH_DUMP

2015-04-30 Thread Kevin Hao
In the current code, the RELOCATABLE will be forcedly enabled when
enabling CRASH_DUMP. But for ppc32, the RELOCABLE also depend on
ADVANCED_OPTIONS and select NONSTATIC_KERNEL. This will cause build
error when CRASH_DUMP=y  ADVANCED_OPTIONS=n. Even there is no such
issue for ppc64, but select is only for non-visible symbols and for
symbols with no dependencies. As for a symbol like RELOCATABLE, it is
definitely not suitable to select it. So choose to depend on it.

Also enable the RELOCATABLE explicitly for the defconfigs which has
CRASH_DUMP enabled.

Signed-off-by: Kevin Hao haoke...@gmail.com
---
 arch/powerpc/Kconfig | 3 +--
 arch/powerpc/configs/ppc64_defconfig | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 190cc48abc0c..d6bbf4f6f869 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -429,8 +429,7 @@ config KEXEC
 
 config CRASH_DUMP
bool Build a kdump crash kernel
-   depends on PPC64 || 6xx || FSL_BOOKE || (44x  !SMP)
-   select RELOCATABLE if (PPC64  !COMPILE_TEST) || 44x || FSL_BOOKE
+   depends on 6xx || ((PPC64 || FSL_BOOKE || (44x  !SMP))  RELOCATABLE)
help
  Build a kernel suitable for use as a kdump capture kernel.
  The same kernel binary can be used as production kernel and dump
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index aad501ae3834..01f7b63f2df0 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -46,6 +46,7 @@ CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_KEXEC=y
 CONFIG_CRASH_DUMP=y
+CONFIG_RELOCATABLE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_KSM=y
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] libfdt: add fdt type definitions

2015-04-30 Thread Rob Herring
In preparation for libfdt/dtc update, add the new fdt specific types.

Signed-off-by: Rob Herring r...@kernel.org
Cc: Russell King li...@arm.linux.org.uk
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Michael Ellerman m...@ellerman.id.au
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/arm/boot/compressed/libfdt_env.h | 4 
 arch/powerpc/boot/libfdt_env.h| 4 
 arch/powerpc/boot/of.h| 2 ++
 include/linux/libfdt_env.h| 4 
 4 files changed, 14 insertions(+)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index 1f4e718..17ae0f3 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -5,6 +5,10 @@
 #include linux/string.h
 #include asm/byteorder.h
 
+typedef __be16 fdt16_t;
+typedef __be32 fdt32_t;
+typedef __be64 fdt64_t;
+
 #define fdt16_to_cpu(x)be16_to_cpu(x)
 #define cpu_to_fdt16(x)cpu_to_be16(x)
 #define fdt32_to_cpu(x)be32_to_cpu(x)
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 8dcd744..7e3789e 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -10,6 +10,10 @@ typedef u32 uint32_t;
 typedef u64 uint64_t;
 typedef unsigned long uintptr_t;
 
+typedef __be16 fdt16_t;
+typedef __be32 fdt32_t;
+typedef __be64 fdt64_t;
+
 #define fdt16_to_cpu(x)be16_to_cpu(x)
 #define cpu_to_fdt16(x)cpu_to_be16(x)
 #define fdt32_to_cpu(x)be32_to_cpu(x)
diff --git a/arch/powerpc/boot/of.h b/arch/powerpc/boot/of.h
index 5603320..53f8f27 100644
--- a/arch/powerpc/boot/of.h
+++ b/arch/powerpc/boot/of.h
@@ -21,7 +21,9 @@ int of_setprop(const void *phandle, const char *name, const 
void *buf,
 /* Console functions */
 void of_console_init(void);
 
+typedef u16__be16;
 typedef u32__be32;
+typedef u64__be64;
 
 #ifdef __LITTLE_ENDIAN__
 #define cpu_to_be16(x) swab16(x)
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index 01508c7..2a663c6 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -5,6 +5,10 @@
 
 #include asm/byteorder.h
 
+typedef __be16 fdt16_t;
+typedef __be32 fdt32_t;
+typedef __be64 fdt64_t;
+
 #define fdt32_to_cpu(x) be32_to_cpu(x)
 #define cpu_to_fdt32(x) cpu_to_be32(x)
 #define fdt64_to_cpu(x) be64_to_cpu(x)
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] powerpc: merge the RELOCATABLE config entries for ppc32 and ppc64

2015-04-30 Thread Kevin Hao
It makes no sense to keep two separate RELOCATABLE config entries for
ppc32 and ppc64 respectively. Merge them into one and move it to
a common place. The dependency on ADVANCED_OPTIONS for ppc32 seems
unnecessary, also drop it.

Signed-off-by: Kevin Hao haoke...@gmail.com
---
 arch/powerpc/Kconfig | 65 ++--
 1 file changed, 27 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d6bbf4f6f869..4080a14707bb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -427,6 +427,33 @@ config KEXEC
  interface is strongly in flux, so no good recommendation can be
  made.
 
+config RELOCATABLE
+   bool Build a relocatable kernel
+   depends on (PPC64  !COMPILE_TEST) || (FLATMEM  (44x || FSL_BOOKE))
+   select NONSTATIC_KERNEL
+   help
+ This builds a kernel image that is capable of running at the
+ location the kernel is loaded at. For ppc32, there is no any
+ alignment restrictions, and this feature is a superset of
+ DYNAMIC_MEMSTART and hence overrides it. For ppc64, we should use
+ 16k-aligned base address. The kernel is linked as a
+ position-independent executable (PIE) and contains dynamic relocations
+ which are processed early in the bootup process.
+
+ One use is for the kexec on panic case where the recovery kernel
+ must live at a different physical address than the primary
+ kernel.
+
+ Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
+ it has been loaded at and the compile time physical addresses
+ CONFIG_PHYSICAL_START is ignored.  However CONFIG_PHYSICAL_START
+ setting can still be useful to bootwrappers that need to know the
+ load address of the kernel (eg. u-boot/mkimage).
+
+config RELOCATABLE_PPC32
+   def_bool y
+   depends on PPC32  RELOCATABLE
+
 config CRASH_DUMP
bool Build a kdump crash kernel
depends on 6xx || ((PPC64 || FSL_BOOKE || (44x  !SMP))  RELOCATABLE)
@@ -926,29 +953,6 @@ config DYNAMIC_MEMSTART
 
  This option is overridden by CONFIG_RELOCATABLE
 
-config RELOCATABLE
-   bool Build a relocatable kernel
-   depends on ADVANCED_OPTIONS  FLATMEM  (44x || FSL_BOOKE)
-   select NONSTATIC_KERNEL
-   help
- This builds a kernel image that is capable of running at the
- location the kernel is loaded at, without any alignment restrictions.
- This feature is a superset of DYNAMIC_MEMSTART and hence overrides it.
-
- One use is for the kexec on panic case where the recovery kernel
- must live at a different physical address than the primary
- kernel.
-
- Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
- it has been loaded at and the compile time physical addresses
- CONFIG_PHYSICAL_START is ignored.  However CONFIG_PHYSICAL_START
- setting can still be useful to bootwrappers that need to know the
- load address of the kernel (eg. u-boot/mkimage).
-
-config RELOCATABLE_PPC32
-   def_bool y
-   depends on PPC32  RELOCATABLE
-
 config PAGE_OFFSET_BOOL
bool Set custom page offset address
depends on ADVANCED_OPTIONS
@@ -1034,21 +1038,6 @@ config PIN_TLB
 endmenu
 
 if PPC64
-config RELOCATABLE
-   bool Build a relocatable kernel
-   depends on !COMPILE_TEST
-   select NONSTATIC_KERNEL
-   help
- This builds a kernel image that is capable of running anywhere
- in the RMA (real memory area) at any 16k-aligned base address.
- The kernel is linked as a position-independent executable (PIE)
- and contains dynamic relocations which are processed early
- in the bootup process.
-
- One use is for the kexec on panic case where the recovery kernel
- must live at a different physical address than the primary
- kernel.
-
 # This value must have zeroes in the bottom 60 bits otherwise lots will break
 config PAGE_OFFSET
hex
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform

2015-04-30 Thread Vasant Hegde
On 04/28/2015 03:48 PM, Arnd Bergmann wrote:
 On Tuesday 28 April 2015 15:40:35 Vasant Hegde wrote:
 +++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt
 @@ -0,0 +1,29 @@
 +Device Tree binding for LEDs on IBM Power Systems
 +-
 +
 +The 'led' node under '/ibm,opal' lists service indicators available in the
 +system and their capabilities.
 +
 +led {
 +   compatible = ibm,opal-v3-led;
 +   phandle = 0x106b;
 +   linux,phandle = 0x106b;
 +   led-mode = lightpath;
 +
 +   U78C9.001.RST0027-P1-C1 {
 +   led-types = identify, fault;
 +   led-loc = descendent;
 +   phandle = 0x106f;
 +   linux,phandle = 0x106f;
 +   };
 +   ...
 +   ...
 +};

Arnd,

  Thanks for the review.

 
 We normally don't list the 'phandle' or 'linux,phandle' properties in the 
 binding
 description.
 

Sure. .Will fix.


 +
 +Each node under 'led' node describes location code of FRU/Enclosure.
 +
 +The properties under each node:
 +
 +  led-types : Supported LED types (attention/identify/fault).
 +
 +  led-loc   : enclosure/descendent(FRU) location code.

 
 Could you use the standard 'label' property for this?

This was discussed earlier [1] and agreed to use led-types property here..

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-March/126301.html

-Vasant

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform

2015-04-30 Thread Vasant Hegde
On 04/30/2015 07:59 PM, Jacek Anaszewski wrote:
 Hi Vasant,
 

Hi  Jacek,

.../...

 diff --git a/Documentation/devicetree/bindings/leds/leds-powernv.txt
 b/Documentation/devicetree/bindings/leds/leds-powernv.txt
 new file mode 100644
 index 000..6bb0e7e
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt
 @@ -0,0 +1,29 @@
 +Device Tree binding for LEDs on IBM Power Systems
 +-
 +
 +The 'led' node under '/ibm,opal' lists service indicators available in the
 +system and their capabilities.
 +
 +led {
 +compatible = ibm,opal-v3-led;
 +phandle = 0x106b;
 +linux,phandle = 0x106b;
 +led-mode = lightpath;
 +
 +U78C9.001.RST0027-P1-C1 {
 +led-types = identify, fault;
 +led-loc = descendent;
 +phandle = 0x106f;
 +linux,phandle = 0x106f;
 +};
 +...
 +...
 +};
 +
 +Each node under 'led' node describes location code of FRU/Enclosure.
 +
 +The properties under each node:
 +
 +  led-types : Supported LED types (attention/identify/fault).
 +
 +  led-loc   : enclosure/descendent(FRU) location code.
 
 DT documentation it usually constructed so that properties are
 described in the beginning and the file ends with an example.
 
 Also last time I mistakenly requested to remove description of
 compatible property, but it should also be present here and
 the entry should described it in detail, like:
 
 - compatible : Should be ibm,opal-v3-led.

That's fine. I will fix it in v5.

 
 Please refer to the other bindings,
 
 I will express my opinion on the LED part after powerpc maintainer
 will ack DT bindings.
 

Sure..

@Ben/Michael,
  Can you please review/ack this patchset?

-Vasant

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux

2015-04-30 Thread igal.liber...@freescale.com


Regards,
Igal Liberman.

 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, April 30, 2015 3:31 AM
 To: Liberman Igal-B31950
 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Tang
 Yuantian-B29983
 Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan clock mux
 
 On Wed, 2015-04-22 at 05:47 -0500, Liberman Igal-B31950 wrote:
 
 
  Regards,
  Igal Liberman.
 
   -Original Message-
   From: Wood Scott-B07421
   Sent: Tuesday, April 21, 2015 3:52 AM
   To: Liberman Igal-B31950
   Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Tang
   Yuantian-B29983
   Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for FMan
   clock mux
  
   On Mon, 2015-04-20 at 06:40 -0500, Liberman Igal-B31950 wrote:
   
   
Regards,
Igal Liberman.
   
 -Original Message-
 From: Liberman Igal-B31950
 Sent: Monday, April 20, 2015 2:07 PM
 To: Wood Scott-B07421
 Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org
 Subject: RE: [v3] dt/bindings: qoriq-clock: Add binding for FMan
 clock mux



 Regards,
 Igal Liberman.

  -Original Message-
  From: Wood Scott-B07421
  Sent: Friday, April 17, 2015 8:41 AM
  To: Liberman Igal-B31950
  Cc: devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org
  Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding for
  FMan clock mux
 
  On Thu, 2015-04-16 at 01:11 -0500, Liberman Igal-B31950 wrote:
  
  
   Regards,
   Igal Liberman.
  
-Original Message-
From: Wood Scott-B07421
Sent: Wednesday, April 15, 2015 8:36 PM
To: Liberman Igal-B31950
Cc: devicet...@vger.kernel.org;
linuxppc-dev@lists.ozlabs.org
Subject: Re: [v3] dt/bindings: qoriq-clock: Add binding
for FMan clock mux
   
On Tue, 2015-04-14 at 13:56 +0300, Igal.Liberman wrote:
 From: Igal Liberman igal.liber...@freescale.com

 v3: Addressed feedback from Scott:
   - Removed clock specifier description.

 v2: Addressed feedback from Scott:
   - Moved the fman-clk-mux clock provider details
 under clocks property.

 Signed-off-by: Igal Liberman
 igal.liber...@freescale.com
 ---
  .../devicetree/bindings/clock/qoriq-clock.txt  |   17
  +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

 diff --git
 a/Documentation/devicetree/bindings/clock/qoriq-clock.tx
 t
 b/Documentation/devicetree/bindings/clock/qoriq-clock.tx
 t index b0d7b73..2bb3b38 100644
 ---
 a/Documentation/devicetree/bindings/clock/qoriq-clock.tx
 t
 +++ b/Documentation/devicetree/bindings/clock/qoriq-cloc
 +++ k.tx
 +++ t
 @@ -65,9 +65,10 @@ Required properties:
   It takes parent's clock-frequency as its clock.
   * fsl,qoriq-platform-pll-1.0 for the platform PLL 
 clock (v1.0)
   * fsl,qoriq-platform-pll-2.0 for the platform PLL
 clock
 (v2.0)
 + * fsl,fman-clk-mux for the Frame Manager clock.
  - #clock-cells: From common clock binding. The number
 of cells in
   a
 - clock-specifier. Should be 0 for 
 fsl,qoriq-sysclk-[1,2].0
 - clocks, or 1 for fsl,qoriq-core-pll-[1,2].0 clocks.
 + clock-specifier. Should be 0 for fsl,qoriq-sysclk-
 [1,2].0
 and
 + fsl,fman-clk-mux clocks or 1 for
 +fsl,qoriq-core-pll-
 [1,2].0.
   For fsl,qoriq-core-pll-1.0 clocks, the single
   clock-specifier cell may take the following values:
   * 0 - equal to the PLL frequency @@ -145,6 +146,18 @@
 Example
 for clock block and clock provider:
   clocks = sysclk;
   clock-output-names = platform-pll,
 platform-pll-
div2;
   };
 +
 + fm0clk: fm0-clk-mux {
 + #clock-cells = 0;
 + reg = 0x10 4
 + compatible = fsl,fman-clk-mux;
 + clocks = pll0 0, pll0 1, pll0
 2,
 pll0 3,
 +  platform_pll 0, pll1 1,
 pll1
 2;
 + clock-names = pll0, pll0-div2,
 pll0-div3,
 +   pll0-div4, 
 platform-pll,
 pll1-
 div2,
 +   pll1-div3;
 + clock-output-names = fm0-clk;
 + };
   };
  };

   
I don't see this register in the manuals for older DPAA
chips, such as
p4080 or p3041.  Is 

Re: [PATCH v4 3/3] leds/powernv: Add driver for PowerNV platform

2015-04-30 Thread Jacek Anaszewski

Hi Vasant,

On 04/28/2015 12:10 PM, Vasant Hegde wrote:

This patch implements LED driver for PowerNV platform using the existing
generic LED class framework.

PowerNV platform has below type of LEDs:
   - System attention
   Indicates there is a problem with the system that needs attention.
   - Identify
   Helps the user locate/identify a particular FRU or resource in the
   system.
   - Fault
   Indicates there is a problem with the FRU or resource at the
   location with which the indicator is associated.

We register classdev structures for all individual LEDs detected on the
system through LED specific device tree nodes. Device tree nodes specify
what all kind of LEDs present on the same location code. It registers
LED classdev structure for each of them.

All the system LEDs can be found in the same regular path /sys/class/leds/.
We don't use LED colors. We use LED node and led-types property to form
LED classdev. Our LEDs have names in this format.

 location_code:attention|identify|fault

Any positive brightness value would turn on the LED and a zero value would
turn off the LED. The driver will return LED_FULL (255) for any turned on
LED and LED_OFF (0) for any turned off LED.

As per the LED class framework, the 'brightness_set' function should not
sleep. Hence these functions have been implemented through global work
queue tasks which might sleep on OPAL async call completion.

The platform level implementation of LED get and set state has been achieved
through OPAL calls. These calls are made available for the driver by
exporting from architecture specific codes.

Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com
Acked-by: Stewart Smith stew...@linux.vnet.ibm.com
Tested-by: Stewart Smith stew...@linux.vnet.ibm.com

---
Changes in v4:
   - s/u64/__be64/g for big endian data we get from firmware
   - Addressed review comments from Jacek. Major once are:
 Removed list in powernv_led_data structure
 s/kzalloc/devm_kzalloc/
 Removed compatible property from documentation
 s/powernv_led_set_queue/powernv_brightness_set/
   - Removed LED specific brightness_set/get function. Instead this version
 uses single function to queue all LED set/get requests. Later we use
 LED name to detect LED type and value.
   - Removed hardcoded LED type used in previous version. Instead we use
 led-types property to form LED classdev.


Changes in v3:
   - Addressed review comments from Jacek. Major once are:
 Replaced spin lock and mutex and removed redundant structures
 Replaced pr_* with dev_*
 Moved OPAL platform sepcific part to separate patch
 Moved repteated code to common function
 Added device tree documentation for LEDs


Changes in v2:
   - Added System Attention indicator support
   - Moved common code to powernv_led_set_queue()


  .../devicetree/bindings/leds/leds-powernv.txt  |   29 +
  drivers/leds/Kconfig   |   11
  drivers/leds/Makefile  |1
  drivers/leds/leds-powernv.c|  472 
  4 files changed, 513 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/leds/leds-powernv.txt
  create mode 100644 drivers/leds/leds-powernv.c

diff --git a/Documentation/devicetree/bindings/leds/leds-powernv.txt 
b/Documentation/devicetree/bindings/leds/leds-powernv.txt
new file mode 100644
index 000..6bb0e7e
--- /dev/null
+++ b/Documentation/devicetree/bindings/leds/leds-powernv.txt
@@ -0,0 +1,29 @@
+Device Tree binding for LEDs on IBM Power Systems
+-
+
+The 'led' node under '/ibm,opal' lists service indicators available in the
+system and their capabilities.
+
+led {
+   compatible = ibm,opal-v3-led;
+   phandle = 0x106b;
+   linux,phandle = 0x106b;
+   led-mode = lightpath;
+
+   U78C9.001.RST0027-P1-C1 {
+   led-types = identify, fault;
+   led-loc = descendent;
+   phandle = 0x106f;
+   linux,phandle = 0x106f;
+   };
+   ...
+   ...
+};
+
+Each node under 'led' node describes location code of FRU/Enclosure.
+
+The properties under each node:
+
+  led-types : Supported LED types (attention/identify/fault).
+
+  led-loc   : enclosure/descendent(FRU) location code.


DT documentation it usually constructed so that properties are
described in the beginning and the file ends with an example.

Also last time I mistakenly requested to remove description of
compatible property, but it should also be present here and
the entry should described it in detail, like:

- compatible : Should be ibm,opal-v3-led.

Please refer to the other bindings,

I will express my opinion on the LED part after powerpc maintainer
will ack DT bindings.


diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig
index 25b320d..2ea0849 

Re: [PATCH net-next v3 2/4] ibmveth: Add support for TSO

2015-04-30 Thread David Miller
From: Thomas Falcon tlfal...@linux.vnet.ibm.com
Date: Wed, 29 Apr 2015 16:25:45 -0500

 Add support for TSO.  TSO is turned off by default and
 must be enabled and configured by the user.  The driver
 version number is increased so that users can be sure
 that they are using ibmveth with TSO support.
 
 Cc: Brian King brk...@linux.vnet.ibm.com
 Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com
 ---
 v2: Included statistics that were previously in a separate patch

Applied.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH net-next v3 1/4] ibmveth: change rx buffer default allocation for CMO

2015-04-30 Thread David Miller
From: Thomas Falcon tlfal...@linux.vnet.ibm.com
Date: Wed, 29 Apr 2015 16:25:44 -0500

 This patch enables 64k rx buffer pools by default.  If Cooperative
 Memory Overcommitment (CMO) is enabled, the number of 64k buffers
 is reduced to save memory.
 
 Cc: Brian King brk...@linux.vnet.ibm.com
 Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com

Applied.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH net-next v3 3/4] ibmveth: Add GRO support

2015-04-30 Thread David Miller
From: Thomas Falcon tlfal...@linux.vnet.ibm.com
Date: Wed, 29 Apr 2015 16:25:46 -0500

 Cc: Brian King brk...@linux.vnet.ibm.com
 Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com

Applied.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH net-next v3 4/4] ibmveth: Add support for Large Receive Offload

2015-04-30 Thread David Miller
From: Thomas Falcon tlfal...@linux.vnet.ibm.com
Date: Wed, 29 Apr 2015 16:25:47 -0500

 Enables receiving large packets from other LPARs. These packets
 have a -1 IP header checksum, so we must recalculate to have
 a valid checksum.
 
 Signed-off-by: Brian King brk...@linux.vnet.ibm.com
 Signed-off-by: Thomas Falcon tlfal...@linux.vnet.ibm.com
 ---
 v3:
  -Removed code setting network and transport headers
  -get IP header from skb data
   Thanks again to Eric Dumazet
 
 v2:
  -Included statistics that were previously in a separate patch
  -Zeroed the IP header checksum before calling ip_fast_csum
   Thanks to Eric Dumazet.

Applied.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size  page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the 
mm_iommu_table_group_mem_t has in order to decide whether to allow 
unregistering or not.


When I clear TCE, I can read the old value which is host physical address 
which I cannot use to find the preregistered region and adjust the mappings 
counter; I can only use userspace addresses for this (not even guest 
physical addresses as it is VFIO and probably no KVM).


So I have to keep userspace addresses somewhere, one per IOMMU page, and 
the iommu_table seems a natural place for this.









+}
+
  static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,

BUG_ON(tbl-it_userspace);
tbl-it_userspace = uas;
+   tbl-it_allocated_size += uas_cb;
tbl-it_ops = pnv_ioda2_iommu_ops;
if (pe-tce_inval_reg)
tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
  }

  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl-it_type = TCE_PCI;
  }

+unsigned long pnv_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread Alexey Kardashevskiy

On 04/30/2015 04:55 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged vfio: powerpc/spapr: Define v2 IOMMU into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
changed:



+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).


A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.


v1: memory preregistration is not supported; explicit enable/disable ioctls 
are required


v2: memory preregistration is required; explicit enable/disable are 
prohibited (as they are not needed).


Mixing these in one IOMMU type caused a lot of problems like should I 
increment locked_vm by the 32bit window size on enable() or not; what do I 
do about pages pinning when map/map (check if it is from registered memory 
and do not pin?).


Having 2 IOMMU models makes everything a lot simpler.



+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those includes pinning/unpinning pages and updating
+mm::locked_vm counter to make sure we do not exceed the rlimit.
+The v2 IOMMU splits accounting and pinning into separate operations:
+
+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
+receive a user space address and size of the block to be pinned.
+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
+be called with the exact address and size used for registering
+the memory block. The userspace is not expected to call these often.
+The ranges are stored in a linked list in a VFIO container.
+
+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
+IOMMU table and do not do pinning; instead these check that the userspace
+address is from 

[PATCH v3 3/3] Documentation: mmc: Update Arasan SDHC documentation to support 4.9a version of Arasan SDHC controller.

2015-04-30 Thread Suman Tripathi
This patch updates Arasan SDHC documentation to support
4.9a version of Arasan SDHC controller.

Signed-off-by: Suman Tripathi stripa...@apm.com
---
 Documentation/devicetree/bindings/mmc/arasan,sdhci.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt 
b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt
index 98ee2ab..f01d41a 100644
--- a/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt
+++ b/Documentation/devicetree/bindings/mmc/arasan,sdhci.txt
@@ -8,7 +8,8 @@ Device Tree Bindings for the Arasan SDHCI Controller
   [3] Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
 
 Required Properties:
-  - compatible: Compatibility string. Must be 'arasan,sdhci-8.9a'
+  - compatible: Compatibility string. Must be 'arasan,sdhci-8.9a' or
+'arasan,sdhci-4.9a'   
   - reg: From mmc bindings: Register location and length.
   - clocks: From clock bindings: Handles to clock inputs.
   - clock-names: From clock bindings: Tuple including clk_xin and clk_ahb
@@ -18,7 +19,7 @@ Required Properties:
 
 Example:
sdhci@e010 {
-   compatible = arasan,sdhci-8.9a;
+   compatible = arasan,sdhci-8.9a, arasan,sdhci-4.9a;
reg = 0xe010 0x1000;
clock-names = clk_xin, clk_ahb;
clocks = clkc 21, clkc 32;
-- 
1.8.2.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 2/3] mmc: host: arasan: Add the support for sdhci-arasan4.9a in sdhci-of-arasan.c.

2015-04-30 Thread Suman Tripathi
This patch adds the quirks and compatible string in sdhci-of-arasan.c
to support sdhci-arasan4.9a version of controller.

Signed-off-by: Suman Tripathi stripa...@apm.com
---
 drivers/mmc/host/sdhci-of-arasan.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/mmc/host/sdhci-of-arasan.c 
b/drivers/mmc/host/sdhci-of-arasan.c
index 981d66e..92a4222 100644
--- a/drivers/mmc/host/sdhci-of-arasan.c
+++ b/drivers/mmc/host/sdhci-of-arasan.c
@@ -20,6 +20,7 @@
  */

 #include linux/module.h
+#include linux/of_device.h
 #include sdhci-pltfm.h

 #define SDHCI_ARASAN_CLK_CTRL_OFFSET   0x2c
@@ -169,6 +170,11 @@ static int sdhci_arasan_probe(struct platform_device *pdev)
goto clk_disable_all;
}

+   if (of_device_is_compatible(pdev-dev.of_node, arasan,sdhci-4.9a)) {
+   host-quirks |= SDHCI_QUIRK_NO_HISPD_BIT;
+   host-quirks2 |= SDHCI_QUIRK2_HOST_NO_CMD23;
+   }
+
sdhci_get_of_property(pdev);
pltfm_host = sdhci_priv(host);
pltfm_host-priv = sdhci_arasan;
@@ -206,6 +212,7 @@ static int sdhci_arasan_remove(struct platform_device *pdev)

 static const struct of_device_id sdhci_arasan_of_match[] = {
{ .compatible = arasan,sdhci-8.9a },
+   { .compatible = arasan,sdhci-4.9a },
{ }
 };
 MODULE_DEVICE_TABLE(of, sdhci_arasan_of_match);
--
1.8.2.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 1/3] arm64: dts: Add the arasan sdhc nodes in apm-storm.dtsi.

2015-04-30 Thread Suman Tripathi
This patch adds the arasan sdhc nodes to reuse the of-arasan
driver for APM X-Gene SoC.

Signed-off-by: Suman Tripathi stripa...@apm.com
---
 arch/arm64/boot/dts/apm-mustang.dts |  4 
 arch/arm64/boot/dts/apm-storm.dtsi  | 44 +
 2 files changed, 48 insertions(+)

diff --git a/arch/arm64/boot/dts/apm-mustang.dts 
b/arch/arm64/boot/dts/apm-mustang.dts
index 8eb6d94..d0e52a9 100644
--- a/arch/arm64/boot/dts/apm-mustang.dts
+++ b/arch/arm64/boot/dts/apm-mustang.dts
@@ -44,3 +44,7 @@
 xgenet {
status = ok;
 };
+
+sdhc0 {
+   status = ok;
+};
diff --git a/arch/arm64/boot/dts/apm-storm.dtsi 
b/arch/arm64/boot/dts/apm-storm.dtsi
index 87d3205..d6c2216 100644
--- a/arch/arm64/boot/dts/apm-storm.dtsi
+++ b/arch/arm64/boot/dts/apm-storm.dtsi
@@ -144,6 +144,40 @@
clock-output-names = socplldiv2;
};

+   ahbclk: ahbclk@1f2ac000 {
+   compatible = apm,xgene-device-clock;
+   #clock-cells = 1;
+   clocks = socplldiv2 0;
+   reg = 0x0 0x1f2ac000 0x0 0x1000
+   0x0 0x1700 0x0 0x2000;
+   reg-names = csr-reg, div-reg;
+   csr-offset = 0x0;
+   csr-mask = 0x1;
+   enable-offset = 0x8;
+   enable-mask = 0x1;
+   divider-offset = 0x164;
+   divider-width = 0x5;
+   divider-shift = 0x0;
+   clock-output-names = ahbclk;
+   };
+
+   sdioclk: sdioclk@1f2ac000 {
+   compatible = apm,xgene-device-clock;
+   #clock-cells = 1;
+   clocks = socplldiv2 0;
+   reg = 0x0 0x1f2ac000 0x0 0x1000
+   0x0 0x1700 0x0 0x2000;
+   reg-names = csr-reg, div-reg;
+   csr-offset = 0x0;
+   csr-mask = 0x2;
+   enable-offset = 0x8;
+   enable-mask = 0x2;
+   divider-offset = 0x178;
+   divider-width = 0x8;
+   divider-shift = 0x0;
+   clock-output-names = sdioclk;
+   };
+
qmlclk: qmlclk {
compatible = apm,xgene-device-clock;
#clock-cells = 1;
@@ -503,6 +537,16 @@
interrupts = 0x0 0x4f 0x4;
};

+   sdhc0: sdhc@1c00 {
+   device_type = sdhc;
+   compatible = arasan,sdhci-8.9a, arasan,sdhci-4.9a;
+   reg = 0x0 0x1c00 0x0 0x100;
+   interrupts = 0x0 0x49 0x4;
+   dma-coherent;
+   clock-names = clk_xin, clk_ahb;
+   clocks = sdioclk 0, ahbclk 0;
+   };
+
phy1: phy@1f21a000 {
compatible = apm,xgene-phy;
reg = 0x0 0x1f21a000 0x0 0x100;
--
1.8.2.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 0/3] Add SDHCI support for APM X-Gene SoC using ARASAN SDHCI controller.

2015-04-30 Thread Suman Tripathi
This patch adds the SDHCI support for APM X-Gene SoC using ARASAN SDHCI 
controller.

v1 change:
 * Use the CONFIG_ARM64_DMA_HAS_IOMMU for dma-mapping.

v2 change:
 * Drop the IOMMU support and switching to PIO mode for arasan.
   controller integrated inside APM X-Gene SoC.

v3 change:
 * Change the sdhci-of-arasan.c to support arasan4.9a.
 * Add quirks for arasan4.9a.

Signed-off-by: Suman Tripathi stripa...@apm.com
---

Suman Tripathi (3):
  arm64: dts: Add the arasan sdhc nodes in apm-storm.dtsi.
  mmc: host: arasan: Add the support for sdhci-arasan4.9a in
sdhci-of-arasan.c
  Documentation: mmc: Update Arasan SDHC documentation to support 4.9a
version of Arasan SDHC controller.

 .../devicetree/bindings/mmc/arasan,sdhci.txt   |  5 ++-
 arch/arm64/boot/dts/apm-mustang.dts|  4 ++
 arch/arm64/boot/dts/apm-storm.dtsi | 44 ++
 drivers/mmc/host/sdhci-of-arasan.c |  7 
 4 files changed, 58 insertions(+), 2 deletions(-)

--
1.8.2.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 1/2] mm/thp: Use new functions to clear pmd on splitting and collapse

2015-04-30 Thread Aneesh Kumar K.V
Some arch may require an explicit IPI before a THP PMD split or
collapse. This enable us to use local_irq_disable to prevent
a parallel THP PMD split or collapse.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
 include/asm-generic/pgtable.h | 32 
 mm/huge_memory.c  |  9 +
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fe617b7e4be6..e95c697bef25 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,6 +184,38 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_splitting_flush_notify pmdp_clear_flush_notify
+#else
+static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+  unsigned long address,
+  pmd_t *pmdp)
+{
+   BUILD_BUG();
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+  unsigned long address,
+  pmd_t *pmdp)
+{
+   return pmdp_clear_flush(vma, address, pmdp);
+}
+#else
+static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+  unsigned long address,
+  pmd_t *pmdp)
+{
+   BUILD_BUG();
+   return __pmd(0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
   pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..30c1b46fcf6d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2187,7 +2187,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 * huge and small TLB entries for the same virtual address
 * to avoid the risk of CPU bugs in that area.
 */
-   _pmd = pmdp_clear_flush(vma, address, pmd);
+   _pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
@@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
 
write = pmd_write(*pmd);
young = pmd_young(*pmd);
-
-   /* leave pmd empty until pte is filled */
-   pmdp_clear_flush_notify(vma, haddr, pmd);
+   /*
+* leave pmd empty until pte is filled.
+*/
+   pmdp_splitting_flush_notify(vma, haddr, pmd);
 
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, _pmd, pgtable);
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 2/2] powerpc/thp: Remove _PAGE_SPLITTING and related code

2015-04-30 Thread Aneesh Kumar K.V
With the new thp refcounting we don't need to mark the PMD splitting.
Drop the code to handle this.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   6 --
 arch/powerpc/include/asm/pgtable-ppc64.h |  29 ++--
 arch/powerpc/mm/hugepage-hash64.c|   3 -
 arch/powerpc/mm/hugetlbpage.c|   2 +-
 arch/powerpc/mm/pgtable_64.c | 111 ---
 mm/gup.c |   2 +-
 6 files changed, 52 insertions(+), 101 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2d81e202bdcc..9a96fe3caa48 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -298,12 +298,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t 
*ptep, int writing,
cpu_relax();
continue;
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   /* If hugepage and is trans splitting return None */
-   if (unlikely(hugepage 
-pmd_trans_splitting(pte_pmd(old_pte
-   return __pte(0);
-#endif
/* If pte is not present return None */
if (unlikely(!(old_pte  _PAGE_PRESENT)))
return __pte(0);
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 843cb35e6add..655dde8e9683 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -361,11 +361,6 @@ void pgtable_cache_init(void);
 #endif /* __ASSEMBLY__ */
 
 /*
- * THP pages can't be special. So use the _PAGE_SPECIAL
- */
-#define _PAGE_SPLITTING _PAGE_SPECIAL
-
-/*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
@@ -375,8 +370,7 @@ void pgtable_cache_init(void);
  * set of bits not changed in pmd_modify.
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS |  \
-_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \
-_PAGE_THP_HUGE)
+_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
 
 #ifndef __ASSEMBLY__
 /*
@@ -458,13 +452,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
return (pmd_val(pmd)  0x3)  (pmd_val(pmd)  _PAGE_THP_HUGE);
 }
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-   if (pmd_trans_huge(pmd))
-   return pmd_val(pmd)  _PAGE_SPLITTING;
-   return 0;
-}
-
 extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -517,12 +504,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
return pmd;
 }
 
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-   pmd_val(pmd) |= _PAGE_SPLITTING;
-   return pmd;
-}
-
 #define __HAVE_ARCH_PMD_SAME
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
@@ -577,8 +558,12 @@ static inline void pmdp_set_wrprotect(struct mm_struct 
*mm, unsigned long addr,
pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
 }
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+   unsigned long address, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
diff --git a/arch/powerpc/mm/hugepage-hash64.c 
b/arch/powerpc/mm/hugepage-hash64.c
index 86686514ae13..078f7207afd2 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, 
unsigned long vsid,
/* If PMD busy, retry the access */
if (unlikely(old_pmd  _PAGE_BUSY))
return 0;
-   /* If PMD is trans splitting retry the access */
-   if (unlikely(old_pmd  _PAGE_SPLITTING))
-   return 0;
/* If PMD permissions don't match, take page fault */
if (unlikely(access  ~old_pmd))
return 1;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f30ae0f7f570..dfd7db0cfbee 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1008,7 +1008,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned 
long ea, unsigned *shift
 * hpte invalidate
 *
 */
-   if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+   if (pmd_none(pmd))
 

[PATCH v2] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Gavin Shan
Commit ff57b454ddb9 (powerpc/eeh: Do probe on pci_dn) probes EEH
devices in early stage, which is reasonable to pSeries platform.
However, it's wrong for PowerNV platform because the PE# isn't
determined until the resources (IO and MMIO) are assigned to PE in
hotplug case. So we have to delay probing EEH devices for PowerNV
platform until the PE# is assigned.

Fixes: ff57b454ddb9 (powerpc/eeh: Do probe on pci_dn)
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
v2: Corrected commit ID
---
 arch/powerpc/kernel/eeh.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index b798c86..04b5d94 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1061,6 +1061,9 @@ void eeh_add_device_early(struct pci_dn *pdn)
if (!edev || !eeh_enabled())
return;
 
+   if (!eeh_has_flag(EEH_PROBE_MODE_DEVTREE))
+   return;
+
/* USB Bus children of PCI devices will not have BUID's */
phb = edev-phb;
if (NULL == phb ||
@@ -1115,6 +1118,9 @@ void eeh_add_device_late(struct pci_dev *dev)
return;
}
 
+   if (eeh_has_flag(EEH_PROBE_MODE_DEV))
+   eeh_ops-probe(pdn, NULL);
+
/*
 * The EEH cache might not be removed correctly because of
 * unbalanced kref to the device during unplug time, which
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

2015-04-30 Thread David Gibson
On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 05:22 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
 At the moment only one group per container is supported.
 POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
 IOMMU group so we can relax this limitation and support multiple groups
 per container.
 
 It's not obvious why allowing multiple TCE tables per PE has any
 pearing on allowing multiple groups per container.
 
 
 This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
 outcomes:
 1. reusing the same IOMMU table for multiple groups - patch 31;
 2. allowing dynamic create/remove of IOMMU tables - patch 32.
 
 I can remove this one from the patchset and post it separately later but
 since 1..30 aim to support both 1) and 2), I'd think I better keep them all
 together (might explain some of changes I do in 1..30).

The combined patchset is fine.  My comment is because your commit
message says that multiple groups are possible *because* 2 TCE tables
per group are allowed, and it's not at all clear why one follows from
the other.

 This adds TCE table descriptors to a container and uses 
 iommu_table_group_ops
 to create/set DMA windows on IOMMU groups so the same TCE tables will be
 shared between several IOMMU groups.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v7:
 * updated doc
 ---
   Documentation/vfio.txt  |   8 +-
   drivers/vfio/vfio_iommu_spapr_tce.c | 268 
  ++--
   2 files changed, 199 insertions(+), 77 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 94328c8..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
 
   This implementation has some specifics:
 
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
   (PE is often a PCI domain but not always).
 
 I thought the more fundamental problem was that different PEs tended
 to use disjoint bus address ranges, so even by duplicating put_tce
 across PEs you couldn't have a common address space.
 
 
 Sorry, I am not following you here.
 
 By duplicating put_tce, I can have multiple IOMMU groups on the same virtual
 PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups
 per container does this, the address ranges will the same.

Oh, ok.  For some reason I thought that (at least on the older
machines) the different PEs used different and not easily changeable
DMA windows in bus addresses space.

 What I cannot do on p5ioc2 is programming the same table to multiple
 physical PHBs (or I could but it is very different than IODA2 and pretty
 ugly and might not always be possible because I would have to allocate these
 pages from some common pool and face problems like fragmentation).

So allowing multiple groups per container should be possible (at the
kernel rather than qemu level) by writing the same value to multiple
TCE tables.  I guess its not worth doing for just the almost-obsolete
IOMMUs though.

 
 
 
 +Newer systems (POWER8 with IODA2) have improved hardware design which 
 allows
 +to remove this limitation and have multiple IOMMU groups per a VFIO 
 container.
 
   2) The hardware supports so called DMA windows - the PCI address range
   within which DMA transfer is allowed, any attempt to access address space
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index a7d6729..970e3a2 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
* into DMA'ble space using the IOMMU
*/
 
 +struct tce_iommu_group {
 +   struct list_head next;
 +   struct iommu_group *grp;
 +};
 +
   /*
* The container descriptor supports only a single group per container.
* Required by the API as the container is not supplied with the IOMMU 
  group
 @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
*/
   struct tce_container {
 struct mutex lock;
 -   struct iommu_group *grp;
 bool enabled;
 unsigned long locked_pages;
 bool v2;
 +   struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 
 Hrm,  so here we have more copies of the full iommu_table structures,
 which again muddies the lifetime.  The table_group pointer is
 presumably meaningless in these copies, which seems dangerously
 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:31 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * fixed code flow in error cases added in v8
 
 v8:
 * added ENOMEM on failed vzalloc()
 ---
   arch/powerpc/include/asm/iommu.h  |  6 ++
   arch/powerpc/kernel/iommu.c   | 18 ++
   arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
   3 files changed, 44 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 7694546..1472de3 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -111,9 +111,15 @@ struct iommu_table {
 unsigned long *it_map;   /* A simple allocation bitmap for now */
 unsigned long  it_page_shift;/* table iommu page size */
 struct iommu_table_group *it_table_group;
 +   unsigned long *it_userspace; /* userspace view of the table */
 
 A single unsigned long doesn't seem like enough.
 
 Why single? This is an array.

As in single per page.

  How do you know
 which process's address space this address refers to?
 
 It is a current task. Multiple userspaces cannot use the same 
 container/tables.

Where is that enforced?

More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.

[snip]
   static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
 @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
 iommu_table_group *table_group,
 int nid = pe-phb-hose-node;
 __u64 bus_offset = num ? pe-tce_bypass_base : 0;
 long ret;
 +   unsigned long *uas, uas_cb = sizeof(*uas) * (window_size  page_shift);
 +
 +   uas = vzalloc(uas_cb);
 +   if (!uas)
 +   return -ENOMEM;
 
 I don't see why this is allocated both here as well as in
 take_ownership.
 
 Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
 want to touch iommu_table fields there.

Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).

Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.

 Isn't this function used for core-kernel users of the
 iommu as well, in which case it shouldn't need the it_userspace.
 
 
 No. This is an iommu_table_group_ops callback which calls what the platform
 code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
 The callback is only called from VFIO.

Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpdapItpnDMX.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 0/2] Remove _PAGE_SPLITTING from ppc64

2015-04-30 Thread Aneesh Kumar K.V
The changes are on top of what is posted  at

 
http://mid.gmane.org/1429823043-157133-1-git-send-email-kirill.shute...@linux.intel.com

 git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Changes from V1:
* Fold part of patch 3 to 1 and 2
* Drop patch 3.
* Make generic version of pmdp_splitting_flush_notify inline.

Aneesh Kumar K.V (2):
  mm/thp: Use new functions to clear pmd on splitting and collapse
  powerpc/thp: Remove _PAGE_SPLITTING and related code

 arch/powerpc/include/asm/kvm_book3s_64.h |   6 --
 arch/powerpc/include/asm/pgtable-ppc64.h |  29 ++--
 arch/powerpc/mm/hugepage-hash64.c|   3 -
 arch/powerpc/mm/hugetlbpage.c|   2 +-
 arch/powerpc/mm/pgtable_64.c | 111 ---
 include/asm-generic/pgtable.h|  32 +
 mm/gup.c |   2 +-
 mm/huge_memory.c |   9 +--
 8 files changed, 89 insertions(+), 105 deletions(-)

-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 10:46:08AM +1000, Benjamin Herrenschmidt wrote:
 On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote:
  On 04/30/2015 05:22 PM, David Gibson wrote:
   On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
   At the moment only one group per container is supported.
   POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
   IOMMU group so we can relax this limitation and support multiple groups
   per container.
  
   It's not obvious why allowing multiple TCE tables per PE has any
   pearing on allowing multiple groups per container.
  
  
  This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 
  outcomes:
  1. reusing the same IOMMU table for multiple groups - patch 31;
  2. allowing dynamic create/remove of IOMMU tables - patch 32.
  
  I can remove this one from the patchset and post it separately later but 
  since 1..30 aim to support both 1) and 2), I'd think I better keep them all 
  together (might explain some of changes I do in 1..30).
 
 I think you are talking past each other :-)
 
 But yes, having 2 tables per group is orthogonal to the ability of
 having multiple groups per container.
 
 The latter is made possible on P8 in large part because each PE has its
 own DMA address space (unlike P5IOC2 or P7IOC where a single address
 space is segmented).
 
 Also, on P8 you can actually make the TVT entries point to the same
 table in memory, thus removing the need to duplicate the actual
 tables (though you still have to duplicate the invalidations). I would
 however recommend only sharing the table that way within a chip/node.
 
  .../..
 
  
   -1) Only one IOMMU group per container is supported as an IOMMU group
   -represents the minimal entity which isolation can be guaranteed for and
   -groups are allocated statically, one per a Partitionable Endpoint (PE)
   +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
   +container is supported as an IOMMU table is allocated at the boot time,
   +one table per a IOMMU group which is a Partitionable Endpoint (PE)
 (PE is often a PCI domain but not always).
 
   I thought the more fundamental problem was that different PEs tended
   to use disjoint bus address ranges, so even by duplicating put_tce
   across PEs you couldn't have a common address space.
 
 Yes. This is the problem with P7IOC and earlier. It *could* be doable on
 P7IOC by making them the same PE but let's not go there.
 
  Sorry, I am not following you here.
  
  By duplicating put_tce, I can have multiple IOMMU groups on the same 
  virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple 
  groups per container does this, the address ranges will the same.
 
 But that is only possible on P8 because only there do we have separate
 address spaces between PEs.
 
  What I cannot do on p5ioc2 is programming the same table to multiple 
  physical PHBs (or I could but it is very different than IODA2 and pretty 
  ugly and might not always be possible because I would have to allocate 
  these pages from some common pool and face problems like fragmentation).
 
 And P7IOC has a similar issue. The DMA address top bits indexes the
 window on P7IOC within a shared address space. It's possible to
 configure a TVT to cover multiple devices but with very serious
 limitations.

Ok.  To check my understanding does this sound reasonable:

  * The table_group more-or-less represents a PE, but in a way you can
reference without first knowing the specific IOMMU hardware type.

  * When attaching multiple groups to the same container, the first PE
(i.e. table_group) attached is used as a representative so that
subsequent groups can be checked for compatibility with the first
PE and therefore all PEs currently included in the container

 - This is why the table_group appears in some places where it
   doesn't seem sensible from a pure object ownership point of
   view

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpJLfJZ6lyD4.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:40 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_create_table()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * reimplemented the whole patch
 ---
   arch/powerpc/include/asm/iommu.h  |  5 +
   arch/powerpc/platforms/powernv/pci-ioda.c | 14 
   arch/powerpc/platforms/powernv/pci.c  | 36 
  +++
   arch/powerpc/platforms/powernv/pci.h  |  2 ++
   4 files changed, 57 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 1472de3..9844c106 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -99,6 +99,7 @@ struct iommu_table {
 unsigned long  it_size;  /* Size of iommu table in entries */
 unsigned long  it_indirect_levels;
 unsigned long  it_level_size;
 +   unsigned long  it_allocated_size;
 unsigned long  it_offset;/* Offset into global table */
 unsigned long  it_base;  /* mapped address of tce table */
 unsigned long  it_index; /* which iommu table this is */
 @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   struct iommu_table_group;
 
   struct iommu_table_group_ops {
 +   unsigned long (*get_table_size)(
 +   __u32 page_shift,
 +   __u64 window_size,
 +   __u32 levels);
 long (*create_table)(struct iommu_table_group *table_group,
 int num,
 __u32 page_shift,
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index e0be556..7f548b4 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
 pnv_phb *phb,
   }
 
   #ifdef CONFIG_IOMMU_API
 +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 +   __u64 window_size, __u32 levels)
 +{
 +   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
 +
 +   if (!ret)
 +   return ret;
 +
 +   /* Add size of it_userspace */
 +   return ret + (window_size  page_shift) * sizeof(unsigned long);
 
 This doesn't make much sense.  The userspace view can't possibly be a
 property of the specific low-level IOMMU model.
 
 
 This it_userspace thing is all about memory preregistration.
 
 I need some way to track how many actual mappings the
 mm_iommu_table_group_mem_t has in order to decide whether to allow
 unregistering or not.
 
 When I clear TCE, I can read the old value which is host physical address
 which I cannot use to find the preregistered region and adjust the mappings
 counter; I can only use userspace addresses for this (not even guest
 physical addresses as it is VFIO and probably no KVM).
 
 So I have to keep userspace addresses somewhere, one per IOMMU page, and the
 iommu_table seems a natural place for this.

Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table owner (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.

The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr.  So you might be able to compaticfy this
information by storing it as a list of variable length blocks in
userspace address space, rather than a per-page address..



But.. isn't there a bigger problem here.  As Paulus was pointing out,
there's nothing guaranteeing the page tables continue to contain the
same page as was there at gup() time.

What's going to happen if you REGISTER a memory region, then mremap()
over it?  Then attempt to PUT_TCE a page in the region? Or what if you
mremap() it to someplace else then try to PUT_TCE a page there? Or
REGISTER it again in its new location?

-- 
David Gibson| I'll have my music baroque, and my code
david AT 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 04:55 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update into 2 
 different
 operations. It requires 1) guest pages to be registered first 2) consequent
 map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to 
 vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
   Documentation/vfio.txt  |  23 
   drivers/vfio/vfio_iommu_spapr_tce.c | 230 
  +++-
   include/uapi/linux/vfio.h   |  27 +
   3 files changed, 274 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..94328c8 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -427,6 +427,29 @@ The code flow from the example above should be 
 slightly changed:
 
 
 
 +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
 +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
 +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
 +(which are unsupported in v1 IOMMU).
 
 A summary of the semantic differeces between v1 and v2 would be nice.
 At this point it's not really clear to me if there's a case for
 creating v2, or if this could just be done by adding (optional)
 functionality to v1.
 
 v1: memory preregistration is not supported; explicit enable/disable ioctls
 are required
 
 v2: memory preregistration is required; explicit enable/disable are
 prohibited (as they are not needed).
 
 Mixing these in one IOMMU type caused a lot of problems like should I
 increment locked_vm by the 32bit window size on enable() or not; what do I
 do about pages pinning when map/map (check if it is from registered memory
 and do not pin?).
 
 Having 2 IOMMU models makes everything a lot simpler.

Ok.  Would it simplify it further if you made v2 only usable on IODA2
hardware?

 +PPC64 paravirtualized guests generate a lot of map/unmap requests,
 +and the handling of those includes pinning/unpinning pages and updating
 +mm::locked_vm counter to make sure we do not exceed the rlimit.
 +The v2 IOMMU splits accounting and pinning into separate operations:
 +
 +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
 ioctls
 +receive a user space address and size of the block to be pinned.
 +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
 +be called with the exact address and size used for registering
 +the memory block. The userspace is not expected 

Re: [PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS

2015-04-30 Thread Michael Ellerman
On Fri, 2015-05-01 at 11:23 +0800, Jeremy Kerr wrote:
 Hi Michael,
 
  Traffic on the cbe-oss-dev list is more or less non-existent, other than
  CC's from linuxppc.
 
 Plus all that spam that never makes it out of the moderation queue.
 
  It's seems like we may as well just send everyone to linuxppc and
  archive the list.
 
 Acked-by: Jeremy Kerr j...@ozlabs.org
 
 [This'll get a mention at our returning from prom_init-farewell party
 too, right?]

You bet, or hell let's just have *another* party, the cbe-oss-dead party.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

2015-04-30 Thread Benjamin Herrenschmidt
On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 05:22 PM, David Gibson wrote:
  On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
  At the moment only one group per container is supported.
  POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
  IOMMU group so we can relax this limitation and support multiple groups
  per container.
 
  It's not obvious why allowing multiple TCE tables per PE has any
  pearing on allowing multiple groups per container.
 
 
 This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 
 outcomes:
 1. reusing the same IOMMU table for multiple groups - patch 31;
 2. allowing dynamic create/remove of IOMMU tables - patch 32.
 
 I can remove this one from the patchset and post it separately later but 
 since 1..30 aim to support both 1) and 2), I'd think I better keep them all 
 together (might explain some of changes I do in 1..30).

I think you are talking past each other :-)

But yes, having 2 tables per group is orthogonal to the ability of
having multiple groups per container.

The latter is made possible on P8 in large part because each PE has its
own DMA address space (unlike P5IOC2 or P7IOC where a single address
space is segmented).

Also, on P8 you can actually make the TVT entries point to the same
table in memory, thus removing the need to duplicate the actual
tables (though you still have to duplicate the invalidations). I would
however recommend only sharing the table that way within a chip/node.

 .../..

 
  -1) Only one IOMMU group per container is supported as an IOMMU group
  -represents the minimal entity which isolation can be guaranteed for and
  -groups are allocated statically, one per a Partitionable Endpoint (PE)
  +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
  +container is supported as an IOMMU table is allocated at the boot time,
  +one table per a IOMMU group which is a Partitionable Endpoint (PE)
(PE is often a PCI domain but not always).

  I thought the more fundamental problem was that different PEs tended
  to use disjoint bus address ranges, so even by duplicating put_tce
  across PEs you couldn't have a common address space.

Yes. This is the problem with P7IOC and earlier. It *could* be doable on
P7IOC by making them the same PE but let's not go there.

 Sorry, I am not following you here.
 
 By duplicating put_tce, I can have multiple IOMMU groups on the same 
 virtual PHB in QEMU, [PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple 
 groups per container does this, the address ranges will the same.

But that is only possible on P8 because only there do we have separate
address spaces between PEs.

 What I cannot do on p5ioc2 is programming the same table to multiple 
 physical PHBs (or I could but it is very different than IODA2 and pretty 
 ugly and might not always be possible because I would have to allocate 
 these pages from some common pool and face problems like fragmentation).

And P7IOC has a similar issue. The DMA address top bits indexes the
window on P7IOC within a shared address space. It's possible to
configure a TVT to cover multiple devices but with very serious
limitations.

  +Newer systems (POWER8 with IODA2) have improved hardware design which 
  allows
  +to remove this limitation and have multiple IOMMU groups per a VFIO 
  container.
 
2) The hardware supports so called DMA windows - the PCI address range
within which DMA transfer is allowed, any attempt to access address space
  diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
  b/drivers/vfio/vfio_iommu_spapr_tce.c
  index a7d6729..970e3a2 100644
  --- a/drivers/vfio/vfio_iommu_spapr_tce.c
  +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
  @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
 * into DMA'ble space using the IOMMU
 */
 
  +struct tce_iommu_group {
  +  struct list_head next;
  +  struct iommu_group *grp;
  +};
  +
/*
 * The container descriptor supports only a single group per container.
 * Required by the API as the container is not supplied with the IOMMU 
  group
  @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
 */
struct tce_container {
 struct mutex lock;
  -  struct iommu_group *grp;
 bool enabled;
 unsigned long locked_pages;
 bool v2;
  +  struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 
  Hrm,  so here we have more copies of the full iommu_table structures,
  which again muddies the lifetime.  The table_group pointer is
  presumably meaningless in these copies, which seems dangerously
  confusing.
 
 
 Ouch. This is bad. No, table_group is not pointless here as it is used to 
 get to the PE number to invalidate TCE cache. I just realized although I 
 need to update just a single table, I still have to invalidate TCE cache 
 for every attached group/PE so I need a list of iommu_table_group's here, 
 not a single pointer...
 
 
 
  +  

[PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS

2015-04-30 Thread Michael Ellerman
Traffic on the cbe-oss-dev list is more or less non-existent, other than
CC's from linuxppc.

It's seems like we may as well just send everyone to linuxppc and
archive the list.

Signed-off-by: Michael Ellerman m...@ellerman.id.au
---
 MAINTAINERS | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2e5bbc0d68b2..b8e038ca0d26 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2433,7 +2433,6 @@ F:
Documentation/devicetree/bindings/net/ieee802154/cc2520.txt
 CELL BROADBAND ENGINE ARCHITECTURE
 M: Arnd Bergmann a...@arndb.de
 L: linuxppc-dev@lists.ozlabs.org
-L: cbe-oss-...@lists.ozlabs.org
 W: http://www.ibm.com/developerworks/power/cell/
 S: Supported
 F: arch/powerpc/include/asm/cell*.h
@@ -7824,14 +7823,13 @@ F:  drivers/net/wireless/prism54/
 PS3 NETWORK SUPPORT
 M: Geoff Levand ge...@infradead.org
 L: net...@vger.kernel.org
-L: cbe-oss-...@lists.ozlabs.org
+L: linuxppc-dev@lists.ozlabs.org
 S: Maintained
 F: drivers/net/ethernet/toshiba/ps3_gelic_net.*
 
 PS3 PLATFORM SUPPORT
 M: Geoff Levand ge...@infradead.org
 L: linuxppc-dev@lists.ozlabs.org
-L: cbe-oss-...@lists.ozlabs.org
 S: Maintained
 F: arch/powerpc/boot/ps3*
 F: arch/powerpc/include/asm/lv1call.h
@@ -7845,7 +7843,7 @@ F:sound/ppc/snd_ps3*
 
 PS3VRAM DRIVER
 M: Jim Paris j...@jtan.com
-L: cbe-oss-...@lists.ozlabs.org
+L: linuxppc-dev@lists.ozlabs.org
 S: Maintained
 F: drivers/block/ps3vram.c
 
@@ -9321,7 +9319,6 @@ F:drivers/net/ethernet/toshiba/spider_net*
 SPU FILE SYSTEM
 M: Jeremy Kerr j...@ozlabs.org
 L: linuxppc-dev@lists.ozlabs.org
-L: cbe-oss-...@lists.ozlabs.org
 W: http://www.ibm.com/developerworks/power/cell/
 S: Supported
 F: Documentation/filesystems/spufs.txt
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/powernv: Add opal-prd channel

2015-04-30 Thread Jeremy Kerr
Hi Ben,

 +static LIST_HEAD(opal_prd_msg_queue);
 +static DEFINE_SPINLOCK(opal_prd_msg_queue_lock);
 +static DECLARE_WAIT_QUEUE_HEAD(opal_prd_msg_wait);
 +static atomic_t usage;
 
 opal_prd_usage ... otherwise  it's a mess in the symbols map

OK, I'll change this.

 Also why limit the number of opens ? we might want to have tools using
 the opal prd for xscom :-) (in absence of debugfs). .. as long as not
 two people read() it should be ok. Or a tool to dump the regions etc...
 
 I don't see any reason to block multiple open's.

Simplicity, really. We can do a get exclusive, but there's no
(current) use-case for multiple openers on a PRD interface.

Pulling this thread a little, you've hit on a key decision point of the
prd design - I see there being two directions we could take with this:

 1) This interface is specifically for PRD functions, or

 2) This interface is a generic userspace interface to OPAL,
and PRD is a subset of that.

I've been aiming for (1) with the current code; and the nature of the
generic read()  write() operations being PRD-specific enforces that.

Allowing multiple openers will help with (2), but if we want to go in
that direction, I think we'd be better off doing a couple of other
changes too:

 * move the general functions (eg xscom, range mappings, OCC control)
   to a separate interface that isn't tied to PRD - say just /dev/opal

 * using this prd code for only the prd-event handling, possibly
   renamed to /dev/opal-prd-events. This would still need some
   method of enforcing exclusive access.

In this case, the actual PRD application would use both devices,
dequeueing events (and updating the ipoll mask) from the latter, and
using the former for helper functionality.

Other tools (eg generic xscom access) would just use the generic
interface, and not the PRD one, which wouldn't enforce exclusive access.

Regardless of the choice here, we could also remove the single-open
exclusion, and shift that responsibility to userspace (eg, flock() on
the PRD device node?). The main reason for the exclusion is to prevent
multiple prd daemons running, which may get messy when updating the
ipoll mask.

 Should we rely exclusively on userspace setting the right permissions or
 should we check CAP_SYSADMIN here ?

I'm okay with relying on userspace, is there any reason not to?


 +vma-vm_page_prot = phys_mem_access_prot(file, vma-vm_pgoff,
 + size, vma-vm_page_prot)
 +| _PAGE_SPECIAL;
 +
 +rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size,
 +vma-vm_page_prot);
 
 Do we still have the warnings of process exist about the map count or is
 that fixed ?

No, not fixed at present. I'll need to chat to you about that.


 +case OPAL_PRD_SCOM_READ:
 +rc = copy_from_user(scom, (void __user *)param, sizeof(scom));
 +if (rc)
 +return -EFAULT;
 +
 +rc = opal_xscom_read(scom.chip, scom.addr,
 +(__be64 *)scom.data);
 
 Are we exporting these for modules ?

No, but opal-prd isn't configurable as a module at the moment.

 
 +scom.data = be64_to_cpu(scom.data);
 +pr_debug(ioctl SCOM_READ: chip %llx addr %016llx 
 +data %016llx rc %d\n,
 +scom.chip, scom.addr, scom.data, rc);
 
 pr_devel ?

This removes the possibility of CONFIG_DYNAMIC_DEBUG, is that intentional?

 
 +if (rc)
 +return -EIO;
 
 Should we consider returning more info about the SCOM error ? HBRT might
 actually need that... Maybe opal_prd_scom needs a field for the OPAL rc
 which is currently not very descriptive but that's fixable.

Sounds good, I'll add that in. On error, we'll return -EIO and have the
OPAL error code in the struct for further detail.


 +nr_ranges = of_property_count_strings(np, reserved-names);
 +ranges_prop = of_get_property(np, reserved-ranges, NULL);
 +if (!ranges_prop) {
 +of_node_put(np);
 +return -ENODEV;
 +}
 
 Didn't we say we had a problem with using those properties due to
 coalescing ? Shouldn't we define specific ones for the HBRT regions ?

There's not a problem at the moment, but one day we will need to expand
the PRD's get_reserved_mem interface to allow per-chip ranges. This
would use a different device-tree representation.

However, I think it'd be better to remove this code entirely (ie, remove
the range member of struct opal_prd_info), and require userspace to do
the device-tree parsing.

 +static int __init opal_prd_init(void)
 +{
 +int rc;
 +
 +/* parse the code region information from the device tree */
 +rc = parse_regions();
 +if (rc) {
 +pr_err(Couldn't parse region information from DT\n);
 +return rc;
 +}
 
 Should we create a virtual device under the OPAL node in FW so we have
 something to attach to ? 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


 How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.




struct iommu_table_ops *it_ops;
  };

+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+   ((tbl)-it_userspace ? \
+   ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \
+   NULL)
+
  /* Pure 2^n version of get_order */
  static inline __attribute_const__
  int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
  #include linux/pci.h
  #include linux/iommu.h
  #include linux/sched.h
+#include linux/vmalloc.h
  #include asm/io.h
  #include asm/prom.h
  #include asm/iommu.h
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char 
*node_name)
free_pages((unsigned long) tbl-it_map, order);
}

+   WARN_ON(tbl-it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
  }

@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;
int ret = 0;
+   unsigned long *uas;

/*
 * VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl-it_ops-exchange)
return -EINVAL;

+   uas = vzalloc(sizeof(*uas) * tbl-it_size);
+   if (!uas)
+   return -ENOMEM;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
}

+   if (ret) {
+   vfree(uas);
+   } else {
+   BUG_ON(tbl-it_userspace);
+   tbl-it_userspace = uas;
+   }
+
for (i = 0; i  tbl-nr_pools; i++)
spin_unlock(tbl-pools[i].lock);
spin_unlock_irqrestore(tbl-large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;

+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
  #include linux/memblock.h
  #include linux/iommu.h
  #include linux/sizes.h
+#include linux/vmalloc.h

  #include asm/sections.h
  #include asm/io.h
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
  }

+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
+   pnv_pci_free_table(tbl);
+}
+
  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
   

Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Gavin Shan
On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote:
On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote:
 Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH
 devices in early stage, which is reasonable to pSeries platform.
 However, it's wrong for PowerNV platform because the PE# isn't
 determined until the resources (IO and MMIO) are assigned to
 PE in hotplug case. So we have to delay probing EEH devices
 for PowerNV platform until the PE# is assigned.
 
 Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn)

I don't have that SHA.

I've have a commit with the same subject which is ff57b454ddb9.

Is that the same thing?


Yes, they're same thing.

I also have 89a51df5ab1d, which Fixes that commit (ff57..).


There seems something wrong at local because git show 1c509148b leads
me to the patch as well. However, I believe ff57b454ddb9 is the correct
one. I'll send v2 to fix it up.

---

gwshan@gwshan:~/sandbox/linux$ git show 1c509148b
1 commit 1c509148bd6b5199dc1d97e146eda496f9f22a06
2 Author: Gavin Shan gws...@linux.vnet.ibm.com
3 Date:   Tue Mar 17 10:49:45 2015 +1100
4 
5 powerpc/eeh: Do probe on pci_dn

gwshan@gwshan:~/sandbox/linux$ git show ff57b454ddb9
  1 commit ff57b454ddb938d98d48d8df356357000fedc88c
  2 Author: Gavin Shan gws...@linux.vnet.ibm.com
  3 Date:   Tue Mar 17 16:15:06 2015 +1100
  4 
  5 powerpc/eeh: Do probe on pci_dn

Thanks,
Gavin

cheers



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/cell: Drop cbe-oss-dev mailing list from MAINTAINERS

2015-04-30 Thread Jeremy Kerr
Hi Michael,

 Traffic on the cbe-oss-dev list is more or less non-existent, other than
 CC's from linuxppc.

Plus all that spam that never makes it out of the moderation queue.

 It's seems like we may as well just send everyone to linuxppc and
 archive the list.

Acked-by: Jeremy Kerr j...@ozlabs.org

[This'll get a mention at our returning from prom_init-farewell party
too, right?]

Cheers,


Jeremy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Michael Ellerman
On Fri, 2015-05-01 at 11:28 +1000, Gavin Shan wrote:
 On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote:
 On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote:
  Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH
  devices in early stage, which is reasonable to pSeries platform.
  However, it's wrong for PowerNV platform because the PE# isn't
  determined until the resources (IO and MMIO) are assigned to
  PE in hotplug case. So we have to delay probing EEH devices
  for PowerNV platform until the PE# is assigned.
  
  Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn)
 
 I don't have that SHA.
 
 I've have a commit with the same subject which is ff57b454ddb9.
 
 Is that the same thing?
 
 Yes, they're same thing.

OK.

 There seems something wrong at local because git show 1c509148b leads
 me to the patch as well. However, I believe ff57b454ddb9 is the correct
 one. I'll send v2 to fix it up.

You just have it in your tree from when you were developing it I guess?

If you do:

  $ git show --format=fuller 1c509148b

It should show you the committer etc.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Gavin Shan
On Fri, May 01, 2015 at 01:51:37PM +1000, Michael Ellerman wrote:
On Fri, 2015-05-01 at 11:28 +1000, Gavin Shan wrote:
 On Fri, May 01, 2015 at 09:50:57AM +1000, Michael Ellerman wrote:
 On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote:
  Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH
  devices in early stage, which is reasonable to pSeries platform.
  However, it's wrong for PowerNV platform because the PE# isn't
  determined until the resources (IO and MMIO) are assigned to
  PE in hotplug case. So we have to delay probing EEH devices
  for PowerNV platform until the PE# is assigned.
  
  Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn)
 
 I don't have that SHA.
 
 I've have a commit with the same subject which is ff57b454ddb9.
 
 Is that the same thing?
 
 Yes, they're same thing.

OK.

 There seems something wrong at local because git show 1c509148b leads
 me to the patch as well. However, I believe ff57b454ddb9 is the correct
 one. I'll send v2 to fix it up.

You just have it in your tree from when you were developing it I guess?

If you do:

  $ git show --format=fuller 1c509148b

It should show you the committer etc.


Yeah, it's true. The committer is myself for commit 1c509148b.

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

2015-04-30 Thread David Gibson
On Thu, Apr 30, 2015 at 07:56:17PM +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 02:37 PM, David Gibson wrote:
 On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 03:30 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
 This extends iommu_table_group_ops by a set of callbacks to support
 dynamic DMA windows management.
 
 create_table() creates a TCE table with specific parameters.
 it receives iommu_table_group to know nodeid in order to allocate
 TCE table memory closer to the PHB. The exact format of allocated
 multi-level table might be also specific to the PHB model (not
 the case now though).
 This callback calculated the DMA window offset on a PCI bus from @num
 and stores it in a just created table.
 
 set_window() sets the window at specified TVT index + @num on PHB.
 
 unset_window() unsets the window from specified TVT.
 
 This adds a free() callback to iommu_table_ops to free the memory
 (potentially a tree of tables) allocated for the TCE table.
 
 Doesn't the free callback belong with the previous patch introducing
 multi-level tables?
 
 
 
 If I did that, you would say why is it here if nothing calls it on
 multilevel patch and I see the allocation but I do not see memory
 release ;)
 
 Yeah, fair enough ;)
 
 I need some rule of thumb here. I think it is a bit cleaner if the same
 patch adds a callback for memory allocation and its counterpart, no?
 
 On further consideration, yes, I think you're right.
 
 create_table() and free() are supposed to be called once per
 VFIO container and set_window()/unset_window() are supposed to be
 called for every group in a container.
 
 This adds IOMMU capabilities to iommu_table_group such as default
 32bit window parameters and others.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
   arch/powerpc/include/asm/iommu.h| 19 
   arch/powerpc/platforms/powernv/pci-ioda.c   | 75 
  ++---
   arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
   3 files changed, 96 insertions(+), 10 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 0f50ee2..7694546 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -70,6 +70,7 @@ struct iommu_table_ops {
   /* get() returns a physical address */
   unsigned long (*get)(struct iommu_table *tbl, long index);
   void (*flush)(struct iommu_table *tbl);
 + void (*free)(struct iommu_table *tbl);
   };
 
   /* These are used by VIO */
 @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   struct iommu_table_group;
 
   struct iommu_table_group_ops {
 + long (*create_table)(struct iommu_table_group *table_group,
 + int num,
 + __u32 page_shift,
 + __u64 window_size,
 + __u32 levels,
 + struct iommu_table *tbl);
 + long (*set_window)(struct iommu_table_group *table_group,
 + int num,
 + struct iommu_table *tblnew);
 + long (*unset_window)(struct iommu_table_group *table_group,
 + int num);
   /*
* Switches ownership from the kernel itself to an external
* user. While onwership is taken, the kernel cannot use IOMMU 
  itself.
 @@ -160,6 +172,13 @@ struct iommu_table_group {
   #ifdef CONFIG_IOMMU_API
   struct iommu_group *group;
   #endif
 + /* Some key properties of IOMMU */
 + __u32 tce32_start;
 + __u32 tce32_size;
 + __u64 pgsizes; /* Bitmap of supported page sizes */
 + __u32 max_dynamic_windows_supported;
 + __u32 max_levels;
 
 With this information, table_group seems even more like a bad name.
 iommu_state maybe?
 
 
 Please, no. We will never come to agreement then :( And iommu_state is too
 general anyway, it won't pass.
 
 
   struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
   struct iommu_table_group_ops *ops;
   };
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index cc1d09c..4828837 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -24,6 +24,7 @@
   #include linux/msi.h
   #include linux/memblock.h
   #include linux/iommu.h
 +#include linux/sizes.h
 
   #include asm/sections.h
   #include asm/io.h
 @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = 
 {
   #endif
   .clear = pnv_ioda2_tce_free,
   .get = pnv_tce_get,
 + .free = pnv_pci_free_table,
   };
 
   static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
 @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct 
 pnv_phb *phb,
TCE_PCI_SWINV_PAIR);
 
   tbl-it_ops = pnv_ioda1_iommu_ops;
 + pe-table_group.tce32_start = tbl-it_offset  tbl-it_page_shift;
 + 

Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-04-30 Thread David Gibson
On Thu, Apr 30, 2015 at 06:25:25PM +1000, Paul Mackerras wrote:
 On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote:
  On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
   We are adding support for DMA memory pre-registration to be used in
   conjunction with VFIO. The idea is that the userspace which is going to
   run a guest may want to pre-register a user space memory region so
   it all gets pinned once and never goes away. Having this done,
   a hypervisor will not have to pin/unpin pages on every DMA map/unmap
   request. This is going to help with multiple pinning of the same memory
   and in-kernel acceleration of DMA requests.
   
   This adds a list of memory regions to mm_context_t. Each region consists
   of a header and a list of physical addresses. This adds API to:
   1. register/unregister memory regions;
   2. do final cleanup (which puts all pre-registered pages);
   3. do userspace to physical address translation;
   4. manage a mapped pages counter; when it is zero, it is safe to
   unregister the region.
   
   Multiple registration of the same region is allowed, kref is used to
   track the number of registrations.
  
  [snip]
   +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
   + struct mm_iommu_table_group_mem_t **pmem)
   +{
   + struct mm_iommu_table_group_mem_t *mem;
   + long i, j;
   + struct page *page = NULL;
   +
   + list_for_each_entry_rcu(mem, current-mm-context.iommu_group_mem_list,
   + next) {
   + if ((mem-ua == ua)  (mem-entries == entries))
   + return -EBUSY;
   +
   + /* Overlap? */
   + if ((mem-ua  (ua + (entries  PAGE_SHIFT))) 
   + (ua  (mem-ua + (mem-entries  PAGE_SHIFT
   + return -EINVAL;
   + }
   +
   + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
   + if (!mem)
   + return -ENOMEM;
   +
   + mem-hpas = vzalloc(entries * sizeof(mem-hpas[0]));
   + if (!mem-hpas) {
   + kfree(mem);
   + return -ENOMEM;
   + }
  
  So, I've thought more about this and I'm really confused as to what
  this is supposed to be accomplishing.
  
  I see that you need to keep track of what regions are registered, so
  you don't double lock or unlock, but I don't see what the point of
  actualy storing the translations in hpas is.
  
  I had assumed it was so that you could later on get to the
  translations in real mode when you do in-kernel acceleration.  But
  that doesn't make sense, because the array is vmalloc()ed, so can't be
  accessed in real mode anyway.
 
 We can access vmalloc'd arrays in real mode using real_vmalloc_addr().

Ah, ok.

  I can't think of a circumstance in which you can use hpas where you
  couldn't just walk the page tables anyway.
 
 The problem with walking the page tables is that there is no guarantee
 that the page you find that way is the page that was returned by the
 gup_fast() we did earlier.  Storing the hpas means that we know for
 sure that the page we're doing DMA to is one that we have an elevated
 page count on.
 
 Also, there are various points where a Linux PTE is made temporarily
 invalid for a short time.  If we happened to do a H_PUT_TCE on one cpu
 while another cpu was doing that, we'd get a spurious failure returned
 by the H_PUT_TCE.

I think we want this explanation in the commit message.  Anr/or in a
comment somewhere, I'm not sure.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpvdLlhle7Fo.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/8] powerpc/powernv: Add a virtual irqchip for opal events

2015-04-30 Thread Neelesh Gupta

Hi Alistair,

With all the patches applied on top of 'v4.0-rc7', I see this issue during
the boot itself http://pastebin.hursley.ibm.com/918

Few compile warnings and minor comments.
drivers/tty/hvc/hvc_opal.c: In function ‘hvc_opal_probe’:
drivers/tty/hvc/hvc_opal.c:174:6: warning: unused variable ‘rc’ 
[-Wunused-variable]

  int rc;
  ^
drivers/tty/hvc/hvc_opal.c: At top level:
drivers/tty/hvc/hvc_opal.c:65:13: warning: ‘hvc_opal_event_registered’ 
defined but not used [-Wunused-variable]

 static bool hvc_opal_event_registered;



On 04/10/2015 01:54 PM, Alistair Popple wrote:

Whenever an interrupt is received for opal the linux kernel gets a
bitfield indicating certain events that have occurred and need handling
by the various device drivers. Currently this is handled using a
notifier interface where we call every device driver that has
registered to receive opal events.

This approach has several drawbacks. For example each driver has to do
its own checking to see if the event is relevant as well as event
masking. There is also no easy method of recording the number of times
we receive particular events.

This patch solves these issues by exposing opal events via the
standard interrupt APIs by adding a new interrupt chip and
domain. Drivers can then register for the appropriate events using
standard kernel calls such as irq_of_parse_and_map().

Signed-off-by: Alistair Popple alist...@popple.id.au
---

+
+static int __init opal_event_init(void)
+{
+   struct device_node *dn, *opal_node;
+   const __be32 *irqs;
+   int i, irqlen;
+
+   opal_node = of_find_node_by_path(/ibm,opal);
+   if (!opal_node) {
+   pr_warn(opal: Node not found\n);
+   return -ENODEV;
+   }
+
+   dn = of_find_compatible_node(NULL, NULL, ibm,opal-event);
+
+   /* If dn is NULL it means the domain won't be linked to a DT
+* node so therefore irq_of_parse_and_map(...) wont work. But
+* that shouldn't be problem because if we're running a
+* version of skiboot that doesn't have the dn then the
+* devices won't have the correct properties and will have to
+* fall back to the legacy method (opal_event_request(...))
+* anyway. */
+   opal_event_irqchip.domain =
+   irq_domain_add_linear(dn, 64, opal_event_domain_ops,


A macro would be better, which is maximum event bits we have.


+ opal_event_irqchip);
+   if (IS_ERR(opal_event_irqchip.domain)) {
+   pr_warn(opal: Unable to create irq domain\n);
+   return PTR_ERR(opal_event_irqchip.domain);
+   }
+
+   /* Get interrupt property */
+   irqs = of_get_property(opal_node, opal-interrupts, irqlen);


of_node_put()
Should decrement the refcount of the nodes 'opal_node' and 'dn' (if !NULL)
before returning from the function.


+   opal_irq_count = irqs ? (irqlen / 4) : 0;
+   pr_debug(Found %d interrupts reserved for OPAL\n, opal_irq_count);
+
+   /* Install interrupt handlers */
+   opal_irqs = kcalloc(opal_irq_count, sizeof(unsigned int), GFP_KERNEL);


Safer to use 'sizeof(*opal_irqs)'

Neelesh.


+   for (i = 0; irqs  i  opal_irq_count; i++, irqs++) {
+   unsigned int irq, virq;
+   int rc;
+
+   /* Get hardware and virtual IRQ */
+   irq = be32_to_cpup(irqs);
+   virq = irq_create_mapping(NULL, irq);
+   if (virq == NO_IRQ) {
+   pr_warn(Failed to map irq 0x%x\n, irq);
+   continue;
+   }
+
+   /* Install interrupt handler */
+   rc = request_irq(virq, opal_interrupt, 0, opal, NULL);
+   if (rc) {
+   irq_dispose_mapping(virq);
+   pr_warn(Error %d requesting irq %d (0x%x)\n,
+rc, virq, irq);
+   continue;
+   }
+
+   /* Cache IRQ */
+   opal_irqs[i] = virq;
+   }
+
+   return 0;
+}
+machine_core_initcall(powernv, opal_event_init);
+



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state()

2015-04-30 Thread Gavin Shan
When asserting reset in pcibios_set_pcie_reset_state(), the PE
is enforced to (hardware) frozen state in order to drop unexpected
PCI transactions (except PCI config read/write) automatically by
hardware during reset, which would cause recursive EEH error.
However, the (software) frozen state EEH_PE_ISOLATED is missed.
When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
is set in PE state retrival backend. Unfortunately, nobody (the
reset handler or the EEH recovery functinality in host) will clear
EEH_PE_ISOLATED when the PE has been passed through to guest.

The patch sets and clears EEH_PE_ISOLATED properly during reset
in function pcibios_set_pcie_reset_state() to fix the issue.

Fixes: 28158cd (Enhance pcibios_set_pcie_reset_state())
Reported-by: Carol L. Soto cls...@us.ibm.com
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
Tested-by: Carol L. Soto cls...@us.ibm.com
---
 arch/powerpc/kernel/eeh.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 65f38d2..b798c86 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -749,21 +749,24 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
enum pcie_reset_state stat
eeh_unfreeze_pe(pe, false);
eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
+   eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
break;
case pcie_hot_reset:
+   eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
eeh_ops-set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops-reset(pe, EEH_RESET_HOT);
break;
case pcie_warm_reset:
+   eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
eeh_ops-set_option(pe, EEH_OPT_FREEZE_PE);
eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
eeh_pe_state_mark(pe, EEH_PE_CFG_BLOCKED);
eeh_ops-reset(pe, EEH_RESET_FUNDAMENTAL);
break;
default:
-   eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+   eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED);
return -EINVAL;
};
 
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Michael Ellerman
On Fri, 2015-05-01 at 09:22 +1000, Gavin Shan wrote:
 Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH
 devices in early stage, which is reasonable to pSeries platform.
 However, it's wrong for PowerNV platform because the PE# isn't
 determined until the resources (IO and MMIO) are assigned to
 PE in hotplug case. So we have to delay probing EEH devices
 for PowerNV platform until the PE# is assigned.
 
 Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn)

I don't have that SHA.

I've have a commit with the same subject which is ff57b454ddb9.

Is that the same thing?

I also have 89a51df5ab1d, which Fixes that commit (ff57..).

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/eeh: Delay probing EEH device during hotplug

2015-04-30 Thread Gavin Shan
Commit 1c509148b (powerpc/eeh: Do probe on pci_dn) probes EEH
devices in early stage, which is reasonable to pSeries platform.
However, it's wrong for PowerNV platform because the PE# isn't
determined until the resources (IO and MMIO) are assigned to
PE in hotplug case. So we have to delay probing EEH devices
for PowerNV platform until the PE# is assigned.

Fixes: 1c509148b (powerpc/eeh: Do probe on pci_dn)
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/eeh.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index b798c86..04b5d94 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1061,6 +1061,9 @@ void eeh_add_device_early(struct pci_dn *pdn)
if (!edev || !eeh_enabled())
return;
 
+   if (!eeh_has_flag(EEH_PROBE_MODE_DEVTREE))
+   return;
+
/* USB Bus children of PCI devices will not have BUID's */
phb = edev-phb;
if (NULL == phb ||
@@ -1115,6 +1118,9 @@ void eeh_add_device_late(struct pci_dev *dev)
return;
}
 
+   if (eeh_has_flag(EEH_PROBE_MODE_DEV))
+   eeh_ops-probe(pdn, NULL);
+
/*
 * The EEH cache might not be removed correctly because of
 * unbalanced kref to the device during unplug time, which
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/pseries: Fix possible leaked device node reference

2015-04-30 Thread Tyrel Datwyler
On 04/29/2015 06:44 PM, Nathan Fontenot wrote:
 Failure return from dlpar_configure_connector when dlpar adding cpus
 results in leaking references to the cpus parent device node. Move the
 call to of_node_put() prior to checking the result of
 dlpar_configure_connector.
 
 Fixes: 8d5ff320766f (powerpc/pseries: Make dlpar_configure_connector parent 
 node aware)

That commit went into 3.12. Shouldn't this be CC to stable?

-Tyrel

 
 Signed-off-by: Nathan Fontenot nf...@linux.vnet.ibm.com
 ---
  arch/powerpc/platforms/pseries/dlpar.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
 b/arch/powerpc/platforms/pseries/dlpar.c
 index 019d34a..47d9cebe 100644
 --- a/arch/powerpc/platforms/pseries/dlpar.c
 +++ b/arch/powerpc/platforms/pseries/dlpar.c
 @@ -421,11 +421,10 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
 count)
   return -ENODEV;
  
   dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
 + of_node_put(parent);
   if (!dn)
   return -EINVAL;
  
 - of_node_put(parent);
 -
   rc = dlpar_attach_node(dn);
   if (rc) {
   dlpar_release_drc(drc_index);
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/8] powerpc/powernv: Add a virtual irqchip for opal events

2015-04-30 Thread Neelesh Gupta

Hi Alistair,

Applied all of the patches on top of 'v4.0-rc7', found this issue during
the boot itself http://pastebin.hursley.ibm.com/918.

There are few compile warnings and minor comments.

drivers/tty/hvc/hvc_opal.c: In function ‘hvc_opal_probe’:
drivers/tty/hvc/hvc_opal.c:174:6: warning: unused variable ‘rc’ 
[-Wunused-variable]

  int rc;
  ^
drivers/tty/hvc/hvc_opal.c: At top level:
drivers/tty/hvc/hvc_opal.c:65:13: warning: ‘hvc_opal_event_registered’ 
defined but not used [-Wunused-variable]

 static bool hvc_opal_event_registered;

Regards,
Neelesh.

On 04/10/2015 01:54 PM, Alistair Popple wrote:

Whenever an interrupt is received for opal the linux kernel gets a
bitfield indicating certain events that have occurred and need handling
by the various device drivers. Currently this is handled using a
notifier interface where we call every device driver that has
registered to receive opal events.

This approach has several drawbacks. For example each driver has to do
its own checking to see if the event is relevant as well as event
masking. There is also no easy method of recording the number of times
we receive particular events.

This patch solves these issues by exposing opal events via the
standard interrupt APIs by adding a new interrupt chip and
domain. Drivers can then register for the appropriate events using
standard kernel calls such as irq_of_parse_and_map().

Signed-off-by: Alistair Popple alist...@popple.id.au
---

+static int __init opal_event_init(void)
+{
+   struct device_node *dn, *opal_node;
+   const __be32 *irqs;
+   int i, irqlen;
+
+   opal_node = of_find_node_by_path(/ibm,opal);
+   if (!opal_node) {
+   pr_warn(opal: Node not found\n);
+   return -ENODEV;
+   }
+
+   dn = of_find_compatible_node(NULL, NULL, ibm,opal-event);
+
+   /* If dn is NULL it means the domain won't be linked to a DT
+* node so therefore irq_of_parse_and_map(...) wont work. But
+* that shouldn't be problem because if we're running a
+* version of skiboot that doesn't have the dn then the
+* devices won't have the correct properties and will have to
+* fall back to the legacy method (opal_event_request(...))
+* anyway. */
+   opal_event_irqchip.domain =
+   irq_domain_add_linear(dn, 64, opal_event_domain_ops,


A macro would be better, which is maximum event bits we have.


+ opal_event_irqchip);
+   if (IS_ERR(opal_event_irqchip.domain)) {
+   pr_warn(opal: Unable to create irq domain\n);
+   return PTR_ERR(opal_event_irqchip.domain);
+   }
+
+   /* Get interrupt property */
+   irqs = of_get_property(opal_node, opal-interrupts, irqlen);
+   opal_irq_count = irqs ? (irqlen / 4) : 0;


of_node_put()
Need to decrement the refcount of these nodes, 'opal_node'  'dn' (if !NULL)


+   pr_debug(Found %d interrupts reserved for OPAL\n, opal_irq_count);
+
+   /* Install interrupt handlers */
+   opal_irqs = kcalloc(opal_irq_count, sizeof(unsigned int), GFP_KERNEL);


Safe to use 'sizeof(*opal_irqs)'


+   for (i = 0; irqs  i  opal_irq_count; i++, irqs++) {
+   unsigned int irq, virq;
+   int rc;
+
+   /* Get hardware and virtual IRQ */
+   irq = be32_to_cpup(irqs);
+   virq = irq_create_mapping(NULL, irq);
+   if (virq == NO_IRQ) {
+   pr_warn(Failed to map irq 0x%x\n, irq);
+   continue;
+   }
+
+   /* Install interrupt handler */
+   rc = request_irq(virq, opal_interrupt, 0, opal, NULL);
+   if (rc) {
+   irq_dispose_mapping(virq);
+   pr_warn(Error %d requesting irq %d (0x%x)\n,
+rc, virq, irq);
+   continue;
+   }
+
+   /* Cache IRQ */
+   opal_irqs[i] = virq;
+   }
+
+   return 0;
+}
+machine_core_initcall(powernv, opal_event_init);
+



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/pseries: Fix possible leaked device node reference

2015-04-30 Thread Michael Ellerman
On Thu, 2015-04-30 at 14:58 -0700, Tyrel Datwyler wrote:
 On 04/29/2015 06:44 PM, Nathan Fontenot wrote:
  Failure return from dlpar_configure_connector when dlpar adding cpus
  results in leaking references to the cpus parent device node. Move the
  call to of_node_put() prior to checking the result of
  dlpar_configure_connector.
  
  Fixes: 8d5ff320766f (powerpc/pseries: Make dlpar_configure_connector 
  parent node aware)
 
 That commit went into 3.12. Shouldn't this be CC to stable?

It could, but it fails the real bug test:

 - It must fix a real bug that bothers people (not a, This could be a
   problem... type thing).

(from Documentation/stable_kernel_rules.txt)

Because the node we're leaking a reference on is /cpus, and in practice that's
pretty harmless.

cheers



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: ppc64le crash in dm on 4.1+

2015-04-30 Thread Joel Stanley
On Fri, May 1, 2015 at 1:18 AM, Mike Snitzer snit...@redhat.com wrote:
 I just booted 3d99e3fe13d473ac4578c37f477a59b829530764 (linus' tree as
 of this morning) on a Tuletta and got the following:

 This is fixed with the following commit (which I just sent to Linus for
 4.1-rc2 inclusion):
 https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-nextid=aa6df8dd28c01d9a3d2cfcfe9dd0a4a334d1cd81

Thanks!

Cheers,

Joel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: ppc64le crash in dm on 4.1+

2015-04-30 Thread Mike Snitzer
On Wed, Apr 29 2015 at 11:29pm -0400,
Joel Stanley j...@jms.id.au wrote:

 Hello,
 
 I just booted 3d99e3fe13d473ac4578c37f477a59b829530764 (linus' tree as
 of this morning) on a Tuletta and got the following:

This is fixed with the following commit (which I just sent to Linus for
4.1-rc2 inclusion):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-nextid=aa6df8dd28c01d9a3d2cfcfe9dd0a4a334d1cd81
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform

2015-04-30 Thread Vipin K Parashar
This patch adds support for FSP EPOW (Early Power Off Warning) and
DPO (Delayed Power Off) events support for PowerNV platform.  EPOW events
are generated by SPCN/FSP due to various critical system conditions that
need system shutdown.  Few examples of these conditions are high ambient
temperature or system running on UPS power with low UPS battery. DPO event
is generated in response to admin initiated system shutdown request.
This patch enables host kernel on PowerNV platform to handle OPAL
notifications for these events and initiate system poweroff. Since EPOW
notifications are sent in advance of impending shutdown event and thus this
patch also adds functionality to wait for EPOW condition to return to
normal.  If EPOW condition doesn't return to normal in estimated time it
proceeds with graceful system shutdown. System admin can also add host
userspace scripts to perform any specific actions like graceful guest
shutdown upon system poweroff.

Signed-off-by: Vipin K Parashar vi...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/opal-api.h|  30 ++
 arch/powerpc/include/asm/opal.h|   3 +-
 arch/powerpc/platforms/powernv/Makefile|   1 +
 .../platforms/powernv/opal-poweroff-events.c   | 358 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 5 files changed, 392 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-poweroff-events.c

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0321a90..03b3cef 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -730,6 +730,36 @@ struct opal_i2c_request {
__be64 buffer_ra;   /* Buffer real address */
 };
 
+/*
+ * EPOW status sharing (OPAL and the host)
+ *
+ * The host will pass on OPAL, a buffer of length OPAL_EPOW_MAX_CLASSES
+ * to fetch system wide EPOW status. Each element in the returned buffer
+ * will contain bitwise EPOW status for each EPOW sub class.
+ */
+
+/* EPOW types */
+enum OpalEpow {
+   OPAL_EPOW_POWER = 0,/* Power EPOW */
+   OPAL_EPOW_TEMP  = 1,/* Temperature EPOW */
+   OPAL_EPOW_COOLING   = 2,/* Cooling EPOW */
+   OPAL_MAX_EPOW_CLASSES   = 3,/* Max EPOW categories */
+};
+
+/* Power EPOW events */
+enum OpalEpowPower {
+   OPAL_EPOW_POWER_UPS = 0x1, /* System on UPS power */
+   OPAL_EPOW_POWER_UPS_LOW = 0x2, /* System on UPS power with low battery*/
+};
+
+/* Temperature EPOW events */
+enum OpalEpowTemp {
+   OPAL_EPOW_TEMP_HIGH_AMB = 0x1, /* High ambient temperature */
+   OPAL_EPOW_TEMP_CRIT_AMB = 0x2, /* Critical ambient temperature */
+   OPAL_EPOW_TEMP_HIGH_INT = 0x4, /* High internal temperature */
+   OPAL_EPOW_TEMP_CRIT_INT = 0x8, /* Critical internal temperature */
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 042af1a..0777864 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -141,7 +141,6 @@ int64_t opal_pci_fence_phb(uint64_t phb_id);
 int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data);
 int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t 
error_type, uint8_t mask_action);
 int64_t opal_set_slot_led_status(uint64_t phb_id, uint64_t slot_id, uint8_t 
led_type, uint8_t led_action);
-int64_t opal_get_epow_status(__be64 *status);
 int64_t opal_set_system_attention_led(uint8_t led_action);
 int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
__be16 *pci_error_type, __be16 *severity);
@@ -200,6 +199,8 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, 
uint64_t buf,
uint64_t size, uint64_t token);
 int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
uint64_t token);
+int32_t opal_get_epow_status(__be32 *status, __be32 *num_classes);
+int32_t opal_get_dpo_status(__be32 *timeout);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 33e44f3..b817bdb 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,6 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o
+obj-y  += opal-poweroff-events.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-p5ioc2.o pci-ioda.o
diff --git a/arch/powerpc/platforms/powernv/opal-poweroff-events.c