[PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Anton Blanchard
I am seeing an issue where a CPU running perf eventually hangs.
Traces show timer interrupts happening every 4 seconds even
when a userspace task is running on the CPU. /proc/timer_list
also shows pending hrtimers have not run in over an hour,
including the scheduler.

Looking closer, decrementers_next_tb is getting set to
0x, and at that point we will never take
a timer interrupt again.

In __timer_interrupt() we set decrementers_next_tb to
0x and rely on -event_handler to update it:

*next_tb = ~(u64)0;
if (evt-event_handler)
evt-event_handler(evt);

In this case -event_handler is hrtimer_interrupt. This will eventually
call back through the clockevents code with the next event to be
programmed:

static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev)
{
/* Don't adjust the decrementer if some irq work is pending */
if (test_irq_work_pending())
return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

If irq work came in between these two points, we will return
before updating decrementers_next_tb and we never process a timer
interrupt again.

This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
with irq_work). Fix it by removing the early exit and relying on
code later on in the function to force an early decrementer:

   /* We may have raced with new irq work */
   if (test_irq_work_pending())
   set_dec(1);

Signed-off-by: Anton Blanchard an...@samba.org
Cc: sta...@vger.kernel.org # 3.14+
---

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..4f0b676 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,9 +888,6 @@ static void __init clocksource_init(void)
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev)
 {
-   /* Don't adjust the decrementer if some irq work is pending */
-   if (test_irq_work_pending())
-   return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
set_dec(evt);
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RFC v2 00/10] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-09 Thread Gavin Shan
The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly introduced
  VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel.

- The error injection infrastructure need support request from the userland
  utility errinjct and PowerKVM based guest. The userland utility errinjct
  works on pSeries platform well with dedicated syscall, which helps invoking
  RTAS service to fulfil error injection in kernel. From the perspective, it's
  reasonable to extend the syscall to support PowerNV platform so that OPAL call
  can be invoked in host kernel for injecting errors. The data transported
  between userland and kerenl is still following struct rtas_args for both
  cases of PowerNV (OPAL) and pSeries (RTAS).

The series of patches requires corresponding firmware changes from Mike Qiu to
support error injection and QEMU changes to support EEH for guest. QEMU patchset
will be sent separately.

Change log
==
v1 - v2:
* EEH RTAS requests are routed to QEMU, and then possiblly to host 
kerenl.
  The mechanism KVM in-kernel handling is dropped.
* Error injection is reimplemented based syscall, instead of KVM 
in-kerenl
  handling. The logic for error injection token management is moved to
  QEMU. The error injection request is routed to QEMU and then possiblly
  to host kernel.

Testing on P7
=

- Emulex adapter

Testing on P8
=

- Need more testing after design is finalized.

-

Gavin Shan (10):
  drivers/vfio: Introduce CONFIG_VFIO_EEH
  powerpc/eeh: Info to trace passed devices
  powerpc/eeh: Search EEH device by guest address
  powerpc/eeh: Search EEH PE by guest address
  drivers/vfio: New IOCTL command VFIO_EEH_INFO
  powerpc/eeh: Avoid event on passed PE
  powerpc/powernv: Sync OPAL header file with firmware
  powerpc: Extend syscall ppc_rtas()
  powerpc/powernv: Implement ppc_call_opal()
  powerpc/powernv: Error injection infrastructure

arch/powerpc/include/asm/eeh.h |  52 +
arch/powerpc/include/asm/opal.h|  74 +-
arch/powerpc/include/asm/rtas.h|  10 ++-
arch/powerpc/include/asm/syscalls.h|   2 +-
arch/powerpc/include/asm/systbl.h  |   2 +-
arch/powerpc/include/uapi/asm/unistd.h |   2 +-
arch/powerpc/kernel/eeh.c  |   8 ++
arch/powerpc/kernel/eeh_pe.c   |  80 +++
arch/powerpc/kernel/rtas.c |  57 +++---
arch/powerpc/kernel/syscalls.c |  50 
arch/powerpc/platforms/powernv/Makefile|   3 +-
arch/powerpc/platforms/powernv/eeh-ioda.c  |   3 +-
arch/powerpc/platforms/powernv/eeh-vfio.c  | 584 
+
arch/powerpc/platforms/powernv/errinject.c | 222 

arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
arch/powerpc/platforms/powernv/opal.c  |  93 ++
drivers/vfio/Kconfig   |   6 ++
drivers/vfio/vfio_iommu_spapr_tce.c|  12 +++
include/uapi/linux/vfio.h  |  61 +++
kernel/sys_ni.c|   2 +-
20 files changed, 1271 insertions(+), 53 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
create mode 100644 arch/powerpc/platforms/powernv/errinject.c

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 06/10] powerpc/eeh: Avoid event on passed PE

2014-05-09 Thread Gavin Shan
If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/eeh.c | 8 
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..6543f05 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
if (ret  0)
return ret;
 
+   /*
+* If the PE has been passed to guest, we won't check the
+* state. Instead, let the guest handle it if the PE has
+* been frozen.
+*/
+   if (eeh_pe_passed(pe))
+   return 0;
+
/* If we already have a pending isolation event for this
 * slot, we know it's bad already, we don't need to check.
 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c 
b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
opal_pci_eeh_freeze_clear(phb-opal_id, 
frozen_pe_no,
OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
ret = EEH_NEXT_ERR_NONE;
-   } else if ((*pe)-state  EEH_PE_ISOLATED) {
+   } else if ((*pe)-state  EEH_PE_ISOLATED ||
+  eeh_pe_passed(*pe)) {
ret = EEH_NEXT_ERR_NONE;
} else {
pr_err(EEH: Frozen PHB#%x-PE#%x (%s) 
detected\n,
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 02/10] powerpc/eeh: Info to trace passed devices

2014-05-09 Thread Gavin Shan
The address of passed PCI devices (domain:bus:slot:func) might be
quite different from the perspective of host and guest. We have to
trace the address mapping so that we can emulate EEH RTAS requests
from guest. The patch introduces additional fields to eeh_pe and
eeh_dev for the purpose.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/eeh.h | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7782056..3268692 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -48,6 +48,14 @@ struct device_node;
 #define EEH_PE_RST_HOLD_TIME   250
 #define EEH_PE_RST_SETTLE_TIME 1800
 
+#ifdef CONFIG_VFIO_EEH
+struct eeh_vfio_pci_addr {
+   uint64_tbuid;   /* PHB BUID */
+   uint16_tbdn;/* Bus/Device/Function number   */
+   uint32_tpe_addr;/* PE configuration address */
+};
+#endif /* CONFIG_VFIO_EEH */
+
 /*
  * The struct is used to trace PE related EEH functionality.
  * In theory, there will have one instance of the struct to
@@ -72,6 +80,7 @@ struct device_node;
 #define EEH_PE_RESET   (1  2)/* PE reset in progress */
 
 #define EEH_PE_KEEP(1  8)/* Keep PE on hotplug   */
+#define EEH_PE_PASSTHROUGH (1  9)/* PE owned by guest*/
 
 struct eeh_pe {
int type;   /* PE type: PHB/Bus/Device  */
@@ -85,6 +94,9 @@ struct eeh_pe {
struct timeval tstamp;  /* Time on first-time freeze*/
int false_positives;/* Times of reported #ff's  */
struct eeh_pe *parent;  /* Parent PE*/
+#ifdef CONFIG_VFIO_EEH
+   struct eeh_vfio_pci_addr gaddr; /* Address in guest */
+#endif
struct list_head child_list;/* Link PE to the child list*/
struct list_head edevs; /* Link list of EEH devices */
struct list_head child; /* Child PEs*/
@@ -93,6 +105,21 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
list_for_each_entry_safe(edev, tmp, pe-edevs, list)
 
+static inline bool eeh_pe_passed(struct eeh_pe *pe)
+{
+   return pe ? !!(pe-state  EEH_PE_PASSTHROUGH) : false;
+}
+
+static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed)
+{
+   if (pe) {
+   if (passed)
+   pe-state |= EEH_PE_PASSTHROUGH;
+   else
+   pe-state = ~EEH_PE_PASSTHROUGH;
+   }
+}
+
 /*
  * The struct is used to trace EEH state for the associated
  * PCI device node or PCI device. In future, it might
@@ -110,6 +137,7 @@ struct eeh_pe {
 #define EEH_DEV_SYSFS  (1  9)/* Sysfs created*/
 #define EEH_DEV_REMOVED(1  10)   /* Removed permanently  
*/
 #define EEH_DEV_FRESET (1  11)   /* Fundamental reset*/
+#define EEH_DEV_PASSTHROUGH(1  12)   /* Owned by guest   */
 
 struct eeh_dev {
int mode;   /* EEH mode */
@@ -126,6 +154,9 @@ struct eeh_dev {
struct device_node *dn; /* Associated device node   */
struct pci_dev *pdev;   /* Associated PCI device*/
struct pci_bus *bus;/* PCI bus for partial hotplug  */
+#ifdef CONFIG_VFIO_EEH
+   struct eeh_vfio_pci_addr gaddr; /* Address in guest */
+#endif
 };
 
 static inline struct device_node *eeh_dev_to_of_node(struct eeh_dev *edev)
@@ -138,6 +169,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct 
eeh_dev *edev)
return edev ? edev-pdev : NULL;
 }
 
+static inline bool eeh_dev_passed(struct eeh_dev *dev)
+{
+   return dev ? !!(dev-mode  EEH_DEV_PASSTHROUGH) : false;
+}
+
+static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed)
+{
+   if (dev) {
+   if (passed)
+   dev-mode |= EEH_DEV_PASSTHROUGH;
+   else
+   dev-mode = ~EEH_DEV_PASSTHROUGH;
+   }
+}
+
 /* Return values from eeh_ops::next_error */
 enum {
EEH_NEXT_ERR_NONE = 0,
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 01/10] drivers/vfio: Introduce CONFIG_VFIO_EEH

2014-05-09 Thread Gavin Shan
The patch introduces CONFIG_VFIO_EEH for more IOCTL commands on
tce_iommu_driver_ops to support EEH funtionality for PCI devices
that are passed through from host to guest.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 drivers/vfio/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index af7b204..4f3293b 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE
depends on VFIO  SPAPR_TCE_IOMMU
default n
 
+config VFIO_EEH
+   tristate
+   depends on EEH  VFIO_IOMMU_SPAPR_TCE
+   default n
+
 menuconfig VFIO
tristate VFIO Non-Privileged userspace driver framework
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
+   select VFIO_EEH if PPC_POWERNV
select ANON_INODES
help
  VFIO provides a framework for secure userspace device drivers.
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 03/10] powerpc/eeh: Search EEH device by guest address

2014-05-09 Thread Gavin Shan
The patch introduces function eeh_vfio_dev_get() to search the EEH
device according to its guest address, which is made up of PHB BUID,
bus, slot and function number. The function is useful in the backends
for EEH RTAS emulation.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/eeh.h |  5 +
 arch/powerpc/kernel/eeh_pe.c   | 42 ++
 2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 3268692..8ffaf39 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -381,6 +381,11 @@ static inline void eeh_remove_device(struct pci_dev *dev) 
{ }
 #define EEH_IO_ERROR_VALUE(size) (-1UL)
 #endif /* CONFIG_EEH */
 
+
+#ifdef CONFIG_VFIO_EEH
+struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr);
+#endif /* CONFIG_VFIO_EEH */
+
 #ifdef CONFIG_PPC64
 /*
  * MMIO read/write operations with EEH support.
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index fbd01eb..d09f055 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -248,6 +248,48 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev)
return pe;
 }
 
+#ifdef CONFIG_VFIO_EEH
+static void *__eeh_vfio_dev_get(void *data, void *flag)
+{
+   struct eeh_pe *pe = (struct eeh_pe *)data;
+   struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag;
+   struct eeh_dev *edev, *tmp;
+
+   eeh_pe_for_each_dev(pe, edev, tmp) {
+   if (!eeh_dev_passed(edev))
+   continue;
+
+   /* Comparing the address in the guest */
+   if (addr-buid == edev-gaddr.buid 
+   addr-bdn  == edev-gaddr.bdn)
+   return edev;
+   }
+
+   return NULL;
+}
+
+/**
+ * eeh_vfio_dev_get - Search EEH device based on guest's address
+ * @addr: EEH device guest address
+ *
+ * Search the EEH device according to its guest's address, which
+ * is made up of PHB BUID, and PCI config address.
+ */
+struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr)
+{
+   struct eeh_pe *root;
+   struct eeh_dev *edev;
+
+   list_for_each_entry(root, eeh_phb_pe, child) {
+   edev = eeh_pe_traverse(root, __eeh_vfio_dev_get, addr);
+   if (edev)
+   return edev;
+   }
+
+   return NULL;
+}
+#endif /* CONFIG_VFIO_EEH */
+
 /**
  * eeh_pe_get_parent - Retrieve the parent PE
  * @edev: EEH device
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 08/10] powerpc: Extend syscall ppc_rtas()

2014-05-09 Thread Gavin Shan
Originally, syscall ppc_rtas() can be used to invoke RTAS call from
user space. Utility errinjct is using it to inject various errors
to the system for testing purpose. The patch intends to extend the
syscall to support both pSeries and PowerNV platform. With that,
RTAS and OPAL call can be invoked from user space. In turn, utility
errinjct can be supported on pSeries and PowerNV platform at same
time.

The original syscall handler ppc_rtas() is renamed to ppc_firmware(),
which calls ppc_call_rtas() or ppc_call_opal() depending on the
running platform. The data transported between userland and kerenl is
by struct rtas_args. It's platform specific on how to use the data.

Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/rtas.h| 10 +-
 arch/powerpc/include/asm/syscalls.h|  2 +-
 arch/powerpc/include/asm/systbl.h  |  2 +-
 arch/powerpc/include/uapi/asm/unistd.h |  2 +-
 arch/powerpc/kernel/rtas.c | 57 +++---
 arch/powerpc/kernel/syscalls.c | 50 +
 arch/powerpc/platforms/powernv/opal.c  |  7 +
 kernel/sys_ni.c|  2 +-
 8 files changed, 82 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index b390f55..3428524 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -20,7 +20,7 @@
 #define RTAS_UNKNOWN_SERVICE (-1)
 #define RTAS_INSTANTIATE_MAX (1ULL30) /* Don't instantiate rtas at/above 
this value */
 
-/* Buffer size for ppc_rtas system call. */
+/* Buffer size for ppc_firmware system call. */
 #define RTAS_RMOBUF_MAX (64 * 1024)
 
 /* RTAS return status codes */
@@ -427,9 +427,17 @@ static inline int page_is_rtas_user_buf(unsigned long pfn)
 /* Not the best place to put pSeries_coalesce_init, will be fixed when we
  * move some of the rtas suspend-me stuff to pseries */
 extern void pSeries_coalesce_init(void);
+extern int ppc_call_rtas(struct rtas_args *args);
 #else
 static inline int page_is_rtas_user_buf(unsigned long pfn) { return 0;}
 static inline void pSeries_coalesce_init(void) { }
+static inline int ppc_call_rtas(struct rtas_args *args) { return -ENXIO; }
+#endif
+
+#ifdef CONFIG_PPC_POWERNV
+extern int ppc_call_opal(struct rtas_args *args);
+#else
+static inline int ppc_call_opal(struct rtas_arts *args) { return -ENXIO; }
 #endif
 
 extern int call_rtas(const char *, int, int, unsigned long *, ...);
diff --git a/arch/powerpc/include/asm/syscalls.h 
b/arch/powerpc/include/asm/syscalls.h
index 23be8f1..3383e50 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -15,7 +15,7 @@ asmlinkage unsigned long sys_mmap2(unsigned long addr, size_t 
len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);
 asmlinkage long ppc64_personality(unsigned long personality);
-asmlinkage int ppc_rtas(struct rtas_args __user *uargs);
+asmlinkage int ppc_firmware(struct rtas_args __user *uargs);
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_SYSCALLS_H */
diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 3ddf702..00f8bb2 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -259,7 +259,7 @@ COMPAT_SYS_SPU(utimes)
 COMPAT_SYS_SPU(statfs64)
 COMPAT_SYS_SPU(fstatfs64)
 SYSX(sys_ni_syscall, ppc_fadvise64_64, ppc_fadvise64_64)
-PPC_SYS_SPU(rtas)
+PPC_SYS_SPU(firmware)
 OLDSYS(debug_setcontext)
 SYSCALL(ni_syscall)
 COMPAT_SYS(migrate_pages)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 881bf2e..3aee765 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -273,7 +273,7 @@
 #ifndef __powerpc64__
 #define __NR_fadvise64_64  254
 #endif
-#define __NR_rtas  255
+#define __NR_firmware  255
 #define __NR_sys_debug_setcontext 256
 /* Number 257 is reserved for vserver */
 #define __NR_migrate_pages 258
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 8cd5ed0..5d829a72 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1017,59 +1017,32 @@ struct pseries_errorlog *get_pseries_errorlog(struct 
rtas_error_log *log,
 }
 
 /* We assume to be passed big endian arguments */
-asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
+int ppc_call_rtas(struct rtas_args *args)
 {
-   struct rtas_args args;
unsigned long flags;
char *buff_copy, *errbuf = NULL;
-   int nargs, nret, token;
int rc;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
-   if (copy_from_user(args, uargs, 3 * sizeof(u32)) != 0)
-   return -EFAULT;
-
-   nargs = be32_to_cpu(args.nargs);
-   nret  = be32_to_cpu(args.nret);
-   token = 

[PATCH 04/10] powerpc/eeh: Search EEH PE by guest address

2014-05-09 Thread Gavin Shan
The patch introduces function eeh_vfio_pe_get() to search the EEH
PE according to its guest address, which is made up of PHB ID and
PE configuration address. The function will be useful in backends
for EEH RTAS emulation.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/eeh.h |  1 +
 arch/powerpc/kernel/eeh_pe.c   | 38 ++
 2 files changed, 39 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8ffaf39..750e028 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -384,6 +384,7 @@ static inline void eeh_remove_device(struct pci_dev *dev) { 
}
 
 #ifdef CONFIG_VFIO_EEH
 struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr);
+struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr);
 #endif /* CONFIG_VFIO_EEH */
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d09f055..8dc58ac 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -288,6 +288,44 @@ struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr 
*addr)
 
return NULL;
 }
+
+static void *__eeh_vfio_pe_get(void *data, void *flag)
+{
+   struct eeh_pe *pe = (struct eeh_pe *)data;
+   struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag;
+
+   if (!eeh_pe_passed(pe))
+   return NULL;
+
+   /* Comparing the address */
+   if (addr-buid== pe-gaddr.buid 
+   addr-pe_addr == pe-gaddr.pe_addr)
+   return pe;
+
+   return NULL;
+}
+
+/**
+ * eeh_vfio_pe_get - Search EEH PE based on guest's address
+ * @addr: EEH PE guest address
+ *
+ * Search the EEH PE according to the guest address, which
+ * is made up of VM indicator, PHB BUID, and PE configuration
+ * address.
+ */
+struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr)
+{
+   struct eeh_pe *root;
+   struct eeh_pe *pe;
+
+   list_for_each_entry(root, eeh_phb_pe, child) {
+   pe = eeh_pe_traverse(root, __eeh_vfio_pe_get, addr);
+   if (pe)
+   return pe;
+   }
+
+   return NULL;
+}
 #endif /* CONFIG_VFIO_EEH */
 
 /**
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 05/10] drivers/vfio: New IOCTL command VFIO_EEH_INFO

2014-05-09 Thread Gavin Shan
The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/eeh-vfio.c | 584 ++
 drivers/vfio/vfio_iommu_spapr_tce.c   |  12 +
 include/uapi/linux/vfio.h |  61 
 4 files changed, 658 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 63cebb9..2b15a03 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,5 +6,6 @@ obj-y   += opal-msglog.o
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_PCI)  += pci.o pci-p5ioc2.o pci-ioda.o
 obj-$(CONFIG_EEH)  += eeh-ioda.o eeh-powernv.o
+obj-$(CONFIG_VFIO_EEH) += eeh-vfio.o
 obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c 
b/arch/powerpc/platforms/powernv/eeh-vfio.c
new file mode 100644
index 000..5766715
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
@@ -0,0 +1,584 @@
+/*
+  * The file intends to support EEH funtionality for those PCI devices,
+  * which have been passed through from host to guest via VFIO. So this
+  * file is naturally part of VFIO implementation on PowerNV platform.
+  *
+  * Copyright Benjamin Herrenschmidt  Gavin Shan, IBM Corporation 2014.
+  *
+  * This program is free software; you can redistribute it and/or modify
+  * it under the terms of the GNU General Public License as published by
+  * the Free Software Foundation; either version 2 of the License, or
+  * (at your option) any later version.
+  */
+
+#include linux/init.h
+#include linux/io.h
+#include linux/irq.h
+#include linux/kernel.h
+#include linux/kvm_host.h
+#include linux/msi.h
+#include linux/pci.h
+#include linux/string.h
+#include linux/vfio.h
+
+#include asm/eeh.h
+#include asm/eeh_event.h
+#include asm/io.h
+#include asm/iommu.h
+#include asm/opal.h
+#include asm/msi_bitmap.h
+#include asm/pci-bridge.h
+#include asm/ppc-pci.h
+#include asm/tce.h
+#include asm/uaccess.h
+
+#include powernv.h
+#include pci.h
+
+static int powernv_eeh_vfio_map(struct vfio_eeh_info *info)
+{
+   struct pci_bus *bus, *pe_bus;
+   struct pci_dev *pdev;
+   struct eeh_dev *edev;
+   struct eeh_pe *pe;
+   int domain, bus_no, devfn;
+
+   /* Host address */
+   domain = info-map.domain;
+   bus_no = (info-map.bdn  8)  0xff;
+   devfn = info-map.bdn  0xff;
+
+   /* Find PCI bus */
+   bus = pci_find_bus(domain, bus_no);
+   if (!bus) {
+   pr_warn(%s: PCI bus %04x:%02x not found\n,
+   __func__, domain, bus_no);
+   return -ENODEV;
+   }
+
+   /* Find PCI device */
+   pdev = pci_get_slot(bus, devfn);
+   if (!pdev) {
+   pr_warn(%s: PCI device %04x:%02x:%02x.%01x not found\n,
+   __func__, domain, bus_no,
+   PCI_SLOT(devfn), PCI_FUNC(devfn));
+   return -ENODEV;
+   }
+
+   /* No EEH device - almost impossible */
+   edev = pci_dev_to_eeh_dev(pdev);
+   if (unlikely(!edev)) {
+   pci_dev_put(pdev);
+   pr_warn(%s: No EEH dev for PCI device %s\n,
+   __func__, pci_name(pdev));
+   return -ENODEV;
+   }
+
+   /* Doesn't support PE migration between different PHBs */
+   pe = edev-pe;
+   if (!eeh_pe_passed(pe)) {
+   pe_bus = eeh_pe_bus_get(pe);
+   BUG_ON(!pe_bus);
+
+   /* PE# has format 00BBSS00 */
+   pe-gaddr.buid= info-map.gbuid;
+   pe-gaddr.pe_addr = pe_bus-number  16;
+   eeh_pe_set_passed(pe, true);
+   } else if (pe-gaddr.buid != info-map.gbuid) {
+   pci_dev_put(pdev);
+   pr_warn(%s: Mismatched PHB BUID (0x%llx, 0x%llx)\n,
+   __func__, pe-gaddr.buid, info-map.gbuid);
+   return -EINVAL;
+   }
+
+   edev-gaddr.buid = info-map.gbuid;
+   edev-gaddr.bdn  = info-map.gbdn;
+   eeh_dev_set_passed(edev, true);
+
+   pr_debug(EEH: Host PCI dev %s to %llx-%02x:%02x.%01x\n,
+pci_name(pdev), info-map.gbuid,
+(info-map.gbdn  8)  0xFF,
+PCI_SLOT(info-map.gbdn  0xFF),
+PCI_FUNC(info-map.gbdn  0xFF));
+
+   pci_dev_put(pdev);
+   return 0;
+}
+
+static int powernv_eeh_vfio_unmap(struct vfio_eeh_info *info)
+{
+   struct eeh_vfio_pci_addr addr;
+   struct pci_dev *pdev;
+   struct eeh_dev *edev, *tmp;
+   struct eeh_pe *pe;
+   bool passed;
+
+   /* Get EEH device */
+   addr.buid = 

[PATCH 07/10] powerpc/powernv: Sync OPAL header file with firmware

2014-05-09 Thread Gavin Shan
The patch synchronizes OPAL header file with firmware so that the
host kernel can make OPAL call to do error injection.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/opal.h| 65 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 2 files changed, 66 insertions(+)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 66ad7a7..ca55d9c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -175,6 +175,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_SET_PARAM 90
 #define OPAL_DUMP_RESEND   91
 #define OPAL_DUMP_INFO294
+#define OPAL_ERR_INJECT96
 
 #ifndef __ASSEMBLY__
 
@@ -219,6 +220,69 @@ enum OpalPciErrorSeverity {
OPAL_EEH_SEV_INF= 5
 };
 
+enum OpalErrinjctType {
+   OpalErrinjctTypeFirst   = 0,
+   OpalErrinjctTypeFatal   = 1,
+   OpalErrinjctTypeRecoverRandomEvent  = 2,
+   OpalErrinjctTypeRecoverSpecialEvent = 3,
+   OpalErrinjctTypeCorruptedPage   = 4,
+   OpalErrinjctTypeCorruptedSlb= 5,
+   OpalErrinjctTypeTranslatorFailure   = 6,
+   OpalErrinjctTypeIoaBusError = 7,
+   OpalErrinjctTypeIoaBusError64   = 8,
+   OpalErrinjctTypePlatformSpecific= 9,
+   OpalErrinjctTypeDcacheStart = 10,
+   OpalErrinjctTypeDcacheEnd   = 11,
+   OpalErrinjctTypeIcacheStart = 12,
+   OpalErrinjctTypeIcacheEnd   = 13,
+   OpalErrinjctTypeTlbStart= 14,
+   OpalErrinjctTypeTlbEnd  = 15,
+   OpalErrinjctTypeUpstreamIoError = 16,
+   OpalErrinjctTypeLast= 17,
+
+   /* IoaBusError  IoaBusError64 */
+   OpalEjtIoaLoadMemAddr   = 0,
+   OpalEjtIoaLoadMemData   = 1,
+   OpalEjtIoaLoadIoAddr= 2,
+   OpalEjtIoaLoadIoData= 3,
+   OpalEjtIoaLoadConfigAddr= 4,
+   OpalEjtIoaLoadConfigData= 5,
+   OpalEjtIoaStoreMemAddr  = 6,
+   OpalEjtIoaStoreMemData  = 7,
+   OpalEjtIoaStoreIoAddr   = 8,
+   OpalEjtIoaStoreIoData   = 9,
+   OpalEjtIoaStoreConfigAddr   = 10,
+   OpalEjtIoaStoreConfigData   = 11,
+   OpalEjtIoaDmaReadMemAddr= 12,
+   OpalEjtIoaDmaReadMemData= 13,
+   OpalEjtIoaDmaReadMemMaster  = 14,
+   OpalEjtIoaDmaReadMemTarget  = 15,
+   OpalEjtIoaDmaWriteMemAddr   = 16,
+   OpalEjtIoaDmaWriteMemData   = 17,
+   OpalEjtIoaDmaWriteMemMaster = 18,
+   OpalEjtIoaDmaWriteMemTarget = 19,
+};
+
+struct OpalErrinjct {
+   int32_t type;
+   union {
+   struct {
+   uint32_t addr;
+   uint32_t mask;
+   uint64_t phb_id;
+   uint32_t pe;
+   uint32_t function;
+   }ioa;
+   struct {
+   uint64_t addr;
+   uint64_t mask;
+   uint64_t phb_id;
+   uint32_t pe;
+   uint32_t function;
+   }ioa64;
+   };
+};
+
 enum OpalShpcAction {
OPAL_SHPC_GET_LINK_STATE = 0,
OPAL_SHPC_GET_SLOT_STATE = 1
@@ -839,6 +903,7 @@ int64_t opal_pci_get_phb_diag_data(uint64_t phb_id, void 
*diag_buffer,
   uint64_t diag_buffer_len);
 int64_t opal_pci_get_phb_diag_data2(uint64_t phb_id, void *diag_buffer,
uint64_t diag_buffer_len);
+int64_t opal_err_injct(void *data);
 int64_t opal_pci_fence_phb(uint64_t phb_id);
 int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data);
 int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t 
error_type, uint8_t mask_action);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index f531ffe..46265de 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -119,6 +119,7 @@ OPAL_CALL(opal_pci_next_error,  
OPAL_PCI_NEXT_ERROR);
 OPAL_CALL(opal_pci_poll,   OPAL_PCI_POLL);
 OPAL_CALL(opal_pci_msi_eoi,OPAL_PCI_MSI_EOI);
 OPAL_CALL(opal_pci_get_phb_diag_data2, OPAL_PCI_GET_PHB_DIAG_DATA2);
+OPAL_CALL(opal_err_injct,  OPAL_ERR_INJECT);
 OPAL_CALL(opal_xscom_read, OPAL_XSCOM_READ);
 OPAL_CALL(opal_xscom_write,  

[PATCH 10/10] powerpc/powernv: Error injection infrastructure

2014-05-09 Thread Gavin Shan
The patch intends to implemdent the error injection infrastructure
for PowerNV platform. The predetermined handlers will be called
according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError).
For now, we just support PCI error injection. We need support
injecting other types of errors in future.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/opal.h|   6 +
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/errinject.c | 224 +
 3 files changed, 231 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/errinject.c

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 7c4ffd0..7bf86ba 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -794,6 +794,12 @@ typedef struct oppanel_line {
uint64_tline_len;
 } oppanel_line_t;
 
+enum OpalCallToken{
+   OPAL_CALL_TOKEN_MIN = 0,
+   OPAL_CALL_TOKEN_ERRINJCT,
+   OPAL_CALL_TOKEN_MAX
+};
+
 /* /sys/firmware/opal */
 extern struct kobject *opal_kobj;
 
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 2b15a03..5ae8257 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -1,7 +1,7 @@
 obj-y  += setup.o opal-takeover.o opal-wrappers.o opal.o 
opal-async.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
-obj-y  += opal-msglog.o
+obj-y  += opal-msglog.o errinject.o
 
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_PCI)  += pci.o pci-p5ioc2.o pci-ioda.o
diff --git a/arch/powerpc/platforms/powernv/errinject.c 
b/arch/powerpc/platforms/powernv/errinject.c
new file mode 100644
index 000..aa892d4
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/errinject.c
@@ -0,0 +1,224 @@
+/*
+ * The file intends to support error injection requests from host OS
+ * owned utility (e.g. errinjct) or VM. We need parse the information
+ * passed from user space and call to appropriate OPAL API accordingly.
+ *
+ * Copyright Benjamin Herrenschmidt  Gavin Shan, IBM Corporation 2014.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/io.h
+#include linux/irq.h
+#include linux/kernel.h
+#include linux/msi.h
+#include linux/module.h
+#include linux/pci.h
+
+#include asm/eeh.h
+#include asm/eeh_event.h
+#include asm/io.h
+#include asm/iommu.h
+#include asm/msi_bitmap.h
+#include asm/opal.h
+#include asm/pci-bridge.h
+#include asm/ppc-pci.h
+#include asm/rtas.h
+#include asm/tce.h
+#include asm/uaccess.h
+
+#include powernv.h
+#include pci.h
+
+static int powernv_errinjct_ioa(struct rtas_args *args)
+{
+   return -ENXIO;
+}
+
+static int powernv_errinjct_ioa64(struct rtas_args *args)
+{
+   return -ENXIO;
+}
+
+#ifdef CONFIG_VFIO_EEH
+static int powernv_errinjct_ioa_virt(struct rtas_args *args)
+{
+   uint32_t addr, mask, cfg_addr;
+   uint32_t buid_hi, buid_lo, op;
+   uint64_t buf_addr = ((uint64_t)(args-args[3]))  32 |
+   args-args[4];
+   void __user *buf = (void __user *)buf_addr;
+   struct eeh_vfio_pci_addr vfio_addr;
+   struct pnv_phb *phb;
+   struct eeh_pe *pe;
+   struct OpalErrinjct ej;
+
+   /* Extract parameters */
+   if (get_user(addr, (uint32_t __user *)buf) ||
+   get_user(mask, (uint32_t __user *)(buf + 4)) ||
+   get_user(cfg_addr, (uint32_t __user *)(buf + 8)) ||
+   get_user(buid_hi, (uint32_t __user *)(buf + 12)) ||
+   get_user(buid_lo, (uint32_t __user *)(buf + 16)) ||
+   get_user(op, (uint32_t __user *)(buf + 20)))
+   return -EFAULT;
+
+   /* Check opcode */
+   if (op  OpalEjtIoaLoadMemAddr ||
+   op  OpalEjtIoaDmaWriteMemTarget)
+   return -EINVAL;
+
+   /* Find PE */
+   vfio_addr.buid = uint64_t)buid_hi)  32) | buid_lo);
+   vfio_addr.pe_addr = cfg_addr;
+   pe = eeh_vfio_pe_get(vfio_addr);
+   if (!pe)
+   return -ENODEV;
+   phb = pe-phb-private_data;
+
+   /* OPAL call */
+   ej.type = OpalErrinjctTypeIoaBusError;
+   ej.ioa.addr = addr;
+   ej.ioa.mask = mask;
+   ej.ioa.phb_id = phb-opal_id;
+   ej.ioa.pe = pe-addr;
+   ej.ioa.function = op;
+   if (opal_err_injct(ej) != OPAL_SUCCESS)
+   return -EIO;
+
+   return 0;
+}
+
+static int powernv_errinjct_ioa64_virt(struct rtas_args *args)
+{
+   uint32_t addr_hi, addr_lo, mask_hi, mask_lo;
+   uint32_t cfg_addr, buid_hi, buid_lo, op;
+   

[PATCH 09/10] powerpc/powernv: Implement ppc_call_opal()

2014-05-09 Thread Gavin Shan
If we're running PowerNV platform, ppc_firmware() will be directed
to ppc_call_opal() where we can call to OPAL API accordingly. In
ppc_call_opal(), the input argument are parsed out and call to
appropriate OPAL API to handle that. Each request passed to the
function is identified with token. As we get to the function either
from host owned application (e.g. errinjct) or VM, we always have
the first parameter (so-called virtual) to differentiate the
cases.

The patch implements above logic and OPAL call handler dynamica
registeration mechanism so that the handlers could be distributed.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/opal.h   |  3 +-
 arch/powerpc/platforms/powernv/opal.c | 90 ++-
 2 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ca55d9c..7c4ffd0 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -997,7 +997,8 @@ extern void opal_lpc_init(void);
 struct opal_sg_list *opal_vmalloc_to_sg_list(void *vmalloc_addr,
 unsigned long vmalloc_size);
 void opal_free_sg_list(struct opal_sg_list *sg);
-
+int opal_call_handler_register(bool virt, int token,
+  int (*fn)(struct rtas_args *));
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_H */
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index ad33c2b..c84823c 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -38,6 +38,13 @@ struct opal {
u64 size;
 } opal;
 
+struct opal_call_handler {
+   bool virt;
+   int token;
+   int (*fn)(struct rtas_args *args);
+   struct list_head list;
+};
+
 struct mcheck_recoverable_range {
u64 start_addr;
u64 end_addr;
@@ -47,6 +54,10 @@ struct mcheck_recoverable_range {
 static struct mcheck_recoverable_range *mc_recoverable_range;
 static int mc_recoverable_range_len;
 
+/* OPAL call handler */
+static LIST_HEAD(opal_call_handler_list);
+static DEFINE_SPINLOCK(opal_call_lock);
+
 struct device_node *opal_node;
 static DEFINE_SPINLOCK(opal_write_lock);
 extern u64 opal_mc_secondary_handler[];
@@ -703,8 +714,83 @@ void opal_free_sg_list(struct opal_sg_list *sg)
}
 }
 
-/* Extend it later */
-int ppc_call_opal(struct rtas_args *args)
+int opal_call_handler_register(bool virt, int token,
+  int (*fn)(struct rtas_args *))
 {
+   struct opal_call_handler *h, *handler;
+
+   if (!token || !fn) {
+   pr_warn(%s: Invalid parameters\n,
+   __func__);
+   return -EINVAL;
+   }
+
+   handler = kzalloc(sizeof(*handler), GFP_KERNEL);
+   if (!handler) {
+   pr_warn(%s: Out of memory\n,
+   __func__);
+   return -ENOMEM;
+   }
+   handler-token = token;
+   handler-virt = virt;
+   handler-fn = fn;
+   INIT_LIST_HEAD(handler-list);
+
+   spin_lock(opal_call_lock);
+   list_for_each_entry(h, opal_call_handler_list, list) {
+   if (h-token == token 
+   h-virt  == virt) {
+   spin_unlock(opal_call_lock);
+   pr_warn(%s: Handler existing (%s, %x)\n,
+   __func__, virt ? T : F, token);
+   kfree(handler);
+   return -EEXIST;
+   }
+   }
+
+   list_add_tail(handler-list, opal_call_handler_list);
+   spin_unlock(opal_call_lock);
+
return 0;
 }
+
+/*
+ * It's usually invoked from syscall ppc_firmware() by host
+ * owned application or VM. The information carried in the
+ * input arguments is different. So we always have the first
+ * argument to differentiate it.
+ *
+ * Also, we have to extend 32-bits address to 64-bits. So
+ * for each address sensitive field, it will require 8
+ * bytes.
+ */
+int ppc_call_opal(struct rtas_args *args)
+{
+   bool virt, found;
+   int token;
+   struct opal_call_handler *h;
+
+   /* We should have virt at least */
+   if (args-nargs  1)
+   return -EINVAL;
+   virt = !!args-args[0];
+   token = args-token;
+
+   /* Do we have handler ? */
+   found = false;
+   spin_lock(opal_call_lock);
+   list_for_each_entry(h, opal_call_handler_list, list) {
+   if (h-token == token 
+   h-virt == virt) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(opal_call_lock);
+
+   /* Call to handler */
+   if (!found)
+   return -ERANGE;
+
+   return h-fn(args);
+}
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC v2 00/10] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-09 Thread Gavin Shan
On Fri, May 09, 2014 at 05:49:32PM +1000, Gavin Shan wrote:

Sorry for having missed cc'ing Alex Graf. Amending it.

The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly 
 introduced
  VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel.

- The error injection infrastructure need support request from the userland
  utility errinjct and PowerKVM based guest. The userland utility errinjct
  works on pSeries platform well with dedicated syscall, which helps invoking
  RTAS service to fulfil error injection in kernel. From the perspective, it's
  reasonable to extend the syscall to support PowerNV platform so that OPAL 
 call
  can be invoked in host kernel for injecting errors. The data transported
  between userland and kerenl is still following struct rtas_args for both
  cases of PowerNV (OPAL) and pSeries (RTAS).

The series of patches requires corresponding firmware changes from Mike Qiu to
support error injection and QEMU changes to support EEH for guest. QEMU 
patchset
will be sent separately.

Change log
==
v1 - v2:
   * EEH RTAS requests are routed to QEMU, and then possiblly to host 
 kerenl.
 The mechanism KVM in-kernel handling is dropped.
   * Error injection is reimplemented based syscall, instead of KVM 
 in-kerenl
 handling. The logic for error injection token management is moved to
 QEMU. The error injection request is routed to QEMU and then possiblly
 to host kernel.

Testing on P7
=

- Emulex adapter

Testing on P8
=

- Need more testing after design is finalized.

-

Gavin Shan (10):
  drivers/vfio: Introduce CONFIG_VFIO_EEH
  powerpc/eeh: Info to trace passed devices
  powerpc/eeh: Search EEH device by guest address
  powerpc/eeh: Search EEH PE by guest address
  drivers/vfio: New IOCTL command VFIO_EEH_INFO
  powerpc/eeh: Avoid event on passed PE
  powerpc/powernv: Sync OPAL header file with firmware
  powerpc: Extend syscall ppc_rtas()
  powerpc/powernv: Implement ppc_call_opal()
  powerpc/powernv: Error injection infrastructure

arch/powerpc/include/asm/eeh.h |  52 +
arch/powerpc/include/asm/opal.h|  74 +-
arch/powerpc/include/asm/rtas.h|  10 ++-
arch/powerpc/include/asm/syscalls.h|   2 +-
arch/powerpc/include/asm/systbl.h  |   2 +-
arch/powerpc/include/uapi/asm/unistd.h |   2 +-
arch/powerpc/kernel/eeh.c  |   8 ++
arch/powerpc/kernel/eeh_pe.c   |  80 +++
arch/powerpc/kernel/rtas.c |  57 +++---
arch/powerpc/kernel/syscalls.c |  50 
arch/powerpc/platforms/powernv/Makefile|   3 +-
arch/powerpc/platforms/powernv/eeh-ioda.c  |   3 +-
arch/powerpc/platforms/powernv/eeh-vfio.c  | 584 
+
arch/powerpc/platforms/powernv/errinject.c | 222 

arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
arch/powerpc/platforms/powernv/opal.c  |  93 ++
drivers/vfio/Kconfig   |   6 ++
drivers/vfio/vfio_iommu_spapr_tce.c|  12 +++
include/uapi/linux/vfio.h  |  61 +++
kernel/sys_ni.c|   2 +-
20 files changed, 1271 insertions(+), 53 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
create mode 100644 arch/powerpc/platforms/powernv/errinject.c

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: powerpc/mpc85xx: Add BSC9132 QDS Support

2014-05-09 Thread Harninder Rai
  +   };
  +
  +   nand@1,0 {
  +   #address-cells = 1;
  +   #size-cells = 1;
  +   compatible = fsl,ifc-nand;
  +   reg = 0x1 0x0 0x4000;
  +
  +   partition@0 {
  +   /* This location must not be altered  */
  +   /* 3MB for u-boot Bootloader Image */
  +   reg = 0x0 0x0030;
  +   label = NAND U-Boot Image;
  +   read-only;
  +   };
  +
  +   partition@30 {
  +   /* 1MB for DTB Image */
  +   reg = 0x0030 0x0010;
  +   label = NAND DTB Image;
  +   };
  +
  +   partition@40 {
  +   /* 8MB for Linux Kernel Image */
  +   reg = 0x0040 0x0080;
  +   label = NAND Linux Kernel Image;
  +   };
  +
  +   partition@c0 {
  +   /* Rest space for Root file System Image */
  +   reg = 0x00c0 0x0740;
  +   label = NAND RFS Image;
  +   };
  +   };
  +};
 
 Please keep partition definitions out of the dts file, as has been recently
 requested of other boards.  You can use U-Boot to create the partition nodes
 based on the mtdparts variable, or you can use the Linux mtdparts command line
 option.
Ok. Will remove these in V2 of patch
 
 -Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFT PATCH -next ] [BUGFIX] kprobes: Fix Failed to find blacklist error on ia64 and ppc64

2014-05-09 Thread Masami Hiramatsu
(2014/05/08 15:16), Ananth N Mavinakayanahalli wrote:
 On Thu, May 08, 2014 at 02:40:00PM +0900, Masami Hiramatsu wrote:
 (2014/05/08 13:47), Ananth N Mavinakayanahalli wrote:
 On Wed, May 07, 2014 at 08:55:51PM +0900, Masami Hiramatsu wrote:

 ...

 +#if defined(CONFIG_PPC64)  (!defined(_CALL_ELF) || _CALL_ELF == 1)
 +/*
 + * On PPC64 ABIv1 the function pointer actually points to the
 + * function's descriptor. The first entry in the descriptor is the
 + * address of the function text.
 + */
 +#define constant_function_entry(fn)   (((func_descr_t *)(fn))-entry)
 +#else
 +#define constant_function_entry(fn)   ((unsigned long)(fn))
 +#endif
 +
  #endif /* __ASSEMBLY__ */

 Hi Masami,

 You could just use ppc_function_entry() instead.

 No, I think ppc_function_entry() has two problems (on the latest -next 
 kernel)

 At first, that is an inlined functions which is not applied in build time.
 Since the NOKPROBE_SYMBOL() is used outside of any functions as like as
 EXPORT_SYMBOL(), we can only use preprocessed macros.
 Next, on PPC64 ABI*v2*, ppc_function_entry() returns local function entry,
 which seems global function entry + 2 insns. I'm not sure about 
 implementation
 of the kallsyms on PPC64 ABIv2, but I guess we need global function entry
 for kallsyms.
 
 ABIv2 does away with function descriptors and Anton fixed up that
 routine to handle the change (the +2 is an artefact of that).

Hmm, do you mean that the address +2 is the actual entry point?
I'd like to know which address is same as the address shown in /proc/kallsyms.

 BTW, could you test this patch on the latest -next tree on PPC64 if possible?
 
 I'll test it, but it may take a bit.

Thanks for your help!

 
 Ananth
 
 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes

2014-05-09 Thread Leo Li
On Thu, May 8, 2014 at 10:04 AM,  yuantian.t...@freescale.com wrote:
 From: Tang Yuantian yuantian.t...@freescale.com

 Basically, this patch does the following:
 1. Move the codes of parsing boot parameters from setup-common.c
to driver. In this way, code reader can know directly that
there are boot parameters that can change the timeout.
 2. Make boot parameter 'booke_wdt_period' effective.
currently, when driver is loaded, default timeout is always
being used in stead of booke_wdt_period.
 3. Wrap up the watchdog timeout in device struct and clean up
unnecessary codes.

 Signed-off-by: Tang Yuantian yuantian.t...@freescale.com
 Acked-by: Scott Wood scottw...@freescale.com

Reviewed-by: Li Yang le...@freescale.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: powerpc/mpc85xx: Add BSC9132 QDS Support

2014-05-09 Thread Harninder Rai


 -Original Message-
 From: Wood Scott-B07421
 Sent: Saturday, May 03, 2014 6:01 AM
 To: Rai Harninder-B01044
 Cc: linuxppc-dev@lists.ozlabs.org; Gupta Ruchika-R66431
 Subject: Re: powerpc/mpc85xx: Add BSC9132 QDS Support
 
 On Tue, Mar 18, 2014 at 01:05:02PM +0530, harninder rai wrote:
  +ifc {
  +   #address-cells = 2;
  +   #size-cells = 1;
  +   compatible = fsl,ifc, simple-bus;
  +   /* FIXME: Test whether interrupts are split */
  +   interrupts = 16 2 0 0 20 2 0 0;
  +};
 
 Have you done this test yet?
Checked with Prabhakar and he says that on 9132, the IFC interrupts are split
B4/T4 (and variants), C29x etc onwards are when the interrupts got merged into 
single interrupt
 
 -Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM

2014-05-09 Thread Li Yang
On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood scottw...@freescale.com wrote:
 On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
 On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood scottw...@freescale.com wrote:
  On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
  From: Wang Dongsheng dongsheng.w...@freescale.com
 
  Add set_pm_suspend_state  pm_suspend_state functions to set/get
  suspend state. When system going to sleep or deep sleep, devices
  can get the system suspend state(STANDBY/MEM) through pm_suspend_state
  function and to handle different situations.
 
  Signed-off-by: Wang Dongsheng dongsheng.w...@freescale.com
  ---
  *v2*
  Move pm api from fsl platform to powerpc general framework.
 
  What is powerpc-specific about this?

 Generally I agree with you.  But I had the discussion about this topic
 a while ago with the PM maintainer.  He suggestion to go with the
 platform way.

 https://lkml.org/lkml/2013/8/16/505

 If what he meant was whether you could do what this patch does, then you
 can answer him with, No, because it got nacked as not being platform or
 arch specific.  Oh, and you're still using .valid as the hook to set
 the platform state, which is awful -- I think .begin is what you want to
 use.

I'm not saying the current patch is good for upstream.  Actually I did
say that the patch need to be updated for upstream purpose.  I only
meant that we discussed about having the mem/standby passed by generic
kernel/power interface as you suggested internally and got an negative
feedback.


 If we did it in powerpc code, then what would we do on ARM?  Copy the
 code?  No.

If you are saying that this shouldn't be done in arch/powerpc  Yes.
We have determined to use drivers/platform folder for the re-used code
with ARM.  Platform power management code will be moved there.


 Now, a more legitimate objection to putting it in generic code might be
 that standby and mem are loosely defined and the knowledge of how a
 driver should react to each is platform specific -- but your patch
 doesn't address that.  You still have the driver itself interpret what
 standby and mem mean.


Yup, we will address it in next batch.

- Leo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Preeti U Murthy
Hi Anton,

On 05/09/2014 01:17 PM, Anton Blanchard wrote:
 I am seeing an issue where a CPU running perf eventually hangs.
 Traces show timer interrupts happening every 4 seconds even
 when a userspace task is running on the CPU. /proc/timer_list
 also shows pending hrtimers have not run in over an hour,
 including the scheduler.
 
 Looking closer, decrementers_next_tb is getting set to
 0x, and at that point we will never take
 a timer interrupt again.
 
 In __timer_interrupt() we set decrementers_next_tb to
 0x and rely on -event_handler to update it:
 
 *next_tb = ~(u64)0;
 if (evt-event_handler)
 evt-event_handler(evt);
 
 In this case -event_handler is hrtimer_interrupt. This will eventually
 call back through the clockevents code with the next event to be
 programmed:
 
 static int decrementer_set_next_event(unsigned long evt,
   struct clock_event_device *dev)
 {
 /* Don't adjust the decrementer if some irq work is pending */
 if (test_irq_work_pending())
 return 0;
 __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
 
 If irq work came in between these two points, we will return
 before updating decrementers_next_tb and we never process a timer
 interrupt again.
 
 This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
 with irq_work). Fix it by removing the early exit and relying on
 code later on in the function to force an early decrementer:
 
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
 

There is another scenario we are missing. Its not necessary that on a
timer interrupt the event handler will call back through the
set_next_event().
If there are no pending timers then the event handler will not bother
programming the tick device and simply return.IOW, set_next_event() will
not be called. In that case we will miss taking care of pending irq work
altogether.

__timer_interrupt() - event_handler - next_time = KTIME_MAX -
__timer_interrupt().

In __timer_interrupt() we do not check for pending irq anywhere after
the call to the event handler and we hence miss servicing irqs in the
above scenario.

How about you also move the check:
 if (test_irq_pending())
   set_dec(1)

in __timer_interrupt() outside the _else_ loop? This will ensure that no
matter what, before exiting timer interrupt handler we check for pending
irq work.

Regards
Preeti U Murthy

 Signed-off-by: Anton Blanchard an...@samba.org
 Cc: sta...@vger.kernel.org # 3.14+
 ---
 
 diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
 index 122a580..4f0b676 100644
 --- a/arch/powerpc/kernel/time.c
 +++ b/arch/powerpc/kernel/time.c
 @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
  static int decrementer_set_next_event(unsigned long evt,
 struct clock_event_device *dev)
  {
 - /* Don't adjust the decrementer if some irq work is pending */
 - if (test_irq_work_pending())
 - return 0;
   __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
   set_dec(evt);

How about if you move the test_irq_work_pending
Why do we have test_irq_work_pending() later in the function
decrementer_set_next_event()?
  
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Paul E. McKenney
On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
 I am seeing an issue where a CPU running perf eventually hangs.
 Traces show timer interrupts happening every 4 seconds even
 when a userspace task is running on the CPU.

Is this by chance every 4.2 seconds?  The reason I ask is that
Paul Clarke and I are seeing an interrupt every 4.2 seconds when
he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

Thanx, Paul

  /proc/timer_list
 also shows pending hrtimers have not run in over an hour,
 including the scheduler.
 
 Looking closer, decrementers_next_tb is getting set to
 0x, and at that point we will never take
 a timer interrupt again.
 
 In __timer_interrupt() we set decrementers_next_tb to
 0x and rely on -event_handler to update it:
 
 *next_tb = ~(u64)0;
 if (evt-event_handler)
 evt-event_handler(evt);
 
 In this case -event_handler is hrtimer_interrupt. This will eventually
 call back through the clockevents code with the next event to be
 programmed:
 
 static int decrementer_set_next_event(unsigned long evt,
   struct clock_event_device *dev)
 {
 /* Don't adjust the decrementer if some irq work is pending */
 if (test_irq_work_pending())
 return 0;
 __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
 
 If irq work came in between these two points, we will return
 before updating decrementers_next_tb and we never process a timer
 interrupt again.
 
 This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
 with irq_work). Fix it by removing the early exit and relying on
 code later on in the function to force an early decrementer:
 
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
 
 Signed-off-by: Anton Blanchard an...@samba.org
 Cc: sta...@vger.kernel.org # 3.14+
 ---
 
 diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
 index 122a580..4f0b676 100644
 --- a/arch/powerpc/kernel/time.c
 +++ b/arch/powerpc/kernel/time.c
 @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
  static int decrementer_set_next_event(unsigned long evt,
 struct clock_event_device *dev)
  {
 - /* Don't adjust the decrementer if some irq work is pending */
 - if (test_irq_work_pending())
 - return 0;
   __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
   set_dec(evt);
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM

2014-05-09 Thread Scott Wood
On Fri, 2014-05-09 at 17:33 +0800, Li Yang wrote:
 On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood scottw...@freescale.com wrote:
  On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
  On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood scottw...@freescale.com 
  wrote:
   On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
   From: Wang Dongsheng dongsheng.w...@freescale.com
  
   Add set_pm_suspend_state  pm_suspend_state functions to set/get
   suspend state. When system going to sleep or deep sleep, devices
   can get the system suspend state(STANDBY/MEM) through pm_suspend_state
   function and to handle different situations.
  
   Signed-off-by: Wang Dongsheng dongsheng.w...@freescale.com
   ---
   *v2*
   Move pm api from fsl platform to powerpc general framework.
  
   What is powerpc-specific about this?
 
  Generally I agree with you.  But I had the discussion about this topic
  a while ago with the PM maintainer.  He suggestion to go with the
  platform way.
 
  https://lkml.org/lkml/2013/8/16/505
 
  If what he meant was whether you could do what this patch does, then you
  can answer him with, No, because it got nacked as not being platform or
  arch specific.  Oh, and you're still using .valid as the hook to set
  the platform state, which is awful -- I think .begin is what you want to
  use.
 
 I'm not saying the current patch is good for upstream.  Actually I did
 say that the patch need to be updated for upstream purpose. 

I don't follow -- this thread is an upstream submission.

  Now, a more legitimate objection to putting it in generic code might be
  that standby and mem are loosely defined and the knowledge of how a
  driver should react to each is platform specific -- but your patch
  doesn't address that.  You still have the driver itself interpret what
  standby and mem mean.
 
 
 Yup, we will address it in next batch.

Thanks.

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes

2014-05-09 Thread Guenter Roeck
On Thu, May 08, 2014 at 10:04:26AM +0800, yuantian.t...@freescale.com wrote:
 From: Tang Yuantian yuantian.t...@freescale.com
 
 Basically, this patch does the following:
 1. Move the codes of parsing boot parameters from setup-common.c
to driver. In this way, code reader can know directly that
there are boot parameters that can change the timeout.
 2. Make boot parameter 'booke_wdt_period' effective.
currently, when driver is loaded, default timeout is always
being used in stead of booke_wdt_period.
 3. Wrap up the watchdog timeout in device struct and clean up
unnecessary codes.
 
 Signed-off-by: Tang Yuantian yuantian.t...@freescale.com
 Acked-by: Scott Wood scottw...@freescale.com

Reviewed-by: Guenter Roeck li...@roeck-us.net
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCHv2] powerpc/85xx: Add OCA4080 board support

2014-05-09 Thread Scott Wood
On Tue, Apr 15, 2014 at 07:51:46PM +0200, Martijn de Gouw wrote:
 diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c 
 b/arch/powerpc/platforms/85xx/corenet_generic.c
 index fbd871e..f3685047 100644
 --- a/arch/powerpc/platforms/85xx/corenet_generic.c
 +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
 @@ -55,8 +55,6 @@ void __init corenet_gen_setup_arch(void)
   mpc85xx_smp_init();
  
   swiotlb_detect_4g();
 -
 - pr_info(%s board from Freescale Semiconductor\n, ppc_md.name);

Valentin's patch kept this line but removed from Freescale
Semiconductor; I'll leave it like that when applying.

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v6,3/5] powerpc/book3e: support kgdb for kernel space

2014-05-09 Thread Scott Wood
On Wed, Oct 23, 2013 at 05:31:23PM +0800, Tiejun Chen wrote:
 Currently we need to skip this for supporting KGDB.
 
 Signed-off-by: Tiejun Chen tiejun.c...@windriver.com
 
 ---
 arch/powerpc/kernel/exceptions-64e.S |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/arch/powerpc/kernel/exceptions-64e.S 
 b/arch/powerpc/kernel/exceptions-64e.S
 index a55cf62..0b750c6 100644
 --- a/arch/powerpc/kernel/exceptions-64e.S
 +++ b/arch/powerpc/kernel/exceptions-64e.S
 @@ -597,11 +597,13 @@ kernel_dbg_exc:
   rfdi
  
   /* Normal debug exception */
 +1:   andi.   r14,r11,MSR_PR; /* check for userspace again */
 +#ifndef CONFIG_KGDB
   /* XXX We only handle coming from userspace for now since we can't
* quite save properly an interrupted kernel state yet
*/
 -1:   andi.   r14,r11,MSR_PR; /* check for userspace again */
   beq kernel_dbg_exc; /* if from kernel mode */
 +#endif

Now that we have support for properly saving state on special level
exceptions, that should be used here.  With the above patch, what happens
if e.g. a debug exception fires during a TLB miss, and the kgdb handler
takes its own TLB miss accessing the serial port?

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

linux-next: add scottwood/linux.git

2014-05-09 Thread Scott Wood
On Mon, 2014-03-24 at 20:09 -0500, Scott Wood wrote:
 On Mon, 2014-03-24 at 10:33 +1100, Benjamin Herrenschmidt wrote:
  On Mon, 2014-03-24 at 10:16 +1100, Benjamin Herrenschmidt wrote:
   On Wed, 2014-03-19 at 23:25 -0500, Scott Wood wrote:
The following changes since commit 
c7e64b9ce04aa2e3fad7396d92b5cb92056d16ac:

  powerpc/powernv Platform dump interface (2014-03-07 16:19:10 +1100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next

for you to fetch changes up to 48b16180d0d91324e5d2423c6d53d97bbe3dcc14:

  fsl/pci: The new pci suspend/resume implementation (2014-03-19 
22:37:44 -0500)
   
   Stephen just informed me that your tree wasn't in -next ... Kumar's
   still is.
   
   Can you guys fix that up ? I somewhat rely on the FSL stuff to simmer
   in -next on its own.
 
 Stephen, what's the process for adding a tree?

ping

-Scott


 
 I suppose we should update MAINTAINERS while we're at it.
 
  Oh and where is my little summary to put in the merge commit ?
  
  I made one up for this time around.
 
 Oops, forgot again.  Now I've added something to the script I use to
 generate pull requests, to give me a reminder.
 
 -Scott
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Paul E. McKenney
On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote:
 On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
  On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
   I am seeing an issue where a CPU running perf eventually hangs.
   Traces show timer interrupts happening every 4 seconds even
   when a userspace task is running on the CPU.
  
  Is this by chance every 4.2 seconds?  The reason I ask is that
  Paul Clarke and I are seeing an interrupt every 4.2 seconds when
  he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)
 
 Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

Now that you mention it...  ;-)

So you are telling me that we are not succeeding in completely turning
off the decrementer interrupt?

Thanx, Paul

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Gabriel Paubert
On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
 On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
  I am seeing an issue where a CPU running perf eventually hangs.
  Traces show timer interrupts happening every 4 seconds even
  when a userspace task is running on the CPU.
 
 Is this by chance every 4.2 seconds?  The reason I ask is that
 Paul Clarke and I are seeing an interrupt every 4.2 seconds when
 he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: Fix attempt to move .org backwards error (again)

2014-05-09 Thread Guenter Roeck
Commit 4e243b7 (powerpc: Fix attempt to move .org backwards error) fixes the
allyesconfig build by moving machine_check_common to a different location.
While this fixes most of the errors, both allmodconfig and allyesconfig still
fail as follows.

arch/powerpc/kernel/exceptions-64s.S:1315: Error: attempt to move .org backwards

Fix by moving machine_check_common after the offending address.

Signed-off-by: Guenter Roeck li...@roeck-us.net
---
This fixes the build error, but unfortunately I don't have a system to test
the resulting image.

 arch/powerpc/kernel/exceptions-64s.S | 49 ++--
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3afd391..25398be 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1138,31 +1138,6 @@ unrecov_user_slb:
 
 #endif /* __DISABLED__ */
 
-
-   /*
-* Machine check is different because we use a different
-* save area: PACA_EXMC instead of PACA_EXGEN.
-*/
-   .align  7
-   .globl machine_check_common
-machine_check_common:
-
-   mfspr   r10,SPRN_DAR
-   std r10,PACA_EXGEN+EX_DAR(r13)
-   mfspr   r10,SPRN_DSISR
-   stw r10,PACA_EXGEN+EX_DSISR(r13)
-   EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
-   FINISH_NAP
-   DISABLE_INTS
-   ld  r3,PACA_EXGEN+EX_DAR(r13)
-   lwz r4,PACA_EXGEN+EX_DSISR(r13)
-   std r3,_DAR(r1)
-   std r4,_DSISR(r1)
-   bl  .save_nvgprs
-   addir3,r1,STACK_FRAME_OVERHEAD
-   bl  .machine_check_exception
-   b   .ret_from_except
-
.align  7
.globl alignment_common
 alignment_common:
@@ -1328,6 +1303,30 @@ fwnmi_data_area:
 initial_stab:
.space  4096
 
+   /*
+* Machine check is different because we use a different
+* save area: PACA_EXMC instead of PACA_EXGEN.
+*/
+   .align  7
+   .globl machine_check_common
+machine_check_common:
+
+   mfspr   r10,SPRN_DAR
+   std r10,PACA_EXGEN+EX_DAR(r13)
+   mfspr   r10,SPRN_DSISR
+   stw r10,PACA_EXGEN+EX_DSISR(r13)
+   EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
+   FINISH_NAP
+   DISABLE_INTS
+   ld  r3,PACA_EXGEN+EX_DAR(r13)
+   lwz r4,PACA_EXGEN+EX_DSISR(r13)
+   std r3,_DAR(r1)
+   std r4,_DSISR(r1)
+   bl  .save_nvgprs
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  .machine_check_exception
+   b   .ret_from_except
+
 #ifdef CONFIG_PPC_POWERNV
 _GLOBAL(opal_mc_secondary_handler)
HMT_MEDIUM_PPR_DISCARD
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info

2014-05-09 Thread Sukadev Bhattiprolu
[PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info

When saving the callchain on Power, the kernel conservatively saves excess
entries in the callchain. A few of these entries are needed in some cases
but not others.

Eg: the value in the link register (LR) is needed only when it holds the
return address of a function. At other times it must be ignored.

If the unnecessary entries are not ignored, we end up with duplicate arcs
in the call-graphs.

Use DWARF debug information to ignore the unnecessary entries.

Callgraph before the patch:

14.67%  2234  sprintft  libc-2.18.so   [.] __random
|
--- __random
   |
   |--61.12%-- __random
   |  |
   |  |--97.15%-- rand
   |  |  do_my_sprintf
   |  |  main
   |  |  generic_start_main.isra.0
   |  |  __libc_start_main
   |  |  0x0
   |  |
   |   --2.85%-- do_my_sprintf
   | main
   | generic_start_main.isra.0
   | __libc_start_main
   | 0x0
   |
--38.88%-- rand
  |
  |--94.01%-- rand
  |  do_my_sprintf
  |  main
  |  generic_start_main.isra.0
  |  __libc_start_main
  |  0x0
  |
   --5.99%-- do_my_sprintf
 main
 generic_start_main.isra.0
 __libc_start_main
 0x0

Callgraph after the patch:

14.67%  2234  sprintft  libc-2.18.so   [.] __random
|
--- __random
   |
   |--95.93%-- rand
   |  do_my_sprintf
   |  main
   |  generic_start_main.isra.0
   |  __libc_start_main
   |  0x0
   |
--4.07%-- do_my_sprintf
  main
  generic_start_main.isra.0
  __libc_start_main
  0x0

TODO:   For split-debug info objects like glibc, we can only determine
the call-frame-address only when both .eh_frame and .debug_info
sections are available. We should be able to determin the CFA
even without the .eh_frame section.

Thanks to Ulrich Weigand for help with DWARF debug information.

Fix suggested by Anton Blanchard.

Reported-by: Maynard Johnson mayn...@us.ibm.com
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 tools/perf/arch/powerpc/Makefile|   1 +
 tools/perf/arch/powerpc/util/adjust-callchain.c | 278 
 tools/perf/config/Makefile  |   5 +
 tools/perf/util/callchain.h |  12 +
 tools/perf/util/machine.c   |  16 +-
 5 files changed, 310 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/adjust-callchain.c

diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile
index 744e629..512cc8d 100644
--- a/tools/perf/arch/powerpc/Makefile
+++ b/tools/perf/arch/powerpc/Makefile
@@ -3,3 +3,4 @@ PERF_HAVE_DWARF_REGS := 1
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
 endif
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/adjust-callchain.o
diff --git a/tools/perf/arch/powerpc/util/adjust-callchain.c 
b/tools/perf/arch/powerpc/util/adjust-callchain.c
new file mode 100644
index 000..31b1f95
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/adjust-callchain.c
@@ -0,0 +1,278 @@
+/*
+ * Use DWARF Debug information to skip unnecessary callchain entries.
+ *
+ * Copyright (C) 2014 Sukadev Bhattiprolu, IBM Corporation.
+ * Copyright (C) 2014 Ulrich Weigand, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include inttypes.h
+#include dwarf.h
+#include elfutils/libdwfl.h
+
+#include util/thread.h
+#include util/callchain.h
+
+/*
+ * When saving the callchain on Power, the kernel conservatively saves
+ * excess entries in the callchain. A few of these entries are needed
+ * in some cases but not others. If the unnecessary entries are not
+ * ignored, we end up with duplicate arcs in the call-graphs. Use
+ * DWARF 

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Benjamin Herrenschmidt
On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
 in __timer_interrupt() outside the _else_ loop? This will ensure that no
 matter what, before exiting timer interrupt handler we check for pending
 irq work.

We still need to make sure that set_next_event() doesn't move the
dec beyond the next tick if there is a pending timer... maybe we
can fix it like this:

static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev)
{
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

/* Don't adjust the decrementer if some irq work is pending */
if (!test_irq_work_pending())
set_dec(evt);

return 0;
}

Along with a single occurrence of:

if (test_irq_work_pending())
set_dec(1);

At the end of __timer_interrupt(), outside if the current else {}
case, this should work, don't you think ?

What about this completely untested patch ?

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..ba7e83b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -503,12 +503,13 @@ void __timer_interrupt(void)
now = *next_tb - now;
if (now = DECREMENTER_MAX)
set_dec((int)now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
__get_cpu_var(irq_stat).timer_irqs_others++;
}
 
+   /* We may have raced with new irq work */
+   if (test_irq_work_pending())
+   set_dec(1);
+
 #ifdef CONFIG_PPC64
/* collect purr register values often, for accurate calculations */
if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
@@ -813,15 +814,11 @@ static void __init clocksource_init(void)
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev)
 {
-   /* Don't adjust the decrementer if some irq work is pending */
-   if (test_irq_work_pending())
-   return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
-   set_dec(evt);
 
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
+   /* Don't adjust the decrementer if some irq work is pending */
+   if (!test_irq_work_pending())
+   set_dec(evt);
 
return 0;
 }




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] printk/of_serial: fix serial console cessation part way through boot.

2014-05-09 Thread Stephen Chivers
Commit 5f5c9ae56c38942623f69c3e6dc6ec78e4da2076
serial_core: Unregister console in uart_remove_one_port()
fixed a crash where a serial port was removed but
not deregistered as a console.

There is a side effect of that commit for platforms having serial consoles
and of_serial configured (CONFIG_SERIAL_OF_PLATFORM). The serial console
is disabled midway through the boot process.

This cessation of the serial console affects PowerPC computers
such as the MVME5100 and SAM440EP.

The sequence is:

bootconsole [udbg0] enabled

serial8250/16550 driver initialises and registers its UARTS,
one of these is the serial console.
console [ttyS0] enabled

of_serial probes platform devices, registering them as it goes.
One of these is the serial console.
console [ttyS0] disabled.

The disabling of the serial console is due to:

a.  unregister_console in printk not clearing the
CONS_ENABLED bit in the console flags,
even though it has announced that the console is disabled; and

b.  of_platform_serial_probe in of_serial not setting the port type
before it registers with serial8250_register_8250_port.

This patch ensures that the serial console is re-enabled when of_serial
registers a serial port that corresponds to the designated console.

Signed-off-by: Stephen Chivers schiv...@csc.com
Tested-by: Stephen Chivers schiv...@csc.com

===
The above failure was identified in Linux-3.15-rc2.

Tested using MVME5100 and SAM440EP PowerPC computers with
kernels built from Linux-3.15-rc5 and tty-next.

The continued operation of the serial console is vital for computers
such as the MVME5100 as that Single Board Computer does not
have any grapical/display hardware.

---
 drivers/tty/serial/of_serial.c |1 +
 kernel/printk/printk.c |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/tty/serial/of_serial.c b/drivers/tty/serial/of_serial.c
index 9924660..27981e2 100644
--- a/drivers/tty/serial/of_serial.c
+++ b/drivers/tty/serial/of_serial.c
@@ -173,6 +173,7 @@ static int of_platform_serial_probe(struct platform_device 
*ofdev)
{
struct uart_8250_port port8250;
memset(port8250, 0, sizeof(port8250));
+   port.type = port_type;
port8250.port = port;
 
if (port.fifosize)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7228258..221229c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2413,6 +2413,7 @@ int unregister_console(struct console *console)
if (console_drivers != NULL  console-flags  CON_CONSDEV)
console_drivers-flags |= CON_CONSDEV;
 
+   console-flags = ~CON_ENABLED;
console_unlock();
console_sysfs_notify();
return res;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev