[PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. /proc/timer_list also shows pending hrtimers have not run in over an hour, including the scheduler. Looking closer, decrementers_next_tb is getting set to 0x, and at that point we will never take a timer interrupt again. In __timer_interrupt() we set decrementers_next_tb to 0x and rely on -event_handler to update it: *next_tb = ~(u64)0; if (evt-event_handler) evt-event_handler(evt); In this case -event_handler is hrtimer_interrupt. This will eventually call back through the clockevents code with the next event to be programmed: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { /* Don't adjust the decrementer if some irq work is pending */ if (test_irq_work_pending()) return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; If irq work came in between these two points, we will return before updating decrementers_next_tb and we never process a timer interrupt again. This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races with irq_work). Fix it by removing the early exit and relying on code later on in the function to force an early decrementer: /* We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); Signed-off-by: Anton Blanchard an...@samba.org Cc: sta...@vger.kernel.org # 3.14+ --- diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..4f0b676 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,9 +888,6 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; set_dec(evt); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RFC v2 00/10] EEH Support for VFIO PCI devices on PowerKVM guest
The series of patches intends to support EEH for PCI devices, which are passed through to PowerKVM based guest via VFIO. The implementation is straightforward based on the issues or problems we have to resolve to support EEH for PowerKVM based guest. - Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly. If QEMU can't handle it, the request will be sent to host via newly introduced VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel. - The error injection infrastructure need support request from the userland utility errinjct and PowerKVM based guest. The userland utility errinjct works on pSeries platform well with dedicated syscall, which helps invoking RTAS service to fulfil error injection in kernel. From the perspective, it's reasonable to extend the syscall to support PowerNV platform so that OPAL call can be invoked in host kernel for injecting errors. The data transported between userland and kerenl is still following struct rtas_args for both cases of PowerNV (OPAL) and pSeries (RTAS). The series of patches requires corresponding firmware changes from Mike Qiu to support error injection and QEMU changes to support EEH for guest. QEMU patchset will be sent separately. Change log == v1 - v2: * EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl. The mechanism KVM in-kernel handling is dropped. * Error injection is reimplemented based syscall, instead of KVM in-kerenl handling. The logic for error injection token management is moved to QEMU. The error injection request is routed to QEMU and then possiblly to host kernel. Testing on P7 = - Emulex adapter Testing on P8 = - Need more testing after design is finalized. - Gavin Shan (10): drivers/vfio: Introduce CONFIG_VFIO_EEH powerpc/eeh: Info to trace passed devices powerpc/eeh: Search EEH device by guest address powerpc/eeh: Search EEH PE by guest address drivers/vfio: New IOCTL command VFIO_EEH_INFO powerpc/eeh: Avoid event on passed PE powerpc/powernv: Sync OPAL header file with firmware powerpc: Extend syscall ppc_rtas() powerpc/powernv: Implement ppc_call_opal() powerpc/powernv: Error injection infrastructure arch/powerpc/include/asm/eeh.h | 52 + arch/powerpc/include/asm/opal.h| 74 +- arch/powerpc/include/asm/rtas.h| 10 ++- arch/powerpc/include/asm/syscalls.h| 2 +- arch/powerpc/include/asm/systbl.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 2 +- arch/powerpc/kernel/eeh.c | 8 ++ arch/powerpc/kernel/eeh_pe.c | 80 +++ arch/powerpc/kernel/rtas.c | 57 +++--- arch/powerpc/kernel/syscalls.c | 50 arch/powerpc/platforms/powernv/Makefile| 3 +- arch/powerpc/platforms/powernv/eeh-ioda.c | 3 +- arch/powerpc/platforms/powernv/eeh-vfio.c | 584 + arch/powerpc/platforms/powernv/errinject.c | 222 arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + arch/powerpc/platforms/powernv/opal.c | 93 ++ drivers/vfio/Kconfig | 6 ++ drivers/vfio/vfio_iommu_spapr_tce.c| 12 +++ include/uapi/linux/vfio.h | 61 +++ kernel/sys_ni.c| 2 +- 20 files changed, 1271 insertions(+), 53 deletions(-) create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c create mode 100644 arch/powerpc/platforms/powernv/errinject.c Thanks, Gavin ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 06/10] powerpc/eeh: Avoid event on passed PE
If we detects frozen state on PE that has been passed to guest, we needn't handle it. Instead, we rely on the guest to detect and recover it. The patch avoid EEH event on the frozen passed PE so that the guest can have chance to handle that. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh.c | 8 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 9c6b899..6543f05 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev) if (ret 0) return ret; + /* +* If the PE has been passed to guest, we won't check the +* state. Instead, let the guest handle it if the PE has +* been frozen. +*/ + if (eeh_pe_passed(pe)) + return 0; + /* If we already have a pending isolation event for this * slot, we know it's bad already, we don't need to check. * Do this checking under a lock; as multiple PCI devices diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c index 1b5982f..03a3ed2 100644 --- a/arch/powerpc/platforms/powernv/eeh-ioda.c +++ b/arch/powerpc/platforms/powernv/eeh-ioda.c @@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe) opal_pci_eeh_freeze_clear(phb-opal_id, frozen_pe_no, OPAL_EEH_ACTION_CLEAR_FREEZE_ALL); ret = EEH_NEXT_ERR_NONE; - } else if ((*pe)-state EEH_PE_ISOLATED) { + } else if ((*pe)-state EEH_PE_ISOLATED || + eeh_pe_passed(*pe)) { ret = EEH_NEXT_ERR_NONE; } else { pr_err(EEH: Frozen PHB#%x-PE#%x (%s) detected\n, -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 02/10] powerpc/eeh: Info to trace passed devices
The address of passed PCI devices (domain:bus:slot:func) might be quite different from the perspective of host and guest. We have to trace the address mapping so that we can emulate EEH RTAS requests from guest. The patch introduces additional fields to eeh_pe and eeh_dev for the purpose. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h | 46 ++ 1 file changed, 46 insertions(+) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 7782056..3268692 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -48,6 +48,14 @@ struct device_node; #define EEH_PE_RST_HOLD_TIME 250 #define EEH_PE_RST_SETTLE_TIME 1800 +#ifdef CONFIG_VFIO_EEH +struct eeh_vfio_pci_addr { + uint64_tbuid; /* PHB BUID */ + uint16_tbdn;/* Bus/Device/Function number */ + uint32_tpe_addr;/* PE configuration address */ +}; +#endif /* CONFIG_VFIO_EEH */ + /* * The struct is used to trace PE related EEH functionality. * In theory, there will have one instance of the struct to @@ -72,6 +80,7 @@ struct device_node; #define EEH_PE_RESET (1 2)/* PE reset in progress */ #define EEH_PE_KEEP(1 8)/* Keep PE on hotplug */ +#define EEH_PE_PASSTHROUGH (1 9)/* PE owned by guest*/ struct eeh_pe { int type; /* PE type: PHB/Bus/Device */ @@ -85,6 +94,9 @@ struct eeh_pe { struct timeval tstamp; /* Time on first-time freeze*/ int false_positives;/* Times of reported #ff's */ struct eeh_pe *parent; /* Parent PE*/ +#ifdef CONFIG_VFIO_EEH + struct eeh_vfio_pci_addr gaddr; /* Address in guest */ +#endif struct list_head child_list;/* Link PE to the child list*/ struct list_head edevs; /* Link list of EEH devices */ struct list_head child; /* Child PEs*/ @@ -93,6 +105,21 @@ struct eeh_pe { #define eeh_pe_for_each_dev(pe, edev, tmp) \ list_for_each_entry_safe(edev, tmp, pe-edevs, list) +static inline bool eeh_pe_passed(struct eeh_pe *pe) +{ + return pe ? !!(pe-state EEH_PE_PASSTHROUGH) : false; +} + +static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed) +{ + if (pe) { + if (passed) + pe-state |= EEH_PE_PASSTHROUGH; + else + pe-state = ~EEH_PE_PASSTHROUGH; + } +} + /* * The struct is used to trace EEH state for the associated * PCI device node or PCI device. In future, it might @@ -110,6 +137,7 @@ struct eeh_pe { #define EEH_DEV_SYSFS (1 9)/* Sysfs created*/ #define EEH_DEV_REMOVED(1 10) /* Removed permanently */ #define EEH_DEV_FRESET (1 11) /* Fundamental reset*/ +#define EEH_DEV_PASSTHROUGH(1 12) /* Owned by guest */ struct eeh_dev { int mode; /* EEH mode */ @@ -126,6 +154,9 @@ struct eeh_dev { struct device_node *dn; /* Associated device node */ struct pci_dev *pdev; /* Associated PCI device*/ struct pci_bus *bus;/* PCI bus for partial hotplug */ +#ifdef CONFIG_VFIO_EEH + struct eeh_vfio_pci_addr gaddr; /* Address in guest */ +#endif }; static inline struct device_node *eeh_dev_to_of_node(struct eeh_dev *edev) @@ -138,6 +169,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct eeh_dev *edev) return edev ? edev-pdev : NULL; } +static inline bool eeh_dev_passed(struct eeh_dev *dev) +{ + return dev ? !!(dev-mode EEH_DEV_PASSTHROUGH) : false; +} + +static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed) +{ + if (dev) { + if (passed) + dev-mode |= EEH_DEV_PASSTHROUGH; + else + dev-mode = ~EEH_DEV_PASSTHROUGH; + } +} + /* Return values from eeh_ops::next_error */ enum { EEH_NEXT_ERR_NONE = 0, -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 01/10] drivers/vfio: Introduce CONFIG_VFIO_EEH
The patch introduces CONFIG_VFIO_EEH for more IOCTL commands on tce_iommu_driver_ops to support EEH funtionality for PCI devices that are passed through from host to guest. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- drivers/vfio/Kconfig | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index af7b204..4f3293b 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE depends on VFIO SPAPR_TCE_IOMMU default n +config VFIO_EEH + tristate + depends on EEH VFIO_IOMMU_SPAPR_TCE + default n + menuconfig VFIO tristate VFIO Non-Privileged userspace driver framework depends on IOMMU_API select VFIO_IOMMU_TYPE1 if X86 select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES) + select VFIO_EEH if PPC_POWERNV select ANON_INODES help VFIO provides a framework for secure userspace device drivers. -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 03/10] powerpc/eeh: Search EEH device by guest address
The patch introduces function eeh_vfio_dev_get() to search the EEH device according to its guest address, which is made up of PHB BUID, bus, slot and function number. The function is useful in the backends for EEH RTAS emulation. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h | 5 + arch/powerpc/kernel/eeh_pe.c | 42 ++ 2 files changed, 47 insertions(+) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 3268692..8ffaf39 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -381,6 +381,11 @@ static inline void eeh_remove_device(struct pci_dev *dev) { } #define EEH_IO_ERROR_VALUE(size) (-1UL) #endif /* CONFIG_EEH */ + +#ifdef CONFIG_VFIO_EEH +struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr); +#endif /* CONFIG_VFIO_EEH */ + #ifdef CONFIG_PPC64 /* * MMIO read/write operations with EEH support. diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index fbd01eb..d09f055 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -248,6 +248,48 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev) return pe; } +#ifdef CONFIG_VFIO_EEH +static void *__eeh_vfio_dev_get(void *data, void *flag) +{ + struct eeh_pe *pe = (struct eeh_pe *)data; + struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag; + struct eeh_dev *edev, *tmp; + + eeh_pe_for_each_dev(pe, edev, tmp) { + if (!eeh_dev_passed(edev)) + continue; + + /* Comparing the address in the guest */ + if (addr-buid == edev-gaddr.buid + addr-bdn == edev-gaddr.bdn) + return edev; + } + + return NULL; +} + +/** + * eeh_vfio_dev_get - Search EEH device based on guest's address + * @addr: EEH device guest address + * + * Search the EEH device according to its guest's address, which + * is made up of PHB BUID, and PCI config address. + */ +struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr) +{ + struct eeh_pe *root; + struct eeh_dev *edev; + + list_for_each_entry(root, eeh_phb_pe, child) { + edev = eeh_pe_traverse(root, __eeh_vfio_dev_get, addr); + if (edev) + return edev; + } + + return NULL; +} +#endif /* CONFIG_VFIO_EEH */ + /** * eeh_pe_get_parent - Retrieve the parent PE * @edev: EEH device -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 08/10] powerpc: Extend syscall ppc_rtas()
Originally, syscall ppc_rtas() can be used to invoke RTAS call from user space. Utility errinjct is using it to inject various errors to the system for testing purpose. The patch intends to extend the syscall to support both pSeries and PowerNV platform. With that, RTAS and OPAL call can be invoked from user space. In turn, utility errinjct can be supported on pSeries and PowerNV platform at same time. The original syscall handler ppc_rtas() is renamed to ppc_firmware(), which calls ppc_call_rtas() or ppc_call_opal() depending on the running platform. The data transported between userland and kerenl is by struct rtas_args. It's platform specific on how to use the data. Signed-off-by: Mike Qiu qiud...@linux.vnet.ibm.com Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/rtas.h| 10 +- arch/powerpc/include/asm/syscalls.h| 2 +- arch/powerpc/include/asm/systbl.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 2 +- arch/powerpc/kernel/rtas.c | 57 +++--- arch/powerpc/kernel/syscalls.c | 50 + arch/powerpc/platforms/powernv/opal.c | 7 + kernel/sys_ni.c| 2 +- 8 files changed, 82 insertions(+), 50 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index b390f55..3428524 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -20,7 +20,7 @@ #define RTAS_UNKNOWN_SERVICE (-1) #define RTAS_INSTANTIATE_MAX (1ULL30) /* Don't instantiate rtas at/above this value */ -/* Buffer size for ppc_rtas system call. */ +/* Buffer size for ppc_firmware system call. */ #define RTAS_RMOBUF_MAX (64 * 1024) /* RTAS return status codes */ @@ -427,9 +427,17 @@ static inline int page_is_rtas_user_buf(unsigned long pfn) /* Not the best place to put pSeries_coalesce_init, will be fixed when we * move some of the rtas suspend-me stuff to pseries */ extern void pSeries_coalesce_init(void); +extern int ppc_call_rtas(struct rtas_args *args); #else static inline int page_is_rtas_user_buf(unsigned long pfn) { return 0;} static inline void pSeries_coalesce_init(void) { } +static inline int ppc_call_rtas(struct rtas_args *args) { return -ENXIO; } +#endif + +#ifdef CONFIG_PPC_POWERNV +extern int ppc_call_opal(struct rtas_args *args); +#else +static inline int ppc_call_opal(struct rtas_arts *args) { return -ENXIO; } #endif extern int call_rtas(const char *, int, int, unsigned long *, ...); diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 23be8f1..3383e50 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -15,7 +15,7 @@ asmlinkage unsigned long sys_mmap2(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); asmlinkage long ppc64_personality(unsigned long personality); -asmlinkage int ppc_rtas(struct rtas_args __user *uargs); +asmlinkage int ppc_firmware(struct rtas_args __user *uargs); #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_SYSCALLS_H */ diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 3ddf702..00f8bb2 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -259,7 +259,7 @@ COMPAT_SYS_SPU(utimes) COMPAT_SYS_SPU(statfs64) COMPAT_SYS_SPU(fstatfs64) SYSX(sys_ni_syscall, ppc_fadvise64_64, ppc_fadvise64_64) -PPC_SYS_SPU(rtas) +PPC_SYS_SPU(firmware) OLDSYS(debug_setcontext) SYSCALL(ni_syscall) COMPAT_SYS(migrate_pages) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 881bf2e..3aee765 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -273,7 +273,7 @@ #ifndef __powerpc64__ #define __NR_fadvise64_64 254 #endif -#define __NR_rtas 255 +#define __NR_firmware 255 #define __NR_sys_debug_setcontext 256 /* Number 257 is reserved for vserver */ #define __NR_migrate_pages 258 diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 8cd5ed0..5d829a72 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -1017,59 +1017,32 @@ struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log, } /* We assume to be passed big endian arguments */ -asmlinkage int ppc_rtas(struct rtas_args __user *uargs) +int ppc_call_rtas(struct rtas_args *args) { - struct rtas_args args; unsigned long flags; char *buff_copy, *errbuf = NULL; - int nargs, nret, token; int rc; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(args, uargs, 3 * sizeof(u32)) != 0) - return -EFAULT; - - nargs = be32_to_cpu(args.nargs); - nret = be32_to_cpu(args.nret); - token =
[PATCH 04/10] powerpc/eeh: Search EEH PE by guest address
The patch introduces function eeh_vfio_pe_get() to search the EEH PE according to its guest address, which is made up of PHB ID and PE configuration address. The function will be useful in backends for EEH RTAS emulation. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h | 1 + arch/powerpc/kernel/eeh_pe.c | 38 ++ 2 files changed, 39 insertions(+) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 8ffaf39..750e028 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -384,6 +384,7 @@ static inline void eeh_remove_device(struct pci_dev *dev) { } #ifdef CONFIG_VFIO_EEH struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr); +struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr); #endif /* CONFIG_VFIO_EEH */ #ifdef CONFIG_PPC64 diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index d09f055..8dc58ac 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -288,6 +288,44 @@ struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr) return NULL; } + +static void *__eeh_vfio_pe_get(void *data, void *flag) +{ + struct eeh_pe *pe = (struct eeh_pe *)data; + struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag; + + if (!eeh_pe_passed(pe)) + return NULL; + + /* Comparing the address */ + if (addr-buid== pe-gaddr.buid + addr-pe_addr == pe-gaddr.pe_addr) + return pe; + + return NULL; +} + +/** + * eeh_vfio_pe_get - Search EEH PE based on guest's address + * @addr: EEH PE guest address + * + * Search the EEH PE according to the guest address, which + * is made up of VM indicator, PHB BUID, and PE configuration + * address. + */ +struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr) +{ + struct eeh_pe *root; + struct eeh_pe *pe; + + list_for_each_entry(root, eeh_phb_pe, child) { + pe = eeh_pe_traverse(root, __eeh_vfio_pe_get, addr); + if (pe) + return pe; + } + + return NULL; +} #endif /* CONFIG_VFIO_EEH */ /** -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 05/10] drivers/vfio: New IOCTL command VFIO_EEH_INFO
The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container to support EEH functionality for PCI devices, which have been passed from host to guest via VFIO. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/Makefile | 1 + arch/powerpc/platforms/powernv/eeh-vfio.c | 584 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 12 + include/uapi/linux/vfio.h | 61 4 files changed, 658 insertions(+) create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index 63cebb9..2b15a03 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -6,5 +6,6 @@ obj-y += opal-msglog.o obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o obj-$(CONFIG_EEH) += eeh-ioda.o eeh-powernv.o +obj-$(CONFIG_VFIO_EEH) += eeh-vfio.o obj-$(CONFIG_PPC_SCOM) += opal-xscom.o obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c b/arch/powerpc/platforms/powernv/eeh-vfio.c new file mode 100644 index 000..5766715 --- /dev/null +++ b/arch/powerpc/platforms/powernv/eeh-vfio.c @@ -0,0 +1,584 @@ +/* + * The file intends to support EEH funtionality for those PCI devices, + * which have been passed through from host to guest via VFIO. So this + * file is naturally part of VFIO implementation on PowerNV platform. + * + * Copyright Benjamin Herrenschmidt Gavin Shan, IBM Corporation 2014. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/init.h +#include linux/io.h +#include linux/irq.h +#include linux/kernel.h +#include linux/kvm_host.h +#include linux/msi.h +#include linux/pci.h +#include linux/string.h +#include linux/vfio.h + +#include asm/eeh.h +#include asm/eeh_event.h +#include asm/io.h +#include asm/iommu.h +#include asm/opal.h +#include asm/msi_bitmap.h +#include asm/pci-bridge.h +#include asm/ppc-pci.h +#include asm/tce.h +#include asm/uaccess.h + +#include powernv.h +#include pci.h + +static int powernv_eeh_vfio_map(struct vfio_eeh_info *info) +{ + struct pci_bus *bus, *pe_bus; + struct pci_dev *pdev; + struct eeh_dev *edev; + struct eeh_pe *pe; + int domain, bus_no, devfn; + + /* Host address */ + domain = info-map.domain; + bus_no = (info-map.bdn 8) 0xff; + devfn = info-map.bdn 0xff; + + /* Find PCI bus */ + bus = pci_find_bus(domain, bus_no); + if (!bus) { + pr_warn(%s: PCI bus %04x:%02x not found\n, + __func__, domain, bus_no); + return -ENODEV; + } + + /* Find PCI device */ + pdev = pci_get_slot(bus, devfn); + if (!pdev) { + pr_warn(%s: PCI device %04x:%02x:%02x.%01x not found\n, + __func__, domain, bus_no, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + return -ENODEV; + } + + /* No EEH device - almost impossible */ + edev = pci_dev_to_eeh_dev(pdev); + if (unlikely(!edev)) { + pci_dev_put(pdev); + pr_warn(%s: No EEH dev for PCI device %s\n, + __func__, pci_name(pdev)); + return -ENODEV; + } + + /* Doesn't support PE migration between different PHBs */ + pe = edev-pe; + if (!eeh_pe_passed(pe)) { + pe_bus = eeh_pe_bus_get(pe); + BUG_ON(!pe_bus); + + /* PE# has format 00BBSS00 */ + pe-gaddr.buid= info-map.gbuid; + pe-gaddr.pe_addr = pe_bus-number 16; + eeh_pe_set_passed(pe, true); + } else if (pe-gaddr.buid != info-map.gbuid) { + pci_dev_put(pdev); + pr_warn(%s: Mismatched PHB BUID (0x%llx, 0x%llx)\n, + __func__, pe-gaddr.buid, info-map.gbuid); + return -EINVAL; + } + + edev-gaddr.buid = info-map.gbuid; + edev-gaddr.bdn = info-map.gbdn; + eeh_dev_set_passed(edev, true); + + pr_debug(EEH: Host PCI dev %s to %llx-%02x:%02x.%01x\n, +pci_name(pdev), info-map.gbuid, +(info-map.gbdn 8) 0xFF, +PCI_SLOT(info-map.gbdn 0xFF), +PCI_FUNC(info-map.gbdn 0xFF)); + + pci_dev_put(pdev); + return 0; +} + +static int powernv_eeh_vfio_unmap(struct vfio_eeh_info *info) +{ + struct eeh_vfio_pci_addr addr; + struct pci_dev *pdev; + struct eeh_dev *edev, *tmp; + struct eeh_pe *pe; + bool passed; + + /* Get EEH device */ + addr.buid =
[PATCH 07/10] powerpc/powernv: Sync OPAL header file with firmware
The patch synchronizes OPAL header file with firmware so that the host kernel can make OPAL call to do error injection. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal.h| 65 ++ arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 2 files changed, 66 insertions(+) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 66ad7a7..ca55d9c 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -175,6 +175,7 @@ extern int opal_enter_rtas(struct rtas_args *args, #define OPAL_SET_PARAM 90 #define OPAL_DUMP_RESEND 91 #define OPAL_DUMP_INFO294 +#define OPAL_ERR_INJECT96 #ifndef __ASSEMBLY__ @@ -219,6 +220,69 @@ enum OpalPciErrorSeverity { OPAL_EEH_SEV_INF= 5 }; +enum OpalErrinjctType { + OpalErrinjctTypeFirst = 0, + OpalErrinjctTypeFatal = 1, + OpalErrinjctTypeRecoverRandomEvent = 2, + OpalErrinjctTypeRecoverSpecialEvent = 3, + OpalErrinjctTypeCorruptedPage = 4, + OpalErrinjctTypeCorruptedSlb= 5, + OpalErrinjctTypeTranslatorFailure = 6, + OpalErrinjctTypeIoaBusError = 7, + OpalErrinjctTypeIoaBusError64 = 8, + OpalErrinjctTypePlatformSpecific= 9, + OpalErrinjctTypeDcacheStart = 10, + OpalErrinjctTypeDcacheEnd = 11, + OpalErrinjctTypeIcacheStart = 12, + OpalErrinjctTypeIcacheEnd = 13, + OpalErrinjctTypeTlbStart= 14, + OpalErrinjctTypeTlbEnd = 15, + OpalErrinjctTypeUpstreamIoError = 16, + OpalErrinjctTypeLast= 17, + + /* IoaBusError IoaBusError64 */ + OpalEjtIoaLoadMemAddr = 0, + OpalEjtIoaLoadMemData = 1, + OpalEjtIoaLoadIoAddr= 2, + OpalEjtIoaLoadIoData= 3, + OpalEjtIoaLoadConfigAddr= 4, + OpalEjtIoaLoadConfigData= 5, + OpalEjtIoaStoreMemAddr = 6, + OpalEjtIoaStoreMemData = 7, + OpalEjtIoaStoreIoAddr = 8, + OpalEjtIoaStoreIoData = 9, + OpalEjtIoaStoreConfigAddr = 10, + OpalEjtIoaStoreConfigData = 11, + OpalEjtIoaDmaReadMemAddr= 12, + OpalEjtIoaDmaReadMemData= 13, + OpalEjtIoaDmaReadMemMaster = 14, + OpalEjtIoaDmaReadMemTarget = 15, + OpalEjtIoaDmaWriteMemAddr = 16, + OpalEjtIoaDmaWriteMemData = 17, + OpalEjtIoaDmaWriteMemMaster = 18, + OpalEjtIoaDmaWriteMemTarget = 19, +}; + +struct OpalErrinjct { + int32_t type; + union { + struct { + uint32_t addr; + uint32_t mask; + uint64_t phb_id; + uint32_t pe; + uint32_t function; + }ioa; + struct { + uint64_t addr; + uint64_t mask; + uint64_t phb_id; + uint32_t pe; + uint32_t function; + }ioa64; + }; +}; + enum OpalShpcAction { OPAL_SHPC_GET_LINK_STATE = 0, OPAL_SHPC_GET_SLOT_STATE = 1 @@ -839,6 +903,7 @@ int64_t opal_pci_get_phb_diag_data(uint64_t phb_id, void *diag_buffer, uint64_t diag_buffer_len); int64_t opal_pci_get_phb_diag_data2(uint64_t phb_id, void *diag_buffer, uint64_t diag_buffer_len); +int64_t opal_err_injct(void *data); int64_t opal_pci_fence_phb(uint64_t phb_id); int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data); int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t error_type, uint8_t mask_action); diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index f531ffe..46265de 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -119,6 +119,7 @@ OPAL_CALL(opal_pci_next_error, OPAL_PCI_NEXT_ERROR); OPAL_CALL(opal_pci_poll, OPAL_PCI_POLL); OPAL_CALL(opal_pci_msi_eoi,OPAL_PCI_MSI_EOI); OPAL_CALL(opal_pci_get_phb_diag_data2, OPAL_PCI_GET_PHB_DIAG_DATA2); +OPAL_CALL(opal_err_injct, OPAL_ERR_INJECT); OPAL_CALL(opal_xscom_read, OPAL_XSCOM_READ); OPAL_CALL(opal_xscom_write,
[PATCH 10/10] powerpc/powernv: Error injection infrastructure
The patch intends to implemdent the error injection infrastructure for PowerNV platform. The predetermined handlers will be called according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError). For now, we just support PCI error injection. We need support injecting other types of errors in future. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal.h| 6 + arch/powerpc/platforms/powernv/Makefile| 2 +- arch/powerpc/platforms/powernv/errinject.c | 224 + 3 files changed, 231 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/platforms/powernv/errinject.c diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 7c4ffd0..7bf86ba 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -794,6 +794,12 @@ typedef struct oppanel_line { uint64_tline_len; } oppanel_line_t; +enum OpalCallToken{ + OPAL_CALL_TOKEN_MIN = 0, + OPAL_CALL_TOKEN_ERRINJCT, + OPAL_CALL_TOKEN_MAX +}; + /* /sys/firmware/opal */ extern struct kobject *opal_kobj; diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index 2b15a03..5ae8257 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -1,7 +1,7 @@ obj-y += setup.o opal-takeover.o opal-wrappers.o opal.o opal-async.o obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o -obj-y += opal-msglog.o +obj-y += opal-msglog.o errinject.o obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o diff --git a/arch/powerpc/platforms/powernv/errinject.c b/arch/powerpc/platforms/powernv/errinject.c new file mode 100644 index 000..aa892d4 --- /dev/null +++ b/arch/powerpc/platforms/powernv/errinject.c @@ -0,0 +1,224 @@ +/* + * The file intends to support error injection requests from host OS + * owned utility (e.g. errinjct) or VM. We need parse the information + * passed from user space and call to appropriate OPAL API accordingly. + * + * Copyright Benjamin Herrenschmidt Gavin Shan, IBM Corporation 2014. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/io.h +#include linux/irq.h +#include linux/kernel.h +#include linux/msi.h +#include linux/module.h +#include linux/pci.h + +#include asm/eeh.h +#include asm/eeh_event.h +#include asm/io.h +#include asm/iommu.h +#include asm/msi_bitmap.h +#include asm/opal.h +#include asm/pci-bridge.h +#include asm/ppc-pci.h +#include asm/rtas.h +#include asm/tce.h +#include asm/uaccess.h + +#include powernv.h +#include pci.h + +static int powernv_errinjct_ioa(struct rtas_args *args) +{ + return -ENXIO; +} + +static int powernv_errinjct_ioa64(struct rtas_args *args) +{ + return -ENXIO; +} + +#ifdef CONFIG_VFIO_EEH +static int powernv_errinjct_ioa_virt(struct rtas_args *args) +{ + uint32_t addr, mask, cfg_addr; + uint32_t buid_hi, buid_lo, op; + uint64_t buf_addr = ((uint64_t)(args-args[3])) 32 | + args-args[4]; + void __user *buf = (void __user *)buf_addr; + struct eeh_vfio_pci_addr vfio_addr; + struct pnv_phb *phb; + struct eeh_pe *pe; + struct OpalErrinjct ej; + + /* Extract parameters */ + if (get_user(addr, (uint32_t __user *)buf) || + get_user(mask, (uint32_t __user *)(buf + 4)) || + get_user(cfg_addr, (uint32_t __user *)(buf + 8)) || + get_user(buid_hi, (uint32_t __user *)(buf + 12)) || + get_user(buid_lo, (uint32_t __user *)(buf + 16)) || + get_user(op, (uint32_t __user *)(buf + 20))) + return -EFAULT; + + /* Check opcode */ + if (op OpalEjtIoaLoadMemAddr || + op OpalEjtIoaDmaWriteMemTarget) + return -EINVAL; + + /* Find PE */ + vfio_addr.buid = uint64_t)buid_hi) 32) | buid_lo); + vfio_addr.pe_addr = cfg_addr; + pe = eeh_vfio_pe_get(vfio_addr); + if (!pe) + return -ENODEV; + phb = pe-phb-private_data; + + /* OPAL call */ + ej.type = OpalErrinjctTypeIoaBusError; + ej.ioa.addr = addr; + ej.ioa.mask = mask; + ej.ioa.phb_id = phb-opal_id; + ej.ioa.pe = pe-addr; + ej.ioa.function = op; + if (opal_err_injct(ej) != OPAL_SUCCESS) + return -EIO; + + return 0; +} + +static int powernv_errinjct_ioa64_virt(struct rtas_args *args) +{ + uint32_t addr_hi, addr_lo, mask_hi, mask_lo; + uint32_t cfg_addr, buid_hi, buid_lo, op; +
[PATCH 09/10] powerpc/powernv: Implement ppc_call_opal()
If we're running PowerNV platform, ppc_firmware() will be directed to ppc_call_opal() where we can call to OPAL API accordingly. In ppc_call_opal(), the input argument are parsed out and call to appropriate OPAL API to handle that. Each request passed to the function is identified with token. As we get to the function either from host owned application (e.g. errinjct) or VM, we always have the first parameter (so-called virtual) to differentiate the cases. The patch implements above logic and OPAL call handler dynamica registeration mechanism so that the handlers could be distributed. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal.h | 3 +- arch/powerpc/platforms/powernv/opal.c | 90 ++- 2 files changed, 90 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index ca55d9c..7c4ffd0 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -997,7 +997,8 @@ extern void opal_lpc_init(void); struct opal_sg_list *opal_vmalloc_to_sg_list(void *vmalloc_addr, unsigned long vmalloc_size); void opal_free_sg_list(struct opal_sg_list *sg); - +int opal_call_handler_register(bool virt, int token, + int (*fn)(struct rtas_args *)); #endif /* __ASSEMBLY__ */ #endif /* __OPAL_H */ diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index ad33c2b..c84823c 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -38,6 +38,13 @@ struct opal { u64 size; } opal; +struct opal_call_handler { + bool virt; + int token; + int (*fn)(struct rtas_args *args); + struct list_head list; +}; + struct mcheck_recoverable_range { u64 start_addr; u64 end_addr; @@ -47,6 +54,10 @@ struct mcheck_recoverable_range { static struct mcheck_recoverable_range *mc_recoverable_range; static int mc_recoverable_range_len; +/* OPAL call handler */ +static LIST_HEAD(opal_call_handler_list); +static DEFINE_SPINLOCK(opal_call_lock); + struct device_node *opal_node; static DEFINE_SPINLOCK(opal_write_lock); extern u64 opal_mc_secondary_handler[]; @@ -703,8 +714,83 @@ void opal_free_sg_list(struct opal_sg_list *sg) } } -/* Extend it later */ -int ppc_call_opal(struct rtas_args *args) +int opal_call_handler_register(bool virt, int token, + int (*fn)(struct rtas_args *)) { + struct opal_call_handler *h, *handler; + + if (!token || !fn) { + pr_warn(%s: Invalid parameters\n, + __func__); + return -EINVAL; + } + + handler = kzalloc(sizeof(*handler), GFP_KERNEL); + if (!handler) { + pr_warn(%s: Out of memory\n, + __func__); + return -ENOMEM; + } + handler-token = token; + handler-virt = virt; + handler-fn = fn; + INIT_LIST_HEAD(handler-list); + + spin_lock(opal_call_lock); + list_for_each_entry(h, opal_call_handler_list, list) { + if (h-token == token + h-virt == virt) { + spin_unlock(opal_call_lock); + pr_warn(%s: Handler existing (%s, %x)\n, + __func__, virt ? T : F, token); + kfree(handler); + return -EEXIST; + } + } + + list_add_tail(handler-list, opal_call_handler_list); + spin_unlock(opal_call_lock); + return 0; } + +/* + * It's usually invoked from syscall ppc_firmware() by host + * owned application or VM. The information carried in the + * input arguments is different. So we always have the first + * argument to differentiate it. + * + * Also, we have to extend 32-bits address to 64-bits. So + * for each address sensitive field, it will require 8 + * bytes. + */ +int ppc_call_opal(struct rtas_args *args) +{ + bool virt, found; + int token; + struct opal_call_handler *h; + + /* We should have virt at least */ + if (args-nargs 1) + return -EINVAL; + virt = !!args-args[0]; + token = args-token; + + /* Do we have handler ? */ + found = false; + spin_lock(opal_call_lock); + list_for_each_entry(h, opal_call_handler_list, list) { + if (h-token == token + h-virt == virt) { + found = true; + break; + } + } + spin_unlock(opal_call_lock); + + /* Call to handler */ + if (!found) + return -ERANGE; + + return h-fn(args); +} -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC v2 00/10] EEH Support for VFIO PCI devices on PowerKVM guest
On Fri, May 09, 2014 at 05:49:32PM +1000, Gavin Shan wrote: Sorry for having missed cc'ing Alex Graf. Amending it. The series of patches intends to support EEH for PCI devices, which are passed through to PowerKVM based guest via VFIO. The implementation is straightforward based on the issues or problems we have to resolve to support EEH for PowerKVM based guest. - Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly. If QEMU can't handle it, the request will be sent to host via newly introduced VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel. - The error injection infrastructure need support request from the userland utility errinjct and PowerKVM based guest. The userland utility errinjct works on pSeries platform well with dedicated syscall, which helps invoking RTAS service to fulfil error injection in kernel. From the perspective, it's reasonable to extend the syscall to support PowerNV platform so that OPAL call can be invoked in host kernel for injecting errors. The data transported between userland and kerenl is still following struct rtas_args for both cases of PowerNV (OPAL) and pSeries (RTAS). The series of patches requires corresponding firmware changes from Mike Qiu to support error injection and QEMU changes to support EEH for guest. QEMU patchset will be sent separately. Change log == v1 - v2: * EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl. The mechanism KVM in-kernel handling is dropped. * Error injection is reimplemented based syscall, instead of KVM in-kerenl handling. The logic for error injection token management is moved to QEMU. The error injection request is routed to QEMU and then possiblly to host kernel. Testing on P7 = - Emulex adapter Testing on P8 = - Need more testing after design is finalized. - Gavin Shan (10): drivers/vfio: Introduce CONFIG_VFIO_EEH powerpc/eeh: Info to trace passed devices powerpc/eeh: Search EEH device by guest address powerpc/eeh: Search EEH PE by guest address drivers/vfio: New IOCTL command VFIO_EEH_INFO powerpc/eeh: Avoid event on passed PE powerpc/powernv: Sync OPAL header file with firmware powerpc: Extend syscall ppc_rtas() powerpc/powernv: Implement ppc_call_opal() powerpc/powernv: Error injection infrastructure arch/powerpc/include/asm/eeh.h | 52 + arch/powerpc/include/asm/opal.h| 74 +- arch/powerpc/include/asm/rtas.h| 10 ++- arch/powerpc/include/asm/syscalls.h| 2 +- arch/powerpc/include/asm/systbl.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 2 +- arch/powerpc/kernel/eeh.c | 8 ++ arch/powerpc/kernel/eeh_pe.c | 80 +++ arch/powerpc/kernel/rtas.c | 57 +++--- arch/powerpc/kernel/syscalls.c | 50 arch/powerpc/platforms/powernv/Makefile| 3 +- arch/powerpc/platforms/powernv/eeh-ioda.c | 3 +- arch/powerpc/platforms/powernv/eeh-vfio.c | 584 + arch/powerpc/platforms/powernv/errinject.c | 222 arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + arch/powerpc/platforms/powernv/opal.c | 93 ++ drivers/vfio/Kconfig | 6 ++ drivers/vfio/vfio_iommu_spapr_tce.c| 12 +++ include/uapi/linux/vfio.h | 61 +++ kernel/sys_ni.c| 2 +- 20 files changed, 1271 insertions(+), 53 deletions(-) create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c create mode 100644 arch/powerpc/platforms/powernv/errinject.c Thanks, Gavin ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: powerpc/mpc85xx: Add BSC9132 QDS Support
+ }; + + nand@1,0 { + #address-cells = 1; + #size-cells = 1; + compatible = fsl,ifc-nand; + reg = 0x1 0x0 0x4000; + + partition@0 { + /* This location must not be altered */ + /* 3MB for u-boot Bootloader Image */ + reg = 0x0 0x0030; + label = NAND U-Boot Image; + read-only; + }; + + partition@30 { + /* 1MB for DTB Image */ + reg = 0x0030 0x0010; + label = NAND DTB Image; + }; + + partition@40 { + /* 8MB for Linux Kernel Image */ + reg = 0x0040 0x0080; + label = NAND Linux Kernel Image; + }; + + partition@c0 { + /* Rest space for Root file System Image */ + reg = 0x00c0 0x0740; + label = NAND RFS Image; + }; + }; +}; Please keep partition definitions out of the dts file, as has been recently requested of other boards. You can use U-Boot to create the partition nodes based on the mtdparts variable, or you can use the Linux mtdparts command line option. Ok. Will remove these in V2 of patch -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFT PATCH -next ] [BUGFIX] kprobes: Fix Failed to find blacklist error on ia64 and ppc64
(2014/05/08 15:16), Ananth N Mavinakayanahalli wrote: On Thu, May 08, 2014 at 02:40:00PM +0900, Masami Hiramatsu wrote: (2014/05/08 13:47), Ananth N Mavinakayanahalli wrote: On Wed, May 07, 2014 at 08:55:51PM +0900, Masami Hiramatsu wrote: ... +#if defined(CONFIG_PPC64) (!defined(_CALL_ELF) || _CALL_ELF == 1) +/* + * On PPC64 ABIv1 the function pointer actually points to the + * function's descriptor. The first entry in the descriptor is the + * address of the function text. + */ +#define constant_function_entry(fn) (((func_descr_t *)(fn))-entry) +#else +#define constant_function_entry(fn) ((unsigned long)(fn)) +#endif + #endif /* __ASSEMBLY__ */ Hi Masami, You could just use ppc_function_entry() instead. No, I think ppc_function_entry() has two problems (on the latest -next kernel) At first, that is an inlined functions which is not applied in build time. Since the NOKPROBE_SYMBOL() is used outside of any functions as like as EXPORT_SYMBOL(), we can only use preprocessed macros. Next, on PPC64 ABI*v2*, ppc_function_entry() returns local function entry, which seems global function entry + 2 insns. I'm not sure about implementation of the kallsyms on PPC64 ABIv2, but I guess we need global function entry for kallsyms. ABIv2 does away with function descriptors and Anton fixed up that routine to handle the change (the +2 is an artefact of that). Hmm, do you mean that the address +2 is the actual entry point? I'd like to know which address is same as the address shown in /proc/kallsyms. BTW, could you test this patch on the latest -next tree on PPC64 if possible? I'll test it, but it may take a bit. Thanks for your help! Ananth -- Masami HIRAMATSU Software Platform Research Dept. Linux Technology Research Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes
On Thu, May 8, 2014 at 10:04 AM, yuantian.t...@freescale.com wrote: From: Tang Yuantian yuantian.t...@freescale.com Basically, this patch does the following: 1. Move the codes of parsing boot parameters from setup-common.c to driver. In this way, code reader can know directly that there are boot parameters that can change the timeout. 2. Make boot parameter 'booke_wdt_period' effective. currently, when driver is loaded, default timeout is always being used in stead of booke_wdt_period. 3. Wrap up the watchdog timeout in device struct and clean up unnecessary codes. Signed-off-by: Tang Yuantian yuantian.t...@freescale.com Acked-by: Scott Wood scottw...@freescale.com Reviewed-by: Li Yang le...@freescale.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: powerpc/mpc85xx: Add BSC9132 QDS Support
-Original Message- From: Wood Scott-B07421 Sent: Saturday, May 03, 2014 6:01 AM To: Rai Harninder-B01044 Cc: linuxppc-dev@lists.ozlabs.org; Gupta Ruchika-R66431 Subject: Re: powerpc/mpc85xx: Add BSC9132 QDS Support On Tue, Mar 18, 2014 at 01:05:02PM +0530, harninder rai wrote: +ifc { + #address-cells = 2; + #size-cells = 1; + compatible = fsl,ifc, simple-bus; + /* FIXME: Test whether interrupts are split */ + interrupts = 16 2 0 0 20 2 0 0; +}; Have you done this test yet? Checked with Prabhakar and he says that on 9132, the IFC interrupts are split B4/T4 (and variants), C29x etc onwards are when the interrupts got merged into single interrupt -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood scottw...@freescale.com wrote: On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote: On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood scottw...@freescale.com wrote: On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote: From: Wang Dongsheng dongsheng.w...@freescale.com Add set_pm_suspend_state pm_suspend_state functions to set/get suspend state. When system going to sleep or deep sleep, devices can get the system suspend state(STANDBY/MEM) through pm_suspend_state function and to handle different situations. Signed-off-by: Wang Dongsheng dongsheng.w...@freescale.com --- *v2* Move pm api from fsl platform to powerpc general framework. What is powerpc-specific about this? Generally I agree with you. But I had the discussion about this topic a while ago with the PM maintainer. He suggestion to go with the platform way. https://lkml.org/lkml/2013/8/16/505 If what he meant was whether you could do what this patch does, then you can answer him with, No, because it got nacked as not being platform or arch specific. Oh, and you're still using .valid as the hook to set the platform state, which is awful -- I think .begin is what you want to use. I'm not saying the current patch is good for upstream. Actually I did say that the patch need to be updated for upstream purpose. I only meant that we discussed about having the mem/standby passed by generic kernel/power interface as you suggested internally and got an negative feedback. If we did it in powerpc code, then what would we do on ARM? Copy the code? No. If you are saying that this shouldn't be done in arch/powerpc Yes. We have determined to use drivers/platform folder for the re-used code with ARM. Platform power management code will be moved there. Now, a more legitimate objection to putting it in generic code might be that standby and mem are loosely defined and the knowledge of how a driver should react to each is platform specific -- but your patch doesn't address that. You still have the driver itself interpret what standby and mem mean. Yup, we will address it in next batch. - Leo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
Hi Anton, On 05/09/2014 01:17 PM, Anton Blanchard wrote: I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. /proc/timer_list also shows pending hrtimers have not run in over an hour, including the scheduler. Looking closer, decrementers_next_tb is getting set to 0x, and at that point we will never take a timer interrupt again. In __timer_interrupt() we set decrementers_next_tb to 0x and rely on -event_handler to update it: *next_tb = ~(u64)0; if (evt-event_handler) evt-event_handler(evt); In this case -event_handler is hrtimer_interrupt. This will eventually call back through the clockevents code with the next event to be programmed: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { /* Don't adjust the decrementer if some irq work is pending */ if (test_irq_work_pending()) return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; If irq work came in between these two points, we will return before updating decrementers_next_tb and we never process a timer interrupt again. This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races with irq_work). Fix it by removing the early exit and relying on code later on in the function to force an early decrementer: /* We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); There is another scenario we are missing. Its not necessary that on a timer interrupt the event handler will call back through the set_next_event(). If there are no pending timers then the event handler will not bother programming the tick device and simply return.IOW, set_next_event() will not be called. In that case we will miss taking care of pending irq work altogether. __timer_interrupt() - event_handler - next_time = KTIME_MAX - __timer_interrupt(). In __timer_interrupt() we do not check for pending irq anywhere after the call to the event handler and we hence miss servicing irqs in the above scenario. How about you also move the check: if (test_irq_pending()) set_dec(1) in __timer_interrupt() outside the _else_ loop? This will ensure that no matter what, before exiting timer interrupt handler we check for pending irq work. Regards Preeti U Murthy Signed-off-by: Anton Blanchard an...@samba.org Cc: sta...@vger.kernel.org # 3.14+ --- diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..4f0b676 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,9 +888,6 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; set_dec(evt); How about if you move the test_irq_work_pending Why do we have test_irq_work_pending() later in the function decrementer_set_next_event()? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. Is this by chance every 4.2 seconds? The reason I ask is that Paul Clarke and I are seeing an interrupt every 4.2 seconds when he runs NO_HZ_FULL, and are trying to get rid of it. ;-) Thanx, Paul /proc/timer_list also shows pending hrtimers have not run in over an hour, including the scheduler. Looking closer, decrementers_next_tb is getting set to 0x, and at that point we will never take a timer interrupt again. In __timer_interrupt() we set decrementers_next_tb to 0x and rely on -event_handler to update it: *next_tb = ~(u64)0; if (evt-event_handler) evt-event_handler(evt); In this case -event_handler is hrtimer_interrupt. This will eventually call back through the clockevents code with the next event to be programmed: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { /* Don't adjust the decrementer if some irq work is pending */ if (test_irq_work_pending()) return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; If irq work came in between these two points, we will return before updating decrementers_next_tb and we never process a timer interrupt again. This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races with irq_work). Fix it by removing the early exit and relying on code later on in the function to force an early decrementer: /* We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); Signed-off-by: Anton Blanchard an...@samba.org Cc: sta...@vger.kernel.org # 3.14+ --- diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..4f0b676 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,9 +888,6 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; set_dec(evt); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
On Fri, 2014-05-09 at 17:33 +0800, Li Yang wrote: On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood scottw...@freescale.com wrote: On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote: On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood scottw...@freescale.com wrote: On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote: From: Wang Dongsheng dongsheng.w...@freescale.com Add set_pm_suspend_state pm_suspend_state functions to set/get suspend state. When system going to sleep or deep sleep, devices can get the system suspend state(STANDBY/MEM) through pm_suspend_state function and to handle different situations. Signed-off-by: Wang Dongsheng dongsheng.w...@freescale.com --- *v2* Move pm api from fsl platform to powerpc general framework. What is powerpc-specific about this? Generally I agree with you. But I had the discussion about this topic a while ago with the PM maintainer. He suggestion to go with the platform way. https://lkml.org/lkml/2013/8/16/505 If what he meant was whether you could do what this patch does, then you can answer him with, No, because it got nacked as not being platform or arch specific. Oh, and you're still using .valid as the hook to set the platform state, which is awful -- I think .begin is what you want to use. I'm not saying the current patch is good for upstream. Actually I did say that the patch need to be updated for upstream purpose. I don't follow -- this thread is an upstream submission. Now, a more legitimate objection to putting it in generic code might be that standby and mem are loosely defined and the knowledge of how a driver should react to each is platform specific -- but your patch doesn't address that. You still have the driver itself interpret what standby and mem mean. Yup, we will address it in next batch. Thanks. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes
On Thu, May 08, 2014 at 10:04:26AM +0800, yuantian.t...@freescale.com wrote: From: Tang Yuantian yuantian.t...@freescale.com Basically, this patch does the following: 1. Move the codes of parsing boot parameters from setup-common.c to driver. In this way, code reader can know directly that there are boot parameters that can change the timeout. 2. Make boot parameter 'booke_wdt_period' effective. currently, when driver is loaded, default timeout is always being used in stead of booke_wdt_period. 3. Wrap up the watchdog timeout in device struct and clean up unnecessary codes. Signed-off-by: Tang Yuantian yuantian.t...@freescale.com Acked-by: Scott Wood scottw...@freescale.com Reviewed-by: Guenter Roeck li...@roeck-us.net ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCHv2] powerpc/85xx: Add OCA4080 board support
On Tue, Apr 15, 2014 at 07:51:46PM +0200, Martijn de Gouw wrote: diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c b/arch/powerpc/platforms/85xx/corenet_generic.c index fbd871e..f3685047 100644 --- a/arch/powerpc/platforms/85xx/corenet_generic.c +++ b/arch/powerpc/platforms/85xx/corenet_generic.c @@ -55,8 +55,6 @@ void __init corenet_gen_setup_arch(void) mpc85xx_smp_init(); swiotlb_detect_4g(); - - pr_info(%s board from Freescale Semiconductor\n, ppc_md.name); Valentin's patch kept this line but removed from Freescale Semiconductor; I'll leave it like that when applying. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v6,3/5] powerpc/book3e: support kgdb for kernel space
On Wed, Oct 23, 2013 at 05:31:23PM +0800, Tiejun Chen wrote: Currently we need to skip this for supporting KGDB. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kernel/exceptions-64e.S |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index a55cf62..0b750c6 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -597,11 +597,13 @@ kernel_dbg_exc: rfdi /* Normal debug exception */ +1: andi. r14,r11,MSR_PR; /* check for userspace again */ +#ifndef CONFIG_KGDB /* XXX We only handle coming from userspace for now since we can't * quite save properly an interrupted kernel state yet */ -1: andi. r14,r11,MSR_PR; /* check for userspace again */ beq kernel_dbg_exc; /* if from kernel mode */ +#endif Now that we have support for properly saving state on special level exceptions, that should be used here. With the above patch, what happens if e.g. a debug exception fires during a TLB miss, and the kgdb handler takes its own TLB miss accessing the serial port? -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
linux-next: add scottwood/linux.git
On Mon, 2014-03-24 at 20:09 -0500, Scott Wood wrote: On Mon, 2014-03-24 at 10:33 +1100, Benjamin Herrenschmidt wrote: On Mon, 2014-03-24 at 10:16 +1100, Benjamin Herrenschmidt wrote: On Wed, 2014-03-19 at 23:25 -0500, Scott Wood wrote: The following changes since commit c7e64b9ce04aa2e3fad7396d92b5cb92056d16ac: powerpc/powernv Platform dump interface (2014-03-07 16:19:10 +1100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next for you to fetch changes up to 48b16180d0d91324e5d2423c6d53d97bbe3dcc14: fsl/pci: The new pci suspend/resume implementation (2014-03-19 22:37:44 -0500) Stephen just informed me that your tree wasn't in -next ... Kumar's still is. Can you guys fix that up ? I somewhat rely on the FSL stuff to simmer in -next on its own. Stephen, what's the process for adding a tree? ping -Scott I suppose we should update MAINTAINERS while we're at it. Oh and where is my little summary to put in the merge commit ? I made one up for this time around. Oops, forgot again. Now I've added something to the script I use to generate pull requests, to give me a reminder. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote: On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. Is this by chance every 4.2 seconds? The reason I ask is that Paul Clarke and I are seeing an interrupt every 4.2 seconds when he runs NO_HZ_FULL, and are trying to get rid of it. ;-) Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? Now that you mention it... ;-) So you are telling me that we are not succeeding in completely turning off the decrementer interrupt? Thanx, Paul ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. Is this by chance every 4.2 seconds? The reason I ask is that Paul Clarke and I are seeing an interrupt every 4.2 seconds when he runs NO_HZ_FULL, and are trying to get rid of it. ;-) Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: Fix attempt to move .org backwards error (again)
Commit 4e243b7 (powerpc: Fix attempt to move .org backwards error) fixes the allyesconfig build by moving machine_check_common to a different location. While this fixes most of the errors, both allmodconfig and allyesconfig still fail as follows. arch/powerpc/kernel/exceptions-64s.S:1315: Error: attempt to move .org backwards Fix by moving machine_check_common after the offending address. Signed-off-by: Guenter Roeck li...@roeck-us.net --- This fixes the build error, but unfortunately I don't have a system to test the resulting image. arch/powerpc/kernel/exceptions-64s.S | 49 ++-- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 3afd391..25398be 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1138,31 +1138,6 @@ unrecov_user_slb: #endif /* __DISABLED__ */ - - /* -* Machine check is different because we use a different -* save area: PACA_EXMC instead of PACA_EXGEN. -*/ - .align 7 - .globl machine_check_common -machine_check_common: - - mfspr r10,SPRN_DAR - std r10,PACA_EXGEN+EX_DAR(r13) - mfspr r10,SPRN_DSISR - stw r10,PACA_EXGEN+EX_DSISR(r13) - EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC) - FINISH_NAP - DISABLE_INTS - ld r3,PACA_EXGEN+EX_DAR(r13) - lwz r4,PACA_EXGEN+EX_DSISR(r13) - std r3,_DAR(r1) - std r4,_DSISR(r1) - bl .save_nvgprs - addir3,r1,STACK_FRAME_OVERHEAD - bl .machine_check_exception - b .ret_from_except - .align 7 .globl alignment_common alignment_common: @@ -1328,6 +1303,30 @@ fwnmi_data_area: initial_stab: .space 4096 + /* +* Machine check is different because we use a different +* save area: PACA_EXMC instead of PACA_EXGEN. +*/ + .align 7 + .globl machine_check_common +machine_check_common: + + mfspr r10,SPRN_DAR + std r10,PACA_EXGEN+EX_DAR(r13) + mfspr r10,SPRN_DSISR + stw r10,PACA_EXGEN+EX_DSISR(r13) + EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC) + FINISH_NAP + DISABLE_INTS + ld r3,PACA_EXGEN+EX_DAR(r13) + lwz r4,PACA_EXGEN+EX_DSISR(r13) + std r3,_DAR(r1) + std r4,_DSISR(r1) + bl .save_nvgprs + addir3,r1,STACK_FRAME_OVERHEAD + bl .machine_check_exception + b .ret_from_except + #ifdef CONFIG_PPC_POWERNV _GLOBAL(opal_mc_secondary_handler) HMT_MEDIUM_PPR_DISCARD -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info
[PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info When saving the callchain on Power, the kernel conservatively saves excess entries in the callchain. A few of these entries are needed in some cases but not others. Eg: the value in the link register (LR) is needed only when it holds the return address of a function. At other times it must be ignored. If the unnecessary entries are not ignored, we end up with duplicate arcs in the call-graphs. Use DWARF debug information to ignore the unnecessary entries. Callgraph before the patch: 14.67% 2234 sprintft libc-2.18.so [.] __random | --- __random | |--61.12%-- __random | | | |--97.15%-- rand | | do_my_sprintf | | main | | generic_start_main.isra.0 | | __libc_start_main | | 0x0 | | | --2.85%-- do_my_sprintf | main | generic_start_main.isra.0 | __libc_start_main | 0x0 | --38.88%-- rand | |--94.01%-- rand | do_my_sprintf | main | generic_start_main.isra.0 | __libc_start_main | 0x0 | --5.99%-- do_my_sprintf main generic_start_main.isra.0 __libc_start_main 0x0 Callgraph after the patch: 14.67% 2234 sprintft libc-2.18.so [.] __random | --- __random | |--95.93%-- rand | do_my_sprintf | main | generic_start_main.isra.0 | __libc_start_main | 0x0 | --4.07%-- do_my_sprintf main generic_start_main.isra.0 __libc_start_main 0x0 TODO: For split-debug info objects like glibc, we can only determine the call-frame-address only when both .eh_frame and .debug_info sections are available. We should be able to determin the CFA even without the .eh_frame section. Thanks to Ulrich Weigand for help with DWARF debug information. Fix suggested by Anton Blanchard. Reported-by: Maynard Johnson mayn...@us.ibm.com Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- tools/perf/arch/powerpc/Makefile| 1 + tools/perf/arch/powerpc/util/adjust-callchain.c | 278 tools/perf/config/Makefile | 5 + tools/perf/util/callchain.h | 12 + tools/perf/util/machine.c | 16 +- 5 files changed, 310 insertions(+), 2 deletions(-) create mode 100644 tools/perf/arch/powerpc/util/adjust-callchain.c diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile index 744e629..512cc8d 100644 --- a/tools/perf/arch/powerpc/Makefile +++ b/tools/perf/arch/powerpc/Makefile @@ -3,3 +3,4 @@ PERF_HAVE_DWARF_REGS := 1 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o endif LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o +LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/adjust-callchain.o diff --git a/tools/perf/arch/powerpc/util/adjust-callchain.c b/tools/perf/arch/powerpc/util/adjust-callchain.c new file mode 100644 index 000..31b1f95 --- /dev/null +++ b/tools/perf/arch/powerpc/util/adjust-callchain.c @@ -0,0 +1,278 @@ +/* + * Use DWARF Debug information to skip unnecessary callchain entries. + * + * Copyright (C) 2014 Sukadev Bhattiprolu, IBM Corporation. + * Copyright (C) 2014 Ulrich Weigand, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include inttypes.h +#include dwarf.h +#include elfutils/libdwfl.h + +#include util/thread.h +#include util/callchain.h + +/* + * When saving the callchain on Power, the kernel conservatively saves + * excess entries in the callchain. A few of these entries are needed + * in some cases but not others. If the unnecessary entries are not + * ignored, we end up with duplicate arcs in the call-graphs. Use + * DWARF
Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote: in __timer_interrupt() outside the _else_ loop? This will ensure that no matter what, before exiting timer interrupt handler we check for pending irq work. We still need to make sure that set_next_event() doesn't move the dec beyond the next tick if there is a pending timer... maybe we can fix it like this: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; /* Don't adjust the decrementer if some irq work is pending */ if (!test_irq_work_pending()) set_dec(evt); return 0; } Along with a single occurrence of: if (test_irq_work_pending()) set_dec(1); At the end of __timer_interrupt(), outside if the current else {} case, this should work, don't you think ? What about this completely untested patch ? diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..ba7e83b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -503,12 +503,13 @@ void __timer_interrupt(void) now = *next_tb - now; if (now = DECREMENTER_MAX) set_dec((int)now); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); __get_cpu_var(irq_stat).timer_irqs_others++; } + /* We may have raced with new irq work */ + if (test_irq_work_pending()) + set_dec(1); + #ifdef CONFIG_PPC64 /* collect purr register values often, for accurate calculations */ if (firmware_has_feature(FW_FEATURE_SPLPAR)) { @@ -813,15 +814,11 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; - set_dec(evt); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); + /* Don't adjust the decrementer if some irq work is pending */ + if (!test_irq_work_pending()) + set_dec(evt); return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] printk/of_serial: fix serial console cessation part way through boot.
Commit 5f5c9ae56c38942623f69c3e6dc6ec78e4da2076 serial_core: Unregister console in uart_remove_one_port() fixed a crash where a serial port was removed but not deregistered as a console. There is a side effect of that commit for platforms having serial consoles and of_serial configured (CONFIG_SERIAL_OF_PLATFORM). The serial console is disabled midway through the boot process. This cessation of the serial console affects PowerPC computers such as the MVME5100 and SAM440EP. The sequence is: bootconsole [udbg0] enabled serial8250/16550 driver initialises and registers its UARTS, one of these is the serial console. console [ttyS0] enabled of_serial probes platform devices, registering them as it goes. One of these is the serial console. console [ttyS0] disabled. The disabling of the serial console is due to: a. unregister_console in printk not clearing the CONS_ENABLED bit in the console flags, even though it has announced that the console is disabled; and b. of_platform_serial_probe in of_serial not setting the port type before it registers with serial8250_register_8250_port. This patch ensures that the serial console is re-enabled when of_serial registers a serial port that corresponds to the designated console. Signed-off-by: Stephen Chivers schiv...@csc.com Tested-by: Stephen Chivers schiv...@csc.com === The above failure was identified in Linux-3.15-rc2. Tested using MVME5100 and SAM440EP PowerPC computers with kernels built from Linux-3.15-rc5 and tty-next. The continued operation of the serial console is vital for computers such as the MVME5100 as that Single Board Computer does not have any grapical/display hardware. --- drivers/tty/serial/of_serial.c |1 + kernel/printk/printk.c |1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/tty/serial/of_serial.c b/drivers/tty/serial/of_serial.c index 9924660..27981e2 100644 --- a/drivers/tty/serial/of_serial.c +++ b/drivers/tty/serial/of_serial.c @@ -173,6 +173,7 @@ static int of_platform_serial_probe(struct platform_device *ofdev) { struct uart_8250_port port8250; memset(port8250, 0, sizeof(port8250)); + port.type = port_type; port8250.port = port; if (port.fifosize) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 7228258..221229c 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2413,6 +2413,7 @@ int unregister_console(struct console *console) if (console_drivers != NULL console-flags CON_CONSDEV) console_drivers-flags |= CON_CONSDEV; + console-flags = ~CON_ENABLED; console_unlock(); console_sysfs_notify(); return res; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev