Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog
On Sat, 10 Dec 2016 16:22:13 +1100 Balbir Singhwrote: > On 10/12/16 02:52, Nicholas Piggin wrote: > > Rather than use perf / PMU interrupts and the generic hardlockup > > detector, this takes the decrementer interrupt as an "NMI" when > > interrupts are soft disabled (XXX: will this do the right thing with a > > large decrementer?). This will work even if we start soft-disabling PMU > > interrupts. > > > > This does not solve the hardlockup problem completely however, because > > interrupts can often become hard disabled when soft disabled for long > > periods. And they can be hard disabled for other reasons. > > > > Ben/Paul suggested a way to work around this with XICS. The idea was to > have MSR_EE set and use XICS to stash away the current > interrupt and acknowledge it/replay it later. Decrementer interrupts would > not trigger timers, but trigger a special NMI watchdog, like you've > implemented. Yeah that's a good idea, it should significantly avoid hard interrupt disable windows. > > @@ -718,6 +719,8 @@ static __init void kvm_free_tmp(void) > > > > static int __init kvm_guest_init(void) > > { > > + /* XXX: disable hardlockup watchdog? */ > > + > > You mean the hypervisor watchdog? Did your testing > catch anything here? I meant guest. Testing didn't catch anything but I put it there to investigate because I saw x86 does hardlockup_detector_disable() in their guest init. > > +static void nmi_timer_fn(unsigned long data) > > +{ > > + struct timer_list *t = this_cpu_ptr(_timer); > > + int cpu = smp_processor_id(); > > + > > + watchdog_timer_interrupt(cpu); > > + > > + t->expires = round_jiffies(jiffies + nmi_timer_period * HZ); > > + add_timer_on(t, cpu); > > +} > > Do we have to have this running all the time? Can we do an on-demand > version of NMI where we do periodic decrementers without any reliance > on timers to implement NMI watchdog We could, but it is trivial to do this and get all the timer and dynticks stuff taken care of for us. We could bump the period up to 30s or so and it should hardly be an issue. I didn't want to try getting too clever, there are times when you could shut it off, but then you still lose some lockup coverage. But... I'm open to suggestions. I don't know the timer code well. > > +static int nmi_cpu_notify(struct notifier_block *self, > > +unsigned long action, void *hcpu) > > +{ > > + int cpu = (unsigned long)hcpu; > > + > > + switch (action & ~CPU_TASKS_FROZEN) { > > + case CPU_ONLINE: > > + case CPU_DOWN_FAILED: > > + start_nmi_on_cpu(cpu); > > + pr_info("NMI Watchdog running on cpus %*pbl\n", > > + cpumask_pr_args(_cpus_enabled)); > > + break; > > + case CPU_DOWN_PREPARE: > > + stop_nmi_on_cpu(cpu); > > + pr_info("NMI Watchdog running on cpus %*pbl\n", > > + cpumask_pr_args(_cpus_enabled)); > > + break; > > + } > > + return NOTIFY_OK; > > +} > > FYI: These bits are changing in linux-next Yeah I'll have to update them. > > diff --git a/init/main.c b/init/main.c > > index 2858be7..36fd7e7 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -579,6 +580,8 @@ asmlinkage __visible void __init start_kernel(void) > > > > kmem_cache_init_late(); > > > > + nmi_init(); > > How did you test these? I just tried a few place putting soft/hard irq disable and spinning forever. Soft disable case was getting caught by the local NMI, hard disable gets caught by the SMP check. When we also get the NMI IPI crash debug stuff, we should be able to get reasonable crash data with hard disabled hangs. Thanks, Nick
Re: [PATCH v3 00/15] livepatch: hybrid consistency model
On Thu, 2016-12-08 at 12:08 -0600, Josh Poimboeuf wrote: > Dusting the cobwebs off the consistency model again. This is based on > linux-next/master. > > v1 was posted on 2015-02-09: > > https://lkml.kernel.org/r/cover.1423499826.git.jpoim...@redhat.com > > v2 was posted on 2016-04-28: > > https://lkml.kernel.org/r/cover.1461875890.git.jpoim...@redhat.com > > The biggest issue from v2 was finding a decent way to detect preemption > and page faults on the stack of a sleeping task. Could you please elaborate on this? Preemption of a sleeping task and faults as in the future (time) preemption and faults? Balbir Singh.
Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog
On 10/12/16 02:52, Nicholas Piggin wrote: > Rather than use perf / PMU interrupts and the generic hardlockup > detector, this takes the decrementer interrupt as an "NMI" when > interrupts are soft disabled (XXX: will this do the right thing with a > large decrementer?). This will work even if we start soft-disabling PMU > interrupts. > > This does not solve the hardlockup problem completely however, because > interrupts can often become hard disabled when soft disabled for long > periods. And they can be hard disabled for other reasons. > Ben/Paul suggested a way to work around this with XICS. The idea was to have MSR_EE set and use XICS to stash away the current interrupt and acknowledge it/replay it later. Decrementer interrupts would not trigger timers, but trigger a special NMI watchdog, like you've implemented. > To make up for the lack of a periodic true NMI, this also has an SMP > hard lockup detector where all CPUs can observe lockups on others. > > This still needs a bit more polishing, testing, comments, config > options, and boot parameters, etc., so it's RFC quality only. > > Thanks, > Nick > --- > arch/powerpc/Kconfig | 2 + > arch/powerpc/include/asm/nmi.h | 5 + > arch/powerpc/kernel/Makefile | 1 + > arch/powerpc/kernel/exceptions-64s.S | 14 +- > arch/powerpc/kernel/kvm.c| 3 + > arch/powerpc/kernel/nmi.c| 288 > +++ > arch/powerpc/kernel/setup_64.c | 18 --- > arch/powerpc/kernel/time.c | 2 + > arch/sparc/kernel/nmi.c | 2 +- > include/linux/nmi.h | 14 ++ > init/main.c | 3 + > kernel/watchdog.c| 16 +- > 12 files changed, 341 insertions(+), 27 deletions(-) > create mode 100644 arch/powerpc/kernel/nmi.c > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index 65fba4c..adb3387 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -124,6 +124,8 @@ config PPC > select HAVE_CBPF_JIT if !PPC64 > select HAVE_EBPF_JIT if PPC64 > select HAVE_ARCH_JUMP_LABEL > + select HAVE_NMI > + select HAVE_NMI_WATCHDOG if PPC64 > select ARCH_HAVE_NMI_SAFE_CMPXCHG > select ARCH_HAS_GCOV_PROFILE_ALL > select GENERIC_SMP_IDLE_THREAD > diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h > index ff1ccb3..d00e29b 100644 > --- a/arch/powerpc/include/asm/nmi.h > +++ b/arch/powerpc/include/asm/nmi.h > @@ -1,4 +1,9 @@ > #ifndef _ASM_NMI_H > #define _ASM_NMI_H > > +#define arch_nmi_init powerpc_nmi_init > +void __init powerpc_nmi_init(void); > +void touch_nmi_watchdog(void); > +void soft_nmi_interrupt(struct pt_regs *regs); > + > #endif /* _ASM_NMI_H */ > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile > index 1925341..77f199f 100644 > --- a/arch/powerpc/kernel/Makefile > +++ b/arch/powerpc/kernel/Makefile > @@ -42,6 +42,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \ > signal_64.o ptrace32.o \ > paca.o nvram_64.o firmware.o > obj-$(CONFIG_VDSO32) += vdso32/ > +obj-$(CONFIG_HAVE_NMI_WATCHDOG) += nmi.o > obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o > obj-$(CONFIG_PPC_BOOK3S_64) += cpu_setup_ppc970.o cpu_setup_pa6t.o > obj-$(CONFIG_PPC_BOOK3S_64) += cpu_setup_power.o > diff --git a/arch/powerpc/kernel/exceptions-64s.S > b/arch/powerpc/kernel/exceptions-64s.S > index 1ba82ea..b159d02 100644 > --- a/arch/powerpc/kernel/exceptions-64s.S > +++ b/arch/powerpc/kernel/exceptions-64s.S > @@ -1295,7 +1295,7 @@ masked_##_H##interrupt: > \ > lis r10,0x7fff; \ > ori r10,r10,0x; \ > mtspr SPRN_DEC,r10; \ > - b 2f; \ > + b masked_decrementer_##_H##interrupt; \ > 1: cmpwi r10,PACA_IRQ_DBELL; \ > beq 2f; \ > cmpwi r10,PACA_IRQ_HMI; \ > @@ -1312,6 +1312,16 @@ masked_##_H##interrupt: > \ > ##_H##rfid; \ > b . > > +#define MASKED_NMI(_H) \ > +masked_decrementer_##_H##interrupt: \ > + std r12,PACA_EXGEN+EX_R12(r13); \ > + GET_SCRATCH0(r10); \ > + std r10,PACA_EXGEN+EX_R13(r13); \ > + EXCEPTION_PROLOG_PSERIES_1(soft_nmi_common, _H) > + > +EXC_COMMON(soft_nmi_common, 0x900, soft_nmi_interrupt) > + > + > /* > * Real mode exceptions actually use this too, but alternate > * instruction code patches (which end up in the common .text area) > @@ -1319,7 +1329,9 @@
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On Wed, 2016-12-07 at 17:31 -0600, Tyrel Datwyler wrote: > The first byte of each CRQ entry is used to indicate whether an entry is > a valid response or free for the VIOS to use. After processing a > response the driver sets the valid byte to zero to indicate the entry is > now free to be reused. Add a memory barrier after this write to ensure > no other stores are reordered when updating the valid byte. Which "other stores" specifically ? This smells fishy without that precision. It's important to always understand what exactly barriers order with. Cheers, Ben. > Signed-off-by: Tyrel Datwyler> --- > drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index d9534ee..2f5b07e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) > > while ((crq = crq_queue_next_crq(>queue)) != NULL) { > > ibmvscsi_handle_crq(crq, hostdata); > > crq->valid = VIOSRP_CRQ_FREE; > > + wmb(); > > } > > > vio_enable_interrupts(vdev); > @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) > > vio_disable_interrupts(vdev); > > ibmvscsi_handle_crq(crq, hostdata); > > crq->valid = VIOSRP_CRQ_FREE; > > + wmb(); > > } else { > > done = 1; > > }
Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog
On Sat, 2016-12-10 at 01:52 +1000, Nicholas Piggin wrote: > This does not solve the hardlockup problem completely however, > because > interrupts can often become hard disabled when soft disabled for long > periods. And they can be hard disabled for other reasons. > > To make up for the lack of a periodic true NMI, this also has an SMP > hard lockup detector where all CPUs can observe lockups on others. > > This still needs a bit more polishing, testing, comments, config > options, and boot parameters, etc., so it's RFC quality only. Paulus and I discussed a plan with Balbir to also limit the cases of hard-disable. They typically happen as a result of an external interrupt. We could on P8 and earlier, just fetch the interrupt from the XICS in the "masked" path and stash it in the PACA. We already have a way to stash an interrupt there for later processing because KVM sometimes does it. That would cause the XICS to elevate the priority effectively masking subsequent interrupts. We'd have to change the XICS code to use the same priority for IPIs and externals too though. For XIVE (P9), we can just poke at the CPU priority register in the TM area to mask at the PIC level in that case and unmask later. Cheers, Ben.
Re: [PATCH v2] of/irq: improve error report on irq discovery process failure
On 12/09/2016 02:25 PM, Rob Herring wrote: > On Mon, Dec 5, 2016 at 1:01 PM, Guilherme G. Piccoli >wrote: >> On 12/05/2016 12:28 PM, Rob Herring wrote: >>> On Mon, Dec 5, 2016 at 7:59 AM, Guilherme G. Piccoli >>> wrote: On PowerPC machines some PCI slots might not have level triggered interrupts capability (also know as level signaled interrupts), leading of_irq_parse_pci() to complain by presenting error messages on the kernel log - in this case, the properties "interrupt-map" and "interrupt-map-mask" are not present on device's node in the device tree. This patch introduces a different message for this specific case, and also reduces its level from error to warning. Besides, we warn (once) that possibly some PCI slots on the system have no level triggered interrupts available. We changed some error return codes too on function of_irq_parse_raw() in order other failure's cases can be presented in a more precise way. Before this patch, when an adapter was plugged in a slot without level interrupts capabilitiy on PowerPC, we saw a generic error message like this: [54.239] pci 002d:70:00.0: of_irq_parse_pci() failed with rc=-22 Now, with this applied, we see the following specific message: [16.154] pci 0014:60:00.1: of_irq_parse_pci: no interrupt-map found, INTx interrupts not available Finally, we standardize the error path in of_irq_parse_raw() by always taking the fail path instead of returning directly from the loop. Signed-off-by: Guilherme G. Piccoli --- v2: * Changed function return code to always return negative values; >>> >>> Are you sure this is safe? This is tricky because of differing values >>> of NO_IRQ (0 or -1). >> >> Thanks Rob, but this is purely bad wording from myself. I'm sorry - I >> meant to say that I changed only my positive return code (that was >> suggested to be removed in the prior revision) to negative return code! >> >> So, I changed only code I added myself in v1 =) >> >> >>> * Improved/simplified warning outputs; * Changed some return codes and some error paths in of_irq_parse_raw() in order to be more precise/consistent; >>> >>> This too could have some side effects on callers. >>> >>> Not saying don't do these changes, just need some assurances this has >>> been considered. >> >> Thanks for your attention. I performed a quick investigation before >> changing this, all the places that use the return values are just >> getting "true/false" information from that, meaning they just are >> comparing to 0 basically. So change -EINVAL to -ENOENT wouldn't hurt any >> user of these return values, it'll only become more informative IMHO. >> >> Now, regarding the only error path that was changed: for some reason, >> this was the only place in which we didn't goto fail label in case of >> failure - it was added by a legacy commit from Ben, dated from 2006: >> 006b64de60 ("[POWERPC] Make OF irq map code detect more error cases"). >> Then it was carried by Grant Likely's commit 7dc2e1134a ("of/irq: merge >> irq mapping code"), 6-year old commit. >> I wasn't able to imagine a scenario in which changing this would break >> something; I believe the change improve consistency, but I'd remove it >> if you or somebody else thinks it worth be removed. > > Okay. It's a bit late for 4.10 now and want this to be in -next for a > while, so I'll queue it after the merge window. > OK, perfect! Thanks Rob Cheers, Guilherme > Rob >
Re: [PATCH v2] cxl: prevent read/write to AFU config space while AFU not configured
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c index 3519ace..639a343 100644 --- a/drivers/misc/cxl/vphb.c +++ b/drivers/misc/cxl/vphb.c @@ -76,23 +76,22 @@ static int cxl_pcie_cfg_record(u8 bus, u8 devfn) return (bus << 8) + devfn; } -static int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn, - struct cxl_afu **_afu, int *_record) +static inline struct cxl_afu *pci_bus_to_afu(struct pci_bus *bus) { - struct pci_controller *phb; - struct cxl_afu *afu; - int record; + struct pci_controller *phb = bus ? pci_bus_to_host(bus) : NULL; - phb = pci_bus_to_host(bus); - if (phb == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; + return phb ? phb->private_data : NULL; +} + +static inline int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn, + struct cxl_afu *afu, int *_record) +{ + int record; - afu = (struct cxl_afu *)phb->private_data; record = cxl_pcie_cfg_record(bus->number, devfn); if (record > afu->crs_num) return PCIBIOS_DEVICE_NOT_FOUND; - *_afu = afu; *_record = record; return 0; } There's no reason to pass the afu parameter to that function, is it? Pushing it further, do we need cxl_pcie_config_info()? It's now a simple wrapper around cxl_pcie_cfg_record() Fred
Re: [PATCH v2] of/irq: improve error report on irq discovery process failure
On Mon, Dec 5, 2016 at 1:01 PM, Guilherme G. Piccoliwrote: > On 12/05/2016 12:28 PM, Rob Herring wrote: >> On Mon, Dec 5, 2016 at 7:59 AM, Guilherme G. Piccoli >> wrote: >>> On PowerPC machines some PCI slots might not have level triggered >>> interrupts capability (also know as level signaled interrupts), >>> leading of_irq_parse_pci() to complain by presenting error messages >>> on the kernel log - in this case, the properties "interrupt-map" and >>> "interrupt-map-mask" are not present on device's node in the device >>> tree. >>> >>> This patch introduces a different message for this specific case, >>> and also reduces its level from error to warning. Besides, we warn >>> (once) that possibly some PCI slots on the system have no level >>> triggered interrupts available. >>> We changed some error return codes too on function of_irq_parse_raw() >>> in order other failure's cases can be presented in a more precise way. >>> >>> Before this patch, when an adapter was plugged in a slot without level >>> interrupts capabilitiy on PowerPC, we saw a generic error message >>> like this: >>> >>> [54.239] pci 002d:70:00.0: of_irq_parse_pci() failed with rc=-22 >>> >>> Now, with this applied, we see the following specific message: >>> >>> [16.154] pci 0014:60:00.1: of_irq_parse_pci: no interrupt-map found, >>> INTx interrupts not available >>> >>> Finally, we standardize the error path in of_irq_parse_raw() by always >>> taking the fail path instead of returning directly from the loop. >>> >>> Signed-off-by: Guilherme G. Piccoli >>> --- >>> >>> v2: >>> * Changed function return code to always return negative values; >> >> Are you sure this is safe? This is tricky because of differing values >> of NO_IRQ (0 or -1). > > Thanks Rob, but this is purely bad wording from myself. I'm sorry - I > meant to say that I changed only my positive return code (that was > suggested to be removed in the prior revision) to negative return code! > > So, I changed only code I added myself in v1 =) > > >> >>> * Improved/simplified warning outputs; >>> * Changed some return codes and some error paths in of_irq_parse_raw() >>> in order to be more precise/consistent; >> >> This too could have some side effects on callers. >> >> Not saying don't do these changes, just need some assurances this has >> been considered. > > Thanks for your attention. I performed a quick investigation before > changing this, all the places that use the return values are just > getting "true/false" information from that, meaning they just are > comparing to 0 basically. So change -EINVAL to -ENOENT wouldn't hurt any > user of these return values, it'll only become more informative IMHO. > > Now, regarding the only error path that was changed: for some reason, > this was the only place in which we didn't goto fail label in case of > failure - it was added by a legacy commit from Ben, dated from 2006: > 006b64de60 ("[POWERPC] Make OF irq map code detect more error cases"). > Then it was carried by Grant Likely's commit 7dc2e1134a ("of/irq: merge > irq mapping code"), 6-year old commit. > I wasn't able to imagine a scenario in which changing this would break > something; I believe the change improve consistency, but I'd remove it > if you or somebody else thinks it worth be removed. Okay. It's a bit late for 4.10 now and want this to be in -next for a while, so I'll queue it after the merge window. Rob
[PATCH] powerpc/64: pseudo-NMI/SMP watchdog
Rather than use perf / PMU interrupts and the generic hardlockup detector, this takes the decrementer interrupt as an "NMI" when interrupts are soft disabled (XXX: will this do the right thing with a large decrementer?). This will work even if we start soft-disabling PMU interrupts. This does not solve the hardlockup problem completely however, because interrupts can often become hard disabled when soft disabled for long periods. And they can be hard disabled for other reasons. To make up for the lack of a periodic true NMI, this also has an SMP hard lockup detector where all CPUs can observe lockups on others. This still needs a bit more polishing, testing, comments, config options, and boot parameters, etc., so it's RFC quality only. Thanks, Nick --- arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/nmi.h | 5 + arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/exceptions-64s.S | 14 +- arch/powerpc/kernel/kvm.c| 3 + arch/powerpc/kernel/nmi.c| 288 +++ arch/powerpc/kernel/setup_64.c | 18 --- arch/powerpc/kernel/time.c | 2 + arch/sparc/kernel/nmi.c | 2 +- include/linux/nmi.h | 14 ++ init/main.c | 3 + kernel/watchdog.c| 16 +- 12 files changed, 341 insertions(+), 27 deletions(-) create mode 100644 arch/powerpc/kernel/nmi.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 65fba4c..adb3387 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -124,6 +124,8 @@ config PPC select HAVE_CBPF_JIT if !PPC64 select HAVE_EBPF_JIT if PPC64 select HAVE_ARCH_JUMP_LABEL + select HAVE_NMI + select HAVE_NMI_WATCHDOG if PPC64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_HAS_GCOV_PROFILE_ALL select GENERIC_SMP_IDLE_THREAD diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h index ff1ccb3..d00e29b 100644 --- a/arch/powerpc/include/asm/nmi.h +++ b/arch/powerpc/include/asm/nmi.h @@ -1,4 +1,9 @@ #ifndef _ASM_NMI_H #define _ASM_NMI_H +#define arch_nmi_init powerpc_nmi_init +void __init powerpc_nmi_init(void); +void touch_nmi_watchdog(void); +void soft_nmi_interrupt(struct pt_regs *regs); + #endif /* _ASM_NMI_H */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 1925341..77f199f 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \ signal_64.o ptrace32.o \ paca.o nvram_64.o firmware.o obj-$(CONFIG_VDSO32) += vdso32/ +obj-$(CONFIG_HAVE_NMI_WATCHDOG)+= nmi.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 1ba82ea..b159d02 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1295,7 +1295,7 @@ masked_##_H##interrupt: \ lis r10,0x7fff; \ ori r10,r10,0x; \ mtspr SPRN_DEC,r10; \ - b 2f; \ + b masked_decrementer_##_H##interrupt; \ 1: cmpwi r10,PACA_IRQ_DBELL; \ beq 2f; \ cmpwi r10,PACA_IRQ_HMI; \ @@ -1312,6 +1312,16 @@ masked_##_H##interrupt: \ ##_H##rfid; \ b . +#define MASKED_NMI(_H) \ +masked_decrementer_##_H##interrupt:\ + std r12,PACA_EXGEN+EX_R12(r13); \ + GET_SCRATCH0(r10); \ + std r10,PACA_EXGEN+EX_R13(r13); \ + EXCEPTION_PROLOG_PSERIES_1(soft_nmi_common, _H) + +EXC_COMMON(soft_nmi_common, 0x900, soft_nmi_interrupt) + + /* * Real mode exceptions actually use this too, but alternate * instruction code patches (which end up in the common .text area) @@ -1319,7 +1329,9 @@ masked_##_H##interrupt: \ */ USE_FIXED_SECTION(virt_trampolines) MASKED_INTERRUPT() + MASKED_NMI() MASKED_INTERRUPT(H) + MASKED_NMI(H) #ifdef CONFIG_KVM_BOOK3S_64_HANDLER TRAMP_REAL_BEGIN(kvmppc_skip_interrupt) diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c index 9ad37f8..f0d215c 100644 --- a/arch/powerpc/kernel/kvm.c +++ b/arch/powerpc/kernel/kvm.c @@ -25,6 +25,7 @@ #include #include #include +#include
Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
On Fri, 9 Dec 2016 18:53:43 +1100 Alexey Kardashevskiywrote: > On 09/12/16 04:55, Alex Williamson wrote: > > On Thu, 8 Dec 2016 19:19:56 +1100 > > Alexey Kardashevskiy wrote: > > > >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > >> without passing them to user space which saves time on switching > >> to user space and back. > >> > >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > >> KVM tries to handle a TCE request in the real mode, if failed > >> it passes the request to the virtual mode to complete the operation. > >> If it a virtual mode handler fails, the request is passed to > >> the user space; this is not expected to happen though. > >> > >> To avoid dealing with page use counters (which is tricky in real mode), > >> this only accelerates SPAPR TCE IOMMU v2 clients which are required > >> to pre-register the userspace memory. The very first TCE request will > >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view > >> of the TCE table (iommu_table::it_userspace) is not allocated till > >> the very first mapping happens and we cannot call vmalloc in real mode. > >> > >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > >> and associates a physical IOMMU table with the SPAPR TCE table (which > >> is a guest view of the hardware IOMMU table). The iommu_table object > >> is referenced so we do not have to retrieve in real mode when hypercall > >> happens. > >> > >> This does not implement the UNSET counterpart as there is no use for it - > >> once the acceleration is enabled, the existing userspace won't > >> disable it unless a VFIO container is detroyed so this adds necessary > >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > >> > >> This uses the kvm->lock mutex to protect against a race between > >> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's > >> release() callback. > >> > >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > >> space. > >> > >> This finally makes use of vfio_external_user_iommu_id() which was > >> introduced quite some time ago and was considered for removal. > >> > >> Tests show that this patch increases transmission speed from 220MB/s > >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > >> > >> Signed-off-by: Alexey Kardashevskiy > >> --- > >> Documentation/virtual/kvm/devices/vfio.txt | 21 +- > >> arch/powerpc/include/asm/kvm_host.h| 8 + > >> arch/powerpc/include/asm/kvm_ppc.h | 5 + > >> include/uapi/linux/kvm.h | 8 + > >> arch/powerpc/kvm/book3s_64_vio.c | 302 > >> + > >> arch/powerpc/kvm/book3s_64_vio_hv.c| 178 + > >> arch/powerpc/kvm/powerpc.c | 2 + > >> virt/kvm/vfio.c| 108 +++ > >> 8 files changed, 630 insertions(+), 2 deletions(-) > >> > >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt > >> b/Documentation/virtual/kvm/devices/vfio.txt > >> index ef51740c67ca..ddb5a6512ab3 100644 > >> --- a/Documentation/virtual/kvm/devices/vfio.txt > >> +++ b/Documentation/virtual/kvm/devices/vfio.txt > >> @@ -16,7 +16,24 @@ Groups: > >> > >> KVM_DEV_VFIO_GROUP attributes: > >>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking > >> + kvm_device_attr.addr points to an int32_t file descriptor > >> + for the VFIO group. > >>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device > >> tracking > >> + kvm_device_attr.addr points to an int32_t file descriptor > >> + for the VFIO group. > >> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table > >> + allocated by sPAPR KVM. > >> + kvm_device_attr.addr points to a struct: > >> > >> -For each, kvm_device_attr.addr points to an int32_t file descriptor > >> -for the VFIO group. > >> + struct kvm_vfio_spapr_tce { > >> + __u32 argsz; > >> + __s32 groupfd; > >> + __s32 tablefd; > >> + __u8pad[4]; > >> + }; > >> + > >> + where > >> + @argsz is the size of kvm_vfio_spapr_tce_liobn; > >> + @groupfd is a file descriptor for a VFIO group; > >> + @tablefd is a file descriptor for a TCE table allocated via > >> + KVM_CREATE_SPAPR_TCE. > >> diff --git a/arch/powerpc/include/asm/kvm_host.h > >> b/arch/powerpc/include/asm/kvm_host.h > >> index 28350a294b1e..94774503c70d 100644 > >> --- a/arch/powerpc/include/asm/kvm_host.h > >> +++ b/arch/powerpc/include/asm/kvm_host.h > >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo { > >>atomic_t refcnt; > >> }; > >> > >> +struct kvmppc_spapr_tce_iommu_table { > >> + struct rcu_head rcu; > >> + struct list_head next; > >> + struct iommu_table *tbl; > >> + atomic_t refs;
Re: 4.9.0-rc8 - rcutorture test failure
On Fri, Dec 09, 2016 at 04:27:42PM +0530, Sachin Sant wrote: > > But I am not seeing this as a failure. The last status print from the > > log you attached is as follows: > > > > 07:58:25 [ 2778.876118] rcu-torture: rtc: (null) ver: 24968 tfle: > > 0 rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 > > rtb: 0 nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 > > cbflood: 22703 > > 07:58:25 [ 2778.876251] rcu-torture: Reader Pipe: 161849976604 399197 0 0 > > 0 0 0 0 0 0 0 > > 07:58:25 [ 2778.876438] rcu-torture: Reader Batch: 145090807711 > > 16759538163 0 0 0 0 0 0 0 0 0 > > 07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation: 24967 24967 > > 24966 24965 24964 24963 24962 24961 24960 24959 0 > > 07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 > > nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 > > shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 > > fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 > > shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 > > onoff_interval=0 onoff_holdoff=0 > > > > The "SUCCESS" indicates that rcutorture thought that it succeeded. > > Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two > > numbers in the series at the end of each line are non-zero, which also > > indicates a non-broken RCU. > > > > So could you please let me know what your scripting didn't like about > > this log? > > > > The test case has following piece of code which prints the failure > message during result analysis. > > Checks for known bugs > """ > utils.system('dmesg -c > /dev/null') > pipe1 = [r for r in self.results if "!!! Reader Pipe:" in r] > if len(pipe1) != 0: > raise error.TestError('\nBUG: grace-period failure !’) > sys.exit(0) > > pipe2 = [r for r in self.results if "Reader Pipe" in r] > for p in pipe2: > nmiss = p.split(" ")[7] > if int(nmiss): > raise error.TestError('\nBUG: rcutorture tests failed !') > sys.exit(0) > > I will double check on this. I suggest using this script in the Linux kernel source as a guide: tools/testing/selftests/rcutorture/bin/parse-console.sh Thanx, Paul
[PATCH v4 0/4] powernv:stop: Use psscr_val,mask provided by firmware
From: "Gautham R. Shenoy"This is the fourth iteration of the patchset to use the psscr_val and psscr_mask provided by the firmware for each of the stop states. The previous version can be found here: [v3]: https://lkml.org/lkml/2016/11/10/37 [v2]: https://lkml.org/lkml/2016/10/27/143 [v1]: https://lkml.org/lkml/2016/9/29/45 This version fixes some of the coding style issues pointed out by Michael Ellerman in v3. This version also documents the device-tree bindings defining the properties under the @power-mgt node in the device tree describing the idle states for Linux running on baremetal POWER servers. Synopsis == In the current implementation, the code for ISA v3.0 stop implementation has a couple of shortcomings. a) The code hand-codes the values for ESL,EC,TR,MTL bits of PSSCR and uses only the RL field from the firmware. While this is not incorrect, since the hand-coded values are legitimate, it is not a very flexible design since the firmware has the capability to communicate these values via the "ibm,cpu-idle-state-psscr" and "ibm,cpu-idle-state-psscr-mask" properties. In case where the firmware provides values for these fields that is different from the hand-coded values, the current code will not work as intended. b) Due to issue a), the current code assumes that ESL=EC=1 for all the stop states and hence the wakeup from the stop instruction will happen at 0x100, the system-reset vector. However, the ISA v3.0 allows the ESL=EC=0 behaviour where the corresponding stop-state loses no state and wakes up from the subsequent instruction. The current code doesn't handle this case. This patch series addresses these issues. The first patch in the series renames the existing IDLE_STATE_ENTER_SEQ macro to IDLE_STATE_ENTER_SEQ_NORET. It reuses the name IDLE_STATE_ENTER_SEQ for entering into stop-states which wake up at the subsequent instruction. The second patch adds a helper function in cpuidle-powernv.c for initializing entries of the powernv_states[] table that is passed to the cpu-idle core. This eliminates some of the code duplication in the function that discovers and initializes the stop states. The third patch in the series fixes issues a) and b) by ensuring that the psscr-value and the psscr-mask provided by the firmware are what will be used to set a particular stop state. It also adds support for handling wake-up from stop states which were entered with ESL=EC=0. The third patch also handles the older firmware which sets only the Requested Level (RL) field in the psscr and psscr-mask exposed in the device tree. In the presence of such older firmware, this patch will set the default sane values for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and TR). The fourth patch provides the documentation for the device-tree bindings describing the idle state properties under the @power-mgt node in the device-tree. The skiboot patch populates all the relevant fields in the PSSCR values and the mask for all the stop states can be found here: https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html The patches are based on top of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes Gautham R. Shenoy (4): powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro cpuidle:powernv: Add helper function to populate powernv idle states. powernv: Pass PSSCR value and mask to power9_idle_stop Documentation:powerpc: Add device-tree bindings for power-mgt .../devicetree/bindings/powerpc/opal/power-mgt.txt | 123 + arch/powerpc/include/asm/cpuidle.h | 46 +++- arch/powerpc/include/asm/processor.h | 3 +- arch/powerpc/kernel/exceptions-64s.S | 6 +- arch/powerpc/kernel/idle_book3s.S | 41 --- arch/powerpc/platforms/powernv/idle.c | 81 +++--- arch/powerpc/platforms/powernv/powernv.h | 3 +- arch/powerpc/platforms/powernv/smp.c | 14 ++- drivers/cpuidle/cpuidle-powernv.c | 113 --- include/linux/cpuidle.h| 1 + 10 files changed, 348 insertions(+), 83 deletions(-) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt -- 1.9.4
[PATCH v4 1/4] powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro
From: "Gautham R. Shenoy"Currently all the low-power idle states are expected to wake up at reset vector 0x100. Which is why the macro IDLE_STATE_ENTER_SEQ that puts the CPU to an idle state and never returns. On ISA_300, when the ESL and EC bits in the PSSCR are zero, the CPU is expected to wake up at the next instruction of the idle instruction. This patch adds a new macro named IDLE_STATE_ENTER_SEQ_NORET for the no-return variant and reuses the name IDLE_STATE_ENTER_SEQ for a variant that allows resuming operation at the instruction next to the idle-instruction. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/cpuidle.h | 5 - arch/powerpc/kernel/exceptions-64s.S | 6 +++--- arch/powerpc/kernel/idle_book3s.S| 10 +- 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h index 3919332..0a3255b 100644 --- a/arch/powerpc/include/asm/cpuidle.h +++ b/arch/powerpc/include/asm/cpuidle.h @@ -21,7 +21,7 @@ /* Idle state entry routines */ #ifdef CONFIG_PPC_P7_NAP -#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \ +#define IDLE_STATE_ENTER_SEQ(IDLE_INST) \ /* Magic NAP/SLEEP/WINKLE mode enter sequence */\ std r0,0(r1); \ ptesync;\ @@ -29,6 +29,9 @@ 1: cmpdcr0,r0,r0; \ bne 1b; \ IDLE_INST; \ + +#defineIDLE_STATE_ENTER_SEQ_NORET(IDLE_INST) \ + IDLE_STATE_ENTER_SEQ(IDLE_INST) \ b . #endif /* CONFIG_PPC_P7_NAP */ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 1ba82ea..7aa8afc 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -381,12 +381,12 @@ EXC_COMMON_BEGIN(machine_check_handle_early) lbz r3,PACA_THREAD_IDLE_STATE(r13) cmpwi r3,PNV_THREAD_NAP bgt 10f - IDLE_STATE_ENTER_SEQ(PPC_NAP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP) /* No return */ 10: cmpwi r3,PNV_THREAD_SLEEP bgt 2f - IDLE_STATE_ENTER_SEQ(PPC_SLEEP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP) /* No return */ 2: @@ -400,7 +400,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early) */ ori r13,r13,1 SET_PACA(r13) - IDLE_STATE_ENTER_SEQ(PPC_WINKLE) + IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE) /* No return */ 4: #endif diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 72dac0b..be90e2f 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -205,7 +205,7 @@ pnv_enter_arch207_idle_mode: stb r3,PACA_THREAD_IDLE_STATE(r13) cmpwi cr3,r3,PNV_THREAD_SLEEP bge cr3,2f - IDLE_STATE_ENTER_SEQ(PPC_NAP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP) /* No return */ 2: /* Sleep or winkle */ @@ -239,7 +239,7 @@ pnv_fastsleep_workaround_at_entry: common_enter: /* common code for all the threads entering sleep or winkle */ bgt cr3,enter_winkle - IDLE_STATE_ENTER_SEQ(PPC_SLEEP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP) fastsleep_workaround_at_entry: ori r15,r15,PNV_CORE_IDLE_LOCK_BIT @@ -261,7 +261,7 @@ fastsleep_workaround_at_entry: enter_winkle: bl save_sprs_to_stack - IDLE_STATE_ENTER_SEQ(PPC_WINKLE) + IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE) /* * r3 - requested stop state @@ -280,7 +280,7 @@ power_enter_stop: ld r4,ADDROFF(pnv_first_deep_stop_state)(r5) cmpdr3,r4 bge 2f - IDLE_STATE_ENTER_SEQ(PPC_STOP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP) 2: /* * Entering deep idle state. @@ -302,7 +302,7 @@ lwarx_loop_stop: bl save_sprs_to_stack - IDLE_STATE_ENTER_SEQ(PPC_STOP) + IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP) _GLOBAL(power7_idle) /* Now check if user or arch enabled NAP mode */ -- 1.9.4
[PATCH v4 3/4] powernv: Pass PSSCR value and mask to power9_idle_stop
From: "Gautham R. Shenoy"The power9_idle_stop method currently takes only the requested stop level as a parameter and picks up the rest of the PSSCR bits from a hand-coded macro. This is not a very flexible design, especially when the firmware has the capability to communicate the psscr value and the mask associated with a particular stop state via device tree. This patch modifies the power9_idle_stop API to take as parameters the PSSCR value and the PSSCR mask corresponding to the stop state that needs to be set. These PSSCR value and mask are respectively obtained by parsing the "ibm,cpu-idle-state-psscr" and "ibm,cpu-idle-state-psscr-mask" fields from the device tree. In addition to this, the patch adds support for handling stop states for which ESL and EC bits in the PSSCR are zero. As per the architecture, a wakeup from these stop states resumes execution from the subsequent instruction as opposed to waking up at the System Vector. The older firmware sets only the Requested Level (RL) field in the psscr and psscr-mask exposed in the device tree. For older firmware where psscr-mask=0xf, this patch will set the default sane values that the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and TR). This skiboot patch that exports fully populated PSSCR values and the mask for all the stop states can be found here: https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/cpuidle.h | 41 arch/powerpc/include/asm/processor.h | 3 +- arch/powerpc/kernel/idle_book3s.S| 31 +++- arch/powerpc/platforms/powernv/idle.c| 81 ++-- arch/powerpc/platforms/powernv/powernv.h | 3 +- arch/powerpc/platforms/powernv/smp.c | 14 +++--- drivers/cpuidle/cpuidle-powernv.c| 40 +++- 7 files changed, 169 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h index 0a3255b..fa0b6c0 100644 --- a/arch/powerpc/include/asm/cpuidle.h +++ b/arch/powerpc/include/asm/cpuidle.h @@ -10,11 +10,52 @@ #define PNV_CORE_IDLE_LOCK_BIT 0x100 #define PNV_CORE_IDLE_THREAD_BITS 0x0FF +/* + * NOTE = + * The older firmware populates only the RL field in the psscr_val and + * sets the psscr_mask to 0xf. On such a firmware, the kernel sets the + * remaining PSSCR fields to default values as follows: + * + * - ESL and EC bits are to 1. So wakeup from any stop state will be + * at vector 0x100. + * + * - MTL and PSLL are set to the maximum allowed value as per the ISA, + *i.e. 15. + * + * - The Transition Rate, TR is set to the Maximum value 3. + */ +#define PSSCR_HV_DEFAULT_VAL(PSSCR_ESL | PSSCR_EC |\ + PSSCR_PSLL_MASK | PSSCR_TR_MASK | \ + PSSCR_MTL_MASK) + +#define PSSCR_HV_DEFAULT_MASK (PSSCR_ESL | PSSCR_EC |\ + PSSCR_PSLL_MASK | PSSCR_TR_MASK | \ + PSSCR_MTL_MASK | PSSCR_RL_MASK) + #ifndef __ASSEMBLY__ extern u32 pnv_fastsleep_workaround_at_entry[]; extern u32 pnv_fastsleep_workaround_at_exit[]; extern u64 pnv_first_deep_stop_state; + +static inline u64 compute_psscr_val(u64 psscr_val, u64 psscr_mask) +{ + /* +* psscr_mask == 0xf indicates an older firmware. +* Set remaining fields of psscr to the default values. +* See NOTE above definition of PSSCR_HV_DEFAULT_VAL +*/ + if (psscr_mask == 0xf) + return psscr_val | PSSCR_HV_DEFAULT_VAL; + return psscr_val; +} + +static inline u64 compute_psscr_mask(u64 psscr_mask) +{ + if (psscr_mask == 0xf) + return PSSCR_HV_DEFAULT_MASK; + return psscr_mask; +} #endif #endif diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index c07c31b..422becd 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -458,7 +458,8 @@ static inline unsigned long get_clean_sp(unsigned long sp, int is_32) extern unsigned long power7_nap(int check_irq); extern unsigned long power7_sleep(void); extern unsigned long power7_winkle(void); -extern unsigned long power9_idle_stop(unsigned long stop_level); +extern unsigned long power9_idle_stop(unsigned long stop_psscr_val, + unsigned long stop_psscr_mask); extern void flush_instruction_cache(void); extern void hard_reset_now(void); diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index be90e2f..37ee533 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -40,9 +40,7 @@ #define _WORC GPR11 #define _PTCR GPR12 -#define PSSCR_HV_TEMPLATE
[PATCH v4 4/4] Documentation:powerpc: Add device-tree bindings for power-mgt
From: "Gautham R. Shenoy"Document the device-tree bindings defining the the properties under the @power-mgt node in the device tree that describe the idle states for Linux running on baremetal POWER servers. Signed-off-by: Gautham R. Shenoy --- .../devicetree/bindings/powerpc/opal/power-mgt.txt | 123 + 1 file changed, 123 insertions(+) create mode 100644 Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt diff --git a/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt new file mode 100644 index 000..002b59e --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt @@ -0,0 +1,123 @@ +IBM Power-Management Bindings += + +Linux running on baremetal POWER machines has access to the processor +idle states. The description of these idle states is exposed via the +node @power-mgt in the device-tree by the firmware. + +Definitions: + +Typically each idle state has the following associated properties: + +- name: The name of the idle state as defined by the firmware. + +- flags: indicating some aspects of this idle states such as the + extent of state-loss, whether timebase is stopped on this + idle states and so on. The flag bits are as follows: + +- exit-latency: The latency involved in transitioning the state of the + CPU from idle to running. + +- target-residency: The minimum time that the CPU needs to reside in + this idle state in order to accrue power-savings + benefit. + +Properties + +The following properties provide details about the idle states. These +properties are optional unless mentioned otherwise below. + +- ibm,cpu-idle-state-names: + Array of strings containing the names of the idle states. + +- ibm,cpu-idle-state-flags: + Array of unsigned 32-bit values containing the values of the + flags associated with the the aforementioned idle-states. This + property is required on POWER9 whenever + ibm,cpu-idle-state-names is defined and the length of this + property array should be the same as + ibm,-cpu-idle-state-names.The flag bits are as follows: + 0x0001 /* Decrementer would stop */ + 0x0002 /* Needs timebase restore */ + 0x1000 /* Restore GPRs like nap */ + 0x2000 /* Restore hypervisor resource from PACA pointer */ + 0x4000 /* Program PORE to restore PACA pointer */ + 0x0001 /* This is a nap state */ + 0x0002 /* This is a fast-sleep state */ + 0x0004 /* This is a winkle state */ + 0x0008 /* This is a fast-sleep state which requires a */ + /* software workaround for restoring the timebase*/ + 0x0080 /* This state uses SPR PMICR instruction */ + 0x0010 /* This is a fast stop state */ + 0x0020 /* This is a deep-stop state */ + +- ibm,cpu-idle-state-latencies-ns: + Array of unsigned 32-bit values containing the values of the + exit-latencies (in ns) for the idle states in + ibm,cpu-idle-state-names. This property is required whenever + ibm,cpu-idle-state-names is defined and the length of this + property array should be the same as + ibm,-cpu-idle-state-names. + +- ibm,cpu-idle-state-residency-ns: + Array of unsigned 32-bit values containing the values of the + target-residency (in ns) for the idle states in + ibm,cpu-idle-state-names. On POWER8 this is an optional + property. If the property is absent, the target residency for + the "Nap", "FastSleep" are defined to 1 and 3 + respectively. On POWER9 this property must be defined if + ibm,cpu-idle-state-names is defined and the length should be + same as that of ibm,cpu-idle-state-names. + +- ibm,cpu-idle-state-psscr: + Array of unsigned 64-bit values containing the values for the + PSSCR for each of the idle states in ibm,cpu-idle-state-names. + This property is required on POWER9 whenever + ibm,cpu-idle-state-names is defined and the length of this + property array should be the same as + ibm,-cpu-idle-state-names. + +- ibm,cpu-idle-state-psscr-mask: + Array of unsigned 64-bit values containing the masks + indicating which psscr fields are set in the corresponding + entries of ibm,cpu-idle-state-psscr. This property is + required on POWER9 whenever ibm,cpu-idle-state-names is + defined and the length of this property array should be the + same as ibm,cpu-idle-state-names. + + Whenever the firmware sets an entry in + ibm,cpu-idle-state-psscr-mask value to 0xf, it implies that +
[PATCH v4 2/4] cpuidle:powernv: Add helper function to populate powernv idle states.
From: "Gautham R. Shenoy"In the current code for powernv_add_idle_states, there is a lot of code duplication while initializing an idle state in powernv_states table. Add an inline helper function to populate the powernv_states[] table for a given idle state. Invoke this for populating the "Nap", "Fastsleep" and the stop states in powernv_add_idle_states. Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-powernv.c | 85 ++- include/linux/cpuidle.h | 1 + 2 files changed, 50 insertions(+), 36 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 7fe442c..db18af1 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -167,6 +167,24 @@ static int powernv_cpuidle_driver_init(void) return 0; } +static inline void add_powernv_state(int index, const char *name, +unsigned int flags, +int (*idle_fn)(struct cpuidle_device *, + struct cpuidle_driver *, + int), +unsigned int target_residency, +unsigned int exit_latency, +u64 psscr_val) +{ + strlcpy(powernv_states[index].name, name, CPUIDLE_NAME_LEN); + strlcpy(powernv_states[index].desc, name, CPUIDLE_NAME_LEN); + powernv_states[index].flags = flags; + powernv_states[index].target_residency = target_residency; + powernv_states[index].exit_latency = exit_latency; + powernv_states[index].enter = idle_fn; + stop_psscr_table[index] = psscr_val; +} + static int powernv_add_idle_states(void) { struct device_node *power_mgt; @@ -236,6 +254,7 @@ static int powernv_add_idle_states(void) "ibm,cpu-idle-state-residency-ns", residency_ns, dt_idle_states); for (i = 0; i < dt_idle_states; i++) { + unsigned int exit_latency, target_residency; /* * If an idle state has exit latency beyond * POWERNV_THRESHOLD_LATENCY_NS then don't use it @@ -243,28 +262,33 @@ static int powernv_add_idle_states(void) */ if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS) continue; + /* +* Firmware passes residency and latency values in ns. +* cpuidle expects it in us. +*/ + exit_latency = ((unsigned int)latency_ns[i]) / 1000; + if (!rc) + target_residency = residency_ns[i] / 1000; + else + target_residency = 0; /* -* Cpuidle accepts exit_latency and target_residency in us. -* Use default target_residency values if f/w does not expose it. +* For nap and fastsleep, use default target_residency +* values if f/w does not expose it. */ if (flags[i] & OPAL_PM_NAP_ENABLED) { + if (!rc) + target_residency = 100; /* Add NAP state */ - strcpy(powernv_states[nr_idle_states].name, "Nap"); - strcpy(powernv_states[nr_idle_states].desc, "Nap"); - powernv_states[nr_idle_states].flags = 0; - powernv_states[nr_idle_states].target_residency = 100; - powernv_states[nr_idle_states].enter = nap_loop; + add_powernv_state(nr_idle_states, "Nap", + CPUIDLE_FLAG_NONE, nap_loop, + target_residency, exit_latency, 0); } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) && !(flags[i] & OPAL_PM_TIMEBASE_STOP)) { - strncpy(powernv_states[nr_idle_states].name, - names[i], CPUIDLE_NAME_LEN); - strncpy(powernv_states[nr_idle_states].desc, - names[i], CPUIDLE_NAME_LEN); - powernv_states[nr_idle_states].flags = 0; - - powernv_states[nr_idle_states].enter = stop_loop; - stop_psscr_table[nr_idle_states] = psscr_val[i]; + add_powernv_state(nr_idle_states, names[i], + CPUIDLE_FLAG_NONE, stop_loop, + target_residency, exit_latency, + psscr_val[i]); } /* @@ -274,32 +298,21 @@ static int powernv_add_idle_states(void) #ifdef CONFIG_TICK_ONESHOT
Re: [PATCH 3/3] powerpc: enable support for GCC plugins
On 9 Dec 2016 at 13:48, Andrew Donnellan wrote: > >> as for the solutions, the general advice should enable the use of otherwise > >> failing gcc versions instead of forcing updating to new ones (though the > >> latter is advisable for other reasons but not everyone's in the position to > >> do so easily). in my experience all one needs to do is manually install the > >> missing files from the gcc sources (ideally distros would take care of it). > > If someone else is willing to write up that advice, then great. > > >> the specific problem addressed here can (and IMHO should) be solved in > >> another way: remove the inclusion of the offending headers in gcc-common.h > >> as neither tm.h nor c-common.h are needed by existing plugins. for > >> background, > > We can't build without tm.h: http://pastebin.com/W0azfCr0 you'll need to repeat the removal of dependent headers. based on a quick test here across gcc 4.5-6.2, if you remove rtl.h, tm_p.h, hard-reg-set.h and emit-rtl.h in addition to tm.h, the plugins should build fine. > And we get warnings without c-common.h: http://pastebin.com/Aw8CAj10 that's not due to c-common.h. gcc versions 4.5-4.6 are compiled as a C program and gcc 4.7 can be compiled both as a C and a C++ program (IIRC, distros opted for the latter, i forget what manually built versions default to but i guess you went with the C compilation for your gcc anyway). couple that with -Wmissing-prototypes and you get that warning regardless of c-common.h being included. something like this should fix it: --- a/scripts/gcc-plugins/gcc-generate-gimple-pass.h 2016-12-06 01:01:54.521724573 +0100 +++ b/scripts/gcc-plugins/gcc-generate-gimple-pass.h 2016-12-09 11:43:32.225226164 +0100 @@ -136,6 +136,7 @@ return new _PASS_NAME_PASS(); } #else +struct opt_pass *_MAKE_PASS_NAME_PASS(void); struct opt_pass *_MAKE_PASS_NAME_PASS(void) { return &_PASS_NAME_PASS.pass; > These were all manually built using a script running on a Debian box. > Installing precompiled distro versions of rather old gccs would have > been somewhat challenging. I've just rebuilt 4.6.4 to double check that > I wasn't just seeing things, but it seems that it definitely is still > putting c-common.h in the old location. for reference, this is the git commit that did the move: commit 7bedc3a05d34cd81e4835a2d3ff8c0ec7108eeb5 Author: stevenDate: Sat Jun 5 20:33:22 2010 + gcc/ChangeLog: * c-common.c: Move to c-family/. * c-common.def: Likewise. * c-common.h: Likewise.
Re: 4.9.0-rc8 - rcutorture test failure
> But I am not seeing this as a failure. The last status print from the > log you attached is as follows: > > 07:58:25 [ 2778.876118] rcu-torture: rtc: (null) ver: 24968 tfle: 0 > rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: > 0 nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 cbflood: > 22703 > 07:58:25 [ 2778.876251] rcu-torture: Reader Pipe: 161849976604 399197 0 0 0 > 0 0 0 0 0 0 > 07:58:25 [ 2778.876438] rcu-torture: Reader Batch: 145090807711 16759538163 > 0 0 0 0 0 0 0 0 0 > 07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation: 24967 24967 > 24966 24965 24964 24963 24962 24961 24960 24959 0 > 07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 > nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 > shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 > fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 > shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 > onoff_interval=0 onoff_holdoff=0 > > The "SUCCESS" indicates that rcutorture thought that it succeeded. > Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two > numbers in the series at the end of each line are non-zero, which also > indicates a non-broken RCU. > > So could you please let me know what your scripting didn't like about > this log? > The test case has following piece of code which prints the failure message during result analysis. Checks for known bugs """ utils.system('dmesg -c > /dev/null') pipe1 = [r for r in self.results if "!!! Reader Pipe:" in r] if len(pipe1) != 0: raise error.TestError('\nBUG: grace-period failure !’) sys.exit(0) pipe2 = [r for r in self.results if "Reader Pipe" in r] for p in pipe2: nmiss = p.split(" ")[7] if int(nmiss): raise error.TestError('\nBUG: rcutorture tests failed !') sys.exit(0) I will double check on this. Thanks -Sachin