Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog

2016-12-09 Thread Nicholas Piggin
On Sat, 10 Dec 2016 16:22:13 +1100
Balbir Singh  wrote:

> On 10/12/16 02:52, Nicholas Piggin wrote:
> > Rather than use perf / PMU interrupts and the generic hardlockup
> > detector, this takes the decrementer interrupt as an "NMI" when
> > interrupts are soft disabled (XXX: will this do the right thing with a
> > large decrementer?).  This will work even if we start soft-disabling PMU
> > interrupts.
> > 
> > This does not solve the hardlockup problem completely however, because
> > interrupts can often become hard disabled when soft disabled for long
> > periods. And they can be hard disabled for other reasons.
> >   
> 
> Ben/Paul suggested a way to work around this with XICS. The idea was to
> have MSR_EE set and use XICS to stash away the current
> interrupt and acknowledge it/replay it later. Decrementer interrupts would
> not trigger timers, but trigger a special NMI watchdog, like you've
> implemented.

Yeah that's a good idea, it should significantly avoid hard interrupt
disable windows.

> > @@ -718,6 +719,8 @@ static __init void kvm_free_tmp(void)
> >  
> >  static int __init kvm_guest_init(void)
> >  {
> > +   /* XXX: disable hardlockup watchdog? */
> > +  
> 
> You mean the hypervisor watchdog? Did your testing
> catch anything here?

I meant guest. Testing didn't catch anything but I put it there to
investigate because I saw x86 does hardlockup_detector_disable() in
their guest init.

> > +static void nmi_timer_fn(unsigned long data)
> > +{
> > +   struct timer_list *t = this_cpu_ptr(_timer);
> > +   int cpu = smp_processor_id();
> > +
> > +   watchdog_timer_interrupt(cpu);
> > +
> > +   t->expires = round_jiffies(jiffies + nmi_timer_period * HZ);
> > +   add_timer_on(t, cpu);
> > +}  
> 
> Do we have to have this running all the time? Can we do an on-demand
> version of NMI where we do periodic decrementers without any reliance
> on timers to implement NMI watchdog

We could, but it is trivial to do this and get all the timer and
dynticks stuff taken care of for us. We could bump the period up
to 30s or so and it should hardly be an issue.

I didn't want to try getting too clever, there are times when you
could shut it off, but then you still lose some lockup coverage.

But... I'm open to suggestions. I don't know the timer code well.

> > +static int nmi_cpu_notify(struct notifier_block *self,
> > +unsigned long action, void *hcpu)
> > +{
> > +   int cpu = (unsigned long)hcpu;
> > +
> > +   switch (action & ~CPU_TASKS_FROZEN) {
> > +   case CPU_ONLINE:
> > +   case CPU_DOWN_FAILED:
> > +   start_nmi_on_cpu(cpu);
> > +   pr_info("NMI Watchdog running on cpus %*pbl\n",
> > +   cpumask_pr_args(_cpus_enabled));
> > +   break;
> > +   case CPU_DOWN_PREPARE:
> > +   stop_nmi_on_cpu(cpu);
> > +   pr_info("NMI Watchdog running on cpus %*pbl\n",
> > +   cpumask_pr_args(_cpus_enabled));
> > +   break;
> > +   }
> > +   return NOTIFY_OK;
> > +}  
> 
> FYI: These bits are changing in linux-next

Yeah I'll have to update them.

> > diff --git a/init/main.c b/init/main.c
> > index 2858be7..36fd7e7 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -579,6 +580,8 @@ asmlinkage __visible void __init start_kernel(void)
> >  
> > kmem_cache_init_late();
> >  
> > +   nmi_init();  
> 
> How did you test these?

I just tried a few place putting soft/hard irq disable and spinning
forever. Soft disable case was getting caught by the local NMI, hard
disable gets caught by the SMP check.

When we also get the NMI IPI crash debug stuff, we should be able to get
reasonable crash data with hard disabled hangs.

Thanks,
Nick


Re: [PATCH v3 00/15] livepatch: hybrid consistency model

2016-12-09 Thread Balbir Singh
On Thu, 2016-12-08 at 12:08 -0600, Josh Poimboeuf wrote:
> Dusting the cobwebs off the consistency model again.  This is based on
> linux-next/master.
> 
> v1 was posted on 2015-02-09:
> 
>   https://lkml.kernel.org/r/cover.1423499826.git.jpoim...@redhat.com
> 
> v2 was posted on 2016-04-28:
> 
>   https://lkml.kernel.org/r/cover.1461875890.git.jpoim...@redhat.com
> 
> The biggest issue from v2 was finding a decent way to detect preemption
> and page faults on the stack of a sleeping task.  

Could you please elaborate on this? Preemption of a sleeping task and
faults as in the future (time) preemption and faults?

Balbir Singh.



Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog

2016-12-09 Thread Balbir Singh


On 10/12/16 02:52, Nicholas Piggin wrote:
> Rather than use perf / PMU interrupts and the generic hardlockup
> detector, this takes the decrementer interrupt as an "NMI" when
> interrupts are soft disabled (XXX: will this do the right thing with a
> large decrementer?).  This will work even if we start soft-disabling PMU
> interrupts.
> 
> This does not solve the hardlockup problem completely however, because
> interrupts can often become hard disabled when soft disabled for long
> periods. And they can be hard disabled for other reasons.
> 

Ben/Paul suggested a way to work around this with XICS. The idea was to
have MSR_EE set and use XICS to stash away the current
interrupt and acknowledge it/replay it later. Decrementer interrupts would
not trigger timers, but trigger a special NMI watchdog, like you've
implemented.

> To make up for the lack of a periodic true NMI, this also has an SMP
> hard lockup detector where all CPUs can observe lockups on others.
> 
> This still needs a bit more polishing, testing, comments, config
> options, and boot parameters, etc., so it's RFC quality only.
> 
> Thanks,
> Nick
> ---
>  arch/powerpc/Kconfig |   2 +
>  arch/powerpc/include/asm/nmi.h   |   5 +
>  arch/powerpc/kernel/Makefile |   1 +
>  arch/powerpc/kernel/exceptions-64s.S |  14 +-
>  arch/powerpc/kernel/kvm.c|   3 +
>  arch/powerpc/kernel/nmi.c| 288 
> +++
>  arch/powerpc/kernel/setup_64.c   |  18 ---
>  arch/powerpc/kernel/time.c   |   2 +
>  arch/sparc/kernel/nmi.c  |   2 +-
>  include/linux/nmi.h  |  14 ++
>  init/main.c  |   3 +
>  kernel/watchdog.c|  16 +-
>  12 files changed, 341 insertions(+), 27 deletions(-)
>  create mode 100644 arch/powerpc/kernel/nmi.c
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 65fba4c..adb3387 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -124,6 +124,8 @@ config PPC
>   select HAVE_CBPF_JIT if !PPC64
>   select HAVE_EBPF_JIT if PPC64
>   select HAVE_ARCH_JUMP_LABEL
> + select HAVE_NMI
> + select HAVE_NMI_WATCHDOG if PPC64
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_HAS_GCOV_PROFILE_ALL
>   select GENERIC_SMP_IDLE_THREAD
> diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h
> index ff1ccb3..d00e29b 100644
> --- a/arch/powerpc/include/asm/nmi.h
> +++ b/arch/powerpc/include/asm/nmi.h
> @@ -1,4 +1,9 @@
>  #ifndef _ASM_NMI_H
>  #define _ASM_NMI_H
>  
> +#define arch_nmi_init powerpc_nmi_init
> +void __init powerpc_nmi_init(void);
> +void touch_nmi_watchdog(void);
> +void soft_nmi_interrupt(struct pt_regs *regs);
> +
>  #endif /* _ASM_NMI_H */
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 1925341..77f199f 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -42,6 +42,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \
>  signal_64.o ptrace32.o \
>  paca.o nvram_64.o firmware.o
>  obj-$(CONFIG_VDSO32) += vdso32/
> +obj-$(CONFIG_HAVE_NMI_WATCHDOG)  += nmi.o
>  obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_ppc970.o cpu_setup_pa6t.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_power.o
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 1ba82ea..b159d02 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1295,7 +1295,7 @@ masked_##_H##interrupt: 
> \
>   lis r10,0x7fff; \
>   ori r10,r10,0x; \
>   mtspr   SPRN_DEC,r10;   \
> - b   2f; \
> + b   masked_decrementer_##_H##interrupt; \
>  1:   cmpwi   r10,PACA_IRQ_DBELL; \
>   beq 2f; \
>   cmpwi   r10,PACA_IRQ_HMI;   \
> @@ -1312,6 +1312,16 @@ masked_##_H##interrupt:
> \
>   ##_H##rfid; \
>   b   .
>  
> +#define MASKED_NMI(_H)   \
> +masked_decrementer_##_H##interrupt:  \
> + std r12,PACA_EXGEN+EX_R12(r13); \
> + GET_SCRATCH0(r10);  \
> + std r10,PACA_EXGEN+EX_R13(r13); \
> + EXCEPTION_PROLOG_PSERIES_1(soft_nmi_common, _H)
> +
> +EXC_COMMON(soft_nmi_common, 0x900, soft_nmi_interrupt)
> +
> +
>  /*
>   * Real mode exceptions actually use this too, but alternate
>   * instruction code patches (which end up in the common .text area)
> @@ -1319,7 +1329,9 @@ 

Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-09 Thread Benjamin Herrenschmidt
On Wed, 2016-12-07 at 17:31 -0600, Tyrel Datwyler wrote:
> The first byte of each CRQ entry is used to indicate whether an entry is
> a valid response or free for the VIOS to use. After processing a
> response the driver sets the valid byte to zero to indicate the entry is
> now free to be reused. Add a memory barrier after this write to ensure
> no other stores are reordered when updating the valid byte.

Which "other stores" specifically ? This smells fishy without that
precision. It's important to always understand what exactly barriers
order with.

Cheers,
Ben.

> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index d9534ee..2f5b07e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
> >     while ((crq = crq_queue_next_crq(>queue)) != NULL) {
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     }
>  
> >     vio_enable_interrupts(vdev);
> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
> >     vio_disable_interrupts(vdev);
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     } else {
> >     done = 1;
> >     }



Re: [PATCH] powerpc/64: pseudo-NMI/SMP watchdog

2016-12-09 Thread Benjamin Herrenschmidt
On Sat, 2016-12-10 at 01:52 +1000, Nicholas Piggin wrote:
> This does not solve the hardlockup problem completely however,
> because
> interrupts can often become hard disabled when soft disabled for long
> periods. And they can be hard disabled for other reasons.
> 
> To make up for the lack of a periodic true NMI, this also has an SMP
> hard lockup detector where all CPUs can observe lockups on others.
> 
> This still needs a bit more polishing, testing, comments, config
> options, and boot parameters, etc., so it's RFC quality only.

Paulus and I discussed a plan with Balbir to also limit the cases of
hard-disable.

They typically happen as a result of an external interrupt. We could
on P8 and earlier, just fetch the interrupt from the XICS in the
"masked" path and stash it in the PACA. We already have a way to
stash an interrupt there for later processing because KVM sometimes
does it.

That would cause the XICS to elevate the priority effectively masking
subsequent interrupts. We'd have to change the XICS code to use the
same priority for IPIs and externals too though.

For XIVE (P9), we can just poke at the CPU priority register in the TM
area to mask at the PIC level in that case and unmask later.

Cheers,
Ben.



Re: [PATCH v2] of/irq: improve error report on irq discovery process failure

2016-12-09 Thread Guilherme G. Piccoli
On 12/09/2016 02:25 PM, Rob Herring wrote:
> On Mon, Dec 5, 2016 at 1:01 PM, Guilherme G. Piccoli
>  wrote:
>> On 12/05/2016 12:28 PM, Rob Herring wrote:
>>> On Mon, Dec 5, 2016 at 7:59 AM, Guilherme G. Piccoli
>>>  wrote:
 On PowerPC machines some PCI slots might not have level triggered
 interrupts capability (also know as level signaled interrupts),
 leading of_irq_parse_pci() to complain by presenting error messages
 on the kernel log - in this case, the properties "interrupt-map" and
 "interrupt-map-mask" are not present on device's node in the device
 tree.

 This patch introduces a different message for this specific case,
 and also reduces its level from error to warning. Besides, we warn
 (once) that possibly some PCI slots on the system have no level
 triggered interrupts available.
 We changed some error return codes too on function of_irq_parse_raw()
 in order other failure's cases can be presented in a more precise way.

 Before this patch, when an adapter was plugged in a slot without level
 interrupts capabilitiy on PowerPC, we saw a generic error message
 like this:

 [54.239] pci 002d:70:00.0: of_irq_parse_pci() failed with rc=-22

 Now, with this applied, we see the following specific message:

 [16.154] pci 0014:60:00.1: of_irq_parse_pci: no interrupt-map found,
 INTx interrupts not available

 Finally, we standardize the error path in of_irq_parse_raw() by always
 taking the fail path instead of returning directly from the loop.

 Signed-off-by: Guilherme G. Piccoli 
 ---

 v2:
   * Changed function return code to always return negative values;
>>>
>>> Are you sure this is safe? This is tricky because of differing values
>>> of NO_IRQ (0 or -1).
>>
>> Thanks Rob, but this is purely bad wording from myself. I'm sorry - I
>> meant to say that I changed only my positive return code (that was
>> suggested to be removed in the prior revision) to negative return code!
>>
>> So, I changed only code I added myself in v1 =)
>>
>>
>>>
   * Improved/simplified warning outputs;
   * Changed some return codes and some error paths in of_irq_parse_raw()
 in order to be more precise/consistent;
>>>
>>> This too could have some side effects on callers.
>>>
>>> Not saying don't do these changes, just need some assurances this has
>>> been considered.
>>
>> Thanks for your attention. I performed a quick investigation before
>> changing this, all the places that use the return values are just
>> getting "true/false" information from that, meaning they just are
>> comparing to 0 basically. So change -EINVAL to -ENOENT wouldn't hurt any
>> user of these return values, it'll only become more informative IMHO.
>>
>> Now, regarding the only error path that was changed: for some reason,
>> this was the only place in which we didn't goto fail label in case of
>> failure - it was added by a legacy commit from Ben, dated from 2006:
>> 006b64de60 ("[POWERPC] Make OF irq map code detect more error cases").
>> Then it was carried by Grant Likely's commit 7dc2e1134a ("of/irq: merge
>> irq mapping code"), 6-year old commit.
>> I wasn't able to imagine a scenario in which changing this would break
>> something; I believe the change improve consistency, but I'd remove it
>> if you or somebody else thinks it worth be removed.
> 
> Okay. It's a bit late for 4.10 now and want this to be in -next for a
> while, so I'll queue it after the merge window.
> 

OK, perfect! Thanks Rob
Cheers,


Guilherme

> Rob
> 



Re: [PATCH v2] cxl: prevent read/write to AFU config space while AFU not configured

2016-12-09 Thread Frederic Barrat




diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
index 3519ace..639a343 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -76,23 +76,22 @@ static int cxl_pcie_cfg_record(u8 bus, u8 devfn)
return (bus << 8) + devfn;
 }

-static int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn,
-   struct cxl_afu **_afu, int *_record)
+static inline struct cxl_afu *pci_bus_to_afu(struct pci_bus *bus)
 {
-   struct pci_controller *phb;
-   struct cxl_afu *afu;
-   int record;
+   struct pci_controller *phb = bus ? pci_bus_to_host(bus) : NULL;

-   phb = pci_bus_to_host(bus);
-   if (phb == NULL)
-   return PCIBIOS_DEVICE_NOT_FOUND;
+   return phb ? phb->private_data : NULL;
+}
+
+static inline int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn,
+  struct cxl_afu *afu, int *_record)
+{
+   int record;

-   afu = (struct cxl_afu *)phb->private_data;
record = cxl_pcie_cfg_record(bus->number, devfn);
if (record > afu->crs_num)
return PCIBIOS_DEVICE_NOT_FOUND;

-   *_afu = afu;
*_record = record;
return 0;
 }



There's no reason to pass the afu parameter to that function, is it?
Pushing it further, do we need cxl_pcie_config_info()? It's now a simple 
wrapper around cxl_pcie_cfg_record()


  Fred



Re: [PATCH v2] of/irq: improve error report on irq discovery process failure

2016-12-09 Thread Rob Herring
On Mon, Dec 5, 2016 at 1:01 PM, Guilherme G. Piccoli
 wrote:
> On 12/05/2016 12:28 PM, Rob Herring wrote:
>> On Mon, Dec 5, 2016 at 7:59 AM, Guilherme G. Piccoli
>>  wrote:
>>> On PowerPC machines some PCI slots might not have level triggered
>>> interrupts capability (also know as level signaled interrupts),
>>> leading of_irq_parse_pci() to complain by presenting error messages
>>> on the kernel log - in this case, the properties "interrupt-map" and
>>> "interrupt-map-mask" are not present on device's node in the device
>>> tree.
>>>
>>> This patch introduces a different message for this specific case,
>>> and also reduces its level from error to warning. Besides, we warn
>>> (once) that possibly some PCI slots on the system have no level
>>> triggered interrupts available.
>>> We changed some error return codes too on function of_irq_parse_raw()
>>> in order other failure's cases can be presented in a more precise way.
>>>
>>> Before this patch, when an adapter was plugged in a slot without level
>>> interrupts capabilitiy on PowerPC, we saw a generic error message
>>> like this:
>>>
>>> [54.239] pci 002d:70:00.0: of_irq_parse_pci() failed with rc=-22
>>>
>>> Now, with this applied, we see the following specific message:
>>>
>>> [16.154] pci 0014:60:00.1: of_irq_parse_pci: no interrupt-map found,
>>> INTx interrupts not available
>>>
>>> Finally, we standardize the error path in of_irq_parse_raw() by always
>>> taking the fail path instead of returning directly from the loop.
>>>
>>> Signed-off-by: Guilherme G. Piccoli 
>>> ---
>>>
>>> v2:
>>>   * Changed function return code to always return negative values;
>>
>> Are you sure this is safe? This is tricky because of differing values
>> of NO_IRQ (0 or -1).
>
> Thanks Rob, but this is purely bad wording from myself. I'm sorry - I
> meant to say that I changed only my positive return code (that was
> suggested to be removed in the prior revision) to negative return code!
>
> So, I changed only code I added myself in v1 =)
>
>
>>
>>>   * Improved/simplified warning outputs;
>>>   * Changed some return codes and some error paths in of_irq_parse_raw()
>>> in order to be more precise/consistent;
>>
>> This too could have some side effects on callers.
>>
>> Not saying don't do these changes, just need some assurances this has
>> been considered.
>
> Thanks for your attention. I performed a quick investigation before
> changing this, all the places that use the return values are just
> getting "true/false" information from that, meaning they just are
> comparing to 0 basically. So change -EINVAL to -ENOENT wouldn't hurt any
> user of these return values, it'll only become more informative IMHO.
>
> Now, regarding the only error path that was changed: for some reason,
> this was the only place in which we didn't goto fail label in case of
> failure - it was added by a legacy commit from Ben, dated from 2006:
> 006b64de60 ("[POWERPC] Make OF irq map code detect more error cases").
> Then it was carried by Grant Likely's commit 7dc2e1134a ("of/irq: merge
> irq mapping code"), 6-year old commit.
> I wasn't able to imagine a scenario in which changing this would break
> something; I believe the change improve consistency, but I'd remove it
> if you or somebody else thinks it worth be removed.

Okay. It's a bit late for 4.10 now and want this to be in -next for a
while, so I'll queue it after the merge window.

Rob


[PATCH] powerpc/64: pseudo-NMI/SMP watchdog

2016-12-09 Thread Nicholas Piggin
Rather than use perf / PMU interrupts and the generic hardlockup
detector, this takes the decrementer interrupt as an "NMI" when
interrupts are soft disabled (XXX: will this do the right thing with a
large decrementer?).  This will work even if we start soft-disabling PMU
interrupts.

This does not solve the hardlockup problem completely however, because
interrupts can often become hard disabled when soft disabled for long
periods. And they can be hard disabled for other reasons.

To make up for the lack of a periodic true NMI, this also has an SMP
hard lockup detector where all CPUs can observe lockups on others.

This still needs a bit more polishing, testing, comments, config
options, and boot parameters, etc., so it's RFC quality only.

Thanks,
Nick
---
 arch/powerpc/Kconfig |   2 +
 arch/powerpc/include/asm/nmi.h   |   5 +
 arch/powerpc/kernel/Makefile |   1 +
 arch/powerpc/kernel/exceptions-64s.S |  14 +-
 arch/powerpc/kernel/kvm.c|   3 +
 arch/powerpc/kernel/nmi.c| 288 +++
 arch/powerpc/kernel/setup_64.c   |  18 ---
 arch/powerpc/kernel/time.c   |   2 +
 arch/sparc/kernel/nmi.c  |   2 +-
 include/linux/nmi.h  |  14 ++
 init/main.c  |   3 +
 kernel/watchdog.c|  16 +-
 12 files changed, 341 insertions(+), 27 deletions(-)
 create mode 100644 arch/powerpc/kernel/nmi.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65fba4c..adb3387 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -124,6 +124,8 @@ config PPC
select HAVE_CBPF_JIT if !PPC64
select HAVE_EBPF_JIT if PPC64
select HAVE_ARCH_JUMP_LABEL
+   select HAVE_NMI
+   select HAVE_NMI_WATCHDOG if PPC64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAS_GCOV_PROFILE_ALL
select GENERIC_SMP_IDLE_THREAD
diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h
index ff1ccb3..d00e29b 100644
--- a/arch/powerpc/include/asm/nmi.h
+++ b/arch/powerpc/include/asm/nmi.h
@@ -1,4 +1,9 @@
 #ifndef _ASM_NMI_H
 #define _ASM_NMI_H
 
+#define arch_nmi_init powerpc_nmi_init
+void __init powerpc_nmi_init(void);
+void touch_nmi_watchdog(void);
+void soft_nmi_interrupt(struct pt_regs *regs);
+
 #endif /* _ASM_NMI_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 1925341..77f199f 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_PPC64)   += setup_64.o sys_ppc32.o \
   signal_64.o ptrace32.o \
   paca.o nvram_64.o firmware.o
 obj-$(CONFIG_VDSO32)   += vdso32/
+obj-$(CONFIG_HAVE_NMI_WATCHDOG)+= nmi.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
 obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o
 obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 1ba82ea..b159d02 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1295,7 +1295,7 @@ masked_##_H##interrupt:   
\
lis r10,0x7fff; \
ori r10,r10,0x; \
mtspr   SPRN_DEC,r10;   \
-   b   2f; \
+   b   masked_decrementer_##_H##interrupt; \
 1: cmpwi   r10,PACA_IRQ_DBELL; \
beq 2f; \
cmpwi   r10,PACA_IRQ_HMI;   \
@@ -1312,6 +1312,16 @@ masked_##_H##interrupt:  
\
##_H##rfid; \
b   .
 
+#define MASKED_NMI(_H) \
+masked_decrementer_##_H##interrupt:\
+   std r12,PACA_EXGEN+EX_R12(r13); \
+   GET_SCRATCH0(r10);  \
+   std r10,PACA_EXGEN+EX_R13(r13); \
+   EXCEPTION_PROLOG_PSERIES_1(soft_nmi_common, _H)
+
+EXC_COMMON(soft_nmi_common, 0x900, soft_nmi_interrupt)
+
+
 /*
  * Real mode exceptions actually use this too, but alternate
  * instruction code patches (which end up in the common .text area)
@@ -1319,7 +1329,9 @@ masked_##_H##interrupt:   
\
  */
 USE_FIXED_SECTION(virt_trampolines)
MASKED_INTERRUPT()
+   MASKED_NMI()
MASKED_INTERRUPT(H)
+   MASKED_NMI(H)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 TRAMP_REAL_BEGIN(kvmppc_skip_interrupt)
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 9ad37f8..f0d215c 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 

Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO

2016-12-09 Thread Alex Williamson
On Fri, 9 Dec 2016 18:53:43 +1100
Alexey Kardashevskiy  wrote:

> On 09/12/16 04:55, Alex Williamson wrote:
> > On Thu,  8 Dec 2016 19:19:56 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is referenced so we do not have to retrieve in real mode when hypercall
> >> happens.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is detroyed so this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> This uses the kvm->lock mutex to protect against a race between
> >> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> >> release() callback.
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
> >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h |   5 +
> >>  include/uapi/linux/kvm.h   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c   | 302 
> >> +
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 178 +
> >>  arch/powerpc/kvm/powerpc.c |   2 +
> >>  virt/kvm/vfio.c| 108 +++
> >>  8 files changed, 630 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> >> b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..ddb5a6512ab3 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,24 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +  kvm_device_attr.addr points to an int32_t file descriptor
> >> +  for the VFIO group.
> >>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device 
> >> tracking
> >> +  kvm_device_attr.addr points to an int32_t file descriptor
> >> +  for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +  allocated by sPAPR KVM.
> >> +  kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +  struct kvm_vfio_spapr_tce {
> >> +  __u32   argsz;
> >> +  __s32   groupfd;
> >> +  __s32   tablefd;
> >> +  __u8pad[4];
> >> +  };
> >> +
> >> +  where
> >> +  @argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +  @groupfd is a file descriptor for a VFIO group;
> >> +  @tablefd is a file descriptor for a TCE table allocated via
> >> +  KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> >> b/arch/powerpc/include/asm/kvm_host.h
> >> index 28350a294b1e..94774503c70d 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +  struct rcu_head rcu;
> >> +  struct list_head next;
> >> +  struct iommu_table *tbl;
> >> +  atomic_t refs;

Re: 4.9.0-rc8 - rcutorture test failure

2016-12-09 Thread Paul E. McKenney
On Fri, Dec 09, 2016 at 04:27:42PM +0530, Sachin Sant wrote:
> > But I am not seeing this as a failure.  The last status print from the
> > log you attached is as follows:
> > 
> > 07:58:25 [ 2778.876118] rcu-torture: rtc:   (null) ver: 24968 tfle: 
> > 0 rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 
> > rtb: 0 nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 
> > cbflood: 22703
> > 07:58:25 [ 2778.876251] rcu-torture: Reader Pipe:  161849976604 399197 0 0 
> > 0 0 0 0 0 0 0
> > 07:58:25 [ 2778.876438] rcu-torture: Reader Batch:  145090807711 
> > 16759538163 0 0 0 0 0 0 0 0 0
> > 07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation:  24967 24967 
> > 24966 24965 24964 24963 24962 24961 24960 24959 0
> > 07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 
> > nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 
> > shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 
> > fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 
> > shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 
> > onoff_interval=0 onoff_holdoff=0
> > 
> > The "SUCCESS" indicates that rcutorture thought that it succeeded.
> > Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two
> > numbers in the series at the end of each line are non-zero, which also
> > indicates a non-broken RCU.
> > 
> > So could you please let me know what your scripting didn't like about
> > this log?
> > 
> 
> The test case has following piece of code which prints the failure
> message during result analysis.
> 
> Checks for known bugs
> """
> utils.system('dmesg -c  > /dev/null')
> pipe1 = [r for r in self.results if "!!! Reader Pipe:" in r]
> if len(pipe1) != 0:
>  raise error.TestError('\nBUG: grace-period failure !’)
>  sys.exit(0)
> 
> pipe2 = [r for r in self.results if "Reader Pipe" in r]
> for p in pipe2:
>   nmiss = p.split(" ")[7]
>   if int(nmiss):
>   raise error.TestError('\nBUG: rcutorture tests failed !')
>   sys.exit(0)
> 
> I will double check on this.

I suggest using this script in the Linux kernel source as a guide:

tools/testing/selftests/rcutorture/bin/parse-console.sh

Thanx, Paul



[PATCH v4 0/4] powernv:stop: Use psscr_val,mask provided by firmware

2016-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This is the fourth iteration of the patchset to use the psscr_val and
psscr_mask provided by the firmware for each of the stop states.

The previous version can be found here:

[v3]: https://lkml.org/lkml/2016/11/10/37
[v2]: https://lkml.org/lkml/2016/10/27/143
[v1]: https://lkml.org/lkml/2016/9/29/45

This version fixes some of the coding style issues pointed out by
Michael Ellerman in v3. This version also documents the device-tree
bindings defining the properties under the @power-mgt node in the
device tree describing the idle states for Linux running on baremetal
POWER servers.

Synopsis
==
In the current implementation, the code for ISA
v3.0 stop implementation has a couple of shortcomings.

a) The code hand-codes the values for ESL,EC,TR,MTL bits of PSSCR and
   uses only the RL field from the firmware. While this is not
   incorrect, since the hand-coded values are legitimate, it is not a
   very flexible design since the firmware has the capability to
   communicate these values via the "ibm,cpu-idle-state-psscr" and
   "ibm,cpu-idle-state-psscr-mask" properties. In case where the
   firmware provides values for these fields that is different from
   the hand-coded values, the current code will not work as intended.

b) Due to issue a), the current code assumes that ESL=EC=1 for all the
   stop states and hence the wakeup from the stop instruction will
   happen at 0x100, the system-reset vector. However, the ISA v3.0
   allows the ESL=EC=0 behaviour where the corresponding stop-state
   loses no state and wakes up from the subsequent instruction. The
   current code doesn't handle this case.
   
This patch series addresses these issues.

The first patch in the series renames the existing
IDLE_STATE_ENTER_SEQ macro to IDLE_STATE_ENTER_SEQ_NORET. It reuses
the name IDLE_STATE_ENTER_SEQ for entering into stop-states which wake
up at the subsequent instruction.

The second patch adds a helper function in cpuidle-powernv.c for
initializing entries of the powernv_states[] table that is passed to
the cpu-idle core. This eliminates some of the code duplication in the
function that discovers and initializes the stop states.

The third patch in the series fixes issues a) and b) by ensuring that
the psscr-value and the psscr-mask provided by the firmware are what
will be used to set a particular stop state. It also adds support for
handling wake-up from stop states which were entered with ESL=EC=0.

The third patch also handles the older firmware which sets only the
Requested Level (RL) field in the psscr and psscr-mask exposed in the
device tree. In the presence of such older firmware, this patch will
set the default sane values for for remaining PSSCR fields (i.e PSLL,
MTL, ESL, EC, and TR).

The fourth patch provides the documentation for the device-tree
bindings describing the idle state properties under the @power-mgt
node in the device-tree.

The skiboot patch populates all the relevant fields in the PSSCR
values and the mask for all the stop states can be found here:
https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html

The patches are based on top of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes

Gautham R. Shenoy (4):
  powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro
  cpuidle:powernv: Add helper function to populate powernv idle states.
  powernv: Pass PSSCR value and mask to power9_idle_stop
  Documentation:powerpc: Add device-tree bindings for power-mgt

 .../devicetree/bindings/powerpc/opal/power-mgt.txt | 123 +
 arch/powerpc/include/asm/cpuidle.h |  46 +++-
 arch/powerpc/include/asm/processor.h   |   3 +-
 arch/powerpc/kernel/exceptions-64s.S   |   6 +-
 arch/powerpc/kernel/idle_book3s.S  |  41 ---
 arch/powerpc/platforms/powernv/idle.c  |  81 +++---
 arch/powerpc/platforms/powernv/powernv.h   |   3 +-
 arch/powerpc/platforms/powernv/smp.c   |  14 ++-
 drivers/cpuidle/cpuidle-powernv.c  | 113 ---
 include/linux/cpuidle.h|   1 +
 10 files changed, 348 insertions(+), 83 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt

-- 
1.9.4



[PATCH v4 1/4] powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro

2016-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently all the low-power idle states are expected to wake up
at reset vector 0x100. Which is why the macro IDLE_STATE_ENTER_SEQ
that puts the CPU to an idle state and never returns.

On ISA_300, when the ESL and EC bits in the PSSCR are zero, the
CPU is expected to wake up at the next instruction of the idle
instruction.

This patch adds a new macro named IDLE_STATE_ENTER_SEQ_NORET for the
no-return variant and reuses the name IDLE_STATE_ENTER_SEQ
for a variant that allows resuming operation at the instruction next
to the idle-instruction.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/cpuidle.h   |  5 -
 arch/powerpc/kernel/exceptions-64s.S |  6 +++---
 arch/powerpc/kernel/idle_book3s.S| 10 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 3919332..0a3255b 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -21,7 +21,7 @@
 
 /* Idle state entry routines */
 #ifdef CONFIG_PPC_P7_NAP
-#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \
+#define IDLE_STATE_ENTER_SEQ(IDLE_INST) \
/* Magic NAP/SLEEP/WINKLE mode enter sequence */\
std r0,0(r1);   \
ptesync;\
@@ -29,6 +29,9 @@
 1: cmpdcr0,r0,r0;  \
bne 1b; \
IDLE_INST;  \
+
+#defineIDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)   \
+   IDLE_STATE_ENTER_SEQ(IDLE_INST) \
b   .
 #endif /* CONFIG_PPC_P7_NAP */
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 1ba82ea..7aa8afc 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -381,12 +381,12 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
lbz r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi   r3,PNV_THREAD_NAP
bgt 10f
-   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
 10:
cmpwi   r3,PNV_THREAD_SLEEP
bgt 2f
-   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
/* No return */
 
 2:
@@ -400,7 +400,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
 */
ori r13,r13,1
SET_PACA(r13)
-   IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
/* No return */
 4:
 #endif
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 72dac0b..be90e2f 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -205,7 +205,7 @@ pnv_enter_arch207_idle_mode:
stb r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi   cr3,r3,PNV_THREAD_SLEEP
bge cr3,2f
-   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
 2:
/* Sleep or winkle */
@@ -239,7 +239,7 @@ pnv_fastsleep_workaround_at_entry:
 
 common_enter: /* common code for all the threads entering sleep or winkle */
bgt cr3,enter_winkle
-   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
 
 fastsleep_workaround_at_entry:
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
@@ -261,7 +261,7 @@ fastsleep_workaround_at_entry:
 enter_winkle:
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
 
 /*
  * r3 - requested stop state
@@ -280,7 +280,7 @@ power_enter_stop:
ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
cmpdr3,r4
bge 2f
-   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
 2:
 /*
  * Entering deep idle state.
@@ -302,7 +302,7 @@ lwarx_loop_stop:
 
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
 
 _GLOBAL(power7_idle)
/* Now check if user or arch enabled NAP mode */
-- 
1.9.4



[PATCH v4 3/4] powernv: Pass PSSCR value and mask to power9_idle_stop

2016-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The power9_idle_stop method currently takes only the requested stop
level as a parameter and picks up the rest of the PSSCR bits from a
hand-coded macro. This is not a very flexible design, especially when
the firmware has the capability to communicate the psscr value and the
mask associated with a particular stop state via device tree.

This patch modifies the power9_idle_stop API to take as parameters the
PSSCR value and the PSSCR mask corresponding to the stop state that
needs to be set. These PSSCR value and mask are respectively obtained
by parsing the "ibm,cpu-idle-state-psscr" and
"ibm,cpu-idle-state-psscr-mask" fields from the device tree.

In addition to this, the patch adds support for handling stop states
for which ESL and EC bits in the PSSCR are zero. As per the
architecture, a wakeup from these stop states resumes execution from
the subsequent instruction as opposed to waking up at the System
Vector.

The older firmware sets only the Requested Level (RL) field in the
psscr and psscr-mask exposed in the device tree. For older firmware
where psscr-mask=0xf, this patch will set the default sane values that
the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and
TR).

This skiboot patch that exports fully populated PSSCR values and the
mask for all the stop states can be found here:
https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/cpuidle.h   | 41 
 arch/powerpc/include/asm/processor.h |  3 +-
 arch/powerpc/kernel/idle_book3s.S| 31 +++-
 arch/powerpc/platforms/powernv/idle.c| 81 ++--
 arch/powerpc/platforms/powernv/powernv.h |  3 +-
 arch/powerpc/platforms/powernv/smp.c | 14 +++---
 drivers/cpuidle/cpuidle-powernv.c| 40 +++-
 7 files changed, 169 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 0a3255b..fa0b6c0 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -10,11 +10,52 @@
 #define PNV_CORE_IDLE_LOCK_BIT  0x100
 #define PNV_CORE_IDLE_THREAD_BITS   0x0FF
 
+/*
+ *  NOTE =
+ * The older firmware populates only the RL field in the psscr_val and
+ * sets the psscr_mask to 0xf. On such a firmware, the kernel sets the
+ * remaining PSSCR fields to default values as follows:
+ *
+ * - ESL and EC bits are to 1. So wakeup from any stop state will be
+ *   at vector 0x100.
+ *
+ * - MTL and PSLL are set to the maximum allowed value as per the ISA,
+ *i.e. 15.
+ *
+ * - The Transition Rate, TR is set to the Maximum value 3.
+ */
+#define PSSCR_HV_DEFAULT_VAL(PSSCR_ESL | PSSCR_EC |\
+   PSSCR_PSLL_MASK | PSSCR_TR_MASK |   \
+   PSSCR_MTL_MASK)
+
+#define PSSCR_HV_DEFAULT_MASK   (PSSCR_ESL | PSSCR_EC |\
+   PSSCR_PSLL_MASK | PSSCR_TR_MASK |   \
+   PSSCR_MTL_MASK | PSSCR_RL_MASK)
+
 #ifndef __ASSEMBLY__
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
 
 extern u64 pnv_first_deep_stop_state;
+
+static inline u64 compute_psscr_val(u64 psscr_val, u64 psscr_mask)
+{
+   /*
+* psscr_mask == 0xf indicates an older firmware.
+* Set remaining fields of psscr to the default values.
+* See NOTE above definition of PSSCR_HV_DEFAULT_VAL
+*/
+   if (psscr_mask == 0xf)
+   return psscr_val | PSSCR_HV_DEFAULT_VAL;
+   return psscr_val;
+}
+
+static inline u64 compute_psscr_mask(u64 psscr_mask)
+{
+   if (psscr_mask == 0xf)
+   return PSSCR_HV_DEFAULT_MASK;
+   return psscr_mask;
+}
 #endif
 
 #endif
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index c07c31b..422becd 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -458,7 +458,8 @@ static inline unsigned long get_clean_sp(unsigned long sp, 
int is_32)
 extern unsigned long power7_nap(int check_irq);
 extern unsigned long power7_sleep(void);
 extern unsigned long power7_winkle(void);
-extern unsigned long power9_idle_stop(unsigned long stop_level);
+extern unsigned long power9_idle_stop(unsigned long stop_psscr_val,
+ unsigned long stop_psscr_mask);
 
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index be90e2f..37ee533 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -40,9 +40,7 @@
 #define _WORC  GPR11
 #define _PTCR  GPR12
 
-#define PSSCR_HV_TEMPLATE  

[PATCH v4 4/4] Documentation:powerpc: Add device-tree bindings for power-mgt

2016-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Document the device-tree bindings defining the the properties under
the @power-mgt node in the device tree that describe the idle states
for Linux running on baremetal POWER servers.

Signed-off-by: Gautham R. Shenoy 
---
 .../devicetree/bindings/powerpc/opal/power-mgt.txt | 123 +
 1 file changed, 123 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt

diff --git a/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt 
b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt
new file mode 100644
index 000..002b59e
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt
@@ -0,0 +1,123 @@
+IBM Power-Management Bindings
+=
+
+Linux running on baremetal POWER machines has access to the processor
+idle states. The description of these idle states is exposed via the
+node @power-mgt in the device-tree by the firmware.
+
+Definitions:
+
+Typically each idle state has the following associated properties:
+
+- name: The name of the idle state as defined by the firmware.
+
+- flags: indicating some aspects of this idle states such as the
+ extent of state-loss, whether timebase is stopped on this
+ idle states and so on. The flag bits are as follows:
+
+- exit-latency: The latency involved in transitioning the state of the
+   CPU from idle to running.
+
+- target-residency: The minimum time that the CPU needs to reside in
+   this idle state in order to accrue power-savings
+   benefit.
+
+Properties
+
+The following properties provide details about the idle states. These
+properties are optional unless mentioned otherwise below.
+
+- ibm,cpu-idle-state-names:
+   Array of strings containing the names of the idle states.
+
+- ibm,cpu-idle-state-flags:
+   Array of unsigned 32-bit values containing the values of the
+   flags associated with the the aforementioned idle-states. This
+   property is required on POWER9 whenever
+   ibm,cpu-idle-state-names is defined and the length of this
+   property array should be the same as
+   ibm,-cpu-idle-state-names.The flag bits are as follows:
+   0x0001 /* Decrementer would stop */
+   0x0002 /* Needs timebase restore */
+   0x1000 /* Restore GPRs like nap */
+   0x2000 /* Restore hypervisor resource from PACA pointer */
+   0x4000 /* Program PORE to restore PACA pointer */
+   0x0001 /* This is a nap state */
+   0x0002 /* This is a fast-sleep state */
+   0x0004 /* This is a winkle state */
+   0x0008 /* This is a fast-sleep state which requires a */
+  /* software workaround for restoring the timebase*/
+   0x0080 /* This state uses SPR PMICR instruction */
+   0x0010 /* This is a fast stop state */
+   0x0020 /* This is a deep-stop state */
+
+- ibm,cpu-idle-state-latencies-ns:
+   Array of unsigned 32-bit values containing the values of the
+   exit-latencies (in ns) for the idle states in
+   ibm,cpu-idle-state-names. This property is required whenever
+   ibm,cpu-idle-state-names is defined and the length of this
+   property array should be the same as
+   ibm,-cpu-idle-state-names.
+
+- ibm,cpu-idle-state-residency-ns:
+   Array of unsigned 32-bit values containing the values of the
+   target-residency (in ns) for the idle states in
+   ibm,cpu-idle-state-names. On POWER8 this is an optional
+   property. If the property is absent, the target residency for
+   the "Nap", "FastSleep" are defined to 1 and 3
+   respectively. On POWER9 this property must be defined if
+   ibm,cpu-idle-state-names is defined and the length should be
+   same as that of ibm,cpu-idle-state-names.
+
+- ibm,cpu-idle-state-psscr:
+   Array of unsigned 64-bit values containing the values for the
+   PSSCR for each of the idle states in ibm,cpu-idle-state-names.
+   This property is required on POWER9 whenever
+   ibm,cpu-idle-state-names is defined and the length of this
+   property array should be the same as
+   ibm,-cpu-idle-state-names.
+
+- ibm,cpu-idle-state-psscr-mask:
+   Array of unsigned 64-bit values containing the masks
+   indicating which psscr fields are set in the corresponding
+   entries of ibm,cpu-idle-state-psscr.  This property is
+   required on POWER9 whenever ibm,cpu-idle-state-names is
+   defined and the length of this property array should be the
+   same as ibm,cpu-idle-state-names.
+
+   Whenever the firmware sets an entry in
+   ibm,cpu-idle-state-psscr-mask value to 0xf, it implies that
+   

[PATCH v4 2/4] cpuidle:powernv: Add helper function to populate powernv idle states.

2016-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

In the current code for powernv_add_idle_states, there is a lot of code
duplication while initializing an idle state in powernv_states table.

Add an inline helper function to populate the powernv_states[] table for
a given idle state. Invoke this for populating the "Nap", "Fastsleep"
and the stop states in powernv_add_idle_states.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-powernv.c | 85 ++-
 include/linux/cpuidle.h   |  1 +
 2 files changed, 50 insertions(+), 36 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 7fe442c..db18af1 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -167,6 +167,24 @@ static int powernv_cpuidle_driver_init(void)
return 0;
 }
 
+static inline void add_powernv_state(int index, const char *name,
+unsigned int flags,
+int (*idle_fn)(struct cpuidle_device *,
+   struct cpuidle_driver *,
+   int),
+unsigned int target_residency,
+unsigned int exit_latency,
+u64 psscr_val)
+{
+   strlcpy(powernv_states[index].name, name, CPUIDLE_NAME_LEN);
+   strlcpy(powernv_states[index].desc, name, CPUIDLE_NAME_LEN);
+   powernv_states[index].flags = flags;
+   powernv_states[index].target_residency = target_residency;
+   powernv_states[index].exit_latency = exit_latency;
+   powernv_states[index].enter = idle_fn;
+   stop_psscr_table[index] = psscr_val;
+}
+
 static int powernv_add_idle_states(void)
 {
struct device_node *power_mgt;
@@ -236,6 +254,7 @@ static int powernv_add_idle_states(void)
"ibm,cpu-idle-state-residency-ns", residency_ns, 
dt_idle_states);
 
for (i = 0; i < dt_idle_states; i++) {
+   unsigned int exit_latency, target_residency;
/*
 * If an idle state has exit latency beyond
 * POWERNV_THRESHOLD_LATENCY_NS then don't use it
@@ -243,28 +262,33 @@ static int powernv_add_idle_states(void)
 */
if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS)
continue;
+   /*
+* Firmware passes residency and latency values in ns.
+* cpuidle expects it in us.
+*/
+   exit_latency = ((unsigned int)latency_ns[i]) / 1000;
+   if (!rc)
+   target_residency = residency_ns[i] / 1000;
+   else
+   target_residency = 0;
 
/*
-* Cpuidle accepts exit_latency and target_residency in us.
-* Use default target_residency values if f/w does not expose 
it.
+* For nap and fastsleep, use default target_residency
+* values if f/w does not expose it.
 */
if (flags[i] & OPAL_PM_NAP_ENABLED) {
+   if (!rc)
+   target_residency = 100;
/* Add NAP state */
-   strcpy(powernv_states[nr_idle_states].name, "Nap");
-   strcpy(powernv_states[nr_idle_states].desc, "Nap");
-   powernv_states[nr_idle_states].flags = 0;
-   powernv_states[nr_idle_states].target_residency = 100;
-   powernv_states[nr_idle_states].enter = nap_loop;
+   add_powernv_state(nr_idle_states, "Nap",
+ CPUIDLE_FLAG_NONE, nap_loop,
+ target_residency, exit_latency, 0);
} else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
!(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
-   strncpy(powernv_states[nr_idle_states].name,
-   names[i], CPUIDLE_NAME_LEN);
-   strncpy(powernv_states[nr_idle_states].desc,
-   names[i], CPUIDLE_NAME_LEN);
-   powernv_states[nr_idle_states].flags = 0;
-
-   powernv_states[nr_idle_states].enter = stop_loop;
-   stop_psscr_table[nr_idle_states] = psscr_val[i];
+   add_powernv_state(nr_idle_states, names[i],
+ CPUIDLE_FLAG_NONE, stop_loop,
+ target_residency, exit_latency,
+ psscr_val[i]);
}
 
/*
@@ -274,32 +298,21 @@ static int powernv_add_idle_states(void)
 #ifdef CONFIG_TICK_ONESHOT

Re: [PATCH 3/3] powerpc: enable support for GCC plugins

2016-12-09 Thread PaX Team
On 9 Dec 2016 at 13:48, Andrew Donnellan wrote:

> >> as for the solutions, the general advice should enable the use of otherwise
> >> failing gcc versions instead of forcing updating to new ones (though the
> >> latter is advisable for other reasons but not everyone's in the position to
> >> do so easily). in my experience all one needs to do is manually install the
> >> missing files from the gcc sources (ideally distros would take care of it).
> 
> If someone else is willing to write up that advice, then great.
> 
> >> the specific problem addressed here can (and IMHO should) be solved in
> >> another way: remove the inclusion of the offending headers in gcc-common.h
> >> as neither tm.h nor c-common.h are needed by existing plugins. for 
> >> background,
> 
> We can't build without tm.h: http://pastebin.com/W0azfCr0

you'll need to repeat the removal of dependent headers. based on a quick
test here across gcc 4.5-6.2, if you remove rtl.h, tm_p.h, hard-reg-set.h
and emit-rtl.h in addition to tm.h, the plugins should build fine.

> And we get warnings without c-common.h: http://pastebin.com/Aw8CAj10

that's not due to c-common.h. gcc versions 4.5-4.6 are compiled as a C program
and gcc 4.7 can be compiled both as a C and a C++ program (IIRC, distros opted
for the latter, i forget what manually built versions default to but i guess you
went with the C compilation for your gcc anyway). couple that with 
-Wmissing-prototypes
and you get that warning regardless of c-common.h being included. something like
this should fix it:

--- a/scripts/gcc-plugins/gcc-generate-gimple-pass.h 2016-12-06 
01:01:54.521724573 +0100
+++ b/scripts/gcc-plugins/gcc-generate-gimple-pass.h  2016-12-09 
11:43:32.225226164 +0100
@@ -136,6 +136,7 @@
return new _PASS_NAME_PASS();
 }
 #else
+struct opt_pass *_MAKE_PASS_NAME_PASS(void);
 struct opt_pass *_MAKE_PASS_NAME_PASS(void)
 {
return &_PASS_NAME_PASS.pass;

> These were all manually built using a script running on a Debian box. 
> Installing precompiled distro versions of rather old gccs would have 
> been somewhat challenging. I've just rebuilt 4.6.4 to double check that 
> I wasn't just seeing things, but it seems that it definitely is still 
> putting c-common.h in the old location.

for reference, this is the git commit that did the move:

commit 7bedc3a05d34cd81e4835a2d3ff8c0ec7108eeb5
Author: steven 
Date:   Sat Jun 5 20:33:22 2010 +

gcc/ChangeLog:
* c-common.c: Move to c-family/.
* c-common.def: Likewise.
* c-common.h: Likewise.




Re: 4.9.0-rc8 - rcutorture test failure

2016-12-09 Thread Sachin Sant
> But I am not seeing this as a failure.  The last status print from the
> log you attached is as follows:
> 
> 07:58:25 [ 2778.876118] rcu-torture: rtc:   (null) ver: 24968 tfle: 0 
> rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 
> 0 nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 cbflood: 
> 22703
> 07:58:25 [ 2778.876251] rcu-torture: Reader Pipe:  161849976604 399197 0 0 0 
> 0 0 0 0 0 0
> 07:58:25 [ 2778.876438] rcu-torture: Reader Batch:  145090807711 16759538163 
> 0 0 0 0 0 0 0 0 0
> 07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation:  24967 24967 
> 24966 24965 24964 24963 24962 24961 24960 24959 0
> 07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 
> nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 
> shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 
> fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 
> shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 
> onoff_interval=0 onoff_holdoff=0
> 
> The "SUCCESS" indicates that rcutorture thought that it succeeded.
> Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two
> numbers in the series at the end of each line are non-zero, which also
> indicates a non-broken RCU.
> 
> So could you please let me know what your scripting didn't like about
> this log?
> 

The test case has following piece of code which prints the failure
message during result analysis.

Checks for known bugs
"""
utils.system('dmesg -c  > /dev/null')
pipe1 = [r for r in self.results if "!!! Reader Pipe:" in r]
if len(pipe1) != 0:
 raise error.TestError('\nBUG: grace-period failure !’)
 sys.exit(0)

pipe2 = [r for r in self.results if "Reader Pipe" in r]
for p in pipe2:
  nmiss = p.split(" ")[7]
  if int(nmiss):
  raise error.TestError('\nBUG: rcutorture tests failed !')
  sys.exit(0)

I will double check on this.

Thanks
-Sachin