Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Wed, 2017-06-07 at 07:45 +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 07, 2017 at 09:04:41AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> > > >   
> > > >   static struct device_attribute vio_dev_attrs[] = {
> > > >    __ATTR_RO(name),
> > > > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = 
> > > > {
> > > >    __ATTR_RO(modalias),
> > > >    __ATTR_NULL
> > > >   };
> > > > +static struct attribute *vio_dev_attrs[] = {
> > > 
> > > Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> > > the above vio_dev_attrs field as well.  Is powerpc really a dead
> > > platform?  :)
> > 
> > Haha, not yet no, and the above is actually still quite actively
> > in use as it's part of our hypervisor virtual IO infrastructure.
> 
> Ok, let me fix this, right after I emailed 0-day sent me the build error :)

Thanks !

Cheers,
Ben.



Re: [PATCH v3 2/9] powerpc/kprobes: Move kprobes over to patch_instruction

2017-06-06 Thread Balbir Singh
On Wed, 2017-06-07 at 00:35 +0530, Naveen N. Rao wrote:
> Hi Balbir,
> 
> On 2017/06/06 02:29PM, Balbir Singh wrote:
> > arch_arm/disarm_probe use direct assignment for copying
> > instructions, replace them with patch_instruction
> > 
> > Signed-off-by: Balbir Singh 
> > ---
> >  arch/powerpc/kernel/kprobes.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> > index fc43435..b49f8f0 100644
> > --- a/arch/powerpc/kernel/kprobes.c
> > +++ b/arch/powerpc/kernel/kprobes.c
> > @@ -158,7 +158,7 @@ NOKPROBE_SYMBOL(arch_prepare_kprobe);
> > 
> >  void arch_arm_kprobe(struct kprobe *p)
> >  {
> > -   *p->addr = BREAKPOINT_INSTRUCTION;
> > +   patch_instruction(p->addr, BREAKPOINT_INSTRUCTION);
> > flush_icache_range((unsigned long) p->addr,
> >(unsigned long) p->addr + sizeof(kprobe_opcode_t));
> 
> Do we still need flush_icache_range() after patch_instruction()?
>

Good catch! No, we don't

Balbir Singh. 



[PATCH 2/2] mtd: powernv_flash: Lock around concurrent access to OPAL

2017-06-06 Thread Cyril Bur
OPAL can only manage one flash access at a time and will return an
OPAL_BUSY error for each concurrent access to the flash. The simplest
way to prevent this from happening is with a mutex.

Signed-off-by: Cyril Bur 
---
This is to address https://github.com/open-power/skiboot/issues/80


 drivers/mtd/devices/powernv_flash.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/mtd/devices/powernv_flash.c 
b/drivers/mtd/devices/powernv_flash.c
index a9a20c00687c..7b41af06f4fe 100644
--- a/drivers/mtd/devices/powernv_flash.c
+++ b/drivers/mtd/devices/powernv_flash.c
@@ -38,6 +38,7 @@
 
 struct powernv_flash {
struct mtd_info mtd;
+   struct mutex lock;
u32 id;
 };
 
@@ -59,12 +60,15 @@ static int powernv_flash_async_op(struct mtd_info *mtd, 
enum flash_op op,
dev_dbg(dev, "%s(op=%d, offset=0x%llx, len=%zu)\n",
__func__, op, offset, len);
 
+   mutex_lock(&info->lock);
+
token = opal_async_get_token_interruptible();
if (token < 0) {
if (token != -ERESTARTSYS)
dev_err(dev, "Failed to get an async token\n");
 
-   return token;
+   rc = token;
+   goto out;
}
 
switch (op) {
@@ -79,18 +83,21 @@ static int powernv_flash_async_op(struct mtd_info *mtd, 
enum flash_op op,
break;
default:
WARN_ON_ONCE(1);
-   return -EIO;
+   rc = -EIO;
+   goto out;
}
 
if (rc != OPAL_ASYNC_COMPLETION) {
dev_err(dev, "opal_flash_async_op(op=%d) failed (rc %d)\n",
op, rc);
opal_async_release_token(token);
-   return -EIO;
+   rc = -EIO;
+   goto out;
}
 
rc = opal_async_wait_response(token, &msg);
opal_async_release_token(token);
+   mutex_unlock(&info->lock);
if (rc) {
dev_err(dev, "opal async wait failed (rc %d)\n", rc);
return -EIO;
@@ -106,6 +113,9 @@ static int powernv_flash_async_op(struct mtd_info *mtd, 
enum flash_op op,
}
 
return rc;
+out:
+   mutex_unlock(&info->lock);
+   return rc;
 }
 
 /**
@@ -237,6 +247,8 @@ static int powernv_flash_probe(struct platform_device *pdev)
if (ret)
goto out;
 
+   mutex_init(&data->lock);
+
dev_set_drvdata(dev, data);
 
/*
-- 
2.13.0



[PATCH 1/2] mtd: powernv_flash: Use WARN_ON_ONCE() rather than BUG_ON()

2017-06-06 Thread Cyril Bur
BUG_ON() should be reserved in situations where we can not longer
guarantee the integrity of the system. In the case where
powernv_flash_async_op() receives an impossible op, we can still
guarantee the integrity of the system.

Signed-off-by: Cyril Bur 
---
I think patches to powernv_flash have gone through MPE in the past -
is this still correct?

 drivers/mtd/devices/powernv_flash.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/devices/powernv_flash.c 
b/drivers/mtd/devices/powernv_flash.c
index f5396f26ddb4..a9a20c00687c 100644
--- a/drivers/mtd/devices/powernv_flash.c
+++ b/drivers/mtd/devices/powernv_flash.c
@@ -78,7 +78,8 @@ static int powernv_flash_async_op(struct mtd_info *mtd, enum 
flash_op op,
rc = opal_flash_erase(info->id, offset, len, token);
break;
default:
-   BUG_ON(1);
+   WARN_ON_ONCE(1);
+   return -EIO;
}
 
if (rc != OPAL_ASYNC_COMPLETION) {
-- 
2.13.0



Re: [PATCH v3 3/9] powerpc/kprobes/optprobes: Move over to patch_instruction

2017-06-06 Thread Balbir Singh
On Wed, 2017-06-07 at 00:42 +0530, Naveen N. Rao wrote:
> On 2017/06/06 02:29PM, Balbir Singh wrote:
> > With text moving to read-only migrate optprobes to using
> > the patch_instruction infrastructure. Without this optprobes
> > will fail and complain.
> > 

> > +   /* We can optimize this via patch_instruction_window later */
> 
> This probably needs a TODO just so it's clear. I do think this would be 
> good to add since we copy many instructions while setting up the 
> optprobe, so this is quite slow as it exists today.

I made it read like a TODO, with a TOOD:

> 
> > +   size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
> 
> That's just TMPL_END_IDX.
>

The entire kprobe_opcode_t and int types thing is a bit messy. The size
calculation assumes nothing for now in terms of sizes, but I guess the
patch_instruction does. Another TODO?

Thanks for the review
Balbir Singh.



Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
On Wed, Jun 07, 2017 at 09:04:41AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> > >   
> > >   static struct device_attribute vio_dev_attrs[] = {
> > >    __ATTR_RO(name),
> > > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
> > >    __ATTR_RO(modalias),
> > >    __ATTR_NULL
> > >   };
> > > +static struct attribute *vio_dev_attrs[] = {
> > 
> > Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> > the above vio_dev_attrs field as well.  Is powerpc really a dead
> > platform?  :)
> 
> Haha, not yet no, and the above is actually still quite actively
> in use as it's part of our hypervisor virtual IO infrastructure.

Ok, let me fix this, right after I emailed 0-day sent me the build error :)


Re: [PATCH v9 07/10] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-06-06 Thread Anju T Sudhakar

Hi Thomas,

On Tuesday 06 June 2017 03:39 PM, Thomas Gleixner wrote:

On Mon, 5 Jun 2017, Anju T Sudhakar wrote:

+static void cleanup_all_core_imc_memory(struct imc_pmu *pmu_ptr)
+{
+   struct imc_mem_info *ptr = pmu_ptr->mem_info;
+
+   if (!ptr)
+   return;

That's pointless.


No, it is not.  We may end up here from imc_mem_init() when the memory 
allocation for
pmu_ptr->mem_info fails. So in that case we can just return from here, 
and kfree wont be

called with a NULL pointer.


+   for (; ptr; ptr++) {

for (ptr = pmu_ptr->mem_info; ptr; ptr++) {

will do the right thing.


+   if (ptr->vbase[0] != 0)
+   free_pages(ptr->vbase[0], 0);
+   }

and kfree can be called with a NULL pointer.


+   kfree(pmu_ptr->mem_info);
+}
+
+/* Enabling of Core Engine needs a scom operation */
+static void core_imc_control_enable(int *cpu_opal_rc)
+{
+   int rc;
+
+   rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE);
+   if (rc)
+   cpu_opal_rc[smp_processor_id()] = 1;
+}
+
+/*
+ * Disabling of IMC Core Engine needs a scom operation
+ */
+static void core_imc_control_disable(int *cpu_opal_rc)
+{
+   int rc;
+
+   rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE);
+   if (rc)
+   cpu_opal_rc[smp_processor_id()] = 1;
+}
+
+int core_imc_control(int operation)
+{
+   int cpu, *cpus_opal_rc;
+
+   /* Memory for OPAL call return value. */
+   cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
+   if (!cpus_opal_rc)
+   return -ENOMEM;
+
+   /*
+* Enable or disable the core IMC PMU on each core using the
+* core_imc_cpumask.
+*/
+   switch (operation) {
+
+   case IMC_COUNTER_DISABLE:
+   on_each_cpu_mask(&core_imc_cpumask,
+   (smp_call_func_t)core_imc_control_disable,
+   cpus_opal_rc, 1);
+   break;
+   case IMC_COUNTER_ENABLE:
+   on_each_cpu_mask(&core_imc_cpumask,
+   (smp_call_func_t)core_imc_control_enable,
+   cpus_opal_rc, 1);
+   break;
+   default:
+   goto fail;
+   }
+   /* Check return value array for any OPAL call failure */
+   for_each_cpu(cpu, &core_imc_cpumask) {
+   if (cpus_opal_rc[cpu])
+   goto fail;
+   }
+   return 0;
+fail:
+   if (cpus_opal_rc)
+   kfree(cpus_opal_rc);
+   return -EINVAL;
+}

Interesting. You copied the other function and then implemented it
differently.

But, what's worse is that this is just duplicated code for no value. Both
the nest and core control functions call

 opal_imc_counters_(WHICH_COUNTER);

So the obvious thing to do is:

static struct cpumask imc_result_mask;
static DEFINE_MUTEX(imc_control_mutex);

static void opal_imc_XXX(void *which)
{
if (opal_imc_counters_XXX((unsigned long) which))
cpumask_set_cpu(smp_processor_id(), &imc_result_mask);
}

static int imc_control(unsigned long which, bool start)
{
smp_call_func_t *fn;
int res;

mutex_lock(&imc_control_mutex);
memset(&imc_result_mask, 0, sizeof(imc_result_mask));

switch (which) {
case OPAL_IMC_COUNTERS_CORE:
case OPAL_IMC_COUNTERS_NEST:
break;
default:
res = -EINVAL;
goto out;
}

fn = start ? opal_imc_start : opal_imc_stop;

on_each_cpu_mask(&core_imc_cpumask, fn, (void *) which, 1);

res = cpumask_empty(&nest_imc_result) ? 0 : -ENODEV;
out:
mutex_unlock(&imc_control_mutex);
return res;

That would allow sharing of too much code, right?


Yes. This looks really good. Thanks!
I will rework the code.


+/*
+ * core_imc_mem_init : Initializes memory for the current core.
+ *
+ * Uses alloc_pages_exact_nid() and uses the returned address as an argument to
+ * an opal call to configure the pdbar. The address sent as an argument is
+ * converted to physical address before the opal call is made. This is the
+ * base address at which the core imc counters are populated.
+ */
+static int  __meminit core_imc_mem_init(int cpu, int size)
+{
+   int phys_id, rc = 0, core_id = (cpu / threads_per_core);
+   struct imc_mem_info *mem_info = NULL;

What's that NULL initialization for?


+
+   phys_id = topology_physical_package_id(cpu);
+   /*
+* alloc_pages_exact_nid() will allocate memory for core in the
+* local node only.
+*/
+   mem_info = &core_imc_pmu->mem_info[core_id];
+   mem_info->id = core_id;
+   mem_info->vbase[0] = (u64) alloc_pages_exact_nid(phys_id,
+   (size_t)size, GFP_KERNEL | __GFP_ZERO);

That allocation is guaranteed not to fail?


Nice catch. We need a check here. Will fix th

Re: [PATCH v3 1/3] EDAC: mv64x60: check driver registration success

2017-06-06 Thread Chris Packham
Hi Borislav,

On 30/05/17 09:21, Chris Packham wrote:
> Check the return status of platform_driver_register() in
> mv64x60_edac_init(). Only output messages and initialise the
> edac_op_state if the registration is successful.
> 
> Signed-off-by: Chris Packham 
> ---
> Changes in v3:
> - catch the retval of platform_register_drivers and return early on failure
>(thanks Borislav).
> 
>   drivers/edac/mv64x60_edac.c | 8 ++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/edac/mv64x60_edac.c b/drivers/edac/mv64x60_edac.c
> index 14b7e7b71eaa..172081551a70 100644
> --- a/drivers/edac/mv64x60_edac.c
> +++ b/drivers/edac/mv64x60_edac.c
> @@ -853,7 +853,11 @@ static struct platform_driver * const drivers[] = {
>   
>   static int __init mv64x60_edac_init(void)
>   {
> - int ret = 0;
> + int ret;
> +
> + ret = platform_register_drivers(drivers, ARRAY_SIZE(drivers));
> + if (ret)
> + return ret;

I actually think this was wrong. The edac_op_state is set as a module 
param and affects the behaviour of the driver probe function which on 
success could be called before we sanity check it below.

I'm back to thinking that my original v1 of just removing the unused ret 
was appropriate. If we still consider the driver too noisy we could just 
remove the printks.

What do you think?

>   
>   printk(KERN_INFO "Marvell MV64x60 EDAC driver " MV64x60_REVISION "\n");
>   printk(KERN_INFO "\t(C) 2006-2007 MontaVista Software\n");
> @@ -867,7 +871,7 @@ static int __init mv64x60_edac_init(void)
>   break;
>   }
>   
> - return platform_register_drivers(drivers, ARRAY_SIZE(drivers));
> + return 0;
>   }
>   module_init(mv64x60_edac_init);
>   
> 



Re: [PATCH v9 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-06-06 Thread Anju T Sudhakar

Hi Thomas,

Thank you for reviewing the patch.


On Tuesday 06 June 2017 02:09 PM, Thomas Gleixner wrote:

On Mon, 5 Jun 2017, Anju T Sudhakar wrote:

+/*
+ * nest_imc_stop : Does OPAL call to stop nest engine.
+ */
+static void nest_imc_stop(int *cpu_opal_rc)
+{
+   int rc;
+
+   rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST);
+   if (rc)
+   cpu_opal_rc[smp_processor_id()] = 1;

What's wrong with just assigning the result directly?



yes. We can assign it directly. Will do this.




+/* nest_imc_start: Does the OPAL call to enable nest counters. */
+static void nest_imc_start(int *cpu_opal_rc)
+{
+   int rc;
+
+   rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST);
+   if (rc)
+   cpu_opal_rc[smp_processor_id()] = 1;
+}
+
+/*
+ * nest_imc_control: Function to start and stop nest imc engine.
+ *  The function is primarily called from event init
+ *  and event destroy.

As I don't see other call sites, what's the secondary usage?



This function is called from event init and event destroy only.
Will fix the comment here.




+ */
+static int nest_imc_control(int operation)
+{
+   int *cpus_opal_rc, cpu;
+
+   /*
+* Memory for OPAL call return value.
+*/
+   cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
+   if (!cpus_opal_rc)
+   return -ENOMEM;
+   switch (operation) {
+   case IMC_COUNTER_ENABLE:
+   /* Initialize Nest PMUs in each node using designated 
cpus */
+   on_each_cpu_mask(&nest_imc_cpumask, 
(smp_call_func_t)nest_imc_start,
+   cpus_opal_rc, 1);

That's one indentation level too deep.


My bad. Will fix this.


+   break;
+   case IMC_COUNTER_DISABLE:
+   /* Disable the counters */
+   on_each_cpu_mask(&nest_imc_cpumask, 
(smp_call_func_t)nest_imc_stop,
+   cpus_opal_rc, 1);
+   break;
+   default:
+   kfree(cpus_opal_rc);
+   return -EINVAL;
+
+   }
+
+   /* Check return value array for any OPAL call failure */
+   for_each_cpu(cpu, &nest_imc_cpumask) {
+   if (cpus_opal_rc[cpu]) {
+   kfree(cpus_opal_rc);
+   return -ENODEV;
+   }
+   }
+   kfree(cpus_opal_rc);

So you have THREE places where you do kfree(cpus_opal_rc). Sigh.

You can spare that hole alloc/free dance by simply having

static struct cpumask nest_imc_result;

mutex_lock(&imc_nest_reserve);
memset(&nest_imc_result, 0, sizeof(nest_imc_result));

switch (op) {
case IMC_COUNTER_ENABLE:
on_each_cpu_mask(&nest_imc_cpumask, nest_imc_start, NULL, 1);
break;

}

res = cpumask_empty(&nest_imc_result) ? 0 : -ENODEV;
mutex_unlock(&imc_nest_reserve);
return res;

And in the start/stop functions:

if (opal_imc_counters_xxx(OPAL_IMC_COUNTERS_NEST))
cpumask_set_cpu(smp_processor_id(), &nest_imc_result);


Ok.


+static void nest_change_cpu_context(int old_cpu, int new_cpu)
+{
+   struct imc_pmu **pn = per_nest_pmu_arr;
+   int i;
+
+   if (old_cpu < 0 || new_cpu < 0)
+   return;
+
+   for (i = 0; *pn && i < IMC_MAX_PMUS; i++, pn++)
+   perf_pmu_migrate_context(&(*pn)->pmu, old_cpu, new_cpu);
+
+}
+
+static int ppc_nest_imc_cpu_offline(unsigned int cpu)
+{
+   int nid, target = -1;
+   const struct cpumask *l_cpumask;
+
+   /*
+* Check in the designated list for this cpu. Dont bother
+* if not one of them.
+*/
+   if (!cpumask_test_and_clear_cpu(cpu, &nest_imc_cpumask))
+   return 0;
+
+   /*
+* Now that this cpu is one of the designated,
+* find a next cpu a) which is online and b) in same chip.
+*/
+   nid = cpu_to_node(cpu);
+   l_cpumask = cpumask_of_node(nid);
+   target = cpumask_next(cpu, l_cpumask);

So if the outgoing CPU is the highest numbered CPU on the node, then
cpumask_next() will not find a CPU, while there still might be other lower
numbered CPUs online in the node.

target = cpumask_any_but(l_cpumask, cpu);

is what you want.


In the previous revisions we designated the smallest cpu in the mask. 
And then we re wrote the

code to pick next available cpu. But this is a nice catch.  :-) Thanks!
I will fix this.


+
+   /*
+* Update the cpumask with the target cpu and
+* migrate the context if needed
+*/
+   if (target >= 0 && target < nr_cpu_ids) {
+   cpumask_set_cpu(target, &nest_imc_cpumask);
+   nest_change_cpu_context(cpu, target);
+   } else {
+   target = -1;
+   opal_imc_counters_stop(OPAL_IMC_CO

[PATCH] selftests/powerpc: Add a test of a simple copy/paste

2017-06-06 Thread Michael Ellerman
Power9 has two new instructions, copy and paste, which copy a cacheline
and then paste it somewhere.

We already have a test that the kernel correctly aborts the copy on
context switch, but we don't have a simple test to confirm copy/paste
itself actually works. So add one.

Signed-off-by: Michael Ellerman 
---
 .../selftests/powerpc/context_switch/.gitignore|   1 +
 .../selftests/powerpc/context_switch/Makefile  |   2 +-
 .../selftests/powerpc/context_switch/copy_paste.c  | 108 +
 3 files changed, 110 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/context_switch/copy_paste.c

diff --git a/tools/testing/selftests/powerpc/context_switch/.gitignore 
b/tools/testing/selftests/powerpc/context_switch/.gitignore
index c1431af7b51c..6018c777e537 100644
--- a/tools/testing/selftests/powerpc/context_switch/.gitignore
+++ b/tools/testing/selftests/powerpc/context_switch/.gitignore
@@ -1 +1,2 @@
 cp_abort
+copy_paste
diff --git a/tools/testing/selftests/powerpc/context_switch/Makefile 
b/tools/testing/selftests/powerpc/context_switch/Makefile
index e9351bb4285d..f5ba79c6c3d1 100644
--- a/tools/testing/selftests/powerpc/context_switch/Makefile
+++ b/tools/testing/selftests/powerpc/context_switch/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := cp_abort
+TEST_GEN_PROGS := cp_abort copy_paste
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/context_switch/copy_paste.c 
b/tools/testing/selftests/powerpc/context_switch/copy_paste.c
new file mode 100644
index ..0695c340bc82
--- /dev/null
+++ b/tools/testing/selftests/powerpc/context_switch/copy_paste.c
@@ -0,0 +1,108 @@
+/*
+ * Copyright 2016, Mikey Neuling, Chris Smart, IBM Corporation.
+ * Copyright 2017, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Check that copy/paste works on Power9.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+
+
+#define NUM_LOOPS 1000
+
+
+/* This defines the "paste" instruction from Power ISA 3.0 Book II, section 
4.4. */
+#define PASTE(RA, RB, L, RC) \
+   .long (0x7c00070c | (RA) << (31-15) | (RB) << (31-20) | (L) << (31-10) 
| (RC) << (31-31))
+
+int paste(void *i)
+{
+   int cr;
+
+   asm volatile(str(PASTE(0, %1, 1, 1))";"
+   "mfcr %0;"
+   : "=r" (cr)
+   : "b" (i)
+   : "memory"
+   );
+   return cr;
+}
+
+/* This defines the "copy" instruction from Power ISA 3.0 Book II, section 
4.4. */
+#define COPY(RA, RB, L) \
+   .long (0x7c00060c | (RA) << (31-15) | (RB) << (31-20) | (L) << (31-10))
+
+void copy(void *i)
+{
+   asm volatile(str(COPY(0, %0, 1))";"
+   :
+   : "b" (i)
+   : "memory"
+   );
+}
+
+int test_copy_paste(void)
+{
+   /* 128 bytes for a full cache line */
+   char orig[128] __cacheline_aligned;
+   char src[128] __cacheline_aligned;
+   char dst[128] __cacheline_aligned;
+   int rc;
+
+   /* only run this test on a P9 or later */
+   SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_00));
+
+   memset(orig, 0x5a, sizeof(orig));
+   memset(src, 0x5a, sizeof(src));
+   memset(dst, 0x00, sizeof(dst));
+
+   /* Confirm orig and src match */
+   FAIL_IF(0 != memcmp(orig, src, sizeof(orig)));
+
+   /* Confirm src & dst are different */
+   FAIL_IF(0 == memcmp(src, dst, sizeof(src)));
+
+   /*
+* Paste can fail, eg. if we get context switched, so we do the
+* copy/paste in a loop and fail the test if it never succeeds.
+*/
+   for (int i = 0; i < NUM_LOOPS; i++) {
+   copy(src);
+   rc = paste(dst);
+
+   /* A paste succeeds if CR0 EQ bit is set */
+   if (rc & 0x2000) {
+   rc = 0;
+   break;
+   }
+   rc = EAGAIN;
+   }
+
+   FAIL_IF(rc);
+
+   /* Confirm orig and src still match */
+   FAIL_IF(0 != memcmp(orig, src, sizeof(orig)));
+
+   /* And that src and dst now match */
+   FAIL_IF(0 != memcmp(src, dst, sizeof(src)));
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   return test_harness(test_copy_paste, "copy_paste");
+}
-- 
2.7.4



[PATCH] powernv: Properly mask POWER9 DD1 PVR for different chip types

2017-06-06 Thread Michael Neuling
Bits 48:51 in the PVR for POWER9 represent different chip types (scale
up vs out and 12 vs 24 core). Current chips have 0 here, but could be
non-zero in the future.

This changes the POWER9 DD1 mask to correctly ignore these bits 48:51.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/cputable.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 9b3e88b1a9..89dcd94237 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -526,8 +526,8 @@ static struct cpu_spec __initdata cpu_specs[] = {
.machine_check_early= __machine_check_early_realmode_p8,
.platform   = "power8",
},
-   {   /* Power9 DD1*/
-   .pvr_mask   = 0xff00,
+   {   /* Power9 DD1. Bits 48:51 represent chip type so mask these */
+   .pvr_mask   = 0x0f00,
.pvr_value  = 0x004e0100,
.cpu_name   = "POWER9 (raw)",
.cpu_features   = CPU_FTRS_POWER9_DD1,
-- 
2.11.0



Re: [PATCH] driver core: remove CLASS_ATTR usage

2017-06-06 Thread Michael Ellerman
Greg KH  writes:

> From: Greg Kroah-Hartman 
>
> There was only 2 remaining users of CLASS_ATTR() so let's finally get
> rid of them and force everyone to use the correct RW/RO/WO versions
> instead.
>
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Signed-off-by: Greg Kroah-Hartman 
> ---
>
> PPC maintainers, can I take this in my driver core tree?  I'm doing some
> cleanups of device.h and this was one of the simpler ones at the moment.

Yeah looks good to me, I did a quick build test and it was OK.

Acked-by: Michael Ellerman 

cheers


Re: [PATCH v5 09/10] VAS: Define vas_tx_win_open()

2017-06-06 Thread Sukadev Bhattiprolu
Sukadev Bhattiprolu [suka...@linux.vnet.ibm.com] wrote:
> +static void *map_paste_region(char *name, uint64_t start, int len)
> +{
> + void *map;
> +
> + if (!request_mem_region(start, len, name)) {
> + pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
> + __func__, start, len);
> + return NULL;
> + }
> +
> + map = __ioremap(start, len, pgprot_val(pgprot_cached(__pgprot(0;
> + if (!map) {
> + pr_devel("%s(): ioremap(0x%llx, %d) failed\n", __func__, start,
> + len);
> + return NULL;
> + }
> +
> + return map;
> +}

Based on Ben Herrenschmidt and Michael Ellerman's comments, replaced the
__ioremap() call above with ioremap_cache(start, len). Here is the updated
patch.

---
>From 6975948dba82007d15ee6003116db15a4c942297 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu 
Date: Wed, 15 Mar 2017 20:44:21 -0400
Subject: [PATCH 1/1] VAS: Define vas_tx_win_open()

Define an interface to open a VAS send window. This interface is
intended to be used the Nest Accelerator (NX) driver(s) to open
a send window and use it to submit compression/encryption requests
to a VAS receive window.

The receive window, identified by the [vasid, cop] parameters, must
already be open in VAS (i.e connected to an NX engine).

Signed-off-by: Sukadev Bhattiprolu 

---
Changelog[v5]:
- [Ben Herrenschmidt] MMIO regions must be mapped non-cached and
  paste regions must be mapped cached. Define/use map_paste_region().
- [Michael Ellerman, Ben Herrenschmidt] use ioremap_cache() rather
  than messing with pgprot flags.

Changelog [v3]:
- Distinguish between hardware PID (SPRN_PID) and Linux pid.
- Use macros rather than enum for threshold-control mode
- Set the pid of send window from attr (needed for user space
  send windows).
- Ignore irq port setting for now. They are needed for user space
  windows and will be added later
---
 arch/powerpc/include/asm/vas.h  |  42 +++
 arch/powerpc/platforms/powernv/vas-window.c | 183 
 2 files changed, 225 insertions(+)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index c923b8f..944bb4b 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -61,6 +61,29 @@ struct vas_rx_win_attr {
 };
 
 /*
+ * Window attributes specified by the in-kernel owner of a send window.
+ */
+struct vas_tx_win_attr {
+   enum vas_cop_type cop;
+   int wcreds_max;
+   int lpid;
+   int pidr;   /* hardware PID (from SPRN_PID) */
+   int pid;/* linux process id */
+   int pswid;
+   int rsvd_txbuf_count;
+   int tc_mode;
+
+   bool user_win;
+   bool pin_win;
+   bool rej_no_credit;
+   bool rsvd_txbuf_enable;
+   bool tx_wcred_mode;
+   bool rx_wcred_mode;
+   bool tx_win_ord_mode;
+   bool rx_win_ord_mode;
+};
+
+/*
  * Helper to initialize receive window attributes to defaults for an
  * NX window.
  */
@@ -77,6 +100,25 @@ extern struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
struct vas_rx_win_attr *attr);
 
 /*
+ * Helper to initialize send window attributes to defaults for an NX window.
+ */
+extern void vas_init_tx_win_attr(struct vas_tx_win_attr *txattr,
+   enum vas_cop_type cop);
+
+/*
+ * Open a VAS send window for the instance of VAS identified by @vasid
+ * and the co-processor type @cop. Use @attr to initialize attributes
+ * of the window.
+ *
+ * Note: The instance of VAS must already have an open receive window for
+ * the coprocessor type @cop.
+ *
+ * Return a handle to the send window or ERR_PTR() on error.
+ */
+struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
+   struct vas_tx_win_attr *attr);
+
+/*
  * Close the send or receive window identified by @win. For receive windows
  * return -EAGAIN if there are active send windows attached to this receive
  * window.
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index caacc28..fb8ccf9 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -56,6 +56,30 @@ static inline void get_uwc_mmio_bar(struct vas_window 
*window,
*len = VAS_UWC_SIZE;
 }
 
+/*
+ * Unlike MMIO regions (map_mmio_region() below), paste region must be mapped
+ * cache-able.
+ */
+static void *map_paste_region(char *name, uint64_t start, int len)
+{
+   void *map;
+
+   if (!request_mem_region(start, len, name)) {
+   pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
+   __func__, start, len);
+   return NULL;
+   }
+
+   map = ioremap_cache(start, len);
+   if (!map) {
+  

EDAC: simplify total memory calculation

2017-06-06 Thread Chris Packham
This take the approach used by cell_edac.c to obtain the total memory from
the devicetree and applies it to mv64x60_edac.c, altera_edac.c and
cpc925_edac.c which were all manually parsing the reg property. In the case
of mv64x60 this actually fixes cases where #address/size-cells != 1. For
altera and cpc925 this is just a cleanup.

Chris Packham (3):
  EDAC: mv64x60: calculate memory size correctly
  EDAC: altera: simplify calculation of total memory
  EDAC: cpc925: simplify calculation of total memory

 drivers/edac/altera_edac.c  | 24 
 drivers/edac/cpc925_edac.c  | 32 ++--
 drivers/edac/mv64x60_edac.c | 17 +++--
 3 files changed, 29 insertions(+), 44 deletions(-)

-- 
2.13.0



[PATCH v2 3/3] EDAC: cpc925: simplify calculation of total memory

2017-06-06 Thread Chris Packham
Use of_address_to_resource() and resource_size() instead of manually
parsing the "reg" property from the "memory" node(s).

Signed-off-by: Chris Packham 
---
Changes in v2:
- New

 drivers/edac/cpc925_edac.c | 32 ++--
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/edac/cpc925_edac.c b/drivers/edac/cpc925_edac.c
index 837b62c4993d..ea347cd7eb92 100644
--- a/drivers/edac/cpc925_edac.c
+++ b/drivers/edac/cpc925_edac.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -296,30 +297,17 @@ struct cpc925_dev_info {
 static void get_total_mem(struct cpc925_mc_pdata *pdata)
 {
struct device_node *np = NULL;
-   const unsigned int *reg, *reg_end;
-   int len, sw, aw;
-   unsigned long start, size;
+   struct resource res;
+   int ret;
 
-   np = of_find_node_by_type(NULL, "memory");
-   if (!np)
-   return;
+   for_each_node_by_type(np, "memory") {
+   ret = of_address_to_resource(np, 0, &res);
+   if (ret)
+   continue;
+
+   pdata->total_mem += resource_size(&res);
+   }
 
-   aw = of_n_addr_cells(np);
-   sw = of_n_size_cells(np);
-   reg = (const unsigned int *)of_get_property(np, "reg", &len);
-   reg_end = reg + len/4;
-
-   pdata->total_mem = 0;
-   do {
-   start = of_read_number(reg, aw);
-   reg += aw;
-   size = of_read_number(reg, sw);
-   reg += sw;
-   edac_dbg(1, "start 0x%lx, size 0x%lx\n", start, size);
-   pdata->total_mem += size;
-   } while (reg < reg_end);
-
-   of_node_put(np);
edac_dbg(0, "total_mem 0x%lx\n", pdata->total_mem);
 }
 
-- 
2.13.0



[PATCH v2 2/3] EDAC: altera: simplify calculation of total memory

2017-06-06 Thread Chris Packham
Use of_address_to_resource() and resource_size() instead of manually
parsing the "reg" property from the "memory" node(s).

Signed-off-by: Chris Packham 
---
Changes in v2:
- New

 drivers/edac/altera_edac.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
index 7717b094fabb..f8b623352627 100644
--- a/drivers/edac/altera_edac.c
+++ b/drivers/edac/altera_edac.c
@@ -214,24 +214,16 @@ static void altr_sdr_mc_create_debugfs_nodes(struct 
mem_ctl_info *mci)
 static unsigned long get_total_mem(void)
 {
struct device_node *np = NULL;
-   const unsigned int *reg, *reg_end;
-   int len, sw, aw;
-   unsigned long start, size, total_mem = 0;
+   struct resource res;
+   int ret;
+   unsigned long total_mem = 0;
 
for_each_node_by_type(np, "memory") {
-   aw = of_n_addr_cells(np);
-   sw = of_n_size_cells(np);
-   reg = (const unsigned int *)of_get_property(np, "reg", &len);
-   reg_end = reg + (len / sizeof(u32));
-
-   total_mem = 0;
-   do {
-   start = of_read_number(reg, aw);
-   reg += aw;
-   size = of_read_number(reg, sw);
-   reg += sw;
-   total_mem += size;
-   } while (reg < reg_end);
+   ret = of_address_to_resource(np, 0, &res);
+   if (ret)
+   continue;
+
+   total_mem += resource_size(&res);
}
edac_dbg(0, "total_mem 0x%lx\n", total_mem);
return total_mem;
-- 
2.13.0



[PATCH v2 1/3] EDAC: mv64x60: calculate memory size correctly

2017-06-06 Thread Chris Packham
The #address-cells and #size-cells properties need to be accounted for
when dealing with the "memory" device tree node. Use
of_address_to_resource() and resource_size() to retrieve the size of the
memory node which will automatically take the #cells into account.

Signed-off-by: Chris Packham 
---
Changes in v2:
- Use of_address_to_resource() instead of manually parsing the reg property.

 drivers/edac/mv64x60_edac.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/edac/mv64x60_edac.c b/drivers/edac/mv64x60_edac.c
index 3461db3723cb..93623e704623 100644
--- a/drivers/edac/mv64x60_edac.c
+++ b/drivers/edac/mv64x60_edac.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -644,15 +645,19 @@ static irqreturn_t mv64x60_mc_isr(int irq, void *dev_id)
 static void get_total_mem(struct mv64x60_mc_pdata *pdata)
 {
struct device_node *np = NULL;
-   const unsigned int *reg;
+   struct resource res;
+   int ret;
+   unsigned long total_mem = 0;
 
-   np = of_find_node_by_type(NULL, "memory");
-   if (!np)
-   return;
+   for_each_node_by_type(np, "memory") {
+   ret = of_address_to_resource(np, 0, &res);
+   if (ret)
+   continue;
 
-   reg = of_get_property(np, "reg", NULL);
+   total_mem += resource_size(&res);
+   }
 
-   pdata->total_mem = reg[1];
+   pdata->total_mem = total_mem;
 }
 
 static void mv64x60_init_csrows(struct mem_ctl_info *mci,
-- 
2.13.0



Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> >   
> >   static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(name),
> > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(modalias),
> >    __ATTR_NULL
> >   };
> > +static struct attribute *vio_dev_attrs[] = {
> 
> Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> the above vio_dev_attrs field as well.  Is powerpc really a dead
> platform?  :)

Haha, not yet no, and the above is actually still quite actively
in use as it's part of our hypervisor virtual IO infrastructure.

Cheers,
Ben.



Re: [PATCH 08/16] powerpc: ps3: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Geoff Levand
On 06/06/2017 12:22 PM, Greg Kroah-Hartman wrote:
> The dev_attrs field has long been "depreciated" and is finally being
> removed, so move the driver to use the "correct" dev_groups field
> instead for struct bus_type.

>  arch/powerpc/platforms/ps3/system-bus.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)

Seems OK.

-Geoff


Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
On Tue, Jun 06, 2017 at 09:22:15PM +0200, Greg Kroah-Hartman wrote:
> The dev_attrs field has long been "depreciated" and is finally being
> removed, so move the driver to use the "correct" dev_groups field
> instead for struct bus_type.
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Vineet Gupta 
> Cc: Bart Van Assche 
> Cc: Robin Murphy 
> Cc: Joerg Roedel 
> Cc: Johan Hovold 
> Cc: Alexey Kardashevskiy 
> Cc: Krzysztof Kozlowski 
> Cc: 
> Signed-off-by: Greg Kroah-Hartman 
> ---
>  arch/powerpc/platforms/pseries/vio.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 28b09fd797ec..fd6595598662 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1537,6 +1537,7 @@ static ssize_t name_show(struct device *dev,
>  {
>   return sprintf(buf, "%s\n", to_vio_dev(dev)->name);
>  }
> +static DEVICE_ATTR_RO(name);
>  
>  static ssize_t devspec_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
> @@ -1545,6 +1546,7 @@ static ssize_t devspec_show(struct device *dev,
>  
>   return sprintf(buf, "%s\n", of_node_full_name(of_node));
>  }
> +static DEVICE_ATTR_RO(devspec);
>  
>  static ssize_t modalias_show(struct device *dev, struct device_attribute 
> *attr,
>char *buf)
> @@ -1566,6 +1568,7 @@ static ssize_t modalias_show(struct device *dev, struct 
> device_attribute *attr,
>  
>   return sprintf(buf, "vio:T%sS%s\n", vio_dev->type, cp);
>  }
> +static DEVICE_ATTR_RO(modalias);
>  
>  static struct device_attribute vio_dev_attrs[] = {
>   __ATTR_RO(name),
> @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
>   __ATTR_RO(modalias),
>   __ATTR_NULL
>  };
> +static struct attribute *vio_dev_attrs[] = {

Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
the above vio_dev_attrs field as well.  Is powerpc really a dead
platform?  :)

thanks,

greg k-h


[PATCH 11/16] powerpc: vio_cmo: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
The dev_attrs field has long been "depreciated" and is finally being
removed, so move the driver to use the "correct" dev_groups field
instead for struct bus_type.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Vineet Gupta 
Cc: Bart Van Assche 
Cc: Robin Murphy 
Cc: Joerg Roedel 
Cc: Johan Hovold 
Cc: Alexey Kardashevskiy 
Cc: Krzysztof Kozlowski 
Cc: 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/platforms/pseries/vio.c | 37 +---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index fd6595598662..176edf422867 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -948,7 +948,7 @@ static void vio_cmo_bus_init(void)
 /* sysfs device functions and data structures for CMO */
 
 #define viodev_cmo_rd_attr(name)\
-static ssize_t viodev_cmo_##name##_show(struct device *dev, \
+static ssize_t cmo_##name##_show(struct device *dev,\
 struct device_attribute *attr,  \
  char *buf) \
 {   \
@@ -962,7 +962,7 @@ static ssize_t viodev_cmo_allocs_failed_show(struct device 
*dev,
return sprintf(buf, "%d\n", atomic_read(&viodev->cmo.allocs_failed));
 }
 
-static ssize_t viodev_cmo_allocs_failed_reset(struct device *dev,
+static ssize_t cmo_allocs_failed_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
 {
struct vio_dev *viodev = to_vio_dev(dev);
@@ -970,7 +970,7 @@ static ssize_t viodev_cmo_allocs_failed_reset(struct device 
*dev,
return count;
 }
 
-static ssize_t viodev_cmo_desired_set(struct device *dev,
+static ssize_t cmo_desired_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
 {
struct vio_dev *viodev = to_vio_dev(dev);
@@ -993,18 +993,25 @@ static ssize_t name_show(struct device *, struct 
device_attribute *, char *);
 static ssize_t devspec_show(struct device *, struct device_attribute *, char 
*);
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 char *buf);
-static struct device_attribute vio_cmo_dev_attrs[] = {
-   __ATTR_RO(name),
-   __ATTR_RO(devspec),
-   __ATTR_RO(modalias),
-   __ATTR(cmo_desired,   S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
-  viodev_cmo_desired_show, viodev_cmo_desired_set),
-   __ATTR(cmo_entitled,  S_IRUGO, viodev_cmo_entitled_show,  NULL),
-   __ATTR(cmo_allocated, S_IRUGO, viodev_cmo_allocated_show, NULL),
-   __ATTR(cmo_allocs_failed, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
-  viodev_cmo_allocs_failed_show, viodev_cmo_allocs_failed_reset),
-   __ATTR_NULL
+static DEVICE_ATTR_RO(name);
+static DEVICE_ATTR_RO(devspec);
+static DEVICE_ATTR_RO(modalias);
+static DEVICE_ATTR_RO(cmo_entitled);
+static DEVICE_ATTR_RO(cmo_allocated);
+static DEVICE_ATTR_RW(cmo_desired);
+static DEVICE_ATTR_RW(cmo_allocs_failed);
+
+static struct attribute *vio_cmo_dev_attrs[] = {
+   &dev_attr_name.attr,
+   &dev_attr_devspec.attr,
+   &dev_attr_modalias.attr,
+   &dev_attr_cmo_entitled.attr,
+   &dev_attr_cmo_allocated.attr,
+   &dev_attr_cmo_desired.attr,
+   &dev_attr_cmo_allocs_failed.attr,
+   NULL,
 };
+ATTRIBUTE_GROUPS(vio_cmo_dev);
 
 /* sysfs bus functions and data structures for CMO */
 
@@ -1066,7 +1073,7 @@ ATTRIBUTE_GROUPS(vio_bus);
 
 static void vio_cmo_sysfs_init(void)
 {
-   vio_bus_type.dev_attrs = vio_cmo_dev_attrs;
+   vio_bus_type.dev_groups = vio_cmo_dev_groups;
vio_bus_type.bus_groups = vio_bus_groups;
 }
 #else /* CONFIG_PPC_SMLPAR */
-- 
2.13.0



[PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
The dev_attrs field has long been "depreciated" and is finally being
removed, so move the driver to use the "correct" dev_groups field
instead for struct bus_type.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Vineet Gupta 
Cc: Bart Van Assche 
Cc: Robin Murphy 
Cc: Joerg Roedel 
Cc: Johan Hovold 
Cc: Alexey Kardashevskiy 
Cc: Krzysztof Kozlowski 
Cc: 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/platforms/pseries/vio.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index 28b09fd797ec..fd6595598662 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1537,6 +1537,7 @@ static ssize_t name_show(struct device *dev,
 {
return sprintf(buf, "%s\n", to_vio_dev(dev)->name);
 }
+static DEVICE_ATTR_RO(name);
 
 static ssize_t devspec_show(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -1545,6 +1546,7 @@ static ssize_t devspec_show(struct device *dev,
 
return sprintf(buf, "%s\n", of_node_full_name(of_node));
 }
+static DEVICE_ATTR_RO(devspec);
 
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 char *buf)
@@ -1566,6 +1568,7 @@ static ssize_t modalias_show(struct device *dev, struct 
device_attribute *attr,
 
return sprintf(buf, "vio:T%sS%s\n", vio_dev->type, cp);
 }
+static DEVICE_ATTR_RO(modalias);
 
 static struct device_attribute vio_dev_attrs[] = {
__ATTR_RO(name),
@@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
__ATTR_RO(modalias),
__ATTR_NULL
 };
+static struct attribute *vio_dev_attrs[] = {
+   &dev_attr_name.attr,
+   &dev_attr_devspec.attr,
+   &dev_attr_modalias.attr,
+   NULL,
+};
+ATTRIBUTE_GROUPS(vio_dev);
 
 void vio_unregister_device(struct vio_dev *viodev)
 {
@@ -1608,7 +1618,7 @@ static int vio_hotplug(struct device *dev, struct 
kobj_uevent_env *env)
 
 struct bus_type vio_bus_type = {
.name = "vio",
-   .dev_attrs = vio_dev_attrs,
+   .dev_groups = vio_dev_groups,
.uevent = vio_hotplug,
.match = vio_bus_match,
.probe = vio_bus_probe,
-- 
2.13.0



[PATCH 09/16] powerpc: ibmebus: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
The dev_attrs field has long been "depreciated" and is finally being
removed, so move the driver to use the "correct" dev_groups field
instead for struct bus_type.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Bart Van Assche 
Cc: Johan Hovold 
Cc: Robin Murphy 
Cc: Rob Herring 
Cc: Lars-Peter Clausen 
Cc: 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/platforms/pseries/ibmebus.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/ibmebus.c 
b/arch/powerpc/platforms/pseries/ibmebus.c
index b363e439ddb9..52146b1356d2 100644
--- a/arch/powerpc/platforms/pseries/ibmebus.c
+++ b/arch/powerpc/platforms/pseries/ibmebus.c
@@ -397,6 +397,7 @@ static ssize_t devspec_show(struct device *dev,
ofdev = to_platform_device(dev);
return sprintf(buf, "%s\n", ofdev->dev.of_node->full_name);
 }
+static DEVICE_ATTR_RO(devspec);
 
 static ssize_t name_show(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -406,19 +407,22 @@ static ssize_t name_show(struct device *dev,
ofdev = to_platform_device(dev);
return sprintf(buf, "%s\n", ofdev->dev.of_node->name);
 }
+static DEVICE_ATTR_RO(name);
 
 static ssize_t modalias_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
return of_device_modalias(dev, buf, PAGE_SIZE);
 }
+static DEVICE_ATTR_RO(modalias);
 
-static struct device_attribute ibmebus_bus_device_attrs[] = {
-   __ATTR_RO(devspec),
-   __ATTR_RO(name),
-   __ATTR_RO(modalias),
-   __ATTR_NULL
+static struct attribute *ibmebus_bus_device_attrs[] = {
+   &dev_attr_devspec.attr,
+   &dev_attr_name.attr,
+   &dev_attr_modalias.attr,
+   NULL,
 };
+ATTRIBUTE_GROUPS(ibmebus_bus_device);
 
 struct bus_type ibmebus_bus_type = {
.name  = "ibmebus",
@@ -428,7 +432,7 @@ struct bus_type ibmebus_bus_type = {
.probe = ibmebus_bus_device_probe,
.remove= ibmebus_bus_device_remove,
.shutdown  = ibmebus_bus_device_shutdown,
-   .dev_attrs = ibmebus_bus_device_attrs,
+   .dev_groups = ibmebus_bus_device_groups,
 };
 EXPORT_SYMBOL(ibmebus_bus_type);
 
-- 
2.13.0



[PATCH 08/16] powerpc: ps3: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
The dev_attrs field has long been "depreciated" and is finally being
removed, so move the driver to use the "correct" dev_groups field
instead for struct bus_type.

Cc: Geoff Levand 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/platforms/ps3/system-bus.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/system-bus.c 
b/arch/powerpc/platforms/ps3/system-bus.c
index 2d2e5f80a3d3..3e98b4ea3df9 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -471,11 +471,13 @@ static ssize_t modalias_show(struct device *_dev, struct 
device_attribute *a,
 
return (len >= PAGE_SIZE) ? (PAGE_SIZE - 1) : len;
 }
+static DEVICE_ATTR_RO(modalias);
 
-static struct device_attribute ps3_system_bus_dev_attrs[] = {
-   __ATTR_RO(modalias),
-   __ATTR_NULL,
+static struct attribute *ps3_system_bus_dev_attrs[] = {
+   &dev_attr_modalias.attr,
+   NULL,
 };
+ATTRIBUTE_GROUPS(ps3_system_bus);
 
 struct bus_type ps3_system_bus_type = {
.name = "ps3_system_bus",
@@ -484,7 +486,7 @@ struct bus_type ps3_system_bus_type = {
.probe = ps3_system_bus_probe,
.remove = ps3_system_bus_remove,
.shutdown = ps3_system_bus_shutdown,
-   .dev_attrs = ps3_system_bus_dev_attrs,
+   .dev_groups = ps3_system_bus_dev_groups,
 };
 
 static int __init ps3_system_bus_init(void)
-- 
2.13.0



[PATCH 07/16] macintosh: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Greg Kroah-Hartman
The dev_attrs field has long been "depreciated" and is finally being
removed, so move the driver to use the "correct" dev_groups field
instead for struct bus_type.

Cc: Benjamin Herrenschmidt 
Cc: 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/macintosh/macio_asic.c  |  4 ++--
 drivers/macintosh/macio_sysfs.c | 29 +
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/macintosh/macio_asic.c b/drivers/macintosh/macio_asic.c
index f757cef293f8..62f541f968f6 100644
--- a/drivers/macintosh/macio_asic.c
+++ b/drivers/macintosh/macio_asic.c
@@ -133,7 +133,7 @@ static int macio_device_resume(struct device * dev)
return 0;
 }
 
-extern struct device_attribute macio_dev_attrs[];
+extern const struct attribute_group *macio_dev_groups[];
 
 struct bus_type macio_bus_type = {
.name   = "macio",
@@ -144,7 +144,7 @@ struct bus_type macio_bus_type = {
.shutdown = macio_device_shutdown,
.suspend= macio_device_suspend,
.resume = macio_device_resume,
-   .dev_attrs = macio_dev_attrs,
+   .dev_groups = macio_dev_groups,
 };
 
 static int __init macio_bus_driver_init(void)
diff --git a/drivers/macintosh/macio_sysfs.c b/drivers/macintosh/macio_sysfs.c
index 0b1f9c76c68d..2445274f7e4b 100644
--- a/drivers/macintosh/macio_sysfs.c
+++ b/drivers/macintosh/macio_sysfs.c
@@ -10,7 +10,8 @@ field##_show (struct device *dev, struct device_attribute 
*attr,  \
 {  \
struct macio_dev *mdev = to_macio_device (dev); \
return sprintf (buf, format_string, mdev->ofdev.dev.of_node->field); \
-}
+}  \
+static DEVICE_ATTR_RO(field);
 
 static ssize_t
 compatible_show (struct device *dev, struct device_attribute *attr, char *buf)
@@ -37,6 +38,7 @@ compatible_show (struct device *dev, struct device_attribute 
*attr, char *buf)
 
return length;
 }
+static DEVICE_ATTR_RO(compatible);
 
 static ssize_t modalias_show (struct device *dev, struct device_attribute 
*attr,
  char *buf)
@@ -52,15 +54,26 @@ static ssize_t devspec_show(struct device *dev,
ofdev = to_platform_device(dev);
return sprintf(buf, "%s\n", ofdev->dev.of_node->full_name);
 }
+static DEVICE_ATTR_RO(modalias);
+static DEVICE_ATTR_RO(devspec);
 
 macio_config_of_attr (name, "%s\n");
 macio_config_of_attr (type, "%s\n");
 
-struct device_attribute macio_dev_attrs[] = {
-   __ATTR_RO(name),
-   __ATTR_RO(type),
-   __ATTR_RO(compatible),
-   __ATTR_RO(modalias),
-   __ATTR_RO(devspec),
-   __ATTR_NULL
+static struct attribute *macio_dev_attrs[] = {
+   &dev_attr_name.attr,
+   &dev_attr_type.attr,
+   &dev_attr_compatible.attr,
+   &dev_attr_modalias.attr,
+   &dev_attr_devspec.attr,
+   NULL,
+};
+
+static const struct attribute_group macio_dev_group = {
+   .attrs = macio_dev_attrs,
+};
+
+const struct attribute_group *macio_dev_groups[] = {
+   &macio_dev_group,
+   NULL,
 };
-- 
2.13.0



Re: [PATCH v3 3/9] powerpc/kprobes/optprobes: Move over to patch_instruction

2017-06-06 Thread Naveen N. Rao
On 2017/06/06 02:29PM, Balbir Singh wrote:
> With text moving to read-only migrate optprobes to using
> the patch_instruction infrastructure. Without this optprobes
> will fail and complain.
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/kernel/optprobes.c | 58 
> ++---
>  1 file changed, 37 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
> index ec60ed0..1c7326c 100644
> --- a/arch/powerpc/kernel/optprobes.c
> +++ b/arch/powerpc/kernel/optprobes.c
> @@ -158,12 +158,13 @@ void arch_remove_optimized_kprobe(struct 
> optimized_kprobe *op)
>  void patch_imm32_load_insns(unsigned int val, kprobe_opcode_t *addr)
>  {
>   /* addis r4,0,(insn)@h */
> - *addr++ = PPC_INST_ADDIS | ___PPC_RT(4) |
> -   ((val >> 16) & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(4) |

We can probably get rid of those casts, seeing as we're not using it in 
kprobes.c.

> + ((val >> 16) & 0x));
> + addr++;
> 
>   /* ori r4,r4,(insn)@l */
> - *addr = PPC_INST_ORI | ___PPC_RA(4) | ___PPC_RS(4) |
> - (val & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(4) |
> + ___PPC_RS(4) | (val & 0x));
>  }
> 
>  /*
> @@ -173,24 +174,28 @@ void patch_imm32_load_insns(unsigned int val, 
> kprobe_opcode_t *addr)
>  void patch_imm64_load_insns(unsigned long val, kprobe_opcode_t *addr)
>  {
>   /* lis r3,(op)@highest */
> - *addr++ = PPC_INST_ADDIS | ___PPC_RT(3) |
> -   ((val >> 48) & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(3) |
> + ((val >> 48) & 0x));
> + addr++;
> 
>   /* ori r3,r3,(op)@higher */
> - *addr++ = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
> -   ((val >> 32) & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
> + ___PPC_RS(3) | ((val >> 32) & 0x));
> + addr++;
> 
>   /* rldicr r3,r3,32,31 */
> - *addr++ = PPC_INST_RLDICR | ___PPC_RA(3) | ___PPC_RS(3) |
> -   __PPC_SH64(32) | __PPC_ME64(31);
> + patch_instruction((unsigned int *)addr, PPC_INST_RLDICR | ___PPC_RA(3) |
> + ___PPC_RS(3) | __PPC_SH64(32) | __PPC_ME64(31));
> + addr++;
> 
>   /* oris r3,r3,(op)@h */
> - *addr++ = PPC_INST_ORIS | ___PPC_RA(3) | ___PPC_RS(3) |
> -   ((val >> 16) & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ORIS | ___PPC_RA(3) |
> + ___PPC_RS(3) | ((val >> 16) & 0x));
> + addr++;
> 
>   /* ori r3,r3,(op)@l */
> - *addr = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
> - (val & 0x);
> + patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
> + ___PPC_RS(3) | (val & 0x));
>  }
> 
>  int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe 
> *p)
> @@ -198,7 +203,8 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
> *op, struct kprobe *p)
>   kprobe_opcode_t *buff, branch_op_callback, branch_emulate_step;
>   kprobe_opcode_t *op_callback_addr, *emulate_step_addr;
>   long b_offset;
> - unsigned long nip;
> + unsigned long nip, size;
> + int rc, i;
> 
>   kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
> 
> @@ -231,8 +237,15 @@ int arch_prepare_optimized_kprobe(struct 
> optimized_kprobe *op, struct kprobe *p)
>   goto error;
> 
>   /* Setup template */
> - memcpy(buff, optprobe_template_entry,
> - TMPL_END_IDX * sizeof(kprobe_opcode_t));
> + /* We can optimize this via patch_instruction_window later */

This probably needs a TODO just so it's clear. I do think this would be 
good to add since we copy many instructions while setting up the 
optprobe, so this is quite slow as it exists today.

> + size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);

That's just TMPL_END_IDX.

Thanks,
Naveen

> + pr_devel("Copying template to %p, size %lu\n", buff, size);
> + for (i = 0; i < size; i++) {
> + rc = patch_instruction((unsigned int *)buff + i,
> + *((unsigned int *)(optprobe_template_entry) + i));
> + if (rc < 0)
> + goto error;
> + }
> 
>   /*
>* Fixup the template with instructions to:
> @@ -261,8 +274,10 @@ int arch_prepare_optimized_kprobe(struct 
> optimized_kprobe *op, struct kprobe *p)
>   if (!branch_op_callback || !branch_emulate_step)
>   goto error;
> 
> - buff[TMPL_CALL_HDLR_IDX] = branch_op_callback;
> - buff[TMPL_EMULATE_IDX] = branch_emulate_step;
> + patch_instruction((unsigned int *)buff + TMPL_CALL_HDL

Re: [PATCH v3 2/9] powerpc/kprobes: Move kprobes over to patch_instruction

2017-06-06 Thread Naveen N. Rao
Hi Balbir,

On 2017/06/06 02:29PM, Balbir Singh wrote:
> arch_arm/disarm_probe use direct assignment for copying
> instructions, replace them with patch_instruction
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/kernel/kprobes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index fc43435..b49f8f0 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -158,7 +158,7 @@ NOKPROBE_SYMBOL(arch_prepare_kprobe);
> 
>  void arch_arm_kprobe(struct kprobe *p)
>  {
> - *p->addr = BREAKPOINT_INSTRUCTION;
> + patch_instruction(p->addr, BREAKPOINT_INSTRUCTION);
>   flush_icache_range((unsigned long) p->addr,
>  (unsigned long) p->addr + sizeof(kprobe_opcode_t));

Do we still need flush_icache_range() after patch_instruction()?

- Naveen

>  }
> @@ -166,7 +166,7 @@ NOKPROBE_SYMBOL(arch_arm_kprobe);
> 
>  void arch_disarm_kprobe(struct kprobe *p)
>  {
> - *p->addr = p->opcode;
> + patch_instruction(p->addr, p->opcode);
>   flush_icache_range((unsigned long) p->addr,
>  (unsigned long) p->addr + sizeof(kprobe_opcode_t));
>  }
> -- 
> 2.9.4
> 



Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

2017-06-06 Thread Michael Bringmann


On 06/06/2017 04:48 AM, Michael Ellerman wrote:
> Michael Bringmann  writes:
>> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>>
>> When the 'numa.c' code is built with debug messages, and the system was
>> given that configuration by pHyp, yes, I did.
>>
>>> What does it say?
>>
>> The debug message for each core thread would be something like,
>>
>> removing cpu 64 from node 0
>> adding cpu 64 to node 8
>>
>> repeated for all 8 threads of the CPU, and usually with the messages
>> for all of the CPUs coming out intermixed on the console/dmesg log.
> 
> OK. I meant what do you see at boot.

Here is an example with nodes 0,2,6,7, node 0 starts out empty:

[0.00] Initmem setup node 0
[0.00]   NODE_DATA [mem 0x3bff7d6300-0x3bff7d]
[0.00] NODE_DATA(0) on node 7
[0.00] Initmem setup node 2 [mem 0x-0x13]
[0.00]   NODE_DATA [mem 0x136300-0x13]
[0.00] Initmem setup node 6 [mem 0x14-0x34afff]
[0.00]   NODE_DATA [mem 0x34afff6300-0x34afff]
[0.00] Initmem setup node 7 [mem 0x34b000-0x3b]
[0.00]   NODE_DATA [mem 0x3bff7cc600-0x3bff7d62ff]

[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x003b]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   2: [mem 0x-0x0013]
[0.00]   node   6: [mem 0x0014-0x0034afff]
[0.00]   node   7: [mem 0x0034b000-0x003b]
[0.00] Could not find start_pfn for node 0
[0.00] Initmem setup node 0 [mem 0x-0x]
[0.00] Initmem setup node 2 [mem 0x-0x0013]
[0.00] Initmem setup node 6 [mem 0x0014-0x0034afff]
[0.00] Initmem setup node 7 [mem 0x0034b000-0x003b]
[0.00] percpu: Embedded 3 pages/cpu @c03bf800 s155672 r0 d40936 
u262144
[0.00] Built 4 zonelists in Node order, mobility grouping on.  Total 
pages: 3928320

and,

[root@ltcalpine2-lp20 ~]# numactl --hardware
available: 4 nodes (0,2,6-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 32 33 34 35 36 37 38 39 56 57 58 59 60 61 
62 63
node 2 size: 81792 MB
node 2 free: 81033 MB
node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 
41 42 43 44 45 46 47
node 6 size: 133743 MB
node 6 free: 133097 MB
node 7 cpus: 48 49 50 51 52 53 54 55
node 7 size: 29877 MB
node 7 free: 29599 MB
node distances:
node   0   2   6   7
  0:  10  40  40  40
  2:  40  10  40  40
  6:  40  40  10  20
  7:  40  40  20  10
[root@ltcalpine2-lp20 ~]#

> 
> I'm curious how we're discovering node 0 and 8 at all if neither has any
> memory or CPUs assigned at boot.

Both are in the memory associativity lookup arrays.  And we are circling
back to the 

> 
>>> Right. So it's not that you're hot adding memory into a previously
>>> unseen node as you implied in earlier mails.
>>
>> In the sense that the nodes were defined in the device tree, that is correct.
> 
> Where are they defined in the device tree? That's what I'm trying to 
> understand.

The nodes for memory are defined one time in "ibm,associativity-lookup-arrays".
I wish that there was an official set of node properties in the device-tree,
instead of having them be the values of other properties.

> 
>> In the sense that those nodes are currently deleted from node_possible_map in
>> 'numa.c' by the instruction 'node_and(node_possible_map,node_possible_map,
>> node_online_map);', the nodes are no longer available to place memory or CPU.
> 
> Yeah I understand that part.
> 
>> Okay, I can try to insert code that extracts all of the nodes from the
>> ibm,associativity-lookup-arrays property and merge them with the nodes
>> put into the online map from the CPUs that were found previously during
>> boot of the powerpc code.
> 
> Hmm, will that work?

The nodes are defined in the associativity lookup array, so they have at least
been reserved for us by the pHyp.  On the other hand, if we are only to use
nodes that have resources at boot, why are there extra node values specified?

What I am not 100% clear on -- and why I preferred letting all possible nodes
originally defined, still be possible for subsequent hot-add operations -- is
whether the nodes to be used for hot-added CPUs would always be a subset of
the nodes used for hot-added memory.

* The hot-added CPUs in Shared CPU configurations may be mapped to nodes by
  the value returned to the kernel by the VPHN hcall.

* So far in my tests, this has not been a problem, but I could not be positive
  from the PAPR.

> 
> Looking at PAPR it's not clear to me that it will work for nodes that
> have no mem

Re: [PATCH v2] workqueue: Fix edge cases for calc of pool's cpumask

2017-06-06 Thread Michael Bringmann


On 06/06/2017 09:20 AM, Tejun Heo wrote:
> Hello, Michael.
> 
> It would have been better to continue debugging in the prev thread.
> This still seems incorrect for the same reason as before.
> 
> On Tue, Jun 06, 2017 at 09:09:40AM -0500, Michael Bringmann wrote:
>> On NUMA systems with dynamic processors, the content of the cpumask
>> may change over time.  As new processors are added via DLPAR operations,
>> workqueues are created for them.  Depending upon the order in which CPUs
>> are added/removed, we may run into problems with the content of the
>> cpumask used by the workqueues.  This patch deals with situations where
>> the online cpumask for a node is a proper superset of possible cpumask
>> for the node.  It also deals with edge cases where the order in which
>> CPUs are removed/added from the online cpumask may leave the set for a
>> node empty, and require execution by CPUs on another node.
>>
>> In these and other cases, the patch attempts to ensure that a valid,
>> usable cpumask is used to set up newly created pools for workqueues.
>>
>> Signed-off-by: Tejun Heo  & Michael Bringmann 
>> 
> 
> Heh, you can't add sob's for other people.  For partial attributions,
> you can just note in the description.

Sorry for the error.
> 
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index c74bf39..460de61 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -3366,6 +3366,9 @@ static struct worker_pool *get_unbound_pool(const 
>> struct workqueue_attrs *attrs)
>>  copy_workqueue_attrs(pool->attrs, attrs);
>>  pool->node = target_node;
>>  
>> +if (!cpumask_weight(pool->attrs->cpumask))
>> +cpumask_copy(pool->attrs->cpumask, 
>> cpumask_of(smp_processor_id()));
> 
> So, this is still wrong.

It only catches if something has gone wrong before.  The alternative in this 
case
would be,

BUG(!cpumask_weight(pool->attrs->cpumask));

> 
>>  /*
>>   * no_numa isn't a worker_pool attribute, always clear it.  See
>>   * 'struct workqueue_attrs' comments for detail.
>> @@ -3559,13 +3562,13 @@ static struct pool_workqueue 
>> *alloc_unbound_pwq(struct workqueue_struct *wq,
>>   * stable.
>>   *
>>   * Return: %true if the resulting @cpumask is different from 
>> @attrs->cpumask,
>> - * %false if equal.
>> + * %false if equal.  On %false return, the content of @cpumask is undefined.
>>   */
>>  static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int 
>> node,
>>   int cpu_going_down, cpumask_t *cpumask)
>>  {
>>  if (!wq_numa_enabled || attrs->no_numa)
>> -goto use_dfl;
>> +return false;
>>  
>>  /* does @node have any online CPUs @attrs wants? */
>>  cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
>> @@ -3573,15 +3576,13 @@ static bool wq_calc_node_cpumask(const struct 
>> workqueue_attrs *attrs, int node,
>>  cpumask_clear_cpu(cpu_going_down, cpumask);
>>  
>>  if (cpumask_empty(cpumask))
>> -goto use_dfl;
>> +return false;
>>  
>>  /* yeap, return possible CPUs in @node that @attrs wants */
>>  cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>> -return !cpumask_equal(cpumask, attrs->cpumask);
>>  
>> -use_dfl:
>> -cpumask_copy(cpumask, attrs->cpumask);
>> -return false;
>> +return !cpumask_empty(cpumask) &&
>> +!cpumask_equal(cpumask, attrs->cpumask);
> 
> And this part doesn't really change that.
> 
> CPUs going offline or online shouldn't change their relation to
> wq_numa_possible_cpumask.  I wonder whether the arch code is changing
> CPU id <-> NUMA node mapping on CPU on/offlining.  x86 used to do that
> too and got recently modified.  Can you see whether that's the case?

The but that I see does not appear to be related to changing of CPU/Node mapping
-- they are not changing their place when going offline/online.  Rather new CPUs
are being hot-added to the system (i.e. they were not present at boot), and the
node to which they are being added had no CPUs at boot.

> 
> Thanks.
> 

Thanks.

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH v2] workqueue: Fix edge cases for calc of pool's cpumask

2017-06-06 Thread Tejun Heo
Hello,

On Tue, Jun 06, 2017 at 09:34:05AM -0500, Michael Bringmann wrote:
> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> >> index c74bf39..460de61 100644
> >> --- a/kernel/workqueue.c
> >> +++ b/kernel/workqueue.c
> >> @@ -3366,6 +3366,9 @@ static struct worker_pool *get_unbound_pool(const 
> >> struct workqueue_attrs *attrs)
> >>copy_workqueue_attrs(pool->attrs, attrs);
> >>pool->node = target_node;
> >>  
> >> +  if (!cpumask_weight(pool->attrs->cpumask))
> >> +  cpumask_copy(pool->attrs->cpumask, 
> >> cpumask_of(smp_processor_id()));
> > 
> > So, this is still wrong.
> 
> It only catches if something has gone wrong before.  The alternative in this 
> case
> would be,
> 
>   BUG(!cpumask_weight(pool->attrs->cpumask));

I'm kinda confused, so the hotplug problems you see across hotplugs go
away without the above change and the above is "just in case"?

> >> @@ -3559,13 +3562,13 @@ static struct pool_workqueue 
> >> *alloc_unbound_pwq(struct workqueue_struct *wq,
> >>   * stable.
> >>   *
> >>   * Return: %true if the resulting @cpumask is different from 
> >> @attrs->cpumask,
> >> - * %false if equal.
> >> + * %false if equal.  On %false return, the content of @cpumask is 
> >> undefined.
> >>   */
> >>  static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int 
> >> node,
> >> int cpu_going_down, cpumask_t *cpumask)
> >>  {
> >>if (!wq_numa_enabled || attrs->no_numa)
> >> -  goto use_dfl;
> >> +  return false;
> >>  
> >>/* does @node have any online CPUs @attrs wants? */
> >>cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
> >> @@ -3573,15 +3576,13 @@ static bool wq_calc_node_cpumask(const struct 
> >> workqueue_attrs *attrs, int node,
> >>cpumask_clear_cpu(cpu_going_down, cpumask);
> >>  
> >>if (cpumask_empty(cpumask))
> >> -  goto use_dfl;
> >> +  return false;
> >>  
> >>/* yeap, return possible CPUs in @node that @attrs wants */
> >>cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> >> -  return !cpumask_equal(cpumask, attrs->cpumask);
> >>  
> >> -use_dfl:
> >> -  cpumask_copy(cpumask, attrs->cpumask);
> >> -  return false;
> >> +  return !cpumask_empty(cpumask) &&
> >> +  !cpumask_equal(cpumask, attrs->cpumask);
> > 
> > And this part doesn't really change that.
> > 
> > CPUs going offline or online shouldn't change their relation to
> > wq_numa_possible_cpumask.  I wonder whether the arch code is changing
> > CPU id <-> NUMA node mapping on CPU on/offlining.  x86 used to do that
> > too and got recently modified.  Can you see whether that's the case?
> 
> The but that I see does not appear to be related to changing of CPU/Node 
> mapping
> -- they are not changing their place when going offline/online.  Rather new 
> CPUs
> are being hot-added to the system (i.e. they were not present at boot), and 
> the
> node to which they are being added had no CPUs at boot.

Can you please post the messages with the debug patch from the prev
thread?  In fact, let's please continue on that thread.  I'm having a
hard time following what's going wrong with the code.

Thanks.

-- 
tejun


Re: [PATCH v2] workqueue: Fix edge cases for calc of pool's cpumask

2017-06-06 Thread Tejun Heo
Hello, Michael.

It would have been better to continue debugging in the prev thread.
This still seems incorrect for the same reason as before.

On Tue, Jun 06, 2017 at 09:09:40AM -0500, Michael Bringmann wrote:
> On NUMA systems with dynamic processors, the content of the cpumask
> may change over time.  As new processors are added via DLPAR operations,
> workqueues are created for them.  Depending upon the order in which CPUs
> are added/removed, we may run into problems with the content of the
> cpumask used by the workqueues.  This patch deals with situations where
> the online cpumask for a node is a proper superset of possible cpumask
> for the node.  It also deals with edge cases where the order in which
> CPUs are removed/added from the online cpumask may leave the set for a
> node empty, and require execution by CPUs on another node.
> 
> In these and other cases, the patch attempts to ensure that a valid,
> usable cpumask is used to set up newly created pools for workqueues.
> 
> Signed-off-by: Tejun Heo  & Michael Bringmann 
> 

Heh, you can't add sob's for other people.  For partial attributions,
you can just note in the description.

> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c74bf39..460de61 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3366,6 +3366,9 @@ static struct worker_pool *get_unbound_pool(const 
> struct workqueue_attrs *attrs)
>   copy_workqueue_attrs(pool->attrs, attrs);
>   pool->node = target_node;
>  
> + if (!cpumask_weight(pool->attrs->cpumask))
> + cpumask_copy(pool->attrs->cpumask, 
> cpumask_of(smp_processor_id()));

So, this is still wrong.

>   /*
>* no_numa isn't a worker_pool attribute, always clear it.  See
>* 'struct workqueue_attrs' comments for detail.
> @@ -3559,13 +3562,13 @@ static struct pool_workqueue 
> *alloc_unbound_pwq(struct workqueue_struct *wq,
>   * stable.
>   *
>   * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
> - * %false if equal.
> + * %false if equal.  On %false return, the content of @cpumask is undefined.
>   */
>  static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int 
> node,
>int cpu_going_down, cpumask_t *cpumask)
>  {
>   if (!wq_numa_enabled || attrs->no_numa)
> - goto use_dfl;
> + return false;
>  
>   /* does @node have any online CPUs @attrs wants? */
>   cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
> @@ -3573,15 +3576,13 @@ static bool wq_calc_node_cpumask(const struct 
> workqueue_attrs *attrs, int node,
>   cpumask_clear_cpu(cpu_going_down, cpumask);
>  
>   if (cpumask_empty(cpumask))
> - goto use_dfl;
> + return false;
>  
>   /* yeap, return possible CPUs in @node that @attrs wants */
>   cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> - return !cpumask_equal(cpumask, attrs->cpumask);
>  
> -use_dfl:
> - cpumask_copy(cpumask, attrs->cpumask);
> - return false;
> + return !cpumask_empty(cpumask) &&
> + !cpumask_equal(cpumask, attrs->cpumask);

And this part doesn't really change that.

CPUs going offline or online shouldn't change their relation to
wq_numa_possible_cpumask.  I wonder whether the arch code is changing
CPU id <-> NUMA node mapping on CPU on/offlining.  x86 used to do that
too and got recently modified.  Can you see whether that's the case?

Thanks.

-- 
tejun


[PATCH v2] workqueue: Fix edge cases for calc of pool's cpumask

2017-06-06 Thread Michael Bringmann

On NUMA systems with dynamic processors, the content of the cpumask
may change over time.  As new processors are added via DLPAR operations,
workqueues are created for them.  Depending upon the order in which CPUs
are added/removed, we may run into problems with the content of the
cpumask used by the workqueues.  This patch deals with situations where
the online cpumask for a node is a proper superset of possible cpumask
for the node.  It also deals with edge cases where the order in which
CPUs are removed/added from the online cpumask may leave the set for a
node empty, and require execution by CPUs on another node.

In these and other cases, the patch attempts to ensure that a valid,
usable cpumask is used to set up newly created pools for workqueues.

Signed-off-by: Tejun Heo  & Michael Bringmann 

---
Changes in V2:
  -- Merge in additional logic fixes provided by Tejun Heo.
  -- Revise safety check added to get_unbound_pool to trigger only
 in the event of a dangerously invalid cpumask attribute.
---
 kernel/workqueue.c |   15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39..460de61 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3366,6 +3366,9 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
copy_workqueue_attrs(pool->attrs, attrs);
pool->node = target_node;
 
+   if (!cpumask_weight(pool->attrs->cpumask))
+   cpumask_copy(pool->attrs->cpumask, 
cpumask_of(smp_processor_id()));
+
/*
 * no_numa isn't a worker_pool attribute, always clear it.  See
 * 'struct workqueue_attrs' comments for detail.
@@ -3559,13 +3562,13 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
  * stable.
  *
  * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
- * %false if equal.
+ * %false if equal.  On %false return, the content of @cpumask is undefined.
  */
 static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
 int cpu_going_down, cpumask_t *cpumask)
 {
if (!wq_numa_enabled || attrs->no_numa)
-   goto use_dfl;
+   return false;
 
/* does @node have any online CPUs @attrs wants? */
cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
@@ -3573,15 +3576,13 @@ static bool wq_calc_node_cpumask(const struct 
workqueue_attrs *attrs, int node,
cpumask_clear_cpu(cpu_going_down, cpumask);
 
if (cpumask_empty(cpumask))
-   goto use_dfl;
+   return false;
 
/* yeap, return possible CPUs in @node that @attrs wants */
cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
-   return !cpumask_equal(cpumask, attrs->cpumask);
 
-use_dfl:
-   cpumask_copy(cpumask, attrs->cpumask);
-   return false;
+   return !cpumask_empty(cpumask) &&
+   !cpumask_equal(cpumask, attrs->cpumask);
 }
 
 /* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */



[PATCH] driver core: remove CLASS_ATTR usage

2017-06-06 Thread Greg KH
From: Greg Kroah-Hartman 

There was only 2 remaining users of CLASS_ATTR() so let's finally get
rid of them and force everyone to use the correct RW/RO/WO versions
instead.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Greg Kroah-Hartman 
---

PPC maintainers, can I take this in my driver core tree?  I'm doing some
cleanups of device.h and this was one of the simpler ones at the moment.

thanks,

greg k-h

 arch/powerpc/platforms/pseries/dlpar.c|2 +-
 arch/powerpc/platforms/pseries/mobility.c |7 ---
 include/linux/device.h|2 --
 3 files changed, 5 insertions(+), 6 deletions(-)

--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -588,7 +588,7 @@ static ssize_t dlpar_show(struct class *
return sprintf(buf, "%s\n", "memory,cpu");
 }
 
-static CLASS_ATTR(dlpar, S_IWUSR | S_IRUSR, dlpar_show, dlpar_store);
+static CLASS_ATTR_RW(dlpar);
 
 static int __init pseries_dlpar_init(void)
 {
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -349,8 +349,9 @@ void post_mobility_fixup(void)
return;
 }
 
-static ssize_t migrate_store(struct class *class, struct class_attribute *attr,
-const char *buf, size_t count)
+static ssize_t migration_store(struct class *class,
+  struct class_attribute *attr, const char *buf,
+  size_t count)
 {
u64 streamid;
int rc;
@@ -380,7 +381,7 @@ static ssize_t migrate_store(struct clas
  */
 #define MIGRATION_API_VERSION  1
 
-static CLASS_ATTR(migration, S_IWUSR, NULL, migrate_store);
+static CLASS_ATTR_WO(migration);
 static CLASS_ATTR_STRING(api_version, S_IRUGO, 
__stringify(MIGRATION_API_VERSION));
 
 static int __init mobility_sysfs_init(void)
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -465,8 +465,6 @@ struct class_attribute {
const char *buf, size_t count);
 };
 
-#define CLASS_ATTR(_name, _mode, _show, _store) \
-   struct class_attribute class_attr_##_name = __ATTR(_name, _mode, _show, 
_store)
 #define CLASS_ATTR_RW(_name) \
struct class_attribute class_attr_##_name = __ATTR_RW(_name)
 #define CLASS_ATTR_RO(_name) \


Re: [PATCH 2/5] powerpc/mm: split store_updates_sp() in two parts in do_page_fault()

2017-06-06 Thread Christophe LEROY



Le 06/06/2017 à 13:00, Michael Ellerman a écrit :

christophe leroy  writes:


Le 05/06/2017 à 12:45, Michael Ellerman a écrit :

Christophe LEROY  writes:


Le 02/06/2017 à 11:26, Michael Ellerman a écrit :

Christophe Leroy  writes:


Only the get_user() in store_updates_sp() has to be done outside
the mm semaphore. All the comparison can be done within the semaphore,
so only when really needed.

As we got a DSI exception, the address pointed by regs->nip is
obviously valid, otherwise we would have had a instruction exception.
So __get_user() can be used instead of get_user()


I don't think that part is true.

You took a DSI so there *was* an instruction at NIP, but since then it
may have been unmapped by another thread.

So I don't think you can assume the get_user() will succeed.


The difference between get_user() and __get_user() is that get_user()
performs an access_ok() in addition.

Doesn't access_ok() only checks whether addr is below TASK_SIZE to
ensure it is a valid user address ?


Yeah more or less, via some gross macros.

I was actually not that worried about the switch from get_user() to
__get_user(), but rather that you removed the check of the return value.
ie.

-   if (get_user(inst, (unsigned int __user *)regs->nip))
-   return 0;

Became:

if (is_write && user_mode(regs))
-   store_update_sp = store_updates_sp(regs);
+   __get_user(inst, (unsigned int __user *)regs->nip);


I think dropping the access_ok() probably is alright, because the NIP
must (should!) have been in userspace, though as Ben says it's always
good to be paranoid.

But ignoring that the address can fault at all is wrong AFAICS.


I see what you mean now.

Indeed,

-   unsigned int inst;

Became

+   unsigned int inst = 0;

Since __get_user() doesn't modify 'inst' in case of error, 'inst'
remains 0, and store_updates_sp(0) return false. That was the idea behind.


Ugh. OK, my bad. Though it is a little subtle.

How about:

@@ -286,10 +290,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
 /*
  * We want to do this outside mmap_sem, because reading code around nip
  * can result in fault, which will cause a deadlock when called with
-* mmap_sem held
+* mmap_sem held. We don't need to check if get_user() fails, if it does
+* it won't modify inst, and an inst of 0 will return false from
+* store_updates_sp().
  */
+   inst = 0;
 if (is_write && is_user)
-   store_update_sp = store_updates_sp(regs);
+   get_user(inst, (unsigned int __user *)regs->nip);
  
 if (is_user)

 flags |= FAULT_FLAG_USER;


Then this one can go in.



I just submitted v4 version of the patch "powerpc/mm: Only read faulting 
instruction when necessary in do_page_fault()", skipping this step and 
going directly to the final solution.
The new approach has been to keep everything inside store_updates_sp() 
function and just move the call.


Christophe


[PATCH v4] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2017-06-06 Thread Christophe Leroy
Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
userspace instruction miss") has shown that limiting the read of
faulting instruction to likely cases improves performance.

This patch goes further into this direction by limiting the read
of the faulting instruction to the only cases where it is definitly
needed.

On an MPC885, with the same benchmark app as in the commit referred
above, we see a reduction of 4000 dTLB misses (approx 3%):

Before the patch:
 Performance counter stats for './fault 500' (10 runs):

 720495838  cpu-cycles  ( +-  0.04% )
141769  dTLB-load-misses( +-  0.02% )
 52722  iTLB-load-misses( +-  0.01% )
 19611  faults  ( +-  0.02% )

   5.750535176 seconds time elapsed ( +-  0.16% )

With the patch:
 Performance counter stats for './fault 500' (10 runs):

 717669123  cpu-cycles  ( +-  0.02% )
137344  dTLB-load-misses( +-  0.03% )
 52731  iTLB-load-misses( +-  0.01% )
 19614  faults  ( +-  0.03% )

   5.728423115 seconds time elapsed ( +-  0.14% )

Signed-off-by: Christophe Leroy 
---
 v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
verification before __get_user_xxx()

 v3: Do a first try with pagefault disabled before releasing the semaphore

 v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'

 arch/powerpc/mm/fault.c | 60 +++--
 1 file changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 4c422632047b..add2166d2459 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -72,26 +72,50 @@ static inline int notify_page_fault(struct pt_regs *regs)
 /*
  * Check whether the instruction at regs->nip is a store using
  * an update addressing form which will update r1.
+ * If no, returns 0 with mmap_sem released
+ * If yes, returns 1 if mmap_sem hasn't been released
+ * If yes, returns 2 if mmap_sem has been released
  */
 static int store_updates_sp(struct pt_regs *regs)
 {
unsigned int inst;
+   unsigned int __user *nip = (unsigned int __user *)regs->nip;
+   int ret;
+   bool is_mm_locked = true;
+
+   /*
+* We want to do this outside mmap_sem, because reading code around nip
+* can result in fault, which will cause a deadlock when called with
+* mmap_sem held. However, we do a first try with pagefault disabled as
+* a fault here is very unlikely.
+*/
+   if (!access_ok(VERIFY_READ, nip, sizeof(inst)))
+   goto failed;
+
+   pagefault_disable();
+   ret = __get_user_inatomic(inst, nip);
+   pagefault_enable();
+   if (ret) {
+   up_read(¤t->mm->mmap_sem);
+   is_mm_locked = false;
+   if (__get_user(inst, nip))
+   goto failed;
+   }
 
-   if (get_user(inst, (unsigned int __user *)regs->nip))
-   return 0;
/* check for 1 in the rA field */
if (((inst >> 16) & 0x1f) != 1)
-   return 0;
+   goto failed;
/* check major opcode */
switch (inst >> 26) {
+   case 62:/* std or stdu */
+   if ((inst & 3) == 0)
+   break;
case 37:/* stwu */
case 39:/* stbu */
case 45:/* sthu */
case 53:/* stfsu */
case 55:/* stfdu */
-   return 1;
-   case 62:/* std or stdu */
-   return (inst & 3) == 1;
+   return is_mm_locked ? 1 : 2;
case 31:
/* check minor opcode */
switch ((inst >> 1) & 0x3ff) {
@@ -101,9 +125,13 @@ static int store_updates_sp(struct pt_regs *regs)
case 439:   /* sthux */
case 695:   /* stfsux */
case 759:   /* stfdux */
-   return 1;
+   return is_mm_locked ? 1 : 2;
}
}
+failed:
+   if (is_mm_locked)
+   up_read(¤t->mm->mmap_sem);
+
return 0;
 }
 /*
@@ -283,14 +311,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
-   /*
-* We want to do this outside mmap_sem, because reading code around nip
-* can result in fault, which will cause a deadlock when called with
-* mmap_sem held
-*/
-   if (is_write && is_user)
-   store_update_sp = store_updates_sp(regs);
-
if (is_user)
flags |= FAULT_FLAG_USER;
 
@@ -359,8 +379,14 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
 * between the last mapped region and the stack will

power server power consumption problem

2017-06-06 Thread ??????????
I use ipmitool measuring power server power consumption, getting the following 
results without total power consumption as the following picture, do I add all 
power values as total power consumption? I am eager to receive someone's 
replay??Thanks??

Re: [PATCH v4 1/2] tty: add compat_ioctl callbacks

2017-06-06 Thread Arnd Bergmann
On Sat, Jun 3, 2017 at 4:15 PM, Aleksa Sarai  wrote:
> In order to avoid future diversions between fs/compat_ioctl.c and
> drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
> tty_operations structs. Since both pty_unix98_ioctl() and
> pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
> special translation is required.
>
> Signed-off-by: Aleksa Sarai 

Looks good,

Reviewed-by: Arnd Bergmann 


Re: [PATCH v3 1/2] tty: add compat_ioctl callbacks

2017-06-06 Thread Arnd Bergmann
On Tue, Jun 6, 2017 at 1:05 PM, Aleksa Sarai  wrote:
>>> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
>>> index 65799575c666..2a6bd9ae3f8b 100644
>>> --- a/drivers/tty/pty.c
>>> +++ b/drivers/tty/pty.c
>>> @@ -481,6 +481,16 @@ static int pty_bsd_ioctl(struct tty_struct *tty,
>>>  return -ENOIOCTLCMD;
>>>   }
>>>
>>> +static long pty_bsd_compat_ioctl(struct tty_struct *tty,
>>> +unsigned int cmd, unsigned long arg)
>>> +{
>>> +   /*
>>> +* PTY ioctls don't require any special translation between
>>> 32-bit and
>>> +* 64-bit userspace, they are already compatible.
>>> +*/
>>> +   return pty_bsd_ioctl(tty, cmd, arg);
>>> +}
>>> +
>>
>>
>> This looks correct but unnecessary, you can simply point both
>> function pointers to the same function:
>
>
> They have different types, since they have different return types:
>
> int  (*ioctl)(struct tty_struct *tty,
> unsigned int cmd, unsigned long arg);
> long (*compat_ioctl)(struct tty_struct *tty,
>  unsigned int cmd, unsigned long arg);
>
> If you like, I can change (*ioctl) to return longs as well, and then change
> all of the call-sites (since unlocked_ioctl also returns long).

Ah, my mistake. In most other data structures that have a compat_ioctl
callback pointer, the prototypes are the same, and I had not realized
that tty_operations is an exception.

Arnd


[PATCH] include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH

2017-06-06 Thread Murilo Opsfelder Araujo
When CONFIG_EEH=y and CONFIG_VFIO_SPAPR_EEH=n, build fails with the
following:

drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_release':
vfio_pci.c:(.text+0xa98): undefined reference to 
`.vfio_spapr_pci_eeh_release'
drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_open':
vfio_pci.c:(.text+0x1420): undefined reference to `.vfio_spapr_pci_eeh_open'

In this case, vfio_pci.c should use the empty definitions of
vfio_spapr_pci_eeh_open and vfio_spapr_pci_eeh_release functions.

This patch fixes it by guarding these function definitions with
CONFIG_VFIO_SPAPR_EEH, the symbol that controls whether vfio_spapr_eeh.c is
built, which is where the non-empty versions of these functions are.

This issue was found during a randconfig build. Logs are here:

http://kisskb.ellerman.id.au/kisskb/buildresult/12982362/

Signed-off-by: Murilo Opsfelder Araujo 
---
 include/linux/vfio.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index edf9b2c..0a05d57 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -150,7 +150,7 @@ extern int vfio_set_irqs_validate_and_prepare(struct 
vfio_irq_set *hdr,
  size_t *data_size);
 
 struct pci_dev;
-#ifdef CONFIG_EEH
+#ifdef CONFIG_VFIO_SPAPR_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
 extern void vfio_spapr_pci_eeh_release(struct pci_dev *pdev);
 extern long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
@@ -171,7 +171,7 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct 
iommu_group *group,
 {
return -ENOTTY;
 }
-#endif /* CONFIG_EEH */
+#endif /* CONFIG_VFIO_SPAPR_EEH */
 
 /*
  * IRQfd - generic
-- 
2.9.4



Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

2017-06-06 Thread Balbir Singh
On Fri, Jun 2, 2017 at 3:14 PM, Michael Ellerman  wrote:
> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
> switched to the generic implementation of cpu_to_node(), which uses a percpu
> variable to hold the NUMA node for each CPU.
>
> Unfortunately we neglected to notice that we use cpu_to_node() in the 
> allocation
> of our percpu areas, leading to a chicken and egg problem. In practice what
> happens is when we are setting up the percpu areas, cpu_to_node() reports that
> all CPUs are on node 0, so we allocate all percpu areas on node 0.
>
> This is visible in the dmesg output, as all pcpu allocs being in group 0:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
>
> To fix it we need an early_cpu_to_node() which can run prior to percpu being
> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
> in. With the patch dmesg output shows two groups, 0 and 1:
>
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
>
> We can also check the data_offset in the paca of various CPUs, with the fix we
> see:
>
>   CPU 0:  data_offset = 0x0ffe8b
>   CPU 24: data_offset = 0x1ffe5b
>
> And we can see from dmesg that CPU 24 has an allocation on node 1:
>
>   node   0: [mem 0x-0x000f]
>   node   1: [mem 0x0010-0x001f]
>
> Cc: sta...@vger.kernel.org # v3.16+
> Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/topology.h | 14 ++
>  arch/powerpc/kernel/setup_64.c  | 19 ++-
>  2 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 8b3b46b7b0f2..8f3b2ec09b9e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -44,8 +44,22 @@ extern void __init dump_numa_cpu_topology(void);
>  extern int sysfs_add_device_to_node(struct device *dev, int nid);
>  extern void sysfs_remove_device_from_node(struct device *dev, int nid);
>
> +static inline int early_cpu_to_node(int cpu)
> +{
> +   int nid;
> +
> +   nid = numa_cpu_lookup_table[cpu];
> +
> +   /*
> +* Some functions, eg. node_distance() don't cope with -1, so instead
> +* fall back to node 0 if nid is unset (it should be, except bugs).
> +*/
> +   return (nid < 0) ? 0 : nid;
> +}
>  #else

Not sure if its entirely related, but I had tried to do

https://patchwork.ozlabs.org/patch/683556/

to setup the mapping earlier, but we would have still missed the pcpu_fc_alloc.

Balbir


Re: [PATCH v3 1/2] tty: add compat_ioctl callbacks

2017-06-06 Thread Aleksa Sarai

diff --git a/Makefile b/Makefile
index 470bd4d9513a..fb689286d83a 100644
--- a/Makefile
+++ b/Makefile
@@ -401,6 +401,7 @@ KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes 
-Wno-trigraphs \
-fno-strict-aliasing -fno-common \
-Werror-implicit-function-declaration \
-Wno-format-security \
+  -Wno-error=int-in-bool-context \
-std=gnu89 $(call cc-option,-fno-PIE)


This  slipped in by accident I assume? It seems completely unrelated.


Yeah, I re-sent v4 with this removed immediately afterwards.




diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 65799575c666..2a6bd9ae3f8b 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -481,6 +481,16 @@ static int pty_bsd_ioctl(struct tty_struct *tty,
 return -ENOIOCTLCMD;
  }

+static long pty_bsd_compat_ioctl(struct tty_struct *tty,
+unsigned int cmd, unsigned long arg)
+{
+   /*
+* PTY ioctls don't require any special translation between 32-bit and
+* 64-bit userspace, they are already compatible.
+*/
+   return pty_bsd_ioctl(tty, cmd, arg);
+}
+


This looks correct but unnecessary, you can simply point both
function pointers to the same function:


They have different types, since they have different return types:

int  (*ioctl)(struct tty_struct *tty,
unsigned int cmd, unsigned long arg);
long (*compat_ioctl)(struct tty_struct *tty,
 unsigned int cmd, unsigned long arg);

If you like, I can change (*ioctl) to return longs as well, and then 
change all of the call-sites (since unlocked_ioctl also returns long).


--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/


Re: [PATCH v3 1/2] tty: add compat_ioctl callbacks

2017-06-06 Thread Arnd Bergmann
On Sat, Jun 3, 2017 at 3:51 PM, Aleksa Sarai  wrote:
> In order to avoid future diversions between fs/compat_ioctl.c and
> drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
> tty_operations structs. Since both pty_unix98_ioctl() and
> pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
> special translation is required.
>
> Signed-off-by: Aleksa Sarai 
> ---
>  Makefile  |  1 +
>  drivers/tty/pty.c | 22 ++
>  fs/compat_ioctl.c |  6 --
>  3 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 470bd4d9513a..fb689286d83a 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -401,6 +401,7 @@ KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes 
> -Wno-trigraphs \
>-fno-strict-aliasing -fno-common \
>-Werror-implicit-function-declaration \
>-Wno-format-security \
> +  -Wno-error=int-in-bool-context \
>-std=gnu89 $(call cc-option,-fno-PIE)

This  slipped in by accident I assume? It seems completely unrelated.

> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> index 65799575c666..2a6bd9ae3f8b 100644
> --- a/drivers/tty/pty.c
> +++ b/drivers/tty/pty.c
> @@ -481,6 +481,16 @@ static int pty_bsd_ioctl(struct tty_struct *tty,
> return -ENOIOCTLCMD;
>  }
>
> +static long pty_bsd_compat_ioctl(struct tty_struct *tty,
> +unsigned int cmd, unsigned long arg)
> +{
> +   /*
> +* PTY ioctls don't require any special translation between 32-bit and
> +* 64-bit userspace, they are already compatible.
> +*/
> +   return pty_bsd_ioctl(tty, cmd, arg);
> +}
> +

This looks correct but unnecessary, you can simply point both
function pointers to the same function:

>   * not really modular, but the easiest way to keep compat with existing
> @@ -502,6 +512,7 @@ static const struct tty_operations master_pty_ops_bsd = {
> .chars_in_buffer = pty_chars_in_buffer,
> .unthrottle = pty_unthrottle,
> .ioctl = pty_bsd_ioctl,
> +   .compat_ioctl = pty_bsd_compat_ioctl,

   .compat_ioctl = pty_bsd_ioctl,

The separate handler would only be required when you need any kind
of special handling depending on the command.

> diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> index 6116d5275a3e..112b3e1e20e3 100644
> --- a/fs/compat_ioctl.c
> +++ b/fs/compat_ioctl.c
> @@ -866,8 +866,6 @@ COMPATIBLE_IOCTL(TIOCGDEV)
>  COMPATIBLE_IOCTL(TIOCCBRK)
>  COMPATIBLE_IOCTL(TIOCGSID)
>  COMPATIBLE_IOCTL(TIOCGICOUNT)
> -COMPATIBLE_IOCTL(TIOCGPKT)
> -COMPATIBLE_IOCTL(TIOCGPTLCK)
>  COMPATIBLE_IOCTL(TIOCGEXCL)
>  /* Little t */
>  COMPATIBLE_IOCTL(TIOCGETD)
> @@ -883,16 +881,12 @@ COMPATIBLE_IOCTL(TIOCMGET)
>  COMPATIBLE_IOCTL(TIOCMBIC)
>  COMPATIBLE_IOCTL(TIOCMBIS)
>  COMPATIBLE_IOCTL(TIOCMSET)
> -COMPATIBLE_IOCTL(TIOCPKT)
>  COMPATIBLE_IOCTL(TIOCNOTTY)
>  COMPATIBLE_IOCTL(TIOCSTI)
>  COMPATIBLE_IOCTL(TIOCOUTQ)
>  COMPATIBLE_IOCTL(TIOCSPGRP)
>  COMPATIBLE_IOCTL(TIOCGPGRP)
> -COMPATIBLE_IOCTL(TIOCGPTN)
> -COMPATIBLE_IOCTL(TIOCSPTLCK)
>  COMPATIBLE_IOCTL(TIOCSERGETLSR)
> -COMPATIBLE_IOCTL(TIOCSIG)

Looks good.

   Arnd


Re: [PATCH 2/5] powerpc/mm: split store_updates_sp() in two parts in do_page_fault()

2017-06-06 Thread Michael Ellerman
christophe leroy  writes:

> Le 05/06/2017 à 12:45, Michael Ellerman a écrit :
>> Christophe LEROY  writes:
>>
>>> Le 02/06/2017 à 11:26, Michael Ellerman a écrit :
 Christophe Leroy  writes:

> Only the get_user() in store_updates_sp() has to be done outside
> the mm semaphore. All the comparison can be done within the semaphore,
> so only when really needed.
>
> As we got a DSI exception, the address pointed by regs->nip is
> obviously valid, otherwise we would have had a instruction exception.
> So __get_user() can be used instead of get_user()

 I don't think that part is true.

 You took a DSI so there *was* an instruction at NIP, but since then it
 may have been unmapped by another thread.

 So I don't think you can assume the get_user() will succeed.
>>>
>>> The difference between get_user() and __get_user() is that get_user()
>>> performs an access_ok() in addition.
>>>
>>> Doesn't access_ok() only checks whether addr is below TASK_SIZE to
>>> ensure it is a valid user address ?
>>
>> Yeah more or less, via some gross macros.
>>
>> I was actually not that worried about the switch from get_user() to
>> __get_user(), but rather that you removed the check of the return value.
>> ie.
>>
>> -if (get_user(inst, (unsigned int __user *)regs->nip))
>> -return 0;
>>
>> Became:
>>
>>  if (is_write && user_mode(regs))
>> -store_update_sp = store_updates_sp(regs);
>> +__get_user(inst, (unsigned int __user *)regs->nip);
>>
>>
>> I think dropping the access_ok() probably is alright, because the NIP
>> must (should!) have been in userspace, though as Ben says it's always
>> good to be paranoid.
>>
>> But ignoring that the address can fault at all is wrong AFAICS.
>
> I see what you mean now.
>
> Indeed,
>
> - unsigned int inst;
>
> Became
>
> + unsigned int inst = 0;
>
> Since __get_user() doesn't modify 'inst' in case of error, 'inst' 
> remains 0, and store_updates_sp(0) return false. That was the idea behind.

Ugh. OK, my bad. Though it is a little subtle.

How about:

@@ -286,10 +290,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
/*
 * We want to do this outside mmap_sem, because reading code around nip
 * can result in fault, which will cause a deadlock when called with
-* mmap_sem held
+* mmap_sem held. We don't need to check if get_user() fails, if it does
+* it won't modify inst, and an inst of 0 will return false from
+* store_updates_sp().
 */
+   inst = 0;
if (is_write && is_user)
-   store_update_sp = store_updates_sp(regs);
+   get_user(inst, (unsigned int __user *)regs->nip);
 
if (is_user)
flags |= FAULT_FLAG_USER;


Then this one can go in.

cheers


Re: [PATCH v2] powerpc/numa: Fix percpu allocations to be NUMA aware

2017-06-06 Thread Nicholas Piggin
On Tue,  6 Jun 2017 20:23:57 +1000
Michael Ellerman  wrote:

> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
> switched to the generic implementation of cpu_to_node(), which uses a percpu
> variable to hold the NUMA node for each CPU.
> 
> Unfortunately we neglected to notice that we use cpu_to_node() in the 
> allocation
> of our percpu areas, leading to a chicken and egg problem. In practice what
> happens is when we are setting up the percpu areas, cpu_to_node() reports that
> all CPUs are on node 0, so we allocate all percpu areas on node 0.
> 
> This is visible in the dmesg output, as all pcpu allocs being in group 0:
> 
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
> 
> To fix it we need an early_cpu_to_node() which can run prior to percpu being
> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
> in. With the patch dmesg output shows two groups, 0 and 1:
> 
>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
> 
> We can also check the data_offset in the paca of various CPUs, with the fix we
> see:
> 
>   CPU 0:  data_offset = 0x0ffe8b
>   CPU 24: data_offset = 0x1ffe5b
> 
> And we can see from dmesg that CPU 24 has an allocation on node 1:
> 
>   node   0: [mem 0x-0x000f]
>   node   1: [mem 0x0010-0x001f]
> 
> Cc: sta...@vger.kernel.org # v3.16+
> Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
> Signed-off-by: Michael Ellerman 

Looks good.

Reviewed-by: Nicholas Piggin 


Re: [PATCH V3 1/2] powerpc/numa: Update CPU topology when VPHN enabled

2017-06-06 Thread Michael Ellerman
Michael Bringmann  writes:

> This build appears to be using V3 of the patch.  V4 of the patch corrected 
> the placement
> of the local variables with respect to '#ifdef CONFIG_PPC_SPLPAR'.

Yes it is, you can tell because it's a reply to "PATCH V3".

cheers


Re: [PATCH v9 10/10] powerpc/perf: Thread imc cpuhotplug support

2017-06-06 Thread Thomas Gleixner
On Mon, 5 Jun 2017, Anju T Sudhakar wrote:
>  static void thread_imc_mem_alloc(int cpu_id)
>  {
> - u64 ldbar_addr, ldbar_value;
>   int phys_id = topology_physical_package_id(cpu_id);
>  
>   per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id,
>   (size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | 
> __GFP_ZERO);

It took me a while to understand that per_cpu_add is an array and not a new
fangled per cpu function which adds something to a per cpu variable. 

So why is that storing the address as u64?

And why is this a NR_CPUS sized array instead of a regular per cpu variable?

> +}
> +
> +static void thread_imc_update_ldbar(unsigned int cpu_id)
> +{
> + u64 ldbar_addr, ldbar_value;
> +
> + if (per_cpu_add[cpu_id] == 0)
> + thread_imc_mem_alloc(cpu_id);

So if that allocation fails then you happily proceed. Optmistic
programming, right?

> +
>   ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);

Ah, it's stored as u64 because you have to convert it back to a void
pointer at the place where it is actually used. Interesting approach.

>   ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
> - (u64)THREAD_IMC_ENABLE;
> + (u64)THREAD_IMC_ENABLE;
>   mtspr(SPRN_LDBAR, ldbar_value);
>  }
>  
> @@ -442,6 +450,26 @@ static void core_imc_change_cpu_context(int old_cpu, int 
> new_cpu)
>   perf_pmu_migrate_context(&core_imc_pmu->pmu, old_cpu, new_cpu);
>  }
>  
> +static int ppc_thread_imc_cpu_online(unsigned int cpu)
> +{
> + thread_imc_update_ldbar(cpu);
> + return 0;
> +}
> +
> +static int ppc_thread_imc_cpu_offline(unsigned int cpu)
> +{
> + mtspr(SPRN_LDBAR, 0);
> + return 0;
> +}
> +
> +void thread_imc_cpu_init(void)
> +{
> + cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
> +   "perf/powerpc/imc_thread:online",
> +   ppc_thread_imc_cpu_online,
> +   ppc_thread_imc_cpu_offline);
> +}
> +
>  static int ppc_core_imc_cpu_online(unsigned int cpu)
>  {
>   const struct cpumask *l_cpumask;
> @@ -953,6 +981,9 @@ int __init init_imc_pmu(struct imc_events *events, int 
> idx,
>   if (ret)
>   return ret;
>   break;
> + case IMC_DOMAIN_THREAD:
> + thread_imc_cpu_init();
> + break;
>   default:
>   return -1;  /* Unknown domain */
>   }

Just a general observation. If anything in init_imc_pmu() fails, then all
the hotplug callbacks are staying registered, at least I haven't seen
anything undoing it.

Thanks,

tglx


[PATCH v2] powerpc/numa: Fix percpu allocations to be NUMA aware

2017-06-06 Thread Michael Ellerman
In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
switched to the generic implementation of cpu_to_node(), which uses a percpu
variable to hold the NUMA node for each CPU.

Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
of our percpu areas, leading to a chicken and egg problem. In practice what
happens is when we are setting up the percpu areas, cpu_to_node() reports that
all CPUs are on node 0, so we allocate all percpu areas on node 0.

This is visible in the dmesg output, as all pcpu allocs being in group 0:

  pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
  pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
  pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
  pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
  pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
  pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47

To fix it we need an early_cpu_to_node() which can run prior to percpu being
setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
in. With the patch dmesg output shows two groups, 0 and 1:

  pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
  pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
  pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
  pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
  pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
  pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47

We can also check the data_offset in the paca of various CPUs, with the fix we
see:

  CPU 0:  data_offset = 0x0ffe8b
  CPU 24: data_offset = 0x1ffe5b

And we can see from dmesg that CPU 24 has an allocation on node 1:

  node   0: [mem 0x-0x000f]
  node   1: [mem 0x0010-0x001f]

Cc: sta...@vger.kernel.org # v3.16+
Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/topology.h | 14 ++
 arch/powerpc/kernel/setup_64.c  |  4 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

v2: Don't bother with node_distance() just compare the nids in 
pcpu_cpu_distance()

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 8b3b46b7b0f2..329771559cbb 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -44,8 +44,22 @@ extern void __init dump_numa_cpu_topology(void);
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct device *dev, int nid);
 
+static inline int early_cpu_to_node(int cpu)
+{
+   int nid;
+
+   nid = numa_cpu_lookup_table[cpu];
+
+   /*
+* Fall back to node 0 if nid is unset (it should be, except bugs).
+* This allows callers to safely do NODE_DATA(early_cpu_to_node(cpu)).
+*/
+   return (nid < 0) ? 0 : nid;
+}
 #else
 
+static inline int early_cpu_to_node(int cpu) { return 0; }
+
 static inline void dump_numa_cpu_topology(void) {}
 
 static inline int sysfs_add_device_to_node(struct device *dev, int nid)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index f35ff9dea4fb..a8c1f99e9607 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -661,7 +661,7 @@ void __init emergency_stack_init(void)
 
 static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
 {
-   return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
+   return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size, 
align,
__pa(MAX_DMA_ADDRESS));
 }
 
@@ -672,7 +672,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-   if (cpu_to_node(from) == cpu_to_node(to))
+   if (early_cpu_to_node(from) == early_cpu_to_node(to))
return LOCAL_DISTANCE;
else
return REMOTE_DISTANCE;
-- 
2.7.4



Re: [PATCH] powerpc/mm/radix: Only add X for pages overlapping kernel text

2017-06-06 Thread Michael Ellerman
Balbir Singh  writes:
> On Tue, Jun 6, 2017 at 3:48 PM, Michael Ellerman  wrote:
>> Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
>> should check if the page overlaps the kernel text and only then add
>> PAGE_KERNEL_X.
...
>> @@ -145,8 +147,14 @@ static int __meminit create_physical_mapping(unsigned 
>> long start,
>> start = addr;
>> }
>>
>> -   rc = radix__map_kernel_page((unsigned long)__va(addr), addr,
>> -   PAGE_KERNEL_X, mapping_size);
>> +   vaddr = (unsigned long)__va(addr);
>> +
>> +   if (overlaps_kernel_text(vaddr, vaddr + mapping_size))
>> +   prot = PAGE_KERNEL_X;
>> +   else
>> +   prot = PAGE_KERNEL;
>
> Do we need the kvm tmp/trampoline bits like hash?

Ugh, I hope not. What is that crap.

It appears to be epapr paravirt only:

  static int __init kvm_guest_init(void)
  {
if (!kvm_para_available())
goto free_tmp;
  
if (!epapr_paravirt_enabled)
goto free_tmp;


But I can't convince myself whether epapr_paravirt_enabled is ever set
on Book3S guests or not.

Looking at Qemu it looks like it *could* be.

And why isn't that code in arch/powerpc/kvm ?

cheers


Re: [PATCH v9 07/10] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-06-06 Thread Thomas Gleixner
On Mon, 5 Jun 2017, Anju T Sudhakar wrote:
> +static void cleanup_all_core_imc_memory(struct imc_pmu *pmu_ptr)
> +{
> + struct imc_mem_info *ptr = pmu_ptr->mem_info;
> +
> + if (!ptr)
> + return;

That's pointless.

> + for (; ptr; ptr++) {

for (ptr = pmu_ptr->mem_info; ptr; ptr++) {

will do the right thing.

> + if (ptr->vbase[0] != 0)
> + free_pages(ptr->vbase[0], 0);
> + }

and kfree can be called with a NULL pointer.

> + kfree(pmu_ptr->mem_info);
> +}
> +
> +/* Enabling of Core Engine needs a scom operation */
> +static void core_imc_control_enable(int *cpu_opal_rc)
> +{
> + int rc;
> +
> + rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE);
> + if (rc)
> + cpu_opal_rc[smp_processor_id()] = 1;
> +}
> +
> +/*
> + * Disabling of IMC Core Engine needs a scom operation
> + */
> +static void core_imc_control_disable(int *cpu_opal_rc)
> +{
> + int rc;
> +
> + rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE);
> + if (rc)
> + cpu_opal_rc[smp_processor_id()] = 1;
> +}
> +
> +int core_imc_control(int operation)
> +{
> + int cpu, *cpus_opal_rc;
> +
> + /* Memory for OPAL call return value. */
> + cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
> + if (!cpus_opal_rc)
> + return -ENOMEM;
> +
> + /*
> +  * Enable or disable the core IMC PMU on each core using the
> +  * core_imc_cpumask.
> +  */
> + switch (operation) {
> +
> + case IMC_COUNTER_DISABLE:
> + on_each_cpu_mask(&core_imc_cpumask,
> + (smp_call_func_t)core_imc_control_disable,
> + cpus_opal_rc, 1);
> + break;
> + case IMC_COUNTER_ENABLE:
> + on_each_cpu_mask(&core_imc_cpumask,
> + (smp_call_func_t)core_imc_control_enable,
> + cpus_opal_rc, 1);
> + break;
> + default:
> + goto fail;
> + }
> + /* Check return value array for any OPAL call failure */
> + for_each_cpu(cpu, &core_imc_cpumask) {
> + if (cpus_opal_rc[cpu])
> + goto fail;
> + }
> + return 0;
> +fail:
> + if (cpus_opal_rc)
> + kfree(cpus_opal_rc);
> + return -EINVAL;
> +}

Interesting. You copied the other function and then implemented it
differently.

But, what's worse is that this is just duplicated code for no value. Both
the nest and core control functions call

opal_imc_counters_(WHICH_COUNTER);

So the obvious thing to do is:

static struct cpumask imc_result_mask;
static DEFINE_MUTEX(imc_control_mutex);

static void opal_imc_XXX(void *which)
{
if (opal_imc_counters_XXX((unsigned long) which))
cpumask_set_cpu(smp_processor_id(), &imc_result_mask);
}

static int imc_control(unsigned long which, bool start)
{
smp_call_func_t *fn;
int res;

mutex_lock(&imc_control_mutex);
memset(&imc_result_mask, 0, sizeof(imc_result_mask));

switch (which) {
case OPAL_IMC_COUNTERS_CORE:
case OPAL_IMC_COUNTERS_NEST:
break;
default:
res = -EINVAL;
goto out;
}

fn = start ? opal_imc_start : opal_imc_stop;

on_each_cpu_mask(&core_imc_cpumask, fn, (void *) which, 1);

res = cpumask_empty(&nest_imc_result) ? 0 : -ENODEV;
out:
mutex_unlock(&imc_control_mutex);
return res;

That would allow sharing of too much code, right?

> +/*
> + * core_imc_mem_init : Initializes memory for the current core.
> + *
> + * Uses alloc_pages_exact_nid() and uses the returned address as an argument 
> to
> + * an opal call to configure the pdbar. The address sent as an argument is
> + * converted to physical address before the opal call is made. This is the
> + * base address at which the core imc counters are populated.
> + */
> +static int  __meminit core_imc_mem_init(int cpu, int size)
> +{
> + int phys_id, rc = 0, core_id = (cpu / threads_per_core);
> + struct imc_mem_info *mem_info = NULL;

What's that NULL initialization for?

> +
> + phys_id = topology_physical_package_id(cpu);
> + /*
> +  * alloc_pages_exact_nid() will allocate memory for core in the
> +  * local node only.
> +  */
> + mem_info = &core_imc_pmu->mem_info[core_id];
> + mem_info->id = core_id;
> + mem_info->vbase[0] = (u64) alloc_pages_exact_nid(phys_id,
> + (size_t)size, GFP_KERNEL | __GFP_ZERO);

That allocation is guaranteed not to fail?

> + rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_CORE,
> + (u64)virt_to_phys((void *)mem_info->vbase[0]),
> + get_hard_smp_processor_id(cpu));
> + if (rc) {
> + kfree(&mem_info->vbase[0]);

Freeing memory allocated with alloc_pages_exact_nid

Re: [PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

2017-06-06 Thread Michael Ellerman
Nicholas Piggin  writes:

> On Fri, 02 Jun 2017 19:54:32 +1000
> Michael Ellerman  wrote:
>
>> >> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t 
>> >> size)
>> >>  
>> >>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>> >>  {
>> >> - if (cpu_to_node(from) == cpu_to_node(to))
>> >> - return LOCAL_DISTANCE;
>> >> - else
>> >> - return REMOTE_DISTANCE;
>> >> +#ifndef CONFIG_NUMA
>> >> + return LOCAL_DISTANCE;
>> >> +#else
>> >> + int from_nid, to_nid;
>> >> +
>> >> + from_nid = early_cpu_to_node(from);
>> >> + to_nid   = early_cpu_to_node(to);
>> >> +
>> >> + if (from_nid == -1 || to_nid == -1)
>> >> + return LOCAL_DISTANCE;  /* Or assume remote? */
>> >> +
>> >> + return node_distance(from_nid, to_nid);  
>> >
>> > If you made node_distance() return LOCAL_NODE for !NUMA, this
>> > should fall out and not require the ifdef?  
>> 
>> Maybe yeah. This is designed to be minimal for backporting though.
>
> Okay fair enough. Is it expected to get back -1 from this ever? I think
> all we need is local vs not local to direct whether to pack CPUs into
> the same chunks or not so I wonder if the equality test is simpler.

Yeah you're right, I'll do a v2.

I think I was worried about the fact that our node_distance() does
complicated stuff for multi-level NUMA, but that doesn't actually matter
for the per-cpu calculation as you say so it's probably better to keep
it simple.

cheers


Re: [Patch 2/2]: powerpc/hotplug/mm: Fix hot-add memory node assoc

2017-06-06 Thread Michael Ellerman
Michael Bringmann  writes:
> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>
> When the 'numa.c' code is built with debug messages, and the system was
> given that configuration by pHyp, yes, I did.
>
>> What does it say?
>
> The debug message for each core thread would be something like,
>
> removing cpu 64 from node 0
> adding cpu 64 to node 8
>
> repeated for all 8 threads of the CPU, and usually with the messages
> for all of the CPUs coming out intermixed on the console/dmesg log.

OK. I meant what do you see at boot.

I'm curious how we're discovering node 0 and 8 at all if neither has any
memory or CPUs assigned at boot.

>> Right. So it's not that you're hot adding memory into a previously
>> unseen node as you implied in earlier mails.
>
> In the sense that the nodes were defined in the device tree, that is correct.

Where are they defined in the device tree? That's what I'm trying to understand.


> In the sense that those nodes are currently deleted from node_possible_map in
> 'numa.c' by the instruction 'node_and(node_possible_map,node_possible_map,
> node_online_map);', the nodes are no longer available to place memory or CPU.

Yeah I understand that part.

> Okay, I can try to insert code that extracts all of the nodes from the
> ibm,associativity-lookup-arrays property and merge them with the nodes
> put into the online map from the CPUs that were found previously during
> boot of the powerpc code.

Hmm, will that work?

Looking at PAPR it's not clear to me that it will work for nodes that
have no memory assigned at boot.

  This property is used to duplicate the function of the
  “ibm,associativity” property in a /memory node. Each “assigned” LMB
  represented has an index valued between 0 and M-1 which is used as in
  index into this table to select which associativity list to use for
  the LMB. “unassigned” LMBs are place holders for potential DLPAR
  additions, for which the associativity list index is meaningless and
 ^
  is given the reserved value of -1. This static property, need only
  contain values relevant for the LMBs presented in the
  “ibm,dynamicreconfiguration-memory” node; for a dynamic LPAR addition
  of a new LMB, the device tree fragment reported by the
  ibm,configure-connector RTAS function is a /memory node, with the
  inclusion of the “ibm,associativity” device tree property defined in
  Section C.6.2.2‚ “Properties of the Children of Root‚” on page 1059.

>> What does your device tree look like? Can you send us the output of:
>> 
>>  $ lsprop /proc/device-tree

Thanks. I forgot that lsprop will truncate long properties, I actually
wanted to see all of the ibm,dynamic-memory property.

But looking at the code I see the only place we set a nid online is if
there is a CPU assigned to it:

static int __init parse_numa_properties(void)
{
...
for_each_present_cpu(i) {
...
cpu = of_get_cpu_node(i, NULL);
nid = of_node_to_nid_single(cpu);
...
node_set_online(nid);
}

Or for memory nodes (same function):

for_each_node_by_type(memory, "memory") {
...
nid = of_node_to_nid_single(memory);
...
node_set_online(nid);
...
}

Or for entries in ibm,dynamic-memory that are assigned:

static void __init parse_drconf_memory(struct device_node *memory)
{
...
for (; n != 0; --n) {
...
/* skip this block if the reserved bit is set in flags (0x80)
   or if the block is not assigned to this partition (0x8) */
if ((drmem.flags & DRCONF_MEM_RESERVED)
|| !(drmem.flags & DRCONF_MEM_ASSIGNED))
continue;

...
do {
...
nid = of_drconf_to_nid_single(&drmem, &aa);
node_set_online(nid);
...
} while (--ranges);
}
}


So I don't see from that how we can even be aware that node 0 and 8
exist at boot based on that. Maybe there's another path I'm missing
though.

cheers


[PATCH v2] cxl: Fix error path on bad ioctl

2017-06-06 Thread Frederic Barrat
Fix error path if we can't copy user structure on CXL_IOCTL_START_WORK
ioctl. We shouldn't unlock the context status mutex as it was not
locked (yet).

Signed-off-by: Frederic Barrat 
Cc: sta...@vger.kernel.org
Fixes: 0712dc7e73e5 ("cxl: Fix issues when unmapping contexts")
---
Changelog:
v2: change sizeof() parameter (Vaibhav)
tweak commit message and add 'Fixes:' (mpe)

 drivers/misc/cxl/file.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 17b433f1ce23..0761271d68c5 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -159,11 +159,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 
/* Do this outside the status_mutex to avoid a circular dependency with
 * the locking in cxl_mmap_fault() */
-   if (copy_from_user(&work, uwork,
-  sizeof(struct cxl_ioctl_start_work))) {
-   rc = -EFAULT;
-   goto out;
-   }
+   if (copy_from_user(&work, uwork, sizeof(work)))
+   return -EFAULT;
 
mutex_lock(&ctx->status_mutex);
if (ctx->status != OPENED) {
-- 
2.11.0



Re: [PATCH] cxl: Fix error path on bad ioctl

2017-06-06 Thread Frederic Barrat



Le 06/06/2017 à 11:20, Michael Ellerman a écrit :

Frederic Barrat  writes:


Fix error path if we can't copy user structure on
CXL_IOCTL_START_WORK ioctl.


To be clear the error is that returning via the out label will unlock
cxl->status_mutex, which has not been locked.

Please spell it out for me :)

This should be:

   Fixes: 0712dc7e73e5 ("cxl: Fix issues when unmapping contexts")

Am I right?



That's correct. I'm about to send a v2 to address Vaibhav's comment and 
I'll fix the above as well.


Thanks,

  Fred




cheers


diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 17b433f1ce23..caa44adfa60e 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -160,10 +160,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
/* Do this outside the status_mutex to avoid a circular dependency with
 * the locking in cxl_mmap_fault() */
if (copy_from_user(&work, uwork,
-  sizeof(struct cxl_ioctl_start_work))) {
-   rc = -EFAULT;
-   goto out;
-   }
+  sizeof(struct cxl_ioctl_start_work)))
+   return -EFAULT;
  
  	mutex_lock(&ctx->status_mutex);

if (ctx->status != OPENED) {
--
2.11.0






Re: [PATCH] cxl: Fix error path on bad ioctl

2017-06-06 Thread Michael Ellerman
Frederic Barrat  writes:

> Fix error path if we can't copy user structure on
> CXL_IOCTL_START_WORK ioctl.

To be clear the error is that returning via the out label will unlock
cxl->status_mutex, which has not been locked.

Please spell it out for me :)

This should be:

  Fixes: 0712dc7e73e5 ("cxl: Fix issues when unmapping contexts")

Am I right?

cheers

> diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
> index 17b433f1ce23..caa44adfa60e 100644
> --- a/drivers/misc/cxl/file.c
> +++ b/drivers/misc/cxl/file.c
> @@ -160,10 +160,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
>   /* Do this outside the status_mutex to avoid a circular dependency with
>* the locking in cxl_mmap_fault() */
>   if (copy_from_user(&work, uwork,
> -sizeof(struct cxl_ioctl_start_work))) {
> - rc = -EFAULT;
> - goto out;
> - }
> +sizeof(struct cxl_ioctl_start_work)))
> + return -EFAULT;
>  
>   mutex_lock(&ctx->status_mutex);
>   if (ctx->status != OPENED) {
> -- 
> 2.11.0


Re: [PATCH v2] cxl: Avoid double free_irq() for psl,slice interrupts

2017-06-06 Thread Michael Ellerman
Vaibhav Jain  writes:

> During an eeh call to cxl_remove can result in double free_irq of
> psl,slice interrupts. This can happen if perst_reloads_same_image == 1
> and call to cxl_configure_adapter() fails during slot_reset
> callback. In such a case we see a kernel oops with following back-trace:
>
> Oops: Kernel access of bad area, sig: 11 [#1]
> Call Trace:
>   free_irq+0x88/0xd0 (unreliable)
>   cxl_unmap_irq+0x20/0x40 [cxl]
>   cxl_native_release_psl_irq+0x78/0xd8 [cxl]
>   pci_deconfigure_afu+0xac/0x110 [cxl]
>   cxl_remove+0x104/0x210 [cxl]
>   pci_device_remove+0x6c/0x110
>   device_release_driver_internal+0x204/0x2e0
>   pci_stop_bus_device+0xa0/0xd0
>   pci_stop_and_remove_bus_device+0x28/0x40
>   pci_hp_remove_devices+0xb0/0x150
>   pci_hp_remove_devices+0x68/0x150
>   eeh_handle_normal_event+0x140/0x580
>   eeh_handle_event+0x174/0x360
>   eeh_event_handler+0x1e8/0x1f0
>
> This patch fixes the issue of double free_irq by checking that
> variables that hold the virqs (err_hwirq, serr_hwirq, psl_virq) are
> not '0' before un-mapping and resetting these variables to '0' when
> they are un-mapped.
>
> Cc: sta...@vger.kernel.org

When was the bug introduced? ie. what commit does this fix?

Was it the original submission? So:

  Fixes: f204e0b8cedd ("cxl: Driver code for powernv PCIe based cards for 
userspace access")

Or does it depend on perst_reloads_same_image, so:

  Fixes: 13e68d8bd05c ("cxl: Allow the kernel to trust that an image won't 
change on PERST.")

Or something else?

cheers


Re: [PATCH v9 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-06-06 Thread Thomas Gleixner
On Mon, 5 Jun 2017, Anju T Sudhakar wrote:
> +/*
> + * nest_imc_stop : Does OPAL call to stop nest engine.
> + */
> +static void nest_imc_stop(int *cpu_opal_rc)
> +{
> + int rc;
> +
> + rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST);
> + if (rc)
> + cpu_opal_rc[smp_processor_id()] = 1;

What's wrong with just assigning the result directly?

> +/* nest_imc_start: Does the OPAL call to enable nest counters. */
> +static void nest_imc_start(int *cpu_opal_rc)
> +{
> + int rc;
> +
> + rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST);
> + if (rc)
> + cpu_opal_rc[smp_processor_id()] = 1;
> +}
> +
> +/*
> + * nest_imc_control: Function to start and stop nest imc engine.
> + *The function is primarily called from event init
> + *and event destroy.

As I don't see other call sites, what's the secondary usage?

> + */
> +static int nest_imc_control(int operation)
> +{
> + int *cpus_opal_rc, cpu;
> +
> + /*
> +  * Memory for OPAL call return value.
> +  */
> + cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
> + if (!cpus_opal_rc)
> + return -ENOMEM;
> + switch (operation) {
> + case IMC_COUNTER_ENABLE:
> + /* Initialize Nest PMUs in each node using designated 
> cpus */
> + on_each_cpu_mask(&nest_imc_cpumask, 
> (smp_call_func_t)nest_imc_start,
> + cpus_opal_rc, 1);

That's one indentation level too deep. 

> + break;
> + case IMC_COUNTER_DISABLE:
> + /* Disable the counters */
> + on_each_cpu_mask(&nest_imc_cpumask, 
> (smp_call_func_t)nest_imc_stop,
> + cpus_opal_rc, 1);
> + break;
> + default:
> + kfree(cpus_opal_rc);
> + return -EINVAL;
> +
> + }
> +
> + /* Check return value array for any OPAL call failure */
> + for_each_cpu(cpu, &nest_imc_cpumask) {
> + if (cpus_opal_rc[cpu]) {
> + kfree(cpus_opal_rc);
> + return -ENODEV;
> + }
> + }
> + kfree(cpus_opal_rc);

So you have THREE places where you do kfree(cpus_opal_rc). Sigh.

You can spare that hole alloc/free dance by simply having

static struct cpumask nest_imc_result;

mutex_lock(&imc_nest_reserve);
memset(&nest_imc_result, 0, sizeof(nest_imc_result));

switch (op) {
case IMC_COUNTER_ENABLE:
on_each_cpu_mask(&nest_imc_cpumask, nest_imc_start, NULL, 1);
break;

}

res = cpumask_empty(&nest_imc_result) ? 0 : -ENODEV;
mutex_unlock(&imc_nest_reserve);
return res;

And in the start/stop functions:

if (opal_imc_counters_xxx(OPAL_IMC_COUNTERS_NEST))
cpumask_set_cpu(smp_processor_id(), &nest_imc_result);

> +static void nest_change_cpu_context(int old_cpu, int new_cpu)
> +{
> + struct imc_pmu **pn = per_nest_pmu_arr;
> + int i;
> +
> + if (old_cpu < 0 || new_cpu < 0)
> + return;
> +
> + for (i = 0; *pn && i < IMC_MAX_PMUS; i++, pn++)
> + perf_pmu_migrate_context(&(*pn)->pmu, old_cpu, new_cpu);
> +
> +}
> +
> +static int ppc_nest_imc_cpu_offline(unsigned int cpu)
> +{
> + int nid, target = -1;
> + const struct cpumask *l_cpumask;
> +
> + /*
> +  * Check in the designated list for this cpu. Dont bother
> +  * if not one of them.
> +  */
> + if (!cpumask_test_and_clear_cpu(cpu, &nest_imc_cpumask))
> + return 0;
> +
> + /*
> +  * Now that this cpu is one of the designated,
> +  * find a next cpu a) which is online and b) in same chip.
> +  */
> + nid = cpu_to_node(cpu);
> + l_cpumask = cpumask_of_node(nid);
> + target = cpumask_next(cpu, l_cpumask);

So if the outgoing CPU is the highest numbered CPU on the node, then
cpumask_next() will not find a CPU, while there still might be other lower
numbered CPUs online in the node.

target = cpumask_any_but(l_cpumask, cpu);

is what you want.

> +
> + /*
> +  * Update the cpumask with the target cpu and
> +  * migrate the context if needed
> +  */
> + if (target >= 0 && target < nr_cpu_ids) {
> + cpumask_set_cpu(target, &nest_imc_cpumask);
> + nest_change_cpu_context(cpu, target);
> + } else {
> + target = -1;
> + opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST);
> + nest_change_cpu_context(cpu, target);

Why are you calling nest_change_cpu_context()? You already know that it
will return immediately due to target < 0.

> + }
> + return 0;
> +}
> +
> +static int ppc_nest_imc_cpu_online(unsigned int cpu)
> +{
> + const struct cpumask *l_cpumask;
> + static struct cpumask tmp_mask;
> + int res;
> +
> + /* Get the cpuma

Re: [PATCH v2] cxl: Avoid double free_irq() for psl,slice interrupts

2017-06-06 Thread Frederic Barrat



Le 02/06/2017 à 18:56, Vaibhav Jain a écrit :

During an eeh call to cxl_remove can result in double free_irq of
psl,slice interrupts. This can happen if perst_reloads_same_image == 1
and call to cxl_configure_adapter() fails during slot_reset
callback. In such a case we see a kernel oops with following back-trace:

Oops: Kernel access of bad area, sig: 11 [#1]
Call Trace:
   free_irq+0x88/0xd0 (unreliable)
   cxl_unmap_irq+0x20/0x40 [cxl]
   cxl_native_release_psl_irq+0x78/0xd8 [cxl]
   pci_deconfigure_afu+0xac/0x110 [cxl]
   cxl_remove+0x104/0x210 [cxl]
   pci_device_remove+0x6c/0x110
   device_release_driver_internal+0x204/0x2e0
   pci_stop_bus_device+0xa0/0xd0
   pci_stop_and_remove_bus_device+0x28/0x40
   pci_hp_remove_devices+0xb0/0x150
   pci_hp_remove_devices+0x68/0x150
   eeh_handle_normal_event+0x140/0x580
   eeh_handle_event+0x174/0x360
   eeh_event_handler+0x1e8/0x1f0

This patch fixes the issue of double free_irq by checking that
variables that hold the virqs (err_hwirq, serr_hwirq, psl_virq) are
not '0' before un-mapping and resetting these variables to '0' when
they are un-mapped.

Cc: sta...@vger.kernel.org
Signed-off-by: Vaibhav Jain 

---


Thanks for the correction.

Acked-by: Frederic Barrat 

  Fred



Changelog:

v2:
- Use psl_hwirq instead of psl_virq to find irq mapping in
cxl_native_release_psl_irq as pointed out by Fred.

Re-send:
- Added stable to recipients
---
  drivers/misc/cxl/native.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 871a2f0..8d6ea97 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -1302,13 +1302,16 @@ int cxl_native_register_psl_err_irq(struct cxl *adapter)

  void cxl_native_release_psl_err_irq(struct cxl *adapter)
  {
-   if (adapter->native->err_virq != irq_find_mapping(NULL, 
adapter->native->err_hwirq))
+   if (adapter->native->err_virq == 0 ||
+   adapter->native->err_virq !=
+   irq_find_mapping(NULL, adapter->native->err_hwirq))
return;

cxl_p1_write(adapter, CXL_PSL_ErrIVTE, 0x);
cxl_unmap_irq(adapter->native->err_virq, adapter);
cxl_ops->release_one_irq(adapter, adapter->native->err_hwirq);
kfree(adapter->irq_name);
+   adapter->native->err_virq = 0;
  }

  int cxl_native_register_serr_irq(struct cxl_afu *afu)
@@ -1346,13 +1349,15 @@ int cxl_native_register_serr_irq(struct cxl_afu *afu)

  void cxl_native_release_serr_irq(struct cxl_afu *afu)
  {
-   if (afu->serr_virq != irq_find_mapping(NULL, afu->serr_hwirq))
+   if (afu->serr_virq == 0 ||
+   afu->serr_virq != irq_find_mapping(NULL, afu->serr_hwirq))
return;

cxl_p1n_write(afu, CXL_PSL_SERR_An, 0x);
cxl_unmap_irq(afu->serr_virq, afu);
cxl_ops->release_one_irq(afu->adapter, afu->serr_hwirq);
kfree(afu->err_irq_name);
+   afu->serr_virq = 0;
  }

  int cxl_native_register_psl_irq(struct cxl_afu *afu)
@@ -1375,12 +1380,15 @@ int cxl_native_register_psl_irq(struct cxl_afu *afu)

  void cxl_native_release_psl_irq(struct cxl_afu *afu)
  {
-   if (afu->native->psl_virq != irq_find_mapping(NULL, 
afu->native->psl_hwirq))
+   if (afu->native->psl_virq == 0 ||
+   afu->native->psl_virq !=
+   irq_find_mapping(NULL, afu->native->psl_hwirq))
return;

cxl_unmap_irq(afu->native->psl_virq, afu);
cxl_ops->release_one_irq(afu->adapter, afu->native->psl_hwirq);
kfree(afu->psl_irq_name);
+   afu->native->psl_virq = 0;
  }

  static void recover_psl_err(struct cxl_afu *afu, u64 errstat)