Re: [PATCH 2/4]: CPUIDLE: Introduce architecture independent cpuidle_pm_idle in drivers/cpuidle/cpuidle.c

2009-08-28 Thread Vaidyanathan Srinivasan
* Peter Zijlstra pet...@infradead.org [2009-08-28 09:01:12]:

 On Fri, 2009-08-28 at 08:48 +0200, Peter Zijlstra wrote:
  
   void cpuidle_install_idle_handler(void)
   {
 .
 .
 cpuidle_pm_idle = cpuidle_idle_call;
   }
  
  All I'm seeing here is a frigging mess.
  
  How on earths can something called: cpuidle_install_idle_handler() have
  a void argument, _WHAT_ handler is it going to install?
 
 Argh, now I see, it installs itself as the platform idle handler.
 
 so cpuidle_install_idle_handler() pokes at the unmanaged pm_idle pointer
 to make cpuidle take control.
 
 On module load it does:
 
  pm_idle_old = pm_idle;
 
 then in the actual idle loop it does:
 
 if (!dev || !dev-enabled) {
 if (pm_idle_old)
 pm_idle_old();
 
 who is to say that the pointer stored at module init time is still
 around at that time?
 
 So cpuidle recognised the pm_idle stuff was a flaky, but instead of
 fixing it, they build a whole new layer on top of it. Brilliant.
 
 /me goes mark this whole thread read, I've got enough things to do.

Hi Peter,

I understand that you are frustrated with the mess.  We are willing
to clean up the pm_idle pointer at the moment before the cpuidle
framework is exploited my more archs.

At this moment we need your suggestions on what interface should we
call 'clean' and safe.

cpuidle.c and the arch independent cpuidle subsystem is not a module
and its cpuidle_idle_call() routine is valid and can be safely called
from arch dependent process.c

The fragile part is how cpuidle_idle_call() is hooked onto arch
specific cpu_idle() function at runtime.  x86 has the pm_idle pointer
exported while powerpc has ppc_md.power_save pointer being called.

At cpuidle init time we can override the platform idle function, but
that will mean we are including arch specific code in cpuidle.c

Do you think having an exported function in cpuidle.c to give us the
correct pointer to arch code be better than the current situation:

in drivers/cpuidle/cpuidle.c

void (*get_cpuidle_handler(void))(void)
{
return cpuidle_pm_idle;
}
EXPORT_SYMBOL(get_cpuidle_handler);

and from pseries/processor_idle.c,

ppc_md.power_save = get_cpuidle_handler();

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-16 Thread Vaidyanathan Srinivasan
* Peter Zijlstra a.p.zijls...@chello.nl [2009-09-16 18:35:16]:

 On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote:
 
  No, for this specific case, latency isn't an issue. The issue is -
  how do we cede unused vcpus to hypervisor for better energy management ?
  Yes, it can be done by a hypervisor manager telling the kernel to
  offline and make a bunch of vcpus inactive. It does have to choose
  offline (release vcpu) vs. inactive (cede but guranteed if needed).
  The problem is that long ago we exported a lot of hotplug stuff to
  userspace through the sysfs interface and we cannot do something
  inside the kernel without keeping the sysfs stuff consistent.
  This seems like a sane way to do that without undoing all the
  virtual cpu hotplug infrastructure in different supporting archs.
 
 I'm still not getting it..
 
 Suppose we have some guest, it booted with 4 cpus.
 
 We then offline 2 of them.
 
 Apparently this LPAR binds guest cpus to physical cpus?
 So we use a hypervisor interface to reclaim these 2 offlined cpus and
 re-assign them to some other guest.
 
 So far so good, right?
 
 Now if you were to try and online the cpus in the guest, it'd fail
 because the cpus aren't backed anymore, and the hot-plug simply
 times-out and fails.
 
 And we're still good, right?

The requirement differ here.  If we had offlined 2 vCPUs for the
purpose of system reconfiguration, the expected behavior with offline
interface will work right.  However the proposed cede interface is
needed when we want them to temporarily go away but still come back
when we do an online.  We want the online to always succeed since the
backing physical resources are not relinquished.  The proposed
interface facilitates offline without relinquishing the physical
resources assigned to LPARs.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [v6 PATCH 0/7]: cpuidle/x86/POWER: Cleanup idle power management code in x86, cleanup drivers/cpuidle/cpuidle.c and introduce cpuidle to POWER.

2009-09-25 Thread Vaidyanathan Srinivasan
* Arjan van de Ven ar...@infradead.org [2009-09-24 14:22:28]:

 On Thu, 24 Sep 2009 10:42:41 +0530
 Arun R Bharadwaj a...@linux.vnet.ibm.com wrote:
 
  * Arun R Bharadwaj a...@linux.vnet.ibm.com [2009-09-22 16:55:27]:
  
  Hi Len, (or other acpi folks),
  
  I had a question regarding ACPI-cpuidle interaction in the current
  implementation.
  
  Currently, every cpu (i.e. acpi_processor) registers to cpuidle as
  a cpuidle_device. So every cpu has to go through the process of
  setting up the idle states and then registering as a cpuidle device.
  
  What exactly is the reason behind this?
  
 
 technically a BIOS can opt to give you C states via ACPI on some cpus,
 but not on others.
 
 in practice when this happens it tends to be a bug.. but it's
 technically a valid configuration

So we will need to keep the per-cpu registration as of now because we
may have such buggy BIOS in the field and we don't want the cpuidle
framework to malfunction there.

--Vaidy


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 0/2] cpu: pseries: Offline state framework.

2009-09-25 Thread Vaidyanathan Srinivasan
* Arjan van de Ven ar...@infradead.org [2009-09-24 13:41:23]:

 On Thu, 24 Sep 2009 13:33:07 +0200
 Peter Zijlstra a.p.zijls...@chello.nl wrote:
 
  On Thu, 2009-09-24 at 18:38 +1000, Benjamin Herrenschmidt wrote:
   On Thu, 2009-09-24 at 09:51 +0200, Peter Zijlstra wrote:
 I don't quite follow your logic here. This is useful for more
 than just hypervisors. For example, take the HV out of the
 picture for a moment and imagine that the HW has the ability to
 offline CPU in various power levels, with varying latencies to
 bring them back.

cpu-hotplug is an utter slow path, anybody saying latency and
hotplug in the same sentence doesn't seem to grasp either or both
concepts.
   
   Let's forget about latency then. Let's imagine I want to set a CPU
   offline to save power, vs. setting it offline -and- opening the back
   door of the machine to actually physically replace it :-)
  
  If the hardware is capable of physical hotplug, then surely powering
  the socket down saves most power and is the preferred mode?
 
 btw just to take away a perception that generally powering down sockets
 help; it does not help for all cpus. Some cpus are so efficient in idle
 that the incremental gain one would get by offlining a core is just
 not worth it
 (in fact, in x86, it's the same thing)
 
 I obviously can't speak for p-series cpus, just wanted to point out
 that there is no universal truth about offlining saves power.

Hi Arjan,

As you have said, on some cpus the extra effort of offlining does not
save us any extra power, and the state will be same as idle.  The
assertion that offlining saves power is still valid, it could be same
as idle or better depending on the architecture and implementation.

On x86 we still need the code (Venki posted) to take cpus to C6 on
offline to save power or else offlining consumes more power than idle
due to C1/hlt state.  This framework can help here as well if we have
any apprehension on making lowest sleep state as default on x86 and
want the administrator to decide.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-28 Thread Vaidyanathan Srinivasan
* Peter Zijlstra a.p.zijls...@chello.nl [2009-09-25 16:48:40]:

 On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
  On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
   I still think its a layering violation... its the hypervisor manager
   that should be bothered in what state an off-lined cpu is in. 
   
  That's not how our hypervisor works.
 
 Then fix it?
 
  If you ask through the management interface, to remove a CPU from a
  partition, the HV will communicate with a daemon inside the partition
  that will then unplug the CPU via the right call.
  
  I don't really understand your objections to be honest. And I fail to
  see why it would be a layering violation to have the ability for the OS
  to indicate in what state it wishes to relinguish a CPU to the
  hypervisor, which more or less defines what is the expected latency for
  getting it back later on.
 
 OK, so the main objection is the abuse of CPU hotplug as resource
 management feature.
 
 CPU hotplug is terribly invasive and expensive to the kernel, doing
 hotplug on a minute basis is just plain crazy.
 
 If you want a CPU in a keep it near and don't hand it back to the HV
 state, why not use cpusets to isolate it and simply not run tasks on it?
 
 cpusets don't use stopmachine and are much nicer to the rest of the
 kernel over-all.

Hi Peter,

This interface is not expected to be used every minute or so to impact
the operation of the rest of the system.  Cpuhotplug is currently used
as a resource management feature in virtualised system using dlpar
operations.  I do understand that cpu hotplug is invasive at run time
and that amount of complexity is required to carefully isolate the cpu
from any involvement in the running kernel.

Building another interface to isolate the cpus to the same extent as
cpu hotplug does today would be redundant and is going to be equally
invasive. Alternatives like cpuset for isolation migrates only tasks.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-28 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2009-09-26 07:12:48]:

 On Fri, 2009-09-25 at 16:48 +0200, Peter Zijlstra wrote:
  On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
   On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
I still think its a layering violation... its the hypervisor manager
that should be bothered in what state an off-lined cpu is in. 

   That's not how our hypervisor works.
  
  Then fix it?
 
 Are you serious ? :-)
 
  CPU hotplug is terribly invasive and expensive to the kernel, doing
  hotplug on a minute basis is just plain crazy.
  
  If you want a CPU in a keep it near and don't hand it back to the HV
  state, why not use cpusets to isolate it and simply not run tasks on it?
  
  cpusets don't use stopmachine and are much nicer to the rest of the
  kernel over-all.
 
 Gautham, what is the different in term of power saving between having
 it idle for long periods of time (which could do H_CEDE and with NO_HZ,
 probably wouln't need to wake up that often) and having it unplugged in
 a H_CEDE loop ?

Hi Ben,

A cede latency specifier value indicating latency expectation of the
guest OS can be established in the VPA to inform the hypervisor during
the H_CEDE call.  Currently, we do call H_CEDE during NO_HZ for
efficient idle.  However, higher cede latency values may not be
suitable for idle CPUs in the kernel and instead more energy savings
may result from exploiting this feature through CPU hotplug
interface.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 0/4] pseries: Add cede support for cpu-offline

2009-11-23 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2009-11-24 14:35:09]:

 On Fri, 2009-10-30 at 10:52 +0530, Gautham R Shenoy wrote:
  Hi,
  
  This is version 5 of patch series that provides a framework to choose the
  state a pseries CPU must be put to when it is offlined.
  
  Previous versions can be found here:
  Version 4: http://lkml.org/lkml/2009/10/9/59
  Version 3: http://lkml.org/lkml/2009/9/15/164
  Version 2: http://lkml.org/lkml/2009/8/28/102
  Version 1: http://lkml.org/lkml/2009/8/6/236
  
  Changes from the previous version include:
  - Rebased against Nathan Fontenot's latest pseries kernel handling of 
  dynamic
logical paritioning v4 patches found here:
http://lkml.org/lkml/2009/10/21/98
 
 I can't merge them right now because afaik, Nathan patches are still not
 quite ready. So we need to either get a final version of Nathan patches
 something like tomorrow or you need to rebase your series on current
 -next by the end of the week, or I'm afraid it's going to miss the next
 merge window.

Hi Ben,

I had checked with Nathan earlier and he mentioned that he is working
on an update.  We can post a rebase to -next tomorrow, but this series
depends on Nathan's patch, hence will work with him. This feature is
important for the next merge window.

Thanks,
Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 0/2] pseries: Add cede support for cpu-offline

2009-11-26 Thread Vaidyanathan Srinivasan
Hi,

This is version 6 of patch series that provides a framework to choose the state
a pseries CPU must be put to when it is offlined.

Previous versions can be found here:
Version 5: http://lkml.org/lkml/2009/10/30/6
Version 4: http://lkml.org/lkml/2009/10/9/59
Version 3: http://lkml.org/lkml/2009/9/15/164
Version 2: http://lkml.org/lkml/2009/8/28/102
Version 1: http://lkml.org/lkml/2009/8/6/236

Changes from the previous version include:

- Built on Nathan Fontenot's v3 of Kernel handling of Dynamic
  Logical Partitioning http://lkml.org/lkml/2009/11/25/21

- Rebased to powerpc.git tree and hence dropped 1st and 2nd patch in
  the stack since they are already in the powerpc.git tree:

  With reference to previous version,

  Dropped:
  1/4 pSeries: extended_cede_processor() helper function
  2/4 pSeries: Add hooks to put the CPU into an appropriate offline state

  Posting only:
  3/4 pseries: Add code to online/offline CPUs of a DLPAR node
  4/4 pseries: Serialize cpu hotplug operations during deactivate Vs deallocate

  Minor changes in the above patchs due to changes in Nathan's routines.

Also,

- This approach addresses Peter Z's objections regarding layering
  violations. The user simply offlines the cpu and doesn't worry about what
  state the CPU should be put into. That part is automatically handled by the
  kernel.

  Acked-by: Peter Zijlstra a.p.zijls...@chello.nl
  http://lkml.org/lkml/2009/11/11/328

- It does not add any additional sysfs interface instead uses the existing
  sysfs interface to offline CPUs.

- On platforms which do not have support for ceding the vcpu with a
  latency specifier value, the offlining mechanism defaults to the current
  method of calling rtas_stop_self().

This patchset is based on powerpc.git + Nathan's patches and has been built and
tested on pseries platforms.  This series can be applied to powerpc.git after
Nathan's patches.

Thanks,
Vaidy

---

Gautham R Shenoy (2):
  pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
  pseries: Add code to online/offline CPUs of a DLPAR node


 arch/powerpc/platforms/pseries/dlpar.c |  144 ++--
 drivers/base/cpu.c |2 
 include/linux/cpu.h|   13 +++
 3 files changed, 150 insertions(+), 9 deletions(-)

-- 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v6 1/2] pseries: Add code to online/offline CPUs of a DLPAR node

2009-11-26 Thread Vaidyanathan Srinivasan
From: Gautham R Shenoy e...@in.ibm.com

Currently the cpu-allocation/deallocation on pSeries is a
two step process from the Userspace.

- Set the indicators and update the device tree by writing to the sysfs
  tunable probe during allocation and release during deallocation.
- Online / Offline the CPUs of the allocated/would_be_deallocated node by
  writing to the sysfs tunable online.

This patch adds kernel code to online/offline the CPUs soon_after/just_before
they have been allocated/would_be_deallocated. This way, the userspace tool
that performs DLPAR operations would only have to deal with one set of sysfs
tunables namely probe and release.

Signed-off-by: Gautham R Shenoy e...@in.ibm.com
Signed-off-by: Nathan Fontenot nf...@austin.ibm.com
Acked-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---

 arch/powerpc/platforms/pseries/dlpar.c |  101 
 1 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index fe8d4b3..642e1b2 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -16,6 +16,7 @@
 #include linux/proc_fs.h
 #include linux/spinlock.h
 #include linux/cpu.h
+#include offline_states.h
 
 #include asm/prom.h
 #include asm/machdep.h
@@ -287,6 +288,98 @@ int dlpar_detach_node(struct device_node *dn)
return 0;
 }
 
+int online_node_cpus(struct device_node *dn)
+{
+   int rc = 0;
+   unsigned int cpu;
+   int len, nthreads, i;
+   const u32 *intserv;
+
+   intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len);
+   if (!intserv)
+   return -EINVAL;
+
+   nthreads = len / sizeof(u32);
+
+   cpu_maps_update_begin();
+   for (i = 0; i  nthreads; i++) {
+   for_each_present_cpu(cpu) {
+   if (get_hard_smp_processor_id(cpu) != intserv[i])
+   continue;
+   BUG_ON(get_cpu_current_state(cpu)
+   != CPU_STATE_OFFLINE);
+   cpu_maps_update_done();
+   rc = cpu_up(cpu);
+   if (rc)
+   goto out;
+   cpu_maps_update_begin();
+
+   break;
+   }
+   if (cpu == num_possible_cpus())
+   printk(KERN_WARNING Could not find cpu to online 
+  with physical id 0x%x\n, intserv[i]);
+   }
+   cpu_maps_update_done();
+
+out:
+   return rc;
+
+}
+
+int offline_node_cpus(struct device_node *dn)
+{
+   int rc = 0;
+   unsigned int cpu;
+   int len, nthreads, i;
+   const u32 *intserv;
+
+   intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len);
+   if (!intserv)
+   return -EINVAL;
+
+   nthreads = len / sizeof(u32);
+
+   cpu_maps_update_begin();
+   for (i = 0; i  nthreads; i++) {
+   for_each_present_cpu(cpu) {
+   if (get_hard_smp_processor_id(cpu) != intserv[i])
+   continue;
+
+   if (get_cpu_current_state(cpu) == CPU_STATE_OFFLINE)
+   break;
+
+   if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) {
+   cpu_maps_update_done();
+   rc = cpu_down(cpu);
+   if (rc)
+   goto out;
+   cpu_maps_update_begin();
+   break;
+
+   }
+
+   /*
+* The cpu is in CPU_STATE_INACTIVE.
+* Upgrade it's state to CPU_STATE_OFFLINE.
+*/
+   set_preferred_offline_state(cpu, CPU_STATE_OFFLINE);
+   BUG_ON(plpar_hcall_norets(H_PROD, intserv[i])
+   != H_SUCCESS);
+   __cpu_die(cpu);
+   break;
+   }
+   if (cpu == num_possible_cpus())
+   printk(KERN_WARNING Could not find cpu to offline 
+  with physical id 0x%x\n, intserv[i]);
+   }
+   cpu_maps_update_done();
+
+out:
+   return rc;
+
+}
+
 #define DR_ENTITY_SENSE9003
 #define DR_ENTITY_PRESENT  1
 #define DR_ENTITY_UNUSABLE 2
@@ -385,6 +478,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
count)
dlpar_free_cc_nodes(dn);
}
 
+   rc = online_node_cpus(dn);
+
return rc ? rc : count;
 }
 
@@ -404,6 +499,12 @@ static ssize_t dlpar_cpu_release(const char *buf, size_t 
count)
return -EINVAL;
}
 
+   rc = offline_node_cpus(dn);
+   if (rc

[PATCH v6 2/2] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate

2009-11-26 Thread Vaidyanathan Srinivasan
From: Gautham R Shenoy e...@in.ibm.com

Currently the cpu-allocation/deallocation process comprises of two steps:
- Set the indicators and to update the device tree with DLPAR node
  information.

- Online/offline the allocated/deallocated CPU.

This is achieved by writing to the sysfs tunables probe during allocation
and release during deallocation.

At the sametime, the userspace can independently online/offline the CPUs of
the system using the sysfs tunable online.

It is quite possible that when a userspace tool offlines a CPU
for the purpose of deallocation and is in the process of updating the device
tree, some other userspace tool could bring the CPU back online by writing to
the online sysfs tunable thereby causing the deallocate process to fail.

The solution to this is to serialize writes to the probe/release sysfs
tunable with the writes to the online sysfs tunable.

This patch employs a mutex to provide this serialization, which is a no-op on
all architectures except PPC_PSERIES

Signed-off-by: Gautham R Shenoy e...@in.ibm.com
Acked-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---

 arch/powerpc/platforms/pseries/dlpar.c |   45 +---
 drivers/base/cpu.c |2 +
 include/linux/cpu.h|   13 +
 3 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 642e1b2..fd2f0af 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -436,6 +436,18 @@ int dlpar_release_drc(u32 drc_index)
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
 
+static DEFINE_MUTEX(pseries_cpu_hotplug_mutex);
+
+void cpu_hotplug_driver_lock()
+{
+   mutex_lock(pseries_cpu_hotplug_mutex);
+}
+
+void cpu_hotplug_driver_unlock()
+{
+   mutex_unlock(pseries_cpu_hotplug_mutex);
+}
+
 static ssize_t dlpar_cpu_probe(const char *buf, size_t count)
 {
struct device_node *dn;
@@ -443,13 +455,18 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
count)
char *cpu_name;
int rc;
 
+   cpu_hotplug_driver_lock();
rc = strict_strtoul(buf, 0, drc_index);
-   if (rc)
-   return -EINVAL;
+   if (rc) {
+   rc = -EINVAL;
+   goto out;
+   }
 
dn = dlpar_configure_connector(drc_index);
-   if (!dn)
-   return -EINVAL;
+   if (!dn) {
+   rc = -EINVAL;
+   goto out;
+   }
 
/* configure-connector reports cpus as living in the base
 * directory of the device tree.  CPUs actually live in the
@@ -459,7 +476,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
count)
   GFP_KERNEL);
if (!cpu_name) {
dlpar_free_cc_nodes(dn);
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto out;
}
 
sprintf(cpu_name, /cpus%s, dn-full_name);
@@ -469,7 +487,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
count)
rc = dlpar_acquire_drc(drc_index);
if (rc) {
dlpar_free_cc_nodes(dn);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto out;
}
 
rc = dlpar_attach_node(dn);
@@ -479,6 +498,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t 
count)
}
 
rc = online_node_cpus(dn);
+out:
+   cpu_hotplug_driver_unlock();
 
return rc ? rc : count;
 }
@@ -499,26 +520,30 @@ static ssize_t dlpar_cpu_release(const char *buf, size_t 
count)
return -EINVAL;
}
 
+   cpu_hotplug_driver_lock();
rc = offline_node_cpus(dn);
if (rc) {
of_node_put(dn);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto out;
}
 
rc = dlpar_release_drc(*drc_index);
if (rc) {
of_node_put(dn);
-   return -EINVAL;
+   goto out;
}
 
rc = dlpar_detach_node(dn);
if (rc) {
dlpar_acquire_drc(*drc_index);
-   return rc;
+   goto out;
}
 
of_node_put(dn);
-   return count;
+out:
+   cpu_hotplug_driver_unlock();
+   return rc ? rc : count;
 }
 
 static int __init pseries_dlpar_init(void)
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 7c03af7..27fd775 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, 
struct sysdev_attribut
struct cpu *cpu = container_of(dev, struct cpu, sysdev);
ssize_t ret;
 
+   cpu_hotplug_driver_lock();
switch (buf[0]) {
case '0':
ret = cpu_down(cpu-sysdev.id);
@@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, 
struct sysdev_attribut
default:
ret = -EINVAL

2.6.24-rc1 on PPC64: machine check exception

2007-11-05 Thread Vaidyanathan Srinivasan
Hi,

I got the following exception on a 128-way PPC64 machine while running
ebizzy-0.2 benchmark.

cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c06df1bb3a40
   msr: 80109032
  current = 0xc06df1725120
  paca= 0xc06f4000
pid   = 5950, comm = ebizzy
cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c00e21bd3a40
   msr: 80109032
  current = 0xc00e pa  paca= 0xc06fb800
cidpid   = 5941, comm = ebizzy
   = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc
[c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8
[c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac
[c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4
[c06df1bb3e30] c0008ae8 do_work+0x28/0x2c
--- Exception: 500 (Hardware Interrupt) at 10007698
SP (40016601610) is in userspace

Config file attached. I did not see any similar failure reported with
2.6.24-rc1.

--Vaidy
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1
# Mon Nov  5 02:25:46 2007
#
CONFIG_PPC64=y

#
# Processor support
#
# CONFIG_POWER4_ONLY is not set
CONFIG_POWER3=y
CONFIG_POWER4=y
# CONFIG_TUNE_CELL is not set
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_SMP=y
CONFIG_NR_CPUS=128
CONFIG_64BIT=y
CONFIG_WORD_SIZE=64
CONFIG_PPC_MERGE=y
CONFIG_MMU=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_IRQ_PER_CPU=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_ARCH_NO_VIRT_TO_BUS=y
CONFIG_PPC=y
CONFIG_EARLY_PRINTK=y
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_OF=y
CONFIG_OF=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_DEFAULT_UIMAGE is not set
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=-sv
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_CPUSETS=y
# CONFIG_FAIR_GROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_PROC_PID_CPUSET=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=anticipatory

#
# Platform support
#
CONFIG_PPC_MULTIPLATFORM=y
# CONFIG_PPC_82xx is not set
# CONFIG_PPC_83xx is not set
# CONFIG_PPC_86xx is not set
CONFIG_PPC_PSERIES=y
CONFIG_PPC_SPLPAR=y
CONFIG_EEH=y
CONFIG_SCANLOG=m
CONFIG_LPARCFG=y
# CONFIG_PPC_ISERIES is not set
# CONFIG_PPC_MPC52xx is not set
# CONFIG_PPC_MPC5200 is not set
# CONFIG_PPC_PMAC is not set
# CONFIG_PPC_MAPLE is not set
# CONFIG_PPC_PASEMI is not set
# CONFIG_PPC_CELLEB is not set
# CONFIG_PPC_PS3 is not set
# CONFIG_PPC_CELL is not set
# CONFIG_PPC_CELL_NATIVE is not set
# CONFIG_PPC_IBM_CELL_BLADE is not set
# CONFIG_PQ2ADS is not set
CONFIG_PPC_NATIVE=y
# CONFIG_UDBG_RTAS_CONSOLE is not set
CONFIG_XICS=y
CONFIG_MPIC=y

Re: 2.6.24-rc1 on PPC64: machine check exception

2007-11-15 Thread Vaidyanathan Srinivasan
 
 http://patchwork.ozlabs.org/linuxppc/patch?id=14475
 
 Thanks for pointing me to this patch.  I will try out and let you know if
 this fixed the problem.

Hi Anton,

This patch fixed the problem.  I am able to run and profile ebizzy on 128-way
PPC64.  However this fix is not included in 2.6.24-rc2 as well.
I will watch for inclusion of this fix in 2.6.24.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Quieten cede latency printk

2010-02-08 Thread Vaidyanathan Srinivasan
* Balbir Singh bal...@linux.vnet.ibm.com [2010-02-08 16:37:37]:

 On Mon, Feb 8, 2010 at 5:22 AM, Anton Blanchard an...@samba.org wrote:
 
  The cede latency stuff is relatively new and we don't need to complain about
  it not working on older firmware.
 
  Signed-off-by: Anton Blanchard an...@samba.org
 
 Acked-by: Balbir Singh bal...@linux.vnet.ibm.com
 
 Seems like a reasonable change. CC'ing Vaidy and Gautham to look even
 more closely at the patch.

Hi Anton,

Removing the printk is reasonable.  The failure case is noise in older
firmware while the success case may not be very informative.

Thanks,
Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc: reset kernel stack on cpu online from cede state

2010-02-22 Thread Vaidyanathan Srinivasan
powerpc: reset kernel stack on cpu online from cede state

Cpu hotplug (offline) without dlpar operation will place cpu
in cede state and the extended_cede_processor() function will
return when resumed.

Kernel stack pointer needs to be reset before
start_secondary() is called to continue the online operation.

Added new function start_secondary_resume() to do the above
steps.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 9258074..567cd57 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -615,6 +615,17 @@ _GLOBAL(start_secondary_prolog)
std r3,0(r1)/* Zero the stack frame pointer */
bl  .start_secondary
b   .
+/*
+ * Reset stack pointer and call start_secondary
+ * to continue with online operation when woken up
+ * from cede in cpu offline.
+ */
+_GLOBAL(start_secondary_resume)
+   ld  r1,PACAKSAVE(r13)   /* Reload kernel stack pointer */
+   li  r3,0
+   std r3,0(r1)/* Zero the stack frame pointer */
+   bl  .start_secondary
+   b   .
 #endif
 
 /*
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 6ea4698..9be7af4 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -146,12 +146,11 @@ static void pseries_mach_cpu_die(void)
unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
 
/*
-* NOTE: Calling start_secondary() here for now to
-* start new context.
-* However, need to do it cleanly by resetting the
-* stack pointer.
+* Call to start_secondary_resume() will not return.
+* Kernel stack will be reset and start_secondary()
+* will be called to continue the online operation.
 */
-   start_secondary();
+   start_secondary_resume();
 
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
 
diff --git a/arch/powerpc/platforms/pseries/offline_states.h 
b/arch/powerpc/platforms/pseries/offline_states.h
index 22574e0..da07256 100644
--- a/arch/powerpc/platforms/pseries/offline_states.h
+++ b/arch/powerpc/platforms/pseries/offline_states.h
@@ -14,5 +14,5 @@ extern void set_cpu_current_state(int cpu, enum 
cpu_state_vals state);
 extern enum cpu_state_vals get_preferred_offline_state(int cpu);
 extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state);
 extern void set_default_offline_state(int cpu);
-extern int start_secondary(void);
+extern void start_secondary_resume(void);
 #endif
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 0/3] powerpc: bug fixes in pseries_mach_cpu_die()

2010-03-01 Thread Vaidyanathan Srinivasan
Hi Ben,

The following set of patches fixes kernel stack reset issue and also
potential race conditions.

This fix should be applied in 2.6.33 stable tree onwards.

Problem description:

(1) Repeated offline/online operation on pseries with extended cede
processor feature will run over the kernel stack and crash since the
stack was not reset on resume from cede processor.

(2) The 'if' conditions and code has been slightly rearranged to avoid
any possible race conditions where preferred_offline_state can change
while the cpu is still executing the condition checks in
pseries_mach_cpu_die().  The new code has the CPU_STATE_OFFLINE as
default and also improved readability.  Thanks to Anton Blanchard for
pointing this out.

(3) There were too many noisy KERN_INFO printks on offline/online
operation.  Removed the printk's for now.  Other debug methods or
traceevents can be introduced at a later point.

The patch has been tested on large POWER machine with both
cede_offline=off case and the default cede_offline=on case.

Please apply in powerpc.git tree and push upstream.

Previous version of the patch can be found at:

[PATCH] powerpc: reset kernel stack on cpu online from cede state
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080515.html

Thanks,
Vaidy

---

Vaidyanathan Srinivasan (3):
  powerpc: reset kernel stack on cpu online from cede state
  powerpc: move checks in pseries_mach_cpu_die()
  powerpc: reduce printk from pseries_mach_cpu_die()


 arch/powerpc/kernel/head_64.S   |   11 ++
 arch/powerpc/platforms/pseries/hotplug-cpu.c|   42 ---
 arch/powerpc/platforms/pseries/offline_states.h |2 +
 3 files changed, 27 insertions(+), 28 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 1/3] powerpc: reset kernel stack on cpu online from cede state

2010-03-01 Thread Vaidyanathan Srinivasan
Cpu hotplug (offline) without dlpar operation will place cpu
in cede state and the extended_cede_processor() function will
return when resumed.

Kernel stack pointer needs to be reset before
start_secondary() is called to continue the online operation.

Added new function start_secondary_resume() to do the above
steps.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Cc: Gautham R Shenoy e...@in.ibm.com
---
 arch/powerpc/kernel/head_64.S   |   11 +++
 arch/powerpc/platforms/pseries/hotplug-cpu.c|9 -
 arch/powerpc/platforms/pseries/offline_states.h |2 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 9258074..567cd57 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -615,6 +615,17 @@ _GLOBAL(start_secondary_prolog)
std r3,0(r1)/* Zero the stack frame pointer */
bl  .start_secondary
b   .
+/*
+ * Reset stack pointer and call start_secondary
+ * to continue with online operation when woken up
+ * from cede in cpu offline.
+ */
+_GLOBAL(start_secondary_resume)
+   ld  r1,PACAKSAVE(r13)   /* Reload kernel stack pointer */
+   li  r3,0
+   std r3,0(r1)/* Zero the stack frame pointer */
+   bl  .start_secondary
+   b   .
 #endif
 
 /*
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 6ea4698..9be7af4 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -146,12 +146,11 @@ static void pseries_mach_cpu_die(void)
unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
 
/*
-* NOTE: Calling start_secondary() here for now to
-* start new context.
-* However, need to do it cleanly by resetting the
-* stack pointer.
+* Call to start_secondary_resume() will not return.
+* Kernel stack will be reset and start_secondary()
+* will be called to continue the online operation.
 */
-   start_secondary();
+   start_secondary_resume();
 
} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
 
diff --git a/arch/powerpc/platforms/pseries/offline_states.h 
b/arch/powerpc/platforms/pseries/offline_states.h
index 22574e0..da07256 100644
--- a/arch/powerpc/platforms/pseries/offline_states.h
+++ b/arch/powerpc/platforms/pseries/offline_states.h
@@ -14,5 +14,5 @@ extern void set_cpu_current_state(int cpu, enum 
cpu_state_vals state);
 extern enum cpu_state_vals get_preferred_offline_state(int cpu);
 extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state);
 extern void set_default_offline_state(int cpu);
-extern int start_secondary(void);
+extern void start_secondary_resume(void);
 #endif

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 2/3] powerpc: move checks in pseries_mach_cpu_die()

2010-03-01 Thread Vaidyanathan Srinivasan
Rearrange condition checks for better code readability and
prevention of possible race conditions when
preferred_offline_state can potentially change during the
execution of pseries_mach_cpu_die().  The patch will make
pseries_mach_cpu_die() put cpu in one of the consistent states
and not hit the run over BUG()

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Cc: Gautham R Shenoy e...@in.ibm.com
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   30 +-
 1 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 9be7af4..39ecb40 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -140,25 +140,25 @@ static void pseries_mach_cpu_die(void)
if (!get_lppaca()-shared_proc)
get_lppaca()-donate_dedicated_cpu = 0;
get_lppaca()-idle = 0;
-   }
 
-   if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) {
-   unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
+   if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) {
+   unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
 
-   /*
-* Call to start_secondary_resume() will not return.
-* Kernel stack will be reset and start_secondary()
-* will be called to continue the online operation.
-*/
-   start_secondary_resume();
+   /*
+* Call to start_secondary_resume() will not return.
+* Kernel stack will be reset and start_secondary()
+* will be called to continue the online operation.
+*/
+   start_secondary_resume();
+   }
+   }
 
-   } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+   /* Requested state is CPU_STATE_OFFLINE at this point */
+   WARN_ON(get_preferred_offline_state(cpu) != CPU_STATE_OFFLINE);
 
-   set_cpu_current_state(cpu, CPU_STATE_OFFLINE);
-   unregister_slb_shadow(hard_smp_processor_id(),
-   __pa(get_slb_shadow()));
-   rtas_stop_self();
-   }
+   set_cpu_current_state(cpu, CPU_STATE_OFFLINE);
+   unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
+   rtas_stop_self();
 
/* Should never get here... */
BUG();

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 3/3] powerpc: reduce printk from pseries_mach_cpu_die()

2010-03-01 Thread Vaidyanathan Srinivasan
Remove debug printks in pseries_mach_cpu_die().  These are
noisy at runtime.  Traceevents can be added to instrument this
section of code.

The following KERN_INFO printks are removed:

cpu 62 (hwid 62) returned from cede.
Decrementer value = b2802fff Timebase value = 2fa8f95035f4a
cpu 62 (hwid 62) got prodded to go online
cpu 58 (hwid 58) ceding for offline with hint 2

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Cc: Gautham R Shenoy e...@in.ibm.com
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   11 ---
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 39ecb40..b842378 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -122,21 +122,10 @@ static void pseries_mach_cpu_die(void)
if (!get_lppaca()-shared_proc)
get_lppaca()-donate_dedicated_cpu = 1;
 
-   printk(KERN_INFO
-   cpu %u (hwid %u) ceding for offline with hint %d\n,
-   cpu, hwcpu, cede_latency_hint);
while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
extended_cede_processor(cede_latency_hint);
-   printk(KERN_INFO cpu %u (hwid %u) returned from 
cede.\n,
-   cpu, hwcpu);
-   printk(KERN_INFO
-   Decrementer value = %x Timebase value = %llx\n,
-   get_dec(), get_tb());
}
 
-   printk(KERN_INFO cpu %u (hwid %u) got prodded to go online\n,
-   cpu, hwcpu);
-
if (!get_lppaca()-shared_proc)
get_lppaca()-donate_dedicated_cpu = 0;
get_lppaca()-idle = 0;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-03-03 Thread Vaidyanathan Srinivasan
Hi,

A new hypervisor call H_BEST_ENERGY is supported in IBM pseries ppc64
platform in order to exports platform energy management capabilities
to user space administrative tools.

This hypervisor calls provides hints or suggested list of cpus to
activate or deactivate for optimized energy management.  The following
patch add a new pseries specific device driver that will export such
platform energy management related information over sysfs interface
for use by user space administrative tools.

The device driver exports the following sysfs files:

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint.

Typical usage:

# cat /sys/devices/system/cpu/pseries_activate_hint_list
16,20,24
# cat /sys/devices/system/cpu/pseries_deactivate_hint_list
0,4,8,12

Comma separated list of recommended cpu numbers to activate or
deactivate.

cat /sys/device/system/cpu/cpuN/pseries_(de)activate_hint
will provide a single integer value returned by the platform
for that cpu.

Please let me know your comments and suggestions.

Thanks,
Vaidy
---
powerpc: add support for new hcall H_BEST_ENERGY

Create sysfs interface to export data from H_BEST_ENERGY hcall
that can be used by administrative tools on supported pseries
platforms for energy management optimizations.

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
hints for activation and deactivation of cpus respectively.

Added new driver module
arch/powerpc/platforms/pseries/pseries_energy.c
under new config option CONFIG_PSERIES_ENERGY

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/kernel/setup-common.c  |2 
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  246 +++
 5 files changed, 261 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index f027581..396599f 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -230,7 +230,8 @@
 #define H_ENABLE_CRQ   0x2B0
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
-#define MAX_HCALL_OPCODE   H_GET_MPP
+#define H_BEST_ENERGY  0x2F4
+#define MAX_HCALL_OPCODE   H_BEST_ENERGY
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 03dd6a2..fbd93e3 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -359,6 +359,8 @@ void __init check_for_initrd(void)
 #ifdef CONFIG_SMP
 
 int threads_per_core, threads_shift;
+EXPORT_SYMBOL_GPL(threads_per_core);
+
 cpumask_t threads_core_mask;
 
 static void __init cpu_init_thread_core_maps(int tpc)
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index c667f0f..b3dd108 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y
 
+config PSERIES_ENERGY
+   tristate pseries energy management capabilities driver
+   depends on PPC_PSERIES
+   default y
+   help
+ Provides interface to platform energy management capabilities
+ on supported PSERIES platforms.
+ Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
+ and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
+
 config SCANLOG
tristate Scanlog dump interface
depends on RTAS_PROC  PPC_PSERIES
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 0ff5174..3813977 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o 
eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_KEXEC)+= kexec.o
 obj-$(CONFIG_PCI)  += pci.o pci_dlpar.o
 obj-$(CONFIG_PSERIES_MSI)  += msi.o
+obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
 obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
new file mode 100644
index 000..0390188
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -0,0 +1,246 @@
+/*
+ * POWER platform energy management driver
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public

Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-03-07 Thread Vaidyanathan Srinivasan
* Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]:

 On Wed, Mar 03, 2010 at 11:48:22PM +0530, Vaidyanathan Srinivasan wrote:
   static void __init cpu_init_thread_core_maps(int tpc)
  diff --git a/arch/powerpc/platforms/pseries/Kconfig 
  b/arch/powerpc/platforms/pseries/Kconfig
  index c667f0f..b3dd108 100644
  --- a/arch/powerpc/platforms/pseries/Kconfig
  +++ b/arch/powerpc/platforms/pseries/Kconfig
  @@ -33,6 +33,16 @@ config PSERIES_MSI
  depends on PCI_MSI  EEH
  default y
  
  +config PSERIES_ENERGY
  +   tristate pseries energy management capabilities driver
  +   depends on PPC_PSERIES
  +   default y
  +   help
  + Provides interface to platform energy management capabilities
  + on supported PSERIES platforms.
  + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
  + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
  +
   config SCANLOG
  tristate Scanlog dump interface
  depends on RTAS_PROC  PPC_PSERIES
 
 .
 
  +static int __init pseries_energy_init(void)
  +{
  +   int cpu, err;
  +   struct sys_device *cpu_sys_dev;
  +
  +   /* Create the sysfs files */
  +   err = sysfs_create_file(cpu_sysdev_class.kset.kobj,
  +   attr_cpu_activate_hint_list.attr);
  +   if (!err)
  +   err = sysfs_create_file(cpu_sysdev_class.kset.kobj,
  +   attr_cpu_deactivate_hint_list.attr);
  +
  +   for_each_possible_cpu(cpu) {
  +   cpu_sys_dev = get_cpu_sysdev(cpu);
  +   err = sysfs_create_file(cpu_sys_dev-kobj,
  +   attr_percpu_activate_hint.attr);
  +   if (err)
  +   break;
  +   err = sysfs_create_file(cpu_sys_dev-kobj,
  +   attr_percpu_deactivate_hint.attr);
  +   if (err)
  +   break;
  +   }
  +   return err;
  +
  +}
 
 Shouldn't we create this only for supported platforms ?

Hi Dipankar,

Yes we will need a check like
firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in
unsupported platforms.  I will add that check in the next iteration.

Thanks,
Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-03-09 Thread Vaidyanathan Srinivasan
* Dipankar Sarma dipan...@in.ibm.com [2010-03-09 00:58:26]:

 On Mon, Mar 08, 2010 at 12:20:06PM +0530, Vaidyanathan Srinivasan wrote:
  * Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]:
  
   Shouldn't we create this only for supported platforms ?
  
  Hi Dipankar,
  
  Yes we will need a check like
  firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in
  unsupported platforms.  I will add that check in the next iteration.
 
 Also, given that this module isn't likely to provide anything on
 older platforms, it should get loaded only on newer platforms.

Yes, I will explore methods to achieve this.  I could not yet find
a trigger to load modules based on firmware feature.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powerpc: export data from new hcall H_EM_GET_PARMS

2010-03-10 Thread Vaidyanathan Srinivasan
Hi Ben,

I have cleaned up the code from the previous post:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html

Changes from v1:
* Removed redundant return statements and rearranged code

Description:

A new hcall H_EM_GET_PARMS as been added to obtain power mode data
from the platform.  This data can be used by user space administrative
tools for better power management.

The following patch add data from this new hcall into the lparcfg
driver and exports to user space along with other existing lpar data
in /proc/powerpc/lparcfg

Please review and include in powerpc -next tree.

Thanks,
Vaidy

---
powerpc: export data from new hcall H_EM_GET_PARMS

Add support for H_EM_GET_PARMS hcall that will return data
related to power modes from the platform.  Export the data
directly to user space for administrative tools to interpret
and use.

cat /proc/powerpc/lparcfg will export power mode data

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h |1 +
 arch/powerpc/kernel/lparcfg.c |   12 +++-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index f027581..ebe7493 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -228,6 +228,7 @@
 #define H_JOIN 0x298
 #define H_VASI_STATE0x2A4
 #define H_ENABLE_CRQ   0x2B0
+#define H_GET_EM_PARMS 0x2B8
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
 #define MAX_HCALL_OPCODE   H_GET_MPP
diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c
index d09d1c6..ff698fb 100644
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -37,7 +37,7 @@
 #include asm/vio.h
 #include asm/mmu.h
 
-#define MODULE_VERS 1.8
+#define MODULE_VERS 1.9
 #define MODULE_NAME lparcfg
 
 /* #define LPARCFG_DEBUG */
@@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m)
seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions);
 }
 
+static void parse_em_data(struct seq_file *m)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS)
+   seq_printf(m, power_mode_data=%016lx\n, retbuf[0]);
+}
+
 static int pseries_lparcfg_data(struct seq_file *m, void *v)
 {
int partition_potential_processors;
@@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
 
seq_printf(m, slb_size=%d\n, mmu_slb_size);
 
+   parse_em_data(m);
+
return 0;
 }
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc: export data from new hcall H_EM_GET_PARMS

2010-03-22 Thread Vaidyanathan Srinivasan
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2010-03-10 17:49:25]:

 Hi Ben,
 
 I have cleaned up the code from the previous post:
 http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html
 
 Changes from v1:
 * Removed redundant return statements and rearranged code
 
 Description:
 
 A new hcall H_EM_GET_PARMS as been added to obtain power mode data
 from the platform.  This data can be used by user space administrative
 tools for better power management.
 
 The following patch add data from this new hcall into the lparcfg
 driver and exports to user space along with other existing lpar data
 in /proc/powerpc/lparcfg
 
 Please review and include in powerpc -next tree.

Hi Ben,

Looks like you have not opened the -next tree after the merge window
yet.  The current git tree status is same as -merge.  Please review
and include in powerpc next tree.  

Thanks,
Vaidy

 ---
   powerpc: export data from new hcall H_EM_GET_PARMS
 
   Add support for H_EM_GET_PARMS hcall that will return data
   related to power modes from the platform.  Export the data
   directly to user space for administrative tools to interpret
   and use.
 
   cat /proc/powerpc/lparcfg will export power mode data
 
 Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/hvcall.h |1 +
  arch/powerpc/kernel/lparcfg.c |   12 +++-
  2 files changed, 12 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/hvcall.h 
 b/arch/powerpc/include/asm/hvcall.h
 index f027581..ebe7493 100644
 --- a/arch/powerpc/include/asm/hvcall.h
 +++ b/arch/powerpc/include/asm/hvcall.h
 @@ -228,6 +228,7 @@
  #define H_JOIN   0x298
  #define H_VASI_STATE0x2A4
  #define H_ENABLE_CRQ 0x2B0
 +#define H_GET_EM_PARMS   0x2B8
  #define H_SET_MPP0x2D0
  #define H_GET_MPP0x2D4
  #define MAX_HCALL_OPCODE H_GET_MPP
 diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c
 index d09d1c6..ff698fb 100644
 --- a/arch/powerpc/kernel/lparcfg.c
 +++ b/arch/powerpc/kernel/lparcfg.c
 @@ -37,7 +37,7 @@
  #include asm/vio.h
  #include asm/mmu.h
 
 -#define MODULE_VERS 1.8
 +#define MODULE_VERS 1.9
  #define MODULE_NAME lparcfg
 
  /* #define LPARCFG_DEBUG */
 @@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m)
   seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions);
  }
 
 +static void parse_em_data(struct seq_file *m)
 +{
 + unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
 +
 + if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS)
 + seq_printf(m, power_mode_data=%016lx\n, retbuf[0]);
 +}
 +
  static int pseries_lparcfg_data(struct seq_file *m, void *v)
  {
   int partition_potential_processors;
 @@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void 
 *v)
 
   seq_printf(m, slb_size=%d\n, mmu_slb_size);
 
 + parse_em_data(m);
 +
   return 0;
  }
 
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 repost] powerpc: export data from new hcall H_EM_GET_PARMS

2010-04-01 Thread Vaidyanathan Srinivasan
Hi Ben,

Previous post: http://patchwork.ozlabs.org/patch/47249/

I have cleaned up the code from the v1:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html

Changes from v1:
* Removed redundant return statements and rearranged code

Description:

A new hcall H_EM_GET_PARMS as been added to obtain power mode data
from the platform.  This data can be used by user space administrative
tools for better power management.

The following patch add data from this new hcall into the lparcfg
driver and exports to user space along with other existing lpar data
in /proc/powerpc/lparcfg

Please review and include in powerpc -next tree.

Thanks,
Vaidy

---
powerpc: export data from new hcall H_EM_GET_PARMS

Add support for H_EM_GET_PARMS hcall that will return data
related to power modes from the platform.  Export the data
directly to user space for administrative tools to interpret
and use.

cat /proc/powerpc/lparcfg will export power mode data

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h |1 +
 arch/powerpc/kernel/lparcfg.c |   12 +++-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index f027581..ebe7493 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -228,6 +228,7 @@
 #define H_JOIN 0x298
 #define H_VASI_STATE0x2A4
 #define H_ENABLE_CRQ   0x2B0
+#define H_GET_EM_PARMS 0x2B8
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
 #define MAX_HCALL_OPCODE   H_GET_MPP
diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c
index d09d1c6..ff698fb 100644
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -37,7 +37,7 @@
 #include asm/vio.h
 #include asm/mmu.h

-#define MODULE_VERS 1.8
+#define MODULE_VERS 1.9
 #define MODULE_NAME lparcfg

 /* #define LPARCFG_DEBUG */
@@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m)
seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions);
 }

+static void parse_em_data(struct seq_file *m)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS)
+   seq_printf(m, power_mode_data=%016lx\n, retbuf[0]);
+}
+
 static int pseries_lparcfg_data(struct seq_file *m, void *v)
 {
int partition_potential_processors;
@@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)

seq_printf(m, slb_size=%d\n, mmu_slb_size);

+   parse_em_data(m);
+
return 0;
 }


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-04-06 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-04-07 12:04:49]:

 On Wed, 2010-03-03 at 23:48 +0530, Vaidyanathan Srinivasan wrote:
  Hi,
 
  diff --git a/arch/powerpc/kernel/setup-common.c 
  b/arch/powerpc/kernel/setup-common.c
  index 03dd6a2..fbd93e3 100644
  --- a/arch/powerpc/kernel/setup-common.c
  +++ b/arch/powerpc/kernel/setup-common.c
  @@ -359,6 +359,8 @@ void __init check_for_initrd(void)
   #ifdef CONFIG_SMP
   
   int threads_per_core, threads_shift;
  +EXPORT_SYMBOL_GPL(threads_per_core);
 
 While I agree it should be exported for the APIs in cputhread.h to be
 usable from a module, this variable shouldn't be -used- directly, but
 only via the API functions in there.

Ok agreed.  This will be the first module to use this and hence will
have to implement the API and export the function.

  .../...
 
  +
  +#define MODULE_VERS 1.0
  +#define MODULE_NAME pseries_energy
  +
  +/* Helper Routines to convert between drc_index to cpu numbers */
  +
  +static u32 cpu_to_drc_index(int cpu)
  +{
  +   struct device_node *dn = NULL;
  +   const int *indexes;
  +   int i;
  +   dn = of_find_node_by_name(dn, cpus);
 
 Check the result. Also that's not a nice way to do that, you should look
 for /cpus by path I reckon.

I will check the return code, but why do you feel getting the node by
path is better?  Is there any race, or we may have duplicate cpus
node?

  +   indexes = of_get_property(dn, ibm,drc-indexes, NULL);
 
 Check the result here too.

Yes, will do.

  +   /* Convert logical cpu number to core number */
  +   i = cpu / threads_per_core;
 
 Don't use that variable as I said earlier. Use cpu_thread_to_core()

Ack

  +   /*
  +* The first element indexes[0] is the number of drc_indexes
  +* returned in the list.  Hence i+1 will get the drc_index
  +* corresponding to core number i.
  +*/
  +   WARN_ON(i  indexes[0]);
  +   return indexes[i + 1];
  +}
  +
  +static int drc_index_to_cpu(u32 drc_index)
  +{
  +   struct device_node *dn = NULL;
  +   const int *indexes;
  +   int i, cpu;
  +   dn = of_find_node_by_name(dn, cpus);
  +   indexes = of_get_property(dn, ibm,drc-indexes, NULL);
 
 Same comments, check results and use /cpus path

Sure. Will do.

  +   /*
  +* First element in the array is the number of drc_indexes
  +* returned.  Search through the list to find the matching
  +* drc_index and get the core number
  +*/
  +   for (i = 0; i  indexes[0]; i++) {
  +   if (indexes[i + 1] == drc_index)
  +   break;
  +   }
  +   /* Convert core number to logical cpu number */
  +   cpu = i * threads_per_core;
 
 Here's more annoying as we don't have an API in cputhread.h for that.
 
 In fact, we have a confusion in there since cpu_first_thread_in_core()
 doesn't actually take a core number ...
 
 I'm going to recommend a complicated approach but that's the best in the
 long run:
 
  - First do a patch that renames cpu_first_thread_in_core() to something
 clearer like cpu_first_thread_in_same_core() or cpu_leftmost_sibling()
 (I think I prefer the later). Rename the few users in
 arch/powerpc/mm/mmu_context_nohash.c and arch/powerpc/kernel/smp.c
 
  - Then do a patch that adds a cpu_first_thread_of_core() that takes a
 core number, basically does:
 
 static inline int cpu_first_thread_of_core(int core)
 {
   return core  threads_shift;
 }
 
  - Then add your modified H_BEST_ENERGY patch on top of it using the
 above two as pre-reqs :-)

Ok, I can do that change and then base this patch on the new API.

  +   return cpu;
  +}
  +
  +/*
  + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on
  + * preferred logical cpus to activate or deactivate for optimized
  + * energy consumption.
  + */
  +
  +#define FLAGS_MODE10x004E20080E01
  +#define FLAGS_MODE20x004E20080401
  +#define FLAGS_ACTIVATE  0x100
  +
  +static ssize_t get_best_energy_list(char *page, int activate)
  +{
  +   int rc, cnt, i, cpu;
  +   unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
  +   unsigned long flags = 0;
  +   u32 *buf_page;
  +   char *s = page;
  +
  +   buf_page = (u32 *) get_zeroed_page(GFP_KERNEL);
  +
 Why that blank line ?

I will remove them

  +   if (!buf_page)
  +   return 0;
 
 So here you return 0 instead of -ENOMEM

Will fix.  -ENOMEM is correct.

  +   flags = FLAGS_MODE1;
  +   if (activate)
  +   flags |= FLAGS_ACTIVATE;
  +
  +   rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page),
  +   0, 0, 0, 0, 0, 0);
  +
  +   
 
 Again, no need for a blank line before xx = foo() and if (xx)

Ack.

  if (rc != H_SUCCESS) {
  +   free_page((unsigned long) buf_page);
  +   return -EINVAL;
  +   }
 
 And here you return an error code. Which one is right ?

Returning an error code is correct to user space.  The call will check
for return value on read while getting the list.

Thanks for the detailed review.  I will fix the issues that you
pointed

[RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-05-07 Thread Vaidyanathan Srinivasan
Hi,

The following series adds a powerpc pseries platform module for
exporting platform energy management capabilities.

The module exports data from a new hypervisor call H_BEST_ENERGY.

Comments and suggestions made on the previous iteration of the patch
has been incorporated.

[RFC] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html

Changes from v1:

* Cleanup cpu/thread/core APIs
* Export APIs to module instead of threads_per_core
* Use of_find_node_by_path() instead of of_find_node_by_name()
* Error checking and whitespace cleanups

Please review and let me know your comments.  I will post the next
version ready for inclusion into powerpc/next tree along with test
reports.

The current patch series is on 2.6.34-rc6.

Thanks,
Vaidy
---

Vaidyanathan Srinivasan (2):
  powerpc: cleanup APIs for cpu/thread/core mappings
  powerpc: add support for new hcall H_BEST_ENERGY


 arch/powerpc/include/asm/cputhreads.h   |   15 +
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/kernel/smp.c   |   17 +-
 arch/powerpc/mm/mmu_context_nohash.c|   12 +
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  258 +++
 7 files changed, 301 insertions(+), 15 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-05-07 Thread Vaidyanathan Srinivasan
These APIs take logical cpu number as input
Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()

These APIs convert core number (index) to logical cpu/thread numbers
Add cpu_first_thread_of_core(int core)
Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu)

Made API changes to few callers.  Exported symbols for use in modules.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/cputhreads.h |   15 +--
 arch/powerpc/kernel/smp.c |   17 +++--
 arch/powerpc/mm/mmu_context_nohash.c  |   12 ++--
 3 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index a8e1844..26dc6bd 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void)
return cpu_thread_mask_to_cores(cpu_online_map);
 }
 
-static inline int cpu_thread_to_core(int cpu)
-{
-   return cpu  threads_shift;
-}
+#ifdef CONFIG_SMP
+int cpu_core_of_thread(int cpu);
+int cpu_first_thread_of_core(int core);
+#else
+static inline int cpu_core_of_thread(int cpu) { return cpu; }
+static inline int cpu_first_thread_of_core(int core) { return core; }
+#endif
 
 static inline int cpu_thread_in_core(int cpu)
 {
return cpu  (threads_per_core - 1);
 }
 
-static inline int cpu_first_thread_in_core(int cpu)
+static inline int cpu_leftmost_thread_sibling(int cpu)
 {
return cpu  ~(threads_per_core - 1);
 }
 
-static inline int cpu_last_thread_in_core(int cpu)
+static inline int cpu_rightmost_thread_sibling(int cpu)
 {
return cpu | (threads_per_core - 1);
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c2ee144..02dedff 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -462,6 +462,19 @@ out:
return id;
 }
 
+/* Helper routines for cpu to core mapping */
+int cpu_core_of_thread(int cpu)
+{
+   return cpu  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_core_of_thread);
+
+int cpu_first_thread_of_core(int core)
+{
+   return core  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
+
 /* Must be called when no change can occur to cpu_present_map,
  * i.e. during cpu online or offline.
  */
@@ -513,7 +526,7 @@ int __devinit start_secondary(void *unused)
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
if (cpu_is_offline(base + i))
continue;
@@ -589,7 +602,7 @@ int __cpu_disable(void)
return err;
 
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
cpu_clear(cpu, per_cpu(cpu_sibling_map, base + i));
cpu_clear(base + i, per_cpu(cpu_sibling_map, cpu));
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index 1f2d9ff..7c66e82 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id)
 * a core map instead but this will do for now.
 */
for_each_cpu(cpu, mm_cpumask(mm)) {
-   for (i = cpu_first_thread_in_core(cpu);
-i = cpu_last_thread_in_core(cpu); i++)
+   for (i = cpu_leftmost_thread_sibling(cpu);
+i = cpu_rightmost_thread_sibling(cpu); i++)
__set_bit(id, stale_map[i]);
cpu = i - 1;
}
@@ -264,14 +264,14 @@ void switch_mmu_context(struct mm_struct *prev, struct 
mm_struct *next)
 */
if (test_bit(id, stale_map[cpu])) {
pr_hardcont( | stale flush %d [%d..%d],
-   id, cpu_first_thread_in_core(cpu),
-   cpu_last_thread_in_core(cpu));
+   id, cpu_leftmost_thread_sibling(cpu),
+   cpu_rightmost_thread_sibling(cpu));
 
local_flush_tlb_mm(next);
 
/* XXX This clear should ultimately be part of 
local_flush_tlb_mm */
-   for (i = cpu_first_thread_in_core(cpu);
-i = cpu_last_thread_in_core(cpu); i++) {
+   for (i = cpu_leftmost_thread_sibling(cpu);
+i = cpu_rightmost_thread_sibling(cpu); i++) {
__clear_bit(id, stale_map[i]);
}
}

___
Linuxppc-dev

[RFC PATCH v2 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-05-07 Thread Vaidyanathan Srinivasan
Create sysfs interface to export data from H_BEST_ENERGY hcall
that can be used by administrative tools on supported pseries
platforms for energy management optimizations.

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
hints for activation and deactivation of cpus respectively.

Added new driver module
arch/powerpc/platforms/pseries/pseries_energy.c
under new config option CONFIG_PSERIES_ENERGY

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  258 +++
 4 files changed, 271 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index f027581..396599f 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -230,7 +230,8 @@
 #define H_ENABLE_CRQ   0x2B0
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
-#define MAX_HCALL_OPCODE   H_GET_MPP
+#define H_BEST_ENERGY  0x2F4
+#define MAX_HCALL_OPCODE   H_BEST_ENERGY
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index c667f0f..b3dd108 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y
 
+config PSERIES_ENERGY
+   tristate pseries energy management capabilities driver
+   depends on PPC_PSERIES
+   default y
+   help
+ Provides interface to platform energy management capabilities
+ on supported PSERIES platforms.
+ Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
+ and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
+
 config SCANLOG
tristate Scanlog dump interface
depends on RTAS_PROC  PPC_PSERIES
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 0ff5174..3813977 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o 
eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_KEXEC)+= kexec.o
 obj-$(CONFIG_PCI)  += pci.o pci_dlpar.o
 obj-$(CONFIG_PSERIES_MSI)  += msi.o
+obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
 obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
new file mode 100644
index 000..9a936b1
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -0,0 +1,258 @@
+/*
+ * POWER platform energy management driver
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This pseries platform device driver provides access to
+ * platform energy management capabilities.
+ */
+
+#include linux/module.h
+#include linux/types.h
+#include linux/errno.h
+#include linux/init.h
+#include linux/seq_file.h
+#include linux/sysdev.h
+#include linux/cpu.h
+#include linux/of.h
+#include asm/cputhreads.h
+#include asm/page.h
+#include asm/hvcall.h
+
+
+#define MODULE_VERS 1.0
+#define MODULE_NAME pseries_energy
+
+/* Helper Routines to convert between drc_index to cpu numbers */
+
+static u32 cpu_to_drc_index(int cpu)
+{
+   struct device_node *dn = NULL;
+   const int *indexes;
+   int i;
+   dn = of_find_node_by_path(/cpus);
+   if (dn == NULL)
+   goto err;
+   indexes = of_get_property(dn, ibm,drc-indexes, NULL);
+   if (indexes == NULL)
+   goto err;
+   /* Convert logical cpu number to core number */
+   i = cpu_core_of_thread(cpu);
+   /*
+* The first element indexes[0] is the number of drc_indexes
+* returned in the list.  Hence i+1 will get the drc_index
+* corresponding to core number i.
+*/
+   WARN_ON(i  indexes[0]);
+   return indexes[i + 1];
+err:
+   printk(KERN_WARNING cpu_to_drc_index(%d) failed, cpu);
+   return 0;
+}
+
+static int drc_index_to_cpu(u32 drc_index)
+{
+   struct device_node *dn = NULL;
+   const int *indexes;
+   int i, cpu;
+   dn = of_find_node_by_path(/cpus);
+   if (dn == NULL)
+   goto err;
+   indexes = of_get_property(dn, ibm,drc-indexes, NULL

Re: [RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-05-09 Thread Vaidyanathan Srinivasan
* Paul Mackerras pau...@samba.org [2010-05-10 09:05:22]:

 On Fri, May 07, 2010 at 05:18:42PM +0530, Vaidyanathan Srinivasan wrote:
 
  These APIs take logical cpu number as input
  Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
  Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()
  
  These APIs convert core number (index) to logical cpu/thread numbers
  Add cpu_first_thread_of_core(int core)
  Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu)
 
 Why make all these changes?  The end result doesn't seem any cleaner
 or better than how it was before, and your patch description doesn't
 give any reason for us to think yes, we should make this change.
 I assume you think this is a good change to make, so you need to
 explain why it's a good idea and convince the rest of us.

Sure Paul.. let me explain.  The crux of the issue is to make
'threads_per_core' accessible to the pseries_energy module.  In the
first RFC, I had directly exported the variable which is not a good
design.  Ben H suggested to make an API around it and then export the
function:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html

Instead of making an API to read threads_per_core, as Ben suggested,
I have made a wrapper at a higher level to make an API to convert from
logical cpu number to core number.

The current APIs cpu_first_thread_in_core() and
cpu_last_thread_in_core() returns logical CPU number while
cpu_thread_to_core() returns core number or index which is not
a logical CPU number.

Ben recommended to clearly name them to distinguish 'core number'
versus first and last 'logical cpu number' in that core.

Hence in the new scheme, I have:

  Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
  Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()

which work on logical cpu numbers.  
While cpu_first_thread_of_core() and cpu_core_of_thread() work on core
index.

Example usage:  (4 threads per core system)

cpu_leftmost_thread_sibling(5) = 4
cpu_rightmost_thread_sibling(5) = 7
cpu_core_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

cpu_core_of_thread() is used in cpu_to_drc_index() in the module and
cpu_first_thread_of_core() is used in drc_index_to_cpu() in the
module.  These APIs may be useful in other modules in future, and the
proposed design is a good method to export these APIs to modules.

An alternative approach could be to move both the base functions
cpu_to_drc_index() and drc_index_to_cpu() into the kernel like in
arch/powerpc/kernel/smp.c

Thanks for the review, I hope I have explained the requirements and
design for this cleanup.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-05-17 Thread Vaidyanathan Srinivasan
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2010-05-10 11:18:01]:

 * Paul Mackerras pau...@samba.org [2010-05-10 09:05:22]:
 
  On Fri, May 07, 2010 at 05:18:42PM +0530, Vaidyanathan Srinivasan wrote:
  
   These APIs take logical cpu number as input
   Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
   Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()
   
   These APIs convert core number (index) to logical cpu/thread numbers
   Add cpu_first_thread_of_core(int core)
   Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu)
  
  Why make all these changes?  The end result doesn't seem any cleaner
  or better than how it was before, and your patch description doesn't
  give any reason for us to think yes, we should make this change.
  I assume you think this is a good change to make, so you need to
  explain why it's a good idea and convince the rest of us.
 
 Sure Paul.. let me explain.  The crux of the issue is to make
 'threads_per_core' accessible to the pseries_energy module.  In the
 first RFC, I had directly exported the variable which is not a good
 design.  Ben H suggested to make an API around it and then export the
 function:
 http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html
 
 Instead of making an API to read threads_per_core, as Ben suggested,
 I have made a wrapper at a higher level to make an API to convert from
 logical cpu number to core number.
 
 The current APIs cpu_first_thread_in_core() and
 cpu_last_thread_in_core() returns logical CPU number while
 cpu_thread_to_core() returns core number or index which is not
 a logical CPU number.
 
 Ben recommended to clearly name them to distinguish 'core number'
 versus first and last 'logical cpu number' in that core.
 
 Hence in the new scheme, I have:
 
   Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
   Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()
 
 which work on logical cpu numbers.  
 While cpu_first_thread_of_core() and cpu_core_of_thread() work on core
 index.
 
 Example usage:  (4 threads per core system)
 
 cpu_leftmost_thread_sibling(5) = 4
 cpu_rightmost_thread_sibling(5) = 7
 cpu_core_of_thread(5) = 1
 cpu_first_thread_of_core(1) = 4
 
 cpu_core_of_thread() is used in cpu_to_drc_index() in the module and
 cpu_first_thread_of_core() is used in drc_index_to_cpu() in the
 module.  These APIs may be useful in other modules in future, and the
 proposed design is a good method to export these APIs to modules.
 
 An alternative approach could be to move both the base functions
 cpu_to_drc_index() and drc_index_to_cpu() into the kernel like in
 arch/powerpc/kernel/smp.c
 
 Thanks for the review, I hope I have explained the requirements and
 design for this cleanup.

Hi Paul and Ben,

Do you have any further comments on this patch series and related
cleanup?  I will post the next iteration in a day or two.

Thanks,
Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

2010-06-01 Thread Vaidyanathan Srinivasan
* Peter Zijlstra pet...@infradead.org [2010-05-31 10:33:16]:

 On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
  
  
  Hrmm, my brain seems muddled but I might have another solution, let me
  ponder this for a bit..
  
 
 Right, so the thing I was thinking about is taking the group capacity
 into account when determining the capacity for a single cpu.
 
 Say the group contains all the SMT siblings, then use the group capacity
 (usually larger than 1024) and then distribute the capacity over the
 group members, preferring CPUs with higher individual cpu_power over
 those with less.
 
 So suppose you've got 4 siblings with cpu_power=294 each, then we assign
 capacity 1 to the first member, and the remaining 153 is insufficient,
 and thus we stop and the rest lives with 0 capacity.
 
 Now take the example that the first sibling would be running a heavy RT
 load, and its cpu_power would be reduced to say, 50, then we still got
 nearly 933 left over the others, which is still sufficient for one
 capacity, but because the first sibling is low, we'll assign it 0 and
 instead assign 1 to the second, again, leaving the third and fourth 0.

Hi Peter,

Thanks for the suggestion.

 If the group were a core group, the total would be much higher and we'd
 likely end up assigning 1 to each before we'd run out of capacity.

This is a tricky case because we are depending upon the
DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1.  We
will not have any task movement until capacity is depleted to quite
low value due to RT task.  Having a threshold to flag 0/1 instead of
DIV_ROUND_CLOSEST just like you have suggested in the power savings
case may help here as well to move tasks to other idle cores.

 For power savings, we can lower the threshold and maybe use the maximal
 individual cpu_power in the group to base 1 capacity from.
 
 So, suppose the second example, where sibling0 has 50 and the others
 have 294, you'd end up with a capacity distribution of: {0,1,1,1}.

One challenge here is that if RT tasks run on more that one thread in
this group, we will have slightly different cpu powers.  Arranging
them from max to min and having a cutoff threshold should work.

Should we keep the RT scaling as a separate entity along with
cpu_power to simplify these thresholds.  Whenever we need to scale
group load with cpu power can take the product of cpu_power and
scale_rt_power but in these cases where we compute capacity, we can
mark a 0 or 1 just based on whether scale_rt_power was less than
SCHED_LOAD_SCALE or not.  Alternatively we can keep cpu_power as
a product of all scaling factors as it is today but save the component
scale factors also like scale_rt_power() and arch_scale_freq_power()
so that it can be used in load balance decisions.

Basically in power save balance we would give all threads a capacity
'1' unless the cpu_power was reduced due to RT task.  Similarly in
the non-power save case, we can have flag 1,0,0,0 unless first thread
had a RT scaling during the last interval.

I am suggesting to distinguish the reduction is cpu_power due to
architectural (hardware DVFS) reasons from RT tasks so that it is easy
to decide if moving tasks to sibling thread or core can help or not.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-06-23 Thread Vaidyanathan Srinivasan
Hi Ben,

The following series adds a kernel module for powerpc pseries platforms in
order to export platform energy management capabilities.

The module exports data from a new hypervisor call H_BEST_ENERGY.

Comments and suggestions made on the previous iteration of the patch
has been incorporated.

Changed in v3:

* Added more documentation in the cleanup patch
* Removed RFC tag, rebased and tested on 2.6.35-rc3
* Ready for inclusion in powerpc/next tree for further testing

Changes in v2:

[1] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html

* Cleanup cpu/thread/core APIs
* Export APIs to module instead of threads_per_core
* Use of_find_node_by_path() instead of of_find_node_by_name()
* Error checking and whitespace cleanups

First version:
[2] [RFC] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html

Please review and include in powerpc/next tree for further testing.  I will
add the following enhancements incrementally:

* Detect h_call support from ibm,hypertas-functions
* Limit sysfs file creation only on hcall availability

This patch series will apply on 2.6.35-rc3 as well as powerpc/next tree.

Thanks,
Vaidy
---

Vaidyanathan Srinivasan (2):
  powerpc: cleanup APIs for cpu/thread/core mappings
  powerpc: add support for new hcall H_BEST_ENERGY


 arch/powerpc/include/asm/cputhreads.h   |   15 +
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/kernel/smp.c   |   19 +-
 arch/powerpc/mm/mmu_context_nohash.c|   12 +
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  258 +++
 7 files changed, 302 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

-- 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-06-23 Thread Vaidyanathan Srinivasan
These APIs take logical cpu number as input
Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()

These APIs convert core number (index) to logical cpu/thread numbers
Add cpu_first_thread_of_core(int core)
Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu)

The goal is to make 'threads_per_core' accessible to the
pseries_energy module.  Instead of making an API to read
threads_per_core, this is a higher level wrapper function to
convert from logical cpu number to core number.

The current APIs cpu_first_thread_in_core() and
cpu_last_thread_in_core() returns logical CPU number while
cpu_thread_to_core() returns core number or index which is
not a logical CPU number.  The APIs are now clearly named to
distinguish 'core number' versus first and last 'logical cpu
number' in that core.

The new APIs cpu_{left,right}most_thread_sibling() work on
logical cpu numbers.  While cpu_first_thread_of_core() and
cpu_core_of_thread() work on core index.

Example usage:  (4 threads per core system)

cpu_leftmost_thread_sibling(5) = 4
cpu_rightmost_thread_sibling(5) = 7
cpu_core_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

cpu_core_of_thread() is used in cpu_to_drc_index() in the
module and cpu_first_thread_of_core() is used in
drc_index_to_cpu() in the module.

Made API changes to few callers.  Exported symbols for use
in modules.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/cputhreads.h |   15 +--
 arch/powerpc/kernel/smp.c |   19 ---
 arch/powerpc/mm/mmu_context_nohash.c  |   12 ++--
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index a8e1844..26dc6bd 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void)
return cpu_thread_mask_to_cores(cpu_online_map);
 }

-static inline int cpu_thread_to_core(int cpu)
-{
-   return cpu  threads_shift;
-}
+#ifdef CONFIG_SMP
+int cpu_core_of_thread(int cpu);
+int cpu_first_thread_of_core(int core);
+#else
+static inline int cpu_core_of_thread(int cpu) { return cpu; }
+static inline int cpu_first_thread_of_core(int core) { return core; }
+#endif

 static inline int cpu_thread_in_core(int cpu)
 {
return cpu  (threads_per_core - 1);
 }

-static inline int cpu_first_thread_in_core(int cpu)
+static inline int cpu_leftmost_thread_sibling(int cpu)
 {
return cpu  ~(threads_per_core - 1);
 }

-static inline int cpu_last_thread_in_core(int cpu)
+static inline int cpu_rightmost_thread_sibling(int cpu)
 {
return cpu | (threads_per_core - 1);
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5c196d1..da4c2f8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -468,7 +468,20 @@ out:
return id;
 }

-/* Must be called when no change can occur to cpu_present_mask,
+/* Helper routines for cpu to core mapping */
+int cpu_core_of_thread(int cpu)
+{
+   return cpu  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_core_of_thread);
+
+int cpu_first_thread_of_core(int core)
+{
+   return core  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
+
+/* Must be called when no change can occur to cpu_present_map,
  * i.e. during cpu online or offline.
  */
 static struct device_node *cpu_to_l2cache(int cpu)
@@ -527,7 +540,7 @@ int __devinit start_secondary(void *unused)
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
if (cpu_is_offline(base + i))
continue;
@@ -606,7 +619,7 @@ int __cpu_disable(void)
return err;

/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i));
cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu));
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index ddfd7ad..22f3bc5 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id)
 * a core map instead but this will do for now.
 */
for_each_cpu(cpu, mm_cpumask(mm)) {
-   for (i = cpu_first_thread_in_core(cpu);
-i = cpu_last_thread_in_core(cpu); i++)
+   for (i = cpu_leftmost_thread_sibling(cpu);
+i

[PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-06-23 Thread Vaidyanathan Srinivasan
Create sysfs interface to export data from H_BEST_ENERGY hcall
that can be used by administrative tools on supported pseries
platforms for energy management optimizations.

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
hints for activation and deactivation of cpus respectively.

Added new driver module
arch/powerpc/platforms/pseries/pseries_energy.c
under new config option CONFIG_PSERIES_ENERGY

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  258 +++
 4 files changed, 271 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 5119b7d..34b66e0 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -231,7 +231,8 @@
 #define H_GET_EM_PARMS 0x2B8
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
-#define MAX_HCALL_OPCODE   H_GET_MPP
+#define H_BEST_ENERGY  0x2F4
+#define MAX_HCALL_OPCODE   H_BEST_ENERGY

 #ifndef __ASSEMBLY__

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index c667f0f..b3dd108 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y

+config PSERIES_ENERGY
+   tristate pseries energy management capabilities driver
+   depends on PPC_PSERIES
+   default y
+   help
+ Provides interface to platform energy management capabilities
+ on supported PSERIES platforms.
+ Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
+ and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
+
 config SCANLOG
tristate Scanlog dump interface
depends on RTAS_PROC  PPC_PSERIES
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 3dbef30..32ae72e 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o 
eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_KEXEC)+= kexec.o
 obj-$(CONFIG_PCI)  += pci.o pci_dlpar.o
 obj-$(CONFIG_PSERIES_MSI)  += msi.o
+obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o

 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
 obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
new file mode 100644
index 000..9a936b1
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -0,0 +1,258 @@
+/*
+ * POWER platform energy management driver
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This pseries platform device driver provides access to
+ * platform energy management capabilities.
+ */
+
+#include linux/module.h
+#include linux/types.h
+#include linux/errno.h
+#include linux/init.h
+#include linux/seq_file.h
+#include linux/sysdev.h
+#include linux/cpu.h
+#include linux/of.h
+#include asm/cputhreads.h
+#include asm/page.h
+#include asm/hvcall.h
+
+
+#define MODULE_VERS 1.0
+#define MODULE_NAME pseries_energy
+
+/* Helper Routines to convert between drc_index to cpu numbers */
+
+static u32 cpu_to_drc_index(int cpu)
+{
+   struct device_node *dn = NULL;
+   const int *indexes;
+   int i;
+   dn = of_find_node_by_path(/cpus);
+   if (dn == NULL)
+   goto err;
+   indexes = of_get_property(dn, ibm,drc-indexes, NULL);
+   if (indexes == NULL)
+   goto err;
+   /* Convert logical cpu number to core number */
+   i = cpu_core_of_thread(cpu);
+   /*
+* The first element indexes[0] is the number of drc_indexes
+* returned in the list.  Hence i+1 will get the drc_index
+* corresponding to core number i.
+*/
+   WARN_ON(i  indexes[0]);
+   return indexes[i + 1];
+err:
+   printk(KERN_WARNING cpu_to_drc_index(%d) failed, cpu);
+   return 0;
+}
+
+static int drc_index_to_cpu(u32 drc_index)
+{
+   struct device_node *dn = NULL;
+   const int *indexes;
+   int i, cpu;
+   dn = of_find_node_by_path(/cpus);
+   if (dn == NULL)
+   goto err;
+   indexes = of_get_property(dn, ibm,drc-indexes, NULL

Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-06-27 Thread Vaidyanathan Srinivasan
* Michael Neuling mi...@neuling.org [2010-06-28 11:44:31]:

 Vaidy,
 
  Create sysfs interface to export data from H_BEST_ENERGY hcall
  that can be used by administrative tools on supported pseries
  platforms for energy management optimizations.
  
  /sys/device/system/cpu/pseries_(de)activate_hint_list and
  /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
  hints for activation and deactivation of cpus respectively.
  
  Added new driver module
  arch/powerpc/platforms/pseries/pseries_energy.c
  under new config option CONFIG_PSERIES_ENERGY
 
 Can you provide some documentation on how to use these hints and what
 format they are provided from sysfs.  Looks like two separate interfaces
 two the same thing (one a comma sep list and 1 per cpu, why do need
 both?).  What is the difference between activate and deactivate, with
 out me having to read PAPR :-) ??

Hi Mike,

Thanks for reviewing this patch series.  Sure, I can provide
additional information.

These hints are abstract number given by the hypervisor based on
the extended knowledge the hypervisor has regarding the current system
topology and resource mappings.

The activate and the deactivate part is for the two distinct
operations that we could do for energy savings.  When we have more
capacity than required, we could deactivate few core to save energy.
The choice of the core to deactivate will be based on
/sys/devices/system/cpu/deactivate_hint_list.  The comma separated
list of cpus (cores) will be the preferred choice.

Once the system has few deactivated cores, based on workload demand we
may have to activate them to meet the demand.  In that case the
/sys/devices/system/cpu/activate_hint_list will be used to prefer the
core in-order among the deactivated cores.

In simple terms, activate_hint_list will be null until we deactivate
few cores.  Then we could look at the corresponding list for
activation or deactivation.

Regarding your second point, there is a reason for both a list and
per-cpu interface.  The list gives us a system wide list of cores in
one shot for userspace to base their decision.  This will be the
preferred interface for most cases.  On the other hand, per-cpu file
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more
information since it exports the hint value as such.

The idea is that the list interface will be used to get a suggested
list of cores to manage, while the per-cpu value can be used to
further get fine grain information on a per-core bases from the
hypervisor.  This allows Linux to have access to all information that
the hypervisor has offered through this hcall interface.

 Other comments below.
 
  
  Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
  ---
   arch/powerpc/include/asm/hvcall.h   |3 
   arch/powerpc/platforms/pseries/Kconfig  |   10 +
   arch/powerpc/platforms/pseries/Makefile |1 
   arch/powerpc/platforms/pseries/pseries_energy.c |  258 
  +
 ++
   4 files changed, 271 insertions(+), 1 deletions(-)
   create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c
  
  diff --git a/arch/powerpc/include/asm/hvcall.h 
  b/arch/powerpc/include/asm/hvc
 all.h
  index 5119b7d..34b66e0 100644
  --- a/arch/powerpc/include/asm/hvcall.h
  +++ b/arch/powerpc/include/asm/hvcall.h
  @@ -231,7 +231,8 @@
   #define H_GET_EM_PARMS 0x2B8
   #define H_SET_MPP  0x2D0
   #define H_GET_MPP  0x2D4
  -#define MAX_HCALL_OPCODE   H_GET_MPP
  +#define H_BEST_ENERGY  0x2F4
  +#define MAX_HCALL_OPCODE   H_BEST_ENERGY
  
   #ifndef __ASSEMBLY__
  
  diff --git a/arch/powerpc/platforms/pseries/Kconfig 
  b/arch/powerpc/platforms/
 pseries/Kconfig
  index c667f0f..b3dd108 100644
  --- a/arch/powerpc/platforms/pseries/Kconfig
  +++ b/arch/powerpc/platforms/pseries/Kconfig
  @@ -33,6 +33,16 @@ config PSERIES_MSI
  depends on PCI_MSI  EEH
  default y
  
  +config PSERIES_ENERGY
 
 Probably need a less generic name.  PSERIES_ENERGY_MANAGEMENT?
 PSERIES_ENERGY_HOTPLUG_HINTS?

PSERIES_ENERGY_MANAGEMENT may be good but too long for a config
option.

The idea is to collect all energy management functions in this module
as and when new features are introduced in the pseries platform.  This
hcall interface is the first to be included, but going forward in
future I do not propose to have different modules for other energy
management related features.

The name is specific enough for IBM pseries platform and energy
management functions and enablements.  Having less generic name below
this level will make it difficult to add all varieties of energy
management functions in future.

  +   tristate pseries energy management capabilities driver
  +   depends on PPC_PSERIES
  +   default y
  +   help
  + Provides interface to platform energy management capabilities
  + on supported PSERIES platforms.
  + Provides: /sys

[PATCH v4 0/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-07-21 Thread Vaidyanathan Srinivasan
Hi Ben,

The following series adds a kernel module for powerpc pseries
platforms in order to export platform energy management capabilities.

The module exports data from a new hypervisor call H_BEST_ENERGY.

Some of the comments and suggestions made on the previous iteration of the
patch has been incorporated.

Changes in v4:

* Added more documentation
* Added check_for_h_best_energy() to look in ibm,hypertas-functions so
  that sysfs entries are not created in an unsupported platform
* Added cleaner error checks and correct of_node_put()
* Rebased and tested on 2.6.35-rc5

Changed in v3:

[3] [PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-June/083414.html

* Added more documentation in the cleanup patch
* Removed RFC tag, rebased and tested on 2.6.35-rc3
* Ready for inclusion in powerpc/next tree for further testing

Changes in v2:

[2] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html

* Cleanup cpu/thread/core APIs
* Export APIs to module instead of threads_per_core
* Use of_find_node_by_path() instead of of_find_node_by_name()
* Error checking and whitespace cleanups

First version:
[1] [RFC] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html

This patch series will apply on 2.6.35-rc5 as well as powerpc/next
tree.  Please review and include in powerpc/next tree for further
testing.  

I could incrementally reduce some of the error checks as suggested by
Michael Neuling as next steps.

Thanks,
Vaidy
---

Vaidyanathan Srinivasan (2):
  powerpc: cleanup APIs for cpu/thread/core mappings
  powerpc: add support for new hcall H_BEST_ENERGY


 arch/powerpc/include/asm/cputhreads.h   |   15 +
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/kernel/smp.c   |   19 +
 arch/powerpc/mm/mmu_context_nohash.c|   12 -
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  326 +++
 7 files changed, 370 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-07-21 Thread Vaidyanathan Srinivasan
These APIs take logical cpu number as input
Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()

These APIs convert core number (index) to logical cpu/thread numbers
Add cpu_first_thread_of_core(int core)
Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu)

The goal is to make 'threads_per_core' accessible to the
pseries_energy module.  Instead of making an API to read
threads_per_core, this is a higher level wrapper function to
convert from logical cpu number to core number.

The current APIs cpu_first_thread_in_core() and
cpu_last_thread_in_core() returns logical CPU number while
cpu_thread_to_core() returns core number or index which is
not a logical CPU number.  The APIs are now clearly named to
distinguish 'core number' versus first and last 'logical cpu
number' in that core.

The new APIs cpu_{left,right}most_thread_sibling() work on
logical cpu numbers.  While cpu_first_thread_of_core() and
cpu_core_of_thread() work on core index.

Example usage:  (4 threads per core system)

cpu_leftmost_thread_sibling(5) = 4
cpu_rightmost_thread_sibling(5) = 7
cpu_core_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

cpu_core_of_thread() is used in cpu_to_drc_index() in the
module and cpu_first_thread_of_core() is used in
drc_index_to_cpu() in the module.

Made API changes to few callers.  Exported symbols for use
in modules.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/cputhreads.h |   15 +--
 arch/powerpc/kernel/smp.c |   19 ---
 arch/powerpc/mm/mmu_context_nohash.c  |   12 ++--
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index a8e1844..26dc6bd 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void)
return cpu_thread_mask_to_cores(cpu_online_map);
 }
 
-static inline int cpu_thread_to_core(int cpu)
-{
-   return cpu  threads_shift;
-}
+#ifdef CONFIG_SMP
+int cpu_core_of_thread(int cpu);
+int cpu_first_thread_of_core(int core);
+#else
+static inline int cpu_core_of_thread(int cpu) { return cpu; }
+static inline int cpu_first_thread_of_core(int core) { return core; }
+#endif
 
 static inline int cpu_thread_in_core(int cpu)
 {
return cpu  (threads_per_core - 1);
 }
 
-static inline int cpu_first_thread_in_core(int cpu)
+static inline int cpu_leftmost_thread_sibling(int cpu)
 {
return cpu  ~(threads_per_core - 1);
 }
 
-static inline int cpu_last_thread_in_core(int cpu)
+static inline int cpu_rightmost_thread_sibling(int cpu)
 {
return cpu | (threads_per_core - 1);
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5c196d1..da4c2f8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -468,7 +468,20 @@ out:
return id;
 }
 
-/* Must be called when no change can occur to cpu_present_mask,
+/* Helper routines for cpu to core mapping */
+int cpu_core_of_thread(int cpu)
+{
+   return cpu  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_core_of_thread);
+
+int cpu_first_thread_of_core(int core)
+{
+   return core  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
+
+/* Must be called when no change can occur to cpu_present_map,
  * i.e. during cpu online or offline.
  */
 static struct device_node *cpu_to_l2cache(int cpu)
@@ -527,7 +540,7 @@ int __devinit start_secondary(void *unused)
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
if (cpu_is_offline(base + i))
continue;
@@ -606,7 +619,7 @@ int __cpu_disable(void)
return err;
 
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_leftmost_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i));
cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu));
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index ddfd7ad..22f3bc5 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id)
 * a core map instead but this will do for now.
 */
for_each_cpu(cpu, mm_cpumask(mm)) {
-   for (i = cpu_first_thread_in_core(cpu);
-i = cpu_last_thread_in_core(cpu); i++)
+   for (i = cpu_leftmost_thread_sibling(cpu);
+i

[PATCH v4 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-07-21 Thread Vaidyanathan Srinivasan
Create sysfs interface to export data from H_BEST_ENERGY hcall
that can be used by administrative tools on supported pseries
platforms for energy management optimizations.

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
hints for activation and deactivation of cpus respectively.

These hints are abstract number given by the hypervisor based
on the extended knowledge the hypervisor has regarding the
current system topology and resource mappings.

The activate and the deactivate sysfs entry is for the two
distinct operations that we could do for energy savings.  When
we have more capacity than required, we could deactivate few
core to save energy.  The choice of the core to deactivate
will be based on /sys/devices/system/cpu/deactivate_hint_list.
The comma separated list of cpus (cores) will be the preferred
choice.  If we have to activate some of the deactivated cores,
then /sys/devices/system/cpu/activate_hint_list will be used.

The per-cpu file
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint further
provide more fine grain information by exporting the value of
the hint itself.

Added new driver module
arch/powerpc/platforms/pseries/pseries_energy.c
under new config option CONFIG_PSERIES_ENERGY

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  326 +++
 4 files changed, 339 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 5119b7d..34b66e0 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -231,7 +231,8 @@
 #define H_GET_EM_PARMS 0x2B8
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
-#define MAX_HCALL_OPCODE   H_GET_MPP
+#define H_BEST_ENERGY  0x2F4
+#define MAX_HCALL_OPCODE   H_BEST_ENERGY
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index c667f0f..8323622 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y
 
+config PSERIES_ENERGY
+   tristate pSeries energy management capabilities driver
+   depends on PPC_PSERIES
+   default y
+   help
+ Provides interface to platform energy management capabilities
+ on supported PSERIES platforms.
+ Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
+ and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
+
 config SCANLOG
tristate Scanlog dump interface
depends on RTAS_PROC  PPC_PSERIES
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 3dbef30..32ae72e 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o 
eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_KEXEC)+= kexec.o
 obj-$(CONFIG_PCI)  += pci.o pci_dlpar.o
 obj-$(CONFIG_PSERIES_MSI)  += msi.o
+obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
 obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
new file mode 100644
index 000..e217ea3
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -0,0 +1,326 @@
+/*
+ * POWER platform energy management driver
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This pseries platform device driver provides access to
+ * platform energy management capabilities.
+ */
+
+#include linux/module.h
+#include linux/types.h
+#include linux/errno.h
+#include linux/init.h
+#include linux/seq_file.h
+#include linux/sysdev.h
+#include linux/cpu.h
+#include linux/of.h
+#include asm/cputhreads.h
+#include asm/page.h
+#include asm/hvcall.h
+
+
+#define MODULE_VERS 1.0
+#define MODULE_NAME pseries_energy
+
+/* Driver flags */
+
+static int sysfs_entries;
+
+/* Helper routines */
+
+/*
+ * Routine to detect firmware support for hcall
+ * return 1 if H_BEST_ENERGY is supported
+ * else return 0

Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-07-21 Thread Vaidyanathan Srinivasan
* Michael Neuling mi...@neuling.org [2010-06-28 16:11:06]:

[snip]

  These hints are abstract number given by the hypervisor based on
  the extended knowledge the hypervisor has regarding the current system
  topology and resource mappings.
  
  The activate and the deactivate part is for the two distinct
  operations that we could do for energy savings.  When we have more
  capacity than required, we could deactivate few core to save energy.
  The choice of the core to deactivate will be based on
  /sys/devices/system/cpu/deactivate_hint_list.  The comma separated
  list of cpus (cores) will be the preferred choice.
  
  Once the system has few deactivated cores, based on workload demand we
  may have to activate them to meet the demand.  In that case the
  /sys/devices/system/cpu/activate_hint_list will be used to prefer the
  core in-order among the deactivated cores.
  
  In simple terms, activate_hint_list will be null until we deactivate
  few cores.  Then we could look at the corresponding list for
  activation or deactivation.
 
 Can you put these details in the code and in the check-in comments.

Hi Mikey,

I have added these in the -v4 post.

  Regarding your second point, there is a reason for both a list and
  per-cpu interface.  The list gives us a system wide list of cores in
  one shot for userspace to base their decision.  This will be the
  preferred interface for most cases.  On the other hand, per-cpu file
  /sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more
  information since it exports the hint value as such.
  
  The idea is that the list interface will be used to get a suggested
  list of cores to manage, while the per-cpu value can be used to
  further get fine grain information on a per-core bases from the
  hypervisor.  This allows Linux to have access to all information that
  the hypervisor has offered through this hcall interface.
 
 OK, I didn't realise that they contained different info.  Just more
 reasons that this interface needs better documentation :-)
 
 Overall, I'm mostly happy with the interface.  It's pretty light weight.

these too.

   Other comments below.
   

[snip]

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platfo
 rms/
   pseries/Kconfig
index c667f0f..b3dd108 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y

+config PSERIES_ENERGY
   
   Probably need a less generic name.  PSERIES_ENERGY_MANAGEMENT?
   PSERIES_ENERGY_HOTPLUG_HINTS?
 
  PSERIES_ENERGY_MANAGEMENT may be good but too long for a config
  option.
  
  The idea is to collect all energy management functions in this module
  as and when new features are introduced in the pseries platform.  This
  hcall interface is the first to be included, but going forward in
  future I do not propose to have different modules for other energy
  management related features.
  
  The name is specific enough for IBM pseries platform and energy
  management functions and enablements.  Having less generic name below
  this level will make it difficult to add all varieties of energy
  management functions in future.
 
 OK, I thought this might be the case but you never said.  Please say
 something like This adds CONFIG_PSERIES_ENERGY which will be used for
 future power saving code or some such.

I already had this comment in the patch description.  Did not want add
a comment in the Kconfig file as the CONFIG_ prefix is assumed in each
of the 

+
+/* Helper Routines to convert between drc_index to cpu numbers */
+
+static u32 cpu_to_drc_index(int cpu)
+{
+   struct device_node *dn = NULL;
+   const int *indexes;
+   int i;
+   dn = of_find_node_by_path(/cpus);
+   if (dn == NULL)
+   goto err;
   
   Humm, I not sure this is really needed.  If you don't have /cpus you are
   probably not going to boot.
  
  Good suggestion.  I could add all these checks in module_init.   I was
  think if any of the functions being called is allocating memory and in
  case they fail, we need to abort.
  
  I just reviewed and look like of_find_node_by_path() will not sleep or
  allocate any memory.  So if it succeeds once in module_init(), then it
  will never fail! 
  
  
+   indexes = of_get_property(dn, ibm,drc-indexes, NULL);
+   if (indexes == NULL)
+   goto err;
   
   These checks should probably be moved to module init rather than /sfs
   read time.  If they fail, don't load the module and print a warning.  
   
   These HCALLS and device-tree entire aren't going to be dynamic.
  
  Agreed.  Only cause of runtime failure is OOM.  If none of these
  allocate memory, moving these checks once at module_init() will be
  a good optimization.
 
 Cool, thanks.  

Hey, I did not yet remove the failure checks in 

Re: [PATCH] powerpc: ONLINE to OFFLINE CPU state transition during removal

2010-07-22 Thread Vaidyanathan Srinivasan
* Robert Jennings r...@linux.vnet.ibm.com [2010-07-22 21:43:44]:

 If a CPU remove is attempted using the 'release' interface on hardware
 which supports extended cede, the CPU will be put in the INACTIVE state
 rather than the OFFLINE state due to the default preferred_offline_state
 in that situation.  In the INACTIVE state it will fail to be removed.
 
 This patch changes the preferred offline state to OFFLINE when an CPU is
 in the ONLINE state.  After cpu_down() is called in dlpar_offline_cpu()
 the CPU will be OFFLINE and CPU removal can continue.

Hi Robert,

Thanks for the patch.  In dlpar operation, we would offline the CPU
first using the sysfs online file and then write to the sysfs release
file to complete the sequence right?  The current code in
dlpar_offline_cpu() would work as long as the cpu is in either
inactive state or offline state (in case of unsupported platform).

Is the dlpar tools being changed to complete the operation with one
sysfs write to release file?

 Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com
 
 ---
 
 diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
 b/arch/powerpc/platforms/pseries/dlpar.c
 index d71e585..227c1c3 100644
 --- a/arch/powerpc/platforms/pseries/dlpar.c
 +++ b/arch/powerpc/platforms/pseries/dlpar.c
 @@ -463,6 +463,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
   break;
 
   if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) {
 + set_preferred_offline_state(cpu, 
 CPU_STATE_OFFLINE);
   cpu_maps_update_done();
   rc = cpu_down(cpu);
   if (rc)

The patch looks good.  Will need to test out the various scenarios so
that the preferred_offline_state do not get flipped before cpu_down()
is called.  This is unlikely, but still we need to validate
a concurrent sysfs online file write and sysfs release file write.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-22 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-07-22 21:44:04]:

 On 07/22/2010 04:57 PM, Darren Hart wrote:
  On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
  On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
 
  1) How can the preempt_count() get mangled across the H_CEDE hcall?
  2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
 
  The preempt count is on the thread info at the bottom of the stack.
 
  Can you check the stack pointers ?
  
  Hi Ben, thanks for looking.
  
  I instrumented the area around extended_cede_processor() as follows
  (please confirm I'm getting the stack pointer correctly).
  
  while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
  asm(mr %0,1 : =r (sp));
  printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
  preempt_count());
  extended_cede_processor(cede_latency_hint);
  asm(mr %0,1 : =r (sp));
  printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
  preempt_count());
}
  
  
  On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
  Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
  Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack:  
  c0010e9e3ce0, pcnt: 1
  
  This surprised me as preempt_count is 1 before and after, so no
  corruption appears to occur on mainline. This makes the pcnt of 65 I see
  without the preempt_count()=0 hack very strange. I ran several hundred
  off/on cycles. The issue of preempt_count being 1 is still addressed by
  this patch however.
  
  On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
  Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: 
  c00089bcfcf0, pcnt: 1
  Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack:  
  c00089bcfcf0, pcnt: 
  
  In both cases the stack pointer appears unchanged.
  
  Note: there is a BUG triggered in between these statements as the
  preempt_count causes the printk to trigger:
  Badness at kernel/sched.c:5572
 
 At Steven's suggestion I updated the instrumentation to display the
 local_save_flags and irqs_disabled_flags:
 
 while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
 local_save_flags(flags);
 printk(local flags: %lx, irqs disabled: %d\n, flags, 
 irqs_disabled_flags(flags));
 asm(mr %0,1 : =r (sp));
 printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
 extended_cede_processor(cede_latency_hint);
 asm(mr %0,1 : =r (sp));
 printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, 
 preempt_count());
 }
 
 
 Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
 c0010e9e3ce0, pcnt: 1
 
 I'm not sure if I'm reading that right, but I believe interrupts are
 intended to be disabled here. If accomplished via the
 spin_lock_irqsave() this would behave differently on RT. However, this
 path disables the interrupts handled by xics, all but the IPIs anyway.
 On RT I disabled the decrementer as well.
 
 Is it possible for RT to be receiving other interrupts here?

Yes.  extended_cede_processor() will return with interrupts enabled in
the cpu. (This is done by the hypervisor).  Under normal cases we
cannot be interrupted because no IO interrupts are routed to us after
xics_teardown_cpu() and since the CPU is out of the map, nobody will
send us IPIs.

Though H_CEDE will return with interrupts enabled, it is unlikely that
an interrupt can be delivered in this context.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-07-23 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-07-23 15:11:00]:

 On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
  Yes.  extended_cede_processor() will return with interrupts enabled in
  the cpu. (This is done by the hypervisor).  Under normal cases we
  cannot be interrupted because no IO interrupts are routed to us after
  xics_teardown_cpu() and since the CPU is out of the map, nobody will
  send us IPIs.
 
 What about decrementer ?

Decrementer expiry event handling is bit complex.  The event as such
may not bring back the extended_cede_processor() cpu, but may be
marked pending when we get out of this state eventually.  I will find
more information on this event and update.

  Though H_CEDE will return with interrupts enabled, it is unlikely that
  an interrupt can be delivered in this context. 
 
 Well, if interrupts are soft-disabled, even if one occurs, we will just
 mask and return, so that at least should be ok.

Yes.  We will immediately return to the extended_cede_processor() in
the while loop until the preferred_offline_state is changed.

--Vaidy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-08-05 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-08-03 14:44:13]:

 On Thu, 2010-07-22 at 06:27 +0530, Vaidyanathan Srinivasan wrote:
  These APIs take logical cpu number as input
  Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling()
  Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling()
 
 I'm still not happy here.
 
 I don't find leftmost/rightmost to be a progress from the previous
 naming. If you really want to change to avoid in_core() at the end, then
 call them first_thread_sibling() and last_thread_sibling().

Hi Ben,

The naming is based on our discussion during the first RFC:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html

I am fine with first_thread_sibling() name instead of
cpu_{left,right}most_thread_sibling().  The goal is to distinguish
between APIs returning logical cpu numbers and core indexes.

first_thread_sibling() implies that the parameter and return value are
logical cpu (thread) number.

I will change the name as per your suggestion and post an update.

 Then you change:
 
  -static inline int cpu_thread_to_core(int cpu)
  -{
  -   return cpu  threads_shift;
  -}
 
   .../... to:
 
  -/* Must be called when no change can occur to cpu_present_mask,
  +/* Helper routines for cpu to core mapping */
  +int cpu_core_of_thread(int cpu)
  +{
  +   return cpu  threads_shift;
  +}
  +EXPORT_SYMBOL_GPL(cpu_core_of_thread);

The cpu_core_of_thread() name implies that the return type is a core index
and not a logical cpu number.

cpu_core_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

Do you think cpu_core_number_of_thread() will explain the return type
better?

 Any reason you are making it out of line other than gratuituous
 bloat ? :-)
 
  +int cpu_first_thread_of_core(int core)
  +{
  +   return core  threads_shift;
  +}
  +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
 
 Same.

The reason behind out of line is to avoid exporting threads_shift or
other variables like threads_per_core and have this API exported so
that modules can read them.  If we make these wrapper inline, then we
will have to export the variables themselves which is not a good idea.

Thanks for the review.  

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-08-04 21:45:51]:

 On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
 * Benjamin Herrenschmidtb...@kernel.crashing.org  [2010-07-23 15:11:00]:
 
 On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
 Yes.  extended_cede_processor() will return with interrupts enabled in
 the cpu. (This is done by the hypervisor).  Under normal cases we
 cannot be interrupted because no IO interrupts are routed to us after
 xics_teardown_cpu() and since the CPU is out of the map, nobody will
 send us IPIs.
 
 What about decrementer ?
 
 Decrementer expiry event handling is bit complex.  The event as such
 may not bring back the extended_cede_processor() cpu, but may be
 marked pending when we get out of this state eventually.  I will find
 more information on this event and update.
 
 Hi Vaidy, have you been able to dig anything up about the
 decrementer expiry?

Hi Darren,

Yes, it is possible that the cpu in extended_cede_processor() can be
woken up by the decrementer.  But the expectation is that we will
return back to this context since we will not have any pending timers.

Our stack should not change even if we get the decrementer interrupts.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: ONLINE to OFFLINE CPU state transition during removal

2010-08-05 Thread Vaidyanathan Srinivasan
* Nathan Fontenot nf...@austin.ibm.com [2010-07-26 14:13:35]:

 On 07/22/2010 11:13 PM, Vaidyanathan Srinivasan wrote:
  * Robert Jennings r...@linux.vnet.ibm.com [2010-07-22 21:43:44]:
  
  If a CPU remove is attempted using the 'release' interface on hardware
  which supports extended cede, the CPU will be put in the INACTIVE state
  rather than the OFFLINE state due to the default preferred_offline_state
  in that situation.  In the INACTIVE state it will fail to be removed.
 
  This patch changes the preferred offline state to OFFLINE when an CPU is
  in the ONLINE state.  After cpu_down() is called in dlpar_offline_cpu()
  the CPU will be OFFLINE and CPU removal can continue.
  
  Hi Robert,
  
  Thanks for the patch.  In dlpar operation, we would offline the CPU
  first using the sysfs online file and then write to the sysfs release
  file to complete the sequence right?  The current code in
  dlpar_offline_cpu() would work as long as the cpu is in either
  inactive state or offline state (in case of unsupported platform).
  
  Is the dlpar tools being changed to complete the operation with one
  sysfs write to release file?
 
 The dlpar tools were updated so that a single write to the 'release' file
 would offline the cpu and remove it from the system.  Given this, I think
 Robert's patch should go forward to maintain compatability.

Hi Nathan,

Thanks for clarifying.  One concern that I have is that this change
could race with sysfs write to cpuN/online file.

dlpar_offline_cpu() is called with cpu_hotplug_driver_lock() held so
the sysfs operation on online file should wait.  However by the time
the cpu_hotplug_driver_unlock() happens the dlpar operation should
have completed and we should not be able to online or offline the cpu
from sysfs online file.

I wanted to confirm whether any concurrent sysfs write will not cause
the dlpar operation to fail in the middle of the operation.

The previous code will work fine for platforms not supporting
cede_offline where we have only two states for the cpu.

When cede_offline is supported, we have three states and the state
transitions are controlled by online file and the release file in two
steps so that the failures can be retried in user space.

This patch will work and I do not see any problem as such, but we
still need to carefully evaluate the retry and error handling paths
before switching over to a single sysfs write based dlpar operation.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

2010-08-05 Thread Vaidyanathan Srinivasan
* Darren Hart dvh...@us.ibm.com [2010-08-05 19:19:00]:

 On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
  On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
  
suggestion I updated the instrumentation to display the
  local_save_flags and irqs_disabled_flags:
  
  Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
  Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
  Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: 
  c0010e9e3ce0, pcnt: 1
 
  I'm not sure if I'm reading that right, but I believe interrupts are
  intended to be disabled here. If accomplished via the
  spin_lock_irqsave() this would behave differently on RT. However, this
  path disables the interrupts handled by xics, all but the IPIs anyway.
  On RT I disabled the decrementer as well.
 
  Is it possible for RT to be receiving other interrupts here?
  
  Also you may want to call hard_irq_disable() to -really- disable
  interrupts ... since we do lazy-disable on powerpc
  
 
 A quick update and request for direction wrt the cede hcall, interrupt 
 handling, and the stack pointer.
 
 I've added patches one at a time, eliminating bugs with preempt_rt
 (tip/rt/head). With my current patchset I have no crashes with a single run of
 random_online.sh (stress testing to hit the hang will sees is my todo for
 tomorrow).
 
 Current patchset includes:
 patches/0001-wms-fix01.patch # P7 lazy flushing thing
 
 # next four are sent to / queued for mainline
 patches/powerpc-increase-pseries_cpu_die-delay.patch
 patches/powerpc-enable-preemption-before-cpu_die.patch
 patches/powerpc-silence-__cpu_up-under-normal-operation.patch
 patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
 
 # this one needs to be cleaned up and sent to mainline
 patches/powerpc-wait-for-cpu-to-go-inactive.patch
 
 patches/powerpc-disable-decrementer-on-offline.patch
 patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after 
 cede
 patches/powerpc-cede-processor-inst.patch # instrumentation
 patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
 patches/powerpc-pad-thread_info.patch
 
 I didn't include all the patches as the relevant bits are included below in
 code form.
 
 With the instrumentation, it's clear the change to preempt_count() is targeted
 (since I added padding before and after preempt_count in the thread_info
 struct) and preempt_count is still changed. It is also clear that it isn't 
 just
 a stray decrement since I set preempt_count() to 4 prior to calling cede and 
 it
 still is 0x after the hcall. Also note that the stack pointer doesn't
 change across the cede call and neither does any other part of the thread_info
 structure.
 
 Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
 serialized list of before/after thread_info prints around cede, I see
 several befores then several afters. But the preempt_count is still modified.
 
 The relevant code looks like this (from pseries_mach_cpu_die())
 
 hard_irq_disable(); /* this doesn't fix things... but
  does change behavior a bit */
 preempt_count() = 0x4; 
 asm(mr %0,1 : =r (sp));  /* stack pointer is in 
 R1 */
 printk(before cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count()); 
 print_thread_info(); 
 extended_cede_processor(cede_latency_hint); 
 asm(mr %0,1 : =r (sp)); 
 printk(after cede: sp=%lx pcnt=%x\n, sp, 
 preempt_count()); 
 print_thread_info(); 
 preempt_count() = 0; 
 
 
 With these patches applied, the output across cede looks like:
 
 before cede: sp=c0b57150 pcnt=4
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0
 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
 ti-preempt_count: 4
 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
 ti-local_flags: 0
 *** end thread_info ***
 after cede: sp=c0b57150 pcnt=
 *** current-thread_info ***
 ti-task: c0aa1410
 ti-exec_domain: c0a59958
 ti-cpu: 0

Is this print sufficient to prove that we did not jump CPU?

We agree that decrementer can possibly expire and we could have gone
to the handler and come back just like we would do in an idle loop.
This will not happen always, but I could see a window of time during
which this may happen.  But that should not change the preempt_count
unconditionally to -1 with the same stack pointer.

 ti-stomp_on_me: 57005
 ti-preempt_count: 
 ti-stomp_on_me_too: 48879
 ti-local_flags: 0

Is there a method to identify whether we had executed any of the
interrupt handler while in this code?

 *** end thread_info ***
 
 Are there any additional thoughts on what 

[PATCH v5 0/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-10-06 Thread Vaidyanathan Srinivasan
Hi Ben,

The following series adds a new kernel module for powerpc pseries
platforms in order to export platform energy management capabilities.

The module exports data from a new hypervisor call H_BEST_ENERGY.

Comments and suggestions made on the previous iteration of the
patch related to function naming has been incorporated.

Changes in v5:

* Renamed cpu_{left,right}most_thread_sibling() to 
cpu_{first,last}_thread_sibling()

* Renamed cpu_core_of_thread() to cpu_core_index_of_thread()  
  (these function work on core index)

* Rebased to 2.6.36-rc6 and tested on a supported platform

Changes in v4:

[4] [PATCH v4 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-July/084217.html

* Added more documentation
* Added check_for_h_best_energy() to look in ibm,hypertas-functions so
  that sysfs entries are not created in an unsupported platform
* Added cleaner error checks and correct of_node_put()
* Rebased and tested on 2.6.35-rc5

Changed in v3:

[3] [PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-June/083414.html

* Added more documentation in the cleanup patch
* Removed RFC tag, rebased and tested on 2.6.35-rc3
* Ready for inclusion in powerpc/next tree for further testing

Changes in v2:

[2] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html

* Cleanup cpu/thread/core APIs
* Export APIs to module instead of threads_per_core
* Use of_find_node_by_path() instead of of_find_node_by_name()
* Error checking and whitespace cleanups

First version:
[1] [RFC] powerpc: add support for new hcall H_BEST_ENERGY
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html

This patch series will apply on 2.6.36-rc6 as well as powerpc/next
tree.  Please review and include in powerpc/next tree for further
testing.  

I could incrementally reduce some of the error checks as suggested by
Michael Neuling as next steps.  This patch series is conservative and
has more error checking in device tree parsing and drc index matching
code than what may be required.

Thanks,
Vaidy
---

Vaidyanathan Srinivasan (2):
  powerpc: cleanup APIs for cpu/thread/core mappings
  powerpc: add support for new hcall H_BEST_ENERGY


 arch/powerpc/include/asm/cputhreads.h   |   15 +
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/kernel/smp.c   |   19 +
 arch/powerpc/mm/mmu_context_nohash.c|   12 -
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  326 +++
 7 files changed, 370 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v5 1/2] powerpc: cleanup APIs for cpu/thread/core mappings

2010-10-06 Thread Vaidyanathan Srinivasan
These APIs take logical cpu number as input
Change cpu_first_thread_in_core() to cpu_first_thread_sibling()
Change cpu_last_thread_in_core() to cpu_last_thread_sibling()

These APIs convert core number (index) to logical cpu/thread numbers
Add cpu_first_thread_of_core(int core)
Changed cpu_thread_to_core() to cpu_core_index_of_thread(int cpu)

The goal is to make 'threads_per_core' accessible to the
pseries_energy module.  Instead of making an API to read
threads_per_core, this is a higher level wrapper function to
convert from logical cpu number to core number.

The current APIs cpu_first_thread_in_core() and
cpu_last_thread_in_core() returns logical CPU number while
cpu_thread_to_core() returns core number or index which is
not a logical CPU number.  The new APIs are now clearly named to
distinguish 'core number' versus first and last 'logical cpu
number' in that core.

The new APIs cpu_{first,last}_thread_sibling() work on
logical cpu numbers.  While cpu_first_thread_of_core() and
cpu_core_index_of_thread() work on core index.

Example usage:  (4 threads per core system)

cpu_first_thread_sibling(5) = 4
cpu_last_thread_sibling(5) = 7
cpu_core_index_of_thread(5) = 1
cpu_first_thread_of_core(1) = 4

cpu_core_index_of_thread() is used in cpu_to_drc_index() in the
module and cpu_first_thread_of_core() is used in
drc_index_to_cpu() in the module.

Make API changes to few callers.  Export symbols for use in modules.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/cputhreads.h |   15 +--
 arch/powerpc/kernel/smp.c |   19 ---
 arch/powerpc/mm/mmu_context_nohash.c  |   12 ++--
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index a8e1844..f71bb4c 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void)
return cpu_thread_mask_to_cores(cpu_online_map);
 }
 
-static inline int cpu_thread_to_core(int cpu)
-{
-   return cpu  threads_shift;
-}
+#ifdef CONFIG_SMP
+int cpu_core_index_of_thread(int cpu);
+int cpu_first_thread_of_core(int core);
+#else
+static inline int cpu_core_index_of_thread(int cpu) { return cpu; }
+static inline int cpu_first_thread_of_core(int core) { return core; }
+#endif
 
 static inline int cpu_thread_in_core(int cpu)
 {
return cpu  (threads_per_core - 1);
 }
 
-static inline int cpu_first_thread_in_core(int cpu)
+static inline int cpu_first_thread_sibling(int cpu)
 {
return cpu  ~(threads_per_core - 1);
 }
 
-static inline int cpu_last_thread_in_core(int cpu)
+static inline int cpu_last_thread_sibling(int cpu)
 {
return cpu | (threads_per_core - 1);
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0008bc5..012d6e2 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -466,7 +466,20 @@ out:
return id;
 }
 
-/* Must be called when no change can occur to cpu_present_mask,
+/* Helper routines for cpu to core mapping */
+int cpu_core_index_of_thread(int cpu)
+{
+   return cpu  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_core_index_of_thread);
+
+int cpu_first_thread_of_core(int core)
+{
+   return core  threads_shift;
+}
+EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
+
+/* Must be called when no change can occur to cpu_present_map,
  * i.e. during cpu online or offline.
  */
 static struct device_node *cpu_to_l2cache(int cpu)
@@ -517,7 +530,7 @@ int __devinit start_secondary(void *unused)
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_first_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
if (cpu_is_offline(base + i))
continue;
@@ -596,7 +609,7 @@ int __cpu_disable(void)
return err;
 
/* Update sibling maps */
-   base = cpu_first_thread_in_core(cpu);
+   base = cpu_first_thread_sibling(cpu);
for (i = 0; i  threads_per_core; i++) {
cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i));
cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu));
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index ddfd7ad..9aa754c 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id)
 * a core map instead but this will do for now.
 */
for_each_cpu(cpu, mm_cpumask(mm)) {
-   for (i = cpu_first_thread_in_core(cpu);
-i = cpu_last_thread_in_core(cpu); i++)
+   for (i = cpu_first_thread_sibling(cpu

[PATCH v5 2/2] powerpc: add support for new hcall H_BEST_ENERGY

2010-10-06 Thread Vaidyanathan Srinivasan
Create sysfs interface to export data from H_BEST_ENERGY hcall
that can be used by administrative tools on supported pseries
platforms for energy management optimizations.

/sys/device/system/cpu/pseries_(de)activate_hint_list and
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
hints for activation and deactivation of cpus respectively.

These hints are abstract number given by the hypervisor based
on the extended knowledge the hypervisor has regarding the
current system topology and resource mappings.

The activate and the deactivate sysfs entry is for the two
distinct operations that we could do for energy savings.  When
we have more capacity than required, we could deactivate few
core to save energy.  The choice of the core to deactivate
will be based on /sys/devices/system/cpu/deactivate_hint_list.
The comma separated list of cpus (cores) will be the preferred
choice.  If we have to activate some of the deactivated cores,
then /sys/devices/system/cpu/activate_hint_list will be used.

The per-cpu file
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint further
provide more fine grain information by exporting the value of
the hint itself.

Added new driver module
arch/powerpc/platforms/pseries/pseries_energy.c
under new config option CONFIG_PSERIES_ENERGY

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/hvcall.h   |3 
 arch/powerpc/platforms/pseries/Kconfig  |   10 +
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/pseries_energy.c |  326 +++
 4 files changed, 339 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index de03ca5..bf86b03 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -232,7 +232,8 @@
 #define H_GET_EM_PARMS 0x2B8
 #define H_SET_MPP  0x2D0
 #define H_GET_MPP  0x2D4
-#define MAX_HCALL_OPCODE   H_GET_MPP
+#define H_BEST_ENERGY  0x2F4
+#define MAX_HCALL_OPCODE   H_BEST_ENERGY
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index c667f0f..8323622 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -33,6 +33,16 @@ config PSERIES_MSI
depends on PCI_MSI  EEH
default y
 
+config PSERIES_ENERGY
+   tristate pSeries energy management capabilities driver
+   depends on PPC_PSERIES
+   default y
+   help
+ Provides interface to platform energy management capabilities
+ on supported PSERIES platforms.
+ Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
+ and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
+
 config SCANLOG
tristate Scanlog dump interface
depends on RTAS_PROC  PPC_PSERIES
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 046ace9..50a7ea3 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o 
eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_KEXEC)+= kexec.o
 obj-$(CONFIG_PCI)  += pci.o pci_dlpar.o
 obj-$(CONFIG_PSERIES_MSI)  += msi.o
+obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
 obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
new file mode 100644
index 000..c8b3c69
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -0,0 +1,326 @@
+/*
+ * POWER platform energy management driver
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This pseries platform device driver provides access to
+ * platform energy management capabilities.
+ */
+
+#include linux/module.h
+#include linux/types.h
+#include linux/errno.h
+#include linux/init.h
+#include linux/seq_file.h
+#include linux/sysdev.h
+#include linux/cpu.h
+#include linux/of.h
+#include asm/cputhreads.h
+#include asm/page.h
+#include asm/hvcall.h
+
+
+#define MODULE_VERS 1.0
+#define MODULE_NAME pseries_energy
+
+/* Driver flags */
+
+static int sysfs_entries;
+
+/* Helper routines */
+
+/*
+ * Routine to detect firmware support for hcall
+ * return 1 if H_BEST_ENERGY is supported
+ * else return 0

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@kernel.crashing.org [2013-07-27 16:30:05]:

 On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
  *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
  broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
  of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
  asking it to program its lapic to fire at timeX so as to wake up CPUX.
  *With multiple CPUs the overhead of sending IPI, could result in
  performance bottlenecks and may not scale well.*
  
  Hence the workaround is that the broadcast CPU on each of its timer
  interrupt checks if any of the next timer event of a CPU in deep idle
  state has expired, which can very well be found from dev-next_event of
  that CPU. For example the timeX that has been mentioned above has
  expired. If so the broadcast handler is called to send an IPI to the
  idling CPU to wake it up.
  
  *If the broadcast CPU, is in tickless idle, its timer interrupt could be
  many ticks away. It could miss waking up a CPU in deep idle*, if its
  wakeup is much before this timer interrupt of the broadcast CPU. But
  without tickless idle, atleast at each period we are assured of a timer
  interrupt. At which time broadcast handling is done as stated in the
  previous paragraph and we will not miss wakeup of CPUs in deep idle states.
 
 But that means a great loss of power saving on the broadcast CPU when the 
 machine
 is basically completely idle. We might be able to come up with some thing 
 better.

Hi Ben,

Yes, we will need to improve on this case in stages.  In the current
design, we will have to hold one of the CPU in shallow idle state
(nap) to wakeup other deep idle cpus.  The cost of keeping the
periodic tick ON on the broadcast CPU in minimal (but not zero) since
we would not allow that CPU to enter any deep idle states even if
there were no periodic timers queued.
 
 (Note : I do no know the timer offload code if it exists already, I'm 
 describing
 how things could happen out of the blue without any knowledge of 
 pre-existing
 framework here)
 
 We can know when the broadcast CPU expects to wake up next. When a CPU goes to
 a deep sleep state, it can then
 
  - Indicate to the broadcast CPU when it intends to be woken up by queuing
 itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
 Play with the locality of that: have one queue (and one broadcast CPU) per
 chip or per node instead of a global one to limit cache bouncing).
 
  - Check if that happens before the broadcast CPU intended wake time (we
 need statistics to see how often that happens), and in that case send an IPI
 to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
 time to the min of it's intended sleep time and the new sleeper time.
 (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

This will be an improvement and the idea we have is to have
a hierarchical method of finding a waking CPU within core/socket/node
in order to find a better fit and ultimately send IPI to wakeup
a broadcast CPU only if there is no other fit.  This condition would
imply that more CPUs are in deep idle state and the cost of sending an
IPI to reprogram the broadcast cpu's local timer may well payoff.

  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
 time
 to a global jiffy boundary, meaning that several CPUs going to idle are likely
 to be choosing the same. Or maybe better, an adaptative alignment by 
 essentially
 getting more coarse grained as we go further in the future
 
  - When the broadcast CPU goes to sleep, it can play the same game of 
 alignment.

CPUs entering deep idle state would need to wakeup only at a jiffy
boundary or the jiffy boundary earlier than the target wakeup time.
Your point is can the broadcast cpu wakeup the sleeping CPU *around*
the designated wakeup time (earlier) so as to avoid reprogramming its
timer.
 
 I don't like the concept of a dedicated broadcast CPU however. I'd rather 
 have a
 general queue (or per node) of sleepers needing a wakeup and more/less 
 dynamically
 pick a waker to be the last man standing, but it does make things a bit more
 tricky with tickless scheduler (non-idle).
 
 Still, I wonder if we could just have some algorithm to actually pick wakers
 more dynamically based on who ever has the closest next wakeup planned,
 that sort of thing. A fixed broadcaster will create an imbalance in
 power/thermal within the chip in addition to needing to be moved around on
 hotplug etc...

Right Ben.  The hierarchical way of selecting the waker will help us
have multiple wakers in different sockets/cores.  The broadcast
framework allows us to decouple the cpu going to idle and the waker to
be selected independently.  This patch series is the start where we
pick one and allow it to move around.  The ideal goal to achieve would
be that we can have multiple wakers serving wakeup requests from
a 

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan
* Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]:

 Hi Ben,
 
 On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
  On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
  *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
  broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
  of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
  asking it to program its lapic to fire at timeX so as to wake up CPUX.
  *With multiple CPUs the overhead of sending IPI, could result in
  performance bottlenecks and may not scale well.*
 
  Hence the workaround is that the broadcast CPU on each of its timer
  interrupt checks if any of the next timer event of a CPU in deep idle
  state has expired, which can very well be found from dev-next_event of
  that CPU. For example the timeX that has been mentioned above has
  expired. If so the broadcast handler is called to send an IPI to the
  idling CPU to wake it up.
 
  *If the broadcast CPU, is in tickless idle, its timer interrupt could be
  many ticks away. It could miss waking up a CPU in deep idle*, if its
  wakeup is much before this timer interrupt of the broadcast CPU. But
  without tickless idle, atleast at each period we are assured of a timer
  interrupt. At which time broadcast handling is done as stated in the
  previous paragraph and we will not miss wakeup of CPUs in deep idle states.
  
  But that means a great loss of power saving on the broadcast CPU when the 
  machine
  is basically completely idle. We might be able to come up with some thing 
  better.
  
  (Note : I do no know the timer offload code if it exists already, I'm 
  describing
  how things could happen out of the blue without any knowledge of 
  pre-existing
  framework here)
  
  We can know when the broadcast CPU expects to wake up next. When a CPU goes 
  to
  a deep sleep state, it can then
  
   - Indicate to the broadcast CPU when it intends to be woken up by queuing
  itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
  Play with the locality of that: have one queue (and one broadcast CPU) per
  chip or per node instead of a global one to limit cache bouncing).
  
   - Check if that happens before the broadcast CPU intended wake time (we
  need statistics to see how often that happens), and in that case send an IPI
  to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
  time to the min of it's intended sleep time and the new sleeper time.
  (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
  
   - We can probably limit spurrious wakeups a *LOT* by aligning that target 
  time
  to a global jiffy boundary, meaning that several CPUs going to idle are 
  likely
  to be choosing the same. Or maybe better, an adaptative alignment by 
  essentially
  getting more coarse grained as we go further in the future
  
   - When the broadcast CPU goes to sleep, it can play the same game of 
  alignment.
  
  I don't like the concept of a dedicated broadcast CPU however. I'd rather 
  have a
  general queue (or per node) of sleepers needing a wakeup and more/less 
  dynamically
  pick a waker to be the last man standing, but it does make things a bit more
  tricky with tickless scheduler (non-idle).
  
  Still, I wonder if we could just have some algorithm to actually pick wakers
  more dynamically based on who ever has the closest next wakeup planned,
  that sort of thing. A fixed broadcaster will create an imbalance in
  power/thermal within the chip in addition to needing to be moved around on
  hotplug etc...
 
 Thank you for having listed out the above suggestions. Below, I will
 bring out some ideas about how the concerns that you have raised can be
 addressed in the increasing order of priority.
 
 - To begin with, I think we can have the following model to have the
 responsibility of the broadcast CPU float around certain CPUs. i.e. Not
 have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
 bc_cpu henceforth for convenience.
 
 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.
 
 2. Every other CPU that intends to enter deep idle state will enter
 themselves into a mask, say the bc_mask, which is already being done
 today, after they check that a bc_cpu has been assigned.
 
 3. The bc_cpu should not enter tickless idle, until step 5a holds true.
 
 4. So on every timer interrupt, which is at-least every period, it
 checks the bc_mask to see if any CPUs need to be woken up.
 
 5. The bc cpu should not enter tickless idle *until* it is de-nominated
 as the bc_cpu. The de-nomination occurs when:
   a. In one of its timer interrupts, it does broadcast handling to find
 out that there are no CPUs to be woken up.
 
 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
 to enter deep idle state again, in which case steps 1 to 5 repeat.
 
 
 - We could optimize this further, to allow 

[PATCH] powerpc: default arch idle could cede processor on pseries

2013-09-04 Thread Vaidyanathan Srinivasan
Hi,

Idle routines on pseries were rearranged so that cpuidle can do
an optimized idle state selection.  However, until cpuidle takes
over during boot, the idle loop spins for a short while.  This
actually affected bootup time since spinning idle sibling threads
slows down master cpu that executes bootup code.

The following patch enables pseries system to yield to hypervisor and
stop spinning by calling cede_processor() until cpuidle can take over
and do optimal idle state selection.

Bootup time can be reduced to half on small guest where most of the
time is spend before device init.

--Vaidy


powerpc: default arch idle should cede processor on pseries

On IBM POWER platform (pseries), use cede_processor()
in arch idle to save cycles until cpuidle takes over.

This helps speedup boot by having SMT threads stop and yield to
master thread rather than spinning until cpuidle in initialized.

Reported-by: Paul Mackerras pau...@samba.org
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Acked-by: Deepthi Dharwar deep...@linux.vnet.ibm.com

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index c11c823..fd4f995 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -362,10 +362,22 @@ static void pSeries_idle(void)
if (cpuidle_idle_call()) {
/* On error, execute default handler
 * to go into low thread priority and possibly
-* low power mode.
+* low power mode by ceding processor to hypervisor
 */
-   HMT_low();
-   HMT_very_low();
+
+   /* Indicate to hypervisor that we are idle. */
+   get_lppaca()-idle = 1;
+
+   /*
+* Yield the processor to the hypervisor.  We return if
+* an external interrupt occurs (which are driven prior
+* to returning here) or if a prod occurs from another
+* processor. When returning here, external interrupts
+* are enabled.
+*/
+   cede_processor();
+
+   get_lppaca()-idle = 0;
}
 }
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powerpc: Default arch idle could cede processor on pseries

2013-09-05 Thread Vaidyanathan Srinivasan
Hi,

Idle routines on pseries were rearranged so that cpuidle can do an
optimized idle state selection.  However, until cpuidle takes over
during boot, the idle loop spins for a short while.  This actually
affected bootup time since spinning idle sibling threads slows down
master cpu that executes bootup code.

The following patch enables pseries system to yield to hypervisor and
stop spinning by calling cede_processor() until cpuidle can take over
and do optimal idle state selection.

Bootup time can be reduced to half on small guest where most of the
time is spend before device init.

Thanks Ben for the review and suggestions.

--Vaidy

powerpc: Default arch idle could cede processor on pseries

When adding cpuidle support to pSeries, we introduced two
regressions:

  - The new cpuidle backend driver only works under hypervisors
supporting the SLPLAR option, which isn't the case of the
old POWER4 hypervisor and the HV light used on js2x blades

  - The cpuidle driver registers fairly late, meaning that for
a significant portion of the boot process, we end up having
all threads spinning. This slows down the boot process and
increases the overall resource usage if the hypervisor has
shared processors.

This fixes it by implementing a default idle that will cede
to the hypervisor when possible, in a very simple way without
all the bells and whisles of cpuidle.

Reported-by: Paul Mackerras pau...@samba.org
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Acked-by: Deepthi Dharwar deep...@linux.vnet.ibm.com
Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index c11c823..54b998f 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -354,7 +354,7 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 early_initcall(alloc_dispatch_log_kmem_cache);
 
-static void pSeries_idle(void)
+static void pseries_lpar_idle(void)
 {
/* This would call on the cpuidle framework, and the back-end pseries
 * driver to  go to idle states
@@ -362,10 +362,22 @@ static void pSeries_idle(void)
if (cpuidle_idle_call()) {
/* On error, execute default handler
 * to go into low thread priority and possibly
-* low power mode.
+* low power mode by cedeing processor to hypervisor
 */
-   HMT_low();
-   HMT_very_low();
+
+   /* Indicate to hypervisor that we are idle. */
+   get_lppaca()-idle = 1;
+
+   /*
+* Yield the processor to the hypervisor.  We return if
+* an external interrupt occurs (which are driven prior
+* to returning here) or if a prod occurs from another
+* processor. When returning here, external interrupts
+* are enabled.
+*/
+   cede_processor();
+
+   get_lppaca()-idle = 0;
}
 }
 
@@ -456,15 +468,14 @@ static void __init pSeries_setup_arch(void)
 
pSeries_nvram_init();
 
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   if (firmware_has_feature(FW_FEATURE_LPAR)) {
vpa_init(boot_cpuid);
-   ppc_md.power_save = pSeries_idle;
-   }
-
-   if (firmware_has_feature(FW_FEATURE_LPAR))
+   ppc_md.power_save = pseries_lpar_idle;
ppc_md.enable_pmcs = pseries_lpar_enable_pmcs;
-   else
+   } else {
+   /* No special idle routine */
ppc_md.enable_pmcs = power4_enable_pmcs;
+   }
 
ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Kernel crash caused by cpufreq

2014-07-28 Thread Vaidyanathan Srinivasan
* Michael Ellerman m...@ellerman.id.au [2014-07-28 17:03:10]:

 On Fri, 2014-07-25 at 11:07 +1000, Gavin Shan wrote:
  I'm tracing one LSI interrupt issue on P8 box, and eventually into the
  following kernel crash. Not sure if there is one fix against this? :-)
 
 Vaidy wrote that I'm pretty sure (on CC).

Yes, I did :)
 
  Starting Linux PPC64 #401 SMP Fri Jul 25 10:52:28 EST 2014
  -
  ppc64_pft_size= 0x0
  physicalMemorySize= 0x8
  htab_address  = 0xc007fe00
  htab_hash_mask= 0x3
  -
   - setup_system()
  Initializing cgroup subsys cpuset
  Initializing cgroup subsys cpuacct
  Linux version 3.16.0-rc6-00076-g4226dbe-dirty (shangw@shangw) (gcc version 
  4.5.2 (crosstool-NG 1.19.0) ) #401 SMP Fri Jul 25 10:52:28 EST 2014
  :
   Unrelated log stripped 
 
 Are you sure there's nothing else in the log that might be related? See the
 messages in init_powernv_pstates() for example.

Most likely PMSR register is showing out of bound values because of
potential firmware (OPAL) issue.

  [ cut here ]
  kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
  cpu 0x1: Vector: 700 (Program Check) at [c007f8483370]
  pc: c096f32c: .pstate_id_to_freq+0x2c/0x50
  lr: c096f37c: .powernv_read_cpu_freq+0x2c/0x50
  sp: c007f84835f0
 msr: 90029032
current = 0xc007f840
paca= 0xcff00400   softe: 0irq_happened: 0x01
  pid   = 1, comm = swapper/0
  kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
  enter ? for help
  [link register   ] c096f37c .powernv_read_cpu_freq+0x2c/0x50
  [c007f84835f0] c1076070 key_type_dns_resolver+0xef20/0x40d20 
  (unreliable)
  [c007f8483670] c010812c .generic_exec_single+0x8c/0x270
  [c007f8483730] c01083d0 .smp_call_function_single+0x90/0xb0
  [c007f84837b0] c0108cec .smp_call_function_any+0x15c/0x200
  [c007f8483860] c096f27c .powernv_cpufreq_get+0x3c/0x60
  [c007f84838f0] c0969dc4 .__cpufreq_add_dev.clone.11+0x574/0xa20
  [c007f84839e0] c05f7a4c .subsys_interface_register+0xec/0x130
  [c007f8483a90] c0967af8 .cpufreq_register_driver+0x168/0x2d0
  [c007f8483b30] c0f774cc .powernv_cpufreq_init+0x210/0x244
  [c007f8483be0] c000bc08 .do_one_initcall+0xc8/0x240
  [c007f8483ce0] c0f44054 .kernel_init_freeable+0x268/0x33c
  [c007f8483db0] c000c4dc .kernel_init+0x1c/0x110
  [c007f8483e30] c000a428 .ret_from_kernel_thread+0x58/0xb0
  1:mon e
  cpu 0x1: Vector: 700 (Program Check) at [c007f8483370]
  pc: c096f32c: .pstate_id_to_freq+0x2c/0x50
  lr: c096f37c: .powernv_read_cpu_freq+0x2c/0x50
  sp: c007f84835f0
 msr: 90029032
current = 0xc007f840
paca= 0xcff00400   softe: 0irq_happened: 0x01
  pid   = 1, comm = swapper/0
  kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134!
  1:mon r
  R00 = 0042   R16 = 
  R01 = c007f84835f0   R17 = 
  R02 = c116c430   R18 = 
  R03 = ffbe   R19 = 0001
  R04 =    R20 = c0078bd61e58
  R05 = c114ca78   R21 = c1008910
  R06 = c007f84838d0   R22 = c11c1b74
  R07 = 0001   R23 = c2200030
  R08 = c11c1910   R24 = 0001
  R09 = c1f2970c   R25 = c1008730
  R10 = 0001   R26 = c1f29478
  R11 = 0042   R27 = c007f84838d0
  R12 = 44002084   R28 = c114ca78
  R13 = cff00400   R29 = 
  R14 = c000c4c0   R30 = c10a90f0
  R15 =    R31 = c007f84838d0
  pc  = c096f32c .pstate_id_to_freq+0x2c/0x50
  cfar= c096f324 .pstate_id_to_freq+0x24/0x50
  lr  = c096f37c .powernv_read_cpu_freq+0x2c/0x50
  msr = 90029032   cr  = 44002082
  ctr = c096f350   xer = 2000   trap =  700
  1:mon
 
 Gavin, in future for a dump like this it's very helpful to see the actual code
 that hit the bug. You can get that with:
 
 1:mon di $.pstate_id_to_freq
 
 
 Vaidy, judging by r3 it looks like i became negative. That would obviously
 happen if powernv_pstate_info.max was zero?

yes, negative is ok.  Something has gone wrong with the PState
firmware/hardware.  A BUG_ON() is too severe for this error.  I will
change code to not stop the system for this error and also investigate
what is happening at runtime.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/cpuidle: Fix parsing of idle state flags from device-tree

2014-08-03 Thread Vaidyanathan Srinivasan
Flags from device-tree need to be parsed with accessors for
interpreting correct value in little-endian.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 drivers/cpuidle/cpuidle-powernv.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 74f5788..a64be57 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -160,10 +160,10 @@ static int powernv_cpuidle_driver_init(void)
 static int powernv_add_idle_states(void)
 {
struct device_node *power_mgt;
-   struct property *prop;
int nr_idle_states = 1; /* Snooze */
int dt_idle_states;
-   u32 *flags;
+   const __be32 *idle_state_flags;
+   u32 len_flags, flags;
int i;
 
/* Currently we have snooze statically defined */
@@ -174,18 +174,18 @@ static int powernv_add_idle_states(void)
return nr_idle_states;
}
 
-   prop = of_find_property(power_mgt, ibm,cpu-idle-state-flags, NULL);
-   if (!prop) {
+   idle_state_flags = of_get_property(power_mgt, 
ibm,cpu-idle-state-flags, len_flags);
+   if (!idle_state_flags) {
pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n);
return nr_idle_states;
}
 
-   dt_idle_states = prop-length / sizeof(u32);
-   flags = (u32 *) prop-value;
+   dt_idle_states = len_flags / sizeof(u32);
 
for (i = 0; i  dt_idle_states; i++) {
 
-   if (flags[i]  IDLE_USE_INST_NAP) {
+   flags = be32_to_cpu(idle_state_flags[i]);
+   if (flags  IDLE_USE_INST_NAP) {
/* Add NAP state */
strcpy(powernv_states[nr_idle_states].name, Nap);
strcpy(powernv_states[nr_idle_states].desc, Nap);
@@ -196,7 +196,7 @@ static int powernv_add_idle_states(void)
nr_idle_states++;
}
 
-   if (flags[i]  IDLE_USE_INST_SLEEP) {
+   if (flags  IDLE_USE_INST_SLEEP) {
/* Add FASTSLEEP state */
strcpy(powernv_states[nr_idle_states].name, 
FastSleep);
strcpy(powernv_states[nr_idle_states].desc, 
FastSleep);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/cpufreq: Add pr_warn() on OPAL firmware failures

2014-08-03 Thread Vaidyanathan Srinivasan
Cpufreq depends on platform firmware to implement PStates.  In case of
platform firmware failure, cpufreq should not panic host kernel with
BUG_ON().  Less severe pr_warn() will suffice.

Add firmware_has_feature(FW_FEATURE_OPALv3) check to
skip probing for device-tree on non-powernv platforms.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 Tested on powernv and pseries platforms with v3.16-rc7 kernel.

 drivers/cpufreq/powernv-cpufreq.c |   18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index bb1d08d..379c083 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -28,6 +28,7 @@
 #include linux/of.h
 
 #include asm/cputhreads.h
+#include asm/firmware.h
 #include asm/reg.h
 #include asm/smp.h /* Required for cpu_sibling_mask() in UP configs */
 
@@ -98,7 +99,11 @@ static int init_powernv_pstates(void)
return -ENODEV;
}
 
-   WARN_ON(len_ids != len_freqs);
+   if (len_ids != len_freqs) {
+   pr_warn(Entries in ibm,pstate-ids and 
+   ibm,pstate-frequencies-mhz does not match\n);
+   }
+
nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
if (!nr_pstates) {
pr_warn(No PStates found\n);
@@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id)
int i;
 
i = powernv_pstate_info.max - pstate_id;
-   BUG_ON(i = powernv_pstate_info.nr_pstates || i  0);
+   if (i = powernv_pstate_info.nr_pstates || i  0) {
+   pr_warn(PState id %d outside of PState table, 
+   reporting nominal id %d instead\n,
+   pstate_id, powernv_pstate_info.nominal);
+   i = powernv_pstate_info.max - powernv_pstate_info.nominal;
+   }
 
return powernv_freqs[i].frequency;
 }
@@ -321,6 +331,10 @@ static int __init powernv_cpufreq_init(void)
 {
int rc = 0;
 
+   /* Don't probe on pseries (guest) platforms */
+   if (!firmware_has_feature(FW_FEATURE_OPALv3))
+   return -ENODEV;
+
/* Discover pstates from device tree and init */
rc = init_powernv_pstates();
if (rc) {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[BUG][PATCH] powerpc: fix numa distance for form0 device tree

2013-03-22 Thread Vaidyanathan Srinivasan
powerpc: fix numa distance for form0 device tree

The following commit breaks numa distance setup for old powerpc
systems that use form0 encoding in device tree.

commit 41eab6f88f24124df89e38067b3766b7bef06ddb
powerpc/numa: Use form 1 affinity to setup node distance

Device tree node /rtas/ibm,associativity-reference-points would
index into /cpus/PowerPC/ibm,associativity based on form0 or
form1 encoding detected by ibm,architecture-vec-5 property.

All modern systems use form1 and current kernel code is correct.
However, on older systems with form0 encoding, the numa distance
will get hard coded as LOCAL_DISTANCE for all nodes.  This causes
task scheduling anomaly since scheduler will skip building numa
level domain (topmost domain with all cpus) if all numa distances
are same.  (value of 'level' in sched_init_numa() will remain 0)

Prior to the above commit:
#define node_distance(from,to)
((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)

Restoring compatible behavior with this patch for old powerpc systems
with device tree where numa distance are encoded as form0.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index bba87ca..6a252c4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -201,7 +201,7 @@ int __node_distance(int a, int b)
int distance = LOCAL_DISTANCE;
 
if (!form1_affinity)
-   return distance;
+   return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
for (i = 0; i  distance_ref_points_depth; i++) {
if (distance_lookup_table[a][i] == distance_lookup_table[b][i])

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [BUG][PATCH] powerpc: fix numa distance for form0 device tree

2013-03-22 Thread Vaidyanathan Srinivasan
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2013-03-22 21:19:35]:

[snip]

 Prior to the above commit:
 #define node_distance(from,to)
   ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
 
 Restoring compatible behavior with this patch for old powerpc systems
 with device tree where numa distance are encoded as form0.

This patch on v3.9-rc3 has been tested on multi-node POWER7 with
different device tree combinations.

numactl -H would show local distance '10' for same node and remote
distance '20' for other nodes.  This ensures NUMA level sched domain
gets built and load balancing could work across such configurations.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform

2014-03-21 Thread Vaidyanathan Srinivasan
* Gautham R Shenoy e...@linux.vnet.ibm.com [2014-03-21 16:13:17]:

 Hi Viresh,
 
 On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote:
  On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy
  e...@linux.vnet.ibm.com wrote:
   From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
  
  Hi Vaidy,
  

Hi Viresh,

Thanks for the detailed review.

   diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
   index 4b029c0..4ba1632 100644
   --- a/drivers/cpufreq/Kconfig
   +++ b/drivers/cpufreq/Kconfig
   @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS
choice
   prompt Default CPUFreq governor
   default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || 
   ARM_SA1110_CPUFREQ
   +   default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ
  
  Probably we should remove SA1100's entry as well from here. This is
  not the right way of doing it. Imagine 100 platforms having entries here.
  If you want it, then select it from your platforms Kconfig.
 
 Sure. Will move these bits and the other governor related bits to the
 Powernv Kconfig.
 
   diff --git a/drivers/cpufreq/powernv-cpufreq.c 
   b/drivers/cpufreq/powernv-cpufreq.c
   new file mode 100644
   index 000..ab1551f
   --- /dev/null
   +
   +#define pr_fmt(fmt)powernv-cpufreq:  fmt
   +
   +#include linux/module.h
   +#include linux/cpufreq.h
   +#include linux/of.h
   +#include asm/cputhreads.h
  
  That's it? Sure?
  
  Even if things compile for you, you must explicitly include all the
  files on which
  you depend.
 
 Ok. 
 
  
   +
   +   WARN_ON(len_ids != len_freqs);
   +   nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
   +   WARN_ON(!nr_pstates);
  
  Why do you want to continue here?
 
 Good point. We might be better off exiting at this point. 

Yes, we could return here.  We do not generally come till this point
if the platform has a problem discovering PStates from device tree.
But there is no useful debug info after this point if nr_pstates is 0.

  
   +   pr_debug(NR PStates %d\n, nr_pstates);
   +   for (i = 0; i  nr_pstates; i++) {
   +   u32 id = be32_to_cpu(pstate_ids[i]);
   +   u32 freq = be32_to_cpu(pstate_freqs[i]);
   +
   +   pr_debug(PState id %d freq %d MHz\n, id, freq);
   +   powernv_freqs[i].driver_data = i;
  
  I don't think you are using this field at all and this is the field you can
  use for driver_data and so you can get rid of powernv_pstate_ids[ ].
 
 Using driver_data to record powernv_pstate_ids won't work since
 powernv_pstate_ids can be negative. So a pstate_id -3 can be confused
 with CPUFREQ_BOOST_FREQ thereby not displaying the frequency
 corresponding to pstate id -3. So for now I think we will be retaining
 powernv_pstate_ids.

Yeah, I had the driver written using driver_data to store pstates.
Gautham found the bug that we are missing one PState when we match the
ID with CPUFREQ_BOOST_FREQ!

We did not know that you have taken care of those issues.  Ideally
I did expect that driver_data should not be touched by the framework.
Thanks for fixing that and allowing the back-end driver to use
driver_data.
 
  
   +   powernv_freqs[i].frequency = freq * 1000; /* kHz */
   +   powernv_pstate_ids[i] = id;
   +   }
   +   /* End of list marker entry */
   +   powernv_freqs[i].driver_data = 0;
  
  Not required.
 
 Ok.
  
   +   powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
   +
   +   /* Print frequency table */
   +   for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
   +   pr_debug(%d: %d\n, i, powernv_freqs[i].frequency);
  
  You have already printed this table earlier..
 
 Fair enough.
 
  
   +
   +   return 0;
   +}
   +
   +static struct freq_attr *powernv_cpu_freq_attr[] = {
   +   cpufreq_freq_attr_scaling_available_freqs,
   +   NULL,
   +};
  
  Can use this instead: cpufreq_generic_attr?
 
 In this patch yes. But later patch introduces an additional attribute
 for displaying the nominal frequency. Will handle that part in a clean
 way in the next version.
 
  
   +/* Helper routines */
   +
   +/* Access helpers to power mgt SPR */
   +
   +static inline unsigned long get_pmspr(unsigned long sprn)
  
  Looks big enough not be inlined?
 
 It is called from one function. It has been defined separately for
 readability. 

Let the compiler decide.  The code is straight forward :)

I wanted to keep this SPR operations in a separate function to
facilitate debug and also introduce any timing/sequence change here if
required.  Keeping this separate is helpful, if inlining is
successful, we get a bonus :)

  
   +{
   +   switch (sprn) {
   +   case SPRN_PMCR:
   +   return mfspr(SPRN_PMCR);
   +
   +   case SPRN_PMICR:
   +   return mfspr(SPRN_PMICR);
   +
   +   case SPRN_PMSR:
   +   return mfspr(SPRN_PMSR);
   +   }
   +   BUG

Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform

2014-03-27 Thread Vaidyanathan Srinivasan
* Gautham R Shenoy e...@linux.vnet.ibm.com [2014-03-27 15:00:50]:

[snip]
 
   +   u32 len_ids, len_freqs;
   +
   +   power_mgt = of_find_node_by_path(/ibm,opal/power-mgt);
   +   if (!power_mgt) {
   +   pr_warn(power-mgt node not found\n);
   +   return -ENODEV;
   +   }
   +
   +   if (of_property_read_u32(power_mgt, ibm,pstate-min, 
   pstate_min)) {
   +   pr_warn(ibm,pstate-min node not found\n);
   +   return -ENODEV;
   +   }
   +
   +   if (of_property_read_u32(power_mgt, ibm,pstate-max, 
   pstate_max)) {
   +   pr_warn(ibm,pstate-max node not found\n);
   +   return -ENODEV;
   +   }
  
  Why do you need to get these from DT? And not find that yourself here 
  instead?
 
 
 Well, I think we can obtain a more accurate value from the DT which
 would have already computed it. But I'll let vaidy answer this.

DT provides the values that we should use.  Ideally these are the
upper and lower limits of the PState table,  however as per the
platform spec, we should use the PState IDs between these limits.  We
would like to parse it and keep it in the driver for debug purposes.

The platform exports this info in the DT to be more explicit and not
have a strong dependency on the PState numbering or PState table
itself.  In the driver we parse it for debug/validation purpose.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/3] sched: Fixes for task placement in SMT threads

2013-10-21 Thread Vaidyanathan Srinivasan
Hi,

The following series fixes scheduler loadbalancing issues where we are
missing opportunity to place tasks in SMT threads optimally especially
on a POWER7 system.

PATCH 1/3, Fixes the scenario where load balancing fails to move tasks
away from the cpus in a domain which has SHARE_PKG_RESOURCES set even
under the presence of idle cpus in other domains.

PATCH 2/3, ensures lower order SMT threads are used for the
SD_ASYM_PACKING load balancing case.

PATCH 3/3, tries to fix the problem that is explained in its
changelog. The following experiment expose the scenario:

A task placement test was conducted on a POWER7 as well as multi-core
x86 system. At the beginning, tasks are pinned to all the threads
within a core and later only the tasks pinned to the secondary threads
in a core are unpinned. This leaves the primary thread with tasks that
are unmovable, while the rest are free to be moved.

Under the above setup, load balancing today does not move the unpinned
tasks away from the secondary threads of the core in the above
experiment although there are other idle cpus. The PATCH 3/3 fixes
this situation and improves task placement even if some of the tasks
in a core are pinned.

This series applies on v3.12-rc6 and tested on x86 and powerpc.

--Vaidy

---

Preeti U Murthy (2):
  sched: Fix nohz_kick_needed to consider the nr_busy of the parent 
domain's group
  sched: Aggressive balance in domains whose groups share package resources

Vaidyanathan Srinivasan (1):
  sched: Fix asymmetric scheduling for POWER7


 kernel/sched/fair.c |   41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-21 Thread Vaidyanathan Srinivasan
From: Preeti U Murthy pre...@linux.vnet.ibm.com

In nohz_kick_needed() there are checks around the flags
SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the
domains with this flag set have more than one cpu busy. Therefore at
every domain, a check has to be made on nr_busy of that domain. This
means the sum of the nr_busy of each group in that domain needs to be
checked, since nr_busy is a parameter which is associated with
a group.  However in the current implementation of nohz_kick_needed(),
the nr_busy is being checked for just the group to which the cpu that
has initiated this check belongs to. This will give us the wrong count
of the number of busy cpus in that domain.

The following commit which fixed the sgp-nr_busy_cpus computation
actually exposed the bug in nohz_kick_needed() which worked when
nr_busy was incorrectly  1

25f55d9d01ad7a7ad248fd5af1d22675ffd202c5
sched: Fix init NOHZ_IDLE flag

To illustrate the scenario,  consider a core, whose domain will have
the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when
we find that more than one thread in the core is busy. With the
current implementation of nohz_kick_needed(), at this domain(sd), the
nr_busy will be 1 always since it returns this parameter for
sd-groups which encompasses a single thread, while we want this
parameter for sd-parent-groups which will rightly point to the
number of busy threads in the core.

This patch also ensures that the order of check for
SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING.
Priority is given to avoid more than one busy thread in a core as much
as possible before attempting asymmetric packing.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..12f0eab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int 
cpu)
 
rcu_read_lock();
for_each_domain(cpu, sd) {
-   struct sched_group *sg = sd-groups;
-   struct sched_group_power *sgp = sg-sgp;
-   int nr_busy = atomic_read(sgp-nr_busy_cpus);
-
-   if (sd-flags  SD_SHARE_PKG_RESOURCES  nr_busy  1)
-   goto need_kick_unlock;
+   struct sched_domain *sd_parent = sd-parent;
+   struct sched_group *sg;
+   struct sched_group_power *sgp;
+   int nr_busy;
+
+   if (sd_parent) {
+   sg = sd_parent-groups;
+   sgp = sg-sgp;
+   nr_busy = atomic_read(sgp-nr_busy_cpus);
+
+   if (sd-flags  SD_SHARE_PKG_RESOURCES  nr_busy  1)
+   goto need_kick_unlock;
+   }
 
if (sd-flags  SD_ASYM_PACKING  nr_busy != sg-group_weight
 (cpumask_first_and(nohz.idle_cpus_mask,

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

2013-10-21 Thread Vaidyanathan Srinivasan
Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd))  cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling michael.neul...@au1.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12f0eab..828ed97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick_unlock;
}
 
-   if (sd-flags  SD_ASYM_PACKING  nr_busy != sg-group_weight
-(cpumask_first_and(nohz.idle_cpus_mask,
+   if (sd-flags  SD_ASYM_PACKING 
+   (cpumask_first_and(nohz.idle_cpus_mask,
  sched_domain_span(sd))  cpu))
goto need_kick_unlock;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-21 Thread Vaidyanathan Srinivasan
From: Preeti U Murthy pre...@linux.vnet.ibm.com

The current logic in load balance is such that after picking the
busiest group, the load is attempted to be moved from the busiest cpu
in that group to the dst_cpu. If the load cannot be moved from the
busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
hot tasks, then the dst_cpu is changed to be another idle cpu within
the dst-grpmask. If even then, the load cannot be moved from the
busiest cpu, then the source group is changed. The next busiest group
is found and the above steps are repeated.

However if the cpus in the group share package resources, then when
a load movement from the busiest cpu in this group fails as above,
instead of finding the next busiest group to move load from, find the
next busiest cpu *within the same group* from which to move load away.
By doing so, a conscious effort is made during load balancing to keep
just one cpu busy as much as possible within domains that have
SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
Having multiple cpus busy within a domain which share package resource
could lead to a performance hit.

A similar scenario arises in active load balancing as well. When the
current task on the busiest cpu cannot be moved away due to task
pinning, currently no more attempts at load balancing is made. This
patch checks if the balancing is being done on a group whose cpus
share package resources. If so, then check if the load balancing can
be done for other cpus in the same group.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 828ed97..bbcd96b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
+   struct sched_domain *child;
+   int share_pkg_res = 0;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_mask);
@@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
schedstat_inc(sd, lb_count[idle]);
 
+   child = sd-child;
+   if (child  child-flags  SD_SHARE_PKG_RESOURCES)
+   share_pkg_res = 1;
+
 redo:
if (!should_we_balance(env)) {
*continue_balancing = 0;
@@ -5202,6 +5208,7 @@ redo:
goto out_balanced;
}
 
+redo_grp:
busiest = find_busiest_queue(env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
@@ -5292,6 +5299,11 @@ more_balance:
if (!cpumask_empty(cpus)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
+   if (share_pkg_res 
+   cpumask_intersects(cpus,
+   to_cpumask(group-cpumask)))
+   goto redo_grp;
+
goto redo;
}
goto out_balanced;
@@ -5318,9 +5330,15 @@ more_balance:
 */
if (!cpumask_test_cpu(this_cpu,
tsk_cpus_allowed(busiest-curr))) {
+   cpumask_clear_cpu(cpu_of(busiest), cpus);
raw_spin_unlock_irqrestore(busiest-lock,
flags);
env.flags |= LBF_ALL_PINNED;
+   if (share_pkg_res 
+   cpumask_intersects(cpus,
+   to_cpumask(group-cpumask)))
+   goto redo_grp;
+
goto out_one_pinned;
}
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform

2014-02-10 Thread Vaidyanathan Srinivasan
Hi,

The following patch series implements the platform driver to support
dynamic CPU frequency scaling on IBM POWERNV platforms.

This patch series is based on Linux kernel 3.14-rc2 and tested on
OPAL v3 based IBM POWERNV platform and IBM POWER8 processor.

--Vaidy

---

Srivatsa S. Bhat (1):
  powernv, cpufreq: Add per-core locking to serialize frequency transitions

Vaidyanathan Srinivasan (1):
  powernv: cpufreq driver for powernv platform


 arch/powerpc/include/asm/reg.h|4 +
 drivers/cpufreq/Kconfig.powerpc   |9 +
 drivers/cpufreq/Makefile  |1 
 drivers/cpufreq/powernv-cpufreq.c |  286 +
 4 files changed, 300 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions

2014-02-10 Thread Vaidyanathan Srinivasan
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 drivers/cpufreq/powernv-cpufreq.c |   21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index ea3b630..8240e90 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include linux/of.h
 #include asm/cputhreads.h
 
-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)\
+   mutex_lock(per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)  \
+   mutex_unlock(per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
 
 #define POWERNV_MAX_PSTATES256
 
@@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
freqs.new = powernv_freqs[new_index].frequency;
freqs.cpu = policy-cpu;
 
-   mutex_lock(freq_switch_mutex);
+   lock_core_freq(policy-cpu);
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
 
pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d,
@@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
rc = powernv_set_freq(policy-cpus, new_index);
 
cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE);
-   mutex_unlock(freq_switch_mutex);
+   unlock_core_freq(policy-cpu);
 
return rc;
 }
@@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 
 static int __init powernv_cpufreq_init(void)
 {
-   int rc = 0;
+   int cpu, rc = 0;
 
/* Discover pstates from device tree and init */
 
@@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void)
pr_info(powernv-cpufreq disabled\n);
return rc;
}
+   /* Init per-core mutex */
+   for_each_possible_cpu(cpu) {
+   mutex_init(per_cpu(freq_switch_lock, cpu));
+   }
 
rc = cpufreq_register_driver(powernv_cpufreq_driver);
return rc;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 1/2] powernv: cpufreq driver for powernv platform

2014-02-10 Thread Vaidyanathan Srinivasan
Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Anton Blanchard an...@samba.org
---
 arch/powerpc/include/asm/reg.h|4 +
 drivers/cpufreq/Kconfig.powerpc   |9 +
 drivers/cpufreq/Makefile  |1 
 drivers/cpufreq/powernv-cpufreq.c |  275 +
 4 files changed, 289 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..4a91ab1 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate CPU frequency scaling for IBM POWERNV platform
+   depends on PPC_POWERNV
+   select CPU_FREQ_TABLE
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..ea3b630
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,275 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)powernv-cpufreq:  fmt
+
+#include linux/module.h
+#include linux/cpufreq.h
+#include linux/of.h
+#include asm/cputhreads.h
+
+/* FIXME: Make this per-core */
+static DEFINE_MUTEX(freq_switch_mutex);
+
+#define POWERNV_MAX_PSTATES256
+
+static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+
+/*
+ * Initialize the freq table based on data obtained
+ * from the firmware passed via device-tree
+ */
+
+static int init_powernv_pstates(void)
+{
+   struct device_node *power_mgt;
+   int nr_pstates = 0;
+   int pstate_min, pstate_max, pstate_nominal;
+   const __be32 *pstate_ids, *pstate_freqs;
+   int i;
+   u32 len_ids, len_freqs;
+
+   power_mgt = of_find_node_by_path(/ibm,opal/power-mgt);
+   if (!power_mgt) {
+   pr_warn(power-mgt node not found\n);
+   return -ENODEV;
+   }
+
+   if (of_property_read_u32(power_mgt, ibm,pstate-min, pstate_min)) {
+   pr_warn(ibm,pstate-min node not found\n);
+   return -ENODEV;
+   }
+
+   if (of_property_read_u32(power_mgt, ibm,pstate

Re: [PATCH] powerpc: fix VMX fix for memcpy case

2012-10-06 Thread Vaidyanathan Srinivasan
* Nishanth Aravamudan n...@linux.vnet.ibm.com [2012-10-01 17:59:13]:

 [urgh, sorry Anton, Ben  Paul, inadvertently hit send before adding
 linuxppc-dev to the cc!]
 
 Hi Anton,
 
 In 2fae7cdb60240e2e2d9b378afbf6d9fcce8a3890 (powerpc: Fix VMX in
 interrupt check in POWER7 copy loops), I think you inadvertently
 introduced a regression for memcpy on POWER7 machines. copyuer and
 memcpy diverge slightly in their use of cr1 (copyuser doesn't use it,
 but memcpy does) and you end up clobbering that register with your fix.
 That results in (taken from an FC18 kernel):
 
 [   18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at 
 c0052f40
 [   18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 
 [#1]
 [   18.824623] SMP NR_CPUS=1024 NUMA pSeries
 [   18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts 
 lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 
 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 
 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs
 [   18.824705] NIP: c0052f40 LR: c020b874 CTR: 
 0512
 [   18.824709] REGS: c01f1fef7790 TRAP: 0f20   Not tainted  
 (3.6.0-0.rc6.git0.2.fc18.ppc64)
 [   18.824713] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 4802802e  XER: 
 2010
 [   18.824726] SOFTE: 0
 [   18.824728] CFAR: 0f20
 [   18.824731] TASK = c00fa7128400[0] 'swapper/24' THREAD: 
 c00fa748 CPU: 24
 GPR00: ffc0 c01f1fef7a10 c164edc0 c00f9b9a8120
 GPR04: c00f9b9a8124 1438 0060 03ff064657ee
 GPR08: 8000 0010 0020 0030
 GPR12: 28028022 cff25400 0001 
 GPR16:  7fff c16b2180 c156a500
 GPR20: c00f968c7a90 c000131c31d8 c01f1fef4000 c1561d00
 GPR24: 000a  0001 0012
 GPR28: c00fa5c04f80 08bc c15c0a28 022e
 [   18.824792] NIP [c0052f40] .memcpy_power7+0x5a0/0x7c4
 [   18.824797] LR [c020b874] .pcpu_free_area+0x174/0x2d0
 [   18.824800] Call Trace:
 [   18.824803] [c01f1fef7a10] [c0052c14] 
 .memcpy_power7+0x274/0x7c4 (unreliable)
 [   18.824809] [c01f1fef7b10] [c020b874] 
 .pcpu_free_area+0x174/0x2d0
 [   18.824813] [c01f1fef7bb0] [c020ba88] .free_percpu+0xb8/0x1b0
 [   18.824819] [c01f1fef7c50] [c043d144] .throtl_pd_exit+0x94/0xd0
 [   18.824824] [c01f1fef7cf0] [c043acf8] .blkg_free+0x88/0xe0
 [   18.824829] [c01f1fef7d90] [c018c048] 
 .rcu_process_callbacks+0x2e8/0x8a0
 [   18.824835] [c01f1fef7e90] [c00a8ce8] .__do_softirq+0x158/0x4d0
 [   18.824840] [c01f1fef7f90] [c0025ecc] 
 .call_do_softirq+0x14/0x24
 [   18.824845] [c00fa7483650] [c0010e80] .do_softirq+0x160/0x1a0
 [   18.824850] [c00fa74836f0] [c00a94a4] .irq_exit+0xf4/0x120
 [   18.824854] [c00fa7483780] [c0020c44] 
 .timer_interrupt+0x154/0x4d0
 [   18.824859] [c00fa7483830] [c0003be0] 
 decrementer_common+0x160/0x180
 [   18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4
 [   18.824866] LR = .check_and_cede_processor+0x48/0x80
 [   18.824871] [c00fa7483b20] [c007f018] 
 .check_and_cede_processor+0x18/0x80 (unreliable)
 [   18.824877] [c00fa7483b90] [c007f104] 
 .dedicated_cede_loop+0x84/0x150
 [   18.824883] [c00fa7483c50] [c06bc030] .cpuidle_enter+0x30/0x50
 [   18.824887] [c00fa7483cc0] [c06bc9f4] 
 .cpuidle_idle_call+0x104/0x720
 [   18.824892] [c00fa7483d80] [c0070af8] .pSeries_idle+0x18/0x40
 [   18.824897] [c00fa7483df0] [c0019084] .cpu_idle+0x1a4/0x380
 [   18.824902] [c00fa7483ec0] [c08a4c18] 
 .start_secondary+0x520/0x528
 [   18.824907] [c00fa7483f90] [c00093f0] 
 .start_secondary_prolog+0x10/0x14
 [   18.824911] Instruction dump:
 [   18.824914] 38840008 9003 90e30004 38630008 7ca62850 7cc300d0 78c7e102 
 7cf01120
 [   18.824923] 78c60660 39200010 39400020 39600030 7e00200c 7c0020ce 
 38840010 409f001c
 [   18.824935] ---[ end trace 0bb95124affaaa45 ]---
 [   18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at 
 c0052d08
 
 I believe the right fix is to make memcpy match usercopy and not use
 cr1.
 
 Signed-off-by: Nishanth Aravamudan n...@us.ibm.com

Tested-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com

 ---
 I've not tested this fix yet, but I think it's logically correct.
 Probably needs to go to 3.6-stable as well.
 
 diff --git a/arch/powerpc/lib/memcpy_power7.S 
 b/arch/powerpc/lib/memcpy_power7.S
 index 7ba6c96..0663630 100644
 --- a/arch/powerpc/lib/memcpy_power7.S
 +++ b/arch/powerpc/lib/memcpy_power7.S
 @@ -239,8 +239,8 @@ _GLOBAL

Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-05 Thread Vaidyanathan Srinivasan
* Shaohua Li shaohua...@intel.com [2009-08-06 09:58:55]:

 Hi,
 
 On Wed, Aug 05, 2009 at 10:25:53PM +0800, Gautham R Shenoy wrote:
  In this patch-series, we propose to extend the CPU-Hotplug infrastructure
  and allow the system administrator to choose the desired state the CPU 
  should
  go to when it is offlined. We think this approach addresses the concerns 
  about
  determinism as well as transparency, since CPU-Hotplug already provides
  notification mechanism which the userspace can listen to for any change
  in the configuration and correspondingly readjust any previously set
  cpu-affinities.
 Peter dislikes any approach (including cpuhotplug) which breaks userspace 
 policy,
 even userspace can get a notification.

Hi Shaohua,

Notification to userspace and an opportunity to react to the situation
is certainly better to manage as compared to under-the-cover reduction
in capacity.

We are already doing something to this effect in virtualized guests
where VCPU entitlement and assignment can be changed by the hypervisor
based on resource constraints.  This framework with notification and
determinism will help manage cpu capacity better.

I agree with Peter that scheduler approach is better, but it is not
deterministic.  This is simpler and cleaner alternative to keep the
complexity in userspace and provide only framework to control/notify
kernel.
 
  Also, approaches such as [1] can make use of this
  extended infrastructure instead of putting the CPU to an arbitrary C-state
  when it is offlined, thereby providing the system administrator a rope to 
  hang
  himself with should he feel the need to do so.
 I didn't see the reason why administrator needs to know which state offline 
 cpu
 should stay. Don't know about powerpc side, but in x86 side, it appears 
 deepest
 C-state is already preferred.

Yes that is what we would expect, but the deepest sleep state may be
restricted by BIOS or other system level parameters.  This was the
main objection to Venki's deepest sleep state for offline cpus patch.
There could be higher level policy that restricts the deepest level
C-states for various reasons.  This framework is to provide an
opportunity to adhere to the policy with userspace inputs.  This
framework also helps to choose different states in a virtualized
system.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-09 Thread Vaidyanathan Srinivasan
* Rafael J. Wysocki r...@sisk.pl [2009-08-09 15:22:02]:

 On Sunday 09 August 2009, Pavel Machek wrote:
  Hi!
  
Also, approaches such as [1] can make use of this
extended infrastructure instead of putting the CPU to an arbitrary 
C-state
when it is offlined, thereby providing the system administrator a rope 
to hang
himself with should he feel the need to do so.
   I didn't see the reason why administrator needs to know which state 
   offline cpu
   should stay. Don't know about powerpc side, but in x86 side, it appears 
   deepest
   C-state is already preferred.
   
  
  Agreed, deepest c-state is always best, there's no need to make it 
  configurable.
 
 Unless it doesn't work.

Yes, this is one of the reason.  Kernel will know about the deepest
sleep state and any restrictions and should set that as default in the
preferred_offline state. End-users and sys-admins need not change it,
default will be the deepest sleep state supported and allowed by the
system which may be different than the one supported by the processor.

This framework could allow the default to be set easily by other
userspace tools.  These restrictions apply to cpuidle governor as
well, hence at some level both these subsystems should be in sync.

As described in this patch series, the meaning of offline cpu is
different in a virtualized system and this framework provides
flexibility of choice there.  Platforms today do have a choice on what
state an offline cpu should reside, this framework is one of the
possible methods to exploit that choice and not restrict it due to
lack of OS flexibility.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-06-03 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt b...@au1.ibm.com [2015-05-30 20:38:22]:

 On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote:
  In shared lpar case, spinning in guest context may potentially take
  away cycles from other lpars waiting to run on the same physical cpu.
  
  So the policy in shared lpar case is to let PowerVM hypervisor know
  immediately that the guest cpu is idle which will allow the hypervisor
  to use the cycles for other tasks/lpars.
 
 But that will have negative side effects under KVM no ?

Yes, you have a good point.  If one of the thread in the core goes to
cede, it can still come back quickly since the KVM guest context is
not switched yet.  But in single threaded guest, this can force an
unnecessary exit/context switch overhead.

Now that we have fixed the snooze loop to be bounded and exit
predictably, KVM guest should actually use snooze state to improve
latency.

I will test this scenario and enable snooze state for KVM guest.
 
 Suresh mentioned something with his new directed interrupts code that we
 had many cases where the interrupts ended up arriving shortly after we
 exited to host for NAP'ing ...
 
 Snooze might fix it...

Right.  This scenario is worth experimenting and then introduce snooze
loop for guest.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver

2015-06-21 Thread Vaidyanathan Srinivasan
opal-prd driver will mmap() firmware code/data area as private
mapping to prd user space daemon.  Write to this page will
trigger COW faults.  The new COW pages are normal kernel RAM
pages accounted by the kernel and are not special.

vma-vm_page_prot value will be used at page fault time
for the new COW pages, while pgprot_t value passed in
remap_pfn_range() is used for the initial page table entry.

Hence:
* Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range()
* Also remap_pfn_range() will add the _PAGE_SPECIAL flag using
  pte_mkspecial() call, hence no need to specify in the driver

This fix resolves the page accounting warning shown below:
BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19

The above warning is triggered since _PAGE_SPECIAL was incorrectly
being set for the normal kernel COW pages.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/opal-prd.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 46cb3fe..4ece8e4 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file 
*file)
 static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma)
 {
size_t addr, size;
+   pgprot_t page_prot;
int rc;
 
pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n,
@@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct 
vm_area_struct *vma)
if (!opal_prd_range_is_valid(addr, size))
return -EINVAL;
 
-   vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file,
-   vma-vm_pgoff,
-size, vma-vm_page_prot))
-   | _PAGE_SPECIAL);
+   page_prot = phys_mem_access_prot(file, vma-vm_pgoff,
+size, vma-vm_page_prot);
 
rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size,
-   vma-vm_page_prot);
+   page_prot);
 
return rc;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver

2015-06-25 Thread Vaidyanathan Srinivasan
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2015-06-25 11:45:46]:

[snip]
 
 Hi Ben,
 
 remap_pfn_range() is the correct method to map the firmware pages
 because we will not have struct page associated with this RAM area.
 
 We do a memblock_reserve() in early boot and take out this memory from
 kernel and avoid struct page allocation/init for these.

Kindly ignore the this comment.  memblock_reserve() does not
prevent/avoid struct page allocation.  We do have valid struct page
which can be used for mapping.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver

2015-06-25 Thread Vaidyanathan Srinivasan
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2015-06-21 23:56:16]:

 opal-prd driver will mmap() firmware code/data area as private
 mapping to prd user space daemon.  Write to this page will
 trigger COW faults.  The new COW pages are normal kernel RAM
 pages accounted by the kernel and are not special.
 
 vma-vm_page_prot value will be used at page fault time
 for the new COW pages, while pgprot_t value passed in
 remap_pfn_range() is used for the initial page table entry.
 
 Hence:
 * Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range()
 * Also remap_pfn_range() will add the _PAGE_SPECIAL flag using
   pte_mkspecial() call, hence no need to specify in the driver
 
 This fix resolves the page accounting warning shown below:
 BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19
 
 The above warning is triggered since _PAGE_SPECIAL was incorrectly
 being set for the normal kernel COW pages.
 
 Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
 ---
  arch/powerpc/platforms/powernv/opal-prd.c |9 -
  1 file changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
 b/arch/powerpc/platforms/powernv/opal-prd.c
 index 46cb3fe..4ece8e4 100644
 --- a/arch/powerpc/platforms/powernv/opal-prd.c
 +++ b/arch/powerpc/platforms/powernv/opal-prd.c
 @@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file 
 *file)
  static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma)
  {
   size_t addr, size;
 + pgprot_t page_prot;
   int rc;
  
   pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n,
 @@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct 
 vm_area_struct *vma)
   if (!opal_prd_range_is_valid(addr, size))
   return -EINVAL;
  
 - vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file,
 - vma-vm_pgoff,
 -  size, vma-vm_page_prot))
 - | _PAGE_SPECIAL);
 + page_prot = phys_mem_access_prot(file, vma-vm_pgoff,
 +  size, vma-vm_page_prot);
  
   rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size,
 - vma-vm_page_prot);
 + page_prot);

Hi Ben,

remap_pfn_range() is the correct method to map the firmware pages
because we will not have struct page associated with this RAM area.

We do a memblock_reserve() in early boot and take out this memory from
kernel and avoid struct page allocation/init for these.

vm_insert_page() is an alternative that would have worked if kernel
allocated the memory, in which case we can bump up the page count and
map the page to user space.  This is already done by vm_insert_page()
and we will not need to make the page special.

However, this use case fits remap_pfn_range() and page special
mechanism since there is no struct page associate with this physical
pages.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-30 Thread Vaidyanathan Srinivasan
* Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]:

[snip]

  +   if (max_idle_state  1) {
  +   snooze_timeout_en = true;
  +   snooze_timeout = cpuidle_state_table[1].target_residency *
  +tb_ticks_per_usec;
  +   }
 
 Any idea why we don't have snooze defined on the shared lpar configuration ?

In shared lpar case, spinning in guest context may potentially take
away cycles from other lpars waiting to run on the same physical cpu.

So the policy in shared lpar case is to let PowerVM hypervisor know
immediately that the guest cpu is idle which will allow the hypervisor
to use the cycles for other tasks/lpars.

--Vaidy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESEND PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver

2015-06-28 Thread Vaidyanathan Srinivasan
opal-prd driver will mmap() firmware code/data area as private
mapping to prd user space daemon.  Write to this page will
trigger COW faults.  The new COW pages are normal kernel RAM
pages accounted by the kernel and are not special.

vma-vm_page_prot value will be used at page fault time
for the new COW pages, while pgprot_t value passed in
remap_pfn_range() is used for the initial page table entry.

Hence:
* Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range()
* Also remap_pfn_range() will add the _PAGE_SPECIAL flag using
  pte_mkspecial() call, hence no need to specify in the driver

This fix resolves the page accounting warning shown below:
BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19

The above warning is triggered since _PAGE_SPECIAL was incorrectly
being set for the normal kernel COW pages.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/opal-prd.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

Resending the patch for inclusion until we have a working solution
without PAGE_SPECIAL.

Kindly queue this patch to fix the mmap issue in opal-prd driver.
The next version/fix with vm_insert_page() will replace this code.
There is no API change on the driver side and hence no side effects of
including this fix until a better solution is available.

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 46cb3fe..4ece8e4 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file 
*file)
 static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma)
 {
size_t addr, size;
+   pgprot_t page_prot;
int rc;
 
pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n,
@@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct 
vm_area_struct *vma)
if (!opal_prd_range_is_valid(addr, size))
return -EINVAL;
 
-   vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file,
-   vma-vm_pgoff,
-size, vma-vm_page_prot))
-   | _PAGE_SPECIAL);
+   page_prot = phys_mem_access_prot(file, vma-vm_pgoff,
+size, vma-vm_page_prot);
 
rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size,
-   vma-vm_page_prot);
+   page_prot);
 
return rc;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powernv: Load correct TOC pointer while waking up from winkle.

2016-08-08 Thread Vaidyanathan Srinivasan
* Benjamin Herrenschmidt  [2016-08-06 08:38:53]:

> On Fri, 2016-08-05 at 19:13 +0530, Mahesh J Salgaonkar wrote:
> > From: Mahesh Salgaonkar 
> > 
> > The function pnv_restore_hyp_resource() loads the TOC into r2 from
> > the invalid PACA pointer before fixing r13 value. This do not affect
> > POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less
> > leading CPU to get stuck forever.
> 
> When was this broken ? Should this get backported to stable ?

Broken for 4.8-rc1 only after moving the SPRN_HSPRG0 checking to
idle_book3s.S. Hence no back ports needed.

--Vaidy



Re: [PATCH] powernv: Load correct TOC pointer while waking up from winkle.

2016-08-08 Thread Vaidyanathan Srinivasan
* Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> [2016-08-05 19:13:12]:

> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
> 
> The function pnv_restore_hyp_resource() loads the TOC into r2 from
> the invalid PACA pointer before fixing r13 value. This do not affect
> POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less
> leading CPU to get stuck forever.
> 
>   login: [  471.830433] Processor 120 is stuck.
> 
> 
> This can be easily reproducible using following steps:
> - Turn off SMT
>   $ ppc64_cpu --smt=off
> - offline/online any online cpu (Thread 0 of any core which is online)
>   $ echo 0 > /sys/devices/system/cpu/cpu/online
>   $ echo 1 > /sys/devices/system/cpu/cpu/online
> 
> For POWER ISA 2.07 or less, the last bit of HSPRG0 is set indicating
> that thread is waking up from winkle. Hence, the last bit of HSPRG0(r13)
> needs to be clear before accessing it as PACA to avoid loading invalid
> values from invalid PACA pointer.
> 
> Fix this by loading TOC after r13 register is corrected.
> 
> Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>

Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>


> ---
>  arch/powerpc/kernel/idle_book3s.S |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 8a56a51..45784ec 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -363,8 +363,8 @@ _GLOBAL(power9_idle_stop)
>   * cr3 - set to gt if waking up with partial/complete hypervisor state loss
>   */
>  _GLOBAL(pnv_restore_hyp_resource)
> - ld  r2,PACATOC(r13);
>  BEGIN_FTR_SECTION
> + ld  r2,PACATOC(r13);
>   /*
>* POWER ISA 3. Use PSSCR to determine if we
>* are waking up from deep idle state
> @@ -395,6 +395,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>*/
>   clrldi  r5,r13,63
>   clrrdi  r13,r13,1
> +
> + /* Now that we are sure r13 is corrected, load TOC */
> + ld  r2,PACATOC(r13);
>   cmpwi   cr4,r5,1
>   mtspr   SPRN_HSPRG0,r13
> 

Thanks Mahesh for this fix.

--Vaidy

 



powerpc: Enable cpuhotplug with ESL=1 on POWER9

2017-02-23 Thread Vaidyanathan Srinivasan
The attached patch enables ESL=1 STOP2 for cpuhotplug.  This is a debug
patch that we could carry now until STOP states are discovered from
device tree.

Test run:

[  151.670021] CPU8 going offline with request psscr 003f0332
[  151.719856] CPU 8 offline: Remove Rx thread
[  189.200410] CPU8 coming online with psscr 203f0332, srr1 
90261001

We request ESL=EC=1 and STOP2 and successfully wakeup from that state
using IPI from XIVE.

--Vaidy




[PATCH 2/2] powerpc/cpuhotplug: print psscr and srr1 value for debug

2017-02-23 Thread Vaidyanathan Srinivasan
This is a debug patch that helps trace various STOP
state transitions and look at srr1 and psscr at wakeup.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/smp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index e39e6c4..5b3f002 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -185,8 +185,12 @@ static void pnv_smp_cpu_kill_self(void)
ppc64_runlatch_off();
 
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   pr_info("CPU%d going offline with request psscr 
%016llx\n",
+   cpu, pnv_deepest_stop_psscr_val);
srr1 = power9_idle_stop(pnv_deepest_stop_psscr_val,
pnv_deepest_stop_psscr_mask);
+   pr_info("CPU%d coming online with psscr %016lx, srr1 
%016lx\n",
+   cpu, mfspr(SPRN_PSSCR), srr1);
} else if (idle_states & OPAL_PM_WINKLE_ENABLED) {
srr1 = power7_winkle();
} else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
-- 
2.9.3



[PATCH 1/2] powerpc/cpuhotplug: Force ESL=1 for offline cpus

2017-02-23 Thread Vaidyanathan Srinivasan
From: Gautham R. Shenoy <e...@linux.vnet.ibm.com>

ESL=1 losses some HYP SPR context and not idea for cpuidle,
however can be used for offline cpus.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/cpuidle.h| 15 +++
 arch/powerpc/platforms/powernv/idle.c | 11 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index fd321eb4..31192d8 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -40,6 +40,21 @@
 #define ERR_EC_ESL_MISMATCH-1
 #define ERR_DEEP_STATE_ESL_MISMATCH-2
 
+/* Additional defs for debug and trace */
+
+#define RL_SHIFT  0
+#define MTL_SHIFT 4
+#define TR_SHIFT  8
+#define PSLL_SHIFT16
+#define EC_SHIFT  20
+#define ESL_SHIFT 21
+#define INIT_PSSCR(ESL, EC, PSLL, TR, MTL, RL) (((ESL) << (ESL_SHIFT)) | \
+((EC) << (EC_SHIFT)) | \
+((PSLL) << (PSLL_SHIFT)) | \
+((TR) << (TR_SHIFT)) | \
+((MTL) << (MTL_SHIFT)) | \
+((RL) << (RL_SHIFT)))
+
 #ifndef __ASSEMBLY__
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 4ee837e..4f0663a 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -524,6 +524,17 @@ static int __init pnv_init_idle_states(void)
 
pnv_alloc_idle_core_states();
 
+   /* On POWER9 DD1, enter stop2 with ESL=EC=1 on Hotplug */
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
+   pnv_deepest_stop_psscr_val =
+   /* ESL, EC, PSSL, TR,  MTL, RL */
+   INIT_PSSCR(0x1, 0x1, 0xf, 0x3, 0x3, 0x2);
+   pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK;
+   pr_warn("Overriding deepest_stop_psscr to: 
val=0x%016llx,mask=0x%016llx\n",
+   pnv_deepest_stop_psscr_val,
+   pnv_deepest_stop_psscr_mask);
+   }
+
if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
ppc_md.power_save = power7_idle;
else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
-- 
2.9.3



[PATCH] powerpc/xics: Adjust interrupt receive priority for offline cpus

2017-02-23 Thread Vaidyanathan Srinivasan
Offline CPUs need to receive IPIs through XIVE when they are
in stop state and wakeup from that state.

Reduce interrupt receive priority in order to receive XIVE
wakeup interrupts when in offline state.

LOWEST_PRIORITY would allow all interrupts to be delivered
as wakeup events.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 arch/powerpc/sysdev/xics/xics-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 69d858e..c674a9d 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -199,7 +199,7 @@ void xics_migrate_irqs_away(void)
xics_set_cpu_giq(xics_default_distrib_server, 0);
 
/* Allow IPIs again... */
-   icp_ops->set_priority(DEFAULT_PRIORITY);
+   icp_ops->set_priority(LOWEST_PRIORITY);
 
for_each_irq_desc(virq, desc) {
struct irq_chip *chip;
-- 
2.9.3



Re: [PATCH] powerpc: cmp -> cmpd for 64-bit

2016-10-12 Thread Vaidyanathan Srinivasan
* Segher Boessenkool  [2016-10-12 08:26:48]:

> On Wed, Oct 12, 2016 at 02:05:19PM +1100, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > 

[snip]

> 
> > > --- a/arch/powerpc/include/asm/cpuidle.h
> > > +++ b/arch/powerpc/include/asm/cpuidle.h
> > > @@ -26,7 +26,7 @@ extern u64 pnv_first_deep_stop_state;
> > 
> > #define IDLE_STATE_ENTER_SEQ(IDLE_INST) \
> > /* Magic NAP/SLEEP/WINKLE mode enter sequence */\
> > >   std r0,0(r1);   \
> > >   ptesync;\
> > >   ld  r0,0(r1);   \
> > > -1:   cmp cr0,r0,r0;  \
> > > +1:   cmpdcr0,r0,r0;  \
> > >   bne 1b; \
> > >   IDLE_INST;  \
> > >   b   .
> > 
> > What's this one doing, is it a bug? I can't really tell without knowing
> > what the magic sequence is meant to do.

This one is the recommended idle state entry sequence described in
ISA.  We need to ensure the context is fully saved and also create
a register dependency using cmp and loop which will ideally not be
taken.  This will get the thread (pipeline) ready to start losing
state when the idle instruction is executed.

ISA 2.07 Section: 3.3.2.1 Entering and Exiting Power-Saving Mode

> 
> It looks like it is making sure the ptesync is done.  The ld/cmp/bne
> is the usual to make sure the ld is done, and in std/ptesync/ld the ld
> won't be done before the ptesync is done.
> 
> The cmp always compares equal, of course, so both cmpw and cmpd would
> work fine here.  cmpd looks better after ld ;-)

Yes :)

cmpd or cmpw would provide same result as far as this code sequence is
concerned.  I agree that cpmd is more appropriate here.

--Vaidy



[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver

2017-03-23 Thread Vaidyanathan Srinivasan
When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible,
then cpuidle-powernv not passing an explicit drv->cpu_mask allows
generic cpuidle driver to try create sysfs objects for cpus that does
not have cpu_devices created by calling register_cpu().

This caused kernel to access incorrect address and crash. The
following patch series fixes the cpuidle-powernv driver and also adds
additional checks in cpuidle_add_sysfs()

This patch set is against v4.11-rc3.

Changed from v1: Updated commit message and comments.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>



Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:28:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
> <sva...@linux.vnet.ibm.com> wrote:
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > On PowerNV platform cpu_present could be less than cpu_possible in cases
> > where firmware detects the cpu, but it is not available to the OS.  When
> > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
> > we skip creating cpu_device.
> >
> > This breaks cpuidle on powernv where register_cpu() is not called for
> > cpus in cpu_possible_mask that cannot be hot-added at runtime.
> >
> > Trying cpuidle_register_device() on cpu without cpu_device will cause
> > crash like this:
> >
> > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> > pc: c022c8bc: string+0x34/0x60
> > lr: c022ed78: vsnprintf+0x284/0x42c
> > sp: c00ff1503710
> >msr: 90009033
> >dar: 60006000
> >   current = 0xc00ff148
> >   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/8
> > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
> > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> > enter ? for help
> > [link register   ] c022ed78 vsnprintf+0x284/0x42c
> > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> > [c00ff15038c0] c0619694 printk+0x38/0x4c
> > [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
> >
> > This patch fixes the bug by passing correct cpumask from
> > powernv-cpuidle driver.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
> 
> That needs to be ACKed by someone familiar with powernv.

Previous version at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html

I had not CCed linux-pm in the first post.

Michael and Mikey have reviewed the previous version.  Let me get an
ack for you to proceed with the merge.

Thanks,
Vaidy



[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
On PowerNV platform cpu_present could be less than cpu_possible in cases
where firmware detects the cpu, but it is not available to the OS.  When
CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
we skip creating cpu_device.

This breaks cpuidle on powernv where register_cpu() is not called for
cpus in cpu_possible_mask that cannot be hot-added at runtime.

Trying cpuidle_register_device() on cpu without cpu_device will cause
crash like this:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
(Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the bug by passing correct cpumask from
powernv-cpuidle driver.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index a06df51..82f7b33 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void)
drv->state_count += 1;
}
 
+   /*
+* On PowerNV platform cpu_present may be less that cpu_possible in
+* cases where firmware detects the cpu, but it is not available to the
+* OS.  If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at
+* runtime and hence cpu_devices are not created for those cpus by
+* generic topology_init().
+*
+* drv->cpumask defaults to cpu_possible_mask in
+* __cpuidle_driver_init().  This breaks cpuidle on powernv where
+* cpu_devices are not created for cpus in cpu_possible_mask that
+* cannot be hot-added later at runtime.
+*
+* Trying cpuidle_register_device() on a cpu without cpu_devices is
+* incorrect. Hence pass correct cpu mask to generic cpuidle driver.
+*/
+
+   drv->cpumask = (struct cpumask *)cpu_present_mask;
+
return 0;
 }
 
-- 
2.9.3



[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan
If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.

Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:

[   30.076045] Unable to handle kernel paging request for
data at address 0x0048
[   30.076100] Faulting instruction address: 0xc07b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
pc: c07b2f30: kobject_get+0x20/0x70
lr: c07b3c94: kobject_add_internal+0x54/0x3f0
sp: c03feb18b8f0
   msr: 90009033
   dar: 48
 dsisr: 4000
  current = 0xc03fd2ed8300
  paca= 0xcfbab500   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
[c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
[c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
[c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
[c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
[c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
[c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
[c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
[c03feb18bdc0] c000d864 kernel_init+0x24/0x160
[c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

Validating cpu_dev fixes the crash and reports correct error message like:

[   30.163506] Failed to register cpuidle device for cpu136
[   30.173329] Registration of powernv driver failed.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/sysfs.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index c5adc8c..f2c3bce 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
int error;
 
+   /*
+* Return error if cpu_device is not setup for this cpu.  This
+* could happen if arch did not setup cpu_device since this
+* cpu is not in cpu_present mask and the driver did not send
+* correct cpu mask at registration.  Without this check we
+* would end up passing bogus value for _dev->kobj in
+* kobject_init_and_add().
+*/
+
+   if (!cpu_dev)
+   return -ENODEV;
+
kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
if (!kdev)
return -ENOMEM;
-- 
2.9.3



Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:27:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
> <sva...@linux.vnet.ibm.com> wrote:
> > If a given cpu is not in cpu_present and cpu hotplug
> > is disabled, arch can skip setting up the cpu_dev.
> >
> > Arch cpuidle driver should pass correct cpu mask
> > for registration, but failing to do so by the driver
> > causes error to propagate and crash like this:
> >
> > [   30.076045] Unable to handle kernel paging request for
> > data at address 0x0048
> > [   30.076100] Faulting instruction address: 0xc07b2f30
> > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
> > pc: c07b2f30: kobject_get+0x20/0x70
> > lr: c07b3c94: kobject_add_internal+0x54/0x3f0
> > sp: c03feb18b8f0
> >msr: 90009033
> >dar: 48
> >  dsisr: 4000
> >   current = 0xc03fd2ed8300
> >   paca= 0xcfbab500   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
> > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
> > enter ? for help
> > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
> > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
> > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
> > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
> > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
> > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
> > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
> > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
> > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160
> > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
> >
> > Validating cpu_dev fixes the crash and reports correct error message like:
> >
> > [   30.163506] Failed to register cpuidle device for cpu136
> > [   30.173329] Registration of powernv driver failed.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
> 
> The previous version is in linux-next already and I'm going to push it
> for merging shortly.

Thanks Rafael.  The previous version is good for merge.

--Vaidy



Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-22 Thread Vaidyanathan Srinivasan
* Michael Ellerman <m...@ellerman.id.au> [2017-03-22 21:55:50]:

> Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes:
> > * Michael Ellerman <m...@ellerman.id.au> [2017-03-20 14:05:39]:
> >> Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes:
> >  
> >> > On powernv platform cpu_present could be less than cpu_possible
> >> > in cases where firmware detects the cpu, but it is not available
> >> > for OS.
> >> 
> >> It's entirely normal for present < possible, on my laptop for example,
> >> so I don't see how that causes the bug.
> >
> > Yes, present < possible in itself not a problem.  It is whether
> > cpu_device exist for that cpu or not.
> ...
> >
> > Currently if CONFIG_HOTPLUG_CPU=n, then we skip calling register_cpu()
> > and that causes the problem.
> ...
> >> 
> >> I really don't understand how a CPU not being present leads to a crash
> >> in printf()? Something in that call chain should have checked that the
> >> CPU was registered before crashing in printf() - surely?
> >
> > Yes, we should have just failed to register the cpuidle driver.  I have
> > the fix here:
> >
> > [PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
> > http://patchwork.ozlabs.org/patch/740634/
> 
> OK. Can you send a v2 of this with a better change log that includes all
> the clarifications above.
> 
> And despite your subject being powerpc/powernv/cpuidle, this is a
> cpuidle patch. I can merge it, but I at least need you to Cc the cpuidle
> maintainers so they have a chance to see it.

Thanks for the review, I will post a v2 with more detailed commit log
and CC cpuidle maintainers and linux-pm.

--Vaidy



[PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-17 Thread Vaidyanathan Srinivasan
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
This breaks cpuidle on powernv where sysfs files are not created for
cpus in cpu_possible_mask that cannot be hot-added.

Trying cpuidle_register_device() on cpu without sysfs node will
cause crash like:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 
2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the issue by passing correct cpumask from
powernv-cpuidle driver.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 3705930..e7a8c2a 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -175,6 +175,28 @@ static int powernv_cpuidle_driver_init(void)
drv->state_count += 1;
}
 
+   /*
+* On PowerNV platform cpu_present may be less that cpu_possible
+* in cases where firmware detects the cpu, but it is not available
+* for OS.  Such CPUs are not hotplugable at runtime on PowerNV
+* platform and hence sysfs files are not created for those.
+* Generic topology_init() would skip creating sysfs directories
+* for cpus that are not present and not hotplugable later at
+* runtime.
+*
+* drv->cpumask defaults to cpu_possible_mask in 
__cpuidle_driver_init().
+* This breaks cpuidle on powernv where sysfs files are not created for
+* cpus in cpu_possible_mask that cannot be hot-added.
+*
+* Hence at runtime sysfs nodes are present for cpus only in
+* cpu_present_mask. Trying cpuidle_register_device() on cpu without
+* sysfs node is incorrect.
+*
+* Hence pass correct cpu mask to generic cpuidle driver.
+*/
+
+   drv->cpumask = (struct cpumask *)cpu_present_mask;
+
return 0;
 }
 
-- 
2.9.3



[PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-18 Thread Vaidyanathan Srinivasan
If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.

Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:

[   30.076045] Unable to handle kernel paging request for data at address 
0x0048
[   30.076100] Faulting instruction address: 0xc07b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
pc: c07b2f30: kobject_get+0x20/0x70
lr: c07b3c94: kobject_add_internal+0x54/0x3f0
sp: c03feb18b8f0
   msr: 90009033
   dar: 48
 dsisr: 4000
  current = 0xc03fd2ed8300
  paca= 0xcfbab500   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
[c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
[c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
[c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
[c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
[c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
[c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
[c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
[c03feb18bdc0] c000d864 kernel_init+0x24/0x160
[c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

Validating cpu_dev fixes the crash and reports correct error message like:

[   30.163506] Failed to register cpuidle device for cpu136
[   30.173329] Registration of powernv driver failed.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/sysfs.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index c5adc8c..19dcf32 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
int error;
 
+   /*
+* Return if cpu_device is not setup for this cpu.
+* This could happen if arch did not setup cpu_device
+* since this cpu is not in cpu_present mask and
+* driver did not send correct cpu mask at registration.
+* Without this check we would end up passing bogus
+* value for _dev->kobj in kobject_init_and_add()
+*/
+
+   if (!cpu_dev)
+   return -ENODEV;
+
kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
if (!kdev)
return -ENOMEM;
-- 
2.9.3



Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-19 Thread Vaidyanathan Srinivasan
* Michael Ellerman <m...@ellerman.id.au> [2017-03-20 14:05:39]:

> Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes:
> 
> > * Michael Neuling <mi...@neuling.org> [2017-03-18 16:28:02]:
> >
> >> Vaidy,
> >> 
> >> Thanks for fixing this.
> >> 
> >> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> >> > This breaks cpuidle on powernv where sysfs files are not created for
> >> > cpus in cpu_possible_mask that cannot be hot-added.
> >> 
> >> I think I prefer the longer description below than this.
> >
> > [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for
> >
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > This breaks cpuidle on powernv where sysfs files are not created for
> > cpus in cpu_possible_mask that cannot be hot-added.
> 
> I'm confused.

It depends on CONFIG_HOTPLUG_CPU now, but we have the following
combinations to handle:
(a) CONFIG_HOTPLUG_CPU=y/n
(b) pseries vs powernv
 
> > On powernv platform cpu_present could be less than cpu_possible
> > in cases where firmware detects the cpu, but it is not available
> > for OS.
> 
> It's entirely normal for present < possible, on my laptop for example,
> so I don't see how that causes the bug.

Yes, present < possible in itself not a problem.  It is whether
cpu_device exist for that cpu or not.

> > Such cpus are not hotplugable at runtime on powernv and
> > hence we skip creating sysfs files for these cpus.
> 
> Why are they not hotpluggable? Looking at topology_init() they should be
> hotpluggable as long as ppc_md.cpu_die is populated, which it is on
> PowerNV AFAICS.

Currently it depends on CONFIG_HOTPLUG_CPU=y.
 
> Is it really "creating sysfs files" that's important, or is that we
> don't call register_cpu() for those CPUs?

Currently if CONFIG_HOTPLUG_CPU=n, then we skip calling register_cpu()
and that causes the problem.

The fix in cpuidle-powernv.c could be based on CONFIG_HOTPLUG_CPU, but
since we can choose to set cpu->hotpluggable based on other factors
even when CONFIG_HOTPLUG_CPU=y, I choose to use cpu_possible.  We will
need to change  cpuidle-powernv to register for additional cpus when
we really support hot-add of cpus that did not exist at boot.

> > Trying cpuidle_register_device() on cpu without sysfs node will
> > cause crash like:
> >
> > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> > pc: c022c8bc: string+0x34/0x60
> > lr: c022ed78: vsnprintf+0x284/0x42c
> > sp: c00ff1503710
> >msr: 90009033
> >dar: 60006000
> >   current = 0xc00ff148
> >   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/8
> > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 
> > 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> > enter ? for help
> > [link register   ] c022ed78 vsnprintf+0x284/0x42c
> > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> > [c00ff15038c0] c0619694 printk+0x38/0x4c
> > [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
> 
> I really don't understand how a CPU not being present leads to a crash
> in printf()? Something in that call chain should have checked that the
> CPU was registered before crashing in printf() - surely?

Yes, we should have just failed to register the cpuidle driver.  I have
the fix here:

[PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
http://patchwork.ozlabs.org/patch/740634/

--Vaidy



Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-18 Thread Vaidyanathan Srinivasan
* Michael Neuling  [2017-03-18 16:28:02]:

> Vaidy,
> 
> Thanks for fixing this.
> 
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > This breaks cpuidle on powernv where sysfs files are not created for
> > cpus in cpu_possible_mask that cannot be hot-added.
> 
> I think I prefer the longer description below than this.

[PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for

drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
This breaks cpuidle on powernv where sysfs files are not created for
cpus in cpu_possible_mask that cannot be hot-added.

On powernv platform cpu_present could be less than cpu_possible
in cases where firmware detects the cpu, but it is not available
for OS.  Such cpus are not hotplugable at runtime on powernv and
hence we skip creating sysfs files for these cpus.

Trying cpuidle_register_device() on cpu without sysfs node will
cause crash like:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 
2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the issue by passing correct cpumask from
powernv-cpuidle driver.
 
> > This patch fixes the issue by passing correct cpumask from
> > powernv-cpuidle driver.
> 
> Any reason the correct fix isn't to change __cpuidle_driver_init() to use
> present mask?  Seems like any arch where present < possible is going to hit
> this.

How to handle the sysfs nodes for cpus that are not in cpu_present
mask is upto the arch/platform.  During init, arch can choose to
create sysfs nodes for cpu's that are not in present mask, which can
be later hot-added and then onlined.  In which case cpuidle driver
should register for such cpus as well.

The decision to allow a cpu to become available in cpu_present mask at
runtime and handle the sysfs nodes are upto the arch/platform and not
to be hard coded in generic driver.  The generic cpuidle would like to
register for all possible cpus if the arch driver does not override
it.

--Vaidy



Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Vaidyanathan Srinivasan
* Nicholas Piggin  [2017-04-04 09:52:07]:

> On Tue,  4 Apr 2017 07:54:12 +1000
> Anton Blanchard  wrote:
> 
> > From: Anton Blanchard 
> > 
> > The core of snooze_loop() continually bounces between low and very
> > low thread priority. Changing thread priorities is an expensive
> > operation that can negatively impact other threads on a core.
> > 
> > All CPUs that can run PowerNV support very low priority, so we can
> > avoid the change completely.
> 
> This looks good. I have HMT_lowest() which does alt feature patching
> we can use for pseries and default idle code.

Alternatively, if we are going to set priority only once in various
other places, HMT_low(); HMT_very_low(); should not add to extra
cycles.  Let me code that up.

--Vaidy

 



Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:12]:

> From: Anton Blanchard <an...@samba.org>
> 
> The core of snooze_loop() continually bounces between low and very
> low thread priority. Changing thread priorities is an expensive
> operation that can negatively impact other threads on a core.
> 
> All CPUs that can run PowerNV support very low priority, so we can
> avoid the change completely.
> 
> Signed-off-by: Anton Blanchard <an...@samba.org>

Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index cda8f62d555b..9d9f164894eb 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev,
>   snooze_exit_time = get_tb() + snooze_timeout;
>   ppc64_runlatch_off();
>   while (!need_resched()) {
> - HMT_low();
>   HMT_very_low();
>   if (snooze_timeout_en && get_tb() > snooze_exit_time)
>   break;


HMT_low() is legacy and can be removed for powernv platforms.

--Vaidy



Re: [PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:13]:

> From: Anton Blanchard <an...@samba.org>
> 
> The powerpc64 kernel exception handlers have preserved thread priorities
> for a long time now, so there is no need to continually set it.
> 
> Just set it once on entry and once exit.
> 
> Signed-off-by: Anton Blanchard <an...@samba.org>

Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 9d9f164894eb..8c991c254b95 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev,
> 
>   snooze_exit_time = get_tb() + snooze_timeout;
>   ppc64_runlatch_off();
> + HMT_very_low();
>   while (!need_resched()) {
> - HMT_very_low();
>   if (snooze_timeout_en && get_tb() > snooze_exit_time)
>   break;
>   }
> -- 
> 2.11.0
> 



Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop

2017-04-03 Thread Vaidyanathan Srinivasan
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:14]:

> From: Anton Blanchard <an...@samba.org>
> 
> When in the snooze_loop() we want to take up the least amount of
> resources. On my version of gcc (6.3), we end up with an extra
> branch because it predicts snooze_timeout_en to be false, whereas it
> is almost always true.

By default snooze_timeout_en is true.  It will become false only when
we do not want to exit the snooze loop and that will be when snooze is
the only idle state available in the platform, which is a rare case.

> Use likely() to avoid the branch and be a little nicer to the
> other non idle threads on the core.
> 
> Signed-off-by: Anton Blanchard <an...@samba.org>

Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 8c991c254b95..251a60bfa8ee 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev,
>   ppc64_runlatch_off();
>   HMT_very_low();
>   while (!need_resched()) {
> - if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time)
>   break;
>   }
> 
> -- 
> 2.11.0
> 



Re: [PATCH] powerpc/pseries: energy driver only print message when LPAR guest

2017-07-20 Thread Vaidyanathan Srinivasan
* Nicholas Piggin <npig...@gmail.com> [2017-07-21 11:16:44]:

> On Thu, 20 Jul 2017 23:03:21 +1000
> Michael Ellerman <m...@ellerman.id.au> wrote:
> 
> > Nicholas Piggin <npig...@gmail.com> writes:
> > 
> > > This driver currently reports the H_BEST_ENERGY is unsupported even
> > > when booting in a non-LPAR environment (e.g., powernv). Prevent it.  
> > 
> > Just delete the printk(). Users don't know what that means, and
> > developers have other better ways to detect that the hcall is missing if
> > anyone cares.
> > 
> > cheers
> 
> powerpc/pseries: energy driver do not print failure message
> 
> This driver currently reports the H_BEST_ENERGY is unsupported (even
> when booting in a non-LPAR environment). This is not something the
> administrator can do much with, and not significant for debugging.
> 
> Remove it.
> 
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>

Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>


> ---
>  arch/powerpc/platforms/pseries/pseries_energy.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
> b/arch/powerpc/platforms/pseries/pseries_energy.c
> index 164a13d3998a..35c891aabef0 100644
> --- a/arch/powerpc/platforms/pseries/pseries_energy.c
> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> @@ -229,10 +229,9 @@ static int __init pseries_energy_init(void)
>   int cpu, err;
>   struct device *cpu_dev;
> 
> - if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) {
> - printk(KERN_INFO "Hypercall H_BEST_ENERGY not supported\n");
> - return 0;
> - }
> + if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY))
> + return 0; /* H_BEST_ENERGY hcall not supported */
> +

The first patch (!firmware_has_feature(FW_FEATURE_LPAR)) would be
ideal, but we do not have this in KVM guest case also. Hence I take
mpe's suggestion.  Removing the print is fine.

--Vaidy



Re: [PATCH] powerpc/pseries: energy driver only print message when LPAR guest

2017-07-21 Thread Vaidyanathan Srinivasan
* Michael Ellerman <m...@ellerman.id.au> [2017-07-21 16:33:07]:

> Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes:
> > * Nicholas Piggin <npig...@gmail.com> [2017-07-21 11:16:44]:
> >> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
> >> b/arch/powerpc/platforms/pseries/pseries_energy.c
> >> index 164a13d3998a..35c891aabef0 100644
> >> --- a/arch/powerpc/platforms/pseries/pseries_energy.c
> >> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> >> @@ -229,10 +229,9 @@ static int __init pseries_energy_init(void)
> >>int cpu, err;
> >>struct device *cpu_dev;
> >> 
> >> -  if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) {
> >> -  printk(KERN_INFO "Hypercall H_BEST_ENERGY not supported\n");
> >> -  return 0;
> >> -  }
> >> +  if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY))
> >> +  return 0; /* H_BEST_ENERGY hcall not supported */
> >> +
> >
> > The first patch (!firmware_has_feature(FW_FEATURE_LPAR)) would be
> > ideal, but we do not have this in KVM guest case also.
> 
> Yeah we do.
> 
> It should really be called FW_FEATURE_RUNNING_UNDER_PAPR_HYPERVISOR.
> 
> static int __init probe_fw_features(unsigned long node, const char *uname, int
>   depth, void *data)
> {
>   
>   if (!strcmp(uname, "rtas") || !strcmp(uname, "rtas@0")) {
>   prop = of_get_flat_dt_prop(node, "ibm,hypertas-functions", 
> );
>   if (prop) {
>   powerpc_firmware_features |= FW_FEATURE_LPAR;
> 
> 
> Qemu initialises that property unconditionally in spapr_dt_rtas().

oops... I meant that FW_FEATURE_BEST_ENERGY is not found in KVM and we
will see the print needlessly.

If we have a check for phyp LPAR, then we can enable the print
"H_BEST_ENERGY hcall not supported"

Since the FW_FEATURE_LPAR is common for all PAPR guest (both pHyp and
KVM), I agree that deleting the print is the right thing to do since
we see it on both powernv and KVM where it is not supported and there
is no point reporting it.

--Vaidy



Re: [PATCH] POWER9 PMU stops after idle workaround

2017-07-19 Thread Vaidyanathan Srinivasan
* Nicholas Piggin <npig...@gmail.com> [2017-07-20 11:53:22]:

> POWER9 DD2 PMU can stop after a state-loss idle in some conditions.
> 
> A solution is to set then clear MMCRA[60] after wake from state-loss
> idle.
> 
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>

Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>


> ---
>  arch/powerpc/kernel/idle_book3s.S | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 516ebef905c0..e6252c5a57a4 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -460,11 +460,17 @@ pnv_restore_hyp_resource_arch300:
>   /*
>* Workaround for POWER9, if we lost resources, the ERAT
>* might have been mixed up and needs flushing. We also need
> -  * to reload MMCR0 (see comment above).
> +  * to reload MMCR0 (see comment above). We also need to set
> +  * then clear bit 60 in MMCRA to ensure the PMU starts running.
>*/
>   blt cr3,1f
>   PPC_INVALIDATE_ERAT
>   ld  r1,PACAR1(r13)
> + mfspr   r4,SPRN_MMCRA
> + ori r4,r4,(1 << (63-60))
> + mtspr   SPRN_MMCRA,r4
> + xorir4,r4,(1 << (63-60))
> + mtspr   SPRN_MMCRA,r4


Timing is ok to resolve the issue?  Does back-to-back bit flip of
MMCRA[60] gets the job done for all cases?  Just asking since this
issue in itself is a corner case ;)

--Vaidy



  1   2   >