Re: [PATCH 2/4]: CPUIDLE: Introduce architecture independent cpuidle_pm_idle in drivers/cpuidle/cpuidle.c
* Peter Zijlstra pet...@infradead.org [2009-08-28 09:01:12]: On Fri, 2009-08-28 at 08:48 +0200, Peter Zijlstra wrote: void cpuidle_install_idle_handler(void) { . . cpuidle_pm_idle = cpuidle_idle_call; } All I'm seeing here is a frigging mess. How on earths can something called: cpuidle_install_idle_handler() have a void argument, _WHAT_ handler is it going to install? Argh, now I see, it installs itself as the platform idle handler. so cpuidle_install_idle_handler() pokes at the unmanaged pm_idle pointer to make cpuidle take control. On module load it does: pm_idle_old = pm_idle; then in the actual idle loop it does: if (!dev || !dev-enabled) { if (pm_idle_old) pm_idle_old(); who is to say that the pointer stored at module init time is still around at that time? So cpuidle recognised the pm_idle stuff was a flaky, but instead of fixing it, they build a whole new layer on top of it. Brilliant. /me goes mark this whole thread read, I've got enough things to do. Hi Peter, I understand that you are frustrated with the mess. We are willing to clean up the pm_idle pointer at the moment before the cpuidle framework is exploited my more archs. At this moment we need your suggestions on what interface should we call 'clean' and safe. cpuidle.c and the arch independent cpuidle subsystem is not a module and its cpuidle_idle_call() routine is valid and can be safely called from arch dependent process.c The fragile part is how cpuidle_idle_call() is hooked onto arch specific cpu_idle() function at runtime. x86 has the pm_idle pointer exported while powerpc has ppc_md.power_save pointer being called. At cpuidle init time we can override the platform idle function, but that will mean we are including arch specific code in cpuidle.c Do you think having an exported function in cpuidle.c to give us the correct pointer to arch code be better than the current situation: in drivers/cpuidle/cpuidle.c void (*get_cpuidle_handler(void))(void) { return cpuidle_pm_idle; } EXPORT_SYMBOL(get_cpuidle_handler); and from pseries/processor_idle.c, ppc_md.power_save = get_cpuidle_handler(); --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
* Peter Zijlstra a.p.zijls...@chello.nl [2009-09-16 18:35:16]: On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote: No, for this specific case, latency isn't an issue. The issue is - how do we cede unused vcpus to hypervisor for better energy management ? Yes, it can be done by a hypervisor manager telling the kernel to offline and make a bunch of vcpus inactive. It does have to choose offline (release vcpu) vs. inactive (cede but guranteed if needed). The problem is that long ago we exported a lot of hotplug stuff to userspace through the sysfs interface and we cannot do something inside the kernel without keeping the sysfs stuff consistent. This seems like a sane way to do that without undoing all the virtual cpu hotplug infrastructure in different supporting archs. I'm still not getting it.. Suppose we have some guest, it booted with 4 cpus. We then offline 2 of them. Apparently this LPAR binds guest cpus to physical cpus? So we use a hypervisor interface to reclaim these 2 offlined cpus and re-assign them to some other guest. So far so good, right? Now if you were to try and online the cpus in the guest, it'd fail because the cpus aren't backed anymore, and the hot-plug simply times-out and fails. And we're still good, right? The requirement differ here. If we had offlined 2 vCPUs for the purpose of system reconfiguration, the expected behavior with offline interface will work right. However the proposed cede interface is needed when we want them to temporarily go away but still come back when we do an online. We want the online to always succeed since the backing physical resources are not relinquished. The proposed interface facilitates offline without relinquishing the physical resources assigned to LPARs. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v6 PATCH 0/7]: cpuidle/x86/POWER: Cleanup idle power management code in x86, cleanup drivers/cpuidle/cpuidle.c and introduce cpuidle to POWER.
* Arjan van de Ven ar...@infradead.org [2009-09-24 14:22:28]: On Thu, 24 Sep 2009 10:42:41 +0530 Arun R Bharadwaj a...@linux.vnet.ibm.com wrote: * Arun R Bharadwaj a...@linux.vnet.ibm.com [2009-09-22 16:55:27]: Hi Len, (or other acpi folks), I had a question regarding ACPI-cpuidle interaction in the current implementation. Currently, every cpu (i.e. acpi_processor) registers to cpuidle as a cpuidle_device. So every cpu has to go through the process of setting up the idle states and then registering as a cpuidle device. What exactly is the reason behind this? technically a BIOS can opt to give you C states via ACPI on some cpus, but not on others. in practice when this happens it tends to be a bug.. but it's technically a valid configuration So we will need to keep the per-cpu registration as of now because we may have such buggy BIOS in the field and we don't want the cpuidle framework to malfunction there. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 0/2] cpu: pseries: Offline state framework.
* Arjan van de Ven ar...@infradead.org [2009-09-24 13:41:23]: On Thu, 24 Sep 2009 13:33:07 +0200 Peter Zijlstra a.p.zijls...@chello.nl wrote: On Thu, 2009-09-24 at 18:38 +1000, Benjamin Herrenschmidt wrote: On Thu, 2009-09-24 at 09:51 +0200, Peter Zijlstra wrote: I don't quite follow your logic here. This is useful for more than just hypervisors. For example, take the HV out of the picture for a moment and imagine that the HW has the ability to offline CPU in various power levels, with varying latencies to bring them back. cpu-hotplug is an utter slow path, anybody saying latency and hotplug in the same sentence doesn't seem to grasp either or both concepts. Let's forget about latency then. Let's imagine I want to set a CPU offline to save power, vs. setting it offline -and- opening the back door of the machine to actually physically replace it :-) If the hardware is capable of physical hotplug, then surely powering the socket down saves most power and is the preferred mode? btw just to take away a perception that generally powering down sockets help; it does not help for all cpus. Some cpus are so efficient in idle that the incremental gain one would get by offlining a core is just not worth it (in fact, in x86, it's the same thing) I obviously can't speak for p-series cpus, just wanted to point out that there is no universal truth about offlining saves power. Hi Arjan, As you have said, on some cpus the extra effort of offlining does not save us any extra power, and the state will be same as idle. The assertion that offlining saves power is still valid, it could be same as idle or better depending on the architecture and implementation. On x86 we still need the code (Venki posted) to take cpus to C6 on offline to save power or else offlining consumes more power than idle due to C1/hlt state. This framework can help here as well if we have any apprehension on making lowest sleep state as default on x86 and want the administrator to decide. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
* Peter Zijlstra a.p.zijls...@chello.nl [2009-09-25 16:48:40]: On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote: On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote: I still think its a layering violation... its the hypervisor manager that should be bothered in what state an off-lined cpu is in. That's not how our hypervisor works. Then fix it? If you ask through the management interface, to remove a CPU from a partition, the HV will communicate with a daemon inside the partition that will then unplug the CPU via the right call. I don't really understand your objections to be honest. And I fail to see why it would be a layering violation to have the ability for the OS to indicate in what state it wishes to relinguish a CPU to the hypervisor, which more or less defines what is the expected latency for getting it back later on. OK, so the main objection is the abuse of CPU hotplug as resource management feature. CPU hotplug is terribly invasive and expensive to the kernel, doing hotplug on a minute basis is just plain crazy. If you want a CPU in a keep it near and don't hand it back to the HV state, why not use cpusets to isolate it and simply not run tasks on it? cpusets don't use stopmachine and are much nicer to the rest of the kernel over-all. Hi Peter, This interface is not expected to be used every minute or so to impact the operation of the rest of the system. Cpuhotplug is currently used as a resource management feature in virtualised system using dlpar operations. I do understand that cpu hotplug is invasive at run time and that amount of complexity is required to carefully isolate the cpu from any involvement in the running kernel. Building another interface to isolate the cpus to the same extent as cpu hotplug does today would be redundant and is going to be equally invasive. Alternatives like cpuset for isolation migrates only tasks. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
* Benjamin Herrenschmidt b...@kernel.crashing.org [2009-09-26 07:12:48]: On Fri, 2009-09-25 at 16:48 +0200, Peter Zijlstra wrote: On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote: On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote: I still think its a layering violation... its the hypervisor manager that should be bothered in what state an off-lined cpu is in. That's not how our hypervisor works. Then fix it? Are you serious ? :-) CPU hotplug is terribly invasive and expensive to the kernel, doing hotplug on a minute basis is just plain crazy. If you want a CPU in a keep it near and don't hand it back to the HV state, why not use cpusets to isolate it and simply not run tasks on it? cpusets don't use stopmachine and are much nicer to the rest of the kernel over-all. Gautham, what is the different in term of power saving between having it idle for long periods of time (which could do H_CEDE and with NO_HZ, probably wouln't need to wake up that often) and having it unplugged in a H_CEDE loop ? Hi Ben, A cede latency specifier value indicating latency expectation of the guest OS can be established in the VPA to inform the hypervisor during the H_CEDE call. Currently, we do call H_CEDE during NO_HZ for efficient idle. However, higher cede latency values may not be suitable for idle CPUs in the kernel and instead more energy savings may result from exploiting this feature through CPU hotplug interface. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 0/4] pseries: Add cede support for cpu-offline
* Benjamin Herrenschmidt b...@kernel.crashing.org [2009-11-24 14:35:09]: On Fri, 2009-10-30 at 10:52 +0530, Gautham R Shenoy wrote: Hi, This is version 5 of patch series that provides a framework to choose the state a pseries CPU must be put to when it is offlined. Previous versions can be found here: Version 4: http://lkml.org/lkml/2009/10/9/59 Version 3: http://lkml.org/lkml/2009/9/15/164 Version 2: http://lkml.org/lkml/2009/8/28/102 Version 1: http://lkml.org/lkml/2009/8/6/236 Changes from the previous version include: - Rebased against Nathan Fontenot's latest pseries kernel handling of dynamic logical paritioning v4 patches found here: http://lkml.org/lkml/2009/10/21/98 I can't merge them right now because afaik, Nathan patches are still not quite ready. So we need to either get a final version of Nathan patches something like tomorrow or you need to rebase your series on current -next by the end of the week, or I'm afraid it's going to miss the next merge window. Hi Ben, I had checked with Nathan earlier and he mentioned that he is working on an update. We can post a rebase to -next tomorrow, but this series depends on Nathan's patch, hence will work with him. This feature is important for the next merge window. Thanks, Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 0/2] pseries: Add cede support for cpu-offline
Hi, This is version 6 of patch series that provides a framework to choose the state a pseries CPU must be put to when it is offlined. Previous versions can be found here: Version 5: http://lkml.org/lkml/2009/10/30/6 Version 4: http://lkml.org/lkml/2009/10/9/59 Version 3: http://lkml.org/lkml/2009/9/15/164 Version 2: http://lkml.org/lkml/2009/8/28/102 Version 1: http://lkml.org/lkml/2009/8/6/236 Changes from the previous version include: - Built on Nathan Fontenot's v3 of Kernel handling of Dynamic Logical Partitioning http://lkml.org/lkml/2009/11/25/21 - Rebased to powerpc.git tree and hence dropped 1st and 2nd patch in the stack since they are already in the powerpc.git tree: With reference to previous version, Dropped: 1/4 pSeries: extended_cede_processor() helper function 2/4 pSeries: Add hooks to put the CPU into an appropriate offline state Posting only: 3/4 pseries: Add code to online/offline CPUs of a DLPAR node 4/4 pseries: Serialize cpu hotplug operations during deactivate Vs deallocate Minor changes in the above patchs due to changes in Nathan's routines. Also, - This approach addresses Peter Z's objections regarding layering violations. The user simply offlines the cpu and doesn't worry about what state the CPU should be put into. That part is automatically handled by the kernel. Acked-by: Peter Zijlstra a.p.zijls...@chello.nl http://lkml.org/lkml/2009/11/11/328 - It does not add any additional sysfs interface instead uses the existing sysfs interface to offline CPUs. - On platforms which do not have support for ceding the vcpu with a latency specifier value, the offlining mechanism defaults to the current method of calling rtas_stop_self(). This patchset is based on powerpc.git + Nathan's patches and has been built and tested on pseries platforms. This series can be applied to powerpc.git after Nathan's patches. Thanks, Vaidy --- Gautham R Shenoy (2): pseries: Serialize cpu hotplug operations during deactivate Vs deallocate pseries: Add code to online/offline CPUs of a DLPAR node arch/powerpc/platforms/pseries/dlpar.c | 144 ++-- drivers/base/cpu.c |2 include/linux/cpu.h| 13 +++ 3 files changed, 150 insertions(+), 9 deletions(-) -- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 1/2] pseries: Add code to online/offline CPUs of a DLPAR node
From: Gautham R Shenoy e...@in.ibm.com Currently the cpu-allocation/deallocation on pSeries is a two step process from the Userspace. - Set the indicators and update the device tree by writing to the sysfs tunable probe during allocation and release during deallocation. - Online / Offline the CPUs of the allocated/would_be_deallocated node by writing to the sysfs tunable online. This patch adds kernel code to online/offline the CPUs soon_after/just_before they have been allocated/would_be_deallocated. This way, the userspace tool that performs DLPAR operations would only have to deal with one set of sysfs tunables namely probe and release. Signed-off-by: Gautham R Shenoy e...@in.ibm.com Signed-off-by: Nathan Fontenot nf...@austin.ibm.com Acked-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c | 101 1 files changed, 101 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index fe8d4b3..642e1b2 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -16,6 +16,7 @@ #include linux/proc_fs.h #include linux/spinlock.h #include linux/cpu.h +#include offline_states.h #include asm/prom.h #include asm/machdep.h @@ -287,6 +288,98 @@ int dlpar_detach_node(struct device_node *dn) return 0; } +int online_node_cpus(struct device_node *dn) +{ + int rc = 0; + unsigned int cpu; + int len, nthreads, i; + const u32 *intserv; + + intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len); + if (!intserv) + return -EINVAL; + + nthreads = len / sizeof(u32); + + cpu_maps_update_begin(); + for (i = 0; i nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + BUG_ON(get_cpu_current_state(cpu) + != CPU_STATE_OFFLINE); + cpu_maps_update_done(); + rc = cpu_up(cpu); + if (rc) + goto out; + cpu_maps_update_begin(); + + break; + } + if (cpu == num_possible_cpus()) + printk(KERN_WARNING Could not find cpu to online + with physical id 0x%x\n, intserv[i]); + } + cpu_maps_update_done(); + +out: + return rc; + +} + +int offline_node_cpus(struct device_node *dn) +{ + int rc = 0; + unsigned int cpu; + int len, nthreads, i; + const u32 *intserv; + + intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len); + if (!intserv) + return -EINVAL; + + nthreads = len / sizeof(u32); + + cpu_maps_update_begin(); + for (i = 0; i nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + + if (get_cpu_current_state(cpu) == CPU_STATE_OFFLINE) + break; + + if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) { + cpu_maps_update_done(); + rc = cpu_down(cpu); + if (rc) + goto out; + cpu_maps_update_begin(); + break; + + } + + /* +* The cpu is in CPU_STATE_INACTIVE. +* Upgrade it's state to CPU_STATE_OFFLINE. +*/ + set_preferred_offline_state(cpu, CPU_STATE_OFFLINE); + BUG_ON(plpar_hcall_norets(H_PROD, intserv[i]) + != H_SUCCESS); + __cpu_die(cpu); + break; + } + if (cpu == num_possible_cpus()) + printk(KERN_WARNING Could not find cpu to offline + with physical id 0x%x\n, intserv[i]); + } + cpu_maps_update_done(); + +out: + return rc; + +} + #define DR_ENTITY_SENSE9003 #define DR_ENTITY_PRESENT 1 #define DR_ENTITY_UNUSABLE 2 @@ -385,6 +478,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) dlpar_free_cc_nodes(dn); } + rc = online_node_cpus(dn); + return rc ? rc : count; } @@ -404,6 +499,12 @@ static ssize_t dlpar_cpu_release(const char *buf, size_t count) return -EINVAL; } + rc = offline_node_cpus(dn); + if (rc
[PATCH v6 2/2] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
From: Gautham R Shenoy e...@in.ibm.com Currently the cpu-allocation/deallocation process comprises of two steps: - Set the indicators and to update the device tree with DLPAR node information. - Online/offline the allocated/deallocated CPU. This is achieved by writing to the sysfs tunables probe during allocation and release during deallocation. At the sametime, the userspace can independently online/offline the CPUs of the system using the sysfs tunable online. It is quite possible that when a userspace tool offlines a CPU for the purpose of deallocation and is in the process of updating the device tree, some other userspace tool could bring the CPU back online by writing to the online sysfs tunable thereby causing the deallocate process to fail. The solution to this is to serialize writes to the probe/release sysfs tunable with the writes to the online sysfs tunable. This patch employs a mutex to provide this serialization, which is a no-op on all architectures except PPC_PSERIES Signed-off-by: Gautham R Shenoy e...@in.ibm.com Acked-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c | 45 +--- drivers/base/cpu.c |2 + include/linux/cpu.h| 13 + 3 files changed, 50 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 642e1b2..fd2f0af 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -436,6 +436,18 @@ int dlpar_release_drc(u32 drc_index) #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE +static DEFINE_MUTEX(pseries_cpu_hotplug_mutex); + +void cpu_hotplug_driver_lock() +{ + mutex_lock(pseries_cpu_hotplug_mutex); +} + +void cpu_hotplug_driver_unlock() +{ + mutex_unlock(pseries_cpu_hotplug_mutex); +} + static ssize_t dlpar_cpu_probe(const char *buf, size_t count) { struct device_node *dn; @@ -443,13 +455,18 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) char *cpu_name; int rc; + cpu_hotplug_driver_lock(); rc = strict_strtoul(buf, 0, drc_index); - if (rc) - return -EINVAL; + if (rc) { + rc = -EINVAL; + goto out; + } dn = dlpar_configure_connector(drc_index); - if (!dn) - return -EINVAL; + if (!dn) { + rc = -EINVAL; + goto out; + } /* configure-connector reports cpus as living in the base * directory of the device tree. CPUs actually live in the @@ -459,7 +476,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) GFP_KERNEL); if (!cpu_name) { dlpar_free_cc_nodes(dn); - return -ENOMEM; + rc = -ENOMEM; + goto out; } sprintf(cpu_name, /cpus%s, dn-full_name); @@ -469,7 +487,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) rc = dlpar_acquire_drc(drc_index); if (rc) { dlpar_free_cc_nodes(dn); - return -EINVAL; + rc = -EINVAL; + goto out; } rc = dlpar_attach_node(dn); @@ -479,6 +498,8 @@ static ssize_t dlpar_cpu_probe(const char *buf, size_t count) } rc = online_node_cpus(dn); +out: + cpu_hotplug_driver_unlock(); return rc ? rc : count; } @@ -499,26 +520,30 @@ static ssize_t dlpar_cpu_release(const char *buf, size_t count) return -EINVAL; } + cpu_hotplug_driver_lock(); rc = offline_node_cpus(dn); if (rc) { of_node_put(dn); - return -EINVAL; + rc = -EINVAL; + goto out; } rc = dlpar_release_drc(*drc_index); if (rc) { of_node_put(dn); - return -EINVAL; + goto out; } rc = dlpar_detach_node(dn); if (rc) { dlpar_acquire_drc(*drc_index); - return rc; + goto out; } of_node_put(dn); - return count; +out: + cpu_hotplug_driver_unlock(); + return rc ? rc : count; } static int __init pseries_dlpar_init(void) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 7c03af7..27fd775 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut struct cpu *cpu = container_of(dev, struct cpu, sysdev); ssize_t ret; + cpu_hotplug_driver_lock(); switch (buf[0]) { case '0': ret = cpu_down(cpu-sysdev.id); @@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut default: ret = -EINVAL
2.6.24-rc1 on PPC64: machine check exception
Hi, I got the following exception on a 128-way PPC64 machine while running ebizzy-0.2 benchmark. cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c06df1bb3a40 msr: 80109032 current = 0xc06df1725120 paca= 0xc06f4000 pid = 5950, comm = ebizzy cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c00e21bd3a40 msr: 80109032 current = 0xc00e pa paca= 0xc06fb800 cidpid = 5941, comm = ebizzy = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc [c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8 [c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac [c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4 [c06df1bb3e30] c0008ae8 do_work+0x28/0x2c --- Exception: 500 (Hardware Interrupt) at 10007698 SP (40016601610) is in userspace Config file attached. I did not see any similar failure reported with 2.6.24-rc1. --Vaidy # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc1 # Mon Nov 5 02:25:46 2007 # CONFIG_PPC64=y # # Processor support # # CONFIG_POWER4_ONLY is not set CONFIG_POWER3=y CONFIG_POWER4=y # CONFIG_TUNE_CELL is not set CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_PPC_STD_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_SMP=y CONFIG_NR_CPUS=128 CONFIG_64BIT=y CONFIG_WORD_SIZE=64 CONFIG_PPC_MERGE=y CONFIG_MMU=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_HARDIRQS=y CONFIG_IRQ_PER_CPU=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_ARCH_NO_VIRT_TO_BUS=y CONFIG_PPC=y CONFIG_EARLY_PRINTK=y CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_OF=y CONFIG_OF=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_DEFAULT_UIMAGE is not set # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION=-sv CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set # CONFIG_CGROUP_NS is not set # CONFIG_CGROUP_CPUACCT is not set CONFIG_CPUSETS=y # CONFIG_FAIR_GROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_PROC_PID_CPUSET=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=anticipatory # # Platform support # CONFIG_PPC_MULTIPLATFORM=y # CONFIG_PPC_82xx is not set # CONFIG_PPC_83xx is not set # CONFIG_PPC_86xx is not set CONFIG_PPC_PSERIES=y CONFIG_PPC_SPLPAR=y CONFIG_EEH=y CONFIG_SCANLOG=m CONFIG_LPARCFG=y # CONFIG_PPC_ISERIES is not set # CONFIG_PPC_MPC52xx is not set # CONFIG_PPC_MPC5200 is not set # CONFIG_PPC_PMAC is not set # CONFIG_PPC_MAPLE is not set # CONFIG_PPC_PASEMI is not set # CONFIG_PPC_CELLEB is not set # CONFIG_PPC_PS3 is not set # CONFIG_PPC_CELL is not set # CONFIG_PPC_CELL_NATIVE is not set # CONFIG_PPC_IBM_CELL_BLADE is not set # CONFIG_PQ2ADS is not set CONFIG_PPC_NATIVE=y # CONFIG_UDBG_RTAS_CONSOLE is not set CONFIG_XICS=y CONFIG_MPIC=y
Re: 2.6.24-rc1 on PPC64: machine check exception
http://patchwork.ozlabs.org/linuxppc/patch?id=14475 Thanks for pointing me to this patch. I will try out and let you know if this fixed the problem. Hi Anton, This patch fixed the problem. I am able to run and profile ebizzy on 128-way PPC64. However this fix is not included in 2.6.24-rc2 as well. I will watch for inclusion of this fix in 2.6.24. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Quieten cede latency printk
* Balbir Singh bal...@linux.vnet.ibm.com [2010-02-08 16:37:37]: On Mon, Feb 8, 2010 at 5:22 AM, Anton Blanchard an...@samba.org wrote: The cede latency stuff is relatively new and we don't need to complain about it not working on older firmware. Signed-off-by: Anton Blanchard an...@samba.org Acked-by: Balbir Singh bal...@linux.vnet.ibm.com Seems like a reasonable change. CC'ing Vaidy and Gautham to look even more closely at the patch. Hi Anton, Removing the printk is reasonable. The failure case is noise in older firmware while the success case may not be very informative. Thanks, Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: reset kernel stack on cpu online from cede state
powerpc: reset kernel stack on cpu online from cede state Cpu hotplug (offline) without dlpar operation will place cpu in cede state and the extended_cede_processor() function will return when resumed. Kernel stack pointer needs to be reset before start_secondary() is called to continue the online operation. Added new function start_secondary_resume() to do the above steps. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index 9258074..567cd57 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -615,6 +615,17 @@ _GLOBAL(start_secondary_prolog) std r3,0(r1)/* Zero the stack frame pointer */ bl .start_secondary b . +/* + * Reset stack pointer and call start_secondary + * to continue with online operation when woken up + * from cede in cpu offline. + */ +_GLOBAL(start_secondary_resume) + ld r1,PACAKSAVE(r13) /* Reload kernel stack pointer */ + li r3,0 + std r3,0(r1)/* Zero the stack frame pointer */ + bl .start_secondary + b . #endif /* diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 6ea4698..9be7af4 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -146,12 +146,11 @@ static void pseries_mach_cpu_die(void) unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); /* -* NOTE: Calling start_secondary() here for now to -* start new context. -* However, need to do it cleanly by resetting the -* stack pointer. +* Call to start_secondary_resume() will not return. +* Kernel stack will be reset and start_secondary() +* will be called to continue the online operation. */ - start_secondary(); + start_secondary_resume(); } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { diff --git a/arch/powerpc/platforms/pseries/offline_states.h b/arch/powerpc/platforms/pseries/offline_states.h index 22574e0..da07256 100644 --- a/arch/powerpc/platforms/pseries/offline_states.h +++ b/arch/powerpc/platforms/pseries/offline_states.h @@ -14,5 +14,5 @@ extern void set_cpu_current_state(int cpu, enum cpu_state_vals state); extern enum cpu_state_vals get_preferred_offline_state(int cpu); extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state); extern void set_default_offline_state(int cpu); -extern int start_secondary(void); +extern void start_secondary_resume(void); #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/3] powerpc: bug fixes in pseries_mach_cpu_die()
Hi Ben, The following set of patches fixes kernel stack reset issue and also potential race conditions. This fix should be applied in 2.6.33 stable tree onwards. Problem description: (1) Repeated offline/online operation on pseries with extended cede processor feature will run over the kernel stack and crash since the stack was not reset on resume from cede processor. (2) The 'if' conditions and code has been slightly rearranged to avoid any possible race conditions where preferred_offline_state can change while the cpu is still executing the condition checks in pseries_mach_cpu_die(). The new code has the CPU_STATE_OFFLINE as default and also improved readability. Thanks to Anton Blanchard for pointing this out. (3) There were too many noisy KERN_INFO printks on offline/online operation. Removed the printk's for now. Other debug methods or traceevents can be introduced at a later point. The patch has been tested on large POWER machine with both cede_offline=off case and the default cede_offline=on case. Please apply in powerpc.git tree and push upstream. Previous version of the patch can be found at: [PATCH] powerpc: reset kernel stack on cpu online from cede state http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080515.html Thanks, Vaidy --- Vaidyanathan Srinivasan (3): powerpc: reset kernel stack on cpu online from cede state powerpc: move checks in pseries_mach_cpu_die() powerpc: reduce printk from pseries_mach_cpu_die() arch/powerpc/kernel/head_64.S | 11 ++ arch/powerpc/platforms/pseries/hotplug-cpu.c| 42 --- arch/powerpc/platforms/pseries/offline_states.h |2 + 3 files changed, 27 insertions(+), 28 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/3] powerpc: reset kernel stack on cpu online from cede state
Cpu hotplug (offline) without dlpar operation will place cpu in cede state and the extended_cede_processor() function will return when resumed. Kernel stack pointer needs to be reset before start_secondary() is called to continue the online operation. Added new function start_secondary_resume() to do the above steps. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Cc: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/kernel/head_64.S | 11 +++ arch/powerpc/platforms/pseries/hotplug-cpu.c|9 - arch/powerpc/platforms/pseries/offline_states.h |2 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index 9258074..567cd57 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -615,6 +615,17 @@ _GLOBAL(start_secondary_prolog) std r3,0(r1)/* Zero the stack frame pointer */ bl .start_secondary b . +/* + * Reset stack pointer and call start_secondary + * to continue with online operation when woken up + * from cede in cpu offline. + */ +_GLOBAL(start_secondary_resume) + ld r1,PACAKSAVE(r13) /* Reload kernel stack pointer */ + li r3,0 + std r3,0(r1)/* Zero the stack frame pointer */ + bl .start_secondary + b . #endif /* diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 6ea4698..9be7af4 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -146,12 +146,11 @@ static void pseries_mach_cpu_die(void) unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); /* -* NOTE: Calling start_secondary() here for now to -* start new context. -* However, need to do it cleanly by resetting the -* stack pointer. +* Call to start_secondary_resume() will not return. +* Kernel stack will be reset and start_secondary() +* will be called to continue the online operation. */ - start_secondary(); + start_secondary_resume(); } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { diff --git a/arch/powerpc/platforms/pseries/offline_states.h b/arch/powerpc/platforms/pseries/offline_states.h index 22574e0..da07256 100644 --- a/arch/powerpc/platforms/pseries/offline_states.h +++ b/arch/powerpc/platforms/pseries/offline_states.h @@ -14,5 +14,5 @@ extern void set_cpu_current_state(int cpu, enum cpu_state_vals state); extern enum cpu_state_vals get_preferred_offline_state(int cpu); extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state); extern void set_default_offline_state(int cpu); -extern int start_secondary(void); +extern void start_secondary_resume(void); #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/3] powerpc: move checks in pseries_mach_cpu_die()
Rearrange condition checks for better code readability and prevention of possible race conditions when preferred_offline_state can potentially change during the execution of pseries_mach_cpu_die(). The patch will make pseries_mach_cpu_die() put cpu in one of the consistent states and not hit the run over BUG() Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Cc: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 30 +- 1 files changed, 15 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 9be7af4..39ecb40 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -140,25 +140,25 @@ static void pseries_mach_cpu_die(void) if (!get_lppaca()-shared_proc) get_lppaca()-donate_dedicated_cpu = 0; get_lppaca()-idle = 0; - } - if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) { - unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); + if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) { + unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); - /* -* Call to start_secondary_resume() will not return. -* Kernel stack will be reset and start_secondary() -* will be called to continue the online operation. -*/ - start_secondary_resume(); + /* +* Call to start_secondary_resume() will not return. +* Kernel stack will be reset and start_secondary() +* will be called to continue the online operation. +*/ + start_secondary_resume(); + } + } - } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { + /* Requested state is CPU_STATE_OFFLINE at this point */ + WARN_ON(get_preferred_offline_state(cpu) != CPU_STATE_OFFLINE); - set_cpu_current_state(cpu, CPU_STATE_OFFLINE); - unregister_slb_shadow(hard_smp_processor_id(), - __pa(get_slb_shadow())); - rtas_stop_self(); - } + set_cpu_current_state(cpu, CPU_STATE_OFFLINE); + unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); + rtas_stop_self(); /* Should never get here... */ BUG(); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 3/3] powerpc: reduce printk from pseries_mach_cpu_die()
Remove debug printks in pseries_mach_cpu_die(). These are noisy at runtime. Traceevents can be added to instrument this section of code. The following KERN_INFO printks are removed: cpu 62 (hwid 62) returned from cede. Decrementer value = b2802fff Timebase value = 2fa8f95035f4a cpu 62 (hwid 62) got prodded to go online cpu 58 (hwid 58) ceding for offline with hint 2 Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Cc: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 11 --- 1 files changed, 0 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 39ecb40..b842378 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -122,21 +122,10 @@ static void pseries_mach_cpu_die(void) if (!get_lppaca()-shared_proc) get_lppaca()-donate_dedicated_cpu = 1; - printk(KERN_INFO - cpu %u (hwid %u) ceding for offline with hint %d\n, - cpu, hwcpu, cede_latency_hint); while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { extended_cede_processor(cede_latency_hint); - printk(KERN_INFO cpu %u (hwid %u) returned from cede.\n, - cpu, hwcpu); - printk(KERN_INFO - Decrementer value = %x Timebase value = %llx\n, - get_dec(), get_tb()); } - printk(KERN_INFO cpu %u (hwid %u) got prodded to go online\n, - cpu, hwcpu); - if (!get_lppaca()-shared_proc) get_lppaca()-donate_dedicated_cpu = 0; get_lppaca()-idle = 0; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC] powerpc: add support for new hcall H_BEST_ENERGY
Hi, A new hypervisor call H_BEST_ENERGY is supported in IBM pseries ppc64 platform in order to exports platform energy management capabilities to user space administrative tools. This hypervisor calls provides hints or suggested list of cpus to activate or deactivate for optimized energy management. The following patch add a new pseries specific device driver that will export such platform energy management related information over sysfs interface for use by user space administrative tools. The device driver exports the following sysfs files: /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint. Typical usage: # cat /sys/devices/system/cpu/pseries_activate_hint_list 16,20,24 # cat /sys/devices/system/cpu/pseries_deactivate_hint_list 0,4,8,12 Comma separated list of recommended cpu numbers to activate or deactivate. cat /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide a single integer value returned by the platform for that cpu. Please let me know your comments and suggestions. Thanks, Vaidy --- powerpc: add support for new hcall H_BEST_ENERGY Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/kernel/setup-common.c |2 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 246 +++ 5 files changed, 261 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index f027581..396599f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -230,7 +230,8 @@ #define H_ENABLE_CRQ 0x2B0 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 03dd6a2..fbd93e3 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -359,6 +359,8 @@ void __init check_for_initrd(void) #ifdef CONFIG_SMP int threads_per_core, threads_shift; +EXPORT_SYMBOL_GPL(threads_per_core); + cpumask_t threads_core_mask; static void __init cpu_init_thread_core_maps(int tpc) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 0ff5174..3813977 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o eeh_sysfs.o obj-$(CONFIG_KEXEC)+= kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c new file mode 100644 index 000..0390188 --- /dev/null +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -0,0 +1,246 @@ +/* + * POWER platform energy management driver + * Copyright (C) 2010 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public
Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY
* Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]: On Wed, Mar 03, 2010 at 11:48:22PM +0530, Vaidyanathan Srinivasan wrote: static void __init cpu_init_thread_core_maps(int tpc) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES . +static int __init pseries_energy_init(void) +{ + int cpu, err; + struct sys_device *cpu_sys_dev; + + /* Create the sysfs files */ + err = sysfs_create_file(cpu_sysdev_class.kset.kobj, + attr_cpu_activate_hint_list.attr); + if (!err) + err = sysfs_create_file(cpu_sysdev_class.kset.kobj, + attr_cpu_deactivate_hint_list.attr); + + for_each_possible_cpu(cpu) { + cpu_sys_dev = get_cpu_sysdev(cpu); + err = sysfs_create_file(cpu_sys_dev-kobj, + attr_percpu_activate_hint.attr); + if (err) + break; + err = sysfs_create_file(cpu_sys_dev-kobj, + attr_percpu_deactivate_hint.attr); + if (err) + break; + } + return err; + +} Shouldn't we create this only for supported platforms ? Hi Dipankar, Yes we will need a check like firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in unsupported platforms. I will add that check in the next iteration. Thanks, Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY
* Dipankar Sarma dipan...@in.ibm.com [2010-03-09 00:58:26]: On Mon, Mar 08, 2010 at 12:20:06PM +0530, Vaidyanathan Srinivasan wrote: * Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]: Shouldn't we create this only for supported platforms ? Hi Dipankar, Yes we will need a check like firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in unsupported platforms. I will add that check in the next iteration. Also, given that this module isn't likely to provide anything on older platforms, it should get loaded only on newer platforms. Yes, I will explore methods to achieve this. I could not yet find a trigger to load modules based on firmware feature. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: export data from new hcall H_EM_GET_PARMS
Hi Ben, I have cleaned up the code from the previous post: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html Changes from v1: * Removed redundant return statements and rearranged code Description: A new hcall H_EM_GET_PARMS as been added to obtain power mode data from the platform. This data can be used by user space administrative tools for better power management. The following patch add data from this new hcall into the lparcfg driver and exports to user space along with other existing lpar data in /proc/powerpc/lparcfg Please review and include in powerpc -next tree. Thanks, Vaidy --- powerpc: export data from new hcall H_EM_GET_PARMS Add support for H_EM_GET_PARMS hcall that will return data related to power modes from the platform. Export the data directly to user space for administrative tools to interpret and use. cat /proc/powerpc/lparcfg will export power mode data Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |1 + arch/powerpc/kernel/lparcfg.c | 12 +++- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index f027581..ebe7493 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -228,6 +228,7 @@ #define H_JOIN 0x298 #define H_VASI_STATE0x2A4 #define H_ENABLE_CRQ 0x2B0 +#define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 #define MAX_HCALL_OPCODE H_GET_MPP diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c index d09d1c6..ff698fb 100644 --- a/arch/powerpc/kernel/lparcfg.c +++ b/arch/powerpc/kernel/lparcfg.c @@ -37,7 +37,7 @@ #include asm/vio.h #include asm/mmu.h -#define MODULE_VERS 1.8 +#define MODULE_VERS 1.9 #define MODULE_NAME lparcfg /* #define LPARCFG_DEBUG */ @@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m) seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions); } +static void parse_em_data(struct seq_file *m) +{ + unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; + + if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS) + seq_printf(m, power_mode_data=%016lx\n, retbuf[0]); +} + static int pseries_lparcfg_data(struct seq_file *m, void *v) { int partition_potential_processors; @@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v) seq_printf(m, slb_size=%d\n, mmu_slb_size); + parse_em_data(m); + return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc: export data from new hcall H_EM_GET_PARMS
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2010-03-10 17:49:25]: Hi Ben, I have cleaned up the code from the previous post: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html Changes from v1: * Removed redundant return statements and rearranged code Description: A new hcall H_EM_GET_PARMS as been added to obtain power mode data from the platform. This data can be used by user space administrative tools for better power management. The following patch add data from this new hcall into the lparcfg driver and exports to user space along with other existing lpar data in /proc/powerpc/lparcfg Please review and include in powerpc -next tree. Hi Ben, Looks like you have not opened the -next tree after the merge window yet. The current git tree status is same as -merge. Please review and include in powerpc next tree. Thanks, Vaidy --- powerpc: export data from new hcall H_EM_GET_PARMS Add support for H_EM_GET_PARMS hcall that will return data related to power modes from the platform. Export the data directly to user space for administrative tools to interpret and use. cat /proc/powerpc/lparcfg will export power mode data Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |1 + arch/powerpc/kernel/lparcfg.c | 12 +++- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index f027581..ebe7493 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -228,6 +228,7 @@ #define H_JOIN 0x298 #define H_VASI_STATE0x2A4 #define H_ENABLE_CRQ 0x2B0 +#define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP0x2D0 #define H_GET_MPP0x2D4 #define MAX_HCALL_OPCODE H_GET_MPP diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c index d09d1c6..ff698fb 100644 --- a/arch/powerpc/kernel/lparcfg.c +++ b/arch/powerpc/kernel/lparcfg.c @@ -37,7 +37,7 @@ #include asm/vio.h #include asm/mmu.h -#define MODULE_VERS 1.8 +#define MODULE_VERS 1.9 #define MODULE_NAME lparcfg /* #define LPARCFG_DEBUG */ @@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m) seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions); } +static void parse_em_data(struct seq_file *m) +{ + unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; + + if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS) + seq_printf(m, power_mode_data=%016lx\n, retbuf[0]); +} + static int pseries_lparcfg_data(struct seq_file *m, void *v) { int partition_potential_processors; @@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v) seq_printf(m, slb_size=%d\n, mmu_slb_size); + parse_em_data(m); + return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 repost] powerpc: export data from new hcall H_EM_GET_PARMS
Hi Ben, Previous post: http://patchwork.ozlabs.org/patch/47249/ I have cleaned up the code from the v1: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-February/080630.html Changes from v1: * Removed redundant return statements and rearranged code Description: A new hcall H_EM_GET_PARMS as been added to obtain power mode data from the platform. This data can be used by user space administrative tools for better power management. The following patch add data from this new hcall into the lparcfg driver and exports to user space along with other existing lpar data in /proc/powerpc/lparcfg Please review and include in powerpc -next tree. Thanks, Vaidy --- powerpc: export data from new hcall H_EM_GET_PARMS Add support for H_EM_GET_PARMS hcall that will return data related to power modes from the platform. Export the data directly to user space for administrative tools to interpret and use. cat /proc/powerpc/lparcfg will export power mode data Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |1 + arch/powerpc/kernel/lparcfg.c | 12 +++- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index f027581..ebe7493 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -228,6 +228,7 @@ #define H_JOIN 0x298 #define H_VASI_STATE0x2A4 #define H_ENABLE_CRQ 0x2B0 +#define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 #define MAX_HCALL_OPCODE H_GET_MPP diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c index d09d1c6..ff698fb 100644 --- a/arch/powerpc/kernel/lparcfg.c +++ b/arch/powerpc/kernel/lparcfg.c @@ -37,7 +37,7 @@ #include asm/vio.h #include asm/mmu.h -#define MODULE_VERS 1.8 +#define MODULE_VERS 1.9 #define MODULE_NAME lparcfg /* #define LPARCFG_DEBUG */ @@ -486,6 +486,14 @@ static void splpar_dispatch_data(struct seq_file *m) seq_printf(m, dispatch_dispersions=%lu\n, dispatch_dispersions); } +static void parse_em_data(struct seq_file *m) +{ + unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; + + if (plpar_hcall(H_GET_EM_PARMS, retbuf) == H_SUCCESS) + seq_printf(m, power_mode_data=%016lx\n, retbuf[0]); +} + static int pseries_lparcfg_data(struct seq_file *m, void *v) { int partition_potential_processors; @@ -540,6 +548,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v) seq_printf(m, slb_size=%d\n, mmu_slb_size); + parse_em_data(m); + return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-04-07 12:04:49]: On Wed, 2010-03-03 at 23:48 +0530, Vaidyanathan Srinivasan wrote: Hi, diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 03dd6a2..fbd93e3 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -359,6 +359,8 @@ void __init check_for_initrd(void) #ifdef CONFIG_SMP int threads_per_core, threads_shift; +EXPORT_SYMBOL_GPL(threads_per_core); While I agree it should be exported for the APIs in cputhread.h to be usable from a module, this variable shouldn't be -used- directly, but only via the API functions in there. Ok agreed. This will be the first module to use this and hence will have to implement the API and export the function. .../... + +#define MODULE_VERS 1.0 +#define MODULE_NAME pseries_energy + +/* Helper Routines to convert between drc_index to cpu numbers */ + +static u32 cpu_to_drc_index(int cpu) +{ + struct device_node *dn = NULL; + const int *indexes; + int i; + dn = of_find_node_by_name(dn, cpus); Check the result. Also that's not a nice way to do that, you should look for /cpus by path I reckon. I will check the return code, but why do you feel getting the node by path is better? Is there any race, or we may have duplicate cpus node? + indexes = of_get_property(dn, ibm,drc-indexes, NULL); Check the result here too. Yes, will do. + /* Convert logical cpu number to core number */ + i = cpu / threads_per_core; Don't use that variable as I said earlier. Use cpu_thread_to_core() Ack + /* +* The first element indexes[0] is the number of drc_indexes +* returned in the list. Hence i+1 will get the drc_index +* corresponding to core number i. +*/ + WARN_ON(i indexes[0]); + return indexes[i + 1]; +} + +static int drc_index_to_cpu(u32 drc_index) +{ + struct device_node *dn = NULL; + const int *indexes; + int i, cpu; + dn = of_find_node_by_name(dn, cpus); + indexes = of_get_property(dn, ibm,drc-indexes, NULL); Same comments, check results and use /cpus path Sure. Will do. + /* +* First element in the array is the number of drc_indexes +* returned. Search through the list to find the matching +* drc_index and get the core number +*/ + for (i = 0; i indexes[0]; i++) { + if (indexes[i + 1] == drc_index) + break; + } + /* Convert core number to logical cpu number */ + cpu = i * threads_per_core; Here's more annoying as we don't have an API in cputhread.h for that. In fact, we have a confusion in there since cpu_first_thread_in_core() doesn't actually take a core number ... I'm going to recommend a complicated approach but that's the best in the long run: - First do a patch that renames cpu_first_thread_in_core() to something clearer like cpu_first_thread_in_same_core() or cpu_leftmost_sibling() (I think I prefer the later). Rename the few users in arch/powerpc/mm/mmu_context_nohash.c and arch/powerpc/kernel/smp.c - Then do a patch that adds a cpu_first_thread_of_core() that takes a core number, basically does: static inline int cpu_first_thread_of_core(int core) { return core threads_shift; } - Then add your modified H_BEST_ENERGY patch on top of it using the above two as pre-reqs :-) Ok, I can do that change and then base this patch on the new API. + return cpu; +} + +/* + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on + * preferred logical cpus to activate or deactivate for optimized + * energy consumption. + */ + +#define FLAGS_MODE10x004E20080E01 +#define FLAGS_MODE20x004E20080401 +#define FLAGS_ACTIVATE 0x100 + +static ssize_t get_best_energy_list(char *page, int activate) +{ + int rc, cnt, i, cpu; + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE]; + unsigned long flags = 0; + u32 *buf_page; + char *s = page; + + buf_page = (u32 *) get_zeroed_page(GFP_KERNEL); + Why that blank line ? I will remove them + if (!buf_page) + return 0; So here you return 0 instead of -ENOMEM Will fix. -ENOMEM is correct. + flags = FLAGS_MODE1; + if (activate) + flags |= FLAGS_ACTIVATE; + + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page), + 0, 0, 0, 0, 0, 0); + + Again, no need for a blank line before xx = foo() and if (xx) Ack. if (rc != H_SUCCESS) { + free_page((unsigned long) buf_page); + return -EINVAL; + } And here you return an error code. Which one is right ? Returning an error code is correct to user space. The call will check for return value on read while getting the list. Thanks for the detailed review. I will fix the issues that you pointed
[RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY
Hi, The following series adds a powerpc pseries platform module for exporting platform energy management capabilities. The module exports data from a new hypervisor call H_BEST_ENERGY. Comments and suggestions made on the previous iteration of the patch has been incorporated. [RFC] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html Changes from v1: * Cleanup cpu/thread/core APIs * Export APIs to module instead of threads_per_core * Use of_find_node_by_path() instead of of_find_node_by_name() * Error checking and whitespace cleanups Please review and let me know your comments. I will post the next version ready for inclusion into powerpc/next tree along with test reports. The current patch series is on 2.6.34-rc6. Thanks, Vaidy --- Vaidyanathan Srinivasan (2): powerpc: cleanup APIs for cpu/thread/core mappings powerpc: add support for new hcall H_BEST_ENERGY arch/powerpc/include/asm/cputhreads.h | 15 + arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/kernel/smp.c | 17 +- arch/powerpc/mm/mmu_context_nohash.c| 12 + arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++ 7 files changed, 301 insertions(+), 15 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu) Made API changes to few callers. Exported symbols for use in modules. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/cputhreads.h | 15 +-- arch/powerpc/kernel/smp.c | 17 +++-- arch/powerpc/mm/mmu_context_nohash.c | 12 ++-- 3 files changed, 30 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index a8e1844..26dc6bd 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void) return cpu_thread_mask_to_cores(cpu_online_map); } -static inline int cpu_thread_to_core(int cpu) -{ - return cpu threads_shift; -} +#ifdef CONFIG_SMP +int cpu_core_of_thread(int cpu); +int cpu_first_thread_of_core(int core); +#else +static inline int cpu_core_of_thread(int cpu) { return cpu; } +static inline int cpu_first_thread_of_core(int core) { return core; } +#endif static inline int cpu_thread_in_core(int cpu) { return cpu (threads_per_core - 1); } -static inline int cpu_first_thread_in_core(int cpu) +static inline int cpu_leftmost_thread_sibling(int cpu) { return cpu ~(threads_per_core - 1); } -static inline int cpu_last_thread_in_core(int cpu) +static inline int cpu_rightmost_thread_sibling(int cpu) { return cpu | (threads_per_core - 1); } diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index c2ee144..02dedff 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -462,6 +462,19 @@ out: return id; } +/* Helper routines for cpu to core mapping */ +int cpu_core_of_thread(int cpu) +{ + return cpu threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_core_of_thread); + +int cpu_first_thread_of_core(int core) +{ + return core threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); + /* Must be called when no change can occur to cpu_present_map, * i.e. during cpu online or offline. */ @@ -513,7 +526,7 @@ int __devinit start_secondary(void *unused) notify_cpu_starting(cpu); set_cpu_online(cpu, true); /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { if (cpu_is_offline(base + i)) continue; @@ -589,7 +602,7 @@ int __cpu_disable(void) return err; /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { cpu_clear(cpu, per_cpu(cpu_sibling_map, base + i)); cpu_clear(base + i, per_cpu(cpu_sibling_map, cpu)); diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c index 1f2d9ff..7c66e82 100644 --- a/arch/powerpc/mm/mmu_context_nohash.c +++ b/arch/powerpc/mm/mmu_context_nohash.c @@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id) * a core map instead but this will do for now. */ for_each_cpu(cpu, mm_cpumask(mm)) { - for (i = cpu_first_thread_in_core(cpu); -i = cpu_last_thread_in_core(cpu); i++) + for (i = cpu_leftmost_thread_sibling(cpu); +i = cpu_rightmost_thread_sibling(cpu); i++) __set_bit(id, stale_map[i]); cpu = i - 1; } @@ -264,14 +264,14 @@ void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next) */ if (test_bit(id, stale_map[cpu])) { pr_hardcont( | stale flush %d [%d..%d], - id, cpu_first_thread_in_core(cpu), - cpu_last_thread_in_core(cpu)); + id, cpu_leftmost_thread_sibling(cpu), + cpu_rightmost_thread_sibling(cpu)); local_flush_tlb_mm(next); /* XXX This clear should ultimately be part of local_flush_tlb_mm */ - for (i = cpu_first_thread_in_core(cpu); -i = cpu_last_thread_in_core(cpu); i++) { + for (i = cpu_leftmost_thread_sibling(cpu); +i = cpu_rightmost_thread_sibling(cpu); i++) { __clear_bit(id, stale_map[i]); } } ___ Linuxppc-dev
[RFC PATCH v2 2/2] powerpc: add support for new hcall H_BEST_ENERGY
Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++ 4 files changed, 271 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index f027581..396599f 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -230,7 +230,8 @@ #define H_ENABLE_CRQ 0x2B0 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 0ff5174..3813977 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o eeh_sysfs.o obj-$(CONFIG_KEXEC)+= kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c new file mode 100644 index 000..9a936b1 --- /dev/null +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -0,0 +1,258 @@ +/* + * POWER platform energy management driver + * Copyright (C) 2010 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This pseries platform device driver provides access to + * platform energy management capabilities. + */ + +#include linux/module.h +#include linux/types.h +#include linux/errno.h +#include linux/init.h +#include linux/seq_file.h +#include linux/sysdev.h +#include linux/cpu.h +#include linux/of.h +#include asm/cputhreads.h +#include asm/page.h +#include asm/hvcall.h + + +#define MODULE_VERS 1.0 +#define MODULE_NAME pseries_energy + +/* Helper Routines to convert between drc_index to cpu numbers */ + +static u32 cpu_to_drc_index(int cpu) +{ + struct device_node *dn = NULL; + const int *indexes; + int i; + dn = of_find_node_by_path(/cpus); + if (dn == NULL) + goto err; + indexes = of_get_property(dn, ibm,drc-indexes, NULL); + if (indexes == NULL) + goto err; + /* Convert logical cpu number to core number */ + i = cpu_core_of_thread(cpu); + /* +* The first element indexes[0] is the number of drc_indexes +* returned in the list. Hence i+1 will get the drc_index +* corresponding to core number i. +*/ + WARN_ON(i indexes[0]); + return indexes[i + 1]; +err: + printk(KERN_WARNING cpu_to_drc_index(%d) failed, cpu); + return 0; +} + +static int drc_index_to_cpu(u32 drc_index) +{ + struct device_node *dn = NULL; + const int *indexes; + int i, cpu; + dn = of_find_node_by_path(/cpus); + if (dn == NULL) + goto err; + indexes = of_get_property(dn, ibm,drc-indexes, NULL
Re: [RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
* Paul Mackerras pau...@samba.org [2010-05-10 09:05:22]: On Fri, May 07, 2010 at 05:18:42PM +0530, Vaidyanathan Srinivasan wrote: These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu) Why make all these changes? The end result doesn't seem any cleaner or better than how it was before, and your patch description doesn't give any reason for us to think yes, we should make this change. I assume you think this is a good change to make, so you need to explain why it's a good idea and convince the rest of us. Sure Paul.. let me explain. The crux of the issue is to make 'threads_per_core' accessible to the pseries_energy module. In the first RFC, I had directly exported the variable which is not a good design. Ben H suggested to make an API around it and then export the function: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html Instead of making an API to read threads_per_core, as Ben suggested, I have made a wrapper at a higher level to make an API to convert from logical cpu number to core number. The current APIs cpu_first_thread_in_core() and cpu_last_thread_in_core() returns logical CPU number while cpu_thread_to_core() returns core number or index which is not a logical CPU number. Ben recommended to clearly name them to distinguish 'core number' versus first and last 'logical cpu number' in that core. Hence in the new scheme, I have: Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() which work on logical cpu numbers. While cpu_first_thread_of_core() and cpu_core_of_thread() work on core index. Example usage: (4 threads per core system) cpu_leftmost_thread_sibling(5) = 4 cpu_rightmost_thread_sibling(5) = 7 cpu_core_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 cpu_core_of_thread() is used in cpu_to_drc_index() in the module and cpu_first_thread_of_core() is used in drc_index_to_cpu() in the module. These APIs may be useful in other modules in future, and the proposed design is a good method to export these APIs to modules. An alternative approach could be to move both the base functions cpu_to_drc_index() and drc_index_to_cpu() into the kernel like in arch/powerpc/kernel/smp.c Thanks for the review, I hope I have explained the requirements and design for this cleanup. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2010-05-10 11:18:01]: * Paul Mackerras pau...@samba.org [2010-05-10 09:05:22]: On Fri, May 07, 2010 at 05:18:42PM +0530, Vaidyanathan Srinivasan wrote: These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu) Why make all these changes? The end result doesn't seem any cleaner or better than how it was before, and your patch description doesn't give any reason for us to think yes, we should make this change. I assume you think this is a good change to make, so you need to explain why it's a good idea and convince the rest of us. Sure Paul.. let me explain. The crux of the issue is to make 'threads_per_core' accessible to the pseries_energy module. In the first RFC, I had directly exported the variable which is not a good design. Ben H suggested to make an API around it and then export the function: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html Instead of making an API to read threads_per_core, as Ben suggested, I have made a wrapper at a higher level to make an API to convert from logical cpu number to core number. The current APIs cpu_first_thread_in_core() and cpu_last_thread_in_core() returns logical CPU number while cpu_thread_to_core() returns core number or index which is not a logical CPU number. Ben recommended to clearly name them to distinguish 'core number' versus first and last 'logical cpu number' in that core. Hence in the new scheme, I have: Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() which work on logical cpu numbers. While cpu_first_thread_of_core() and cpu_core_of_thread() work on core index. Example usage: (4 threads per core system) cpu_leftmost_thread_sibling(5) = 4 cpu_rightmost_thread_sibling(5) = 7 cpu_core_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 cpu_core_of_thread() is used in cpu_to_drc_index() in the module and cpu_first_thread_of_core() is used in drc_index_to_cpu() in the module. These APIs may be useful in other modules in future, and the proposed design is a good method to export these APIs to modules. An alternative approach could be to move both the base functions cpu_to_drc_index() and drc_index_to_cpu() into the kernel like in arch/powerpc/kernel/smp.c Thanks for the review, I hope I have explained the requirements and design for this cleanup. Hi Paul and Ben, Do you have any further comments on this patch series and related cleanup? I will post the next iteration in a day or two. Thanks, Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] sched: fix capacity calculations for SMT4
* Peter Zijlstra pet...@infradead.org [2010-05-31 10:33:16]: On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote: Hrmm, my brain seems muddled but I might have another solution, let me ponder this for a bit.. Right, so the thing I was thinking about is taking the group capacity into account when determining the capacity for a single cpu. Say the group contains all the SMT siblings, then use the group capacity (usually larger than 1024) and then distribute the capacity over the group members, preferring CPUs with higher individual cpu_power over those with less. So suppose you've got 4 siblings with cpu_power=294 each, then we assign capacity 1 to the first member, and the remaining 153 is insufficient, and thus we stop and the rest lives with 0 capacity. Now take the example that the first sibling would be running a heavy RT load, and its cpu_power would be reduced to say, 50, then we still got nearly 933 left over the others, which is still sufficient for one capacity, but because the first sibling is low, we'll assign it 0 and instead assign 1 to the second, again, leaving the third and fourth 0. Hi Peter, Thanks for the suggestion. If the group were a core group, the total would be much higher and we'd likely end up assigning 1 to each before we'd run out of capacity. This is a tricky case because we are depending upon the DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We will not have any task movement until capacity is depleted to quite low value due to RT task. Having a threshold to flag 0/1 instead of DIV_ROUND_CLOSEST just like you have suggested in the power savings case may help here as well to move tasks to other idle cores. For power savings, we can lower the threshold and maybe use the maximal individual cpu_power in the group to base 1 capacity from. So, suppose the second example, where sibling0 has 50 and the others have 294, you'd end up with a capacity distribution of: {0,1,1,1}. One challenge here is that if RT tasks run on more that one thread in this group, we will have slightly different cpu powers. Arranging them from max to min and having a cutoff threshold should work. Should we keep the RT scaling as a separate entity along with cpu_power to simplify these thresholds. Whenever we need to scale group load with cpu power can take the product of cpu_power and scale_rt_power but in these cases where we compute capacity, we can mark a 0 or 1 just based on whether scale_rt_power was less than SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as a product of all scaling factors as it is today but save the component scale factors also like scale_rt_power() and arch_scale_freq_power() so that it can be used in load balance decisions. Basically in power save balance we would give all threads a capacity '1' unless the cpu_power was reduced due to RT task. Similarly in the non-power save case, we can have flag 1,0,0,0 unless first thread had a RT scaling during the last interval. I am suggesting to distinguish the reduction is cpu_power due to architectural (hardware DVFS) reasons from RT tasks so that it is easy to decide if moving tasks to sibling thread or core can help or not. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY
Hi Ben, The following series adds a kernel module for powerpc pseries platforms in order to export platform energy management capabilities. The module exports data from a new hypervisor call H_BEST_ENERGY. Comments and suggestions made on the previous iteration of the patch has been incorporated. Changed in v3: * Added more documentation in the cleanup patch * Removed RFC tag, rebased and tested on 2.6.35-rc3 * Ready for inclusion in powerpc/next tree for further testing Changes in v2: [1] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html * Cleanup cpu/thread/core APIs * Export APIs to module instead of threads_per_core * Use of_find_node_by_path() instead of of_find_node_by_name() * Error checking and whitespace cleanups First version: [2] [RFC] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html Please review and include in powerpc/next tree for further testing. I will add the following enhancements incrementally: * Detect h_call support from ibm,hypertas-functions * Limit sysfs file creation only on hcall availability This patch series will apply on 2.6.35-rc3 as well as powerpc/next tree. Thanks, Vaidy --- Vaidyanathan Srinivasan (2): powerpc: cleanup APIs for cpu/thread/core mappings powerpc: add support for new hcall H_BEST_ENERGY arch/powerpc/include/asm/cputhreads.h | 15 + arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/kernel/smp.c | 19 +- arch/powerpc/mm/mmu_context_nohash.c| 12 + arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++ 7 files changed, 302 insertions(+), 16 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c -- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu) The goal is to make 'threads_per_core' accessible to the pseries_energy module. Instead of making an API to read threads_per_core, this is a higher level wrapper function to convert from logical cpu number to core number. The current APIs cpu_first_thread_in_core() and cpu_last_thread_in_core() returns logical CPU number while cpu_thread_to_core() returns core number or index which is not a logical CPU number. The APIs are now clearly named to distinguish 'core number' versus first and last 'logical cpu number' in that core. The new APIs cpu_{left,right}most_thread_sibling() work on logical cpu numbers. While cpu_first_thread_of_core() and cpu_core_of_thread() work on core index. Example usage: (4 threads per core system) cpu_leftmost_thread_sibling(5) = 4 cpu_rightmost_thread_sibling(5) = 7 cpu_core_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 cpu_core_of_thread() is used in cpu_to_drc_index() in the module and cpu_first_thread_of_core() is used in drc_index_to_cpu() in the module. Made API changes to few callers. Exported symbols for use in modules. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/cputhreads.h | 15 +-- arch/powerpc/kernel/smp.c | 19 --- arch/powerpc/mm/mmu_context_nohash.c | 12 ++-- 3 files changed, 31 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index a8e1844..26dc6bd 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void) return cpu_thread_mask_to_cores(cpu_online_map); } -static inline int cpu_thread_to_core(int cpu) -{ - return cpu threads_shift; -} +#ifdef CONFIG_SMP +int cpu_core_of_thread(int cpu); +int cpu_first_thread_of_core(int core); +#else +static inline int cpu_core_of_thread(int cpu) { return cpu; } +static inline int cpu_first_thread_of_core(int core) { return core; } +#endif static inline int cpu_thread_in_core(int cpu) { return cpu (threads_per_core - 1); } -static inline int cpu_first_thread_in_core(int cpu) +static inline int cpu_leftmost_thread_sibling(int cpu) { return cpu ~(threads_per_core - 1); } -static inline int cpu_last_thread_in_core(int cpu) +static inline int cpu_rightmost_thread_sibling(int cpu) { return cpu | (threads_per_core - 1); } diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5c196d1..da4c2f8 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -468,7 +468,20 @@ out: return id; } -/* Must be called when no change can occur to cpu_present_mask, +/* Helper routines for cpu to core mapping */ +int cpu_core_of_thread(int cpu) +{ + return cpu threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_core_of_thread); + +int cpu_first_thread_of_core(int core) +{ + return core threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); + +/* Must be called when no change can occur to cpu_present_map, * i.e. during cpu online or offline. */ static struct device_node *cpu_to_l2cache(int cpu) @@ -527,7 +540,7 @@ int __devinit start_secondary(void *unused) notify_cpu_starting(cpu); set_cpu_online(cpu, true); /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { if (cpu_is_offline(base + i)) continue; @@ -606,7 +619,7 @@ int __cpu_disable(void) return err; /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i)); cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu)); diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c index ddfd7ad..22f3bc5 100644 --- a/arch/powerpc/mm/mmu_context_nohash.c +++ b/arch/powerpc/mm/mmu_context_nohash.c @@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id) * a core map instead but this will do for now. */ for_each_cpu(cpu, mm_cpumask(mm)) { - for (i = cpu_first_thread_in_core(cpu); -i = cpu_last_thread_in_core(cpu); i++) + for (i = cpu_leftmost_thread_sibling(cpu); +i
[PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++ 4 files changed, 271 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 5119b7d..34b66e0 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -231,7 +231,8 @@ #define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 3dbef30..32ae72e 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o eeh_sysfs.o obj-$(CONFIG_KEXEC)+= kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c new file mode 100644 index 000..9a936b1 --- /dev/null +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -0,0 +1,258 @@ +/* + * POWER platform energy management driver + * Copyright (C) 2010 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This pseries platform device driver provides access to + * platform energy management capabilities. + */ + +#include linux/module.h +#include linux/types.h +#include linux/errno.h +#include linux/init.h +#include linux/seq_file.h +#include linux/sysdev.h +#include linux/cpu.h +#include linux/of.h +#include asm/cputhreads.h +#include asm/page.h +#include asm/hvcall.h + + +#define MODULE_VERS 1.0 +#define MODULE_NAME pseries_energy + +/* Helper Routines to convert between drc_index to cpu numbers */ + +static u32 cpu_to_drc_index(int cpu) +{ + struct device_node *dn = NULL; + const int *indexes; + int i; + dn = of_find_node_by_path(/cpus); + if (dn == NULL) + goto err; + indexes = of_get_property(dn, ibm,drc-indexes, NULL); + if (indexes == NULL) + goto err; + /* Convert logical cpu number to core number */ + i = cpu_core_of_thread(cpu); + /* +* The first element indexes[0] is the number of drc_indexes +* returned in the list. Hence i+1 will get the drc_index +* corresponding to core number i. +*/ + WARN_ON(i indexes[0]); + return indexes[i + 1]; +err: + printk(KERN_WARNING cpu_to_drc_index(%d) failed, cpu); + return 0; +} + +static int drc_index_to_cpu(u32 drc_index) +{ + struct device_node *dn = NULL; + const int *indexes; + int i, cpu; + dn = of_find_node_by_path(/cpus); + if (dn == NULL) + goto err; + indexes = of_get_property(dn, ibm,drc-indexes, NULL
Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
* Michael Neuling mi...@neuling.org [2010-06-28 11:44:31]: Vaidy, Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Can you provide some documentation on how to use these hints and what format they are provided from sysfs. Looks like two separate interfaces two the same thing (one a comma sep list and 1 per cpu, why do need both?). What is the difference between activate and deactivate, with out me having to read PAPR :-) ?? Hi Mike, Thanks for reviewing this patch series. Sure, I can provide additional information. These hints are abstract number given by the hypervisor based on the extended knowledge the hypervisor has regarding the current system topology and resource mappings. The activate and the deactivate part is for the two distinct operations that we could do for energy savings. When we have more capacity than required, we could deactivate few core to save energy. The choice of the core to deactivate will be based on /sys/devices/system/cpu/deactivate_hint_list. The comma separated list of cpus (cores) will be the preferred choice. Once the system has few deactivated cores, based on workload demand we may have to activate them to meet the demand. In that case the /sys/devices/system/cpu/activate_hint_list will be used to prefer the core in-order among the deactivated cores. In simple terms, activate_hint_list will be null until we deactivate few cores. Then we could look at the corresponding list for activation or deactivation. Regarding your second point, there is a reason for both a list and per-cpu interface. The list gives us a system wide list of cores in one shot for userspace to base their decision. This will be the preferred interface for most cases. On the other hand, per-cpu file /sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more information since it exports the hint value as such. The idea is that the list interface will be used to get a suggested list of cores to manage, while the per-cpu value can be used to further get fine grain information on a per-core bases from the hypervisor. This allows Linux to have access to all information that the hypervisor has offered through this hcall interface. Other comments below. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 258 + ++ 4 files changed, 271 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvc all.h index 5119b7d..34b66e0 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -231,7 +231,8 @@ #define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/ pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY Probably need a less generic name. PSERIES_ENERGY_MANAGEMENT? PSERIES_ENERGY_HOTPLUG_HINTS? PSERIES_ENERGY_MANAGEMENT may be good but too long for a config option. The idea is to collect all energy management functions in this module as and when new features are introduced in the pseries platform. This hcall interface is the first to be included, but going forward in future I do not propose to have different modules for other energy management related features. The name is specific enough for IBM pseries platform and energy management functions and enablements. Having less generic name below this level will make it difficult to add all varieties of energy management functions in future. + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys
[PATCH v4 0/2] powerpc: add support for new hcall H_BEST_ENERGY
Hi Ben, The following series adds a kernel module for powerpc pseries platforms in order to export platform energy management capabilities. The module exports data from a new hypervisor call H_BEST_ENERGY. Some of the comments and suggestions made on the previous iteration of the patch has been incorporated. Changes in v4: * Added more documentation * Added check_for_h_best_energy() to look in ibm,hypertas-functions so that sysfs entries are not created in an unsupported platform * Added cleaner error checks and correct of_node_put() * Rebased and tested on 2.6.35-rc5 Changed in v3: [3] [PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-June/083414.html * Added more documentation in the cleanup patch * Removed RFC tag, rebased and tested on 2.6.35-rc3 * Ready for inclusion in powerpc/next tree for further testing Changes in v2: [2] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html * Cleanup cpu/thread/core APIs * Export APIs to module instead of threads_per_core * Use of_find_node_by_path() instead of of_find_node_by_name() * Error checking and whitespace cleanups First version: [1] [RFC] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html This patch series will apply on 2.6.35-rc5 as well as powerpc/next tree. Please review and include in powerpc/next tree for further testing. I could incrementally reduce some of the error checks as suggested by Michael Neuling as next steps. Thanks, Vaidy --- Vaidyanathan Srinivasan (2): powerpc: cleanup APIs for cpu/thread/core mappings powerpc: add support for new hcall H_BEST_ENERGY arch/powerpc/include/asm/cputhreads.h | 15 + arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/kernel/smp.c | 19 + arch/powerpc/mm/mmu_context_nohash.c| 12 - arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 326 +++ 7 files changed, 370 insertions(+), 16 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_of_thread(int cpu) The goal is to make 'threads_per_core' accessible to the pseries_energy module. Instead of making an API to read threads_per_core, this is a higher level wrapper function to convert from logical cpu number to core number. The current APIs cpu_first_thread_in_core() and cpu_last_thread_in_core() returns logical CPU number while cpu_thread_to_core() returns core number or index which is not a logical CPU number. The APIs are now clearly named to distinguish 'core number' versus first and last 'logical cpu number' in that core. The new APIs cpu_{left,right}most_thread_sibling() work on logical cpu numbers. While cpu_first_thread_of_core() and cpu_core_of_thread() work on core index. Example usage: (4 threads per core system) cpu_leftmost_thread_sibling(5) = 4 cpu_rightmost_thread_sibling(5) = 7 cpu_core_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 cpu_core_of_thread() is used in cpu_to_drc_index() in the module and cpu_first_thread_of_core() is used in drc_index_to_cpu() in the module. Made API changes to few callers. Exported symbols for use in modules. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/cputhreads.h | 15 +-- arch/powerpc/kernel/smp.c | 19 --- arch/powerpc/mm/mmu_context_nohash.c | 12 ++-- 3 files changed, 31 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index a8e1844..26dc6bd 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void) return cpu_thread_mask_to_cores(cpu_online_map); } -static inline int cpu_thread_to_core(int cpu) -{ - return cpu threads_shift; -} +#ifdef CONFIG_SMP +int cpu_core_of_thread(int cpu); +int cpu_first_thread_of_core(int core); +#else +static inline int cpu_core_of_thread(int cpu) { return cpu; } +static inline int cpu_first_thread_of_core(int core) { return core; } +#endif static inline int cpu_thread_in_core(int cpu) { return cpu (threads_per_core - 1); } -static inline int cpu_first_thread_in_core(int cpu) +static inline int cpu_leftmost_thread_sibling(int cpu) { return cpu ~(threads_per_core - 1); } -static inline int cpu_last_thread_in_core(int cpu) +static inline int cpu_rightmost_thread_sibling(int cpu) { return cpu | (threads_per_core - 1); } diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5c196d1..da4c2f8 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -468,7 +468,20 @@ out: return id; } -/* Must be called when no change can occur to cpu_present_mask, +/* Helper routines for cpu to core mapping */ +int cpu_core_of_thread(int cpu) +{ + return cpu threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_core_of_thread); + +int cpu_first_thread_of_core(int core) +{ + return core threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); + +/* Must be called when no change can occur to cpu_present_map, * i.e. during cpu online or offline. */ static struct device_node *cpu_to_l2cache(int cpu) @@ -527,7 +540,7 @@ int __devinit start_secondary(void *unused) notify_cpu_starting(cpu); set_cpu_online(cpu, true); /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { if (cpu_is_offline(base + i)) continue; @@ -606,7 +619,7 @@ int __cpu_disable(void) return err; /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_leftmost_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i)); cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu)); diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c index ddfd7ad..22f3bc5 100644 --- a/arch/powerpc/mm/mmu_context_nohash.c +++ b/arch/powerpc/mm/mmu_context_nohash.c @@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id) * a core map instead but this will do for now. */ for_each_cpu(cpu, mm_cpumask(mm)) { - for (i = cpu_first_thread_in_core(cpu); -i = cpu_last_thread_in_core(cpu); i++) + for (i = cpu_leftmost_thread_sibling(cpu); +i
[PATCH v4 2/2] powerpc: add support for new hcall H_BEST_ENERGY
Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. These hints are abstract number given by the hypervisor based on the extended knowledge the hypervisor has regarding the current system topology and resource mappings. The activate and the deactivate sysfs entry is for the two distinct operations that we could do for energy savings. When we have more capacity than required, we could deactivate few core to save energy. The choice of the core to deactivate will be based on /sys/devices/system/cpu/deactivate_hint_list. The comma separated list of cpus (cores) will be the preferred choice. If we have to activate some of the deactivated cores, then /sys/devices/system/cpu/activate_hint_list will be used. The per-cpu file /sys/device/system/cpu/cpuN/pseries_(de)activate_hint further provide more fine grain information by exporting the value of the hint itself. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 326 +++ 4 files changed, 339 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 5119b7d..34b66e0 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -231,7 +231,8 @@ #define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..8323622 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pSeries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 3dbef30..32ae72e 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o eeh_sysfs.o obj-$(CONFIG_KEXEC)+= kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c new file mode 100644 index 000..e217ea3 --- /dev/null +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -0,0 +1,326 @@ +/* + * POWER platform energy management driver + * Copyright (C) 2010 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This pseries platform device driver provides access to + * platform energy management capabilities. + */ + +#include linux/module.h +#include linux/types.h +#include linux/errno.h +#include linux/init.h +#include linux/seq_file.h +#include linux/sysdev.h +#include linux/cpu.h +#include linux/of.h +#include asm/cputhreads.h +#include asm/page.h +#include asm/hvcall.h + + +#define MODULE_VERS 1.0 +#define MODULE_NAME pseries_energy + +/* Driver flags */ + +static int sysfs_entries; + +/* Helper routines */ + +/* + * Routine to detect firmware support for hcall + * return 1 if H_BEST_ENERGY is supported + * else return 0
Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
* Michael Neuling mi...@neuling.org [2010-06-28 16:11:06]: [snip] These hints are abstract number given by the hypervisor based on the extended knowledge the hypervisor has regarding the current system topology and resource mappings. The activate and the deactivate part is for the two distinct operations that we could do for energy savings. When we have more capacity than required, we could deactivate few core to save energy. The choice of the core to deactivate will be based on /sys/devices/system/cpu/deactivate_hint_list. The comma separated list of cpus (cores) will be the preferred choice. Once the system has few deactivated cores, based on workload demand we may have to activate them to meet the demand. In that case the /sys/devices/system/cpu/activate_hint_list will be used to prefer the core in-order among the deactivated cores. In simple terms, activate_hint_list will be null until we deactivate few cores. Then we could look at the corresponding list for activation or deactivation. Can you put these details in the code and in the check-in comments. Hi Mikey, I have added these in the -v4 post. Regarding your second point, there is a reason for both a list and per-cpu interface. The list gives us a system wide list of cores in one shot for userspace to base their decision. This will be the preferred interface for most cases. On the other hand, per-cpu file /sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more information since it exports the hint value as such. The idea is that the list interface will be used to get a suggested list of cores to manage, while the per-cpu value can be used to further get fine grain information on a per-core bases from the hypervisor. This allows Linux to have access to all information that the hypervisor has offered through this hcall interface. OK, I didn't realise that they contained different info. Just more reasons that this interface needs better documentation :-) Overall, I'm mostly happy with the interface. It's pretty light weight. these too. Other comments below. [snip] diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platfo rms/ pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY Probably need a less generic name. PSERIES_ENERGY_MANAGEMENT? PSERIES_ENERGY_HOTPLUG_HINTS? PSERIES_ENERGY_MANAGEMENT may be good but too long for a config option. The idea is to collect all energy management functions in this module as and when new features are introduced in the pseries platform. This hcall interface is the first to be included, but going forward in future I do not propose to have different modules for other energy management related features. The name is specific enough for IBM pseries platform and energy management functions and enablements. Having less generic name below this level will make it difficult to add all varieties of energy management functions in future. OK, I thought this might be the case but you never said. Please say something like This adds CONFIG_PSERIES_ENERGY which will be used for future power saving code or some such. I already had this comment in the patch description. Did not want add a comment in the Kconfig file as the CONFIG_ prefix is assumed in each of the + +/* Helper Routines to convert between drc_index to cpu numbers */ + +static u32 cpu_to_drc_index(int cpu) +{ + struct device_node *dn = NULL; + const int *indexes; + int i; + dn = of_find_node_by_path(/cpus); + if (dn == NULL) + goto err; Humm, I not sure this is really needed. If you don't have /cpus you are probably not going to boot. Good suggestion. I could add all these checks in module_init. I was think if any of the functions being called is allocating memory and in case they fail, we need to abort. I just reviewed and look like of_find_node_by_path() will not sleep or allocate any memory. So if it succeeds once in module_init(), then it will never fail! + indexes = of_get_property(dn, ibm,drc-indexes, NULL); + if (indexes == NULL) + goto err; These checks should probably be moved to module init rather than /sfs read time. If they fail, don't load the module and print a warning. These HCALLS and device-tree entire aren't going to be dynamic. Agreed. Only cause of runtime failure is OOM. If none of these allocate memory, moving these checks once at module_init() will be a good optimization. Cool, thanks. Hey, I did not yet remove the failure checks in
Re: [PATCH] powerpc: ONLINE to OFFLINE CPU state transition during removal
* Robert Jennings r...@linux.vnet.ibm.com [2010-07-22 21:43:44]: If a CPU remove is attempted using the 'release' interface on hardware which supports extended cede, the CPU will be put in the INACTIVE state rather than the OFFLINE state due to the default preferred_offline_state in that situation. In the INACTIVE state it will fail to be removed. This patch changes the preferred offline state to OFFLINE when an CPU is in the ONLINE state. After cpu_down() is called in dlpar_offline_cpu() the CPU will be OFFLINE and CPU removal can continue. Hi Robert, Thanks for the patch. In dlpar operation, we would offline the CPU first using the sysfs online file and then write to the sysfs release file to complete the sequence right? The current code in dlpar_offline_cpu() would work as long as the cpu is in either inactive state or offline state (in case of unsupported platform). Is the dlpar tools being changed to complete the operation with one sysfs write to release file? Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index d71e585..227c1c3 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -463,6 +463,7 @@ static int dlpar_offline_cpu(struct device_node *dn) break; if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) { + set_preferred_offline_state(cpu, CPU_STATE_OFFLINE); cpu_maps_update_done(); rc = cpu_down(cpu); if (rc) The patch looks good. Will need to test out the various scenarios so that the preferred_offline_state do not get flipped before cpu_down() is called. This is unlikely, but still we need to validate a concurrent sysfs online file write and sysfs release file write. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
* Darren Hart dvh...@us.ibm.com [2010-07-22 21:44:04]: On 07/22/2010 04:57 PM, Darren Hart wrote: On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? The preempt count is on the thread info at the bottom of the stack. Can you check the stack pointers ? Hi Ben, thanks for looking. I instrumented the area around extended_cede_processor() as follows (please confirm I'm getting the stack pointer correctly). while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this: Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 This surprised me as preempt_count is 1 before and after, so no corruption appears to occur on mainline. This makes the pcnt of 65 I see without the preempt_count()=0 hack very strange. I ran several hundred off/on cycles. The issue of preempt_count being 1 is still addressed by this patch however. On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise): Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: c00089bcfcf0, pcnt: 1 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack: c00089bcfcf0, pcnt: In both cases the stack pointer appears unchanged. Note: there is a BUG triggered in between these statements as the preempt_count causes the printk to trigger: Badness at kernel/sched.c:5572 At Steven's suggestion I updated the instrumentation to display the local_save_flags and irqs_disabled_flags: while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { local_save_flags(flags); printk(local flags: %lx, irqs disabled: %d\n, flags, irqs_disabled_flags(flags)); asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 I'm not sure if I'm reading that right, but I believe interrupts are intended to be disabled here. If accomplished via the spin_lock_irqsave() this would behave differently on RT. However, this path disables the interrupts handled by xics, all but the IPIs anyway. On RT I disabled the decrementer as well. Is it possible for RT to be receiving other interrupts here? Yes. extended_cede_processor() will return with interrupts enabled in the cpu. (This is done by the hypervisor). Under normal cases we cannot be interrupted because no IO interrupts are routed to us after xics_teardown_cpu() and since the CPU is out of the map, nobody will send us IPIs. Though H_CEDE will return with interrupts enabled, it is unlikely that an interrupt can be delivered in this context. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-07-23 15:11:00]: On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote: Yes. extended_cede_processor() will return with interrupts enabled in the cpu. (This is done by the hypervisor). Under normal cases we cannot be interrupted because no IO interrupts are routed to us after xics_teardown_cpu() and since the CPU is out of the map, nobody will send us IPIs. What about decrementer ? Decrementer expiry event handling is bit complex. The event as such may not bring back the extended_cede_processor() cpu, but may be marked pending when we get out of this state eventually. I will find more information on this event and update. Though H_CEDE will return with interrupts enabled, it is unlikely that an interrupt can be delivered in this context. Well, if interrupts are soft-disabled, even if one occurs, we will just mask and return, so that at least should be ok. Yes. We will immediately return to the extended_cede_processor() in the while loop until the preferred_offline_state is changed. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
* Benjamin Herrenschmidt b...@kernel.crashing.org [2010-08-03 14:44:13]: On Thu, 2010-07-22 at 06:27 +0530, Vaidyanathan Srinivasan wrote: These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_leftmost_thread_sibling() Change cpu_last_thread_in_core() to cpu_rightmost_thread_sibling() I'm still not happy here. I don't find leftmost/rightmost to be a progress from the previous naming. If you really want to change to avoid in_core() at the end, then call them first_thread_sibling() and last_thread_sibling(). Hi Ben, The naming is based on our discussion during the first RFC: http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081610.html I am fine with first_thread_sibling() name instead of cpu_{left,right}most_thread_sibling(). The goal is to distinguish between APIs returning logical cpu numbers and core indexes. first_thread_sibling() implies that the parameter and return value are logical cpu (thread) number. I will change the name as per your suggestion and post an update. Then you change: -static inline int cpu_thread_to_core(int cpu) -{ - return cpu threads_shift; -} .../... to: -/* Must be called when no change can occur to cpu_present_mask, +/* Helper routines for cpu to core mapping */ +int cpu_core_of_thread(int cpu) +{ + return cpu threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_core_of_thread); The cpu_core_of_thread() name implies that the return type is a core index and not a logical cpu number. cpu_core_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 Do you think cpu_core_number_of_thread() will explain the return type better? Any reason you are making it out of line other than gratuituous bloat ? :-) +int cpu_first_thread_of_core(int core) +{ + return core threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); Same. The reason behind out of line is to avoid exporting threads_shift or other variables like threads_per_core and have this API exported so that modules can read them. If we make these wrapper inline, then we will have to export the variables themselves which is not a good idea. Thanks for the review. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
* Darren Hart dvh...@us.ibm.com [2010-08-04 21:45:51]: On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote: * Benjamin Herrenschmidtb...@kernel.crashing.org [2010-07-23 15:11:00]: On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote: Yes. extended_cede_processor() will return with interrupts enabled in the cpu. (This is done by the hypervisor). Under normal cases we cannot be interrupted because no IO interrupts are routed to us after xics_teardown_cpu() and since the CPU is out of the map, nobody will send us IPIs. What about decrementer ? Decrementer expiry event handling is bit complex. The event as such may not bring back the extended_cede_processor() cpu, but may be marked pending when we get out of this state eventually. I will find more information on this event and update. Hi Vaidy, have you been able to dig anything up about the decrementer expiry? Hi Darren, Yes, it is possible that the cpu in extended_cede_processor() can be woken up by the decrementer. But the expectation is that we will return back to this context since we will not have any pending timers. Our stack should not change even if we get the decrementer interrupts. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: ONLINE to OFFLINE CPU state transition during removal
* Nathan Fontenot nf...@austin.ibm.com [2010-07-26 14:13:35]: On 07/22/2010 11:13 PM, Vaidyanathan Srinivasan wrote: * Robert Jennings r...@linux.vnet.ibm.com [2010-07-22 21:43:44]: If a CPU remove is attempted using the 'release' interface on hardware which supports extended cede, the CPU will be put in the INACTIVE state rather than the OFFLINE state due to the default preferred_offline_state in that situation. In the INACTIVE state it will fail to be removed. This patch changes the preferred offline state to OFFLINE when an CPU is in the ONLINE state. After cpu_down() is called in dlpar_offline_cpu() the CPU will be OFFLINE and CPU removal can continue. Hi Robert, Thanks for the patch. In dlpar operation, we would offline the CPU first using the sysfs online file and then write to the sysfs release file to complete the sequence right? The current code in dlpar_offline_cpu() would work as long as the cpu is in either inactive state or offline state (in case of unsupported platform). Is the dlpar tools being changed to complete the operation with one sysfs write to release file? The dlpar tools were updated so that a single write to the 'release' file would offline the cpu and remove it from the system. Given this, I think Robert's patch should go forward to maintain compatability. Hi Nathan, Thanks for clarifying. One concern that I have is that this change could race with sysfs write to cpuN/online file. dlpar_offline_cpu() is called with cpu_hotplug_driver_lock() held so the sysfs operation on online file should wait. However by the time the cpu_hotplug_driver_unlock() happens the dlpar operation should have completed and we should not be able to online or offline the cpu from sysfs online file. I wanted to confirm whether any concurrent sysfs write will not cause the dlpar operation to fail in the middle of the operation. The previous code will work fine for platforms not supporting cede_offline where we have only two states for the cpu. When cede_offline is supported, we have three states and the state transitions are controlled by online file and the release file in two steps so that the failures can be retried in user space. This patch will work and I do not see any problem as such, but we still need to carefully evaluate the retry and error handling paths before switching over to a single sysfs write based dlpar operation. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
* Darren Hart dvh...@us.ibm.com [2010-08-05 19:19:00]: On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote: suggestion I updated the instrumentation to display the local_save_flags and irqs_disabled_flags: Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 I'm not sure if I'm reading that right, but I believe interrupts are intended to be disabled here. If accomplished via the spin_lock_irqsave() this would behave differently on RT. However, this path disables the interrupts handled by xics, all but the IPIs anyway. On RT I disabled the decrementer as well. Is it possible for RT to be receiving other interrupts here? Also you may want to call hard_irq_disable() to -really- disable interrupts ... since we do lazy-disable on powerpc A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer. I've added patches one at a time, eliminating bugs with preempt_rt (tip/rt/head). With my current patchset I have no crashes with a single run of random_online.sh (stress testing to hit the hang will sees is my todo for tomorrow). Current patchset includes: patches/0001-wms-fix01.patch # P7 lazy flushing thing # next four are sent to / queued for mainline patches/powerpc-increase-pseries_cpu_die-delay.patch patches/powerpc-enable-preemption-before-cpu_die.patch patches/powerpc-silence-__cpu_up-under-normal-operation.patch patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch # this one needs to be cleaned up and sent to mainline patches/powerpc-wait-for-cpu-to-go-inactive.patch patches/powerpc-disable-decrementer-on-offline.patch patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede patches/powerpc-cede-processor-inst.patch # instrumentation patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede patches/powerpc-pad-thread_info.patch I didn't include all the patches as the relevant bits are included below in code form. With the instrumentation, it's clear the change to preempt_count() is targeted (since I added padding before and after preempt_count in the thread_info struct) and preempt_count is still changed. It is also clear that it isn't just a stray decrement since I set preempt_count() to 4 prior to calling cede and it still is 0x after the hcall. Also note that the stack pointer doesn't change across the cede call and neither does any other part of the thread_info structure. Adding hard_irq_disable() did seem to affect things somewhat. Rather than a serialized list of before/after thread_info prints around cede, I see several befores then several afters. But the preempt_count is still modified. The relevant code looks like this (from pseries_mach_cpu_die()) hard_irq_disable(); /* this doesn't fix things... but does change behavior a bit */ preempt_count() = 0x4; asm(mr %0,1 : =r (sp)); /* stack pointer is in R1 */ printk(before cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); preempt_count() = 0; With these patches applied, the output across cede looks like: before cede: sp=c0b57150 pcnt=4 *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */ ti-preempt_count: 4 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */ ti-local_flags: 0 *** end thread_info *** after cede: sp=c0b57150 pcnt= *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 Is this print sufficient to prove that we did not jump CPU? We agree that decrementer can possibly expire and we could have gone to the handler and come back just like we would do in an idle loop. This will not happen always, but I could see a window of time during which this may happen. But that should not change the preempt_count unconditionally to -1 with the same stack pointer. ti-stomp_on_me: 57005 ti-preempt_count: ti-stomp_on_me_too: 48879 ti-local_flags: 0 Is there a method to identify whether we had executed any of the interrupt handler while in this code? *** end thread_info *** Are there any additional thoughts on what
[PATCH v5 0/2] powerpc: add support for new hcall H_BEST_ENERGY
Hi Ben, The following series adds a new kernel module for powerpc pseries platforms in order to export platform energy management capabilities. The module exports data from a new hypervisor call H_BEST_ENERGY. Comments and suggestions made on the previous iteration of the patch related to function naming has been incorporated. Changes in v5: * Renamed cpu_{left,right}most_thread_sibling() to cpu_{first,last}_thread_sibling() * Renamed cpu_core_of_thread() to cpu_core_index_of_thread() (these function work on core index) * Rebased to 2.6.36-rc6 and tested on a supported platform Changes in v4: [4] [PATCH v4 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-July/084217.html * Added more documentation * Added check_for_h_best_energy() to look in ibm,hypertas-functions so that sysfs entries are not created in an unsupported platform * Added cleaner error checks and correct of_node_put() * Rebased and tested on 2.6.35-rc5 Changed in v3: [3] [PATCH v3 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-June/083414.html * Added more documentation in the cleanup patch * Removed RFC tag, rebased and tested on 2.6.35-rc3 * Ready for inclusion in powerpc/next tree for further testing Changes in v2: [2] [RFC PATCH v2 0/2] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082246.html * Cleanup cpu/thread/core APIs * Export APIs to module instead of threads_per_core * Use of_find_node_by_path() instead of of_find_node_by_name() * Error checking and whitespace cleanups First version: [1] [RFC] powerpc: add support for new hcall H_BEST_ENERGY http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-March/080796.html This patch series will apply on 2.6.36-rc6 as well as powerpc/next tree. Please review and include in powerpc/next tree for further testing. I could incrementally reduce some of the error checks as suggested by Michael Neuling as next steps. This patch series is conservative and has more error checking in device tree parsing and drc index matching code than what may be required. Thanks, Vaidy --- Vaidyanathan Srinivasan (2): powerpc: cleanup APIs for cpu/thread/core mappings powerpc: add support for new hcall H_BEST_ENERGY arch/powerpc/include/asm/cputhreads.h | 15 + arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/kernel/smp.c | 19 + arch/powerpc/mm/mmu_context_nohash.c| 12 - arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 326 +++ 7 files changed, 370 insertions(+), 16 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 1/2] powerpc: cleanup APIs for cpu/thread/core mappings
These APIs take logical cpu number as input Change cpu_first_thread_in_core() to cpu_first_thread_sibling() Change cpu_last_thread_in_core() to cpu_last_thread_sibling() These APIs convert core number (index) to logical cpu/thread numbers Add cpu_first_thread_of_core(int core) Changed cpu_thread_to_core() to cpu_core_index_of_thread(int cpu) The goal is to make 'threads_per_core' accessible to the pseries_energy module. Instead of making an API to read threads_per_core, this is a higher level wrapper function to convert from logical cpu number to core number. The current APIs cpu_first_thread_in_core() and cpu_last_thread_in_core() returns logical CPU number while cpu_thread_to_core() returns core number or index which is not a logical CPU number. The new APIs are now clearly named to distinguish 'core number' versus first and last 'logical cpu number' in that core. The new APIs cpu_{first,last}_thread_sibling() work on logical cpu numbers. While cpu_first_thread_of_core() and cpu_core_index_of_thread() work on core index. Example usage: (4 threads per core system) cpu_first_thread_sibling(5) = 4 cpu_last_thread_sibling(5) = 7 cpu_core_index_of_thread(5) = 1 cpu_first_thread_of_core(1) = 4 cpu_core_index_of_thread() is used in cpu_to_drc_index() in the module and cpu_first_thread_of_core() is used in drc_index_to_cpu() in the module. Make API changes to few callers. Export symbols for use in modules. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/cputhreads.h | 15 +-- arch/powerpc/kernel/smp.c | 19 --- arch/powerpc/mm/mmu_context_nohash.c | 12 ++-- 3 files changed, 31 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index a8e1844..f71bb4c 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -61,22 +61,25 @@ static inline cpumask_t cpu_online_cores_map(void) return cpu_thread_mask_to_cores(cpu_online_map); } -static inline int cpu_thread_to_core(int cpu) -{ - return cpu threads_shift; -} +#ifdef CONFIG_SMP +int cpu_core_index_of_thread(int cpu); +int cpu_first_thread_of_core(int core); +#else +static inline int cpu_core_index_of_thread(int cpu) { return cpu; } +static inline int cpu_first_thread_of_core(int core) { return core; } +#endif static inline int cpu_thread_in_core(int cpu) { return cpu (threads_per_core - 1); } -static inline int cpu_first_thread_in_core(int cpu) +static inline int cpu_first_thread_sibling(int cpu) { return cpu ~(threads_per_core - 1); } -static inline int cpu_last_thread_in_core(int cpu) +static inline int cpu_last_thread_sibling(int cpu) { return cpu | (threads_per_core - 1); } diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 0008bc5..012d6e2 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -466,7 +466,20 @@ out: return id; } -/* Must be called when no change can occur to cpu_present_mask, +/* Helper routines for cpu to core mapping */ +int cpu_core_index_of_thread(int cpu) +{ + return cpu threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_core_index_of_thread); + +int cpu_first_thread_of_core(int core) +{ + return core threads_shift; +} +EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); + +/* Must be called when no change can occur to cpu_present_map, * i.e. during cpu online or offline. */ static struct device_node *cpu_to_l2cache(int cpu) @@ -517,7 +530,7 @@ int __devinit start_secondary(void *unused) notify_cpu_starting(cpu); set_cpu_online(cpu, true); /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_first_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { if (cpu_is_offline(base + i)) continue; @@ -596,7 +609,7 @@ int __cpu_disable(void) return err; /* Update sibling maps */ - base = cpu_first_thread_in_core(cpu); + base = cpu_first_thread_sibling(cpu); for (i = 0; i threads_per_core; i++) { cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i)); cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu)); diff --git a/arch/powerpc/mm/mmu_context_nohash.c b/arch/powerpc/mm/mmu_context_nohash.c index ddfd7ad..9aa754c 100644 --- a/arch/powerpc/mm/mmu_context_nohash.c +++ b/arch/powerpc/mm/mmu_context_nohash.c @@ -111,8 +111,8 @@ static unsigned int steal_context_smp(unsigned int id) * a core map instead but this will do for now. */ for_each_cpu(cpu, mm_cpumask(mm)) { - for (i = cpu_first_thread_in_core(cpu); -i = cpu_last_thread_in_core(cpu); i++) + for (i = cpu_first_thread_sibling(cpu
[PATCH v5 2/2] powerpc: add support for new hcall H_BEST_ENERGY
Create sysfs interface to export data from H_BEST_ENERGY hcall that can be used by administrative tools on supported pseries platforms for energy management optimizations. /sys/device/system/cpu/pseries_(de)activate_hint_list and /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide hints for activation and deactivation of cpus respectively. These hints are abstract number given by the hypervisor based on the extended knowledge the hypervisor has regarding the current system topology and resource mappings. The activate and the deactivate sysfs entry is for the two distinct operations that we could do for energy savings. When we have more capacity than required, we could deactivate few core to save energy. The choice of the core to deactivate will be based on /sys/devices/system/cpu/deactivate_hint_list. The comma separated list of cpus (cores) will be the preferred choice. If we have to activate some of the deactivated cores, then /sys/devices/system/cpu/activate_hint_list will be used. The per-cpu file /sys/device/system/cpu/cpuN/pseries_(de)activate_hint further provide more fine grain information by exporting the value of the hint itself. Added new driver module arch/powerpc/platforms/pseries/pseries_energy.c under new config option CONFIG_PSERIES_ENERGY Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/include/asm/hvcall.h |3 arch/powerpc/platforms/pseries/Kconfig | 10 + arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/pseries_energy.c | 326 +++ 4 files changed, 339 insertions(+), 1 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index de03ca5..bf86b03 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -232,7 +232,8 @@ #define H_GET_EM_PARMS 0x2B8 #define H_SET_MPP 0x2D0 #define H_GET_MPP 0x2D4 -#define MAX_HCALL_OPCODE H_GET_MPP +#define H_BEST_ENERGY 0x2F4 +#define MAX_HCALL_OPCODE H_BEST_ENERGY #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..8323622 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pSeries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 046ace9..50a7ea3 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_event.o eeh_sysfs.o obj-$(CONFIG_KEXEC)+= kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c new file mode 100644 index 000..c8b3c69 --- /dev/null +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -0,0 +1,326 @@ +/* + * POWER platform energy management driver + * Copyright (C) 2010 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This pseries platform device driver provides access to + * platform energy management capabilities. + */ + +#include linux/module.h +#include linux/types.h +#include linux/errno.h +#include linux/init.h +#include linux/seq_file.h +#include linux/sysdev.h +#include linux/cpu.h +#include linux/of.h +#include asm/cputhreads.h +#include asm/page.h +#include asm/hvcall.h + + +#define MODULE_VERS 1.0 +#define MODULE_NAME pseries_energy + +/* Driver flags */ + +static int sysfs_entries; + +/* Helper routines */ + +/* + * Routine to detect firmware support for hcall + * return 1 if H_BEST_ENERGY is supported + * else return 0
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Benjamin Herrenschmidt b...@kernel.crashing.org [2013-07-27 16:30:05]: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. Hi Ben, Yes, we will need to improve on this case in stages. In the current design, we will have to hold one of the CPU in shallow idle state (nap) to wakeup other deep idle cpus. The cost of keeping the periodic tick ON on the broadcast CPU in minimal (but not zero) since we would not allow that CPU to enter any deep idle states even if there were no periodic timers queued. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) This will be an improvement and the idea we have is to have a hierarchical method of finding a waking CPU within core/socket/node in order to find a better fit and ultimately send IPI to wakeup a broadcast CPU only if there is no other fit. This condition would imply that more CPUs are in deep idle state and the cost of sending an IPI to reprogram the broadcast cpu's local timer may well payoff. - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. CPUs entering deep idle state would need to wakeup only at a jiffy boundary or the jiffy boundary earlier than the target wakeup time. Your point is can the broadcast cpu wakeup the sleeping CPU *around* the designated wakeup time (earlier) so as to avoid reprogramming its timer. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Right Ben. The hierarchical way of selecting the waker will help us have multiple wakers in different sockets/cores. The broadcast framework allows us to decouple the cpu going to idle and the waker to be selected independently. This patch series is the start where we pick one and allow it to move around. The ideal goal to achieve would be that we can have multiple wakers serving wakeup requests from a
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]: Hi Ben, On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Thank you for having listed out the above suggestions. Below, I will bring out some ideas about how the concerns that you have raised can be addressed in the increasing order of priority. - To begin with, I think we can have the following model to have the responsibility of the broadcast CPU float around certain CPUs. i.e. Not have a dedicated broadcast CPU. I will refer to the broadcast CPU as the bc_cpu henceforth for convenience. 1. The first CPU that intends to enter deep sleep state will be the bc_cpu. 2. Every other CPU that intends to enter deep idle state will enter themselves into a mask, say the bc_mask, which is already being done today, after they check that a bc_cpu has been assigned. 3. The bc_cpu should not enter tickless idle, until step 5a holds true. 4. So on every timer interrupt, which is at-least every period, it checks the bc_mask to see if any CPUs need to be woken up. 5. The bc cpu should not enter tickless idle *until* it is de-nominated as the bc_cpu. The de-nomination occurs when: a. In one of its timer interrupts, it does broadcast handling to find out that there are no CPUs to be woken up. 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides to enter deep idle state again, in which case steps 1 to 5 repeat. - We could optimize this further, to allow
[PATCH] powerpc: default arch idle could cede processor on pseries
Hi, Idle routines on pseries were rearranged so that cpuidle can do an optimized idle state selection. However, until cpuidle takes over during boot, the idle loop spins for a short while. This actually affected bootup time since spinning idle sibling threads slows down master cpu that executes bootup code. The following patch enables pseries system to yield to hypervisor and stop spinning by calling cede_processor() until cpuidle can take over and do optimal idle state selection. Bootup time can be reduced to half on small guest where most of the time is spend before device init. --Vaidy powerpc: default arch idle should cede processor on pseries On IBM POWER platform (pseries), use cede_processor() in arch idle to save cycles until cpuidle takes over. This helps speedup boot by having SMT threads stop and yield to master thread rather than spinning until cpuidle in initialized. Reported-by: Paul Mackerras pau...@samba.org Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Acked-by: Deepthi Dharwar deep...@linux.vnet.ibm.com diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index c11c823..fd4f995 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -362,10 +362,22 @@ static void pSeries_idle(void) if (cpuidle_idle_call()) { /* On error, execute default handler * to go into low thread priority and possibly -* low power mode. +* low power mode by ceding processor to hypervisor */ - HMT_low(); - HMT_very_low(); + + /* Indicate to hypervisor that we are idle. */ + get_lppaca()-idle = 1; + + /* +* Yield the processor to the hypervisor. We return if +* an external interrupt occurs (which are driven prior +* to returning here) or if a prod occurs from another +* processor. When returning here, external interrupts +* are enabled. +*/ + cede_processor(); + + get_lppaca()-idle = 0; } } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: Default arch idle could cede processor on pseries
Hi, Idle routines on pseries were rearranged so that cpuidle can do an optimized idle state selection. However, until cpuidle takes over during boot, the idle loop spins for a short while. This actually affected bootup time since spinning idle sibling threads slows down master cpu that executes bootup code. The following patch enables pseries system to yield to hypervisor and stop spinning by calling cede_processor() until cpuidle can take over and do optimal idle state selection. Bootup time can be reduced to half on small guest where most of the time is spend before device init. Thanks Ben for the review and suggestions. --Vaidy powerpc: Default arch idle could cede processor on pseries When adding cpuidle support to pSeries, we introduced two regressions: - The new cpuidle backend driver only works under hypervisors supporting the SLPLAR option, which isn't the case of the old POWER4 hypervisor and the HV light used on js2x blades - The cpuidle driver registers fairly late, meaning that for a significant portion of the boot process, we end up having all threads spinning. This slows down the boot process and increases the overall resource usage if the hypervisor has shared processors. This fixes it by implementing a default idle that will cede to the hypervisor when possible, in a very simple way without all the bells and whisles of cpuidle. Reported-by: Paul Mackerras pau...@samba.org Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Acked-by: Deepthi Dharwar deep...@linux.vnet.ibm.com Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index c11c823..54b998f 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -354,7 +354,7 @@ static int alloc_dispatch_log_kmem_cache(void) } early_initcall(alloc_dispatch_log_kmem_cache); -static void pSeries_idle(void) +static void pseries_lpar_idle(void) { /* This would call on the cpuidle framework, and the back-end pseries * driver to go to idle states @@ -362,10 +362,22 @@ static void pSeries_idle(void) if (cpuidle_idle_call()) { /* On error, execute default handler * to go into low thread priority and possibly -* low power mode. +* low power mode by cedeing processor to hypervisor */ - HMT_low(); - HMT_very_low(); + + /* Indicate to hypervisor that we are idle. */ + get_lppaca()-idle = 1; + + /* +* Yield the processor to the hypervisor. We return if +* an external interrupt occurs (which are driven prior +* to returning here) or if a prod occurs from another +* processor. When returning here, external interrupts +* are enabled. +*/ + cede_processor(); + + get_lppaca()-idle = 0; } } @@ -456,15 +468,14 @@ static void __init pSeries_setup_arch(void) pSeries_nvram_init(); - if (firmware_has_feature(FW_FEATURE_SPLPAR)) { + if (firmware_has_feature(FW_FEATURE_LPAR)) { vpa_init(boot_cpuid); - ppc_md.power_save = pSeries_idle; - } - - if (firmware_has_feature(FW_FEATURE_LPAR)) + ppc_md.power_save = pseries_lpar_idle; ppc_md.enable_pmcs = pseries_lpar_enable_pmcs; - else + } else { + /* No special idle routine */ ppc_md.enable_pmcs = power4_enable_pmcs; + } ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Kernel crash caused by cpufreq
* Michael Ellerman m...@ellerman.id.au [2014-07-28 17:03:10]: On Fri, 2014-07-25 at 11:07 +1000, Gavin Shan wrote: I'm tracing one LSI interrupt issue on P8 box, and eventually into the following kernel crash. Not sure if there is one fix against this? :-) Vaidy wrote that I'm pretty sure (on CC). Yes, I did :) Starting Linux PPC64 #401 SMP Fri Jul 25 10:52:28 EST 2014 - ppc64_pft_size= 0x0 physicalMemorySize= 0x8 htab_address = 0xc007fe00 htab_hash_mask= 0x3 - - setup_system() Initializing cgroup subsys cpuset Initializing cgroup subsys cpuacct Linux version 3.16.0-rc6-00076-g4226dbe-dirty (shangw@shangw) (gcc version 4.5.2 (crosstool-NG 1.19.0) ) #401 SMP Fri Jul 25 10:52:28 EST 2014 : Unrelated log stripped Are you sure there's nothing else in the log that might be related? See the messages in init_powernv_pstates() for example. Most likely PMSR register is showing out of bound values because of potential firmware (OPAL) issue. [ cut here ] kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134! cpu 0x1: Vector: 700 (Program Check) at [c007f8483370] pc: c096f32c: .pstate_id_to_freq+0x2c/0x50 lr: c096f37c: .powernv_read_cpu_freq+0x2c/0x50 sp: c007f84835f0 msr: 90029032 current = 0xc007f840 paca= 0xcff00400 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134! enter ? for help [link register ] c096f37c .powernv_read_cpu_freq+0x2c/0x50 [c007f84835f0] c1076070 key_type_dns_resolver+0xef20/0x40d20 (unreliable) [c007f8483670] c010812c .generic_exec_single+0x8c/0x270 [c007f8483730] c01083d0 .smp_call_function_single+0x90/0xb0 [c007f84837b0] c0108cec .smp_call_function_any+0x15c/0x200 [c007f8483860] c096f27c .powernv_cpufreq_get+0x3c/0x60 [c007f84838f0] c0969dc4 .__cpufreq_add_dev.clone.11+0x574/0xa20 [c007f84839e0] c05f7a4c .subsys_interface_register+0xec/0x130 [c007f8483a90] c0967af8 .cpufreq_register_driver+0x168/0x2d0 [c007f8483b30] c0f774cc .powernv_cpufreq_init+0x210/0x244 [c007f8483be0] c000bc08 .do_one_initcall+0xc8/0x240 [c007f8483ce0] c0f44054 .kernel_init_freeable+0x268/0x33c [c007f8483db0] c000c4dc .kernel_init+0x1c/0x110 [c007f8483e30] c000a428 .ret_from_kernel_thread+0x58/0xb0 1:mon e cpu 0x1: Vector: 700 (Program Check) at [c007f8483370] pc: c096f32c: .pstate_id_to_freq+0x2c/0x50 lr: c096f37c: .powernv_read_cpu_freq+0x2c/0x50 sp: c007f84835f0 msr: 90029032 current = 0xc007f840 paca= 0xcff00400 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at drivers/cpufreq/powernv-cpufreq.c:134! 1:mon r R00 = 0042 R16 = R01 = c007f84835f0 R17 = R02 = c116c430 R18 = R03 = ffbe R19 = 0001 R04 = R20 = c0078bd61e58 R05 = c114ca78 R21 = c1008910 R06 = c007f84838d0 R22 = c11c1b74 R07 = 0001 R23 = c2200030 R08 = c11c1910 R24 = 0001 R09 = c1f2970c R25 = c1008730 R10 = 0001 R26 = c1f29478 R11 = 0042 R27 = c007f84838d0 R12 = 44002084 R28 = c114ca78 R13 = cff00400 R29 = R14 = c000c4c0 R30 = c10a90f0 R15 = R31 = c007f84838d0 pc = c096f32c .pstate_id_to_freq+0x2c/0x50 cfar= c096f324 .pstate_id_to_freq+0x24/0x50 lr = c096f37c .powernv_read_cpu_freq+0x2c/0x50 msr = 90029032 cr = 44002082 ctr = c096f350 xer = 2000 trap = 700 1:mon Gavin, in future for a dump like this it's very helpful to see the actual code that hit the bug. You can get that with: 1:mon di $.pstate_id_to_freq Vaidy, judging by r3 it looks like i became negative. That would obviously happen if powernv_pstate_info.max was zero? yes, negative is ok. Something has gone wrong with the PState firmware/hardware. A BUG_ON() is too severe for this error. I will change code to not stop the system for this error and also investigate what is happening at runtime. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/cpuidle: Fix parsing of idle state flags from device-tree
Flags from device-tree need to be parsed with accessors for interpreting correct value in little-endian. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 74f5788..a64be57 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -160,10 +160,10 @@ static int powernv_cpuidle_driver_init(void) static int powernv_add_idle_states(void) { struct device_node *power_mgt; - struct property *prop; int nr_idle_states = 1; /* Snooze */ int dt_idle_states; - u32 *flags; + const __be32 *idle_state_flags; + u32 len_flags, flags; int i; /* Currently we have snooze statically defined */ @@ -174,18 +174,18 @@ static int powernv_add_idle_states(void) return nr_idle_states; } - prop = of_find_property(power_mgt, ibm,cpu-idle-state-flags, NULL); - if (!prop) { + idle_state_flags = of_get_property(power_mgt, ibm,cpu-idle-state-flags, len_flags); + if (!idle_state_flags) { pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n); return nr_idle_states; } - dt_idle_states = prop-length / sizeof(u32); - flags = (u32 *) prop-value; + dt_idle_states = len_flags / sizeof(u32); for (i = 0; i dt_idle_states; i++) { - if (flags[i] IDLE_USE_INST_NAP) { + flags = be32_to_cpu(idle_state_flags[i]); + if (flags IDLE_USE_INST_NAP) { /* Add NAP state */ strcpy(powernv_states[nr_idle_states].name, Nap); strcpy(powernv_states[nr_idle_states].desc, Nap); @@ -196,7 +196,7 @@ static int powernv_add_idle_states(void) nr_idle_states++; } - if (flags[i] IDLE_USE_INST_SLEEP) { + if (flags IDLE_USE_INST_SLEEP) { /* Add FASTSLEEP state */ strcpy(powernv_states[nr_idle_states].name, FastSleep); strcpy(powernv_states[nr_idle_states].desc, FastSleep); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/cpufreq: Add pr_warn() on OPAL firmware failures
Cpufreq depends on platform firmware to implement PStates. In case of platform firmware failure, cpufreq should not panic host kernel with BUG_ON(). Less severe pr_warn() will suffice. Add firmware_has_feature(FW_FEATURE_OPALv3) check to skip probing for device-tree on non-powernv platforms. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- Tested on powernv and pseries platforms with v3.16-rc7 kernel. drivers/cpufreq/powernv-cpufreq.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index bb1d08d..379c083 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -28,6 +28,7 @@ #include linux/of.h #include asm/cputhreads.h +#include asm/firmware.h #include asm/reg.h #include asm/smp.h /* Required for cpu_sibling_mask() in UP configs */ @@ -98,7 +99,11 @@ static int init_powernv_pstates(void) return -ENODEV; } - WARN_ON(len_ids != len_freqs); + if (len_ids != len_freqs) { + pr_warn(Entries in ibm,pstate-ids and + ibm,pstate-frequencies-mhz does not match\n); + } + nr_pstates = min(len_ids, len_freqs) / sizeof(u32); if (!nr_pstates) { pr_warn(No PStates found\n); @@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id) int i; i = powernv_pstate_info.max - pstate_id; - BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + if (i = powernv_pstate_info.nr_pstates || i 0) { + pr_warn(PState id %d outside of PState table, + reporting nominal id %d instead\n, + pstate_id, powernv_pstate_info.nominal); + i = powernv_pstate_info.max - powernv_pstate_info.nominal; + } return powernv_freqs[i].frequency; } @@ -321,6 +331,10 @@ static int __init powernv_cpufreq_init(void) { int rc = 0; + /* Don't probe on pseries (guest) platforms */ + if (!firmware_has_feature(FW_FEATURE_OPALv3)) + return -ENODEV; + /* Discover pstates from device tree and init */ rc = init_powernv_pstates(); if (rc) { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[BUG][PATCH] powerpc: fix numa distance for form0 device tree
powerpc: fix numa distance for form0 device tree The following commit breaks numa distance setup for old powerpc systems that use form0 encoding in device tree. commit 41eab6f88f24124df89e38067b3766b7bef06ddb powerpc/numa: Use form 1 affinity to setup node distance Device tree node /rtas/ibm,associativity-reference-points would index into /cpus/PowerPC/ibm,associativity based on form0 or form1 encoding detected by ibm,architecture-vec-5 property. All modern systems use form1 and current kernel code is correct. However, on older systems with form0 encoding, the numa distance will get hard coded as LOCAL_DISTANCE for all nodes. This causes task scheduling anomaly since scheduler will skip building numa level domain (topmost domain with all cpus) if all numa distances are same. (value of 'level' in sched_init_numa() will remain 0) Prior to the above commit: #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) Restoring compatible behavior with this patch for old powerpc systems with device tree where numa distance are encoded as form0. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index bba87ca..6a252c4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -201,7 +201,7 @@ int __node_distance(int a, int b) int distance = LOCAL_DISTANCE; if (!form1_affinity) - return distance; + return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE); for (i = 0; i distance_ref_points_depth; i++) { if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG][PATCH] powerpc: fix numa distance for form0 device tree
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2013-03-22 21:19:35]: [snip] Prior to the above commit: #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) Restoring compatible behavior with this patch for old powerpc systems with device tree where numa distance are encoded as form0. This patch on v3.9-rc3 has been tested on multi-node POWER7 with different device tree combinations. numactl -H would show local distance '10' for same node and remote distance '20' for other nodes. This ensures NUMA level sched domain gets built and load balancing could work across such configurations. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
* Gautham R Shenoy e...@linux.vnet.ibm.com [2014-03-21 16:13:17]: Hi Viresh, On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote: On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy e...@linux.vnet.ibm.com wrote: From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Hi Vaidy, Hi Viresh, Thanks for the detailed review. diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt Default CPUFreq governor default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ Probably we should remove SA1100's entry as well from here. This is not the right way of doing it. Imagine 100 platforms having entries here. If you want it, then select it from your platforms Kconfig. Sure. Will move these bits and the other governor related bits to the Powernv Kconfig. diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null + +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/of.h +#include asm/cputhreads.h That's it? Sure? Even if things compile for you, you must explicitly include all the files on which you depend. Ok. + + WARN_ON(len_ids != len_freqs); + nr_pstates = min(len_ids, len_freqs) / sizeof(u32); + WARN_ON(!nr_pstates); Why do you want to continue here? Good point. We might be better off exiting at this point. Yes, we could return here. We do not generally come till this point if the platform has a problem discovering PStates from device tree. But there is no useful debug info after this point if nr_pstates is 0. + pr_debug(NR PStates %d\n, nr_pstates); + for (i = 0; i nr_pstates; i++) { + u32 id = be32_to_cpu(pstate_ids[i]); + u32 freq = be32_to_cpu(pstate_freqs[i]); + + pr_debug(PState id %d freq %d MHz\n, id, freq); + powernv_freqs[i].driver_data = i; I don't think you are using this field at all and this is the field you can use for driver_data and so you can get rid of powernv_pstate_ids[ ]. Using driver_data to record powernv_pstate_ids won't work since powernv_pstate_ids can be negative. So a pstate_id -3 can be confused with CPUFREQ_BOOST_FREQ thereby not displaying the frequency corresponding to pstate id -3. So for now I think we will be retaining powernv_pstate_ids. Yeah, I had the driver written using driver_data to store pstates. Gautham found the bug that we are missing one PState when we match the ID with CPUFREQ_BOOST_FREQ! We did not know that you have taken care of those issues. Ideally I did expect that driver_data should not be touched by the framework. Thanks for fixing that and allowing the back-end driver to use driver_data. + powernv_freqs[i].frequency = freq * 1000; /* kHz */ + powernv_pstate_ids[i] = id; + } + /* End of list marker entry */ + powernv_freqs[i].driver_data = 0; Not required. Ok. + powernv_freqs[i].frequency = CPUFREQ_TABLE_END; + + /* Print frequency table */ + for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) + pr_debug(%d: %d\n, i, powernv_freqs[i].frequency); You have already printed this table earlier.. Fair enough. + + return 0; +} + +static struct freq_attr *powernv_cpu_freq_attr[] = { + cpufreq_freq_attr_scaling_available_freqs, + NULL, +}; Can use this instead: cpufreq_generic_attr? In this patch yes. But later patch introduces an additional attribute for displaying the nominal frequency. Will handle that part in a clean way in the next version. +/* Helper routines */ + +/* Access helpers to power mgt SPR */ + +static inline unsigned long get_pmspr(unsigned long sprn) Looks big enough not be inlined? It is called from one function. It has been defined separately for readability. Let the compiler decide. The code is straight forward :) I wanted to keep this SPR operations in a separate function to facilitate debug and also introduce any timing/sequence change here if required. Keeping this separate is helpful, if inlining is successful, we get a bonus :) +{ + switch (sprn) { + case SPRN_PMCR: + return mfspr(SPRN_PMCR); + + case SPRN_PMICR: + return mfspr(SPRN_PMICR); + + case SPRN_PMSR: + return mfspr(SPRN_PMSR); + } + BUG
Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
* Gautham R Shenoy e...@linux.vnet.ibm.com [2014-03-27 15:00:50]: [snip] + u32 len_ids, len_freqs; + + power_mgt = of_find_node_by_path(/ibm,opal/power-mgt); + if (!power_mgt) { + pr_warn(power-mgt node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt, ibm,pstate-min, pstate_min)) { + pr_warn(ibm,pstate-min node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt, ibm,pstate-max, pstate_max)) { + pr_warn(ibm,pstate-max node not found\n); + return -ENODEV; + } Why do you need to get these from DT? And not find that yourself here instead? Well, I think we can obtain a more accurate value from the DT which would have already computed it. But I'll let vaidy answer this. DT provides the values that we should use. Ideally these are the upper and lower limits of the PState table, however as per the platform spec, we should use the PState IDs between these limits. We would like to parse it and keep it in the driver for debug purposes. The platform exports this info in the DT to be more explicit and not have a strong dependency on the PState numbering or PState table itself. In the driver we parse it for debug/validation purpose. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] sched: Fixes for task placement in SMT threads
Hi, The following series fixes scheduler loadbalancing issues where we are missing opportunity to place tasks in SMT threads optimally especially on a POWER7 system. PATCH 1/3, Fixes the scenario where load balancing fails to move tasks away from the cpus in a domain which has SHARE_PKG_RESOURCES set even under the presence of idle cpus in other domains. PATCH 2/3, ensures lower order SMT threads are used for the SD_ASYM_PACKING load balancing case. PATCH 3/3, tries to fix the problem that is explained in its changelog. The following experiment expose the scenario: A task placement test was conducted on a POWER7 as well as multi-core x86 system. At the beginning, tasks are pinned to all the threads within a core and later only the tasks pinned to the secondary threads in a core are unpinned. This leaves the primary thread with tasks that are unmovable, while the rest are free to be moved. Under the above setup, load balancing today does not move the unpinned tasks away from the secondary threads of the core in the above experiment although there are other idle cpus. The PATCH 3/3 fixes this situation and improves task placement even if some of the tasks in a core are pinned. This series applies on v3.12-rc6 and tested on x86 and powerpc. --Vaidy --- Preeti U Murthy (2): sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group sched: Aggressive balance in domains whose groups share package resources Vaidyanathan Srinivasan (1): sched: Fix asymmetric scheduling for POWER7 kernel/sched/fair.c | 41 + 1 file changed, 33 insertions(+), 8 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
From: Preeti U Murthy pre...@linux.vnet.ibm.com In nohz_kick_needed() there are checks around the flags SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the domains with this flag set have more than one cpu busy. Therefore at every domain, a check has to be made on nr_busy of that domain. This means the sum of the nr_busy of each group in that domain needs to be checked, since nr_busy is a parameter which is associated with a group. However in the current implementation of nohz_kick_needed(), the nr_busy is being checked for just the group to which the cpu that has initiated this check belongs to. This will give us the wrong count of the number of busy cpus in that domain. The following commit which fixed the sgp-nr_busy_cpus computation actually exposed the bug in nohz_kick_needed() which worked when nr_busy was incorrectly 1 25f55d9d01ad7a7ad248fd5af1d22675ffd202c5 sched: Fix init NOHZ_IDLE flag To illustrate the scenario, consider a core, whose domain will have the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when we find that more than one thread in the core is busy. With the current implementation of nohz_kick_needed(), at this domain(sd), the nr_busy will be 1 always since it returns this parameter for sd-groups which encompasses a single thread, while we want this parameter for sd-parent-groups which will rightly point to the number of busy threads in the core. This patch also ensures that the order of check for SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING. Priority is given to avoid more than one busy thread in a core as much as possible before attempting asymmetric packing. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- kernel/sched/fair.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7c70201..12f0eab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) rcu_read_lock(); for_each_domain(cpu, sd) { - struct sched_group *sg = sd-groups; - struct sched_group_power *sgp = sg-sgp; - int nr_busy = atomic_read(sgp-nr_busy_cpus); - - if (sd-flags SD_SHARE_PKG_RESOURCES nr_busy 1) - goto need_kick_unlock; + struct sched_domain *sd_parent = sd-parent; + struct sched_group *sg; + struct sched_group_power *sgp; + int nr_busy; + + if (sd_parent) { + sg = sd_parent-groups; + sgp = sg-sgp; + nr_busy = atomic_read(sgp-nr_busy_cpus); + + if (sd-flags SD_SHARE_PKG_RESOURCES nr_busy 1) + goto need_kick_unlock; + } if (sd-flags SD_ASYM_PACKING nr_busy != sg-group_weight (cpumask_first_and(nohz.idle_cpus_mask, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7
Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling michael.neul...@au1.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/fair.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12f0eab..828ed97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) goto need_kick_unlock; } - if (sd-flags SD_ASYM_PACKING nr_busy != sg-group_weight -(cpumask_first_and(nohz.idle_cpus_mask, + if (sd-flags SD_ASYM_PACKING + (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu)) goto need_kick_unlock; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources
From: Preeti U Murthy pre...@linux.vnet.ibm.com The current logic in load balance is such that after picking the busiest group, the load is attempted to be moved from the busiest cpu in that group to the dst_cpu. If the load cannot be moved from the busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache hot tasks, then the dst_cpu is changed to be another idle cpu within the dst-grpmask. If even then, the load cannot be moved from the busiest cpu, then the source group is changed. The next busiest group is found and the above steps are repeated. However if the cpus in the group share package resources, then when a load movement from the busiest cpu in this group fails as above, instead of finding the next busiest group to move load from, find the next busiest cpu *within the same group* from which to move load away. By doing so, a conscious effort is made during load balancing to keep just one cpu busy as much as possible within domains that have SHARED_PKG_RESOURCES flag set unless under scenarios of high load. Having multiple cpus busy within a domain which share package resource could lead to a performance hit. A similar scenario arises in active load balancing as well. When the current task on the busiest cpu cannot be moved away due to task pinning, currently no more attempts at load balancing is made. This patch checks if the balancing is being done on a group whose cpus share package resources. If so, then check if the load balancing can be done for other cpus in the same group. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- kernel/sched/fair.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 828ed97..bbcd96b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, { int ld_moved, cur_ld_moved, active_balance = 0; struct sched_group *group; + struct sched_domain *child; + int share_pkg_res = 0; struct rq *busiest; unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_mask); @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq, schedstat_inc(sd, lb_count[idle]); + child = sd-child; + if (child child-flags SD_SHARE_PKG_RESOURCES) + share_pkg_res = 1; + redo: if (!should_we_balance(env)) { *continue_balancing = 0; @@ -5202,6 +5208,7 @@ redo: goto out_balanced; } +redo_grp: busiest = find_busiest_queue(env, group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); @@ -5292,6 +5299,11 @@ more_balance: if (!cpumask_empty(cpus)) { env.loop = 0; env.loop_break = sched_nr_migrate_break; + if (share_pkg_res + cpumask_intersects(cpus, + to_cpumask(group-cpumask))) + goto redo_grp; + goto redo; } goto out_balanced; @@ -5318,9 +5330,15 @@ more_balance: */ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest-curr))) { + cpumask_clear_cpu(cpu_of(busiest), cpus); raw_spin_unlock_irqrestore(busiest-lock, flags); env.flags |= LBF_ALL_PINNED; + if (share_pkg_res + cpumask_intersects(cpus, + to_cpumask(group-cpumask))) + goto redo_grp; + goto out_one_pinned; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform
Hi, The following patch series implements the platform driver to support dynamic CPU frequency scaling on IBM POWERNV platforms. This patch series is based on Linux kernel 3.14-rc2 and tested on OPAL v3 based IBM POWERNV platform and IBM POWER8 processor. --Vaidy --- Srivatsa S. Bhat (1): powernv, cpufreq: Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h|4 + drivers/cpufreq/Kconfig.powerpc |9 + drivers/cpufreq/Makefile |1 drivers/cpufreq/powernv-cpufreq.c | 286 + 4 files changed, 300 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ea3b630..8240e90 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include linux/of.h #include asm/cputhreads.h -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy-cpu; - mutex_lock(freq_switch_mutex); + lock_core_freq(policy-cpu); cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d, @@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy-cpus, new_index); cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(freq_switch_mutex); + unlock_core_freq(policy-cpu); return rc; } @@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void) pr_info(powernv-cpufreq disabled\n); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(powernv_cpufreq_driver); return rc; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 1/2] powernv: cpufreq driver for powernv platform
Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org --- arch/powerpc/include/asm/reg.h|4 + drivers/cpufreq/Kconfig.powerpc |9 + drivers/cpufreq/Makefile |1 drivers/cpufreq/powernv-cpufreq.c | 275 + 4 files changed, 289 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..4a91ab1 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + select CPU_FREQ_TABLE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ea3b630 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,275 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/of.h +#include asm/cputhreads.h + +/* FIXME: Make this per-core */ +static DEFINE_MUTEX(freq_switch_mutex); + +#define POWERNV_MAX_PSTATES256 + +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; + +/* + * Initialize the freq table based on data obtained + * from the firmware passed via device-tree + */ + +static int init_powernv_pstates(void) +{ + struct device_node *power_mgt; + int nr_pstates = 0; + int pstate_min, pstate_max, pstate_nominal; + const __be32 *pstate_ids, *pstate_freqs; + int i; + u32 len_ids, len_freqs; + + power_mgt = of_find_node_by_path(/ibm,opal/power-mgt); + if (!power_mgt) { + pr_warn(power-mgt node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt, ibm,pstate-min, pstate_min)) { + pr_warn(ibm,pstate-min node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt, ibm,pstate
Re: [PATCH] powerpc: fix VMX fix for memcpy case
* Nishanth Aravamudan n...@linux.vnet.ibm.com [2012-10-01 17:59:13]: [urgh, sorry Anton, Ben Paul, inadvertently hit send before adding linuxppc-dev to the cc!] Hi Anton, In 2fae7cdb60240e2e2d9b378afbf6d9fcce8a3890 (powerpc: Fix VMX in interrupt check in POWER7 copy loops), I think you inadvertently introduced a regression for memcpy on POWER7 machines. copyuer and memcpy diverge slightly in their use of cr1 (copyuser doesn't use it, but memcpy does) and you end up clobbering that register with your fix. That results in (taken from an FC18 kernel): [ 18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at c0052f40 [ 18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1] [ 18.824623] SMP NR_CPUS=1024 NUMA pSeries [ 18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs [ 18.824705] NIP: c0052f40 LR: c020b874 CTR: 0512 [ 18.824709] REGS: c01f1fef7790 TRAP: 0f20 Not tainted (3.6.0-0.rc6.git0.2.fc18.ppc64) [ 18.824713] MSR: 80009032 SF,EE,ME,IR,DR,RI CR: 4802802e XER: 2010 [ 18.824726] SOFTE: 0 [ 18.824728] CFAR: 0f20 [ 18.824731] TASK = c00fa7128400[0] 'swapper/24' THREAD: c00fa748 CPU: 24 GPR00: ffc0 c01f1fef7a10 c164edc0 c00f9b9a8120 GPR04: c00f9b9a8124 1438 0060 03ff064657ee GPR08: 8000 0010 0020 0030 GPR12: 28028022 cff25400 0001 GPR16: 7fff c16b2180 c156a500 GPR20: c00f968c7a90 c000131c31d8 c01f1fef4000 c1561d00 GPR24: 000a 0001 0012 GPR28: c00fa5c04f80 08bc c15c0a28 022e [ 18.824792] NIP [c0052f40] .memcpy_power7+0x5a0/0x7c4 [ 18.824797] LR [c020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824800] Call Trace: [ 18.824803] [c01f1fef7a10] [c0052c14] .memcpy_power7+0x274/0x7c4 (unreliable) [ 18.824809] [c01f1fef7b10] [c020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824813] [c01f1fef7bb0] [c020ba88] .free_percpu+0xb8/0x1b0 [ 18.824819] [c01f1fef7c50] [c043d144] .throtl_pd_exit+0x94/0xd0 [ 18.824824] [c01f1fef7cf0] [c043acf8] .blkg_free+0x88/0xe0 [ 18.824829] [c01f1fef7d90] [c018c048] .rcu_process_callbacks+0x2e8/0x8a0 [ 18.824835] [c01f1fef7e90] [c00a8ce8] .__do_softirq+0x158/0x4d0 [ 18.824840] [c01f1fef7f90] [c0025ecc] .call_do_softirq+0x14/0x24 [ 18.824845] [c00fa7483650] [c0010e80] .do_softirq+0x160/0x1a0 [ 18.824850] [c00fa74836f0] [c00a94a4] .irq_exit+0xf4/0x120 [ 18.824854] [c00fa7483780] [c0020c44] .timer_interrupt+0x154/0x4d0 [ 18.824859] [c00fa7483830] [c0003be0] decrementer_common+0x160/0x180 [ 18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4 [ 18.824866] LR = .check_and_cede_processor+0x48/0x80 [ 18.824871] [c00fa7483b20] [c007f018] .check_and_cede_processor+0x18/0x80 (unreliable) [ 18.824877] [c00fa7483b90] [c007f104] .dedicated_cede_loop+0x84/0x150 [ 18.824883] [c00fa7483c50] [c06bc030] .cpuidle_enter+0x30/0x50 [ 18.824887] [c00fa7483cc0] [c06bc9f4] .cpuidle_idle_call+0x104/0x720 [ 18.824892] [c00fa7483d80] [c0070af8] .pSeries_idle+0x18/0x40 [ 18.824897] [c00fa7483df0] [c0019084] .cpu_idle+0x1a4/0x380 [ 18.824902] [c00fa7483ec0] [c08a4c18] .start_secondary+0x520/0x528 [ 18.824907] [c00fa7483f90] [c00093f0] .start_secondary_prolog+0x10/0x14 [ 18.824911] Instruction dump: [ 18.824914] 38840008 9003 90e30004 38630008 7ca62850 7cc300d0 78c7e102 7cf01120 [ 18.824923] 78c60660 39200010 39400020 39600030 7e00200c 7c0020ce 38840010 409f001c [ 18.824935] ---[ end trace 0bb95124affaaa45 ]--- [ 18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at c0052d08 I believe the right fix is to make memcpy match usercopy and not use cr1. Signed-off-by: Nishanth Aravamudan n...@us.ibm.com Tested-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- I've not tested this fix yet, but I think it's logically correct. Probably needs to go to 3.6-stable as well. diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S index 7ba6c96..0663630 100644 --- a/arch/powerpc/lib/memcpy_power7.S +++ b/arch/powerpc/lib/memcpy_power7.S @@ -239,8 +239,8 @@ _GLOBAL
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
* Shaohua Li shaohua...@intel.com [2009-08-06 09:58:55]: Hi, On Wed, Aug 05, 2009 at 10:25:53PM +0800, Gautham R Shenoy wrote: In this patch-series, we propose to extend the CPU-Hotplug infrastructure and allow the system administrator to choose the desired state the CPU should go to when it is offlined. We think this approach addresses the concerns about determinism as well as transparency, since CPU-Hotplug already provides notification mechanism which the userspace can listen to for any change in the configuration and correspondingly readjust any previously set cpu-affinities. Peter dislikes any approach (including cpuhotplug) which breaks userspace policy, even userspace can get a notification. Hi Shaohua, Notification to userspace and an opportunity to react to the situation is certainly better to manage as compared to under-the-cover reduction in capacity. We are already doing something to this effect in virtualized guests where VCPU entitlement and assignment can be changed by the hypervisor based on resource constraints. This framework with notification and determinism will help manage cpu capacity better. I agree with Peter that scheduler approach is better, but it is not deterministic. This is simpler and cleaner alternative to keep the complexity in userspace and provide only framework to control/notify kernel. Also, approaches such as [1] can make use of this extended infrastructure instead of putting the CPU to an arbitrary C-state when it is offlined, thereby providing the system administrator a rope to hang himself with should he feel the need to do so. I didn't see the reason why administrator needs to know which state offline cpu should stay. Don't know about powerpc side, but in x86 side, it appears deepest C-state is already preferred. Yes that is what we would expect, but the deepest sleep state may be restricted by BIOS or other system level parameters. This was the main objection to Venki's deepest sleep state for offline cpus patch. There could be higher level policy that restricts the deepest level C-states for various reasons. This framework is to provide an opportunity to adhere to the policy with userspace inputs. This framework also helps to choose different states in a virtualized system. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
* Rafael J. Wysocki r...@sisk.pl [2009-08-09 15:22:02]: On Sunday 09 August 2009, Pavel Machek wrote: Hi! Also, approaches such as [1] can make use of this extended infrastructure instead of putting the CPU to an arbitrary C-state when it is offlined, thereby providing the system administrator a rope to hang himself with should he feel the need to do so. I didn't see the reason why administrator needs to know which state offline cpu should stay. Don't know about powerpc side, but in x86 side, it appears deepest C-state is already preferred. Agreed, deepest c-state is always best, there's no need to make it configurable. Unless it doesn't work. Yes, this is one of the reason. Kernel will know about the deepest sleep state and any restrictions and should set that as default in the preferred_offline state. End-users and sys-admins need not change it, default will be the deepest sleep state supported and allowed by the system which may be different than the one supported by the processor. This framework could allow the default to be set easily by other userspace tools. These restrictions apply to cpuidle governor as well, hence at some level both these subsystems should be in sync. As described in this patch series, the meaning of offline cpu is different in a virtualized system and this framework provides flexibility of choice there. Platforms today do have a choice on what state an offline cpu should reside, this framework is one of the possible methods to exploit that choice and not restrict it due to lack of OS flexibility. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Benjamin Herrenschmidt b...@au1.ibm.com [2015-05-30 20:38:22]: On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote: In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. But that will have negative side effects under KVM no ? Yes, you have a good point. If one of the thread in the core goes to cede, it can still come back quickly since the KVM guest context is not switched yet. But in single threaded guest, this can force an unnecessary exit/context switch overhead. Now that we have fixed the snooze loop to be bounded and exit predictably, KVM guest should actually use snooze state to improve latency. I will test this scenario and enable snooze state for KVM guest. Suresh mentioned something with his new directed interrupts code that we had many cases where the interrupts ended up arriving shortly after we exited to host for NAP'ing ... Snooze might fix it... Right. This scenario is worth experimenting and then introduce snooze loop for guest. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver
opal-prd driver will mmap() firmware code/data area as private mapping to prd user space daemon. Write to this page will trigger COW faults. The new COW pages are normal kernel RAM pages accounted by the kernel and are not special. vma-vm_page_prot value will be used at page fault time for the new COW pages, while pgprot_t value passed in remap_pfn_range() is used for the initial page table entry. Hence: * Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range() * Also remap_pfn_range() will add the _PAGE_SPECIAL flag using pte_mkspecial() call, hence no need to specify in the driver This fix resolves the page accounting warning shown below: BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19 The above warning is triggered since _PAGE_SPECIAL was incorrectly being set for the normal kernel COW pages. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/opal-prd.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c index 46cb3fe..4ece8e4 100644 --- a/arch/powerpc/platforms/powernv/opal-prd.c +++ b/arch/powerpc/platforms/powernv/opal-prd.c @@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file *file) static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) { size_t addr, size; + pgprot_t page_prot; int rc; pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n, @@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) if (!opal_prd_range_is_valid(addr, size)) return -EINVAL; - vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file, - vma-vm_pgoff, -size, vma-vm_page_prot)) - | _PAGE_SPECIAL); + page_prot = phys_mem_access_prot(file, vma-vm_pgoff, +size, vma-vm_page_prot); rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size, - vma-vm_page_prot); + page_prot); return rc; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2015-06-25 11:45:46]: [snip] Hi Ben, remap_pfn_range() is the correct method to map the firmware pages because we will not have struct page associated with this RAM area. We do a memblock_reserve() in early boot and take out this memory from kernel and avoid struct page allocation/init for these. Kindly ignore the this comment. memblock_reserve() does not prevent/avoid struct page allocation. We do have valid struct page which can be used for mapping. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver
* Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com [2015-06-21 23:56:16]: opal-prd driver will mmap() firmware code/data area as private mapping to prd user space daemon. Write to this page will trigger COW faults. The new COW pages are normal kernel RAM pages accounted by the kernel and are not special. vma-vm_page_prot value will be used at page fault time for the new COW pages, while pgprot_t value passed in remap_pfn_range() is used for the initial page table entry. Hence: * Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range() * Also remap_pfn_range() will add the _PAGE_SPECIAL flag using pte_mkspecial() call, hence no need to specify in the driver This fix resolves the page accounting warning shown below: BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19 The above warning is triggered since _PAGE_SPECIAL was incorrectly being set for the normal kernel COW pages. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/opal-prd.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c index 46cb3fe..4ece8e4 100644 --- a/arch/powerpc/platforms/powernv/opal-prd.c +++ b/arch/powerpc/platforms/powernv/opal-prd.c @@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file *file) static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) { size_t addr, size; + pgprot_t page_prot; int rc; pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n, @@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) if (!opal_prd_range_is_valid(addr, size)) return -EINVAL; - vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file, - vma-vm_pgoff, - size, vma-vm_page_prot)) - | _PAGE_SPECIAL); + page_prot = phys_mem_access_prot(file, vma-vm_pgoff, + size, vma-vm_page_prot); rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size, - vma-vm_page_prot); + page_prot); Hi Ben, remap_pfn_range() is the correct method to map the firmware pages because we will not have struct page associated with this RAM area. We do a memblock_reserve() in early boot and take out this memory from kernel and avoid struct page allocation/init for these. vm_insert_page() is an alternative that would have worked if kernel allocated the memory, in which case we can bump up the page count and map the page to user space. This is already done by vm_insert_page() and we will not need to make the page special. However, this use case fits remap_pfn_range() and page special mechanism since there is no struct page associate with this physical pages. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]: [snip] + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = cpuidle_state_table[1].target_residency * +tb_ticks_per_usec; + } Any idea why we don't have snooze defined on the shared lpar configuration ? In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESEND PATCH] powerpc/powernv: Fix vma page prot flags in opal-prd driver
opal-prd driver will mmap() firmware code/data area as private mapping to prd user space daemon. Write to this page will trigger COW faults. The new COW pages are normal kernel RAM pages accounted by the kernel and are not special. vma-vm_page_prot value will be used at page fault time for the new COW pages, while pgprot_t value passed in remap_pfn_range() is used for the initial page table entry. Hence: * Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range() * Also remap_pfn_range() will add the _PAGE_SPECIAL flag using pte_mkspecial() call, hence no need to specify in the driver This fix resolves the page accounting warning shown below: BUG: Bad rss-counter state mm:c007d34ac600 idx:1 val:19 The above warning is triggered since _PAGE_SPECIAL was incorrectly being set for the normal kernel COW pages. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/opal-prd.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) Resending the patch for inclusion until we have a working solution without PAGE_SPECIAL. Kindly queue this patch to fix the mmap issue in opal-prd driver. The next version/fix with vm_insert_page() will replace this code. There is no API change on the driver side and hence no side effects of including this fix until a better solution is available. diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c index 46cb3fe..4ece8e4 100644 --- a/arch/powerpc/platforms/powernv/opal-prd.c +++ b/arch/powerpc/platforms/powernv/opal-prd.c @@ -112,6 +112,7 @@ static int opal_prd_open(struct inode *inode, struct file *file) static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) { size_t addr, size; + pgprot_t page_prot; int rc; pr_devel(opal_prd_mmap(0x%016lx, 0x%016lx, 0x%lx, 0x%lx)\n, @@ -125,13 +126,11 @@ static int opal_prd_mmap(struct file *file, struct vm_area_struct *vma) if (!opal_prd_range_is_valid(addr, size)) return -EINVAL; - vma-vm_page_prot = __pgprot(pgprot_val(phys_mem_access_prot(file, - vma-vm_pgoff, -size, vma-vm_page_prot)) - | _PAGE_SPECIAL); + page_prot = phys_mem_access_prot(file, vma-vm_pgoff, +size, vma-vm_page_prot); rc = remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, size, - vma-vm_page_prot); + page_prot); return rc; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powernv: Load correct TOC pointer while waking up from winkle.
* Benjamin Herrenschmidt[2016-08-06 08:38:53]: > On Fri, 2016-08-05 at 19:13 +0530, Mahesh J Salgaonkar wrote: > > From: Mahesh Salgaonkar > > > > The function pnv_restore_hyp_resource() loads the TOC into r2 from > > the invalid PACA pointer before fixing r13 value. This do not affect > > POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less > > leading CPU to get stuck forever. > > When was this broken ? Should this get backported to stable ? Broken for 4.8-rc1 only after moving the SPRN_HSPRG0 checking to idle_book3s.S. Hence no back ports needed. --Vaidy
Re: [PATCH] powernv: Load correct TOC pointer while waking up from winkle.
* Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> [2016-08-05 19:13:12]: > From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > The function pnv_restore_hyp_resource() loads the TOC into r2 from > the invalid PACA pointer before fixing r13 value. This do not affect > POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less > leading CPU to get stuck forever. > > login: [ 471.830433] Processor 120 is stuck. > > > This can be easily reproducible using following steps: > - Turn off SMT > $ ppc64_cpu --smt=off > - offline/online any online cpu (Thread 0 of any core which is online) > $ echo 0 > /sys/devices/system/cpu/cpu/online > $ echo 1 > /sys/devices/system/cpu/cpu/online > > For POWER ISA 2.07 or less, the last bit of HSPRG0 is set indicating > that thread is waking up from winkle. Hence, the last bit of HSPRG0(r13) > needs to be clear before accessing it as PACA to avoid loading invalid > values from invalid PACA pointer. > > Fix this by loading TOC after r13 register is corrected. > > Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/idle_book3s.S |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/idle_book3s.S > b/arch/powerpc/kernel/idle_book3s.S > index 8a56a51..45784ec 100644 > --- a/arch/powerpc/kernel/idle_book3s.S > +++ b/arch/powerpc/kernel/idle_book3s.S > @@ -363,8 +363,8 @@ _GLOBAL(power9_idle_stop) > * cr3 - set to gt if waking up with partial/complete hypervisor state loss > */ > _GLOBAL(pnv_restore_hyp_resource) > - ld r2,PACATOC(r13); > BEGIN_FTR_SECTION > + ld r2,PACATOC(r13); > /* >* POWER ISA 3. Use PSSCR to determine if we >* are waking up from deep idle state > @@ -395,6 +395,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) >*/ > clrldi r5,r13,63 > clrrdi r13,r13,1 > + > + /* Now that we are sure r13 is corrected, load TOC */ > + ld r2,PACATOC(r13); > cmpwi cr4,r5,1 > mtspr SPRN_HSPRG0,r13 > Thanks Mahesh for this fix. --Vaidy
powerpc: Enable cpuhotplug with ESL=1 on POWER9
The attached patch enables ESL=1 STOP2 for cpuhotplug. This is a debug patch that we could carry now until STOP states are discovered from device tree. Test run: [ 151.670021] CPU8 going offline with request psscr 003f0332 [ 151.719856] CPU 8 offline: Remove Rx thread [ 189.200410] CPU8 coming online with psscr 203f0332, srr1 90261001 We request ESL=EC=1 and STOP2 and successfully wakeup from that state using IPI from XIVE. --Vaidy
[PATCH 2/2] powerpc/cpuhotplug: print psscr and srr1 value for debug
This is a debug patch that helps trace various STOP state transitions and look at srr1 and psscr at wakeup. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- arch/powerpc/platforms/powernv/smp.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index e39e6c4..5b3f002 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -185,8 +185,12 @@ static void pnv_smp_cpu_kill_self(void) ppc64_runlatch_off(); if (cpu_has_feature(CPU_FTR_ARCH_300)) { + pr_info("CPU%d going offline with request psscr %016llx\n", + cpu, pnv_deepest_stop_psscr_val); srr1 = power9_idle_stop(pnv_deepest_stop_psscr_val, pnv_deepest_stop_psscr_mask); + pr_info("CPU%d coming online with psscr %016lx, srr1 %016lx\n", + cpu, mfspr(SPRN_PSSCR), srr1); } else if (idle_states & OPAL_PM_WINKLE_ENABLED) { srr1 = power7_winkle(); } else if ((idle_states & OPAL_PM_SLEEP_ENABLED) || -- 2.9.3
[PATCH 1/2] powerpc/cpuhotplug: Force ESL=1 for offline cpus
From: Gautham R. Shenoy <e...@linux.vnet.ibm.com> ESL=1 losses some HYP SPR context and not idea for cpuidle, however can be used for offline cpus. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/cpuidle.h| 15 +++ arch/powerpc/platforms/powernv/idle.c | 11 +++ 2 files changed, 26 insertions(+) diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h index fd321eb4..31192d8 100644 --- a/arch/powerpc/include/asm/cpuidle.h +++ b/arch/powerpc/include/asm/cpuidle.h @@ -40,6 +40,21 @@ #define ERR_EC_ESL_MISMATCH-1 #define ERR_DEEP_STATE_ESL_MISMATCH-2 +/* Additional defs for debug and trace */ + +#define RL_SHIFT 0 +#define MTL_SHIFT 4 +#define TR_SHIFT 8 +#define PSLL_SHIFT16 +#define EC_SHIFT 20 +#define ESL_SHIFT 21 +#define INIT_PSSCR(ESL, EC, PSLL, TR, MTL, RL) (((ESL) << (ESL_SHIFT)) | \ +((EC) << (EC_SHIFT)) | \ +((PSLL) << (PSLL_SHIFT)) | \ +((TR) << (TR_SHIFT)) | \ +((MTL) << (MTL_SHIFT)) | \ +((RL) << (RL_SHIFT))) + #ifndef __ASSEMBLY__ extern u32 pnv_fastsleep_workaround_at_entry[]; extern u32 pnv_fastsleep_workaround_at_exit[]; diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 4ee837e..4f0663a 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -524,6 +524,17 @@ static int __init pnv_init_idle_states(void) pnv_alloc_idle_core_states(); + /* On POWER9 DD1, enter stop2 with ESL=EC=1 on Hotplug */ + if (cpu_has_feature(CPU_FTR_POWER9_DD1)) { + pnv_deepest_stop_psscr_val = + /* ESL, EC, PSSL, TR, MTL, RL */ + INIT_PSSCR(0x1, 0x1, 0xf, 0x3, 0x3, 0x2); + pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK; + pr_warn("Overriding deepest_stop_psscr to: val=0x%016llx,mask=0x%016llx\n", + pnv_deepest_stop_psscr_val, + pnv_deepest_stop_psscr_mask); + } + if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED) ppc_md.power_save = power7_idle; else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST) -- 2.9.3
[PATCH] powerpc/xics: Adjust interrupt receive priority for offline cpus
Offline CPUs need to receive IPIs through XIVE when they are in stop state and wakeup from that state. Reduce interrupt receive priority in order to receive XIVE wakeup interrupts when in offline state. LOWEST_PRIORITY would allow all interrupts to be delivered as wakeup events. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- arch/powerpc/sysdev/xics/xics-common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index 69d858e..c674a9d 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -199,7 +199,7 @@ void xics_migrate_irqs_away(void) xics_set_cpu_giq(xics_default_distrib_server, 0); /* Allow IPIs again... */ - icp_ops->set_priority(DEFAULT_PRIORITY); + icp_ops->set_priority(LOWEST_PRIORITY); for_each_irq_desc(virq, desc) { struct irq_chip *chip; -- 2.9.3
Re: [PATCH] powerpc: cmp -> cmpd for 64-bit
* Segher Boessenkool[2016-10-12 08:26:48]: > On Wed, Oct 12, 2016 at 02:05:19PM +1100, Michael Ellerman wrote: > > Segher Boessenkool writes: > > [snip] > > > > --- a/arch/powerpc/include/asm/cpuidle.h > > > +++ b/arch/powerpc/include/asm/cpuidle.h > > > @@ -26,7 +26,7 @@ extern u64 pnv_first_deep_stop_state; > > > > #define IDLE_STATE_ENTER_SEQ(IDLE_INST) \ > > /* Magic NAP/SLEEP/WINKLE mode enter sequence */\ > > > std r0,0(r1); \ > > > ptesync;\ > > > ld r0,0(r1); \ > > > -1: cmp cr0,r0,r0; \ > > > +1: cmpdcr0,r0,r0; \ > > > bne 1b; \ > > > IDLE_INST; \ > > > b . > > > > What's this one doing, is it a bug? I can't really tell without knowing > > what the magic sequence is meant to do. This one is the recommended idle state entry sequence described in ISA. We need to ensure the context is fully saved and also create a register dependency using cmp and loop which will ideally not be taken. This will get the thread (pipeline) ready to start losing state when the idle instruction is executed. ISA 2.07 Section: 3.3.2.1 Entering and Exiting Power-Saving Mode > > It looks like it is making sure the ptesync is done. The ld/cmp/bne > is the usual to make sure the ld is done, and in std/ptesync/ld the ld > won't be done before the ptesync is done. > > The cmp always compares equal, of course, so both cmpw and cmpd would > work fine here. cmpd looks better after ld ;-) Yes :) cmpd or cmpw would provide same result as far as this code sequence is concerned. I agree that cpmd is more appropriate here. --Vaidy
[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver
When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible, then cpuidle-powernv not passing an explicit drv->cpu_mask allows generic cpuidle driver to try create sysfs objects for cpus that does not have cpu_devices created by calling register_cpu(). This caused kernel to access incorrect address and crash. The following patch series fixes the cpuidle-powernv driver and also adds additional checks in cpuidle_add_sysfs() This patch set is against v4.11-rc3. Changed from v1: Updated commit message and comments. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:28:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > <sva...@linux.vnet.ibm.com> wrote: > > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > > On PowerNV platform cpu_present could be less than cpu_possible in cases > > where firmware detects the cpu, but it is not available to the OS. When > > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence > > we skip creating cpu_device. > > > > This breaks cpuidle on powernv where register_cpu() is not called for > > cpus in cpu_possible_mask that cannot be hot-added at runtime. > > > > Trying cpuidle_register_device() on cpu without cpu_device will cause > > crash like this: > > > > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] > > pc: c022c8bc: string+0x34/0x60 > > lr: c022ed78: vsnprintf+0x284/0x42c > > sp: c00ff1503710 > >msr: 90009033 > >dar: 60006000 > > current = 0xc00ff148 > > paca= 0xcfe82d00 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/8 > > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 > > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 > > enter ? for help > > [link register ] c022ed78 vsnprintf+0x284/0x42c > > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) > > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 > > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc > > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 > > [c00ff15038c0] c0619694 printk+0x38/0x4c > > [c00ff15038e0] c0224950 kobject_get+0x40/0x60 > > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 > > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 > > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 > > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c > > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc > > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 > > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c > > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c > > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c > > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 > > > > This patch fixes the bug by passing correct cpumask from > > powernv-cpuidle driver. > > > > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > > That needs to be ACKed by someone familiar with powernv. Previous version at https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html I had not CCed linux-pm in the first post. Michael and Mikey have reviewed the previous version. Let me get an ack for you to proceed with the merge. Thanks, Vaidy
[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). On PowerNV platform cpu_present could be less than cpu_possible in cases where firmware detects the cpu, but it is not available to the OS. When CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence we skip creating cpu_device. This breaks cpuidle on powernv where register_cpu() is not called for cpus in cpu_possible_mask that cannot be hot-added at runtime. Trying cpuidle_register_device() on cpu without cpu_device will cause crash like this: cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] pc: c022c8bc: string+0x34/0x60 lr: c022ed78: vsnprintf+0x284/0x42c sp: c00ff1503710 msr: 90009033 dar: 60006000 current = 0xc00ff148 paca= 0xcfe82d00 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/8 Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 enter ? for help [link register ] c022ed78 vsnprintf+0x284/0x42c [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 [c00ff15038c0] c0619694 printk+0x38/0x4c [c00ff15038e0] c0224950 kobject_get+0x40/0x60 [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 This patch fixes the bug by passing correct cpumask from powernv-cpuidle driver. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle-powernv.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index a06df51..82f7b33 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void) drv->state_count += 1; } + /* +* On PowerNV platform cpu_present may be less that cpu_possible in +* cases where firmware detects the cpu, but it is not available to the +* OS. If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at +* runtime and hence cpu_devices are not created for those cpus by +* generic topology_init(). +* +* drv->cpumask defaults to cpu_possible_mask in +* __cpuidle_driver_init(). This breaks cpuidle on powernv where +* cpu_devices are not created for cpus in cpu_possible_mask that +* cannot be hot-added later at runtime. +* +* Trying cpuidle_register_device() on a cpu without cpu_devices is +* incorrect. Hence pass correct cpu mask to generic cpuidle driver. +*/ + + drv->cpumask = (struct cpumask *)cpu_present_mask; + return 0; } -- 2.9.3
[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
If a given cpu is not in cpu_present and cpu hotplug is disabled, arch can skip setting up the cpu_dev. Arch cpuidle driver should pass correct cpu mask for registration, but failing to do so by the driver causes error to propagate and crash like this: [ 30.076045] Unable to handle kernel paging request for data at address 0x0048 [ 30.076100] Faulting instruction address: 0xc07b2f30 cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] pc: c07b2f30: kobject_get+0x20/0x70 lr: c07b3c94: kobject_add_internal+0x54/0x3f0 sp: c03feb18b8f0 msr: 90009033 dar: 48 dsisr: 4000 current = 0xc03fd2ed8300 paca= 0xcfbab500 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 enter ? for help [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 Validating cpu_dev fixes the crash and reports correct error message like: [ 30.163506] Failed to register cpuidle device for cpu136 [ 30.173329] Registration of powernv driver failed. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/sysfs.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index c5adc8c..f2c3bce 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev) struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu); int error; + /* +* Return error if cpu_device is not setup for this cpu. This +* could happen if arch did not setup cpu_device since this +* cpu is not in cpu_present mask and the driver did not send +* correct cpu mask at registration. Without this check we +* would end up passing bogus value for _dev->kobj in +* kobject_init_and_add(). +*/ + + if (!cpu_dev) + return -ENODEV; + kdev = kzalloc(sizeof(*kdev), GFP_KERNEL); if (!kdev) return -ENOMEM; -- 2.9.3
Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:27:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > <sva...@linux.vnet.ibm.com> wrote: > > If a given cpu is not in cpu_present and cpu hotplug > > is disabled, arch can skip setting up the cpu_dev. > > > > Arch cpuidle driver should pass correct cpu mask > > for registration, but failing to do so by the driver > > causes error to propagate and crash like this: > > > > [ 30.076045] Unable to handle kernel paging request for > > data at address 0x0048 > > [ 30.076100] Faulting instruction address: 0xc07b2f30 > > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] > > pc: c07b2f30: kobject_get+0x20/0x70 > > lr: c07b3c94: kobject_add_internal+0x54/0x3f0 > > sp: c03feb18b8f0 > >msr: 90009033 > >dar: 48 > > dsisr: 4000 > > current = 0xc03fd2ed8300 > > paca= 0xcfbab500 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/0 > > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 > > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 > > enter ? for help > > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 > > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 > > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 > > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 > > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 > > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 > > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 > > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 > > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 > > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 > > > > Validating cpu_dev fixes the crash and reports correct error message like: > > > > [ 30.163506] Failed to register cpuidle device for cpu136 > > [ 30.173329] Registration of powernv driver failed. > > > > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > > The previous version is in linux-next already and I'm going to push it > for merging shortly. Thanks Rafael. The previous version is good for merge. --Vaidy
Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Michael Ellerman <m...@ellerman.id.au> [2017-03-22 21:55:50]: > Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes: > > * Michael Ellerman <m...@ellerman.id.au> [2017-03-20 14:05:39]: > >> Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes: > > > >> > On powernv platform cpu_present could be less than cpu_possible > >> > in cases where firmware detects the cpu, but it is not available > >> > for OS. > >> > >> It's entirely normal for present < possible, on my laptop for example, > >> so I don't see how that causes the bug. > > > > Yes, present < possible in itself not a problem. It is whether > > cpu_device exist for that cpu or not. > ... > > > > Currently if CONFIG_HOTPLUG_CPU=n, then we skip calling register_cpu() > > and that causes the problem. > ... > >> > >> I really don't understand how a CPU not being present leads to a crash > >> in printf()? Something in that call chain should have checked that the > >> CPU was registered before crashing in printf() - surely? > > > > Yes, we should have just failed to register the cpuidle driver. I have > > the fix here: > > > > [PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs > > http://patchwork.ozlabs.org/patch/740634/ > > OK. Can you send a v2 of this with a better change log that includes all > the clarifications above. > > And despite your subject being powerpc/powernv/cpuidle, this is a > cpuidle patch. I can merge it, but I at least need you to Cc the cpuidle > maintainers so they have a chance to see it. Thanks for the review, I will post a v2 with more detailed commit log and CC cpuidle maintainers and linux-pm. --Vaidy
[PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). This breaks cpuidle on powernv where sysfs files are not created for cpus in cpu_possible_mask that cannot be hot-added. Trying cpuidle_register_device() on cpu without sysfs node will cause crash like: cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] pc: c022c8bc: string+0x34/0x60 lr: c022ed78: vsnprintf+0x284/0x42c sp: c00ff1503710 msr: 90009033 dar: 60006000 current = 0xc00ff148 paca= 0xcfe82d00 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/8 Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 enter ? for help [link register ] c022ed78 vsnprintf+0x284/0x42c [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 [c00ff15038c0] c0619694 printk+0x38/0x4c [c00ff15038e0] c0224950 kobject_get+0x40/0x60 [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 This patch fixes the issue by passing correct cpumask from powernv-cpuidle driver. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle-powernv.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 3705930..e7a8c2a 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -175,6 +175,28 @@ static int powernv_cpuidle_driver_init(void) drv->state_count += 1; } + /* +* On PowerNV platform cpu_present may be less that cpu_possible +* in cases where firmware detects the cpu, but it is not available +* for OS. Such CPUs are not hotplugable at runtime on PowerNV +* platform and hence sysfs files are not created for those. +* Generic topology_init() would skip creating sysfs directories +* for cpus that are not present and not hotplugable later at +* runtime. +* +* drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). +* This breaks cpuidle on powernv where sysfs files are not created for +* cpus in cpu_possible_mask that cannot be hot-added. +* +* Hence at runtime sysfs nodes are present for cpus only in +* cpu_present_mask. Trying cpuidle_register_device() on cpu without +* sysfs node is incorrect. +* +* Hence pass correct cpu mask to generic cpuidle driver. +*/ + + drv->cpumask = (struct cpumask *)cpu_present_mask; + return 0; } -- 2.9.3
[PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
If a given cpu is not in cpu_present and cpu hotplug is disabled, arch can skip setting up the cpu_dev. Arch cpuidle driver should pass correct cpu mask for registration, but failing to do so by the driver causes error to propagate and crash like this: [ 30.076045] Unable to handle kernel paging request for data at address 0x0048 [ 30.076100] Faulting instruction address: 0xc07b2f30 cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] pc: c07b2f30: kobject_get+0x20/0x70 lr: c07b3c94: kobject_add_internal+0x54/0x3f0 sp: c03feb18b8f0 msr: 90009033 dar: 48 dsisr: 4000 current = 0xc03fd2ed8300 paca= 0xcfbab500 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 enter ? for help [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 Validating cpu_dev fixes the crash and reports correct error message like: [ 30.163506] Failed to register cpuidle device for cpu136 [ 30.173329] Registration of powernv driver failed. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/sysfs.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index c5adc8c..19dcf32 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev) struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu); int error; + /* +* Return if cpu_device is not setup for this cpu. +* This could happen if arch did not setup cpu_device +* since this cpu is not in cpu_present mask and +* driver did not send correct cpu mask at registration. +* Without this check we would end up passing bogus +* value for _dev->kobj in kobject_init_and_add() +*/ + + if (!cpu_dev) + return -ENODEV; + kdev = kzalloc(sizeof(*kdev), GFP_KERNEL); if (!kdev) return -ENOMEM; -- 2.9.3
Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Michael Ellerman <m...@ellerman.id.au> [2017-03-20 14:05:39]: > Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes: > > > * Michael Neuling <mi...@neuling.org> [2017-03-18 16:28:02]: > > > >> Vaidy, > >> > >> Thanks for fixing this. > >> > >> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > >> > This breaks cpuidle on powernv where sysfs files are not created for > >> > cpus in cpu_possible_mask that cannot be hot-added. > >> > >> I think I prefer the longer description below than this. > > > > [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for > > > > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > > This breaks cpuidle on powernv where sysfs files are not created for > > cpus in cpu_possible_mask that cannot be hot-added. > > I'm confused. It depends on CONFIG_HOTPLUG_CPU now, but we have the following combinations to handle: (a) CONFIG_HOTPLUG_CPU=y/n (b) pseries vs powernv > > On powernv platform cpu_present could be less than cpu_possible > > in cases where firmware detects the cpu, but it is not available > > for OS. > > It's entirely normal for present < possible, on my laptop for example, > so I don't see how that causes the bug. Yes, present < possible in itself not a problem. It is whether cpu_device exist for that cpu or not. > > Such cpus are not hotplugable at runtime on powernv and > > hence we skip creating sysfs files for these cpus. > > Why are they not hotpluggable? Looking at topology_init() they should be > hotpluggable as long as ppc_md.cpu_die is populated, which it is on > PowerNV AFAICS. Currently it depends on CONFIG_HOTPLUG_CPU=y. > Is it really "creating sysfs files" that's important, or is that we > don't call register_cpu() for those CPUs? Currently if CONFIG_HOTPLUG_CPU=n, then we skip calling register_cpu() and that causes the problem. The fix in cpuidle-powernv.c could be based on CONFIG_HOTPLUG_CPU, but since we can choose to set cpu->hotpluggable based on other factors even when CONFIG_HOTPLUG_CPU=y, I choose to use cpu_possible. We will need to change cpuidle-powernv to register for additional cpus when we really support hot-add of cpus that did not exist at boot. > > Trying cpuidle_register_device() on cpu without sysfs node will > > cause crash like: > > > > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] > > pc: c022c8bc: string+0x34/0x60 > > lr: c022ed78: vsnprintf+0x284/0x42c > > sp: c00ff1503710 > >msr: 90009033 > >dar: 60006000 > > current = 0xc00ff148 > > paca= 0xcfe82d00 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/8 > > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot > > 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 > > enter ? for help > > [link register ] c022ed78 vsnprintf+0x284/0x42c > > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) > > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 > > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc > > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 > > [c00ff15038c0] c0619694 printk+0x38/0x4c > > [c00ff15038e0] c0224950 kobject_get+0x40/0x60 > > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 > > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 > > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 > > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c > > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc > > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 > > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c > > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c > > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c > > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 > > I really don't understand how a CPU not being present leads to a crash > in printf()? Something in that call chain should have checked that the > CPU was registered before crashing in printf() - surely? Yes, we should have just failed to register the cpuidle driver. I have the fix here: [PATCH] cpuidle: Validate cpu_dev in cpuidle_add_sysfs http://patchwork.ozlabs.org/patch/740634/ --Vaidy
Re: [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Michael Neuling[2017-03-18 16:28:02]: > Vaidy, > > Thanks for fixing this. > > > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > > This breaks cpuidle on powernv where sysfs files are not created for > > cpus in cpu_possible_mask that cannot be hot-added. > > I think I prefer the longer description below than this. [PATCH] powerpc/powernv/cpuidle: Pass correct drv->cpumask for drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). This breaks cpuidle on powernv where sysfs files are not created for cpus in cpu_possible_mask that cannot be hot-added. On powernv platform cpu_present could be less than cpu_possible in cases where firmware detects the cpu, but it is not available for OS. Such cpus are not hotplugable at runtime on powernv and hence we skip creating sysfs files for these cpus. Trying cpuidle_register_device() on cpu without sysfs node will cause crash like: cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] pc: c022c8bc: string+0x34/0x60 lr: c022ed78: vsnprintf+0x284/0x42c sp: c00ff1503710 msr: 90009033 dar: 60006000 current = 0xc00ff148 paca= 0xcfe82d00 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/8 Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 enter ? for help [link register ] c022ed78 vsnprintf+0x284/0x42c [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 [c00ff15038c0] c0619694 printk+0x38/0x4c [c00ff15038e0] c0224950 kobject_get+0x40/0x60 [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 This patch fixes the issue by passing correct cpumask from powernv-cpuidle driver. > > This patch fixes the issue by passing correct cpumask from > > powernv-cpuidle driver. > > Any reason the correct fix isn't to change __cpuidle_driver_init() to use > present mask? Seems like any arch where present < possible is going to hit > this. How to handle the sysfs nodes for cpus that are not in cpu_present mask is upto the arch/platform. During init, arch can choose to create sysfs nodes for cpu's that are not in present mask, which can be later hot-added and then onlined. In which case cpuidle driver should register for such cpus as well. The decision to allow a cpu to become available in cpu_present mask at runtime and handle the sysfs nodes are upto the arch/platform and not to be hard coded in generic driver. The generic cpuidle would like to register for all possible cpus if the arch driver does not override it. --Vaidy
Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
* Nicholas Piggin[2017-04-04 09:52:07]: > On Tue, 4 Apr 2017 07:54:12 +1000 > Anton Blanchard wrote: > > > From: Anton Blanchard > > > > The core of snooze_loop() continually bounces between low and very > > low thread priority. Changing thread priorities is an expensive > > operation that can negatively impact other threads on a core. > > > > All CPUs that can run PowerNV support very low priority, so we can > > avoid the change completely. > > This looks good. I have HMT_lowest() which does alt feature patching > we can use for pseries and default idle code. Alternatively, if we are going to set priority only once in various other places, HMT_low(); HMT_very_low(); should not add to extra cycles. Let me code that up. --Vaidy
Re: [PATCH 1/3] cpuidle: powernv: Don't bounce between low and very low thread priority
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:12]: > From: Anton Blanchard <an...@samba.org> > > The core of snooze_loop() continually bounces between low and very > low thread priority. Changing thread priorities is an expensive > operation that can negatively impact other threads on a core. > > All CPUs that can run PowerNV support very low priority, so we can > avoid the change completely. > > Signed-off-by: Anton Blanchard <an...@samba.org> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > drivers/cpuidle/cpuidle-powernv.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index cda8f62d555b..9d9f164894eb 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -57,7 +57,6 @@ static int snooze_loop(struct cpuidle_device *dev, > snooze_exit_time = get_tb() + snooze_timeout; > ppc64_runlatch_off(); > while (!need_resched()) { > - HMT_low(); > HMT_very_low(); > if (snooze_timeout_en && get_tb() > snooze_exit_time) > break; HMT_low() is legacy and can be removed for powernv platforms. --Vaidy
Re: [PATCH 2/3] cpuidle: powernv: Don't continually set thread priority in snooze_loop()
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:13]: > From: Anton Blanchard <an...@samba.org> > > The powerpc64 kernel exception handlers have preserved thread priorities > for a long time now, so there is no need to continually set it. > > Just set it once on entry and once exit. > > Signed-off-by: Anton Blanchard <an...@samba.org> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > drivers/cpuidle/cpuidle-powernv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 9d9f164894eb..8c991c254b95 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -56,8 +56,8 @@ static int snooze_loop(struct cpuidle_device *dev, > > snooze_exit_time = get_tb() + snooze_timeout; > ppc64_runlatch_off(); > + HMT_very_low(); > while (!need_resched()) { > - HMT_very_low(); > if (snooze_timeout_en && get_tb() > snooze_exit_time) > break; > } > -- > 2.11.0 >
Re: [PATCH 3/3] cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
* Anton Blanchard <an...@ozlabs.org> [2017-04-04 07:54:14]: > From: Anton Blanchard <an...@samba.org> > > When in the snooze_loop() we want to take up the least amount of > resources. On my version of gcc (6.3), we end up with an extra > branch because it predicts snooze_timeout_en to be false, whereas it > is almost always true. By default snooze_timeout_en is true. It will become false only when we do not want to exit the snooze loop and that will be when snooze is the only idle state available in the platform, which is a rare case. > Use likely() to avoid the branch and be a little nicer to the > other non idle threads on the core. > > Signed-off-by: Anton Blanchard <an...@samba.org> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > drivers/cpuidle/cpuidle-powernv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 8c991c254b95..251a60bfa8ee 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -58,7 +58,7 @@ static int snooze_loop(struct cpuidle_device *dev, > ppc64_runlatch_off(); > HMT_very_low(); > while (!need_resched()) { > - if (snooze_timeout_en && get_tb() > snooze_exit_time) > + if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) > break; > } > > -- > 2.11.0 >
Re: [PATCH] powerpc/pseries: energy driver only print message when LPAR guest
* Nicholas Piggin <npig...@gmail.com> [2017-07-21 11:16:44]: > On Thu, 20 Jul 2017 23:03:21 +1000 > Michael Ellerman <m...@ellerman.id.au> wrote: > > > Nicholas Piggin <npig...@gmail.com> writes: > > > > > This driver currently reports the H_BEST_ENERGY is unsupported even > > > when booting in a non-LPAR environment (e.g., powernv). Prevent it. > > > > Just delete the printk(). Users don't know what that means, and > > developers have other better ways to detect that the hcall is missing if > > anyone cares. > > > > cheers > > powerpc/pseries: energy driver do not print failure message > > This driver currently reports the H_BEST_ENERGY is unsupported (even > when booting in a non-LPAR environment). This is not something the > administrator can do much with, and not significant for debugging. > > Remove it. > > Signed-off-by: Nicholas Piggin <npig...@gmail.com> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > arch/powerpc/platforms/pseries/pseries_energy.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c > b/arch/powerpc/platforms/pseries/pseries_energy.c > index 164a13d3998a..35c891aabef0 100644 > --- a/arch/powerpc/platforms/pseries/pseries_energy.c > +++ b/arch/powerpc/platforms/pseries/pseries_energy.c > @@ -229,10 +229,9 @@ static int __init pseries_energy_init(void) > int cpu, err; > struct device *cpu_dev; > > - if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) { > - printk(KERN_INFO "Hypercall H_BEST_ENERGY not supported\n"); > - return 0; > - } > + if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) > + return 0; /* H_BEST_ENERGY hcall not supported */ > + The first patch (!firmware_has_feature(FW_FEATURE_LPAR)) would be ideal, but we do not have this in KVM guest case also. Hence I take mpe's suggestion. Removing the print is fine. --Vaidy
Re: [PATCH] powerpc/pseries: energy driver only print message when LPAR guest
* Michael Ellerman <m...@ellerman.id.au> [2017-07-21 16:33:07]: > Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> writes: > > * Nicholas Piggin <npig...@gmail.com> [2017-07-21 11:16:44]: > >> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c > >> b/arch/powerpc/platforms/pseries/pseries_energy.c > >> index 164a13d3998a..35c891aabef0 100644 > >> --- a/arch/powerpc/platforms/pseries/pseries_energy.c > >> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c > >> @@ -229,10 +229,9 @@ static int __init pseries_energy_init(void) > >>int cpu, err; > >>struct device *cpu_dev; > >> > >> - if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) { > >> - printk(KERN_INFO "Hypercall H_BEST_ENERGY not supported\n"); > >> - return 0; > >> - } > >> + if (!firmware_has_feature(FW_FEATURE_BEST_ENERGY)) > >> + return 0; /* H_BEST_ENERGY hcall not supported */ > >> + > > > > The first patch (!firmware_has_feature(FW_FEATURE_LPAR)) would be > > ideal, but we do not have this in KVM guest case also. > > Yeah we do. > > It should really be called FW_FEATURE_RUNNING_UNDER_PAPR_HYPERVISOR. > > static int __init probe_fw_features(unsigned long node, const char *uname, int > depth, void *data) > { > > if (!strcmp(uname, "rtas") || !strcmp(uname, "rtas@0")) { > prop = of_get_flat_dt_prop(node, "ibm,hypertas-functions", > ); > if (prop) { > powerpc_firmware_features |= FW_FEATURE_LPAR; > > > Qemu initialises that property unconditionally in spapr_dt_rtas(). oops... I meant that FW_FEATURE_BEST_ENERGY is not found in KVM and we will see the print needlessly. If we have a check for phyp LPAR, then we can enable the print "H_BEST_ENERGY hcall not supported" Since the FW_FEATURE_LPAR is common for all PAPR guest (both pHyp and KVM), I agree that deleting the print is the right thing to do since we see it on both powernv and KVM where it is not supported and there is no point reporting it. --Vaidy
Re: [PATCH] POWER9 PMU stops after idle workaround
* Nicholas Piggin <npig...@gmail.com> [2017-07-20 11:53:22]: > POWER9 DD2 PMU can stop after a state-loss idle in some conditions. > > A solution is to set then clear MMCRA[60] after wake from state-loss > idle. > > Signed-off-by: Nicholas Piggin <npig...@gmail.com> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/idle_book3s.S | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/idle_book3s.S > b/arch/powerpc/kernel/idle_book3s.S > index 516ebef905c0..e6252c5a57a4 100644 > --- a/arch/powerpc/kernel/idle_book3s.S > +++ b/arch/powerpc/kernel/idle_book3s.S > @@ -460,11 +460,17 @@ pnv_restore_hyp_resource_arch300: > /* >* Workaround for POWER9, if we lost resources, the ERAT >* might have been mixed up and needs flushing. We also need > - * to reload MMCR0 (see comment above). > + * to reload MMCR0 (see comment above). We also need to set > + * then clear bit 60 in MMCRA to ensure the PMU starts running. >*/ > blt cr3,1f > PPC_INVALIDATE_ERAT > ld r1,PACAR1(r13) > + mfspr r4,SPRN_MMCRA > + ori r4,r4,(1 << (63-60)) > + mtspr SPRN_MMCRA,r4 > + xorir4,r4,(1 << (63-60)) > + mtspr SPRN_MMCRA,r4 Timing is ok to resolve the issue? Does back-to-back bit flip of MMCRA[60] gets the job done for all cases? Just asking since this issue in itself is a corner case ;) --Vaidy