Re: PowerOP 0/3: System power operating point management API
Dominik Brodowski wrote: A small add-on: We need to make sure that we're capable of handling smart CPUs like Transmeta Crusoe processors in a sane way. This means b) Setting of "values" is optional if the hardware itself can be set to a min/max value (step a above in previous mail). Although I haven't looked into the Crusoe processor support, it may be that there is a different set of power parameters, not cpu speed directly, that are appropriate to manage on these platforms (after a brief look, seems to be a range of frequencies and some sort of flags)? If so, these sorts of machine-specific power parameters are what PowerOP is trying to address, allowing management of the underlying machine-specific stuff to upper layers that may be presenting an abstracted view of power/performance, such as CPU speed or speed ranges, to the user. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
Dominik Brodowski wrote: First, the table interface you suggest is ugly. If there's indeed the need for such an abstraction, I'd favour something like I'm planning to adopt the previous suggestions of an opaque data structure and stop trying to have any generic structure to it. I'll try to leave dependency checking etc. to the upper layers as much as possible, since platforms vary greatly in this and so do the needs of different PM s/w stacks. Secondly, you do not adress the cross-relationships between operation points correctly. If you change the CPU frequency, you may have to switch other (memory, video) settings; you might even have to validate the frequency settings for these or even additional reasons (thermal and battery reasons - ACPI _PPC). This lowest layer basically assumes that upper-layer software has created an appropriate operating point (for example, in DPM we pretty much require a system designer to create operating points that match the h/w specs and don't go to great lengths to encode rules about this), and/or will call driver notifiers etc. as needed to adapt to the changes. Although there may be some sanity checking appropriate at the PowerOP level, cpufreq, DPM, etc. can for the most part continue to handle the larger issues of how valid operating points are constructed, driver callbacks, etc. If you do want to handle various dependencies at the PowerOP layer then there's nothing that prevents that, but PM frameworks tend to embody assumptions about how frequently operating points will change and in what contexts (interrupt, idle...), and this can influence the code for such things. Thirdly, who is to decide on the power management settings? The first and intuitive answer is the kernel. Therefore, kernel-space cpufreq governors exist. Only under rare circumstances, you want full userspace control -- that's what the userspace cpufreq governor is for. Also something left to the existing upper layers; PowerOP isn't intended to handle any of that. In the embedded space we usually let the system designer choose operating points supported by their h/w vendor and that match their particular system states (hardware enabled at any point in time, type and power/performance needs of software currently running). We do recommend that a userspace power policy manager be the component in charge of PM settings, based on messages from drivers and other apps on the state of the system. And so that userspace component activates the operating point (or set of operating points in the case of DPM) appropriate for current state. Foruthly, the code duplication which your implementation leads to is obvious for the speedstep-centrino case. We could move the tables of valid cpu speeds and corresponding voltages down to the PowerOP level, and there would probably be little duplication at that point (in fact, with the current patch there's not a lot of duplication since the actual MSR access was moved to PowerOP and PowerOP contains little else, but both levels know how to understand the MSR format, and a more aggressive port to PowerOP could do away with that). Your suggestions of changes to cpufreq governors and policies to handle governance of non-cpu-speed parameters sound interesting, and I'd be happy to help figure out what to do about those vs. the lower machine access layer I've discussed up until now. I'll think more about this real soon now. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
A small add-on: We need to make sure that we're capable of handling smart CPUs like Transmeta Crusoe processors in a sane way. This means > b)Setting of "values" is optional if the hardware itself can be set to a min/max value (step a above in previous mail). Dominik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
Hi! The PowerOP infrastructure you suggest surely is one path to better runtime power management in the Linux kernel. However, I don't like it at all in its current implementation. Here are a few suggestions for improvements, rewrites, and so on: First, the table interface you suggest is ugly. If there's indeed the need for such an abstraction, I'd favour something like struct powerop { struct list_headpowerop_values; /* linked list of powerop_values */ ... } struct powerop_value { unsigned long value_cur; unsigned long value_min; unsigned long value_max; struct list_headnext; u16 type; struct powerop_value*cross_dependency; struct powerop_driver *driver; } #define POWEROP_TYPE_CPU_FREQUENCY 0x0001 #define POWEROP_TYPE_CPU_VOLTAGE0x0002 #define POWEROP_TYPE_FRONT_SIDE_BUS_SPEED 0x0004 ... #define POWEROP_TYPE_GPU_FREQUENCY 0x0001 ... and if CPU_VOLTAGE and CPU_FREQEUNCY can only be modified at the same time, (as most cpufreq drivers require), type is 0x0003. Secondly, you do not adress the cross-relationships between operation points correctly. If you change the CPU frequency, you may have to switch other (memory, video) settings; you might even have to validate the frequency settings for these or even additional reasons (thermal and battery reasons - ACPI _PPC). Thirdly, who is to decide on the power management settings? The first and intuitive answer is the kernel. Therefore, kernel-space cpufreq governors exist. Only under rare circumstances, you want full userspace control -- that's what the userspace cpufreq governor is for. Foruthly, the code duplication which your implementation leads to is obvious for the speedstep-centrino case. And in contrast to Pavel, I do not consider it a "tiny cleanup". I'd suggest that you try upgrading the cpufreq infrastructure to provide full support for multiple types of POWEROPs: a) Setting of "policies" - New "min" or "max" values for all powerop_values are set, verified by powerop lowlevel drivers, powerop governors and external notifiers. E.g. if a new frequency min/max pair is required, the voltage level gets a new min and max value as well --> you need to handle recursion. - If necessary a new "powerop governor" is started. - Each powerop governor specifies which POWEROPs it can handle - current cpufreq governors can handle CPU_FREQUENCY, CPU_VOLTAGE and FRONT_SIDE_BUS_SPEED - an userspace fallback-governor always "handles" the parameters no other governor handles b) Setting of "values" - Each governor can initiate transitions between the "min" and "max" values for operationg points it aquired ownership for. - The new setting is notified to all other governors and to external notifiers. If some entitiy decides it cannot live well with this new setting, it breaks out. Note that this should not happen quite often, as the "normal" verification takes place in a) above. Nonetheless, if you want to break out CPU_VOLTAGE and CPU_FREQUENCY, you need it. And as it makes life for the kernel so much more difficult, I'm against doing so. - The low-level driver handling the powerop_value is called Thanks, Dominik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
Pavel Machek wrote: Depending on the ability of the hardware to make software-controlled power/performance adjustments, this may be useful to select custom voltages, bus speeds, etc. in desktop/server systems. Various embedded systems have several parameters that can be set. For example, an XScale PXA27x could be considered to have six basic power parameters (mainly cpu run mode and memory and bus dividers) that for the most part should This scares me a bit. Is table enough to handle this? I'm afraid that table will get very large on systems that allow you to do "almost anything". Exhaustive tables for all combinations of possible parameters aren't expected (or practical for many systems as you note). In practice, a subset of these possible operating points are created and activated over the lifetime of the system, where the subset is chosen by a system designer according to the needs of the particular system. It's a matter for the higher-layer power management software to decide whether to have in-kernel tables of the possible operating points (as cpufreq does for various platforms) or whether to require userspace to create only the ones wanted (as does DPM). There are cpufreq patches for PXA27x somewhere, for example, and in that case a subset of the supported operating points (and there are still only about 16 of those even for such a complicated piece of hardware) are represented in the kernel tables, choosing one of the possible combinations of memory/bus/etc. parameters for each unique cpu frequency. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
Hi! > PowerOP is a system power parameter management API submitted for > discussion. PowerOP writes and reads power "operating points", > comprised of arbitrary integer-valued values, called power parameters, > that correspond to registers, clocks, dividers, voltage regulators, > etc. that may be modified to set a basic power/performance point for the > system. The core basically passes an array of integer-valued power > parameters (with very little additional structure imposed by the core) > to a platform-specific backend that interprets those values and makes > the requested adjustments. PowerOP is intended to leave all power > policy decisions to higher layers. An optional sysfs representation of > power parameters is also available, primarily for diagnostic use. > > PowerOP can be thought of as a layer below cpufreq that actually > accesses the hardware to make cpu frequency, voltage, core bus, and > perhaps other modifications to set a power point, leaving cpufreq to > manage the interfaces based around the "cpu frequency" abstraction, the > policies and governors that select the frequency, its notifiers, and so > forth. An example hooking up support for one cpufreq platform to > PowerOP is in patch 3/3. > > Depending on the ability of the hardware to make software-controlled > power/performance adjustments, this may be useful to select custom > voltages, bus speeds, etc. in desktop/server systems. Various embedded > systems have several parameters that can be set. For example, an XScale > PXA27x could be considered to have six basic power parameters (mainly > cpu run mode and memory and bus dividers) that for the most part > should This scares me a bit. Is table enough to handle this? I'm afraid that table will get very large on systems that allow you to do "almost anything". Pavel -- if you have sharp zaurus hardware you don't need... you know my address - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/