Re: PowerOP 0/3: System power operating point management API

2005-08-16 Thread Todd Poynor

Dominik Brodowski wrote:

A small add-on:

We need to make sure that we're capable of handling smart CPUs like Transmeta
Crusoe processors in a sane way. This means



b)  Setting of "values"



is optional if the hardware itself can be set to a min/max value (step a
above in previous mail).


Although I haven't looked into the Crusoe processor support, it may be 
that there is a different set of power parameters, not cpu speed 
directly, that are appropriate to manage on these platforms (after a 
brief look, seems to be a range of frequencies and some sort of flags)? 
 If so, these sorts of machine-specific power parameters are what 
PowerOP is trying to address, allowing management of the underlying 
machine-specific stuff to upper layers that may be presenting an 
abstracted view of power/performance, such as CPU speed or speed ranges, 
to the user.  Thanks,


--
Todd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PowerOP 0/3: System power operating point management API

2005-08-16 Thread Todd Poynor

Dominik Brodowski wrote:


First, the table interface you suggest is ugly. If there's indeed the need for
such an abstraction, I'd favour something like


I'm planning to adopt the previous suggestions of an opaque data 
structure and stop trying to have any generic structure to it.  I'll try 
to leave dependency checking etc. to the upper layers as much as 
possible, since platforms vary greatly in this and so do the needs of 
different PM s/w stacks.



Secondly, you do not adress the cross-relationships between operation points
correctly. If you change the CPU frequency, you may have to switch other
(memory, video) settings; you might even have to validate the frequency
settings for these or even additional reasons (thermal and battery reasons -
ACPI _PPC).


This lowest layer basically assumes that upper-layer software has 
created an appropriate operating point (for example, in DPM we pretty 
much require a system designer to create operating points that match the 
h/w specs and don't go to great lengths to encode rules about this), 
and/or will call driver notifiers etc. as needed to adapt to the 
changes.  Although there may be some sanity checking appropriate at the 
PowerOP level, cpufreq, DPM, etc. can for the most part continue to 
handle the larger issues of how valid operating points are constructed, 
driver callbacks, etc.  If you do want to handle various dependencies at 
the PowerOP layer then there's nothing that prevents that, but PM 
frameworks tend to embody assumptions about how frequently operating 
points will change and in what contexts (interrupt, idle...), and this 
can influence the code for such things.



Thirdly, who is to decide on the power management settings? The first and
intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
exist. Only under rare circumstances, you want full userspace control --
that's what the userspace cpufreq governor is for.


Also something left to the existing upper layers; PowerOP isn't intended 
to handle any of that.  In the embedded space we usually let the system 
designer choose operating points supported by their h/w vendor and that 
match their particular system states (hardware enabled at any point in 
time, type and power/performance needs of software currently running). 
We do recommend that a userspace power policy manager be the component 
in charge of PM settings, based on messages from drivers and other apps 
on the state of the system.  And so that userspace component activates 
the operating point (or set of operating points in the case of DPM) 
appropriate for current state.



Foruthly, the code duplication which your implementation leads to is obvious
for the speedstep-centrino case. 


We could move the tables of valid cpu speeds and corresponding voltages 
down to the PowerOP level, and there would probably be little 
duplication at that point (in fact, with the current patch there's not a 
lot of duplication since the actual MSR access was moved to PowerOP and 
PowerOP contains little else, but both levels know how to understand the 
MSR format, and a more aggressive port to PowerOP could do away with that).


Your suggestions of changes to cpufreq governors and policies to handle 
governance of non-cpu-speed parameters sound interesting, and I'd be 
happy to help figure out what to do about those vs. the lower machine 
access layer I've discussed up until now.  I'll think more about this 
real soon now.  Thanks,


--
Todd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PowerOP 0/3: System power operating point management API

2005-08-16 Thread Dominik Brodowski
A small add-on:

We need to make sure that we're capable of handling smart CPUs like Transmeta
Crusoe processors in a sane way. This means

> b)Setting of "values"

is optional if the hardware itself can be set to a min/max value (step a
above in previous mail).

Dominik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PowerOP 0/3: System power operating point management API

2005-08-16 Thread Dominik Brodowski
Hi!

The PowerOP infrastructure you suggest surely is one path to better runtime
power management in the Linux kernel. However, I don't like it at all in its
current implementation. Here are a few suggestions for improvements,
rewrites, and so on:

First, the table interface you suggest is ugly. If there's indeed the need for
such an abstraction, I'd favour something like

struct powerop {
struct list_headpowerop_values; /* linked list of 
powerop_values */
...
}

struct powerop_value {
unsigned long   value_cur;
unsigned long   value_min;
unsigned long   value_max;
struct list_headnext;
u16 type;
struct powerop_value*cross_dependency;
struct powerop_driver   *driver;
}

#define POWEROP_TYPE_CPU_FREQUENCY  0x0001
#define POWEROP_TYPE_CPU_VOLTAGE0x0002
#define POWEROP_TYPE_FRONT_SIDE_BUS_SPEED   0x0004
...
#define POWEROP_TYPE_GPU_FREQUENCY  0x0001
...

and if CPU_VOLTAGE and CPU_FREQEUNCY can only be modified at the same time, (as
most cpufreq drivers require), type is 0x0003.


Secondly, you do not adress the cross-relationships between operation points
correctly. If you change the CPU frequency, you may have to switch other
(memory, video) settings; you might even have to validate the frequency
settings for these or even additional reasons (thermal and battery reasons -
ACPI _PPC).

Thirdly, who is to decide on the power management settings? The first and
intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
exist. Only under rare circumstances, you want full userspace control --
that's what the userspace cpufreq governor is for.

Foruthly, the code duplication which your implementation leads to is obvious
for the speedstep-centrino case. And in contrast to Pavel, I do not consider
it a "tiny cleanup".



I'd suggest that you try upgrading the cpufreq infrastructure to provide
full support for multiple types of POWEROPs:

a)  Setting of "policies"
- New "min" or "max" values for all powerop_values are set, verified
  by powerop lowlevel drivers, powerop governors and external
  notifiers. E.g. if a new frequency min/max pair is required, the
  voltage level gets a new min and max value as well --> you need to
  handle recursion.
- If necessary a new "powerop governor" is started.
   - Each powerop governor specifies which POWEROPs it can handle
- current cpufreq governors can handle CPU_FREQUENCY,
  CPU_VOLTAGE and FRONT_SIDE_BUS_SPEED
- an userspace fallback-governor always "handles" the
  parameters no other governor handles

b)  Setting of "values"
- Each governor can initiate transitions between the "min" and "max"
  values for operationg points it aquired ownership for.
- The new setting is notified to all other governors and to external
  notifiers. If some entitiy decides it cannot live well with this
  new setting, it breaks out. Note that this should not happen quite
  often, as the "normal" verification takes place in a) above.
  Nonetheless, if you want to break out CPU_VOLTAGE and CPU_FREQUENCY, 
you
  need it. And as it makes life for the kernel so much more
  difficult, I'm against doing so.
- The low-level driver handling the powerop_value is called

Thanks,
Dominik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PowerOP 0/3: System power operating point management API

2005-08-10 Thread Todd Poynor

Pavel Machek wrote:

Depending on the ability of the hardware to make software-controlled
power/performance adjustments, this may be useful to select custom
voltages, bus speeds, etc. in desktop/server systems.  Various embedded
systems have several parameters that can be set.  For example, an XScale
PXA27x could be considered to have six basic power parameters (mainly
cpu run mode and memory and bus dividers) that for the most part
should



This scares me a bit. Is table enough to handle this? I'm afraid that
table will get very large on systems that allow you to do "almost
anything".


Exhaustive tables for all combinations of possible parameters aren't 
expected (or practical for many systems as you note).  In practice, a 
subset of these possible operating points are created and activated over 
the lifetime of the system, where the subset is chosen by a system 
designer according to the needs of the particular system.  It's a matter 
for the higher-layer power management software to decide whether to have 
in-kernel tables of the possible operating points (as cpufreq does for 
various platforms) or whether to require userspace to create only the 
ones wanted (as does DPM).  There are cpufreq patches for PXA27x 
somewhere, for example, and in that case a subset of the supported 
operating points (and there are still only about 16 of those even for 
such a complicated piece of hardware) are represented in the kernel 
tables, choosing one of the possible combinations of memory/bus/etc. 
parameters for each unique cpu frequency.  Thanks,


--
Todd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PowerOP 0/3: System power operating point management API

2005-08-10 Thread Pavel Machek
Hi!

> PowerOP is a system power parameter management API submitted for
> discussion.  PowerOP writes and reads power "operating points",
> comprised of arbitrary integer-valued values, called power parameters,
> that correspond to registers, clocks, dividers, voltage regulators,
> etc. that may be modified to set a basic power/performance point for the
> system.  The core basically passes an array of integer-valued power
> parameters (with very little additional structure imposed by the core)
> to a platform-specific backend that interprets those values and makes
> the requested adjustments.  PowerOP is intended to leave all power
> policy decisions to higher layers.  An optional sysfs representation of
> power parameters is also available, primarily for diagnostic use.
> 
> PowerOP can be thought of as a layer below cpufreq that actually
> accesses the hardware to make cpu frequency, voltage, core bus, and
> perhaps other modifications to set a power point, leaving cpufreq to
> manage the interfaces based around the "cpu frequency" abstraction, the
> policies and governors that select the frequency, its notifiers, and so
> forth.  An example hooking up support for one cpufreq platform to
> PowerOP is in patch 3/3.
> 
> Depending on the ability of the hardware to make software-controlled
> power/performance adjustments, this may be useful to select custom
> voltages, bus speeds, etc. in desktop/server systems.  Various embedded
> systems have several parameters that can be set.  For example, an XScale
> PXA27x could be considered to have six basic power parameters (mainly
> cpu run mode and memory and bus dividers) that for the most part
> should

This scares me a bit. Is table enough to handle this? I'm afraid that
table will get very large on systems that allow you to do "almost
anything".
Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/