http://defect.opensolaris.org/bz/show_bug.cgi?id=6057
Eric Saxe <eric.saxe at sun.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ACCEPTED |FIXINPROGRESS
--- Comment #2 from Eric Saxe <eric.saxe at sun.com> 2009-01-27 16:16:47 ---
The fix for this issue involves implementing governors in the CPU power manager
that identify periods of transient utilization and idleness, and adapt CPU
power management policy accordingly.
Two governors are implemented:
- One to defend against the effects of transient utilization (e.g. kernel
background activity on an otherwise idle system). That would result in a given
power domain being clocked up as soon as one of these transient threads hits
the CPU, only to have the domain clocked down a moment later.
- One to defend against the effects of transient idleness...for example, due to
a workload that is (for the most part) using the CPU, but occasionally yields
the CPU for a bit to await some event from elsewhere in the system.
Each governor works as follows (using the transient work governor as an
example):
A time threshold called a "transience prediction interval" is defined for the
purpose of evaluating how well the the power manager is doing with respect to
making state transitions in the face of transience. In the case of the
transient work (tw) governor, the threshold is the amount of time below which
we consider the work transient, and not worth making a power state transition.
For an idle power domain, when the dispatcher sends a "domain is now busy"
event, the power manager will generally request a speed raise for the domain.
When the dispatcher later sends a "domain is now idle" event, the power manager
will observe how long the domain was utilized. If that duration is less than
the tw prediction interval, it it will proceed to lower the power, but notate
that the last raise was "mispredicted".
After some number of consecutive mispredicted raises, a governor is installed
(which in the code is implemented with a flag) that prevents future raises.
Once the governor is active, the power manager continues to monitor the
duration of the utilization periods, and after some number of consecutive
mispredicted governed raises, will remove the governor.
The transient idle (ti) governor works the same way, but holds the speed high
in the face of otherwise transient idleness.
The prediction intervals for the governors are initialized the first time event
based CPU power management is enabled. At that time, the power manager will
"benchmark" the cost of making the state transitions, and the initial value of
the ti and tw prediction intervals will be set to a tunable multiple of that
measured cost. This approach has several benefits:
- On systems where making state changes is comparatively more expensive (either
in hardware or software), the prediction intervals will be initialized to
reflect that cost, and the system will be comparatively more aggressive about
utilizing the governors to throttle transition rates.
- The tunable multiple can be tuned by an administrator to make the power
manager more (or less) aggressive about making state transitions in response to
utilization changes. The actual prediction interval itself is maintained in
unscaled hrtime for performance (the msacct time stamps are recycled out of
swtch()), and this provides a means for tuning that is time base independent.
Webrev of changes at:
http://cr.opensolaris.org/~esaxe/6057/
With these changes, the performance regression at the high utilizations is
fixed, transition rates are down, and the odd logic implemented in cmt.c to
deal with transient utilization is gone (since the power manager has a governor
mechanism to deal with this).
--
Configure bugmail: http://defect.opensolaris.org/bz/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.