[tesla-dev] Request for comment for PSARC onapager about CPU idle nofication interface

Liu, Jiang Sat, 22 Nov 2008 00:15:37 +0800

Bill.Holler at Sun.COM <mailto:Bill.Holler at Sun.COM> wrote:
> Hi Gerry,
> 
> Here are my comments on this proposal.
> 
> ci_c_state should be architecture neutral.  Something like
> ci_idle_state would be better.  Each architecture can have public
> defines for what these mean (e.g. #define   C1   1).
Good suggestion, I'll refine it.


> 
> Should "arg" be removed from the typedefs for cpu_idle_enter_cbfn
> and cpu_idle_exit_cbfn and also removed from
> cpu_idle_register_callback and cpu_idle_unregister_callback?
> There is no purpose for arg in the new version of the interface.
The arg is used to pass context info back to callbacks.
It's useful when the same callback handler will be registered more than one 
time. If no such usage scenario, we can remove it.

> 
> Should cpu_idle_exit handlers be called before enabling interrupts
> when the CPU resumes from idle?
Yes, it will be called before enabling interrupt and all cpu_idle_exit handlers 
will called with interrupt enabling flag cleared.

> 
> What are your thoughts on priority?  Is this related to
> hardware requirements (for example run handler before
> disabling LAPIC)?  Or is this for power performance?
> This seems like a tricky area where some of the hardware
> implementation may need to be exposed.
Mainly the priority is used to resolve hardware dependencies among callbacks.
Actually this is the most tricky and dirty part, under some circumstances we 
may need to statically allocate priority based on hardware dependency and 
performance requirements.
I have splitted the priority space into three sub-spaces:
HIGH_PRIORITY sub-space: statically allocated high priorities
DYNAMIC_PRIORITY sub-space: callbacks don't care about calling order and system 
allocate priority for callbacks.
LOW_PRIORITY sub-space: statically allocated low priorities

> 
> Thank you,
> Bill
> 
> 
> On 11/19/08 08:17, Liu, Jiang wrote:
>> Hi Bill, Eric and Ashok,
>>      I have refined the PSARC onepager according to our
> previous discussion, please help to review.
>>      We may refine it further once we reach agreements about
> the statistics info sharing among idle notification callbacks.
>>      Thanks!
>> 
>> Eric.Saxe at Sun.COM <mailto:Eric.Saxe at Sun.COM> wrote:
>> 
>>> Hi Gerry, Bill,
>>> 
>>> Liu, Jiang wrote:
>>> 
>>>> Hi Bill,
>>>>    Thanks for your great suggestions.
>>>>    I'll update the onepager according to your suggestions and repost
>>>> it after getting enough feedbacks.
>>>> 
>>>> Bill.Holler at Sun.COM <mailto:Bill.Holler at Sun.COM> wrote:
>>>> 
>>>> 
>>>>> Hi Liu Jiang,
>>>>> 
>>>>> This proposal looks great!
>>>>> 
>>>>> Here are some suggestions:
>>>>> 
>>>>> 1) This proposal does not specify the context the callbacks will
>>>>> execute in. Presumably idle callbacks will run in the idle-thread
>>>>> context.   This proposal should state this.
>>>>> Idle threads cannot block.  Idle threads also cannot spin when the
>>>>> owner of the synchronization object expects to be able to wakeup
>>>>> the CPU from idle. 
>>>>> 
>>>>> 
>>>> For cpu_idle_enter_cbfn, it will be called only in idle thread
>>>> context. For cpu_idle_exit_cbfn, it may be called in idle thread
>>>> context or interrupt context which depends on the hardware
>>>> mechanism(instructions) used to put CPU into low power state. For
>>>> example, cpu_idle_exit_cbfn will be called in interrupt context if
>>>> OS uses hlt instruction to put CPU into C1 state on x86. On the
>>>> other hand, it will be called in idle thread context if OS uses
>>>> mwait to put CPU into low power state with interrupt disabled. So
>>>> there's a flag to indicate whether cpu_idle_exit_cbfn is called in
>>>> interrupt context. 
>>>> 
>>>> 
>>>> 
>>>>> 2) Clients may only be interested in a sub-set of c-states.
>>>>> cpu_idle_register_callback() could take another argument
>>>>> that specified which c-state(s) the callback should be invoked.
>>>>> I do not know if clients will want this option.
>>>>> 
>>>>> CPUs will likely only enter C1 idle when they do not expect to be
>>>>> idle very long on C-State enabled kernels.
>>>>> 
>>>>> 
>>>> I don't know whether this is a common requirement too.
>>>> How about to let the callback filter out C states not interested
>>>> at? 
>>>> 
>>>> 
>>>> 
>>>>> 3) A "max idle time" argument could be added to
>>>>> cpu_idle_enter_cbfn(). As part of the C-State project CPUs now
>>>>> read their LAPIC Current Count register to see when the next
>>>>> Local APIC Timer interrupt will fire.  Would FIPE be interested
>>>>> in this max value? 
>>>>> 
>>>>> 
>>>> In the original design and implementation, we do have a parameter
>>>> named 'hint' to pass "predicted max idle time" to
>>>> cpu_idle_enter_cbfn.
>>>> I have removed it from the onepager due to following reasons,
>>>> 1) parameter cstate roughly has the same info.
>>>> 2) There are arguments whether we should extend parameter 'hint' as
>>>> a structure to contain more info other than "predicted max idle
>>>> time" for extensibility. 3) Our code is based on onnv_92 which has
>>>> no support for deeper C state and tickless timer yet. I have no
>>>> idea about the tickless timer interface at that time.
>>>> If tickless timer or C-State project provides some interface to
>>>> query next timer expiration time, we could pass the predicted max
>>>> idle time to callbacks. 
>>>> 
>>>> 
>>> One thing that's nicer about the predicted max idle time, it that's
>>> it's relatively platform independent. I would really like to see us
>>> make this idle callback work common code to the greatest extent
>>> possible, so that it could be leveraged by the other architectures
>>> at some point. 
>>> 
>>> 
>>>>> 4) Clients may not want their callback invoked if the CPU will not
>>>>> be idle longer than a specified time due to the LAPIC Current
>>>>> Count register.  Do you want cpu_idle_register_callback() to be
>>>>> able to specify a minimum "max idle-time threshold" argument? The
>>>>> callbeck would not be called if the CPU knows it will not be idle
>>>>> that long. 
>>>>> 
>>>>> 
>>>> How about to let callbacks make such a decision to simplify the
>>>> framework? Or any concern about performance issues?
>>>> 
>>>> 
>>> Letting the callbacks make the decision makes sense if it can
>>> simplify the interfaces....the cost is really just an extra
>>> function call (since the comparison would happen either way).
>>> 
>>> 
>>>>> 5) Should any of the FIPE "system idleness" calculations be part
>>>>> of the core kernel?  Other systems may be interested in this
>>>>> information. 
>>>>> 
>>>>> 
>>>> We may optionally expose the idleness prediction algorithm if it's
>>>> useful to other componets. 
>>>> 
>>>> 
>>>> 
>>>>> 6) What does this sentence in 4.2.4 mean?
>>>>> "If it returns non-zero value, CPU should cancel entering idle
>>>>> state"? Does this mean "enter ACPI C1 instead of a deeper
>>>>> C-State"? Does FIPE need this? 
>>>>> 
>>>>> Can this functionality be removed?
>>>>> CPUs have to at least enter C1 when idle.  There can be large
>>>>> performance penalties if CPUs never halt in their idle loop....
>>>>> 
>>>>> 
>>>>> 
>>>> Currently FIPE doesn't make use of this feature. It's here just
>>>> for future extensibility. According to your questions about C1, it
>>>> would be better to remove this feature. How about the idea to
>>>> demote deep C state instead of cancel entering idle state
>>>> according to callback return value? 
>>>> For example, cpu_idle_enter_cbfn is called with an indicator to
>>>> enter C3, it may optionally argue about that by returning a
>>>> indicator to suggest entering C1 instead of C3.
>>>> By adopting sucn a change, all questions relative to C1 could be
>>>> solved. 
>>>> 
>>>> 
>>> Presumably there is code that decides (based on how long the CPU
>>> estimates it will be idle) which c-state to request...and that is
>>> another consumer of the interfaces being proposed here. Having other
>>> consumers also make that decision also seems like it would greatly
>>> complicate things...or am I not thinking correctly about this?
>>> 
>>> 
>>>>> 7) How will rollback be handled when a callback cancels the idle
>>>>> after other callbacks have already been called?
>>>>> 
>>>>> Will the cpu_idle_exit_cbfn() functions be called for all
>>>>> callbacks that have already been invoked?
>>>>> 
>>>>> 
>>>> Yes, currently cpu_idle_exit_cbfn will be called with a flag
>>>> indicating cancellation for all callbacks that have already been
>>>> invoked. 
>>>> 
>>>> 
>>> Is the idea of rollback still on the table here? I thought the exit
>>> function would invoke the callbacks when the CPU is no longer idle.
>>> A callback wouldn't make the judgment about whether a CPU is idle
>>> or not...right? 
>>> 
>>> 
>>>>> Will the CPU then just enter C1 idle without calling callbacks
>>>>> again?  The lower priority callbacks will not get invoked in this
>>>>> case. 
>>>>> 
>>>>> 
>>>> Yes, currently lower priority callbacks will not get invoked.
>>>> 
>>>> 
>>>> 
>>>>> The CPU needs to eventually at least enter the C1 idle state.
>>>>> 
>>>>> 
>>>> Great thanks to your reminder. This is a question not covered by
>>>> current design, I need to change it.
>>>> 
>>>> 
>>>> 
>>> Thanks,
>>> -Eric
>>> 
>>> 
> ---------------------------------------------------------------
> ---------
>>> 
>>> _______________________________________________
>>> tesla-dev mailing list
>>> tesla-dev at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/tesla-dev

[tesla-dev] Request for comment for PSARC onapager about CPU idle nofication interface

Reply via email to