Few more comments.

A general comment on some of the enhancements are to keep the interface simple 
for its current use. If we have a real application/need to add extras, we could 
always extend when needed, otherwise we would have feature creep and bloated 
interface that's hardly used.

Since the interface is rather generic, one concern I had was to make sure if we 
can split the IA part and generic parts. So we can combine all the arch 
specific info separarly by having an opaque ptr, and have access functions 
per-arch to get the right data. In case you find need for this in other arch's 
as well. (esp c-state, min/max latencies expected etc can be abstracted to a 
high level if that makes sense)

>-----Original Message-----
>From: Liu, Jiang
>Sent: Wednesday, November 12, 2008 9:42 PM
>To: Bill.Holler at Sun.COM; tesla-dev at opensolaris.org
>Cc: Randall Fishel; Eric Saxe; Raj, Ashok
>Subject: RE: [tesla-dev] Request for comment for PSARC onapager about CPU idle 
>nofication interface
>
>Hi Bill,
>       Thanks for your great suggestions.
>       I'll update the onepager according to your suggestions and repost it 
> after getting enough
>feedbacks.
>
>Bill.Holler at Sun.COM <mailto:Bill.Holler at Sun.COM> wrote:
>> Hi Liu Jiang,
>>
>> This proposal looks great!
>>
>> Here are some suggestions:
>>
>> 1) This proposal does not specify the context the callbacks will
>> execute in. Presumably idle callbacks will run in the idle-thread
>> context.   This proposal should state this.
>> Idle threads cannot block.  Idle threads also cannot spin when the
>> owner of the synchronization object expects to be able to wakeup
>> the CPU from idle.
>For cpu_idle_enter_cbfn, it will be called only in idle thread context.
>For cpu_idle_exit_cbfn, it may be called in idle thread context or interrupt 
>context which depends on
>the hardware mechanism(instructions) used to put CPU into low power state. For 
>example,
>cpu_idle_exit_cbfn will be called in interrupt context if OS uses hlt 
>instruction to put CPU into C1
>state on x86. On the other hand, it will be called in idle thread context if 
>OS uses mwait to put CPU
>into low power state with interrupt disabled.
>So there's a flag to indicate whether cpu_idle_exit_cbfn is called in 
>interrupt context.
>

Gerry, this is one area if we could explain what the args mean and put those 
constants will clarify the intent properly. Bill has a good point, we should 
put callback restrictions on what they can do in the callback to make it very 
clear.

>>
>> 2) Clients may only be interested in a sub-set of c-states.
>> cpu_idle_register_callback() could take another argument
>> that specified which c-state(s) the callback should be invoked.
>> I do not know if clients will want this option.
>>
>> CPUs will likely only enter C1 idle when they do not expect to be
>> idle very long on C-State enabled kernels.
>I don't know whether this is a common requirement too.
>How about to let the callback filter out C states not interested at?
>

Does the C-state subsystem provide latencies from ACPI as well? If this is 
easily available, the min latencies can also be obtained by caller knowing 
which c-state entry is requested.

>>
>> 3) A "max idle time" argument could be added to cpu_idle_enter_cbfn().
>> As part of the C-State project CPUs now read their LAPIC Current
>> Count register to see when the next Local APIC Timer interrupt
>> will fire.  Would FIPE be interested in this max value?
>In the original design and implementation, we do have a parameter named 'hint' 
>to pass "predicted max
>idle time" to cpu_idle_enter_cbfn.
>I have removed it from the onepager due to following reasons,
>1) parameter cstate roughly has the same info.
>2) There are arguments whether we should extend parameter 'hint' as a 
>structure to contain more info
>other than "predicted max idle time" for extensibility.
>3) Our code is based on onnv_92 which has no support for deeper C state and 
>tickless timer yet. I have
>no idea about the tickless timer interface at that time.
>If tickless timer or C-State project provides some interface to query next 
>timer expiration time, we
>could pass the predicted max idle time to callbacks.

In above cases, lets pass real values, instead of calling them hint, so it can 
make sense in a generic way.

>
>>
>> 4) Clients may not want their callback invoked if the CPU will not
>> be idle longer than a specified time due to the LAPIC Current Count
>> register.  Do you want cpu_idle_register_callback() to be able to
>> specify a minimum "max idle-time threshold" argument?  The callbeck
>> would not be called if the CPU knows it will not be idle that long.
>How about to let callbacks make such a decision to simplify the framework? Or 
>any concern about
>performance issues?
>

Easy if clients could filter if the data can be provided by tickless subsystem.?

>>
>> 5) Should any of the FIPE "system idleness" calculations be part
>> of the core kernel?  Other systems may be interested in this
>> information.
>We may optionally expose the idleness prediction algorithm if it's useful to 
>other componets.
>

I would leave it internal to FIPE and expose when we have a real need instead 
of speculatively.

>>
>> 6) What does this sentence in 4.2.4 mean?
>> "If it returns non-zero value, CPU should cancel entering idle state"?
>> Does this mean "enter ACPI C1 instead of a deeper C-State"?
>> Does FIPE need this?
>>
>> Can this functionality be removed?
>> CPUs have to at least enter C1 when idle.  There can be large
>> performance penalties if CPUs never halt in their idle loop....
>>
>Currently FIPE doesn't make use of this feature. It's here just for future 
>extensibility.
>According to your questions about C1, it would be better to remove this 
>feature.
>How about the idea to demote deep C state instead of cancel entering idle 
>state according to callback
>return value?
>For example, cpu_idle_enter_cbfn is called with an indicator to enter C3, it 
>may optionally argue
>about that by returning a indicator to suggest entering C1 instead of C3.
>By adopting sucn a change, all questions relative to C1 could be solved.

I don't know the return value makes any sense? Why not just keep them void. The 
callbacks shold have no veto power really, they should ideally use them only as 
notifications to process additional things, but never should veto the entry.

>
>> 7) How will rollback be handled when a callback cancels the idle
>> after other callbacks have already been called?
>>
>> Will the cpu_idle_exit_cbfn() functions be called for all
>> callbacks that have already been invoked?
>Yes, currently cpu_idle_exit_cbfn will be called with a flag indicating 
>cancellation for all callbacks
>that have already been invoked.
>
>>
>> Will the CPU then just enter C1 idle without calling callbacks
>> again?  The lower priority callbacks will not get invoked in this
>> case.
>Yes, currently lower priority callbacks will not get invoked.
>
>>
>> The CPU needs to eventually at least enter the C1 idle state.
>Great thanks to your reminder. This is a question not covered by current 
>design, I need to change it.
>
>>
>> Best regards,
>> Bill
>>
>>
>> On 11/12/08 06:25, Liu, Jiang wrote:
>>> Hi all,
>>>     We're planning to develop a memory power management driver for some
>>> platforms which need to add new interface to current OpenSolaris
>>> kernel. I have written a PSARC onepager  for the new interfaces,
>>> please help to comment on the interface design. The attachment is
>>> information about the memory power management driver for your
>>> reference.  Thanks!
>>>
>>> Template Version: @(#)onepager.txt 1.35 07/11/07 SMI
>>>
>>> 1. Introduction
>>>     1.1. Project/Component Working Name:
>>>     CPU idle notification interface
>>>
>>>     1.2. Name of Document Author/Supplier:
>>>     Author: Gerry Liu <jiang.liu at intel.com>
>>>
>>>     1.3. Date of This Document:
>>>     October 21, 2008
>>>
>>> 4. Technical Description:
>>>     4.1. Problem
>>>     A CPU idle state change notification mechanism is needed to signal
>>>     other components which are interesting in the state change events
>>>     when CPU enters/leaves idle state. This mechanism could be used by
>>>     following components: A) Memory power saving driver
>>>     B) Lazy TLB flush on x86 system
>>>     C) CPU power management framework
>>>
>>>     4.2. Proposal
>>>     We propose to add following data structures/interfaces to
>>> OpenSolaris kernel.
>>>
>>>     4.2.1. CPU idle state change notification callback data
>>>     structures typedef int  (*cpu_idle_enter_cbfn)(void *arg, int
>>>     flags,      int cstate); Callback function prototype for entering
>>> idle state notification.
>>>
>>>     typedef void (*cpu_idle_exit_cbfn)(void *arg, int flags);
>>>     Callback function prototype for leaving idle state notification.
>>>
>>>     typedef struct cpu_idle_callback {
>>>             int                     version;
>>>             cpu_idle_enter_cbfn     idle_enter;
>>>             cpu_idle_exit_cbfn      idle_exit;
>>>     } cpu_idle_callback_t;
>>>     Data structure for CPU idle state change notification callback.
>>>
>>>     4.2.2. Register CPU idle state change notification callback
>>>     int cpu_idle_register_callback(uint_t priority,
>>>         cpu_idle_callback_t *callbackp, void *arg);
>>>     This interface registers a callback to be called when CPU idle state
>>>     changes. All registered callbacks will be called in priority order
>>>     from high to low when CPU enters idle state and will be called in
>>>     the reverse order when CPU leaves idle state.
>>>
>>>     4.2.3. Unregister CPU idle callback interface
>>>     int cpu_idle_unregister_callback(uint_t priority,
>>>         cpu_idle_callback_t *callbackp, void *arg);
>>>     This interface deregisters a registered callback.
>>>
>>>     4.2.4. Notify entering idle state
>>>     int cpu_idle_enter(int flags, int cstate, int64_t hints);
>>>     This interface signals that a specific CPU is entering idle state.
>>>     If it returns non-zero value, CPU should cancel entering idle state.
>>>
>>>     4.2.5. Signal exiting idle state interface
>>>     void cpu_idle_exit(int flags);
>>>     This interface signals that a specific CPU has left idle state.
>>>
>>>
>>> 6. Resources and Schedule:
>>>     6.4. Steering Committee requested information
>>>     6.4.1. Consolidation C-team Name:
>>>             ON
>>>     6.5. ARC review type: FastTrack
>>>     6.6. ARC Exposure: open
>>>
>>> Liu Jiang (Gerry)
>>> Senior Software Engineer
>>> OpenSolaris, OTC
>>> Tel: (8610)82611515-1643
>>> iNet: 8758-1643
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> tesla-dev mailing list
>>> tesla-dev at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/tesla-dev

Reply via email to