[tesla-dev] re-initialize APIC timer

Liu, Jiang Tue, 17 Jun 2008 15:33:55 +0800

Li, Aubrey <> wrote:
> < Forward to mailing list to request more comments >
> 
> Bill.Holler wrote:
> 
>> Hi Aubrey,
>> 
>> This is in sync with what we where thinking here.  Please see inline.
>> 
>> 
>> Li, Aubrey wrote:
>>> 
>>> I'm doing investigation about re-initialize APIC timer after wake up
>>> from deep c-state. And obviously I haven't finished, I need your
>>> suggestions and comments, :-) 
>>> 
>>> APIC timer is used by the cyclic subsystem, provides per-CPU
>>> interval timers and interrupts. My initial thoughts as follows,
>>> please correct me. 
>>> 
>>> 1) before enter deep c-state, we read current count register of APIC
>>> timer, since apic timer will generate interrupt after count-down to
>>> ZERO, we know when exactly the interrupt will happen on this CPU.
>>> 
>> 
>> The top cyclic_t structure on this cpu's cyclic heap has the
>> hrtime_t when it should expire.  We can either read the cpu's
>> APIC timer register directly and compute the expiration time,
>> or we can use the hrtime_t from the top cyclic on this cpu's
>> cyclic heap.
>> 
>> Is the local APIC fast to read?  Is it faster than a memory read?
>> Using the hrtime_t in the top cyclic_t will take a few memory
>> reads which may not be in the cpu's cache.
>> 
> I considered this issue. Since there must be a time gap between cbe
> reprogram
> and the exact point before enter deep c-state, using the count in APIC
> timer should
> be more accurate.


Reading local apic timer count approach may be beter. 
If using next time out value in root cyclic node, we need to first get
current hrtime and comupte delta between now and next time out.
If using local apic timer count, we just need to read the count and
convert it to nanosec. The latter should have less computation and
memory access.

> 
>> 
>> 
>>> 2) alloc one HPET for this CPU, program HPET in one-shot mode, tell
>>> HPET when it needs to generate one interrupt to wake up the CPU.
>>> Here, it looks like we need the HPET number is equal or lager than
>>> CPU number? 
>>> 
>> 
>> 
>> We were thinking software will have to manage a "timer heap"
>> or other data structure to multiplex multiple cpus onto a few HPET
>> timers. 
>> 
>> Most HPETs only have 1 or 2 timers available for this (max 32).  :-(
>> We know there will be Nahalem systems with 8-sockets, 8-cores,
>> and 2 hyper-threads/core = 128 cpus.
>> The HPETs I have used only support I/O APIC routing, and
>> the I/O APIC does not have very many free interrupt wires.
>> We will need to drive multiple cpus from 1 or a few HPET timers.
>> 
> HPET's number is a limit, even if MSI/MSIx supported.
> So a data structure makes sense here to make less HPET to wake
> up  more cpu.
> 
>> 
>> It gets more complicated when thinking about the throughput
>> of the HPET: how fast can an HPET timer interrupt multiple
>> cpus.  An HPET timer may not be fast enough to wakeup a
>> lot of cpus at their desired cyclic fire time.
> 
> The cyclic system takes advantage of per-cpu APIC timer.
> But when APIC timer become unreliable, this advantage become
> disadvantage. a cyclic which can be fired on all online CPUs of
> course is fine for APIC timer, but for HPET, if the number is not
> enough, can we make some CPUs invisible for cyclic system?
> or just enable it on BSP?
> 
>> 
>> We have a prototype that can move a cyclic from one cpu to
>> another.  When the other cpu's cyclic fires, it sees it is for
>> the other cpu and sends the correct cpu an interrupt.
>> We could do something like this with the HPET heap if
>> necessary: one HPET interrupt wakes up multiple cpus.  :-)
>> 
> before one cpu enter deep cstate, no problem to move it's cyclic
> to another one, but how about its buddy need to sleep before
> it's waken up? ok, go to the HPET heap, or move again. This sounds
> more complicated. I think I need more time to figure out why
> cyclic has to be supported on all online cpus.
> 
>> 
>> 
>>> 3) CPU is waken up from deep c-state, we can know how long CPU
>>> sleep, by tod_get or PM timer or another HPET, for c-state policy
>>> 
>> 
>> The TSC keeps working on Nahalem.
> 
> Oh, oh..., that's right, we should use TSC.
> But since we will support platform on which TSC is not reliable in
> deep c-state, I suggest we use PM timer to make code a bit easier to
> readable, :-)

tod_get obviously is not an option because it's precision, on x86 it has
1s granularity based on RTC.
I'm not familiar with PM timer, but I have concern that all time are not
reliable after just waking up from deep-C state with unreliable TSC
because current Solaris time subsystem on x86 is based on RTC and TSC.
Maybe the only reliable thing is HPET counters.
 
>> 
>> 
>>> 4) The first thing after cpu wake up is disable local APIC
>>> timer(call cbe_disable(), maybe), it's unreliable
>>> now. Let's wait the desired interrupt occurs from HPET.
>>> 
>> 
>> We could set the local APIC's timer to APIC_MAXVAL before
>> entering deep c-state?  This effectively disabled the APIC's timer.
>> The APIC's timer could be set to APIC_MAXVAL either
>> before or after entering deep c-state.
>> 
>> Spurious cbe_fire() do not hurt anything other than performance
>> assuming the interrupt does not "pin" the processor.
> Yeah, a better idea, program to APIC_MAXVAL is better.
> 
>> 
>>> 5) in HPET ISR, we call cbe_fire first, so that the root in the
>>> cyclic array heap is handled, and a downheap
>>> operation is performed. These operation will repeat until the root
>>> cyclic has an expiration time in the future.
>>> Here, We get the expiration time from the root cyclic to re-program
>>> APIC timer, of course we need re-enable
>>> it.
>>> 
>> 
>> Yes.  If the cpu woke up from an HPET interrupt, then
>> cbe_fire() will reprogram the APIC timer for us.  :-)
>> 
>> If the cpu did not wake up due to an HPET interrupt, then
>> we can directly program the APIC with proper countdown
>> time for the next cyclic.
>> 
>>> 6) we can release this HPET.
>>> 
>>> 7) Everything should be recovered now, except TSC.
>>> 
>>> TSC is not usable on montevina, it makes the debug process
>>> difficult. Hope I can get nehalem box soon.
>>> 
>> 
>> I am attempting to write a prototype that will not panic on Penry/
>> Montevina.  It will not solve the TSC issue, but may mask it
>> enough to do prototype work until deep c-state Nahalems are
>> available.  ;-) 
>> 
> That would be great, let me try to write something in cpu_acpi_idle to
> interface
> with HPET.
> 
> Thanks,
> -Aubrey
> _______________________________________________
> tesla-dev mailing list
> tesla-dev at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/tesla-dev

[tesla-dev] re-initialize APIC timer

Reply via email to