Thanks Aubrey. I'll take a look. PAD isn't trying to xcall other CPUs,
but perhaps the cpupm driver is doing so to honor the dependencies...
I see what you mean now. :)
Thanks,
-Eric
On Aug 15, 2008, at 7:47 PM, Aubrey Li <aubreylee at gmail.com> wrote:
> Hi Eric,
>
> Eric Saxe wrote:
>
>>> What event will drive p-state transition?
>>>
>> I think that's a good policy question. The current code kicks the
>> p-state domain to P0 when some non-idle thread begins to run on a
>> CPU in
>> the domain, and then goes back down to the slowest P-state when the
>> last
>> CPU in a formerly busy domain goes idle again.
>> Like I was saying in my previous mail, those particular event
>> triggers
>> may cause too many state transitions (and therefore overhead) to be
>> worthwhile.
>>> The current one in PAD will cause
>>> high CPU utilization even if the system is idle.
>>>
>> Why is that Aubrey?
>
> After build PAD-gate and boot the kernel up,
> I got the following report, you see, percent system time is 70%~
>
> aubrey at aubrey-nhm:~$ mpstat 5
> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr
> sys wt idl
> 0 23 0 15 212 53 100 0 15 10 0 185 0
> 75 0 25
> 1 27 0 16 23 3 121 0 16 7 0 216 0
> 62 0 38
> 2 24 0 28 30 12 140 0 15 10 0 279 0
> 67 0 33
> 3 25 0 16 411 393 89 0 14 7 0 180 0
> 72 0 28
> 4 22 0 9 19 4 60 0 11 5 0 123 0
> 63 0 36
> 5 18 0 13 16 1 52 0 10 5 0 84 0
> 56 0 43
> 6 14 0 11 16 3 78 0 9 4 0 99 0
> 57 0 43
> 7 12 0 11 13 0 80 0 10 6 0 197 0
> 55 0 45
>
> And I tracked this down, the hotspot is as follows:
>
> unix`lock_set_spl_spin+0xc9
> unix`mutex_vector_enter+0x4c6
> unix`xc_do_call+0x120
> unix`xc_call+0x4b
> cpudrv`speedstep_power+0x99
> cpudrv`cpudrv_pm_change_state+0x42
> unix`cpupm_plat_change_state+0x3d
> unix`cpupm_change_state+0x26
> unix`cpupm_utilization_change+0x44
> unix`cmt_ev_thread_swtch_pwr+0x7a
> unix`pg_ev_thread_swtch+0x56
> unix`swtch+0x17c
>
> Did I miss anything?
>
>>
>>> As far as I know, the two known methods are related to polling.
>>>
>>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>>> 2) The software mechanism if idle time is larger than the current
>>> threshold in
>>> a time window.
>>>
>>> What's problem with periodically checking?
>> As long as it's not done too often, there overhead won't be high (in
>> terms of performance), but my concern is that as we start taking
>> advantage of deeper c-states it could become more costly. Going
>> down the
>> road of eliminating polling in the system seems good because
>> otherwise
>> we would be undermining our tickless efforts.
>>
>> With polling, there's also a lag between the time the CPU utilization
>> changes, and the time that we notice and change the power state. This
>> means that at times were running a thread on a clocked down CPU,
>> which
>> is poor for performance....or the CPU is idle, but running flat out
>> (which as Mark pointed out could be ok from a "race to C-state"
>> perspective). If it's even driven, we can know precisely when
>> utilization has changed...and so if the state transitions are cheap
>> enough, why not just make them then?
>>
>
> P-state transition can't be cheap, besides xcalls, P-state driver has
> to poll to wait until switch is complete.
> =================================
> /*
> * Intel docs indicate that maximum latency of P-state changes should
> * be on the order of 10mS. When waiting, wait in 100uS increments.
> */
> #define ESS_MAX_LATENCY_MICROSECS 10000
> #define ESS_LATENCY_WAIT 100
>
> void
> speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
> uint32_t req_state)
> {
>
> /* Wait until switch is complete, but bound the loop just in
> case. */
> for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i +=
> ESS_LATENCY_WAIT) {
> if (read_status(handle, &stat) == 0 &&
> CPU_ACPI_STAT(req_pstate) == stat)
> break;
> drv_usecwait(ESS_LATENCY_WAIT);
> }
> }
> =================================
> This can be improved by checking the latency parameter from ACPI
> table,
> but if you put this in the code path of swtch(), I believe it's still
> a big problem.
>
> Thanks,
> -Aubrey
> _______________________________________________
> tesla-dev mailing list
> tesla-dev at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/tesla-dev