Eric Saxe wrote:
> Thanks Aubrey. I'll take a look. PAD isn't trying to xcall other CPUs,
> but perhaps the cpupm driver is doing so to honor the dependencies...
>
The speedstep_power() function is always calling xc_call() even when it
it not necessary. That is, even if the thread is already executing on
the target CPU. I'll fix this.
Mark
> I see what you mean now. :)
>
> Thanks,
> -Eric
>
> On Aug 15, 2008, at 7:47 PM, Aubrey Li <aubreylee at gmail.com> wrote:
>
>
>> Hi Eric,
>>
>> Eric Saxe wrote:
>>
>>
>>>> What event will drive p-state transition?
>>>>
>>>>
>>> I think that's a good policy question. The current code kicks the
>>> p-state domain to P0 when some non-idle thread begins to run on a
>>> CPU in
>>> the domain, and then goes back down to the slowest P-state when the
>>> last
>>> CPU in a formerly busy domain goes idle again.
>>> Like I was saying in my previous mail, those particular event
>>> triggers
>>> may cause too many state transitions (and therefore overhead) to be
>>> worthwhile.
>>>
>>>> The current one in PAD will cause
>>>> high CPU utilization even if the system is idle.
>>>>
>>>>
>>> Why is that Aubrey?
>>>
>> After build PAD-gate and boot the kernel up,
>> I got the following report, you see, percent system time is 70%~
>>
>> aubrey at aubrey-nhm:~$ mpstat 5
>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr
>> sys wt idl
>> 0 23 0 15 212 53 100 0 15 10 0 185 0
>> 75 0 25
>> 1 27 0 16 23 3 121 0 16 7 0 216 0
>> 62 0 38
>> 2 24 0 28 30 12 140 0 15 10 0 279 0
>> 67 0 33
>> 3 25 0 16 411 393 89 0 14 7 0 180 0
>> 72 0 28
>> 4 22 0 9 19 4 60 0 11 5 0 123 0
>> 63 0 36
>> 5 18 0 13 16 1 52 0 10 5 0 84 0
>> 56 0 43
>> 6 14 0 11 16 3 78 0 9 4 0 99 0
>> 57 0 43
>> 7 12 0 11 13 0 80 0 10 6 0 197 0
>> 55 0 45
>>
>> And I tracked this down, the hotspot is as follows:
>>
>> unix`lock_set_spl_spin+0xc9
>> unix`mutex_vector_enter+0x4c6
>> unix`xc_do_call+0x120
>> unix`xc_call+0x4b
>> cpudrv`speedstep_power+0x99
>> cpudrv`cpudrv_pm_change_state+0x42
>> unix`cpupm_plat_change_state+0x3d
>> unix`cpupm_change_state+0x26
>> unix`cpupm_utilization_change+0x44
>> unix`cmt_ev_thread_swtch_pwr+0x7a
>> unix`pg_ev_thread_swtch+0x56
>> unix`swtch+0x17c
>>
>> Did I miss anything?
>>
>>
>>>> As far as I know, the two known methods are related to polling.
>>>>
>>>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>>>> 2) The software mechanism if idle time is larger than the current
>>>> threshold in
>>>> a time window.
>>>>
>>>> What's problem with periodically checking?
>>>>
>>> As long as it's not done too often, there overhead won't be high (in
>>> terms of performance), but my concern is that as we start taking
>>> advantage of deeper c-states it could become more costly. Going
>>> down the
>>> road of eliminating polling in the system seems good because
>>> otherwise
>>> we would be undermining our tickless efforts.
>>>
>>> With polling, there's also a lag between the time the CPU utilization
>>> changes, and the time that we notice and change the power state. This
>>> means that at times were running a thread on a clocked down CPU,
>>> which
>>> is poor for performance....or the CPU is idle, but running flat out
>>> (which as Mark pointed out could be ok from a "race to C-state"
>>> perspective). If it's even driven, we can know precisely when
>>> utilization has changed...and so if the state transitions are cheap
>>> enough, why not just make them then?
>>>
>>>
>> P-state transition can't be cheap, besides xcalls, P-state driver has
>> to poll to wait until switch is complete.
>> =================================
>> /*
>> * Intel docs indicate that maximum latency of P-state changes should
>> * be on the order of 10mS. When waiting, wait in 100uS increments.
>> */
>> #define ESS_MAX_LATENCY_MICROSECS 10000
>> #define ESS_LATENCY_WAIT 100
>>
>> void
>> speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
>> uint32_t req_state)
>> {
>>
>> /* Wait until switch is complete, but bound the loop just in
>> case. */
>> for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i +=
>> ESS_LATENCY_WAIT) {
>> if (read_status(handle, &stat) == 0 &&
>> CPU_ACPI_STAT(req_pstate) == stat)
>> break;
>> drv_usecwait(ESS_LATENCY_WAIT);
>> }
>> }
>> =================================
>> This can be improved by checking the latency parameter from ACPI
>> table,
>> but if you put this in the code path of swtch(), I believe it's still
>> a big problem.
>>
>> Thanks,
>> -Aubrey
>> _______________________________________________
>> tesla-dev mailing list
>> tesla-dev at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/tesla-dev
>>
> _______________________________________________
> tesla-dev mailing list
> tesla-dev at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/tesla-dev
>