Hi Eric,
Eric Saxe wrote:
>> What event will drive p-state transition?
>>
> I think that's a good policy question. The current code kicks the
> p-state domain to P0 when some non-idle thread begins to run on a CPU in
> the domain, and then goes back down to the slowest P-state when the last
> CPU in a formerly busy domain goes idle again.
> Like I was saying in my previous mail, those particular event triggers
> may cause too many state transitions (and therefore overhead) to be
> worthwhile.
>> The current one in PAD will cause
>> high CPU utilization even if the system is idle.
>>
> Why is that Aubrey?
After build PAD-gate and boot the kernel up,
I got the following report, you see, percent system time is 70%~
aubrey at aubrey-nhm:~$ mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 23 0 15 212 53 100 0 15 10 0 185 0 75 0 25
1 27 0 16 23 3 121 0 16 7 0 216 0 62 0 38
2 24 0 28 30 12 140 0 15 10 0 279 0 67 0 33
3 25 0 16 411 393 89 0 14 7 0 180 0 72 0 28
4 22 0 9 19 4 60 0 11 5 0 123 0 63 0 36
5 18 0 13 16 1 52 0 10 5 0 84 0 56 0 43
6 14 0 11 16 3 78 0 9 4 0 99 0 57 0 43
7 12 0 11 13 0 80 0 10 6 0 197 0 55 0 45
And I tracked this down, the hotspot is as follows:
unix`lock_set_spl_spin+0xc9
unix`mutex_vector_enter+0x4c6
unix`xc_do_call+0x120
unix`xc_call+0x4b
cpudrv`speedstep_power+0x99
cpudrv`cpudrv_pm_change_state+0x42
unix`cpupm_plat_change_state+0x3d
unix`cpupm_change_state+0x26
unix`cpupm_utilization_change+0x44
unix`cmt_ev_thread_swtch_pwr+0x7a
unix`pg_ev_thread_swtch+0x56
unix`swtch+0x17c
Did I miss anything?
>
>> As far as I know, the two known methods are related to polling.
>>
>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>> 2) The software mechanism if idle time is larger than the current
>> threshold in
>> a time window.
>>
>> What's problem with periodically checking?
> As long as it's not done too often, there overhead won't be high (in
> terms of performance), but my concern is that as we start taking
> advantage of deeper c-states it could become more costly. Going down the
> road of eliminating polling in the system seems good because otherwise
> we would be undermining our tickless efforts.
>
> With polling, there's also a lag between the time the CPU utilization
> changes, and the time that we notice and change the power state. This
> means that at times were running a thread on a clocked down CPU, which
> is poor for performance....or the CPU is idle, but running flat out
> (which as Mark pointed out could be ok from a "race to C-state"
> perspective). If it's even driven, we can know precisely when
> utilization has changed...and so if the state transitions are cheap
> enough, why not just make them then?
>
P-state transition can't be cheap, besides xcalls, P-state driver has
to poll to wait until switch is complete.
=================================
/*
* Intel docs indicate that maximum latency of P-state changes should
* be on the order of 10mS. When waiting, wait in 100uS increments.
*/
#define ESS_MAX_LATENCY_MICROSECS 10000
#define ESS_LATENCY_WAIT 100
void
speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
uint32_t req_state)
{
/* Wait until switch is complete, but bound the loop just in case. */
for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i += ESS_LATENCY_WAIT) {
if (read_status(handle, &stat) == 0 &&
CPU_ACPI_STAT(req_pstate) == stat)
break;
drv_usecwait(ESS_LATENCY_WAIT);
}
}
=================================
This can be improved by checking the latency parameter from ACPI table,
but if you put this in the code path of swtch(), I believe it's still
a big problem.
Thanks,
-Aubrey