[tesla-dev] Dispatcher driven P-state management under simple policy

Eric Saxe Sun, 17 Aug 2008 23:38:03 -0700

Thanks Aubrey. I'll take a look. PAD isn't trying to xcall other CPUs,  
but perhaps the cpupm driver is doing so to honor the dependencies...


I see what you mean now. :)

Thanks,
-Eric

On Aug 15, 2008, at 7:47 PM, Aubrey Li <aubreylee at gmail.com> wrote:

> Hi Eric,
>
> Eric Saxe  wrote:
>
>>> What event will drive p-state transition?
>>>
>> I think that's a good policy question. The current code kicks the
>> p-state domain to P0 when some non-idle thread begins to run on a  
>> CPU in
>> the domain, and then goes back down to the slowest P-state when the  
>> last
>> CPU in a formerly busy domain goes idle again.
>> Like I was saying in my previous mail, those particular event  
>> triggers
>> may cause too many state transitions (and therefore overhead) to be
>> worthwhile.
>>> The current one in PAD will cause
>>> high CPU utilization even if the system is idle.
>>>
>> Why is that Aubrey?
>
> After build PAD-gate and boot the kernel up,
> I got the following report, you see, percent system time is 70%~
>
> aubrey at aubrey-nhm:~$ mpstat 5
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr  
> sys  wt idl
>  0   23   0   15   212   53  100    0   15   10    0   185    0   
> 75   0  25
>  1   27   0   16    23    3  121    0   16    7    0   216    0   
> 62   0  38
>  2   24   0   28    30   12  140    0   15   10    0   279    0   
> 67   0  33
>  3   25   0   16   411  393   89    0   14    7    0   180    0   
> 72   0  28
>  4   22   0    9    19    4   60    0   11    5    0   123    0   
> 63   0  36
>  5   18   0   13    16    1   52    0   10    5    0    84    0   
> 56   0  43
>  6   14   0   11    16    3   78    0    9    4    0    99    0   
> 57   0  43
>  7   12   0   11    13    0   80    0   10    6    0   197    0   
> 55   0  45
>
> And I tracked this down, the hotspot is as follows:
>
>              unix`lock_set_spl_spin+0xc9
>              unix`mutex_vector_enter+0x4c6
>              unix`xc_do_call+0x120
>              unix`xc_call+0x4b
>              cpudrv`speedstep_power+0x99
>              cpudrv`cpudrv_pm_change_state+0x42
>              unix`cpupm_plat_change_state+0x3d
>              unix`cpupm_change_state+0x26
>              unix`cpupm_utilization_change+0x44
>              unix`cmt_ev_thread_swtch_pwr+0x7a
>              unix`pg_ev_thread_swtch+0x56
>              unix`swtch+0x17c
>
> Did I miss anything?
>
>>
>>> As far as I know, the two known methods are related to polling.
>>>
>>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>>> 2)  The software mechanism if idle time is larger than the current
>>> threshold in
>>> a time window.
>>>
>>> What's problem with periodically checking?
>> As long as it's not done too often, there overhead won't be high (in
>> terms of performance), but my concern is that as we start taking
>> advantage of deeper c-states it could become more costly. Going  
>> down the
>> road of eliminating polling in the system seems good because  
>> otherwise
>> we would be undermining our tickless efforts.
>>
>> With polling, there's also a lag between the time the CPU utilization
>> changes, and the time that we notice and change the power state. This
>> means that at times were running a thread on a clocked down CPU,  
>> which
>> is poor for performance....or the CPU is idle, but running flat out
>> (which as Mark pointed out could be ok from a "race to C-state"
>> perspective). If it's even driven, we can know precisely when
>> utilization has changed...and so if the state transitions are cheap
>> enough, why not just make them then?
>>
>
> P-state transition can't be cheap, besides xcalls, P-state driver has
> to poll to wait until switch is complete.
> =================================
> /*
> * Intel docs indicate that maximum latency of P-state changes should
> * be on the order of 10mS. When waiting, wait in 100uS increments.
> */
> #define ESS_MAX_LATENCY_MICROSECS 10000
> #define ESS_LATENCY_WAIT                100
>
> void
> speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
>    uint32_t req_state)
> {
>
>        /* Wait until switch is complete, but bound the loop just in  
> case. */
>        for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i +=  
> ESS_LATENCY_WAIT) {
>                if (read_status(handle, &stat) == 0 &&
>                    CPU_ACPI_STAT(req_pstate) == stat)
>                        break;
>                drv_usecwait(ESS_LATENCY_WAIT);
>        }
> }
> =================================
> This can be improved by checking the latency parameter from ACPI  
> table,
> but if you put this in the code path of swtch(), I believe it's still
> a big problem.
>
> Thanks,
> -Aubrey
> _______________________________________________
> tesla-dev mailing list
> tesla-dev at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/tesla-dev

[tesla-dev] Dispatcher driven P-state management under simple policy

Reply via email to