[tesla-dev] Dispatcher driven P-state management under simple policy

Mark Haywood Mon, 18 Aug 2008 08:31:12 -0400

Eric Saxe wrote:
> Thanks Aubrey. I'll take a look. PAD isn't trying to xcall other CPUs,  
> but perhaps the cpupm driver is doing so to honor the dependencies...
>


The speedstep_power() function is always calling xc_call() even when it 
it not necessary. That is, even if the thread is already executing on 
the target CPU. I'll fix this.

Mark

> I see what you mean now. :)
>
> Thanks,
> -Eric
>
> On Aug 15, 2008, at 7:47 PM, Aubrey Li <aubreylee at gmail.com> wrote:
>
>   
>> Hi Eric,
>>
>> Eric Saxe  wrote:
>>
>>     
>>>> What event will drive p-state transition?
>>>>
>>>>         
>>> I think that's a good policy question. The current code kicks the
>>> p-state domain to P0 when some non-idle thread begins to run on a  
>>> CPU in
>>> the domain, and then goes back down to the slowest P-state when the  
>>> last
>>> CPU in a formerly busy domain goes idle again.
>>> Like I was saying in my previous mail, those particular event  
>>> triggers
>>> may cause too many state transitions (and therefore overhead) to be
>>> worthwhile.
>>>       
>>>> The current one in PAD will cause
>>>> high CPU utilization even if the system is idle.
>>>>
>>>>         
>>> Why is that Aubrey?
>>>       
>> After build PAD-gate and boot the kernel up,
>> I got the following report, you see, percent system time is 70%~
>>
>> aubrey at aubrey-nhm:~$ mpstat 5
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr  
>> sys  wt idl
>>  0   23   0   15   212   53  100    0   15   10    0   185    0   
>> 75   0  25
>>  1   27   0   16    23    3  121    0   16    7    0   216    0   
>> 62   0  38
>>  2   24   0   28    30   12  140    0   15   10    0   279    0   
>> 67   0  33
>>  3   25   0   16   411  393   89    0   14    7    0   180    0   
>> 72   0  28
>>  4   22   0    9    19    4   60    0   11    5    0   123    0   
>> 63   0  36
>>  5   18   0   13    16    1   52    0   10    5    0    84    0   
>> 56   0  43
>>  6   14   0   11    16    3   78    0    9    4    0    99    0   
>> 57   0  43
>>  7   12   0   11    13    0   80    0   10    6    0   197    0   
>> 55   0  45
>>
>> And I tracked this down, the hotspot is as follows:
>>
>>              unix`lock_set_spl_spin+0xc9
>>              unix`mutex_vector_enter+0x4c6
>>              unix`xc_do_call+0x120
>>              unix`xc_call+0x4b
>>              cpudrv`speedstep_power+0x99
>>              cpudrv`cpudrv_pm_change_state+0x42
>>              unix`cpupm_plat_change_state+0x3d
>>              unix`cpupm_change_state+0x26
>>              unix`cpupm_utilization_change+0x44
>>              unix`cmt_ev_thread_swtch_pwr+0x7a
>>              unix`pg_ev_thread_swtch+0x56
>>              unix`swtch+0x17c
>>
>> Did I miss anything?
>>
>>     
>>>> As far as I know, the two known methods are related to polling.
>>>>
>>>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>>>> 2)  The software mechanism if idle time is larger than the current
>>>> threshold in
>>>> a time window.
>>>>
>>>> What's problem with periodically checking?
>>>>         
>>> As long as it's not done too often, there overhead won't be high (in
>>> terms of performance), but my concern is that as we start taking
>>> advantage of deeper c-states it could become more costly. Going  
>>> down the
>>> road of eliminating polling in the system seems good because  
>>> otherwise
>>> we would be undermining our tickless efforts.
>>>
>>> With polling, there's also a lag between the time the CPU utilization
>>> changes, and the time that we notice and change the power state. This
>>> means that at times were running a thread on a clocked down CPU,  
>>> which
>>> is poor for performance....or the CPU is idle, but running flat out
>>> (which as Mark pointed out could be ok from a "race to C-state"
>>> perspective). If it's even driven, we can know precisely when
>>> utilization has changed...and so if the state transitions are cheap
>>> enough, why not just make them then?
>>>
>>>       
>> P-state transition can't be cheap, besides xcalls, P-state driver has
>> to poll to wait until switch is complete.
>> =================================
>> /*
>> * Intel docs indicate that maximum latency of P-state changes should
>> * be on the order of 10mS. When waiting, wait in 100uS increments.
>> */
>> #define ESS_MAX_LATENCY_MICROSECS 10000
>> #define ESS_LATENCY_WAIT                100
>>
>> void
>> speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
>>    uint32_t req_state)
>> {
>>
>>        /* Wait until switch is complete, but bound the loop just in  
>> case. */
>>        for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i +=  
>> ESS_LATENCY_WAIT) {
>>                if (read_status(handle, &stat) == 0 &&
>>                    CPU_ACPI_STAT(req_pstate) == stat)
>>                        break;
>>                drv_usecwait(ESS_LATENCY_WAIT);
>>        }
>> }
>> =================================
>> This can be improved by checking the latency parameter from ACPI  
>> table,
>> but if you put this in the code path of swtch(), I believe it's still
>> a big problem.
>>
>> Thanks,
>> -Aubrey
>> _______________________________________________
>> tesla-dev mailing list
>> tesla-dev at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/tesla-dev
>>     
> _______________________________________________
> tesla-dev mailing list
> tesla-dev at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/tesla-dev
>

[tesla-dev] Dispatcher driven P-state management under simple policy

Reply via email to