[tesla-dev] Dispatcher driven P-state management under simple policy

Aubrey Li Sat, 16 Aug 2008 10:47:26 +0800

Hi Eric,

Eric Saxe  wrote:


>> What event will drive p-state transition?
>>
> I think that's a good policy question. The current code kicks the
> p-state domain to P0 when some non-idle thread begins to run on a CPU in
> the domain, and then goes back down to the slowest P-state when the last
> CPU in a formerly busy domain goes idle again.
> Like I was saying in my previous mail, those particular event triggers
> may cause too many state transitions (and therefore overhead) to be
> worthwhile.
>> The current one in PAD will cause
>> high CPU utilization even if the system is idle.
>>
> Why is that Aubrey?

After build PAD-gate and boot the kernel up,
I got the following report, you see, percent system time is 70%~

aubrey at aubrey-nhm:~$ mpstat 5
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   23   0   15   212   53  100    0   15   10    0   185    0  75   0  25
  1   27   0   16    23    3  121    0   16    7    0   216    0  62   0  38
  2   24   0   28    30   12  140    0   15   10    0   279    0  67   0  33
  3   25   0   16   411  393   89    0   14    7    0   180    0  72   0  28
  4   22   0    9    19    4   60    0   11    5    0   123    0  63   0  36
  5   18   0   13    16    1   52    0   10    5    0    84    0  56   0  43
  6   14   0   11    16    3   78    0    9    4    0    99    0  57   0  43
  7   12   0   11    13    0   80    0   10    6    0   197    0  55   0  45

And I tracked this down, the hotspot is as follows:

              unix`lock_set_spl_spin+0xc9
              unix`mutex_vector_enter+0x4c6
              unix`xc_do_call+0x120
              unix`xc_call+0x4b
              cpudrv`speedstep_power+0x99
              cpudrv`cpudrv_pm_change_state+0x42
              unix`cpupm_plat_change_state+0x3d
              unix`cpupm_change_state+0x26
              unix`cpupm_utilization_change+0x44
              unix`cmt_ev_thread_swtch_pwr+0x7a
              unix`pg_ev_thread_swtch+0x56
              unix`swtch+0x17c

Did I miss anything?

>
>> As far as I know, the two known methods are related to polling.
>>
>> 1) The hardware feedback mechanism provided by APERF/MPERF.
>> 2)  The software mechanism if idle time is larger than the current
>> threshold in
>> a time window.
>>
>> What's problem with periodically checking?
> As long as it's not done too often, there overhead won't be high (in
> terms of performance), but my concern is that as we start taking
> advantage of deeper c-states it could become more costly. Going down the
> road of eliminating polling in the system seems good because otherwise
> we would be undermining our tickless efforts.
>
> With polling, there's also a lag between the time the CPU utilization
> changes, and the time that we notice and change the power state. This
> means that at times were running a thread on a clocked down CPU, which
> is poor for performance....or the CPU is idle, but running flat out
> (which as Mark pointed out could be ok from a "race to C-state"
> perspective). If it's even driven, we can know precisely when
> utilization has changed...and so if the state transitions are cheap
> enough, why not just make them then?
>

P-state transition can't be cheap, besides xcalls, P-state driver has
to poll to wait until switch is complete.
=================================
/*
 * Intel docs indicate that maximum latency of P-state changes should
 * be on the order of 10mS. When waiting, wait in 100uS increments.
 */
#define ESS_MAX_LATENCY_MICROSECS       10000
#define ESS_LATENCY_WAIT                100

void
speedstep_pstate_transition(int *ret, cpudrv_devstate_t *cpudsp,
    uint32_t req_state)
{

        /* Wait until switch is complete, but bound the loop just in case. */
        for (i = 0; i < ESS_MAX_LATENCY_MICROSECS; i += ESS_LATENCY_WAIT) {
                if (read_status(handle, &stat) == 0 &&
                    CPU_ACPI_STAT(req_pstate) == stat)
                        break;
                drv_usecwait(ESS_LATENCY_WAIT);
        }
}
=================================
This can be improved by checking the latency parameter from ACPI table,
but if you put this in the code path of swtch(), I believe it's still
a big problem.

Thanks,
-Aubrey

[tesla-dev] Dispatcher driven P-state management under simple policy

Reply via email to