I almost finished the reorganization of c-state driver. This is for c-state observation only. The system can't enter deep c-state without Bill's HPET work. Due to cpu driver re-structed recently, c-state has to follow it. Now I'm making a request for code review.
The webrev for cstate driver can be found at: http://cr.opensolaris.org/~aubrey/cstate/ The patch is against onnv_97(rev 7367), Changes as follows: 1) A kstat member added in cpu_info module. ------------------ $kstat -m cpu_info | grep supported_max_cstates supported_max_cstates 3 ------snip------ 2) A kstat module added, named "cpudrv", exporting c-state latency(us), the method to enter c-state(FFH, SIO) and Power(mW). We could add more like the total times of entering each c-state, c-state residency time, etc for development and observation. ---------------- $kstat -m cpudrv module: cpudrv instance: 0 name: c1 class: misc address_space_id FFixedHW crtime 24.073615727 latency 1 power 1000 snaptime 262.816570865 module: cpudrv instance: 0 name: c2 class: misc address_space_id SystemIO crtime 24.073622285 latency 1 power 500 snaptime 262.81687418 module: cpudrv instance: 0 name: c3 class: misc address_space_id SystemIO crtime 24.073627073 latency 57 power 100 snaptime 262.817009506 ------snip------ 3) C-state info is obtained from ACPI _CST objects. So, we can't do anything if BIOS doesn't export this object out to OS. 4) Currently, we only support c-state on the Nehalem platform. this check was added in the driver to support c-state on the Nehalem platfrom only. 5) Theoretically, C-state coordination has 3 types. But Nehalem platform only support HW_ALL type. So currently c-state domain creation only support this type. And the dependency is determined by the core_id. 6) _CST notification handler added to accept dynamically change of c-state type and number. 7) The idle thread proc pointer "idle_cpu" has been changed to a per-cpu function pointer, so that we can support different max cstates on the different c-state domain. This has to touch the common code, including SPARC, I'm glad to accept a better idle. 8) On the early boot, "cp->idle_cpu" is assigned to "generic_idle_cpu()" and then "cpu_idle()" or "cpu_idle_mwait()". During cpudrv attaches, or _CST notification event occurs, if deep cstate(C2 or high) support detected, cp->idle_cpu will be changed to point to "cpu_acpi_idle()", which supports to enter deep c-state. And another shadow pointer(cp->shadow_idle_cpu) saves the old "cp->idle_cpu". So that if the next idle type is C1, we don't need to check if monitor/mwait supported or not, we call "cp->shadow_idle_cpu" to enter C1 directly. 9) The next c-state type is determined by a prediction algorithm, based on the last c-state residency, if the time is large enough, we consider to enter a deeper c-state next time. Oppositely, if the time becomes shorter than the current c-state latency, we'll make a demotion to enter a higher c-state next time. Any suggestion and comments are greatly appreciated! Thanks, -Aubrey
