http://defect.opensolaris.org/bz/show_bug.cgi?id=6219
--- Comment #4 from Eric Saxe <eric.saxe at sun.com> 2009-02-16 23:40:16 ---
> 8 usr/src/uts/common/disp/cmt_policy.c
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 8.1
> ====
>
> The following comment is confusing in its description of top level balancing
> on NUMA systems.
>
> 54 *
> 55 * On NUMA systems, the CMT load balancing algorithm balances across the
> 56 * CMT PGs within their respective lgroups. On UMA based system, there
> 57 * exists a top level group of PGs to balance across. On NUMA systems
> multiple
> 58 * top level groups are instantiated, where the top level balancing
> begins
> by
> 59 * balancng across the CMT PGs within their respective (per lgroup) top
> level
> 60 * groups.
I've re-written the comment so it is more clear. Thanks.
> 8.2
> ====
>
> What can be a more generic solution for the XXX comment below? It assumes too
> much specifics about the way utilization is computed.
>
> 76 /*
> 77 * XXX Note this assumes that curthread's utilization is "1"
> 78 */
> 79 if (self && bitset_in_set(&here->cmt_cpus_actv_set,
> CPU->cpu_seqid))
> 80 here_util--; /* Ignore curthread's effect */
> 81
A more generic solution would be to track (on a per thread basis) the amount
of utilization (e.g. load) the current thread is contributing to the PG, so
that load can be "factored out" when deciding where to run. This is done in an
attempt to avoid the "groundhog being scared away by it's own shadow"
pathology. When threads can contribute a variable amount of utilization, it
will probably make sense to track this, and not assume that the current
thread's utilization is "1"...but at least for now, this is always the case.
> 8.3
> ====
>
> Would it simplify things to have just a single bit which means whether the
> policy is coalescence or load balancing?
>
> 82 /*
> 83 * Load balancing and coalescence are conflicting policies
> 84 */
> 85 ASSERT((policy & (CMT_BALANCE|CMT_COALESCE)) !=
> 86 (CMT_BALANCE|CMT_COALESCE));
I wanted it to be possible for multiple policies to be specified for a PG...for
example for caches we would probably want both load balancing and affinity.
Opposing policies aren't legal though, and that is what the assert is checking
for.
> 8.4
> ====
>
> What do you do with CMT_AFFINITY? There seems to be no policy implementation
> for it!
Yea, not yet. When we do rechoose_interval at the cache level though...
> 8.5
> ====
>
> For the balancing policy what do should be done if
>
> (here_util < there_util) &&
> (CMT_CAPACITY(there) much more than CMT_CAPACITY(here))
For load balancing policy, we'll stay here. In this case, we'll only go there
if the utilizations are the same or there becomes less utilized enough so that
if we migrate load will be more balanced than if we stay here. I think I see
where you are going through...if we knew (for example) the rate of change of
the utilization, we might choose the the one with greater capacity even if it's
more utilized. But we don't track the du/dt yet.
> 8.6 cmt_balance()
> =================
>
> 197 /*
> 198 * Select an idle CPU from the target PG
> 199 */
>
> Does it mean that after we walked through PG tree and found PG to move two we
> will only move there if there are idle CPUs available, but we will stay with
> the
> current cp even if it is not idle?
Assuming the run queues are empty, the current CPU is (about to become)
idle...so yes, all this is really about choosing between idle CPUs...and
answering the question "should I stay or should I go". The "default" thing to
do should be to stay, and only migrate if it only looks like it's worth it.
> 8.7 cmt_should_migrate()
> ========================
>
> Should there be some memory barriers for the access of pg->cmt_utilization
> values or this is not very important anyway?
The utilization is updated using atomic operations, which forces all updates to
global visibility. That's about the best we can expect, since the utilization
can change at any moment...but this code is just trying to make the best
decision given the immediately available information.
> 9 usr/src/uts/common/disp/cmt.c
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 9.1
> ====
>
> In the code below you combine plain reads with atomics. Is there any need to
> have memory barriers as well?
>
> 1147 do {
> 1148 u = cmt->cmt_utilization;
> 1149 } while (atomic_cas_32(&cmt->cmt_utilization, u,
> 1150 u - 1) != u);
Thankfully, this code went away in favor of atomic_add_32_nv(). :)
> 9.2
> ====
>
> The comments around CPU_DISP_CMT_TRANS talk about time slice expiration, but
> the
> code doesn't seem to have nothing to do with time slice expiration. It would
> be
> good to have a better comment explaining the notion of transients and how
> exactly is it defined.
This went away with the governor's handling of transience.
>
> 10 usr/src/uts/common/os/cpu.c
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 10.1
> =====
>
> Need to update cp->cpu_info_kstat->ks_data_size by the size of ci_curr_cstate
> size
I don't think this is an issue any longer since the type of the kstat is now a
non-string.
> 10.2
> =====
>
> pg_cmt_hier_rank()
>
> From the code it seems that there is no need to get platform rank at all if
> one
> of the hw{1,2} is power domain and CPUPM_POLICY_DISABLED.
Agreed. Fixed.
>
> 11 usr/src/uts/common/sys/cpu_pm.h
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 11.1
> =====
>
> It would be good to have some comment explaining what is CPUPM_POLICY_ELASTIC.
> Same goes for a short explanation of active and idle power domains.
I've added that to the block comment at the top of cpu_pm.c
> 11.2
> =====
>
> 72 typedef struct cpupm_state {
> 73 uint32_t cps_speed;
> 74 cpupm_handle_t cps_handle;
> 75 } cpupm_state_t;
>
> The only use of cps_speed that I found was in
>
> i86pc/os/cpupm/cpupm_mach.c:
>
> uint_t
> cpupm_plat_state_enumerate()
> {
> int *speeds;
>
> ...
>
> /*
> * If the caller passes NULL for states, just return the
> * number of states.
> */
> if (states != NULL) {
> for (i = 0; i < nspeeds; i++) {
> states[i].cps_speed = speeds[i];
> states[i].cps_handle = (cpupm_handle_t)i;
> }
> }
> }
>
> So here speed is treated as int, not as uint32_t and there are, actually no
> other consumers of it. What is it used for?
Just Observability.
> 11.3
> =====
>
> Why is utilization in cpupm_utilization_change() uint32_t? The only values
> used are 0 and 1. Why the choice of fixed-size type?
This has been changed. The interface uses an enumerated event type.
>
> 12 usr/src/uts/common/os/cpu_pm.c
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 12.1 cpupm_set_policy()
> =======================
>
> The comment says:
>
> 55 /*
> 56 * Pausing CPUs causes a high priority thread to be scheduled
> 57 * on all other CPUs (besides the current one). This locks out
> 58 * other CPUs from making CPUPM state transitions, and in elastic
> 59 * mode has the side effect of bringing all CPUs active and to
> full
> 60 * speed.
> 61 */
>
> Why do you need to do the same when setting the policy to disabled? Can
> something bad happen if the policy changes without stopping other CPUs?
Yes.
Since the state changes happen at context switch time, pausing the CPUs puts
them in a known state. Setting the policy at this point ensures that when CPUs
are unpaused, all future switches will happen with the new policy in effect.
> Also, the cpu_lock() is not required if the policy is not changing.
The check is done here under the lock, so that if we see that the policy needs
to be changed, we know nobody else can change it until we change it. That way,
we don't check and see if we need to change the policy and then before we
change the policy have someone else come in and change it (atomic test and
set).
> 12.2
> =====
>
> 78 default:
> 79 cmn_err(CE_WARN, "Attempt to set unknown CPUPM policy
> %d\n",
> 80 new_policy);
>
> Can you ASSERT that this is not the case? I guess the pm_ioctl() handler
> should
> verify that it tries to set the correct policy.
Yes, but I've added an assert just the same...
> 12.3 XXX comments
> =================
>
> XXX comment left: * XXX CPUPM platform interface for this?
> Same for line 196 and 281.
Fixed (per jjc's comments).
> 12.4 cpupm_utilization_change()
> ===============================
>
> it seems that there are no policy changes if the state is not LOW_POWER or
> MAX_POWER. Well, you can only have these two for now, but it still seems
> strange
> since it is not clear whether the code should remain correct if more states
> are
> added or it should be changed as well.
Right. For speed changes, only the highest perf / lowest power states are used.
Looking ahead, we'll likely make use of the other states for other kinds of
policies (e.g. regulated) that we don't yet implement.
> 12.5
> =====
>
> 287 if (dom != NULL) {
> 288 dom->cpd_named_states[CPUPM_STATE_MAX_PERF] =
> 289
> &dom->cpd_states[max_perf_level];
> 290 }
>
> Should there be a check that max_perf_level is within the nstates?
Yes, I've added a range check.
> 13 usr/src/uts/sun4{u,v}/os/cmp.c
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 13.1 pg_plat_hw_rank()
> ======================
>
> 234 int i, rank1, rank2;
>
> rank1 and rank2 are not initialised to anything. In the code below if either
> hw1
> or hw2 is not part of the list of { ipipe, chip, cache } the comparison will
> use
> junk values.
>
> Same goes for sun4v version.
Fixed. Thanks.
Webrev with code changes for all the above at:
http://cr.opensolaris.org/~esaxe/6219
--
Configure bugmail: http://defect.opensolaris.org/bz/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.