Eric Saxe wrote:

> If there are N or more running threads in an N-CPU system, then 
> utilization is at 100%. Generally, i'm thinking that the CPUs should 
> all be clocked up, if we want to maximize performance. There's not 
> much opportunity to squander power in this case. It's really only the 
> "partial utilization" scenario, where power is being directed at the 
> part of the system that isn't being used. 

If you hold this view, then the policy I described previously that 
decides on the clock rate of idle CPUs based on how their number has 
fluctuated in the past should be very relevant and allows us to find the 
desired tradeoff between maximizing the the system's performance (by 
never clocking down the idle CPUs) and minimizing the power consumption 
(by running idle CPUs at the lowest power and increasing their clock 
rate only when they receive some threads). The policy I am envisioning 
should be able to clock down CPUs to different extents based on 
estimated probabilities of some of them receiving threads in the near 
future.

> It gets interesting. To improve throughput on a CMT / NUMA system, we 
> spread work out. So if there are, say, N CPUs in the system and N/2 
> physical processors. We'll try to run N/2 threads with 1 thread per 
> physical processor. Utilization is 50% in this case. If the power 
> manageable unit is the physical processor, we're stuck, because we 
> can't power manage any of the sockets without hurting the performance 
> of the thread running there. If the dispatcher employed a coalescence 
> policy, it would instead run with 2 threads per socket, which would 
> mean N/4 (now idle) sockets are available to be power managed with 
> potentially minimal performance impact on the workload (assuming that 
> no contention over shared socket resources ensures as a result of the 
> coalescence). We need better workload observability to understand when 
> we can coalesce without hurting the workload's performance. It's going 
> to vary by workload, and also by machine/processor architecture. For 
> example, coalescing threads up on a multi-core chip will be far less 
> costly (from a performance perspective) than coalescing threads up on 
> a group of CPUs sharing an instruction pipeline.

I am currently collaborating with Sasha Fedorova (assistant Professor at 
the *Simon Fraser University*  in Canada) on a project addressing 
optimal dynamic thread scheduling in CMT systems based on online 
estimates of threads' data sharing, contention, and mutual sensitivity. 
By the end of the summer her students should make the first 
demonstrations of the possibility of measuring these quantities online, 
and then we will experiment with different state-dependent rules 
(possibly tuned with reinforcement learning) that govern how the threads 
keep getting scheduled. In any case, given a policy that decides in what 
cases threads can be co-scheduled on a single socket, we still get some 
dynamics of sockets becoming idle and then occupied over time, and the 
policy I mentioned above should still be very relevant. What do you think?

David



-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/tesla-dev/attachments/20070605/26c369b8/attachment.html>

Reply via email to