David Vengerov wrote:
> Eric Saxe wrote:
>> If there are N or more running threads in an N-CPU system, then 
>> utilization is at 100%. Generally, i'm thinking that the CPUs should 
>> all be clocked up, if we want to maximize performance. There's not 
>> much opportunity to squander power in this case. It's really only the 
>> "partial utilization" scenario, where power is being directed at the 
>> part of the system that isn't being used. 
> If you hold this view, then the policy I described previously that 
> decides on the clock rate of idle CPUs based on how their number has 
> fluctuated in the past should be very relevant and allows us to find 
> the desired tradeoff between maximizing the the system's performance 
> (by never clocking down the idle CPUs) and minimizing the power 
> consumption (by running idle CPUs at the lowest power and increasing 
> their clock rate only when they receive some threads). The policy I am 
> envisioning should be able to clock down CPUs to different extents 
> based on estimated probabilities of some of them receiving threads in 
> the near future.

Where I think this is especially relevant is where there exists a 
non-trivial amount of time to bring online additional resources 
(latency). If ML techniques can help predict when utilization will 
increase/decrease, then perhaps the latency associated with bringing 
online/offline the additional capacity can be better hidden. In the data 
center context, where the unit of power management may be suspended / 
powered off systems, this could be significant.

>> It gets interesting. To improve throughput on a CMT / NUMA system, we 
>> spread work out. So if there are, say, N CPUs in the system and N/2 
>> physical processors. We'll try to run N/2 threads with 1 thread per 
>> physical processor. Utilization is 50% in this case. If the power 
>> manageable unit is the physical processor, we're stuck, because we 
>> can't power manage any of the sockets without hurting the performance 
>> of the thread running there. If the dispatcher employed a coalescence 
>> policy, it would instead run with 2 threads per socket, which would 
>> mean N/4 (now idle) sockets are available to be power managed with 
>> potentially minimal performance impact on the workload (assuming that 
>> no contention over shared socket resources ensures as a result of the 
>> coalescence). We need better workload observability to understand 
>> when we can coalesce without hurting the workload's performance. It's 
>> going to vary by workload, and also by machine/processor 
>> architecture. For example, coalescing threads up on a multi-core chip 
>> will be far less costly (from a performance perspective) than 
>> coalescing threads up on a group of CPUs sharing an instruction pipeline.
> I am currently collaborating with Sasha Fedorova (assistant Professor 
> at the *Simon Fraser University*  in Canada) on a project addressing 
> optimal dynamic thread scheduling in CMT systems based on online 
> estimates of threads' data sharing, contention, and mutual 
> sensitivity. By the end of the summer her students should make the 
> first demonstrations of the possibility of measuring these quantities 
> online, and then we will experiment with different state-dependent 
> rules (possibly tuned with reinforcement learning) that govern how the 
> threads keep getting scheduled. In any case, given a policy that 
> decides in what cases threads can be co-scheduled on a single socket, 
> we still get some dynamics of sockets becoming idle and then occupied 
> over time, and the policy I mentioned above should still be very 
> relevant. What do you think?
I think so. Wanting to co-schedule threads sharing data just adds 
another constraint to the placement pattern...but I think the effect on 
overall system utilization should be similar either way.

-Eric

Reply via email to