David Vengerov wrote: > Eric Saxe wrote: >> If there are N or more running threads in an N-CPU system, then >> utilization is at 100%. Generally, i'm thinking that the CPUs should >> all be clocked up, if we want to maximize performance. There's not >> much opportunity to squander power in this case. It's really only the >> "partial utilization" scenario, where power is being directed at the >> part of the system that isn't being used. > If you hold this view, then the policy I described previously that > decides on the clock rate of idle CPUs based on how their number has > fluctuated in the past should be very relevant and allows us to find > the desired tradeoff between maximizing the the system's performance > (by never clocking down the idle CPUs) and minimizing the power > consumption (by running idle CPUs at the lowest power and increasing > their clock rate only when they receive some threads). The policy I am > envisioning should be able to clock down CPUs to different extents > based on estimated probabilities of some of them receiving threads in > the near future.
Where I think this is especially relevant is where there exists a non-trivial amount of time to bring online additional resources (latency). If ML techniques can help predict when utilization will increase/decrease, then perhaps the latency associated with bringing online/offline the additional capacity can be better hidden. In the data center context, where the unit of power management may be suspended / powered off systems, this could be significant. >> It gets interesting. To improve throughput on a CMT / NUMA system, we >> spread work out. So if there are, say, N CPUs in the system and N/2 >> physical processors. We'll try to run N/2 threads with 1 thread per >> physical processor. Utilization is 50% in this case. If the power >> manageable unit is the physical processor, we're stuck, because we >> can't power manage any of the sockets without hurting the performance >> of the thread running there. If the dispatcher employed a coalescence >> policy, it would instead run with 2 threads per socket, which would >> mean N/4 (now idle) sockets are available to be power managed with >> potentially minimal performance impact on the workload (assuming that >> no contention over shared socket resources ensures as a result of the >> coalescence). We need better workload observability to understand >> when we can coalesce without hurting the workload's performance. It's >> going to vary by workload, and also by machine/processor >> architecture. For example, coalescing threads up on a multi-core chip >> will be far less costly (from a performance perspective) than >> coalescing threads up on a group of CPUs sharing an instruction pipeline. > I am currently collaborating with Sasha Fedorova (assistant Professor > at the *Simon Fraser University* in Canada) on a project addressing > optimal dynamic thread scheduling in CMT systems based on online > estimates of threads' data sharing, contention, and mutual > sensitivity. By the end of the summer her students should make the > first demonstrations of the possibility of measuring these quantities > online, and then we will experiment with different state-dependent > rules (possibly tuned with reinforcement learning) that govern how the > threads keep getting scheduled. In any case, given a policy that > decides in what cases threads can be co-scheduled on a single socket, > we still get some dynamics of sockets becoming idle and then occupied > over time, and the policy I mentioned above should still be very > relevant. What do you think? I think so. Wanting to co-schedule threads sharing data just adds another constraint to the placement pattern...but I think the effect on overall system utilization should be similar either way. -Eric
