David Vengerov wrote: > Eric Saxe wrote: > >> David Vengerov wrote: >> >>> Eric Saxe wrote: >>> >>>> For a default, I was leaning towards "maximize performance, but >>>> don't squander power". That way, without doing anything system >>>> administrators / users will still get the performance levels they >>>> have come to expect..but will also see overall efficiency >>>> improvements, since average system utilization is generally very low. >>> >>> >>> So if the system has 8 CPUs and 8 threads running, how do you decide >>> on the number of CPUs to keep idle and clocked down? Maybe 8 threads >>> on a single CPU will still be satisfactory from the application's >>> point of view? >> >> I think we would always want threads to run on (even a clocked down >> CPU) rather than not run at all (waiting on a run queue somewhere). >> If the policy is to maximize performance without squandering power, >> then with 8 threads and 8 logical CPUs, I would probably say none of >> the CPUs should be clocked down, since we would be at 100% utilization. > > OK, but what do you then mean by "not squandering power" if N or more > threads are present in an N-CPU system? If there are N or more running threads in an N-CPU system, then utilization is at 100%. Generally, i'm thinking that the CPUs should all be clocked up, if we want to maximize performance. There's not much opportunity to squander power in this case. It's really only the "partial utilization" scenario, where power is being directed at the part of the system that isn't being used. > Would it be something like running CPUs at 75% of maximum frequency if > there are N threads in an N-CPU system, at 90% if there are 2N threads > and 100% if there are more than 3N threads? What is a sample scenario > when you would like the system to "coalesce" the workload? It gets interesting. To improve throughput on a CMT / NUMA system, we spread work out. So if there are, say, N CPUs in the system and N/2 physical processors. We'll try to run N/2 threads with 1 thread per physical processor. Utilization is 50% in this case. If the power manageable unit is the physical processor, we're stuck, because we can't power manage any of the sockets without hurting the performance of the thread running there. If the dispatcher employed a coalescence policy, it would instead run with 2 threads per socket, which would mean N/4 (now idle) sockets are available to be power managed with potentially minimal performance impact on the workload (assuming that no contention over shared socket resources ensures as a result of the coalescence). We need better workload observability to understand when we can coalesce without hurting the workload's performance. It's going to vary by workload, and also by machine/processor architecture. For example, coalescing threads up on a multi-core chip will be far less costly (from a performance perspective) than coalescing threads up on a group of CPUs sharing an instruction pipeline.
-Eric
