[tesla-dev] Adaptive Optimizations page update (ML)

Eric Saxe Mon, 04 Jun 2007 18:22:02 -0700

David Vengerov wrote:
> Eric Saxe wrote:
>
>> David Vengerov wrote:
>>
>>> Eric Saxe wrote:
>>>
>>>> For a default, I was leaning towards "maximize performance, but 
>>>> don't squander power". That way, without doing anything system 
>>>> administrators / users will still get the performance levels they 
>>>> have come to expect..but will also see overall efficiency 
>>>> improvements, since average system utilization is generally very low. 
>>>
>>>
>>> So if the system has 8 CPUs and 8 threads running, how do you decide 
>>> on the number of CPUs to keep idle and clocked down? Maybe 8 threads 
>>> on a single CPU will still be satisfactory from the application's 
>>> point of view?
>>
>> I think we would always want threads to run on (even a clocked down 
>> CPU) rather than not run at all (waiting on a run queue somewhere). 
>> If the policy is to maximize performance without squandering power, 
>> then with 8 threads and 8 logical CPUs, I would probably say none of 
>> the CPUs should be clocked down, since we would be at 100% utilization. 
>
> OK, but what do you then mean by "not squandering power" if N or more 
> threads are present in an N-CPU system?
If there are N or more running threads in an N-CPU system, then 
utilization is at 100%. Generally, i'm thinking that the CPUs should all 
be clocked up, if we want to maximize performance. There's not much 
opportunity to squander power in this case. It's really only the 
"partial utilization" scenario, where power is being directed at the 
part of the system that isn't being used.
> Would it be something like running CPUs at 75% of maximum frequency if 
> there are N threads in an N-CPU system, at 90% if there are 2N threads 
> and 100% if there are more than 3N threads? What is a sample scenario 
> when you would like the system to "coalesce" the workload?
It gets interesting. To improve throughput on a CMT / NUMA system, we 
spread work out. So if there are, say, N CPUs in the system and N/2 
physical processors. We'll try to run N/2 threads with 1 thread per 
physical processor. Utilization is 50% in this case. If the power 
manageable unit is the physical processor, we're stuck, because we can't 
power manage any of the sockets without hurting the performance of the 
thread running there. If the dispatcher employed a coalescence policy, 
it would instead run with 2 threads per socket, which would mean N/4 
(now idle) sockets are available to be power managed with potentially 
minimal performance impact on the workload (assuming that no contention 
over shared socket resources ensures as a result of the coalescence). We 
need better workload observability to understand when we can coalesce 
without hurting the workload's performance. It's going to vary by 
workload, and also by machine/processor architecture. For example, 
coalescing threads up on a multi-core chip will be far less costly (from 
a performance perspective) than coalescing threads up on a group of CPUs 
sharing an instruction pipeline.


-Eric

[tesla-dev] Adaptive Optimizations page update (ML)

Reply via email to