Eric Saxe wrote:
> David Vengerov wrote:
>
>>>
>>> From your description, it sounds like the opportunity here is using
>>> ML to learn the best
>>> policies for controlling power manageable resources, such that
>>> efficiency, performance,
>>> and adaptability (in the face of changing system utilization) are
>>> maximized. Is that right?
>>
>>
>> Yes, and the system can also jointly adapt the thread migration
>> policies so as to achieve the same objective.
>
> What sorts of inputs (observability) do you think will be needed for
> this?
I think that whatever is currently observable in the Solaris kernel
should be sufficient to start with. That is, even if the controller just
knows the run queue length on each CPU, it can already improve the
system's power efficiency by lowering the clock rate of CPUs that are
idle or have few threads in their run queues (if the workload consists
of many short-lived transactions). A more effective approach, however,
is for the thread migration policy to cooperate with the power
management policy and try to keep as many CPUs idle as possible without
"infringing" on the SLAs made with the running applications. However, in
order to do this the controller should be aware of the service quality
received by the applications, by receiving some continual feedback from
the system about the response time of completed transactions (or better
yet about the SLA rewards/penalties, which can already include many
different performance considerations). So the administrators can be
given an option of specifying a performance measure that should be sent
to the power management/thread migration controller. If they choose not
to provide a performance measure, then the system will behave according
to the default policy, which can be chosen from the set: {most
performance oriented (spread the load equally), most power efficient
(keep all threads on a single CPU and turn others off), 50-50 tradeoff
between performance and efficiency (use half of the CPUs), etc.}.
There can be some special cases when performance feedback from
applications is not needed. For example,
we can decide that more than one thread per CPU results in a noticeable
performance degradation, and so the power management policy can then
learn what clock frequencies to assign to existing CPUs based on
observing how many of those CPUs have some threads running on them. Do
you think this is an important case to address? If so, then a separate
management policy should be developed for this case, which will be
"turned on" whenever such case is detected.
David