Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-09-04 Thread Darrick J. Wong
On Mon, Jul 28, 2008 at 08:56:34AM +0800, Tian, Kevin wrote:
 I guess the solution for such issues is not to have kvm (or qemu) play
 with nice levels, but instead send notifications on virtual frequency
 changes on the qemu monitor. The management application can then choose
 whether to ignore the information, play with nice levels, or even
 propagate the frequency change to the host (useful in client-side
 virtualization).

I like this idea too.

I've been giving a little more thought to how we present cpufreq
control to the guest.  According to the ACPI specs, either we can
implement a fixed hardware implementation (i.e. MSRs) or we can provide
a system i/o address that (presumably) traps to the firmware so that the
BIOS can do the actual work.  Since the MSRs controls are different
between Intel and AMD (Linux refuses to use Intel MSRs on an AMD CPU and
vice versa), I'm thinking it might be easier to use the system I/O route
because then we don't have to spend any code emulating the hardware
mechanisms when we don't need to do so.

Of course, that's assuming that it's easy to set up a magic I/O port
on the guest that will trap into the VMM so that we can perform whatever
magic we want.  I would assume that this is the case, though I've only
just now gotten back to this patch set.  I've also not studied the speed
difference between the emulated wrmsr command and this manner of I/O
port access, but I suppose I can try it and find out. :)

At least in theory this would also eliminate an obstacle to migrating
VMs from Intel to AMD CPUs, but I suspect that's not really feasible
anyway.

--D
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-27 Thread Avi Kivity
Tian, Kevin wrote:
 From: Darrick J. Wong
 Sent: 2008年7月16日 7:18

 I envision four scenarios:

 0. Guests that don't know about cpufreq still run at whatever 
 nice level
 they started with.

 1. If we have a system with a lot of idle VMs, they will all 
 run with +5
 nice and this patch has no effect.

 2. If we have a system with a lot of busy VMs, they all run 
 with -5 nice
 and this patch also has no effect.

 3. If, however, we have a lot of idle VMs and a few busy ones, then the
 -5 nice of the busy VMs will get those VMs extra CPU time.  On a really
 crummy FPU microbenchmark I have, the score goes from about 500 to 2000
 with the patch applied, though of course YMMV.  In some respects this
 

 How many VMs did you run in this test? All the VMs are idle except
 the one where your benchmark runs?

 How about the actual effect when several VMs are doing some stuff?

 There's another scenario where some VMs don't support cpufreq while
 others do. Here is it unfair to just renice the latter when the former is
 not 'nice' at all?
   

I guess the solution for such issues is not to have kvm (or qemu) play
with nice levels, but instead send notifications on virtual frequency
changes on the qemu monitor. The management application can then choose
whether to ignore the information, play with nice levels, or even
propagate the frequency change to the host (useful in client-side
virtualization).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-27 Thread Tian, Kevin
From: Avi Kivity [mailto:[EMAIL PROTECTED] 
Sent: 2008年7月27日 16:27

Tian, Kevin wrote:
 From: Darrick J. Wong
 Sent: 2008年7月16日 7:18

 I envision four scenarios:

 0. Guests that don't know about cpufreq still run at whatever 
 nice level
 they started with.

 1. If we have a system with a lot of idle VMs, they will all 
 run with +5
 nice and this patch has no effect.

 2. If we have a system with a lot of busy VMs, they all run 
 with -5 nice
 and this patch also has no effect.

 3. If, however, we have a lot of idle VMs and a few busy 
ones, then the
 -5 nice of the busy VMs will get those VMs extra CPU time.  
On a really
 crummy FPU microbenchmark I have, the score goes from about 
500 to 2000
 with the patch applied, though of course YMMV.  In some 
respects this
 

 How many VMs did you run in this test? All the VMs are idle except
 the one where your benchmark runs?

 How about the actual effect when several VMs are doing some stuff?

 There's another scenario where some VMs don't support cpufreq while
 others do. Here is it unfair to just renice the latter when 
the former is
 not 'nice' at all?
   

I guess the solution for such issues is not to have kvm (or qemu) play
with nice levels, but instead send notifications on virtual frequency
changes on the qemu monitor. The management application can then choose
whether to ignore the information, play with nice levels, or even
propagate the frequency change to the host (useful in client-side
virtualization).


Yes, that'd be more flexible and cleaner.

Thanks,
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-17 Thread Darrick J. Wong
On Wed, Jul 16, 2008 at 01:56:51PM +0800, Tian, Kevin wrote:
 
 How many VMs did you run in this test?

100 idle

 All the VMs are idle except
 the one where your benchmark runs?

Yes.

 How about the actual effect when several VMs are doing some stuff?

If there are multiple VMs that are busy, the busy ones will fight among
themselves for CPU time.  I still see some priority boost, just not as
much.

 There's another scenario where some VMs don't support cpufreq while
 others do. Here is it unfair to just renice the latter when the former is
 not 'nice' at all?
 
 Guess this feature has to be applied with some qualifications, e.g.
 in a group of VMs with known same PM abilities...

Agreed it's not very convenient for guests that don't know about
cpufreq, though I was planning to make it the case that unaware guests
get no priority boost/reduction.

 You can report constant tsc feature in cpuid virtualization. Of course
 if physical TSC is unstable, it's another story about how to mark guest
 TSC untrustable. (e.g. Marcelo develops one method by simulating C2)

I wonder how stable the virtual tsc is...?  Will have to study this.

 This description seems mismatch with the implementaion, which pushes
 +10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

Nope, that's a mistake on my part.

 One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
 doesn't reflect a meanful state to guest, since there's no perf entry in
 ACPI table to carry such value. One likely result is that guest'd think the
 cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
 sense to set initial value to 2 (P1), as keeping the default nice value, or 
 even 3 (P0), if you take that state as IDA style which may over-clock but
 not assure.

Indeed.  I had pondered this point considerably myself.  For this RFC I
decided that I could leave the MSR as zero as a way of detecting a guest
that didn't know anything, in case that ability is useful.  However, the
Linux drivers seem to give you either 0MHz or some arbitrary p-state, so
I think I'll change it to value 1.

 More critical points to be further thought of, if expecting this feature to be
 in real use, is the difinition of exposed virtual freq states, and how these
 states can be mapped to scheduler knobs. Inappropriate exposure may
 cause guest to excessively bounce between virtual freq points. For example, 
 'nice' value is only a relative hint to scheduler and there's no guarantee 
 that
 same portion of cpu cycles are added as what 'nice' value changes. There's

IDA has the same problem... the T61 BIOS compensates for this fakery
by reporting a frequency of $max_freq + 1 so if you're smart then you'll
somehow know that you might see a boost that you can't measure. :P

I suppose the problem here is that p-states were designed on the
assumption that you're directly manipulating hardware speeds, whereas
what we really want in both this patch and IDA are qualitative values
(medium speed, highest speed, ludicrous speed?)

 even the case where guest requests lowest speed while actual cpu cycles
 allocated to it keeps similar as last epoch when it's in high speed. This
 will fool the guest that lowest speed can satisfy its requirement. It's 
 similar 

On the other hand, if you get the same performance  at both high and low
speeds, then it doesn't really matter which one you choose.  At least
not until the load changes.  I suppose the next question is, how much
software is dependent on knowing the exact CPU frequency, and are
workload schedulers smart enough to realize that performance
characteristics can change over time (throttling, TM1/TM2, etc)?
Inasmuch as you actually ever know, since with hardware coordination of
cpufreq the hardware can do whatever it wants.

 to the requirement on core-based hardware coordination logic, where some 
 feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
 actual freq in last sampling period. Here the VM case may need similar 
 virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
 deduced however.

I don't think it's easily deduced.  I also don't think APERF/MPERF are emulated 
in
kvm yet.  I suppose it wouldn't be difficult to add those two, though
measuring that might be a bit messy.

Maybe the cheap workaround for now is to report the CPU speeds in the
table as n-1, n, n+1.

 Maybe it's applausive to compare the freq change count for same benchmark
 between VM and native, and more interesting is, how's the effect when 
 multiple VMs all take use of such features? For example, whether the
 expected effect is counteracted with only overhead added? Any strange
 behaviors exposed as in real 'nice' won't be changed so frequently in dozens
 of ms level? :-)

I'll run some benchmarks and see what happens over the next week.

--D
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-15 Thread Tian, Kevin
From: Darrick J. Wong
Sent: 2008年7月16日 7:18

I envision four scenarios:

0. Guests that don't know about cpufreq still run at whatever 
nice level
they started with.

1. If we have a system with a lot of idle VMs, they will all 
run with +5
nice and this patch has no effect.

2. If we have a system with a lot of busy VMs, they all run 
with -5 nice
and this patch also has no effect.

3. If, however, we have a lot of idle VMs and a few busy ones, then the
-5 nice of the busy VMs will get those VMs extra CPU time.  On a really
crummy FPU microbenchmark I have, the score goes from about 500 to 2000
with the patch applied, though of course YMMV.  In some respects this

How many VMs did you run in this test? All the VMs are idle except
the one where your benchmark runs?

How about the actual effect when several VMs are doing some stuff?

There's another scenario where some VMs don't support cpufreq while
others do. Here is it unfair to just renice the latter when the former is
not 'nice' at all?

Guess this feature has to be applied with some qualifications, e.g.
in a group of VMs with known same PM abilities...


There are some warts to this patch--most notably, the current
implementation uses the Intel MSRs and EST feature flag ... even if the
guest reports the CPU as being AuthenticAMD.  Also, there could be
timing problems introduced by this change--the OS thinks the CPU
frequency changes, but I don't know the effect on the guest CPU TSCs.

You can report constant tsc feature in cpuid virtualization. Of course
if physical TSC is unstable, it's another story about how to mark guest
TSC untrustable. (e.g. Marcelo develops one method by simulating C2)


Control values are as as follows:
0: Nobody's touched cpufreq.  nice is the whatever the default is.
1: Lowest speed.  nice +5.
2: Medium speed.  nice is reset.
3: High speed.  nice -5.

This description seems mismatch with the implementaion, which pushes
+10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
doesn't reflect a meanful state to guest, since there's no perf entry in
ACPI table to carry such value. One likely result is that guest'd think the
cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
sense to set initial value to 2 (P1), as keeping the default nice value, or 
even 3 (P0), if you take that state as IDA style which may over-clock but
not assure.

More critical points to be further thought of, if expecting this feature to be
in real use, is the difinition of exposed virtual freq states, and how these
states can be mapped to scheduler knobs. Inappropriate exposure may
cause guest to excessively bounce between virtual freq points. For example, 
'nice' value is only a relative hint to scheduler and there's no guarantee that
same portion of cpu cycles are added as what 'nice' value changes. There's
even the case where guest requests lowest speed while actual cpu cycles
allocated to it keeps similar as last epoch when it's in high speed. This
will fool the guest that lowest speed can satisfy its requirement. It's similar 
to the requirement on core-based hardware coordination logic, where some 
feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
actual freq in last sampling period. Here the VM case may need similar 
virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
deduced however.

Maybe it's applausive to compare the freq change count for same benchmark
between VM and native, and more interesting is, how's the effect when 
multiple VMs all take use of such features? For example, whether the
expected effect is counteracted with only overhead added? Any strange
behaviors exposed as in real 'nice' won't be changed so frequently in dozens
of ms level? :-)

Thanks,
Kevin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html