Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-13 Thread Avi Kivity

On 03/12/2010 10:53 AM, Qing He wrote:



When Qing(CCed) was working on nested VMX in the past, he found PV
vmread/vmwrite indeed works well(it would write to the virtual vmcs so vmwrite
can also benefit). Though compared to old machine(one our internal patch shows
improve more than 5%), NHM get less benefit due to the reduced vmexit cost.

 

One of the hurdles to PVize vmread/vmwrite is the fact that the memory
layout of physical vmcs remains unknown. Of course it can use the custom
vmcs layout utilized by nested virtualization, but that looks a little weird,
since different nested virtualization implementation may create different
custom layout.
   


Note we must use a custom layout and cannot depend on the physical 
layout, due to live migration.  The layout becomes an ABI.



I once used another approach to partially accelerate the vmread/vmwrite
in nested virtualization case, which also gives good performance gain (around
7% on pre-nehalem, based on this, PV vmread/vmwrite had another 7%). That
is to make a shortcut to handle EXIT_REASON_VM{READ,WRITE}, without
even turning on the IF.
   


Interesting.  That means our exit path is inefficient; it seems to imply 
half the time is spent outside the hardware vmexit path.


A quick profile (on non-Nehalem) shows many atomics and calls into the 
lapic, as well as update_cr8_intercept which is sometimes unnecessary; 
these could easily be optimized.


Definitely optimizing the non-paravirt path is preferred to adding more 
paravirtualization.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-12 Thread Qing He
On Thu, 2010-03-11 at 17:17 +0800, Sheng Yang wrote:
 On Thursday 11 March 2010 15:50:54 Avi Kivity wrote:
  On 03/11/2010 09:46 AM, Sheng Yang wrote:
   On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
   On 03/11/2010 09:20 AM, Sheng Yang wrote:
   Currently we can only get the cpu_stat of whole guest as one. This
   patch enhanced cpu_stat with more detail, has guest_system and
   guest_user cpu time statistics with a little overhead.
  
   Signed-off-by: Sheng Yangsh...@linux.intel.com
   ---
  
   This draft patch based on KVM upstream to show the idea. I would split
   it into more kernel friendly version later.
  
   The overhead is, the cost of get_cpl() after each exit from guest.
  
   This can be very expensive in the nested virtualization case, so I
   wouldn't like this to be in normal paths.  I think detailed profiling
   like that can be left to 'perf kvm', which only has overhead if enabled
   at runtime.
  
   Yes, that's my concern too(though nested vmcs/vmcb read already too
   expensive, they should be optimized...).
  
  Any ideas on how to do that?  Perhaps use paravirt_ops to covert the
  vmread into a memory read?  We store the vmwrites in the vmcs anyway.
 
 When Qing(CCed) was working on nested VMX in the past, he found PV 
 vmread/vmwrite indeed works well(it would write to the virtual vmcs so 
 vmwrite 
 can also benefit). Though compared to old machine(one our internal patch 
 shows 
 improve more than 5%), NHM get less benefit due to the reduced vmexit cost.
 

One of the hurdles to PVize vmread/vmwrite is the fact that the memory
layout of physical vmcs remains unknown. Of course it can use the custom
vmcs layout utilized by nested virtualization, but that looks a little weird,
since different nested virtualization implementation may create different
custom layout.

I once used another approach to partially accelerate the vmread/vmwrite
in nested virtualization case, which also gives good performance gain (around
7% on pre-nehalem, based on this, PV vmread/vmwrite had another 7%). That
is to make a shortcut to handle EXIT_REASON_VM{READ,WRITE}, without
even turning on the IF.

Thanks,
Qing
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-11 Thread Zhang, Yanmin
On Thu, 2010-03-11 at 09:50 +0200, Avi Kivity wrote:
 On 03/11/2010 09:46 AM, Sheng Yang wrote:
  On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
 
  On 03/11/2010 09:20 AM, Sheng Yang wrote:
   

 
  Currently we can only get the cpu_stat of whole guest as one. This patch
  enhanced cpu_stat with more detail, has guest_system and guest_user cpu
  time statistics with a little overhead.
 
  Signed-off-by: Sheng Yangsh...@linux.intel.com
It seems per-process guest cpu utilization is more useful than per-cpu's.


  ---
 
  This draft patch based on KVM upstream to show the idea. I would split it
  into more kernel friendly version later.
 
  The overhead is, the cost of get_cpl() after each exit from guest.
 
  This can be very expensive in the nested virtualization case, so I
  wouldn't like this to be in normal paths.  I think detailed profiling
  like that can be left to 'perf kvm', which only has overhead if enabled
  at runtime.
   
  Yes, that's my concern too(though nested vmcs/vmcb read already too 
  expensive,
  they should be optimized...).
 
 Any ideas on how to do that?  Perhaps use paravirt_ops to covert the 
 vmread into a memory read?  We store the vmwrites in the vmcs anyway.
Another method is to add sysctl entry, such like 
/proc/sys/kernel/collect_guest_utilization,
and we can set it off by default. Or add a 
/sys/kernel/debug/kvm/collect_guest_utilization.

 
  The other concern is, perf alike mechanism would
  bring a lot more overhead compared to this.
 
 
 Ordinarily users won't care if time is spent in guest kernel mode or 
 guest user mode.  They want to see which guest is imposing a load on a 
 system.  I consider a user profiling a guest from the host an advanced 
 and rarer use case, so it's okay to require tools and additional 
 overhead for this.
Here is the story why Sheng worked out the patch. Some guys work on
KVM performance. They want us to extend top to show guest utilization
info, such like guest kernel and guest userspace cpu utilization. With
the new tool, they could find which VM (mapping with qemu process id)
consumes too much cpu time in host space (including kernel and userspace),
and compare them with guest kernel/userspace. That information could provide
a first-hand high-level overview about all VMs running in the system
and help admin quickly find what the worst VM instance is.

So we need per-process (guest) cpu utilization than per-cpu guest utilization.

 
  For example you can put the code to note the cpl in a tracepoint which
  is enabled dynamically.
   
  Yanmin have already implement perf kvm to support this. We are just 
  arguing
  if a normal top-alike mechanism is necessary.
perf kvm mostly is used to find hot functions which might cause more overhead.
Sheng's patch has less overhead.


 
  I am also considering to make it a feature that can be disabled. But seems 
  it
  make things complicate and result in uncertain cpustat output.
 
 
 I'm not even sure that guest time was a good idea.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-11 Thread Sheng Yang
On Thursday 11 March 2010 15:50:54 Avi Kivity wrote:
 On 03/11/2010 09:46 AM, Sheng Yang wrote:
  On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
  On 03/11/2010 09:20 AM, Sheng Yang wrote:
  Currently we can only get the cpu_stat of whole guest as one. This
  patch enhanced cpu_stat with more detail, has guest_system and
  guest_user cpu time statistics with a little overhead.
 
  Signed-off-by: Sheng Yangsh...@linux.intel.com
  ---
 
  This draft patch based on KVM upstream to show the idea. I would split
  it into more kernel friendly version later.
 
  The overhead is, the cost of get_cpl() after each exit from guest.
 
  This can be very expensive in the nested virtualization case, so I
  wouldn't like this to be in normal paths.  I think detailed profiling
  like that can be left to 'perf kvm', which only has overhead if enabled
  at runtime.
 
  Yes, that's my concern too(though nested vmcs/vmcb read already too
  expensive, they should be optimized...).
 
 Any ideas on how to do that?  Perhaps use paravirt_ops to covert the
 vmread into a memory read?  We store the vmwrites in the vmcs anyway.

When Qing(CCed) was working on nested VMX in the past, he found PV 
vmread/vmwrite indeed works well(it would write to the virtual vmcs so vmwrite 
can also benefit). Though compared to old machine(one our internal patch shows 
improve more than 5%), NHM get less benefit due to the reduced vmexit cost.

-- 
regards
Yang, Sheng

 
  The other concern is, perf alike mechanism would
  bring a lot more overhead compared to this.
 
 Ordinarily users won't care if time is spent in guest kernel mode or
 guest user mode.  They want to see which guest is imposing a load on a
 system.  I consider a user profiling a guest from the host an advanced
 and rarer use case, so it's okay to require tools and additional
 overhead for this.
 
  For example you can put the code to note the cpl in a tracepoint which
  is enabled dynamically.
 
  Yanmin have already implement perf kvm to support this. We are just
  arguing if a normal top-alike mechanism is necessary.
 
  I am also considering to make it a feature that can be disabled. But
  seems it make things complicate and result in uncertain cpustat output.
 
 I'm not even sure that guest time was a good idea.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-10 Thread Sheng Yang
Currently we can only get the cpu_stat of whole guest as one. This patch
enhanced cpu_stat with more detail, has guest_system and guest_user cpu time
statistics with a little overhead.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---

This draft patch based on KVM upstream to show the idea. I would split it into
more kernel friendly version later.

The overhead is, the cost of get_cpl() after each exit from guest.

Comments are welcome!

 arch/x86/kvm/x86.c  |   10 ++
 fs/proc/stat.c  |   22 --
 include/linux/kernel_stat.h |2 ++
 include/linux/kvm_host.h|1 +
 include/linux/sched.h   |1 +
 kernel/sched.c  |6 ++
 6 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 703f637..c8ea6e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4290,6 +4290,14 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
}
 }
 
+static void kvm_update_guest_mode(struct kvm_vcpu *vcpu)
+{
+   int cpl = kvm_x86_ops-get_cpl(vcpu);
+
+   if (cpl != 0)
+   current-flags |= PF_VCPU_USER;
+}
+
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
int r;
@@ -4377,6 +4385,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
trace_kvm_entry(vcpu-vcpu_id);
kvm_x86_ops-run(vcpu);
 
+   kvm_update_guest_mode(vcpu);
+
/*
 * If the guest has used debug registers, at least dr7
 * will be disabled while returning to the host.
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index b9b7aad..d07640a 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -27,7 +27,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-   cputime64_t guest, guest_nice;
+   cputime64_t guest, guest_nice, guest_user, guest_system;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -36,7 +36,7 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = cputime64_zero;
-   guest = guest_nice = cputime64_zero;
+   guest = guest_nice = guest_user = guest_system = cputime64_zero;
getboottime(boottime);
jif = boottime.tv_sec;
 
@@ -53,6 +53,10 @@ static int show_stat(struct seq_file *p, void *v)
guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
guest_nice = cputime64_add(guest_nice,
kstat_cpu(i).cpustat.guest_nice);
+   guest_user = cputime64_add(guest_user,
+   kstat_cpu(i).cpustat.guest_user);
+   guest_system = cputime64_add(guest_system,
+   kstat_cpu(i).cpustat.guest_system);
for_each_irq_nr(j) {
sum += kstat_irqs_cpu(j, i);
}
@@ -68,7 +72,7 @@ static int show_stat(struct seq_file *p, void *v)
sum += arch_irq_stat();
 
seq_printf(p, cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu 
-   %llu\n,
+   %llu %llu %llu\n,
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
(unsigned long long)cputime64_to_clock_t(system),
@@ -78,7 +82,9 @@ static int show_stat(struct seq_file *p, void *v)
(unsigned long long)cputime64_to_clock_t(softirq),
(unsigned long long)cputime64_to_clock_t(steal),
(unsigned long long)cputime64_to_clock_t(guest),
-   (unsigned long long)cputime64_to_clock_t(guest_nice));
+   (unsigned long long)cputime64_to_clock_t(guest_nice),
+   (unsigned long long)cputime64_to_clock_t(guest_user),
+   (unsigned long long)cputime64_to_clock_t(guest_system));
for_each_online_cpu(i) {
 
/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
@@ -93,9 +99,11 @@ static int show_stat(struct seq_file *p, void *v)
steal = kstat_cpu(i).cpustat.steal;
guest = kstat_cpu(i).cpustat.guest;
guest_nice = kstat_cpu(i).cpustat.guest_nice;
+   guest_user = kstat_cpu(i).cpustat.guest_user;
+   guest_system = kstat_cpu(i).cpustat.guest_system;
seq_printf(p,
cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu 
-   %llu\n,
+   %llu %llu %llu\n,
i,
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
@@ -106,7 +114,9 @@ static int show_stat(struct seq_file *p, void *v)
(unsigned long long)cputime64_to_clock_t(softirq),

Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-10 Thread Avi Kivity

On 03/11/2010 09:20 AM, Sheng Yang wrote:

Currently we can only get the cpu_stat of whole guest as one. This patch
enhanced cpu_stat with more detail, has guest_system and guest_user cpu time
statistics with a little overhead.

Signed-off-by: Sheng Yangsh...@linux.intel.com
---

This draft patch based on KVM upstream to show the idea. I would split it into
more kernel friendly version later.

The overhead is, the cost of get_cpl() after each exit from guest.
   


This can be very expensive in the nested virtualization case, so I 
wouldn't like this to be in normal paths.  I think detailed profiling 
like that can be left to 'perf kvm', which only has overhead if enabled 
at runtime.


For example you can put the code to note the cpl in a tracepoint which 
is enabled dynamically.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-10 Thread Sheng Yang
On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
 On 03/11/2010 09:20 AM, Sheng Yang wrote:
  Currently we can only get the cpu_stat of whole guest as one. This patch
  enhanced cpu_stat with more detail, has guest_system and guest_user cpu
  time statistics with a little overhead.
 
  Signed-off-by: Sheng Yangsh...@linux.intel.com
  ---
 
  This draft patch based on KVM upstream to show the idea. I would split it
  into more kernel friendly version later.
 
  The overhead is, the cost of get_cpl() after each exit from guest.
 
 This can be very expensive in the nested virtualization case, so I
 wouldn't like this to be in normal paths.  I think detailed profiling
 like that can be left to 'perf kvm', which only has overhead if enabled
 at runtime.

Yes, that's my concern too(though nested vmcs/vmcb read already too expensive, 
they should be optimized...). The other concern is, perf alike mechanism would 
bring a lot more overhead compared to this. 

 For example you can put the code to note the cpl in a tracepoint which
 is enabled dynamically.

Yanmin have already implement perf kvm to support this. We are just arguing 
if a normal top-alike mechanism is necessary.

I am also considering to make it a feature that can be disabled. But seems it 
make things complicate and result in uncertain cpustat output.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-10 Thread Avi Kivity

On 03/11/2010 09:46 AM, Sheng Yang wrote:

On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
   

On 03/11/2010 09:20 AM, Sheng Yang wrote:
 

Currently we can only get the cpu_stat of whole guest as one. This patch
enhanced cpu_stat with more detail, has guest_system and guest_user cpu
time statistics with a little overhead.

Signed-off-by: Sheng Yangsh...@linux.intel.com
---

This draft patch based on KVM upstream to show the idea. I would split it
into more kernel friendly version later.

The overhead is, the cost of get_cpl() after each exit from guest.
   

This can be very expensive in the nested virtualization case, so I
wouldn't like this to be in normal paths.  I think detailed profiling
like that can be left to 'perf kvm', which only has overhead if enabled
at runtime.
 

Yes, that's my concern too(though nested vmcs/vmcb read already too expensive,
they should be optimized...).


Any ideas on how to do that?  Perhaps use paravirt_ops to covert the 
vmread into a memory read?  We store the vmwrites in the vmcs anyway.



The other concern is, perf alike mechanism would
bring a lot more overhead compared to this.
   


Ordinarily users won't care if time is spent in guest kernel mode or 
guest user mode.  They want to see which guest is imposing a load on a 
system.  I consider a user profiling a guest from the host an advanced 
and rarer use case, so it's okay to require tools and additional 
overhead for this.



For example you can put the code to note the cpl in a tracepoint which
is enabled dynamically.
 

Yanmin have already implement perf kvm to support this. We are just arguing
if a normal top-alike mechanism is necessary.

I am also considering to make it a feature that can be disabled. But seems it
make things complicate and result in uncertain cpustat output.
   


I'm not even sure that guest time was a good idea.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html