Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-23 Thread Zhang, Yanmin
On Wed, 2010-06-23 at 08:51 +0300, Avi Kivity wrote:
 On 06/23/2010 06:12 AM, Zhang, Yanmin wrote:
 
  This design is to deal with a task context perf collection in guest os.
  Scenario 1:
  1) guest os starts to collect statistics of process A on vcpu 0;
  2) process A is scheduled to vcpu 1. Then, the perf_event at host side 
  need
  to be moved to VCPU 1 's thread. With the per KVM instance design, we 
  needn't
  move host_perf_shadow among vcpus.
 
 
  First, the guest already knows how to deal with per-cpu performance
  monitors, since that's how most (all) hardware works.  So we aren't
  making the guest more complex, and on the other hand we simplify the host.
   
  I agree that we need keep things simple.
 
 
  Second, if process A is migrated, and the guest uses per-process
  counters, the guest will need to stop/start the counter during the
  migration.  This will cause the host to migrate the counter,
   
  Agree. My patches do so.
 
  Question: Where does host migrate the counter?
  The perf event at host side is bound to specific vcpu thread.
 
 
 If the perf event is bound to the vm, not a vcpu, then on guest process 
 migration you will have to disable it on one vcpu and enable it on the 
 other, no?
I found we start from different points. This patch is to implement a para virt
interface based on current perf implementation in kernel.

Here is a diagram about perf implementation layers. Below picture is not 
precise,
but it could show perf layers. Ingo and Peter could correct me if something is 
wrong.

-
|  Perf Generic Layer   |
-
|  PMU Abstraction Layer| 
|  (a couple of callbacks)  | 
-
|  x86_pmu  |
|  (operate real PMU hardware)  |
-


The upper layer is perf generic layer. The 3rd layer is x86_pmu which really
manipulate PMU hardware. Sometimes, 1st calls 3rd directly at event 
initialization
and enable/disable all events.

My patch implements a kvm_pmu at the 2nd layer in guest os, to call hypercall 
to vmexit
to host. At host side, mostly it would go through the 3 layers till accessing 
real
hardware.

Most of your comments don't agree with the kvm_pmu design. Although you didn't 
say
directly, I know that perhaps you want to implement para virt interface at 3rd 
layer
in guest os. That means guest os maintains a mapping between guest event and 
PMU counters.
That's why you strongly prefer per-vcpu event managements and idx reference to 
event.
If we implement it at 3rd layer (or something like that although you might say 
I don't
like that layer...) in guest, we need bypass 1st and 2nd layers in host kernel 
when
processing guest os event. Eventually, we almost add a new layer under x86_pmu 
to arbitrate
between perf PMU request and KVM guest event request.

My current patch arranges the calling to go through the whole perf stack at 
host side.
The upper layer arranges perf event scheduling on PMU hardware. Applications 
don't know
when its event will be really scheduled to real hardware as they needn't know.


 
so while we
  didn't move the counter to a different vcpu,
   
  Disagree here. If process A on vcpu 0 in guest os is migrated to vcpu 1,
  host has to move process A's perf_event to vcpu 1 thread.
 
 
 Sorry, I'm confused now (lost track of our example).  But whatever we 
 do, if a guest process is migrated, the host will have to migrate the 
 perf event, yes?
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Mon, 2010-06-21 at 16:56 +0300, Gleb Natapov wrote:
 On Mon, Jun 21, 2010 at 05:31:43PM +0800, Zhang, Yanmin wrote:
  The 3rd patch is to implement para virt perf at host kernel.
  
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
  
snip

  +
  +static void kvm_copy_event_to_guest(struct kvm_vcpu *vcpu,
  +   struct perf_event *host_event)
  +{
  +   struct host_perf_shadow *shadow = host_event-host_perf_shadow;
  +   struct guest_perf_event counter;
  +   int ret;
  +   s32 overflows;
  +
  +   ret = kvm_read_guest(vcpu-kvm, shadow-guest_event_addr,
  +   counter, sizeof(counter));
  +   if (ret  0)
  +   return;
  +
  +again:
  +   overflows = atomic_read(shadow-counter.overflows);
  +   if (atomic_cmpxchg(shadow-counter.overflows, overflows, 0) !=
  +   overflows)
  +   goto again;
  +
  +   counter.count = shadow-counter.count;
  +   atomic_add(overflows, counter.overflows);
  +
  +   kvm_write_guest(vcpu-kvm,
  +   shadow-guest_event_addr,
  +   counter,
  +   sizeof(counter));
 Those kind of interfaces worry me since the can cause bugs that are
 very hard to catch.  What if guest enables some events and crashes into
 kdump kernel (or kexec new kernel) without reseting HW. Now host may
 write over guest memory without guest expecting it. Do you handle this
 scenario in a guest side? I think you need to register reboot notify
 and disable events from there.
Sorry for missing your comments.

My patch could take care of dead guest os by cleaning up all events in function
kvm_arch_destroy_vm, so all events are closed if host user kills the guest
qemu process.

As for your scenario, I will register reboot notify and add a new pv perf
hypercall interface to vmexit to host kernel to do cleanup.

Thanks,
Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote:
 On 06/22/10 03:49, Zhang, Yanmin wrote:
  On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
Since the guest can use NMI to read the 
  counter, it should have the highest possible priority, and thus it 
  shouldn't see any overflow unless it configured the threshold really low.
 
  If we drop overflow, we can use the RDPMC instruction instead of 
  KVM_PERF_OP_READ.  This allows the guest to allow userspace to read a 
  counter, or prevent userspace from reading the counter, by setting cr4.pce.
  1) para virt perf interface is to hide PMU hardware in host os. Guest os 
  shouldn't
  access PMU hardware directly. We could expose PMU hardware to guest os 
  directly, but
  that would be another guest os PMU support method. It shouldn't be a part 
  of para virt
  interface.
  2) Consider below scenario: PMU counter overflows and NMI causes guest os 
  vmexit to
  host kernel. Host kernel schedules the vcpu thread to another physical cpu 
  before
  vmenter the guest os again. So later on, guest os just RDPMC the counter on 
  another
  cpu.
  
  So I think above discussion is around how to expose PMU hardware to guest 
  os. I will
  also check this method after the para virt interface is done.
 
 You should be able to expose the counters as read-only to the guest. KVM
 allows you to specify whether or not a guest has read, write or
 read/write access. If you allowed read access of the counters that would
 safe a fair bit of hyper calls.
Thanks. KVM is good in register access permission configuration. But things are 
not so
simple like that if we consider real running environment. Host kernel might 
schedule
guest os vcpu thread to other cpus, or other non-kvm processes might preempt 
the vcpu
thread on this cpu.

To support such capability you said, we have to implement the direct exposition 
of PMU
hardware to guest os eventually.

 
 Question is if it is safe to drop overflow support?
Not safe. One of PMU hardware design objectives is to use interrupt or NMI to 
notify
software when event counter overflows. Without overflow support, software need 
poll
the PMU registers looply. That is not good and consumes more cpu resources.

Besides the para virt perf interface, I'm also considering the direct exposition
of PMU hardware to guest os. But that will be another very different 
implementation. We
should not combine it with pv interface. Perhaps our target is to implement 
both, so
unmodified guest os could get support on perf statistics.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 10:24 +0200, Jes Sorensen wrote:
 On 06/21/10 11:31, Zhang, Yanmin wrote:
  @@ -583,10 +584,20 @@ static void x86_pmu_disable_all(void)
  }
   }
   
  +#ifdef CONFIG_KVM_PERF
  +static int kvm_hw_perf_enable(void);
  +static int kvm_hw_perf_disable(void);
  +#endif
 
 Please put these prototypes into a header ... and create dummy stubs for
 them when CONFIG_KVM_PERF is not set.
Ok. I just didn't want to touch too much generic codes of perf.

 
   void hw_perf_disable(void)
   {
  struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
   
  +#ifdef CONFIG_KVM_PERF
  +   if (!kvm_hw_perf_disable())
  +   return;
  +#endif
 
 If you stub them out we can avoid all the ugly #ifdefs
Ok.

 
  @@ -810,6 +821,11 @@ void hw_perf_enable(void)
  struct hw_perf_event *hwc;
  int i, added = cpuc-n_added;
   
  +#ifdef CONFIG_KVM_PERF
  +   if (!kvm_hw_perf_enable())
  +   return;
  +#endif
 
 and here
Ok.

 
  @@ -1317,6 +1334,11 @@ void __init init_hw_perf_events(void)
   
  pr_info(Performance Events: );
   
  +#ifdef CONFIG_KVM_PERF
  +   if (!kvm_init_hw_perf_events())
  +   return;
  +#endif
 
 and again here :)
Ok. Peter is working out a couple of patches to support multiple PMU. His 
patches
change pmu difition and we might move some into the callbacks. That will become
much clearer.

Yanmin



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 12:12 +0300, Avi Kivity wrote:
 On 06/22/2010 05:08 AM, Zhang, Yanmin wrote:
 
  Something that is worrying is that we don't expose group information.
  perf will multiplex the events for us, but there will be a loss in 
  accuracy.
 
   
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #includeasm/hw_breakpoint.h
 #endif
  @@ -753,6 +752,20 @@ struct perf_event {
 
perf_overflow_handler_t overflow_handler;
 
  + /*
  +  * pointers used by kvm perf paravirt interface.
  +  *
  +  * 1) Used in host kernel and points to host_perf_shadow which
  +  * has information about guest perf_event
  +  */
  + void*host_perf_shadow;
 
 
  Can we have real types instead of void pointers?
   
  I just want perf generic codes have less dependency on KVM codes.
 
 
 One way to do that and retain type safety is to have
 
  struct perf_client {
struct perf_client_ops *ops;
...
  }
 
 The client (kvm) can do
 
 struct kvm_perf_client {
   struct perf_client pc;
   // kvm specific stuff
 };
 
 the callbacks receive struct perf_client and use container_of to reach 
 the kvm_perf_client that contains it.
Let me double check it.

 
  + /*
  +  * 2) Used in guest kernel and points to guest_perf_shadow which
  +  * is used as a communication area with host kernel. Host kernel
  +  * copies overflow data to it when an event overflows.
  +  */
  + void*guest_perf_shadow;
 
 
  It's strange to see both guest and host parts in the same patch.
  Splitting to separate patches will really help review.
   
  It's a little hard to split the patches if they change the same file. 
  Perhaps
  I could add more statements before the patch when I send it out.
 
 
 With git, it's easy (once you're used to it):
 
# go back one commit:
git reset HEAD^
# selectively add bits:
git add -p
# commit first patch
git commit -s
# selectively add bits:
git add -p
# commit second patch
git commit -s
Thanks for your teaching.

 
 
  @@ -1626,9 +1629,22 @@ void perf_event_task_tick(struct task_st
if (ctx   ctx-nr_events   ctx-nr_events != ctx-nr_active)
rotate = 1;
 
  - perf_ctx_adjust_freq(cpuctx-ctx);
  - if (ctx)
  - perf_ctx_adjust_freq(ctx);
  +#ifdef CONFIG_KVM_PERF
  + if (kvm_para_available()) {
  + /*
  +  * perf_ctx_adjust_freq causes lots of pmu-read which would
  +  * trigger too many vmexit to host kernel. We disable it
  +  * under para virt situation
  +  */
  + adjust_freq = 0;
  + }
  +#endif
 
 
  Perhaps we can have a batch read interface which will read many counters
  at once.
   
  It's a good idea. But that will touch many perf generic codes which causes 
  it's hard
  to maintain or follow future changes.
 
 
 I'm talking about the guest/host interface.  So you have one vmexit and 
 many host perf calls.
I understood what you were speaking. I mean, perf generic codes operate 
perf_event
one by one. At low layer, we just know one perf_event before calling hypercall 
to
vmexit to host kernel.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote:
 On 06/22/10 09:55, Peter Zijlstra wrote:
  On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
  Besides the para virt perf interface, I'm also considering the direct 
  exposition
  of PMU hardware to guest os. 
  
  NAK NAK NAK NAK, we've been over that, its not going to happen, full
  stop!
  
  Use MSR read/write traps and host perf to emulate the hardware. In some
  cases we could allow the reads without trap but that's a later
  optimization.
 
 I believe whats meant here is a PMU compatible interface which is
 partially emulated. Not a handover of the PMU.
Right. We need capture all write to PMU MSR and allows guest os to read MSR 
directly.

Yanmin

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 09:58 +0200, Jes Sorensen wrote:
 On 06/22/10 09:47, Zhang, Yanmin wrote:
  On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote:
  On 06/22/10 03:49, Zhang, Yanmin wrote:
  On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
  So I think above discussion is around how to expose PMU hardware to guest 
  os. I will
  also check this method after the para virt interface is done.
 
  You should be able to expose the counters as read-only to the guest. KVM
  allows you to specify whether or not a guest has read, write or
  read/write access. If you allowed read access of the counters that would
  safe a fair bit of hyper calls.
  Thanks. KVM is good in register access permission configuration. But things 
  are not so
  simple like that if we consider real running environment. Host kernel might 
  schedule
  guest os vcpu thread to other cpus, or other non-kvm processes might 
  preempt the vcpu
  thread on this cpu.
  
  To support such capability you said, we have to implement the direct 
  exposition of PMU
  hardware to guest os eventually.
 
 If the guest is rescheduled to another CPU, or you get a preemption, you
 have a VMEXIT. The vcpu thread will not migrate while it is running, so
 you can handle it while the the VMEXIT is being serviced.
 
 Exposing the counters read-only would save a lot of overhead for sure.
  Question is if it is safe to drop overflow support?
  Not safe. One of PMU hardware design objectives is to use interrupt or NMI 
  to notify
  software when event counter overflows. Without overflow support, software 
  need poll
  the PMU registers looply. That is not good and consumes more cpu resources.
 
 Here is an idea, how about having the overflow NMI in the host trigger a
 flag that causes the PMU register read to trap and get special handling?
 That way you could propagate the overflow back down to the guest.
That doesn't resolve the issue that guest os software has to poll register.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 11:04 +0300, Avi Kivity wrote:
 On 06/22/2010 08:47 AM, Zhang, Yanmin wrote:
  On Mon, 2010-06-21 at 16:56 +0300, Gleb Natapov wrote:
 
  On Mon, Jun 21, 2010 at 05:31:43PM +0800, Zhang, Yanmin wrote:
   
  The 3rd patch is to implement para virt perf at host kernel.
 
  Signed-off-by: Zhang Yanminyanmin_zh...@linux.intel.com
 
 
  snip
 
 
  +
  +static void kvm_copy_event_to_guest(struct kvm_vcpu *vcpu,
  + struct perf_event *host_event)
  +{
  + struct host_perf_shadow *shadow = host_event-host_perf_shadow;
  + struct guest_perf_event counter;
  + int ret;
  + s32 overflows;
  +
  + ret = kvm_read_guest(vcpu-kvm, shadow-guest_event_addr,
  + counter, sizeof(counter));
  + if (ret  0)
  + return;
  +
  +again:
  + overflows = atomic_read(shadow-counter.overflows);
  + if (atomic_cmpxchg(shadow-counter.overflows, overflows, 0) !=
  + overflows)
  + goto again;
  +
  + counter.count = shadow-counter.count;
  + atomic_add(overflows,counter.overflows);
  +
  + kvm_write_guest(vcpu-kvm,
  + shadow-guest_event_addr,
  + counter,
  + sizeof(counter));
 
  Those kind of interfaces worry me since the can cause bugs that are
  very hard to catch.  What if guest enables some events and crashes into
  kdump kernel (or kexec new kernel) without reseting HW. Now host may
  write over guest memory without guest expecting it. Do you handle this
  scenario in a guest side? I think you need to register reboot notify
  and disable events from there.
   
  Sorry for missing your comments.
 
  My patch could take care of dead guest os by cleaning up all events in 
  function
  kvm_arch_destroy_vm, so all events are closed if host user kills the guest
  qemu process.
 
 
 
 A reset does not involve destroying a vm; you have to clean up as part 
 of the rest process.
What does 'reset' here mean? Is it a reboot or halt? If it's a halt, it involves
destroying a vm. If a host user just kills the qemu process, is it a reset 
involving
destroying a vm?

 
 Note MSRs are automatically cleared, so that's something in favour of an 
 MSR interface.
 
  As for your scenario, I will register reboot notify and add a new pv perf
  hypercall interface to vmexit to host kernel to do cleanup.
 
 
 You aren't guaranteed a reboot notifier will be called.  On the other 
 hand, we need a kexec handler.
ordinary kexec calls all reboot notifiers. Only crash kexec doesn't call them.
I will implement a machine_ops.crash_shutdown callback.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-22 Thread Zhang, Yanmin
On Tue, 2010-06-22 at 11:29 +0300, Avi Kivity wrote:
 On 06/22/2010 06:12 AM, Zhang, Yanmin wrote:
  On Mon, 2010-06-21 at 15:33 +0300, Avi Kivity wrote:
 
  On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
   
  The 3rd patch is to implement para virt perf at host kernel.
 
 
  @@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt {
 #ifdef __KERNEL__
 #includeasm/processor.h
 
 
  +/*
  + * In host kernel, perf_event-host_perf_shadow points to
  + * host_perf_shadow which records some information
  + * about the guest.
  + */
  +struct host_perf_shadow {
  + /* guest perf_event id passed from guest os */
  + int id;
  + /*
  +  * Host kernel saves data into data member counter firstly.
  +  * kvm will get data from this counter and calls kvm functions
  +  * to copy or add data back to guets os before entering guest os
  +  * next time
  +  */
  + struct guest_perf_event counter;
  + /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/
  + __u64 guest_event_addr;
 
 
  So just use gpa_t as the type.
   
  host_perf_shadow-guest_event_addr is a copy of 
  guest_event_addr-guest_event_addr.
  As the latter's type is __u64 as the interface between guest os and host 
  os, I use
  __u64 as the type of host_perf_shadow-guest_event_addr.
 
 
 Right.  Bug gpa_t is more descriptive.  We have a lot of address spaces 
 in kvm.
I change it to gpa_t.

 
  +
  + /*
  +  * Link to  of kvm.kvm_arch.shadow_hash_table
  +  */
  + struct list_head shadow_entry;
  + struct kvm_vcpu *vcpu;
  +
  + struct perf_event *host_event;
  + /*
  +  * Below counter is to prevent malicious guest os to try to
  +  * close/enable event at the same time.
  +  */
  + atomic_t ref_counter;
 
 
  If events are made per-vcpu (like real hardware), races become impossible.
   
  This design is to deal with a task context perf collection in guest os.
  Scenario 1:
  1) guest os starts to collect statistics of process A on vcpu 0;
  2) process A is scheduled to vcpu 1. Then, the perf_event at host side need
  to be moved to VCPU 1 's thread. With the per KVM instance design, we 
  needn't
  move host_perf_shadow among vcpus.
 
 
 First, the guest already knows how to deal with per-cpu performance 
 monitors, since that's how most (all) hardware works.  So we aren't 
 making the guest more complex, and on the other hand we simplify the host.
I agree that we need keep things simple.

 
 Second, if process A is migrated, and the guest uses per-process 
 counters, the guest will need to stop/start the counter during the 
 migration.  This will cause the host to migrate the counter,
Agree. My patches do so.

Question: Where does host migrate the counter?
The perf event at host side is bound to specific vcpu thread.

  so while we 
 didn't move the counter to a different vcpu,
Disagree here. If process A on vcpu 0 in guest os is migrated to vcpu 1,
host has to move process A's perf_event to vcpu 1 thread.


  we still have to move it to 
 a different cpu.
 
  Scenario 2:
  1) guest os creates a perf_event at host side on vcpu 0;
  2) malicious guest os calls close to delete the host perf_event on vcpu 1, 
  but
  enables the perf_event on vcpu0 at the same time. When close thread runs to 
  get the
  host_perf_shadow from the list, enable thread also gets it. Then, close 
  thread
  deletes the perf_event, and enable thread will cause host kernel panic when 
  using
  host_perf_shadow.
 
 
 With per-vcpu events, this can't happen.  Each vcpu has its own set of 
 perf events in their own ID namespace.  vcpu 0 can't touch vcpu 1's 
 events even if it knows their IDs.
What does 'touch' mean here?

With the task (process) context event in guest os, we have to migrate the event
among vcpus when guest process scheduler balances processes among vcpus. So all
events in a guest os instance uses the same ID name space.

What you mentioned is really right when the event is cpu context event, but
not with task context event.

 
  Please move this structure to include/linux/kvm_host.h.  No need to spam
  kvm_para.h.  Note it's not x86 specific (though you can leave arch
  enabling to arch maintainers).
   
  Ok. Originally, I wanted to do so, but I'm afraid other arch might be not 
  happy.
 
 
 Copying kvm arch maintainers.  Please keep an eye on this topic and 
 holler if something doesn't fit your arch.
 
 
   
  @@ -24,6 +24,7 @@
 #includeasm/desc.h
 #includeasm/mtrr.h
 #includeasm/msr-index.h
  +#includeasm/perf_event.h
 
 #define KVM_MAX_VCPUS 64
 #define KVM_MEMORY_SLOTS 32
  @@ -360,6 +361,18 @@ struct kvm_vcpu_arch {
 
/* fields used by HYPER-V emulation */
u64 hv_vapic;
  +
  + /*
  +  * Fields used by PARAVIRT perf interface:
  +  *
  +  * kvm checks overflow_events before entering guest os,
  +  * and copy data back to guest os.
  +  * event_mutex is to avoid a race between NMI perf event overflow
  +  * handler, event close, and enable

[PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
Here is the version 2.

ChangeLog since V1: Mostly changes based on Avi's suggestions.
1) Use a id to identify the perf_event between host and guest;
2) Changes lots of codes to deal with malicious guest os;
3) Add a perf_event number limitation per gust os instance;
4) Support guest os on the top of another guest os scenario. But
I didn't test it yet as there is no environment. The design is to
add 2 pointers in struct perf_event. One is used by host and the
other is used by guest.
5) Fix the bug to support 'perf stat'. The key is sync count data
back to guest when guest tries to disable the perf_event at host
side.
6) Add a clear ABI of PV perf.

I don't implement live migration feature.

Avi,
Is live migration necessary on pv perf support?


Based on Ingo's idea, I implement a para virt interface for perf to support
statistics collection in guest os. That means we could run tool perf in guest
os directly.

Great thanks to Peter Zijlstra. He is really the architect and gave me 
architecture
design suggestions. I also want to thank Yangsheng and LinMing for their 
generous
help.

The design is:

1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host 
kernel;
2) Create a host perf_event per guest perf_event;
3) Host kernel syncs perf_event count/overflows data changes to guest perf_event
when processing perf_event overflows after NMI arrives. Host kernel inject NMI 
to guest
kernel if a guest event overflows.
4) Guest kernel goes through all enabled event on current cpu and output data 
when they
overflows.
5) No change in user space.

Below is an example.

#perf top
--
   PerfTop:7954 irqs/sec  kernel:79.5%  exact:  0.0% [1000Hz cycles],  
(all, 8 CPUs)
--

 samples  pcnt function DSO
 ___ _  
_

 5315.00  4.9% copy_user_generic_string 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 3342.00  3.1% add_preempt_count
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 3338.00  3.1% sub_preempt_count
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2454.00  2.3% pvclock_clocksource_read 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2434.00  2.3% tcp_sendmsg  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2090.00  1.9% child_run
/bm/tmp/benchmarks/run_bmtbench/dbench/dbench-3.03/tbench
 2081.00  1.9% debug_smp_processor_id   
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2003.00  1.9% __GI_strstr  /lib64/libc-2.11.so 
 
 1999.00  1.9% __strchr_sse2/lib64/libc-2.11.so 
 
 1983.00  1.8% tcp_ack  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1800.00  1.7% tcp_transmit_skb 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1727.00  1.6% schedule 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1706.00  1.6% __libc_recv  /lib64/libc-2.11.so 
 
 1702.00  1.6% __GI_memchr  /lib64/libc-2.11.so 
 
 1580.00  1.5% tcp_recvmsg  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  

The patch is against tip/master tree of June 20st.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

--- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt   1970-01-01 
08:00:00.0 +0800
+++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt
2010-06-21 15:21:39.312999849 +0800
@@ -0,0 +1,133 @@
+The x86 kvm paravirt perf event interface
+===
+
+This paravirt interface is responsible for supporting guest os perf event
+collections. If guest os supports this interface, users could run command
+perf in guest os directly.
+
+Design
+
+
+Guest os calls a series of hypercalls to communicate with host kernel to
+create/enable/disable/close perf events. Host kernel notifies guest os
+by injecting an NMI to guest os when an event overflows. Guets os need
+go through all its active events to check if they overflow, and output
+performance statistics if they do.
+
+ABI
+=
+
+1) Detect if host kernel supports paravirt perf interface:
+#define KVM_FEATURE_PV_PERF   4
+Host kernel defines above cpuid bit. Guest os calls

[PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
The 2nd patch is to change the definition of perf_event to facilitate
perf attr copy when a hypercall happens.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

--- linux-2.6_tip0620/include/linux/perf_event.h2010-06-21 
15:19:52.821999849 +0800
+++ linux-2.6_tip0620perfkvm/include/linux/perf_event.h 2010-06-21 
16:53:49.283999849 +0800
@@ -188,7 +188,10 @@ struct perf_event_attr {
__u64   sample_type;
__u64   read_format;
 
-   __u64   disabled   :  1, /* off by default*/
+   union {
+   __u64   flags;
+   struct {
+   __u64   disabled   :  1, /* off by default*/
inherit:  1, /* children inherit it   */
pinned :  1, /* must always be on PMU */
exclusive  :  1, /* only group on PMU */
@@ -217,6 +220,8 @@ struct perf_event_attr {
mmap_data  :  1, /* non-exec mmap data*/
 
__reserved_1   : 46;
+   };
+   };
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -465,12 +470,6 @@ enum perf_callchain_context {
 # include asm/local64.h
 #endif
 
-struct perf_guest_info_callbacks {
-   int (*is_in_guest) (void);
-   int (*is_user_mode) (void);
-   unsigned long (*get_guest_ip) (void);
-};
-
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include asm/hw_breakpoint.h
 #endif
@@ -753,6 +752,20 @@ struct perf_event {
 
perf_overflow_handler_t overflow_handler;
 
+   /*
+* pointers used by kvm perf paravirt interface.
+*
+* 1) Used in host kernel and points to host_perf_shadow which
+* has information about guest perf_event
+*/
+   void*host_perf_shadow;
+   /*
+* 2) Used in guest kernel and points to guest_perf_shadow which
+* is used as a communication area with host kernel. Host kernel
+* copies overflow data to it when an event overflows.
+*/
+   void*guest_perf_shadow;
+
 #ifdef CONFIG_EVENT_TRACING
struct ftrace_event_call*tp_event;
struct event_filter *filter;
@@ -838,6 +851,16 @@ struct perf_output_handle {
int sample;
 };
 
+struct perf_guest_info_callbacks {
+   /* Support collect guest statistics from host side */
+   int (*is_in_guest) (void);
+   int (*is_user_mode) (void);
+   unsigned long (*get_guest_ip) (void);
+
+   /* Support paravirt interface */
+   void (*copy_event_to_shadow) (struct perf_event *event, int overflows);
+};
+
 #ifdef CONFIG_PERF_EVENTS
 
 /*
@@ -871,6 +894,10 @@ perf_event_create_kernel_counter(struct 
perf_overflow_handler_t callback);
 extern u64 perf_event_read_value(struct perf_event *event,
 u64 *enabled, u64 *running);
+extern void perf_event_output(struct perf_event *event, int nmi,
+   struct perf_sample_data *data, struct pt_regs *regs);
+void perf_event_attach(struct perf_event *event);
+void perf_event_detach(struct perf_event *event);
 
 struct perf_sample_data {
u64 type;
@@ -1023,6 +1050,14 @@ perf_event_task_sched_in(struct task_str
 static inline void
 perf_event_task_sched_out(struct task_struct *task,
struct task_struct *next)   { }
+
+static inline void
+perf_event_output(struct perf_event *event, int nmi,
+   struct perf_sample_data *data, struct pt_regs *regs){ }
+
+static inline void perf_event_attach(struct perf_event *event) { }
+static inline void perf_event_detach(struct perf_event *event) { }
+
 static inline void
 perf_event_task_tick(struct task_struct *task) { }
 static inline int perf_event_init_task(struct task_struct *child)  { 
return 0; }
--- linux-2.6_tip0620/kernel/watchdog.c 2010-06-21 15:20:48.517999849 +0800
+++ linux-2.6_tip0620perfkvm/kernel/watchdog.c  2010-06-21 15:21:39.315999849 
+0800
@@ -197,8 +197,6 @@ static struct perf_event_attr wd_hw_attr
.type   = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
 };
 
 /* Callback function for perf event subsystem */
@@ -361,6 +359,8 @@ static int watchdog_nmi_enable(int cpu)
/* Try to register using hardware perf events */
wd_attr = wd_hw_attr;
wd_attr-sample_period = hw_nmi_get_sample_period();
+   wd_attr-pinned = 1;
+   wd_attr-disabled = 1;
event = perf_event_create_kernel_counter(wd_attr, cpu, -1

[PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
The 3rd patch is to implement para virt perf at host kernel.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

--- linux-2.6_tip0620/arch/x86/include/asm/kvm_para.h   2010-06-21 
15:19:38.992999849 +0800
+++ linux-2.6_tip0620perfkvm/arch/x86/include/asm/kvm_para.h2010-06-21 
15:21:39.308999849 +0800
@@ -2,6 +2,7 @@
 #define _ASM_X86_KVM_PARA_H
 
 #include linux/types.h
+#include linux/list.h
 #include asm/hyperv.h
 
 /* This CPUID returns the signature 'KVMKVMKVM' in ebx, ecx, and edx.  It
@@ -19,7 +20,8 @@
 /* This indicates that the new set of kvmclock msrs
  * are available. The use of 0x11 and 0x12 is deprecated
  */
-#define KVM_FEATURE_CLOCKSOURCE23
+#define KVM_FEATURE_CLOCKSOURCE2   3
+#define KVM_FEATURE_PV_PERF4
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -33,7 +35,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 
-#define KVM_MAX_MMU_OP_BATCH   32
+#define KVM_MAX_MMU_OP_BATCH   32
+
+/* Operations for KVM_PERF_OP */
+#define KVM_PERF_OP_OPEN   1
+#define KVM_PERF_OP_CLOSE  2
+#define KVM_PERF_OP_ENABLE 3
+#define KVM_PERF_OP_DISABLE4
+#define KVM_PERF_OP_READ   5
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE1
@@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt {
 #ifdef __KERNEL__
 #include asm/processor.h
 
+/*
+ * data communication area about perf_event between
+ * Host kernel and guest kernel
+ */
+struct guest_perf_event {
+   u64 count;
+   atomic_t overflows;
+};
+
+/*
+ * In host kernel, perf_event-host_perf_shadow points to
+ * host_perf_shadow which records some information
+ * about the guest.
+ */
+struct host_perf_shadow {
+   /* guest perf_event id passed from guest os */
+   int id;
+   /*
+* Host kernel saves data into data member counter firstly.
+* kvm will get data from this counter and calls kvm functions
+* to copy or add data back to guets os before entering guest os
+* next time
+*/
+   struct guest_perf_event counter;
+   /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/
+   __u64 guest_event_addr;
+
+   /*
+* Link to  of kvm.kvm_arch.shadow_hash_table
+*/
+   struct list_head shadow_entry;
+   struct kvm_vcpu *vcpu;
+
+   struct perf_event *host_event;
+   /*
+* Below counter is to prevent malicious guest os to try to
+* close/enable event at the same time.
+*/
+   atomic_t ref_counter;
+};
+
+/*
+ * In guest kernel, perf_event-guest_shadow points to
+ * guest_perf_shadow which records some information
+ * about the guest.
+ */
+struct guest_perf_shadow {
+   /* guest perf_event id passed from guest os */
+   int id;
+   /*
+* Host kernel kvm saves data into data member counter
+*/
+   struct guest_perf_event counter;
+};
+
+/*
+ * guest_perf_attr is used when guest calls hypercall to
+ * open a new perf_event at host side. Mostly, it's a copy of
+ * perf_event_attr and deletes something not used by host kernel.
+ */
+struct guest_perf_attr {
+   __u32   type;
+   __u64   config;
+   __u64   sample_period;
+   __u64   sample_type;
+   __u64   read_format;
+   __u64   flags;
+   __u32   bp_type;
+   __u64   bp_addr;
+   __u64   bp_len;
+};
+
+struct guest_perf_event_param {
+   __u64 attr_addr;
+   __u64 guest_event_addr;
+   /* In case there is an alignment issue, we put id as the last one */
+   int id;
+};
+
 extern void kvmclock_init(void);
 
 
--- linux-2.6_tip0620/arch/x86/include/asm/kvm_host.h   2010-06-21 
15:19:39.01849 +0800
+++ linux-2.6_tip0620perfkvm/arch/x86/include/asm/kvm_host.h2010-06-21 
15:21:39.308999849 +0800
@@ -24,6 +24,7 @@
 #include asm/desc.h
 #include asm/mtrr.h
 #include asm/msr-index.h
+#include asm/perf_event.h
 
 #define KVM_MAX_VCPUS 64
 #define KVM_MEMORY_SLOTS 32
@@ -360,6 +361,18 @@ struct kvm_vcpu_arch {
 
/* fields used by HYPER-V emulation */
u64 hv_vapic;
+
+   /*
+* Fields used by PARAVIRT perf interface:
+*
+* kvm checks overflow_events before entering guest os,
+* and copy data back to guest os.
+* event_mutex is to avoid a race between NMI perf event overflow
+* handler, event close, and enable/disable.
+*/
+   struct mutex event_mutex;
+   int overflows;
+   struct perf_event *overflow_events[X86_PMC_IDX_MAX];
 };
 
 struct kvm_mem_alias {
@@ -377,6 +390,9 @@ struct kvm_mem_aliases {
int naliases;
 };
 
+#define KVM_PARAVIRT_PERF_EVENT_ENTRY_BITS (10

[PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
The 5th patch is applied to the latest qemu-kvm tree.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup qemu-kvm_0621/kvm/include/linux/kvm.h 
qemu-kvm_0621_perf/kvm/include/linux/kvm.h
--- qemu-kvm_0621/kvm/include/linux/kvm.h   2010-06-21 11:00:28.0 
+0800
+++ qemu-kvm_0621_perf/kvm/include/linux/kvm.h  2010-06-21 13:23:51.537999849 
+0800
@@ -530,6 +530,7 @@ struct kvm_enable_cap {
 #ifdef __KVM_HAVE_XCRS
 #define KVM_CAP_XCRS 56
 #endif
+#define KVM_CAP_PV_PERF 57
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff -Nraup qemu-kvm_0621/kvm/include/x86/asm/kvm_para.h 
qemu-kvm_0621_perf/kvm/include/x86/asm/kvm_para.h
--- qemu-kvm_0621/kvm/include/x86/asm/kvm_para.h2010-06-21 
11:00:28.0 +0800
+++ qemu-kvm_0621_perf/kvm/include/x86/asm/kvm_para.h   2010-06-21 
13:27:04.375999849 +0800
@@ -15,6 +15,7 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_PV_PERF4
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
diff -Nraup qemu-kvm_0621/target-i386/kvm.c qemu-kvm_0621_perf/target-i386/kvm.c
--- qemu-kvm_0621/target-i386/kvm.c 2010-06-21 11:00:29.0 +0800
+++ qemu-kvm_0621_perf/target-i386/kvm.c2010-06-21 13:00:14.136999850 
+0800
@@ -150,6 +150,9 @@ struct kvm_para_features {
 #ifdef KVM_CAP_PV_MMU
 { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
 #endif
+#ifdef KVM_CAP_PV_PERF
+{ KVM_CAP_PV_PERF, KVM_FEATURE_PV_PERF },
+#endif
 { -1, -1 }
 };


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 4/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
The 4th patch is to implement para virt perf at guest side.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

--- linux-2.6_tip0620/arch/x86/Kconfig  2010-06-21 15:19:39.180999849 +0800
+++ linux-2.6_tip0620perfkvm/arch/x86/Kconfig   2010-06-21 15:21:39.30849 
+0800
@@ -552,6 +552,14 @@ config KVM_GUEST
  This option enables various optimizations for running under the KVM
  hypervisor.
 
+config KVM_PERF
+   bool KVM Guest perf support
+   select PARAVIRT
+   select PERF_EVENT
+   ---help---
+ This option enables various optimizations for running perf in
+ guest os under the KVM hypervisor.
+
 source arch/x86/lguest/Kconfig
 
 config PARAVIRT
--- linux-2.6_tip0620/arch/x86/kernel/cpu/perf_event.c  2010-06-21 
15:19:39.964999849 +0800
+++ linux-2.6_tip0620perfkvm/arch/x86/kernel/cpu/perf_event.c   2010-06-21 
16:44:36.602999849 +0800
@@ -25,6 +25,7 @@
 #include linux/highmem.h
 #include linux/cpu.h
 #include linux/bitops.h
+#include linux/kvm_para.h
 
 #include asm/apic.h
 #include asm/stacktrace.h
@@ -583,10 +584,20 @@ static void x86_pmu_disable_all(void)
}
 }
 
+#ifdef CONFIG_KVM_PERF
+static int kvm_hw_perf_enable(void);
+static int kvm_hw_perf_disable(void);
+#endif
+
 void hw_perf_disable(void)
 {
struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
+#ifdef CONFIG_KVM_PERF
+   if (!kvm_hw_perf_disable())
+   return;
+#endif
+
if (!x86_pmu_initialized())
return;
 
@@ -810,6 +821,11 @@ void hw_perf_enable(void)
struct hw_perf_event *hwc;
int i, added = cpuc-n_added;
 
+#ifdef CONFIG_KVM_PERF
+   if (!kvm_hw_perf_enable())
+   return;
+#endif
+
if (!x86_pmu_initialized())
return;
 
@@ -1264,6 +1280,7 @@ x86_get_event_constraints(struct cpu_hw_
 #include perf_event_intel_lbr.c
 #include perf_event_intel_ds.c
 #include perf_event_intel.c
+#include perf_event_kvm.c
 
 static int __cpuinit
 x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
@@ -1317,6 +1334,11 @@ void __init init_hw_perf_events(void)
 
pr_info(Performance Events: );
 
+#ifdef CONFIG_KVM_PERF
+   if (!kvm_init_hw_perf_events())
+   return;
+#endif
+
switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_INTEL:
err = intel_pmu_init();
@@ -1541,6 +1563,13 @@ const struct pmu *hw_perf_event_init(str
const struct pmu *tmp;
int err;
 
+#ifdef CONFIG_KVM_PERF
+   if (kvm_para_available()) {
+   tmp = kvm_hw_perf_event_init(event);
+   return tmp;
+   }
+#endif
+
err = __hw_perf_event_init(event);
if (!err) {
/*
--- linux-2.6_tip0620/arch/x86/kernel/cpu/perf_event_kvm.c  1970-01-01 
08:00:00.0 +0800
+++ linux-2.6_tip0620perfkvm/arch/x86/kernel/cpu/perf_event_kvm.c   
2010-06-21 16:44:56.735999849 +0800
@@ -0,0 +1,426 @@
+/*
+ * Performance events
+ *
+ * Copyright (C) 2010 Intel Corporation
+ * Zhang Yanmin yanmin.zh...@intel.com
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#ifdef CONFIG_KVM_PERF
+
+static atomic_t guest_perf_id; /*Global id counter per guest os*/
+
+static inline int get_new_perf_event_id(void)
+{
+   return atomic_inc_return(guest_perf_id);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+
+static bool kvm_reserve_pmc_hardware(void)
+{
+   if (nmi_watchdog == NMI_LOCAL_APIC)
+   disable_lapic_nmi_watchdog();
+
+   return true;
+}
+
+static void kvm_release_pmc_hardware(void)
+{
+   if (nmi_watchdog == NMI_LOCAL_APIC)
+   enable_lapic_nmi_watchdog();
+}
+
+#else
+
+static bool kvm_reserve_pmc_hardware(void) { return true; }
+static void kvm_release_pmc_hardware(void) {}
+
+#endif
+
+static void kvm_hw_perf_event_destroy(struct perf_event *event)
+{
+   struct guest_perf_shadow *shadow = event-guest_perf_shadow;
+
+   BUG_ON(!shadow);
+   kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_CLOSE, shadow-id);
+
+   kfree(shadow);
+   event-guest_perf_shadow = NULL;
+
+   if (atomic_dec_and_mutex_lock(active_events, pmc_reserve_mutex)) {
+   kvm_release_pmc_hardware();
+   mutex_unlock(pmc_reserve_mutex);
+   }
+}
+
+/* The guest might also run as a host */
+static int check_ontop_guest_overflow(struct perf_event *event, int overflows)
+{
+   struct host_perf_shadow *host_shadow = event-host_perf_shadow;
+   if (!host_shadow)
+   return 0;
+
+   if (perf_guest_cbs)
+   perf_guest_cbs-copy_event_to_shadow(event, overflows);
+
+   return 1;
+}
+
+static int
+check_event_overflow(struct perf_event *event, struct pt_regs *regs)
+{
+   struct perf_sample_data data;
+   struct guest_perf_shadow *guest_shadow = event-guest_perf_shadow;
+   s32 overflows;
+   int i;
+   int handled = 0;
+
+   local64_set(event-count, guest_shadow

Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
 On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
  Here is the version 2.
 
  ChangeLog since V1: Mostly changes based on Avi's suggestions.
  1) Use a id to identify the perf_event between host and guest;
  2) Changes lots of codes to deal with malicious guest os;
  3) Add a perf_event number limitation per gust os instance;
  4) Support guest os on the top of another guest os scenario. But
  I didn't test it yet as there is no environment. The design is to
  add 2 pointers in struct perf_event. One is used by host and the
  other is used by guest.
  5) Fix the bug to support 'perf stat'. The key is sync count data
  back to guest when guest tries to disable the perf_event at host
  side.
  6) Add a clear ABI of PV perf.
 
 
 
 Please use meaningful subject lines for individual patches.
Yes, I should. I rushed to send the patches out yesterday afternoon as I need
to take company shuttle back home.
 
  I don't implement live migration feature.
 
  Avi,
  Is live migration necessary on pv perf support?
 
 
 Yes.
Ok. With the PV perf interface, host perf saves all counter info into perf_event
structure. To support live migration, we need save all host perf_event 
structure,
or at least perf_event-count and perf_event-attr. Then, recreate the host 
perf_event
after migration.

I check qemu-kvm codes and it seems most live migration is to save cpu states.
So it seems it's hard for perf pv interface to match current live migration. 
Any suggestion?

 
  --- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt   1970-01-01 
  08:00:00.0 +0800
  +++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt
  2010-06-21 15:21:39.312999849 +0800
  @@ -0,0 +1,133 @@
  +The x86 kvm paravirt perf event interface
  +===
  +
  +This paravirt interface is responsible for supporting guest os perf event
  +collections. If guest os supports this interface, users could run command
  +perf in guest os directly.
  +
  +Design
  +
  +
  +Guest os calls a series of hypercalls to communicate with host kernel to
  +create/enable/disable/close perf events. Host kernel notifies guest os
  +by injecting an NMI to guest os when an event overflows. Guets os need
  +go through all its active events to check if they overflow, and output
  +performance statistics if they do.
  +
  +ABI
  +=
  +
  +1) Detect if host kernel supports paravirt perf interface:
  +#define KVM_FEATURE_PV_PERF   4
  +Host kernel defines above cpuid bit. Guest os calls cpuid to check if host
  +os retuns this bit. If it does, it mean host kernel supports paravirt perf
  +interface.
  +
  +2) Open a new event at host side:
  +kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, param_addr_low32bit,
  +param_addr_high32bit);
  +
  +#define KVM_PERF_OP3
  +/* Operations for KVM_PERF_OP */
  +#define KVM_PERF_OP_OPEN1
  +#define KVM_PERF_OP_CLOSE   2
  +#define KVM_PERF_OP_ENABLE  3
  +#define KVM_PERF_OP_DISABLE 4
  +#define KVM_PERF_OP_READ5
 
 
  +/*
  + * guest_perf_attr is used when guest calls hypercall to
  + * open a new perf_event at host side. Mostly, it's a copy of
  + * perf_event_attr and deletes something not used by host kernel.
  + */
  +struct guest_perf_attr {
  +__u32   type;
 
 
 Need padding here, otherwise the structure is different on 32-bit and 
 64-bit guests.
Ok. I will change it.

 
  +__u64   config;
  +__u64   sample_period;
  +__u64   sample_type;
  +__u64   read_format;
  +__u64   flags;
 
 
 and here.
I will rearrange the whole structure.

 
  +__u32   bp_type;
  +__u64   bp_addr;
  +__u64   bp_len;
 
 
 Do we actually support breakpoints on the guest?  Note the hardware 
 breakpoints are also usable by the guest, so if the host uses them, we 
 won't be able to emulate them correctly.
   We can let the guest to 
 breakpoint perf monitoring itself and drop this feature.
Ok, I will disable breakpoint feature of pv interface.

 
  +};
 
 
 What about documentation for individual fields?  Esp. type, config, and 
 flags, but also the others.
They are really perf implementation specific. Even perf_event definition
has no document but code comments. I will add simple explanation around
the new structure definition.

 
  +/*
  + * data communication area about perf_event between
  + * Host kernel and guest kernel
  + */
  +struct guest_perf_event {
  +u64 count;
  +atomic_t overflows;
 
 
 Please use __u64 and __u32, assume guests don't have Linux internal 
 types (though of course the first guest _is_ Linux).
This structure is used by both host

Re: [PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
On Mon, 2010-06-21 at 15:00 +0300, Avi Kivity wrote:
 On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
  The 2nd patch is to change the definition of perf_event to facilitate
  perf attr copy when a hypercall happens.
 
  Signed-off-by: Zhang Yanminyanmin_zh...@linux.intel.com
 
  ---
 
  --- linux-2.6_tip0620/include/linux/perf_event.h2010-06-21 
  15:19:52.821999849 +0800
  +++ linux-2.6_tip0620perfkvm/include/linux/perf_event.h 2010-06-21 
  16:53:49.283999849 +0800
  @@ -188,7 +188,10 @@ struct perf_event_attr {
  __u64   sample_type;
  __u64   read_format;
 
 
 
 Assuming these flags are available to the guest?
These flags are used by generic perf codes. To match with host kernel, we wish
guest os also use the flags.

 
  -   __u64   disabled   :  1, /* off by default*/
  +   union {
  +   __u64   flags;
  +   struct {
  +   __u64   disabled   :  1, /* off by default*/
  inherit:  1, /* children inherit it   */
 
 
 inherit is meaningless for a guest.
Right. host kernel will reset it to 0 before create perf_event for guest os.
Here we couldn't delete the flag as it's used by perf generic codes.
I need separate the patch a little better. All definitions in 
include/linux/perf_event.h
are mostly perf generic code related. I'm very careful to change it.

 
  pinned :  1, /* must always be on PMU */
 
 
 We cannot allow a guest to pin a counter.
Ok. I will reset it in function kvm_pv_perf_op_open.

 
 The other flags are also problematic.  I'd like to see virt-specific 
 flags (probably we'll only need kernel/user and nested_hv for nested 
 virtualization).
How about to add more comments around struct guest_perf_attr-flags that
guest os developers should take a look at include/linux/perf_event.h?
BTW, I will reset more flags to 0 in kvm_pv_perf_op_open.

 
 Something that is worrying is that we don't expose group information.  
 perf will multiplex the events for us, but there will be a loss in accuracy.
 
#ifdef CONFIG_HAVE_HW_BREAKPOINT
#includeasm/hw_breakpoint.h
#endif
  @@ -753,6 +752,20 @@ struct perf_event {
 
  perf_overflow_handler_t overflow_handler;
 
  +   /*
  +* pointers used by kvm perf paravirt interface.
  +*
  +* 1) Used in host kernel and points to host_perf_shadow which
  +* has information about guest perf_event
  +*/
  +   void*host_perf_shadow;
 
 
 Can we have real types instead of void pointers?
I just want perf generic codes have less dependency on KVM codes.

 
  +   /*
  +* 2) Used in guest kernel and points to guest_perf_shadow which
  +* is used as a communication area with host kernel. Host kernel
  +* copies overflow data to it when an event overflows.
  +*/
  +   void*guest_perf_shadow;
 
 
 It's strange to see both guest and host parts in the same patch.  
 Splitting to separate patches will really help review.
It's a little hard to split the patches if they change the same file. Perhaps
I could add more statements before the patch when I send it out.

 
  @@ -1626,9 +1629,22 @@ void perf_event_task_tick(struct task_st
  if (ctx  ctx-nr_events  ctx-nr_events != ctx-nr_active)
  rotate = 1;
 
  -   perf_ctx_adjust_freq(cpuctx-ctx);
  -   if (ctx)
  -   perf_ctx_adjust_freq(ctx);
  +#ifdef CONFIG_KVM_PERF
  +   if (kvm_para_available()) {
  +   /*
  +* perf_ctx_adjust_freq causes lots of pmu-read which would
  +* trigger too many vmexit to host kernel. We disable it
  +* under para virt situation
  +*/
  +   adjust_freq = 0;
  +   }
  +#endif
 
 
 Perhaps we can have a batch read interface which will read many counters 
 at once.
It's a good idea. But that will touch many perf generic codes which causes it's 
hard
to maintain or follow future changes.

   This would reduce the number of exits.  Also adjust the 
 frequency less frequently.
Here it depends on process scheduler frequency, CONFIG_HZ. 

 
  +
  +   if (adjust_freq) {
  +   perf_ctx_adjust_freq(cpuctx-ctx);
  +   if (ctx)
  +   perf_ctx_adjust_freq(ctx);
  +   }
 
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
On Mon, 2010-06-21 at 15:33 +0300, Avi Kivity wrote:
 On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
  The 3rd patch is to implement para virt perf at host kernel.
 
 
  @@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt {
#ifdef __KERNEL__
#includeasm/processor.h
 
 
  +/*
  + * In host kernel, perf_event-host_perf_shadow points to
  + * host_perf_shadow which records some information
  + * about the guest.
  + */
  +struct host_perf_shadow {
  +   /* guest perf_event id passed from guest os */
  +   int id;
  +   /*
  +* Host kernel saves data into data member counter firstly.
  +* kvm will get data from this counter and calls kvm functions
  +* to copy or add data back to guets os before entering guest os
  +* next time
  +*/
  +   struct guest_perf_event counter;
  +   /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/
  +   __u64 guest_event_addr;
 
 
 So just use gpa_t as the type.
host_perf_shadow-guest_event_addr is a copy of 
guest_event_addr-guest_event_addr.
As the latter's type is __u64 as the interface between guest os and host os, I 
use
__u64 as the type of host_perf_shadow-guest_event_addr.

 
  +
  +   /*
  +* Link to  of kvm.kvm_arch.shadow_hash_table
  +*/
  +   struct list_head shadow_entry;
  +   struct kvm_vcpu *vcpu;
  +
  +   struct perf_event *host_event;
  +   /*
  +* Below counter is to prevent malicious guest os to try to
  +* close/enable event at the same time.
  +*/
  +   atomic_t ref_counter;
 
 
 If events are made per-vcpu (like real hardware), races become impossible.
This design is to deal with a task context perf collection in guest os.
Scenario 1:
1) guest os starts to collect statistics of process A on vcpu 0;
2) process A is scheduled to vcpu 1. Then, the perf_event at host side need
to be moved to VCPU 1 's thread. With the per KVM instance design, we needn't
move host_perf_shadow among vcpus.

Scenario 2:
1) guest os creates a perf_event at host side on vcpu 0;
2) malicious guest os calls close to delete the host perf_event on vcpu 1, but
enables the perf_event on vcpu0 at the same time. When close thread runs to get 
the
host_perf_shadow from the list, enable thread also gets it. Then, close thread
deletes the perf_event, and enable thread will cause host kernel panic when 
using
host_perf_shadow.


 
  +};
 
 
 Please move this structure to include/linux/kvm_host.h.  No need to spam 
 kvm_para.h.  Note it's not x86 specific (though you can leave arch 
 enabling to arch maintainers).
Ok. Originally, I wanted to do so, but I'm afraid other arch might be not happy.

 
  +
  +/*
  + * In guest kernel, perf_event-guest_shadow points to
  + * guest_perf_shadow which records some information
  + * about the guest.
  + */
  +struct guest_perf_shadow {
  +   /* guest perf_event id passed from guest os */
  +   int id;
  +   /*
  +* Host kernel kvm saves data into data member counter
  +*/
  +   struct guest_perf_event counter;
  +};
 
 
 Don't ordinary perf structures already have a counter ID which we can reuse?
No. In the other hand, if we assume generic perf has, we couldn't use it, 
because
generic perf id (actually there is no) is host kernel system-wide while here
guest_perf_shadow-id is per kvm instance wide.

 
  +
  +/*
  + * guest_perf_attr is used when guest calls hypercall to
  + * open a new perf_event at host side. Mostly, it's a copy of
  + * perf_event_attr and deletes something not used by host kernel.
  + */
  +struct guest_perf_attr {
  +   __u32   type;
  +   __u64   config;
  +   __u64   sample_period;
  +   __u64   sample_type;
  +   __u64   read_format;
  +   __u64   flags;
  +   __u32   bp_type;
  +   __u64   bp_addr;
  +   __u64   bp_len;
  +};
 
 
 This is really not a guest or host structure, it's part of the 
 interface.  So please rename it (and similar) kvm_pv_perf_*.
Good idea.

 
  @@ -24,6 +24,7 @@
#includeasm/desc.h
#includeasm/mtrr.h
#includeasm/msr-index.h
  +#includeasm/perf_event.h
 
#define KVM_MAX_VCPUS 64
#define KVM_MEMORY_SLOTS 32
  @@ -360,6 +361,18 @@ struct kvm_vcpu_arch {
 
  /* fields used by HYPER-V emulation */
  u64 hv_vapic;
  +
  +   /*
  +* Fields used by PARAVIRT perf interface:
  +*
  +* kvm checks overflow_events before entering guest os,
  +* and copy data back to guest os.
  +* event_mutex is to avoid a race between NMI perf event overflow
  +* handler, event close, and enable/disable.
  +*/
  +   struct mutex event_mutex;
 
 
 No race can exist.  The host NMI handler cannot take any mutex
We use a mutex_trylock in NMI hanlder. If it can't get the lock, there is a NMI 
miss
happening, but host kernel still updates perf_event-host_perf_shadow.counter, 
so the
overflow data will be accumulated. 

  so it 
 must be immune

Re: [PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-21 Thread Zhang, Yanmin
On Mon, 2010-06-21 at 15:37 +0300, Avi Kivity wrote:
 On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
  The 5th patch is applied to the latest qemu-kvm tree.
 
  --- qemu-kvm_0621/target-i386/kvm.c 2010-06-21 11:00:29.0 +0800
  +++ qemu-kvm_0621_perf/target-i386/kvm.c2010-06-21 13:00:14.136999850 
  +0800
  @@ -150,6 +150,9 @@ struct kvm_para_features {
#ifdef KVM_CAP_PV_MMU
{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
#endif
  +#ifdef KVM_CAP_PV_PERF
  +{ KVM_CAP_PV_PERF, KVM_FEATURE_PV_PERF },
  +#endif
{ -1, -1 }
};
 
 
 
 
 Not really necessary any more - if you expose the cpuid bit via 
 KVM_GET_SUPPORTED_CPUID2 then 'qemu -cpu host' will automatically enable it.
 
 On the other hand, do update target-i386/cpuid.c:kvm_feature_name so 
 people can enable the feature using qemu -cpu ...,+kvmperf.
Thanks for the good pointer. I will change it.

Avi,

Thanks for your wonderful comments. I will fix all in the patches.
As for the live migration, I need check it carefully. If you could
provide some suggestions on it, especially about how/when to save all 
perf_events
and restore all pref_events, that would be very helpful.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-10 Thread Zhang, Yanmin
On Thu, 2010-06-10 at 11:50 +0200, Peter Zijlstra wrote:
 On Thu, 2010-06-10 at 10:21 +0800, Zhang, Yanmin wrote:
  . The ABI mostly includes the definition of struct perf_event_attr,
  guest_perf_counter,
  and hypercalls. 
 
 Note that perf_event_attr isn't guaranteed to be stable between kernels,
 it can grow.
Thanks. I need create a small new attr structure to save some key data members.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-09 Thread Zhang, Yanmin
On Wed, 2010-06-09 at 11:33 +0300, Avi Kivity wrote:
 On 06/09/2010 06:30 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on Ingo's idea, I implement a para virt interface for perf to support
  statistics collection in guest os. That means we could run tool perf in 
  guest
  os directly.
 
  Great thanks to Peter Zijlstra. He is really the architect and gave me 
  architecture
  design suggestions. I also want to thank Yangsheng and LinMing for their 
  generous
  help.
 
  The design is:
 
  1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to 
  host kernel;
  2) Create a host perf_event per guest perf_event;
  3) Host kernel syncs perf_event count/overflows data changes to guest 
  perf_event
  when processing perf_event overflows after NMI arrives. Host kernel inject 
  NMI to guest
  kernel if a guest event overflows.
  4) Guest kernel goes through all enabled event on current cpu and output 
  data when they
  overflows.
  5) No change in user space.
 
 
Thanks for your kind comments.

 One thing that's missing is documentation of the guest/host ABI.  It 
 will be a requirement for inclusion, but it will also be a great help 
 for review, so please provide it ASAP.
I will add such document. It will includes:
1) Data struct perf_event definition. Guest os and host os have to share the 
same
data structure as host kernel will sync data (counte/overflows and others if 
needed)
changes back to guest os.
2) A list of all hypercalls
3) Guest need have NMI handler which checks all guest events.


 
 Please also split the guest and host parts into separate patches.
I will do so.

 
 
  -#define KVM_MAX_MMU_OP_BATCH   32
 
 
 Keep that please.
I will do so.

 
  +/* Operations for KVM_PERF_OP */
  +#define KVM_PERF_OP_OPEN   1
  +#define KVM_PERF_OP_CLOSE  2
  +#define KVM_PERF_OP_ENABLE 3
  +#define KVM_PERF_OP_DISABLE4
  +#define KVM_PERF_OP_START  5
  +#define KVM_PERF_OP_STOP   6
  +#define KVM_PERF_OP_READ   7
 
 
 Where is the hypercall number for the perf hypercall?
I defines it in file kvm_para.h like KVM_HC_MMU_OP.

 
  +static bool kvm_reserve_pmc_hardware(void)
  +{
  +   if (nmi_watchdog == NMI_LOCAL_APIC)
  +   disable_lapic_nmi_watchdog();
  +
  +   return true;
  +}
  +
  +static void kvm_release_pmc_hardware(void)
  +{
  +   if (nmi_watchdog == NMI_LOCAL_APIC)
  +   enable_lapic_nmi_watchdog();
  +}
 
 
 Disabling the watchdog is unfortunate.  Why is it necessary?
perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
set up in case they might have impact.

 
  +
  +static int kvm_pmu_enable(struct perf_event *event)
  +{
  +   int ret;
  +   unsigned long addr = __pa(event);
  +
  +   if (kvm_add_event(event))
  +   return -1;
  +
  +   ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE,
  +   addr, (unsigned long) event-shadow);
 
 
 This is suspicious.  Passing virtual addresses to the host is always 
 problematic.  Or event-shadow only used as an opaque cookie?
I use perf_event-shadow at both host and guest side.
1) At host side, perf_event-shadow points to an area including the page
mapping information about guest perf_event. Host kernel uses it to sync data
changes back to guest;
2) At guest side, perf_event-shadow save the virtual address of host
perf_event at host side. At guest side,
kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address.
Guest os shouldn't use it but using it to pass it back to host kernel in 
following
hypercalls. It might be a security issue for host kernel. Originally, I planed 
guest
os not to use perf_event-shadow. Host kernel maintains a per-vcpu event-related
list whose key is addr of guest perf_event. But considering the vcpu thread 
migration
on logical cpu, such list needs lock and implementation becomes a little 
complicated.

I will double-check the list method as the security issue is there.

 
  +int __init kvm_init_hw_perf_events(void)
  +{
  +   if (!kvm_para_available())
  +   return -1;
  +
  +   x86_pmu.handle_irq = kvm_default_x86_handle_irq;
  +
  +   pr_cont(KVM PARA PMU driver.\n);
  +   register_die_notifier(kvm_perf_event_nmi_notifier);
  +
  +   return 0;
  +}
 
 
 Need to detect the kvm pmu via its own cpuid bit.
Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like
bit KVM_FEATURE_CLOCKSOURCE.

 
  +
  +static int __kvm_hw_perf_event_init(struct perf_event *event)
  +{
  +   struct hw_perf_event *hwc =event-hw;
  +   int err;
  +   unsigned long result;
  +   unsigned long addr;
  +
  +   err = 0;
  +   if (!atomic_inc_not_zero(active_events)) {
  +   mutex_lock(pmc_reserve_mutex);
  +   if (atomic_read(active_events) == 0) {
  +   if (!kvm_reserve_pmc_hardware())
  +   err = -EBUSY;
  +   }
  +   if (!err

Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-09 Thread Zhang, Yanmin
On Wed, 2010-06-09 at 11:59 +0300, Avi Kivity wrote:
 On 06/09/2010 06:30 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on Ingo's idea, I implement a para virt interface for perf to support
  statistics collection in guest os. That means we could run tool perf in 
  guest
  os directly.
 
  Great thanks to Peter Zijlstra. He is really the architect and gave me 
  architecture
  design suggestions. I also want to thank Yangsheng and LinMing for their 
  generous
  help.
 
  The design is:
 
  1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to 
  host kernel;
  2) Create a host perf_event per guest perf_event;
  3) Host kernel syncs perf_event count/overflows data changes to guest 
  perf_event
  when processing perf_event overflows after NMI arrives. Host kernel inject 
  NMI to guest
  kernel if a guest event overflows.
  4) Guest kernel goes through all enabled event on current cpu and output 
  data when they
  overflows.
  5) No change in user space.
 
 
 Other issues:
 
 - save/restore support for live migration
Well, it's a little hard to process perf_event under live migration case.
I will check it.

 - some way to limit the number of open handles (comes automatically with 
 the table approach I suggested earlier)
Current perf doesn't restrict perf_event number. Kernel does a rotation to 
collect
statistics of all perf_events. My patch just follows this style. 
The table method might be not good, because below scenario:
guest perf_event might be a per-task event at guest side. When the guest 
application task is
migrated to another cpu, the perf_event peer at host side should also be 
migrated to the new vcpu
thread. With table method, we need do some rearrangement on the table when 
event migration happens.
Here migration I mention is not guest live migration. 

I will double-check it.

Thanks,
Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-09 Thread Zhang, Yanmin
On Wed, 2010-06-09 at 12:41 +0300, Avi Kivity wrote:
 On 06/09/2010 12:21 PM, Zhang, Yanmin wrote:
 
  One thing that's missing is documentation of the guest/host ABI.  It
  will be a requirement for inclusion, but it will also be a great help
  for review, so please provide it ASAP.
   
  I will add such document. It will includes:
  1) Data struct perf_event definition. Guest os and host os have to share 
  the same
  data structure as host kernel will sync data (counte/overflows and others 
  if needed)
  changes back to guest os.
  2) A list of all hypercalls
  3) Guest need have NMI handler which checks all guest events.
 
 
 Thanks.  It may be easier to have just the documentation for the first 
 few iterations so we can converge on a good interface, then update the 
 code to match the interface.
I thought over it last night. Your input is important. I need define a clear 
ABI.
At guest side, I plan to use perf_event-shadow to point to another data area, 
such like:
struct guest_perf_counter {
u64 count;
atomic_t overflows;
};

So host side just copy data to this area, then guest copy them to its own
perf_event.

The ABI becomes more simple than before. Function kvm_sync_event_to_guest also 
becomes
clearer. The ABI mostly includes the definition of struct perf_event_attr, 
guest_perf_counter,
and hypercalls.


  Disabling the watchdog is unfortunate.  Why is it necessary?
   
  perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
  set up in case they might have impact.
 
 
 Ok.  Is that the case for the hardware pmus as well?  If so it might be 
 done in common code.
 
  +
  +static int kvm_pmu_enable(struct perf_event *event)
  +{
  + int ret;
  + unsigned long addr = __pa(event);
  +
  + if (kvm_add_event(event))
  + return -1;
  +
  + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE,
  + addr, (unsigned long) event-shadow);
 
 
  This is suspicious.  Passing virtual addresses to the host is always
  problematic.  Or event-shadow only used as an opaque cookie?
   
  I use perf_event-shadow at both host and guest side.
  1) At host side, perf_event-shadow points to an area including the page
  mapping information about guest perf_event. Host kernel uses it to sync data
  changes back to guest;
  2) At guest side, perf_event-shadow save the virtual address of host
  perf_event at host side. At guest side,
  kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual 
  address.
  Guest os shouldn't use it but using it to pass it back to host kernel in 
  following
  hypercalls. It might be a security issue for host kernel. Originally, I 
  planed guest
  os not to use perf_event-shadow. Host kernel maintains a per-vcpu 
  event-related
  list whose key is addr of guest perf_event. But considering the vcpu thread 
  migration
  on logical cpu, such list needs lock and implementation becomes a little 
  complicated.
 
  I will double-check the list method as the security issue is there.
 
 
 Besides the other concern, you cannot live migrate a host virtual 
 address, since it changes from host to host.  It's better to use a 
 guest-chosen bounded small integer.
Ok. Perhaps a single u32 per guest os instance is enough. So I will change the 
shadow
pointing to a structure like below in guest kernel:

struct guest_perf_counter {
u64 count;
atomic_t overflows;
};

struct guest_perf_shadow {
u32 id;
struct guest_perf_counter sync_data;
};

atomic_t guest_perf_id; /*Global id counter per guest os*/


 
  Need to detect the kvm pmu via its own cpuid bit.
   
  Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like
  bit KVM_FEATURE_CLOCKSOURCE.
 
 
 Don't forget Documentation/kvm/cpuid.txt.
Thanks for your kind reminder.

  Ok, so event-shadow is never dereferenced.  In that case better not
  make it a pointer at all, keep it unsigned long.
   
  Host kernel also uses it.
 
 
 I see.  So put it in a union.  Or perhaps not even in a union - what if 
 a kvm guest is also acting as a kvm host?
My patch has consideration on it. I compiled kernel with host and guest support
at the same time. The accessing to perf_event-shadow is really under specific
scenarios, or they are just in specific functions. These functions are called
just bu host kernel , or just by guest kernel.

 
 
 
   
  +
  +static void kvm_sync_event_to_guest(struct perf_event *event, int 
  overflows)
  +{
  + struct hw_perf_event *hwc =event-hw;
  + struct kvmperf_event *kvmevent;
  + int offset, len, data_len, copied, page_offset;
  + struct page *event_page;
  + void *shared_kaddr;
  +
  + kvmevent = event-shadow;
  + offset = kvmevent-event_offset;
  +
  + /* Copy perf_event-count firstly */
  + offset += offsetof(struct perf_event, count);
  + if (offset   PAGE_SIZE) {
  + event_page = kvmevent-event_page;
  + page_offset = offset;
  + } else

Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-09 Thread Zhang, Yanmin
On Thu, 2010-06-10 at 06:06 +0300, Avi Kivity wrote:
 On 06/10/2010 05:21 AM, Zhang, Yanmin wrote:
 
  I see.  So put it in a union.  Or perhaps not even in a union - what if
  a kvm guest is also acting as a kvm host?
   
  My patch has consideration on it. I compiled kernel with host and guest 
  support
  at the same time. The accessing to perf_event-shadow is really under 
  specific
  scenarios, or they are just in specific functions. These functions are 
  called
  just bu host kernel , or just by guest kernel.
 
 
 But a kernel can be both guest and host at the same time.  Currently 
 this only works on AMD, but there was some work to bring it to Intel as 
 well.
Oh, this is a fancy VT feature. My patch doesn't support this mode.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] para virt interface of perf to support kvm guest os statistics collection in guest os

2010-06-08 Thread Zhang, Yanmin
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Based on Ingo's idea, I implement a para virt interface for perf to support
statistics collection in guest os. That means we could run tool perf in guest
os directly.

Great thanks to Peter Zijlstra. He is really the architect and gave me 
architecture
design suggestions. I also want to thank Yangsheng and LinMing for their 
generous
help.

The design is:

1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host 
kernel;
2) Create a host perf_event per guest perf_event;
3) Host kernel syncs perf_event count/overflows data changes to guest perf_event
when processing perf_event overflows after NMI arrives. Host kernel inject NMI 
to guest
kernel if a guest event overflows.
4) Guest kernel goes through all enabled event on current cpu and output data 
when they
overflows.
5) No change in user space.

Below is an example.

#perf top
--
   PerfTop:7954 irqs/sec  kernel:79.5%  exact:  0.0% [1000Hz cycles],  
(all, 8 CPUs)
--

 samples  pcnt function DSO
 ___ _  
_

 5315.00  4.9% copy_user_generic_string 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 3342.00  3.1% add_preempt_count
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 3338.00  3.1% sub_preempt_count
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2454.00  2.3% pvclock_clocksource_read 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2434.00  2.3% tcp_sendmsg  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2090.00  1.9% child_run
/bm/tmp/benchmarks/run_bmtbench/dbench/dbench-3.03/tbench
 2081.00  1.9% debug_smp_processor_id   
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 2003.00  1.9% __GI_strstr  /lib64/libc-2.11.so 
 
 1999.00  1.9% __strchr_sse2/lib64/libc-2.11.so 
 
 1983.00  1.8% tcp_ack  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1800.00  1.7% tcp_transmit_skb 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1727.00  1.6% schedule 
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  
 1706.00  1.6% __libc_recv  /lib64/libc-2.11.so 
 
 1702.00  1.6% __GI_memchr  /lib64/libc-2.11.so 
 
 1580.00  1.5% tcp_recvmsg  
/lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux  

The patch is against tip/master tree of June 1st.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0601/arch/x86/include/asm/kvm_host.h 
linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_host.h
--- linux-2.6_tip0601/arch/x86/include/asm/kvm_host.h   2010-06-02 
10:01:52.147999849 +0800
+++ linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_host.h2010-06-06 
15:46:31.874999850 +0800
@@ -561,6 +561,9 @@ int emulator_write_phys(struct kvm_vcpu 
  const void *val, int bytes);
 int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
  gpa_t addr, unsigned long *ret);
+int kvm_pv_perf_op(struct kvm_vcpu *vcpu, int op_code, unsigned long a1,
+  unsigned long a2, unsigned long *result);
+
 u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 extern bool tdp_enabled;
diff -Nraup linux-2.6_tip0601/arch/x86/include/asm/kvm_para.h 
linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_para.h
--- linux-2.6_tip0601/arch/x86/include/asm/kvm_para.h   2010-06-02 
10:01:52.126999849 +0800
+++ linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_para.h2010-06-06 
15:46:31.874999850 +0800
@@ -33,7 +33,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 
-#define KVM_MAX_MMU_OP_BATCH   32
+/* Operations for KVM_PERF_OP */
+#define KVM_PERF_OP_OPEN   1
+#define KVM_PERF_OP_CLOSE  2
+#define KVM_PERF_OP_ENABLE 3
+#define KVM_PERF_OP_DISABLE4
+#define KVM_PERF_OP_START  5
+#define KVM_PERF_OP_STOP   6
+#define KVM_PERF_OP_READ   7
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE1
@@ -64,6 +71,12 @@ struct kvm_mmu_op_release_pt {
 #ifdef __KERNEL__
 #include asm/processor.h
 
+struct kvmperf_event {
+   unsigned int

Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-26 Thread Zhang, Yanmin
On Fri, 2010-04-23 at 13:50 +0300, Avi Kivity wrote:
 On 04/22/2010 01:27 PM, Liu Yu-B13201 wrote:
 
  I met this error when built kernel. Anything wrong?
 
 CC  init/main.o
  In file included from include/linux/ftrace_event.h:8,
from include/trace/syscall.h:6,
from include/linux/syscalls.h:75,
from init/main.c:16:
  include/linux/perf_event.h: In function 
  'perf_register_guest_info_callbacks':
  include/linux/perf_event.h:1019: error: parameter name omitted
  include/linux/perf_event.h: In function 
  'perf_unregister_guest_info_callbacks':
  include/linux/perf_event.h:1021: error: parameter name omitted
  make[1]: *** [init/main.o] Error 1
  make: *** [init] Error 2
 
 
 I merged tip/perf/code which may fix this.  Find it in kvm.git next branch.
I downloaded the latest kvm.git tree and compilation is ok, including both 
enabling
FTRACE and disabling FTRACE.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-20 Thread Zhang, Yanmin
On Tue, 2010-04-20 at 08:09 +0200, Ingo Molnar wrote:
 * Ingo Molnar mi...@elte.hu wrote:
 
  
  * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
  
unsigned long perf_misc_flags(struct pt_regs *regs)
{
 int misc = 0;
   +
 if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
   + if (perf_guest_cbs-is_user_mode())
   + misc |= PERF_RECORD_MISC_GUEST_USER;
   + else
   + misc |= PERF_RECORD_MISC_GUEST_KERNEL;
   + } else if (user_mode(regs))
   + misc |= PERF_RECORD_MISC_USER;
   + else
   + misc |= PERF_RECORD_MISC_KERNEL;
   +
  
  We try to use balanced curly braces. I.e.:
  
  if (x) {
  boo();
  } else {
  if (y)
  foo();
  else
  bar();
  }   
  
  And avoid unbalanced ones:
  
  if (x) {
  boo();
  } else
  if (y)
  foo();
  else
  bar();
 
 Note, i fixed this in the patch and applied it to perf/core. (the invalid-C 
 problem was causing build failures)
Thanks for your teaching. Originally, I used {} in the 2nd half, but deleted it.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-19 Thread Zhang, Yanmin
On Mon, 2010-04-19 at 11:37 +0300, Avi Kivity wrote:
 On 04/19/2010 08:32 AM, Zhang, Yanmin wrote:
  Below patch introduces perf_guest_info_callbacks and related 
  register/unregister
  functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and 
  guest user
  space.
 
 
 This doesn't apply against upstream.
What does upstream mean here? The vanilla 2.6.34-rc4?

   What branch was this generated 
 against?
 
It's against the latest tip/master. I checked out to 19b26586090 as the latest
tip/master has some updates on perf.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-19 Thread Zhang, Yanmin
On Mon, 2010-04-19 at 11:59 +0300, Avi Kivity wrote:
 On 04/19/2010 11:55 AM, Zhang, Yanmin wrote:
  On Mon, 2010-04-19 at 11:37 +0300, Avi Kivity wrote:
 
  On 04/19/2010 08:32 AM, Zhang, Yanmin wrote:
   
  Below patch introduces perf_guest_info_callbacks and related 
  register/unregister
  functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and 
  guest user
  space.
 
 
  This doesn't apply against upstream.
   
  What does upstream mean here? The vanilla 2.6.34-rc4?
 
 
 Yes, sorry for being unclear.
 
 What branch was this generated
  against?
 
   
  It's against the latest tip/master. I checked out to 19b26586090 as the 
  latest
  tip/master has some updates on perf.
 
 
 I don't want to merge tip/master... does tip/perf/core contain the 
 needed updates?
I think so. A moment ago, I checked out to b5a80b7e9 of tip/perf/core. All 3
patches could be applied cleanly and compilation is ok. A quick testing shows
tip/perf/core kernel does work.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-19 Thread Zhang, Yanmin
On Mon, 2010-04-19 at 20:11 +0200, Ingo Molnar wrote:
 FYI, i found a few problems that need fixing:
 
  +   unsigned long ip;
  +   if (perf_guest_cbs  perf_guest_cbs-is_in_guest())
 
 missing newline.
 
  +   int misc = 0;
  +   if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
 
 ditto.
 
  +   PERF_RECORD_MISC_GUEST_KERNEL;
  +   } else
  +   misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
  +   PERF_RECORD_MISC_KERNEL;
 
 - unbalanced curly braces
 - missing curly brace for multi-line statement.
 - unnecessary line-break due to col80 warning from checkpatch
 
  +extern struct perf_guest_info_callbacks *perf_guest_cbs;
  +extern int perf_register_guest_info_callbacks(
  +   struct perf_guest_info_callbacks *);
  +extern int perf_unregister_guest_info_callbacks(
  +   struct perf_guest_info_callbacks *);
 
 - unnecessary line-break due to col80 warning from checkpatch
 
  +static inline int perf_register_guest_info_callbacks
  +(struct perf_guest_info_callbacks *) {return 0; }
  +static inline int perf_unregister_guest_info_callbacks
  +(struct perf_guest_info_callbacks *) {return 0; }
 
 - invalid C: function parameter needs name even if unused
 - missing space after opening curly brace
 
 Please provide delta fixes.

Here is the fix on the top of the prior 3 patches of V5.

From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Fix some programming style issues on the top of perf kvm
enhancement V5.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c 
linux-2.6_tip0417_perfkvmstyle/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c  2010-04-19 
09:53:59.689452915 +0800
+++ linux-2.6_tip0417_perfkvmstyle/arch/x86/kernel/cpu/perf_event.c 
2010-04-20 10:48:18.500999849 +0800
@@ -1752,23 +1752,29 @@ void perf_arch_fetch_caller_regs(struct 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
unsigned long ip;
+
if (perf_guest_cbs  perf_guest_cbs-is_in_guest())
ip = perf_guest_cbs-get_guest_ip();
else
ip = instruction_pointer(regs);
+
return ip;
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
int misc = 0;
+
if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
-   misc |= perf_guest_cbs-is_user_mode() ?
-   PERF_RECORD_MISC_GUEST_USER :
-   PERF_RECORD_MISC_GUEST_KERNEL;
-   } else
-   misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
-   PERF_RECORD_MISC_KERNEL;
+   if (perf_guest_cbs-is_user_mode())
+   misc |= PERF_RECORD_MISC_GUEST_USER;
+   else
+   misc |= PERF_RECORD_MISC_GUEST_KERNEL;
+   } else if (user_mode(regs))
+   misc |= PERF_RECORD_MISC_USER;
+   else
+   misc |= PERF_RECORD_MISC_KERNEL;
+
if (regs-flags  PERF_EFLAGS_EXACT)
misc |= PERF_RECORD_MISC_EXACT;
 
diff -Nraup linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c 
linux-2.6_tip0417_perfkvmstyle/arch/x86/kvm/x86.c
--- linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c2010-04-19 
09:53:59.691378953 +0800
+++ linux-2.6_tip0417_perfkvmstyle/arch/x86/kvm/x86.c   2010-04-20 
10:11:40.507545564 +0800
@@ -3776,16 +3776,20 @@ static int kvm_is_in_guest(void)
 static int kvm_is_user_mode(void)
 {
int user_mode = 3;
+
if (percpu_read(current_vcpu))
user_mode = kvm_x86_ops-get_cpl(percpu_read(current_vcpu));
+
return user_mode != 0;
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
unsigned long ip = 0;
+
if (percpu_read(current_vcpu))
ip = kvm_rip_read(percpu_read(current_vcpu));
+
return ip;
 }
 
diff -Nraup linux-2.6_tip0417_perfkvm/include/linux/perf_event.h 
linux-2.6_tip0417_perfkvmstyle/include/linux/perf_event.h
--- linux-2.6_tip0417_perfkvm/include/linux/perf_event.h2010-04-19 
09:53:59.691378953 +0800
+++ linux-2.6_tip0417_perfkvmstyle/include/linux/perf_event.h   2010-04-20 
10:08:03.531551890 +0800
@@ -941,10 +941,8 @@ static inline void perf_event_mmap(struc
 }
 
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
-extern int perf_register_guest_info_callbacks(
-   struct perf_guest_info_callbacks *);
-extern int perf_unregister_guest_info_callbacks(
-   struct perf_guest_info_callbacks *);
+extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*callbacks);
+extern int perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
 
 extern void perf_event_comm(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
@@ -1016,9 +1014,9 @@ static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
 static inline int

[PATCH V5 0/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-18 Thread Zhang, Yanmin
Here is the new patch of V5 against tip/master of April 17th
if anyone wants to try it.

ChangeLog V5:
1) Split kernel patch to 2 parts. The one introduces
perf_guest_info_callbacks() and related register/unregister
functions. The other is the kvm implementation of the callbacks.
2) Port to tip/master tree of April 17th.
3) Fix a bug which causes the module parsing of default guest kernel
fail.

ChangeLog V4:
1) Based on Ingo's comments, I added help information around kvm
such like command-list.txt and perf-kvm.txt.
2) Added guest process id at the tail of kernel dso long name, so
the display could show different label with different guest os.
3) Based on Avi's comments, erase the racy window which might
trigger an NMI while the NMI isn't in guest os.
4) Fixed all the errors and warnings reported by scripts/checkpatch.pl.
5) Fixed a compilation error pointed by Yang Sheng.

ChangeLog V3:
1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os
root directories under /dir/to/all/guestos by sshfs. For example, I 
start
2 guest os. The one's pid is  and the other's is .
#mkdir ~/guestmount; cd ~/guestmount
#sshfs -o allow_other,direct_io -p 5551 localhost:/ /
#sshfs -o allow_other,direct_io -p 5552 localhost:/ /
#perf kvm --host --guest --guestmount=~/guestmount top

The old --guestkallsyms and --guestmodules are still supported as 
default
guest os symbol parsing.

2) Add guest os buildid support.
3) Add sub command 'perf kvm buildid-list'.
4) Delete sub command 'perf kvm stat', because our current 
implementation
doesn't transfer guest/host requirement to kernel, and kernel always
collects both host and guest statistics. So regular 'perf stat' is ok.
5) Fix a couple of perf bugs.
6) We still have no support on command with parameter 'any' as current 
KVM
just uses process id to identify specific guest os instance. Users could
uses parameter -p to collect specific guest os instance statistics.

ChangeLog V2:
1) Based on Avi's suggestion, I moved callback functions
to generic code area. So the kernel part of the patch is
clearer.
2) Add 'perf kvm stat'.


From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new sub command kvm to perf.

  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  perf kvm buildid-list

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

---
   PerfTop:   16024 irqs/sec  kernel: 2.6% us: 0.6% guest kernel:76.2% guest 
us:20.6% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
---

 samples  pcnt function DSO
 ___ _  ___

 3740.00  8.0% __ticket_spin_lock   [guest.kernel.kallsyms]
 2056.00  4.4% copy_user_generic_string [guest.kernel.kallsyms]
 1412.00  3.0% resource_string  [guest.kernel.kallsyms]
  595.00  1.3% __switch_to  [guest.kernel.kallsyms]
  586.00  1.2% __d_lookup   [guest.kernel.kallsyms]
  574.00  1.2% tcp_sendmsg  [guest.kernel.kallsyms]
  565.00  1.2% kmem_cache_alloc [guest.kernel.kallsyms]
  532.00  1.1% tcp_ack  [guest.kernel.kallsyms]
  494.00  1.1% __kmalloc[guest.kernel.kallsyms]
  468.00  1.0% print_cfs_rq [guest.kernel.kallsyms]
  437.00  0.9% link_path_walk   [guest.kernel.kallsyms]
  380.00  0.8% balance_runtime  [guest.kernel.kallsyms]
  379.00  0.8% kmem_cache_free  [guest.kernel.kallsyms]
  377.00  0.8% in_gate_area_no_task [guest.kernel.kallsyms]
  374.00  0.8% get_page_from_freelist   [guest.kernel.kallsyms]
  372.00  0.8% mark_files_ro[guest.kernel.kallsyms

[PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-18 Thread Zhang, Yanmin
Below patch introduces perf_guest_info_callbacks and related register/unregister
functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest 
user
space.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup --exclude-from=exclude.diff 
linux-2.6_tip0417/arch/x86/include/asm/perf_event.h 
linux-2.6_tip0417_perfkvm/arch/x86/include/asm/perf_event.h
--- linux-2.6_tip0417/arch/x86/include/asm/perf_event.h 2010-04-19 
09:51:47.557797121 +0800
+++ linux-2.6_tip0417_perfkvm/arch/x86/include/asm/perf_event.h 2010-04-19 
09:53:59.689452915 +0800
@@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void)
  */
 #define PERF_EFLAGS_EXACT  (1UL  3)
 
-#define perf_misc_flags(regs)  \
-({ int misc = 0;   \
-   if (user_mode(regs))\
-   misc |= PERF_RECORD_MISC_USER;  \
-   else\
-   misc |= PERF_RECORD_MISC_KERNEL;\
-   if (regs-flags  PERF_EFLAGS_EXACT)\
-   misc |= PERF_RECORD_MISC_EXACT; \
-   misc; })
-
-#define perf_instruction_pointer(regs) ((regs)-ip)
+struct pt_regs;
+extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long perf_misc_flags(struct pt_regs *regs);
+#define perf_misc_flags(regs)  perf_misc_flags(regs)
 
 #else
 static inline void init_hw_perf_events(void)   { }
diff -Nraup --exclude-from=exclude.diff 
linux-2.6_tip0417/arch/x86/kernel/cpu/perf_event.c 
linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tip0417/arch/x86/kernel/cpu/perf_event.c  2010-04-19 
09:51:48.347655964 +0800
+++ linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c  2010-04-19 
09:53:59.689452915 +0800
@@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch
 {
struct perf_callchain_entry *entry;
 
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
+   /* TODO: We don't support guest os callchain now */
+   return NULL;
+   }
+
if (in_nmi())
entry = __get_cpu_var(pmc_nmi_entry);
else
@@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct 
regs-cs = __KERNEL_CS;
local_save_flags(regs-flags);
 }
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
+   unsigned long ip;
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest())
+   ip = perf_guest_cbs-get_guest_ip();
+   else
+   ip = instruction_pointer(regs);
+   return ip;
+}
+
+unsigned long perf_misc_flags(struct pt_regs *regs)
+{
+   int misc = 0;
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
+   misc |= perf_guest_cbs-is_user_mode() ?
+   PERF_RECORD_MISC_GUEST_USER :
+   PERF_RECORD_MISC_GUEST_KERNEL;
+   } else
+   misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
+   PERF_RECORD_MISC_KERNEL;
+   if (regs-flags  PERF_EFLAGS_EXACT)
+   misc |= PERF_RECORD_MISC_EXACT;
+
+   return misc;
+}
+
diff -Nraup --exclude-from=exclude.diff 
linux-2.6_tip0417/include/linux/perf_event.h 
linux-2.6_tip0417_perfkvm/include/linux/perf_event.h
--- linux-2.6_tip0417/include/linux/perf_event.h2010-04-19 
09:51:59.544791000 +0800
+++ linux-2.6_tip0417_perfkvm/include/linux/perf_event.h2010-04-19 
09:53:59.691378953 +0800
@@ -288,11 +288,13 @@ struct perf_event_mmap_page {
__u64   data_tail;  /* user-space written tail */
 };
 
-#define PERF_RECORD_MISC_CPUMODE_MASK  (3  0)
+#define PERF_RECORD_MISC_CPUMODE_MASK  (7  0)
 #define PERF_RECORD_MISC_CPUMODE_UNKNOWN   (0  0)
 #define PERF_RECORD_MISC_KERNEL(1  0)
 #define PERF_RECORD_MISC_USER  (2  0)
 #define PERF_RECORD_MISC_HYPERVISOR(3  0)
+#define PERF_RECORD_MISC_GUEST_KERNEL  (4  0)
+#define PERF_RECORD_MISC_GUEST_USER(5  0)
 
 #define PERF_RECORD_MISC_EXACT (1  14)
 /*
@@ -446,6 +448,12 @@ enum perf_callchain_context {
 # include asm/perf_event.h
 #endif
 
+struct perf_guest_info_callbacks {
+   int (*is_in_guest) (void);
+   int (*is_user_mode) (void);
+   unsigned long (*get_guest_ip) (void);
+};
+
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include asm/hw_breakpoint.h
 #endif
@@ -932,6 +940,12 @@ static inline void perf_event_mmap(struc
__perf_event_mmap(vma);
 }
 
+extern struct perf_guest_info_callbacks *perf_guest_cbs;
+extern int perf_register_guest_info_callbacks(
+   struct perf_guest_info_callbacks *);
+extern int perf_unregister_guest_info_callbacks(
+   struct perf_guest_info_callbacks *);
+
 extern void perf_event_comm(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
 
@@ -1001,6 +1015,11

[PATCH V5 2/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-18 Thread Zhang, Yanmin
Below patch implements the perf_guest_info_callbacks on kvm.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0417/arch/x86/kvm/vmx.c 
linux-2.6_tip0417_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6_tip0417/arch/x86/kvm/vmx.c2010-04-19 09:51:47.908673911 
+0800
+++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/vmx.c2010-04-19 
09:53:59.690399987 +0800
@@ -3654,8 +3654,11 @@ static void vmx_complete_interrupts(stru
 
/* We need to handle NMIs before interrupts are enabled */
if ((exit_intr_info  INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR 
-   (exit_intr_info  INTR_INFO_VALID_MASK))
+   (exit_intr_info  INTR_INFO_VALID_MASK)) {
+   kvm_before_handle_nmi(vmx-vcpu);
asm(int $2);
+   kvm_after_handle_nmi(vmx-vcpu);
+   }
 
idtv_info_valid = idt_vectoring_info  VECTORING_INFO_VALID_MASK;
 
diff -Nraup linux-2.6_tip0417/arch/x86/kvm/x86.c 
linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c
--- linux-2.6_tip0417/arch/x86/kvm/x86.c2010-04-19 09:51:47.892676413 
+0800
+++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c2010-04-19 
09:53:59.691378953 +0800
@@ -40,6 +40,7 @@
 #include linux/user-return-notifier.h
 #include linux/srcu.h
 #include linux/slab.h
+#include linux/perf_event.h
 #include trace/events/kvm.h
 #undef TRACE_INCLUDE_FILE
 #define CREATE_TRACE_POINTS
@@ -3765,6 +3766,47 @@ static void kvm_timer_init(void)
}
 }
 
+static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(current_vcpu) != NULL;
+}
+
+static int kvm_is_user_mode(void)
+{
+   int user_mode = 3;
+   if (percpu_read(current_vcpu))
+   user_mode = kvm_x86_ops-get_cpl(percpu_read(current_vcpu));
+   return user_mode != 0;
+}
+
+static unsigned long kvm_get_guest_ip(void)
+{
+   unsigned long ip = 0;
+   if (percpu_read(current_vcpu))
+   ip = kvm_rip_read(percpu_read(current_vcpu));
+   return ip;
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+   .is_in_guest= kvm_is_in_guest,
+   .is_user_mode   = kvm_is_user_mode,
+   .get_guest_ip   = kvm_get_guest_ip,
+};
+
+void kvm_before_handle_nmi(struct kvm_vcpu *vcpu)
+{
+   percpu_write(current_vcpu, vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_before_handle_nmi);
+
+void kvm_after_handle_nmi(struct kvm_vcpu *vcpu)
+{
+   percpu_write(current_vcpu, NULL);
+}
+EXPORT_SYMBOL_GPL(kvm_after_handle_nmi);
+
 int kvm_arch_init(void *opaque)
 {
int r;
@@ -3801,6 +3843,8 @@ int kvm_arch_init(void *opaque)
 
kvm_timer_init();
 
+   perf_register_guest_info_callbacks(kvm_guest_cbs);
+
return 0;
 
 out:
@@ -3809,6 +3853,8 @@ out:
 
 void kvm_arch_exit(void)
 {
+   perf_unregister_guest_info_callbacks(kvm_guest_cbs);
+
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
diff -Nraup linux-2.6_tip0417/arch/x86/kvm/x86.h 
linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.h
--- linux-2.6_tip0417/arch/x86/kvm/x86.h2010-04-19 09:51:47.884709050 
+0800
+++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.h2010-04-19 
09:53:59.691378953 +0800
@@ -65,4 +65,7 @@ static inline int is_paging(struct kvm_v
return kvm_read_cr0_bits(vcpu, X86_CR0_PG);
 }
 
+void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
+void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
+
 #endif


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 0/2] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-16 Thread Zhang, Yanmin
Here is the new patch of V4 against tip/master of April 13th
if anyone wants to try it.

ChangeLog V4:
1) Based on Ingo's comments, I added help information around kvm
such like command-list.txt and perf-kvm.txt.
2) Added guest process id at the tail of kernel dso long name, so
the display could show different label with different guest os.
3) Based on Avi's comments, erase the racy window which might
trigger an NMI while the NMI isn't in guest os.
4) Fixed all the errors and warnings reported by scripts/checkpatch.pl.
5) Fixed a compilation error pointed by Yang Sheng.

ChangeLog V3:
1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os
root directories under /dir/to/all/guestos by sshfs. For example, I 
start
2 guest os. The one's pid is  and the other's is .
#mkdir ~/guestmount; cd ~/guestmount
#sshfs -o allow_other,direct_io -p 5551 localhost:/ /
#sshfs -o allow_other,direct_io -p 5552 localhost:/ /
#perf kvm --host --guest --guestmount=~/guestmount top

The old --guestkallsyms and --guestmodules are still supported as 
default
guest os symbol parsing.

2) Add guest os buildid support.
3) Add sub command 'perf kvm buildid-list'.
4) Delete sub command 'perf kvm stat', because our current 
implementation
doesn't transfer guest/host requirement to kernel, and kernel always
collects both host and guest statistics. So regular 'perf stat' is ok.
5) Fix a couple of perf bugs.
6) We still have no support on command with parameter 'any' as current 
KVM
just uses process id to identify specific guest os instance. Users could
uses parameter -p to collect specific guest os instance statistics.

ChangeLog V2:
1) Based on Avi's suggestion, I moved callback functions
to generic code area. So the kernel part of the patch is
clearer.
2) Add 'perf kvm stat'.


From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new sub command kvm to perf.

  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  perf kvm buildid-list

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[r...@lkp-ne01 norm]# perf kvm --host --guest 
--guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

---
   PerfTop:   16024 irqs/sec  kernel: 2.6% us: 0.6% guest kernel:76.2% guest 
us:20.6% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
---

 samples  pcnt function DSO
 ___ _  ___

 3740.00  8.0% __ticket_spin_lock   [guest.kernel.kallsyms]
 2056.00  4.4% copy_user_generic_string [guest.kernel.kallsyms]
 1412.00  3.0% resource_string  [guest.kernel.kallsyms]
  595.00  1.3% __switch_to  [guest.kernel.kallsyms]
  586.00  1.2% __d_lookup   [guest.kernel.kallsyms]
  574.00  1.2% tcp_sendmsg  [guest.kernel.kallsyms]
  565.00  1.2% kmem_cache_alloc [guest.kernel.kallsyms]
  532.00  1.1% tcp_ack  [guest.kernel.kallsyms]
  494.00  1.1% __kmalloc[guest.kernel.kallsyms]
  468.00  1.0% print_cfs_rq [guest.kernel.kallsyms]
  437.00  0.9% link_path_walk   [guest.kernel.kallsyms]
  380.00  0.8% balance_runtime  [guest.kernel.kallsyms]
  379.00  0.8% kmem_cache_free  [guest.kernel.kallsyms]
  377.00  0.8% in_gate_area_no_task [guest.kernel.kallsyms]
  374.00  0.8% get_page_from_freelist   [guest.kernel.kallsyms]
  372.00  0.8% mark_files_ro[guest.kernel.kallsyms]
  368.00  0.8% _atomic_dec_and_lock [guest.kernel.kallsyms]
  356.00  0.8% crc16[crc16]
  353.00  0.8% put_page [guest.kernel.kallsyms]

If you want to just show host data, pls. don't use parameter --guest.
The headline includes guest os kernel and userspace percentage.

2) perf

[PATCH V4 1/2] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-16 Thread Zhang, Yanmin
Below is the kernel patch to enable perf to collect guest os statistics.

Joerg,

Would you like to add support on svm? I don't know the exact point to trigger
NMI to host with svm.

See below code with vmx:
 
+   kvm_before_handle_nmi(vmx-vcpu);
asm(int $2);
+   kvm_after_handle_nmi(vmx-vcpu);

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/include/asm/perf_event.h 
linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h
--- linux-2.6_tip0413/arch/x86/include/asm/perf_event.h 2010-04-14 
11:11:03.992966568 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h 2010-04-14 
11:13:17.261881591 +0800
@@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void)
  */
 #define PERF_EFLAGS_EXACT  (1UL  3)
 
-#define perf_misc_flags(regs)  \
-({ int misc = 0;   \
-   if (user_mode(regs))\
-   misc |= PERF_RECORD_MISC_USER;  \
-   else\
-   misc |= PERF_RECORD_MISC_KERNEL;\
-   if (regs-flags  PERF_EFLAGS_EXACT)\
-   misc |= PERF_RECORD_MISC_EXACT; \
-   misc; })
-
-#define perf_instruction_pointer(regs) ((regs)-ip)
+struct pt_regs;
+extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long perf_misc_flags(struct pt_regs *regs);
+#define perf_misc_flags(regs)  perf_misc_flags(regs)
 
 #else
 static inline void init_hw_perf_events(void)   { }
diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c 
linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c  2010-04-14 
11:11:04.825028810 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c  2010-04-14 
17:02:12.198063684 +0800
@@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch
 {
struct perf_callchain_entry *entry;
 
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
+   /* TODO: We don't support guest os callchain now */
+   return NULL;
+   }
+
if (in_nmi())
entry = __get_cpu_var(pmc_nmi_entry);
else
@@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct 
regs-cs = __KERNEL_CS;
local_save_flags(regs-flags);
 }
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
+   unsigned long ip;
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest())
+   ip = perf_guest_cbs-get_guest_ip();
+   else
+   ip = instruction_pointer(regs);
+   return ip;
+}
+
+unsigned long perf_misc_flags(struct pt_regs *regs)
+{
+   int misc = 0;
+   if (perf_guest_cbs  perf_guest_cbs-is_in_guest()) {
+   misc |= perf_guest_cbs-is_user_mode() ?
+   PERF_RECORD_MISC_GUEST_USER :
+   PERF_RECORD_MISC_GUEST_KERNEL;
+   } else
+   misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
+   PERF_RECORD_MISC_KERNEL;
+   if (regs-flags  PERF_EFLAGS_EXACT)
+   misc |= PERF_RECORD_MISC_EXACT;
+
+   return misc;
+}
+
diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kvm/vmx.c 
linux-2.6_tip0413_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6_tip0413/arch/x86/kvm/vmx.c2010-04-14 11:11:04.353024541 
+0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/vmx.c2010-04-15 
10:28:39.516891050 +0800
@@ -3654,8 +3654,11 @@ static void vmx_complete_interrupts(stru
 
/* We need to handle NMIs before interrupts are enabled */
if ((exit_intr_info  INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR 
-   (exit_intr_info  INTR_INFO_VALID_MASK))
+   (exit_intr_info  INTR_INFO_VALID_MASK)) {
+   kvm_before_handle_nmi(vmx-vcpu);
asm(int $2);
+   kvm_after_handle_nmi(vmx-vcpu);
+   }
 
idtv_info_valid = idt_vectoring_info  VECTORING_INFO_VALID_MASK;
 
diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kvm/x86.c 
linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
--- linux-2.6_tip0413/arch/x86/kvm/x86.c2010-04-14 11:11:04.341042024 
+0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c2010-04-15 
17:16:41.340064784 +0800
@@ -40,6 +40,7 @@
 #include linux/user-return-notifier.h
 #include linux/srcu.h
 #include linux/slab.h
+#include linux/perf_event.h
 #include trace/events/kvm.h
 #undef TRACE_INCLUDE_FILE
 #define CREATE_TRACE_POINTS
@@ -3765,6 +3766,47 @@ static void kvm_timer_init(void)
}
 }
 
+static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(current_vcpu) != NULL;
+}
+
+static int kvm_is_user_mode(void)
+{
+   int user_mode = 3;
+   if (percpu_read(current_vcpu

Re: [PATCH V3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-15 Thread Zhang, Yanmin
On Thu, 2010-04-15 at 11:05 +0300, Avi Kivity wrote:
 On 04/15/2030 04:04 AM, Zhang, Yanmin wrote:
 
  An even more accurate way to determine this is to check whether the
  interrupt frame points back at the 'int $2' instruction.  However we
  plan to switch to a self-IPI method to inject the NMI, and I'm not sure
  wether APIC NMIs are accepted on an instruction boundary or whether
  there's some latency involved.
   
  Yes. But the frame pointer checking seems a little complicated.
 
 
 An even bigger disadvantage is that it won't work with Sheng's patch, 
 self-NMIs are not synchronous.
 
trace_kvm_entry(vcpu-vcpu_id);
  +
  + percpu_write(current_vcpu, vcpu);
kvm_x86_ops-run(vcpu);
  + percpu_write(current_vcpu, NULL);
 
 
  If you move this around the 'int $2' instructions you will close the
  race, as a stray NMI won't catch us updating the rip cache.  But that
  depends on whether self-IPI is accepted on the next instruction or not.
   
  Right. The kernel part has dependency on the self-IPI implementation.
  I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper 
  function)
  just around 'int $2'.
 
 
 
 Or create a new function to inject the interrupt in x86.c.  That will 
 reduce duplication between svm.c and vmx.c.
I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
happens in guest os. In addition, svm_complete_interrupts is called after
interrupt is enabled.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-14 Thread Zhang, Yanmin
On Wed, 2010-04-14 at 12:20 +0300, Avi Kivity wrote:
 On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
  Here is the new patch of V3 against tip/master of April 13th
  if anyone wants to try it.
 
 
 
 Thanks for persisting despite the flames.
 
 Can you please separate arch/x86/kvm part of the patch?  That will make 
 for easier reviewing, and will need to go through separate trees.
I should do so definitely, and will do so in next version which also fixes
some issues pointed by Ingo.

 
 Sheng, did you make any progress with the NMI injection issue?
 
  +
  diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c 
  linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
  --- linux-2.6_tip0413/arch/x86/kvm/x86.c2010-04-14 11:11:04.341042024 
  +0800
  +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c2010-04-14 
  11:32:45.841278890 +0800
  @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
  }
}
 
  +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
  +
  +static int kvm_is_in_guest(void)
  +{
  +   return percpu_read(current_vcpu) != NULL;
 
 
 An even more accurate way to determine this is to check whether the 
 interrupt frame points back at the 'int $2' instruction.  However we 
 plan to switch to a self-IPI method to inject the NMI, and I'm not sure 
 wether APIC NMIs are accepted on an instruction boundary or whether 
 there's some latency involved.
Yes. But the frame pointer checking seems a little complicated.

 
  +static unsigned long kvm_get_guest_ip(void)
  +{
  +   unsigned long ip = 0;
  +   if (percpu_read(current_vcpu))
  +   ip = kvm_rip_read(percpu_read(current_vcpu));
  +   return ip;
  +}
 
 
 This may be racy.  kvm_rip_read() accesses a cache in memory; if we're 
 in the process of updating the cache, then we may read a stale value.  
 See below.
Right. The racy window seems too big.

 
 
  trace_kvm_entry(vcpu-vcpu_id);
  +
  +   percpu_write(current_vcpu, vcpu);
  kvm_x86_ops-run(vcpu);
  +   percpu_write(current_vcpu, NULL);
 
 
 If you move this around the 'int $2' instructions you will close the 
 race, as a stray NMI won't catch us updating the rip cache.  But that 
 depends on whether self-IPI is accepted on the next instruction or not.
Right. The kernel part has dependency on the self-IPI implementation.
I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
just around 'int $2'.

Sheng would find a solution on the self-IPI delivery. Let's separate my patch
and self-IPI as 2 issues as we don't know when the self-IPI delivery would be
resolved.

Thanks,
Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide

2010-03-25 Thread Zhang, Yanmin
On Thu, 2010-03-25 at 16:02 +0800, Li Zefan wrote:
 Zhang, Yanmin wrote:
  From: Zhang, Yanmin yanmin_zh...@linux.intel.com
  
  Parameter --pid (or -p) of perf currently means a thread-wide collection.
  For exmaple, if a process whose id is  has 10 threads, 'perf top -p 
  '
  just collects the main thread statistics. That's misleading. Users are
  used to attach a whole process when debugging a process by gdb. To follow
  normal usage style, the patch change --pid to process-wide collection and
  add --tid (-t) to mean a thread-wide collection.
  
  Usage example is:
  #perf top -p 
  #perf record -p  -f sleep 10
  #perf stat -p  -f sleep 10
  Above commands collect the statistics of all threads of process .
  
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
  
 
 Seems this patch causes seg faults:
 
 # ./perf sched record
 Segmentation fault
 # ./perf kmem record
 Segmentation fault
 # ./perf timechart record
 Segmentation fault

Thanks for reporting it. Arnaldo, could you pick up below patch?
Zefan, Could you try it?

mmap_array[][][] is not reset to 0 after malloc. Below patch against
tip/master of March 24th fixes it with a zalloc.

Reported-by:Li Zefan l...@cn.fujitsu.com
Signed-off-by:  Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0324/tools/perf/builtin-record.c 
linux-2.6_tip0324_perfkvm/tools/perf/builtin-record.c
--- linux-2.6_tip0324/tools/perf/builtin-record.c   2010-03-25 
10:58:13.308912201 +0800
+++ linux-2.6_tip0324_perfkvm/tools/perf/builtin-record.c   2010-03-25 
16:14:18.201475298 +0800
@@ -751,7 +751,7 @@ int cmd_record(int argc, const char **ar
for (i = 0; i  MAX_NR_CPUS; i++) {
for (j = 0; j  MAX_COUNTERS; j++) {
fd[i][j] = malloc(sizeof(int)*thread_num);
-   mmap_array[i][j] = malloc(
+   mmap_array[i][j] = zalloc(
sizeof(struct mmap_data)*thread_num);
if (!fd[i][j] || !mmap_array[i][j])
return -ENOMEM;
diff -Nraup linux-2.6_tip0324/tools/perf/builtin-top.c 
linux-2.6_tip0324_perfkvm/tools/perf/builtin-top.c
--- linux-2.6_tip0324/tools/perf/builtin-top.c  2010-03-25 10:58:13.284848937 
+0800
+++ linux-2.6_tip0324_perfkvm/tools/perf/builtin-top.c  2010-03-25 
16:14:56.875266645 +0800
@@ -1371,7 +1371,7 @@ int cmd_top(int argc, const char **argv,
for (i = 0; i  MAX_NR_CPUS; i++) {
for (j = 0; j  MAX_COUNTERS; j++) {
fd[i][j] = malloc(sizeof(int)*thread_num);
-   mmap_array[i][j] = malloc(
+   mmap_array[i][j] = zalloc(
sizeof(struct mmap_data)*thread_num);
if (!fd[i][j] || !mmap_array[i][j])
return -ENOMEM;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-03-25 Thread Zhang, Yanmin
On Wed, 2010-03-24 at 20:20 +0200, Avi Kivity wrote:
 On 03/24/2010 07:47 PM, Arnaldo Carvalho de Melo wrote:
  Em Wed, Mar 24, 2010 at 06:09:30PM +0200, Avi Kivity escreveu:
 
  Doesn't perf already has a dependency on naming conventions for finding
  debug information?
   
  It looks at several places, from most symbol rich (/usr/lib/debug/, aka
  -debuginfo packages, where we have full symtabs) to poorest (the
  packaged binary, where we may just have a .dynsym).
 
  In an ideal world, it would just get the build-id (a SHA1 cookie that is
  in an ELF session inserted in every binary (aka DSOs), kernel module,
  kallsyms or vmlinux file) and use that to look first in a local cache
  (implemented in perf for a long time already) or in some symbol server.
 
  For instance, for a random perf.data file I collected here in my machine
  I have:
 
  [a...@doppio linux-2.6-tip]$ perf buildid-list | grep libpthread
  5c68f7afeb33309c78037e374b0deee84dd441f6 /lib64/libpthread-2.10.2.so
  [a...@doppio linux-2.6-tip]$
 
  So I don't have to access /lib64/libpthread-2.10.2.so directly, nor some
  convention to get a debuginfo in a local file like:
 
  /usr/lib/debug/lib64/libpthread-2.10.2.so.debug
 
  Instead the tools look at:
 
  [a...@doppio linux-2.6-tip]$ l 
  ~/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6
  lrwxrwxrwx 1 acme acme 73 2010-01-06 18:53 
  /home/acme/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 -  
  ../../lib64/libpthread-2.10.2.so/5c68f7afeb33309c78037e374b0deee84dd441f6*
 
  To find the file for that specific build-id, not the one installed in my
  machine (or on the different machine, of a different architecture) that
  may be completely unrelated, a new one, or one for a different arch.
 
 
 Thanks.  I believe qemu could easily act as a symbol server for this use 
 case.


I spent a couple of days to investigate why sshfs/fuse doesn't work well with
procfs and sysfs. Just after my patch against fuse is ready almost, I found
fuse already supports such access by direct I/O. With parameter -o direct_io,
it could work well.

Here is an example to mount / from a guest os.
#sshfs -p 5551 -o direct_io localhost:/ guestmount

We can read files and write files if permission is ok.

I will go ahead to support multiple guest os instance statistics parsing.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-23 Thread Zhang, Yanmin
On Tue, 2010-03-23 at 10:15 -0300, Arnaldo Carvalho de Melo wrote:
 Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu:
  On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
   Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
Then, perf could access all files. It's possible because guest os 
instance
happens to be multi-threading in a process. One of the defects is the 
accessing to
guest os becomes slow or impossible when guest os is very busy.
   
   If the MMAP events on the guest included a cookie that could later be
   used to query for the symtab of that DSO, we wouldn't need to access the
   guest FS at all, right?
 
  It depends on specific sub commands. As for 'perf kvm top', developers
  want to see the profiling immediately. Even with 'perf kvm record',
  developers also want to
 
 That is not a problem, if you have the relevant buildids in your cache
 (Look in your machine at ~/.debug/), it will be as fast as ever.
 
 If you use a distro that has its userspace with build-ids, you probably
 use it always without noticing :-)
 
  see results quickly. At least I'm eager for the results when
  investigating a performance issue.
 
 Sure thing.
  
   With build-ids and debuginfo-install like tools the symbol
   resolution could be performed by using the cookies (build-ids) as
   keys to get to the *-debuginfo packages with matching symtabs (and
   DWARF for source annotation, etc).
 
  We can't make sure guest os uses the same os images, or don't know
  where we could find the original DVD images being used to install
  guest os.
 
 You don't have to have guest and host sharing the same OS image, you
 just have to somehow populate your buildid cache with what you need, be
 it using sshfs or what Ingo is suggesting once, or using what your
 vendor provides (debuginfo packages). And you just have to do it once,
 for the relevant apps, to have it in your buildid cache.
  
  Current perf does save build id, including both kernls's and other
  application lib/executables.
 
 Yeah, I know, I implemented it. :-)
  
   We have that for the kernel as:
 
   [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
   -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
   [a...@doppio linux-2.6-tip]$ l 
   /sys/module/ipv6/sections/.note.gnu.build-id 
   -r--r--r-- 1 root root 4096 2010-03-22 13:38 
   /sys/module/ipv6/sections/.note.gnu.build-id
   [a...@doppio linux-2.6-tip]$
 
   That way we would cover DSOs being reinstalled in long running 'perf
   record' sessions too.
 
  That's one of objectives of perf to support long running.
 
 But it doesn't fully supports right now, as I explained, build-ids are
 collected at the end of the record session, because we have to open the
 DSOs that had hits to get the 20 bytes cookie we need, the build-id.
 
 If we had it in the PERF_RECORD_MMAP record, we would close this race,
 and the added cost at load time should be minimal, to get the ELF
 section with it and put it somewhere in task struct.
Well, you are improving upon perfection.

 
 If only we could coalesce it a bit to reclaim this:
 
 [a...@doppio linux-2.6-tip]$ pahole -C task_struct 
 ../build/v2.6.34-rc1-tip+/kernel/sched.o  | tail -5
   /* size: 5968, cachelines: 94, members: 150 */
   /* sum members: 5943, holes: 7, sum holes: 25 */
   /* bit holes: 1, sum bit holes: 28 bits */
   /* last cacheline: 16 bytes */
 };
 [a...@doppio linux-2.6-tip]$ 
That reminds me I listened to your presentation on 2007 OLS. :)

 
 8-)
 
 Or at least get just one of those 4 bytes holes then we could stick it
 at the end to get our build-id there, accessing it would be done only
 at PERF_RECORD_MMAP injection time, i.e. close to the time when we
 actually are loading the executable mmap, i.e. close to the time when
 the loader is injecting the build-id, I guess the extra memory and
 processing costs would be in the noise.
 
   This was discussed some time ago but would require help from the bits
   that load DSOs.
 
   build-ids then would be first class citizens.
 
 - Arnaldo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Unify KVM kernel-space and user-space code into a single project

2010-03-22 Thread Zhang, Yanmin
On Sun, 2010-03-21 at 22:20 +0100, Ingo Molnar wrote:
 * Avi Kivity a...@redhat.com wrote:
 
   Well, for what it's worth, I rarely ever use anything else. My virtual 
   disks are raw so I can loop mount them easily, and I can also switch my 
   guest kernels from outside... without ever needing to mount those disks.
  
  Curious, what do you use them for?
  
  btw, if you build your kernel outside the guest, then you already have 
  access to all its symbols, without needing anything further.
 
 There's two errors with your argument:
 
 1) you are assuming that it's only about kernel symbols
 
 Look at this 'perf report' output:
 
 # Samples: 7127509216
 #
 # Overhead Command  Shared Object  Symbol
 #   ..  .  ..
 #
 19.14% git  git[.] lookup_object
 15.16%perf  git[.] lookup_object
  4.74%perf  libz.so.1.2.3  [.] inflate
  4.52% git  libz.so.1.2.3  [.] inflate
  4.21%perf  libz.so.1.2.3  [.] inflate_table
  3.94% git  libz.so.1.2.3  [.] inflate_table
  3.29% git  git[.] find_pack_entry_one
  3.24% git  libz.so.1.2.3  [.] inflate_fast
  2.96%perf  libz.so.1.2.3  [.] inflate_fast
  2.96% git  git[.] decode_tree_entry
  2.80%perf  libc-2.11.90.so[.] __strlen_sse42
  2.56% git  libc-2.11.90.so[.] __strlen_sse42
  1.98%perf  libc-2.11.90.so[.] __GI_memcpy
  1.71%perf  git[.] decode_tree_entry
  1.53% git  libc-2.11.90.so[.] __GI_memcpy
  1.48% git  git[.] lookup_blob
  1.30% git  git[.] process_tree
  1.30%perf  git[.] process_tree
  0.90%perf  git[.] tree_entry
  0.82%perf  git[.] lookup_blob
  0.78% git  [kernel.kallsyms]  [k] kstat_irqs_cpu
 
 kernel symbols are only a small portion of the symbols. (a single line in 
 this 
 case)
Above example shows perf could summarize both kernel and application hot 
functions.
If we collect guest os statistics from host side, we can't summarize detailed 
guest os
application info because we couldn't get guest os's application process id from 
host
side. So we could only get detailed kernel info and the total utilization 
percent of
guest application processes.


 
 To get to those other symbols we have to read the ELF symbols of those 
 binaries in the guest filesystem, in the post-processing/reporting phase. 
 This 
 is both complex to do and relatively slow so we dont want to (and cannot) do 
 this at sample time from IRQ context or NMI context ...
 
 Also, many aspects of reporting are interactive so it's done lazily or 
 on-demand. So we need ready access to the guest filesystem - for those guests 
 which decide to integrate with the host for this.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Zhang, Yanmin
On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
 Nice progress!
 
 This bit:
 
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 

 Will be really be painful to developers - to enter that long line while we 
 have these things called 'computers' that ought to reduce human work. Also, 
 it's incomplete, we need access to the guest system's binaries to do ELF 
 symbol resolution and dwarf decoding.
Yes, I agree with you and Avi that we need the enhancement be user-friendly.
One of my start points is to keep the tool having less dependency on
other components. Admin/developers could write script wrappers quickly if
perf has parameters to support the new capability.


 
 So we really need some good, automatic way to get to the guest symbol space, 
 so that if a developer types:
 
perf kvm top
 
 Then the obvious thing happens by default. (which is to show the guest 
 overhead)
 
 There's no technical barrier on the perf tooling side to implement all that: 
 perf supports build-ids extensively and can deal with multiple symbol spaces 
 - 
 as long as it has access to it. The guest kernel could be ID-ed based on its 
 /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.
I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely.
I could access the files quickly. However, it doesn't work when I access
/proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most
files of /proc/ and /sys/ are 0.


 
 So some sort of --guestmount option would be the natural solution, which 
 points to the guest system's root: and a Qemu enumeration of guest mounts 
 (which would be off by default and configurable) from which perf can pick up 
 the target guest all automatically. (obviously only under allowed permissions 
 so that such access is secure)
If sshfs could access /proc/ and /sys correctly, here is a design:
--guestmount points to a directory which consists of a list of sub-directories.
Every sub-directory's name is just the qemu process id of guest os. 
Admin/developer
mounts every guest os instance's root directory to corresponding sub-directory.

Then, perf could access all files. It's possible because guest os instance
happens to be multi-threading in a process. One of the defects is the accessing 
to
guest os becomes slow or impossible when guest os is very busy.


 
 This would allow not just kallsyms access via $guest/proc/kallsyms but also 
 gives us the full space of symbol features: access to the guest binaries for 
 annotation and general symbol resolution, command/binary name identification, 
 etc.
 
 Such a mount would obviously not broaden existing privileges - and as an 
 additional control a guest would also have a way to indicate that it does not 
 wish a guest mount at all.
 
 Unfortunately, in a previous thread the Qemu maintainer has indicated that he 
 will essentially NAK any attempt to enhance Qemu to provide an easily 
 discoverable, self-contained, transparent guest mount on the host side.
 
 No technical justification was given for that NAK, despite my repeated 
 requests to particulate the exact security problems that such an approach 
 would cause.
 
 If that NAK does not stand in that form then i'd like to know about it - it 
 makes no sense for us to try to code up a solution against a standing 
 maintainer NAK ...
 
 The other option is some sysadmin level hackery to NFS-mount the guest or so. 
 This is a vastly inferior method that brings us back to the absymal usability 
 levels of OProfile:
 
  1) it wont be guest transparent
  2) has to be re-done for every guest image. 
  3) even if packaged it has to be gotten into every. single. Linux. distro. 
 separately.
  4) old Linux guests wont work out of box
 
 In other words: it's very inconvenient on multiple levels and wont ever 
 happen 
 on any reasonable enough scale to make a difference to Linux.
 
 Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i 
 can do little about it.
 
 Thanks,
 
   Ingo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-22 Thread Zhang, Yanmin
On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote:
 Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu:
  On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote:
   So some sort of --guestmount option would be the natural solution, which 
   points to the guest system's root: and a Qemu enumeration of guest mounts 
   (which would be off by default and configurable) from which perf can pick 
   up 
   the target guest all automatically. (obviously only under allowed 
   permissions 
   so that such access is secure)
  If sshfs could access /proc/ and /sys correctly, here is a design:
  --guestmount points to a directory which consists of a list of 
  sub-directories.
  Every sub-directory's name is just the qemu process id of guest os. 
  Admin/developer
  mounts every guest os instance's root directory to corresponding 
  sub-directory.
  
  Then, perf could access all files. It's possible because guest os instance
  happens to be multi-threading in a process. One of the defects is the 
  accessing to
  guest os becomes slow or impossible when guest os is very busy.
 
 If the MMAP events on the guest included a cookie that could later be
 used to query for the symtab of that DSO, we wouldn't need to access the
 guest FS at all, right?
It depends on specific sub commands. As for 'perf kvm top', developers want to 
see
the profiling immediately. Even with 'perf kvm record', developers also want to
see results quickly. At least I'm eager for the results when investigating
a performance issue.

 
 With build-ids and debuginfo-install like tools the symbol resolution
 could be performed by using the cookies (build-ids) as keys to get to
 the *-debuginfo packages with matching symtabs (and DWARF for source
 annotation, etc).
We can't make sure guest os uses the same os images, or don't know where we
could find the original DVD images being used to install guest os.

Current perf does save build id, including both kernls's and other application
lib/executables.

 
 We have that for the kernel as:
 
 [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes 
 -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
 [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id 
 -r--r--r-- 1 root root 4096 2010-03-22 13:38 
 /sys/module/ipv6/sections/.note.gnu.build-id
 [a...@doppio linux-2.6-tip]$
 
 That way we would cover DSOs being reinstalled in long running 'perf
 record' sessions too.
That's one of objectives of perf to support long running.

 
 This was discussed some time ago but would require help from the bits
 that load DSOs.
 
 build-ids then would be first class citizens.
 
 - Arnaldo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-18 Thread Zhang, Yanmin
On Thu, 2010-03-18 at 10:45 +0800, Zhang, Yanmin wrote:
 On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
  On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
   * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
   
On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
  On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
   From: Zhang, Yanminyanmin_zh...@linux.intel.com
  
   Based on the discussion in KVM community, I worked out the patch 
   to support
   perf to collect guest os statistics from host side. This patch is 
   implemented
   with Ingo, Peter and some other guys' kind help. Yang Sheng 
   pointed out a
   critical bug and provided good suggestions with other guys. I 
   really appreciate
   their kind help.
  
   The patch adds new subcommand kvm to perf.
  
  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  
   The new perf could profile guest os kernel except guest os user 
   space, but it
   could summarize guest os user space utilization per guest os.
  
   Below are some examples.
   1) perf kvm top
   [r...@lkp-ne01 norm]# perf kvm --host --guest 
   --guestkallsyms=/home/ymzhang/guest/kallsyms
   --guestmodules=/home/ymzhang/guest/modules top
  
  
  
 Thanks for your kind comments.
 
  Excellent, support for guest kernel != host kernel is critical (I 
  can't 
  remember the last time I ran same kernels).
  
  How would we support multiple guests with different kernels?
 With the patch, 'perf kvm report --sort pid could show
 summary statistics for all guest os instances. Then, use
 parameter --pid of 'perf kvm record' to collect single problematic 
 instance data.
Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf 
record' and 'perf top' could collect data on all threads of a process 
when 
the process is running?

If not, I need add a new ugly parameter which is similar to --pid to 
filter 
out process data in userspace.
   
   Yeah. For maximum utility i'd suggest to extend --pid to include this, 
   and 
   introduce --tid for the previous, limited-to-a-single-task functionality.
   
   Most users would expect --pid to work like a 'late attach' - i.e. to work 
   like 
   strace -f or like a gdb attach.
  
  Thanks Ingo, Avi.
  
  I worked out below patch against tip/master of March 15th.
  
  Subject: [PATCH] Change perf's parameter --pid to process-wide collection
  From: Zhang, Yanmin yanmin_zh...@linux.intel.com
  
  Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
  thread id. Now, --pid means perf collects the statistics of all threads of
  the process, while --tid means perf just collect the statistics of that 
  thread.
  
  BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
  attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
  and no forks, 'perf stat -p' doesn't collect any data. In addition, the
  while(!done) in run_perf_stat consumes 100% single cpu time which has bad 
  impact
  on running workload. I added a sleep(1) in the loop.
  
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Ingo,
 
 Sorry, the patch has bugs.  I need do a better job and will work out 2
 separate patches against the 2 issues.

I worked out 3 new patches against tip/master tree of Mar. 17th.

1) Patch perf_stat: Fix the issue that perf doesn't enable counters when
target_pid != -1. Change the condition to fork/exec subcommand. If there
is a subcommand parameter, perf always fork/exec it. The usage example is:
#perf stat -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand,
it should enable all counters after the new process is execing.Change the
condition to fork/exec subcommand. If there is a subcommand parameter,
perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

3) perf_pid: Change parameter --pid to process-wide collection. Add --tid
which means collecting thread-wide statistics. Usage example is:
#perf top -p 
#perf record -p  -f sleep 10
#perf stat -p  -f sleep 10

Arnaldo,

Pls. apply the 3 attached patches.

Yanmin

diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c
--- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c	2010-03-18 09:04:40.938289813 +0800
+++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c	2010-03-18 13:07

[PATCH 1/3] perf events: Enable counters when collecting process-wide or system-wide data by 'perf stat'

2010-03-18 Thread Zhang, Yanmin

Tool perf has a couple of sub commands. There are a couple of issues around
counter enabling time point. In addition, we want a precise time clock
when collect system-wide or process/thread-wide statistics.

I worked out 3 patches against tips/master tree of March 17 to fix such issues
and enhance perf to be more user-friendly.


Subject: [PATCH 1/3] perf events: Enable counters when collecting process-wide 
or system-wide data by 'perf stat'
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Command 'perf stat' doesn't enable counters when collecting an existing (by -p) 
process
or a system-wide statistics. Fix the issue.
Change the condition of fork/exec subcommand. If there is a subcommand 
parameter,
perf always fork/exec it. The usage example is:
#perf stat -a sleep 10

So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C. Without the new capability, user could only
use CTRL+C to stop it without precise time clock.

Another issue is 'perf stat -a' consumes 100% time of a full single logical 
cpu. It
has a bad impact on running workload. Fix it by adding a sleep(1) in the 
while(!done)
loop in function run_perf_stat.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c 
linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c
--- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c   2010-03-18 
09:04:40.938289813 +0800
+++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c   2010-03-18 
13:07:26.773773541 +0800
@@ -159,8 +159,10 @@ static void create_perf_stat_counter(int
}
} else {
attr-inherit= inherit;
-   attr-disabled   = 1;
-   attr-enable_on_exec = 1;
+   if (target_pid == -1) {
+   attr-disabled = 1;
+   attr-enable_on_exec = 1;
+   }
 
fd[0][counter] = sys_perf_event_open(attr, pid, -1, -1, 0);
if (fd[0][counter]  0  verbose)
@@ -251,9 +253,9 @@ static int run_perf_stat(int argc __used
unsigned long long t0, t1;
int status = 0;
int counter;
-   int pid = target_pid;
+   int pid;
int child_ready_pipe[2], go_pipe[2];
-   const bool forks = (target_pid == -1  argc  0);
+   const bool forks = (argc  0);
char buf;
 
if (!system_wide)
@@ -265,10 +267,10 @@ static int run_perf_stat(int argc __used
}
 
if (forks) {
-   if ((pid = fork())  0)
+   if ((child_pid = fork())  0)
perror(failed to fork);
 
-   if (!pid) {
+   if (!child_pid) {
close(child_ready_pipe[0]);
close(go_pipe[1]);
fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC);
@@ -297,8 +299,6 @@ static int run_perf_stat(int argc __used
exit(-1);
}
 
-   child_pid = pid;
-
/*
 * Wait for the child to be ready to exec.
 */
@@ -309,6 +309,10 @@ static int run_perf_stat(int argc __used
close(child_ready_pipe[0]);
}
 
+   if (target_pid == -1)
+   pid = child_pid;
+   else
+   pid = target_pid;
for (counter = 0; counter  nr_counters; counter++)
create_perf_stat_counter(counter, pid);
 
@@ -321,7 +325,7 @@ static int run_perf_stat(int argc __used
close(go_pipe[1]);
wait(status);
} else {
-   while(!done);
+   while(!done) sleep(1);
}
 
t1 = rdclock();
@@ -459,7 +463,7 @@ static volatile int signr = -1;
 
 static void skip_signal(int signo)
 {
-   if(target_pid != -1)
+   if(child_pid == -1)
done = 1;
 
signr = signo;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] perf events: Enable counters only when kernel is execing the subcommand of 'perf record'

2010-03-18 Thread Zhang, Yanmin
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

'perf record' starts counters before subcommand is execed, so the statistics is
not precise because it includes data of some preparation steps. I fix it with 
the
patch.

In addition, Change the condition to fork/exec subcommand. If there is a 
subcommand
parameter, perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10

So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C. Without the new capability, user could only
input CTRL+C to stop it without precise time clock.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0317/tools/perf/builtin-record.c 
linux-2.6_tip0317_fixrecord/tools/perf/builtin-record.c
--- linux-2.6_tip0317/tools/perf/builtin-record.c   2010-03-18 
09:04:40.942263175 +0800
+++ linux-2.6_tip0317_fixrecord/tools/perf/builtin-record.c 2010-03-18 
13:33:24.254359348 +0800
@@ -225,7 +225,7 @@ static struct perf_header_attr *get_head
return h_attr;
 }
 
-static void create_counter(int counter, int cpu, pid_t pid, bool forks)
+static void create_counter(int counter, int cpu, pid_t pid)
 {
char *filter = filters[counter];
struct perf_event_attr *attr = attrs + counter;
@@ -275,10 +275,10 @@ static void create_counter(int counter, 
attr-mmap  = track;
attr-comm  = track;
attr-inherit   = inherit;
-   attr-disabled  = 1;
-
-   if (forks)
+   if (target_pid == -1  !system_wide) {
+   attr-disabled = 1;
attr-enable_on_exec = 1;
+   }
 
 try_again:
fd[nr_cpu][counter] = sys_perf_event_open(attr, pid, cpu, group_fd, 0);
@@ -380,17 +380,15 @@ try_again:
exit(-1);
}
}
-
-   ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_ENABLE);
 }
 
-static void open_counters(int cpu, pid_t pid, bool forks)
+static void open_counters(int cpu, pid_t pid)
 {
int counter;
 
group_fd = -1;
for (counter = 0; counter  nr_counters; counter++)
-   create_counter(counter, cpu, pid, forks);
+   create_counter(counter, cpu, pid);
 
nr_cpu++;
 }
@@ -425,7 +423,7 @@ static int __cmd_record(int argc, const 
int err;
unsigned long waking = 0;
int child_ready_pipe[2], go_pipe[2];
-   const bool forks = target_pid == -1  argc  0;
+   const bool forks = argc  0;
char buf;
 
page_size = sysconf(_SC_PAGE_SIZE);
@@ -496,13 +494,13 @@ static int __cmd_record(int argc, const 
atexit(atexit_header);
 
if (forks) {
-   pid = fork();
+   child_pid = fork();
if (pid  0) {
perror(failed to fork);
exit(-1);
}
 
-   if (!pid) {
+   if (!child_pid) {
close(child_ready_pipe[0]);
close(go_pipe[1]);
fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC);
@@ -531,11 +529,6 @@ static int __cmd_record(int argc, const 
exit(-1);
}
 
-   child_pid = pid;
-
-   if (!system_wide)
-   target_pid = pid;
-
close(child_ready_pipe[1]);
close(go_pipe[0]);
/*
@@ -548,13 +541,17 @@ static int __cmd_record(int argc, const 
close(child_ready_pipe[0]);
}
 
+   if (forks  target_pid == -1  !system_wide)
+   pid = child_pid;
+   else
+   pid = target_pid;
 
if ((!system_wide  !inherit) || profile_cpu != -1) {
-   open_counters(profile_cpu, target_pid, forks);
+   open_counters(profile_cpu, pid);
} else {
nr_cpus = read_cpu_map();
for (i = 0; i  nr_cpus; i++)
-   open_counters(cpumap[i], target_pid, forks);
+   open_counters(cpumap[i], pid);
}
 
if (file_new) {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide

2010-03-18 Thread Zhang, Yanmin
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Parameter --pid (or -p) of perf currently means a thread-wide collection.
For exmaple, if a process whose id is  has 10 threads, 'perf top -p '
just collects the main thread statistics. That's misleading. Users are
used to attach a whole process when debugging a process by gdb. To follow
normal usage style, the patch change --pid to process-wide collection and
add --tid (-t) to mean a thread-wide collection.

Usage example is:
#perf top -p 
#perf record -p  -f sleep 10
#perf stat -p  -f sleep 10
Above commands collect the statistics of all threads of process .

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tip0317_statrecord/tools/perf/builtin-record.c 
linux-2.6_tip0317_statrecordpid/tools/perf/builtin-record.c
--- linux-2.6_tip0317_statrecord/tools/perf/builtin-record.c2010-03-18 
13:48:39.578181540 +0800
+++ linux-2.6_tip0317_statrecordpid/tools/perf/builtin-record.c 2010-03-18 
14:28:41.449631936 +0800
@@ -27,7 +27,7 @@
 #include unistd.h
 #include sched.h
 
-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static longdefault_interval=  0;
 
@@ -43,6 +43,9 @@ static intraw_samples 
=  0;
 static int system_wide =  0;
 static int profile_cpu = -1;
 static pid_t   target_pid  = -1;
+static pid_t   target_tid  = -1;
+static pid_t   *all_tids   =  NULL;
+static int thread_num  =  0;
 static pid_t   child_pid   = -1;
 static int inherit =  1;
 static int force   =  0;
@@ -60,7 +63,7 @@ static struct timeval this_read;
 
 static u64 bytes_written   =  0;
 
-static struct pollfd   event_array[MAX_NR_CPUS * MAX_COUNTERS];
+static struct pollfd   *event_array;
 
 static int nr_poll =  0;
 static int nr_cpu  =  0;
@@ -77,7 +80,7 @@ struct mmap_data {
unsigned intprev;
 };
 
-static struct mmap_datammap_array[MAX_NR_CPUS][MAX_COUNTERS];
+static struct mmap_data*mmap_array[MAX_NR_CPUS][MAX_COUNTERS];
 
 static unsigned long mmap_read_head(struct mmap_data *md)
 {
@@ -225,12 +228,13 @@ static struct perf_header_attr *get_head
return h_attr;
 }
 
-static void create_counter(int counter, int cpu, pid_t pid)
+static void create_counter(int counter, int cpu)
 {
char *filter = filters[counter];
struct perf_event_attr *attr = attrs + counter;
struct perf_header_attr *h_attr;
int track = !counter; /* only the first counter needs these */
+   int thread_index;
int ret;
struct {
u64 count;
@@ -280,115 +284,124 @@ static void create_counter(int counter, 
attr-enable_on_exec = 1;
}
 
+   for (thread_index = 0; thread_index  thread_num; thread_index++) {
 try_again:
-   fd[nr_cpu][counter] = sys_perf_event_open(attr, pid, cpu, group_fd, 0);
+   fd[nr_cpu][counter][thread_index] = sys_perf_event_open(attr,
+   all_tids[thread_index], cpu, group_fd, 0);
 
-   if (fd[nr_cpu][counter]  0) {
-   int err = errno;
+   if (fd[nr_cpu][counter][thread_index]  0) {
+   int err = errno;
 
-   if (err == EPERM || err == EACCES)
-   die(Permission error - are you root?\n
-   \t Consider tweaking 
/proc/sys/kernel/perf_event_paranoid.\n);
-   else if (err ==  ENODEV  profile_cpu != -1)
-   die(No such device - did you specify an out-of-range 
profile CPU?\n);
+   if (err == EPERM || err == EACCES)
+   die(Permission error - are you root?\n
+   \t Consider tweaking
+
/proc/sys/kernel/perf_event_paranoid.\n);
+   else if (err ==  ENODEV  profile_cpu != -1) {
+   die(No such device - did you specify
+an out-of-range profile CPU?\n);
+   }
 
-   /*
-* If it's cycles then fall back to hrtimer
-* based cpu-clock-tick sw counter, which
-* is always available even if no PMU support

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zhang, Yanmin
On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
 * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
 
  On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
   On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
 From: Zhang, Yanminyanmin_zh...@linux.intel.com

 Based on the discussion in KVM community, I worked out the patch to 
 support
 perf to collect guest os statistics from host side. This patch is 
 implemented
 with Ingo, Peter and some other guys' kind help. Yang Sheng pointed 
 out a
 critical bug and provided good suggestions with other guys. I really 
 appreciate
 their kind help.

 The patch adds new subcommand kvm to perf.

perf kvm top
perf kvm record
perf kvm report
perf kvm diff

 The new perf could profile guest os kernel except guest os user 
 space, but it
 could summarize guest os user space utilization per guest os.

 Below are some examples.
 1) perf kvm top
 [r...@lkp-ne01 norm]# perf kvm --host --guest 
 --guestkallsyms=/home/ymzhang/guest/kallsyms
 --guestmodules=/home/ymzhang/guest/modules top



   Thanks for your kind comments.
   
Excellent, support for guest kernel != host kernel is critical (I can't 
remember the last time I ran same kernels).

How would we support multiple guests with different kernels?
   With the patch, 'perf kvm report --sort pid could show
   summary statistics for all guest os instances. Then, use
   parameter --pid of 'perf kvm record' to collect single problematic 
   instance data.
  Sorry. I found currently --pid isn't process but a thread (main thread).
  
  Ingo,
  
  Is it possible to support a new parameter or extend --inherit, so 'perf 
  record' and 'perf top' could collect data on all threads of a process when 
  the process is running?
  
  If not, I need add a new ugly parameter which is similar to --pid to filter 
  out process data in userspace.
 
 Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
 introduce --tid for the previous, limited-to-a-single-task functionality.
 
 Most users would expect --pid to work like a 'late attach' - i.e. to work 
 like 
 strace -f or like a gdb attach.

Thanks Ingo, Avi.

I worked out below patch against tip/master of March 15th.

Subject: [PATCH] Change perf's parameter --pid to process-wide collection
From: Zhang, Yanmin yanmin_zh...@linux.intel.com

Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
thread id. Now, --pid means perf collects the statistics of all threads of
the process, while --tid means perf just collect the statistics of that thread.

BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
and no forks, 'perf stat -p' doesn't collect any data. In addition, the
while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
on running workload. I added a sleep(1) in the loop.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com

---

diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c 
linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 
08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 
16:30:17.71706 +0800
@@ -27,7 +27,7 @@
 #include unistd.h
 #include sched.h
 
-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static longdefault_interval=  0;
 
@@ -43,6 +43,9 @@ static intraw_samples 
=  0;
 static int system_wide =  0;
 static int profile_cpu = -1;
 static pid_t   target_pid  = -1;
+static pid_t   target_tid  = -1;
+static int *all_tids   =  NULL;
+static int thread_num  =  0;
 static pid_t   child_pid   = -1;
 static int inherit =  1;
 static int force   =  0;
@@ -60,7 +63,7 @@ static struct timeval this_read;
 
 static u64 bytes_written   =  0;
 
-static struct pollfd   event_array[MAX_NR_CPUS * MAX_COUNTERS];
+static struct pollfd   *event_array;
 
 static int nr_poll =  0;
 static int nr_cpu  =  0;
@@ -77,7 +80,7 @@ struct mmap_data {
unsigned int

Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-17 Thread Zhang, Yanmin
On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
  * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
  
   On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on the discussion in KVM community, I worked out the patch to 
  support
  perf to collect guest os statistics from host side. This patch is 
  implemented
  with Ingo, Peter and some other guys' kind help. Yang Sheng pointed 
  out a
  critical bug and provided good suggestions with other guys. I 
  really appreciate
  their kind help.
 
  The patch adds new subcommand kvm to perf.
 
 perf kvm top
 perf kvm record
 perf kvm report
 perf kvm diff
 
  The new perf could profile guest os kernel except guest os user 
  space, but it
  could summarize guest os user space utilization per guest os.
 
  Below are some examples.
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 
 
 
Thanks for your kind comments.

 Excellent, support for guest kernel != host kernel is critical (I 
 can't 
 remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels?
With the patch, 'perf kvm report --sort pid could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic 
instance data.
   Sorry. I found currently --pid isn't process but a thread (main thread).
   
   Ingo,
   
   Is it possible to support a new parameter or extend --inherit, so 'perf 
   record' and 'perf top' could collect data on all threads of a process 
   when 
   the process is running?
   
   If not, I need add a new ugly parameter which is similar to --pid to 
   filter 
   out process data in userspace.
  
  Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
  introduce --tid for the previous, limited-to-a-single-task functionality.
  
  Most users would expect --pid to work like a 'late attach' - i.e. to work 
  like 
  strace -f or like a gdb attach.
 
 Thanks Ingo, Avi.
 
 I worked out below patch against tip/master of March 15th.
 
 Subject: [PATCH] Change perf's parameter --pid to process-wide collection
 From: Zhang, Yanmin yanmin_zh...@linux.intel.com
 
 Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
 thread id. Now, --pid means perf collects the statistics of all threads of
 the process, while --tid means perf just collect the statistics of that 
 thread.
 
 BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
 attr-disabled=1 if it isn't a system-wide collection. If there is a '-p'
 and no forks, 'perf stat -p' doesn't collect any data. In addition, the
 while(!done) in run_perf_stat consumes 100% single cpu time which has bad 
 impact
 on running workload. I added a sleep(1) in the loop.
 
 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
Ingo,

Sorry, the patch has bugs.  I need do a better job and will work out 2
separate patches against the 2 issues.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin
On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
 On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
  From: Zhang, Yanminyanmin_zh...@linux.intel.com
 
  Based on the discussion in KVM community, I worked out the patch to support
  perf to collect guest os statistics from host side. This patch is 
  implemented
  with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
  critical bug and provided good suggestions with other guys. I really 
  appreciate
  their kind help.
 
  The patch adds new subcommand kvm to perf.
 
 perf kvm top
 perf kvm record
 perf kvm report
 perf kvm diff
 
  The new perf could profile guest os kernel except guest os user space, but 
  it
  could summarize guest os user space utilization per guest os.
 
  Below are some examples.
  1) perf kvm top
  [r...@lkp-ne01 norm]# perf kvm --host --guest 
  --guestkallsyms=/home/ymzhang/guest/kallsyms
  --guestmodules=/home/ymzhang/guest/modules top
 
 
 
Thanks for your kind comments.

 Excellent, support for guest kernel != host kernel is critical (I can't 
 remember the last time I ran same kernels).
 
 How would we support multiple guests with different kernels?
With the patch, 'perf kvm report --sort pid could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic instance 
data.

   Perhaps a 
 symbol server that perf can connect to (and that would connect to guests 
 in turn)?

 
  diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
  linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
  --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c  2010-03-16 
  08:59:11.825295404 +0800
  +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c  2010-03-16 
  09:01:09.976084492 +0800
  @@ -26,6 +26,7 @@
#includelinux/sched.h
#includelinux/moduleparam.h
#includelinux/ftrace_event.h
  +#includelinux/perf_event.h
#include kvm_cache_regs.h
#include x86.h
 
  @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
  vmcs_write32(TPR_THRESHOLD, irr);
}
 
  +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
  +
  +static void kvm_set_in_guest(void)
  +{
  +   percpu_write(kvm_in_guest, 1);
  +}
  +
  +static int kvm_is_in_guest(void)
  +{
  +   return percpu_read(kvm_in_guest);
  +}
 
 

 There is already PF_VCPU for this.
Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is very
narrow, I will change it to use flag PF_VCPU.

 
  +static struct perf_guest_info_callbacks kvm_guest_cbs = {
  +   .is_in_guest= kvm_is_in_guest,
  +   .is_user_mode   = kvm_is_user_mode,
  +   .get_guest_ip   = kvm_get_guest_ip,
  +   .reset_in_guest = kvm_reset_in_guest,
  +};
 
 
 Should be in common code, not vmx specific.
Right. I discussed with Yangsheng. I will move above data structures and
callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
kvm_x86_ops.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin
On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
  On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
   From: Zhang, Yanminyanmin_zh...@linux.intel.com
  
   Based on the discussion in KVM community, I worked out the patch to 
   support
   perf to collect guest os statistics from host side. This patch is 
   implemented
   with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
   critical bug and provided good suggestions with other guys. I really 
   appreciate
   their kind help.
  
   The patch adds new subcommand kvm to perf.
  
  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  
   The new perf could profile guest os kernel except guest os user space, 
   but it
   could summarize guest os user space utilization per guest os.
  
   Below are some examples.
   1) perf kvm top
   [r...@lkp-ne01 norm]# perf kvm --host --guest 
   --guestkallsyms=/home/ymzhang/guest/kallsyms
   --guestmodules=/home/ymzhang/guest/modules top
  
  
  
 Thanks for your kind comments.
 
  Excellent, support for guest kernel != host kernel is critical (I can't 
  remember the last time I ran same kernels).
  
  How would we support multiple guests with different kernels?
 With the patch, 'perf kvm report --sort pid could show
 summary statistics for all guest os instances. Then, use
 parameter --pid of 'perf kvm record' to collect single problematic instance 
 data.
Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf record' 
and
'perf top' could collect data on all threads of a process when the process is 
running?

If not, I need add a new ugly parameter which is similar to --pid to filter out 
process
data in userspace.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side

2010-03-16 Thread Zhang, Yanmin
On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
 On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
 
  Excellent, support for guest kernel != host kernel is critical (I can't
  remember the last time I ran same kernels).
 
  How would we support multiple guests with different kernels?
   
  With the patch, 'perf kvm report --sort pid could show
  summary statistics for all guest os instances. Then, use
  parameter --pid of 'perf kvm record' to collect single problematic instance 
  data.
 
 
 That certainly works, though automatic association of guest data with 
 guest symbols is friendlier.
Thanks. Originally, I planed to add a -G parameter to perf. Such like
-G 
:/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules
 and 8889 are just qemu guest pid.

So we could define multiple guest os symbol files. But it seems ugly,
and 'perf kvm report --sort pid and 'perf kvm top --pid' could provide
similar functionality.

 
  diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 
  linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
  --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c2010-03-16 
  08:59:11.825295404 +0800
  +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c2010-03-16 
  09:01:09.976084492 +0800
  @@ -26,6 +26,7 @@
 #includelinux/sched.h
 #includelinux/moduleparam.h
 #includelinux/ftrace_event.h
  +#includelinux/perf_event.h
 #include kvm_cache_regs.h
 #include x86.h
 
  @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
vmcs_write32(TPR_THRESHOLD, irr);
 }
 
  +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
  +
  +static void kvm_set_in_guest(void)
  +{
  + percpu_write(kvm_in_guest, 1);
  +}
  +
  +static int kvm_is_in_guest(void)
  +{
  + return percpu_read(kvm_in_guest);
  +}
 
 
   
 
  There is already PF_VCPU for this.
   
  Right, but there is a scope between kvm_guest_enter and really running
  in guest os, where a perf event might overflow. Anyway, the scope is very
  narrow, I will change it to use flag PF_VCPU.
 
 
 There is also a window between setting the flag and calling 'int $2' 
 where an NMI might happen and be accounted incorrectly.
 
 Perhaps separate the 'int $2' into a direct call into perf and another 
 call for the rest of NMI handling.  I don't see how it would work on svm 
 though - AFAICT the NMI is held whereas vmx swallows it. 

  I guess NMIs 
 will be disabled until the next IRET so it isn't racy, just tricky.
I'm not sure if vmexit does break NMI context or not. Hardware NMI context
isn't reentrant till a IRET. YangSheng would like to double check it.

 
  +static struct perf_guest_info_callbacks kvm_guest_cbs = {
  + .is_in_guest= kvm_is_in_guest,
  + .is_user_mode   = kvm_is_user_mode,
  + .get_guest_ip   = kvm_get_guest_ip,
  + .reset_in_guest = kvm_reset_in_guest,
  +};
 
 
  Should be in common code, not vmx specific.
   
  Right. I discussed with Yangsheng. I will move above data structures and
  callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
  kvm_x86_ops.
 
 
 You will need access to the vcpu pointer (kvm_rip_read() needs it), you 
 can put it in a percpu variable.
We do so now in a new patch.

   I guess if it's not null, you know 
 you're in a guest, so no need for PF_VCPU.
Good suggestion.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat

2010-03-11 Thread Zhang, Yanmin
On Thu, 2010-03-11 at 09:50 +0200, Avi Kivity wrote:
 On 03/11/2010 09:46 AM, Sheng Yang wrote:
  On Thursday 11 March 2010 15:36:01 Avi Kivity wrote:
 
  On 03/11/2010 09:20 AM, Sheng Yang wrote:
   

 
  Currently we can only get the cpu_stat of whole guest as one. This patch
  enhanced cpu_stat with more detail, has guest_system and guest_user cpu
  time statistics with a little overhead.
 
  Signed-off-by: Sheng Yangsh...@linux.intel.com
It seems per-process guest cpu utilization is more useful than per-cpu's.


  ---
 
  This draft patch based on KVM upstream to show the idea. I would split it
  into more kernel friendly version later.
 
  The overhead is, the cost of get_cpl() after each exit from guest.
 
  This can be very expensive in the nested virtualization case, so I
  wouldn't like this to be in normal paths.  I think detailed profiling
  like that can be left to 'perf kvm', which only has overhead if enabled
  at runtime.
   
  Yes, that's my concern too(though nested vmcs/vmcb read already too 
  expensive,
  they should be optimized...).
 
 Any ideas on how to do that?  Perhaps use paravirt_ops to covert the 
 vmread into a memory read?  We store the vmwrites in the vmcs anyway.
Another method is to add sysctl entry, such like 
/proc/sys/kernel/collect_guest_utilization,
and we can set it off by default. Or add a 
/sys/kernel/debug/kvm/collect_guest_utilization.

 
  The other concern is, perf alike mechanism would
  bring a lot more overhead compared to this.
 
 
 Ordinarily users won't care if time is spent in guest kernel mode or 
 guest user mode.  They want to see which guest is imposing a load on a 
 system.  I consider a user profiling a guest from the host an advanced 
 and rarer use case, so it's okay to require tools and additional 
 overhead for this.
Here is the story why Sheng worked out the patch. Some guys work on
KVM performance. They want us to extend top to show guest utilization
info, such like guest kernel and guest userspace cpu utilization. With
the new tool, they could find which VM (mapping with qemu process id)
consumes too much cpu time in host space (including kernel and userspace),
and compare them with guest kernel/userspace. That information could provide
a first-hand high-level overview about all VMs running in the system
and help admin quickly find what the worst VM instance is.

So we need per-process (guest) cpu utilization than per-cpu guest utilization.

 
  For example you can put the code to note the cpl in a tracepoint which
  is enabled dynamically.
   
  Yanmin have already implement perf kvm to support this. We are just 
  arguing
  if a normal top-alike mechanism is necessary.
perf kvm mostly is used to find hot functions which might cause more overhead.
Sheng's patch has less overhead.


 
  I am also considering to make it a feature that can be disabled. But seems 
  it
  make things complicate and result in uncertain cpustat output.
 
 
 I'm not even sure that guest time was a good idea.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-10 Thread Zhang, Yanmin
On Thu, 2010-03-04 at 09:00 +0800, Zhang, Yanmin wrote:
 On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
  On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
   -#ifndef perf_misc_flags
   -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER 
   : \
   -PERF_RECORD_MISC_KERNEL)
   -#define perf_instruction_pointer(regs) instruction_pointer(regs)
   -#endif 
  
  Ah, that #ifndef is for powerpc, which I think you just broke.
 Thanks for the reminder. I deleted powerpc codes when building cscope
 lib.
 
 It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
 change them to a callback function struct and kvm registers its version to 
 perf.
 
 Such like:
 struct perf_guest_info_callbacks {
   int (*is_in_guest)();
   u64 (*get_guest_ip)();
   int (*copy_guest_stack)();
   int (*reset_in_guest)();
   ...
 };
 int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
 int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);
 
 It's more scalable and neater.
In case you guys might lose patience, I worked out a new patch against 
2.6.34-rc1.

It could work with:
#perf kvm --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules
/guest/os/proc/modules top
It also support to collect both host side and guest side at the same time:
#perf kvm --host --guest --guestkallsyms /guest/os/kernel/proc/kallsyms 
--guestmodules
/guest/os/proc/modules top

The first output line of top has guest kernel/user space percentage.

Or just host side:
#perf kvm --host

As tool perf source codes have lots of changes, I am still working on perf kvm 
record
and report.

---

diff -Nraup linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h 
linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h
--- linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h  2010-03-09 
13:04:20.730596079 +0800
+++ linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h 2010-03-10 
17:06:34.228953260 +0800
@@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re
 #endif
 }
 
+static inline int user_mode_cs(u16 cs)
+{
+#ifdef CONFIG_X86_32
+   return (cs  SEGMENT_RPL_MASK) == USER_RPL;
+#else
+   return !!(cs  3);
+#endif
+}
+
 static inline int user_mode_vm(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_32
diff -Nraup linux-2.6.34-rc1/arch/x86/kvm/vmx.c 
linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c
--- linux-2.6.34-rc1/arch/x86/kvm/vmx.c 2010-03-09 13:04:20.758593132 +0800
+++ linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c2010-03-10 17:11:49.709019136 
+0800
@@ -26,6 +26,7 @@
 #include linux/sched.h
 #include linux/moduleparam.h
 #include linux/ftrace_event.h
+#include linux/perf_event.h
 #include kvm_cache_regs.h
 #include x86.h
 
@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct 
vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+   percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(kvm_in_guest);
+}
+
+static int kvm_is_user_mode(void)
+{
+   int user_mode;
+   user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR));
+   return user_mode;
+}
+
+static u64 kvm_get_guest_ip(void)
+{
+   return vmcs_readl(GUEST_RIP);
+}
+
+static void kvm_reset_in_guest(void)
+{
+   if (percpu_read(kvm_in_guest))
+   percpu_write(kvm_in_guest, 0);
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+   .is_in_guest= kvm_is_in_guest,
+   .is_user_mode   = kvm_is_user_mode,
+   .get_guest_ip   = kvm_get_guest_ip,
+   .reset_in_guest = kvm_reset_in_guest
+};
+
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
u32 exit_intr_info;
@@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru
 
/* We need to handle NMIs before interrupts are enabled */
if ((exit_intr_info  INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR 
-   (exit_intr_info  INTR_INFO_VALID_MASK))
+   (exit_intr_info  INTR_INFO_VALID_MASK)) {
+   kvm_set_in_guest();
asm(int $2);
+   kvm_reset_in_guest();
+   }
 
idtv_info_valid = idt_vectoring_info  VECTORING_INFO_VALID_MASK;
 
@@ -4251,6 +4292,8 @@ static int __init vmx_init(void)
if (bypass_guest_pf)
kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);
 
+   perf_register_guest_info_callbacks(kvm_guest_cbs);
+
return 0;
 
 out3:
@@ -4266,6 +4309,8 @@ out:
 
 static void __exit vmx_exit(void)
 {
+   perf_unregister_guest_info_callbacks(kvm_guest_cbs);
+
free_page((unsigned long)vmx_msr_bitmap_legacy);
free_page((unsigned long)vmx_msr_bitmap_longmode);
free_page((unsigned long)vmx_io_bitmap_b);
diff -Nraup linux-2.6.34-rc1/include/linux/perf_event.h 
linux-2.6.34-rc1_work/include/linux/perf_event.h
--- linux-2.6.34-rc1

Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:32 +0800, Zhang, Yanmin wrote:
 On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
  * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
  
   On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
  
My suggestion, as always, would be to start very simple and very 
minimal:

Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
image both as a host and as guest (for testing), to not have to deal 
with 
the symbol space transport problem initially. Enable 'perf kvm record' 
to 
only record guest events by default. Etc.

This alone will be a quite useful result already - and gives a basis 
for 
further work. No need to spend months to do the big grand design 
straight 
away, all of this can be done gradually and in the order of usefulness 
- 
and you'll always have something that actually works (and helps your 
other 
KVM projects) along the way.
  
   It took me for a couple of hours to read the emails on the topic. Based 
   on 
   above idea, I worked out a prototype which is ugly, but does work with 
   top/record when both guest side and host side use the same kernel image, 
   while compiling most needed modules into kernel directly..
   
   The commands are:
   perf kvm top
   perf kvm record
   perf kvm report
   
   They just collect guest kernel hot functions.
  
  Fantastic, and there's some really interesting KVM guest/host comparison 
  profiles you've done with this prototype!
  
   With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
   cpu).
  
   1) Vanilla host kernel (6G memory):
   
  PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 
   CPUs)
   
   
samples  pcnt functionDSO
___ _ ___ 
   
   
   99376.00 40.5% ext3_test_allocatable   
   /lib/modules/2.6.33-kvmymz/build/vmlinux
   41239.00 16.8% bitmap_search_next_usable_block 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
7019.00  2.9% __ticket_spin_lock  
   /lib/modules/2.6.33-kvmymz/build/vmlinux
5350.00  2.2% copy_user_generic_string
   /lib/modules/2.6.33-kvmymz/build/vmlinux
5208.00  2.1% do_get_write_access 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
4484.00  1.8% journal_dirty_metadata  
   /lib/modules/2.6.33-kvmymz/build/vmlinux
4078.00  1.7% ext3_free_blocks_sb 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
3856.00  1.6% ext3_new_blocks 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
3485.00  1.4% journal_get_undo_access 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
2803.00  1.1% ext3_try_to_allocate
   /lib/modules/2.6.33-kvmymz/build/vmlinux
2241.00  0.9% __find_get_block
   /lib/modules/2.6.33-kvmymz/build/vmlinux
1957.00  0.8% find_revoke_record  
   /lib/modules/2.6.33-kvmymz/build/vmlinux
   
   2) guest os: start one guest os with 4GB memory.
   
  PerfTop: 827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 
   CPUs)
   
   
samples  pcnt functionDSO
___ _ ___ 
   
   
   41701.00 28.1% __ticket_spin_lock  
   /lib/modules/2.6.33-kvmymz/build/vmlinux
   33843.00 22.8% ext3_test_allocatable   
   /lib/modules/2.6.33-kvmymz/build/vmlinux
   16862.00 11.4% bitmap_search_next_usable_block 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
3278.00  2.2% native_flush_tlb_others 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
3200.00  2.2% copy_user_generic_string
   /lib/modules/2.6.33-kvmymz/build/vmlinux
3009.00  2.0% do_get_write_access 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
2834.00  1.9% journal_dirty_metadata  
   /lib/modules/2.6.33-kvmymz/build/vmlinux
1965.00  1.3% journal_get_undo_access 
   /lib/modules/2.6.33-kvmymz/build/vmlinux
1907.00  1.3% ext3_new_blocks 
   /lib/modules/2.6.33-kvmymz/build/vmlinux

Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:13 +0100, Peter Zijlstra wrote:
 On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
  +static inline u64 perf_instruction_pointer(struct pt_regs *regs)
  +{
  +   u64 ip;
  +   ip = percpu_read(perf_virt_ip.ip);
  +   if (!ip)
  +   ip = instruction_pointer(regs);
  +   else
  +   perf_reset_virt_ip();
  +   return ip;
  +}
  +
  +static inline unsigned int perf_misc_flags(struct pt_regs *regs)
  +{
  +   if (percpu_read(perf_virt_ip.ip)) {
  +   return percpu_read(perf_virt_ip.user_mode) ?
  +   PERF_RECORD_MISC_GUEST_USER :
  +   PERF_RECORD_MISC_GUEST_KERNEL;
  +   } else
  +   return user_mode(regs) ? PERF_RECORD_MISC_USER :
  +PERF_RECORD_MISC_KERNEL;
  +} 
 
 This codes in the assumption that perf_misc_flags() must only be called
 before perf_instruction_pointer(), which is currently true, but you
 might want to put a comment near to remind us of this.
I will change the logic with a clear reset operation in caller.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
 On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
  -#ifndef perf_misc_flags
  -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
  -PERF_RECORD_MISC_KERNEL)
  -#define perf_instruction_pointer(regs) instruction_pointer(regs)
  -#endif 
 
 Ah, that #ifndef is for powerpc, which I think you just broke.
Thanks for the reminder. I deleted powerpc codes when building cscope
lib.

It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
change them to a callback function struct and kvm registers its version to perf.

Such like:
struct perf_guest_info_callbacks {
int (*is_in_guest)();
u64 (*get_guest_ip)();
int (*copy_guest_stack)();
int (*reset_in_guest)();
...
};
int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);

It's more scalable and neater.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-02 Thread Zhang, Yanmin
On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
 * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:
 
  On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
 
   My suggestion, as always, would be to start very simple and very minimal:
   
   Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
   image both as a host and as guest (for testing), to not have to deal with 
   the symbol space transport problem initially. Enable 'perf kvm record' to 
   only record guest events by default. Etc.
   
   This alone will be a quite useful result already - and gives a basis for 
   further work. No need to spend months to do the big grand design straight 
   away, all of this can be done gradually and in the order of usefulness - 
   and you'll always have something that actually works (and helps your 
   other 
   KVM projects) along the way.
 
  It took me for a couple of hours to read the emails on the topic. Based on 
  above idea, I worked out a prototype which is ugly, but does work with 
  top/record when both guest side and host side use the same kernel image, 
  while compiling most needed modules into kernel directly..
  
  The commands are:
  perf kvm top
  perf kvm record
  perf kvm report
  
  They just collect guest kernel hot functions.
 
 Fantastic, and there's some really interesting KVM guest/host comparison 
 profiles you've done with this prototype!
 
  With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
  cpu).
 
  1) Vanilla host kernel (6G memory):
  
 PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
  
  
   samples  pcnt functionDSO
   ___ _ ___ 
  
  
  99376.00 40.5% ext3_test_allocatable   
  /lib/modules/2.6.33-kvmymz/build/vmlinux
  41239.00 16.8% bitmap_search_next_usable_block 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   7019.00  2.9% __ticket_spin_lock  
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   5350.00  2.2% copy_user_generic_string
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   5208.00  2.1% do_get_write_access 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   4484.00  1.8% journal_dirty_metadata  
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   4078.00  1.7% ext3_free_blocks_sb 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   3856.00  1.6% ext3_new_blocks 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   3485.00  1.4% journal_get_undo_access 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   2803.00  1.1% ext3_try_to_allocate
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   2241.00  0.9% __find_get_block
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   1957.00  0.8% find_revoke_record  
  /lib/modules/2.6.33-kvmymz/build/vmlinux
  
  2) guest os: start one guest os with 4GB memory.
  
 PerfTop: 827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
  
  
   samples  pcnt functionDSO
   ___ _ ___ 
  
  
  41701.00 28.1% __ticket_spin_lock  
  /lib/modules/2.6.33-kvmymz/build/vmlinux
  33843.00 22.8% ext3_test_allocatable   
  /lib/modules/2.6.33-kvmymz/build/vmlinux
  16862.00 11.4% bitmap_search_next_usable_block 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   3278.00  2.2% native_flush_tlb_others 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   3200.00  2.2% copy_user_generic_string
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   3009.00  2.0% do_get_write_access 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   2834.00  1.9% journal_dirty_metadata  
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   1965.00  1.3% journal_get_undo_access 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   1907.00  1.3% ext3_new_blocks 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   1790.00  1.2% ext3_free_blocks_sb 
  /lib/modules/2.6.33-kvmymz/build/vmlinux
   1741.00  1.2% find_revoke_record  
  /lib/modules/2.6.33-kvmymz/build

Re: KVM PMU virtualization

2010-03-01 Thread Zhang, Yanmin
On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
 * Joerg Roedel j...@8bytes.org wrote:
 
  On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
   On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:

 1) Add support to perf to allow it to monitor a KVM guest from the
host.

This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
configured to count only when in guest mode. Perf needs to be aware of
that and fetch the rip from a different place when monitoring a guest.
  
   The idea is we want to measure both host and guest at the same time, and
   compare all the hot functions fairly.
  
  So you want to measure while the guest vcpu is running and the vmexit
  path of that vcpu (including qemu userspace part) together? The
  challenge here is to find out if a performance event originated in guest
  mode or in host mode.
  But we can check for that in the nmi-protected part of the vmexit path.
 
 As far as instrumentation goes, virtualization is simply another 'PID 
 dimension' of measurement.
 
 Today we can isolate system performance measurements/events to the following 
 domains:
 
  - per system
  - per cpu
  - per task
 
 ( Note that PowerPC already supports certain sorts of 
 'hypervisor/kernel/user' 
   domain separation, and we have some ABI details for all that but it's by no 
   means complete. Anton is using the PowerPC bits AFAIK, so it already works 
   to a certain degree. )
 
 When extending measurements to KVM, we want two things:
 
  - user friendliness: instead of having to check 'ps' and figure out which 
Qemu thread is the KVM thread we want to profile, just give a convenience
namespace to access guest profiling info. -G ought to map to the first
currently running KVM guest it can find. (which would match like 90% of the
cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
useful by default the whole effort is for naught.
 
  - Extend core facilities and enable the following measurement dimensions:
 
  host-kernel-space
  host-user-space
  guest-kernel-space
  guest-user-space
 
on a per guest basis. We want to be able to measure just what the guest 
does, and we want to be able to measure just what the host does.
 
Some of this the hardware helps us with (say only measuring host kernel 
events is possible), some has to be done by fiddling with event 
enable/disable at vm-exit / vm-entry time.
 
 My suggestion, as always, would be to start very simple and very minimal:
 
 Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
 both as a host and as guest (for testing), to not have to deal with the 
 symbol 
 space transport problem initially. Enable 'perf kvm record' to only record 
 guest events by default. Etc.
 
 This alone will be a quite useful result already - and gives a basis for 
 further work. No need to spend months to do the big grand design straight 
 away, all of this can be done gradually and in the order of usefulness - and 
 you'll always have something that actually works (and helps your other KVM 
 projects) along the way.
It took me for a couple of hours to read the emails on the topic.
Based on above idea, I worked out a prototype which is ugly, but does work
with top/record when both guest side and host side use the same kernel image,
while compiling most needed modules into kernel directly..

The commands are:
perf kvm top
perf kvm record
perf kvm report

They just collect guest kernel hot functions.

 
 [ And, as so often, once you walk that path, that grand scheme you are 
   thinking about right now might easily become last year's really bad idea 
 ;-) ]
 
 So please start walking the path and experience the challenges first-hand.
With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu).
1) Vanilla host kernel (6G memory):

   PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)


 samples  pcnt functionDSO
 ___ _ ___ 


99376.00 40.5% ext3_test_allocatable   
/lib/modules/2.6.33-kvmymz/build/vmlinux
41239.00 16.8% bitmap_search_next_usable_block 
/lib/modules/2.6.33-kvmymz/build/vmlinux
 7019.00  2.9% __ticket_spin_lock  
/lib/modules/2.6.33-kvmymz/build/vmlinux
 5350.00  2.2% copy_user_generic_string
/lib/modules/2.6.33-kvmymz/build/vmlinux
 5208.00  2.1% do_get_write_access 
/lib/modules/2.6.33-kvmymz/build/vmlinux

Re: Enhance perf to support KVM

2010-02-25 Thread Zhang, Yanmin
On Thu, 2010-02-25 at 10:20 +0100, Peter Zijlstra wrote:
 On Thu, 2010-02-25 at 11:27 +0800, Zhang, Yanmin wrote:
  Ingo,
  
  I did some testing with KVM virtualization. perf shows vmx_vcpu_run
  consumes more than 50% cpu time. Actually, the info is incorrect because
  when perf counter overflows and NMI is triggered, vm exit to function
  vmx_vcpu_run, then vmx_vcpu_run triggers a software NMI so perf event is
  notified. perf just checks regs which just saves the address of 
  vmx_vcpu_run.
  
  I want to enhance perf to collect real guest os address.
  
  Below is the design.
  KVM uses multi-thread model. Every guest os is a process of multi-thread.
  
  1) Kernel:
  Add a per_cpu var and some functions, so KVM records interrupted
  guest os address before triggering the software NMI. perf event would check
  the per_cpu var to use it if it's not zero, or just goes though the old 
  path.
  
  2) User space: Add a new parameter to perf-top and perf-report, such like
  -g pid:guest_os_vmlinux_path. Command perf parses the guest os kernel image
  to collect symbols. Change perf to summarize results based on pid.
  Another direction is to use the new parameter -g only when old parameter
  -p is defined. Perf just needs separate native kernel and guest os kernel.
 
I really appreciate your kind comments, and will contact you again in
the future for help.

 -g is already taken :-)
We could use other flag or just -G.

 
 One thing I worry about is making sense of the guest data, it might be
 possible to sorta make sense of the main kernel image, but after that
 its going to be 'interesting' in deed.
 
 You're going to have to extend PERF_RECORD_MISC_* though, perhaps you
 can reuse CPUMODE_UNKNOWN for GUEST.
 
 The callchain stuff already has GUEST context identifiers, however
 determining KERNEL/USER context might be hard and interpreting it is
 going to be harder still since we don't have map information for the
 guest.
Right. As for side #1 pointed in Ingo' email, we assume guest os is
linux. We couldn't support all capabilities of perf on KVM from host side.
1) We couldn't get module and process mapping info in guest os in an easy way,
so we can't support to collect guest kernel module and user space hot functions.
A work around is user could get guest os /proc/kallsym and pass it to tool perf
at host side so we could analyze module host functions.
2) We couldn't get guest os kernel/user stack data in an easy way, so we might 
not
support callchain feature of tool perf. A work around is KVM copies kernel stack
data out, so we could at least support guest os kernel callchain.

So the host side perf support on guest os:
  perf kvm list
  perf kvm record# records the first running guest
  perf kvm stat  # stats the first running KVM guest
  perf kvm top   # shows the profile of the first running
guest
  perf kvm trace # active the KVM specific tracepoints

As for record, doesn't support to record guest os user space stack
callchain and guest os user space hot functions.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Zhang, Yanmin
On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
 * Jan Kiszka jan.kis...@siemens.com wrote:
 
  Jes Sorensen wrote:
   Hi,
   
   It looks like several of us have been looking at how to use the PMU
   for virtualization. Rather than continuing to have discussions in
   smaller groups, I think it is a good idea we move it to the mailing
   lists to see what we can share and avoid duplicate efforts.
   
   There are really two separate things to handle:
   
   1) Add support to perf to allow it to monitor a KVM guest from the
  host.
   
   2) Allow guests access to the PMU (or an emulated PMU), making it
  possible to run perf on applications running within the guest.
   
   I know some of you have been looking at 1) and I am currently working
   on 2). I have been looking at various approaches, including whether it
   is feasible to share the PMU between the host and multiple guests. For
   now I am going to focus on allowing one guest to take control of the
   PMU, then later hopefully adding support for multiplexing it between
   multiple guests.
  
  Given that perf can apply the PMU to individual host tasks, I don't see 
  fundamental problems multiplexing it between individual guests (which can 
  then internally multiplex it again).
 
 In terms of how to expose it to guests, a 'soft PMU' might be a usable 
 approach. Although to Linux guests you could expose much more functionality 
 and an non-PMU-limited number of instrumentation events, via a more 
 intelligent interface.
 
 But note that in terms of handling it on the host side the PMU approach is 
 not 
 acceptable: instead it should map to proper perf_events, not try to muck with 
 the PMU itself.
 


 That, besides integrating properly with perf usage on the host, will also 
 allow interesting 'PMU' features on guests: you could set up the host side to 
 trace block IO requests (or VM exits) for example, and expose that as 'PMC
 #0' on the guest side.
So virtualization becomes non-transparent to guest os? I know virtio is an
optimization on guest side.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Zhang, Yanmin
On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
 On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
 
  1) Add support to perf to allow it to monitor a KVM guest from the
 host.
 
 This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
 configured to count only when in guest mode. Perf needs to be aware of
 that and fetch the rip from a different place when monitoring a guest.
The idea is we want to measure both host and guest at the same time, and
compare all the hot functions fairly.


 
  2) Allow guests access to the PMU (or an emulated PMU), making it
 possible to run perf on applications running within the guest.
 
 The biggest problem I see here is teaching the guest about the available
 events. The available event sets are dependent on the processor family
 (at least on AMD).
 A simple approach would be shadowing the perf msrs which is a simple
 thing to do. More problematic is the reinjection of performance
 interrupts and performance nmis.
 
 I personally don't like a self-defined event-set as the only solution
 because that would probably only work with linux and perf. I think we
 should have a way (additionally to a soft-event interface) which allows
 to expose the host pmu events to the guest.
 
   Joerg
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html