Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Wed, 2010-06-23 at 08:51 +0300, Avi Kivity wrote: On 06/23/2010 06:12 AM, Zhang, Yanmin wrote: This design is to deal with a task context perf collection in guest os. Scenario 1: 1) guest os starts to collect statistics of process A on vcpu 0; 2) process A is scheduled to vcpu 1. Then, the perf_event at host side need to be moved to VCPU 1 's thread. With the per KVM instance design, we needn't move host_perf_shadow among vcpus. First, the guest already knows how to deal with per-cpu performance monitors, since that's how most (all) hardware works. So we aren't making the guest more complex, and on the other hand we simplify the host. I agree that we need keep things simple. Second, if process A is migrated, and the guest uses per-process counters, the guest will need to stop/start the counter during the migration. This will cause the host to migrate the counter, Agree. My patches do so. Question: Where does host migrate the counter? The perf event at host side is bound to specific vcpu thread. If the perf event is bound to the vm, not a vcpu, then on guest process migration you will have to disable it on one vcpu and enable it on the other, no? I found we start from different points. This patch is to implement a para virt interface based on current perf implementation in kernel. Here is a diagram about perf implementation layers. Below picture is not precise, but it could show perf layers. Ingo and Peter could correct me if something is wrong. - | Perf Generic Layer | - | PMU Abstraction Layer| | (a couple of callbacks) | - | x86_pmu | | (operate real PMU hardware) | - The upper layer is perf generic layer. The 3rd layer is x86_pmu which really manipulate PMU hardware. Sometimes, 1st calls 3rd directly at event initialization and enable/disable all events. My patch implements a kvm_pmu at the 2nd layer in guest os, to call hypercall to vmexit to host. At host side, mostly it would go through the 3 layers till accessing real hardware. Most of your comments don't agree with the kvm_pmu design. Although you didn't say directly, I know that perhaps you want to implement para virt interface at 3rd layer in guest os. That means guest os maintains a mapping between guest event and PMU counters. That's why you strongly prefer per-vcpu event managements and idx reference to event. If we implement it at 3rd layer (or something like that although you might say I don't like that layer...) in guest, we need bypass 1st and 2nd layers in host kernel when processing guest os event. Eventually, we almost add a new layer under x86_pmu to arbitrate between perf PMU request and KVM guest event request. My current patch arranges the calling to go through the whole perf stack at host side. The upper layer arranges perf event scheduling on PMU hardware. Applications don't know when its event will be really scheduled to real hardware as they needn't know. so while we didn't move the counter to a different vcpu, Disagree here. If process A on vcpu 0 in guest os is migrated to vcpu 1, host has to move process A's perf_event to vcpu 1 thread. Sorry, I'm confused now (lost track of our example). But whatever we do, if a guest process is migrated, the host will have to migrate the perf event, yes? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Mon, 2010-06-21 at 16:56 +0300, Gleb Natapov wrote: On Mon, Jun 21, 2010 at 05:31:43PM +0800, Zhang, Yanmin wrote: The 3rd patch is to implement para virt perf at host kernel. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com snip + +static void kvm_copy_event_to_guest(struct kvm_vcpu *vcpu, + struct perf_event *host_event) +{ + struct host_perf_shadow *shadow = host_event-host_perf_shadow; + struct guest_perf_event counter; + int ret; + s32 overflows; + + ret = kvm_read_guest(vcpu-kvm, shadow-guest_event_addr, + counter, sizeof(counter)); + if (ret 0) + return; + +again: + overflows = atomic_read(shadow-counter.overflows); + if (atomic_cmpxchg(shadow-counter.overflows, overflows, 0) != + overflows) + goto again; + + counter.count = shadow-counter.count; + atomic_add(overflows, counter.overflows); + + kvm_write_guest(vcpu-kvm, + shadow-guest_event_addr, + counter, + sizeof(counter)); Those kind of interfaces worry me since the can cause bugs that are very hard to catch. What if guest enables some events and crashes into kdump kernel (or kexec new kernel) without reseting HW. Now host may write over guest memory without guest expecting it. Do you handle this scenario in a guest side? I think you need to register reboot notify and disable events from there. Sorry for missing your comments. My patch could take care of dead guest os by cleaning up all events in function kvm_arch_destroy_vm, so all events are closed if host user kills the guest qemu process. As for your scenario, I will register reboot notify and add a new pv perf hypercall interface to vmexit to host kernel to do cleanup. Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote: On 06/22/10 03:49, Zhang, Yanmin wrote: On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote: Since the guest can use NMI to read the counter, it should have the highest possible priority, and thus it shouldn't see any overflow unless it configured the threshold really low. If we drop overflow, we can use the RDPMC instruction instead of KVM_PERF_OP_READ. This allows the guest to allow userspace to read a counter, or prevent userspace from reading the counter, by setting cr4.pce. 1) para virt perf interface is to hide PMU hardware in host os. Guest os shouldn't access PMU hardware directly. We could expose PMU hardware to guest os directly, but that would be another guest os PMU support method. It shouldn't be a part of para virt interface. 2) Consider below scenario: PMU counter overflows and NMI causes guest os vmexit to host kernel. Host kernel schedules the vcpu thread to another physical cpu before vmenter the guest os again. So later on, guest os just RDPMC the counter on another cpu. So I think above discussion is around how to expose PMU hardware to guest os. I will also check this method after the para virt interface is done. You should be able to expose the counters as read-only to the guest. KVM allows you to specify whether or not a guest has read, write or read/write access. If you allowed read access of the counters that would safe a fair bit of hyper calls. Thanks. KVM is good in register access permission configuration. But things are not so simple like that if we consider real running environment. Host kernel might schedule guest os vcpu thread to other cpus, or other non-kvm processes might preempt the vcpu thread on this cpu. To support such capability you said, we have to implement the direct exposition of PMU hardware to guest os eventually. Question is if it is safe to drop overflow support? Not safe. One of PMU hardware design objectives is to use interrupt or NMI to notify software when event counter overflows. Without overflow support, software need poll the PMU registers looply. That is not good and consumes more cpu resources. Besides the para virt perf interface, I'm also considering the direct exposition of PMU hardware to guest os. But that will be another very different implementation. We should not combine it with pv interface. Perhaps our target is to implement both, so unmodified guest os could get support on perf statistics. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 4/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 10:24 +0200, Jes Sorensen wrote: On 06/21/10 11:31, Zhang, Yanmin wrote: @@ -583,10 +584,20 @@ static void x86_pmu_disable_all(void) } } +#ifdef CONFIG_KVM_PERF +static int kvm_hw_perf_enable(void); +static int kvm_hw_perf_disable(void); +#endif Please put these prototypes into a header ... and create dummy stubs for them when CONFIG_KVM_PERF is not set. Ok. I just didn't want to touch too much generic codes of perf. void hw_perf_disable(void) { struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); +#ifdef CONFIG_KVM_PERF + if (!kvm_hw_perf_disable()) + return; +#endif If you stub them out we can avoid all the ugly #ifdefs Ok. @@ -810,6 +821,11 @@ void hw_perf_enable(void) struct hw_perf_event *hwc; int i, added = cpuc-n_added; +#ifdef CONFIG_KVM_PERF + if (!kvm_hw_perf_enable()) + return; +#endif and here Ok. @@ -1317,6 +1334,11 @@ void __init init_hw_perf_events(void) pr_info(Performance Events: ); +#ifdef CONFIG_KVM_PERF + if (!kvm_init_hw_perf_events()) + return; +#endif and again here :) Ok. Peter is working out a couple of patches to support multiple PMU. His patches change pmu difition and we might move some into the callbacks. That will become much clearer. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 12:12 +0300, Avi Kivity wrote: On 06/22/2010 05:08 AM, Zhang, Yanmin wrote: Something that is worrying is that we don't expose group information. perf will multiplex the events for us, but there will be a loss in accuracy. #ifdef CONFIG_HAVE_HW_BREAKPOINT #includeasm/hw_breakpoint.h #endif @@ -753,6 +752,20 @@ struct perf_event { perf_overflow_handler_t overflow_handler; + /* + * pointers used by kvm perf paravirt interface. + * + * 1) Used in host kernel and points to host_perf_shadow which + * has information about guest perf_event + */ + void*host_perf_shadow; Can we have real types instead of void pointers? I just want perf generic codes have less dependency on KVM codes. One way to do that and retain type safety is to have struct perf_client { struct perf_client_ops *ops; ... } The client (kvm) can do struct kvm_perf_client { struct perf_client pc; // kvm specific stuff }; the callbacks receive struct perf_client and use container_of to reach the kvm_perf_client that contains it. Let me double check it. + /* + * 2) Used in guest kernel and points to guest_perf_shadow which + * is used as a communication area with host kernel. Host kernel + * copies overflow data to it when an event overflows. + */ + void*guest_perf_shadow; It's strange to see both guest and host parts in the same patch. Splitting to separate patches will really help review. It's a little hard to split the patches if they change the same file. Perhaps I could add more statements before the patch when I send it out. With git, it's easy (once you're used to it): # go back one commit: git reset HEAD^ # selectively add bits: git add -p # commit first patch git commit -s # selectively add bits: git add -p # commit second patch git commit -s Thanks for your teaching. @@ -1626,9 +1629,22 @@ void perf_event_task_tick(struct task_st if (ctx ctx-nr_events ctx-nr_events != ctx-nr_active) rotate = 1; - perf_ctx_adjust_freq(cpuctx-ctx); - if (ctx) - perf_ctx_adjust_freq(ctx); +#ifdef CONFIG_KVM_PERF + if (kvm_para_available()) { + /* + * perf_ctx_adjust_freq causes lots of pmu-read which would + * trigger too many vmexit to host kernel. We disable it + * under para virt situation + */ + adjust_freq = 0; + } +#endif Perhaps we can have a batch read interface which will read many counters at once. It's a good idea. But that will touch many perf generic codes which causes it's hard to maintain or follow future changes. I'm talking about the guest/host interface. So you have one vmexit and many host perf calls. I understood what you were speaking. I mean, perf generic codes operate perf_event one by one. At low layer, we just know one perf_event before calling hypercall to vmexit to host kernel. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote: On 06/22/10 09:55, Peter Zijlstra wrote: On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote: Besides the para virt perf interface, I'm also considering the direct exposition of PMU hardware to guest os. NAK NAK NAK NAK, we've been over that, its not going to happen, full stop! Use MSR read/write traps and host perf to emulate the hardware. In some cases we could allow the reads without trap but that's a later optimization. I believe whats meant here is a PMU compatible interface which is partially emulated. Not a handover of the PMU. Right. We need capture all write to PMU MSR and allows guest os to read MSR directly. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 09:58 +0200, Jes Sorensen wrote: On 06/22/10 09:47, Zhang, Yanmin wrote: On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote: On 06/22/10 03:49, Zhang, Yanmin wrote: On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote: So I think above discussion is around how to expose PMU hardware to guest os. I will also check this method after the para virt interface is done. You should be able to expose the counters as read-only to the guest. KVM allows you to specify whether or not a guest has read, write or read/write access. If you allowed read access of the counters that would safe a fair bit of hyper calls. Thanks. KVM is good in register access permission configuration. But things are not so simple like that if we consider real running environment. Host kernel might schedule guest os vcpu thread to other cpus, or other non-kvm processes might preempt the vcpu thread on this cpu. To support such capability you said, we have to implement the direct exposition of PMU hardware to guest os eventually. If the guest is rescheduled to another CPU, or you get a preemption, you have a VMEXIT. The vcpu thread will not migrate while it is running, so you can handle it while the the VMEXIT is being serviced. Exposing the counters read-only would save a lot of overhead for sure. Question is if it is safe to drop overflow support? Not safe. One of PMU hardware design objectives is to use interrupt or NMI to notify software when event counter overflows. Without overflow support, software need poll the PMU registers looply. That is not good and consumes more cpu resources. Here is an idea, how about having the overflow NMI in the host trigger a flag that causes the PMU register read to trap and get special handling? That way you could propagate the overflow back down to the guest. That doesn't resolve the issue that guest os software has to poll register. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 11:04 +0300, Avi Kivity wrote: On 06/22/2010 08:47 AM, Zhang, Yanmin wrote: On Mon, 2010-06-21 at 16:56 +0300, Gleb Natapov wrote: On Mon, Jun 21, 2010 at 05:31:43PM +0800, Zhang, Yanmin wrote: The 3rd patch is to implement para virt perf at host kernel. Signed-off-by: Zhang Yanminyanmin_zh...@linux.intel.com snip + +static void kvm_copy_event_to_guest(struct kvm_vcpu *vcpu, + struct perf_event *host_event) +{ + struct host_perf_shadow *shadow = host_event-host_perf_shadow; + struct guest_perf_event counter; + int ret; + s32 overflows; + + ret = kvm_read_guest(vcpu-kvm, shadow-guest_event_addr, + counter, sizeof(counter)); + if (ret 0) + return; + +again: + overflows = atomic_read(shadow-counter.overflows); + if (atomic_cmpxchg(shadow-counter.overflows, overflows, 0) != + overflows) + goto again; + + counter.count = shadow-counter.count; + atomic_add(overflows,counter.overflows); + + kvm_write_guest(vcpu-kvm, + shadow-guest_event_addr, + counter, + sizeof(counter)); Those kind of interfaces worry me since the can cause bugs that are very hard to catch. What if guest enables some events and crashes into kdump kernel (or kexec new kernel) without reseting HW. Now host may write over guest memory without guest expecting it. Do you handle this scenario in a guest side? I think you need to register reboot notify and disable events from there. Sorry for missing your comments. My patch could take care of dead guest os by cleaning up all events in function kvm_arch_destroy_vm, so all events are closed if host user kills the guest qemu process. A reset does not involve destroying a vm; you have to clean up as part of the rest process. What does 'reset' here mean? Is it a reboot or halt? If it's a halt, it involves destroying a vm. If a host user just kills the qemu process, is it a reset involving destroying a vm? Note MSRs are automatically cleared, so that's something in favour of an MSR interface. As for your scenario, I will register reboot notify and add a new pv perf hypercall interface to vmexit to host kernel to do cleanup. You aren't guaranteed a reboot notifier will be called. On the other hand, we need a kexec handler. ordinary kexec calls all reboot notifiers. Only crash kexec doesn't call them. I will implement a machine_ops.crash_shutdown callback. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Tue, 2010-06-22 at 11:29 +0300, Avi Kivity wrote: On 06/22/2010 06:12 AM, Zhang, Yanmin wrote: On Mon, 2010-06-21 at 15:33 +0300, Avi Kivity wrote: On 06/21/2010 12:31 PM, Zhang, Yanmin wrote: The 3rd patch is to implement para virt perf at host kernel. @@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt { #ifdef __KERNEL__ #includeasm/processor.h +/* + * In host kernel, perf_event-host_perf_shadow points to + * host_perf_shadow which records some information + * about the guest. + */ +struct host_perf_shadow { + /* guest perf_event id passed from guest os */ + int id; + /* + * Host kernel saves data into data member counter firstly. + * kvm will get data from this counter and calls kvm functions + * to copy or add data back to guets os before entering guest os + * next time + */ + struct guest_perf_event counter; + /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/ + __u64 guest_event_addr; So just use gpa_t as the type. host_perf_shadow-guest_event_addr is a copy of guest_event_addr-guest_event_addr. As the latter's type is __u64 as the interface between guest os and host os, I use __u64 as the type of host_perf_shadow-guest_event_addr. Right. Bug gpa_t is more descriptive. We have a lot of address spaces in kvm. I change it to gpa_t. + + /* + * Link to of kvm.kvm_arch.shadow_hash_table + */ + struct list_head shadow_entry; + struct kvm_vcpu *vcpu; + + struct perf_event *host_event; + /* + * Below counter is to prevent malicious guest os to try to + * close/enable event at the same time. + */ + atomic_t ref_counter; If events are made per-vcpu (like real hardware), races become impossible. This design is to deal with a task context perf collection in guest os. Scenario 1: 1) guest os starts to collect statistics of process A on vcpu 0; 2) process A is scheduled to vcpu 1. Then, the perf_event at host side need to be moved to VCPU 1 's thread. With the per KVM instance design, we needn't move host_perf_shadow among vcpus. First, the guest already knows how to deal with per-cpu performance monitors, since that's how most (all) hardware works. So we aren't making the guest more complex, and on the other hand we simplify the host. I agree that we need keep things simple. Second, if process A is migrated, and the guest uses per-process counters, the guest will need to stop/start the counter during the migration. This will cause the host to migrate the counter, Agree. My patches do so. Question: Where does host migrate the counter? The perf event at host side is bound to specific vcpu thread. so while we didn't move the counter to a different vcpu, Disagree here. If process A on vcpu 0 in guest os is migrated to vcpu 1, host has to move process A's perf_event to vcpu 1 thread. we still have to move it to a different cpu. Scenario 2: 1) guest os creates a perf_event at host side on vcpu 0; 2) malicious guest os calls close to delete the host perf_event on vcpu 1, but enables the perf_event on vcpu0 at the same time. When close thread runs to get the host_perf_shadow from the list, enable thread also gets it. Then, close thread deletes the perf_event, and enable thread will cause host kernel panic when using host_perf_shadow. With per-vcpu events, this can't happen. Each vcpu has its own set of perf events in their own ID namespace. vcpu 0 can't touch vcpu 1's events even if it knows their IDs. What does 'touch' mean here? With the task (process) context event in guest os, we have to migrate the event among vcpus when guest process scheduler balances processes among vcpus. So all events in a guest os instance uses the same ID name space. What you mentioned is really right when the event is cpu context event, but not with task context event. Please move this structure to include/linux/kvm_host.h. No need to spam kvm_para.h. Note it's not x86 specific (though you can leave arch enabling to arch maintainers). Ok. Originally, I wanted to do so, but I'm afraid other arch might be not happy. Copying kvm arch maintainers. Please keep an eye on this topic and holler if something doesn't fit your arch. @@ -24,6 +24,7 @@ #includeasm/desc.h #includeasm/mtrr.h #includeasm/msr-index.h +#includeasm/perf_event.h #define KVM_MAX_VCPUS 64 #define KVM_MEMORY_SLOTS 32 @@ -360,6 +361,18 @@ struct kvm_vcpu_arch { /* fields used by HYPER-V emulation */ u64 hv_vapic; + + /* + * Fields used by PARAVIRT perf interface: + * + * kvm checks overflow_events before entering guest os, + * and copy data back to guest os. + * event_mutex is to avoid a race between NMI perf event overflow + * handler, event close, and enable
[PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
Here is the version 2. ChangeLog since V1: Mostly changes based on Avi's suggestions. 1) Use a id to identify the perf_event between host and guest; 2) Changes lots of codes to deal with malicious guest os; 3) Add a perf_event number limitation per gust os instance; 4) Support guest os on the top of another guest os scenario. But I didn't test it yet as there is no environment. The design is to add 2 pointers in struct perf_event. One is used by host and the other is used by guest. 5) Fix the bug to support 'perf stat'. The key is sync count data back to guest when guest tries to disable the perf_event at host side. 6) Add a clear ABI of PV perf. I don't implement live migration feature. Avi, Is live migration necessary on pv perf support? Based on Ingo's idea, I implement a para virt interface for perf to support statistics collection in guest os. That means we could run tool perf in guest os directly. Great thanks to Peter Zijlstra. He is really the architect and gave me architecture design suggestions. I also want to thank Yangsheng and LinMing for their generous help. The design is: 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel; 2) Create a host perf_event per guest perf_event; 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest kernel if a guest event overflows. 4) Guest kernel goes through all enabled event on current cpu and output data when they overflows. 5) No change in user space. Below is an example. #perf top -- PerfTop:7954 irqs/sec kernel:79.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) -- samples pcnt function DSO ___ _ _ 5315.00 4.9% copy_user_generic_string /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 3342.00 3.1% add_preempt_count /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 3338.00 3.1% sub_preempt_count /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2454.00 2.3% pvclock_clocksource_read /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2434.00 2.3% tcp_sendmsg /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2090.00 1.9% child_run /bm/tmp/benchmarks/run_bmtbench/dbench/dbench-3.03/tbench 2081.00 1.9% debug_smp_processor_id /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2003.00 1.9% __GI_strstr /lib64/libc-2.11.so 1999.00 1.9% __strchr_sse2/lib64/libc-2.11.so 1983.00 1.8% tcp_ack /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1800.00 1.7% tcp_transmit_skb /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1727.00 1.6% schedule /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1706.00 1.6% __libc_recv /lib64/libc-2.11.so 1702.00 1.6% __GI_memchr /lib64/libc-2.11.so 1580.00 1.5% tcp_recvmsg /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux The patch is against tip/master tree of June 20st. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- --- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt 1970-01-01 08:00:00.0 +0800 +++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt 2010-06-21 15:21:39.312999849 +0800 @@ -0,0 +1,133 @@ +The x86 kvm paravirt perf event interface +=== + +This paravirt interface is responsible for supporting guest os perf event +collections. If guest os supports this interface, users could run command +perf in guest os directly. + +Design + + +Guest os calls a series of hypercalls to communicate with host kernel to +create/enable/disable/close perf events. Host kernel notifies guest os +by injecting an NMI to guest os when an event overflows. Guets os need +go through all its active events to check if they overflow, and output +performance statistics if they do. + +ABI += + +1) Detect if host kernel supports paravirt perf interface: +#define KVM_FEATURE_PV_PERF 4 +Host kernel defines above cpuid bit. Guest os calls
[PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
The 2nd patch is to change the definition of perf_event to facilitate perf attr copy when a hypercall happens. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- --- linux-2.6_tip0620/include/linux/perf_event.h2010-06-21 15:19:52.821999849 +0800 +++ linux-2.6_tip0620perfkvm/include/linux/perf_event.h 2010-06-21 16:53:49.283999849 +0800 @@ -188,7 +188,10 @@ struct perf_event_attr { __u64 sample_type; __u64 read_format; - __u64 disabled : 1, /* off by default*/ + union { + __u64 flags; + struct { + __u64 disabled : 1, /* off by default*/ inherit: 1, /* children inherit it */ pinned : 1, /* must always be on PMU */ exclusive : 1, /* only group on PMU */ @@ -217,6 +220,8 @@ struct perf_event_attr { mmap_data : 1, /* non-exec mmap data*/ __reserved_1 : 46; + }; + }; union { __u32 wakeup_events;/* wakeup every n events */ @@ -465,12 +470,6 @@ enum perf_callchain_context { # include asm/local64.h #endif -struct perf_guest_info_callbacks { - int (*is_in_guest) (void); - int (*is_user_mode) (void); - unsigned long (*get_guest_ip) (void); -}; - #ifdef CONFIG_HAVE_HW_BREAKPOINT #include asm/hw_breakpoint.h #endif @@ -753,6 +752,20 @@ struct perf_event { perf_overflow_handler_t overflow_handler; + /* +* pointers used by kvm perf paravirt interface. +* +* 1) Used in host kernel and points to host_perf_shadow which +* has information about guest perf_event +*/ + void*host_perf_shadow; + /* +* 2) Used in guest kernel and points to guest_perf_shadow which +* is used as a communication area with host kernel. Host kernel +* copies overflow data to it when an event overflows. +*/ + void*guest_perf_shadow; + #ifdef CONFIG_EVENT_TRACING struct ftrace_event_call*tp_event; struct event_filter *filter; @@ -838,6 +851,16 @@ struct perf_output_handle { int sample; }; +struct perf_guest_info_callbacks { + /* Support collect guest statistics from host side */ + int (*is_in_guest) (void); + int (*is_user_mode) (void); + unsigned long (*get_guest_ip) (void); + + /* Support paravirt interface */ + void (*copy_event_to_shadow) (struct perf_event *event, int overflows); +}; + #ifdef CONFIG_PERF_EVENTS /* @@ -871,6 +894,10 @@ perf_event_create_kernel_counter(struct perf_overflow_handler_t callback); extern u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running); +extern void perf_event_output(struct perf_event *event, int nmi, + struct perf_sample_data *data, struct pt_regs *regs); +void perf_event_attach(struct perf_event *event); +void perf_event_detach(struct perf_event *event); struct perf_sample_data { u64 type; @@ -1023,6 +1050,14 @@ perf_event_task_sched_in(struct task_str static inline void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next) { } + +static inline void +perf_event_output(struct perf_event *event, int nmi, + struct perf_sample_data *data, struct pt_regs *regs){ } + +static inline void perf_event_attach(struct perf_event *event) { } +static inline void perf_event_detach(struct perf_event *event) { } + static inline void perf_event_task_tick(struct task_struct *task) { } static inline int perf_event_init_task(struct task_struct *child) { return 0; } --- linux-2.6_tip0620/kernel/watchdog.c 2010-06-21 15:20:48.517999849 +0800 +++ linux-2.6_tip0620perfkvm/kernel/watchdog.c 2010-06-21 15:21:39.315999849 +0800 @@ -197,8 +197,6 @@ static struct perf_event_attr wd_hw_attr .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .size = sizeof(struct perf_event_attr), - .pinned = 1, - .disabled = 1, }; /* Callback function for perf event subsystem */ @@ -361,6 +359,8 @@ static int watchdog_nmi_enable(int cpu) /* Try to register using hardware perf events */ wd_attr = wd_hw_attr; wd_attr-sample_period = hw_nmi_get_sample_period(); + wd_attr-pinned = 1; + wd_attr-disabled = 1; event = perf_event_create_kernel_counter(wd_attr, cpu, -1
[PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
The 3rd patch is to implement para virt perf at host kernel. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- --- linux-2.6_tip0620/arch/x86/include/asm/kvm_para.h 2010-06-21 15:19:38.992999849 +0800 +++ linux-2.6_tip0620perfkvm/arch/x86/include/asm/kvm_para.h2010-06-21 15:21:39.308999849 +0800 @@ -2,6 +2,7 @@ #define _ASM_X86_KVM_PARA_H #include linux/types.h +#include linux/list.h #include asm/hyperv.h /* This CPUID returns the signature 'KVMKVMKVM' in ebx, ecx, and edx. It @@ -19,7 +20,8 @@ /* This indicates that the new set of kvmclock msrs * are available. The use of 0x11 and 0x12 is deprecated */ -#define KVM_FEATURE_CLOCKSOURCE23 +#define KVM_FEATURE_CLOCKSOURCE2 3 +#define KVM_FEATURE_PV_PERF4 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -33,7 +35,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -#define KVM_MAX_MMU_OP_BATCH 32 +#define KVM_MAX_MMU_OP_BATCH 32 + +/* Operations for KVM_PERF_OP */ +#define KVM_PERF_OP_OPEN 1 +#define KVM_PERF_OP_CLOSE 2 +#define KVM_PERF_OP_ENABLE 3 +#define KVM_PERF_OP_DISABLE4 +#define KVM_PERF_OP_READ 5 /* Operations for KVM_HC_MMU_OP */ #define KVM_MMU_OP_WRITE_PTE1 @@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt { #ifdef __KERNEL__ #include asm/processor.h +/* + * data communication area about perf_event between + * Host kernel and guest kernel + */ +struct guest_perf_event { + u64 count; + atomic_t overflows; +}; + +/* + * In host kernel, perf_event-host_perf_shadow points to + * host_perf_shadow which records some information + * about the guest. + */ +struct host_perf_shadow { + /* guest perf_event id passed from guest os */ + int id; + /* +* Host kernel saves data into data member counter firstly. +* kvm will get data from this counter and calls kvm functions +* to copy or add data back to guets os before entering guest os +* next time +*/ + struct guest_perf_event counter; + /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/ + __u64 guest_event_addr; + + /* +* Link to of kvm.kvm_arch.shadow_hash_table +*/ + struct list_head shadow_entry; + struct kvm_vcpu *vcpu; + + struct perf_event *host_event; + /* +* Below counter is to prevent malicious guest os to try to +* close/enable event at the same time. +*/ + atomic_t ref_counter; +}; + +/* + * In guest kernel, perf_event-guest_shadow points to + * guest_perf_shadow which records some information + * about the guest. + */ +struct guest_perf_shadow { + /* guest perf_event id passed from guest os */ + int id; + /* +* Host kernel kvm saves data into data member counter +*/ + struct guest_perf_event counter; +}; + +/* + * guest_perf_attr is used when guest calls hypercall to + * open a new perf_event at host side. Mostly, it's a copy of + * perf_event_attr and deletes something not used by host kernel. + */ +struct guest_perf_attr { + __u32 type; + __u64 config; + __u64 sample_period; + __u64 sample_type; + __u64 read_format; + __u64 flags; + __u32 bp_type; + __u64 bp_addr; + __u64 bp_len; +}; + +struct guest_perf_event_param { + __u64 attr_addr; + __u64 guest_event_addr; + /* In case there is an alignment issue, we put id as the last one */ + int id; +}; + extern void kvmclock_init(void); --- linux-2.6_tip0620/arch/x86/include/asm/kvm_host.h 2010-06-21 15:19:39.01849 +0800 +++ linux-2.6_tip0620perfkvm/arch/x86/include/asm/kvm_host.h2010-06-21 15:21:39.308999849 +0800 @@ -24,6 +24,7 @@ #include asm/desc.h #include asm/mtrr.h #include asm/msr-index.h +#include asm/perf_event.h #define KVM_MAX_VCPUS 64 #define KVM_MEMORY_SLOTS 32 @@ -360,6 +361,18 @@ struct kvm_vcpu_arch { /* fields used by HYPER-V emulation */ u64 hv_vapic; + + /* +* Fields used by PARAVIRT perf interface: +* +* kvm checks overflow_events before entering guest os, +* and copy data back to guest os. +* event_mutex is to avoid a race between NMI perf event overflow +* handler, event close, and enable/disable. +*/ + struct mutex event_mutex; + int overflows; + struct perf_event *overflow_events[X86_PMC_IDX_MAX]; }; struct kvm_mem_alias { @@ -377,6 +390,9 @@ struct kvm_mem_aliases { int naliases; }; +#define KVM_PARAVIRT_PERF_EVENT_ENTRY_BITS (10
[PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
The 5th patch is applied to the latest qemu-kvm tree. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup qemu-kvm_0621/kvm/include/linux/kvm.h qemu-kvm_0621_perf/kvm/include/linux/kvm.h --- qemu-kvm_0621/kvm/include/linux/kvm.h 2010-06-21 11:00:28.0 +0800 +++ qemu-kvm_0621_perf/kvm/include/linux/kvm.h 2010-06-21 13:23:51.537999849 +0800 @@ -530,6 +530,7 @@ struct kvm_enable_cap { #ifdef __KVM_HAVE_XCRS #define KVM_CAP_XCRS 56 #endif +#define KVM_CAP_PV_PERF 57 #ifdef KVM_CAP_IRQ_ROUTING diff -Nraup qemu-kvm_0621/kvm/include/x86/asm/kvm_para.h qemu-kvm_0621_perf/kvm/include/x86/asm/kvm_para.h --- qemu-kvm_0621/kvm/include/x86/asm/kvm_para.h2010-06-21 11:00:28.0 +0800 +++ qemu-kvm_0621_perf/kvm/include/x86/asm/kvm_para.h 2010-06-21 13:27:04.375999849 +0800 @@ -15,6 +15,7 @@ #define KVM_FEATURE_CLOCKSOURCE0 #define KVM_FEATURE_NOP_IO_DELAY 1 #define KVM_FEATURE_MMU_OP 2 +#define KVM_FEATURE_PV_PERF4 #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 diff -Nraup qemu-kvm_0621/target-i386/kvm.c qemu-kvm_0621_perf/target-i386/kvm.c --- qemu-kvm_0621/target-i386/kvm.c 2010-06-21 11:00:29.0 +0800 +++ qemu-kvm_0621_perf/target-i386/kvm.c2010-06-21 13:00:14.136999850 +0800 @@ -150,6 +150,9 @@ struct kvm_para_features { #ifdef KVM_CAP_PV_MMU { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP }, #endif +#ifdef KVM_CAP_PV_PERF +{ KVM_CAP_PV_PERF, KVM_FEATURE_PV_PERF }, +#endif { -1, -1 } }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 4/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
The 4th patch is to implement para virt perf at guest side. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- --- linux-2.6_tip0620/arch/x86/Kconfig 2010-06-21 15:19:39.180999849 +0800 +++ linux-2.6_tip0620perfkvm/arch/x86/Kconfig 2010-06-21 15:21:39.30849 +0800 @@ -552,6 +552,14 @@ config KVM_GUEST This option enables various optimizations for running under the KVM hypervisor. +config KVM_PERF + bool KVM Guest perf support + select PARAVIRT + select PERF_EVENT + ---help--- + This option enables various optimizations for running perf in + guest os under the KVM hypervisor. + source arch/x86/lguest/Kconfig config PARAVIRT --- linux-2.6_tip0620/arch/x86/kernel/cpu/perf_event.c 2010-06-21 15:19:39.964999849 +0800 +++ linux-2.6_tip0620perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-06-21 16:44:36.602999849 +0800 @@ -25,6 +25,7 @@ #include linux/highmem.h #include linux/cpu.h #include linux/bitops.h +#include linux/kvm_para.h #include asm/apic.h #include asm/stacktrace.h @@ -583,10 +584,20 @@ static void x86_pmu_disable_all(void) } } +#ifdef CONFIG_KVM_PERF +static int kvm_hw_perf_enable(void); +static int kvm_hw_perf_disable(void); +#endif + void hw_perf_disable(void) { struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); +#ifdef CONFIG_KVM_PERF + if (!kvm_hw_perf_disable()) + return; +#endif + if (!x86_pmu_initialized()) return; @@ -810,6 +821,11 @@ void hw_perf_enable(void) struct hw_perf_event *hwc; int i, added = cpuc-n_added; +#ifdef CONFIG_KVM_PERF + if (!kvm_hw_perf_enable()) + return; +#endif + if (!x86_pmu_initialized()) return; @@ -1264,6 +1280,7 @@ x86_get_event_constraints(struct cpu_hw_ #include perf_event_intel_lbr.c #include perf_event_intel_ds.c #include perf_event_intel.c +#include perf_event_kvm.c static int __cpuinit x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu) @@ -1317,6 +1334,11 @@ void __init init_hw_perf_events(void) pr_info(Performance Events: ); +#ifdef CONFIG_KVM_PERF + if (!kvm_init_hw_perf_events()) + return; +#endif + switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: err = intel_pmu_init(); @@ -1541,6 +1563,13 @@ const struct pmu *hw_perf_event_init(str const struct pmu *tmp; int err; +#ifdef CONFIG_KVM_PERF + if (kvm_para_available()) { + tmp = kvm_hw_perf_event_init(event); + return tmp; + } +#endif + err = __hw_perf_event_init(event); if (!err) { /* --- linux-2.6_tip0620/arch/x86/kernel/cpu/perf_event_kvm.c 1970-01-01 08:00:00.0 +0800 +++ linux-2.6_tip0620perfkvm/arch/x86/kernel/cpu/perf_event_kvm.c 2010-06-21 16:44:56.735999849 +0800 @@ -0,0 +1,426 @@ +/* + * Performance events + * + * Copyright (C) 2010 Intel Corporation + * Zhang Yanmin yanmin.zh...@intel.com + * + * For licencing details see kernel-base/COPYING + */ + +#ifdef CONFIG_KVM_PERF + +static atomic_t guest_perf_id; /*Global id counter per guest os*/ + +static inline int get_new_perf_event_id(void) +{ + return atomic_inc_return(guest_perf_id); +} + +#ifdef CONFIG_X86_LOCAL_APIC + +static bool kvm_reserve_pmc_hardware(void) +{ + if (nmi_watchdog == NMI_LOCAL_APIC) + disable_lapic_nmi_watchdog(); + + return true; +} + +static void kvm_release_pmc_hardware(void) +{ + if (nmi_watchdog == NMI_LOCAL_APIC) + enable_lapic_nmi_watchdog(); +} + +#else + +static bool kvm_reserve_pmc_hardware(void) { return true; } +static void kvm_release_pmc_hardware(void) {} + +#endif + +static void kvm_hw_perf_event_destroy(struct perf_event *event) +{ + struct guest_perf_shadow *shadow = event-guest_perf_shadow; + + BUG_ON(!shadow); + kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_CLOSE, shadow-id); + + kfree(shadow); + event-guest_perf_shadow = NULL; + + if (atomic_dec_and_mutex_lock(active_events, pmc_reserve_mutex)) { + kvm_release_pmc_hardware(); + mutex_unlock(pmc_reserve_mutex); + } +} + +/* The guest might also run as a host */ +static int check_ontop_guest_overflow(struct perf_event *event, int overflows) +{ + struct host_perf_shadow *host_shadow = event-host_perf_shadow; + if (!host_shadow) + return 0; + + if (perf_guest_cbs) + perf_guest_cbs-copy_event_to_shadow(event, overflows); + + return 1; +} + +static int +check_event_overflow(struct perf_event *event, struct pt_regs *regs) +{ + struct perf_sample_data data; + struct guest_perf_shadow *guest_shadow = event-guest_perf_shadow; + s32 overflows; + int i; + int handled = 0; + + local64_set(event-count, guest_shadow
Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote: On 06/21/2010 12:31 PM, Zhang, Yanmin wrote: Here is the version 2. ChangeLog since V1: Mostly changes based on Avi's suggestions. 1) Use a id to identify the perf_event between host and guest; 2) Changes lots of codes to deal with malicious guest os; 3) Add a perf_event number limitation per gust os instance; 4) Support guest os on the top of another guest os scenario. But I didn't test it yet as there is no environment. The design is to add 2 pointers in struct perf_event. One is used by host and the other is used by guest. 5) Fix the bug to support 'perf stat'. The key is sync count data back to guest when guest tries to disable the perf_event at host side. 6) Add a clear ABI of PV perf. Please use meaningful subject lines for individual patches. Yes, I should. I rushed to send the patches out yesterday afternoon as I need to take company shuttle back home. I don't implement live migration feature. Avi, Is live migration necessary on pv perf support? Yes. Ok. With the PV perf interface, host perf saves all counter info into perf_event structure. To support live migration, we need save all host perf_event structure, or at least perf_event-count and perf_event-attr. Then, recreate the host perf_event after migration. I check qemu-kvm codes and it seems most live migration is to save cpu states. So it seems it's hard for perf pv interface to match current live migration. Any suggestion? --- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt 1970-01-01 08:00:00.0 +0800 +++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt 2010-06-21 15:21:39.312999849 +0800 @@ -0,0 +1,133 @@ +The x86 kvm paravirt perf event interface +=== + +This paravirt interface is responsible for supporting guest os perf event +collections. If guest os supports this interface, users could run command +perf in guest os directly. + +Design + + +Guest os calls a series of hypercalls to communicate with host kernel to +create/enable/disable/close perf events. Host kernel notifies guest os +by injecting an NMI to guest os when an event overflows. Guets os need +go through all its active events to check if they overflow, and output +performance statistics if they do. + +ABI += + +1) Detect if host kernel supports paravirt perf interface: +#define KVM_FEATURE_PV_PERF 4 +Host kernel defines above cpuid bit. Guest os calls cpuid to check if host +os retuns this bit. If it does, it mean host kernel supports paravirt perf +interface. + +2) Open a new event at host side: +kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, param_addr_low32bit, +param_addr_high32bit); + +#define KVM_PERF_OP3 +/* Operations for KVM_PERF_OP */ +#define KVM_PERF_OP_OPEN1 +#define KVM_PERF_OP_CLOSE 2 +#define KVM_PERF_OP_ENABLE 3 +#define KVM_PERF_OP_DISABLE 4 +#define KVM_PERF_OP_READ5 +/* + * guest_perf_attr is used when guest calls hypercall to + * open a new perf_event at host side. Mostly, it's a copy of + * perf_event_attr and deletes something not used by host kernel. + */ +struct guest_perf_attr { +__u32 type; Need padding here, otherwise the structure is different on 32-bit and 64-bit guests. Ok. I will change it. +__u64 config; +__u64 sample_period; +__u64 sample_type; +__u64 read_format; +__u64 flags; and here. I will rearrange the whole structure. +__u32 bp_type; +__u64 bp_addr; +__u64 bp_len; Do we actually support breakpoints on the guest? Note the hardware breakpoints are also usable by the guest, so if the host uses them, we won't be able to emulate them correctly. We can let the guest to breakpoint perf monitoring itself and drop this feature. Ok, I will disable breakpoint feature of pv interface. +}; What about documentation for individual fields? Esp. type, config, and flags, but also the others. They are really perf implementation specific. Even perf_event definition has no document but code comments. I will add simple explanation around the new structure definition. +/* + * data communication area about perf_event between + * Host kernel and guest kernel + */ +struct guest_perf_event { +u64 count; +atomic_t overflows; Please use __u64 and __u32, assume guests don't have Linux internal types (though of course the first guest _is_ Linux). This structure is used by both host
Re: [PATCH V2 2/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Mon, 2010-06-21 at 15:00 +0300, Avi Kivity wrote: On 06/21/2010 12:31 PM, Zhang, Yanmin wrote: The 2nd patch is to change the definition of perf_event to facilitate perf attr copy when a hypercall happens. Signed-off-by: Zhang Yanminyanmin_zh...@linux.intel.com --- --- linux-2.6_tip0620/include/linux/perf_event.h2010-06-21 15:19:52.821999849 +0800 +++ linux-2.6_tip0620perfkvm/include/linux/perf_event.h 2010-06-21 16:53:49.283999849 +0800 @@ -188,7 +188,10 @@ struct perf_event_attr { __u64 sample_type; __u64 read_format; Assuming these flags are available to the guest? These flags are used by generic perf codes. To match with host kernel, we wish guest os also use the flags. - __u64 disabled : 1, /* off by default*/ + union { + __u64 flags; + struct { + __u64 disabled : 1, /* off by default*/ inherit: 1, /* children inherit it */ inherit is meaningless for a guest. Right. host kernel will reset it to 0 before create perf_event for guest os. Here we couldn't delete the flag as it's used by perf generic codes. I need separate the patch a little better. All definitions in include/linux/perf_event.h are mostly perf generic code related. I'm very careful to change it. pinned : 1, /* must always be on PMU */ We cannot allow a guest to pin a counter. Ok. I will reset it in function kvm_pv_perf_op_open. The other flags are also problematic. I'd like to see virt-specific flags (probably we'll only need kernel/user and nested_hv for nested virtualization). How about to add more comments around struct guest_perf_attr-flags that guest os developers should take a look at include/linux/perf_event.h? BTW, I will reset more flags to 0 in kvm_pv_perf_op_open. Something that is worrying is that we don't expose group information. perf will multiplex the events for us, but there will be a loss in accuracy. #ifdef CONFIG_HAVE_HW_BREAKPOINT #includeasm/hw_breakpoint.h #endif @@ -753,6 +752,20 @@ struct perf_event { perf_overflow_handler_t overflow_handler; + /* +* pointers used by kvm perf paravirt interface. +* +* 1) Used in host kernel and points to host_perf_shadow which +* has information about guest perf_event +*/ + void*host_perf_shadow; Can we have real types instead of void pointers? I just want perf generic codes have less dependency on KVM codes. + /* +* 2) Used in guest kernel and points to guest_perf_shadow which +* is used as a communication area with host kernel. Host kernel +* copies overflow data to it when an event overflows. +*/ + void*guest_perf_shadow; It's strange to see both guest and host parts in the same patch. Splitting to separate patches will really help review. It's a little hard to split the patches if they change the same file. Perhaps I could add more statements before the patch when I send it out. @@ -1626,9 +1629,22 @@ void perf_event_task_tick(struct task_st if (ctx ctx-nr_events ctx-nr_events != ctx-nr_active) rotate = 1; - perf_ctx_adjust_freq(cpuctx-ctx); - if (ctx) - perf_ctx_adjust_freq(ctx); +#ifdef CONFIG_KVM_PERF + if (kvm_para_available()) { + /* +* perf_ctx_adjust_freq causes lots of pmu-read which would +* trigger too many vmexit to host kernel. We disable it +* under para virt situation +*/ + adjust_freq = 0; + } +#endif Perhaps we can have a batch read interface which will read many counters at once. It's a good idea. But that will touch many perf generic codes which causes it's hard to maintain or follow future changes. This would reduce the number of exits. Also adjust the frequency less frequently. Here it depends on process scheduler frequency, CONFIG_HZ. + + if (adjust_freq) { + perf_ctx_adjust_freq(cpuctx-ctx); + if (ctx) + perf_ctx_adjust_freq(ctx); + } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Mon, 2010-06-21 at 15:33 +0300, Avi Kivity wrote: On 06/21/2010 12:31 PM, Zhang, Yanmin wrote: The 3rd patch is to implement para virt perf at host kernel. @@ -64,6 +73,85 @@ struct kvm_mmu_op_release_pt { #ifdef __KERNEL__ #includeasm/processor.h +/* + * In host kernel, perf_event-host_perf_shadow points to + * host_perf_shadow which records some information + * about the guest. + */ +struct host_perf_shadow { + /* guest perf_event id passed from guest os */ + int id; + /* +* Host kernel saves data into data member counter firstly. +* kvm will get data from this counter and calls kvm functions +* to copy or add data back to guets os before entering guest os +* next time +*/ + struct guest_perf_event counter; + /* guest_event_addr is gpa_t pointing to guest os guest_perf_event*/ + __u64 guest_event_addr; So just use gpa_t as the type. host_perf_shadow-guest_event_addr is a copy of guest_event_addr-guest_event_addr. As the latter's type is __u64 as the interface between guest os and host os, I use __u64 as the type of host_perf_shadow-guest_event_addr. + + /* +* Link to of kvm.kvm_arch.shadow_hash_table +*/ + struct list_head shadow_entry; + struct kvm_vcpu *vcpu; + + struct perf_event *host_event; + /* +* Below counter is to prevent malicious guest os to try to +* close/enable event at the same time. +*/ + atomic_t ref_counter; If events are made per-vcpu (like real hardware), races become impossible. This design is to deal with a task context perf collection in guest os. Scenario 1: 1) guest os starts to collect statistics of process A on vcpu 0; 2) process A is scheduled to vcpu 1. Then, the perf_event at host side need to be moved to VCPU 1 's thread. With the per KVM instance design, we needn't move host_perf_shadow among vcpus. Scenario 2: 1) guest os creates a perf_event at host side on vcpu 0; 2) malicious guest os calls close to delete the host perf_event on vcpu 1, but enables the perf_event on vcpu0 at the same time. When close thread runs to get the host_perf_shadow from the list, enable thread also gets it. Then, close thread deletes the perf_event, and enable thread will cause host kernel panic when using host_perf_shadow. +}; Please move this structure to include/linux/kvm_host.h. No need to spam kvm_para.h. Note it's not x86 specific (though you can leave arch enabling to arch maintainers). Ok. Originally, I wanted to do so, but I'm afraid other arch might be not happy. + +/* + * In guest kernel, perf_event-guest_shadow points to + * guest_perf_shadow which records some information + * about the guest. + */ +struct guest_perf_shadow { + /* guest perf_event id passed from guest os */ + int id; + /* +* Host kernel kvm saves data into data member counter +*/ + struct guest_perf_event counter; +}; Don't ordinary perf structures already have a counter ID which we can reuse? No. In the other hand, if we assume generic perf has, we couldn't use it, because generic perf id (actually there is no) is host kernel system-wide while here guest_perf_shadow-id is per kvm instance wide. + +/* + * guest_perf_attr is used when guest calls hypercall to + * open a new perf_event at host side. Mostly, it's a copy of + * perf_event_attr and deletes something not used by host kernel. + */ +struct guest_perf_attr { + __u32 type; + __u64 config; + __u64 sample_period; + __u64 sample_type; + __u64 read_format; + __u64 flags; + __u32 bp_type; + __u64 bp_addr; + __u64 bp_len; +}; This is really not a guest or host structure, it's part of the interface. So please rename it (and similar) kvm_pv_perf_*. Good idea. @@ -24,6 +24,7 @@ #includeasm/desc.h #includeasm/mtrr.h #includeasm/msr-index.h +#includeasm/perf_event.h #define KVM_MAX_VCPUS 64 #define KVM_MEMORY_SLOTS 32 @@ -360,6 +361,18 @@ struct kvm_vcpu_arch { /* fields used by HYPER-V emulation */ u64 hv_vapic; + + /* +* Fields used by PARAVIRT perf interface: +* +* kvm checks overflow_events before entering guest os, +* and copy data back to guest os. +* event_mutex is to avoid a race between NMI perf event overflow +* handler, event close, and enable/disable. +*/ + struct mutex event_mutex; No race can exist. The host NMI handler cannot take any mutex We use a mutex_trylock in NMI hanlder. If it can't get the lock, there is a NMI miss happening, but host kernel still updates perf_event-host_perf_shadow.counter, so the overflow data will be accumulated. so it must be immune
Re: [PATCH V2 5/5] ara virt interface of perf to support kvm guest os statistics collection in guest os
On Mon, 2010-06-21 at 15:37 +0300, Avi Kivity wrote: On 06/21/2010 12:31 PM, Zhang, Yanmin wrote: The 5th patch is applied to the latest qemu-kvm tree. --- qemu-kvm_0621/target-i386/kvm.c 2010-06-21 11:00:29.0 +0800 +++ qemu-kvm_0621_perf/target-i386/kvm.c2010-06-21 13:00:14.136999850 +0800 @@ -150,6 +150,9 @@ struct kvm_para_features { #ifdef KVM_CAP_PV_MMU { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP }, #endif +#ifdef KVM_CAP_PV_PERF +{ KVM_CAP_PV_PERF, KVM_FEATURE_PV_PERF }, +#endif { -1, -1 } }; Not really necessary any more - if you expose the cpuid bit via KVM_GET_SUPPORTED_CPUID2 then 'qemu -cpu host' will automatically enable it. On the other hand, do update target-i386/cpuid.c:kvm_feature_name so people can enable the feature using qemu -cpu ...,+kvmperf. Thanks for the good pointer. I will change it. Avi, Thanks for your wonderful comments. I will fix all in the patches. As for the live migration, I need check it carefully. If you could provide some suggestions on it, especially about how/when to save all perf_events and restore all pref_events, that would be very helpful. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
On Thu, 2010-06-10 at 11:50 +0200, Peter Zijlstra wrote: On Thu, 2010-06-10 at 10:21 +0800, Zhang, Yanmin wrote: . The ABI mostly includes the definition of struct perf_event_attr, guest_perf_counter, and hypercalls. Note that perf_event_attr isn't guaranteed to be stable between kernels, it can grow. Thanks. I need create a small new attr structure to save some key data members. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
On Wed, 2010-06-09 at 11:33 +0300, Avi Kivity wrote: On 06/09/2010 06:30 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on Ingo's idea, I implement a para virt interface for perf to support statistics collection in guest os. That means we could run tool perf in guest os directly. Great thanks to Peter Zijlstra. He is really the architect and gave me architecture design suggestions. I also want to thank Yangsheng and LinMing for their generous help. The design is: 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel; 2) Create a host perf_event per guest perf_event; 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest kernel if a guest event overflows. 4) Guest kernel goes through all enabled event on current cpu and output data when they overflows. 5) No change in user space. Thanks for your kind comments. One thing that's missing is documentation of the guest/host ABI. It will be a requirement for inclusion, but it will also be a great help for review, so please provide it ASAP. I will add such document. It will includes: 1) Data struct perf_event definition. Guest os and host os have to share the same data structure as host kernel will sync data (counte/overflows and others if needed) changes back to guest os. 2) A list of all hypercalls 3) Guest need have NMI handler which checks all guest events. Please also split the guest and host parts into separate patches. I will do so. -#define KVM_MAX_MMU_OP_BATCH 32 Keep that please. I will do so. +/* Operations for KVM_PERF_OP */ +#define KVM_PERF_OP_OPEN 1 +#define KVM_PERF_OP_CLOSE 2 +#define KVM_PERF_OP_ENABLE 3 +#define KVM_PERF_OP_DISABLE4 +#define KVM_PERF_OP_START 5 +#define KVM_PERF_OP_STOP 6 +#define KVM_PERF_OP_READ 7 Where is the hypercall number for the perf hypercall? I defines it in file kvm_para.h like KVM_HC_MMU_OP. +static bool kvm_reserve_pmc_hardware(void) +{ + if (nmi_watchdog == NMI_LOCAL_APIC) + disable_lapic_nmi_watchdog(); + + return true; +} + +static void kvm_release_pmc_hardware(void) +{ + if (nmi_watchdog == NMI_LOCAL_APIC) + enable_lapic_nmi_watchdog(); +} Disabling the watchdog is unfortunate. Why is it necessary? perf always uses NMI, so we disable the nmi_watchdog when a perf_event is set up in case they might have impact. + +static int kvm_pmu_enable(struct perf_event *event) +{ + int ret; + unsigned long addr = __pa(event); + + if (kvm_add_event(event)) + return -1; + + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE, + addr, (unsigned long) event-shadow); This is suspicious. Passing virtual addresses to the host is always problematic. Or event-shadow only used as an opaque cookie? I use perf_event-shadow at both host and guest side. 1) At host side, perf_event-shadow points to an area including the page mapping information about guest perf_event. Host kernel uses it to sync data changes back to guest; 2) At guest side, perf_event-shadow save the virtual address of host perf_event at host side. At guest side, kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address. Guest os shouldn't use it but using it to pass it back to host kernel in following hypercalls. It might be a security issue for host kernel. Originally, I planed guest os not to use perf_event-shadow. Host kernel maintains a per-vcpu event-related list whose key is addr of guest perf_event. But considering the vcpu thread migration on logical cpu, such list needs lock and implementation becomes a little complicated. I will double-check the list method as the security issue is there. +int __init kvm_init_hw_perf_events(void) +{ + if (!kvm_para_available()) + return -1; + + x86_pmu.handle_irq = kvm_default_x86_handle_irq; + + pr_cont(KVM PARA PMU driver.\n); + register_die_notifier(kvm_perf_event_nmi_notifier); + + return 0; +} Need to detect the kvm pmu via its own cpuid bit. Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like bit KVM_FEATURE_CLOCKSOURCE. + +static int __kvm_hw_perf_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc =event-hw; + int err; + unsigned long result; + unsigned long addr; + + err = 0; + if (!atomic_inc_not_zero(active_events)) { + mutex_lock(pmc_reserve_mutex); + if (atomic_read(active_events) == 0) { + if (!kvm_reserve_pmc_hardware()) + err = -EBUSY; + } + if (!err
Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
On Wed, 2010-06-09 at 11:59 +0300, Avi Kivity wrote: On 06/09/2010 06:30 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on Ingo's idea, I implement a para virt interface for perf to support statistics collection in guest os. That means we could run tool perf in guest os directly. Great thanks to Peter Zijlstra. He is really the architect and gave me architecture design suggestions. I also want to thank Yangsheng and LinMing for their generous help. The design is: 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel; 2) Create a host perf_event per guest perf_event; 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest kernel if a guest event overflows. 4) Guest kernel goes through all enabled event on current cpu and output data when they overflows. 5) No change in user space. Other issues: - save/restore support for live migration Well, it's a little hard to process perf_event under live migration case. I will check it. - some way to limit the number of open handles (comes automatically with the table approach I suggested earlier) Current perf doesn't restrict perf_event number. Kernel does a rotation to collect statistics of all perf_events. My patch just follows this style. The table method might be not good, because below scenario: guest perf_event might be a per-task event at guest side. When the guest application task is migrated to another cpu, the perf_event peer at host side should also be migrated to the new vcpu thread. With table method, we need do some rearrangement on the table when event migration happens. Here migration I mention is not guest live migration. I will double-check it. Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
On Wed, 2010-06-09 at 12:41 +0300, Avi Kivity wrote: On 06/09/2010 12:21 PM, Zhang, Yanmin wrote: One thing that's missing is documentation of the guest/host ABI. It will be a requirement for inclusion, but it will also be a great help for review, so please provide it ASAP. I will add such document. It will includes: 1) Data struct perf_event definition. Guest os and host os have to share the same data structure as host kernel will sync data (counte/overflows and others if needed) changes back to guest os. 2) A list of all hypercalls 3) Guest need have NMI handler which checks all guest events. Thanks. It may be easier to have just the documentation for the first few iterations so we can converge on a good interface, then update the code to match the interface. I thought over it last night. Your input is important. I need define a clear ABI. At guest side, I plan to use perf_event-shadow to point to another data area, such like: struct guest_perf_counter { u64 count; atomic_t overflows; }; So host side just copy data to this area, then guest copy them to its own perf_event. The ABI becomes more simple than before. Function kvm_sync_event_to_guest also becomes clearer. The ABI mostly includes the definition of struct perf_event_attr, guest_perf_counter, and hypercalls. Disabling the watchdog is unfortunate. Why is it necessary? perf always uses NMI, so we disable the nmi_watchdog when a perf_event is set up in case they might have impact. Ok. Is that the case for the hardware pmus as well? If so it might be done in common code. + +static int kvm_pmu_enable(struct perf_event *event) +{ + int ret; + unsigned long addr = __pa(event); + + if (kvm_add_event(event)) + return -1; + + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE, + addr, (unsigned long) event-shadow); This is suspicious. Passing virtual addresses to the host is always problematic. Or event-shadow only used as an opaque cookie? I use perf_event-shadow at both host and guest side. 1) At host side, perf_event-shadow points to an area including the page mapping information about guest perf_event. Host kernel uses it to sync data changes back to guest; 2) At guest side, perf_event-shadow save the virtual address of host perf_event at host side. At guest side, kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address. Guest os shouldn't use it but using it to pass it back to host kernel in following hypercalls. It might be a security issue for host kernel. Originally, I planed guest os not to use perf_event-shadow. Host kernel maintains a per-vcpu event-related list whose key is addr of guest perf_event. But considering the vcpu thread migration on logical cpu, such list needs lock and implementation becomes a little complicated. I will double-check the list method as the security issue is there. Besides the other concern, you cannot live migrate a host virtual address, since it changes from host to host. It's better to use a guest-chosen bounded small integer. Ok. Perhaps a single u32 per guest os instance is enough. So I will change the shadow pointing to a structure like below in guest kernel: struct guest_perf_counter { u64 count; atomic_t overflows; }; struct guest_perf_shadow { u32 id; struct guest_perf_counter sync_data; }; atomic_t guest_perf_id; /*Global id counter per guest os*/ Need to detect the kvm pmu via its own cpuid bit. Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like bit KVM_FEATURE_CLOCKSOURCE. Don't forget Documentation/kvm/cpuid.txt. Thanks for your kind reminder. Ok, so event-shadow is never dereferenced. In that case better not make it a pointer at all, keep it unsigned long. Host kernel also uses it. I see. So put it in a union. Or perhaps not even in a union - what if a kvm guest is also acting as a kvm host? My patch has consideration on it. I compiled kernel with host and guest support at the same time. The accessing to perf_event-shadow is really under specific scenarios, or they are just in specific functions. These functions are called just bu host kernel , or just by guest kernel. + +static void kvm_sync_event_to_guest(struct perf_event *event, int overflows) +{ + struct hw_perf_event *hwc =event-hw; + struct kvmperf_event *kvmevent; + int offset, len, data_len, copied, page_offset; + struct page *event_page; + void *shared_kaddr; + + kvmevent = event-shadow; + offset = kvmevent-event_offset; + + /* Copy perf_event-count firstly */ + offset += offsetof(struct perf_event, count); + if (offset PAGE_SIZE) { + event_page = kvmevent-event_page; + page_offset = offset; + } else
Re: [RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
On Thu, 2010-06-10 at 06:06 +0300, Avi Kivity wrote: On 06/10/2010 05:21 AM, Zhang, Yanmin wrote: I see. So put it in a union. Or perhaps not even in a union - what if a kvm guest is also acting as a kvm host? My patch has consideration on it. I compiled kernel with host and guest support at the same time. The accessing to perf_event-shadow is really under specific scenarios, or they are just in specific functions. These functions are called just bu host kernel , or just by guest kernel. But a kernel can be both guest and host at the same time. Currently this only works on AMD, but there was some work to bring it to Intel as well. Oh, this is a fancy VT feature. My patch doesn't support this mode. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] para virt interface of perf to support kvm guest os statistics collection in guest os
From: Zhang, Yanmin yanmin_zh...@linux.intel.com Based on Ingo's idea, I implement a para virt interface for perf to support statistics collection in guest os. That means we could run tool perf in guest os directly. Great thanks to Peter Zijlstra. He is really the architect and gave me architecture design suggestions. I also want to thank Yangsheng and LinMing for their generous help. The design is: 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel; 2) Create a host perf_event per guest perf_event; 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest kernel if a guest event overflows. 4) Guest kernel goes through all enabled event on current cpu and output data when they overflows. 5) No change in user space. Below is an example. #perf top -- PerfTop:7954 irqs/sec kernel:79.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) -- samples pcnt function DSO ___ _ _ 5315.00 4.9% copy_user_generic_string /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 3342.00 3.1% add_preempt_count /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 3338.00 3.1% sub_preempt_count /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2454.00 2.3% pvclock_clocksource_read /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2434.00 2.3% tcp_sendmsg /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2090.00 1.9% child_run /bm/tmp/benchmarks/run_bmtbench/dbench/dbench-3.03/tbench 2081.00 1.9% debug_smp_processor_id /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 2003.00 1.9% __GI_strstr /lib64/libc-2.11.so 1999.00 1.9% __strchr_sse2/lib64/libc-2.11.so 1983.00 1.8% tcp_ack /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1800.00 1.7% tcp_transmit_skb /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1727.00 1.6% schedule /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux 1706.00 1.6% __libc_recv /lib64/libc-2.11.so 1702.00 1.6% __GI_memchr /lib64/libc-2.11.so 1580.00 1.5% tcp_recvmsg /lib/modules/2.6.35-rc1-tip-guestperf/build/vmlinux The patch is against tip/master tree of June 1st. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0601/arch/x86/include/asm/kvm_host.h linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_host.h --- linux-2.6_tip0601/arch/x86/include/asm/kvm_host.h 2010-06-02 10:01:52.147999849 +0800 +++ linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_host.h2010-06-06 15:46:31.874999850 +0800 @@ -561,6 +561,9 @@ int emulator_write_phys(struct kvm_vcpu const void *val, int bytes); int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes, gpa_t addr, unsigned long *ret); +int kvm_pv_perf_op(struct kvm_vcpu *vcpu, int op_code, unsigned long a1, + unsigned long a2, unsigned long *result); + u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn); extern bool tdp_enabled; diff -Nraup linux-2.6_tip0601/arch/x86/include/asm/kvm_para.h linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_para.h --- linux-2.6_tip0601/arch/x86/include/asm/kvm_para.h 2010-06-02 10:01:52.126999849 +0800 +++ linux-2.6_tip0601perfkvm/arch/x86/include/asm/kvm_para.h2010-06-06 15:46:31.874999850 +0800 @@ -33,7 +33,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -#define KVM_MAX_MMU_OP_BATCH 32 +/* Operations for KVM_PERF_OP */ +#define KVM_PERF_OP_OPEN 1 +#define KVM_PERF_OP_CLOSE 2 +#define KVM_PERF_OP_ENABLE 3 +#define KVM_PERF_OP_DISABLE4 +#define KVM_PERF_OP_START 5 +#define KVM_PERF_OP_STOP 6 +#define KVM_PERF_OP_READ 7 /* Operations for KVM_HC_MMU_OP */ #define KVM_MMU_OP_WRITE_PTE1 @@ -64,6 +71,12 @@ struct kvm_mmu_op_release_pt { #ifdef __KERNEL__ #include asm/processor.h +struct kvmperf_event { + unsigned int
Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Fri, 2010-04-23 at 13:50 +0300, Avi Kivity wrote: On 04/22/2010 01:27 PM, Liu Yu-B13201 wrote: I met this error when built kernel. Anything wrong? CC init/main.o In file included from include/linux/ftrace_event.h:8, from include/trace/syscall.h:6, from include/linux/syscalls.h:75, from init/main.c:16: include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks': include/linux/perf_event.h:1019: error: parameter name omitted include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks': include/linux/perf_event.h:1021: error: parameter name omitted make[1]: *** [init/main.o] Error 1 make: *** [init] Error 2 I merged tip/perf/code which may fix this. Find it in kvm.git next branch. I downloaded the latest kvm.git tree and compilation is ok, including both enabling FTRACE and disabling FTRACE. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-04-20 at 08:09 +0200, Ingo Molnar wrote: * Ingo Molnar mi...@elte.hu wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { + if (perf_guest_cbs-is_user_mode()) + misc |= PERF_RECORD_MISC_GUEST_USER; + else + misc |= PERF_RECORD_MISC_GUEST_KERNEL; + } else if (user_mode(regs)) + misc |= PERF_RECORD_MISC_USER; + else + misc |= PERF_RECORD_MISC_KERNEL; + We try to use balanced curly braces. I.e.: if (x) { boo(); } else { if (y) foo(); else bar(); } And avoid unbalanced ones: if (x) { boo(); } else if (y) foo(); else bar(); Note, i fixed this in the patch and applied it to perf/core. (the invalid-C problem was causing build failures) Thanks for your teaching. Originally, I used {} in the 2nd half, but deleted it. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Mon, 2010-04-19 at 11:37 +0300, Avi Kivity wrote: On 04/19/2010 08:32 AM, Zhang, Yanmin wrote: Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. This doesn't apply against upstream. What does upstream mean here? The vanilla 2.6.34-rc4? What branch was this generated against? It's against the latest tip/master. I checked out to 19b26586090 as the latest tip/master has some updates on perf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Mon, 2010-04-19 at 11:59 +0300, Avi Kivity wrote: On 04/19/2010 11:55 AM, Zhang, Yanmin wrote: On Mon, 2010-04-19 at 11:37 +0300, Avi Kivity wrote: On 04/19/2010 08:32 AM, Zhang, Yanmin wrote: Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. This doesn't apply against upstream. What does upstream mean here? The vanilla 2.6.34-rc4? Yes, sorry for being unclear. What branch was this generated against? It's against the latest tip/master. I checked out to 19b26586090 as the latest tip/master has some updates on perf. I don't want to merge tip/master... does tip/perf/core contain the needed updates? I think so. A moment ago, I checked out to b5a80b7e9 of tip/perf/core. All 3 patches could be applied cleanly and compilation is ok. A quick testing shows tip/perf/core kernel does work. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Mon, 2010-04-19 at 20:11 +0200, Ingo Molnar wrote: FYI, i found a few problems that need fixing: + unsigned long ip; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) missing newline. + int misc = 0; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { ditto. + PERF_RECORD_MISC_GUEST_KERNEL; + } else + misc |= user_mode(regs) ? PERF_RECORD_MISC_USER : + PERF_RECORD_MISC_KERNEL; - unbalanced curly braces - missing curly brace for multi-line statement. - unnecessary line-break due to col80 warning from checkpatch +extern struct perf_guest_info_callbacks *perf_guest_cbs; +extern int perf_register_guest_info_callbacks( + struct perf_guest_info_callbacks *); +extern int perf_unregister_guest_info_callbacks( + struct perf_guest_info_callbacks *); - unnecessary line-break due to col80 warning from checkpatch +static inline int perf_register_guest_info_callbacks +(struct perf_guest_info_callbacks *) {return 0; } +static inline int perf_unregister_guest_info_callbacks +(struct perf_guest_info_callbacks *) {return 0; } - invalid C: function parameter needs name even if unused - missing space after opening curly brace Please provide delta fixes. Here is the fix on the top of the prior 3 patches of V5. From: Zhang, Yanmin yanmin_zh...@linux.intel.com Fix some programming style issues on the top of perf kvm enhancement V5. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c linux-2.6_tip0417_perfkvmstyle/arch/x86/kernel/cpu/perf_event.c --- linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-04-19 09:53:59.689452915 +0800 +++ linux-2.6_tip0417_perfkvmstyle/arch/x86/kernel/cpu/perf_event.c 2010-04-20 10:48:18.500999849 +0800 @@ -1752,23 +1752,29 @@ void perf_arch_fetch_caller_regs(struct unsigned long perf_instruction_pointer(struct pt_regs *regs) { unsigned long ip; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) ip = perf_guest_cbs-get_guest_ip(); else ip = instruction_pointer(regs); + return ip; } unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { - misc |= perf_guest_cbs-is_user_mode() ? - PERF_RECORD_MISC_GUEST_USER : - PERF_RECORD_MISC_GUEST_KERNEL; - } else - misc |= user_mode(regs) ? PERF_RECORD_MISC_USER : - PERF_RECORD_MISC_KERNEL; + if (perf_guest_cbs-is_user_mode()) + misc |= PERF_RECORD_MISC_GUEST_USER; + else + misc |= PERF_RECORD_MISC_GUEST_KERNEL; + } else if (user_mode(regs)) + misc |= PERF_RECORD_MISC_USER; + else + misc |= PERF_RECORD_MISC_KERNEL; + if (regs-flags PERF_EFLAGS_EXACT) misc |= PERF_RECORD_MISC_EXACT; diff -Nraup linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c linux-2.6_tip0417_perfkvmstyle/arch/x86/kvm/x86.c --- linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c2010-04-19 09:53:59.691378953 +0800 +++ linux-2.6_tip0417_perfkvmstyle/arch/x86/kvm/x86.c 2010-04-20 10:11:40.507545564 +0800 @@ -3776,16 +3776,20 @@ static int kvm_is_in_guest(void) static int kvm_is_user_mode(void) { int user_mode = 3; + if (percpu_read(current_vcpu)) user_mode = kvm_x86_ops-get_cpl(percpu_read(current_vcpu)); + return user_mode != 0; } static unsigned long kvm_get_guest_ip(void) { unsigned long ip = 0; + if (percpu_read(current_vcpu)) ip = kvm_rip_read(percpu_read(current_vcpu)); + return ip; } diff -Nraup linux-2.6_tip0417_perfkvm/include/linux/perf_event.h linux-2.6_tip0417_perfkvmstyle/include/linux/perf_event.h --- linux-2.6_tip0417_perfkvm/include/linux/perf_event.h2010-04-19 09:53:59.691378953 +0800 +++ linux-2.6_tip0417_perfkvmstyle/include/linux/perf_event.h 2010-04-20 10:08:03.531551890 +0800 @@ -941,10 +941,8 @@ static inline void perf_event_mmap(struc } extern struct perf_guest_info_callbacks *perf_guest_cbs; -extern int perf_register_guest_info_callbacks( - struct perf_guest_info_callbacks *); -extern int perf_unregister_guest_info_callbacks( - struct perf_guest_info_callbacks *); +extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); +extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern void perf_event_comm(struct task_struct *tsk); extern void perf_event_fork(struct task_struct *tsk); @@ -1016,9 +1014,9 @@ static inline void perf_bp_event(struct perf_event *event, void *data){ } static inline int
[PATCH V5 0/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
Here is the new patch of V5 against tip/master of April 17th if anyone wants to try it. ChangeLog V5: 1) Split kernel patch to 2 parts. The one introduces perf_guest_info_callbacks() and related register/unregister functions. The other is the kvm implementation of the callbacks. 2) Port to tip/master tree of April 17th. 3) Fix a bug which causes the module parsing of default guest kernel fail. ChangeLog V4: 1) Based on Ingo's comments, I added help information around kvm such like command-list.txt and perf-kvm.txt. 2) Added guest process id at the tail of kernel dso long name, so the display could show different label with different guest os. 3) Based on Avi's comments, erase the racy window which might trigger an NMI while the NMI isn't in guest os. 4) Fixed all the errors and warnings reported by scripts/checkpatch.pl. 5) Fixed a compilation error pointed by Yang Sheng. ChangeLog V3: 1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os root directories under /dir/to/all/guestos by sshfs. For example, I start 2 guest os. The one's pid is and the other's is . #mkdir ~/guestmount; cd ~/guestmount #sshfs -o allow_other,direct_io -p 5551 localhost:/ / #sshfs -o allow_other,direct_io -p 5552 localhost:/ / #perf kvm --host --guest --guestmount=~/guestmount top The old --guestkallsyms and --guestmodules are still supported as default guest os symbol parsing. 2) Add guest os buildid support. 3) Add sub command 'perf kvm buildid-list'. 4) Delete sub command 'perf kvm stat', because our current implementation doesn't transfer guest/host requirement to kernel, and kernel always collects both host and guest statistics. So regular 'perf stat' is ok. 5) Fix a couple of perf bugs. 6) We still have no support on command with parameter 'any' as current KVM just uses process id to identify specific guest os instance. Users could uses parameter -p to collect specific guest os instance statistics. ChangeLog V2: 1) Based on Avi's suggestion, I moved callback functions to generic code area. So the kernel part of the patch is clearer. 2) Add 'perf kvm stat'. From: Zhang, Yanmin yanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new sub command kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff perf kvm buildid-list The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top --- PerfTop: 16024 irqs/sec kernel: 2.6% us: 0.6% guest kernel:76.2% guest us:20.6% exact: 0.0% [1000Hz cycles], (all, 16 CPUs) --- samples pcnt function DSO ___ _ ___ 3740.00 8.0% __ticket_spin_lock [guest.kernel.kallsyms] 2056.00 4.4% copy_user_generic_string [guest.kernel.kallsyms] 1412.00 3.0% resource_string [guest.kernel.kallsyms] 595.00 1.3% __switch_to [guest.kernel.kallsyms] 586.00 1.2% __d_lookup [guest.kernel.kallsyms] 574.00 1.2% tcp_sendmsg [guest.kernel.kallsyms] 565.00 1.2% kmem_cache_alloc [guest.kernel.kallsyms] 532.00 1.1% tcp_ack [guest.kernel.kallsyms] 494.00 1.1% __kmalloc[guest.kernel.kallsyms] 468.00 1.0% print_cfs_rq [guest.kernel.kallsyms] 437.00 0.9% link_path_walk [guest.kernel.kallsyms] 380.00 0.8% balance_runtime [guest.kernel.kallsyms] 379.00 0.8% kmem_cache_free [guest.kernel.kallsyms] 377.00 0.8% in_gate_area_no_task [guest.kernel.kallsyms] 374.00 0.8% get_page_from_freelist [guest.kernel.kallsyms] 372.00 0.8% mark_files_ro[guest.kernel.kallsyms
[PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup --exclude-from=exclude.diff linux-2.6_tip0417/arch/x86/include/asm/perf_event.h linux-2.6_tip0417_perfkvm/arch/x86/include/asm/perf_event.h --- linux-2.6_tip0417/arch/x86/include/asm/perf_event.h 2010-04-19 09:51:47.557797121 +0800 +++ linux-2.6_tip0417_perfkvm/arch/x86/include/asm/perf_event.h 2010-04-19 09:53:59.689452915 +0800 @@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void) */ #define PERF_EFLAGS_EXACT (1UL 3) -#define perf_misc_flags(regs) \ -({ int misc = 0; \ - if (user_mode(regs))\ - misc |= PERF_RECORD_MISC_USER; \ - else\ - misc |= PERF_RECORD_MISC_KERNEL;\ - if (regs-flags PERF_EFLAGS_EXACT)\ - misc |= PERF_RECORD_MISC_EXACT; \ - misc; }) - -#define perf_instruction_pointer(regs) ((regs)-ip) +struct pt_regs; +extern unsigned long perf_instruction_pointer(struct pt_regs *regs); +extern unsigned long perf_misc_flags(struct pt_regs *regs); +#define perf_misc_flags(regs) perf_misc_flags(regs) #else static inline void init_hw_perf_events(void) { } diff -Nraup --exclude-from=exclude.diff linux-2.6_tip0417/arch/x86/kernel/cpu/perf_event.c linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c --- linux-2.6_tip0417/arch/x86/kernel/cpu/perf_event.c 2010-04-19 09:51:48.347655964 +0800 +++ linux-2.6_tip0417_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-04-19 09:53:59.689452915 +0800 @@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch { struct perf_callchain_entry *entry; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { + /* TODO: We don't support guest os callchain now */ + return NULL; + } + if (in_nmi()) entry = __get_cpu_var(pmc_nmi_entry); else @@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct regs-cs = __KERNEL_CS; local_save_flags(regs-flags); } + +unsigned long perf_instruction_pointer(struct pt_regs *regs) +{ + unsigned long ip; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) + ip = perf_guest_cbs-get_guest_ip(); + else + ip = instruction_pointer(regs); + return ip; +} + +unsigned long perf_misc_flags(struct pt_regs *regs) +{ + int misc = 0; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { + misc |= perf_guest_cbs-is_user_mode() ? + PERF_RECORD_MISC_GUEST_USER : + PERF_RECORD_MISC_GUEST_KERNEL; + } else + misc |= user_mode(regs) ? PERF_RECORD_MISC_USER : + PERF_RECORD_MISC_KERNEL; + if (regs-flags PERF_EFLAGS_EXACT) + misc |= PERF_RECORD_MISC_EXACT; + + return misc; +} + diff -Nraup --exclude-from=exclude.diff linux-2.6_tip0417/include/linux/perf_event.h linux-2.6_tip0417_perfkvm/include/linux/perf_event.h --- linux-2.6_tip0417/include/linux/perf_event.h2010-04-19 09:51:59.544791000 +0800 +++ linux-2.6_tip0417_perfkvm/include/linux/perf_event.h2010-04-19 09:53:59.691378953 +0800 @@ -288,11 +288,13 @@ struct perf_event_mmap_page { __u64 data_tail; /* user-space written tail */ }; -#define PERF_RECORD_MISC_CPUMODE_MASK (3 0) +#define PERF_RECORD_MISC_CPUMODE_MASK (7 0) #define PERF_RECORD_MISC_CPUMODE_UNKNOWN (0 0) #define PERF_RECORD_MISC_KERNEL(1 0) #define PERF_RECORD_MISC_USER (2 0) #define PERF_RECORD_MISC_HYPERVISOR(3 0) +#define PERF_RECORD_MISC_GUEST_KERNEL (4 0) +#define PERF_RECORD_MISC_GUEST_USER(5 0) #define PERF_RECORD_MISC_EXACT (1 14) /* @@ -446,6 +448,12 @@ enum perf_callchain_context { # include asm/perf_event.h #endif +struct perf_guest_info_callbacks { + int (*is_in_guest) (void); + int (*is_user_mode) (void); + unsigned long (*get_guest_ip) (void); +}; + #ifdef CONFIG_HAVE_HW_BREAKPOINT #include asm/hw_breakpoint.h #endif @@ -932,6 +940,12 @@ static inline void perf_event_mmap(struc __perf_event_mmap(vma); } +extern struct perf_guest_info_callbacks *perf_guest_cbs; +extern int perf_register_guest_info_callbacks( + struct perf_guest_info_callbacks *); +extern int perf_unregister_guest_info_callbacks( + struct perf_guest_info_callbacks *); + extern void perf_event_comm(struct task_struct *tsk); extern void perf_event_fork(struct task_struct *tsk); @@ -1001,6 +1015,11
[PATCH V5 2/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
Below patch implements the perf_guest_info_callbacks on kvm. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0417/arch/x86/kvm/vmx.c linux-2.6_tip0417_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tip0417/arch/x86/kvm/vmx.c2010-04-19 09:51:47.908673911 +0800 +++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/vmx.c2010-04-19 09:53:59.690399987 +0800 @@ -3654,8 +3654,11 @@ static void vmx_complete_interrupts(stru /* We need to handle NMIs before interrupts are enabled */ if ((exit_intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR - (exit_intr_info INTR_INFO_VALID_MASK)) + (exit_intr_info INTR_INFO_VALID_MASK)) { + kvm_before_handle_nmi(vmx-vcpu); asm(int $2); + kvm_after_handle_nmi(vmx-vcpu); + } idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; diff -Nraup linux-2.6_tip0417/arch/x86/kvm/x86.c linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c --- linux-2.6_tip0417/arch/x86/kvm/x86.c2010-04-19 09:51:47.892676413 +0800 +++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.c2010-04-19 09:53:59.691378953 +0800 @@ -40,6 +40,7 @@ #include linux/user-return-notifier.h #include linux/srcu.h #include linux/slab.h +#include linux/perf_event.h #include trace/events/kvm.h #undef TRACE_INCLUDE_FILE #define CREATE_TRACE_POINTS @@ -3765,6 +3766,47 @@ static void kvm_timer_init(void) } } +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); + +static int kvm_is_in_guest(void) +{ + return percpu_read(current_vcpu) != NULL; +} + +static int kvm_is_user_mode(void) +{ + int user_mode = 3; + if (percpu_read(current_vcpu)) + user_mode = kvm_x86_ops-get_cpl(percpu_read(current_vcpu)); + return user_mode != 0; +} + +static unsigned long kvm_get_guest_ip(void) +{ + unsigned long ip = 0; + if (percpu_read(current_vcpu)) + ip = kvm_rip_read(percpu_read(current_vcpu)); + return ip; +} + +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, +}; + +void kvm_before_handle_nmi(struct kvm_vcpu *vcpu) +{ + percpu_write(current_vcpu, vcpu); +} +EXPORT_SYMBOL_GPL(kvm_before_handle_nmi); + +void kvm_after_handle_nmi(struct kvm_vcpu *vcpu) +{ + percpu_write(current_vcpu, NULL); +} +EXPORT_SYMBOL_GPL(kvm_after_handle_nmi); + int kvm_arch_init(void *opaque) { int r; @@ -3801,6 +3843,8 @@ int kvm_arch_init(void *opaque) kvm_timer_init(); + perf_register_guest_info_callbacks(kvm_guest_cbs); + return 0; out: @@ -3809,6 +3853,8 @@ out: void kvm_arch_exit(void) { + perf_unregister_guest_info_callbacks(kvm_guest_cbs); + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) cpufreq_unregister_notifier(kvmclock_cpufreq_notifier_block, CPUFREQ_TRANSITION_NOTIFIER); diff -Nraup linux-2.6_tip0417/arch/x86/kvm/x86.h linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.h --- linux-2.6_tip0417/arch/x86/kvm/x86.h2010-04-19 09:51:47.884709050 +0800 +++ linux-2.6_tip0417_perfkvm/arch/x86/kvm/x86.h2010-04-19 09:53:59.691378953 +0800 @@ -65,4 +65,7 @@ static inline int is_paging(struct kvm_v return kvm_read_cr0_bits(vcpu, X86_CR0_PG); } +void kvm_before_handle_nmi(struct kvm_vcpu *vcpu); +void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); + #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 0/2] perf kvm: Enhance perf to collect KVM guest os statistics from host side
Here is the new patch of V4 against tip/master of April 13th if anyone wants to try it. ChangeLog V4: 1) Based on Ingo's comments, I added help information around kvm such like command-list.txt and perf-kvm.txt. 2) Added guest process id at the tail of kernel dso long name, so the display could show different label with different guest os. 3) Based on Avi's comments, erase the racy window which might trigger an NMI while the NMI isn't in guest os. 4) Fixed all the errors and warnings reported by scripts/checkpatch.pl. 5) Fixed a compilation error pointed by Yang Sheng. ChangeLog V3: 1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os root directories under /dir/to/all/guestos by sshfs. For example, I start 2 guest os. The one's pid is and the other's is . #mkdir ~/guestmount; cd ~/guestmount #sshfs -o allow_other,direct_io -p 5551 localhost:/ / #sshfs -o allow_other,direct_io -p 5552 localhost:/ / #perf kvm --host --guest --guestmount=~/guestmount top The old --guestkallsyms and --guestmodules are still supported as default guest os symbol parsing. 2) Add guest os buildid support. 3) Add sub command 'perf kvm buildid-list'. 4) Delete sub command 'perf kvm stat', because our current implementation doesn't transfer guest/host requirement to kernel, and kernel always collects both host and guest statistics. So regular 'perf stat' is ok. 5) Fix a couple of perf bugs. 6) We still have no support on command with parameter 'any' as current KVM just uses process id to identify specific guest os instance. Users could uses parameter -p to collect specific guest os instance statistics. ChangeLog V2: 1) Based on Avi's suggestion, I moved callback functions to generic code area. So the kernel part of the patch is clearer. 2) Add 'perf kvm stat'. From: Zhang, Yanmin yanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new sub command kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff perf kvm buildid-list The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top --- PerfTop: 16024 irqs/sec kernel: 2.6% us: 0.6% guest kernel:76.2% guest us:20.6% exact: 0.0% [1000Hz cycles], (all, 16 CPUs) --- samples pcnt function DSO ___ _ ___ 3740.00 8.0% __ticket_spin_lock [guest.kernel.kallsyms] 2056.00 4.4% copy_user_generic_string [guest.kernel.kallsyms] 1412.00 3.0% resource_string [guest.kernel.kallsyms] 595.00 1.3% __switch_to [guest.kernel.kallsyms] 586.00 1.2% __d_lookup [guest.kernel.kallsyms] 574.00 1.2% tcp_sendmsg [guest.kernel.kallsyms] 565.00 1.2% kmem_cache_alloc [guest.kernel.kallsyms] 532.00 1.1% tcp_ack [guest.kernel.kallsyms] 494.00 1.1% __kmalloc[guest.kernel.kallsyms] 468.00 1.0% print_cfs_rq [guest.kernel.kallsyms] 437.00 0.9% link_path_walk [guest.kernel.kallsyms] 380.00 0.8% balance_runtime [guest.kernel.kallsyms] 379.00 0.8% kmem_cache_free [guest.kernel.kallsyms] 377.00 0.8% in_gate_area_no_task [guest.kernel.kallsyms] 374.00 0.8% get_page_from_freelist [guest.kernel.kallsyms] 372.00 0.8% mark_files_ro[guest.kernel.kallsyms] 368.00 0.8% _atomic_dec_and_lock [guest.kernel.kallsyms] 356.00 0.8% crc16[crc16] 353.00 0.8% put_page [guest.kernel.kallsyms] If you want to just show host data, pls. don't use parameter --guest. The headline includes guest os kernel and userspace percentage. 2) perf
[PATCH V4 1/2] perf kvm: Enhance perf to collect KVM guest os statistics from host side
Below is the kernel patch to enable perf to collect guest os statistics. Joerg, Would you like to add support on svm? I don't know the exact point to trigger NMI to host with svm. See below code with vmx: + kvm_before_handle_nmi(vmx-vcpu); asm(int $2); + kvm_after_handle_nmi(vmx-vcpu); Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/include/asm/perf_event.h linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h --- linux-2.6_tip0413/arch/x86/include/asm/perf_event.h 2010-04-14 11:11:03.992966568 +0800 +++ linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h 2010-04-14 11:13:17.261881591 +0800 @@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void) */ #define PERF_EFLAGS_EXACT (1UL 3) -#define perf_misc_flags(regs) \ -({ int misc = 0; \ - if (user_mode(regs))\ - misc |= PERF_RECORD_MISC_USER; \ - else\ - misc |= PERF_RECORD_MISC_KERNEL;\ - if (regs-flags PERF_EFLAGS_EXACT)\ - misc |= PERF_RECORD_MISC_EXACT; \ - misc; }) - -#define perf_instruction_pointer(regs) ((regs)-ip) +struct pt_regs; +extern unsigned long perf_instruction_pointer(struct pt_regs *regs); +extern unsigned long perf_misc_flags(struct pt_regs *regs); +#define perf_misc_flags(regs) perf_misc_flags(regs) #else static inline void init_hw_perf_events(void) { } diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c --- linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c 2010-04-14 11:11:04.825028810 +0800 +++ linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-04-14 17:02:12.198063684 +0800 @@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch { struct perf_callchain_entry *entry; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { + /* TODO: We don't support guest os callchain now */ + return NULL; + } + if (in_nmi()) entry = __get_cpu_var(pmc_nmi_entry); else @@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct regs-cs = __KERNEL_CS; local_save_flags(regs-flags); } + +unsigned long perf_instruction_pointer(struct pt_regs *regs) +{ + unsigned long ip; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) + ip = perf_guest_cbs-get_guest_ip(); + else + ip = instruction_pointer(regs); + return ip; +} + +unsigned long perf_misc_flags(struct pt_regs *regs) +{ + int misc = 0; + if (perf_guest_cbs perf_guest_cbs-is_in_guest()) { + misc |= perf_guest_cbs-is_user_mode() ? + PERF_RECORD_MISC_GUEST_USER : + PERF_RECORD_MISC_GUEST_KERNEL; + } else + misc |= user_mode(regs) ? PERF_RECORD_MISC_USER : + PERF_RECORD_MISC_KERNEL; + if (regs-flags PERF_EFLAGS_EXACT) + misc |= PERF_RECORD_MISC_EXACT; + + return misc; +} + diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kvm/vmx.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tip0413/arch/x86/kvm/vmx.c2010-04-14 11:11:04.353024541 +0800 +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/vmx.c2010-04-15 10:28:39.516891050 +0800 @@ -3654,8 +3654,11 @@ static void vmx_complete_interrupts(stru /* We need to handle NMIs before interrupts are enabled */ if ((exit_intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR - (exit_intr_info INTR_INFO_VALID_MASK)) + (exit_intr_info INTR_INFO_VALID_MASK)) { + kvm_before_handle_nmi(vmx-vcpu); asm(int $2); + kvm_after_handle_nmi(vmx-vcpu); + } idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; diff -Nraup --exclude=tools linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c --- linux-2.6_tip0413/arch/x86/kvm/x86.c2010-04-14 11:11:04.341042024 +0800 +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c2010-04-15 17:16:41.340064784 +0800 @@ -40,6 +40,7 @@ #include linux/user-return-notifier.h #include linux/srcu.h #include linux/slab.h +#include linux/perf_event.h #include trace/events/kvm.h #undef TRACE_INCLUDE_FILE #define CREATE_TRACE_POINTS @@ -3765,6 +3766,47 @@ static void kvm_timer_init(void) } } +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); + +static int kvm_is_in_guest(void) +{ + return percpu_read(current_vcpu) != NULL; +} + +static int kvm_is_user_mode(void) +{ + int user_mode = 3; + if (percpu_read(current_vcpu
Re: [PATCH V3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Thu, 2010-04-15 at 11:05 +0300, Avi Kivity wrote: On 04/15/2030 04:04 AM, Zhang, Yanmin wrote: An even more accurate way to determine this is to check whether the interrupt frame points back at the 'int $2' instruction. However we plan to switch to a self-IPI method to inject the NMI, and I'm not sure wether APIC NMIs are accepted on an instruction boundary or whether there's some latency involved. Yes. But the frame pointer checking seems a little complicated. An even bigger disadvantage is that it won't work with Sheng's patch, self-NMIs are not synchronous. trace_kvm_entry(vcpu-vcpu_id); + + percpu_write(current_vcpu, vcpu); kvm_x86_ops-run(vcpu); + percpu_write(current_vcpu, NULL); If you move this around the 'int $2' instructions you will close the race, as a stray NMI won't catch us updating the rip cache. But that depends on whether self-IPI is accepted on the next instruction or not. Right. The kernel part has dependency on the self-IPI implementation. I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function) just around 'int $2'. Or create a new function to inject the interrupt in x86.c. That will reduce duplication between svm.c and vmx.c. I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI happens in guest os. In addition, svm_complete_interrupts is called after interrupt is enabled. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
On Wed, 2010-04-14 at 12:20 +0300, Avi Kivity wrote: On 04/14/2030 12:05 PM, Zhang, Yanmin wrote: Here is the new patch of V3 against tip/master of April 13th if anyone wants to try it. Thanks for persisting despite the flames. Can you please separate arch/x86/kvm part of the patch? That will make for easier reviewing, and will need to go through separate trees. I should do so definitely, and will do so in next version which also fixes some issues pointed by Ingo. Sheng, did you make any progress with the NMI injection issue? + diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c --- linux-2.6_tip0413/arch/x86/kvm/x86.c2010-04-14 11:11:04.341042024 +0800 +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c2010-04-14 11:32:45.841278890 +0800 @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void) } } +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu); + +static int kvm_is_in_guest(void) +{ + return percpu_read(current_vcpu) != NULL; An even more accurate way to determine this is to check whether the interrupt frame points back at the 'int $2' instruction. However we plan to switch to a self-IPI method to inject the NMI, and I'm not sure wether APIC NMIs are accepted on an instruction boundary or whether there's some latency involved. Yes. But the frame pointer checking seems a little complicated. +static unsigned long kvm_get_guest_ip(void) +{ + unsigned long ip = 0; + if (percpu_read(current_vcpu)) + ip = kvm_rip_read(percpu_read(current_vcpu)); + return ip; +} This may be racy. kvm_rip_read() accesses a cache in memory; if we're in the process of updating the cache, then we may read a stale value. See below. Right. The racy window seems too big. trace_kvm_entry(vcpu-vcpu_id); + + percpu_write(current_vcpu, vcpu); kvm_x86_ops-run(vcpu); + percpu_write(current_vcpu, NULL); If you move this around the 'int $2' instructions you will close the race, as a stray NMI won't catch us updating the rip cache. But that depends on whether self-IPI is accepted on the next instruction or not. Right. The kernel part has dependency on the self-IPI implementation. I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function) just around 'int $2'. Sheng would find a solution on the self-IPI delivery. Let's separate my patch and self-IPI as 2 issues as we don't know when the self-IPI delivery would be resolved. Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide
On Thu, 2010-03-25 at 16:02 +0800, Li Zefan wrote: Zhang, Yanmin wrote: From: Zhang, Yanmin yanmin_zh...@linux.intel.com Parameter --pid (or -p) of perf currently means a thread-wide collection. For exmaple, if a process whose id is has 10 threads, 'perf top -p ' just collects the main thread statistics. That's misleading. Users are used to attach a whole process when debugging a process by gdb. To follow normal usage style, the patch change --pid to process-wide collection and add --tid (-t) to mean a thread-wide collection. Usage example is: #perf top -p #perf record -p -f sleep 10 #perf stat -p -f sleep 10 Above commands collect the statistics of all threads of process . Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Seems this patch causes seg faults: # ./perf sched record Segmentation fault # ./perf kmem record Segmentation fault # ./perf timechart record Segmentation fault Thanks for reporting it. Arnaldo, could you pick up below patch? Zefan, Could you try it? mmap_array[][][] is not reset to 0 after malloc. Below patch against tip/master of March 24th fixes it with a zalloc. Reported-by:Li Zefan l...@cn.fujitsu.com Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0324/tools/perf/builtin-record.c linux-2.6_tip0324_perfkvm/tools/perf/builtin-record.c --- linux-2.6_tip0324/tools/perf/builtin-record.c 2010-03-25 10:58:13.308912201 +0800 +++ linux-2.6_tip0324_perfkvm/tools/perf/builtin-record.c 2010-03-25 16:14:18.201475298 +0800 @@ -751,7 +751,7 @@ int cmd_record(int argc, const char **ar for (i = 0; i MAX_NR_CPUS; i++) { for (j = 0; j MAX_COUNTERS; j++) { fd[i][j] = malloc(sizeof(int)*thread_num); - mmap_array[i][j] = malloc( + mmap_array[i][j] = zalloc( sizeof(struct mmap_data)*thread_num); if (!fd[i][j] || !mmap_array[i][j]) return -ENOMEM; diff -Nraup linux-2.6_tip0324/tools/perf/builtin-top.c linux-2.6_tip0324_perfkvm/tools/perf/builtin-top.c --- linux-2.6_tip0324/tools/perf/builtin-top.c 2010-03-25 10:58:13.284848937 +0800 +++ linux-2.6_tip0324_perfkvm/tools/perf/builtin-top.c 2010-03-25 16:14:56.875266645 +0800 @@ -1371,7 +1371,7 @@ int cmd_top(int argc, const char **argv, for (i = 0; i MAX_NR_CPUS; i++) { for (j = 0; j MAX_COUNTERS; j++) { fd[i][j] = malloc(sizeof(int)*thread_num); - mmap_array[i][j] = malloc( + mmap_array[i][j] = zalloc( sizeof(struct mmap_data)*thread_num); if (!fd[i][j] || !mmap_array[i][j]) return -ENOMEM; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Unify KVM kernel-space and user-space code into a single project
On Wed, 2010-03-24 at 20:20 +0200, Avi Kivity wrote: On 03/24/2010 07:47 PM, Arnaldo Carvalho de Melo wrote: Em Wed, Mar 24, 2010 at 06:09:30PM +0200, Avi Kivity escreveu: Doesn't perf already has a dependency on naming conventions for finding debug information? It looks at several places, from most symbol rich (/usr/lib/debug/, aka -debuginfo packages, where we have full symtabs) to poorest (the packaged binary, where we may just have a .dynsym). In an ideal world, it would just get the build-id (a SHA1 cookie that is in an ELF session inserted in every binary (aka DSOs), kernel module, kallsyms or vmlinux file) and use that to look first in a local cache (implemented in perf for a long time already) or in some symbol server. For instance, for a random perf.data file I collected here in my machine I have: [a...@doppio linux-2.6-tip]$ perf buildid-list | grep libpthread 5c68f7afeb33309c78037e374b0deee84dd441f6 /lib64/libpthread-2.10.2.so [a...@doppio linux-2.6-tip]$ So I don't have to access /lib64/libpthread-2.10.2.so directly, nor some convention to get a debuginfo in a local file like: /usr/lib/debug/lib64/libpthread-2.10.2.so.debug Instead the tools look at: [a...@doppio linux-2.6-tip]$ l ~/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 lrwxrwxrwx 1 acme acme 73 2010-01-06 18:53 /home/acme/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 - ../../lib64/libpthread-2.10.2.so/5c68f7afeb33309c78037e374b0deee84dd441f6* To find the file for that specific build-id, not the one installed in my machine (or on the different machine, of a different architecture) that may be completely unrelated, a new one, or one for a different arch. Thanks. I believe qemu could easily act as a symbol server for this use case. I spent a couple of days to investigate why sshfs/fuse doesn't work well with procfs and sysfs. Just after my patch against fuse is ready almost, I found fuse already supports such access by direct I/O. With parameter -o direct_io, it could work well. Here is an example to mount / from a guest os. #sshfs -p 5551 -o direct_io localhost:/ guestmount We can read files and write files if permission is ok. I will go ahead to support multiple guest os instance statistics parsing. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-23 at 10:15 -0300, Arnaldo Carvalho de Melo wrote: Em Tue, Mar 23, 2010 at 11:14:41AM +0800, Zhang, Yanmin escreveu: On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to That is not a problem, if you have the relevant buildids in your cache (Look in your machine at ~/.debug/), it will be as fast as ever. If you use a distro that has its userspace with build-ids, you probably use it always without noticing :-) see results quickly. At least I'm eager for the results when investigating a performance issue. Sure thing. With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. You don't have to have guest and host sharing the same OS image, you just have to somehow populate your buildid cache with what you need, be it using sshfs or what Ingo is suggesting once, or using what your vendor provides (debuginfo packages). And you just have to do it once, for the relevant apps, to have it in your buildid cache. Current perf does save build id, including both kernls's and other application lib/executables. Yeah, I know, I implemented it. :-) We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. That's one of objectives of perf to support long running. But it doesn't fully supports right now, as I explained, build-ids are collected at the end of the record session, because we have to open the DSOs that had hits to get the 20 bytes cookie we need, the build-id. If we had it in the PERF_RECORD_MMAP record, we would close this race, and the added cost at load time should be minimal, to get the ELF section with it and put it somewhere in task struct. Well, you are improving upon perfection. If only we could coalesce it a bit to reclaim this: [a...@doppio linux-2.6-tip]$ pahole -C task_struct ../build/v2.6.34-rc1-tip+/kernel/sched.o | tail -5 /* size: 5968, cachelines: 94, members: 150 */ /* sum members: 5943, holes: 7, sum holes: 25 */ /* bit holes: 1, sum bit holes: 28 bits */ /* last cacheline: 16 bytes */ }; [a...@doppio linux-2.6-tip]$ That reminds me I listened to your presentation on 2007 OLS. :) 8-) Or at least get just one of those 4 bytes holes then we could stick it at the end to get our build-id there, accessing it would be done only at PERF_RECORD_MMAP injection time, i.e. close to the time when we actually are loading the executable mmap, i.e. close to the time when the loader is injecting the build-id, I guess the extra memory and processing costs would be in the noise. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Unify KVM kernel-space and user-space code into a single project
On Sun, 2010-03-21 at 22:20 +0100, Ingo Molnar wrote: * Avi Kivity a...@redhat.com wrote: Well, for what it's worth, I rarely ever use anything else. My virtual disks are raw so I can loop mount them easily, and I can also switch my guest kernels from outside... without ever needing to mount those disks. Curious, what do you use them for? btw, if you build your kernel outside the guest, then you already have access to all its symbols, without needing anything further. There's two errors with your argument: 1) you are assuming that it's only about kernel symbols Look at this 'perf report' output: # Samples: 7127509216 # # Overhead Command Shared Object Symbol # .. . .. # 19.14% git git[.] lookup_object 15.16%perf git[.] lookup_object 4.74%perf libz.so.1.2.3 [.] inflate 4.52% git libz.so.1.2.3 [.] inflate 4.21%perf libz.so.1.2.3 [.] inflate_table 3.94% git libz.so.1.2.3 [.] inflate_table 3.29% git git[.] find_pack_entry_one 3.24% git libz.so.1.2.3 [.] inflate_fast 2.96%perf libz.so.1.2.3 [.] inflate_fast 2.96% git git[.] decode_tree_entry 2.80%perf libc-2.11.90.so[.] __strlen_sse42 2.56% git libc-2.11.90.so[.] __strlen_sse42 1.98%perf libc-2.11.90.so[.] __GI_memcpy 1.71%perf git[.] decode_tree_entry 1.53% git libc-2.11.90.so[.] __GI_memcpy 1.48% git git[.] lookup_blob 1.30% git git[.] process_tree 1.30%perf git[.] process_tree 0.90%perf git[.] tree_entry 0.82%perf git[.] lookup_blob 0.78% git [kernel.kallsyms] [k] kstat_irqs_cpu kernel symbols are only a small portion of the symbols. (a single line in this case) Above example shows perf could summarize both kernel and application hot functions. If we collect guest os statistics from host side, we can't summarize detailed guest os application info because we couldn't get guest os's application process id from host side. So we could only get detailed kernel info and the total utilization percent of guest application processes. To get to those other symbols we have to read the ELF symbols of those binaries in the guest filesystem, in the post-processing/reporting phase. This is both complex to do and relatively slow so we dont want to (and cannot) do this at sample time from IRQ context or NMI context ... Also, many aspects of reporting are interactive so it's done lazily or on-demand. So we need ready access to the guest filesystem - for those guests which decide to integrate with the host for this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: Nice progress! This bit: 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Will be really be painful to developers - to enter that long line while we have these things called 'computers' that ought to reduce human work. Also, it's incomplete, we need access to the guest system's binaries to do ELF symbol resolution and dwarf decoding. Yes, I agree with you and Avi that we need the enhancement be user-friendly. One of my start points is to keep the tool having less dependency on other components. Admin/developers could write script wrappers quickly if perf has parameters to support the new capability. So we really need some good, automatic way to get to the guest symbol space, so that if a developer types: perf kvm top Then the obvious thing happens by default. (which is to show the guest overhead) There's no technical barrier on the perf tooling side to implement all that: perf supports build-ids extensively and can deal with multiple symbol spaces - as long as it has access to it. The guest kernel could be ID-ed based on its /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids. I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely. I could access the files quickly. However, it doesn't work when I access /proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most files of /proc/ and /sys/ are 0. So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. This would allow not just kallsyms access via $guest/proc/kallsyms but also gives us the full space of symbol features: access to the guest binaries for annotation and general symbol resolution, command/binary name identification, etc. Such a mount would obviously not broaden existing privileges - and as an additional control a guest would also have a way to indicate that it does not wish a guest mount at all. Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like to know about it - it makes no sense for us to try to code up a solution against a standing maintainer NAK ... The other option is some sysadmin level hackery to NFS-mount the guest or so. This is a vastly inferior method that brings us back to the absymal usability levels of OProfile: 1) it wont be guest transparent 2) has to be re-done for every guest image. 3) even if packaged it has to be gotten into every. single. Linux. distro. separately. 4) old Linux guests wont work out of box In other words: it's very inconvenient on multiple levels and wont ever happen on any reasonable enough scale to make a difference to Linux. Which is an unfortunate situation - and the ball is on the KVM/Qemu side so i can do little about it. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Mon, 2010-03-22 at 13:44 -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Mar 22, 2010 at 03:24:47PM +0800, Zhang, Yanmin escreveu: On Fri, 2010-03-19 at 09:21 +0100, Ingo Molnar wrote: So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to guest os becomes slow or impossible when guest os is very busy. If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to see results quickly. At least I'm eager for the results when investigating a performance issue. With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. Current perf does save build id, including both kernls's and other application lib/executables. We have that for the kernel as: [a...@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [a...@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [a...@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. That's one of objectives of perf to support long running. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thu, 2010-03-18 at 10:45 +0800, Zhang, Yanmin wrote: On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. I worked out 3 new patches against tip/master tree of Mar. 17th. 1) Patch perf_stat: Fix the issue that perf doesn't enable counters when target_pid != -1. Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf stat -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand, it should enable all counters after the new process is execing.Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf record -f -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 3) perf_pid: Change parameter --pid to process-wide collection. Add --tid which means collecting thread-wide statistics. Usage example is: #perf top -p #perf record -p -f sleep 10 #perf stat -p -f sleep 10 Arnaldo, Pls. apply the 3 attached patches. Yanmin diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c --- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c 2010-03-18 09:04:40.938289813 +0800 +++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c 2010-03-18 13:07
[PATCH 1/3] perf events: Enable counters when collecting process-wide or system-wide data by 'perf stat'
Tool perf has a couple of sub commands. There are a couple of issues around counter enabling time point. In addition, we want a precise time clock when collect system-wide or process/thread-wide statistics. I worked out 3 patches against tips/master tree of March 17 to fix such issues and enhance perf to be more user-friendly. Subject: [PATCH 1/3] perf events: Enable counters when collecting process-wide or system-wide data by 'perf stat' From: Zhang, Yanmin yanmin_zh...@linux.intel.com Command 'perf stat' doesn't enable counters when collecting an existing (by -p) process or a system-wide statistics. Fix the issue. Change the condition of fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf stat -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. Without the new capability, user could only use CTRL+C to stop it without precise time clock. Another issue is 'perf stat -a' consumes 100% time of a full single logical cpu. It has a bad impact on running workload. Fix it by adding a sleep(1) in the while(!done) loop in function run_perf_stat. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tipmaster0317/tools/perf/builtin-stat.c linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c --- linux-2.6_tipmaster0317/tools/perf/builtin-stat.c 2010-03-18 09:04:40.938289813 +0800 +++ linux-2.6_tipmaster0317_fixstat/tools/perf/builtin-stat.c 2010-03-18 13:07:26.773773541 +0800 @@ -159,8 +159,10 @@ static void create_perf_stat_counter(int } } else { attr-inherit= inherit; - attr-disabled = 1; - attr-enable_on_exec = 1; + if (target_pid == -1) { + attr-disabled = 1; + attr-enable_on_exec = 1; + } fd[0][counter] = sys_perf_event_open(attr, pid, -1, -1, 0); if (fd[0][counter] 0 verbose) @@ -251,9 +253,9 @@ static int run_perf_stat(int argc __used unsigned long long t0, t1; int status = 0; int counter; - int pid = target_pid; + int pid; int child_ready_pipe[2], go_pipe[2]; - const bool forks = (target_pid == -1 argc 0); + const bool forks = (argc 0); char buf; if (!system_wide) @@ -265,10 +267,10 @@ static int run_perf_stat(int argc __used } if (forks) { - if ((pid = fork()) 0) + if ((child_pid = fork()) 0) perror(failed to fork); - if (!pid) { + if (!child_pid) { close(child_ready_pipe[0]); close(go_pipe[1]); fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC); @@ -297,8 +299,6 @@ static int run_perf_stat(int argc __used exit(-1); } - child_pid = pid; - /* * Wait for the child to be ready to exec. */ @@ -309,6 +309,10 @@ static int run_perf_stat(int argc __used close(child_ready_pipe[0]); } + if (target_pid == -1) + pid = child_pid; + else + pid = target_pid; for (counter = 0; counter nr_counters; counter++) create_perf_stat_counter(counter, pid); @@ -321,7 +325,7 @@ static int run_perf_stat(int argc __used close(go_pipe[1]); wait(status); } else { - while(!done); + while(!done) sleep(1); } t1 = rdclock(); @@ -459,7 +463,7 @@ static volatile int signr = -1; static void skip_signal(int signo) { - if(target_pid != -1) + if(child_pid == -1) done = 1; signr = signo; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] perf events: Enable counters only when kernel is execing the subcommand of 'perf record'
From: Zhang, Yanmin yanmin_zh...@linux.intel.com 'perf record' starts counters before subcommand is execed, so the statistics is not precise because it includes data of some preparation steps. I fix it with the patch. In addition, Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf record -f -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. Without the new capability, user could only input CTRL+C to stop it without precise time clock. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0317/tools/perf/builtin-record.c linux-2.6_tip0317_fixrecord/tools/perf/builtin-record.c --- linux-2.6_tip0317/tools/perf/builtin-record.c 2010-03-18 09:04:40.942263175 +0800 +++ linux-2.6_tip0317_fixrecord/tools/perf/builtin-record.c 2010-03-18 13:33:24.254359348 +0800 @@ -225,7 +225,7 @@ static struct perf_header_attr *get_head return h_attr; } -static void create_counter(int counter, int cpu, pid_t pid, bool forks) +static void create_counter(int counter, int cpu, pid_t pid) { char *filter = filters[counter]; struct perf_event_attr *attr = attrs + counter; @@ -275,10 +275,10 @@ static void create_counter(int counter, attr-mmap = track; attr-comm = track; attr-inherit = inherit; - attr-disabled = 1; - - if (forks) + if (target_pid == -1 !system_wide) { + attr-disabled = 1; attr-enable_on_exec = 1; + } try_again: fd[nr_cpu][counter] = sys_perf_event_open(attr, pid, cpu, group_fd, 0); @@ -380,17 +380,15 @@ try_again: exit(-1); } } - - ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_ENABLE); } -static void open_counters(int cpu, pid_t pid, bool forks) +static void open_counters(int cpu, pid_t pid) { int counter; group_fd = -1; for (counter = 0; counter nr_counters; counter++) - create_counter(counter, cpu, pid, forks); + create_counter(counter, cpu, pid); nr_cpu++; } @@ -425,7 +423,7 @@ static int __cmd_record(int argc, const int err; unsigned long waking = 0; int child_ready_pipe[2], go_pipe[2]; - const bool forks = target_pid == -1 argc 0; + const bool forks = argc 0; char buf; page_size = sysconf(_SC_PAGE_SIZE); @@ -496,13 +494,13 @@ static int __cmd_record(int argc, const atexit(atexit_header); if (forks) { - pid = fork(); + child_pid = fork(); if (pid 0) { perror(failed to fork); exit(-1); } - if (!pid) { + if (!child_pid) { close(child_ready_pipe[0]); close(go_pipe[1]); fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC); @@ -531,11 +529,6 @@ static int __cmd_record(int argc, const exit(-1); } - child_pid = pid; - - if (!system_wide) - target_pid = pid; - close(child_ready_pipe[1]); close(go_pipe[0]); /* @@ -548,13 +541,17 @@ static int __cmd_record(int argc, const close(child_ready_pipe[0]); } + if (forks target_pid == -1 !system_wide) + pid = child_pid; + else + pid = target_pid; if ((!system_wide !inherit) || profile_cpu != -1) { - open_counters(profile_cpu, target_pid, forks); + open_counters(profile_cpu, pid); } else { nr_cpus = read_cpu_map(); for (i = 0; i nr_cpus; i++) - open_counters(cpumap[i], target_pid, forks); + open_counters(cpumap[i], pid); } if (file_new) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] perf events: Change perf parameter --pid to process-wide collection instead of thread-wide
From: Zhang, Yanmin yanmin_zh...@linux.intel.com Parameter --pid (or -p) of perf currently means a thread-wide collection. For exmaple, if a process whose id is has 10 threads, 'perf top -p ' just collects the main thread statistics. That's misleading. Users are used to attach a whole process when debugging a process by gdb. To follow normal usage style, the patch change --pid to process-wide collection and add --tid (-t) to mean a thread-wide collection. Usage example is: #perf top -p #perf record -p -f sleep 10 #perf stat -p -f sleep 10 Above commands collect the statistics of all threads of process . Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tip0317_statrecord/tools/perf/builtin-record.c linux-2.6_tip0317_statrecordpid/tools/perf/builtin-record.c --- linux-2.6_tip0317_statrecord/tools/perf/builtin-record.c2010-03-18 13:48:39.578181540 +0800 +++ linux-2.6_tip0317_statrecordpid/tools/perf/builtin-record.c 2010-03-18 14:28:41.449631936 +0800 @@ -27,7 +27,7 @@ #include unistd.h #include sched.h -static int fd[MAX_NR_CPUS][MAX_COUNTERS]; +static int *fd[MAX_NR_CPUS][MAX_COUNTERS]; static longdefault_interval= 0; @@ -43,6 +43,9 @@ static intraw_samples = 0; static int system_wide = 0; static int profile_cpu = -1; static pid_t target_pid = -1; +static pid_t target_tid = -1; +static pid_t *all_tids = NULL; +static int thread_num = 0; static pid_t child_pid = -1; static int inherit = 1; static int force = 0; @@ -60,7 +63,7 @@ static struct timeval this_read; static u64 bytes_written = 0; -static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS]; +static struct pollfd *event_array; static int nr_poll = 0; static int nr_cpu = 0; @@ -77,7 +80,7 @@ struct mmap_data { unsigned intprev; }; -static struct mmap_datammap_array[MAX_NR_CPUS][MAX_COUNTERS]; +static struct mmap_data*mmap_array[MAX_NR_CPUS][MAX_COUNTERS]; static unsigned long mmap_read_head(struct mmap_data *md) { @@ -225,12 +228,13 @@ static struct perf_header_attr *get_head return h_attr; } -static void create_counter(int counter, int cpu, pid_t pid) +static void create_counter(int counter, int cpu) { char *filter = filters[counter]; struct perf_event_attr *attr = attrs + counter; struct perf_header_attr *h_attr; int track = !counter; /* only the first counter needs these */ + int thread_index; int ret; struct { u64 count; @@ -280,115 +284,124 @@ static void create_counter(int counter, attr-enable_on_exec = 1; } + for (thread_index = 0; thread_index thread_num; thread_index++) { try_again: - fd[nr_cpu][counter] = sys_perf_event_open(attr, pid, cpu, group_fd, 0); + fd[nr_cpu][counter][thread_index] = sys_perf_event_open(attr, + all_tids[thread_index], cpu, group_fd, 0); - if (fd[nr_cpu][counter] 0) { - int err = errno; + if (fd[nr_cpu][counter][thread_index] 0) { + int err = errno; - if (err == EPERM || err == EACCES) - die(Permission error - are you root?\n - \t Consider tweaking /proc/sys/kernel/perf_event_paranoid.\n); - else if (err == ENODEV profile_cpu != -1) - die(No such device - did you specify an out-of-range profile CPU?\n); + if (err == EPERM || err == EACCES) + die(Permission error - are you root?\n + \t Consider tweaking + /proc/sys/kernel/perf_event_paranoid.\n); + else if (err == ENODEV profile_cpu != -1) { + die(No such device - did you specify +an out-of-range profile CPU?\n); + } - /* -* If it's cycles then fall back to hrtimer -* based cpu-clock-tick sw counter, which -* is always available even if no PMU support
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c --- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800 +++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 16:30:17.71706 +0800 @@ -27,7 +27,7 @@ #include unistd.h #include sched.h -static int fd[MAX_NR_CPUS][MAX_COUNTERS]; +static int *fd[MAX_NR_CPUS][MAX_COUNTERS]; static longdefault_interval= 0; @@ -43,6 +43,9 @@ static intraw_samples = 0; static int system_wide = 0; static int profile_cpu = -1; static pid_t target_pid = -1; +static pid_t target_tid = -1; +static int *all_tids = NULL; +static int thread_num = 0; static pid_t child_pid = -1; static int inherit = 1; static int force = 0; @@ -60,7 +63,7 @@ static struct timeval this_read; static u64 bytes_written = 0; -static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS]; +static struct pollfd *event_array; static int nr_poll = 0; static int nr_cpu = 0; @@ -77,7 +80,7 @@ struct mmap_data { unsigned int
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Perhaps a symbol server that perf can connect to (and that would connect to guests in turn)? diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. That certainly works, though automatic association of guest data with guest symbols is friendlier. Thanks. Originally, I planed to add a -G parameter to perf. Such like -G :/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules and 8889 are just qemu guest pid. So we could define multiple guest os symbol files. But it seems ugly, and 'perf kvm report --sort pid and 'perf kvm top --pid' could provide similar functionality. diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c2010-03-16 08:59:11.825295404 +0800 +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c2010-03-16 09:01:09.976084492 +0800 @@ -26,6 +26,7 @@ #includelinux/sched.h #includelinux/moduleparam.h #includelinux/ftrace_event.h +#includelinux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} There is already PF_VCPU for this. Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest, +}; Should be in common code, not vmx specific. Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. You will need access to the vcpu pointer (kvm_rip_read() needs it), you can put it in a percpu variable. We do so now in a new patch. I guess if it's not null, you know you're in a guest, so no need for PF_VCPU. Good suggestion. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86/kvm: Show guest system/user cputime in cpustat
On Thu, 2010-03-11 at 09:50 +0200, Avi Kivity wrote: On 03/11/2010 09:46 AM, Sheng Yang wrote: On Thursday 11 March 2010 15:36:01 Avi Kivity wrote: On 03/11/2010 09:20 AM, Sheng Yang wrote: Currently we can only get the cpu_stat of whole guest as one. This patch enhanced cpu_stat with more detail, has guest_system and guest_user cpu time statistics with a little overhead. Signed-off-by: Sheng Yangsh...@linux.intel.com It seems per-process guest cpu utilization is more useful than per-cpu's. --- This draft patch based on KVM upstream to show the idea. I would split it into more kernel friendly version later. The overhead is, the cost of get_cpl() after each exit from guest. This can be very expensive in the nested virtualization case, so I wouldn't like this to be in normal paths. I think detailed profiling like that can be left to 'perf kvm', which only has overhead if enabled at runtime. Yes, that's my concern too(though nested vmcs/vmcb read already too expensive, they should be optimized...). Any ideas on how to do that? Perhaps use paravirt_ops to covert the vmread into a memory read? We store the vmwrites in the vmcs anyway. Another method is to add sysctl entry, such like /proc/sys/kernel/collect_guest_utilization, and we can set it off by default. Or add a /sys/kernel/debug/kvm/collect_guest_utilization. The other concern is, perf alike mechanism would bring a lot more overhead compared to this. Ordinarily users won't care if time is spent in guest kernel mode or guest user mode. They want to see which guest is imposing a load on a system. I consider a user profiling a guest from the host an advanced and rarer use case, so it's okay to require tools and additional overhead for this. Here is the story why Sheng worked out the patch. Some guys work on KVM performance. They want us to extend top to show guest utilization info, such like guest kernel and guest userspace cpu utilization. With the new tool, they could find which VM (mapping with qemu process id) consumes too much cpu time in host space (including kernel and userspace), and compare them with guest kernel/userspace. That information could provide a first-hand high-level overview about all VMs running in the system and help admin quickly find what the worst VM instance is. So we need per-process (guest) cpu utilization than per-cpu guest utilization. For example you can put the code to note the cpl in a tracepoint which is enabled dynamically. Yanmin have already implement perf kvm to support this. We are just arguing if a normal top-alike mechanism is necessary. perf kvm mostly is used to find hot functions which might cause more overhead. Sheng's patch has less overhead. I am also considering to make it a feature that can be disabled. But seems it make things complicate and result in uncertain cpustat output. I'm not even sure that guest time was a good idea. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Thu, 2010-03-04 at 09:00 +0800, Zhang, Yanmin wrote: On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote: On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: -#ifndef perf_misc_flags -#define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER : \ -PERF_RECORD_MISC_KERNEL) -#define perf_instruction_pointer(regs) instruction_pointer(regs) -#endif Ah, that #ifndef is for powerpc, which I think you just broke. Thanks for the reminder. I deleted powerpc codes when building cscope lib. It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to change them to a callback function struct and kvm registers its version to perf. Such like: struct perf_guest_info_callbacks { int (*is_in_guest)(); u64 (*get_guest_ip)(); int (*copy_guest_stack)(); int (*reset_in_guest)(); ... }; int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *); int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *); It's more scalable and neater. In case you guys might lose patience, I worked out a new patch against 2.6.34-rc1. It could work with: #perf kvm --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules /guest/os/proc/modules top It also support to collect both host side and guest side at the same time: #perf kvm --host --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules /guest/os/proc/modules top The first output line of top has guest kernel/user space percentage. Or just host side: #perf kvm --host As tool perf source codes have lots of changes, I am still working on perf kvm record and report. --- diff -Nraup linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h --- linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h 2010-03-09 13:04:20.730596079 +0800 +++ linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h 2010-03-10 17:06:34.228953260 +0800 @@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re #endif } +static inline int user_mode_cs(u16 cs) +{ +#ifdef CONFIG_X86_32 + return (cs SEGMENT_RPL_MASK) == USER_RPL; +#else + return !!(cs 3); +#endif +} + static inline int user_mode_vm(struct pt_regs *regs) { #ifdef CONFIG_X86_32 diff -Nraup linux-2.6.34-rc1/arch/x86/kvm/vmx.c linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c --- linux-2.6.34-rc1/arch/x86/kvm/vmx.c 2010-03-09 13:04:20.758593132 +0800 +++ linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c2010-03-10 17:11:49.709019136 +0800 @@ -26,6 +26,7 @@ #include linux/sched.h #include linux/moduleparam.h #include linux/ftrace_event.h +#include linux/perf_event.h #include kvm_cache_regs.h #include x86.h @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} + +static int kvm_is_user_mode(void) +{ + int user_mode; + user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR)); + return user_mode; +} + +static u64 kvm_get_guest_ip(void) +{ + return vmcs_readl(GUEST_RIP); +} + +static void kvm_reset_in_guest(void) +{ + if (percpu_read(kvm_in_guest)) + percpu_write(kvm_in_guest, 0); +} + +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest +}; + static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { u32 exit_intr_info; @@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru /* We need to handle NMIs before interrupts are enabled */ if ((exit_intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR - (exit_intr_info INTR_INFO_VALID_MASK)) + (exit_intr_info INTR_INFO_VALID_MASK)) { + kvm_set_in_guest(); asm(int $2); + kvm_reset_in_guest(); + } idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; @@ -4251,6 +4292,8 @@ static int __init vmx_init(void) if (bypass_guest_pf) kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull); + perf_register_guest_info_callbacks(kvm_guest_cbs); + return 0; out3: @@ -4266,6 +4309,8 @@ out: static void __exit vmx_exit(void) { + perf_unregister_guest_info_callbacks(kvm_guest_cbs); + free_page((unsigned long)vmx_msr_bitmap_legacy); free_page((unsigned long)vmx_msr_bitmap_longmode); free_page((unsigned long)vmx_io_bitmap_b); diff -Nraup linux-2.6.34-rc1/include/linux/perf_event.h linux-2.6.34-rc1_work/include/linux/perf_event.h --- linux-2.6.34-rc1
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:32 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. It took me for a couple of hours to read the emails on the topic. Based on above idea, I worked out a prototype which is ugly, but does work with top/record when both guest side and host side use the same kernel image, while compiling most needed modules into kernel directly.. The commands are: perf kvm top perf kvm record perf kvm report They just collect guest kernel hot functions. Fantastic, and there's some really interesting KVM guest/host comparison profiles you've done with this prototype! With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu). 1) Vanilla host kernel (6G memory): PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 99376.00 40.5% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 7019.00 2.9% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 5350.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 5208.00 2.1% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 4484.00 1.8% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 4078.00 1.7% ext3_free_blocks_sb /lib/modules/2.6.33-kvmymz/build/vmlinux 3856.00 1.6% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux 3485.00 1.4% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2803.00 1.1% ext3_try_to_allocate /lib/modules/2.6.33-kvmymz/build/vmlinux 2241.00 0.9% __find_get_block /lib/modules/2.6.33-kvmymz/build/vmlinux 1957.00 0.8% find_revoke_record /lib/modules/2.6.33-kvmymz/build/vmlinux 2) guest os: start one guest os with 4GB memory. PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 41701.00 28.1% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 33843.00 22.8% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 3278.00 2.2% native_flush_tlb_others /lib/modules/2.6.33-kvmymz/build/vmlinux 3200.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 3009.00 2.0% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2834.00 1.9% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 1965.00 1.3% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 1907.00 1.3% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:13 +0100, Peter Zijlstra wrote: On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: +static inline u64 perf_instruction_pointer(struct pt_regs *regs) +{ + u64 ip; + ip = percpu_read(perf_virt_ip.ip); + if (!ip) + ip = instruction_pointer(regs); + else + perf_reset_virt_ip(); + return ip; +} + +static inline unsigned int perf_misc_flags(struct pt_regs *regs) +{ + if (percpu_read(perf_virt_ip.ip)) { + return percpu_read(perf_virt_ip.user_mode) ? + PERF_RECORD_MISC_GUEST_USER : + PERF_RECORD_MISC_GUEST_KERNEL; + } else + return user_mode(regs) ? PERF_RECORD_MISC_USER : +PERF_RECORD_MISC_KERNEL; +} This codes in the assumption that perf_misc_flags() must only be called before perf_instruction_pointer(), which is currently true, but you might want to put a comment near to remind us of this. I will change the logic with a clear reset operation in caller. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote: On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: -#ifndef perf_misc_flags -#define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER : \ -PERF_RECORD_MISC_KERNEL) -#define perf_instruction_pointer(regs) instruction_pointer(regs) -#endif Ah, that #ifndef is for powerpc, which I think you just broke. Thanks for the reminder. I deleted powerpc codes when building cscope lib. It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to change them to a callback function struct and kvm registers its version to perf. Such like: struct perf_guest_info_callbacks { int (*is_in_guest)(); u64 (*get_guest_ip)(); int (*copy_guest_stack)(); int (*reset_in_guest)(); ... }; int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *); int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *); It's more scalable and neater. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. It took me for a couple of hours to read the emails on the topic. Based on above idea, I worked out a prototype which is ugly, but does work with top/record when both guest side and host side use the same kernel image, while compiling most needed modules into kernel directly.. The commands are: perf kvm top perf kvm record perf kvm report They just collect guest kernel hot functions. Fantastic, and there's some really interesting KVM guest/host comparison profiles you've done with this prototype! With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu). 1) Vanilla host kernel (6G memory): PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 99376.00 40.5% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 7019.00 2.9% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 5350.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 5208.00 2.1% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 4484.00 1.8% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 4078.00 1.7% ext3_free_blocks_sb /lib/modules/2.6.33-kvmymz/build/vmlinux 3856.00 1.6% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux 3485.00 1.4% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2803.00 1.1% ext3_try_to_allocate /lib/modules/2.6.33-kvmymz/build/vmlinux 2241.00 0.9% __find_get_block /lib/modules/2.6.33-kvmymz/build/vmlinux 1957.00 0.8% find_revoke_record /lib/modules/2.6.33-kvmymz/build/vmlinux 2) guest os: start one guest os with 4GB memory. PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 41701.00 28.1% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 33843.00 22.8% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 3278.00 2.2% native_flush_tlb_others /lib/modules/2.6.33-kvmymz/build/vmlinux 3200.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 3009.00 2.0% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux 2834.00 1.9% journal_dirty_metadata /lib/modules/2.6.33-kvmymz/build/vmlinux 1965.00 1.3% journal_get_undo_access /lib/modules/2.6.33-kvmymz/build/vmlinux 1907.00 1.3% ext3_new_blocks /lib/modules/2.6.33-kvmymz/build/vmlinux 1790.00 1.2% ext3_free_blocks_sb /lib/modules/2.6.33-kvmymz/build/vmlinux 1741.00 1.2% find_revoke_record /lib/modules/2.6.33-kvmymz/build
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: * Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: 1) Add support to perf to allow it to monitor a KVM guest from the host. This shouldn't be a big problem. The PMU of AMD Fam10 processors can be configured to count only when in guest mode. Perf needs to be aware of that and fetch the rip from a different place when monitoring a guest. The idea is we want to measure both host and guest at the same time, and compare all the hot functions fairly. So you want to measure while the guest vcpu is running and the vmexit path of that vcpu (including qemu userspace part) together? The challenge here is to find out if a performance event originated in guest mode or in host mode. But we can check for that in the nmi-protected part of the vmexit path. As far as instrumentation goes, virtualization is simply another 'PID dimension' of measurement. Today we can isolate system performance measurements/events to the following domains: - per system - per cpu - per task ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' domain separation, and we have some ABI details for all that but it's by no means complete. Anton is using the PowerPC bits AFAIK, so it already works to a certain degree. ) When extending measurements to KVM, we want two things: - user friendliness: instead of having to check 'ps' and figure out which Qemu thread is the KVM thread we want to profile, just give a convenience namespace to access guest profiling info. -G ought to map to the first currently running KVM guest it can find. (which would match like 90% of the cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something useful by default the whole effort is for naught. - Extend core facilities and enable the following measurement dimensions: host-kernel-space host-user-space guest-kernel-space guest-user-space on a per guest basis. We want to be able to measure just what the guest does, and we want to be able to measure just what the host does. Some of this the hardware helps us with (say only measuring host kernel events is possible), some has to be done by fiddling with event enable/disable at vm-exit / vm-entry time. My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. It took me for a couple of hours to read the emails on the topic. Based on above idea, I worked out a prototype which is ugly, but does work with top/record when both guest side and host side use the same kernel image, while compiling most needed modules into kernel directly.. The commands are: perf kvm top perf kvm record perf kvm report They just collect guest kernel hot functions. [ And, as so often, once you walk that path, that grand scheme you are thinking about right now might easily become last year's really bad idea ;-) ] So please start walking the path and experience the challenges first-hand. With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu). 1) Vanilla host kernel (6G memory): PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 99376.00 40.5% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 7019.00 2.9% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 5350.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 5208.00 2.1% do_get_write_access /lib/modules/2.6.33-kvmymz/build/vmlinux
Re: Enhance perf to support KVM
On Thu, 2010-02-25 at 10:20 +0100, Peter Zijlstra wrote: On Thu, 2010-02-25 at 11:27 +0800, Zhang, Yanmin wrote: Ingo, I did some testing with KVM virtualization. perf shows vmx_vcpu_run consumes more than 50% cpu time. Actually, the info is incorrect because when perf counter overflows and NMI is triggered, vm exit to function vmx_vcpu_run, then vmx_vcpu_run triggers a software NMI so perf event is notified. perf just checks regs which just saves the address of vmx_vcpu_run. I want to enhance perf to collect real guest os address. Below is the design. KVM uses multi-thread model. Every guest os is a process of multi-thread. 1) Kernel: Add a per_cpu var and some functions, so KVM records interrupted guest os address before triggering the software NMI. perf event would check the per_cpu var to use it if it's not zero, or just goes though the old path. 2) User space: Add a new parameter to perf-top and perf-report, such like -g pid:guest_os_vmlinux_path. Command perf parses the guest os kernel image to collect symbols. Change perf to summarize results based on pid. Another direction is to use the new parameter -g only when old parameter -p is defined. Perf just needs separate native kernel and guest os kernel. I really appreciate your kind comments, and will contact you again in the future for help. -g is already taken :-) We could use other flag or just -G. One thing I worry about is making sense of the guest data, it might be possible to sorta make sense of the main kernel image, but after that its going to be 'interesting' in deed. You're going to have to extend PERF_RECORD_MISC_* though, perhaps you can reuse CPUMODE_UNKNOWN for GUEST. The callchain stuff already has GUEST context identifiers, however determining KERNEL/USER context might be hard and interpreting it is going to be harder still since we don't have map information for the guest. Right. As for side #1 pointed in Ingo' email, we assume guest os is linux. We couldn't support all capabilities of perf on KVM from host side. 1) We couldn't get module and process mapping info in guest os in an easy way, so we can't support to collect guest kernel module and user space hot functions. A work around is user could get guest os /proc/kallsym and pass it to tool perf at host side so we could analyze module host functions. 2) We couldn't get guest os kernel/user stack data in an easy way, so we might not support callchain feature of tool perf. A work around is KVM copies kernel stack data out, so we could at least support guest os kernel callchain. So the host side perf support on guest os: perf kvm list perf kvm record# records the first running guest perf kvm stat # stats the first running KVM guest perf kvm top # shows the profile of the first running guest perf kvm trace # active the KVM specific tracepoints As for record, doesn't support to record guest os user space stack callchain and guest os user space hot functions. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote: * Jan Kiszka jan.kis...@siemens.com wrote: Jes Sorensen wrote: Hi, It looks like several of us have been looking at how to use the PMU for virtualization. Rather than continuing to have discussions in smaller groups, I think it is a good idea we move it to the mailing lists to see what we can share and avoid duplicate efforts. There are really two separate things to handle: 1) Add support to perf to allow it to monitor a KVM guest from the host. 2) Allow guests access to the PMU (or an emulated PMU), making it possible to run perf on applications running within the guest. I know some of you have been looking at 1) and I am currently working on 2). I have been looking at various approaches, including whether it is feasible to share the PMU between the host and multiple guests. For now I am going to focus on allowing one guest to take control of the PMU, then later hopefully adding support for multiplexing it between multiple guests. Given that perf can apply the PMU to individual host tasks, I don't see fundamental problems multiplexing it between individual guests (which can then internally multiplex it again). In terms of how to expose it to guests, a 'soft PMU' might be a usable approach. Although to Linux guests you could expose much more functionality and an non-PMU-limited number of instrumentation events, via a more intelligent interface. But note that in terms of handling it on the host side the PMU approach is not acceptable: instead it should map to proper perf_events, not try to muck with the PMU itself. That, besides integrating properly with perf usage on the host, will also allow interesting 'PMU' features on guests: you could set up the host side to trace block IO requests (or VM exits) for example, and expose that as 'PMC #0' on the guest side. So virtualization becomes non-transparent to guest os? I know virtio is an optimization on guest side. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: 1) Add support to perf to allow it to monitor a KVM guest from the host. This shouldn't be a big problem. The PMU of AMD Fam10 processors can be configured to count only when in guest mode. Perf needs to be aware of that and fetch the rip from a different place when monitoring a guest. The idea is we want to measure both host and guest at the same time, and compare all the hot functions fairly. 2) Allow guests access to the PMU (or an emulated PMU), making it possible to run perf on applications running within the guest. The biggest problem I see here is teaching the guest about the available events. The available event sets are dependent on the processor family (at least on AMD). A simple approach would be shadowing the perf msrs which is a simple thing to do. More problematic is the reinjection of performance interrupts and performance nmis. I personally don't like a self-defined event-set as the only solution because that would probably only work with linux and perf. I think we should have a way (additionally to a soft-event interface) which allows to expose the host pmu events to the guest. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html