Re: [kvm-devel] stable distro for kvm?
-Original Message- From: [EMAIL PROTECTED] on behalf of Andrey Dmitriev Sent: Tue 2/12/2008 11:20 PM To: kvm-devel@lists.sourceforge.net Subject: [kvm-devel] stable distro for kvm? Any recommendations or link to plans for a stable KVM with any major distro? Latest KVM (KVM-60) is stable; Fedora7+/RHEL5+ have good support for KVM. openSUSE/SLES 11.x will likely have good KVM support too. I've read somewhere that Ubuntu will support it soon (how soon?) but I thought it was based on debian, and it doesn't seem to have it as part of etch yet (if I switch to unstable, I seem to be able to get it) Debian Etch (Stable) was feature-frozen before KVM was released. (Debian Etch was released in 2007, but it was in feature-freeze since summer of 2006). As with any Debian Stable, it takes time to insert features into it. Those features need to be ready nearly year before insertion into Debian Stable. Both Ubuntu 8.04 LTS and Debian Lenny will have good KVM support. openSUSE 10.3, Ubuntu 7.04, Ubuntu 7.10 have early KVM included, so they have some bugs. (basic KVM support) In any case, I would recommend you to install KVM-60, because it is more stable. -Alexey Technologov, Qumranet QA Team Member. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Clock off in guest
Hi, I'm running an Linux AMD64 guest on an AMD64 host. The host is running a 2.6.23 kernel (self compiled), the guest is running a stock linux-image-2.6.22-3-amd64 Debian kernel. My problem is that the clock on the guest is off (slow), while the clock on the host seems to be OK. When doing 'time sleep 10' on the guest, it takes about 16 'real' seconds for it to finish. I can change the host kernel, the guest kernel or fiddle around with kvm command line options, but I would like some guidance where to start. Which option has the largest probability of solving my problem? Let me know if you need some dmesg output or some other log data; I didn't want to flood the list with random logs of both the host and the guest. Best, Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization
Joerg Roedel wrote: This still has the same issue as the previous patchset: if the guest enables some other bit in MSR_IA32_DEBUCTLMSR, we silently ignore it. We should either pr_unimpl() on such bits or not handle them (ultimately injecting a #GP). Thats not true. The patch saves the MSR value in vmcb-save.dbgctl. This value is returned on reads of that MSR. So no bit is ignored. This value in the VMCB is also used as the guests copy of that MSR if LBR virtualization is enabled. Right, my mistake. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization
On Wed, Feb 13, 2008 at 11:50:58AM +0200, Avi Kivity wrote: Joerg Roedel wrote: @@ -1224,6 +1261,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) if (data != 0) goto unhandled; break; +case MSR_IA32_DEBUGCTLMSR: +svm-vmcb-save.dbgctl = data; +if (!svm_has(SVM_FEATURE_LBRV)) +break; +if (data (1ULL0)) +svm_enable_lbrv(svm); +else +svm_disable_lbrv(svm); +break; default: unhandled: return kvm_set_msr_common(vcpu, ecx, data); This still has the same issue as the previous patchset: if the guest enables some other bit in MSR_IA32_DEBUCTLMSR, we silently ignore it. We should either pr_unimpl() on such bits or not handle them (ultimately injecting a #GP). Thats not true. The patch saves the MSR value in vmcb-save.dbgctl. This value is returned on reads of that MSR. So no bit is ignored. This value in the VMCB is also used as the guests copy of that MSR if LBR virtualization is enabled. But another issue, I should ensure the guest does not set reserved bits in that MSR. Also, I'd like a simple patch for 2.6.25 to add support for Windows x86 on AMD. So if the first patch in the series can add support for the bits that Windows sets in MSR_IA32_DEBUGCTLMSR (I imagine it just writes zero?) then I can queue that for 2.6.25 and the rest for 2.6.26. Ok, I will work that into the patchset. Joerg -- | AMD Saxony Limited Liability Company Co. KG Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany System| Register Court Dresden: HRA 4896 Research | General Partner authorized to represent: Center| AMD Saxony LLC (Wilmington, Delaware, US) | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization
Joerg Roedel wrote: @@ -1224,6 +1261,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) if (data != 0) goto unhandled; break; + case MSR_IA32_DEBUGCTLMSR: + svm-vmcb-save.dbgctl = data; + if (!svm_has(SVM_FEATURE_LBRV)) + break; + if (data (1ULL0)) + svm_enable_lbrv(svm); + else + svm_disable_lbrv(svm); + break; default: unhandled: return kvm_set_msr_common(vcpu, ecx, data); This still has the same issue as the previous patchset: if the guest enables some other bit in MSR_IA32_DEBUCTLMSR, we silently ignore it. We should either pr_unimpl() on such bits or not handle them (ultimately injecting a #GP). Also, I'd like a simple patch for 2.6.25 to add support for Windows x86 on AMD. So if the first patch in the series can add support for the bits that Windows sets in MSR_IA32_DEBUGCTLMSR (I imagine it just writes zero?) then I can queue that for 2.6.25 and the rest for 2.6.26. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/2] kvmclock - the host part.
Glauber de Oliveira Costa wrote: This is the host part of kvm clocksource implementation. As it does not include clockevents, it is a fairly simple implementation. We only have to register a per-vcpu area, and start writting to it periodically. The area is binary compatible with xen, as we use the same shadow_info structure. +static void kvm_write_wall_clock(struct kvm_vcpu *v, gpa_t wall_clock) +{ + int version = 1; + struct kvm_wall_clock wc; + unsigned long flags; + struct timespec wc_ts; + + local_irq_save(flags); + kvm_get_msr(v, MSR_IA32_TIME_STAMP_COUNTER, + v-arch.hv_clock.tsc_timestamp); Why is this needed? IIRC the wall clock is not tied to any vcpu. If we can remove this, the argument to the function should be kvm, not kvm_vcpu. We can remove the irq games as well. + wc_ts = current_kernel_time(); + local_irq_restore(flags); + + down_write(current-mm-mmap_sem); + kvm_write_guest(v-kvm, wall_clock, version, sizeof(version)); + up_write(current-mm-mmap_sem); Why down_write? accidentally or on purpose? For mutual exclusion, I suggest taking kvm-lock instead (for the entire function). + + /* With all the info we got, fill in the values */ + wc.wc_sec = wc_ts.tv_sec; + wc.wc_nsec = wc_ts.tv_nsec; + wc.wc_version = ++version; + + down_write(current-mm-mmap_sem); + kvm_write_guest(v-kvm, wall_clock, wc, sizeof(wc)); + up_write(current-mm-mmap_sem); Should be in three steps: write version, write data, write version. kvm_write_guest doesn't guarantee any order. It may fail as well, and we need to handle that. +/* xen binary-compatible interface. See xen headers for details */ +struct kvm_vcpu_time_info { + uint32_t version; + uint32_t pad0; + uint64_t tsc_timestamp; + uint64_t system_time; + uint32_t tsc_to_system_mul; + int8_t tsc_shift; +}; /* 32 bytes */ + +struct kvm_wall_clock { + uint32_t wc_version; + uint32_t wc_sec; + uint32_t wc_nsec; +}; + These structures are dangerously sized. __Suggest__ __attribute__((__packed__)). (or some padding at the end of kvm_vcpu_time_info. diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 4de4fd2..78ce53f 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -232,6 +232,7 @@ #define KVM_CAP_USER_MEMORY 3 #define KVM_CAP_SET_TSS_ADDR 4 #define KVM_CAP_EXT_CPUID 5 #define KVM_CAP_VAPIC 6 +#define KVM_CAP_CLOCKSOURCE 7 Please refresh against kvm.git, this has changed a bit. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote: Hi, I'm running an Linux AMD64 guest on an AMD64 host. The host is running a 2.6.23 kernel (self compiled), the guest is running a stock linux-image-2.6.22-3-amd64 Debian kernel. My problem is that the clock on the guest is off (slow), while the clock on the host seems to be OK. When doing 'time sleep 10' on the guest, it takes about 16 'real' seconds for it to finish. What is the clock source of your guest? If it is the traditional PIT, you describe a known problem in busy hosts. You may try to set clocksource=tsc in the guest kerenl parameters. Regards, Dan. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Signals for file descriptors
I am wondering about this commit, http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=commit;h=b4e392c21c4b98c1c13af353caa3d6e6bcb6b8af which adds signals on tap I/O. It seems a bit half-done to me. For one thing, it is mixing timers with I/O. Anyway, my question is about the remaining file descriptors. Should signals be activated for them as well, for example in qemu_set_fd_handler2() ? My example on hand is that connecting a VNC client currently delays until the next timer expiry. Anders. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
From: [EMAIL PROTECTED] on behalf of Dan Kenigsberg Sent: Wed 13/02/2008 13:25 To: Koen Vermeer Cc: kvm-devel@lists.sourceforge.net Subject: Re: [kvm-devel] Clock off in guest On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote: Hi, I'm running an Linux AMD64 guest on an AMD64 host. The host is running a 2.6.23 kernel (self compiled), the guest is running a stock linux-image-2.6.22-3-amd64 Debian kernel. My problem is that the clock on the guest is off (slow), while the clock on the host seems to be OK. When doing 'time sleep 10' on the guest, it takes about 16 'real' seconds for it to finish. What is the clock source of your guest? If it is the traditional PIT, you describe a known problem in busy hosts. You may try to set clocksource=tsc in the guest kerenl parameters. Regards, Dan. Hi, This would not work if you are using an old version of kvm ( with no in-kernel-apic ) I recommend upgrading to kvm-60 (or latest linux kernel). Or as an alternative, probably not as good, sometimes (when the guest's clocksource is PIT) adding '-tdf' to the command line helps. Uri. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Signals for file descriptors
Anders Melchiorsen wrote: I am wondering about this commit, http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=commit;h=b4e392c21c4b98c1c13af353caa3d6e6bcb6b8af which adds signals on tap I/O. It seems a bit half-done to me. For one thing, it is mixing timers with I/O. The signal handler doesn't actually matter; all that's needed is to break out of the loop. Anyway, my question is about the remaining file descriptors. Should signals be activated for them as well, for example in qemu_set_fd_handler2() ? My example on hand is that connecting a VNC client currently delays until the next timer expiry. In practice it doesn't matter, but yes, any fd which we will select() needs to have a signal attached. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Signals for file descriptors
Avi Kivity [EMAIL PROTECTED] writes: In practice it doesn't matter, but yes, any fd which we will select() needs to have a signal attached. It matters to me, because I am removing periodic timers, and so I ended up where I could not attach with VNC at all (well, strace would break the loop). With your answers in mind, I will prepare a patch to add this. Thanks, Anders. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
This would not work if you are using an old version of kvm ( with no in-kernel-apic ) I recommend upgrading to kvm-60 (or latest linux kernel). Should I upgrade the guest kernel or the host kernel? My bet is the host kernel, but the clocksource=tsc applies to the guest, so I'm not really sure... Or as an alternative, probably not as good, sometimes (when the guest's clocksource is PIT) adding '-tdf' to the command line helps. I cannot find this in man kvm or man qemu. Should I add this to the command line that starts the guest? Best, Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
Hi Dan, On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote: I'm running an Linux AMD64 guest on an AMD64 host. The host is running a 2.6.23 kernel (self compiled), the guest is running a stock linux-image-2.6.22-3-amd64 Debian kernel. My problem is that the clock on the guest is off (slow), while the clock on the host seems to be OK. When doing 'time sleep 10' on the guest, it takes about 16 'real' seconds for it to finish. What is the clock source of your guest? If it is the traditional PIT, you describe a known problem in busy hosts. You may try to set clocksource=tsc in the guest kerenl parameters. Thanks for your reply. I tried looking for PIT in dmesg, which didn't give me anything. Then I tried 'dmesg|grep -i tsc' and I got Time: tsc clocksource has been installed. I then ran 'dmesg|grep -i clock' and got ACPI: PM-Timer IO Port: 0xb008 time.c: Detected 1803.751 MHz processor. Calibrating delay using timer specific routine.. 14442.19 BogoMIPS (lpj=28884398) Using local APIC timer interrupts. Detected 62.628 MHz APIC timer. * Found PM-Timer Bug on the chipset. Due to workarounds for a bug, Time: tsc clocksource has been installed. Real Time Clock Driver v1.12ac PCI: Setting latency timer of device :00:01.1 to 64 PCI: Setting latency timer of device :00:01.2 to 64 The kernel command line is rather simple: root=/dev/hda1 ro quiet Does that answer your question, or should I be looking somewhere else? Thanks! Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 0/6] MMU Notifiers V6
GRU - Simple additional hardware TLB (possibly covering multiple instances of Linux) - Needs TLB shootdown when the VM unmaps pages. - Determines page address via follow_page (from interrupt context) but can fall back to get_user_pages(). - No page reference possible since no page status is kept.. I applied the latest mmuops patch to a 2.6.24 kernel updated the GRU driver to use it. As far as I can tell, everything works ok. Although more testing is needed, all current tests of driver functionality are working on both a system simulator and a hardware simulator. The driver itself is still a few weeks from being ready to post but I can send code fragments of the portions related to mmuops or external TLB management if anyone is interested. --- jack - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote: What is the clock source of your guest? If it is the traditional PIT, you describe a known problem in busy hosts. You may try to set clocksource=tsc in the guest kerenl parameters. Thanks for your reply. I tried looking for PIT in dmesg, which didn't give me anything. Then I tried 'dmesg|grep -i tsc' and I got Time: tsc clocksource has been installed. If this was done in the guest then, yes, your guest's clocksource is tsc. Yes, this was done on the guest. The host shows Marking TSC unstable due to TSCs unsynchronized hpet0: at MMIO 0xfed0, IRQs 2, 8, 31 hpet0: 3 32-bit timers, 2500 Hz Time: hpet clocksource has been installed. However, as Uri mentioned earlier, this is useful only with newer KVMs. I assume that your host runs the kvm from 2.6.23 which is pretty old in kvm timescale. Try downloading kvm-60, insmod it to your host and try running your guest. I'll try building a kvm-60 module, run the guest with that and report the results. I cannot do that immediately, though, because people are actually using the guest system. Best, Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] KVM: SVM: fix Windows XP 64 bit installation crash
While installing Windows XP 64 bit wants to access the DEBUGCTL and the last branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to crash. This patch allow the access to these MSRs and fixes the issue. Signed-off-by: Joerg Roedel [EMAIL PROTECTED] Signed-off-by: Markus Rechberger [EMAIL PROTECTED] --- arch/x86/kvm/svm.c | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 13765e9..1ef3e7b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1155,6 +1155,24 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) case MSR_IA32_SYSENTER_ESP: *data = svm-vmcb-save.sysenter_esp; break; + /* Nobody will change the following 5 values in the VMCB so + we can safely return them on rdmsr. They will always be 0 + until LBRV is implemented. */ + case MSR_IA32_DEBUGCTLMSR: + *data = svm-vmcb-save.dbgctl; + break; + case MSR_IA32_LASTBRANCHFROMIP: + *data = svm-vmcb-save.br_from; + break; + case MSR_IA32_LASTBRANCHTOIP: + *data = svm-vmcb-save.br_to; + break; + case MSR_IA32_LASTINTFROMIP: + *data = svm-vmcb-save.last_excp_from; + break; + case MSR_IA32_LASTINTTOIP: + *data = svm-vmcb-save.last_excp_to; + break; default: return kvm_get_msr_common(vcpu, ecx, data); } @@ -1215,6 +1233,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) case MSR_IA32_SYSENTER_ESP: svm-vmcb-save.sysenter_esp = data; break; + case MSR_IA32_DEBUGCTLMSR: + pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n, + __FUNCTION__, data); + break; case MSR_K7_EVNTSEL0: case MSR_K7_EVNTSEL1: case MSR_K7_EVNTSEL2: -- 1.5.3.7 - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote: Hi Dan, On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote: I'm running an Linux AMD64 guest on an AMD64 host. The host is running a 2.6.23 kernel (self compiled), the guest is running a stock linux-image-2.6.22-3-amd64 Debian kernel. My problem is that the clock on the guest is off (slow), while the clock on the host seems to be OK. When doing 'time sleep 10' on the guest, it takes about 16 'real' seconds for it to finish. What is the clock source of your guest? If it is the traditional PIT, you describe a known problem in busy hosts. You may try to set clocksource=tsc in the guest kerenl parameters. Thanks for your reply. I tried looking for PIT in dmesg, which didn't give me anything. Then I tried 'dmesg|grep -i tsc' and I got Time: tsc clocksource has been installed. If this was done in the guest then, yes, your guest's clocksource is tsc. However, as Uri mentioned earlier, this is useful only with newer KVMs. I assume that your host runs the kvm from 2.6.23 which is pretty old in kvm timescale. Try downloading kvm-60, insmod it to your host and try running your guest. regards, Dan. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
Chelsio's T3 HW doesn't support this. For ehca we currently can't modify a large MR when it has been allocated. EHCA Hardware expects the pages to be there (MRs must not have holes). This is also true for the global MR covering all kernel space. Therefore we still need the memory to be pinned if ib_umem_get() is called. So with the current implementation we don't have much use for a notifier. It is difficult to make predictions, especially about the future Gruss / Regards Christoph Raisch + Hoang-Nam Nguyen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
On Wed, 2008-02-13 at 17:55 +0200, Dan Kenigsberg wrote: On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote: However, as Uri mentioned earlier, this is useful only with newer KVMs. I assume that your host runs the kvm from 2.6.23 which is pretty old in kvm timescale. Try downloading kvm-60, insmod it to your host and try running your guest. OK, I installed both version 60 of both kvm and kvm-source (Debian Lenny), ran m-a a-i kvm and just to be sure I removed and reloaded the kvm and kvm-amd modules. modinfo kvm and modinfo kvm-amd shows that these are indeed version 60. I then restarted the guest (which is exactly the same as before) and I tried the 'sleep 10' test again. Same result: It takes about 17 or 18 seconds for the prompt to return again. I assume that clocksource=tsc isn't useful as it is already using the tsc. So, what else can I try? Any command line parameter I can add to the kvm call? Kernel parameters in the guest? Update the guest OS to a newer kernel (2.6.22 to 2.6.24)? Update the host OS to a newer kernel (2.6.23 to 2.6.24)? Thanks for the help! Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] Qemu powerpc work around
On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote: Jerone Young wrote: So the recent code in qemu cvs has problem powerpc. So what I have done is mainly work around this in the build system, by creating ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs out the functions that are no longer defined (something done by Anthony Liguori attempting to fix qemu_cvs). What do folks think about this approach, for us all we really need is a qemu that is not built with tcg dependency. Since a target in qemu is a cpu type, how the instructions are executed (kvm, kqemu, dyngen, or tcg) shouldn't come into it. Instead we can have a --without-cpu-emulation or --no-tcg which would simply disable those parts. Actually this much much more sensible solution. So I took some time and implemented it. So on the qemu command line you use --disable-cpu-emulation Signed-off-by: Jerone Young [EMAIL PROTECTED] diff --git a/qemu/Makefile.target b/qemu/Makefile.target --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -179,11 +179,17 @@ all: $(PROGS) # # cpu emulator library -LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\ -translate.o op.o host-utils.o +LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o + +ifneq ($(NO_CPU_EMULATION), 1) +LIBOBJS+= translate-all.o translate.o op.o # TCG code generator LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH) +else +LIBOBJS+= fake-exec.o +endif + ifeq ($(USE_KVM), 1) LIBOBJS+=qemu-kvm.o endif diff --git a/qemu/configure b/qemu/configure --- a/qemu/configure +++ b/qemu/configure @@ -110,6 +110,7 @@ darwin_user=no darwin_user=no build_docs=no uname_release= +cpu_emulation=yes # OS specific targetos=`uname -s` @@ -339,6 +340,8 @@ for opt do ;; --disable-werror) werror=no ;; + --disable-cpu-emulation) cpu_emulation=no + ;; *) echo ERROR: unknown option $opt; exit 1 ;; esac @@ -770,6 +773,7 @@ fi fi echo kqemu support $kqemu echo kvm support $kvm +echo CPU emulation $cpu_emulation echo Documentation $build_docs [ ! -z $uname_release ] \ echo uname -r $uname_release @@ -1094,12 +1098,20 @@ interp_prefix1=`echo $interp_prefix | interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g` echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\ $config_h +disable_cpu_emulation() { + if test $cpu_emulation = no; then +echo #define NO_CPU_EMULATION 1 $config_h +echo NO_CPU_EMULATION=1 $config_mak + fi +} + configure_kvm() { if test $kvm = yes -a $target_softmmu = yes -a \ \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu = powerpc \); then echo #define USE_KVM 1 $config_h echo USE_KVM=1 $config_mak echo CONFIG_KVM_KERNEL_INC=$kernel_path/include $config_mak +disable_cpu_emulation fi } diff --git a/qemu/exec.c b/qemu/exec.c --- a/qemu/exec.c +++ b/qemu/exec.c @@ -35,7 +35,11 @@ #include cpu.h #include exec-all.h + +#if !defined(NO_CPU_EMULATION) #include tcg-target.h +#endif + #include qemu-kvm.h #if defined(CONFIG_USER_ONLY) #include qemu.h diff --git a/qemu/fake-exec.c b/qemu/fake-exec.c new file mode 100644 --- /dev/null +++ b/qemu/fake-exec.c @@ -0,0 +1,62 @@ +#include stdarg.h +#include stdlib.h +#include stdio.h +#include string.h +#include inttypes.h + +#include cpu.h +#include exec-all.h + +int code_copy_enabled = 0; + +void cpu_dump_state (CPUState *env, FILE *f, + int (*cpu_fprintf)(FILE *f, const char *fmt, ...), + int flags) +{ +} + +void ppc_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...)) +{ +} + +void cpu_dump_statistics (CPUState *env, FILE*f, + int (*cpu_fprintf)(FILE *f, const char *fmt, ...), + int flags) +{ +} + +unsigned long code_gen_max_block_size(void) +{ +return 32; +} + +void cpu_gen_init(void) +{ +} + +int cpu_restore_state(TranslationBlock *tb, + CPUState *env, unsigned long searched_pc, + void *puc) + +{ +return 0; +} + +int cpu_ppc_gen_code(CPUState *env, TranslationBlock *tb, int *gen_code_size_ptr) +{ +return 0; +} + +const ppc_def_t *cpu_ppc_find_by_name (const unsigned char *name) +{ +return NULL; +} + +int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def) +{ +return 0; +} + +void flush_icache_range(unsigned long start, unsigned long stop) +{ +} - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] KVM: SVM: Implement LBR virtualization
This patch set enables the virtualization of the last branch record in SVM if this feature is supported by the hardware. To the previous post the fix for the XP 64 bit install crash has been removed from this series and was posted seperatly to keep it small enough for 2.6.25. This patch set applies on top of that fix. Joerg diffstat: arch/x86/kvm/kvm_svm.h |2 + arch/x86/kvm/svm.c | 118 ++- 2 files changed, 77 insertions(+), 43 deletions(-) - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/3] KVM: SVM: allocate the MSR permission map per VCPU
This patch changes the kvm-amd module to allocate the SVM MSR permission map per VCPU instead of a global map for all VCPUs. With this we have more flexibility allowing specific guests to access virtualized MSRs. This is required for LBR virtualization. Signed-off-by: Joerg Roedel [EMAIL PROTECTED] Signed-off-by: Markus Rechberger [EMAIL PROTECTED] --- arch/x86/kvm/kvm_svm.h |2 + arch/x86/kvm/svm.c | 67 +++- 2 files changed, 34 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/kvm_svm.h b/arch/x86/kvm/kvm_svm.h index ecdfe97..65ef0fc 100644 --- a/arch/x86/kvm/kvm_svm.h +++ b/arch/x86/kvm/kvm_svm.h @@ -39,6 +39,8 @@ struct vcpu_svm { unsigned long host_db_regs[NUM_DB_REGS]; unsigned long host_dr6; unsigned long host_dr7; + + u32 *msrpm; }; #endif diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5a69619..3b31162 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -65,7 +65,6 @@ static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) } unsigned long iopm_base; -unsigned long msrpm_base; struct kvm_ldttss_desc { u16 limit0; @@ -370,12 +369,29 @@ static void set_msr_interception(u32 *msrpm, unsigned msr, BUG(); } +static void svm_vcpu_init_msrpm(u32 *msrpm) +{ + memset(msrpm, 0xff, PAGE_SIZE * (1 MSRPM_ALLOC_ORDER)); + +#ifdef CONFIG_X86_64 + set_msr_interception(msrpm, MSR_GS_BASE, 1, 1); + set_msr_interception(msrpm, MSR_FS_BASE, 1, 1); + set_msr_interception(msrpm, MSR_KERNEL_GS_BASE, 1, 1); + set_msr_interception(msrpm, MSR_LSTAR, 1, 1); + set_msr_interception(msrpm, MSR_CSTAR, 1, 1); + set_msr_interception(msrpm, MSR_SYSCALL_MASK, 1, 1); +#endif + set_msr_interception(msrpm, MSR_K6_STAR, 1, 1); + set_msr_interception(msrpm, MSR_IA32_SYSENTER_CS, 1, 1); + set_msr_interception(msrpm, MSR_IA32_SYSENTER_ESP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_SYSENTER_EIP, 1, 1); +} + static __init int svm_hardware_setup(void) { int cpu; struct page *iopm_pages; - struct page *msrpm_pages; - void *iopm_va, *msrpm_va; + void *iopm_va; int r; iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER); @@ -388,37 +404,13 @@ static __init int svm_hardware_setup(void) clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */ iopm_base = page_to_pfn(iopm_pages) PAGE_SHIFT; - - msrpm_pages = alloc_pages(GFP_KERNEL, MSRPM_ALLOC_ORDER); - - r = -ENOMEM; - if (!msrpm_pages) - goto err_1; - - msrpm_va = page_address(msrpm_pages); - memset(msrpm_va, 0xff, PAGE_SIZE * (1 MSRPM_ALLOC_ORDER)); - msrpm_base = page_to_pfn(msrpm_pages) PAGE_SHIFT; - -#ifdef CONFIG_X86_64 - set_msr_interception(msrpm_va, MSR_GS_BASE, 1, 1); - set_msr_interception(msrpm_va, MSR_FS_BASE, 1, 1); - set_msr_interception(msrpm_va, MSR_KERNEL_GS_BASE, 1, 1); - set_msr_interception(msrpm_va, MSR_LSTAR, 1, 1); - set_msr_interception(msrpm_va, MSR_CSTAR, 1, 1); - set_msr_interception(msrpm_va, MSR_SYSCALL_MASK, 1, 1); -#endif - set_msr_interception(msrpm_va, MSR_K6_STAR, 1, 1); - set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_CS, 1, 1); - set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_ESP, 1, 1); - set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_EIP, 1, 1); - if (boot_cpu_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); for_each_online_cpu(cpu) { r = svm_cpu_init(cpu); if (r) - goto err_2; + goto err; } svm_features = cpuid_edx(SVM_CPUID_FUNC); @@ -438,10 +430,7 @@ static __init int svm_hardware_setup(void) return 0; -err_2: - __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER); - msrpm_base = 0; -err_1: +err: __free_pages(iopm_pages, IOPM_ALLOC_ORDER); iopm_base = 0; return r; @@ -449,9 +438,8 @@ err_1: static __exit void svm_hardware_unsetup(void) { - __free_pages(pfn_to_page(msrpm_base PAGE_SHIFT), MSRPM_ALLOC_ORDER); __free_pages(pfn_to_page(iopm_base PAGE_SHIFT), IOPM_ALLOC_ORDER); - iopm_base = msrpm_base = 0; + iopm_base = 0; } static void init_seg(struct vmcb_seg *seg) @@ -536,7 +524,7 @@ static void init_vmcb(struct vcpu_svm *svm) (1ULL INTERCEPT_MWAIT); control-iopm_base_pa = iopm_base; - control-msrpm_base_pa = msrpm_base; + control-msrpm_base_pa = __pa(svm-msrpm); control-tsc_offset = 0; control-int_ctl = V_INTR_MASKING_MASK; @@ -615,6 +603,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) { struct vcpu_svm *svm; struct page *page; + struct page *msrpm_pages; int err; svm =
[kvm-devel] [PATCH 3/3] KVM: SVM: enable LBR virtualization
This patch implements the Last Branch Record Virtualization (LBRV) feature of the AMD Barcelona and Phenom processors into the kvm-amd module. It will only be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So there is no increased world switch overhead when the guest doesn't use these MSRs. Signed-off-by: Joerg Roedel [EMAIL PROTECTED] Signed-off-by: Markus Rechberger [EMAIL PROTECTED] --- arch/x86/kvm/svm.c | 39 +-- 1 files changed, 37 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 3b31162..e1d139f 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -47,6 +47,8 @@ MODULE_LICENSE(GPL); #define SVM_FEATURE_LBRV (1 1) #define SVM_DEATURE_SVML (1 2) +#define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) + /* enable NPT for AMD64 and X86 with PAE */ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) static bool npt_enabled = true; @@ -387,6 +389,28 @@ static void svm_vcpu_init_msrpm(u32 *msrpm) set_msr_interception(msrpm, MSR_IA32_SYSENTER_EIP, 1, 1); } +static void svm_enable_lbrv(struct vcpu_svm *svm) +{ + u32 *msrpm = svm-msrpm; + + svm-vmcb-control.lbr_ctl = 1; + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); + set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1); +} + +static void svm_disable_lbrv(struct vcpu_svm *svm) +{ + u32 *msrpm = svm-msrpm; + + svm-vmcb-control.lbr_ctl = 0; + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0); + set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0); +} + static __init int svm_hardware_setup(void) { int cpu; @@ -1231,8 +1255,19 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) svm-vmcb-save.sysenter_esp = data; break; case MSR_IA32_DEBUGCTLMSR: - pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n, - __FUNCTION__, data); + if (!svm_has(SVM_FEATURE_LBRV)) { + pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTL 0x%llx, nop\n, + __FUNCTION__, data); + break; + } + if (data DEBUGCTL_RESERVED_BITS) + return 1; + + svm-vmcb-save.dbgctl = data; + if (data (1ULL0)) + svm_enable_lbrv(svm); + else + svm_disable_lbrv(svm); break; case MSR_K7_EVNTSEL0: case MSR_K7_EVNTSEL1: -- 1.5.3.7 - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/3] KVM: SVM: let init_vmcb() take struct vcpu_svm as parameter
Change the parameter of the init_vmcb() function in the kvm-amd module from struct vmcb to struct vcpu_svm. Signed-off-by: Joerg Roedel [EMAIL PROTECTED] Signed-off-by: Markus Rechberger [EMAIL PROTECTED] --- arch/x86/kvm/svm.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1ef3e7b..5a69619 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -471,10 +471,10 @@ static void init_sys_seg(struct vmcb_seg *seg, uint32_t type) seg-base = 0; } -static void init_vmcb(struct vmcb *vmcb) +static void init_vmcb(struct vcpu_svm *svm) { - struct vmcb_control_area *control = vmcb-control; - struct vmcb_save_area *save = vmcb-save; + struct vmcb_control_area *control = svm-vmcb-control; + struct vmcb_save_area *save = svm-vmcb-save; control-intercept_cr_read =INTERCEPT_CR0_MASK | INTERCEPT_CR3_MASK | @@ -600,7 +600,7 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); - init_vmcb(svm-vmcb); + init_vmcb(svm); if (vcpu-vcpu_id != 0) { svm-vmcb-save.rip = 0; @@ -638,7 +638,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; memset(svm-db_regs, 0, sizeof(svm-db_regs)); - init_vmcb(svm-vmcb); + init_vmcb(svm); fx_init(svm-vcpu); svm-vcpu.fpu_active = 1; @@ -1024,7 +1024,7 @@ static int shutdown_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) * so reinitialize it. */ clear_page(svm-vmcb); - init_vmcb(svm-vmcb); + init_vmcb(svm); kvm_run-exit_reason = KVM_EXIT_SHUTDOWN; return 0; -- 1.5.3.7 - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Christian Bell wrote: You're arguing that a HW page table is not needed by describing a use case that is essentially what all RDMA solutions already do above the wire protocols (all solutions except Quadrics, of course). The HW page table is not essential to the notification scheme. That the RDMA uses the page table for linearization is another issue. A chip could just have a TLB cache and lookup the entries using the OS page table f.e. Lets say you have a two systems A and B. Each has their memory region MemA and MemB. Each side also has page tables for this region PtA and PtB. If either side then accesses the page again then the reverse process happens. If B accesses the page then it wil first of all incur a page fault because the entry in PtB is missing. The fault will then cause a message to be send to A to establish the page again. A will create an entry in PtA and will then confirm to B that the page was established. At that point RDMA operations can occur again. The notifier-reclaim cycle you describe is akin to the out-of-band pin-unpin control messages used by existing communication libraries. Also, I think what you are proposing can have problems at scale -- A must keep track of all of the (potentially many systems) of memA and cooperatively get an agreement from all these systems before reclaiming the page. Right. We (SGI) have done something like this for a long time with XPmem and it scales ok. When messages are sufficiently large, the control messaging necessary to setup/teardown the regions is relatively small. This is not always the case however -- in programming models that employ smaller messages, the one-sided nature of RDMA is the most attractive part of it. The messaging would only be needed if a process comes under memory pressure. As long as there is enough memory nothing like this will occur. Nothing any communication/runtime system can't already do today. The point of RDMA demand paging is enabling the possibility of using RDMA without the implied synchronization -- the optimistic part. Using the notifiers to duplicate existing memory region handling for RDMA hardware that doesn't have HW page tables is possible but undermines the more important consumer of your patches in my opinion. The notifier schemet should integrate into existing memory region handling and not cause a duplication. If you already have library layers that do this then it should be possible to integrate it. One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. The driver gets the notifications about memory being reclaimed. The driver could then notify user code about the release as well. Pinned memory current *cannot* be reclaimed by the kernel. The refcount is elevated. This means that the VM tries to remove the mappings and then sees that it was not able to remove all references. Then it gives up and tries again and again and again Thus the potential for livelock. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Clock off in guest
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wed 13/02/2008 14:52 To: Uri Lublin Cc: kvm-devel@lists.sourceforge.net Subject: RE: [kvm-devel] Clock off in guest This would not work if you are using an old version of kvm ( with no in-kernel-apic ) I recommend upgrading to kvm-60 (or latest linux kernel). Should I upgrade the guest kernel or the host kernel? My bet is the host kernel, but the clocksource=tsc applies to the guest, so I'm not really sure... The host kernel or kvm. If you choose to upgrade your host kernel (and kvm that comes with it), make sure you are using recent kvm-userspace too (e.g. kvm-60). Or as an alternative, probably not as good, sometimes (when the guest's clocksource is PIT) adding '-tdf' to the command line helps. I cannot find this in man kvm or man qemu. I'm not sure about the man pages, but kvm/qemu's help says: bash$ /usr/bin/kvm -h | grep tdf -tdfinject timer interrupts that got lost Should I add this to the command line that starts the guest? Yes, try adding it to the command line that starts the guest (executable name may vary): /usr/bin/kvm [kvm-params] -tdf Also tdf (time drift fix) only works when using PIT+PIC (no APIC) so sometimes it's helpful to also add -no-acpi: /usr/bin/kvm [kvm-params] -tdf -no-acpi Best, Koen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Jason Gunthorpe wrote: But this isn't how IB or iwarp work at all. What you describe is a significant change to the general RDMA operation and requires changes to both sides of the connection and the wire protocol. Yes it may require a separate connection between both sides where a kind of VM notification protocol is established to tear these things down and set them up again. That is if there is nothing in the RDMA protocol that allows a notification to the other side that the mapping is being down down. - In RDMA (iwarp and IB versions) the hardware page tables exist to linearize the local memory so the remote does not need to be aware of non-linearities in the physical address space. The main motivation for this is kernel bypass where the user space app wants to instruct the remote side to DMA into memory using user space addresses. Hardware provides the page tables to switch from incoming user space virtual addresses to physical addresess. s/switch/translate I guess. That is good and those page tables could be used for the notification scheme to enable reclaim. But they are optional and are maintaining the driver state. The linearization could be reconstructed from the kernel page tables on demand. Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables for access control and enforcing the liftime of the mapping. Well the mapping would have to be on demand to avoid the issues that we currently have with pinning. The user API could stay the same. If the driver tracks the mappings using the notifier then the VM can make sure that the right things happen on exit etc etc. The page tables in the RDMA hardware exist primarily to support this, and not for other reasons. The pinning of pages is one part to support the HW page tables and one part to support the RDMA lifetime rules, the liftime rules are what cause problems for the VM. So the driver software can tear down and establish page tables entries at will? I do not see the problem. The RDMA hardware is one thing, the way things are visible to the user another. If the driver can establish and remove mappings as needed via RDMA then the user can have the illusion of persistent RDMA memory. This is the same as virtual memory providing the illusion of a process having lots of memory all for itself. - The wire protocol consists of packets that say 'Write XXX bytes to offset YY in Region RRR'. Creating a region produces the RRR label and currently pins the pages. So long as the RRR label is valid the remote side can issue write packets at any time without any further synchronization. There is no wire level events associated with creating RRR. You can pass RRR to the other machine in any fashion, even using carrier pigeons :) - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) are built on top of it and they specify the lifetime rules and protocol for exchanging RRR. Well yes of course. What is proposed here is an additional notification mechanism (could even be via tcp/udp to simplify things) that would manage the mappings at a higher level. The writes would not occur if the mapping has not been established. This is your step 'A will then send a message to B notifying..'. It simply does not exist in the protocol specifications Of course. You need to create an additional communication layer to get that. What it boils down to is that to implement true removal of pages in a general way the kernel and HCA must either drop packets or stall incoming packets, both are big performance problems - and I can't see many users wanting this. Enterprise style people using SCSI, NFS, etc already have short pin periods and HPC MPI users probably won't care about the VM issues enough to warrent the performance overhead. True maybe you cannot do this by simply staying within the protocol bounds of RDMA that is based on page pinning if the RDMA protocol does not support a notification to the other side that the mapping is going away. If RDMA cannot do this then you would need additional ways of notifying the remote side that pages/mappings are invalidated. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] Qemu powerpc work around
Jerone Young wrote: On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote: Jerone Young wrote: So the recent code in qemu cvs has problem powerpc. So what I have done is mainly work around this in the build system, by creating ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs out the functions that are no longer defined (something done by Anthony Liguori attempting to fix qemu_cvs). What do folks think about this approach, for us all we really need is a qemu that is not built with tcg dependency. Since a target in qemu is a cpu type, how the instructions are executed (kvm, kqemu, dyngen, or tcg) shouldn't come into it. Instead we can have a --without-cpu-emulation or --no-tcg which would simply disable those parts. Actually this much much more sensible solution. So I took some time and implemented it. Funny enough, I was thinking the same thing last night :-) Please move fake-exec.c to target-ppc/fake-exec.c as it contains PPC specific code. Otherwise, this patch is much better! Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote: On Tue, 12 Feb 2008, Jason Gunthorpe wrote: But this isn't how IB or iwarp work at all. What you describe is a significant change to the general RDMA operation and requires changes to both sides of the connection and the wire protocol. Yes it may require a separate connection between both sides where a kind of VM notification protocol is established to tear these things down and set them up again. That is if there is nothing in the RDMA protocol that allows a notification to the other side that the mapping is being down down. Well, yes, you could build this thing you are describing on top of the RDMA protocol and get some support from some of the hardware - but it is a new set of protocols and they would need to be implemented in several places. It is not transparent to userspace and it is not compatible with existing implementations. Unfortunately it really has little to do with the drivers - changes, for instance, need to be made to support this in the user space MPI libraries. The RDMA ops do not pass through the kernel, userspace talks directly to the hardware which complicates building any sort of abstraction. That is where I think you run into trouble, if you ask the MPI people to add code to their critical path to support swapping they probably will not be too interested. At a minimum to support your idea you need to check on every RDMA if the remote page is mapped... Plus the overheads Christian was talking about in the OOB channel(s). Jason - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christoph Lameter wrote: Right. We (SGI) have done something like this for a long time with XPmem and it scales ok. I'd dispute this based on experience developing PGAS language support on the Altix but more importantly (and less subjectively), I think that scales ok refers to a very specific case. Sure, pages (and/or regions) can be large on some systems and the number of systems may not always be in the thousands but you're still claiming scalability for a mechanism that essentially logs who accesses the regions. Then there's the fact that reclaim becomes a collective communication operation over all region accessors. Makes me nervous. When messages are sufficiently large, the control messaging necessary to setup/teardown the regions is relatively small. This is not always the case however -- in programming models that employ smaller messages, the one-sided nature of RDMA is the most attractive part of it. The messaging would only be needed if a process comes under memory pressure. As long as there is enough memory nothing like this will occur. Nothing any communication/runtime system can't already do today. The point of RDMA demand paging is enabling the possibility of using RDMA without the implied synchronization -- the optimistic part. Using the notifiers to duplicate existing memory region handling for RDMA hardware that doesn't have HW page tables is possible but undermines the more important consumer of your patches in my opinion. The notifier schemet should integrate into existing memory region handling and not cause a duplication. If you already have library layers that do this then it should be possible to integrate it. I appreciate that you're trying to make a general case for the applicability of notifiers on all types of existing RDMA hardware and wire protocols. Also, I'm not disagreeing whether a HW page table is required or not: clearly it's not required to make *some* use of the notifier scheme. However, short of providing user-level notifications for pinned pages that are inadvertently released to the O/S, I don't believe that the patchset provides any significant added value for the HPC community that can't optimistically do RDMA demand paging. . . christian - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christoph Raisch wrote: For ehca we currently can't modify a large MR when it has been allocated. EHCA Hardware expects the pages to be there (MRs must not have holes). This is also true for the global MR covering all kernel space. Therefore we still need the memory to be pinned if ib_umem_get() is called. It cannot be freed and then reallocated? What happens when a process exists? - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Qemu powerpc work around
Hi Jerone, Jerone Young wrote: Ok taking everybodys suggestions. This patch adds a --disable-cpu-emulation option to qemu. This way powerpc has the ability to compile, and also gives other archs the ability to easily add the ability to compile without the tcg code. Signed-off-by: Jerone Young [EMAIL PROTECTED] Can you try out this version of the patch on PPC? This version also supports --disable-cpu-emulation on x86. It also eliminates -no-kvm when using --disable-cpu-emulation and exit's if KVM can not be initialized. This should be useful on x86 where people cannot easily get their hands on gcc-3 and only wish to run KVM. Regards, Anthony Liguori diff --git a/qemu/Makefile.target b/qemu/Makefile.target index 49b81df..8b0436b 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -179,11 +179,17 @@ all: $(PROGS) # # cpu emulator library -LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\ -translate.o op.o host-utils.o +LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o + +ifeq ($(NO_CPU_EMULATION), 1) +LIBOBJS+=fake-exec.o +else +LIBOBJS+= translate-all.o translate.o op.o # TCG code generator LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH) +endif + ifeq ($(USE_KVM), 1) LIBOBJS+=qemu-kvm.o endif diff --git a/qemu/configure b/qemu/configure index 92299b9..bc42665 100755 --- a/qemu/configure +++ b/qemu/configure @@ -110,6 +110,7 @@ linux_user=no darwin_user=no build_docs=no uname_release= +cpu_emulation=yes # OS specific targetos=`uname -s` @@ -339,6 +340,8 @@ for opt do ;; --disable-werror) werror=no ;; + --disable-cpu-emulation) cpu_emulation=no + ;; *) echo ERROR: unknown option $opt; exit 1 ;; esac @@ -770,6 +773,7 @@ if test -n $sparc_cpu; then fi echo kqemu support $kqemu echo kvm support $kvm +echo CPU emulation $cpu_emulation echo Documentation $build_docs [ ! -z $uname_release ] \ echo uname -r $uname_release @@ -1094,12 +1098,20 @@ elfload32=no interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g` echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\ $config_h +disable_cpu_emulation() { + if test $cpu_emulation = no; then +echo #define NO_CPU_EMULATION 1 $config_h +echo NO_CPU_EMULATION=1 $config_mak + fi +} + configure_kvm() { if test $kvm = yes -a $target_softmmu = yes -a \ \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu = powerpc \); then echo #define USE_KVM 1 $config_h echo USE_KVM=1 $config_mak echo CONFIG_KVM_KERNEL_INC=$kernel_path/include $config_mak +disable_cpu_emulation fi } diff --git a/qemu/exec.c b/qemu/exec.c index 050b150..960adcd 100644 --- a/qemu/exec.c +++ b/qemu/exec.c @@ -35,7 +35,11 @@ #include cpu.h #include exec-all.h + +#if !defined(NO_CPU_EMULATION) #include tcg-target.h +#endif + #include qemu-kvm.h #if defined(CONFIG_USER_ONLY) #include qemu.h diff --git a/qemu/target-i386/fake-exec.c b/qemu/target-i386/fake-exec.c new file mode 100644 index 000..737286d --- /dev/null +++ b/qemu/target-i386/fake-exec.c @@ -0,0 +1,54 @@ +/* + * fake-exec.c + * + * This is a file for stub functions so that compilation is possible + * when TCG CPU emulation is disabled during compilation. + * + * Copyright 2007 IBM Corporation. + * Added by Authors: + * Jerone Young [EMAIL PROTECTED] + * This work is licensed under the GNU GPL licence version 2 or later. + * + */ +#include exec.h +#include cpu.h + +int code_copy_enabled = 0; + +CCTable cc_table[CC_OP_NB]; + +void cpu_dump_statistics (CPUState *env, FILE*f, + int (*cpu_fprintf)(FILE *f, const char *fmt, ...), + int flags) +{ +} + +unsigned long code_gen_max_block_size(void) +{ +return 32; +} + +void cpu_gen_init(void) +{ +} + +int cpu_restore_state(TranslationBlock *tb, + CPUState *env, unsigned long searched_pc, + void *puc) + +{ +return 0; +} + +int cpu_x86_gen_code(CPUState *env, TranslationBlock *tb, int *gen_code_size_ptr) +{ +return 0; +} + +void flush_icache_range(unsigned long start, unsigned long stop) +{ +} + +void optimize_flags_init(void) +{ +} diff --git a/qemu/target-ppc/fake-exec.c b/qemu/target-ppc/fake-exec.c new file mode 100644 index 000..b042f58 --- /dev/null +++ b/qemu/target-ppc/fake-exec.c @@ -0,0 +1,75 @@ +/* + * fake-exec.c + * + * This is a file for stub functions so that compilation is possible + * when TCG CPU emulation is disabled during compilation. + * + * Copyright 2007 IBM Corporation. + * Added by Authors: + * Jerone Young [EMAIL PROTECTED] + * This work is licensed under the GNU GPL licence version 2 or later. + * + */ + +#include stdarg.h +#include stdlib.h +#include stdio.h +#include string.h +#include inttypes.h + +#include cpu.h +#include exec-all.h + +int code_copy_enabled = 0; + +void
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christian Bell wrote: not always be in the thousands but you're still claiming scalability for a mechanism that essentially logs who accesses the regions. Then there's the fact that reclaim becomes a collective communication operation over all region accessors. Makes me nervous. Well reclaim is not a very fast process (and we usually try to avoid it as much as possible for our HPC). Essentially its only there to allow shifts of processing loads and to allow efficient caching of application data. However, short of providing user-level notifications for pinned pages that are inadvertently released to the O/S, I don't believe that the patchset provides any significant added value for the HPC community that can't optimistically do RDMA demand paging. We currently also run XPmem with pinning. Its great as long as you just run one load on the system. No reclaim ever iccurs. However, if you do things that require lots of allocations etc etc then the page pinning can easily lead to livelock if reclaim is finally triggerd and also strange OOM situations since the VM cannot free any pages. So the main issue that is addressed here is reliability of pinned page operations. Better VM integration avoids these issues because we can unpin on request to deal with memory shortages. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Jason Gunthorpe wrote: Unfortunately it really has little to do with the drivers - changes, for instance, need to be made to support this in the user space MPI libraries. The RDMA ops do not pass through the kernel, userspace talks directly to the hardware which complicates building any sort of abstraction. Ok so the notifiers have to be handed over to the user space library that has the function of the device driver here... That is where I think you run into trouble, if you ask the MPI people to add code to their critical path to support swapping they probably will not be too interested. At a minimum to support your idea you need to check on every RDMA if the remote page is mapped... Plus the overheads Christian was talking about in the OOB channel(s). You only need to check if a handle has been receiving invalidates. If not then you can just go ahead as now. You can use the notifier to take down the whole region if any reclaim occur against it (probably best and simples to implement approach). Then you mark the handle so that the mapping is reestablished before the next operation. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Qemu powerpc work around
Ok taking everybodys suggestions. This patch adds a --disable-cpu-emulation option to qemu. This way powerpc has the ability to compile, and also gives other archs the ability to easily add the ability to compile without the tcg code. Signed-off-by: Jerone Young [EMAIL PROTECTED] diff --git a/qemu/Makefile.target b/qemu/Makefile.target --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -179,11 +179,15 @@ all: $(PROGS) # # cpu emulator library -LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\ -translate.o op.o host-utils.o +LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o + +ifneq ($(NO_CPU_EMULATION), 1) +LIBOBJS+= translate-all.o translate.o op.o # TCG code generator LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH) +endif + ifeq ($(USE_KVM), 1) LIBOBJS+=qemu-kvm.o endif @@ -214,6 +218,9 @@ LIBOBJS+= op_helper.o helper.o LIBOBJS+= op_helper.o helper.o ifeq ($(USE_KVM), 1) LIBOBJS+= qemu-kvm-powerpc.o +endif +ifeq ($(NO_CPU_EMULATION), 1) +LIBOBJS+=fake-exec-ppc.o endif endif diff --git a/qemu/configure b/qemu/configure --- a/qemu/configure +++ b/qemu/configure @@ -110,6 +110,7 @@ darwin_user=no darwin_user=no build_docs=no uname_release= +cpu_emulation=yes # OS specific targetos=`uname -s` @@ -339,6 +340,8 @@ for opt do ;; --disable-werror) werror=no ;; + --disable-cpu-emulation) cpu_emulation=no + ;; *) echo ERROR: unknown option $opt; exit 1 ;; esac @@ -770,6 +773,7 @@ fi fi echo kqemu support $kqemu echo kvm support $kvm +echo CPU emulation $cpu_emulation echo Documentation $build_docs [ ! -z $uname_release ] \ echo uname -r $uname_release @@ -1094,12 +1098,20 @@ interp_prefix1=`echo $interp_prefix | interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g` echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\ $config_h +disable_cpu_emulation() { + if test $cpu_emulation = no; then +echo #define NO_CPU_EMULATION 1 $config_h +echo NO_CPU_EMULATION=1 $config_mak + fi +} + configure_kvm() { if test $kvm = yes -a $target_softmmu = yes -a \ \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu = powerpc \); then echo #define USE_KVM 1 $config_h echo USE_KVM=1 $config_mak echo CONFIG_KVM_KERNEL_INC=$kernel_path/include $config_mak +disable_cpu_emulation fi } diff --git a/qemu/exec.c b/qemu/exec.c --- a/qemu/exec.c +++ b/qemu/exec.c @@ -35,7 +35,11 @@ #include cpu.h #include exec-all.h + +#if !defined(NO_CPU_EMULATION) #include tcg-target.h +#endif + #include qemu-kvm.h #if defined(CONFIG_USER_ONLY) #include qemu.h diff --git a/qemu/fake-exec-ppc.c b/qemu/fake-exec-ppc.c new file mode 100644 --- /dev/null +++ b/qemu/fake-exec-ppc.c @@ -0,0 +1,75 @@ +/* + * fake-exec.c + * + * This is a file for stub functions so that compilation is possible + * when TCG CPU emulation is disabled during compilation. + * + * Copyright 2007 IBM Corporation. + * Added by Authors: + * Jerone Young [EMAIL PROTECTED] + * This work is licensed under the GNU GPL licence version 2 or later. + * + */ + +#include stdarg.h +#include stdlib.h +#include stdio.h +#include string.h +#include inttypes.h + +#include cpu.h +#include exec-all.h + +int code_copy_enabled = 0; + +void cpu_dump_state (CPUState *env, FILE *f, + int (*cpu_fprintf)(FILE *f, const char *fmt, ...), + int flags) +{ +} + +void ppc_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...)) +{ +} + +void cpu_dump_statistics (CPUState *env, FILE*f, + int (*cpu_fprintf)(FILE *f, const char *fmt, ...), + int flags) +{ +} + +unsigned long code_gen_max_block_size(void) +{ +return 32; +} + +void cpu_gen_init(void) +{ +} + +int cpu_restore_state(TranslationBlock *tb, + CPUState *env, unsigned long searched_pc, + void *puc) + +{ +return 0; +} + +int cpu_ppc_gen_code(CPUState *env, TranslationBlock *tb, int *gen_code_size_ptr) +{ +return 0; +} + +const ppc_def_t *cpu_ppc_find_by_name (const unsigned char *name) +{ +return NULL; +} + +int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def) +{ +return 0; +} + +void flush_icache_range(unsigned long start, unsigned long stop) +{ +} - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/2] kvmclock - the host part.
Avi Kivity wrote: Glauber de Oliveira Costa wrote: This is the host part of kvm clocksource implementation. As it does not include clockevents, it is a fairly simple implementation. We only have to register a per-vcpu area, and start writting to it periodically. The area is binary compatible with xen, as we use the same shadow_info structure. +static void kvm_write_wall_clock(struct kvm_vcpu *v, gpa_t wall_clock) +{ +int version = 1; +struct kvm_wall_clock wc; +unsigned long flags; +struct timespec wc_ts; + +local_irq_save(flags); +kvm_get_msr(v, MSR_IA32_TIME_STAMP_COUNTER, + v-arch.hv_clock.tsc_timestamp); Why is this needed? IIRC the wall clock is not tied to any vcpu. If we can remove this, the argument to the function should be kvm, not kvm_vcpu. We can remove the irq games as well. After some new thoughts, I don't agree. The time calculation in the guest will be in the format wallclock + delta tsc. Everything that has a tsc _is_ tied to a cpu. Although we can store the area globally, I think the best semantics is to have a vcpu to always issue an msr write to the area before reading it, in order to have the tsc updated. +wc_ts = current_kernel_time(); +local_irq_restore(flags); + +down_write(current-mm-mmap_sem); +kvm_write_guest(v-kvm, wall_clock, version, sizeof(version)); +up_write(current-mm-mmap_sem); Why down_write? accidentally or on purpose? accidentally. Marcelo has pointed it out to me, and I do have a version without it now. For mutual exclusion, I suggest taking kvm-lock instead (for the entire function). This function is called too often. since we only need to guarantee mutual exclusion in a tiny part, it seems preferable to me. Do you have any extra reason for kvm-lock'ing the entire function? + +/* With all the info we got, fill in the values */ +wc.wc_sec = wc_ts.tv_sec; +wc.wc_nsec = wc_ts.tv_nsec; +wc.wc_version = ++version; + +down_write(current-mm-mmap_sem); +kvm_write_guest(v-kvm, wall_clock, wc, sizeof(wc)); +up_write(current-mm-mmap_sem); Should be in three steps: write version, write data, write version. kvm_write_guest doesn't guarantee any order. It may fail as well, and we need to handle that. Ok, I see. This is fundamentally different from the system time case, because multiple cpus can be running over the same area. +/* xen binary-compatible interface. See xen headers for details */ +struct kvm_vcpu_time_info { +uint32_t version; +uint32_t pad0; +uint64_t tsc_timestamp; +uint64_t system_time; +uint32_t tsc_to_system_mul; +int8_t tsc_shift; +}; /* 32 bytes */ + +struct kvm_wall_clock { +uint32_t wc_version; +uint32_t wc_sec; +uint32_t wc_nsec; +}; + These structures are dangerously sized. __Suggest__ __attribute__((__packed__)). (or some padding at the end of kvm_vcpu_time_info. They are sized so to have same size as xen's. If it concerns you, packed should be better. diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 4de4fd2..78ce53f 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -232,6 +232,7 @@ #define KVM_CAP_USER_MEMORY 3 #define KVM_CAP_SET_TSS_ADDR 4 #define KVM_CAP_EXT_CPUID 5 #define KVM_CAP_VAPIC 6 +#define KVM_CAP_CLOCKSOURCE 7 Please refresh against kvm.git, this has changed a bit. ok. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
--- Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 13 Feb 2008, Christian Bell wrote: not always be in the thousands but you're still claiming scalability for a mechanism that essentially logs who accesses the regions. Then there's the fact that reclaim becomes a collective communication operation over all region accessors. Makes me nervous. Well reclaim is not a very fast process (and we usually try to avoid it as much as possible for our HPC). Essentially its only there to allow shifts of processing loads and to allow efficient caching of application data. However, short of providing user-level notifications for pinned pages that are inadvertently released to the O/S, I don't believe that the patchset provides any significant added value for the HPC community that can't optimistically do RDMA demand paging. We currently also run XPmem with pinning. Its great as long as you just run one load on the system. No reclaim ever iccurs. However, if you do things that require lots of allocations etc etc then the page pinning can easily lead to livelock if reclaim is finally triggerd and also strange OOM situations since the VM cannot free any pages. So the main issue that is addressed here is reliability of pinned page operations. Better VM integration avoids these issues because we can unpin on request to deal with memory shortages. I have a question on the basic need for the mmu notifier stuff wrt rdma hardware and pinning memory. It seems that the need is to solve potential memory shortage and overcommit issues by being able to reclaim pages pinned by rdma driver/hardware. Is my understanding correct? If I do understand correctly, then why is rdma page pinning any different than eg mlock pinning? I imagine Oracle pins lots of memory (using mlock), how come they do not run into vm overcommit issues? Are we up against some kind of breaking c-o-w issue here that is different between mlock and rdma pinning? Asked another way, why should effort be spent on a notifier scheme, and rather not on fixing any memory accounting problems and unifying how pin pages are accounted for that get pinned via mlock() or rdma drivers? Startup benefits are well understood with the notifier scheme (ie, not all pages need to be faulted in at memory region creation time), specially when most of the memory region is not accessed at all. I would imagine most of HPC does not work this way though. Then again, as rdma hardware is applied (increasingly?) towards apps with short lived connections, the notifier scheme will help with startup times. Kanoj Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Kanoj Sarcar wrote: It seems that the need is to solve potential memory shortage and overcommit issues by being able to reclaim pages pinned by rdma driver/hardware. Is my understanding correct? Correct. If I do understand correctly, then why is rdma page pinning any different than eg mlock pinning? I imagine Oracle pins lots of memory (using mlock), how come they do not run into vm overcommit issues? Mlocked pages are not pinned. They are movable by f.e. page migration and will be potentially be moved by future memory defrag approaches. Currently we have the same issues with mlocked pages as with pinned pages. There is work in progress to put mlocked pages onto a different lru so that reclaim exempts these pages and more work on limiting the percentage of memory that can be mlocked. Are we up against some kind of breaking c-o-w issue here that is different between mlock and rdma pinning? Not that I know. Asked another way, why should effort be spent on a notifier scheme, and rather not on fixing any memory accounting problems and unifying how pin pages are accounted for that get pinned via mlock() or rdma drivers? There are efforts underway to account for and limit mlocked pages as described above. Page pinning the way it is done by Infiniband through increasing the page refcount is treated by the VM as a temporary condition not as a permanent pin. The VM will continually try to reclaim these pages thinking that the temporary usage of the page must cease soon. This is why the use of large amounts of pinned pages can lead to livelock situations. If we want to have pinning behavior then we could mark pinned pages specially so that the VM will not continually try to evict these pages. We could manage them similar to mlocked pages but just not allow page migration, memory unplug and defrag to occur on pinned memory. All of theses would have to fail. With the notifier scheme the device driver could be told to get rid of the pinned memory. This would make these 3 techniques work despite having an RDMA memory section. Startup benefits are well understood with the notifier scheme (ie, not all pages need to be faulted in at memory region creation time), specially when most of the memory region is not accessed at all. I would imagine most of HPC does not work this way though. No for optimal performance you would want to prefault all pages like it is now. The notifier scheme would only become relevant in memory shortage situations. Then again, as rdma hardware is applied (increasingly?) towards apps with short lived connections, the notifier scheme will help with startup times. The main use of the notifier scheme is for stability and reliability. The pinned pages become unpinnable on request by the VM. So the VM can work itself out of memory shortage situations in cooperation with the RDMA logic instead of simply failing. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800: One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
--- Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 13 Feb 2008, Kanoj Sarcar wrote: It seems that the need is to solve potential memory shortage and overcommit issues by being able to reclaim pages pinned by rdma driver/hardware. Is my understanding correct? Correct. If I do understand correctly, then why is rdma page pinning any different than eg mlock pinning? I imagine Oracle pins lots of memory (using mlock), how come they do not run into vm overcommit issues? Mlocked pages are not pinned. They are movable by f.e. page migration and will be potentially be moved by future memory defrag approaches. Currently we have the same issues with mlocked pages as with pinned pages. There is work in progress to put mlocked pages onto a different lru so that reclaim exempts these pages and more work on limiting the percentage of memory that can be mlocked. Are we up against some kind of breaking c-o-w issue here that is different between mlock and rdma pinning? Not that I know. Asked another way, why should effort be spent on a notifier scheme, and rather not on fixing any memory accounting problems and unifying how pin pages are accounted for that get pinned via mlock() or rdma drivers? There are efforts underway to account for and limit mlocked pages as described above. Page pinning the way it is done by Infiniband through increasing the page refcount is treated by the VM as a temporary condition not as a permanent pin. The VM will continually try to reclaim these pages thinking that the temporary usage of the page must cease soon. This is why the use of large amounts of pinned pages can lead to livelock situations. Oh ok, yes, I did see the discussion on this; sorry I missed it. I do see what notifiers bring to the table now (without endorsing it :-)). An orthogonal question is this: is IB/rdma the only culprit that elevates page refcounts? Are there no other subsystems which do a similar thing? The example I am thinking about is rawio (Oracle's mlock'ed SHM regions are handed to rawio, isn't it?). My understanding of how rawio works in Linux is quite dated though ... Kanoj If we want to have pinning behavior then we could mark pinned pages specially so that the VM will not continually try to evict these pages. We could manage them similar to mlocked pages but just not allow page migration, memory unplug and defrag to occur on pinned memory. All of theses would have to fail. With the notifier scheme the device driver could be told to get rid of the pinned memory. This would make these 3 techniques work despite having an RDMA memory section. Startup benefits are well understood with the notifier scheme (ie, not all pages need to be faulted in at memory region creation time), specially when most of the memory region is not accessed at all. I would imagine most of HPC does not work this way though. No for optimal performance you would want to prefault all pages like it is now. The notifier scheme would only become relevant in memory shortage situations. Then again, as rdma hardware is applied (increasingly?) towards apps with short lived connections, the notifier scheme will help with startup times. The main use of the notifier scheme is for stability and reliability. The pinned pages become unpinnable on request by the VM. So the VM can work itself out of memory shortage situations in cooperation with the RDMA logic instead of simply failing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to [EMAIL PROTECTED] For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Demand paging for memory regions
On Wednesday, February 13, 2008 3:43 pm Kanoj Sarcar wrote: Oh ok, yes, I did see the discussion on this; sorry I missed it. I do see what notifiers bring to the table now (without endorsing it :-)). An orthogonal question is this: is IB/rdma the only culprit that elevates page refcounts? Are there no other subsystems which do a similar thing? The example I am thinking about is rawio (Oracle's mlock'ed SHM regions are handed to rawio, isn't it?). My understanding of how rawio works in Linux is quite dated though ... We're doing something similar in the DRM these days... We need big chunks of memory to be pinned so that the GPU can operate on them, but when the operation completes we can allow them to be swappable again. I think with the current implementation, allocations are always pinned, but we'll definitely want to change that soon. Dave? Jesse - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800: One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. Well, the registrations also exist to provide protection against rouge/faulty remotes, but for the purposes of MPI that is probably not important. Here is a thought.. Some RDMA hardware can change the page tables on the fly. What if the kernel had a mechanism to dynamically maintain a full registration of the processes entire address space ('mlocked' but able to be migrated)? MPI would never need to register a buffer, and all the messy cases with munmap/sbrk/etc go away - the risk is that other MPI nodes can randomly scribble all over the process :) Christoph: It seemed to me you were first talking about freeing/swapping/faulting RDMA'able pages - but would pure migration as a special hardware supported case be useful like Catilan suggested? Regards, Jason - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions
Hi Kanoj, On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote: Oh ok, yes, I did see the discussion on this; sorry I missed it. I do see what notifiers bring to the table now (without endorsing it :-)). I'm not really livelocks are really the big issue here. I'm running N 1G VM on a 1G ram system, with N-1G swapped out. Combining this with auto-ballooning, rss limiting, and ksm ram sharing, provides really advanced and lowlevel virtualization VM capabilities to the linux kernel while at the same time guaranteeing no oom failures as long as the guest pages are lower than ram+swap (just slower runtime if too many pages are unshared or if the balloons are deflated etc..). Swapping the virtual machine in the host may be more efficient than having the guest swapping over a virtual swap paravirt storage for example. As more management features are added admins will gain more experience in handling those new features and they'll find what's best for them. mmu notifiers and real reliable swapping are the enabler for those more advanced VM features. oom livelocks wouldn't happen anyway with KVM as long as the maximimal number of guest physical is lower than RAM. An orthogonal question is this: is IB/rdma the only culprit that elevates page refcounts? Are there no other subsystems which do a similar thing? The example I am thinking about is rawio (Oracle's mlock'ed SHM regions are handed to rawio, isn't it?). My understanding of how rawio works in Linux is quite dated though ... rawio in flight I/O shall be limited. As long as each task can't pin more than X ram, and the ram is released when the task is oom killed, and the first get_user_pages/alloc_pages/slab_alloc that returns -ENOMEM takes an oom fail path that returns failure to userland, everything is ok. Even with IB deadlock could only happen if IB would allow unlimited memory to be pinned down by unprivileged users. If IB is insecure and DoSable without mmu notifiers, then I'm not sure how enabling swapping of the IB memory could be enough to fix the DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap to be allocated in a tmpfs file (despite the tmpfs file storage includes swap and not only ram). Pinning the whole ram+swap with tmpfs livelocks the same way of pinning the whole ram with ramfs. So if you add mmu notifier support to IB, you only need to RDMA an area as large as ram+swap to livelock again as before... no difference at all. I don't think livelocks have anything to do with mmu notifiers (other than to deferring the livelock to the swap+ram point of no return instead of the current ram point of no return). Livelocks have to be solved the usual way: handling alloc_pages/get_user_pages/slab allocation failures with a fail path that returns to userland and allows the ram to be released if the task was selected for oom-killage. The real benefit of the mmu notifiers for IB would be to allow the rdma region to be larger than RAM without triggering the oom killer (or without triggering a livelock if it's DoSable but then the livelock would need fixing to be converted in a regular oom-killing by some other mean not related to the mmu-notifier, it's really an orthogonal problem). So suppose you've a MPI simulation that requires a 10G array and you've only 1G of ram, then you can rdma over 10G like if you had 10G of ram. Things will preform ok only if there's some huge locality of the computations. For virtualization it's orders of magnitude more useful than for computer clusters but certain simulations really swaps so I don't exclude certain RDMA apps will also need this (dunno about IB). - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] Qemu powerpc work around
This works better. Not sure why but when I had fake-exec in target-ppc, the build system was complaining that it could not find fake-exec.d. So then I just decided to move it to fake-exec-ppc.c. This patch works fine for powerpc. On Wed, 2008-02-13 at 12:55 -0600, Anthony Liguori wrote: Jerone Young wrote: On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote: Jerone Young wrote: So the recent code in qemu cvs has problem powerpc. So what I have done is mainly work around this in the build system, by creating ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs out the functions that are no longer defined (something done by Anthony Liguori attempting to fix qemu_cvs). What do folks think about this approach, for us all we really need is a qemu that is not built with tcg dependency. Since a target in qemu is a cpu type, how the instructions are executed (kvm, kqemu, dyngen, or tcg) shouldn't come into it. Instead we can have a --without-cpu-emulation or --no-tcg which would simply disable those parts. Actually this much much more sensible solution. So I took some time and implemented it. Funny enough, I was thinking the same thing last night :-) Please move fake-exec.c to target-ppc/fake-exec.c as it contains PPC specific code. Otherwise, this patch is much better! Regards, Anthony Liguori - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel