Re: Biweekly KVM Test report, kernel 0c7771... userspace 1223a0...
Xu, Jiajun wrote: Hi All, This is our Weekly KVM Testing Report against lastest kvm.git 0c77713470debc666a07dc40080d728272bb58b9 and kvm-userspace.git 1223a029b36b0d9e73af76bcc274bb770f814886. One New Issue: 1. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831 This is the architectural performance counting msr which was enabled in 4f76231 (KVM: x86: Ignore reads to EVNTSEL MSRs). Amit, can you check if appropriate cpuid leaf 10 reporting will fix this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-autotest] Fix command line for obtaining version number
Avi Kivity wrote: Plain 'qemu' now runs an empty VM; a -help is needed to get the help message. Signed-off-by: Avi Kivity a...@redhat.com --- client/tests/kvm_runtest_2/kvm_preprocessing.py |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/client/tests/kvm_runtest_2/kvm_preprocessing.py b/client/tests/kvm_runtest_2/kvm_preprocessing.py index 5cd6e10..c9eb35d 100644 --- a/client/tests/kvm_runtest_2/kvm_preprocessing.py +++ b/client/tests/kvm_runtest_2/kvm_preprocessing.py @@ -214,7 +214,7 @@ def preprocess(test, params, env): # Get the KVM userspace version and write it as a keyval kvm_log.debug(Fetching KVM userspace version...) qemu_path = os.path.join(test.bindir, qemu) -version_line = commands.getoutput(%s | head -n 1 % qemu_path) +version_line = commands.getoutput(%s -help | head -n 1 % qemu_path) exp = re.compile([Vv]ersion .*?,) match = exp.search(version_line) if match: Applied, Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Rewrite twisted maze of if() statements with more straightforward switch()
Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com This is actually not just a rewrite, but also a bugfix: INTR_INFO); @@ -3289,34 +3288,42 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) vmx-vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time)); + vmx-vcpu.arch.nmi_injected = false; + kvm_clear_exception_queue(vmx-vcpu); + kvm_clear_interrupt_queue(vmx-vcpu); + + if (!idtv_info_valid) + return; + vector = idt_vectoring_info VECTORING_INFO_VECTOR_MASK; type = idt_vectoring_info VECTORING_INFO_TYPE_MASK; - if (vmx-vcpu.arch.nmi_injected) { + + switch(type) { + case INTR_TYPE_NMI_INTR: + vmx-vcpu.arch.nmi_injected = true; /* The existing code would leave nmi_injected == false if we exit on NMI_INTR, so we drop an NMI here. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Looks good, but please split into (at least) two patches. Also please provide a test case so we don't regress again. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cleanup to reuse is_long_mode()
Dong, Eddie wrote: struct vcpu_svm *svm = to_svm(vcpu); #ifdef CONFIG_X86_64 - if (vcpu-arch.shadow_efer EFER_LME) { + if (is_long_mode(vcpu)) { is_long_mode() actually tests EFER_LMA, so this is incorrect. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: Add reserved bits check
Just noticed that walk_addr() too can be called from tdp context, so need to make sure rsvd_bits_mask is initialized in init_kvm_tdp_mmu() as well. Yes, fixed. Thx, eddie commit b282565503a78e75af643de42fe7bf495e2213ec Author: root r...@eddie-wb.localdomain Date: Mon Mar 30 16:57:39 2009 +0800 Emulate #PF error code of reserved bits violation. Signed-off-by: Eddie Dong eddie.d...@intel.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55fd4c5..4fe2742 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -261,6 +261,7 @@ struct kvm_mmu { union kvm_mmu_page_role base_role; u64 *pae_root; + u64 rsvd_bits_mask[2][4]; }; struct kvm_vcpu_arch { @@ -791,5 +792,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void); #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); +int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ef060ec..2eab758 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -126,6 +126,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_PRESENT_MASK (1U 0) #define PFERR_WRITE_MASK (1U 1) #define PFERR_USER_MASK (1U 2) +#define PFERR_RSVD_MASK (1U 3) #define PFERR_FETCH_MASK (1U 4) #define PT_DIRECTORY_LEVEL 2 @@ -179,6 +180,11 @@ static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; static u64 __read_mostly shadow_mt_mask; +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL (e - s + 1)) - 1) s; +} + void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) { shadow_trap_nonpresent_pte = trap_pte; @@ -2155,6 +2161,14 @@ static void paging_free(struct kvm_vcpu *vcpu) nonpaging_free(vcpu); } +static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte 7) 1; + return (gpte vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0; +} + #define PTTYPE 64 #include paging_tmpl.h #undef PTTYPE @@ -2163,6 +2177,54 @@ static void paging_free(struct kvm_vcpu *vcpu) #include paging_tmpl.h #undef PTTYPE +void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) +{ + struct kvm_mmu *context = vcpu-arch.mmu; + int maxphyaddr = cpuid_maxphyaddr(vcpu); + u64 exb_bit_rsvd = 0; + + if (!is_nx(vcpu)) + exb_bit_rsvd = rsvd_bits(63, 63); + switch (level) { + case PT32_ROOT_LEVEL: + /* no rsvd bits for 2 level 4K page table entries */ + context-rsvd_bits_mask[0][1] = 0; + context-rsvd_bits_mask[0][0] = 0; + if (is_cpuid_PSE36()) + /* 36bits PSE 4MB page */ + context-rsvd_bits_mask[1][1] = rsvd_bits(17, 21); + else + /* 32 bits PSE 4MB page */ + context-rsvd_bits_mask[1][1] = rsvd_bits(13, 21); + context-rsvd_bits_mask[1][0] = 0; + break; + case PT32E_ROOT_LEVEL: + context-rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PDE */ + context-rsvd_bits_mask[0][0] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62); /* PTE */ + context-rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(13, 20); /* large page */ + context-rsvd_bits_mask[1][0] = context-rsvd_bits_mask[0][0]; + break; + case PT64_ROOT_LEVEL: + context-rsvd_bits_mask[0][3] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context-rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context-rsvd_bits_mask[0][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8); + context-rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51); + context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3]; + context-rsvd_bits_mask[1][2] = context-rsvd_bits_mask[0][2]; + context-rsvd_bits_mask[1][1] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20); + context-rsvd_bits_mask[1][0] = context-rsvd_bits_mask[0][0]; + break; + } +} + static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) { struct kvm_mmu *context = vcpu-arch.mmu; @@ -2183,6 +2245,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) static int paging64_init_context(struct kvm_vcpu *vcpu) { + reset_rsvds_bits_mask(vcpu,
Re: Biweekly KVM Test report, kernel 0c7771... userspace 1223a0...
On (Mon) Mar 30 2009 [10:07:58], Avi Kivity wrote: 1. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831 This is the architectural performance counting msr which was enabled in 4f76231 (KVM: x86: Ignore reads to EVNTSEL MSRs). Amit, can you check if appropriate cpuid leaf 10 reporting will fix this? We already report 0s for the cpuid leaf 10; we need to report 0x3f in EBX for leaf 10 to denote events corresponding to the bits aren't available. I checked and it didn't help (we can't rely on guests to abide by cpuid flags) Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cleanup to reuse is_long_mode()
Dong, Eddie wrote: Avi Kivity wrote: Dong, Eddie wrote: struct vcpu_svm *svm = to_svm(vcpu); #ifdef CONFIG_X86_64 - if (vcpu-arch.shadow_efer EFER_LME) { + if (is_long_mode(vcpu)) { is_long_mode() actually tests EFER_LMA, so this is incorrect. Something missing? Here is the definition of is_long_mode, the patch is just for equal replacement. thx, eddie static inline int is_long_mode(struct kvm_vcpu *vcpu) { #ifdef CONFIG_X86_64 return vcpu-arch.shadow_efer EFER_LME; #else return 0; #endif } You're looking at an old version. Mine has EFER_LMA. See 9d642b. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Dong, Eddie wrote: @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context-rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: + context-rsvd_bits_mask[0][2] = exb_bit_rsvd | + rsvd_bits(maxphyaddr, 62) | + rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context-rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context-rsvd_bits_mask[0][0] = exb_bit_rsvd Are you sure that PDPTEs support NX? They don't support R/W and U/S, so it seems likely that NX is reserved as well even when EFER.NXE is enabled. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] kvm: qemu: check device assignment command
Avi Kivity wrote: Han, Weidong wrote: I suggest replacing the parsing code with pci_parse_devaddr() (needs to be extended to support functions) so that all the checking and parsing is done in one place. If use pci_parse_devaddr(), it needs to add domain section to assigning command, and add function section to pci_add/pci_del commands. What's more, pci_parse_devaddr() parses guest device bdf, there are some assumption, such as function is 0. But here parse host bdf. It's a little complex to combine them together. Right, but we end up with overall better code. pci_parse_devaddr parses [[domain:][bus:]slot, it's valid when even enter only slot, whereas it must be bus:slot.func in device assignment command (-pcidevice host=bus:slot.func). So I implemented a dedicated function to parse device bdf in device assignment command, rather than mix two parsing function together. Signed-off-by: Weidong Han weidong@intel.com diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c index cef7c8a..53375ff 100644 --- a/qemu/hw/device-assignment.c +++ b/qemu/hw/device-assignment.c @@ -1195,8 +1195,7 @@ out: */ AssignedDevInfo *add_assigned_device(const char *arg) { -char *cp, *cp1; -char device[8]; +char device[16]; char dma[6]; int r; AssignedDevInfo *adev; @@ -1207,6 +1206,13 @@ AssignedDevInfo *add_assigned_device(const char *arg) return NULL; } r = get_param_value(device, sizeof(device), host, arg); +if (!r) + goto bad; + +r = pci_parse_host_devaddr(device, adev-bus, adev-dev, adev-func); +if (r) +goto bad; + r = get_param_value(adev-name, sizeof(adev-name), name, arg); if (!r) snprintf(adev-name, sizeof(adev-name), %s, device); @@ -1216,18 +1222,6 @@ AssignedDevInfo *add_assigned_device(const char *arg) if (r !strncmp(dma, none, 4)) adev-disable_iommu = 1; #endif -cp = device; -adev-bus = strtoul(cp, cp1, 16); -if (*cp1 != ':') -goto bad; -cp = cp1 + 1; - -adev-dev = strtoul(cp, cp1, 16); -if (*cp1 != '.') -goto bad; -cp = cp1 + 1; - -adev-func = strtoul(cp, cp1, 16); LIST_INSERT_HEAD(adev_head, adev, next); return adev; diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c index eca0517..bf97c8c 100644 --- a/qemu/hw/pci.c +++ b/qemu/hw/pci.c @@ -163,6 +163,7 @@ static int pci_set_default_subsystem_id(PCIDevice *pci_dev) } /* + * Parse pci address in qemu command * Parse [[domain:]bus:]slot, return -1 on error */ static int pci_parse_devaddr(const char *addr, int *domp, int *busp, unsigned *slotp) @@ -211,6 +212,55 @@ static int pci_parse_devaddr(const char *addr, int *domp, int *busp, unsigned *s return 0; } +/* + * Parse device bdf in device assignment command: + * + * -pcidevice host=bus:dev.func + * + * Parse bus:slot.func return -1 on error + */ +int pci_parse_host_devaddr(const char *addr, int *busp, + int *slotp, int *funcp) +{ +const char *p; +char *e; +int val; +int bus = 0, slot = 0, func = 0; + +p = addr; +val = strtoul(p, e, 16); +if (e == p) + return -1; +if (*e == ':') { + bus = val; + p = e + 1; + val = strtoul(p, e, 16); + if (e == p) + return -1; + if (*e == '.') { + slot = val; + p = e + 1; + val = strtoul(p, e, 16); + if (e == p) + return -1; + func = val; + } else + return -1; +} else + return -1; + +if (bus 0xff || slot 0x1f || func 0x7) + return -1; + +if (*e) + return -1; + +*busp = bus; +*slotp = slot; +*funcp = func; +return 0; +} + int pci_read_devaddr(const char *addr, int *domp, int *busp, unsigned *slotp) { char devaddr[32]; diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index a7438f2..bfdd29a 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -227,6 +227,9 @@ PCIDevice *pci_find_device(int bus_num, int slot, int function); int pci_read_devaddr(const char *addr, int *domp, int *busp, unsigned *slotp); int pci_assign_devaddr(const char *addr, int *domp, int *busp, unsigned *slotp); +int pci_parse_host_devaddr(const char *addr, int *busp, + int *slotp, int *funcp); + void pci_info(Monitor *mon); PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t vid, uint16_t did, pci_map_irq_fn map_irq, const char *name);-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Add reserved bits check
Dong, Eddie wrote: Just noticed that walk_addr() too can be called from tdp context, so need to make sure rsvd_bits_mask is initialized in init_kvm_tdp_mmu() as well. Yes, fixed. Applied, thanks. I also added unit tests for bit 51 of the pte and pde in the mmu tests. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can disk geometry be specified in libvirt?
Hello I'm trying to pass a fibre channel virtual disk to a KVM host via libvirt. On the host, disk is: Disk /dev/sdb: 53.6 GB, 53631516672 bytes 64 heads, 32 sectors/track, 6393 cylinders Units = cylinders of 2048 * 4096 = 8388608 bytes Disk identifier: 0x5e9ca6c0 As you can see, the sector size is 4096 and not the usual 512 bytes. If I pass this to a KVM guest, I get this in the guest: Disk /dev/vdc: 53.6 GB, 53631516672 bytes 64 heads, 32 sectors/track, 51147 cylinders Units = cylinders of 2048 * 512 = 1048576 bytes Disk identifier: 0x5e9ca6c0 As you can see, it's sector size has not been recognised correctly. It's recognised as 512 bytes. Because of this, the disk cannot be used. Is there any way to pass the sector size to the guest? Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: soft lockup - CPU stuck for ...
Hi, many thanks for your replys. I've upgraded some systems to kernel 2.6.29 a few days ago. There was especially one system which nearly always crashed during kernel compilation. With 2.6.29 as host an guest it currently works. Have now compiled the kernel three times (always from scratch) and nothing crashed. To use i686 (or x86) as host wouldn't be a option. The preemtible kernel seems a possible way to go if the crash happens again. But if it works now I'll leave it as it is since there are still drivers out there which have problems with preemt. kernels. But there is something I still wonder: Is this the right mailing list for such requests? If I read a message like BUG: soft lockup - CPU#0 stuck for ...? it looks for me like a bug which should be looked after by the develpers but it seems that nobody here really cares for such reports. I'm really a gratefull for KVM and the work by done by all the developers but isn't it in the interest of a company like Redhat to get the product stable and to eliminate all known bugs before the release of their new virtualisation product? I really don't mean this as flame because my intention is really to get KVM better. But the only thing I can do is to submit bug reports since I'm not a C/C++ developer. Btw: Is there a overview what kernel settings are recommended for KVM hosts and guests beside the obvious ones? I've learned so far that the noop I/O scheduler in the guest and deadline in the host are good choices. I've read in the XFS filesystem FAQ that the KVM drive= option should include cache=none to avoid filesystem corruption (which I've already had in some KVMs and caused me to switch to ext3 instead). The kernel settings are especially usefull for people like me who're using Gentoo where you have to compile everything yourself. Keep the good work going! Thanks! Robert many than Hi, I was also experiencing this problem a lot for quite a long time (and for wide range of KVM versions..) I might be completely wrong as I'm not sure if it was really the reason, but i THINK it disappeared when I started to use fully preemptible kernel on host.. You might want to try it... BR nik On Sun, Mar 29, 2009 at 07:51:21AM +, Gerrit Slomma wrote: Robert Wimmer r.wimmer at tomorrow-focus.de writes: Hi, does anyone know how to solve the problem with BUG: soft lockup - CPU#0 stuck for ...? Today I got the messages below during compilation of the kernel modules in a guest. Using kvm84 and Kernel 2.6.29 as host kernel and 2.6.28 as guest kernel during the hangup of the guest neither ssh or ping was possible. After about 2 minutes the guest was reachable again and I saw the messages below with dmesg. Maybe it is related with my prev. anserwed posting: http://article.gmane.org/gmane.comp.emulators.kvm.devel/29677 Thanks! Robert BUG: soft lockup - CPU#0 stuck for 61s! (...) Hello Do you use x86_64 or i686? Look at my post here http://article.gmane.org/gmane.comp.emulators.kvm.devel/29833 And my Bug-report here https://bugzilla.redhat.com/show_bug.cgi?id=492688. I do not have the problems while running but after migrating. Problems with stuck CPUs vanish if i686 for the host is used - but i am testing further. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- - Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz http://www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis at linuxbox.cz - -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/5] Fix handling of a fault during NMI unblocked due to IRET
Bit 12 is undefined in any of the following cases: If the VM exit sets the valid bit in the IDT-vectoring information field. If the VM exit is due to a double fault. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 17 +++-- 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 37ae13d..14e3f48 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3259,36 +3259,41 @@ static void update_tpr_threshold(struct kvm_vcpu *vcpu) static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { u32 exit_intr_info; - u32 idt_vectoring_info; + u32 idt_vectoring_info = vmx-idt_vectoring_info; bool unblock_nmi; u8 vector; int type; bool idtv_info_valid; u32 error; + idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); if (cpu_has_virtual_nmis()) { unblock_nmi = (exit_intr_info INTR_INFO_UNBLOCK_NMI) != 0; vector = exit_intr_info INTR_INFO_VECTOR_MASK; /* -* SDM 3: 25.7.1.2 +* SDM 3: 27.7.1.2 (September 2008) * Re-set bit block by NMI before VM entry if vmexit caused by * a guest IRET fault. +* SDM 3: 23.2.2 (September 2008) +* Bit 12 is undefined in any of the following cases: +* If the VM exit sets the valid bit in the IDT-vectoring +* information field. +* If the VM exit is due to a double fault. */ - if (unblock_nmi vector != DF_VECTOR) + if ((exit_intr_info INTR_INFO_VALID_MASK) unblock_nmi + vector != DF_VECTOR !idtv_info_valid) vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); } else if (unlikely(vmx-soft_vnmi_blocked)) vmx-vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time)); - idt_vectoring_info = vmx-idt_vectoring_info; - idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; vector = idt_vectoring_info VECTORING_INFO_VECTOR_MASK; type = idt_vectoring_info VECTORING_INFO_TYPE_MASK; if (vmx-vcpu.arch.nmi_injected) { /* -* SDM 3: 25.7.1.2 +* SDM 3: 27.7.1.2 (September 2008) * Clear bit block by NMI before VM entry if a NMI delivery * faulted. */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/5] Rewrite twisted maze of if() statements with more straightforward switch()
Also fix a bug when NMI could be dropped on exit. Although this should never happen in practice. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 43 +-- 1 files changed, 25 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 14e3f48..1017544 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3264,7 +3264,6 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) u8 vector; int type; bool idtv_info_valid; - u32 error; idtv_info_valid = idt_vectoring_info VECTORING_INFO_VALID_MASK; exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); @@ -3289,34 +3288,42 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) vmx-vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time)); + vmx-vcpu.arch.nmi_injected = false; + kvm_clear_exception_queue(vmx-vcpu); + kvm_clear_interrupt_queue(vmx-vcpu); + + if (!idtv_info_valid) + return; + vector = idt_vectoring_info VECTORING_INFO_VECTOR_MASK; type = idt_vectoring_info VECTORING_INFO_TYPE_MASK; - if (vmx-vcpu.arch.nmi_injected) { + + switch(type) { + case INTR_TYPE_NMI_INTR: + vmx-vcpu.arch.nmi_injected = true; /* * SDM 3: 27.7.1.2 (September 2008) -* Clear bit block by NMI before VM entry if a NMI delivery -* faulted. +* Clear bit block by NMI before VM entry if a NMI +* delivery faulted. */ - if (idtv_info_valid type == INTR_TYPE_NMI_INTR) - vmcs_clear_bits(GUEST_INTERRUPTIBILITY_INFO, - GUEST_INTR_STATE_NMI); - else - vmx-vcpu.arch.nmi_injected = false; - } - kvm_clear_exception_queue(vmx-vcpu); - if (idtv_info_valid (type == INTR_TYPE_HARD_EXCEPTION || - type == INTR_TYPE_SOFT_EXCEPTION)) { + vmcs_clear_bits(GUEST_INTERRUPTIBILITY_INFO, + GUEST_INTR_STATE_NMI); + break; + case INTR_TYPE_HARD_EXCEPTION: + case INTR_TYPE_SOFT_EXCEPTION: if (idt_vectoring_info VECTORING_INFO_DELIVER_CODE_MASK) { - error = vmcs_read32(IDT_VECTORING_ERROR_CODE); - kvm_queue_exception_e(vmx-vcpu, vector, error); + u32 err = vmcs_read32(IDT_VECTORING_ERROR_CODE); + kvm_queue_exception_e(vmx-vcpu, vector, err); } else kvm_queue_exception(vmx-vcpu, vector); vmx-idt_vectoring_info = 0; - } - kvm_clear_interrupt_queue(vmx-vcpu); - if (idtv_info_valid type == INTR_TYPE_EXT_INTR) { + break; + case INTR_TYPE_EXT_INTR: kvm_queue_interrupt(vmx-vcpu, vector); vmx-idt_vectoring_info = 0; + break; + default: + break; } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] Do not zero idt_vectoring_info in vmx_complete_interrupts().
We will need it later in task_switch(). Code in handle_exception() is dead. is_external_interrupt(vect_info) will always be false since idt_vectoring_info is zeroed in vmx_complete_interrupts(). Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c |7 --- 1 files changed, 0 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1017544..0da7a9e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2613,11 +2613,6 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) printk(KERN_ERR %s: unexpected, vectoring info 0x%x intr info 0x%x\n, __func__, vect_info, intr_info); - if (!irqchip_in_kernel(vcpu-kvm) is_external_interrupt(vect_info)) { - int irq = vect_info VECTORING_INFO_VECTOR_MASK; - kvm_push_irq(vcpu, irq); - } - if ((intr_info INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR) return 1; /* already handled by vmx_vcpu_run() */ @@ -3316,11 +3311,9 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) kvm_queue_exception_e(vmx-vcpu, vector, err); } else kvm_queue_exception(vmx-vcpu, vector); - vmx-idt_vectoring_info = 0; break; case INTR_TYPE_EXT_INTR: kvm_queue_interrupt(vmx-vcpu, vector); - vmx-idt_vectoring_info = 0; break; default: break; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/5] Fix task switch back link handling.
Back link is written to a wrong TSS now. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/x86.c | 40 1 files changed, 32 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ae4918c..f14c622 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3697,7 +3697,6 @@ static void save_state_to_tss32(struct kvm_vcpu *vcpu, tss-fs = get_segment_selector(vcpu, VCPU_SREG_FS); tss-gs = get_segment_selector(vcpu, VCPU_SREG_GS); tss-ldt_selector = get_segment_selector(vcpu, VCPU_SREG_LDTR); - tss-prev_task_link = get_segment_selector(vcpu, VCPU_SREG_TR); } static int load_state_from_tss32(struct kvm_vcpu *vcpu, @@ -3794,8 +3793,8 @@ static int load_state_from_tss16(struct kvm_vcpu *vcpu, } static int kvm_task_switch_16(struct kvm_vcpu *vcpu, u16 tss_selector, - u32 old_tss_base, - struct desc_struct *nseg_desc) + u16 old_tss_sel, u32 old_tss_base, + struct desc_struct *nseg_desc) { struct tss_segment_16 tss_segment_16; int ret = 0; @@ -3814,6 +3813,16 @@ static int kvm_task_switch_16(struct kvm_vcpu *vcpu, u16 tss_selector, tss_segment_16, sizeof tss_segment_16)) goto out; + if (old_tss_sel != 0x) { + tss_segment_16.prev_task_link = old_tss_sel; + + if (kvm_write_guest(vcpu-kvm, + get_tss_base_addr(vcpu, nseg_desc), + tss_segment_16.prev_task_link, + sizeof tss_segment_16.prev_task_link)) + goto out; + } + if (load_state_from_tss16(vcpu, tss_segment_16)) goto out; @@ -3823,7 +3832,7 @@ out: } static int kvm_task_switch_32(struct kvm_vcpu *vcpu, u16 tss_selector, - u32 old_tss_base, + u16 old_tss_sel, u32 old_tss_base, struct desc_struct *nseg_desc) { struct tss_segment_32 tss_segment_32; @@ -3843,6 +3852,16 @@ static int kvm_task_switch_32(struct kvm_vcpu *vcpu, u16 tss_selector, tss_segment_32, sizeof tss_segment_32)) goto out; + if (old_tss_sel != 0x) { + tss_segment_32.prev_task_link = old_tss_sel; + + if (kvm_write_guest(vcpu-kvm, + get_tss_base_addr(vcpu, nseg_desc), + tss_segment_32.prev_task_link, + sizeof tss_segment_32.prev_task_link)) + goto out; + } + if (load_state_from_tss32(vcpu, tss_segment_32)) goto out; @@ -3898,12 +3917,17 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason) kvm_x86_ops-skip_emulated_instruction(vcpu); + /* set back link to prev task only if NT bit is set in eflags + note that old_tss_sel is not used afetr this point */ + if (reason != TASK_SWITCH_CALL reason != TASK_SWITCH_GATE) + old_tss_sel = 0x; + if (nseg_desc.type 8) - ret = kvm_task_switch_32(vcpu, tss_selector, old_tss_base, -nseg_desc); + ret = kvm_task_switch_32(vcpu, tss_selector, old_tss_sel, +old_tss_base, nseg_desc); else - ret = kvm_task_switch_16(vcpu, tss_selector, old_tss_base, -nseg_desc); + ret = kvm_task_switch_16(vcpu, tss_selector, old_tss_sel, +old_tss_base, nseg_desc); if (reason == TASK_SWITCH_CALL || reason == TASK_SWITCH_GATE) { u32 eflags = kvm_x86_ops-get_rflags(vcpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] Fix unneeded instruction skipping during task switching.
There is no need to skip instruction if the reason for a task switch is a task gate in IDT and access to it is caused by an external even. The problem is currently solved only for VMX since there is no reliable way to skip an instruction in SVM. We should emulate it instead. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/svm.h |1 + arch/x86/kvm/svm.c | 25 ++--- arch/x86/kvm/vmx.c | 40 +--- arch/x86/kvm/x86.c |5 - 4 files changed, 52 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 82ada75..85574b7 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -225,6 +225,7 @@ struct __attribute__ ((__packed__)) vmcb { #define SVM_EVTINJ_VALID_ERR (1 11) #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK +#define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK #defineSVM_EXITINTINFO_TYPE_INTR SVM_EVTINJ_TYPE_INTR #defineSVM_EXITINTINFO_TYPE_NMI SVM_EVTINJ_TYPE_NMI diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1fcbc17..3ffb695 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1823,17 +1823,28 @@ static int task_switch_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { u16 tss_selector; + int reason; + int int_type = svm-vmcb-control.exit_int_info + SVM_EXITINTINFO_TYPE_MASK; tss_selector = (u16)svm-vmcb-control.exit_info_1; + if (svm-vmcb-control.exit_info_2 (1ULL SVM_EXITINFOSHIFT_TS_REASON_IRET)) - return kvm_task_switch(svm-vcpu, tss_selector, - TASK_SWITCH_IRET); - if (svm-vmcb-control.exit_info_2 - (1ULL SVM_EXITINFOSHIFT_TS_REASON_JMP)) - return kvm_task_switch(svm-vcpu, tss_selector, - TASK_SWITCH_JMP); - return kvm_task_switch(svm-vcpu, tss_selector, TASK_SWITCH_CALL); + reason = TASK_SWITCH_IRET; + else if (svm-vmcb-control.exit_info_2 +(1ULL SVM_EXITINFOSHIFT_TS_REASON_JMP)) + reason = TASK_SWITCH_JMP; + else if (svm-vmcb-control.exit_int_info SVM_EXITINTINFO_VALID) + reason = TASK_SWITCH_GATE; + else + reason = TASK_SWITCH_CALL; + + + if (reason != TASK_SWITCH_GATE || int_type == SVM_EXITINTINFO_TYPE_SOFT) + skip_emulated_instruction(svm-vcpu); + + return kvm_task_switch(svm-vcpu, tss_selector, reason); } static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0da7a9e..01db958 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3025,22 +3025,40 @@ static int handle_task_switch(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long exit_qualification; u16 tss_selector; - int reason; + int reason, type, idt_v; + + idt_v = (vmx-idt_vectoring_info VECTORING_INFO_VALID_MASK); + type = (vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK); exit_qualification = vmcs_readl(EXIT_QUALIFICATION); reason = (u32)exit_qualification 30; - if (reason == TASK_SWITCH_GATE vmx-vcpu.arch.nmi_injected - (vmx-idt_vectoring_info VECTORING_INFO_VALID_MASK) - (vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK) - == INTR_TYPE_NMI_INTR) { - vcpu-arch.nmi_injected = false; - if (cpu_has_virtual_nmis()) - vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, - GUEST_INTR_STATE_NMI); + if (reason == TASK_SWITCH_GATE idt_v) { + switch (type) { + case INTR_TYPE_NMI_INTR: + vcpu-arch.nmi_injected = false; + if (cpu_has_virtual_nmis()) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, + GUEST_INTR_STATE_NMI); + break; + case INTR_TYPE_EXT_INTR: + kvm_clear_interrupt_queue(vcpu); + break; + case INTR_TYPE_HARD_EXCEPTION: + case INTR_TYPE_SOFT_EXCEPTION: + kvm_clear_exception_queue(vcpu); + break; + default: + break; + } } tss_selector = exit_qualification; + if (!idt_v || (type != INTR_TYPE_HARD_EXCEPTION + type != INTR_TYPE_EXT_INTR + type != INTR_TYPE_NMI_INTR)) + skip_emulated_instruction(vcpu); + if (!kvm_task_switch(vcpu, tss_selector, reason)) return 0; @@ -3292,8
Re: [PATCH 4/4] Fix task switching.
On Mon, Mar 30, 2009 at 10:39:21AM +0300, Avi Kivity wrote: Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Looks good, but please split into (at least) two patches. Also please provide a test case so we don't regress again. This what I am using for testing. After running make you should get kernel.bin that can be booted from grub. Runs on real HW too. I am planing to add more test. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/user/test/x86/kvmtest/Makefile b/user/test/x86/kvmtest/Makefile new file mode 100644 index 000..b93935f --- /dev/null +++ b/user/test/x86/kvmtest/Makefile @@ -0,0 +1,33 @@ +CC=gcc +AS=gcc +CFLAGS=-m32 -I. -O2 -Wall +ASFLAGS=-m32 -I. +OBJS=kernel.o lib.o boot.o memory.o gdt.o idt.o isrs.o tss.o uart.o +ALLOBJS=$(OBJS) tests/tests.o + +PHONY := all +all: kernel.bin + $(MAKE) -C tests + +kernel.bin: $(ALLOBJS) kernel.ld + ld -T kernel.ld $(ALLOBJS) -o $@ + +install: kernel.bin + cp $ /boot/ + +tests/tests.o: + $(MAKE) -C tests + +-include $(OBJS:.o=.d) + +# compile and generate dependency info +%.o: %.c + gcc -c $(CFLAGS) $*.c -o $*.o + gcc -MM $(CFLAGS) $*.c $*.d + +PHONY += clean +clean: + $(MAKE) -C tests + -rm *.o *~ *.d kernel.bin + +.PHONY: $(PHONY) diff --git a/user/test/x86/kvmtest/boot.S b/user/test/x86/kvmtest/boot.S new file mode 100644 index 000..f74015c --- /dev/null +++ b/user/test/x86/kvmtest/boot.S @@ -0,0 +1,357 @@ +/* boot.S - bootstrap the kernel */ +/* Copyright (C) 1999, 2001 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#define ASM 1 +#include multiboot.h +#include kernel.h + +.text + +.globl start, _start +start: +_start: +jmp multiboot_entry + +/* Align 32 bits boundary. */ +.align 4 + +/* Multiboot header. */ +multiboot_header: +/* magic */ +.long MULTIBOOT_HEADER_MAGIC +/* flags */ +.long MULTIBOOT_HEADER_FLAGS +/* checksum */ +.long -(MULTIBOOT_HEADER_MAGIC + MULTIBOOT_HEADER_FLAGS) +#ifndef __ELF__ + /* header_addr */ + .long multiboot_header + /* load_addr */ + .long _start + /* load_end_addr */ + .long _edata + /* bss_end_addr */ + .long _end + /* entry_addr */ + .long multiboot_entry +#endif /* ! __ELF__ */ + + multiboot_entry: + /* Initialize the stack pointer. */ + movl$(STACK_START), %esp + + /* Reset EFLAGS. */ + pushl $0 + popf + + /* Push the pointer to the Multiboot information structure. */ + pushl %ebx + /* Push the magic value. */ + pushl %eax + + /* Now enter the C main function... */ + callcmain + + /* Halt. */ + pushl $halt_message + pushl $0 + callprintk + + loop: hlt + jmp loop + +.globl isr0 +.globl isr1 +.globl isr2 +.globl isr3 +.globl isr4 +.globl isr5 +.globl isr6 +.globl isr7 +.globl isr8 +.globl isr9 +.globl isr10 +.globl isr11 +.globl isr12 +.globl isr13 +.globl isr14 +.globl isr15 +.globl isr16 +.globl isr17 +.globl isr18 +.globl isr19 +.globl isr20 +.globl isr21 +.globl isr22 +.globl isr23 +.globl isr24 +.globl isr25 +.globl isr26 +.globl isr27 +.globl isr28 +.globl isr29 +.globl isr30 +.globl isr31 + +/* 0: Divide By Zero Exception */ +isr0: + cli + pushl $0 + pushl $0 + jmp isr_common_stub + +/* 1: Debug Exception */ +isr1: + cli + pushl $0 + pushl $1 + jmp isr_common_stub + +/* 2: Non Maskable Interrupt Exception */ +isr2: + cli + pushl $0 + pushl $2 + jmp isr_common_stub + +/* 3: Int 3 Exception */ +isr3: + cli + pushl $0 + pushl $3 + jmp isr_common_stub + +/* 4: INTO Exception */ +isr4: + cli + pushl $0 + pushl $4 + jmp isr_common_stub + +/* 5: Out of Bounds Exception */ +isr5: + cli + pushl $0 + pushl $5 + jmp isr_common_stub + +/* 6: Invalid Opcode Exception
Re: Live memory allocation?
On Saturday 28 March 2009 11:17:42 am you wrote: KVM devs have a patch called KSM (short for kernel shared memory I think) that helps windows guests a good bit. See the original announcement [1] for some numbers. I spoke to one of the devs recently and they said they are going to resubmit it soon. I remember the discussion about KSM. First, the kernel developers were not very happy with the approach, and second, there were some patent implications with VMware. Have these issues been resolved? Don't get me wrong. I'm not trying to stop KSM, I'm just wondering if I can get my hopes up again. I thought KSM was a great idea and I'd love to get my hands on it. -- Alberto Treviño BYU Testing Center Brigham Young University -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
Avi Kivity schrieb: (...) Perhaps KSM would help you? Alternately, a heuristic that scanned for (and collapsed) fully zeroed pages when a page is faulted in for the first time could catch these. ksm will indeed collapse these pages. Lighter-weight alternatives exist -- ballooning (need a Windows driver), or, like you mention, a simple scanner that looks for zero pages and drops them. That could be implemented within qemu (with some simple kernel support for dropping zero pages atomically, say madvise(MADV_DROP_IFZERO). From KSM description I can conclude that it allows dynamicly sharing identical memory pages between one or more processes. What about cache/buffers sharing between the host kernel and running processes? If I'm not mistaken, right now, memory is wasted by caching the same data by host and guest kernels. For example, let's say we have a host with 2 GB RAM and it runs a 1 GB guest. If we read ~900 MB file_1 (block device) on guest, then: - guest's kernel will cache file_1 - host's kernel will cache the same area of file_1 (block device) Now, if we want to read ~900 MB file_2 (or lots of files with that size), cache for file_1 will be emptied on both guest and host as we read file_2. Ideal situation would be if host and guest caches could be shared, to a degree (and have both file_1 and file_2 in memory, doesn't matter if it's guest or host). -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
Tomasz Chmielewski wrote: What about cache/buffers sharing between the host kernel and running processes? If I'm not mistaken, right now, memory is wasted by caching the same data by host and guest kernels. For example, let's say we have a host with 2 GB RAM and it runs a 1 GB guest. If we read ~900 MB file_1 (block device) on guest, then: - guest's kernel will cache file_1 - host's kernel will cache the same area of file_1 (block device) Now, if we want to read ~900 MB file_2 (or lots of files with that size), cache for file_1 will be emptied on both guest and host as we read file_2. Ideal situation would be if host and guest caches could be shared, to a degree (and have both file_1 and file_2 in memory, doesn't matter if it's guest or host). Double caching is indeed a bad idea. That's why you have cache=off (though it isn't recommended with qcow2). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Biweekly KVM Test report, kernel 0c7771... userspace 1223a0...
Amit Shah wrote: On (Mon) Mar 30 2009 [10:07:58], Avi Kivity wrote: 1. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831 This is the architectural performance counting msr which was enabled in 4f76231 (KVM: x86: Ignore reads to EVNTSEL MSRs). Amit, can you check if appropriate cpuid leaf 10 reporting will fix this? We already report 0s for the cpuid leaf 10; we need to report 0x3f in EBX for leaf 10 to denote events corresponding to the bits aren't available. I checked and it didn't help (we can't rely on guests to abide by cpuid flags) I see this in the code: /* * Check whether the Architectural PerfMon supports * Unhalted Core Cycles Event or not. * NOTE: Corresponding bit = 0 in ebx indicates event present. */ cpuid(10, (eax.full), ebx, unused, unused); if ((eax.split.mask_length (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX+1)) || (ebx ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT)) return 0; So I think it can be done. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
Avi Kivity schrieb: Tomasz Chmielewski wrote: What about cache/buffers sharing between the host kernel and running processes? If I'm not mistaken, right now, memory is wasted by caching the same data by host and guest kernels. For example, let's say we have a host with 2 GB RAM and it runs a 1 GB guest. If we read ~900 MB file_1 (block device) on guest, then: - guest's kernel will cache file_1 - host's kernel will cache the same area of file_1 (block device) Now, if we want to read ~900 MB file_2 (or lots of files with that size), cache for file_1 will be emptied on both guest and host as we read file_2. Ideal situation would be if host and guest caches could be shared, to a degree (and have both file_1 and file_2 in memory, doesn't matter if it's guest or host). Double caching is indeed a bad idea. That's why you have cache=off (though it isn't recommended with qcow2). cache= option is about write cache, right? Here, I'm talking about read cache. Or, does cache=none disable read cache as well? -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] qemu: SMBIOS passing support
Is there any interest in this series? Aside from copying host SMBIOS entries, it also seems useful for providing information to the guest about their virtual machine pool (perhaps via a type 3 entry), or whatever other bits of data someone might find useful (type 11, OEM string for instance). Thanks, Alex On Mon, 2009-03-23 at 13:05 -0600, Alex Williamson wrote: This series adds a new -smbios option for x86 that allows individual SMBIOS entries to be passed into the guest VM. This follows the same basic path as the support for loading ACPI tables. While SMBIOS is independent of ACPI, I chose to add the smbios_entry_add() function to acpi.c because they're both somewhat PC BIOS related (and ia64 can support SMBIOS and might be able to make use of it there). This feature allows the guest to see certain properties of the host if configured correctly. For instance, the system model and serial number in the type 1 entry. Obviously its only built at boot, so doesn't get updated for migration scenarios. User provided entries will supersede generated entries, so care should be taken when passing entries which describe physical properties, such as memory size and address ranges. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] qemu: SMBIOS passing support
On Mon, Mar 30, 2009 at 07:59:36AM -0600, Alex Williamson wrote: Is there any interest in this series? Aside from copying host SMBIOS entries, it also seems useful for providing information to the guest about their virtual machine pool (perhaps via a type 3 entry), or whatever other bits of data someone might find useful (type 11, OEM string for instance). Thanks, I think the patch is useful. Haven't looked at implementation though. Alex On Mon, 2009-03-23 at 13:05 -0600, Alex Williamson wrote: This series adds a new -smbios option for x86 that allows individual SMBIOS entries to be passed into the guest VM. This follows the same basic path as the support for loading ACPI tables. While SMBIOS is independent of ACPI, I chose to add the smbios_entry_add() function to acpi.c because they're both somewhat PC BIOS related (and ia64 can support SMBIOS and might be able to make use of it there). This feature allows the guest to see certain properties of the host if configured correctly. For instance, the system model and serial number in the type 1 entry. Obviously its only built at boot, so doesn't get updated for migration scenarios. User provided entries will supersede generated entries, so care should be taken when passing entries which describe physical properties, such as memory size and address ranges. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] qemu: SMBIOS passing support
On Mon, Mar 30, 2009 at 07:59:36AM -0600, Alex Williamson wrote: Is there any interest in this series? Aside from copying host SMBIOS entries, it also seems useful for providing information to the guest about their virtual machine pool (perhaps via a type 3 entry), or whatever other bits of data someone might find useful (type 11, OEM string for instance). Thanks, Alex On Mon, 2009-03-23 at 13:05 -0600, Alex Williamson wrote: This series adds a new -smbios option for x86 that allows individual SMBIOS entries to be passed into the guest VM. This follows the same basic path as the support for loading ACPI tables. While SMBIOS is independent of ACPI, I chose to add the smbios_entry_add() function to acpi.c because they're both somewhat PC BIOS related (and ia64 can support SMBIOS and might be able to make use of it there). This feature allows the guest to see certain properties of the host if configured correctly. For instance, the system model and serial number in the type 1 entry. Obviously its only built at boot, so doesn't get updated for migration scenarios. User provided entries will supersede generated entries, so care should be taken when passing entries which describe physical properties, such as memory size and address ranges. Thanks, I can't help thinking that if we wish to provide metadata to guest OS like system model, serial number, etc, then we'd be better off using explicit named flags (or QEMU config file settings once that exists) -system-serial 2141241521 -system-model Some Virtual Machine and have QEMU generate the neccessary SMBIOS data, or other equivalent data tables to suit the non-PC based machine types for which SMBIOS is not relevant. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Biweekly KVM Test report, kernel 0c7771... userspace 1223a0...
On (Mon) Mar 30 2009 [16:55:05], Avi Kivity wrote: Amit Shah wrote: On (Mon) Mar 30 2009 [10:07:58], Avi Kivity wrote: 1. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detailaid=2721640group_id=180599atid=893831 This is the architectural performance counting msr which was enabled in 4f76231 (KVM: x86: Ignore reads to EVNTSEL MSRs). Amit, can you check if appropriate cpuid leaf 10 reporting will fix this? We already report 0s for the cpuid leaf 10; we need to report 0x3f in EBX for leaf 10 to denote events corresponding to the bits aren't available. I checked and it didn't help (we can't rely on guests to abide by cpuid flags) I see this in the code: /* * Check whether the Architectural PerfMon supports * Unhalted Core Cycles Event or not. * NOTE: Corresponding bit = 0 in ebx indicates event present. */ cpuid(10, (eax.full), ebx, unused, unused); if ((eax.split.mask_length (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX+1)) || (ebx ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT)) return 0; So I think it can be done. Only if the guest kernel (or module accessing those registers) look at the cpuid output, right? I checked this for the Kaspersky AV on Windows, the crash bug I was solving and that program doesn't seem to check cpuid. RHEL 5.3 is based on 2.6.18 and this patch appears to have entered in 2.6.21. I saw this on 5.3 as well. Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
Tomasz Chmielewski wrote: Double caching is indeed a bad idea. That's why you have cache=off (though it isn't recommended with qcow2). cache= option is about write cache, right? Here, I'm talking about read cache. Or, does cache=none disable read cache as well? cache=writethrough disables the write cache cache=none disables host caching completely -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] qemu: SMBIOS passing support
Daniel P. Berrange wrote: I can't help thinking that if we wish to provide metadata to guest OS like system model, serial number, etc, then we'd be better off using explicit named flags (or QEMU config file settings once that exists) -system-serial 2141241521 -system-model Some Virtual Machine and have QEMU generate the neccessary SMBIOS data, or other equivalent data tables to suit the non-PC based machine types for which SMBIOS is not relevant. -smbios serial=blah,model=bleach ? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Biweekly KVM Test report, kernel 0c7771... userspace 1223a0...
Amit Shah wrote: /* * Check whether the Architectural PerfMon supports * Unhalted Core Cycles Event or not. * NOTE: Corresponding bit = 0 in ebx indicates event present. */ cpuid(10, (eax.full), ebx, unused, unused); if ((eax.split.mask_length (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX+1)) || (ebx ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT)) return 0; So I think it can be done. Only if the guest kernel (or module accessing those registers) look at the cpuid output, right? I checked this for the Kaspersky AV on Windows, the crash bug I was solving and that program doesn't seem to check cpuid. The only way to solve all possible cases is to implement the performance counters MSRs. That's not going to happen in a hurry, we're looking at making the known cases work. RHEL 5.3 is based on 2.6.18 and this patch appears to have entered in 2.6.21. I saw this on 5.3 as well. The snippet I quoted came from RHEL 5.3. It checks cpuid so we should be able to make it fail gracefully. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
Avi Kivity schrieb: Tomasz Chmielewski wrote: Double caching is indeed a bad idea. That's why you have cache=off (though it isn't recommended with qcow2). cache= option is about write cache, right? Here, I'm talking about read cache. Or, does cache=none disable read cache as well? cache=writethrough disables the write cache cache=none disables host caching completely Still, if there is free memory on host, why not use it for cache? -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
On Mon, Mar 30, 2009 at 10:15 AM, Tomasz Chmielewski man...@wpkg.org wrote: Still, if there is free memory on host, why not use it for cache? because it's best used on the guest; which will do anyway. so, not cacheing already-cached data, it's free to cache other more important things, or to keep more of the VMs memory on RAM. -- Javier -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] qemu: SMBIOS passing support
On Mon, 2009-03-30 at 17:59 +0300, Avi Kivity wrote: Daniel P. Berrange wrote: I can't help thinking that if we wish to provide metadata to guest OS like system model, serial number, etc, then we'd be better off using explicit named flags (or QEMU config file settings once that exists) -system-serial 2141241521 -system-model Some Virtual Machine and have QEMU generate the neccessary SMBIOS data, or other equivalent data tables to suit the non-PC based machine types for which SMBIOS is not relevant. -smbios serial=blah,model=bleach ? Unfortunately that does make them smbios specific, while I think Daniel is pointing out that several options may be useful on other platforms. This is basically the same issue we have with -uuid already. -uuid is a non-smbios specific option, but rombios will incorporate the data when it builds the type 1 entry. I've retained this functionality, so that a -uuid option will override the uuid in a passed in type 1 entry. This could be further extended with separate patches to provide serial or model numbers generically, but allow them to override smbios values. This seems complimentary to the patches in this series, but I don't think it replaces all the functionality we get from a raw smbios entry interface. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Live memory allocation?
On Monday 30 March 2009 08:23:44 Alberto Treviño wrote: On Saturday 28 March 2009 11:17:42 am you wrote: KVM devs have a patch called KSM (short for kernel shared memory I think) that helps windows guests a good bit. See the original announcement [1] for some numbers. I spoke to one of the devs recently and they said they are going to resubmit it soon. I remember the discussion about KSM. First, the kernel developers were not very happy with the approach, and second, there were some patent implications with VMware. Some (one?) of the kernel devs didn't like it, then admitted that he hadn't even read the patch. And as Alan Cox pointed out, if there was some patent problem, it should be handled by lawyers. There was also prior art (even in Linux) from quite some time ago. So, I think we are safe for now. --Brian Jackson Have these issues been resolved? Don't get me wrong. I'm not trying to stop KSM, I'm just wondering if I can get my hopes up again. I thought KSM was a great idea and I'd love to get my hands on it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: soft lockup - CPU stuck for ...
On Monday 30 March 2009 06:37:35 Robert Wimmer wrote: Hi, many thanks for your replys. I've upgraded some systems to kernel 2.6.29 a few days ago. There was especially one system which nearly always crashed during kernel compilation. With 2.6.29 as host an guest it currently works. Have now compiled the kernel three times (always from scratch) and nothing crashed. To use i686 (or x86) as host wouldn't be a option. The preemtible kernel seems a possible way to go if the crash happens again. But if it works now I'll leave it as it is since there are still drivers out there which have problems with preemt. kernels. But there is something I still wonder: Is this the right mailing list for such requests? If I read a message like BUG: soft lockup - CPU#0 stuck for ...? it looks for me like a bug which should be looked after by the develpers but it seems that nobody here really cares for such reports. I'm really a gratefull for KVM and the work by done by all the developers but isn't it in the interest of a company like Redhat to get the product stable and to eliminate all known bugs before the release of their new virtualisation product? I really don't mean this as flame because my intention is really to get KVM better. But the only thing I can do is to submit bug reports since I'm not a C/C++ developer. I think your problem is timing. All the devs seem to be really focused on getting kvm merged into upstream qemu properly right now. Following the list I've noticed that at least one of the devs seems to do a weekly review of the list and tries to handle all the bugs he sees. I actually think filing bugs for bugs is probably a better way to go because it's easier for the devs to keep track of them there (rather than having to read through a ton of mailing list messages, some of which don't even have to do with kvm). Moral of the story... even though nobody replied to you (yet?) your reports and time spent finding workarounds is appreciated. Btw: Is there a overview what kernel settings are recommended for KVM hosts and guests beside the obvious ones? I've learned so far that the noop I/O scheduler in the guest and deadline in the host are good choices. I've read in the XFS filesystem FAQ that the KVM drive= option should include cache=none to avoid filesystem corruption (which I've already had in some KVMs and caused me to switch to ext3 instead). The kernel settings are especially usefull for people like me who're using Gentoo where you have to compile everything yourself. Keep the good work going! Thanks! Robert many than Hi, I was also experiencing this problem a lot for quite a long time (and for wide range of KVM versions..) I might be completely wrong as I'm not sure if it was really the reason, but i THINK it disappeared when I started to use fully preemptible kernel on host.. You might want to try it... BR nik On Sun, Mar 29, 2009 at 07:51:21AM +, Gerrit Slomma wrote: Robert Wimmer r.wimmer at tomorrow-focus.de writes: Hi, does anyone know how to solve the problem with BUG: soft lockup - CPU#0 stuck for ...? Today I got the messages below during compilation of the kernel modules in a guest. Using kvm84 and Kernel 2.6.29 as host kernel and 2.6.28 as guest kernel during the hangup of the guest neither ssh or ping was possible. After about 2 minutes the guest was reachable again and I saw the messages below with dmesg. Maybe it is related with my prev. anserwed posting: http://article.gmane.org/gmane.comp.emulators.kvm.devel/29677 Thanks! Robert BUG: soft lockup - CPU#0 stuck for 61s! (...) Hello Do you use x86_64 or i686? Look at my post here http://article.gmane.org/gmane.comp.emulators.kvm.devel/29833 And my Bug-report here https://bugzilla.redhat.com/show_bug.cgi?id=492688. I do not have the problems while running but after migrating. Problems with stuck CPUs vanish if i686 for the host is used - but i am testing further. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Rewrite twisted maze of if() statements with more straightforward switch()
Avi Kivity wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com This is actually not just a rewrite, but also a bugfix: INTR_INFO); @@ -3289,34 +3288,42 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) vmx-vnmi_blocked_time += ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time)); +vmx-vcpu.arch.nmi_injected = false; +kvm_clear_exception_queue(vmx-vcpu); +kvm_clear_interrupt_queue(vmx-vcpu); + +if (!idtv_info_valid) +return; + vector = idt_vectoring_info VECTORING_INFO_VECTOR_MASK; type = idt_vectoring_info VECTORING_INFO_TYPE_MASK; -if (vmx-vcpu.arch.nmi_injected) { + +switch(type) { +case INTR_TYPE_NMI_INTR: +vmx-vcpu.arch.nmi_injected = true; /* The existing code would leave nmi_injected == false if we exit on NMI_INTR, so we drop an NMI here. I think NMI_INTR and nmi_injected always go together. However, the rework looks good and more logical to me, too. Will see that I can give this (more precisely -v2) a try with our scenarios ASAP. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Does this series fix all issues Bernhard, Thomas and Julian stumbled over? Jan Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/svm.h |1 + arch/x86/kvm/svm.c | 25 ++--- arch/x86/kvm/vmx.c | 40 +--- arch/x86/kvm/x86.c | 40 +++- 4 files changed, 79 insertions(+), 27 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 82ada75..85574b7 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -225,6 +225,7 @@ struct __attribute__ ((__packed__)) vmcb { #define SVM_EVTINJ_VALID_ERR (1 11) #define SVM_EXITINTINFO_VEC_MASK SVM_EVTINJ_VEC_MASK +#define SVM_EXITINTINFO_TYPE_MASK SVM_EVTINJ_TYPE_MASK #define SVM_EXITINTINFO_TYPE_INTR SVM_EVTINJ_TYPE_INTR #define SVM_EXITINTINFO_TYPE_NMI SVM_EVTINJ_TYPE_NMI diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1fcbc17..3ffb695 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1823,17 +1823,28 @@ static int task_switch_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { u16 tss_selector; + int reason; + int int_type = svm-vmcb-control.exit_int_info + SVM_EXITINTINFO_TYPE_MASK; tss_selector = (u16)svm-vmcb-control.exit_info_1; + if (svm-vmcb-control.exit_info_2 (1ULL SVM_EXITINFOSHIFT_TS_REASON_IRET)) - return kvm_task_switch(svm-vcpu, tss_selector, -TASK_SWITCH_IRET); - if (svm-vmcb-control.exit_info_2 - (1ULL SVM_EXITINFOSHIFT_TS_REASON_JMP)) - return kvm_task_switch(svm-vcpu, tss_selector, -TASK_SWITCH_JMP); - return kvm_task_switch(svm-vcpu, tss_selector, TASK_SWITCH_CALL); + reason = TASK_SWITCH_IRET; + else if (svm-vmcb-control.exit_info_2 + (1ULL SVM_EXITINFOSHIFT_TS_REASON_JMP)) + reason = TASK_SWITCH_JMP; + else if (svm-vmcb-control.exit_int_info SVM_EXITINTINFO_VALID) + reason = TASK_SWITCH_GATE; + else + reason = TASK_SWITCH_CALL; + + + if (reason != TASK_SWITCH_GATE || int_type == SVM_EXITINTINFO_TYPE_SOFT) + skip_emulated_instruction(svm-vcpu); + + return kvm_task_switch(svm-vcpu, tss_selector, reason); } static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0da7a9e..01db958 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3025,22 +3025,40 @@ static int handle_task_switch(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long exit_qualification; u16 tss_selector; - int reason; + int reason, type, idt_v; + + idt_v = (vmx-idt_vectoring_info VECTORING_INFO_VALID_MASK); + type = (vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK); exit_qualification = vmcs_readl(EXIT_QUALIFICATION); reason = (u32)exit_qualification 30; - if (reason == TASK_SWITCH_GATE vmx-vcpu.arch.nmi_injected - (vmx-idt_vectoring_info VECTORING_INFO_VALID_MASK) - (vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK) - == INTR_TYPE_NMI_INTR) { - vcpu-arch.nmi_injected = false; - if (cpu_has_virtual_nmis()) - vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, - GUEST_INTR_STATE_NMI); + if (reason == TASK_SWITCH_GATE idt_v) { + switch (type) { + case INTR_TYPE_NMI_INTR: + vcpu-arch.nmi_injected = false; + if (cpu_has_virtual_nmis()) + vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, + GUEST_INTR_STATE_NMI); + break; + case INTR_TYPE_EXT_INTR: + kvm_clear_interrupt_queue(vcpu); + break; + case INTR_TYPE_HARD_EXCEPTION: + case INTR_TYPE_SOFT_EXCEPTION: + kvm_clear_exception_queue(vcpu); + break; + default: + break; + } } tss_selector = exit_qualification; + if (!idt_v || (type != INTR_TYPE_HARD_EXCEPTION +type !=
Re: [PATCH 4/4] Fix task switching.
On Mon, Mar 30, 2009 at 06:04:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Does this series fix all issues Bernhard, Thomas and Julian stumbled over? Haven't tried. I wrote my own tests for task switching. How can I check it? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov wrote: On Mon, Mar 30, 2009 at 06:04:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Does this series fix all issues Bernhard, Thomas and Julian stumbled over? Haven't tried. I wrote my own tests for task switching. How can I check it? There is a test case attached to Julian's sourceforge-reported bug: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 And I guess Thomas or Bernhard will be happy to give it a try, too... :) There was one issue, the IRQ injection bug [1] which was related to IRQ tasks IIRC. Thomas and I finally suspected after a private chat that there is actually a different reason behind it, something like interrupt.pending should be cleared when the injection took place via an (emulated) task switch. Any news on this, Thomas? Jan [1] http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/29288 signature.asc Description: OpenPGP digital signature
Re: [PATCH 4/4] Fix task switching.
On Mon, Mar 30, 2009 at 06:35:05PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: On Mon, Mar 30, 2009 at 06:04:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: The patch fixes two problems with task switching. 1. Back link is written to a wrong TSS. 2. Instruction emulation is not needed if the reason for task switch is a task gate in IDT and access to it is caused by an external even. 2 is currently solved only for VMX since there is not reliable way to skip an instruction in SVM. We should emulate it instead. Does this series fix all issues Bernhard, Thomas and Julian stumbled over? Haven't tried. I wrote my own tests for task switching. How can I check it? There is a test case attached to Julian's sourceforge-reported bug: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 I'll try that. And I guess Thomas or Bernhard will be happy to give it a try, too... :) There was one issue, the IRQ injection bug [1] which was related to IRQ tasks IIRC. Thomas and I finally suspected after a private chat that there is actually a different reason behind it, something like interrupt.pending should be cleared when the injection took place via an (emulated) task switch. Any news on this, Thomas? If this is the case then the patch series should fix it. Jan [1] http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/29288 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
On Mon, Mar 30, 2009 at 06:35:05PM +0200, Jan Kiszka wrote: Haven't tried. I wrote my own tests for task switching. How can I check it? There is a test case attached to Julian's sourceforge-reported bug: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 Works for me. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: soft lockup - CPU stuck for ...
Nikola Ciprich extmaillist at linuxbox.cz writes: Hi, I was also experiencing this problem a lot for quite a long time (and for wide range of KVM versions..) I might be completely wrong as I'm not sure if it was really the reason, but i THINK it disappeared when I started to use fully preemptible kernel on host.. You might want to try it... BR nik Alas i can't! I am on thin ice compiling libvirt, virt-manager and kvm on my own. Company rules say i have to use upstream packages and there those that come from the install-mediums. But if live-migration won't work i have to use VMWare. Seems like i take the fallback of i686 - live-migration seems to work there - and wait for RHEL 5.4 in fall. Then kvm is said to be the default virtualization from Red Hat. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IO on guest is 20 times slower than host
On Mar 29, 2009, at 10:29 AM, Avi Kivity wrote: Kurt Yoder wrote: slow host cpu information, core 1 of 16: processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 4 model name : Quad-Core AMD Opteron(tm) Processor 8382 stepping: 2 cpu MHz : 2611.998 cache size : 512 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt bogomips: 5223.97 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate Can you loading kvm_amd on this host with 'modprobe kvm-amd npt=0'? So that's most likely the problem for me: m...@host:/etc/nagios/nrpe_directives$ sudo modprobe kvm-amd npt=0 FATAL: Error inserting kvm_amd (/lib/modules/2.6.27-11-server/kernel/ arch/x86/kvm/kvm-amd.ko): Operation not supported m...@host:/etc/nagios/nrpe_directives$ uname -a Linux boron 2.6.27-11-server #1 SMP Thu Jan 29 20:13:12 UTC 2009 x86_64 GNU/Linux It looks like I need to enable SVM in my BIOS. I'll do that and report back on the results. -Kurt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2517725 ] Windows 7 CPU Runaway
Bugs item #2517725, was opened at 2009-01-18 12:44 Message generated for change (Comment added) made by martyg7 You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2517725group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Martin Gallant (martyg7) Assigned to: Nobody/Anonymous (nobody) Summary: Windows 7 CPU Runaway Initial Comment: After some uptime in the client, I have observed Windows7 going into a CPU runaway lockup. Host is Athlon x2 running Debian Lenny amd64 + kernel.org 2.6.28 kernel + kvm-83 Client is Windows 7 beta 32-bit image, (UP or SMP) with virtio drivers. This happens when running the guest either as UP or SMPx2. When running as UP, only one of my host CPUs are affected. When running as SMPx2, both of my host CPUs are affected. This can be reproduced with reasonable reliability by either a) Commanding a restart in the guest machine b) Significant sustained disk IO traffic, e.g. 200+ MB/s Invocation: sudo $KVM -name kvm-windows7 -smp 2 -m 1024 -localtime \ -drive file=/dev/vm/kvm-windows7 \ -drive file=/dev/vm/usenet \ -net nic,macaddr=00:ff:3e:a4:f4:20,model=virtio -net tap \ -daemonize -vnc localhost:1,to=4 -usbdevice tablet After lockup, here is the backtrace I am pulling from gdb [Am I doing this right? I am a little rusty] (gdb) attach 12717 (gdb) bt #0 0x7f37b1496ce2 in select () from /lib/libc.so.6 #1 0x004088cb in main_loop_wait (timeout=0) at /usr/src/kvm-83/qemu/vl.c:3637 #2 0x005142ea in kvm_main_loop () at /usr/src/kvm-83/qemu/qemu-kvm.c:600 #3 0x0040c952 in main (argc=value optimized out, argv=0x7fffba8f1f78, envp=value optimized out) at /usr/src/kvm-83/qemu/vl.c:3799 Detailed configuration information attached. Nohing I can find in the logs I would consider relevant. -- Comment By: Martin Gallant (martyg7) Date: 2009-03-30 15:36 Message: Still present, and easily reproducible, on 2.6.29/kvm-84. -- Comment By: Martin Gallant (martyg7) Date: 2009-01-18 14:35 Message: I can reproduce this problem at will using -nic model=e1000 So this has nothing to do with virtio network drivers I am attaching a gdb log showing backtrace of all 3 process threads -- Comment By: Technologov (technologov) Date: 2009-01-18 13:24 Message: VirtIO drivers will not be supported in Windows 7 BETA -- only in Final. -Alexey -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2517725group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM + virt-manager: which is the perfect host Linux distro?
Evert wrote: Hi all, I am about to install a new host system, which will be hosting various guest systems by means of KVM virt-manager for GUI. What would be the best choice for host OS distro? Red Hat, or will any mature Linux distro do? Personally I am more of a Gentoo guy, but if there is 1 distro which is clearly better as host OS when it comes to KVM+virt-manager, I am willing to use something else... ;-) Fedora supports KVM and virt-manager nicely. I can't say it's the best, only that it works solidly. -- Bill Davidsen david...@tmr.com We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More vcd info wanted
* Bill Davidsen david...@tmr.com [2009-03-30 15:51]: I am looking for detailed information or a single reproducible example of starting a VM using the qemu-kvm command from a script under Linux (and a display script on a control host, obviously). What software needs to be installed and running on the host, and what needs to be on the remote accessing display. Please: this is not a question about doing something else using some other method, I need to be able to drop a disk image and a few parameters into a KVM host and start it in such a way that there is not human intervention nor previous preparation such as virt-manager or similar. I run desktops and servers under KVM using both command line start and managers, I just keep running into documentation which tells me to use a vnc specifier without explanation of what that might look like or a single reproducible example of same. -vnc localhost:1 -- will display the guest VGA display on the localhost. A remote system can do: vncviewer ${kvmhost}:1 to view the guest VGA. The host will be given a disk image and some parameters such as MAC address and memory size, and the machine which will have the display. That's my starting point, KVM host info will be used to start the viewer on another machine. -- Bill Davidsen david...@tmr.com We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm binary names
Daniel P. Berrange wrote: On Fri, Mar 20, 2009 at 10:57:50AM -0700, jd wrote: Hi What is the motivation for having different kvm binary names on various linux distributions.. ? -- kvm -- qemu-system-x86_84 -- qemu-kvm I can tell you the history from the Fedora POV at least... We already had 'qemu', 'qemu-system-x86_64', etc from the existing plain qemu emulator RPMs we distributed. The KVM makefile creates a binary call qemu-system-x86_64 but this clashes with the existing QEMU RPM, so we had to rename it somehow to allow parallel installation of KVM and QEMU RPMs. KVM already ships with a python script called 'kvm' and we didn't want to clash with that either, so we eventually settled on calling it 'qemu-kvm'. Other distros didn't worry about clash with the python script so called their binary just 'kvm' Don't stop there, why does Fedora have both qemu-ppc and qemu-system-ppc and so forth? There are many of these, arm and m68k for instance. On x86 I assume that they are both emulated, and they are not two names for the same executable or such, so what are they and how to choose which to use? -- Bill Davidsen david...@tmr.com We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More vcd info wanted
Ryan Harper wrote: * Bill Davidsen david...@tmr.com [2009-03-30 15:51]: I am looking for detailed information or a single reproducible example of starting a VM using the qemu-kvm command from a script under Linux (and a display script on a control host, obviously). What software needs to be installed and running on the host, and what needs to be on the remote accessing display. Please: this is not a question about doing something else using some other method, I need to be able to drop a disk image and a few parameters into a KVM host and start it in such a way that there is not human intervention nor previous preparation such as virt-manager or similar. I run desktops and servers under KVM using both command line start and managers, I just keep running into documentation which tells me to use a vnc specifier without explanation of what that might look like or a single reproducible example of same. -vnc localhost:1 -- will display the guest VGA display on the localhost. A remote system can do: vncviewer ${kvmhost}:1 to view the guest VGA. Thanks, will try later tonight. Have to have a bit of care getting the number right (unique) since there might be more than one of these, but this may be all I need. The host will be given a disk image and some parameters such as MAC address and memory size, and the machine which will have the display. That's my starting point, KVM host info will be used to start the viewer on another machine. -- Bill Davidsen david...@tmr.com We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More vcd info wanted
Ryan Harper wrote: * Bill Davidsen david...@tmr.com [2009-03-30 15:51]: I am looking for detailed information or a single reproducible example of starting a VM using the qemu-kvm command from a script under Linux (and a display script on a control host, obviously). What software needs to be installed and running on the host, and what needs to be on the remote accessing display. Please: this is not a question about doing something else using some other method, I need to be able to drop a disk image and a few parameters into a KVM host and start it in such a way that there is not human intervention nor previous preparation such as virt-manager or similar. I run desktops and servers under KVM using both command line start and managers, I just keep running into documentation which tells me to use a vnc specifier without explanation of what that might look like or a single reproducible example of same. -vnc localhost:1 -- will display the guest VGA display on the localhost. A remote system can do: vncviewer ${kvmhost}:1 If you say -vnc localhost:1, then vncviewer ${kvmhost}:1 will certainly not work. You have to say -vnc :1, then vncviewer ${kvmhost}:1. And happiness will ensue by s/vncviewer/vinagre/. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm binary names
On Mon, Mar 30, 2009 at 6:12 PM, Bill Davidsen david...@tmr.com wrote: Daniel P. Berrange wrote: On Fri, Mar 20, 2009 at 10:57:50AM -0700, jd wrote: Hi What is the motivation for having different kvm binary names on various linux distributions.. ? -- kvm -- qemu-system-x86_84 -- qemu-kvm I can tell you the history from the Fedora POV at least... We already had 'qemu', 'qemu-system-x86_64', etc from the existing plain qemu emulator RPMs we distributed. The KVM makefile creates a binary call qemu-system-x86_64 but this clashes with the existing QEMU RPM, so we had to rename it somehow to allow parallel installation of KVM and QEMU RPMs. KVM already ships with a python script called 'kvm' and we didn't want to clash with that either, so we eventually settled on calling it 'qemu-kvm'. Other distros didn't worry about clash with the python script so called their binary just 'kvm' Don't stop there, why does Fedora have both qemu-ppc and qemu-system-ppc and so forth? There are many of these, arm and m68k for instance. On x86 I assume that they are both emulated, and they are not two names for the same executable or such, so what are they and how to choose which to use? one of them being the userspace linux emulator, and the other, the system emulator. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
read/write performance degredation in ubuntu/debian?
I have run iozone on my ubuntu 8.10 VM and it says: random randombkwd record stride KB reclen write rewritereadrereadread writeread rewriteread fwrite frewrite fread freread 4096 4 17520 49143 1194284 87115 44566 94074 50770 903871705444816 10441588611 ... and that is unbearably slow. My debian etch VM is somewhat better: random randombkwd record stride KB reclen write rewritereadrereadread writeread rewriteread fwrite frewrite fread freread 4096 4 29559 80552 142286 144129 109956 73759 122737 86741 1148942800168013 128927 131687 ..but viewing still very slow by a factor of 20, compared to my rPath (a redhat-derived distro) 1.07 VM: random randombkwd record stride KB reclen write rewritereadrereadread writeread rewriteread fwrite frewrite fread freread 4096 4 475392 724942 1043308 1080466 1012386 768492 970375 1683570 972908 447357 693303 993935 1059245 Maybe in debian/ubuntu there is some kernel setting related to disk I/O that I need to tweak? Anyone else seen this problem before? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More vcd info wanted
Bill Davidsen wrote: Ryan Harper wrote: -vnc localhost:1 -- will display the guest VGA display on the localhost. A remote system can do: vncviewer ${kvmhost}:1 to view the guest VGA. Thanks, will try later tonight. Have to have a bit of care getting the number right (unique) since there might be more than one of these, but this may be all I need. You might consider using libvirt, which (among many other relevant features) can dynamically assign VNC ports (thus managing the uniqueness constraint) and will expose the currently selected port as part of the domain's XML configuration. (Getting a VNC viewer going with libvirt is considerably easier than that, though -- virt-viewer VM_NAME will do the trick). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Fix task switching.
Gleb Natapov g...@redhat.com writes: On Mon, Mar 30, 2009 at 06:35:05PM +0200, Jan Kiszka wrote: Haven't tried. I wrote my own tests for task switching. How can I check it? There is a test case attached to Julian's sourceforge-reported bug: https://sourceforge.net/tracker/?func=detailatid=893831aid=2681442group_id=180599 Works for me. Then the patches should be fine (at least for me *g*). Regards, -- Julian Stecklina The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners - Ernst Jan Plugge -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] ksm - dynamic page sharing driver for linux
KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. The sharing is unnoticeable by the process that use this memory. (the shared pages are marked as readonly, and in case of write do_wp_page() take care to create new copy of the page) To find identical pages ksm use algorithm that is split into three primery levels: 1) Ksm will start scan the memory and will calculate checksum for each page that is registred to be scanned. (In the first round of the scanning, ksm would only calculate this checksum for all the pages) 2) Ksm will go again on the whole memory and will recalculate the checmsum of the pages, pages that are found to have the same checksum value, would be considered pages that are most likely wont changed Ksm will insert this pages into sorted by page content RB-tree that is called unstable tree, the reason that this tree is called unstable is due to the fact that the page contents might changed while they are still inside the tree, and therefore the tree would become corrupted. Due to this problem ksm take two more steps in addition to the checksum calculation: a) Ksm will throw and recreate the entire unstable tree each round of memory scanning - so if we have corruption, it will be fixed when we will rebuild the tree. b) Ksm is using RB-tree, that its balancing is made by the node color and not by the content, so even if the page get corrupted, it still would take the same amount of time to search on it. 3) In addition to the unstable tree, ksm hold another tree that is called stable tree - this tree is RB-tree that is sorted by the pages content and all its pages are write protected, and therefore it cant get corrupted. Each time ksm will find two identcial pages using the unstable tree, it will create new write-protected shared page, and this page will be inserted into the stable tree, and would be saved there, the stable tree, unlike the unstable tree, is never throwen away, so each page that we find would be saved inside it. Taking into account the three levels that described above, the algorithm work like that: search primary tree (sorted by entire page contents, pages write protected) - if match found, merge - if no match found... - search secondary tree (sorted by entire page contents, pages not write protected) - if match found, merge - remove from secondary tree and insert merged page into primary tree - if no match found... - checksum - if checksum hasn't changed - insert into secondary tree - if it has, store updated checksum (note: first time this page is handled it won't have a checksum, so checksum will appear as changed, so it takes two passes w/ no other matches to get into secondary tree) - do not insert into any tree, will see it again on next pass The basic idea of this algorithm, is that even if the unstable tree doesnt promise to us to find two identical pages in the first round, we would probably find them in the second or the third or the tenth round, then after we have found this two identical pages only once, we will insert them into the stable tree, and then they would be protected there forever. So the all idea of the unstable tree, is just to build the stable tree and then we will find the identical pages using it. The current implemantion can be improved alot: we dont have to calculate exspensive checksum, we can just use the host dirty bit. currently we dont support shared pages swapping (other pages that are not shared can be swapped (all the pages that we didnt find to be identical to other pages...). Walking on the tree, we keep call to get_user_pages(), we can optimized it by saving the pfn, and using mmu notifiers to know when the virtual address mapping was changed. We currently scan just programs that were registred to be used by ksm, we would later want to add the abilaty to tell ksm to scan PIDS (so you can scan closed binary applications as well). Right now ksm scanning is made by just one thread, multiple scanners support might would be needed. This driver is very useful for KVM as in cases of runing multiple guests operation system of the same type. (For desktop work loads we have achived more than x2 memory overcommit (more like x3)) This driver have found users other than KVM, for example CERN, Fons Rademakers: on many-core machines we run one large detector simulation program per core. These simulation programs are identical but run each in their own process and need about 2 - 2.5 GB RAM. We typically buy machines with 2GB RAM per core and so have a problem to run one of these programs per core. Of
[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()
this macro allow setting the pte in the shadow page tables directly instead of flushing the shadow page table entry and then get vmexit in order to set it. This function is optimzation for kvm/users of mmu_notifiers for COW pages, it is useful for kvm when ksm is used beacuse it allow kvm not to have to recive VMEXIT and only then map the shared page into the mmu shadow pages, but instead map it directly at the same time linux map the page into the host page table. this mmu notifer macro is working by calling to callback that will map directly the physical page into the shadow page tables. (users of mmu_notifiers that didnt implement the set_pte_at_notify() call back will just recive the mmu_notifier_invalidate_page callback) Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/mmu_notifier.h | 34 ++ mm/memory.c | 10 -- mm/mmu_notifier.c| 20 3 files changed, 62 insertions(+), 2 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b77486d..8bb245f 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,15 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* + * change_pte is called in cases that pte mapping into page is changed + * for example when ksm mapped pte to point into a new shared page. + */ + void (*change_pte)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address, + pte_t pte); + /* * Before this is invoked any secondary MMU is still ok to * read/write to the page previously pointed to by the Linux @@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); +extern void __mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte); extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, @@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_change_pte(mm, address, pte); +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __young;\ }) +#define set_pte_at_notify(__mm, __address, __ptep, __pte) \ +({ \ + struct mm_struct *___mm = __mm; \ + unsigned long ___address = __address; \ + pte_t ___pte = __pte; \ + \ + set_pte_at(__mm, __address, __ptep, ___pte);\ + mmu_notifier_change_pte(___mm, ___address, ___pte); \ +}) + #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define ptep_clear_flush_young_notify ptep_clear_flush_young #define ptep_clear_flush_notify ptep_clear_flush +#define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ diff --git a/mm/memory.c b/mm/memory.c index baa999e..0382a34 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2031,9 +2031,15 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); page_add_new_anon_rmap(new_page, vma, address); -
[PATCH 2/4] add page_wrprotect(): write protecting page.
this patch add new function called page_wrprotect(), page_wrprotect() is used to take a page and mark all the pte that point into it as readonly. The function is working by walking the rmap of the page, and setting each pte realted to the page as readonly. The odirect_sync parameter is used to protect against possible races with odirect while we are marking the pte as readonly, as noted by Andrea Arcanglei: While thinking at get_user_pages_fast I figured another worse way things can go wrong with ksm and o_direct: think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads, the very moment we mark any pte wrprotected... Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/rmap.h | 11 mm/rmap.c| 139 ++ 2 files changed, 150 insertions(+), 0 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b35bc0e..469376d 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page) } #endif +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +int page_wrprotect(struct page *page, int *odirect_sync, int count_offset); +#endif + #else /* !CONFIG_MMU */ #define anon_vma_init()do {} while (0) @@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page) return 0; } +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +static inline int page_wrprotect(struct page *page, int *odirect_sync, +int count_offset) +{ + return 0; +} +#endif #endif /* CONFIG_MMU */ diff --git a/mm/rmap.c b/mm/rmap.c index 1652166..95c55ea 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -585,6 +585,145 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) + +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma, + int *odirect_sync, int count_offset) +{ + struct mm_struct *mm = vma-vm_mm; + unsigned long address; + pte_t *pte; + spinlock_t *ptl; + int ret = 0; + + address = vma_address(page, vma); + if (address == -EFAULT) + goto out; + + pte = page_check_address(page, mm, address, ptl, 0); + if (!pte) + goto out; + + if (pte_write(*pte)) { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(*pte)); + /* +* Ok this is tricky, when get_user_pages_fast() run it doesnt +* take any lock, therefore the check that we are going to make +* with the pagecount against the mapcount is racey and +* O_DIRECT can happen right after the check. +* So we clear the pte and flush the tlb before the check +* this assure us that no O_DIRECT can happen after the check +* or in the middle of the check. +*/ + entry = ptep_clear_flush(vma, address, pte); + /* +* Check that no O_DIRECT or similar I/O is in progress on the +* page +*/ + if ((page_mapcount(page) + count_offset) != page_count(page)) { + *odirect_sync = 0; + set_pte_at_notify(mm, address, pte, entry); + goto out_unlock; + } + entry = pte_wrprotect(entry); + set_pte_at_notify(mm, address, pte, entry); + } + ret = 1; + +out_unlock: + pte_unmap_unlock(pte, ptl); +out: + return ret; +} + +static int page_wrprotect_file(struct page *page, int *odirect_sync, + int count_offset) +{ + struct address_space *mapping; + struct prio_tree_iter iter; + struct vm_area_struct *vma; + pgoff_t pgoff = page-index (PAGE_CACHE_SHIFT - PAGE_SHIFT); + int ret = 0; + + mapping = page_mapping(page); + if (!mapping) + return ret; + + spin_lock(mapping-i_mmap_lock); + + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff) + ret += page_wrprotect_one(page, vma, odirect_sync, + count_offset); + + spin_unlock(mapping-i_mmap_lock); + + return ret; +} + +static int page_wrprotect_anon(struct page *page, int *odirect_sync, + int count_offset) +{ + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + int ret = 0; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + return ret; + + /* +* If the page is inside the swap cache, its _count number was +* increased by one, therefore we have to increase
[PATCH 3/4] add replace_page(): change the page pte is pointing to.
replace_page() allow changing the mapping of pte from one physical page into diffrent physical page. this function is working by removing oldpage from the rmap and calling put_page on it, and by setting the pte to point into newpage and by inserting it to the rmap using page_add_file_rmap(). note: newpage must be non anonymous page, the reason for this is: replace_page() is built to allow mapping one page into more than one virtual addresses, the mapping of this page can happen in diffrent offsets inside each vma, and therefore we cannot trust the page-index anymore. the side effect of this issue is that newpage cannot be anything but kernel allocated page that is not swappable. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/mm.h |5 +++ mm/memory.c| 80 2 files changed, 85 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 065cdf8..b19e4c2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1237,6 +1237,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot); +#endif + struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); #define FOLL_WRITE 0x01/* check pte is writable */ diff --git a/mm/memory.c b/mm/memory.c index 0382a34..3946e79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1562,6 +1562,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_mixed); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) + +/** + * replace_page - replace page in vma with new page + * @vma: vma that hold the pte oldpage is pointed by. + * @oldpage: the page we are replacing with newpage + * @newpage: the page we replace oldpage with + * @orig_pte: the original value of the pte + * @prot: page protection bits + * + * Returns 0 on success, -EFAULT on failure. + * + * Note: @newpage must not be an anonymous page because replace_page() does + * not change the mapping of @newpage to have the same values as @oldpage. + * @newpage can be mapped in several vmas at different offsets (page-index). + */ +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot) +{ + struct mm_struct *mm = vma-vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int ret; + + BUG_ON(PageAnon(newpage)); + + ret = -EFAULT; + addr = page_address_in_vma(oldpage, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, ptl); + if (!ptep) + goto out; + + if (!pte_same(*ptep, orig_pte)) { + pte_unmap_unlock(ptep, ptl); + goto out; + } + + ret = 0; + get_page(newpage); + page_add_file_rmap(newpage); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot)); + + page_remove_rmap(oldpage); + if (PageAnon(oldpage)) { + dec_mm_counter(mm, anon_rss); + inc_mm_counter(mm, file_rss); + } + put_page(oldpage); + + pte_unmap_unlock(ptep, ptl); + +out: + return ret; +} +EXPORT_SYMBOL_GPL(replace_page); + +#endif + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] kvm support for ksm
apply it against Avi git tree. Izik Eidus (3): kvm: dont hold pagecount reference for mapped sptes pages. kvm: add SPTE_HOST_WRITEABLE flag to the shadow ptes. kvm: add support for change_pte mmu notifiers arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 89 --- arch/x86/kvm/paging_tmpl.h | 16 ++- virt/kvm/kvm_main.c | 14 ++ 4 files changed, 101 insertions(+), 19 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages.
When using mmu notifiers, we are allowed to remove the page count reference tooken by get_user_pages to a specific page that is mapped inside the shadow page tables. This is needed so we can balance the pagecount against mapcount checking. (Right now kvm increase the pagecount and does not increase the mapcount when mapping page into shadow page table entry, so when comparing pagecount against mapcount, you have no reliable result.) Signed-off-by: Izik Eidus iei...@redhat.com --- arch/x86/kvm/mmu.c |7 ++- 1 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b625ed4..df8fbaf 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -567,9 +567,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) if (*spte shadow_accessed_mask) kvm_set_pfn_accessed(pfn); if (is_writeble_pte(*spte)) - kvm_release_pfn_dirty(pfn); - else - kvm_release_pfn_clean(pfn); + kvm_set_pfn_dirty(pfn); rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], is_large_pte(*spte)); if (!*rmapp) { printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte); @@ -1812,8 +1810,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, page_header_update_slot(vcpu-kvm, shadow_pte, gfn); if (!was_rmapped) { rmap_add(vcpu, shadow_pte, gfn, largepage); - if (!is_rmap_pte(*shadow_pte)) - kvm_release_pfn_clean(pfn); + kvm_release_pfn_clean(pfn); } else { if (was_writeble) kvm_release_pfn_dirty(pfn); -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] kvm: add SPTE_HOST_WRITEABLE flag to the shadow ptes.
this flag notify that the host physical page we are pointing to from the spte is write protected, and therefore we cant change its access to be write unless we run get_user_pages(write = 1). (this is needed for change_pte support in kvm) Signed-off-by: Izik Eidus iei...@redhat.com --- arch/x86/kvm/mmu.c | 14 ++ arch/x86/kvm/paging_tmpl.h | 16 +--- 2 files changed, 23 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index df8fbaf..6b4d795 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -138,6 +138,8 @@ module_param(oos_shadow, bool, 0644); #define ACC_USER_MASKPT_USER_MASK #define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK) +#define SPTE_HOST_WRITEABLE (1ULL PT_FIRST_AVAIL_BITS_SHIFT) + #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) struct kvm_rmap_desc { @@ -1676,7 +1678,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, unsigned pte_access, int user_fault, int write_fault, int dirty, int largepage, int global, gfn_t gfn, pfn_t pfn, bool speculative, - bool can_unsync) + bool can_unsync, bool reset_host_protection) { u64 spte; int ret = 0; @@ -1719,6 +1721,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, kvm_x86_ops-get_mt_mask_shift(); spte |= mt_mask; } + if (reset_host_protection) + spte |= SPTE_HOST_WRITEABLE; spte |= (u64)pfn PAGE_SHIFT; @@ -1764,7 +1768,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, unsigned pt_access, unsigned pte_access, int user_fault, int write_fault, int dirty, int *ptwrite, int largepage, int global, -gfn_t gfn, pfn_t pfn, bool speculative) +gfn_t gfn, pfn_t pfn, bool speculative, +bool reset_host_protection) { int was_rmapped = 0; int was_writeble = is_writeble_pte(*shadow_pte); @@ -1793,7 +1798,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, was_rmapped = 1; } if (set_spte(vcpu, shadow_pte, pte_access, user_fault, write_fault, - dirty, largepage, global, gfn, pfn, speculative, true)) { + dirty, largepage, global, gfn, pfn, speculative, true, + reset_host_protection)) { if (write_fault) *ptwrite = 1; kvm_x86_ops-tlb_flush(vcpu); @@ -1840,7 +1846,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, || (largepage iterator.level == PT_DIRECTORY_LEVEL)) { mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL, 0, write, 1, pt_write, -largepage, 0, gfn, pfn, false); +largepage, 0, gfn, pfn, false, true); ++vcpu-stat.pf_fixed; break; } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index eae9499..9fdacd0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -259,10 +259,14 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, if (mmu_notifier_retry(vcpu, vcpu-arch.update_pte.mmu_seq)) return; kvm_get_pfn(pfn); + /* +* we call mmu_set_spte() with reset_host_protection = true beacuse that +* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1). +*/ mmu_set_spte(vcpu, spte, page-role.access, pte_access, 0, 0, gpte PT_DIRTY_MASK, NULL, largepage, gpte PT_GLOBAL_MASK, gpte_to_gfn(gpte), -pfn, true); +pfn, true, true); } /* @@ -297,7 +301,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, gw-ptes[gw-level-1] PT_DIRTY_MASK, ptwrite, largepage, gw-ptes[gw-level-1] PT_GLOBAL_MASK, -gw-gfn, pfn, false); +gw-gfn, pfn, false, true); break; } @@ -547,6 +551,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu, static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { int i, offset, nr_present; +bool reset_host_protection = 1; offset = nr_present = 0; @@ -584,9 +589,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) nr_present++; pte_access =
[PATCH 3/3] kvm: add support for change_pte mmu notifiers
this is needed for kvm if it want ksm to directly map pages into its shadow page tables. Signed-off-by: Izik Eidus iei...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 68 +++ virt/kvm/kvm_main.c | 14 3 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8351c4d..9062729 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -791,5 +791,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void); #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); +void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6b4d795..f8816dd 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -257,6 +257,11 @@ static pfn_t spte_to_pfn(u64 pte) return (pte PT64_BASE_ADDR_MASK) PAGE_SHIFT; } +static pte_t ptep_val(pte_t *ptep) +{ + return *ptep; +} + static gfn_t pse36_gfn_delta(u32 gpte) { int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT; @@ -678,7 +683,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) return write_protected; } -static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, + unsigned long data) { u64 *spte; int need_tlb_flush = 0; @@ -693,8 +699,48 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) return need_tlb_flush; } +static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, +unsigned long data) +{ + int need_flush = 0; + u64 *spte, new_spte; + pte_t *ptep = (pte_t *)data; + pfn_t new_pfn; + + new_pfn = pte_pfn(ptep_val(ptep)); + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!is_shadow_present_pte(*spte)); + rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte); + need_flush = 1; + if (pte_write(ptep_val(ptep))) { + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + spte = rmap_next(kvm, rmapp, NULL); + } else { + new_spte = *spte ~ (PT64_BASE_ADDR_MASK); + new_spte |= new_pfn PAGE_SHIFT; + + if (!pte_write(ptep_val(ptep))) { + new_spte = ~PT_WRITABLE_MASK; + new_spte = ~SPTE_HOST_WRITEABLE; + if (is_writeble_pte(*spte)) + kvm_set_pfn_dirty(spte_to_pfn(*spte)); + } + set_shadow_pte(spte, new_spte); + spte = rmap_next(kvm, rmapp, spte); + } + } + if (need_flush) + kvm_flush_remote_tlbs(kvm); + + return 0; +} + static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, - int (*handler)(struct kvm *kvm, unsigned long *rmapp)) + unsigned long data, + int (*handler)(struct kvm *kvm, unsigned long *rmapp, +unsigned long data)) { int i; int retval = 0; @@ -715,11 +761,13 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, end = start + (memslot-npages PAGE_SHIFT); if (hva = start hva end) { gfn_t gfn_offset = (hva - start) PAGE_SHIFT; - retval |= handler(kvm, memslot-rmap[gfn_offset]); + retval |= handler(kvm, memslot-rmap[gfn_offset], + data); retval |= handler(kvm, memslot-lpage_info[ gfn_offset / - KVM_PAGES_PER_HPAGE].rmap_pde); + KVM_PAGES_PER_HPAGE].rmap_pde, + data); } } @@ -728,10 +776,16 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) { - return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp); + return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp); +} + +void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) +{ + kvm_handle_hva(kvm, hva, (unsigned long)pte, kvm_set_pte_rmapp); } -static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) +static int
[PATCH 0/2] kvm-userspace ksm support
Apply it against Avi kvm-userspace git tree. Izik Eidus (2): qemu: add ksm support qemu: add ksmctl. qemu/ksm.h | 70 qemu/vl.c | 34 + user/Makefile |6 +++- user/config-x86-common.mak |2 +- user/ksmctl.c | 69 +++ 5 files changed, 179 insertions(+), 2 deletions(-) create mode 100644 qemu/ksm.h create mode 100644 user/ksmctl.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] qemu: add ksm support
Signed-off-by: Izik Eidus iei...@redhat.com --- qemu/ksm.h | 70 qemu/vl.c | 34 + 2 files changed, 104 insertions(+), 0 deletions(-) create mode 100644 qemu/ksm.h diff --git a/qemu/ksm.h b/qemu/ksm.h new file mode 100644 index 000..2fb91a8 --- /dev/null +++ b/qemu/ksm.h @@ -0,0 +1,70 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + + +#include sys/types.h +#include sys/ioctl.h + +#include asm/types.h + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +struct ksm_kthread_info { + __u32 sleep; /* number of microsecoends to sleep */ + __u32 pages_to_scan; /* number of pages to scan */ + __u32 flags; /* control flags */ +__u32 pad; +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ +/* + * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed + * (can stop the kernel thread from working by setting running = 0) + */ +#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x02,\ + struct ksm_kthread_info) +/* + * KSM_GET_INFO_KTHREAD - return information about the kernel thread + * scanning speed. + */ +#define KSM_GET_INFO_KTHREAD_IOW(KSMIO, 0x03,\ + struct ksm_kthread_info) + + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) + +#endif diff --git a/qemu/vl.c b/qemu/vl.c index c52d2d7..54a9dd9 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -130,6 +130,7 @@ int main(int argc, char **argv) #define main qemu_main #endif /* CONFIG_COCOA */ +#include ksm.h #include hw/hw.h #include hw/boards.h #include hw/usb.h @@ -4873,6 +4874,37 @@ static void termsig_setup(void) #endif +static int ksm_register_memory(void) +{ +int fd; +int ksm_fd; +int r = 1; +struct ksm_memory_region ksm_region; + +fd = open(/dev/ksm, O_RDWR | O_TRUNC, (mode_t)0600); +if (fd == -1) +goto out; + +ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA); +if (ksm_fd == -1) +goto out_free; + +ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE; +ksm_region.addr = (unsigned long)phys_ram_base; +r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, ksm_region); +if (r) +goto out_free1; + +return r; + +out_free1: +close(ksm_fd); +out_free: +close(fd); +out: +return r; +} + int main(int argc, char **argv, char **envp) { #ifdef CONFIG_GDBSTUB @@ -5862,6 +5894,8 @@ int main(int argc, char **argv, char **envp) /* init the dynamic translator */ cpu_exec_init_all(tb_size * 1024 * 1024); +ksm_register_memory(); + bdrv_init(); dma_helper_init(); -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] qemu: add ksmctl.
userspace tool to control the ksm kernel thread Signed-off-by: Izik Eidus iei...@redhat.com --- user/Makefile |6 +++- user/config-x86-common.mak |2 +- user/ksmctl.c | 69 3 files changed, 75 insertions(+), 2 deletions(-) create mode 100644 user/ksmctl.c diff --git a/user/Makefile b/user/Makefile index cf7f8ed..a291b37 100644 --- a/user/Makefile +++ b/user/Makefile @@ -39,6 +39,10 @@ autodepend-flags = -MMD -MF $(dir $*).$(notdir $*).d LDFLAGS += -pthread -lrt +ksmctl_objs= ksmctl.o +ksmctl: $(ksmctl_objs) + $(CC) $(LDFLAGS) $^ -o $@ + kvmtrace_objs= kvmtrace.o kvmctl: $(kvmctl_objs) @@ -56,4 +60,4 @@ $(libcflat): $(cflatobjs) -include .*.d clean: arch_clean - $(RM) kvmctl kvmtrace *.o *.a .*.d $(libcflat) $(cflatobjs) + $(RM) ksmctl kvmctl kvmtrace *.o *.a .*.d $(libcflat) $(cflatobjs) diff --git a/user/config-x86-common.mak b/user/config-x86-common.mak index e789fd4..4303aee 100644 --- a/user/config-x86-common.mak +++ b/user/config-x86-common.mak @@ -1,6 +1,6 @@ #This is a make file with common rules for both x86 x86-64 -all: kvmctl kvmtrace test_cases +all: ksmctl kvmctl kvmtrace test_cases kvmctl_objs= main.o iotable.o ../libkvm/libkvm.a balloon_ctl: balloon_ctl.o diff --git a/user/ksmctl.c b/user/ksmctl.c new file mode 100644 index 000..034469f --- /dev/null +++ b/user/ksmctl.c @@ -0,0 +1,69 @@ +#include stdio.h +#include stdlib.h +#include string.h +#include sys/types.h +#include sys/stat.h +#include sys/ioctl.h +#include fcntl.h +#include sys/mman.h +#include unistd.h +#include ../qemu/ksm.h + +int main(int argc, char *argv[]) +{ + int fd; + int used = 0; + int fd_start; + struct ksm_kthread_info info; + + + if (argc 2) { + fprintf(stderr, usage: %s {start npages sleep | stop | info}\n, argv[0]); + exit(1); + } + + fd = open(/dev/ksm, O_RDWR | O_TRUNC, (mode_t)0600); + if (fd == -1) { + fprintf(stderr, could not open /dev/ksm\n); + exit(1); + } + + if (!strncmp(argv[1], start, strlen(argv[1]))) { + used = 1; + if (argc 4) { + fprintf(stderr, + usage: %s start npages_to_scan sleep\n, + argv[0]); + exit(1); + } + info.pages_to_scan = atoi(argv[2]); + info.sleep = atoi(argv[3]); + info.flags = ksm_control_flags_run; + + fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, info); + if (fd_start == -1) { + fprintf(stderr, KSM_START_KTHREAD failed\n); + exit(1); + } + printf(created scanner\n); + } + + if (!strncmp(argv[1], stop, strlen(argv[1]))) { + used = 1; + info.flags = 0; + fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, info); + printf(stopped scanner\n); + } + + if (!strncmp(argv[1], info, strlen(argv[1]))) { + used = 1; + ioctl(fd, KSM_GET_INFO_KTHREAD, info); +printf(flags %d, pages_to_scan %d, sleep_time %d\n, +info.flags, info.pages_to_scan, info.sleep); + } + + if (!used) + fprintf(stderr, unknown command %s\n, argv[1]); + + return 0; +} -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] add ksm kernel shared memory driver.
Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_START_STOP_KTHREAD: Return information about the kernel thread, the inforamtion is returned using the ksm_kthread_info structure: ksm_kthread_info: __u32 sleep: number of microsecoends to sleep between each iteration of scanning. __u32 pages_to_scan: number of pages to scan for each iteration of scanning. __u32 max_pages_to_merge: maximum number of pages to merge in each iteration of scanning (so even if there are still more pages to scan, we stop this iteration) __u32 flags: flags to control ksmd (right now just ksm_control_flags_run available) KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/ksm.h| 69 +++ include/linux/miscdevice.h |1 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1431 5 files changed, 1508 insertions(+), 0 deletions(-) create mode 100644 include/linux/ksm.h create mode 100644 mm/ksm.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..5776dce --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,69 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include linux/types.h +#include linux/ioctl.h + +#include asm/types.h + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +struct ksm_kthread_info { + __u32 sleep; /* number of microsecoends to sleep */ + __u32 pages_to_scan; /* number of pages to scan */ + __u32 flags; /* control flags */ +__u32 pad; +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ +/* + * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed + * (can stop the kernel thread from working by setting running = 0) + */ +#define KSM_START_STOP_KTHREAD _IOW(KSMIO, 0x02,\ + struct ksm_kthread_info) +/* + * KSM_GET_INFO_KTHREAD - return information about the kernel thread + * scanning speed. + */ +#define KSM_GET_INFO_KTHREAD_IOW(KSMIO, 0x03,\ + struct ksm_kthread_info) + + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) + +#endif diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index a820f81..6d4f8df 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -29,6 +29,7 @@
FW: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
Avi Kivity wrote: Dong, Eddie wrote: @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context-rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: +context-rsvd_bits_mask[0][2] = exb_bit_rsvd | +rsvd_bits(maxphyaddr, 62) | +rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context-rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context-rsvd_bits_mask[0][0] = exb_bit_rsvd Are you sure that PDPTEs support NX? They don't support R/W and U/S, so it seems likely that NX is reserved as well even when EFER.NXE is enabled. Gil: Here is the original mail in KVM mailinglist. If you would be able to help, that is great. thx, eddie-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux
Izik Eidus wrote: I am sending another seires of patchs for kvm kernel and kvm-userspace that would allow users of kvm to test ksm with it. The kvm patchs would apply to Avi git tree. Any reason to not take these through upstream QEMU instead of kvm-userspace? In principle, I don't see anything that would prevent normal QEMU from almost making use of this functionality. That would make it one less thing to eventually have to merge... Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
Izik Eidus wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_START_STOP_KTHREAD: Return information about the kernel thread, the inforamtion is returned using the ksm_kthread_info structure: ksm_kthread_info: __u32 sleep: number of microsecoends to sleep between each iteration of scanning. __u32 pages_to_scan: number of pages to scan for each iteration of scanning. __u32 max_pages_to_merge: maximum number of pages to merge in each iteration of scanning (so even if there are still more pages to scan, we stop this iteration) __u32 flags: flags to control ksmd (right now just ksm_control_flags_run available) Wouldn't this make more sense as a sysfs interface? That is, the KSM_START_STOP_KTHREAD part, not necessarily the rest of the API. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
On Tue, 31 Mar 2009 02:59:20 +0300 Izik Eidus iei...@redhat.com wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_START_STOP_KTHREAD: Return information about the kernel thread, the inforamtion is returned using the ksm_kthread_info structure: ksm_kthread_info: __u32 sleep: number of microsecoends to sleep between each iteration of scanning. __u32 pages_to_scan: number of pages to scan for each iteration of scanning. __u32 max_pages_to_merge: maximum number of pages to merge in each iteration of scanning (so even if there are still more pages to scan, we stop this iteration) __u32 flags: flags to control ksmd (right now just ksm_control_flags_run available) KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/ksm.h| 69 +++ include/linux/miscdevice.h |1 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1431 5 files changed, 1508 insertions(+), 0 deletions(-) create mode 100644 include/linux/ksm.h create mode 100644 mm/ksm.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..5776dce --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,69 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include linux/types.h +#include linux/ioctl.h + +#include asm/types.h + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +struct ksm_kthread_info { + __u32 sleep; /* number of microsecoends to sleep */ + __u32 pages_to_scan; /* number of pages to scan */ + __u32 flags; /* control flags */ +__u32 pad; +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ +/* + * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed + * (can stop the kernel thread from working by setting running = 0) + */ +#define KSM_START_STOP_KTHREAD_IOW(KSMIO, 0x02,\ + struct ksm_kthread_info) +/* + * KSM_GET_INFO_KTHREAD - return information about the kernel thread + * scanning speed. + */ +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x03,\ + struct ksm_kthread_info) + + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION
DMA errors in guest caused by corrupted(?) disk image
Hi, I've just come across a somewhat strange problem that was suggested I report to the list. The problem manifested itself as DMA errors and the like popping up in the guest, like I'd expect to see if a disk in a physical machine was dying, like this: hda: dma_timer_expiry: dma status == 0x21 The VM has previously been quite stable until this problem started part of the way through today. Another guest on the same host machine is fine. After talking to iggy on IRC, I tried running qemu-convert over the disk image to copy it to another image, and that solved the problem. So it looks like disk image corruption somehow manages to manifest itself as a DMA error in the guest... I'm starting KVM like so: kvm -m 512 -net nic,macaddr=$macaddr -net tap,iface=$iface -hda hda.qc2 As the filename suggests, it's a qcow2 image, 30GB in size. The guest is a 32 bit RHEL3 installation. The host is a 64 bit Debian Lenny machine, running kvm 84. I've got a copy of the dodgy disk image, although it's 2.7GB so not so easy to ship around. I can do any diagnostics on it that people need to try and track down the cause of the problem. I tried doing another qemu-convert (with a view to seeing the differences between the two images) but the copy is 500MB smaller (zero blocks removed, presumably) so a diff probably isn't going to help much. - Matt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Segfault while booting Windows XP x64
I'm on a Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz, using a 2.6.29 vanilla kernel, x86_64. kvm userland version 84. When I try to boot my x64 Windows XP, it gets partway through the windows booting process, with the progress bar and what not. Then, I get the attached backtrace. The various -no-kvm options don't seem to make a difference. I created, and was able to boot, this image using linux 2.6.28. I'll give it a shot again later to confirm that is still the case. Thanks in advance. -- Mike Kelly GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-pc-linux-gnu... Starting program: /usr/bin/kvm -usb -usbdevice tablet -name winxp-x64 winxp-x64.kvm [Thread debugging using libthread_db enabled] [New Thread 0x7fe4d978b740 (LWP 29948)] [New Thread 0x7fe4ccf9d950 (LWP 29951)] [New Thread 0x7fe4cb6d5950 (LWP 29955)] [Thread 0x7fe4cb6d5950 (LWP 29955) exited] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fe4ccf9d950 (LWP 29951)] qemu_paio_cancel (fd=value optimized out, aiocb=0x2909230) at posix-aio-compat.c:184 184 TAILQ_REMOVE(request_list, aiocb, node); Thread 2 (Thread 0x7fe4ccf9d950 (LWP 29951)): #0 qemu_paio_cancel (fd=value optimized out, aiocb=0x2909230) at posix-aio-compat.c:184 ret = value optimized out #1 0x0041acf8 in raw_aio_cancel (blockacb=value optimized out) at block-raw-posix.c:681 ret = value optimized out acb = (RawAIOCB *) 0x2909210 #2 0x00433790 in ide_dma_cancel (bm=0x27dfe60) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/hw/ide.c:2973 No locals. #3 0x004337f5 in bmdma_cmd_writeb (opaque=0x27dfe60, addr=0, val=value optimized out) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/hw/ide.c:2987 No locals. #4 0x00520d5d in kvm_outb (opaque=value optimized out, addr=0, data=0 '\0') at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/qemu-kvm.c:684 No locals. #5 0x0054cfa5 in kvm_run (kvm=0x2716010, vcpu=0, env=0x2725f90) at libkvm.c:722 r = value optimized out fd = 12 run = (struct kvm_run *) 0x7fe4cc799000 #6 0x00521529 in kvm_cpu_exec (env=value optimized out) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/qemu-kvm.c:205 r = value optimized out #7 0x00521818 in ap_main_loop (_env=value optimized out) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/qemu-kvm.c:414 env = (CPUX86State *) 0x2725f90 signals = {__val = {18446744067267100671, 18446744073709551615 repeats 15 times}} data = (struct ioperm_data *) 0x0 #8 0x7fe4d89eff97 in start_thread () from /lib/libpthread.so.0 No locals. #9 0x7fe4d792bdfd in clone () from /lib/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7fe4d978b740 (LWP 29948)): #0 0x7fe4d7925452 in select () from /lib/libc.so.6 No symbol table info available. #1 0x00409eab in main_loop_wait (timeout=0) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/vl.c:3647 ioh = value optimized out rfds = {fds_bits = {164992, 0 repeats 15 times}} wfds = {fds_bits = {0 repeats 16 times}} xfds = {fds_bits = {0 repeats 16 times}} ret = value optimized out nfds = 17 tv = {tv_sec = 0, tv_usec = 999644} #2 0x00520fea in kvm_main_loop () at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/qemu-kvm.c:596 fds = {15, 16} mask = {__val = {268443648, 0 repeats 15 times}} sigfd = 17 #3 0x0040e4db in main (argc=value optimized out, argv=0x7fffe17aa448, envp=value optimized out) at /var/tmp/paludis/build/app-virtualization-kvm-84/work/kvm-84/qemu/vl.c:3809 use_gdbstub = 0 gdbstub_port = 0x54f5ef 1234 boot_devices_bitmap = 0 i = value optimized out snapshot = 0 linux_boot = value optimized out net_boot = value optimized out initrd_filename = 0x0 kernel_filename = 0x0 kernel_cmdline = 0x58cc6b boot_devices = 0x54f881 cad ds = value optimized out dcl = value optimized out cyls = 0 heads = 0 secs = 0 translation = 0 net_clients = {0x54f45d nic, 0x54f885 user, 0x0, 0x7fe4d95972ee \205À\017\217z\001, 0x0, 0x7fe4d9596bec \205Àt\A\213D$\f\205Àu\027\205í\017\037D, 0x7fe40001 Address 0x7fe40001 out of bounds, 0x7fe4d97a95b8 \220\225zÙä\177, 0x0, 0x1 Address 0x1 out of bounds, 0x71dd557f Address 0x71dd557f out of bounds, 0x7fe4d9596ffa
RE: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit
PDPTEs are used only if CR0.PG=CR4.PAE=1. In that situation, their format depends the value of IA32_EFER.LMA. If IA32_EFER.LMA=0, bit 63 is reserved and must be 0 in any PDPTE that is marked present. The execute-disable setting of a page is determined only by the PDE and PTE. If IA32_EFER.LMA=1, bit 63 is used for the execute-disable in PML4 entries, PDPTEs, PDEs, and PTEs (assuming IA32_EFER.NXE=1). - Gil -Original Message- From: Dong, Eddie Sent: Monday, March 30, 2009 5:51 PM To: Neiger, Gil Cc: Avi Kivity; kvm@vger.kernel.org; Dong, Eddie Subject: FW: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit Avi Kivity wrote: Dong, Eddie wrote: @@ -2199,6 +2194,9 @@ void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) context-rsvd_bits_mask[1][0] = 0; break; case PT32E_ROOT_LEVEL: +context-rsvd_bits_mask[0][2] = exb_bit_rsvd | +rsvd_bits(maxphyaddr, 62) | +rsvd_bits(7, 8) | rsvd_bits(1, 2); /* PDPTE */ context-rsvd_bits_mask[0][1] = exb_bit_rsvd | rsvd_bits(maxphyaddr, 62); /* PDE */ context-rsvd_bits_mask[0][0] = exb_bit_rsvd Are you sure that PDPTEs support NX? They don't support R/W and U/S, so it seems likely that NX is reserved as well even when EFER.NXE is enabled. Gil: Here is the original mail in KVM mailinglist. If you would be able to help, that is great. thx, eddie-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Segfault while booting Windows XP x64
On Mon, Mar 30, 2009 at 11:26:52PM -0400, Mike Kelly wrote: I'm on a Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz, using a 2.6.29 vanilla kernel, x86_64. kvm userland version 84. When I try to boot my x64 Windows XP, it gets partway through the windows booting process, with the progress bar and what not. Then, I get the attached backtrace. The various -no-kvm options don't seem to make a difference. I created, and was able to boot, this image using linux 2.6.28. I'll give it a shot again later to confirm that is still the case. Are you sure you have write permission to that image? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html