[PATCH] kvm: qemu: regenerate bios for tsc zeroing removal
From: Avi Kivity a...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/qemu/pc-bios/bios.bin b/qemu/pc-bios/bios.bin index 768d8f0..0e0dbea 100644 Binary files a/qemu/pc-bios/bios.bin and b/qemu/pc-bios/bios.bin differ -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: configure: if all else fails, infer kernel version from .config
From: Avi Kivity a...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/configure b/configure index d2883a7..493c178 100755 --- a/configure +++ b/configure @@ -125,6 +125,9 @@ if [ -n $no_uname ]; then elif [ -e $kerneldir/include/config/kernel.release ]; then depmod_version=`cat $kerneldir/include/config/kernel.release` +elif [ -e $kerneldir/.config ]; then + depmod_version=$(awk '/Linux kernel version:/ { print $NF }' +$kerneldir/.config) else echo echo Error: kernelversion not found -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: qemu: separate TSC load from kvm_arch_load_regs
From: Marcelo Tosatti mtosa...@redhat.com kvm_load_registers is a general interface to load registers, and is used by vmport, gdbstub, etc. The TSC MSR is continually counting, so it can't be simply read and written back as the other registers/MSR's (doing so overwrites the current count). Introduce kvm_load_tsc and use it for x86's migration CPU load code. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c index aa36be8..f7ed70b 100644 --- a/qemu/qemu-kvm-x86.c +++ b/qemu/qemu-kvm-x86.c @@ -261,7 +261,6 @@ void kvm_arch_load_regs(CPUState *env) set_msr_entry(msrs[n++], MSR_IA32_SYSENTER_EIP, env-sysenter_eip); if (kvm_has_msr_star) set_msr_entry(msrs[n++], MSR_STAR, env-star); -set_msr_entry(msrs[n++], MSR_IA32_TSC, env-tsc); set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); #ifdef TARGET_X86_64 if (lm_capable_kernel) { @@ -277,6 +276,18 @@ void kvm_arch_load_regs(CPUState *env) perror(kvm_set_msrs FAILED); } +void kvm_load_tsc(CPUState *env) +{ +int rc; +struct kvm_msr_entry msr; + +set_msr_entry(msr, MSR_IA32_TSC, env-tsc); + +rc = kvm_set_msrs(kvm_context, env-cpu_index, msr, 1); +if (rc == -1) +perror(kvm_set_tsc FAILED.\n); +} + void kvm_save_mpstate(CPUState *env) { #ifdef KVM_CAP_MP_STATE diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index aec5286..ec27d06 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -159,6 +159,7 @@ int qemu_kvm_has_sync_mmu(void); #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context) #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu() void kvm_init_vcpu(CPUState *env); +void kvm_load_tsc(CPUState *env); #else #define kvm_enabled() (0) #define kvm_nested 0 @@ -168,6 +169,8 @@ void kvm_init_vcpu(CPUState *env); #define kvm_load_registers(env) do {} while(0) #define kvm_save_registers(env) do {} while(0) static inline void kvm_init_vcpu(CPUState *env) { } +static inline void void kvm_load_tsc(CPUState *env) {} + #endif diff --git a/qemu/target-i386/machine.c b/qemu/target-i386/machine.c index a7056c7..71c0751 100644 --- a/qemu/target-i386/machine.c +++ b/qemu/target-i386/machine.c @@ -328,6 +328,7 @@ int cpu_load(QEMUFile *f, void *opaque, int version_id) } qemu_get_be64s(f, env-tsc); kvm_load_registers(env); +kvm_load_tsc(env); if (version_id = 5) { qemu_get_be32s(f, env-mp_state); kvm_load_mpstate(env); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: bios: revert TSC zeroing
From: Marcelo Tosatti mtosa...@redhat.com The TSC is zeroed at RESET, and not at SMP initialization. This avoids the TSC from going out of sync between vcpu's on SMP guests. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/bios/rombios32.c b/bios/rombios32.c index 321563d..fd6ff71 100755 --- a/bios/rombios32.c +++ b/bios/rombios32.c @@ -623,12 +623,6 @@ void smp_probe(void) writel(APIC_BASE + APIC_ICR_LOW, 0x000C4500); sipi_vector = AP_BOOT_ADDR 12; writel(APIC_BASE + APIC_ICR_LOW, 0x000C4600 | sipi_vector); - asm volatile( -xor %%eax, %%eax \n\t -xor %%edx, %%edx \n\t -mov $0x10, %%ecx \n\t -wrmsr -: : : eax, ecx, edx); #ifndef BX_QEMU delay_ms(10); diff --git a/bios/rombios32start.S b/bios/rombios32start.S index 335e3ef..c9bc045 100644 --- a/bios/rombios32start.S +++ b/bios/rombios32start.S @@ -50,10 +50,6 @@ smp_ap_boot_code_start: cli xor %ax, %ax mov %ax, %ds - xor %eax, %eax - xor %edx, %edx - mov $0x10, %ecx - wrmsr mov $SMP_MSR_ADDR, %ebx 11: -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Add for_each_shadow_entry(), a simpler alternative to walk_shadow()
From: Avi Kivity a...@redhat.com Using a for_each loop style removes the need to write callback and nasty casts. Implement the walk_shadow() using the for_each_shadow_entry(). Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3b86df6..3248a3e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -150,6 +150,20 @@ struct kvm_shadow_walk { u64 addr, u64 *spte, int level); }; +struct kvm_shadow_walk_iterator { + u64 addr; + hpa_t shadow_addr; + int level; + u64 *sptep; + unsigned index; +}; + +#define for_each_shadow_entry(_vcpu, _addr, _walker)\ + for (shadow_walk_init((_walker), _vcpu, _addr);\ +shadow_walk_okay((_walker)); \ +shadow_walk_next((_walker))) + + struct kvm_unsync_walk { int (*entry) (struct kvm_mmu_page *sp, struct kvm_unsync_walk *walk); }; @@ -1254,33 +1268,48 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, return sp; } +static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, +struct kvm_vcpu *vcpu, u64 addr) +{ + iterator-addr = addr; + iterator-shadow_addr = vcpu-arch.mmu.root_hpa; + iterator-level = vcpu-arch.mmu.shadow_root_level; + if (iterator-level == PT32E_ROOT_LEVEL) { + iterator-shadow_addr + = vcpu-arch.mmu.pae_root[(addr 30) 3]; + iterator-shadow_addr = PT64_BASE_ADDR_MASK; + --iterator-level; + if (!iterator-shadow_addr) + iterator-level = 0; + } +} + +static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) +{ + if (iterator-level PT_PAGE_TABLE_LEVEL) + return false; + iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level); + iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + iterator-index; + return true; +} + +static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) +{ + iterator-shadow_addr = *iterator-sptep PT64_BASE_ADDR_MASK; + --iterator-level; +} + static int walk_shadow(struct kvm_shadow_walk *walker, struct kvm_vcpu *vcpu, u64 addr) { - hpa_t shadow_addr; - int level; + struct kvm_shadow_walk_iterator iterator; int r; - u64 *sptep; - unsigned index; - - shadow_addr = vcpu-arch.mmu.root_hpa; - level = vcpu-arch.mmu.shadow_root_level; - if (level == PT32E_ROOT_LEVEL) { - shadow_addr = vcpu-arch.mmu.pae_root[(addr 30) 3]; - shadow_addr = PT64_BASE_ADDR_MASK; - if (!shadow_addr) - return 1; - --level; - } - while (level = PT_PAGE_TABLE_LEVEL) { - index = SHADOW_PT_INDEX(addr, level); - sptep = ((u64 *)__va(shadow_addr)) + index; - r = walker-entry(walker, vcpu, addr, sptep, level); + for_each_shadow_entry(vcpu, addr, iterator) { + r = walker-entry(walker, vcpu, addr, + iterator.sptep, iterator.level); if (r) return r; - shadow_addr = *sptep PT64_BASE_ADDR_MASK; - --level; } return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Use for_each_shadow_entry() in __direct_map()
From: Avi Kivity a...@redhat.com Eliminating a callback and a useless structure. Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3248a3e..b4b79b0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1841,67 +1841,42 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) { } -struct direct_shadow_walk { - struct kvm_shadow_walk walker; - pfn_t pfn; - int write; - int largepage; - int pt_write; -}; - -static int direct_map_entry(struct kvm_shadow_walk *_walk, - struct kvm_vcpu *vcpu, - u64 addr, u64 *sptep, int level) +static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, + int largepage, gfn_t gfn, pfn_t pfn) { - struct direct_shadow_walk *walk = - container_of(_walk, struct direct_shadow_walk, walker); + struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; + int pt_write = 0; gfn_t pseudo_gfn; - gfn_t gfn = addr PAGE_SHIFT; - - if (level == PT_PAGE_TABLE_LEVEL - || (walk-largepage level == PT_DIRECTORY_LEVEL)) { - mmu_set_spte(vcpu, sptep, ACC_ALL, ACC_ALL, -0, walk-write, 1, walk-pt_write, -walk-largepage, 0, gfn, walk-pfn, false); - ++vcpu-stat.pf_fixed; - return 1; - } - if (*sptep == shadow_trap_nonpresent_pte) { - pseudo_gfn = (addr PT64_DIR_BASE_ADDR_MASK) PAGE_SHIFT; - sp = kvm_mmu_get_page(vcpu, pseudo_gfn, (gva_t)addr, level - 1, - 1, ACC_ALL, sptep); - if (!sp) { - pgprintk(nonpaging_map: ENOMEM\n); - kvm_release_pfn_clean(walk-pfn); - return -ENOMEM; + for_each_shadow_entry(vcpu, (u64)gfn PAGE_SHIFT, iterator) { + if (iterator.level == PT_PAGE_TABLE_LEVEL + || (largepage iterator.level == PT_DIRECTORY_LEVEL)) { + mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL, +0, write, 1, pt_write, +largepage, 0, gfn, pfn, false); + ++vcpu-stat.pf_fixed; + break; } - set_shadow_pte(sptep, - __pa(sp-spt) - | PT_PRESENT_MASK | PT_WRITABLE_MASK - | shadow_user_mask | shadow_x_mask); - } - return 0; -} + if (*iterator.sptep == shadow_trap_nonpresent_pte) { + pseudo_gfn = (iterator.addr PT64_DIR_BASE_ADDR_MASK) PAGE_SHIFT; + sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr, + iterator.level - 1, + 1, ACC_ALL, iterator.sptep); + if (!sp) { + pgprintk(nonpaging_map: ENOMEM\n); + kvm_release_pfn_clean(pfn); + return -ENOMEM; + } -static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, - int largepage, gfn_t gfn, pfn_t pfn) -{ - int r; - struct direct_shadow_walk walker = { - .walker = { .entry = direct_map_entry, }, - .pfn = pfn, - .largepage = largepage, - .write = write, - .pt_write = 0, - }; - - r = walk_shadow(walker.walker, vcpu, gfn PAGE_SHIFT); - if (r 0) - return r; - return walker.pt_write; + set_shadow_pte(iterator.sptep, + __pa(sp-spt) + | PT_PRESENT_MASK | PT_WRITABLE_MASK + | shadow_user_mask | shadow_x_mask); + } + } + return pt_write; } static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in fetch()
From: Avi Kivity a...@redhat.com Effectively reverting to the pre walk_shadow() version -- but now with the reusable for_each(). Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 9fd78b6..69c7e33 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -283,91 +283,79 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, /* * Fetch a shadow pte for a specific level in the paging hierarchy. */ -static int FNAME(shadow_walk_entry)(struct kvm_shadow_walk *_sw, - struct kvm_vcpu *vcpu, u64 addr, - u64 *sptep, int level) +static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, +struct guest_walker *gw, +int user_fault, int write_fault, int largepage, +int *ptwrite, pfn_t pfn) { - struct shadow_walker *sw = - container_of(_sw, struct shadow_walker, walker); - struct guest_walker *gw = sw-guest_walker; unsigned access = gw-pt_access; struct kvm_mmu_page *shadow_page; - u64 spte; + u64 spte, *sptep; int metaphysical; gfn_t table_gfn; int r; + int level; pt_element_t curr_pte; + struct kvm_shadow_walk_iterator iterator; - if (level == PT_PAGE_TABLE_LEVEL - || (sw-largepage level == PT_DIRECTORY_LEVEL)) { - mmu_set_spte(vcpu, sptep, access, gw-pte_access access, -sw-user_fault, sw-write_fault, -gw-ptes[gw-level-1] PT_DIRTY_MASK, -sw-ptwrite, sw-largepage, -gw-ptes[gw-level-1] PT_GLOBAL_MASK, -gw-gfn, sw-pfn, false); - sw-sptep = sptep; - return 1; - } - - if (is_shadow_present_pte(*sptep) !is_large_pte(*sptep)) - return 0; - - if (is_large_pte(*sptep)) { - set_shadow_pte(sptep, shadow_trap_nonpresent_pte); - kvm_flush_remote_tlbs(vcpu-kvm); - rmap_remove(vcpu-kvm, sptep); - } + if (!is_present_pte(gw-ptes[gw-level - 1])) + return NULL; - if (level == PT_DIRECTORY_LEVEL gw-level == PT_DIRECTORY_LEVEL) { - metaphysical = 1; - if (!is_dirty_pte(gw-ptes[level - 1])) - access = ~ACC_WRITE_MASK; - table_gfn = gpte_to_gfn(gw-ptes[level - 1]); - } else { - metaphysical = 0; - table_gfn = gw-table_gfn[level - 2]; - } - shadow_page = kvm_mmu_get_page(vcpu, table_gfn, (gva_t)addr, level-1, - metaphysical, access, sptep); - if (!metaphysical) { - r = kvm_read_guest_atomic(vcpu-kvm, gw-pte_gpa[level - 2], - curr_pte, sizeof(curr_pte)); - if (r || curr_pte != gw-ptes[level - 2]) { - kvm_mmu_put_page(shadow_page, sptep); - kvm_release_pfn_clean(sw-pfn); - sw-sptep = NULL; - return 1; + for_each_shadow_entry(vcpu, addr, iterator) { + level = iterator.level; + sptep = iterator.sptep; + if (level == PT_PAGE_TABLE_LEVEL + || (largepage level == PT_DIRECTORY_LEVEL)) { + mmu_set_spte(vcpu, sptep, access, +gw-pte_access access, +user_fault, write_fault, +gw-ptes[gw-level-1] PT_DIRTY_MASK, +ptwrite, largepage, +gw-ptes[gw-level-1] PT_GLOBAL_MASK, +gw-gfn, pfn, false); + break; } - } - spte = __pa(shadow_page-spt) | PT_PRESENT_MASK | PT_ACCESSED_MASK - | PT_WRITABLE_MASK | PT_USER_MASK; - *sptep = spte; - return 0; -} + if (is_shadow_present_pte(*sptep) !is_large_pte(*sptep)) + continue; -static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, -struct guest_walker *guest_walker, -int user_fault, int write_fault, int largepage, -int *ptwrite, pfn_t pfn) -{ - struct shadow_walker walker = { - .walker = { .entry = FNAME(shadow_walk_entry), }, - .guest_walker = guest_walker, - .user_fault = user_fault, - .write_fault = write_fault, - .largepage = largepage, - .ptwrite = ptwrite, - .pfn = pfn, - }; + if (is_large_pte(*sptep)) { +
[PATCH] KVM: MMU: Drop walk_shadow()
From: Avi Kivity a...@redhat.com No longer used. Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b4b79b0..31ebe69 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -145,11 +145,6 @@ struct kvm_rmap_desc { struct kvm_rmap_desc *more; }; -struct kvm_shadow_walk { - int (*entry)(struct kvm_shadow_walk *walk, struct kvm_vcpu *vcpu, -u64 addr, u64 *spte, int level); -}; - struct kvm_shadow_walk_iterator { u64 addr; hpa_t shadow_addr; @@ -1299,21 +1294,6 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) --iterator-level; } -static int walk_shadow(struct kvm_shadow_walk *walker, - struct kvm_vcpu *vcpu, u64 addr) -{ - struct kvm_shadow_walk_iterator iterator; - int r; - - for_each_shadow_entry(vcpu, addr, iterator) { - r = walker-entry(walker, vcpu, addr, - iterator.sptep, iterator.level); - if (r) - return r; - } - return 0; -} - static void kvm_mmu_page_unlink_children(struct kvm *kvm, struct kvm_mmu_page *sp) { -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in invlpg()
From: Avi Kivity a...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 69c7e33..46b68f9 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -25,7 +25,6 @@ #if PTTYPE == 64 #define pt_element_t u64 #define guest_walker guest_walker64 - #define shadow_walker shadow_walker64 #define FNAME(name) paging##64_##name #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK #define PT_DIR_BASE_ADDR_MASK PT64_DIR_BASE_ADDR_MASK @@ -42,7 +41,6 @@ #elif PTTYPE == 32 #define pt_element_t u32 #define guest_walker guest_walker32 - #define shadow_walker shadow_walker32 #define FNAME(name) paging##32_##name #define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK #define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK @@ -73,18 +71,6 @@ struct guest_walker { u32 error_code; }; -struct shadow_walker { - struct kvm_shadow_walk walker; - struct guest_walker *guest_walker; - int user_fault; - int write_fault; - int largepage; - int *ptwrite; - pfn_t pfn; - u64 *sptep; - gpa_t pte_gpa; -}; - static gfn_t gpte_to_gfn(pt_element_t gpte) { return (gpte PT_BASE_ADDR_MASK) PAGE_SHIFT; @@ -453,54 +439,52 @@ out_unlock: return 0; } -static int FNAME(shadow_invlpg_entry)(struct kvm_shadow_walk *_sw, - struct kvm_vcpu *vcpu, u64 addr, - u64 *sptep, int level) +static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) { - struct shadow_walker *sw = - container_of(_sw, struct shadow_walker, walker); + struct kvm_shadow_walk_iterator iterator; + pt_element_t gpte; + gpa_t pte_gpa = -1; + int level; + u64 *sptep; + + spin_lock(vcpu-kvm-mmu_lock); - /* FIXME: properly handle invlpg on large guest pages */ - if (level == PT_PAGE_TABLE_LEVEL || - ((level == PT_DIRECTORY_LEVEL) is_large_pte(*sptep))) { - struct kvm_mmu_page *sp = page_header(__pa(sptep)); + for_each_shadow_entry(vcpu, gva, iterator) { + level = iterator.level; + sptep = iterator.sptep; - sw-pte_gpa = (sp-gfn PAGE_SHIFT); - sw-pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t); + /* FIXME: properly handle invlpg on large guest pages */ + if (level == PT_PAGE_TABLE_LEVEL || + ((level == PT_DIRECTORY_LEVEL) is_large_pte(*sptep))) { + struct kvm_mmu_page *sp = page_header(__pa(sptep)); - if (is_shadow_present_pte(*sptep)) { - rmap_remove(vcpu-kvm, sptep); - if (is_large_pte(*sptep)) - --vcpu-kvm-stat.lpages; + pte_gpa = (sp-gfn PAGE_SHIFT); + pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t); + + if (is_shadow_present_pte(*sptep)) { + rmap_remove(vcpu-kvm, sptep); + if (is_large_pte(*sptep)) + --vcpu-kvm-stat.lpages; + } + set_shadow_pte(sptep, shadow_trap_nonpresent_pte); + break; } - set_shadow_pte(sptep, shadow_trap_nonpresent_pte); - return 1; - } - if (!is_shadow_present_pte(*sptep)) - return 1; - return 0; -} -static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) -{ - pt_element_t gpte; - struct shadow_walker walker = { - .walker = { .entry = FNAME(shadow_invlpg_entry), }, - .pte_gpa = -1, - }; + if (!is_shadow_present_pte(*sptep)) + break; + } - spin_lock(vcpu-kvm-mmu_lock); - walk_shadow(walker.walker, vcpu, gva); spin_unlock(vcpu-kvm-mmu_lock); - if (walker.pte_gpa == -1) + + if (pte_gpa == -1) return; - if (kvm_read_guest_atomic(vcpu-kvm, walker.pte_gpa, gpte, + if (kvm_read_guest_atomic(vcpu-kvm, pte_gpa, gpte, sizeof(pt_element_t))) return; if (is_present_pte(gpte) (gpte PT_ACCESSED_MASK)) { if (mmu_topup_memory_caches(vcpu)) return; - kvm_mmu_pte_write(vcpu, walker.pte_gpa, (const u8 *)gpte, + kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)gpte, sizeof(pt_element_t), 0); } } @@ -607,7 +591,6 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) #undef pt_element_t #undef guest_walker -#undef shadow_walker #undef FNAME #undef PT_BASE_ADDR_MASK #undef PT_INDEX -- To
[PATCH] KVM: VMX: initialize TSC offset relative to vm creation time
From: Marcelo Tosatti mtosa...@redhat.com VMX initializes the TSC offset for each vcpu at different times, and also reinitializes it for vcpus other than 0 on APIC SIPI message. This bug causes the TSC's to appear unsynchronized in the guest, even if the host is good. Older Linux kernels don't handle the situation very well, so gettimeofday is likely to go backwards in time: http://www.mail-archive.com/k...@vger.kernel.org/msg02955.html http://sourceforge.net/tracker/index.php?func=detailaid=2025534group_id=180599atid=893831 Fix it by initializating the offset of each vcpu relative to vm creation time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the APIC MP init path. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ab8ef1d..af00b8c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -398,6 +398,7 @@ struct kvm_arch{ unsigned long irq_sources_bitmap; unsigned long irq_states[KVM_IOAPIC_NUM_PINS]; + u64 vm_init_tsc; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ffd8f24..615447f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -865,11 +865,8 @@ static u64 guest_read_tsc(void) * writes 'guest_tsc' into guest's timestamp counter register * guest_tsc = host_tsc + tsc_offset == tsc_offset = guest_tsc - host_tsc */ -static void guest_write_tsc(u64 guest_tsc) +static void guest_write_tsc(u64 guest_tsc, u64 host_tsc) { - u64 host_tsc; - - rdtscll(host_tsc); vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc); } @@ -933,6 +930,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct kvm_msr_entry *msr; + u64 host_tsc; int ret = 0; switch (msr_index) { @@ -958,7 +956,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) vmcs_writel(GUEST_SYSENTER_ESP, data); break; case MSR_IA32_TIME_STAMP_COUNTER: - guest_write_tsc(data); + rdtscll(host_tsc); + guest_write_tsc(data, host_tsc); break; case MSR_P6_PERFCTR0: case MSR_P6_PERFCTR1: @@ -2108,7 +2107,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) { u32 host_sysenter_cs, msr_low, msr_high; u32 junk; - u64 host_pat; + u64 host_pat, tsc_this, tsc_base; unsigned long a; struct descriptor_table dt; int i; @@ -2236,6 +2235,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL); vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK); + tsc_base = vmx-vcpu.kvm-arch.vm_init_tsc; + rdtscll(tsc_this); + if (tsc_this vmx-vcpu.kvm-arch.vm_init_tsc) + tsc_base = tsc_this; + + guest_write_tsc(0, tsc_base); return 0; } @@ -2327,8 +2332,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0); vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0); - guest_write_tsc(0); - /* Special registers */ vmcs_write64(GUEST_IA32_DEBUGCTL, 0); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa4575c..36aa576 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4167,6 +4167,8 @@ struct kvm *kvm_arch_create_vm(void) /* Reserve bit 0 of irq_sources_bitmap for userspace irq source */ set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, kvm-arch.irq_sources_bitmap); + rdtscll(kvm-arch.vm_init_tsc); + return kvm; } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] for_each_shadow_entry
Marcelo Tosatti wrote: On Thu, Dec 25, 2008 at 03:23:34PM +0200, Avi Kivity wrote: This patchset replaces walk_shadow(), which calls a callback for each shadow pte that maps a guest virtal address, by an equivalent for_each style construct. Benefits are less thunks and smaller code. Please review. Looks good. Merged; thanks for the review. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Weekly KVM Test report, kernel 4f27e3 ... userspace 6b3523 ...
Hi All, This is our Weekly KVM Testing Report against lastest kvm.git 4f27e3e32b71dd1beff3ace80cce5fb48887a8be and kvm-userspace.git 6b3523a08a01c1411b33690d66cf29920b0d09ce. SMP Vista booting issue is fixed. 32e guest is found crash with Live Migration. One New issue: 1. 32e Guest crash after Live Migration https://sourceforge.net/tracker/index.php?func=detailaid=2458020group_id=180599atid=893831 One Fixed issue: 1. SMP PAE Vista guest crash with inject_page_fault https://sourceforge.net/tracker/index.php?func=detailaid=2443886group_id=180599atid=893831 Five Old Issues: 1. 32bits Rhel5/FC6 guest may fail to reboot after installation https://sourceforge.net/tracker/?func=detailatid=893831aid=1991647group_id=180599 2. failure to migrate guests with more than 4GB of RAM https://sourceforge.net/tracker/index.php?func=detailaid=1971512group_id=180599atid=893831 3. OpenSuse10.2 can not be installed http://sourceforge.net/tracker/index.php?func=detailaid=2088475group_id=180599atid=893831 4. Fail to Save Restore Guest https://sourceforge.net/tracker/?func=detailatid=893831aid=2175042group_id=180599 5. hotplug inexistent device will kill guest https://sourceforge.net/tracker/index.php?func=detailaid=2432316group_id=180599atid=893831 Test environment Platform A Stoakley/Clovertown CPU 4 Memory size 8G' Report Summary on IA32-pae Summary Test Report of Last Session = Total PassFailNoResult Crash = control_panel 8 7 1 00 gtest 16 16 0 00 = control_panel 8 7 1 00 :KVM_256M_guest_PAE_gPAE 1 1 0 00 :KVM_linux_win_PAE_gPAE1 1 0 00 :KVM_two_winxp_PAE_gPAE1 1 0 00 :KVM_four_sguest_PAE_gPA 1 1 0 00 :KVM_1500M_guest_PAE_gPA 1 1 0 00 :KVM_LM_Continuity_PAE_g 1 1 0 00 :KVM_LM_SMP_PAE_gPAE 1 1 0 00 :KVM_SR_Continuity_PAE_g 1 0 1 00 gtest 16 16 0 00 :ltp_nightly_PAE_gPAE 1 1 0 00 :boot_up_acpi_PAE_gPAE 1 1 0 00 :boot_up_acpi_xp_PAE_gPA 1 1 0 00 :boot_up_vista_PAE_gPAE1 1 0 00 :reboot_xp_PAE_gPAE1 1 0 00 :boot_base_kernel_PAE_gP 1 1 0 00 :boot_up_acpi_win2k3_PAE 1 1 0 00 :boot_smp_acpi_win2k3_PA 1 1 0 00 :boot_smp_acpi_win2k_PAE 1 1 0 00 :boot_up_win2008_PAE_gPA 1 1 0 00 :boot_up_acpi_win2k_PAE_ 1 1 0 00 :boot_smp_acpi_xp_PAE_gP 1 1 0 00 :boot_up_noacpi_win2k_PA 1 1 0 00 :boot_smp_vista_PAE_gPAE 1 1 0 00 :boot_smp_win2008_PAE_gP 1 1 0 00 :kb_nightly_PAE_gPAE 1 1 0 00 = Total 24 23 1 00 Report Summary on IA32e Summary Test Report of Last Session = Total PassFailNoResult Crash = control_panel 17 12 5 00 gtest 23 23 0 00 = control_panel 17 12 5 00 :KVM_4G_guest_64_g32e 1 1 0 00 :KVM_four_sguest_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_g32e1 0 1 00 :KVM_linux_win_64_gPAE 1 1 0 00 :KVM_LM_SMP_64_gPAE1 1 0 00 :KVM_SR_Continuity_64_gP 1 0 1 00 :KVM_four_sguest_64_g32e 1 1 0 00 :KVM_four_dguest_64_gPAE 1 1 0 00 :KVM_SR_SMP_64_gPAE1 0 1 00
Re: [patch 3/3] KVM: VMX: initialize TSC offset relative to vm creation time
Marcelo Tosatti wrote: Ok, this could cause the guest tsc to be initialized to a high value close to wraparound (in case the vcpu is migrated to a cpu with negative difference before vmx_vcpu_setup). Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] synchronized TSC between vcpu's on SMP guests
Marcelo Tosatti wrote: Most Intel hosts are supposed to have their TSC's synchronized. This patchset attempts to fix the sites which overwrite the TSC making them appear unsynchronized to the guest. Applied the userspace bits as well. I dropped qemu-kvm-x86.h; these files are going away, so we shouldn't add more. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm-82: much improved guest debugging
The changelog from kvm-81 to kvm-82 says: - much improved guest debugging (Jan Kiszka) - both debugger in guest and debugger in host I haven't tested it much, but I can confirm that debugging tricks I knew didn't work with kvm-81 and before now work fine. Malware which wouldn't even run now launches inside KVM, allowing controlled execution for deductive purposes -- letting the malware itself spill the beans about what it's programmed to do. Exploits which were painful to track in action are now easily controlled via a userland debugger. This is a fantastic addition, not least because it means that KVM fans can now evangelise much more enthusiastically amongst the security/reversing community, which uses virtualisation heavily for research purposes. VMWare has an almost total monopoly in this segment of the population (except for Microsoft researchers appearing in public at conferences, of course [smiley]), not just because its brand has '200kg gorilla' status, but also because it has long and fairly correctly supported hardware debugging inside guests. (Without this support, exploit and malware analysis is seriously hampered -- se above.) So thanks and well done to Jan for getting these changes on board. This opens up the security research community to a world beyond EMC, giving additional choice of a genuinely free virtualisation solution with the additional bonus of having great pedagogical value thanks to the source code. (Next on wish list: reliable Windows kernel debugging via serial port :/) Your sincerely, Very Happy Customer. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] KVM: Add a route layer to convert MSI message to GSI
On Sat, Dec 27, 2008 at 05:27:25PM -0200, Marcelo Tosatti wrote: +static int kvm_vm_ioctl_request_gsi_msg(struct kvm *kvm, + struct kvm_assigned_gsi_msg *agsi_msg) +{ + struct kvm_gsi_msg gsi_msg; + int r; + + gsi_msg.gsi = agsi_msg-gsi; + gsi_msg.msg.address_lo = agsi_msg-msg.addr_lo; + gsi_msg.msg.address_hi = agsi_msg-msg.addr_hi; + gsi_msg.msg.data = agsi_msg-msg.data; + + r = kvm_update_gsi_msg(kvm, gsi_msg); + if (r == 0) + agsi_msg-gsi = gsi_msg.gsi; + return r; +} Can't see the purpose of this function. Why preserve the user-passed GSI value in case of failure? It will return an error anyway... Oh, we preserved the user-passed GSI in case of userspace want to update one entry. :) The structure is not copied back to userspace in case of failure anyway: + r = kvm_vm_ioctl_request_gsi_msg(kvm, agsi_msg); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, agsi_msg, sizeof agsi_msg)) + goto out; So I don't see the point of doing this? Oh, Marcelo, finally I got your meaning. I misunderstand you at first time. And now I think you just means: Why if (r == 0) is there? It's unnecessary! Yes your are right. It can be called a little habit, and preseved value is useless at all, and maybe I just want to indicate what's the legal return value here. And from the other side, if (r != 0), the value should be meaningless and assigning it to another variable may not proper. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq
On Sat, Dec 27, 2008 at 06:06:26PM -0200, Marcelo Tosatti wrote: On Fri, Dec 26, 2008 at 10:30:07AM +0800, Sheng Yang wrote: Thanks to Marcelo's observation, The following code have potential issue: if (cancel_work_sync(assigned_dev-interrupt_work)) kvm_put_kvm(kvm); In fact, cancel_work_sync() would return true either work struct is only scheduled or the callback of work struct is executed. This code only consider the former situation. Why not simply drop the reference inc / dec from irq handler/work function? Sorry, I don't know the answer. After checking the code, I also think it's a little strange to increase refernce count here, and I think we won't suppose work_handler can release the kvm struct. Maybe Avi knows? Or Amit and Weidong? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq
Sheng Yang wrote: if (cancel_work_sync(assigned_dev-interrupt_work)) kvm_put_kvm(kvm); In fact, cancel_work_sync() would return true either work struct is only scheduled or the callback of work struct is executed. This code only consider the former situation. Why not simply drop the reference inc / dec from irq handler/work function? Sorry, I don't know the answer. After checking the code, I also think it's a little strange to increase refernce count here, and I think we won't suppose work_handler can release the kvm struct. Maybe Avi knows? Or Amit and Weidong? Not sure what the reasoning was, but it does seem like reference counting can be safely dropped from interrupt_work. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Serial ATA Support - will it come?
Hello List, I was planning to introduce kvm in our server farm to give customers the possibility to load their OS into kvm in case they did something wrong and want to fix that. We have several 100 servers with different Linux distros, but always the same hardware and kernel. Our hard drives are SATA who use the AHCI driver. These disks are named /dev/sdXX in the OS and this is where kvm is failing on me. It seems that our disks are recognized as IDE drives and named /dev/hdXX while booting the OS into KVM thus giving me some classic kernel panics. My initial line is quite straightforward: kvm -hda /dev/sda -boot c -net nic,model=rtl8139 -vnc :1 I was told to try SCSI emulation, but this didn't help either, as no disk was found at all: kvm -drive file=/dev/sda,if=scsi,boot=on -boot c -net nic,model=rtl8139 -vnc :1 With /dev/hdaX I cannot really boot into kvm without changing a significant amount of files per server, and changing almost 1000 servers is not a very sane option :-) Has SATA support, especially the generic AHCI devices, ever been considered to be included into kvm? Would it be a lot of work to add this emulation into kvm? I certainly don't want to ask for impossible things, but I would just like to know if in the near future I will be able to boot my images without changing them. Maybe there is a hidden solution that I did not find, if so, please let me know. Thanks for your time. Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Fix compilation with -Werror=format-security
On 12/27/08, Lennert Buytenhek buyt...@wantstofly.org wrote: On Sat, Dec 27, 2008 at 07:52:39PM +, Frederik Himpe wrote: Mandriva is now using the -Werror=format-security CFLAG by default. To make kvm 82 compile with this option, I had to apply this patch: http://svn.mandriva.com/cgi-bin/viewvc.cgi/packages/cooker/kvm/current/SOURCES/kvm-82-format-string.patch?view=co diff -ur kvm-82.orig/qemu/monitor.c kvm-82/qemu/monitor.c --- kvm-82.orig/qemu/monitor.c 2008-12-24 15:24:58.0 +0100 +++ kvm-82/qemu/monitor.c 2008-12-27 20:19:32.0 +0100 @@ -1975,7 +1975,7 @@ static void expr_error(const char *fmt) { -term_printf(fmt); +term_printf(%s, fmt); term_printf(\n); longjmp(expr_env, 1); } Why not make this term_printf(%s\n, fmt); while you're at it? Thanks, I combined these to form r6134. As fmt no longer describes the parameter, I changed that to msg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
(restoring cc list) Andi Kleen wrote: One of the other problems: NMIs and MCEs have the same problem with SYSCALL This one however looks unsolvable. Userspace can point %rsp into arbitrary memory, issue a syscall, and hope for an nmi. Since we're in cpl 0 and are not using IST, the processor will not switch stacks, and the nmi stack frame will corrupt any memory the user chooses to point to. Even without a malicious user, %rsp could legitimately point at unmapped memory. I don't see how syscall could work on i386, and indeed: vdso32.so-$(VDSO32-y)+= int80 vdso32.so-$(CONFIG_COMPAT)+= syscall vdso32.so-$(VDSO32-y)+= sysenter It's disabled. Is that the reason? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gettimeofday slow in RHEL4 guests
On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote: But yeah - the remapping of HPET timers to virtual HPET timers sounds pretty tough. I wonder if one could overcome that with a little hardware support though ... For gettimeofday better make TSC work. Even in the best case (no virtualization) it is much faster than HPET because it sits in the CPU, while HPET is far away on the external south bridge. The tsc clock on older Linux 2.6 kernels compensates for lost ticks. The algorithm uses the PIT count (latched) to measure the delay between interrupt generation and handling, and sums that value, on the next interrupt, to the TSC delta. Sheng investigated this problem in the discussions before in-kernel PIT was merged: http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg13873.html The algorithm overcompensates for lost ticks and the guest time runs faster than the hosts. There are two issues: 1) A bug in the in-kernel PIT which miscalculates the count value. 2) For the case where more than one interrupt is lost, and later reinjected, the value read from PIT count is meaningless for the purpose of the tsc algorithm. The count is interpreted as the delay until the next interrupt, which is not the case with reinjection. As Sheng mentioned in the thread above, Xen pulls back the TSC value when reinjecting interrupts. VMWare ESX has a notion of virtual TSC, which I believe is similar in this context. For KVM I believe the best immediate solution (for now) is to provide an option to disable reinjection, behaving similarly to real hardware. The advantage is simplicity compared to virtualizing the time sources. The QEMU PIT emulation has a limit on the rate of interrupt reinjection, perhaps something similar should be investigated in the future. The following patch (which contains the bugfix for 1) and disabled reinjection) fixes the severe time drift on RHEL4 with clock=tsc. What I'm proposing is to condition reinjection with an option (-kvm-pit-no-reinject or something). Comments or better ideas? diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index e665d1c..608af7b 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps) if (!atomic_inc_and_test(pt-pending)) set_bit(KVM_REQ_PENDING_TIMER, vcpu0-requests); + if (atomic_read(pt-pending) 1) + atomic_set(pt-pending, 1); + if (vcpu0 waitqueue_active(vcpu0-wq)) wake_up_interruptible(vcpu0-wq); hrtimer_add_expires_ns(pt-timer, pt-period); pt-scheduled = hrtimer_get_expires_ns(pt-timer); if (pt-period) - ps-channels[0].count_load_time = hrtimer_get_expires(pt-timer); + ps-channels[0].count_load_time = ktime_get(); return (pt-period == 0 ? 0 : 1); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
On Sun, Dec 28, 2008 at 04:09:26PM +0200, Avi Kivity wrote: I don't see how syscall could work on i386, and indeed: i386 has task gates which support unconditional stack switching. But there are no 64bit task gates, just ISTs. BTW I think there are more similar problems in your patch too. vdso32.so-$(VDSO32-y)+= int80 vdso32.so-$(CONFIG_COMPAT)+= syscall vdso32.so-$(VDSO32-y)+= sysenter It's disabled. Is that the reason? No. All interesting 32bit CPUs have SYSENTER; the only one who has SYSCALL but no SYSENTER is the K6, but it has a weird early variant of SYSCALL with more problems which was never worth supporting. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
Andi Kleen wrote: On Sun, Dec 28, 2008 at 04:09:26PM +0200, Avi Kivity wrote: I don't see how syscall could work on i386, and indeed: i386 has task gates which support unconditional stack switching. But there are no 64bit task gates, just ISTs. i386 is not that interesting to me (and probably task switching couldn't be made to work well with guest state in TR). BTW I think there are more similar problems in your patch too. One fatal problem is enough -- I don't thing that patch can be made to work. Pity since it did clean up some stuff. I would like however to speed up kvm. Here's a plan: 1. Add per-cpu IDT 2. When switching to the guest TR (and other state), switch off IST use in the current IDT 3. When switching away from the kvm task, restore the IST entries per-cpu IDT would cost around 4K per cpu. I propose to make it kconfigurable, and have kvm select it. Ingo, does this sound workable? It increases complexity rather than decreasing it as the previous solution, but I don't see any way to drop the use of IST as SYSCALL cannot work without IST if NMIs are enabled. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
One fatal problem is enough -- I don't thing that patch can be made to work. Pity since it did clean up some stuff. Not sure that was true anyways. I would like however to speed up kvm. Here's a plan: 1. Add per-cpu IDT You don't need that, do you? Just two sets. 2. When switching to the guest TR (and other state), switch off IST use in the current IDT 3. When switching away from the kvm task, restore the IST entries per-cpu IDT would cost around 4K per cpu. I propose to make it kconfigurable, and have kvm select it. If anything please make it runtime switchable and disable it on Intel CPUs. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
Avi Kivity wrote: 1. Add per-cpu IDT Or we could have just two IDTs - one with IST and one without. I clocked LIDT at 58 cycles (and we need two per heavyweight switch), so it's not that wonderful. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
Avi Kivity wrote: Avi Kivity wrote: 1. Add per-cpu IDT Or we could have just two IDTs - one with IST and one without. I clocked LIDT at 58 cycles (and we need two per heavyweight switch), so it's not that wonderful. This makes the whole thing unworthwhile. The vmload/vmsave pair costs only 200 cycles (I should have started with this), and 120 cycles on the heavyweight path plus complexity are not worth 200 cycles on the lightweight path. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
On Sun, Dec 28, 2008 at 10:08:35PM +0200, Avi Kivity wrote: Avi Kivity wrote: Avi Kivity wrote: 1. Add per-cpu IDT Or we could have just two IDTs - one with IST and one without. I clocked LIDT at 58 cycles (and we need two per heavyweight switch), so it's not that wonderful. This makes the whole thing unworthwhile. The vmload/vmsave pair costs only 200 cycles (I should have started with this), and 120 cycles on the heavyweight path plus complexity are not worth 200 cycles on the lightweight path. Actually to switch ISTs you need to change the TSS, not the IDT. But I suppose that won't be any faster. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel
Andi Kleen wrote: This makes the whole thing unworthwhile. The vmload/vmsave pair costs only 200 cycles (I should have started with this), and 120 cycles on the heavyweight path plus complexity are not worth 200 cycles on the lightweight path. Actually to switch ISTs you need to change the TSS, not the IDT. But I suppose that won't be any faster. I can't touch the TSS (that's the starting point of the exercise). The plan was to have a copy of the IDT with all IST pointers zeroed out (or to have per-cpu IDT and zero out the IST pointers when entering guest mode, restoring them on context switch). It's not worth it though. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] remove kvm vmap usage
Remove the vmap usage from kvm, this is needed both for ksm and get_user_pages != write. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: introducing kvm_read_guest_virt, kvm_write_guest_virt.
This commit change the name of emulator_read_std into kvm_read_guest_virt, and add new function name kvm_write_guest_virt that allow writing into a guest virtual address. Signed-off-by: Izik Eidus iei...@redhat.com --- arch/x86/include/asm/kvm_host.h |4 --- arch/x86/kvm/x86.c | 56 +- 2 files changed, 42 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ab8ef1d..a129700 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -608,10 +608,6 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu); void fx_init(struct kvm_vcpu *vcpu); -int emulator_read_std(unsigned long addr, - void *val, - unsigned int bytes, - struct kvm_vcpu *vcpu); int emulator_write_emulated(unsigned long addr, const void *val, unsigned int bytes, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa4575c..c812209 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1973,10 +1973,8 @@ static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, return dev; } -int emulator_read_std(unsigned long addr, -void *val, -unsigned int bytes, -struct kvm_vcpu *vcpu) +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, + struct kvm_vcpu *vcpu) { void *data = val; int r = X86EMUL_CONTINUE; @@ -1984,27 +1982,57 @@ int emulator_read_std(unsigned long addr, while (bytes) { gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr); unsigned offset = addr (PAGE_SIZE-1); - unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset); + unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; if (gpa == UNMAPPED_GVA) { r = X86EMUL_PROPAGATE_FAULT; goto out; } - ret = kvm_read_guest(vcpu-kvm, gpa, data, tocopy); + ret = kvm_read_guest(vcpu-kvm, gpa, data, toread); if (ret 0) { r = X86EMUL_UNHANDLEABLE; goto out; } - bytes -= tocopy; - data += tocopy; - addr += tocopy; + bytes -= toread; + data += toread; + addr += toread; } out: return r; } -EXPORT_SYMBOL_GPL(emulator_read_std); + +int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes, +struct kvm_vcpu *vcpu) +{ + void *data = val; + int r = X86EMUL_CONTINUE; + + while (bytes) { + gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr); + unsigned offset = addr (PAGE_SIZE-1); + unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset); + int ret; + + if (gpa == UNMAPPED_GVA) { + r = X86EMUL_PROPAGATE_FAULT; + goto out; + } + ret = kvm_write_guest(vcpu-kvm, gpa, data, towrite); + if (ret 0) { + r = X86EMUL_UNHANDLEABLE; + goto out; + } + + bytes -= towrite; + data += towrite; + addr += towrite; + } +out: + return r; +} + static int emulator_read_emulated(unsigned long addr, void *val, @@ -2026,8 +2054,8 @@ static int emulator_read_emulated(unsigned long addr, if ((gpa PAGE_MASK) == APIC_DEFAULT_PHYS_BASE) goto mmio; - if (emulator_read_std(addr, val, bytes, vcpu) - == X86EMUL_CONTINUE) + if (kvm_read_guest_virt(addr, val, bytes, vcpu) + == X86EMUL_CONTINUE) return X86EMUL_CONTINUE; if (gpa == UNMAPPED_GVA) return X86EMUL_PROPAGATE_FAULT; @@ -2230,7 +2258,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) rip_linear = rip + get_segment_base(vcpu, VCPU_SREG_CS); - emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu); + kvm_read_guest_virt(rip_linear, (void *)opcodes, 4, vcpu); printk(KERN_ERR emulation failed (%s) rip %lx %02x %02x %02x %02x\n, context, rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]); @@ -2238,7 +2266,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, const char *context) EXPORT_SYMBOL_GPL(kvm_report_emulation_failure); static struct x86_emulate_ops emulate_ops = { - .read_std= emulator_read_std, + .read_std= kvm_read_guest_virt, .read_emulated =
[PATCH 2/2] KVM: remove the vmap usage
Signed-off-by: Izik Eidus iei...@redhat.com --- arch/x86/kvm/x86.c| 62 +--- include/linux/kvm_types.h |3 +- 2 files changed, 14 insertions(+), 51 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c812209..29df564 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2368,40 +2368,19 @@ int emulate_instruction(struct kvm_vcpu *vcpu, } EXPORT_SYMBOL_GPL(emulate_instruction); -static void free_pio_guest_pages(struct kvm_vcpu *vcpu) -{ - int i; - - for (i = 0; i ARRAY_SIZE(vcpu-arch.pio.guest_pages); ++i) - if (vcpu-arch.pio.guest_pages[i]) { - kvm_release_page_dirty(vcpu-arch.pio.guest_pages[i]); - vcpu-arch.pio.guest_pages[i] = NULL; - } -} - static int pio_copy_data(struct kvm_vcpu *vcpu) { void *p = vcpu-arch.pio_data; - void *q; + gva_t q = vcpu-arch.pio.guest_gva; unsigned bytes; - int nr_pages = vcpu-arch.pio.guest_pages[1] ? 2 : 1; + int ret; - q = vmap(vcpu-arch.pio.guest_pages, nr_pages, VM_READ|VM_WRITE, -PAGE_KERNEL); - if (!q) { - free_pio_guest_pages(vcpu); - return -ENOMEM; - } - q += vcpu-arch.pio.guest_page_offset; bytes = vcpu-arch.pio.size * vcpu-arch.pio.cur_count; if (vcpu-arch.pio.in) - memcpy(q, p, bytes); + ret = kvm_write_guest_virt(q, p, bytes, vcpu); else - memcpy(p, q, bytes); - q -= vcpu-arch.pio.guest_page_offset; - vunmap(q); - free_pio_guest_pages(vcpu); - return 0; + ret = kvm_read_guest_virt(q, p, bytes, vcpu); + return ret; } int complete_pio(struct kvm_vcpu *vcpu) @@ -2512,7 +2491,6 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, vcpu-arch.pio.in = in; vcpu-arch.pio.string = 0; vcpu-arch.pio.down = 0; - vcpu-arch.pio.guest_page_offset = 0; vcpu-arch.pio.rep = 0; if (vcpu-run-io.direction == KVM_EXIT_IO_IN) @@ -2540,9 +2518,7 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, gva_t address, int rep, unsigned port) { unsigned now, in_page; - int i, ret = 0; - int nr_pages = 1; - struct page *page; + int ret = 0; struct kvm_io_device *pio_dev; vcpu-run-exit_reason = KVM_EXIT_IO; @@ -2554,7 +2530,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, vcpu-arch.pio.in = in; vcpu-arch.pio.string = 1; vcpu-arch.pio.down = down; - vcpu-arch.pio.guest_page_offset = offset_in_page(address); vcpu-arch.pio.rep = rep; if (vcpu-run-io.direction == KVM_EXIT_IO_IN) @@ -2574,15 +2549,8 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, else in_page = offset_in_page(address) + size; now = min(count, (unsigned long)in_page / size); - if (!now) { - /* -* String I/O straddles page boundary. Pin two guest pages -* so that we satisfy atomicity constraints. Do just one -* transaction to avoid complexity. -*/ - nr_pages = 2; + if (!now) now = 1; - } if (down) { /* * String I/O in reverse. Yuck. Kill the guest, fix later. @@ -2597,15 +2565,7 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, if (vcpu-arch.pio.cur_count == vcpu-arch.pio.count) kvm_x86_ops-skip_emulated_instruction(vcpu); - for (i = 0; i nr_pages; ++i) { - page = gva_to_page(vcpu, address + i * PAGE_SIZE); - vcpu-arch.pio.guest_pages[i] = page; - if (!page) { - kvm_inject_gp(vcpu, 0); - free_pio_guest_pages(vcpu); - return 1; - } - } + vcpu-arch.pio.guest_gva = address; pio_dev = vcpu_find_pio_dev(vcpu, port, vcpu-arch.pio.cur_count, @@ -2613,7 +2573,11 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in, if (!vcpu-arch.pio.in) { /* string PIO write */ ret = pio_copy_data(vcpu); - if (ret = 0 pio_dev) { + if (ret == X86EMUL_PROPAGATE_FAULT) { + kvm_inject_gp(vcpu, 0); + return 1; + } + if (ret == 0 pio_dev) { pio_string_write(pio_dev, vcpu); complete_pio(vcpu); if (vcpu-arch.pio.count == 0) diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 9b6f395..5f4a18c
Re: [PATCH 0/2] remove kvm vmap usage
Izik Eidus wrote: Remove the vmap usage from kvm, this is needed both for ksm and get_user_pages != write. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Ok so for ksm things are now all right, actully i could live without this vmap as well, beacuse ksm check the page _count and every call to gva_to_page should increase it. anyway before i continue with the kmap removing (again ksm doesnt need it, only get_user_pages != write) does we really want to go that way? even for the get_user_pages case we can live with the write = 0, if before each kmap_atomic we will call get_user_pages (write = 1), so the page inside apic_page and time_page and what so ever would be safe (when ksm will come to merge that page it will skip it with the page count check) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][v2] kvm-userspace: Load PCI option ROMs
Hi Avi, Thanks for your comments. I've updated the patch according to them. Please review it. Thank you. Load assigned devices' PCI option ROMs to the RAM of guest OS. And pass the corresponding devfns to BIOS. Signed-off-by: Kechao Liu kechao@intel.com --- bios/rombios.c | 20 +- qemu/hw/device-assignment.c | 155 +++ qemu/hw/device-assignment.h |1 + qemu/hw/pc.c|8 ++- 4 files changed, 178 insertions(+), 6 deletions(-) diff --git a/bios/rombios.c b/bios/rombios.c index b7a240f..e98d4ff 100644 --- a/bios/rombios.c +++ b/bios/rombios.c @@ -10253,18 +10253,30 @@ rom_scan_loop: add al, #0x04 block_count_rounded: - xor bx, bx ;; Restore DS back to : - mov ds, bx push ax ;; Save AX push di ;; Save DI ;; Push addr of ROM entry point push cx ;; Push seg push #0x0003 ;; Push offset + ;; Get the BDF into ax before invoking the option ROM + mov bl, [2] + mov al, bl + shr al, #7 + cmp al, #1 + jne fetch_bdf + mov ax, ds ;; Increment the DS since rom size larger than an segment + add ax, #0x1000 + mov ds, ax +fetch_bdf: + shl bx, #9 + xor ax, ax + mov al, [bx] + ;; Point ES:DI at $PnP, which tells the ROM that we are a PnP BIOS. ;; That should stop it grabbing INT 19h; we will use its BEV instead. - mov ax, #0xf000 - mov es, ax + mov bx, #0xf000 + mov es, bx lea di, pnp_string mov bp, sp ;; Call ROM init routine using seg:off on stack diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c index 7a5..7f16970 100644 --- a/qemu/hw/device-assignment.c +++ b/qemu/hw/device-assignment.c @@ -678,3 +678,158 @@ void add_assigned_devices(PCIBus *bus, const char **devices, int n_devices) } } } + +/* Option ROM header */ +struct option_rom_header { +uint8_t signature[2]; +uint8_t rom_size; +uint32_t entry_point; +uint8_t reserved[17]; +uint16_t pci_header_offset; +uint16_t expansion_header_offset; +} __attribute__ ((packed)); + +/* Option ROM PCI data structure */ +struct option_rom_pci_header { +uint8_t signature[4]; +uint16_t vendor_id; +uint16_t device_id; +uint16_t vital_product_data_offset; +uint16_t structure_length; +uint8_t structure_revision; +uint8_t class_code[3]; +uint16_t image_length; +uint16_t image_revision; +uint8_t code_type; +uint8_t indicator; +uint16_t reserved; +} __attribute__ ((packed)); + +/* + * Scan the list of Option ROMs at roms. If a suitable Option ROM is found, + * allocate a ram space and copy it there. Then return its size aligned to + * both 2KB and target page size. + */ +#define OPTION_ROM_ALIGN(x) (((x) + 2047) ~2047) +static int scan_option_rom(uint8_t devfn, void *roms, ram_addr_t offset) +{ +int i, size, total_size; +uint8_t csum; +ram_addr_t addr; +struct option_rom_header *rom; +struct option_rom_pci_header *pcih; + +rom = roms; + +for ( ; ; ) { +/* Invalid signature means we're out of option ROMs. */ +if (strncmp((char *)rom-signature, \x55\xaa, 2) || + (rom-rom_size == 0)) +break; + +size = rom-rom_size * 512; +/* Invalid checksum means we're out of option ROMs. */ +csum = 0; +for (i = 0; i size; i++) +csum += ((uint8_t *)rom)[i]; +if (csum != 0) +break; + +/* Check the PCI header (if any) for a match. */ +pcih = (struct option_rom_pci_header *) +((char *)rom + rom-pci_header_offset); +if ((rom-pci_header_offset != 0) + !strncmp((char *)pcih-signature, PCIR, 4)) +goto found; + +rom = (struct option_rom_header *)((char *)rom + size); +} + +return 0; + + found: +/* The size should be both 2K-aligned and page-aligned */ +total_size = (TARGET_PAGE_SIZE 2048) + ? OPTION_ROM_ALIGN(size + 1) + : TARGET_PAGE_ALIGN(size + 1); + +/* Size of all available ram space is 0x1 (0xd to 0xe) */ +if ((offset + total_size) 0x1u) { +fprintf(stderr, Option ROM size %x exceeds available space\n, size); +return 0; +} + +addr = qemu_ram_alloc(total_size); +cpu_register_physical_memory(0xd + offset, total_size, addr | IO_MEM_ROM); + +/* Write ROM data and devfn to phys_addr */ +cpu_physical_memory_write_rom(0xd + offset, rom, size); +cpu_physical_memory_write_rom(0xd + offset + size, devfn, 1); + +return total_size; +} + +/* + * Scan the assigned devices for the devices that have an option ROM, and then + * load the corresponding ROM data to RAM. If an error occurs while loading an + * option ROM, we just ignore that option ROM and continue with the next one. + */ +ram_addr_t assigned_dev_load_option_roms(ram_addr_t
Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq
On Sun, Dec 28, 2008 at 07:24:02PM +0800, Sheng Yang wrote: On Sat, Dec 27, 2008 at 06:06:26PM -0200, Marcelo Tosatti wrote: On Fri, Dec 26, 2008 at 10:30:07AM +0800, Sheng Yang wrote: Thanks to Marcelo's observation, The following code have potential issue: if (cancel_work_sync(assigned_dev-interrupt_work)) kvm_put_kvm(kvm); In fact, cancel_work_sync() would return true either work struct is only scheduled or the callback of work struct is executed. This code only consider the former situation. Why not simply drop the reference inc / dec from irq handler/work function? Sorry, I don't know the answer. After checking the code, I also think it's a little strange to increase refernce count here, and I think we won't suppose work_handler can release the kvm struct. At the time of developing that code, this was my observation: I see from the call chain kvm_put_kvm-...-kvm_arch_destroy_vm, no locks are taken to actually destroy the vm. We can't be in ioctls, sure. But shouldn't the mutex be taken to ensure there's nothing else going on while destroying? At least with the workqueue model, we could be called asynchronously in kernel context and I would have held the mutex and about to inject interrupts while everything is being wiped off underneath. However, the workqueue model tries its best to schedule the work on the same CPU, though we can't use that guarantee to ensure things will be fine. --- So I had to get a ref to the current vm till we had any pending work scheduled. I think I put in comments in the code, but sadly most of my comments we stripped out before the merge. Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html