[PATCH] kvm: qemu: regenerate bios for tsc zeroing removal

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/qemu/pc-bios/bios.bin b/qemu/pc-bios/bios.bin
index 768d8f0..0e0dbea 100644
Binary files a/qemu/pc-bios/bios.bin and b/qemu/pc-bios/bios.bin differ
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: configure: if all else fails, infer kernel version from .config

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/configure b/configure
index d2883a7..493c178 100755
--- a/configure
+++ b/configure
@@ -125,6 +125,9 @@ if [ -n $no_uname ]; then
 
 elif [ -e $kerneldir/include/config/kernel.release ]; then
 depmod_version=`cat $kerneldir/include/config/kernel.release`
+elif [ -e $kerneldir/.config ]; then
+   depmod_version=$(awk '/Linux kernel version:/ { print $NF }'
+$kerneldir/.config)
 else
 echo
 echo Error: kernelversion not found
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: qemu: separate TSC load from kvm_arch_load_regs

2008-12-28 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

kvm_load_registers is a general interface to load registers, and is
used by vmport, gdbstub, etc. The TSC MSR is continually counting, so
it can't be simply read and written back as the other registers/MSR's
(doing so overwrites the current count).

Introduce kvm_load_tsc and use it for x86's migration CPU load code.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c
index aa36be8..f7ed70b 100644
--- a/qemu/qemu-kvm-x86.c
+++ b/qemu/qemu-kvm-x86.c
@@ -261,7 +261,6 @@ void kvm_arch_load_regs(CPUState *env)
 set_msr_entry(msrs[n++], MSR_IA32_SYSENTER_EIP, env-sysenter_eip);
 if (kvm_has_msr_star)
set_msr_entry(msrs[n++], MSR_STAR,  env-star);
-set_msr_entry(msrs[n++], MSR_IA32_TSC, env-tsc);
 set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
@@ -277,6 +276,18 @@ void kvm_arch_load_regs(CPUState *env)
 perror(kvm_set_msrs FAILED);
 }
 
+void kvm_load_tsc(CPUState *env)
+{
+int rc;
+struct kvm_msr_entry msr;
+
+set_msr_entry(msr, MSR_IA32_TSC, env-tsc);
+
+rc = kvm_set_msrs(kvm_context, env-cpu_index, msr, 1);
+if (rc == -1)
+perror(kvm_set_tsc FAILED.\n);
+}
+
 void kvm_save_mpstate(CPUState *env)
 {
 #ifdef KVM_CAP_MP_STATE
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index aec5286..ec27d06 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -159,6 +159,7 @@ int qemu_kvm_has_sync_mmu(void);
 #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context)
 #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu()
 void kvm_init_vcpu(CPUState *env);
+void kvm_load_tsc(CPUState *env);
 #else
 #define kvm_enabled() (0)
 #define kvm_nested 0
@@ -168,6 +169,8 @@ void kvm_init_vcpu(CPUState *env);
 #define kvm_load_registers(env) do {} while(0)
 #define kvm_save_registers(env) do {} while(0)
 static inline void kvm_init_vcpu(CPUState *env) { }
+static inline void void kvm_load_tsc(CPUState *env) {}
+
 
 #endif
 
diff --git a/qemu/target-i386/machine.c b/qemu/target-i386/machine.c
index a7056c7..71c0751 100644
--- a/qemu/target-i386/machine.c
+++ b/qemu/target-i386/machine.c
@@ -328,6 +328,7 @@ int cpu_load(QEMUFile *f, void *opaque, int version_id)
 }
 qemu_get_be64s(f, env-tsc);
 kvm_load_registers(env);
+kvm_load_tsc(env);
 if (version_id = 5) {
 qemu_get_be32s(f, env-mp_state);
 kvm_load_mpstate(env);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: bios: revert TSC zeroing

2008-12-28 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

The TSC is zeroed at RESET, and not at SMP initialization.

This avoids the TSC from going out of sync between vcpu's on SMP
guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/bios/rombios32.c b/bios/rombios32.c
index 321563d..fd6ff71 100755
--- a/bios/rombios32.c
+++ b/bios/rombios32.c
@@ -623,12 +623,6 @@ void smp_probe(void)
 writel(APIC_BASE + APIC_ICR_LOW, 0x000C4500);
 sipi_vector = AP_BOOT_ADDR  12;
 writel(APIC_BASE + APIC_ICR_LOW, 0x000C4600 | sipi_vector);
-   asm volatile(
-xor %%eax, %%eax \n\t
-xor %%edx, %%edx \n\t
-mov $0x10, %%ecx \n\t
-wrmsr
-: : : eax, ecx, edx);
 
 #ifndef BX_QEMU
 delay_ms(10);
diff --git a/bios/rombios32start.S b/bios/rombios32start.S
index 335e3ef..c9bc045 100644
--- a/bios/rombios32start.S
+++ b/bios/rombios32start.S
@@ -50,10 +50,6 @@ smp_ap_boot_code_start:
   cli
   xor %ax, %ax
   mov %ax, %ds
-  xor %eax, %eax
-  xor %edx, %edx
-  mov $0x10, %ecx
-  wrmsr
 
   mov $SMP_MSR_ADDR, %ebx
 11:
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MMU: Add for_each_shadow_entry(), a simpler alternative to walk_shadow()

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Using a for_each loop style removes the need to write callback and nasty
casts.

Implement the walk_shadow() using the for_each_shadow_entry().

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3b86df6..3248a3e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -150,6 +150,20 @@ struct kvm_shadow_walk {
 u64 addr, u64 *spte, int level);
 };
 
+struct kvm_shadow_walk_iterator {
+   u64 addr;
+   hpa_t shadow_addr;
+   int level;
+   u64 *sptep;
+   unsigned index;
+};
+
+#define for_each_shadow_entry(_vcpu, _addr, _walker)\
+   for (shadow_walk_init((_walker), _vcpu, _addr);\
+shadow_walk_okay((_walker));  \
+shadow_walk_next((_walker)))
+
+
 struct kvm_unsync_walk {
int (*entry) (struct kvm_mmu_page *sp, struct kvm_unsync_walk *walk);
 };
@@ -1254,33 +1268,48 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
return sp;
 }
 
+static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
+struct kvm_vcpu *vcpu, u64 addr)
+{
+   iterator-addr = addr;
+   iterator-shadow_addr = vcpu-arch.mmu.root_hpa;
+   iterator-level = vcpu-arch.mmu.shadow_root_level;
+   if (iterator-level == PT32E_ROOT_LEVEL) {
+   iterator-shadow_addr
+   = vcpu-arch.mmu.pae_root[(addr  30)  3];
+   iterator-shadow_addr = PT64_BASE_ADDR_MASK;
+   --iterator-level;
+   if (!iterator-shadow_addr)
+   iterator-level = 0;
+   }
+}
+
+static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
+{
+   if (iterator-level  PT_PAGE_TABLE_LEVEL)
+   return false;
+   iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level);
+   iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + 
iterator-index;
+   return true;
+}
+
+static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
+{
+   iterator-shadow_addr = *iterator-sptep  PT64_BASE_ADDR_MASK;
+   --iterator-level;
+}
+
 static int walk_shadow(struct kvm_shadow_walk *walker,
   struct kvm_vcpu *vcpu, u64 addr)
 {
-   hpa_t shadow_addr;
-   int level;
+   struct kvm_shadow_walk_iterator iterator;
int r;
-   u64 *sptep;
-   unsigned index;
-
-   shadow_addr = vcpu-arch.mmu.root_hpa;
-   level = vcpu-arch.mmu.shadow_root_level;
-   if (level == PT32E_ROOT_LEVEL) {
-   shadow_addr = vcpu-arch.mmu.pae_root[(addr  30)  3];
-   shadow_addr = PT64_BASE_ADDR_MASK;
-   if (!shadow_addr)
-   return 1;
-   --level;
-   }
 
-   while (level = PT_PAGE_TABLE_LEVEL) {
-   index = SHADOW_PT_INDEX(addr, level);
-   sptep = ((u64 *)__va(shadow_addr)) + index;
-   r = walker-entry(walker, vcpu, addr, sptep, level);
+   for_each_shadow_entry(vcpu, addr, iterator) {
+   r = walker-entry(walker, vcpu, addr,
+ iterator.sptep, iterator.level);
if (r)
return r;
-   shadow_addr = *sptep  PT64_BASE_ADDR_MASK;
-   --level;
}
return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MMU: Use for_each_shadow_entry() in __direct_map()

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Eliminating a callback and a useless structure.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3248a3e..b4b79b0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1841,67 +1841,42 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 {
 }
 
-struct direct_shadow_walk {
-   struct kvm_shadow_walk walker;
-   pfn_t pfn;
-   int write;
-   int largepage;
-   int pt_write;
-};
-
-static int direct_map_entry(struct kvm_shadow_walk *_walk,
-   struct kvm_vcpu *vcpu,
-   u64 addr, u64 *sptep, int level)
+static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
+   int largepage, gfn_t gfn, pfn_t pfn)
 {
-   struct direct_shadow_walk *walk =
-   container_of(_walk, struct direct_shadow_walk, walker);
+   struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
+   int pt_write = 0;
gfn_t pseudo_gfn;
-   gfn_t gfn = addr  PAGE_SHIFT;
-
-   if (level == PT_PAGE_TABLE_LEVEL
-   || (walk-largepage  level == PT_DIRECTORY_LEVEL)) {
-   mmu_set_spte(vcpu, sptep, ACC_ALL, ACC_ALL,
-0, walk-write, 1, walk-pt_write,
-walk-largepage, 0, gfn, walk-pfn, false);
-   ++vcpu-stat.pf_fixed;
-   return 1;
-   }
 
-   if (*sptep == shadow_trap_nonpresent_pte) {
-   pseudo_gfn = (addr  PT64_DIR_BASE_ADDR_MASK)  PAGE_SHIFT;
-   sp = kvm_mmu_get_page(vcpu, pseudo_gfn, (gva_t)addr, level - 1,
- 1, ACC_ALL, sptep);
-   if (!sp) {
-   pgprintk(nonpaging_map: ENOMEM\n);
-   kvm_release_pfn_clean(walk-pfn);
-   return -ENOMEM;
+   for_each_shadow_entry(vcpu, (u64)gfn  PAGE_SHIFT, iterator) {
+   if (iterator.level == PT_PAGE_TABLE_LEVEL
+   || (largepage  iterator.level == PT_DIRECTORY_LEVEL)) {
+   mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
+0, write, 1, pt_write,
+largepage, 0, gfn, pfn, false);
+   ++vcpu-stat.pf_fixed;
+   break;
}
 
-   set_shadow_pte(sptep,
-  __pa(sp-spt)
-  | PT_PRESENT_MASK | PT_WRITABLE_MASK
-  | shadow_user_mask | shadow_x_mask);
-   }
-   return 0;
-}
+   if (*iterator.sptep == shadow_trap_nonpresent_pte) {
+   pseudo_gfn = (iterator.addr  PT64_DIR_BASE_ADDR_MASK) 
 PAGE_SHIFT;
+   sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
+ iterator.level - 1,
+ 1, ACC_ALL, iterator.sptep);
+   if (!sp) {
+   pgprintk(nonpaging_map: ENOMEM\n);
+   kvm_release_pfn_clean(pfn);
+   return -ENOMEM;
+   }
 
-static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-   int largepage, gfn_t gfn, pfn_t pfn)
-{
-   int r;
-   struct direct_shadow_walk walker = {
-   .walker = { .entry = direct_map_entry, },
-   .pfn = pfn,
-   .largepage = largepage,
-   .write = write,
-   .pt_write = 0,
-   };
-
-   r = walk_shadow(walker.walker, vcpu, gfn  PAGE_SHIFT);
-   if (r  0)
-   return r;
-   return walker.pt_write;
+   set_shadow_pte(iterator.sptep,
+  __pa(sp-spt)
+  | PT_PRESENT_MASK | PT_WRITABLE_MASK
+  | shadow_user_mask | shadow_x_mask);
+   }
+   }
+   return pt_write;
 }
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in fetch()

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Effectively reverting to the pre walk_shadow() version -- but now
with the reusable for_each().

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 9fd78b6..69c7e33 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -283,91 +283,79 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  */
-static int FNAME(shadow_walk_entry)(struct kvm_shadow_walk *_sw,
-   struct kvm_vcpu *vcpu, u64 addr,
-   u64 *sptep, int level)
+static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
+struct guest_walker *gw,
+int user_fault, int write_fault, int largepage,
+int *ptwrite, pfn_t pfn)
 {
-   struct shadow_walker *sw =
-   container_of(_sw, struct shadow_walker, walker);
-   struct guest_walker *gw = sw-guest_walker;
unsigned access = gw-pt_access;
struct kvm_mmu_page *shadow_page;
-   u64 spte;
+   u64 spte, *sptep;
int metaphysical;
gfn_t table_gfn;
int r;
+   int level;
pt_element_t curr_pte;
+   struct kvm_shadow_walk_iterator iterator;
 
-   if (level == PT_PAGE_TABLE_LEVEL
-   || (sw-largepage  level == PT_DIRECTORY_LEVEL)) {
-   mmu_set_spte(vcpu, sptep, access, gw-pte_access  access,
-sw-user_fault, sw-write_fault,
-gw-ptes[gw-level-1]  PT_DIRTY_MASK,
-sw-ptwrite, sw-largepage,
-gw-ptes[gw-level-1]  PT_GLOBAL_MASK,
-gw-gfn, sw-pfn, false);
-   sw-sptep = sptep;
-   return 1;
-   }
-
-   if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))
-   return 0;
-
-   if (is_large_pte(*sptep)) {
-   set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
-   kvm_flush_remote_tlbs(vcpu-kvm);
-   rmap_remove(vcpu-kvm, sptep);
-   }
+   if (!is_present_pte(gw-ptes[gw-level - 1]))
+   return NULL;
 
-   if (level == PT_DIRECTORY_LEVEL  gw-level == PT_DIRECTORY_LEVEL) {
-   metaphysical = 1;
-   if (!is_dirty_pte(gw-ptes[level - 1]))
-   access = ~ACC_WRITE_MASK;
-   table_gfn = gpte_to_gfn(gw-ptes[level - 1]);
-   } else {
-   metaphysical = 0;
-   table_gfn = gw-table_gfn[level - 2];
-   }
-   shadow_page = kvm_mmu_get_page(vcpu, table_gfn, (gva_t)addr, level-1,
-  metaphysical, access, sptep);
-   if (!metaphysical) {
-   r = kvm_read_guest_atomic(vcpu-kvm, gw-pte_gpa[level - 2],
- curr_pte, sizeof(curr_pte));
-   if (r || curr_pte != gw-ptes[level - 2]) {
-   kvm_mmu_put_page(shadow_page, sptep);
-   kvm_release_pfn_clean(sw-pfn);
-   sw-sptep = NULL;
-   return 1;
+   for_each_shadow_entry(vcpu, addr, iterator) {
+   level = iterator.level;
+   sptep = iterator.sptep;
+   if (level == PT_PAGE_TABLE_LEVEL
+   || (largepage  level == PT_DIRECTORY_LEVEL)) {
+   mmu_set_spte(vcpu, sptep, access,
+gw-pte_access  access,
+user_fault, write_fault,
+gw-ptes[gw-level-1]  PT_DIRTY_MASK,
+ptwrite, largepage,
+gw-ptes[gw-level-1]  PT_GLOBAL_MASK,
+gw-gfn, pfn, false);
+   break;
}
-   }
 
-   spte = __pa(shadow_page-spt) | PT_PRESENT_MASK | PT_ACCESSED_MASK
-   | PT_WRITABLE_MASK | PT_USER_MASK;
-   *sptep = spte;
-   return 0;
-}
+   if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))
+   continue;
 
-static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
-struct guest_walker *guest_walker,
-int user_fault, int write_fault, int largepage,
-int *ptwrite, pfn_t pfn)
-{
-   struct shadow_walker walker = {
-   .walker = { .entry = FNAME(shadow_walk_entry), },
-   .guest_walker = guest_walker,
-   .user_fault = user_fault,
-   .write_fault = write_fault,
-   .largepage = largepage,
-   .ptwrite = ptwrite,
-   .pfn = pfn,
-   };
+   if (is_large_pte(*sptep)) {
+   

[PATCH] KVM: MMU: Drop walk_shadow()

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

No longer used.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b4b79b0..31ebe69 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -145,11 +145,6 @@ struct kvm_rmap_desc {
struct kvm_rmap_desc *more;
 };
 
-struct kvm_shadow_walk {
-   int (*entry)(struct kvm_shadow_walk *walk, struct kvm_vcpu *vcpu,
-u64 addr, u64 *spte, int level);
-};
-
 struct kvm_shadow_walk_iterator {
u64 addr;
hpa_t shadow_addr;
@@ -1299,21 +1294,6 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
--iterator-level;
 }
 
-static int walk_shadow(struct kvm_shadow_walk *walker,
-  struct kvm_vcpu *vcpu, u64 addr)
-{
-   struct kvm_shadow_walk_iterator iterator;
-   int r;
-
-   for_each_shadow_entry(vcpu, addr, iterator) {
-   r = walker-entry(walker, vcpu, addr,
- iterator.sptep, iterator.level);
-   if (r)
-   return r;
-   }
-   return 0;
-}
-
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
 struct kvm_mmu_page *sp)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MMU: Replace walk_shadow() by for_each_shadow_entry() in invlpg()

2008-12-28 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 69c7e33..46b68f9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -25,7 +25,6 @@
 #if PTTYPE == 64
#define pt_element_t u64
#define guest_walker guest_walker64
-   #define shadow_walker shadow_walker64
#define FNAME(name) paging##64_##name
#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
#define PT_DIR_BASE_ADDR_MASK PT64_DIR_BASE_ADDR_MASK
@@ -42,7 +41,6 @@
 #elif PTTYPE == 32
#define pt_element_t u32
#define guest_walker guest_walker32
-   #define shadow_walker shadow_walker32
#define FNAME(name) paging##32_##name
#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
#define PT_DIR_BASE_ADDR_MASK PT32_DIR_BASE_ADDR_MASK
@@ -73,18 +71,6 @@ struct guest_walker {
u32 error_code;
 };
 
-struct shadow_walker {
-   struct kvm_shadow_walk walker;
-   struct guest_walker *guest_walker;
-   int user_fault;
-   int write_fault;
-   int largepage;
-   int *ptwrite;
-   pfn_t pfn;
-   u64 *sptep;
-   gpa_t pte_gpa;
-};
-
 static gfn_t gpte_to_gfn(pt_element_t gpte)
 {
return (gpte  PT_BASE_ADDR_MASK)  PAGE_SHIFT;
@@ -453,54 +439,52 @@ out_unlock:
return 0;
 }
 
-static int FNAME(shadow_invlpg_entry)(struct kvm_shadow_walk *_sw,
- struct kvm_vcpu *vcpu, u64 addr,
- u64 *sptep, int level)
+static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
-   struct shadow_walker *sw =
-   container_of(_sw, struct shadow_walker, walker);
+   struct kvm_shadow_walk_iterator iterator;
+   pt_element_t gpte;
+   gpa_t pte_gpa = -1;
+   int level;
+   u64 *sptep;
+
+   spin_lock(vcpu-kvm-mmu_lock);
 
-   /* FIXME: properly handle invlpg on large guest pages */
-   if (level == PT_PAGE_TABLE_LEVEL ||
-   ((level == PT_DIRECTORY_LEVEL)  is_large_pte(*sptep))) {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   for_each_shadow_entry(vcpu, gva, iterator) {
+   level = iterator.level;
+   sptep = iterator.sptep;
 
-   sw-pte_gpa = (sp-gfn  PAGE_SHIFT);
-   sw-pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t);
+   /* FIXME: properly handle invlpg on large guest pages */
+   if (level == PT_PAGE_TABLE_LEVEL ||
+   ((level == PT_DIRECTORY_LEVEL)  is_large_pte(*sptep))) {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
 
-   if (is_shadow_present_pte(*sptep)) {
-   rmap_remove(vcpu-kvm, sptep);
-   if (is_large_pte(*sptep))
-   --vcpu-kvm-stat.lpages;
+   pte_gpa = (sp-gfn  PAGE_SHIFT);
+   pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t);
+
+   if (is_shadow_present_pte(*sptep)) {
+   rmap_remove(vcpu-kvm, sptep);
+   if (is_large_pte(*sptep))
+   --vcpu-kvm-stat.lpages;
+   }
+   set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
+   break;
}
-   set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
-   return 1;
-   }
-   if (!is_shadow_present_pte(*sptep))
-   return 1;
-   return 0;
-}
 
-static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
-{
-   pt_element_t gpte;
-   struct shadow_walker walker = {
-   .walker = { .entry = FNAME(shadow_invlpg_entry), },
-   .pte_gpa = -1,
-   };
+   if (!is_shadow_present_pte(*sptep))
+   break;
+   }
 
-   spin_lock(vcpu-kvm-mmu_lock);
-   walk_shadow(walker.walker, vcpu, gva);
spin_unlock(vcpu-kvm-mmu_lock);
-   if (walker.pte_gpa == -1)
+
+   if (pte_gpa == -1)
return;
-   if (kvm_read_guest_atomic(vcpu-kvm, walker.pte_gpa, gpte,
+   if (kvm_read_guest_atomic(vcpu-kvm, pte_gpa, gpte,
  sizeof(pt_element_t)))
return;
if (is_present_pte(gpte)  (gpte  PT_ACCESSED_MASK)) {
if (mmu_topup_memory_caches(vcpu))
return;
-   kvm_mmu_pte_write(vcpu, walker.pte_gpa, (const u8 *)gpte,
+   kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)gpte,
  sizeof(pt_element_t), 0);
}
 }
@@ -607,7 +591,6 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
 #undef pt_element_t
 #undef guest_walker
-#undef shadow_walker
 #undef FNAME
 #undef PT_BASE_ADDR_MASK
 #undef PT_INDEX
--
To 

[PATCH] KVM: VMX: initialize TSC offset relative to vm creation time

2008-12-28 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

VMX initializes the TSC offset for each vcpu at different times, and
also reinitializes it for vcpus other than 0 on APIC SIPI message.

This bug causes the TSC's to appear unsynchronized in the guest, even if
the host is good.

Older Linux kernels don't handle the situation very well, so
gettimeofday is likely to go backwards in time:

http://www.mail-archive.com/k...@vger.kernel.org/msg02955.html
http://sourceforge.net/tracker/index.php?func=detailaid=2025534group_id=180599atid=893831

Fix it by initializating the offset of each vcpu relative to vm creation
time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
APIC MP init path.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ab8ef1d..af00b8c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -398,6 +398,7 @@ struct kvm_arch{
 
unsigned long irq_sources_bitmap;
unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
+   u64 vm_init_tsc;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ffd8f24..615447f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -865,11 +865,8 @@ static u64 guest_read_tsc(void)
  * writes 'guest_tsc' into guest's timestamp counter register
  * guest_tsc = host_tsc + tsc_offset == tsc_offset = guest_tsc - host_tsc
  */
-static void guest_write_tsc(u64 guest_tsc)
+static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
 {
-   u64 host_tsc;
-
-   rdtscll(host_tsc);
vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
 }
 
@@ -933,6 +930,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 data)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct kvm_msr_entry *msr;
+   u64 host_tsc;
int ret = 0;
 
switch (msr_index) {
@@ -958,7 +956,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 data)
vmcs_writel(GUEST_SYSENTER_ESP, data);
break;
case MSR_IA32_TIME_STAMP_COUNTER:
-   guest_write_tsc(data);
+   rdtscll(host_tsc);
+   guest_write_tsc(data, host_tsc);
break;
case MSR_P6_PERFCTR0:
case MSR_P6_PERFCTR1:
@@ -2108,7 +2107,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
u32 host_sysenter_cs, msr_low, msr_high;
u32 junk;
-   u64 host_pat;
+   u64 host_pat, tsc_this, tsc_base;
unsigned long a;
struct descriptor_table dt;
int i;
@@ -2236,6 +2235,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK);
 
+   tsc_base = vmx-vcpu.kvm-arch.vm_init_tsc;
+   rdtscll(tsc_this);
+   if (tsc_this  vmx-vcpu.kvm-arch.vm_init_tsc)
+   tsc_base = tsc_this;
+
+   guest_write_tsc(0, tsc_base);
 
return 0;
 }
@@ -2327,8 +2332,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
 
-   guest_write_tsc(0);
-
/* Special registers */
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa4575c..36aa576 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4167,6 +4167,8 @@ struct  kvm *kvm_arch_create_vm(void)
/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, kvm-arch.irq_sources_bitmap);
 
+   rdtscll(kvm-arch.vm_init_tsc);
+
return kvm;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] for_each_shadow_entry

2008-12-28 Thread Avi Kivity

Marcelo Tosatti wrote:

On Thu, Dec 25, 2008 at 03:23:34PM +0200, Avi Kivity wrote:
  

This patchset replaces walk_shadow(), which calls a callback for each
shadow pte that maps a guest virtal address, by an equivalent for_each style
construct.  Benefits are less thunks and smaller code.

Please review.



Looks good.
  


Merged; thanks for the review.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Weekly KVM Test report, kernel 4f27e3 ... userspace 6b3523 ...

2008-12-28 Thread Xu, Jiajun
Hi All,

This is our Weekly KVM Testing Report against lastest kvm.git
4f27e3e32b71dd1beff3ace80cce5fb48887a8be and kvm-userspace.git
6b3523a08a01c1411b33690d66cf29920b0d09ce.
SMP Vista booting issue is fixed. 32e guest is found crash with Live Migration.

One New issue:

1. 32e Guest crash after Live Migration
https://sourceforge.net/tracker/index.php?func=detailaid=2458020group_id=180599atid=893831


One Fixed issue:

1. SMP PAE Vista guest crash with inject_page_fault
https://sourceforge.net/tracker/index.php?func=detailaid=2443886group_id=180599atid=893831


Five Old Issues:

1. 32bits Rhel5/FC6 guest may fail to reboot after installation
https://sourceforge.net/tracker/?func=detailatid=893831aid=1991647group_id=180599

2. failure to migrate guests with more than 4GB of RAM
https://sourceforge.net/tracker/index.php?func=detailaid=1971512group_id=180599atid=893831

3. OpenSuse10.2 can not be installed
http://sourceforge.net/tracker/index.php?func=detailaid=2088475group_id=180599atid=893831

4. Fail to Save Restore Guest
https://sourceforge.net/tracker/?func=detailatid=893831aid=2175042group_id=180599

5. hotplug inexistent device will kill guest
https://sourceforge.net/tracker/index.php?func=detailaid=2432316group_id=180599atid=893831


Test environment

Platform  A
Stoakley/Clovertown
CPU 4
Memory size 8G'

Report Summary on IA32-pae
   Summary Test Report of Last Session
=
Total   PassFailNoResult   Crash
=
control_panel   8   7   1 00
gtest   16  16  0 00
=
control_panel   8   7   1 00
 :KVM_256M_guest_PAE_gPAE   1   1   0 00
 :KVM_linux_win_PAE_gPAE1   1   0 00
 :KVM_two_winxp_PAE_gPAE1   1   0 00
 :KVM_four_sguest_PAE_gPA   1   1   0 00
 :KVM_1500M_guest_PAE_gPA   1   1   0 00
 :KVM_LM_Continuity_PAE_g   1   1   0 00
 :KVM_LM_SMP_PAE_gPAE   1   1   0 00
 :KVM_SR_Continuity_PAE_g   1   0   1 00
gtest   16  16  0 00
 :ltp_nightly_PAE_gPAE  1   1   0 00
 :boot_up_acpi_PAE_gPAE 1   1   0 00
 :boot_up_acpi_xp_PAE_gPA   1   1   0 00
 :boot_up_vista_PAE_gPAE1   1   0 00
 :reboot_xp_PAE_gPAE1   1   0 00
 :boot_base_kernel_PAE_gP   1   1   0 00
 :boot_up_acpi_win2k3_PAE   1   1   0 00
 :boot_smp_acpi_win2k3_PA   1   1   0 00
 :boot_smp_acpi_win2k_PAE   1   1   0 00
 :boot_up_win2008_PAE_gPA   1   1   0 00
 :boot_up_acpi_win2k_PAE_   1   1   0 00
 :boot_smp_acpi_xp_PAE_gP   1   1   0 00
 :boot_up_noacpi_win2k_PA   1   1   0 00
 :boot_smp_vista_PAE_gPAE   1   1   0 00
 :boot_smp_win2008_PAE_gP   1   1   0 00
 :kb_nightly_PAE_gPAE   1   1   0 00
=
Total   24  23  1 00

Report Summary on IA32e
   Summary Test Report of Last Session
=
Total   PassFailNoResult   Crash
=
control_panel   17  12  5 00
gtest   23  23  0 00
=
control_panel   17  12  5 00
 :KVM_4G_guest_64_g32e  1   1   0 00
 :KVM_four_sguest_64_gPAE   1   1   0 00
 :KVM_LM_SMP_64_g32e1   0   1 00
 :KVM_linux_win_64_gPAE 1   1   0 00
 :KVM_LM_SMP_64_gPAE1   1   0 00
 :KVM_SR_Continuity_64_gP   1   0   1 00
 :KVM_four_sguest_64_g32e   1   1   0 00
 :KVM_four_dguest_64_gPAE   1   1   0 00
 :KVM_SR_SMP_64_gPAE1   0   1 00
 

Re: [patch 3/3] KVM: VMX: initialize TSC offset relative to vm creation time

2008-12-28 Thread Avi Kivity

Marcelo Tosatti wrote:

Ok, this could cause the guest tsc to be initialized to a high value
close to wraparound (in case the vcpu is migrated to a cpu with negative
difference before vmx_vcpu_setup). 


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/3] synchronized TSC between vcpu's on SMP guests

2008-12-28 Thread Avi Kivity

Marcelo Tosatti wrote:

Most Intel hosts are supposed to have their TSC's synchronized. This
patchset attempts to fix the sites which overwrite the TSC making them
appear unsynchronized to the guest.
  


Applied the userspace bits as well.  I dropped qemu-kvm-x86.h; these 
files are going away, so we shouldn't add more.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm-82: much improved guest debugging

2008-12-28 Thread duck

The changelog from kvm-81 to kvm-82 says:

- much improved guest debugging (Jan Kiszka)
  - both debugger in guest and debugger in host

I haven't tested it much, but I can confirm that debugging tricks I knew 
didn't work with kvm-81 and before now work fine. Malware which wouldn't 
even run now launches inside KVM, allowing controlled execution for 
deductive purposes -- letting the malware itself spill the beans about 
what it's programmed to do. Exploits which were painful to track in action 
are now easily controlled via a userland debugger.

This is a fantastic addition, not least because it means that KVM fans can 
now evangelise much more enthusiastically amongst the security/reversing 
community, which uses virtualisation heavily for research purposes.

VMWare has an almost total monopoly in this segment of the population 
(except for Microsoft researchers appearing in public at conferences, of 
course [smiley]), not just because its brand has '200kg gorilla' status, 
but also because it has long and fairly correctly supported hardware 
debugging inside guests. (Without this support, exploit and malware 
analysis is seriously hampered -- se above.)

So thanks and well done to Jan for getting these changes on board. 

This opens up the security research community to a world beyond EMC, 
giving additional choice of a genuinely free virtualisation solution with 
the additional bonus of having great pedagogical value thanks to the 
source code.

(Next on wish list: reliable Windows kernel debugging via serial port :/)

Your sincerely,

Very Happy Customer.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] KVM: Add a route layer to convert MSI message to GSI

2008-12-28 Thread Sheng Yang
On Sat, Dec 27, 2008 at 05:27:25PM -0200, Marcelo Tosatti wrote:
 +static int kvm_vm_ioctl_request_gsi_msg(struct kvm *kvm,
 + struct kvm_assigned_gsi_msg 
 *agsi_msg)
 +{
 + struct kvm_gsi_msg gsi_msg;
 + int r;
 +
 + gsi_msg.gsi = agsi_msg-gsi;
 + gsi_msg.msg.address_lo = agsi_msg-msg.addr_lo;
 + gsi_msg.msg.address_hi = agsi_msg-msg.addr_hi;
 + gsi_msg.msg.data = agsi_msg-msg.data;
 +
 + r = kvm_update_gsi_msg(kvm, gsi_msg);
 + if (r == 0)
 + agsi_msg-gsi = gsi_msg.gsi;
 + return r;
 +}
   
Can't see the purpose of this function. Why preserve the user-passed GSI
value in case of failure? It will return an error anyway...
  
  Oh, we preserved the user-passed GSI in case of userspace want to update 
  one 
  entry. :)
 
 The structure is not copied back to userspace in case of failure anyway:
 
 +   r = kvm_vm_ioctl_request_gsi_msg(kvm, agsi_msg);
 +   if (r)
 +   goto out;
 +   r = -EFAULT;
 +   if (copy_to_user(argp, agsi_msg, sizeof agsi_msg))
 +   goto out;
 
 So I don't see the point of doing this?
 

Oh, Marcelo, finally I got your meaning. I misunderstand you at first time.
And now I think you just means: Why if (r == 0) is there? It's unnecessary!

Yes your are right. It can be called a little habit, and preseved value is
useless at all, and maybe I just want to indicate what's the legal return
value here. And from the other side, if (r != 0), the value should be
meaningless and assigning it to another variable may not proper.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq

2008-12-28 Thread Sheng Yang
On Sat, Dec 27, 2008 at 06:06:26PM -0200, Marcelo Tosatti wrote:
 On Fri, Dec 26, 2008 at 10:30:07AM +0800, Sheng Yang wrote:
  Thanks to Marcelo's observation, The following code have potential issue:
  
  if (cancel_work_sync(assigned_dev-interrupt_work))
  kvm_put_kvm(kvm);
  
  In fact, cancel_work_sync() would return true either work struct is only
  scheduled or the callback of work struct is executed. This code only
  consider the former situation.
 
 Why not simply drop the reference inc / dec from irq handler/work
 function?

Sorry, I don't know the answer. After checking the code, I also think it's a
little strange to increase refernce count here, and I think we won't suppose
work_handler can release the kvm struct.

Maybe Avi knows? Or Amit and Weidong?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq

2008-12-28 Thread Avi Kivity

Sheng Yang wrote:

if (cancel_work_sync(assigned_dev-interrupt_work))
kvm_put_kvm(kvm);

In fact, cancel_work_sync() would return true either work struct is only
scheduled or the callback of work struct is executed. This code only
consider the former situation.
  

Why not simply drop the reference inc / dec from irq handler/work
function?



Sorry, I don't know the answer. After checking the code, I also think it's a
little strange to increase refernce count here, and I think we won't suppose
work_handler can release the kvm struct.

Maybe Avi knows? Or Amit and Weidong?
  


Not sure what the reasoning was, but it does seem like reference 
counting can be safely dropped from interrupt_work.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Serial ATA Support - will it come?

2008-12-28 Thread Andy B.
Hello List,

I was planning to introduce kvm in our server farm to give customers the
possibility to load their OS into kvm in case they did something wrong and
want to fix that.

We have several 100 servers with different Linux distros, but always the
same hardware and kernel. Our hard drives are SATA who use the AHCI driver.
These disks are named /dev/sdXX in the OS and this is where kvm is failing
on me. It seems that our disks are recognized as IDE drives and named
/dev/hdXX while booting the OS into KVM thus giving me some classic kernel
panics.

My initial line is quite straightforward:

kvm -hda /dev/sda -boot c -net nic,model=rtl8139 -vnc :1

I was told to try SCSI emulation, but this didn't help either, as no disk
was found at all:

kvm -drive file=/dev/sda,if=scsi,boot=on -boot c -net nic,model=rtl8139 -vnc
:1


With /dev/hdaX I cannot really boot into kvm without changing a significant
amount of files per server, and changing almost 1000 servers is not a very
sane option :-)


Has SATA support, especially the generic AHCI devices, ever been considered
to be included into kvm? Would it be a lot of work to add this emulation
into kvm? I certainly don't want to ask for impossible things, but I would
just like to know if in the near future I will be able to boot my images
without changing them. Maybe there is a hidden solution that I did not
find, if so, please let me know.

Thanks for your time.


Andy


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Fix compilation with -Werror=format-security

2008-12-28 Thread Blue Swirl
On 12/27/08, Lennert Buytenhek buyt...@wantstofly.org wrote:
 On Sat, Dec 27, 2008 at 07:52:39PM +, Frederik Himpe wrote:

   Mandriva is now using the -Werror=format-security CFLAG by default. To
   make kvm 82 compile with this option, I had to apply this patch:
   
 http://svn.mandriva.com/cgi-bin/viewvc.cgi/packages/cooker/kvm/current/SOURCES/kvm-82-format-string.patch?view=co


 diff -ur kvm-82.orig/qemu/monitor.c kvm-82/qemu/monitor.c
 --- kvm-82.orig/qemu/monitor.c  2008-12-24 15:24:58.0 +0100
 +++ kvm-82/qemu/monitor.c   2008-12-27 20:19:32.0 +0100
 @@ -1975,7 +1975,7 @@

  static void expr_error(const char *fmt)
  {
 -term_printf(fmt);
 +term_printf(%s, fmt);
  term_printf(\n);
  longjmp(expr_env, 1);
  }

  Why not make this term_printf(%s\n, fmt); while you're at it?

Thanks, I combined these to form r6134. As fmt no longer describes
the parameter, I changed that to msg.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Avi Kivity

(restoring cc list)

Andi Kleen wrote:

One of the other problems: NMIs and MCEs have the same problem with SYSCALL
  


This one however looks unsolvable.  Userspace can point %rsp into 
arbitrary memory, issue a syscall, and hope for an nmi.  Since we're in 
cpl 0 and are not using IST, the processor will not switch stacks, and 
the nmi stack frame will corrupt any memory the user chooses to point to.


Even without a malicious user, %rsp could legitimately point at unmapped 
memory.


I don't see how syscall could work on i386, and indeed:


vdso32.so-$(VDSO32-y)+= int80
vdso32.so-$(CONFIG_COMPAT)+= syscall
vdso32.so-$(VDSO32-y)+= sysenter


It's disabled.  Is that the reason?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gettimeofday slow in RHEL4 guests

2008-12-28 Thread Marcelo Tosatti
On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote:
  But yeah - the remapping of HPET timers to virtual HPET timers sounds  
  pretty tough. I wonder if one could overcome that with a little  
  hardware support though ...
 
 For gettimeofday better make TSC work. Even in the best case (no 
 virtualization) it is much faster than HPET because it sits in the CPU,
 while HPET is far away on the external south bridge.

The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
The algorithm uses the PIT count (latched) to measure the delay between
interrupt generation and handling, and sums that value, on the next
interrupt, to the TSC delta.

Sheng investigated this problem in the discussions before in-kernel PIT
was merged:

http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg13873.html

The algorithm overcompensates for lost ticks and the guest time runs
faster than the hosts.

There are two issues:

1) A bug in the in-kernel PIT which miscalculates the count value.

2) For the case where more than one interrupt is lost, and later
reinjected, the value read from PIT count is meaningless for the purpose
of the tsc algorithm. The count is interpreted as the delay until the
next interrupt, which is not the case with reinjection.

As Sheng mentioned in the thread above, Xen pulls back the TSC value
when reinjecting interrupts. VMWare ESX has a notion of virtual TSC,
which I believe is similar in this context.

For KVM I believe the best immediate solution (for now) is to provide an
option to disable reinjection, behaving similarly to real hardware. The
advantage is simplicity compared to virtualizing the time sources.

The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
perhaps something similar should be investigated in the future.

The following patch (which contains the bugfix for 1) and disabled
reinjection) fixes the severe time drift on RHEL4 with clock=tsc.
What I'm proposing is to condition reinjection with an option
(-kvm-pit-no-reinject or something).

Comments or better ideas?


diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index e665d1c..608af7b 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
if (!atomic_inc_and_test(pt-pending))
set_bit(KVM_REQ_PENDING_TIMER, vcpu0-requests);
 
+   if (atomic_read(pt-pending)  1)
+   atomic_set(pt-pending, 1);
+
if (vcpu0  waitqueue_active(vcpu0-wq))
wake_up_interruptible(vcpu0-wq);
 
hrtimer_add_expires_ns(pt-timer, pt-period);
pt-scheduled = hrtimer_get_expires_ns(pt-timer);
if (pt-period)
-   ps-channels[0].count_load_time = 
hrtimer_get_expires(pt-timer);
+   ps-channels[0].count_load_time = ktime_get();
 
return (pt-period == 0 ? 0 : 1);
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Andi Kleen
On Sun, Dec 28, 2008 at 04:09:26PM +0200, Avi Kivity wrote:
 I don't see how syscall could work on i386, and indeed:

i386 has task gates which support unconditional stack switching. But there 
are no 64bit task gates, just ISTs.

BTW I think there are more similar problems in your patch too.

 
 vdso32.so-$(VDSO32-y)+= int80
 vdso32.so-$(CONFIG_COMPAT)+= syscall
 vdso32.so-$(VDSO32-y)+= sysenter
 
 It's disabled.  Is that the reason?

No.  All interesting 32bit CPUs have SYSENTER; the only one who has SYSCALL 
but no SYSENTER is the K6, but it has a weird early variant of SYSCALL with
more problems which was never worth supporting.

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Avi Kivity

Andi Kleen wrote:

On Sun, Dec 28, 2008 at 04:09:26PM +0200, Avi Kivity wrote:
  

I don't see how syscall could work on i386, and indeed:



i386 has task gates which support unconditional stack switching. But there 
are no 64bit task gates, just ISTs.


  


i386 is not that interesting to me (and probably task switching couldn't 
be made to work well with guest state in TR).



BTW I think there are more similar problems in your patch too.
  


One fatal problem is enough -- I don't thing that patch can be made to 
work.  Pity since it did clean up some stuff.


I would like however to speed up kvm.  Here's a plan:

1. Add per-cpu IDT
2. When switching to the guest TR (and other state), switch off IST use 
in the current IDT

3. When switching away from the kvm task, restore the IST entries

per-cpu IDT would cost around 4K per cpu.  I propose to make it 
kconfigurable, and have kvm select it.


Ingo, does this sound workable?  It increases complexity rather than 
decreasing it as the previous solution, but I don't see any way to drop 
the use of IST as SYSCALL cannot work without IST if NMIs are enabled.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Andi Kleen
 One fatal problem is enough -- I don't thing that patch can be made to 
 work.  Pity since it did clean up some stuff.

Not sure that was true anyways.

 I would like however to speed up kvm.  Here's a plan:
 
 1. Add per-cpu IDT

You don't need that, do you?  Just two sets.

 2. When switching to the guest TR (and other state), switch off IST use 
 in the current IDT
 3. When switching away from the kvm task, restore the IST entries
 
 per-cpu IDT would cost around 4K per cpu.  I propose to make it 
 kconfigurable, and have kvm select it.

If anything please make it runtime switchable and disable it on Intel
CPUs.

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Avi Kivity

Avi Kivity wrote:

1. Add per-cpu IDT


Or we could have just two IDTs - one with IST and one without.  I 
clocked LIDT at 58 cycles (and we need two per heavyweight switch), so 
it's not that wonderful.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Avi Kivity

Avi Kivity wrote:

Avi Kivity wrote:

1. Add per-cpu IDT


Or we could have just two IDTs - one with IST and one without.  I 
clocked LIDT at 58 cycles (and we need two per heavyweight switch), so 
it's not that wonderful.


This makes the whole thing unworthwhile.  The vmload/vmsave pair costs 
only 200 cycles (I should have started with this), and 120 cycles on the 
heavyweight path plus complexity are not worth 200 cycles on the 
lightweight path.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Andi Kleen
On Sun, Dec 28, 2008 at 10:08:35PM +0200, Avi Kivity wrote:
 Avi Kivity wrote:
 Avi Kivity wrote:
 1. Add per-cpu IDT
 
 Or we could have just two IDTs - one with IST and one without.  I 
 clocked LIDT at 58 cycles (and we need two per heavyweight switch), so 
 it's not that wonderful.
 
 This makes the whole thing unworthwhile.  The vmload/vmsave pair costs 
 only 200 cycles (I should have started with this), and 120 cycles on the 
 heavyweight path plus complexity are not worth 200 cycles on the 
 lightweight path.

Actually to switch ISTs you need to change the TSS, not the IDT.
But I suppose that won't be any faster.

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove interrupt stack table usage from x86_64 kernel

2008-12-28 Thread Avi Kivity

Andi Kleen wrote:
This makes the whole thing unworthwhile.  The vmload/vmsave pair costs 
only 200 cycles (I should have started with this), and 120 cycles on the 
heavyweight path plus complexity are not worth 200 cycles on the 
lightweight path.



Actually to switch ISTs you need to change the TSS, not the IDT.
But I suppose that won't be any faster.
  


I can't touch the TSS (that's the starting point of the exercise).  The 
plan was to have a copy of the IDT with all IST pointers zeroed out (or 
to have per-cpu IDT and zero out the IST pointers when entering guest 
mode, restoring them on context switch).


It's not worth it though.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] remove kvm vmap usage

2008-12-28 Thread Izik Eidus
Remove the vmap usage from kvm, this is needed both for ksm and
get_user_pages != write.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: introducing kvm_read_guest_virt, kvm_write_guest_virt.

2008-12-28 Thread Izik Eidus
This commit change the name of emulator_read_std into kvm_read_guest_virt,
and add new function name kvm_write_guest_virt that allow writing into a
guest virtual address.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |4 ---
 arch/x86/kvm/x86.c  |   56 +-
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ab8ef1d..a129700 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -608,10 +608,6 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu);
 
 void fx_init(struct kvm_vcpu *vcpu);
 
-int emulator_read_std(unsigned long addr,
- void *val,
- unsigned int bytes,
- struct kvm_vcpu *vcpu);
 int emulator_write_emulated(unsigned long addr,
const void *val,
unsigned int bytes,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa4575c..c812209 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1973,10 +1973,8 @@ static struct kvm_io_device *vcpu_find_mmio_dev(struct 
kvm_vcpu *vcpu,
return dev;
 }
 
-int emulator_read_std(unsigned long addr,
-void *val,
-unsigned int bytes,
-struct kvm_vcpu *vcpu)
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+   struct kvm_vcpu *vcpu)
 {
void *data = val;
int r = X86EMUL_CONTINUE;
@@ -1984,27 +1982,57 @@ int emulator_read_std(unsigned long addr,
while (bytes) {
gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr);
unsigned offset = addr  (PAGE_SIZE-1);
-   unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset);
+   unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset);
int ret;
 
if (gpa == UNMAPPED_GVA) {
r = X86EMUL_PROPAGATE_FAULT;
goto out;
}
-   ret = kvm_read_guest(vcpu-kvm, gpa, data, tocopy);
+   ret = kvm_read_guest(vcpu-kvm, gpa, data, toread);
if (ret  0) {
r = X86EMUL_UNHANDLEABLE;
goto out;
}
 
-   bytes -= tocopy;
-   data += tocopy;
-   addr += tocopy;
+   bytes -= toread;
+   data += toread;
+   addr += toread;
}
 out:
return r;
 }
-EXPORT_SYMBOL_GPL(emulator_read_std);
+
+int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes,
+struct kvm_vcpu *vcpu)
+{
+   void *data = val;
+   int r = X86EMUL_CONTINUE;
+
+   while (bytes) {
+   gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr);
+   unsigned offset = addr  (PAGE_SIZE-1);
+   unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
+   int ret;
+
+   if (gpa == UNMAPPED_GVA) {
+   r = X86EMUL_PROPAGATE_FAULT;
+   goto out;
+   }
+   ret = kvm_write_guest(vcpu-kvm, gpa, data, towrite);
+   if (ret  0) {
+   r = X86EMUL_UNHANDLEABLE;
+   goto out;
+   }
+
+   bytes -= towrite;
+   data += towrite;
+   addr += towrite;
+   }
+out:
+   return r;
+}
+
 
 static int emulator_read_emulated(unsigned long addr,
  void *val,
@@ -2026,8 +2054,8 @@ static int emulator_read_emulated(unsigned long addr,
if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
goto mmio;
 
-   if (emulator_read_std(addr, val, bytes, vcpu)
-   == X86EMUL_CONTINUE)
+   if (kvm_read_guest_virt(addr, val, bytes, vcpu)
+   == X86EMUL_CONTINUE)
return X86EMUL_CONTINUE;
if (gpa == UNMAPPED_GVA)
return X86EMUL_PROPAGATE_FAULT;
@@ -2230,7 +2258,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, 
const char *context)
 
rip_linear = rip + get_segment_base(vcpu, VCPU_SREG_CS);
 
-   emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu);
+   kvm_read_guest_virt(rip_linear, (void *)opcodes, 4, vcpu);
 
printk(KERN_ERR emulation failed (%s) rip %lx %02x %02x %02x %02x\n,
   context, rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]);
@@ -2238,7 +2266,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, 
const char *context)
 EXPORT_SYMBOL_GPL(kvm_report_emulation_failure);
 
 static struct x86_emulate_ops emulate_ops = {
-   .read_std= emulator_read_std,
+   .read_std= kvm_read_guest_virt,
.read_emulated   = 

[PATCH 2/2] KVM: remove the vmap usage

2008-12-28 Thread Izik Eidus
Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/x86.c|   62 +---
 include/linux/kvm_types.h |3 +-
 2 files changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c812209..29df564 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2368,40 +2368,19 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
 }
 EXPORT_SYMBOL_GPL(emulate_instruction);
 
-static void free_pio_guest_pages(struct kvm_vcpu *vcpu)
-{
-   int i;
-
-   for (i = 0; i  ARRAY_SIZE(vcpu-arch.pio.guest_pages); ++i)
-   if (vcpu-arch.pio.guest_pages[i]) {
-   kvm_release_page_dirty(vcpu-arch.pio.guest_pages[i]);
-   vcpu-arch.pio.guest_pages[i] = NULL;
-   }
-}
-
 static int pio_copy_data(struct kvm_vcpu *vcpu)
 {
void *p = vcpu-arch.pio_data;
-   void *q;
+   gva_t q = vcpu-arch.pio.guest_gva;
unsigned bytes;
-   int nr_pages = vcpu-arch.pio.guest_pages[1] ? 2 : 1;
+   int ret;
 
-   q = vmap(vcpu-arch.pio.guest_pages, nr_pages, VM_READ|VM_WRITE,
-PAGE_KERNEL);
-   if (!q) {
-   free_pio_guest_pages(vcpu);
-   return -ENOMEM;
-   }
-   q += vcpu-arch.pio.guest_page_offset;
bytes = vcpu-arch.pio.size * vcpu-arch.pio.cur_count;
if (vcpu-arch.pio.in)
-   memcpy(q, p, bytes);
+   ret = kvm_write_guest_virt(q, p, bytes, vcpu);
else
-   memcpy(p, q, bytes);
-   q -= vcpu-arch.pio.guest_page_offset;
-   vunmap(q);
-   free_pio_guest_pages(vcpu);
-   return 0;
+   ret = kvm_read_guest_virt(q, p, bytes, vcpu);
+   return ret;
 }
 
 int complete_pio(struct kvm_vcpu *vcpu)
@@ -2512,7 +2491,6 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run 
*run, int in,
vcpu-arch.pio.in = in;
vcpu-arch.pio.string = 0;
vcpu-arch.pio.down = 0;
-   vcpu-arch.pio.guest_page_offset = 0;
vcpu-arch.pio.rep = 0;
 
if (vcpu-run-io.direction == KVM_EXIT_IO_IN)
@@ -2540,9 +2518,7 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct 
kvm_run *run, int in,
  gva_t address, int rep, unsigned port)
 {
unsigned now, in_page;
-   int i, ret = 0;
-   int nr_pages = 1;
-   struct page *page;
+   int ret = 0;
struct kvm_io_device *pio_dev;
 
vcpu-run-exit_reason = KVM_EXIT_IO;
@@ -2554,7 +2530,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct 
kvm_run *run, int in,
vcpu-arch.pio.in = in;
vcpu-arch.pio.string = 1;
vcpu-arch.pio.down = down;
-   vcpu-arch.pio.guest_page_offset = offset_in_page(address);
vcpu-arch.pio.rep = rep;
 
if (vcpu-run-io.direction == KVM_EXIT_IO_IN)
@@ -2574,15 +2549,8 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct 
kvm_run *run, int in,
else
in_page = offset_in_page(address) + size;
now = min(count, (unsigned long)in_page / size);
-   if (!now) {
-   /*
-* String I/O straddles page boundary.  Pin two guest pages
-* so that we satisfy atomicity constraints.  Do just one
-* transaction to avoid complexity.
-*/
-   nr_pages = 2;
+   if (!now)
now = 1;
-   }
if (down) {
/*
 * String I/O in reverse.  Yuck.  Kill the guest, fix later.
@@ -2597,15 +2565,7 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct 
kvm_run *run, int in,
if (vcpu-arch.pio.cur_count == vcpu-arch.pio.count)
kvm_x86_ops-skip_emulated_instruction(vcpu);
 
-   for (i = 0; i  nr_pages; ++i) {
-   page = gva_to_page(vcpu, address + i * PAGE_SIZE);
-   vcpu-arch.pio.guest_pages[i] = page;
-   if (!page) {
-   kvm_inject_gp(vcpu, 0);
-   free_pio_guest_pages(vcpu);
-   return 1;
-   }
-   }
+   vcpu-arch.pio.guest_gva = address;
 
pio_dev = vcpu_find_pio_dev(vcpu, port,
vcpu-arch.pio.cur_count,
@@ -2613,7 +2573,11 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct 
kvm_run *run, int in,
if (!vcpu-arch.pio.in) {
/* string PIO write */
ret = pio_copy_data(vcpu);
-   if (ret = 0  pio_dev) {
+   if (ret == X86EMUL_PROPAGATE_FAULT) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+   if (ret == 0  pio_dev) {
pio_string_write(pio_dev, vcpu);
complete_pio(vcpu);
if (vcpu-arch.pio.count == 0)
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 9b6f395..5f4a18c 

Re: [PATCH 0/2] remove kvm vmap usage

2008-12-28 Thread Izik Eidus

Izik Eidus wrote:

Remove the vmap usage from kvm, this is needed both for ksm and
get_user_pages != write.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
Ok so for ksm things are now all right, actully i could live without 
this vmap as well,
beacuse ksm check the page _count and every call to gva_to_page should 
increase it.


anyway before i continue with the kmap removing (again ksm doesnt need 
it, only get_user_pages != write)

does we really want to go that way?

even for the get_user_pages case we can live with the write = 0, if 
before each kmap_atomic we will call
get_user_pages (write = 1), so the page inside apic_page and time_page 
and what so ever would be safe
(when ksm will come to merge that page it will skip it with the page 
count check)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][v2] kvm-userspace: Load PCI option ROMs

2008-12-28 Thread Liu, Kechao
Hi Avi,

Thanks for your comments. I've updated the patch according to them.
Please review it. Thank you.

Load assigned devices' PCI option ROMs to the RAM of
guest OS. And pass the corresponding devfns to BIOS.

Signed-off-by: Kechao Liu kechao@intel.com
---
 bios/rombios.c  |   20 +-
 qemu/hw/device-assignment.c |  155 +++
 qemu/hw/device-assignment.h |1 +
 qemu/hw/pc.c|8 ++-
 4 files changed, 178 insertions(+), 6 deletions(-)

diff --git a/bios/rombios.c b/bios/rombios.c
index b7a240f..e98d4ff 100644
--- a/bios/rombios.c
+++ b/bios/rombios.c
@@ -10253,18 +10253,30 @@ rom_scan_loop:
   add  al, #0x04
 block_count_rounded:
 
-  xor  bx, bx   ;; Restore DS back to :
-  mov  ds, bx
   push ax   ;; Save AX
   push di   ;; Save DI
   ;; Push addr of ROM entry point
   push cx   ;; Push seg
   push #0x0003  ;; Push offset
 
+  ;; Get the BDF into ax before invoking the option ROM
+  mov  bl, [2]
+  mov  al, bl
+  shr  al, #7
+  cmp  al, #1
+  jne  fetch_bdf
+  mov  ax, ds ;; Increment the DS since rom size larger than an segment
+  add  ax, #0x1000
+  mov  ds, ax
+fetch_bdf:
+  shl  bx, #9
+  xor  ax, ax
+  mov  al, [bx]
+
   ;; Point ES:DI at $PnP, which tells the ROM that we are a PnP BIOS.
   ;; That should stop it grabbing INT 19h; we will use its BEV instead.
-  mov  ax, #0xf000
-  mov  es, ax
+  mov  bx, #0xf000
+  mov  es, bx
   lea  di, pnp_string
 
   mov  bp, sp   ;; Call ROM init routine using seg:off on stack
diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 7a5..7f16970 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -678,3 +678,158 @@ void add_assigned_devices(PCIBus *bus, const char 
**devices, int n_devices)
 }
 }
 }
+
+/* Option ROM header */
+struct option_rom_header {
+uint8_t signature[2];
+uint8_t rom_size;
+uint32_t entry_point;
+uint8_t reserved[17];
+uint16_t pci_header_offset;
+uint16_t expansion_header_offset;
+} __attribute__ ((packed));
+
+/* Option ROM PCI data structure */
+struct option_rom_pci_header {
+uint8_t signature[4];
+uint16_t vendor_id;
+uint16_t device_id;
+uint16_t vital_product_data_offset;
+uint16_t structure_length;
+uint8_t structure_revision;
+uint8_t class_code[3];
+uint16_t image_length;
+uint16_t image_revision;
+uint8_t code_type;
+uint8_t indicator;
+uint16_t reserved;
+} __attribute__ ((packed));
+
+/*
+ * Scan the list of Option ROMs at roms. If a suitable Option ROM is found,
+ * allocate a ram space and copy it there. Then return its size aligned to 
+ * both 2KB and target page size. 
+ */
+#define OPTION_ROM_ALIGN(x) (((x) + 2047)  ~2047)
+static int scan_option_rom(uint8_t devfn, void *roms, ram_addr_t offset)
+{
+int i, size, total_size;
+uint8_t csum;
+ram_addr_t addr;
+struct option_rom_header *rom;
+struct option_rom_pci_header *pcih;
+ 
+rom = roms;
+
+for ( ; ; ) {
+/* Invalid signature means we're out of option ROMs. */
+if (strncmp((char *)rom-signature, \x55\xaa, 2) ||
+ (rom-rom_size == 0))
+break;
+
+size = rom-rom_size * 512;
+/* Invalid checksum means we're out of option ROMs. */
+csum = 0;
+for (i = 0; i  size; i++)
+csum += ((uint8_t *)rom)[i];
+if (csum != 0) 
+break;
+
+/* Check the PCI header (if any) for a match. */
+pcih = (struct option_rom_pci_header *)
+((char *)rom + rom-pci_header_offset);
+if ((rom-pci_header_offset != 0) 
+ !strncmp((char *)pcih-signature, PCIR, 4))
+goto found;
+
+rom = (struct option_rom_header *)((char *)rom + size);
+}
+
+return 0;
+
+ found:
+/* The size should be both 2K-aligned and page-aligned */
+total_size = (TARGET_PAGE_SIZE  2048)
+  ? OPTION_ROM_ALIGN(size + 1)
+  : TARGET_PAGE_ALIGN(size + 1);
+
+/* Size of all available ram space is 0x1 (0xd to 0xe) */
+if ((offset + total_size)  0x1u) {
+fprintf(stderr, Option ROM size %x exceeds available space\n, size);
+return 0;
+}
+
+addr = qemu_ram_alloc(total_size);
+cpu_register_physical_memory(0xd + offset, total_size, addr | 
IO_MEM_ROM);
+
+/* Write ROM data and devfn to phys_addr */
+cpu_physical_memory_write_rom(0xd + offset, rom, size);
+cpu_physical_memory_write_rom(0xd + offset + size, devfn, 1);
+
+return total_size; 
+}
+
+/*
+ * Scan the assigned devices for the devices that have an option ROM, and then 
+ * load the corresponding ROM data to RAM. If an error occurs while loading an 
+ * option ROM, we just ignore that option ROM and continue with the next one.
+ */
+ram_addr_t assigned_dev_load_option_roms(ram_addr_t 

Re: [PATCH 15/15] KVM: Fix racy in kvm_free_assigned_irq

2008-12-28 Thread Amit Shah
On Sun, Dec 28, 2008 at 07:24:02PM +0800, Sheng Yang wrote:
 On Sat, Dec 27, 2008 at 06:06:26PM -0200, Marcelo Tosatti wrote:
  On Fri, Dec 26, 2008 at 10:30:07AM +0800, Sheng Yang wrote:
   Thanks to Marcelo's observation, The following code have potential issue:
   
   if (cancel_work_sync(assigned_dev-interrupt_work))
 kvm_put_kvm(kvm);
   
   In fact, cancel_work_sync() would return true either work struct is only
   scheduled or the callback of work struct is executed. This code only
   consider the former situation.
  
  Why not simply drop the reference inc / dec from irq handler/work
  function?
 
 Sorry, I don't know the answer. After checking the code, I also think it's a
 little strange to increase refernce count here, and I think we won't suppose
 work_handler can release the kvm struct.

At the time of developing that code, this was my observation:

I see from the call chain kvm_put_kvm-...-kvm_arch_destroy_vm, no locks are 
taken to actually destroy the vm. We can't be in ioctls, sure. But shouldn't 
the mutex be taken to ensure there's nothing else going on while destroying?

At least with the workqueue model, we could be called asynchronously in kernel 
context and I would have held the mutex and about to inject interrupts while 
everything is being wiped off underneath. However, the workqueue model tries 
its best to schedule the work on the same CPU, though we can't use that 
guarantee to ensure things will be fine.

---
So I had to get a ref to the current vm till we had any pending work scheduled. 
I think I put in comments in the code, but sadly most of my comments we 
stripped out before the merge.

Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html