Re: First performance numbers

2008-09-20 Thread Joerg Roedel
On Fri, Sep 19, 2008 at 06:30:50PM -0700, Avi Kivity wrote:
 Joerg Roedel wrote:
 Ok, here are some performance numbers for nested svm. I ran kernbench -M
 on a virtual machine with 4G RAM and 1 VCPU (since nesting SMP guests
 do currently not work). I measured simple virtualization with a shadow
 paging guest on bare metal and within a nested guest (same guest image)
 on a nested paging enabled first level guest.
 
  | Shadow Guest (100%) | Nested Guest (X)  | X
 -+-+---+
 Elapsed Time | 553.244 (1.21208)   | 1185.95 (20.0365) | 214.363%
 User Time| 407.728 (0.987279)  | 520.434 (8.55643) | 127.642% 
 System Time  | 144.828 (0.480645)  | 664.528 (11.6648) | 458.839%
 Percent CPU  | 99 (0)  | 99 (0)| 100.000%
 Context Switches | 98265.2 (183.001)   | 220015 (3302.74)  | 223.899%
 Sleeps   | 49397.8 (31.0274)   | 49460.2 (364.84)  | 100.126%
 
 So we have an overall slowdown in the first nesting level of more than
 50%. Mostly because we spend so much time in the system level. Seems
 there is some work to do for performance improvements :-)
 
   
 
 Do you have kvm_stat output for the two cases?  Also interesting to run 
 kvm_stat on both guest and host.

Sorry, no. But I can repeat the measurements and gather these numbers.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX: Host NMI triggering on NMI vmexit

2008-09-20 Thread Jan Kiszka
Avi Kivity wrote:
 Jan Kiszka wrote:
 Sheng,

 out of curiosity: vmx_vcpu_run invokes 'int $2' to trigger a host NMI if
 the VM exited due to an external NMI event. According to Intel specs I
 have, software-triggered NMIs do not block hardware NMIs. So are we
 facing the risk to receive another NMI while running the first handler?
 Or will the VM be left with the hardware blocking logic armed? Or does
 Linux not care about NMI handler re-entrance?
   
 
 All good questions.  Usually this doesn't happen since NMI sources are
 far apart (oprofile, watchdog).

Only true until you have multiple unsynchronized NMI sources, e.g.
inter-CPU NMIs of kgdb + a watchdog. I just stumbled over several bugs
in kvm's and my own NMI code that were triggered by such a scenario
(sigh...).

 
 Maybe the answer is to generate the local nmi via an IPI-to-self command
 to the local apic.
 

That sounds like a good idea, will look into this right after fixing the
other NMI issues.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/9] allow intersecting region to be on the boundary.

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

Signed-off-by: Glauber Costa [EMAIL PROTECTED]
---
 libkvm/libkvm.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c
index e768e44..fa65c30 100644
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -130,8 +130,8 @@ int get_intersecting_slot(unsigned long phys_addr)
int i;
 
 	for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS ; ++i)

-   if (slots[i].len  slots[i].phys_addr  phys_addr 
-   (slots[i].phys_addr + slots[i].len)  phys_addr)
+   if (slots[i].len  slots[i].phys_addr = phys_addr 
+   (slots[i].phys_addr + slots[i].len) = phys_addr)
return i;
return -1;
  

consider

 slots[i].phys_addr = 0
 slots[i].len = 1
 phys_addr = 1

with the new calculation, i (well, not me personally) will be considered 
an intersecting slot.


Not that I (me this time) can understand how you can calculate interval 
intersection without the entire interval.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] substitute is_allocated_mem with more general is_containing_region

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

is_allocated_mem is a function that checks if every relevant aspect of the 
memory slot
match (start and size). Replace it with a more generic function that checks if 
a memory
region is totally contained into another. The former case is also covered.
  


I think enabling dirty page tracking requires the slot to match exactly.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] add debuging facilities to memory registration at libkvm

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

+#ifdef DEBUG_MEMREG
+   fprintf(stderr, %s, memory: gpa: %llx, size: %llx, uaddr: %llx, slot: %x, 
flags: %lx\n,
+   __func__, memory.guest_phys_addr, memory.memory_size,
+   memory.userspace_addr, memory.slot, memory.flags);
+#endif
  


Please wrap in a dprintf() macro or similar.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] coalesce mmio regions with an explicit call

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

Remove explicit calls to mmio coalescing. Rather,
include it in the registration functions.

index 5ae3960..2d97b34 100644
--- a/qemu/hw/e1000.c
+++ b/qemu/hw/e1000.c
@@ -942,18 +942,6 @@ e1000_mmio_map(PCIDevice *pci_dev, int region_num,
 
 d-mmio_base = addr;

 cpu_register_physical_memory(addr, PNPMMIO_SIZE, d-mmio_index);
-
-if (kvm_enabled()) {
-   int i;
-uint32_t excluded_regs[] = {
-E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
-E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
-};
-qemu_kvm_register_coalesced_mmio(addr, excluded_regs[0]);
-for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
-qemu_kvm_register_coalesced_mmio(addr + excluded_regs[i] + 4,
- excluded_regs[i + 1] - excluded_regs[i] - 4);
-}
 }
 
  


Where did all of this go?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] register mmio slots

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

By analysing phys_offset, we know whether a region is an mmio region
or not. If it is, register it as so. We don't reuse the same slot
infrastructure already existant, because there is a relationship between
the slot number for kvm the kernel module, and the index in the slots vector
for libkvm. However, we can do best in the future and use only a single data 
structure
for both.

  


Why is kvm interested in emulated mmio regions, at all?



@@ -1032,11 +1042,9 @@ void kvm_mutex_lock(void)
 
 int qemu_kvm_register_coalesced_mmio(target_phys_addr_t addr, unsigned int size)

 {
-return kvm_register_coalesced_mmio(kvm_context, addr, size);
 }
 
 int qemu_kvm_unregister_coalesced_mmio(target_phys_addr_t addr,

   unsigned int size)
 {
-return kvm_unregister_coalesced_mmio(kvm_context, addr, size);
 }
  


Why?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graceful shutdown for OpenBSD

2008-09-20 Thread Avi Kivity

[EMAIL PROTECTED] wrote:


Maybe power button pressed or something?



Yes.  If OpenBSD supports power buttons, this should work.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/10] KVM: MMU: move local TLB flush to mmu_set_spte

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:21:09PM -0700, Avi Kivity wrote:
 Marcelo Tosatti wrote:
 Since the sync page path can collapse flushes.

 Also only flush if the spte was writable before.

 Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

 @@ -1241,9 +1239,12 @@ static void mmu_set_spte(struct kvm_vcpu
  }
  }
  if (set_spte(vcpu, shadow_pte, pte_access, user_fault, write_fault,
 -  dirty, largepage, gfn, pfn, speculative))
 +  dirty, largepage, gfn, pfn, speculative)) {
  if (write_fault)
  *ptwrite = 1;
 +if (was_writeble)
 +kvm_x86_ops-tlb_flush(vcpu);
 +}


 I think we had cases where the spte.pfn contents changed, for example  
 when a large page was replaced by a normal page,

True. And the TLB is not flushed now for large-normal replace, in case
the pte thats faulting is read-only. The local (and remote) TLB's must
be flushed on large-normal replace.

(BTW the largepage patch is wrong, will reply to that soon).

 and also:

} else if (pfn != spte_to_pfn(*shadow_pte)) {

That one is likely to crash the guest anyway, so I don't see the need
for a flush there:

  Did you find out what's causing the errors in the first place (if
  zap is not used)?  It worries me greatly.
 Yes, the problem is that the rmap code does not handle the qemu
 process
 mappings from vanishing while there is a present rmap. If that
 happens,
 and there is a fault for a gfn whose qemu mapping has been removed, a
 different physical zero page will be allocated:
 
  rmap a - gfn 0 - physical host page 0
  mapping for gfn 0 gets removed
  guest faults in gfn 0 through the same pte chain
  rmap a - gfn 0 - physical host page 1
 
 When instantiating the shadow mapping for the second time, the
 is_rmap_pte check succeeds, so we release the reference grabbed by
 gfn_to_page() at mmu_set_spte(). We now have a shadow mapping
 pointing
 to a physical page without having an additional reference on that
 page.
 
 The following makes the host not crash under such a condition, but
 the condition itself is invalid leading to inconsistent state on the
 guest.
 So IMHO it shouldnt be allowed to happen in the first place.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 03/10] KVM: MMU: do not write-protect large mappings

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:29:28PM -0700, Avi Kivity wrote:
 Marcelo Tosatti wrote:
 There is not much point in write protecting large mappings. This
 can only happen when a page is shadowed during the window between
 is_largepage_backed and mmu_lock acquision. Zap the entry instead, so
 the next pagefault will find a shadowed page via is_largepage_backed and
 fallback to 4k translations.

 Simplifies out of sync shadow.

 Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

 Index: kvm/arch/x86/kvm/mmu.c
 ===
 --- kvm.orig/arch/x86/kvm/mmu.c
 +++ kvm/arch/x86/kvm/mmu.c
 @@ -1180,11 +1180,16 @@ static int set_spte(struct kvm_vcpu *vcp
  || (write_fault  !is_write_protection(vcpu)  !user_fault)) {
  struct kvm_mmu_page *shadow;
  +   if (largepage  has_wrprotected_page(vcpu-kvm, gfn)) {
 +ret = 1;
 +spte = shadow_trap_nonpresent_pte;
 +goto set_pte;
 +}

 Don't we need to drop a reference to the page?

mmu_set_spte will do it if necessary:

if (!was_rmapped) {
rmap_add(vcpu, shadow_pte, gfn, largepage);
if (!is_rmap_pte(*shadow_pte))
kvm_release_pfn_clean(pfn);


But as noted, this part is wrong:

@@ -307,11 +307,10 @@ static int FNAME(shadow_walk_entry)(stru
return 1;
}
 
-   if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))
+   if (is_shadow_present_pte(*sptep))
return 0;
 
-   if (is_large_pte(*sptep))
-   rmap_remove(vcpu-kvm, sptep);
+   WARN_ON (is_large_pte(*sptep));

Since it might still be necessary to replace a read-only large mapping
(which rmap_write_protect will not nuke) with a normal pte pointer.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/10] KVM: x86: trap invlpg

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:53:22PM -0700, Avi Kivity wrote:
 Marcelo Tosatti wrote:
  +static int FNAME(shadow_invlpg_entry)(struct kvm_shadow_walk *_sw,
 +  struct kvm_vcpu *vcpu, u64 addr,
 +  u64 *sptep, int level)
 +{
 +
 +if (level == PT_PAGE_TABLE_LEVEL) {
 +if (is_shadow_present_pte(*sptep))
 +rmap_remove(vcpu-kvm, sptep);
 +set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
   

 Need to flush the real tlb as well.

The local TLB you mean?

+void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
+{
+   spin_lock(vcpu-kvm-mmu_lock);
+   vcpu-arch.mmu.invlpg(vcpu, gva);
+   spin_unlock(vcpu-kvm-mmu_lock);
+   kvm_mmu_flush_tlb(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_invlpg);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] KVM: MMU: mmu_parent_walk

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:56:46PM -0700, Avi Kivity wrote:
 +} while (level  start_level-1);
 +}
 +

 Could be much simplified with recursion, no?  As the depth is limited to  
 4, there's no stack overflow problem.

The early version was recursive, but since its a generic helper I
preferred a non-recursive function.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 06:22:52PM -0700, Avi Kivity wrote:
 Instead of private, have an object contain both callback and private  
 data, and use container_of().  Reduces the chance of type errors.

OK.

 +while (parent-unsync_children) {
 +for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
 +u64 ent = sp-spt[i];
 +
 +if (is_shadow_present_pte(ent)) {
 +struct kvm_mmu_page *child;
 +child = page_header(ent  PT64_BASE_ADDR_MASK);

 What does this do?

Walks all children of given page with no efficiency. Its replaced later
by the bitmap version.

 +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 +{
 +if (sp-role.glevels != vcpu-arch.mmu.root_level) {
 +kvm_mmu_zap_page(vcpu-kvm, sp);
 +return 1;
 +}
   

 Suppose we switch to real mode, touch a pte, switch back.  Is this handled?

The shadow page will go unsync on pte touch and resynced as soon as its
visible (after return to paging).

Or, while still in real mode, it might be zapped by
kvm_mmu_get_page-kvm_sync_page.

Am I missing something?

 @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_
   gfn, role.word);
  index = kvm_page_table_hashfn(gfn);
  bucket = vcpu-kvm-arch.mmu_page_hash[index];
 -hlist_for_each_entry(sp, node, bucket, hash_link)
 -if (sp-gfn == gfn  sp-role.word == role.word) {
 +hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
 +if (sp-gfn == gfn) {
 +if (sp-unsync)
 +if (kvm_sync_page(vcpu, sp))
 +continue;
 +
 +if (sp-role.word != role.word)
 +continue;
 +
 +if (sp-unsync_children)
 +vcpu-arch.mmu.need_root_sync = 1;
   

 mmu_reload() maybe?

Hum, will think about it.

  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 -return 0;
 +return ret;
  }
   

 Why does the caller care if zap also zapped some other random pages?  To  
 restart walking the list?

Yes. The next element for_each_entry_safe saved could have been zapped.

 +/* don't unsync if pagetable is shadowed with multiple roles */
 +hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
 +if (s-gfn != sp-gfn || s-role.metaphysical)
 +continue;
 +if (s-role.word != sp-role.word)
 +return 1;
 +}
   

 This will happen for nonpae paging.  But why not allow it?  Zap all  
 unsynced pages on mode switch.

 Oh, if a page is both a page directory and page table, yes.  

Yes. 

 So to allow nonpae oos, check the level instead.

Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
levels too.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 10/10] KVM: MMU: speed up mmu_unsync_walk

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 06:26:56PM -0700, Avi Kivity wrote:
  --- kvm.orig/include/asm-x86/kvm_host.h
 +++ kvm/include/asm-x86/kvm_host.h
 @@ -201,6 +201,7 @@ struct kvm_mmu_page {
  u64 *parent_pte;   /* !multimapped */
  struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */
  };
 +DECLARE_BITMAP(unsync_child_bitmap, 512);
  };
  

 Later, we can throw this bitmap out to a separate object. 

Yeah.

 Also, it may make sense to replace it with an array of u16s.

Why?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html