persistent tun different virtual NICs dead guest network

2009-04-04 Thread Michael Tokarev

Hello.

2 days debugging an.. issue here, and finally got it.
To make the long and painful (it was for me anyway)
story short...

kvm provides a way to control various offload settings
on the host side of the tun network device (I mean
the `-net tap' setup) from within guest.  I.e., guest
can set/clear various offload bits according to its
capabilities/wishes.

The problem is that different virtual NICs as used by
kvm/qemu expects and sets different offload bits for
the virtual NIC.  And sets only those bits which -
as they think - differs from the default (all-off).

This means that when changing virtual NIC model AND
using persistent tun device, it's very likely to get
inconsistent flags.

For example, here's how the offload settings on the
host looks like after using e1000 driver in guest
(freshly created persistent tun device):

 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: off
 large receive offload: off

Here's the same setting when using virtio_net
instead:

 rx-checksumming: on
 tx-checksumming: off
 scatter-gather: off
 tcp segmentation offload: off
 udp fragmentation offload: off
 generic segmentation offload: off
 large receive offload: off

I.e., only rx-checksumming.  When using virtio_net
from 2.6.29, which supports LRO, it also turns on
large receive offload.

Now, say, I tried a host with e1000 driver, and it
turned on tx, sg and tso bits.  And now I'm trying
to run a guest with new virtio-net NIC instead.  It
turns on lro bit, but the network does not work anyway:
almost any packet that's being sent from host to the
guest has incorrect checksum - because the NIC is marked
as able to do tx-checksumming but it does not do it.
The network is dead.

Now, after trying that and this, not understanding
what's going on etc, let's reboot back with e1000
NIC which worked a few minutes ago... just to discover
that it does not work anymore too!  Because previous
attempt with virtio_net resulted in lro being on, but
the driver does not support it!  So now, we've non-
working network again, and now, it does not matter
which driver we'll try: neither of them will work
because the offload settings are broken.

It's more: one can't control this stuff from the
host side using standard ethtool: it says that
the operation is not supported (I wonder how kvm
performs the settings changes).

The solution here is to re-create the tun device
before changing the virtual NIC model.  But it
isn't always possible, esp. when guests are
being run from non-root user (where persistent
tun devices are most useful).

Can this be fixed somehow please?

I think all the settings should be reset to 0
when opening the tun device.

Thanks.

/mjt,
  who lost 2 more days and had another sleepless
  night trying to understand what's going wrong...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest

2009-04-04 Thread Avi Kivity

Marcelo Tosatti wrote:

On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote:
  

index 2ea8262..48169d7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
kvm_write_guest_time(vcpu);
if (test_and_clear_bit(KVM_REQ_MMU_SYNC, vcpu-requests))
kvm_mmu_sync_roots(vcpu);
+   if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, 
vcpu-requests))
+   kvm_mmu_sync_global(vcpu);
if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, vcpu-requests))
kvm_x86_ops-tlb_flush(vcpu);
if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS
  
Windows will (I think) write a PDE on every context switch, so this  
effectively disables global unsync for that guest.


What about recursively syncing the newly linked page in FNAME(fetch)()?  
If the page isn't global, this becomes a no-op, so no new overhead.  The  
only question is the expense when linking a populated top-level page,  
especially in long mode.



How about this?

KVM: MMU: sync global pages on fetch()

If an unsync global page becomes unreachable via the shadow tree, which
can happen if one its parent pages is zapped, invlpg will fail to
invalidate translations for gvas contained in such unreachable pages.

So sync global pages in fetch().

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 09782a9..728be72 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
break;
}
 
-		if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))

+   if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep)) {
+   if (level-1 == PT_PAGE_TABLE_LEVEL) {
+   shadow_page = page_header(__pa(sptep));
+   if (shadow_page-unsync  shadow_page-global)
+   kvm_sync_page(vcpu, shadow_page);
+   }
continue;
+   }
 
 		if (is_large_pte(*sptep)) {

rmap_remove(vcpu-kvm, sptep);
  


But here the shadow page is already linked?  Isn't the root cause that 
an invlpg was called when the page wasn't linked, so it wasn't seen by 
invlpg?


So I thought the best place would be in fetch(), after 
kvm_mmu_get_page().  If we're linking a page which contains global ptes, 
they might be unsynced due to invlpgs that we've missed.


Or am I missing something about the root cause?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix display breakage when resizing the screen

2009-04-04 Thread Avi Kivity
When the vga resolution changes, a new display surface is not allocated
immediately; instead that is deferred until the next update.  However,
if we're running without a display client attached, that won't happen
and the next bitblt is likely to cause a segfault by overflowing the
display surface.

Fix by reallocating the display immediately when the resolution changes.

Tested with (Windows|Linux) x (cirrus|std) x (curses|sdl).

Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/cirrus_vga.c |   11 ++-
 hw/vga.c|  261 ++-
 hw/vga_int.h|4 +
 3 files changed, 156 insertions(+), 120 deletions(-)

diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index 08fd4c2..223008e 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -1392,6 +1392,8 @@ cirrus_hook_write_sr(CirrusVGAState * s, unsigned 
reg_index, int reg_value)
break;
 }
 
+vga_update_resolution((VGAState *)s);
+
 return CIRRUS_HOOK_HANDLED;
 }
 
@@ -1419,6 +1421,7 @@ static void cirrus_write_hidden_dac(CirrusVGAState * s, 
int reg_value)
 #endif
 }
 s-cirrus_hidden_dac_lockindex = 0;
+vga_update_resolution((VGAState *)s);
 }
 
 /***
@@ -1705,6 +1708,8 @@ cirrus_hook_write_cr(CirrusVGAState * s, unsigned 
reg_index, int reg_value)
break;
 }
 
+vga_update_resolution((VGAState *)s);
+
 return CIRRUS_HOOK_HANDLED;
 }
 
@@ -2830,6 +2835,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
if (s-ar_flip_flop == 0) {
val = 0x3f;
s-ar_index = val;
+vga_update_resolution((VGAState *)s);
} else {
index = s-ar_index  0x1f;
switch (index) {
@@ -2923,6 +2929,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
/* can always write bit 4 of CR7 */
if (s-cr_index == 7)
s-cr[7] = (s-cr[7]  ~0x10) | (val  0x10);
+vga_update_resolution((VGAState *)s);
return;
}
switch (s-cr_index) {
@@ -2951,6 +2958,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
s-update_retrace_info((VGAState *) s);
break;
}
+vga_update_resolution((VGAState *)s);
break;
 case 0x3ba:
 case 0x3da:
@@ -3157,7 +3165,8 @@ static int cirrus_vga_load(QEMUFile *f, void *opaque, int 
version_id)
 
 cirrus_update_memory_access(s);
 /* force refresh */
-s-graphic_mode = -1;
+vga_update_resolution((VGAState *)s);
+s-want_full_update = 1;
 cirrus_update_bank_ptr(s, 0);
 cirrus_update_bank_ptr(s, 1);
 return 0;
diff --git a/hw/vga.c b/hw/vga.c
index b1e4373..404450f 100644
--- a/hw/vga.c
+++ b/hw/vga.c
@@ -36,6 +36,10 @@
 
 //#define DEBUG_BOCHS_VBE
 
+#define GMODE_TEXT 0
+#define GMODE_GRAPH1
+#define GMODE_BLANK 2
+
 /* force some bits to zero */
 const uint8_t sr_mask[8] = {
 0x03,
@@ -393,6 +397,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 if (s-ar_flip_flop == 0) {
 val = 0x3f;
 s-ar_index = val;
+vga_update_resolution(s);
 } else {
 index = s-ar_index  0x1f;
 switch(index) {
@@ -433,6 +438,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 #endif
 s-sr[s-sr_index] = val  sr_mask[s-sr_index];
 if (s-sr_index == 1) s-update_retrace_info(s);
+vga_update_resolution(s);
 break;
 case 0x3c7:
 s-dac_read_index = val;
@@ -460,6 +466,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 printf(vga: write GR%x = 0x%02x\n, s-gr_index, val);
 #endif
 s-gr[s-gr_index] = val  gr_mask[s-gr_index];
+vga_update_resolution(s);
 break;
 case 0x3b4:
 case 0x3d4:
@@ -475,6 +482,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 /* can always write bit 4 of CR7 */
 if (s-cr_index == 7)
 s-cr[7] = (s-cr[7]  ~0x10) | (val  0x10);
+vga_update_resolution(s);
 return;
 }
 switch(s-cr_index) {
@@ -502,6 +510,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 s-update_retrace_info(s);
 break;
 }
+vga_update_resolution(s);
 break;
 case 0x3ba:
 case 0x3da:
@@ -581,11 +590,13 @@ static void vbe_ioport_write_data(void *opaque, uint32_t 
addr, uint32_t val)
 if ((val = VBE_DISPI_MAX_XRES)  ((val  7) == 0)) {
 s-vbe_regs[s-vbe_index] = val;
 }
+vga_update_resolution(s);
 break;
 case VBE_DISPI_INDEX_YRES:
 if (val = VBE_DISPI_MAX_YRES) {
 s-vbe_regs[s-vbe_index] = val;
 }
+vga_update_resolution(s);
 

Re: IO on guest is 20 times slower than host

2009-04-04 Thread Avi Kivity

Joerg Roedel wrote:

index 1fcbc17..d9774e9 100644
--- a/kernel/x86/kvm/svm.c
+++ b/kernel/x86/kvm/svm.c
@@ -575,7 +575,7 @@ static void init_vmcb(struct vcpu_svm *svm)
INTERCEPT_CR3_MASK);
control-intercept_cr_write = ~(INTERCEPT_CR0_MASK|
 INTERCEPT_CR3_MASK);
-   save-g_pat = 0x0007040600070406ULL;
+   save-g_pat = 0x0606060606060606ULL;
/* enable caching because the QEMU Bios doesn't enable it */
save-cr0 = X86_CR0_ET;
save-cr3 = 0;



Yeah, that patch makes sense. But I think we need some more work on this
because the guest may change the pat msr afterwards. Best would be a simple
shadow of the pat msr. Last question is how this will effect pci passthrough.

  


I've noticed that Windows (and likely Linux, didn't test) maps the 
cirrus framebuffer with PWT=1, which should slow down the emulated 
framebuffer.  So this patch should speed up things.


If a device is assigned, we must respect the guest PAT, so cirrus 
performance will be low.  On Intel there's an 'ignore PAT' bit which can 
be set on an ept pte for the framebuffer.  Any trick we can do on AMD to 
achieve a similar result?



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Expand on help info to specify kvm intel and amd module names.

2009-04-04 Thread Robert P. J. Day

Signed-off-by: Robert P. J. Day rpj...@crashcourse.ca

---

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0a303c3..68c7e21 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,9 @@ config KVM_INTEL
  Provides support for KVM on Intel processors equipped with the VT
  extensions.

+ To compile this as a module, choose M here: the module
+ will be called kvm-intel.
+
 config KVM_AMD
tristate KVM for AMD processors support
depends on KVM
@@ -57,6 +60,9 @@ config KVM_AMD
  Provides support for KVM on AMD processors equipped with the AMD-V
  (SVM) extensions.

+ To compile this as a module, choose M here: the module
+ will be called kvm-amd.
+
 config KVM_TRACE
bool KVM trace support
depends on KVM  MARKERS  SYSFS


Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry:
Have classroom, will lecture.

http://crashcourse.ca  Waterloo, Ontario, CANADA

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-04-04 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] add page_wrprotect(): write protecting page.

2009-04-04 Thread Izik Eidus
this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected...

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/rmap.h |   11 
 mm/rmap.c|  139 ++
 2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+int count_offset)
+{
+   return 0;
+}
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   /*
+* Ok this is tricky, when get_user_pages_fast() run it doesnt
+* take any lock, therefore the check that we are going to make
+* with the pagecount against the mapcount is racey and
+* O_DIRECT can happen right after the check.
+* So we clear the pte and flush the tlb before the check
+* this assure us that no O_DIRECT can happen after the check
+* or in the middle of the check.
+*/
+   entry = ptep_clear_flush(vma, address, pte);
+   /*
+* Check that no O_DIRECT or similar I/O is in progress on the
+* page
+*/
+   if ((page_mapcount(page) + count_offset) != page_count(page)) {
+   *odirect_sync = 0;
+   set_pte_at_notify(mm, address, pte, entry);
+   goto out_unlock;
+   }
+   entry = pte_wrprotect(entry);
+   set_pte_at_notify(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(mapping-i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+   spin_unlock(mapping-i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   /*
+* If the page is inside the swap cache, its _count number was
+* increased by one, therefore we have to increase 

[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

2009-04-04 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
-  

[PATCH 0/4] ksm - dynamic page sharing driver for linux v2

2009-04-04 Thread Izik Eidus
From v1 to v2:

1)Fixed security issue found by Chris Wright:
Ksm was checking if page is a shared page by running !PageAnon.
Beacuse that Ksm scan only anonymous memory, all !PageAnons
inside ksm data strctures are shared page, however there might
be a case for do_wp_page() when the VM_SHARED is used where
do_wp_page() would instead of copying the page into new anonymos
page, would reuse the page, it was fixed by adding check for the
dirty_bit of the virtual addresses pointing into the shared page.
I was not finding any VM code tha would clear the dirty bit from
this virtual address (due to the fact that we allocate the page
using page_alloc() - kernel allocated pages), ~but i still want
confirmation about this from the vm guys - thanks.~

2)Moved to sysfs to control ksm:
It was requested as a better way to control the ksm scanning
thread than ioctls.
the sysfs api:
dir: /sys/kernel/mm/ksm/

kernel_pages_allocated - information about how many kernel pages
ksm have allocated, this pages are not swappable, and each page
like that is used by ksm to share pages with identical content

pages_shared - how many pages were shared by ksm

run - set to 1 when you want ksm to run, 0 when no

max_kernel_pages - set the maximum amount of kernel pages
to be allocated by ksm, set 0 for unlimited.

pages_to_scan - how many pages to scan before ksm will sleep

sleep - how much usecs ksm will sleep.

3)Add sysfs paramater to control the maximum kernel pages to be by
ksm.

4)Add statistics about how much pages are really shared.


One issue still to be discussed:
There was a suggestion to use madvice(SHAREABLE) instead of using
ioctls to register memory that need to be scanned by ksm.
Such change is outside the area of ksm.c and would required adding
new madvice api, and change some parts of the vm and the kernel
code, so first thing to do, is realized if we really want this.

I dont know any other open issues.

Thanks.

This is from the first post:
(The kvm part, togather with the kvm-userspace part, was post with V1
before about a week, whoever want to test ksm may download the
patch from lkml archive)

KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.

Unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and
merged.
The sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

To find identical pages ksm use algorithm that is split into three
primery levels:

1) Ksm will start scan the memory and will calculate checksum for each
   page that is registred to be scanned.
   (In the first round of the scanning, ksm would only calculate
this checksum for all the pages)

2) Ksm will go again on the whole memory and will recalculate the
   checmsum of the pages, pages that are found to have the same
   checksum value, would be considered pages that are most likely
   wont changed
   Ksm will insert this pages into sorted by page content RB-tree that
   is called unstable tree, the reason that this tree is called
   unstable is due to the fact that the page contents might changed
   while they are still inside the tree, and therefore the tree would
   become corrupted.
   Due to this problem ksm take two more steps in addition to the
   checksum calculation:
   a) Ksm will throw and recreate the entire unstable tree each round
  of memory scanning - so if we have corruption, it will be fixed
  when we will rebuild the tree.
   b) Ksm is using RB-tree, that its balancing is made by the node color
  and not by the content, so even if the page get corrupted, it still
  would take the same amount of time to search on it.

3) In addition to the unstable tree, ksm hold another tree that is called
   stable tree - this tree is RB-tree that is sorted by the pages
   content and all its pages are write protected, and therefore it cant get
   corrupted.
   Each time ksm will find two identcial pages using the unstable tree,
   it will create new write-protected shared page, and this page will be
   inserted into the stable tree, and would be saved there, the
   stable tree, unlike the unstable tree, is never throwen away, so each
   page that we find would be saved inside it.

Taking into account the three levels that described above, the algorithm
work like that:

search primary tree (sorted by entire page contents, pages write protected)
- if match found, merge
- if no match found...
  - search secondary tree (sorted by entire page contents, pages not write
protected)
- if match found, merge
  - remove from secondary tree and insert merged page into primary tree
- if no match found...
  - 

[PATCH 4/4] add ksm kernel shared memory driver.

2009-04-04 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.
__u64 reserved_bits;
reserved bits for future usage.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/ksm.h|   48 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1668 
 5 files changed, 1724 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define HPET_MINOR 228
 #define FUSE_MINOR 229
 #define KVM_MINOR  232
+#define KSM_MINOR  233
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..fb59a08
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1668 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 

Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest

2009-04-04 Thread Marcelo Tosatti
On Sat, Apr 04, 2009 at 01:37:39PM +0300, Avi Kivity wrote:
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 09782a9..728be72 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
 addr,
  break;
  }
  -   if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))
 +if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep)) {
 +if (level-1 == PT_PAGE_TABLE_LEVEL) {
 +shadow_page = page_header(__pa(sptep));
 +if (shadow_page-unsync  shadow_page-global)
 +kvm_sync_page(vcpu, shadow_page);
 +}
  continue;
 +}
  if (is_large_pte(*sptep)) {
  rmap_remove(vcpu-kvm, sptep);
   

 But here the shadow page is already linked?  Isn't the root cause that  
 an invlpg was called when the page wasn't linked, so it wasn't seen by  
 invlpg?

 So I thought the best place would be in fetch(), after  
 kvm_mmu_get_page().  If we're linking a page which contains global ptes,  
 they might be unsynced due to invlpgs that we've missed.

 Or am I missing something about the root cause?

The problem is when the page is unreachable due to a higher level path
being unlinked. Say:

level 4 - level 3 . level 2 - level 1 (global unsync)

The dot there means level 3 is not linked to level 2, so invlpg can't
reach the global unsync at level 1.

kvm_mmu_get_page does sync pages when it finds them, so the code is
already safe for the linking a page which contains global ptes case
you mention above.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't boot guest with more than 3585MB when using large pages

2009-04-04 Thread Alex Williamson
On Fri, 2009-04-03 at 20:28 -0300, Marcelo Tosatti wrote:
 
 Can you please try the following

Thanks Marcelo, this seems to fix it.  I tested up to a 30G guest with
large pages.

Alex

 --
 
 qemu: kvm: fixup 4GB+ memslot large page alignment
 
 Need to align the 4GB+ memslot after we know its address, not before.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Tested-by: Alex Williamson alex.william...@hp.com

 diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
 index d4a4320..cc84772 100644
 --- a/qemu/hw/pc.c
 +++ b/qemu/hw/pc.c
 @@ -866,6 +866,7 @@ static void pc_init1(ram_addr_t ram_size, int 
 vga_ram_size,
  
  /* above 4giga memory allocation */
  if (above_4g_mem_size  0) {
 +ram_addr = qemu_ram_alloc(above_4g_mem_size);
  if (hpagesize) {
  if (ram_addr  (hpagesize-1)) {
  unsigned long aligned_addr;
 @@ -874,7 +875,6 @@ static void pc_init1(ram_addr_t ram_size, int 
 vga_ram_size,
  ram_addr = aligned_addr;
  }
  }
 -ram_addr = qemu_ram_alloc(above_4g_mem_size);
  cpu_register_physical_memory(0x1ULL,
   above_4g_mem_size,
   ram_addr);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IOMMU setting

2009-04-04 Thread Muli Ben-Yehuda
On Sat, Apr 04, 2009 at 12:16:50AM +, Eric Liu wrote:
 
 Is there a quick way to check if system has IOMMU enabled in Linux?
  
 I saw the following messages in /var/log/messages:
  
 Apr  3 21:03:16 kernel: PCI-DMA: Disabling AGP.
 Apr  3 21:03:16 kernel: PCI-DMA: aperture base @ f400 size 65536 KB
 Apr  3 21:03:16 kernel: init_memory_mapping: f400-f800
 Apr  3 21:03:16 kernel: last_map_addr: f800 end: f800
 Apr  3 21:03:16 kernel: PCI-DMA: using GART IOMMU.
 Apr  3 21:03:16 kernel: PCI-DMA: Reserving 64MB of IOMMU area in the AGP 
 aperture
  
 Does this mean IOMMU is enabed? And i don't need anything like
 iommu=force in boot option, right?

It means that you are running on an AMD system, and that this system
has a GART. You need an isolation-capable IOMMU such as Intel's VT-d
for KVM in-tree device passthrough.

Cheers,
Muli
-- 
Muli Ben-Yehuda | m...@il.ibm.com | +972-4-8281080
Manager, Virtualization and Systems Architecture
Master Inventor, IBM Haifa Research Laboratory

SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Port

2009-04-04 Thread kvm port
ok, so these are a few steps to begin
(a) add a QEMUMachine for my h/w in qemu
(b) Add arch support in kvm

I have a few questions
(a) qemu starts in user space, how would I configure my linux. Should
the linux run in Hypervisor state and the apps run in user state, and
nothing runs in guest state [ there are 3 states in my processor]
(b) qemu starts the VM and somehow ( i dont know yet, how?) , starts
my code in processor guest state

-thanks



On Thu, Mar 26, 2009 at 4:16 PM, Avi Kivity a...@redhat.com wrote:
 kvm port wrote:

 AFAIK KVM userspace app creates a VM using /dev/kvm. Now if IO has a
 MMU managed by KVM arch module, do i still need qemu?



 qemu is needed to allocate memory, and to emulate devices.  For example the
 IDE disk controller is implemented in qemu.


 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest

2009-04-04 Thread Aurelien Jarno
On Fri, Apr 03, 2009 at 06:45:48PM -0300, Marcelo Tosatti wrote:
 On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote:
  index 2ea8262..48169d7 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, 
  struct kvm_run *kvm_run)
 kvm_write_guest_time(vcpu);
 if (test_and_clear_bit(KVM_REQ_MMU_SYNC, vcpu-requests))
 kvm_mmu_sync_roots(vcpu);
  +  if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, 
  vcpu-requests))
  +  kvm_mmu_sync_global(vcpu);
 if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, vcpu-requests))
 kvm_x86_ops-tlb_flush(vcpu);
 if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS
 
  Windows will (I think) write a PDE on every context switch, so this  
  effectively disables global unsync for that guest.
 
  What about recursively syncing the newly linked page in FNAME(fetch)()?  
  If the page isn't global, this becomes a no-op, so no new overhead.  The  
  only question is the expense when linking a populated top-level page,  
  especially in long mode.
 
 How about this?
 
 KVM: MMU: sync global pages on fetch()
 
 If an unsync global page becomes unreachable via the shadow tree, which
 can happen if one its parent pages is zapped, invlpg will fail to
 invalidate translations for gvas contained in such unreachable pages.
 
 So sync global pages in fetch().
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

I have tried this patch, and unfortunately it does not solve the
original problem, while the previous one did.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio_net: MAC address releated breakage if there is no MAC area in config

2009-04-04 Thread David Miller
From: Christian Borntraeger borntrae...@de.ibm.com
Date: Thu, 2 Apr 2009 19:23:48 +0200

 Am Thursday 02 April 2009 18:06:25 schrieb Alex Williamson:
  virtio_net: Set the mac config only when VIRITO_NET_F_MAC
  
  VIRTIO_NET_F_MAC indicates the presence of the mac field in config
  space, not the validity of the value it contains.  Allow the mac to be
  changed at runtime, but only push the change into config space with the
  VIRTIO_NET_F_MAC feature present.
  
  Signed-off-by: Alex Williamson alex.william...@hp.com
 
 Acked-by: Christian Borntraeger borntrae...@de.ibm.com

Applied, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AHCI?

2009-04-04 Thread tsuraan
Is there any plan to add AHCI support to kvm?  It seems like it would
be an ideal alternative to the LSI SCSI driver, since AHCI is
supported by 64-bit Solaris as well as nearly every other modern OS.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-04 Thread Rusty Russell
On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
 Rusty Russell wrote:
  As you point out, 350-450 is possible, which is still bad, and it's at least
  partially caused by the exit to userspace and two system calls.  If 
  virtio_net
  had a backend in the kernel, we'd be able to compare numbers properly.
 
 I doubt the userspace exit is the problem.  On a modern system, it takes 
 about 1us to do a light-weight exit and about 2us to do a heavy-weight 
 exit.  A transition to userspace is only about ~150ns, the bulk of the 
 additional heavy-weight exit cost is from vcpu_put() within KVM.

Just to inject some facts, servicing a ping via tap (ie host-guest then
guest-host response) takes 26 system calls from one qemu thread, 7 from
another (see strace below). Judging by those futex calls, multiple context
switches, too.

 If you were to switch to another kernel thread, and I'm pretty sure you 
 have to, you're going to still see about a 2us exit cost.

He switches to another thread, too, but with the right infrastructure (ie.
skb data destructors) we could skip this as well.  (It'd be interesting to
see how virtual-bus performed on a single cpu host).

Cheers,
Rusty.

Pid 10260:
12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], 
left {0, 992000}) 0.003995
12:37:40.250226 read(6, 
\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0..., 69632) = 108 
0.51
12:37:40.250462 write(1, tap read: 108 bytes\n, 20) = 20 0.000197
12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 0.000223
12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily 
unavailable) 0.19
12:37:40.251292 write(1, tap read: -1 bytes\n, 19) = 19 0.85
12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 0.20
12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 0.19
12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 0.001222
12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left 
{1, 0}) 0.26
12:37:40.253196 read(16, 
\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0..., 128) = 128 
0.22
12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], 
SA_RESTORER, 0x7f1a842430f0}, 8) = 0 0.18
12:37:40.253477 write(5, \0, 1)   = 1 0.22
12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily 
unavailable) 0.20
12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 0.19
12:37:40.253798 writev(6, [{\0\0\0\0\0\0\0\0\0\0, 10}, 
{*\211\24\210`\304rt\0\0224v\10\0e\0\0t\255\262\...@\1g..., 98}], 2) = 108 
0.62
12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 0.000161
12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 0.19
12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 0.000394
12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 
0}) 0.22
12:37:40.255001 read(4, \0, 512)  = 1 0.21
12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily 
unavailable) 0.18
12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 0.19
12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 0.19
12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 
0.18
12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 
1400}}, NULL) = 0 0.21
12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 0.19
12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 0.18
12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left 
{0, 988000}) 0.014303

Pid 10262:
12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 0.18
12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) 
= 0 0.21
12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 25}}, 
NULL) = 0 0.24
12:37:40.252868 ioctl(11, 0xae80, 0)= 0 0.001171
12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 0.000270
12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 0.19
12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 0.17
12:37:40.254693 ioctl(11, 0xae80 unfinished ...

fd:
lrwx-- 1 root root 64 2009-04-05 12:31 0 - /dev/pts/1 
lrwx-- 1 root root 64 2009-04-05 12:31 1 - /dev/pts/1 
lrwx-- 1 root root 64 2009-04-05 12:35 10 - 
/home/rusty/qemu-images/ubuntu-8.10 
   
lrwx-- 1 root root 64 2009-04-05 12:35 11 - anon_inode:kvm-vcpu
lrwx-- 1 root root 64 2009-04-05 12:35 12 - socket:[31414] 
lrwx-- 1 root root 64 2009-04-05 12:35 13 - socket:[31416] 
lrwx-- 1 root root 64 2009-04-05 12:35 14 - anon_inode:[eventfd]   
lrwx-- 1 root root 64 2009-04-05 12:35 15 - anon_inode:[eventfd]   
lrwx-- 1