date:20100510

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-05-10 Thread Avi Kivity


On 05/09/2010 10:35 PM, Gerhard Wiesinger wrote:



Please run kvm_stat and report output for both tests to confirm.



See below. 2nd column is per second statistic when running the test.


efer_reload  0   0
exits 18470836  554582
fpu_reload 21478333469
halt_exits2083   0
halt_wakeup   2047   0
host_state_reload  21481863470
hypercalls   0   0
insn_emulation 7688203  554244

This indicates that kvm is emulating instead of direct mapping.  
That's probably a bug.  If you fix it, performance will increase 
dramatically.



Where can I start here?
Any ideas how to?

One of my ideas: Move hw/vga.c functions
vga_mem_readb
vga_mem_readw
vga_mem_readl
vga_mem_writeb
vga_mem_writew
vga_mem_writel
to KVM to avoid switching from KVM to QEMU (I can write C code even 
kernel but I'm not comfortable with KVM). Howto?


That is already done (generically), it's called coalesced mmio.  You 
only have 3470 qemu exits/sec compared to 554244 kvm writes/sec.


Switching between tcg and kvm is hard, but not needed.  For 256 color 
modes, direct map is possible and should yield good performance.  
Bank switching can be improved perhaps 3x, but will never be fast.


Where can I start for KVM performance for the bank switching (256 
color mode)? (e.g. BIOS writes to VGA window I/O port to switch the bank)

Any ideas how to improve (architecture for the change)?


For 256 color more the first priority is to find out why direct mapping 
is not used.  I'd suggest tracing the code that makes this decision (in 
hw/*vga.c) and seeing if it's right or not.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots

2010-05-10 Thread Avi Kivity


On 05/06/2010 10:10 PM, Marcelo Tosatti wrote:

On Tue, May 04, 2010 at 01:03:50PM +0300, Avi Kivity wrote:
   

On svm, kvm_read_pdptr() may require reading guest memory, which can sleep.

Push the spinlock into mmu_alloc_roots(), and only take it after we've read
the pdptr.

Marcelo, dropping and re-acquiring the lock before mmu_sync_roots(), is fine,
yes?
 

Yes, but you should call kvm_mmu_free_some_pages after reacquiring the
spin_lock, to guarantee kvm_mmu_alloc_page won't fail.

   


We never fail due to lack of free pages (but I agree it's cleaner).

The patch is already in, I'll send an incremental.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Autotest][PATCH] KVM Test: Update qemu-ifup script to set bridge's forwarding delay to 0.

2010-05-10 Thread Feng Yang


Our pxe case always fail. The problem is that the bridge takes some time 
to enter the forwarding state. Before that, packages are simply dropped by
the bridge.
If one sets:
$ brctl setfd bridge 0 # forward delay = 0

It works.


Signed-off-by: Feng Yang fy...@redhat.com
---
 client/tests/kvm/scripts/qemu-ifup |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/client/tests/kvm/scripts/qemu-ifup 
b/client/tests/kvm/scripts/qemu-ifup
index bcd9a7a..3d23fb4 100755
--- a/client/tests/kvm/scripts/qemu-ifup
+++ b/client/tests/kvm/scripts/qemu-ifup
@@ -6,3 +6,4 @@ switch=$(/usr/sbin/brctl show | awk 'NR==2 { print $1 }')
 
 /sbin/ifconfig $1 0.0.0.0 up
 /usr/sbin/brctl addif ${switch} $1
+/usr/sbin/brctl setfd ${switch} 0
-- 
1.5.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Another SIGFPE in display code, now in cirrus

2010-05-10 Thread Avi Kivity


On 05/06/2010 11:07 PM, Michael Tokarev wrote:

There was a bug recently fixed in vnc code.  Apparently
there's something similar in the cirrus emulation as well.
Here it triggers _always_ (including old versions of kvm)
when running windows NT and hitting test button in its
display resolution dialog.  Here's what gdb is to say:

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0xf76cab70 (LWP 580)]
0x080c5e45 in cirrus_do_copy (s=0x86134dc, dst=96, src=0, w=2, h=9)
at hw/cirrus_vga.c:687
687sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth;
(gdb) p depth
$1 = 2
(gdb) p s-cirrus_blt_srcpitch
$2 = 0


This qemu-kvm-0.12.3 - actually a debian package of it,
but there's no patches relevant to video applied.

Anything can be done with it?


Well, it's trivial to check for the condition, but how to handle it?

Need to find the spec for the chip.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Gleb Natapov

This patch adds native support for booting from virtio disks to Seabios.

Signed-off-by: Gleb Natapov g...@redhat.com
---

Changelog:
 v1-v2:
  - free memory in case of vq initialization error.
  - change license of virtio ring/pci to LGPLv3 with permission
of Laurent Vivier (aka the author).

diff --git a/Makefile b/Makefile
index 327a1bf..d0b8881 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,8 @@ OUT=out/
 SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \
 kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \
 pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \
-usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c
+usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \
+virtio-ring.c virtio-pci.c virtio-blk.c
 SRC16=$(SRCBOTH) system.c disk.c apm.c font.c
 SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \
   acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \
diff --git a/src/block.c b/src/block.c
index ddf441f..b6b1902 100644
--- a/src/block.c
+++ b/src/block.c
@@ -11,6 +11,7 @@
 #include util.h // dprintf
 #include ata.h // process_ata_op
 #include usb-msc.h // process_usb_op
+#include virtio-blk.h // process_virtio_op
 
 struct drives_s Drives VAR16VISIBLE;
 
@@ -289,6 +290,8 @@ process_op(struct disk_op_s *op)
 return process_cdemu_op(op);
 case DTYPE_USB:
 return process_usb_op(op);
+case DTYPE_VIRTIO:
+   return process_virtio_op(op);
 default:
 op-count = 0;
 return DISK_RET_EPARAM;
diff --git a/src/config.h b/src/config.h
index b101174..ad569c6 100644
--- a/src/config.h
+++ b/src/config.h
@@ -136,6 +136,9 @@
 #define CONFIG_SUBMODEL_ID   0x00
 #define CONFIG_BIOS_REVISION 0x01
 
+// Support boot from virtio storage
+#define CONFIG_VIRTIO_BLK 1
+
 // Various memory addresses used by the code.
 #define BUILD_STACK_ADDR  0x7000
 #define BUILD_S3RESUME_STACK_ADDR 0x1000
diff --git a/src/disk.h b/src/disk.h
index 0cd1b74..9e5b083 100644
--- a/src/disk.h
+++ b/src/disk.h
@@ -197,6 +197,7 @@ struct drive_s {
 #define DTYPE_RAMDISK  0x04
 #define DTYPE_CDEMU0x05
 #define DTYPE_USB  0x06
+#define DTYPE_VIRTIO   0x07
 
 #define MAXDESCSIZE 80
 
diff --git a/src/pci_ids.h b/src/pci_ids.h
index 1800f1d..e1cded2 100644
--- a/src/pci_ids.h
+++ b/src/pci_ids.h
@@ -2605,3 +2605,6 @@
 #define PCI_DEVICE_ID_RME_DIGI32   0x9896
 #define PCI_DEVICE_ID_RME_DIGI32_PRO   0x9897
 #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898
+
+#define PCI_VENDOR_ID_REDHAT_QUMRANET  0x1af4
+#define PCI_DEVICE_ID_VIRTIO_BLK   0x1001
diff --git a/src/post.c b/src/post.c
index 638b0f7..25535e2 100644
--- a/src/post.c
+++ b/src/post.c
@@ -23,6 +23,7 @@
 #include smbios.h // smbios_init
 #include paravirt.h // qemu_cfg_port_probe
 #include ps2port.h // ps2port_setup
+#include virtio-blk.h // virtio_blk_setup
 
 void
 __set_irq(int vector, void *loc)
@@ -184,6 +185,7 @@ init_hw(void)
 floppy_setup();
 ata_setup();
 ramdisk_setup();
+virtio_blk_setup();
 }
 
 // Main setup code.
diff --git a/src/virtio-blk.c b/src/virtio-blk.c
new file mode 100644
index 000..a41c336
--- /dev/null
+++ b/src/virtio-blk.c
@@ -0,0 +1,155 @@
+// Virtio blovl boot support.
+//
+// Copyright (C) 2010 Red Hat Inc.
+//
+// Authors:
+//  Gleb Natapov gnata...@redhat.com
+//
+// This file may be distributed under the terms of the GNU LGPLv3 license.
+
+#include util.h // dprintf
+#include pci.h // foreachpci
+#include config.h // CONFIG_*
+#include virtio-pci.h
+#include virtio-blk.h
+#include disk.h
+
+struct virtiodrive_s {
+struct drive_s drive;
+struct vring_virtqueue *vq;
+u16 ioaddr;
+};
+
+static int
+virtio_blk_read(struct disk_op_s *op)
+{
+struct virtiodrive_s *vdrive_g =
+container_of(op-drive_g, struct virtiodrive_s, drive);
+struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq);
+struct virtio_blk_outhdr hdr = {
+.type = VIRTIO_BLK_T_IN,
+.ioprio = 0,
+.sector = op-lba,
+};
+u8 status = VIRTIO_BLK_S_UNSUPP;
+struct vring_list sg[] = {
+{
+.addr  = MAKE_FLATPTR(GET_SEG(SS), hdr),
+.length= sizeof(hdr),
+},
+{
+.addr  = op-buf_fl,
+.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count,
+},
+{
+.addr  = MAKE_FLATPTR(GET_SEG(SS), status),
+.length= sizeof(status),
+},
+};
+
+/* Add to virtqueue and kick host */
+vring_add_buf(vq, sg, 1, 2, 0, 0);
+vring_kick(GET_GLOBAL(vdrive_g-ioaddr), vq, 1);
+
+/* Wait for reply */
+while (!vring_more_used(vq))
+udelay(5);
+
+/* Reclaim virtqueue element */
+vring_get_buf(vq, NULL);
+return status == VIRTIO_BLK_S_OK ? DISK_RET_SUCCESS : DISK_RET_EBADTRACK;
+}
+
+int
+process_virtio_op(struct disk_op_s *op)
+{
+switch (op-command) {

Re: Another SIGFPE in display code, now in cirrus

2010-05-10 Thread Avi Kivity


On 05/10/2010 10:41 AM, Avi Kivity wrote:

On 05/06/2010 11:07 PM, Michael Tokarev wrote:

There was a bug recently fixed in vnc code.  Apparently
there's something similar in the cirrus emulation as well.
Here it triggers _always_ (including old versions of kvm)
when running windows NT and hitting test button in its
display resolution dialog.  Here's what gdb is to say:

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0xf76cab70 (LWP 580)]
0x080c5e45 in cirrus_do_copy (s=0x86134dc, dst=96, src=0, w=2, h=9)
at hw/cirrus_vga.c:687
687sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth;
(gdb) p depth
$1 = 2
(gdb) p s-cirrus_blt_srcpitch
$2 = 0


This qemu-kvm-0.12.3 - actually a debian package of it,
but there's no patches relevant to video applied.

Anything can be done with it?


Well, it's trivial to check for the condition, but how to handle it?



That code is wrong.  It shouldn't be using the bitblt source pitch, but 
the actual display pitch.  If the two are different, then the blt 
doesn't actually affect a rectangular region, but instead a parallelogram.


Further:



if (notify)
qemu_console_copy(s-vga.ds,
  sx, sy, dx, dy,
  s-cirrus_blt_width / depth,
  s-cirrus_blt_height);

/* we don't have to notify the display that this portion has
   changed since qemu_console_copy implies this */

cirrus_invalidate_region(s, s-cirrus_blt_dstaddr,
s-cirrus_blt_dstpitch, s-cirrus_blt_width,
s-cirrus_blt_height);



Shouldn't we avoid the invalidate if notify != 0, as the comment says?

31c05501c says this breaks bitblt, but I can't see why this is true.  
The copy should update the display.  This is probably due to a 
miscalculation of the affected region, and now we have two invalidates 
instead of one, reducing performance.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qemu-kvm-0.12.4 with bochs bios

2010-05-10 Thread Sasha Levin

Hi,

I've tried running an image using the new qemu-kvm-0.12.4 using a
BOCHS BIOS (using the -bios parameter).
When the machine starts the QEMU window gets minimized into a 1x1
pixel box and just stops.

Starting the same machine with the same BIOS on 0.12.3 is working great.

Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug?

Let me know if theres any debug info I can pull out to help finding the issue.

Commandline:
qemu-system-x86_64 -bios /usr/share/bochs/BIOS-bochs-latest
/path-to-img/server5.img

laptop bin # ./qemu-system-x86_64 --version
QEMU PC emulator version 0.12.4 (qemu-kvm-0.12.4), Copyright (c)
2003-2008 Fabrice Bellard

laptop bin # uname -a
Linux laptop 2.6.34-rc6-git3 #1 SMP PREEMPT Wed May 5 16:32:23 IDT
2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel
GNU/Linux


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace

2010-05-10 Thread Gleb Natapov

Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.

Signed-off-by: Gleb Natapov g...@redhat.com
---

Changelog:
  v1-v2:
- always inject #UD and report failure to userspace.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ed48904..5aa0944 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -575,7 +575,6 @@ enum emulation_result {
 #define EMULTYPE_SKIP  (1  2)
 int emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2, u16 error_code, int emulation_type);
-void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context);
 void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95bee9d..41d2de8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code)
return 1;
case EMULATE_DO_MMIO:
++vcpu-stat.mmio_exits;
-   return 0;
+   /* fall through */
case EMULATE_FAIL:
-   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
-   vcpu-run-internal.ndata = 0;
return 0;
default:
BUG();
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cea916f..58c91f5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm)
string = (io_info  SVM_IOIO_STR_MASK) != 0;
in = (io_info  SVM_IOIO_TYPE_MASK) != 0;
if (string || in)
-   return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO);
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 
port = io_info  16;
size = (io_info  SVM_IOIO_SIZE_MASK)  SVM_IOIO_SIZE_SHIFT;
@@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm)
 
 static int invlpg_interception(struct vcpu_svm *svm)
 {
-   if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
-   pr_unimpl(svm-vcpu, %s: failed\n, __func__);
-   return 1;
+   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int emulate_on_interception(struct vcpu_svm *svm)
 {
-   if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
-   pr_unimpl(svm-vcpu, %s: failed\n, __func__);
-   return 1;
+   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int cr8_write_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 777e00d..e35c479 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
++vcpu-stat.io_exits;
 
if (string || in)
-   return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO);
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 
port = exit_qualification  16;
size = (exit_qualification  7) + 1;
@@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
-   unsigned long exit_qualification;
-   enum emulation_result er;
-   unsigned long offset;
-
-   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-   offset = exit_qualification  0xffful;
-
-   er = emulate_instruction(vcpu, 0, 0, 0);
-
-   if (er !=  EMULATE_DONE) {
-   printk(KERN_ERR
-  Fail to handle apic access vmexit! Offset is 0x%lx\n,
-  offset);
-   return -ENOEXEC;
-   }
-   return 1;
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int handle_task_switch(struct kvm_vcpu *vcpu)
@@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
goto out;
}
 
-   if (err != EMULATE_DONE) {
-   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu-run-internal.suberror = 
KVM_INTERNAL_ERROR_EMULATION;
-   vcpu-run-internal.ndata = 0;
-   ret = 0;
-   goto out;
-   }
+   if (err != EMULATE_DONE)
+   return 0;
 
if (signal_pending(current))
goto out;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd8a606..b5eaed4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3587,24

[PATCH] Do not stop VM if emulation failed in userspace.

2010-05-10 Thread Gleb Natapov

Continue vcpu execution in case emulation failure happened while vcpu
was in userspace. In this case #UD will be injected into the guest
allowing guest OS to kill offending process and continue.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/kvm-all.c b/kvm-all.c
index 9ac35aa..db28d94 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -786,6 +786,8 @@ static void kvm_handle_internal_error(CPUState *env, struct 
kvm_run *run)
 cpu_dump_state(env, stderr, fprintf, 0);
 if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, emulation failure\n);
+if (!kvm_arch_stop_on_emulation_error(env))
+   return;
 }
 /* FIXME: Should trigger a qmp message to let management know
  * something went wrong.
diff --git a/kvm.h b/kvm.h
index 67c0392..79e59f5 100644
--- a/kvm.h
+++ b/kvm.h
@@ -187,4 +187,5 @@ int kvm_set_irqfd(int gsi, int fd, bool assigned)
 #endif
 
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool 
assign);
+bool kvm_arch_stop_on_emulation_error(CPUState *env);
 #endif
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 76c1adb..c333f86 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1287,4 +1287,10 @@ void kvm_arch_update_guest_debug(CPUState *env, struct 
kvm_guest_debug *dbg)
 }
 #endif /* KVM_CAP_SET_GUEST_DEBUG */
 
+bool kvm_arch_stop_on_emulation_error(CPUState *env)
+{
+   return !(env-cr[0]  CR0_PE_MASK) ||
+   ((env-segs[R_CS].selector   3) != 3);
+}
+
 #include qemu-kvm-x86.c
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index aa3d432..7b246a1 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -321,3 +321,8 @@ uint32_t kvmppc_get_tbfreq(void)
 retval = atoi(ns);
 return retval;
 }
+
+bool kvm_arch_stop_on_emulation_error(CPUState *env)
+{
+return true;
+}
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index 72e77b0..14f723b 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -480,3 +480,8 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run *run)
 
 return ret;
 }
+
+bool kvm_arch_stop_on_emulation_error(CPUState *env)
+{
+return true;
+}
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Stefan Hajnoczi

 diff --git a/src/virtio-blk.c b/src/virtio-blk.c
 new file mode 100644
 index 000..a41c336
 --- /dev/null
 +++ b/src/virtio-blk.c
 @@ -0,0 +1,155 @@
 +// Virtio blovl boot support.

Just noticed the blovl typo.

 +        char *desc = malloc_tmphigh(MAXDESCSIZE);
 +        struct virtiodrive_s *vdrive_g = malloc_fseg(sizeof(*vdrive_g));
 +        struct vring_virtqueue *vq = malloc_low(sizeof(*vq));
 +        if (!vdrive_g || !desc || !vq) {
 +            warn_noalloc();
 +            return;
 +        }

This error return can still leak.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Gleb Natapov

On Mon, May 10, 2010 at 09:25:20AM +0100, Stefan Hajnoczi wrote:
  diff --git a/src/virtio-blk.c b/src/virtio-blk.c
  new file mode 100644
  index 000..a41c336
  --- /dev/null
  +++ b/src/virtio-blk.c
  @@ -0,0 +1,155 @@
  +// Virtio blovl boot support.
 
 Just noticed the blovl typo.
 
  +        char *desc = malloc_tmphigh(MAXDESCSIZE);
  +        struct virtiodrive_s *vdrive_g = malloc_fseg(sizeof(*vdrive_g));
  +        struct vring_virtqueue *vq = malloc_low(sizeof(*vq));
  +        if (!vdrive_g || !desc || !vq) {
  +            warn_noalloc();
  +            return;
  +        }
 
 This error return can still leak.
 
Oh Gosh, programming is hard. Why don't we write bios in python?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm-0.12.4 with bochs bios

2010-05-10 Thread Avi Kivity


On 05/10/2010 11:16 AM, Sasha Levin wrote:

Hi,

I've tried running an image using the new qemu-kvm-0.12.4 using a
BOCHS BIOS (using the -bios parameter).
When the machine starts the QEMU window gets minimized into a 1x1
pixel box and just stops.

Starting the same machine with the same BIOS on 0.12.3 is working great.

Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug?
   


The bochs bios is not supported in 0.12.  Why do you need to use it?

If there's a simple fix that can make it work I don't we can include it.


Let me know if theres any debug info I can pull out to help finding the issue.
   


A bisection to find the offending commit would be good.  I suggest 
starting with upstream qemu 0.12.3 and 0.12.4.  If 0.12.3 works and 
0.12.4 fails, you can try a simple git bisect to find what commit 
introduced the problem.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3] Support for booting from virtio disks

2010-05-10 Thread Gleb Natapov

This patch adds native support for booting from virtio disks to Seabios.

Signed-off-by: Gleb Natapov g...@redhat.com
---

Changelog:
 v1-v2:
  - free memory in case of vq initialization error.
  - change license of virtio ring/pci to LGPLv3 with permission
of Laurent Vivier (aka the author).
 v2-v3:
  - free memory in case one of allocations failed

diff --git a/Makefile b/Makefile
index 327a1bf..d0b8881 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,8 @@ OUT=out/
 SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \
 kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \
 pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \
-usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c
+usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \
+virtio-ring.c virtio-pci.c virtio-blk.c
 SRC16=$(SRCBOTH) system.c disk.c apm.c font.c
 SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \
   acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \
diff --git a/src/block.c b/src/block.c
index ddf441f..b6b1902 100644
--- a/src/block.c
+++ b/src/block.c
@@ -11,6 +11,7 @@
 #include util.h // dprintf
 #include ata.h // process_ata_op
 #include usb-msc.h // process_usb_op
+#include virtio-blk.h // process_virtio_op
 
 struct drives_s Drives VAR16VISIBLE;
 
@@ -289,6 +290,8 @@ process_op(struct disk_op_s *op)
 return process_cdemu_op(op);
 case DTYPE_USB:
 return process_usb_op(op);
+case DTYPE_VIRTIO:
+   return process_virtio_op(op);
 default:
 op-count = 0;
 return DISK_RET_EPARAM;
diff --git a/src/config.h b/src/config.h
index b101174..ad569c6 100644
--- a/src/config.h
+++ b/src/config.h
@@ -136,6 +136,9 @@
 #define CONFIG_SUBMODEL_ID   0x00
 #define CONFIG_BIOS_REVISION 0x01
 
+// Support boot from virtio storage
+#define CONFIG_VIRTIO_BLK 1
+
 // Various memory addresses used by the code.
 #define BUILD_STACK_ADDR  0x7000
 #define BUILD_S3RESUME_STACK_ADDR 0x1000
diff --git a/src/disk.h b/src/disk.h
index 0cd1b74..9e5b083 100644
--- a/src/disk.h
+++ b/src/disk.h
@@ -197,6 +197,7 @@ struct drive_s {
 #define DTYPE_RAMDISK  0x04
 #define DTYPE_CDEMU0x05
 #define DTYPE_USB  0x06
+#define DTYPE_VIRTIO   0x07
 
 #define MAXDESCSIZE 80
 
diff --git a/src/pci_ids.h b/src/pci_ids.h
index 1800f1d..e1cded2 100644
--- a/src/pci_ids.h
+++ b/src/pci_ids.h
@@ -2605,3 +2605,6 @@
 #define PCI_DEVICE_ID_RME_DIGI32   0x9896
 #define PCI_DEVICE_ID_RME_DIGI32_PRO   0x9897
 #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898
+
+#define PCI_VENDOR_ID_REDHAT_QUMRANET  0x1af4
+#define PCI_DEVICE_ID_VIRTIO_BLK   0x1001
diff --git a/src/post.c b/src/post.c
index 638b0f7..25535e2 100644
--- a/src/post.c
+++ b/src/post.c
@@ -23,6 +23,7 @@
 #include smbios.h // smbios_init
 #include paravirt.h // qemu_cfg_port_probe
 #include ps2port.h // ps2port_setup
+#include virtio-blk.h // virtio_blk_setup
 
 void
 __set_irq(int vector, void *loc)
@@ -184,6 +185,7 @@ init_hw(void)
 floppy_setup();
 ata_setup();
 ramdisk_setup();
+virtio_blk_setup();
 }
 
 // Main setup code.
diff --git a/src/virtio-blk.c b/src/virtio-blk.c
new file mode 100644
index 000..f106bdf
--- /dev/null
+++ b/src/virtio-blk.c
@@ -0,0 +1,158 @@
+// Virtio block boot support.
+//
+// Copyright (C) 2010 Red Hat Inc.
+//
+// Authors:
+//  Gleb Natapov gnata...@redhat.com
+//
+// This file may be distributed under the terms of the GNU LGPLv3 license.
+
+#include util.h // dprintf
+#include pci.h // foreachpci
+#include config.h // CONFIG_*
+#include virtio-pci.h
+#include virtio-blk.h
+#include disk.h
+
+struct virtiodrive_s {
+struct drive_s drive;
+struct vring_virtqueue *vq;
+u16 ioaddr;
+};
+
+static int
+virtio_blk_read(struct disk_op_s *op)
+{
+struct virtiodrive_s *vdrive_g =
+container_of(op-drive_g, struct virtiodrive_s, drive);
+struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq);
+struct virtio_blk_outhdr hdr = {
+.type = VIRTIO_BLK_T_IN,
+.ioprio = 0,
+.sector = op-lba,
+};
+u8 status = VIRTIO_BLK_S_UNSUPP;
+struct vring_list sg[] = {
+{
+.addr  = MAKE_FLATPTR(GET_SEG(SS), hdr),
+.length= sizeof(hdr),
+},
+{
+.addr  = op-buf_fl,
+.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count,
+},
+{
+.addr  = MAKE_FLATPTR(GET_SEG(SS), status),
+.length= sizeof(status),
+},
+};
+
+/* Add to virtqueue and kick host */
+vring_add_buf(vq, sg, 1, 2, 0, 0);
+vring_kick(GET_GLOBAL(vdrive_g-ioaddr), vq, 1);
+
+/* Wait for reply */
+while (!vring_more_used(vq))
+udelay(5);
+
+/* Reclaim virtqueue element */
+vring_get_buf(vq, NULL);
+return status == VIRTIO_BLK_S_OK ? DISK_RET_SUCCESS : DISK_RET_EBADTRACK;
+}
+
+int

Re: qemu-kvm-0.12.4 with bochs bios

2010-05-10 Thread Sasha Levin

On Mon, May 10, 2010 at 11:35 AM, Avi Kivity a...@redhat.com wrote:
 On 05/10/2010 11:16 AM, Sasha Levin wrote:

 Hi,

 I've tried running an image using the new qemu-kvm-0.12.4 using a
 BOCHS BIOS (using the -bios parameter).
 When the machine starts the QEMU window gets minimized into a 1x1
 pixel box and just stops.

 Starting the same machine with the same BIOS on 0.12.3 is working great.

 Is starting qemu using the BOCHS BIOS no longer supported? Or is it a bug?


 The bochs bios is not supported in 0.12.  Why do you need to use it?

 If there's a simple fix that can make it work I don't we can include it.

I've tried doing cpu hotplug using cpu_set on 0.12.3 but it caused a
segfault, but now on 0.12.4 it seems to add the cpu - but it looks
like SeaBIOS doesn't support cpu hotplug.

I want to try doing the same with BOCHS.

 Let me know if theres any debug info I can pull out to help finding the
 issue.


 A bisection to find the offending commit would be good.  I suggest starting
 with upstream qemu 0.12.3 and 0.12.4.  If 0.12.3 works and 0.12.4 fails, you
 can try a simple git bisect to find what commit introduced the problem.


I'll send an update when I find the commit.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/2] x86 FPU API

2010-05-10 Thread Avi Kivity


On 05/06/2010 11:45 AM, Avi Kivity wrote:

Currently all fpu accessors are wedded to task_struct.  However kvm also uses
the fpu in a different context.  Introduce an FPU API, and replace the
current uses with the new API.

While this patchset is oriented towards deeper changes, as a first step it
simlifies xsave for kvm.
   


Peter/Ingo, what are the plans for merging it?  The kvm xsave work 
depends on this.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: update document of gfns

2010-05-10 Thread Avi Kivity


On 05/07/2010 11:47 AM, Lai Jiangshan wrote:

When don't use -gfns when role.direct=1, update document for it.

   



Please fold into the patch that does the change.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SeaBIOS] [PATCHv3] Support for booting from virtio disks

2010-05-10 Thread Stefan Hajnoczi

Looks good.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio: put last_used and last_avail index into ring itself.

2010-05-10 Thread Rusty Russell

On Sun, 9 May 2010 06:27:33 pm Michael S. Tsirkin wrote:
 On Fri, May 07, 2010 at 12:35:39PM +0930, Rusty Russell wrote:
  Then there's padding to page boundary.  That puts us on a cacheline again
  for the used ring; also 2 bytes per entry.
  
 
 Hmm, is used ring really 2 bytes per entry?

Err, no, I am an idiot.

 /* u32 is used here for ids for padding reasons. */
 struct vring_used_elem {
 /* Index of start of used descriptor chain. */
 __u32 id;
 /* Total length of the descriptor chain which was used (written to) */
 __u32 len;
 };
 
 struct vring_used {
 __u16 flags;
 __u16 idx;
 struct vring_used_elem ring[];
 };

OK, now I get it.  Sorry, I was focussed on the avail ring.

 I thought that used ring has 8 bytes per entry, and that struct
 vring_used is aligned at page boundary, this
 would mean that ring element is at offset 4 bytes from page boundary.
 Thus with cacheline size 128 bytes, each 4th element crosses
 a cacheline boundary. If we had a 4 byte padding after idx, each
 used element would always be completely within a single cacheline.

I think the numbers are: every 16th entry hits two cachelines.  So currently
the first 15 entries are free (assuming we hit the idx cacheline anyway),
then 1 in 16 cost 2 cachelines.  That makes the aligned version win when
N  240.

But, we access the array linearly.  So the extra cacheline cost is in fact
amortized.  I doubt it could be measured, but maybe vring_get_buf() should
prefetch?  While you're there, we could use an  rather than a mod on the
calculation, which may actually be measurable :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] kvm: calculate correct gfn for small host pages which emulates large guest pages

2010-05-10 Thread Avi Kivity


On 04/30/2010 05:41 AM, Lai Jiangshan wrote:

Lai Jiangshan wrote:
   

RFC, because maybe I missing something with the old code.

Frome: Lai Jiangshanla...@cn.fujitsu.com

In Document/kvm/mmu.txt:
   gfn:
 Either the guest page table containing the translations shadowed by this
 page, or the base page frame for linear translations. See role.direct.

But in function FNAME(fetch)(), sp-gfn is incorrect when one of following
situations occurred:
  1) guest is 32bit paging and guest uses pse-36 and the guest PDE maps
 a 4-MByte page(backed by 4k host pages) and bits 20:13 of the guest PDE
 is not equals to 0.
  2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
 (backed by 4k or 2M host pages)

 

Resend this patch with the changelog changed.

As Marcelo Tosatti and Gui Jianfeng points out,
FNAME(fetch)() miss quadrant on 4mb large page emulation with shadow.

Subject: [PATCH] kvm: calculate correct gfn for small host pages which emulates 
large guest pages

In Document/kvm/mmu.txt:
   gfn:
 Either the guest page table containing the translations shadowed by this
 page, or the base page frame for linear translations. See role.direct.

But in function FNAME(fetch)(), sp-gfn is incorrect when one of following
situations occurred:
  1) guest is 32bit paging and the guest PDE maps a 4-MByte page
 (backed by 4k host pages), FNAME(fetch)() miss handling the quadrant.

 And if guest use pse-36, table_gfn = gpte_to_gfn(gw-ptes[level - 
delta]);
 is incorrect.
  2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
 (backed by 4k or 2M host pages).

So we fix it to suit to the document and suit to the code which
requires sp-gfn correct when sp-role.direct=1.

We use the goal mapping gfn(gw-gfn) to calculate the base page frame
for linear translations, it is simple and easy to be understood.

Signed-off-by: Lai Jiangshanla...@cn.fujitsu.com
---
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 702c016..958e9c6 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -338,10 +338,13 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
addr,
direct = 1;
if (!is_dirty_gpte(gw-ptes[level - delta]))
access= ~ACC_WRITE_MASK;
-   table_gfn = gpte_to_gfn(gw-ptes[level - delta]);
-   /* advance table_gfn when emulating 1gb pages with 4k */
-   if (delta == 0)
-   table_gfn += PT_INDEX(addr, level);
+   /*
+* It is a large guest pages backed by small host pages,
+* So we set @direct(@shadow_page-role.direct)=1, and
+* set @table_gfn(@shadow_page-gfn)=the base page frame
+* for linear translations.
+*/
+   table_gfn = gw-gfn  ~(KVM_PAGES_PER_HPAGE(level) - 1);
} else {
direct = 0;
table_gfn = gw-table_gfn[level - 2];
   


Looks good, indeed it is a lot easier to understand than the original 
calculation (a minor issue is that the variable name is misleading, but 
that's a problem with kvm_mmu_page definition and not this patch).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: MMU: Fix free memory accounting race in mmu_alloc_roots()

2010-05-10 Thread Avi Kivity

We drop the mmu lock between freeing memory and allocating the roots; this
allows some other vcpu to sneak in and allocate memory.

While the race is benign (resulting only in temporary overallocation, not oom)
it is simple and easy to fix by moving the freeing close to the allocation.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/mmu.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95bee9d..6857a2f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2067,6 +2067,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = 0;
}
spin_lock(vcpu-kvm-mmu_lock);
+   kvm_mmu_free_some_pages(vcpu-kvm);
sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
  PT64_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2097,6 +2098,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = i  30;
}
spin_lock(vcpu-kvm-mmu_lock);
+   kvm_mmu_free_some_pages(vcpu-kvm);
sp = kvm_mmu_get_page(vcpu, root_gfn, i  30,
  PT32_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2470,9 +2472,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
r = mmu_topup_memory_caches(vcpu);
if (r)
goto out;
-   spin_lock(vcpu-kvm-mmu_lock);
-   kvm_mmu_free_some_pages(vcpu);
-   spin_unlock(vcpu-kvm-mmu_lock);
r = mmu_alloc_roots(vcpu);
spin_lock(vcpu-kvm-mmu_lock);
mmu_sync_roots(vcpu);
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm-0.12.4 with bochs bios

2010-05-10 Thread Michael Tokarev


10.05.2010 12:16, Sasha Levin wrote:

Hi,

I've tried running an image using the new qemu-kvm-0.12.4 using a
BOCHS BIOS (using the -bios parameter).
When the machine starts the QEMU window gets minimized into a 1x1
pixel box and just stops.

Starting the same machine with the same BIOS on 0.12.3 is working great.


Actually I tried this many times in different situation, and it never
worked for me. qemu-kvm-0.12.* with older bios (bochs) never boots for
me, it opens the SDL window but that window has nothing in it, it's
completely blank, and the whole thing sits here forever doing nothing.

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

migration fails with virtio-blk

2010-05-10 Thread Michael Tokarev


Apparently there's an issue with migration when
virtio-blk is in use at the time migration occurs.

See Debian bug #580649 for details:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649

In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails.
2.6.32 guest works, 0.11 works, non-virtio-blk works.

Debian qemu-kvm is compiled without --enable-iothread -
JFYI.

The bugreport contains a backtrace taken from guest.

Now, one can say it's a bug in 2.6.26 kernel, which
is quite old.  But the problem does not occur with
qemu-kvm-0.11, so it might be treated as a regression
instead.

Thanks!

/mjt

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace

2010-05-10 Thread Gleb Natapov

On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote:
 Do not kill VM when instruction emulation fails. Inject #UD and report
 failure to userspace instead. Userspace may choose to reenter guest if
 vcpu is in userspace (cpl == 3) in which case guest OS will kill
 offending process and continue running.
 
Please use this one instead. Compilation warning fixed.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ed48904..5aa0944 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -575,7 +575,6 @@ enum emulation_result {
 #define EMULTYPE_SKIP  (1  2)
 int emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2, u16 error_code, int emulation_type);
-void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char *context);
 void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95bee9d..41d2de8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code)
return 1;
case EMULATE_DO_MMIO:
++vcpu-stat.mmio_exits;
-   return 0;
+   /* fall through */
case EMULATE_FAIL:
-   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
-   vcpu-run-internal.ndata = 0;
return 0;
default:
BUG();
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cea916f..58c91f5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm)
string = (io_info  SVM_IOIO_STR_MASK) != 0;
in = (io_info  SVM_IOIO_TYPE_MASK) != 0;
if (string || in)
-   return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO);
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 
port = io_info  16;
size = (io_info  SVM_IOIO_SIZE_MASK)  SVM_IOIO_SIZE_SHIFT;
@@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm)
 
 static int invlpg_interception(struct vcpu_svm *svm)
 {
-   if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
-   pr_unimpl(svm-vcpu, %s: failed\n, __func__);
-   return 1;
+   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int emulate_on_interception(struct vcpu_svm *svm)
 {
-   if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
-   pr_unimpl(svm-vcpu, %s: failed\n, __func__);
-   return 1;
+   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int cr8_write_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 777e00d..e35c479 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
++vcpu-stat.io_exits;
 
if (string || in)
-   return !(emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DO_MMIO);
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 
port = exit_qualification  16;
size = (exit_qualification  7) + 1;
@@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
-   unsigned long exit_qualification;
-   enum emulation_result er;
-   unsigned long offset;
-
-   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-   offset = exit_qualification  0xffful;
-
-   er = emulate_instruction(vcpu, 0, 0, 0);
-
-   if (er !=  EMULATE_DONE) {
-   printk(KERN_ERR
-  Fail to handle apic access vmexit! Offset is 0x%lx\n,
-  offset);
-   return -ENOEXEC;
-   }
-   return 1;
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
 static int handle_task_switch(struct kvm_vcpu *vcpu)
@@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
goto out;
}
 
-   if (err != EMULATE_DONE) {
-   vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu-run-internal.suberror = 
KVM_INTERNAL_ERROR_EMULATION;
-   vcpu-run-internal.ndata = 0;
-   ret = 0;
-   goto out;
-   }
+   if (err != EMULATE_DONE)
+   return 0;
 
if (signal_pending(current))
goto out;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd8a606..eb845cf 100644
--- a/arch/x86/kvm/x86.c

Re: [Autotest] [PATCH 2/2] KVM test: Support to SLES install

2010-05-10 Thread pradeepkumar

On Wed, 10 Mar 2010 08:45:59 -0300
Lucas Meneghel Rodrigues l...@redhat.com wrote:

Hi Yogi/Lucas

Thanks for including SLES guests support in KVM autotest.
I tried SLES guest install. After succsfull install of SLES guest first
stage, again install started. Autotest keep installing first stage
in a loop. I guess some thing needs to be changed here,

  extra_params +=  -bootp /pxelinux.0 -boot n

Can you send a patch to fix this. Thanks in advance.

--SP

 
 Signed-off-by: Yogananth Subramanian anant...@linux.vnet.ibm.com
 ---
  client/tests/kvm/tests_base.cfg.sample |   22 ++
  1 files changed, 22 insertions(+), 0 deletions(-)
 
 diff --git a/client/tests/kvm/tests_base.cfg.sample
 b/client/tests/kvm/tests_base.cfg.sample index c76470d..acb2076 100644
 --- a/client/tests/kvm/tests_base.cfg.sample
 +++ b/client/tests/kvm/tests_base.cfg.sample
 @@ -503,6 +503,28 @@ variants:
  md5sum = 2afee1b8a87175e6dee2b8dbbd1ad8e8
  md5sum_1m = 768ca32503ef92c28f2d144f2a87e4d0
 
 +- SLES:
 +no setup
 +shell_prompt = ^r...@.*[\#\$]\s*$|#
 +unattended_install:
 +pxe_image = linux
 +pxe_initrd = initrd
 +extra_params +=  -bootp /pxelinux.0 -boot n
 +kernel_args = autoyast=floppy
 +
 +variants:
 +- 11.64:
 +no setup
 +image_name = sles11-64
 +cdrom=linux/SLES-11-DVD-x86_64-GM-DVD1.iso
 +md5sum = 50a2bd45cd12c3808c3ee48208e2586b
 +md5sum_1m = 0951cab7c32e332362fc424c1054
 +unattended_install:
 +unattended_file =
 unattended/Sles11-64-autoinst.xml
 +tftp = images/sles11-64/tftpboot
 +floppy = images/sles11-64floppy.img
 +pxe_dir = boot/x86_64/loader
 +
  - @Ubuntu:
  shell_prompt = ^r...@.*[\#\$]\s*$
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation

2010-05-10 Thread Avi Kivity


On 04/21/2010 08:53 PM, Cam Macdonell wrote:

This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
host file into guest RAM.  This function mmaps the opened file anywhere and adds
the memory to the ram blocks.

Usage is

qemu_ram_mmap(fd, size, MAP_SHARED, offset);
   


Signoff?


+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t offset)
+{
+RAMBlock *new_block;
+
+size = TARGET_PAGE_ALIGN(size);
+new_block = qemu_malloc(sizeof(*new_block));
+
+/* map the file passed as a parameter to be this part of memory */
+new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
+
+if (new_block-host == MAP_FAILED)
+exit(1);
   


Braces after if ()


+if (kvm_enabled())
+kvm_setup_guest_memory(new_block-host, size);
+
   


More braces.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.

2010-05-10 Thread Avi Kivity


On 04/21/2010 08:53 PM, Cam Macdonell wrote:

Generic functions to assign irqfds and ioeventfds.

   


Signoff.


  }

  #ifdef KVM_IOEVENTFD
+int kvm_set_irqfd(int fd, uint16_t vector, uint32_t gsi)
+{
+struct kvm_irqfd call = { };
+int r;
+
+call.fd = fd;
+call.gsi = gsi;
   



+
+if (!kvm_enabled())
+return -ENOSYS;
   


Braces, here and elsewhere.


+r = kvm_vm_ioctl(kvm_state, KVM_IRQFD,call);
+
+if (r  0) {
+return r;
   


-errno


+}
+return 0;
+}
+
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool 
assign)
+{
+
+int ret;
+struct kvm_ioeventfd iofd;
+
+iofd.datamatch = val;
+iofd.addr = addr;
+iofd.len = 4;
+iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.fd = fd;
+
+if (!kvm_enabled())
+return -ENOSYS;
+if (!assign)
+iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
   


May be more usable to have separate assign and deassign functions (that 
can call into a single internal implementation).



+
+ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD,iofd);
+
+if (ret  0) {
+return ret;
   


-errno


+}
+
+return 0;
+}
+
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

network between host and guest

2010-05-10 Thread Thanasis

I have installed kvm (app-emulation/qemu-kvm-0.12.3-r1) on linux (a
laptop with gentoo linux) and MS Windows 7 as guest.
Here is what info I get about the network on each one:
1) on linux (host):

eth0  Link encap:Ethernet  HWaddr 00:03:25:43:7f:93 
  inet addr:192.168.0.13  Bcast:255.255.255.255  Mask:255.255.255.0
  inet6 addr: fe80::203:25ff:fe43:7f93/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1492  Metric:1
  RX packets:406972 errors:0 dropped:0 overruns:0 frame:0
  TX packets:5268868 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:132098724 (125.9 MiB)  TX bytes:7683259267 (7.1 GiB)
  Interrupt:26 Base address:0x2000

loLink encap:Local Loopback 
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:505 errors:0 dropped:0 overruns:0 frame:0
  TX packets:505 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:65800 (64.2 KiB)  TX bytes:65800 (64.2 KiB)

wlan0 Link encap:Ethernet  HWaddr 00:16:44:8e:5a:0b 
  BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

2) on the guest (sorry I could't find a way copy, so I took a screenshot):
http://i999.photobucket.com/albums/af112/aggman/Screenshot-QEMU.jpg

Question: Where is my interface on linux with IP address of 10.0.2.2 ?
The guest can actually access the internet (via the host), but where can
I see that hidden routing that linux provides to it?

PS: forgive my ignorance, I am new to kvm.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: virtio-win problem

2010-05-10 Thread Yan Vugenfirer

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Martin Maurer
 Sent: Thursday, May 06, 2010 5:59 PM
 To: kvm@vger.kernel.org
 Subject: RE: virtio-win problem

  -Original Message-
  From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
  Behalf Of Riccardo Veraldi
  Sent: Donnerstag, 06. Mai 2010 14:10
  To: kvm@vger.kernel.org
  Subject: virtio-win problem

  Hello,
  if I install virtio-win drivers on windows 2008 Server R2, I have the
  problem of signed device drivers.
  I Can install the drivers but Windows 2008 server refuses to use them
  unless I start
  the machine pressing F8 every time at each reboot bypassing the
  checking
  of signed certified drivers, and this is annoying,
  since I Cannot reboot the virtual machien automatically.

  Anyone solved this issue ?

 Redhat released signed drivers working on win2008r2, but no public
 download or free use (you need a subscription, it's a special license)

 Or you need to follow the howtos on the KVM wiki pages,
 http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

[YV] You could test sign the drivers with our own certificate
(http://www.rage3d.com/board/showthread.php?t=33920573) and enable the
usage of test signed drivers (Bcdedit.exe -set TESTSIGNING ON). 

 Br, Martin

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network between host and guest

2010-05-10 Thread Liang Guo

在 星期一 10 5月 2010 18:46:20，Thanasis 写道：

 
 Question: Where is my interface on linux with IP address of 10.0.2.2 ?
 The guest can actually access the internet (via the host), but where can
 I see that hidden routing that linux provides to it?
 

I think you are using user networking(slirp) , with user networking, etch tcp 
connection is mapped to a tcp connection in host's kvm process, and  there
is no host route to the guest.  Please visit 
http://wiki.qemu.org/Documentation/Networking
for detailed information. 

-- 
Liang Guo
http://bluestone.cublog.cn
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Takuya Yoshikawa


(2010/05/06 22:38), Arnd Bergmann wrote:

On Wednesday 05 May 2010, Takuya Yoshikawa wrote:

Date:
Yesterday 04:59:24

That's why the bitmaps are defined as little endian u64 aligned, even on
big endian 32-bit systems.  Little endian bitmaps are wordsize agnostic,
and u64 alignment ensures we can use long-sized bitops on mixed size
systems.


Ok, I see.


There was a suggestion to propose set_le_bit_user() kind of macros.
But what I thought was these have a constraint you two explained and seemed to 
be
a little bit specific to some area, like KVM.

So I decided to propose just the offset calculation macro.


I'm not sure I understand how this macro is going to be used though.
If you are just using this in kernel space, that's fine, please go for
it.


Yes, I'm just using in kernel space: qemu has its own endian related helpers.

So if you allow us to place this macro in asm-generic/bitops/* it will help us.

Avi, what do you think? Do you want to place it in kvm.h ?




However, if the intention is to use the same macro in user space, putting
it into asm-generic/bitops/* is not going to help, because those headers
are not available in user space, and I wouldn't want to change that.

The definition of the macro is not part of the ABI, so just duplicate
it in KVM if you need it there.

Arnd


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Avi Kivity


On 04/21/2010 08:53 PM, Cam Macdonell wrote:

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

 -device ivshmem,size=size in format accepted by -m[,shm=shm name]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

 -device ivshmem,size=size in format accepted by -m[,shm=shm name]
 [,chardev=id][,msi=on][,irqfd=on][,vectors=n]
 -chardev socket,path=path,id=id

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:


+typedef struct EventfdEntry {
+PCIDevice *pdev;
+int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+PCIDevice dev;
+uint32_t intrmask;
+uint32_t intrstatus;
+uint32_t doorbell;
+
+CharDriverState * chr;
+CharDriverState ** eventfd_chr;
+int ivshmem_mmio_io_addr;
+
+pcibus_t mmio_addr;
+unsigned long ivshmem_offset;
+uint64_t ivshmem_size; /* size of shared memory region */
+int shm_fd; /* shared memory file descriptor */
+
+int nr_allocated_vms;
+/* array of eventfds for each guest */
+int ** eventfds;
+/* keep track of # of eventfds for each guest*/
+int * eventfds_posn_count;
   


More readable:

  typedef struct Peer {
  int nb_eventfds;
  int *eventfds;
  } Peer;
  int nb_peers;
  Peer *peers;

Does eventfd_chr need to be there as well?


+
+int nr_alloc_guests;
+int vm_id;
+int num_eventfds;
+uint32_t vectors;
+uint32_t features;
+EventfdEntry *eventfd_table;
+
+char * shmobj;
+char * sizearg;
   


Does this need to be part of the state?


+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 4,
+IVPosition = 8,
+Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+return (ivs-features  (1  feature));
+}
+
+static inline int is_power_of_two(int x) {
+return (x  (x-1)) == 0;
+}
   


argument needs to be uint64_t to avoid overflow with large BARs.  Return 
type can be bool.



+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+IVShmemState *s = opaque;
+
+u_int64_t write_one = 1;
+u_int16_t dest = val  16;
+u_int16_t vector = val  0xff;
+
+addr= 0xfe;
   


Why 0xfe?  Can understand 0xfc or 0xff.


+
+switch (addr)
+{
+case IntrMask:
+ivshmem_IntrMask_write(s, val);
+break;
+
+case IntrStatus:
+ivshmem_IntrStatus_write(s, val);
+break;
+
+case Doorbell:
+/* check doorbell range */
+if ((vector= 0)  (vector  s-eventfds_posn_count[dest])) {
   


What if dest is too big?  We overflow s-eventfds_posn_count.

+
+static void close_guest_eventfds(IVShmemState *s, int posn)
+{
+int i, guest_curr_max;
+
+guest_curr_max = s-eventfds_posn_count[posn];
+
+for (i = 0; i  guest_curr_max; i++)
+close(s-eventfds[posn][i]);
+
+free(s-eventfds[posn]);
   


qemu_free().


+/* this function increase the dynamic storage need to store data about other
+ * guests */
+static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
+
+int j, old_nr_alloc;
+
+old_nr_alloc = s-nr_alloc_guests;
+
+while (s-nr_alloc_guests  new_min_size)
+s-nr_alloc_guests = s-nr_alloc_guests * 2;
+
+IVSHMEM_DPRINTF(bumping storage to %d guests\n, s-nr_alloc_guests);
+s-eventfds = qemu_realloc(s-eventfds, s-nr_alloc_guests *
+sizeof(int *));
+s-eventfds_posn_count = qemu_realloc(s-eventfds_posn_count,
+s-nr_alloc_guests *
+sizeof(int));
+s-eventfd_table = qemu_realloc(s-eventfd_table, s-nr_alloc_guests *
+sizeof(EventfdEntry));
+
+if ((s-eventfds == NULL) || (s-eventfds_posn_count == NULL) ||
+(s-eventfd_table == NULL)) {
+fprintf(stderr, Allocation error - exiting\n);
+exit(1);
+}
+
+if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
+s-eventfd_chr = (CharDriverState **)qemu_realloc(s-eventfd_chr,
+s-nr_alloc_guests * sizeof(void *));
+if (s-eventfd_chr == NULL) {
+fprintf(stderr, Allocation error - exiting\n);
+exit(1);
+}
+}
+
+/* zero out new pointers */
+for (j = old_nr_alloc; j  s-nr_alloc_guests; j++) {
+s-eventfds[j] = NULL;
   


eventfds_posn_count and eventfd_table want zeroing as well.


+}
+}
+
+static void ivshmem_read(void *opaque, const

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Avi Kivity


On 05/10/2010 02:46 PM, Takuya Yoshikawa wrote:

(2010/05/06 22:38), Arnd Bergmann wrote:

On Wednesday 05 May 2010, Takuya Yoshikawa wrote:

Date:
Yesterday 04:59:24
That's why the bitmaps are defined as little endian u64 aligned, 
even on
big endian 32-bit systems.  Little endian bitmaps are wordsize 
agnostic,

and u64 alignment ensures we can use long-sized bitops on mixed size
systems.


Ok, I see.


There was a suggestion to propose set_le_bit_user() kind of macros.
But what I thought was these have a constraint you two explained and 
seemed to be

a little bit specific to some area, like KVM.

So I decided to propose just the offset calculation macro.


I'm not sure I understand how this macro is going to be used though.
If you are just using this in kernel space, that's fine, please go for
it.


Yes, I'm just using in kernel space: qemu has its own endian related 
helpers.


So if you allow us to place this macro in asm-generic/bitops/* it will 
help us.


Avi, what do you think? Do you want to place it in kvm.h ?


I really prefer anything that is generic to be outside kvm, even if kvm 
is the only user.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Arnd Bergmann

On Monday 10 May 2010, Takuya Yoshikawa wrote:
 (2010/05/06 22:38), Arnd Bergmann wrote:
  On Wednesday 05 May 2010, Takuya Yoshikawa wrote:
  There was a suggestion to propose set_le_bit_user() kind of macros.
  But what I thought was these have a constraint you two explained and 
  seemed to be
  a little bit specific to some area, like KVM.
 
  So I decided to propose just the offset calculation macro.
 
  I'm not sure I understand how this macro is going to be used though.
  If you are just using this in kernel space, that's fine, please go for
  it.
 
 Yes, I'm just using in kernel space: qemu has its own endian related helpers.
 
 So if you allow us to place this macro in asm-generic/bitops/* it will help 
 us.

No problem at all then. Thanks for the explanation.

Acked-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Takuya Yoshikawa





Yes, I'm just using in kernel space: qemu has its own endian related helpers.

So if you allow us to place this macro in asm-generic/bitops/* it will help us.


No problem at all then. Thanks for the explanation.

Acked-by: Arnd Bergmanna...@arndb.de


Thanks you both. I will add your Acked-by from now on!

  Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Avi Kivity


On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote:

[Performance test]

We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.

Test environment

   AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory


1. GUI test (running Ubuntu guest in graphical mode)

   sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...

We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.

get.org   get.opt  switch.opt

slots[7].len=32768  278379 66398 64024
slots[8].len=32768  181246   270   160
slots[7].len=32768  263961 64673 64494
slots[8].len=32768  181655   265   160
slots[7].len=32768  263736 64701 64610
slots[8].len=32768  182785   267   160
slots[7].len=32768  260925 65360 65042
slots[8].len=32768  182579   264   160
slots[7].len=32768  267823 65915 65682
slots[8].len=32768  186350   271   160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.
   


100 ns... this is a bit on the low side (and if you can measure it 
interactively you have much better reflexes than I).



To feel the difference, please try GUI on your PC with our patch series!
   


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is 
problematic.  Have you tried profiling to see where the time is spent 
(well I can guess, clearing the write access from the sptes).




2. Live-migration test (4GB guest, write loop with 1GB buf)

We also did a live-migration test.

get.org   get.opt  switch.opt

slots[0].len=655360 797383261144222181
slots[1].len=37570478082186721   1965244   1842824
slots[2].len=637534208 1433562   1012723   1031213
slots[3].len=131072 216858   331   331
slots[4].len=131072 121635   225   164
slots[5].len=131072 120863   356   164
slots[6].len=16777216   121746  1133   156
slots[7].len=32768  120415   230   278
slots[8].len=32768  120368   216   149
slots[0].len=655360 806497194710223582
slots[1].len=37570478082142922   1878025   1895369
slots[2].len=637534208 1386512   1021309   1000345
slots[3].len=131072 221118   459   296
slots[4].len=131072 121516   272   166
slots[5].len=131072 122652   244   173
slots[6].len=16777216   123226 99185   149
slots[7].len=32768  121803   457   505
slots[8].len=32768  121586   216   155
slots[0].len=655360 766113211317213179
slots[1].len=37570478082155662   1974790   1842361
slots[2].len=637534208 1481411   1020004   1031352
slots[3].len=131072 223100   351   295
slots[4].len=131072 122982   436   164
slots[5].len=131072 122100   300   503
slots[6].len=16777216   123653   779   151
slots[7].len=32768  122617   284   157
slots[8].len=32768  122737   253   149

For slots other than 0,1,2 we can see the similar improvement.

Considering the fact that switch.opt does not depend on the bitmap length
except for kvm_mmu_slot_remove_write_access(), this is the cause of some
usec to msec time consumption: there might be some context switches.

But note that this was done with the workload which dirtied the memory
endlessly during the live-migration.

In usual workload, the number of dirty pages varies a lot for each iteration
and we should gain really a lot for relatively clean cases.
   


Can you post such a test, for an idle large guest?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Takuya Yoshikawa





get.org get.opt switch.opt

slots[7].len=32768 278379 66398 64024
slots[8].len=32768 181246 270 160
slots[7].len=32768 263961 64673 64494
slots[8].len=32768 181655 265 160
slots[7].len=32768 263736 64701 64610
slots[8].len=32768 182785 267 160
slots[7].len=32768 260925 65360 65042
slots[8].len=32768 182579 264 160
slots[7].len=32768 267823 65915 65682
slots[8].len=32768 186350 271 160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.


100 ns... this is a bit on the low side (and if you can measure it
interactively you have much better reflexes than I).


To feel the difference, please try GUI on your PC with our patch series!


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is
problematic. Have you tried profiling to see where the time is spent
(well I can guess, clearing the write access from the sptes).


Sorry but no, and I agree with your guess.
Anyway, I want to do some profiling to confirm this guess.


BTW, If we only think about performance improvement of time, optimized
get(get.opt) may be enough at this stage.

But if we consider the future expansion like using user allocated bitmaps,
new API's introduced for switch.opt won't become waste, I think, because we
need a structure to get and export bitmap addresses.




In usual workload, the number of dirty pages varies a lot for each
iteration
and we should gain really a lot for relatively clean cases.


Can you post such a test, for an idle large guest?


OK, I'll do!





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call agenda for May 11

2010-05-10 Thread Chris Wright

Please send in any agenda items you are interested in covering.

If we have a lack of agenda items I'll cancel the week's call.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 3/5] Add functions for assigning ioeventfd and irqfds.

2010-05-10 Thread Avi Kivity


On 05/10/2010 06:13 PM, Cam Macdonell wrote:



+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool
assign)
+{
+
+int ret;
+struct kvm_ioeventfd iofd;
+
+iofd.datamatch = val;
+iofd.addr = addr;
+iofd.len = 4;
+iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.fd = fd;
+
+if (!kvm_enabled())
+return -ENOSYS;
+if (!assign)
+iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;

   

May be more usable to have separate assign and deassign functions (that can
call into a single internal implementation).
 

I believe the convention so far is to use the 'assign' flag as
Michael's patch and the PIO version kvm_set_ioeventfd_pio_word() do.
   


I dislike bool arguments since they're hard to understand at the call 
site.  However if there's precedent we can stick to it and perhaps 
change it all later.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity a...@redhat.com wrote:
 On 04/21/2010 08:53 PM, Cam Macdonell wrote:

 Support an inter-vm shared memory device that maps a shared-memory object
 as a
 PCI device in the guest.  This patch also supports interrupts between
 guest by
 communicating over a unix domain socket.  This patch applies to the
 qemu-kvm
 repository.

     -device ivshmem,size=size in format accepted by -m[,shm=shm name]

 Interrupts are supported between multiple VMs by using a shared memory
 server
 by using a chardev socket.

     -device ivshmem,size=size in format accepted by -m[,shm=shm name]
                     [,chardev=id][,msi=on][,irqfd=on][,vectors=n]
     -chardev socket,path=path,id=id

 (shared memory server is qemu.git/contrib/ivshmem-server)

 Sample programs and init scripts are in a git repo here:


 +typedef struct EventfdEntry {
 +    PCIDevice *pdev;
 +    int vector;
 +} EventfdEntry;
 +
 +typedef struct IVShmemState {
 +    PCIDevice dev;
 +    uint32_t intrmask;
 +    uint32_t intrstatus;
 +    uint32_t doorbell;
 +
 +    CharDriverState * chr;
 +    CharDriverState ** eventfd_chr;
 +    int ivshmem_mmio_io_addr;
 +
 +    pcibus_t mmio_addr;
 +    unsigned long ivshmem_offset;
 +    uint64_t ivshmem_size; /* size of shared memory region */
 +    int shm_fd; /* shared memory file descriptor */
 +
 +    int nr_allocated_vms;
 +    /* array of eventfds for each guest */
 +    int ** eventfds;
 +    /* keep track of # of eventfds for each guest*/
 +    int * eventfds_posn_count;


 More readable:

  typedef struct Peer {
      int nb_eventfds;
      int *eventfds;
  } Peer;
  int nb_peers;
  Peer *peers;

 Does eventfd_chr need to be there as well?

 +
 +    int nr_alloc_guests;
 +    int vm_id;
 +    int num_eventfds;
 +    uint32_t vectors;
 +    uint32_t features;
 +    EventfdEntry *eventfd_table;
 +
 +    char * shmobj;
 +    char * sizearg;


 Does this need to be part of the state?

 +} IVShmemState;
 +
 +/* registers for the Inter-VM shared memory device */
 +enum ivshmem_registers {
 +    IntrMask = 0,
 +    IntrStatus = 4,
 +    IVPosition = 8,
 +    Doorbell = 12,
 +};
 +
 +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
 feature) {
 +    return (ivs-features  (1  feature));
 +}
 +
 +static inline int is_power_of_two(int x) {
 +    return (x  (x-1)) == 0;
 +}


 argument needs to be uint64_t to avoid overflow with large BARs.  Return
 type can be bool.

 +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
 +{
 +    IVShmemState *s = opaque;
 +
 +    u_int64_t write_one = 1;
 +    u_int16_t dest = val  16;
 +    u_int16_t vector = val  0xff;
 +
 +    addr= 0xfe;


 Why 0xfe?  Can understand 0xfc or 0xff.

 +
 +    switch (addr)
 +    {
 +        case IntrMask:
 +            ivshmem_IntrMask_write(s, val);
 +            break;
 +
 +        case IntrStatus:
 +            ivshmem_IntrStatus_write(s, val);
 +            break;
 +
 +        case Doorbell:
 +            /* check doorbell range */
 +            if ((vector= 0)  (vector  s-eventfds_posn_count[dest]))
 {


 What if dest is too big?  We overflow s-eventfds_posn_count.

 +
 +static void close_guest_eventfds(IVShmemState *s, int posn)
 +{
 +    int i, guest_curr_max;
 +
 +    guest_curr_max = s-eventfds_posn_count[posn];
 +
 +    for (i = 0; i  guest_curr_max; i++)
 +        close(s-eventfds[posn][i]);
 +
 +    free(s-eventfds[posn]);


 qemu_free().

 +/* this function increase the dynamic storage need to store data about
 other
 + * guests */
 +static void increase_dynamic_storage(IVShmemState *s, int new_min_size) {
 +
 +    int j, old_nr_alloc;
 +
 +    old_nr_alloc = s-nr_alloc_guests;
 +
 +    while (s-nr_alloc_guests  new_min_size)
 +        s-nr_alloc_guests = s-nr_alloc_guests * 2;
 +
 +    IVSHMEM_DPRINTF(bumping storage to %d guests\n,
 s-nr_alloc_guests);
 +    s-eventfds = qemu_realloc(s-eventfds, s-nr_alloc_guests *
 +                                                        sizeof(int *));
 +    s-eventfds_posn_count = qemu_realloc(s-eventfds_posn_count,
 +                                                    s-nr_alloc_guests *
 +                                                        sizeof(int));
 +    s-eventfd_table = qemu_realloc(s-eventfd_table, s-nr_alloc_guests
 *
 +
  sizeof(EventfdEntry));
 +
 +    if ((s-eventfds == NULL) || (s-eventfds_posn_count == NULL) ||
 +            (s-eventfd_table == NULL)) {
 +        fprintf(stderr, Allocation error - exiting\n);
 +        exit(1);
 +    }
 +
 +    if (!ivshmem_has_feature(s, IVSHMEM_IRQFD)) {
 +        s-eventfd_chr = (CharDriverState **)qemu_realloc(s-eventfd_chr,
 +                                    s-nr_alloc_guests * sizeof(void *));
 +        if (s-eventfd_chr == NULL) {
 +            fprintf(stderr, Allocation error - exiting\n);
 +            exit(1);
 +        }
 +    }
 +
 +    /* zero out new pointers */
 +    for (j = old_nr_alloc; j  s-nr_alloc_guests; j++) {
 +        s-eventfds[j] =

Re: [PATCH v3 0/2] x86 FPU API

2010-05-10 Thread H. Peter Anvin

On 05/10/2010 01:48 AM, Avi Kivity wrote:
 On 05/06/2010 11:45 AM, Avi Kivity wrote:
 Currently all fpu accessors are wedded to task_struct.  However kvm
 also uses
 the fpu in a different context.  Introduce an FPU API, and replace the
 current uses with the new API.

 While this patchset is oriented towards deeper changes, as a first
 step it
 simlifies xsave for kvm.

 
 Peter/Ingo, what are the plans for merging it?  The kvm xsave work
 depends on this.
 

Going to look at it today.  Looks good, but I want to go over it in
detail to catch any gotchas.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Avi Kivity


On 05/10/2010 06:22 PM, Cam Macdonell wrote:





+
+/* if the position is -1, then it's shared memory region fd */
+if (incoming_posn == -1) {
+
+s-num_eventfds = 0;
+
+if (check_shm_size(s, incoming_fd) == -1) {
+exit(-1);
+}
+
+/* creating a BAR in qemu_chr callback may be crazy */
+create_shared_memory_BAR(s, incoming_fd);

   

It probably is... why can't you create it during initialization?
 

This is for the shared memory server implementation, so the fd for the
shared memory has to be received (over the qemu char device) from the
server before the BAR can be created via qemu_ram_mmap() which adds
the necessary memory

   



We could do the handshake during initialization.  I'm worried that the 
device will appear without the BAR, and strange things will happen.  But 
the chardev API is probably not geared for passing data during init.


Anthony, any ideas?


Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.
   


What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.


So it's a good idea to make the initialization process atomic.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/5] Support adding a file to qemu's ram allocation

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 4:39 AM, Avi Kivity a...@redhat.com wrote:
 On 04/21/2010 08:53 PM, Cam Macdonell wrote:

 This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to
 map a
 host file into guest RAM.  This function mmaps the opened file anywhere
 and adds
 the memory to the ram blocks.

 Usage is

 qemu_ram_mmap(fd, size, MAP_SHARED, offset);


 Signoff?

 +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, off_t
 offset)
 +{
 +    RAMBlock *new_block;
 +
 +    size = TARGET_PAGE_ALIGN(size);
 +    new_block = qemu_malloc(sizeof(*new_block));
 +
 +    /* map the file passed as a parameter to be this part of memory */
 +    new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd,
 offset);
 +
 +    if (new_block-host == MAP_FAILED)
 +        exit(1);


 Braces after if ()

 +    if (kvm_enabled())
 +        kvm_setup_guest_memory(new_block-host, size);
 +


 More braces.


This function is possibly made redundant by Marcelo's patch for qemu_ram_map

http://kerneltrap.org/mailarchive/linux-kvm/2010/4/26/6261299

qemu_ram_map isn't merged yet either, but I'm fine with either one.
Marcelo's requires the device to map the memory and then pass the
pointer to be added to the memory allocation, so it gives the device
full mapping control.  Alternatively, I could add the protection flag
to my function (I think that's all that is missing).

Let me know and I'll change my patch if necessary.

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Anthony Liguori


On 05/10/2010 10:28 AM, Avi Kivity wrote:

On 05/10/2010 06:22 PM, Cam Macdonell wrote:





+
+/* if the position is -1, then it's shared memory region fd */
+if (incoming_posn == -1) {
+
+s-num_eventfds = 0;
+
+if (check_shm_size(s, incoming_fd) == -1) {
+exit(-1);
+}
+
+/* creating a BAR in qemu_chr callback may be crazy */
+create_shared_memory_BAR(s, incoming_fd);


It probably is... why can't you create it during initialization?

This is for the shared memory server implementation, so the fd for the
shared memory has to be received (over the qemu char device) from the
server before the BAR can be created via qemu_ram_mmap() which adds
the necessary memory




We could do the handshake during initialization.  I'm worried that the 
device will appear without the BAR, and strange things will happen.  
But the chardev API is probably not geared for passing data during init.


Anthony, any ideas?


Why can't we create the BAR with just normal RAM and then change it to a 
mmap()'d fd after initialization?  This will be behavior would be 
important for live migration as it would let you quickly migrate 
preserving the memory contents without waiting for an external program 
to reconnect.


Regards,

Anthony Lioguori


Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.


What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.


You don't have to do MAP_FIXED.  You can allocate a ram area and map 
that in when disconnected.  When you connect, you create another ram 
area and memcpy() the previous ram area to the new one.  You then map 
the second ram area in.


From the guest's perspective, it's totally transparent.  For the 
backend, I'd suggest having an explicit initialized ack or something 
so that it knows that the data is now mapped to the guest.


If you're doing just a ring queue in shared memory, it should allow 
disconnect/reconnect during live migration asynchronously to the actual 
qemu live migration.


Regards,

Anthony Liguori


So it's a good idea to make the initialization process atomic.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 9:28 AM, Avi Kivity a...@redhat.com wrote:
 On 05/10/2010 06:22 PM, Cam Macdonell wrote:


 +
 +    /* if the position is -1, then it's shared memory region fd */
 +    if (incoming_posn == -1) {
 +
 +        s-num_eventfds = 0;
 +
 +        if (check_shm_size(s, incoming_fd) == -1) {
 +            exit(-1);
 +        }
 +
 +        /* creating a BAR in qemu_chr callback may be crazy */
 +        create_shared_memory_BAR(s, incoming_fd);



 It probably is... why can't you create it during initialization?


 This is for the shared memory server implementation, so the fd for the
 shared memory has to be received (over the qemu char device) from the
 server before the BAR can be created via qemu_ram_mmap() which adds
 the necessary memory




 We could do the handshake during initialization.  I'm worried that the
 device will appear without the BAR, and strange things will happen.  But the
 chardev API is probably not geared for passing data during init.

More specifically, the challenge I've found is that there is no
function to tell a chardev to block and wait for the initialization
data.


 Anthony, any ideas?

 Otherwise, if the BAR is allocated during initialization, I would have
 to use MAP_FIXED to mmap the memory.  This is what I did before the
 qemu_ram_mmap() function was added.


 What would happen to any data written to the BAR before the the handshake
 completed?  I think it would disappear.

But, the BAR isn't there until the handshake is completed.  Only after
receiving the shared memory fd does my device call pci_register_bar()
in the callback function.  So there may be a case with BAR2 (the
shared memory BAR) missing during initialization.  FWIW, I haven't
encountered this.


 So it's a good idea to make the initialization process atomic.

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Anthony Liguori


On 05/10/2010 03:11 AM, Gleb Natapov wrote:

This patch adds native support for booting from virtio disks to Seabios.

Signed-off-by: Gleb Natapovg...@redhat.com
   


A related problem that I think we need to think about how we solve is 
indicating to Seabios which device we want to boot from


With your patch, a user can select a virtio device explicitly or if they 
use only one virtio device, it will Just Work.


However, if a user uses IDE and virtio, or a user has multiple disks, 
they cannot select a device via -boot.


Is this something we need to address?  I don't think we'd break libvirt 
if we didn't.


Regards,

Anthony Liguori


---

Changelog:
  v1-v2:
   - free memory in case of vq initialization error.
   - change license of virtio ring/pci to LGPLv3 with permission
 of Laurent Vivier (aka the author).

diff --git a/Makefile b/Makefile
index 327a1bf..d0b8881 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,8 @@ OUT=out/
  SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c mouse.c \
  kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c resume.c \
  pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \
-usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c
+usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \
+virtio-ring.c virtio-pci.c virtio-blk.c
  SRC16=$(SRCBOTH) system.c disk.c apm.c font.c
  SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \
acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \
diff --git a/src/block.c b/src/block.c
index ddf441f..b6b1902 100644
--- a/src/block.c
+++ b/src/block.c
@@ -11,6 +11,7 @@
  #include util.h // dprintf
  #include ata.h // process_ata_op
  #include usb-msc.h // process_usb_op
+#include virtio-blk.h // process_virtio_op

  struct drives_s Drives VAR16VISIBLE;

@@ -289,6 +290,8 @@ process_op(struct disk_op_s *op)
  return process_cdemu_op(op);
  case DTYPE_USB:
  return process_usb_op(op);
+case DTYPE_VIRTIO:
+   return process_virtio_op(op);
  default:
  op-count = 0;
  return DISK_RET_EPARAM;
diff --git a/src/config.h b/src/config.h
index b101174..ad569c6 100644
--- a/src/config.h
+++ b/src/config.h
@@ -136,6 +136,9 @@
  #define CONFIG_SUBMODEL_ID   0x00
  #define CONFIG_BIOS_REVISION 0x01

+// Support boot from virtio storage
+#define CONFIG_VIRTIO_BLK 1
+
  // Various memory addresses used by the code.
  #define BUILD_STACK_ADDR  0x7000
  #define BUILD_S3RESUME_STACK_ADDR 0x1000
diff --git a/src/disk.h b/src/disk.h
index 0cd1b74..9e5b083 100644
--- a/src/disk.h
+++ b/src/disk.h
@@ -197,6 +197,7 @@ struct drive_s {
  #define DTYPE_RAMDISK  0x04
  #define DTYPE_CDEMU0x05
  #define DTYPE_USB  0x06
+#define DTYPE_VIRTIO   0x07

  #define MAXDESCSIZE 80

diff --git a/src/pci_ids.h b/src/pci_ids.h
index 1800f1d..e1cded2 100644
--- a/src/pci_ids.h
+++ b/src/pci_ids.h
@@ -2605,3 +2605,6 @@
  #define PCI_DEVICE_ID_RME_DIGI32  0x9896
  #define PCI_DEVICE_ID_RME_DIGI32_PRO  0x9897
  #define PCI_DEVICE_ID_RME_DIGI32_80x9898
+
+#define PCI_VENDOR_ID_REDHAT_QUMRANET  0x1af4
+#define PCI_DEVICE_ID_VIRTIO_BLK   0x1001
diff --git a/src/post.c b/src/post.c
index 638b0f7..25535e2 100644
--- a/src/post.c
+++ b/src/post.c
@@ -23,6 +23,7 @@
  #include smbios.h // smbios_init
  #include paravirt.h // qemu_cfg_port_probe
  #include ps2port.h // ps2port_setup
+#include virtio-blk.h // virtio_blk_setup

  void
  __set_irq(int vector, void *loc)
@@ -184,6 +185,7 @@ init_hw(void)
  floppy_setup();
  ata_setup();
  ramdisk_setup();
+virtio_blk_setup();
  }

  // Main setup code.
diff --git a/src/virtio-blk.c b/src/virtio-blk.c
new file mode 100644
index 000..a41c336
--- /dev/null
+++ b/src/virtio-blk.c
@@ -0,0 +1,155 @@
+// Virtio blovl boot support.
+//
+// Copyright (C) 2010 Red Hat Inc.
+//
+// Authors:
+//  Gleb Natapovgnata...@redhat.com
+//
+// This file may be distributed under the terms of the GNU LGPLv3 license.
+
+#include util.h // dprintf
+#include pci.h // foreachpci
+#include config.h // CONFIG_*
+#include virtio-pci.h
+#include virtio-blk.h
+#include disk.h
+
+struct virtiodrive_s {
+struct drive_s drive;
+struct vring_virtqueue *vq;
+u16 ioaddr;
+};
+
+static int
+virtio_blk_read(struct disk_op_s *op)
+{
+struct virtiodrive_s *vdrive_g =
+container_of(op-drive_g, struct virtiodrive_s, drive);
+struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq);
+struct virtio_blk_outhdr hdr = {
+.type = VIRTIO_BLK_T_IN,
+.ioprio = 0,
+.sector = op-lba,
+};
+u8 status = VIRTIO_BLK_S_UNSUPP;
+struct vring_list sg[] = {
+{
+.addr  = MAKE_FLATPTR(GET_SEG(SS),hdr),
+.length= sizeof(hdr),
+},
+{
+.addr  = op-buf_fl,
+.length= GET_GLOBAL(vdrive_g-drive.blksize) * op-count,
+},
+{
+

[PATCH] VMX: Invalid guest state detection enhancements and bug fixes

2010-05-10 Thread Mohammed Gamal

- Correct unusable flag check on SS, DS, ES, FS, GS, and LDTR
- Add rflags checks
- Report failed instruction on emulation failure

Signed-off-by: Mohammed Gamal m.gamal...@gmail.com
---
 arch/x86/kvm/vmx.c |   31 +++
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 777e00d..968384b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2122,7 +2122,7 @@ static bool stack_segment_valid(struct kvm_vcpu *vcpu)
ss_rpl = ss.selector  SELECTOR_RPL_MASK;
 
if (ss.unusable)
-   return true;
+   return false;
if (ss.type != 3  ss.type != 7)
return false;
if (!ss.s)
@@ -2144,7 +2144,7 @@ static bool data_segment_valid(struct kvm_vcpu *vcpu, int 
seg)
rpl = var.selector  SELECTOR_RPL_MASK;
 
if (var.unusable)
-   return true;
+   return false;
if (!var.s)
return false;
if (!var.present)
@@ -2185,7 +2185,7 @@ static bool ldtr_valid(struct kvm_vcpu *vcpu)
vmx_get_segment(vcpu, ldtr, VCPU_SREG_LDTR);
 
if (ldtr.unusable)
-   return true;
+   return false;
if (ldtr.selector  SELECTOR_TI_MASK)   /* TI = 1 */
return false;
if (ldtr.type != 2)
@@ -2207,6 +2207,27 @@ static bool cs_ss_rpl_check(struct kvm_vcpu *vcpu)
 (ss.selector  SELECTOR_RPL_MASK));
 }
 
+static bool rflags_valid(struct kvm_vcpu *vcpu)
+{
+   unsigned long rflags;
+   u32 entry_intr_info;
+
+   rflags = vmcs_readl(GUEST_RFLAGS);
+   entry_intr_info = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+#ifdef CONFIG_X86_64
+   if (is_long_mode(vcpu))
+   if (rflags  X86_EFLAGS_VM)
+   return false;   
+#endif
+   if ((entry_intr_info  INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_EXT_INTR 
+(entry_intr_info  INTR_INFO_VALID_MASK)) {
+   if (!(rflags  X86_EFLAGS_IF))
+   return false;
+   }
+
+   return true;
+}
+
 /*
  * Check if guest state is valid. Returns true if valid, false if
  * not.
@@ -2251,8 +2272,9 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
}
/* TODO:
 * - Add checks on RIP
-* - Add checks on RFLAGS
 */
+   if (!rflags_valid(vcpu))
+   return false;
 
return true;
 }
@@ -3559,6 +3581,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
}
 
if (err != EMULATE_DONE) {
+   kvm_report_emulation_failure(vcpu, invalid guest state 
handler);
vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
vcpu-run-internal.suberror = 
KVM_INTERNAL_ERROR_EMULATION;
vcpu-run-internal.ndata = 0;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Anthony Liguori


On 05/10/2010 10:54 AM, Gleb Natapov wrote:

On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote:
   

On 05/10/2010 03:11 AM, Gleb Natapov wrote:
 

This patch adds native support for booting from virtio disks to Seabios.

Signed-off-by: Gleb Natapovg...@redhat.com
   

A related problem that I think we need to think about how we solve
is indicating to Seabios which device we want to boot from

With your patch, a user can select a virtio device explicitly or if
they use only one virtio device, it will Just Work.

However, if a user uses IDE and virtio, or a user has multiple
disks, they cannot select a device via -boot.

 

Isn't this problem unrelated to this patch?  I mean if I start qemu with
two ide devices can I specify from qemu command line which one I want to
boot from?
   


That's sort of what I'm asking.  If you compare this approach to 
extboot, extboot provided a capability to select a disk.  I think it can 
be argued though that this isn't a necessary feature to carry over and 
I'm looking for additional opinions on that.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Gleb Natapov

On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote:
 On 05/10/2010 03:11 AM, Gleb Natapov wrote:
 This patch adds native support for booting from virtio disks to Seabios.
 
 Signed-off-by: Gleb Natapovg...@redhat.com
 
 A related problem that I think we need to think about how we solve
 is indicating to Seabios which device we want to boot from
 
 With your patch, a user can select a virtio device explicitly or if
 they use only one virtio device, it will Just Work.
 
 However, if a user uses IDE and virtio, or a user has multiple
 disks, they cannot select a device via -boot.
 
Isn't this problem unrelated to this patch?  I mean if I start qemu with
two ide devices can I specify from qemu command line which one I want to
boot from?

 Is this something we need to address?  I don't think we'd break
 libvirt if we didn't.
 
 Regards,
 
 Anthony Liguori
 
 ---
 
 Changelog:
   v1-v2:
- free memory in case of vq initialization error.
- change license of virtio ring/pci to LGPLv3 with permission
  of Laurent Vivier (aka the author).
 
 diff --git a/Makefile b/Makefile
 index 327a1bf..d0b8881 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -14,7 +14,8 @@ OUT=out/
   SRCBOTH=misc.c pmm.c stacks.c output.c util.c block.c floppy.c ata.c 
  mouse.c \
   kbd.c pci.c serial.c clock.c pic.c cdrom.c ps2port.c smp.c 
  resume.c \
   pnpbios.c pirtable.c vgahooks.c ramdisk.c pcibios.c blockcmd.c \
 -usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c
 +usb.c usb-uhci.c usb-ohci.c usb-ehci.c usb-hid.c usb-msc.c \
 +virtio-ring.c virtio-pci.c virtio-blk.c
   SRC16=$(SRCBOTH) system.c disk.c apm.c font.c
   SRC32FLAT=$(SRCBOTH) post.c shadow.c memmap.c coreboot.c boot.c \
 acpi.c smm.c mptable.c smbios.c pciinit.c optionroms.c mtrr.c \
 diff --git a/src/block.c b/src/block.c
 index ddf441f..b6b1902 100644
 --- a/src/block.c
 +++ b/src/block.c
 @@ -11,6 +11,7 @@
   #include util.h // dprintf
   #include ata.h // process_ata_op
   #include usb-msc.h // process_usb_op
 +#include virtio-blk.h // process_virtio_op
 
   struct drives_s Drives VAR16VISIBLE;
 
 @@ -289,6 +290,8 @@ process_op(struct disk_op_s *op)
   return process_cdemu_op(op);
   case DTYPE_USB:
   return process_usb_op(op);
 +case DTYPE_VIRTIO:
 +return process_virtio_op(op);
   default:
   op-count = 0;
   return DISK_RET_EPARAM;
 diff --git a/src/config.h b/src/config.h
 index b101174..ad569c6 100644
 --- a/src/config.h
 +++ b/src/config.h
 @@ -136,6 +136,9 @@
   #define CONFIG_SUBMODEL_ID   0x00
   #define CONFIG_BIOS_REVISION 0x01
 
 +// Support boot from virtio storage
 +#define CONFIG_VIRTIO_BLK 1
 +
   // Various memory addresses used by the code.
   #define BUILD_STACK_ADDR  0x7000
   #define BUILD_S3RESUME_STACK_ADDR 0x1000
 diff --git a/src/disk.h b/src/disk.h
 index 0cd1b74..9e5b083 100644
 --- a/src/disk.h
 +++ b/src/disk.h
 @@ -197,6 +197,7 @@ struct drive_s {
   #define DTYPE_RAMDISK  0x04
   #define DTYPE_CDEMU0x05
   #define DTYPE_USB  0x06
 +#define DTYPE_VIRTIO   0x07
 
   #define MAXDESCSIZE 80
 
 diff --git a/src/pci_ids.h b/src/pci_ids.h
 index 1800f1d..e1cded2 100644
 --- a/src/pci_ids.h
 +++ b/src/pci_ids.h
 @@ -2605,3 +2605,6 @@
   #define PCI_DEVICE_ID_RME_DIGI32   0x9896
   #define PCI_DEVICE_ID_RME_DIGI32_PRO   0x9897
   #define PCI_DEVICE_ID_RME_DIGI32_8 0x9898
 +
 +#define PCI_VENDOR_ID_REDHAT_QUMRANET   0x1af4
 +#define PCI_DEVICE_ID_VIRTIO_BLK0x1001
 diff --git a/src/post.c b/src/post.c
 index 638b0f7..25535e2 100644
 --- a/src/post.c
 +++ b/src/post.c
 @@ -23,6 +23,7 @@
   #include smbios.h // smbios_init
   #include paravirt.h // qemu_cfg_port_probe
   #include ps2port.h // ps2port_setup
 +#include virtio-blk.h // virtio_blk_setup
 
   void
   __set_irq(int vector, void *loc)
 @@ -184,6 +185,7 @@ init_hw(void)
   floppy_setup();
   ata_setup();
   ramdisk_setup();
 +virtio_blk_setup();
   }
 
   // Main setup code.
 diff --git a/src/virtio-blk.c b/src/virtio-blk.c
 new file mode 100644
 index 000..a41c336
 --- /dev/null
 +++ b/src/virtio-blk.c
 @@ -0,0 +1,155 @@
 +// Virtio blovl boot support.
 +//
 +// Copyright (C) 2010 Red Hat Inc.
 +//
 +// Authors:
 +//  Gleb Natapovgnata...@redhat.com
 +//
 +// This file may be distributed under the terms of the GNU LGPLv3 license.
 +
 +#include util.h // dprintf
 +#include pci.h // foreachpci
 +#include config.h // CONFIG_*
 +#include virtio-pci.h
 +#include virtio-blk.h
 +#include disk.h
 +
 +struct virtiodrive_s {
 +struct drive_s drive;
 +struct vring_virtqueue *vq;
 +u16 ioaddr;
 +};
 +
 +static int
 +virtio_blk_read(struct disk_op_s *op)
 +{
 +struct virtiodrive_s *vdrive_g =
 +container_of(op-drive_g, struct virtiodrive_s, drive);
 +struct vring_virtqueue *vq = GET_GLOBAL(vdrive_g-vq);
 +struct virtio_blk_outhdr hdr = {
 +.type = VIRTIO_BLK_T_IN,
 +.ioprio = 0,
 +

Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace

2010-05-10 Thread Mohammed Gamal

On Mon, May 10, 2010 at 1:25 PM, Gleb Natapov g...@redhat.com wrote:
 On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote:
 Do not kill VM when instruction emulation fails. Inject #UD and report
 failure to userspace instead. Userspace may choose to reenter guest if
 vcpu is in userspace (cpl == 3) in which case guest OS will kill
 offending process and continue running.


I am curious to know what'd happen in case the vcpu is in kernel space
(cpl == 0). Is that case handled?

 Please use this one instead. Compilation warning fixed.

 Signed-off-by: Gleb Natapov g...@redhat.com
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index ed48904..5aa0944 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -575,7 +575,6 @@ enum emulation_result {
  #define EMULTYPE_SKIP              (1  2)
  int emulate_instruction(struct kvm_vcpu *vcpu,
                        unsigned long cr2, u16 error_code, int emulation_type);
 -void kvm_report_emulation_failure(struct kvm_vcpu *cvpu, const char 
 *context);
  void realmode_lgdt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);
  void realmode_lidt(struct kvm_vcpu *vcpu, u16 size, unsigned long address);

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 95bee9d..41d2de8 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2788,11 +2788,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
 cr2, u32 error_code)
                return 1;
        case EMULATE_DO_MMIO:
                ++vcpu-stat.mmio_exits;
 -               return 0;
 +               /* fall through */
        case EMULATE_FAIL:
 -               vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
 -               vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
 -               vcpu-run-internal.ndata = 0;
                return 0;
        default:
                BUG();
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index cea916f..58c91f5 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1449,7 +1449,7 @@ static int io_interception(struct vcpu_svm *svm)
        string = (io_info  SVM_IOIO_STR_MASK) != 0;
        in = (io_info  SVM_IOIO_TYPE_MASK) != 0;
        if (string || in)
 -               return !(emulate_instruction(vcpu, 0, 0, 0) == 
 EMULATE_DO_MMIO);
 +               return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;

        port = io_info  16;
        size = (io_info  SVM_IOIO_SIZE_MASK)  SVM_IOIO_SIZE_SHIFT;
 @@ -2300,16 +2300,12 @@ static int iret_interception(struct vcpu_svm *svm)

  static int invlpg_interception(struct vcpu_svm *svm)
  {
 -       if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
 -               pr_unimpl(svm-vcpu, %s: failed\n, __func__);
 -       return 1;
 +       return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
  }

  static int emulate_on_interception(struct vcpu_svm *svm)
  {
 -       if (emulate_instruction(svm-vcpu, 0, 0, 0) != EMULATE_DONE)
 -               pr_unimpl(svm-vcpu, %s: failed\n, __func__);
 -       return 1;
 +       return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
  }

  static int cr8_write_interception(struct vcpu_svm *svm)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 777e00d..e35c479 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3074,7 +3074,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
        ++vcpu-stat.io_exits;

        if (string || in)
 -               return !(emulate_instruction(vcpu, 0, 0, 0) == 
 EMULATE_DO_MMIO);
 +               return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;

        port = exit_qualification  16;
        size = (exit_qualification  7) + 1;
 @@ -3331,22 +3331,7 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)

  static int handle_apic_access(struct kvm_vcpu *vcpu)
  {
 -       unsigned long exit_qualification;
 -       enum emulation_result er;
 -       unsigned long offset;
 -
 -       exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 -       offset = exit_qualification  0xffful;
 -
 -       er = emulate_instruction(vcpu, 0, 0, 0);
 -
 -       if (er !=  EMULATE_DONE) {
 -               printk(KERN_ERR
 -                      Fail to handle apic access vmexit! Offset is 0x%lx\n,
 -                      offset);
 -               return -ENOEXEC;
 -       }
 -       return 1;
 +       return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
  }

  static int handle_task_switch(struct kvm_vcpu *vcpu)
 @@ -3558,13 +3543,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
 *vcpu)
                        goto out;
                }

 -               if (err != EMULATE_DONE) {
 -                       vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
 -                       vcpu-run-internal.suberror = 
 KVM_INTERNAL_ERROR_EMULATION;
 -                       vcpu-run-internal.ndata = 0;
 -                       ret = 0;
 -                       goto out;
 -               }
 +

Re: [PATCHv2] Support for booting from virtio disks

2010-05-10 Thread Gleb Natapov

On Mon, May 10, 2010 at 10:58:45AM -0500, Anthony Liguori wrote:
 On 05/10/2010 10:54 AM, Gleb Natapov wrote:
 On Mon, May 10, 2010 at 10:48:42AM -0500, Anthony Liguori wrote:
 On 05/10/2010 03:11 AM, Gleb Natapov wrote:
 This patch adds native support for booting from virtio disks to Seabios.
 
 Signed-off-by: Gleb Natapovg...@redhat.com
 A related problem that I think we need to think about how we solve
 is indicating to Seabios which device we want to boot from
 
 With your patch, a user can select a virtio device explicitly or if
 they use only one virtio device, it will Just Work.
 
 However, if a user uses IDE and virtio, or a user has multiple
 disks, they cannot select a device via -boot.
 
 Isn't this problem unrelated to this patch?  I mean if I start qemu with
 two ide devices can I specify from qemu command line which one I want to
 boot from?
 
 That's sort of what I'm asking.  If you compare this approach to
 extboot, extboot provided a capability to select a disk.  I think it
 can be argued though that this isn't a necessary feature to carry
 over and I'm looking for additional opinions on that.
 
Well, extboot is just a hack and shouldn't be used with ide disks at
all. With extboot it is not possible to switch to another disk from
F12 menu for instance (is it actually possible to read more then one
disks with bios int13 when extboot is in use?). About specifying boot
disk from qemu command like I think it will be very useful. It is not
clear how to pass default boot device into Seabios though. We should
pass a bus boot device is attached too (ide/virtio) and an unique id
of the device on the bus.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 9:38 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 05/10/2010 10:28 AM, Avi Kivity wrote:

 On 05/10/2010 06:22 PM, Cam Macdonell wrote:


 +
 +    /* if the position is -1, then it's shared memory region fd */
 +    if (incoming_posn == -1) {
 +
 +        s-num_eventfds = 0;
 +
 +        if (check_shm_size(s, incoming_fd) == -1) {
 +            exit(-1);
 +        }
 +
 +        /* creating a BAR in qemu_chr callback may be crazy */
 +        create_shared_memory_BAR(s, incoming_fd);

 It probably is... why can't you create it during initialization?

 This is for the shared memory server implementation, so the fd for the
 shared memory has to be received (over the qemu char device) from the
 server before the BAR can be created via qemu_ram_mmap() which adds
 the necessary memory



 We could do the handshake during initialization.  I'm worried that the
 device will appear without the BAR, and strange things will happen.  But the
 chardev API is probably not geared for passing data during init.

 Anthony, any ideas?

 Why can't we create the BAR with just normal RAM and then change it to a
 mmap()'d fd after initialization?  This will be behavior would be important
 for live migration as it would let you quickly migrate preserving the memory
 contents without waiting for an external program to reconnect.

 Regards,

 Anthony Lioguori

 Otherwise, if the BAR is allocated during initialization, I would have
 to use MAP_FIXED to mmap the memory.  This is what I did before the
 qemu_ram_mmap() function was added.

 What would happen to any data written to the BAR before the the handshake
 completed?  I think it would disappear.

 You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
 when disconnected.  When you connect, you create another ram area and
 memcpy() the previous ram area to the new one.  You then map the second ram
 area in.

the memcpy() would overwrite the contents of the shared memory each
time a guest joins which would be dangerous.


 From the guest's perspective, it's totally transparent.  For the backend,
 I'd suggest having an explicit initialized ack or something so that it
 knows that the data is now mapped to the guest.

Yes, I think the ack is the way to go, so the guest has to be aware of
it.  Would setting a flag in the driver-specific config space be an
acceptable ack that the shared region is now mapped?

Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Avi Kivity


On 05/10/2010 06:41 PM, Cam Macdonell wrote:



What would happen to any data written to the BAR before the the handshake
completed?  I think it would disappear.
 

But, the BAR isn't there until the handshake is completed.  Only after
receiving the shared memory fd does my device call pci_register_bar()
in the callback function.  So there may be a case with BAR2 (the
shared memory BAR) missing during initialization.  FWIW, I haven't
encountered this.
   


Well, that violates PCI.  You can't have a PCI device with no BAR, then 
have a BAR appear.  It may work since the BAR is registered a lot faster 
than the BIOS is able to peek at it, but it's a race nevertheless.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread Michael S. Tsirkin

On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
 @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
   use_mm(net-dev.mm);
   mutex_lock(vq-mutex);
   vhost_disable_notify(vq);
 - hdr_size = vq-hdr_size;
 + vhost_hlen = vq-vhost_hlen;
  
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
   vq-log : NULL;
  
 - for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  ARRAY_SIZE(vq-iov),
 -  out, in,
 -  vq_log, log);
 + while ((datalen = vhost_head_len(vq, sock-sk))) {
 + headcount = vhost_get_desc_n(vq, vq-heads,
 +  datalen + vhost_hlen,
 +  in, vq_log, log);
 + if (headcount  0)
 + break;
   /* OK, now we need to know about added descriptors. */
 - if (head == vq-num) {
 + if (!headcount) {
   if (unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
 @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
   break;
   }
   /* We don't need to be notified again. */
 - if (out) {
 - vq_err(vq, Unexpected descriptor format for RX: 
 -out %d, int %d\n,
 -out, in);
 - break;
 - }
 - /* Skip header. TODO: support TSO/mergeable rx buffers. */
 - s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
 + if (vhost_hlen)
 + /* Skip header. TODO: support TSO. */
 + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in);
 + else
 + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in);
   msg.msg_iovlen = in;
   len = iov_length(vq-iov, in);
   /* Sanity check */
   if (!len) {
   vq_err(vq, Unexpected header len for RX: 
  %zd expected %zd\n,
 -iov_length(vq-hdr, s), hdr_size);
 +iov_length(vq-hdr, s), vhost_hlen);
   break;
   }
   err = sock-ops-recvmsg(NULL, sock, msg,
len, MSG_DONTWAIT | MSG_TRUNC);
   /* TODO: Check specific error and bomb out unless EAGAIN? */
   if (err  0) {
 - vhost_discard_vq_desc(vq);
 + vhost_discard_desc(vq, headcount);
   break;
   }
 - /* TODO: Should check and handle checksum. */
 - if (err  len) {
 - pr_err(Discarded truncated rx packet: 
 - len %d  %zd\n, err, len);
 - vhost_discard_vq_desc(vq);
 + if (err != datalen) {
 + pr_err(Discarded rx packet: 
 + len %d, expected %zd\n, err, datalen);
 + vhost_discard_desc(vq, headcount);
   continue;
   }
   len = err;
 - err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
 - if (err) {
 - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n,
 -vq-iov-iov_base, err);
 + if (vhost_hlen 
 + memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0,
 +   vhost_hlen)) {
 + vq_err(vq, Unable to write vnet_hdr at addr %p\n,
 +vq-iov-iov_base);
   break;
   }
 - len += hdr_size;
 - vhost_add_used_and_signal(net-dev, vq, head, len);
 + /* TODO: Should check and handle checksum. */
 + if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) 
 + memcpy_toiovecend(vq-hdr, (unsigned char *)headcount,
 +   offsetof(typeof(hdr), num_buffers),
 +   sizeof(hdr.num_buffers))) {
 + vq_err(vq, Failed num_buffers write);
 + vhost_discard_desc(vq, headcount);
 + break;
 + }
 + len += vhost_hlen;
 + vhost_add_used_and_signal_n(net-dev, vq, vq-heads,
 + headcount);
   if (unlikely(vq_log))
   vhost_log_write(vq, vq_log, log, len);
   total_len += len;

OK I think I see the bug here: vhost_add_used_and_signal_n
does not get the actual length, it

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 10:40 AM, Avi Kivity a...@redhat.com wrote:
 On 05/10/2010 06:41 PM, Cam Macdonell wrote:

 What would happen to any data written to the BAR before the the handshake
 completed?  I think it would disappear.


 But, the BAR isn't there until the handshake is completed.  Only after
 receiving the shared memory fd does my device call pci_register_bar()
 in the callback function.  So there may be a case with BAR2 (the
 shared memory BAR) missing during initialization.  FWIW, I haven't
 encountered this.


 Well, that violates PCI.  You can't have a PCI device with no BAR, then have
 a BAR appear.  It may work since the BAR is registered a lot faster than the
 BIOS is able to peek at it, but it's a race nevertheless.

Agreed.  I'll get Anthony's idea up and running.  It seems that is the
way forward.

Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Anthony Liguori


On 05/10/2010 11:20 AM, Cam Macdonell wrote:

On Mon, May 10, 2010 at 9:38 AM, Anthony Liguorianth...@codemonkey.ws  wrote:
   

On 05/10/2010 10:28 AM, Avi Kivity wrote:
 

On 05/10/2010 06:22 PM, Cam Macdonell wrote:
   
 
   

+
+/* if the position is -1, then it's shared memory region fd */
+if (incoming_posn == -1) {
+
+s-num_eventfds = 0;
+
+if (check_shm_size(s, incoming_fd) == -1) {
+exit(-1);
+}
+
+/* creating a BAR in qemu_chr callback may be crazy */
+create_shared_memory_BAR(s, incoming_fd);

 

It probably is... why can't you create it during initialization?
   

This is for the shared memory server implementation, so the fd for the
shared memory has to be received (over the qemu char device) from the
server before the BAR can be created via qemu_ram_mmap() which adds
the necessary memory

 


We could do the handshake during initialization.  I'm worried that the
device will appear without the BAR, and strange things will happen.  But the
chardev API is probably not geared for passing data during init.

Anthony, any ideas?
   

Why can't we create the BAR with just normal RAM and then change it to a
mmap()'d fd after initialization?  This will be behavior would be important
for live migration as it would let you quickly migrate preserving the memory
contents without waiting for an external program to reconnect.

Regards,

Anthony Lioguori

 

Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.
 

What would happen to any data written to the BAR before the the handshake
completed?  I think it would disappear.
   

You don't have to do MAP_FIXED.  You can allocate a ram area and map that in
when disconnected.  When you connect, you create another ram area and
memcpy() the previous ram area to the new one.  You then map the second ram
area in.
 

the memcpy() would overwrite the contents of the shared memory each
time a guest joins which would be dangerous.
   


I think those are reasonable semantics and is really the only way to get 
guest-transparent reconnect.  The later is pretty critical for guest 
transparent live migration.



 From the guest's perspective, it's totally transparent.  For the backend,
I'd suggest having an explicit initialized ack or something so that it
knows that the data is now mapped to the guest.
 

Yes, I think the ack is the way to go, so the guest has to be aware of
it.  Would setting a flag in the driver-specific config space be an
acceptable ack that the shared region is now mapped?
   


You know it's mapped because it's mapped when the pci map function 
returns.  You don't need the guest to explicitly tell you.


Regards,

Anthony Liguori


Cam
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Avi Kivity


On 05/10/2010 06:38 PM, Anthony Liguori wrote:



Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.


What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.


You don't have to do MAP_FIXED.  You can allocate a ram area and map 
that in when disconnected.  When you connect, you create another ram 
area and memcpy() the previous ram area to the new one.  You then map 
the second ram area in.


But it's a shared memory area.  Other peers could have connected and 
written some data in.  The memcpy() would destroy their data.




From the guest's perspective, it's totally transparent.  For the 
backend, I'd suggest having an explicit initialized ack or something 
so that it knows that the data is now mapped to the guest.


From the peers' perspective, it's non-transparent :(

Also it doubles the transient memory requirement.



If you're doing just a ring queue in shared memory, it should allow 
disconnect/reconnect during live migration asynchronously to the 
actual qemu live migration.




Live migration of guests using shared memory is interesting.  You'd need 
to freeze all peers on one node, disconnect, reconnect, and restart them 
on the other node.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens

Since datalen carries the difference and will be negative by that amount
from the original loop, what about just adding something like:

}
if (headcount)
heads[headcount-1].len += datalen;
[and really, headcount 0 since datalen  0, so just:

heads[headcount-1].len += datalen;

+-DLS


kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
  use_mm(net-dev.mm);
  mutex_lock(vq-mutex);
  vhost_disable_notify(vq);
  -   hdr_size = vq-hdr_size;
  +   vhost_hlen = vq-vhost_hlen;
  
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(vq, sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads,
  +datalen + vhost_hlen,
  +in, vq_log, log);
  +  if (headcount  0)
  + break;
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  +  if (!headcount) {
if (unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
  @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
break;
 }
 /* We don't need to be notified again. */
  -  if (out) {
  - vq_err(vq, Unexpected descriptor format for RX: 
  -out %d, int %d\n,
  -out, in);
  - break;
  -  }
  -  /* Skip header. TODO: support TSO/mergeable rx buffers. */
  -  s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
  +  if (vhost_hlen)
  + /* Skip header. TODO: support TSO. */
  + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in);
  +  else
  + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in);
 msg.msg_iovlen = in;
 len = iov_length(vq-iov, in);
 /* Sanity check */
 if (!len) {
vq_err(vq, Unexpected header len for RX: 
   %zd expected %zd\n,
  -iov_length(vq-hdr, s), hdr_size);
  +iov_length(vq-hdr, s), vhost_hlen);
break;
 }
 err = sock-ops-recvmsg(NULL, sock, msg,
   len, MSG_DONTWAIT | MSG_TRUNC);
 /* TODO: Check specific error and bomb out unless EAGAIN? */
 if (err  0) {
  - vhost_discard_vq_desc(vq);
  + vhost_discard_desc(vq, headcount);
break;
 }
  -  /* TODO: Should check and handle checksum. */
  -  if (err  len) {
  - pr_err(Discarded truncated rx packet: 
  - len %d  %zd\n, err, len);
  - vhost_discard_vq_desc(vq);
  +  if (err != datalen) {
  + pr_err(Discarded rx packet: 
  + len %d, expected %zd\n, err, datalen);
  + vhost_discard_desc(vq, headcount);
continue;
 }
 len = err;
  -  err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
  -  if (err) {
  - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n,
  -vq-iov-iov_base, err);
  +  if (vhost_hlen 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0,
  +  vhost_hlen)) {
  + vq_err(vq, Unable to write vnet_hdr at addr %p\n,
  +vq-iov-iov_base);
break;
 }
  -  len += hdr_size;
  -  vhost_add_used_and_signal(net-dev, vq, head, len);
  +  /* TODO: Should check and handle checksum. */
  +  if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) 
  +  memcpy_toiovecend(vq-hdr, (unsigned char *)headcount,
  +  offsetof(typeof(hdr), num_buffers),
  +  sizeof(hdr.num_buffers))) {
  + vq_err(vq, Failed num_buffers write);
  + vhost_discard_desc(vq, headcount);
  + break;
  +  }
  +  len += vhost_hlen;
  +  vhost_add_used_and_signal_n(net-dev, vq, vq-heads,
  +   headcount);
 if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, len);
 total_len += len;
 
 OK I think I see the bug here: vhost_add_used_and_signal_n
 does not get the actual length, it gets the iovec length from vhost.
 Guest virtio uses this as packet length, with bad results.
 
 So I have applied the follows and it seems to have fixed the problem:
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index c16db02..9d7496d 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -219,7 +219,7 @@ static int peek_head_len(struct vhost_virtqueue *vq, 

 struct sock *sk)
  /* This is a multi-buffer

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Anthony Liguori


On 05/10/2010 11:59 AM, Avi Kivity wrote:

On 05/10/2010 06:38 PM, Anthony Liguori wrote:



Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.


What would happen to any data written to the BAR before the the 
handshake completed?  I think it would disappear.


You don't have to do MAP_FIXED.  You can allocate a ram area and map 
that in when disconnected.  When you connect, you create another ram 
area and memcpy() the previous ram area to the new one.  You then map 
the second ram area in.


But it's a shared memory area.  Other peers could have connected and 
written some data in.  The memcpy() would destroy their data.


Why try to attempt to support multi-master shared memory?  What's the 
use-case?


Regards,

Anthony Liguori



From the guest's perspective, it's totally transparent.  For the 
backend, I'd suggest having an explicit initialized ack or 
something so that it knows that the data is now mapped to the guest.


From the peers' perspective, it's non-transparent :(

Also it doubles the transient memory requirement.



If you're doing just a ring queue in shared memory, it should allow 
disconnect/reconnect during live migration asynchronously to the 
actual qemu live migration.




Live migration of guests using shared memory is interesting.  You'd 
need to freeze all peers on one node, disconnect, reconnect, and 
restart them on the other node.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread Michael S. Tsirkin

On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote:
 Since datalen carries the difference and will be negative by that amount
 from the original loop, what about just adding something like:
 
 }
 if (headcount)
 heads[headcount-1].len += datalen;
 [and really, headcount 0 since datalen  0, so just:
 
 heads[headcount-1].len += datalen;
 
 +-DLS

This works too, just does more checks and comparisons.
I am still surprised that you were unable to reproduce the problem.

 
 kvm-ow...@vger.kernel.org wrote on 05/10/2010 09:43:03 AM:
 
  On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
   @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
   use_mm(net-dev.mm);
   mutex_lock(vq-mutex);
   vhost_disable_notify(vq);
   -   hdr_size = vq-hdr_size;
   +   vhost_hlen = vq-vhost_hlen;
   
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
  vq-log : NULL;
   
   -   for (;;) {
   -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
   -ARRAY_SIZE(vq-iov),
   -out, in,
   -vq_log, log);
   +   while ((datalen = vhost_head_len(vq, sock-sk))) {
   +  headcount = vhost_get_desc_n(vq, vq-heads,
   +datalen + vhost_hlen,
   +in, vq_log, log);
   +  if (headcount  0)
   + break;
  /* OK, now we need to know about added descriptors. */
   -  if (head == vq-num) {
   +  if (!headcount) {
 if (unlikely(vhost_enable_notify(vq))) {
/* They have slipped one in as we were
 * doing that: check again. */
   @@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
 break;
  }
  /* We don't need to be notified again. */
   -  if (out) {
   - vq_err(vq, Unexpected descriptor format for RX: 
   -out %d, int %d\n,
   -out, in);
   - break;
   -  }
   -  /* Skip header. TODO: support TSO/mergeable rx buffers. */
   -  s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, in);
   +  if (vhost_hlen)
   + /* Skip header. TODO: support TSO. */
   + s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, in);
   +  else
   + s = copy_iovec_hdr(vq-iov, vq-hdr, vq-sock_hlen, in);
  msg.msg_iovlen = in;
  len = iov_length(vq-iov, in);
  /* Sanity check */
  if (!len) {
 vq_err(vq, Unexpected header len for RX: 
%zd expected %zd\n,
   -iov_length(vq-hdr, s), hdr_size);
   +iov_length(vq-hdr, s), vhost_hlen);
 break;
  }
  err = sock-ops-recvmsg(NULL, sock, msg,
len, MSG_DONTWAIT | MSG_TRUNC);
  /* TODO: Check specific error and bomb out unless EAGAIN? */
  if (err  0) {
   - vhost_discard_vq_desc(vq);
   + vhost_discard_desc(vq, headcount);
 break;
  }
   -  /* TODO: Should check and handle checksum. */
   -  if (err  len) {
   - pr_err(Discarded truncated rx packet: 
   - len %d  %zd\n, err, len);
   - vhost_discard_vq_desc(vq);
   +  if (err != datalen) {
   + pr_err(Discarded rx packet: 
   + len %d, expected %zd\n, err, datalen);
   + vhost_discard_desc(vq, headcount);
 continue;
  }
  len = err;
   -  err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
   -  if (err) {
   - vq_err(vq, Unable to write vnet_hdr at addr %p: %d\n,
   -vq-iov-iov_base, err);
   +  if (vhost_hlen 
   +  memcpy_toiovecend(vq-hdr, (unsigned char *)hdr, 0,
   +  vhost_hlen)) {
   + vq_err(vq, Unable to write vnet_hdr at addr %p\n,
   +vq-iov-iov_base);
 break;
  }
   -  len += hdr_size;
   -  vhost_add_used_and_signal(net-dev, vq, head, len);
   +  /* TODO: Should check and handle checksum. */
   +  if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) 
   +  memcpy_toiovecend(vq-hdr, (unsigned char *)headcount,
   +  offsetof(typeof(hdr), num_buffers),
   +  sizeof(hdr.num_buffers))) {
   + vq_err(vq, Failed num_buffers write);
   + vhost_discard_desc(vq, headcount);
   + break;
   +  }
   +  len += vhost_hlen;
   +  vhost_add_used_and_signal_n(net-dev, vq, vq-heads,
   +   headcount);
  if (unlikely(vq_log))
 vhost_log_write(vq, vq_log, log, len);
  total_len += len;
  
  OK I think I see the bug here: vhost_add_used_and_signal_n
  does not get the actual length, it gets the iovec length from vhost.
  Guest virtio uses this as packet length, with bad results.
  
  So I have applied the

Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace

2010-05-10 Thread Gleb Natapov

On Mon, May 10, 2010 at 07:06:05PM +0300, Mohammed Gamal wrote:
 On Mon, May 10, 2010 at 1:25 PM, Gleb Natapov g...@redhat.com wrote:
  On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote:
  Do not kill VM when instruction emulation fails. Inject #UD and report
  failure to userspace instead. Userspace may choose to reenter guest if
  vcpu is in userspace (cpl == 3) in which case guest OS will kill
  offending process and continue running.
 
 
 I am curious to know what'd happen in case the vcpu is in kernel space
 (cpl == 0). Is that case handled?
 
Currently no matter where emulation fails VM is stopped and cpu state is
printed on stderr. After that patch userspace may choose to continue VM
execution after emulation error (#UD will be injected into VM though). The
policy is in userspace, but I don't see the point to continue execution
after emulation failed in kernel. How kernel can recover from the #UD?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread Michael S. Tsirkin

On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
 @@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
   use_mm(net-dev.mm);
   mutex_lock(vq-mutex);
   vhost_disable_notify(vq);
 - hdr_size = vq-hdr_size;
 + vhost_hlen = vq-vhost_hlen;
  
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
   vq-log : NULL;
  
 - for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  ARRAY_SIZE(vq-iov),
 -  out, in,
 -  vq_log, log);
 + while ((datalen = vhost_head_len(vq, sock-sk))) {
 + headcount = vhost_get_desc_n(vq, vq-heads,
 +  datalen + vhost_hlen,
 +  in, vq_log, log);
 + if (headcount  0)
 + break;
   /* OK, now we need to know about added descriptors. */
 - if (head == vq-num) {
 + if (!headcount) {
   if (unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */


So I think this breaks handling for a failure mode where
we get an skb that is larger than the max packet guest
can get. The right thing to do in this case is to
drop the skb, we currently do this by passing
truncate flag to recvmsg.

In particular, with mergeable buffers off, if we get an skb
that does not fit in a single packet, this code will
spread it over multiple buffers.

You should be able to reproduce this fairly easily
by disabling both indirect buffers and mergeable buffers
on qemu command line. With current code TCP still
works by falling back on small packets. I think
with your code it will get stuck forever once
we get an skb that is too large for us to handle.

-- 
MST


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 11:25 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 05/10/2010 11:59 AM, Avi Kivity wrote:

 On 05/10/2010 06:38 PM, Anthony Liguori wrote:

 Otherwise, if the BAR is allocated during initialization, I would have
 to use MAP_FIXED to mmap the memory.  This is what I did before the
 qemu_ram_mmap() function was added.

 What would happen to any data written to the BAR before the the
 handshake completed?  I think it would disappear.

 You don't have to do MAP_FIXED.  You can allocate a ram area and map that
 in when disconnected.  When you connect, you create another ram area and
 memcpy() the previous ram area to the new one.  You then map the second ram
 area in.

 But it's a shared memory area.  Other peers could have connected and
 written some data in.  The memcpy() would destroy their data.

 Why try to attempt to support multi-master shared memory?  What's the
 use-case?

I don't see it as multi-master, but that the latest guest to join
shouldn't have their contents take precedence.  In developing this
patch, my motivation has been to let the guests decide.  If the memcpy
is always done, even when no data is written, a guest cannot join
without overwriting everything.

One use case we're looking at is having VMs using a map reduce
framework like Hadoop or Phoenix running in VMs.  However, if a
workqueue is stored or data transfer passes through shared memory, a
system can't scale up the number of workers because each new guest
will erase the shared memory (and the workqueue or in progress data
transfer).

In cases where the latest guest to join wants to clear the memory, it
can do so without the automatic memcpy.  The guest can do a memset
once it knows the memory is attached.  My opinion is to leave it to
the guests and the application that is using the shared memory to
decide what to do on guest joins.

Cam


 Regards,

 Anthony Liguori


 From the guest's perspective, it's totally transparent.  For the backend,
 I'd suggest having an explicit initialized ack or something so that it
 knows that the data is now mapped to the guest.

 From the peers' perspective, it's non-transparent :(

 Also it doubles the transient memory requirement.


 If you're doing just a ring queue in shared memory, it should allow
 disconnect/reconnect during live migration asynchronously to the actual qemu
 live migration.


 Live migration of guests using shared memory is interesting.  You'd need
 to freeze all peers on one node, disconnect, reconnect, and restart them on
 the other node.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-10 Thread David Stevens

netdev-ow...@vger.kernel.org wrote on 05/10/2010 10:25:57 AM:

 On Mon, May 10, 2010 at 10:09:03AM -0700, David Stevens wrote:
  Since datalen carries the difference and will be negative by that 
amount
  from the original loop, what about just adding something like:
  
  }
  if (headcount)
  heads[headcount-1].len += datalen;
  [and really, headcount 0 since datalen  0, so just:
  
  heads[headcount-1].len += datalen;
  
  +-DLS
 
 This works too, just does more checks and comparisons.
 I am still surprised that you were unable to reproduce the problem.
 

I'm sure it happened, and probably had a performance
penalty on my systems too, but not as much as yours.
I didn't see any obvious performance difference running
with the patch, though; not sure why. I'll instrument to
see how often it's happening, I think.
But fixed now, good catch!

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Anthony Liguori


On 05/10/2010 12:43 PM, Cam Macdonell wrote:

On Mon, May 10, 2010 at 11:25 AM, Anthony Liguorianth...@codemonkey.ws  wrote:
   

On 05/10/2010 11:59 AM, Avi Kivity wrote:
 

On 05/10/2010 06:38 PM, Anthony Liguori wrote:
   
 

Otherwise, if the BAR is allocated during initialization, I would have
to use MAP_FIXED to mmap the memory.  This is what I did before the
qemu_ram_mmap() function was added.
 

What would happen to any data written to the BAR before the the
handshake completed?  I think it would disappear.
   

You don't have to do MAP_FIXED.  You can allocate a ram area and map that
in when disconnected.  When you connect, you create another ram area and
memcpy() the previous ram area to the new one.  You then map the second ram
area in.
 

But it's a shared memory area.  Other peers could have connected and
written some data in.  The memcpy() would destroy their data.
   

Why try to attempt to support multi-master shared memory?  What's the
use-case?
 

I don't see it as multi-master, but that the latest guest to join
shouldn't have their contents take precedence.  In developing this
patch, my motivation has been to let the guests decide.  If the memcpy
is always done, even when no data is written, a guest cannot join
without overwriting everything.

One use case we're looking at is having VMs using a map reduce
framework like Hadoop or Phoenix running in VMs.  However, if a
workqueue is stored or data transfer passes through shared memory, a
system can't scale up the number of workers because each new guest
will erase the shared memory (and the workqueue or in progress data
transfer).
   


(Replying again to list)

What data structure would you use?  For a lockless ring queue, you can 
only support a single producer and consumer.  To achieve bidirectional 
communication in virtio, we always use two queues.


If you're adding additional queues to support other levels of 
communication, you can always use different areas of shared memory.


I guess this is the point behind the doorbell mechanism?

Regards,

Anthony Liguori


In cases where the latest guest to join wants to clear the memory, it
can do so without the automatic memcpy.  The guest can do a memset
once it knows the memory is attached.  My opinion is to leave it to
the guests and the application that is using the shared memory to
decide what to do on guest joins.

Cam

   

Regards,

Anthony Liguori

 

 From the guest's perspective, it's totally transparent.  For the backend,
I'd suggest having an explicit initialized ack or something so that it
knows that the data is now mapped to the guest.
 

 From the peers' perspective, it's non-transparent :(

Also it doubles the transient memory requirement.

   

If you're doing just a ring queue in shared memory, it should allow
disconnect/reconnect during live migration asynchronously to the actual qemu
live migration.

 

Live migration of guests using shared memory is interesting.  You'd need
to freeze all peers on one node, disconnect, reconnect, and restart them on
the other node.

   


 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Instant crash (invalid free()) at migrate for all 32bit qemu-kvm-0.12

2010-05-10 Thread Michael Tokarev


Seeing an.. interesting thing here.  Kvm crashes during migrate,
instantly.  Freshly-built 0.12.4, with a simple idle guest
(linux 2.6.32).

Starting TCP receiver on the same host, so using exactly the same
kvm binary.  And it drops into shell with large diagnostics output
from glibc:

(qemu) migrate tcp:localhost:3210
*** glibc detected *** /usr/bin/kvm: free(): invalid next size (fast): 
0x084c7fd0 ***
=== Backtrace: =
/lib/libc.so.6[0xf7ae1905]
/lib/libc.so.6[0xf7ae31a3]
/lib/libc.so.6(cfree+0x6d)[0xf7ae622d]
/usr/bin/kvm[0x806ed6f]
/usr/bin/kvm[0x806ee53]
/usr/bin/kvm[0x8050fb6]
/usr/bin/kvm[0x805114b]
/usr/bin/kvm[0x810e090]
/usr/bin/kvm[0x8105ee9]
/usr/bin/kvm[0x8106cce]
/usr/bin/kvm[0x8051d28]
/usr/bin/kvm[0x806d3c4]
/usr/bin/kvm[0x8054bbd]
/lib/libc.so.6(__libc_start_main+0xe5)[0xf7a8cb55]
/usr/bin/kvm[0x804e711]
=== Memory map: 
08048000-08244000 r-xp  08:01 77392  
/usr/bin/kvm
08244000-08256000 rw-p 001fc000 08:01 77392  
/usr/bin/kvm
08256000-08506000 rw-p  00:00 0
08506000-08516000 rw-p  00:00 0
08516000-08517000 rw-p  00:00 0
08517000-0851f000 rw-p  00:00 0
0851f000-085f9000 rw-p  00:00 0
e8f4a000-e8f4b000 ---p  00:00 0
e8f4b000-e974a000 rw-p  00:00 0
e9f4a000-e9f4b000 ---p  00:00 0
e9f4b000-ea74a000 rw-p  00:00 0
ea74a000-ea74b000 ---p  00:00 0
ea74b000-eaf4a000 rw-p  00:00 0
eaf4a000-eaf4b000 ---p  00:00 0
eaf4b000-eb74a000 rw-p  00:00 0
eb74a000-eb74b000 ---p  00:00 0
eb74b000-ebf4a000 rw-p  00:00 0
ec60-ec621000 rw-p  00:00 0
ec621000-ec70 ---p  00:00 0
ec74a000-ec864000 rw-s  00:04 2293771
/SYSV (deleted)
ec864000-ec89e000 rw-p  00:00 0
ec91b000-ec923000 r-xp  08:01 32779  
/usr/lib/libXcursor.so.1.0.2
ec923000-ec924000 rw-p 7000 08:01 32779  
/usr/lib/libXcursor.so.1.0.2
ec939000-eca27000 r--p 0018 08:01 106154 
/usr/lib/locale/locale-archive
eca27000-ecc27000 r--p  08:01 106154 
/usr/lib/locale/locale-archive
ecc27000-ecc2d000 r-xp  08:01 75964  
/usr/lib/libXrandr.so.2.2.0
ecc2d000-ecc2e000 rw-p 5000 08:01 75964  
/usr/lib/libXrandr.so.2.2.0
ecc43000-ecd28000 rw-p  00:00 0
ecded000-ece4e000 rw-p  00:00 0
ece4e000-ece55000 r--s  08:01 84015  
/usr/lib/gconv/gconv-modules.cache
ece55000-ece56000 rw-p  00:00 0
ece56000-ede56000 rw-p  00:00 0
ede56000-ede58000 rw-p  00:00 0
ede58000-ede78000 rw-p  00:00 0
ede78000-ede79000 rw-p  00:00 0
ede79000-ede83000 r-xp  08:01 188305 
/lib/libnss_files-2.10.2.so
ede83000-ede84000 r--p 9000 08:01 188305 
/lib/libnss_files-2.10.2.so
ede84000-ede85000 rw-p a000 08:01 188305 
/lib/libnss_files-2.10.2.so
ede9a000-ede9b000 rw-p  00:00 0
ede9b000-edebb000 rw-p  00:00 0
edebb000-edebd000 rw-p  00:00 0
edebd000-f5ebd000 rw-p  00:00 0
f5ebd000-f5ebe000 rw-p  00:00 0
f5ebe000-f5ebf000 ---p  00:00 0
f5ebf000-f66c2000 rw-p  00:00 0
f66c2000-f66c6000 r-xp  08:01 33553  
/usr/lib/libXdmcp.so.6.0.0
f66c6000-f66c7000 rw-p 3000 08:01 33553  
/usr/lib/libXdmcp.so.6.0.0
f66c7000-f66c9000 r-xp  08:01 33554  
/usr/lib/libXau.so.6.0.0
f66c9000-f66ca000 rw-p 1000 08:01 33554  
/usr/lib/libXau.so.6.0.0
f66ca000-f66cc000 r-xp  08:01 187523 
/lib/libx86.so.1
f66cc000-f66cd000 rw-p 1000 08:01 187523 
/lib/libx86.so.1
f66cd000-f66ce000 rw-p  00:00 0
f66ce000-f66d2000 r-xp  08:01 187387 
/lib/libattr.so.1.1.0
f66d2000-f66d3000 rw-p 3000 08:01 187387 
/lib/libattr.so.1.1.0
f66d3000-f66d7000 r-xp  08:01 33792  
/usr/lib/libogg.so.0.5.3
f66d7000-f66d8000 rw-p 3000 08:01 33792  
/usr/lib/libogg.so.0.5.3
f66d8000-f66ff000 r-xp  08:01 76459  
/usr/lib/libvorbis.so.0.4.4
f66ff000-f670 rw-p 00026000 08:01 76459  
/usr/lib/libvorbis.so.0.4.4
f670-f670b000 r-xp  08:01 73934  
/usr/lib/libvorbisenc.so.2.0.3
f670b000-f67fb000 rw-p a000 08:01 73934  
/usr/lib/libvorbisenc.so.2.0.3
f67fb000-f684d000 r-xp  08:01 33564  
/usr/lib/libFLAC.so.8.2.0

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 11:52 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 05/10/2010 12:43 PM, Cam Macdonell wrote:

 On Mon, May 10, 2010 at 11:25 AM, Anthony Liguorianth...@codemonkey.ws
  wrote:


 On 05/10/2010 11:59 AM, Avi Kivity wrote:


 On 05/10/2010 06:38 PM, Anthony Liguori wrote:




 Otherwise, if the BAR is allocated during initialization, I would
 have
 to use MAP_FIXED to mmap the memory.  This is what I did before the
 qemu_ram_mmap() function was added.


 What would happen to any data written to the BAR before the the
 handshake completed?  I think it would disappear.


 You don't have to do MAP_FIXED.  You can allocate a ram area and map
 that
 in when disconnected.  When you connect, you create another ram area
 and
 memcpy() the previous ram area to the new one.  You then map the second
 ram
 area in.


 But it's a shared memory area.  Other peers could have connected and
 written some data in.  The memcpy() would destroy their data.


 Why try to attempt to support multi-master shared memory?  What's the
 use-case?


 I don't see it as multi-master, but that the latest guest to join
 shouldn't have their contents take precedence.  In developing this
 patch, my motivation has been to let the guests decide.  If the memcpy
 is always done, even when no data is written, a guest cannot join
 without overwriting everything.

 One use case we're looking at is having VMs using a map reduce
 framework like Hadoop or Phoenix running in VMs.  However, if a
 workqueue is stored or data transfer passes through shared memory, a
 system can't scale up the number of workers because each new guest
 will erase the shared memory (and the workqueue or in progress data
 transfer).


 (Replying again to list)

Sorry about that.

 What data structure would you use?  For a lockless ring queue, you can only
 support a single producer and consumer.  To achieve bidirectional
 communication in virtio, we always use two queues.

MCS locks can work with multiple producer/consumers, either with busy
waiting or using the doorbell mechanism.


 If you're adding additional queues to support other levels of communication,
 you can always use different areas of shared memory.

True, and my point is simply that the memcpy would wipe those all out.


 I guess this is the point behind the doorbell mechanism?

Yes.


 Regards,

 Anthony Liguori

 In cases where the latest guest to join wants to clear the memory, it
 can do so without the automatic memcpy.  The guest can do a memset
 once it knows the memory is attached.  My opinion is to leave it to
 the guests and the application that is using the shared memory to
 decide what to do on guest joins.

 Cam



 Regards,

 Anthony Liguori



  From the guest's perspective, it's totally transparent.  For the
 backend,
 I'd suggest having an explicit initialized ack or something so that
 it
 knows that the data is now mapped to the guest.


  From the peers' perspective, it's non-transparent :(

 Also it doubles the transient memory requirement.



 If you're doing just a ring queue in shared memory, it should allow
 disconnect/reconnect during live migration asynchronously to the actual
 qemu
 live migration.



 Live migration of guests using shared memory is interesting.  You'd need
 to freeze all peers on one node, disconnect, reconnect, and restart them
 on
 the other node.





 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network between host and guest

2010-05-10 Thread Thanasis

Probably it's slirp.
The url you provided doesn't help much though.
Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migration fails with virtio-blk

2010-05-10 Thread Michael Tokarev


10.05.2010 13:49, Michael Tokarev пишет:

Apparently there's an issue with migration when
virtio-blk is in use at the time migration occurs.

See Debian bug #580649 for details:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649

In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails.
2.6.32 guest works, 0.11 works, non-virtio-blk works.


It looks like a problem reported on qemu-devel
list should fix the issue here:
http://marc.info/?l=qemu-develm=127350821419989

Testing...  It wont be fast because I want to hear
from the OP.

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migration fails with virtio-blk

2010-05-10 Thread Michael Tokarev


10.05.2010 22:38, Anthony Liguori wrote:

On 05/10/2010 01:34 PM, Michael Tokarev wrote:

In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails.
2.6.32 guest works, 0.11 works, non-virtio-blk works.


It looks like a problem reported on qemu-devel
list should fix the issue here:
http://marc.info/?l=qemu-develm=127350821419989


Can't be, that's a vhost patch and 0.12.3 doesn't support vhost.


Well, that was really close. qemu-kvm-0.12 has the
following code in that place:

 vdev-features = features;

the patch mentioned has this:

+if (vdev-set_features)
+vdev-set_features(vdev, features);

so basically kvm had the same code here,
just not wrapped into a function ;)

And sure thing you're right, it does not help any. :)

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migration fails with virtio-blk

2010-05-10 Thread Anthony Liguori


On 05/10/2010 01:34 PM, Michael Tokarev wrote:

10.05.2010 13:49, Michael Tokarev пишет:

Apparently there's an issue with migration when
virtio-blk is in use at the time migration occurs.

See Debian bug #580649 for details:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580649

In short: 2.6.26 (guest) + 0.12.3 + virtio-blk fails.
2.6.32 guest works, 0.11 works, non-virtio-blk works.


It looks like a problem reported on qemu-devel
list should fix the issue here:
http://marc.info/?l=qemu-develm=127350821419989


Can't be, that's a vhost patch and 0.12.3 doesn't support vhost.

Regards,

Anthony Liguori


Testing...  It wont be fast because I want to hear
from the OP.

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] pci: cleanly backout of pci_qdev_init()

2010-05-10 Thread Alex Williamson

If the init function of a device fails, as might happen with device
assignment, we never undo the work done by do_pci_register_device().
This not only causes a bit of a memory leak, but also leaves a bogus
pointer in the bus devices array that can cause a segfault or
garbage data from 'info pci'.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pci.c |   17 -
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index f167436..3d3560e 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -625,6 +625,14 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 return pci_dev;
 }
 
+static void do_pci_unregister_device(PCIDevice *pci_dev)
+{
+qemu_free_irqs(pci_dev-irq);
+pci_dev-bus-devices[pci_dev-devfn] = NULL;
+pci_config_free(pci_dev);
+return;
+}
+
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
int instance_size, int devfn,
PCIConfigReadFunc *config_read,
@@ -680,10 +688,7 @@ static int pci_unregister_device(DeviceState *dev)
 return ret;
 
 pci_unregister_io_regions(pci_dev);
-
-qemu_free_irqs(pci_dev-irq);
-pci_dev-bus-devices[pci_dev-devfn] = NULL;
-pci_config_free(pci_dev);
+do_pci_unregister_device(pci_dev);
 return 0;
 }
 
@@ -1652,8 +1657,10 @@ static int pci_qdev_init(DeviceState *qdev, DeviceInfo 
*base)
 if (pci_dev == NULL)
 return -1;
 rc = info-init(pci_dev);
-if (rc != 0)
+if (rc != 0) {
+do_pci_unregister_device(pci_dev);
 return rc;
+}
 
 /* rom loading */
 if (pci_dev-romfile == NULL  info-romfile != NULL)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Keep index within boundaries in kvmppc_44x_emul_tlbwe()

2010-05-10 Thread Hollis Blanchard

On Sun, May 9, 2010 at 8:26 AM, Roel Kluin roel.kl...@gmail.com wrote:
 An index of KVM44x_GUEST_TLB_SIZE is already one too large.

 Signed-off-by: Roel Kluin roel.kl...@gmail.com

Acked-by: Hollis Blanchard hol...@penguinppc.org

Thanks Roel.

-Hollis
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Keep index within boundaries in kvmppc_44x_emul_tlbwe()

2010-05-10 Thread Alexander Graf


On 11.05.2010, at 00:58, Hollis Blanchard wrote:

 On Sun, May 9, 2010 at 8:26 AM, Roel Kluin roel.kl...@gmail.com wrote:
 An index of KVM44x_GUEST_TLB_SIZE is already one too large.
 
 Signed-off-by: Roel Kluin roel.kl...@gmail.com
 
 Acked-by: Hollis Blanchard hol...@penguinppc.org

Acked-by: Alexander Graf ag...@suse.de


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[69/98] KVM: remove unused load_segment_descriptor_to_kvm_desct

2010-05-10 Thread Greg KH

2.6.32-stable review patch.  If anyone has any objections, please let us know.

--

From: Marcelo Tosatti mtosa...@redhat.com

Commit 78ce64a384 in v2.6.32.12 introduced a warning due to unused
load_segment_descriptor_to_kvm_desct helper, which has been opencoded by
this commit.

On upstream, the helper was removed as part of a different commit.

Remove the now unused function.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Greg Kroah-Hartman gre...@suse.de

---
 arch/x86/kvm/x86.c |   12 
 1 file changed, 12 deletions(-)

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4155,18 +4155,6 @@ static u16 get_segment_selector(struct k
return kvm_seg.selector;
 }
 
-static int load_segment_descriptor_to_kvm_desct(struct kvm_vcpu *vcpu,
-   u16 selector,
-   struct kvm_segment *kvm_seg)
-{
-   struct desc_struct seg_desc;
-
-   if (load_guest_segment_descriptor(vcpu, selector, seg_desc))
-   return 1;
-   seg_desct_to_kvm_desct(seg_desc, selector, kvm_seg);
-   return 0;
-}
-
 static int kvm_load_realmode_segment(struct kvm_vcpu *vcpu, u16 selector, int 
seg)
 {
struct kvm_segment segvar = {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/5] Inter-VM shared memory PCI device

2010-05-10 Thread Cam Macdonell

On Mon, May 10, 2010 at 5:59 AM, Avi Kivity a...@redhat.com wrote:
 On 04/21/2010 08:53 PM, Cam Macdonell wrote:

 Support an inter-vm shared memory device that maps a shared-memory object
 as a
 PCI device in the guest.  This patch also supports interrupts between
 guest by
 communicating over a unix domain socket.  This patch applies to the
 qemu-kvm
 repository.

     -device ivshmem,size=size in format accepted by -m[,shm=shm name]

 Interrupts are supported between multiple VMs by using a shared memory
 server
 by using a chardev socket.

     -device ivshmem,size=size in format accepted by -m[,shm=shm name]
                     [,chardev=id][,msi=on][,irqfd=on][,vectors=n]
     -chardev socket,path=path,id=id

 (shared memory server is qemu.git/contrib/ivshmem-server)

 Sample programs and init scripts are in a git repo here:


 +typedef struct EventfdEntry {
 +    PCIDevice *pdev;
 +    int vector;
 +} EventfdEntry;
 +
 +typedef struct IVShmemState {
 +    PCIDevice dev;
 +    uint32_t intrmask;
 +    uint32_t intrstatus;
 +    uint32_t doorbell;
 +
 +    CharDriverState * chr;
 +    CharDriverState ** eventfd_chr;
 +    int ivshmem_mmio_io_addr;
 +
 +    pcibus_t mmio_addr;
 +    unsigned long ivshmem_offset;
 +    uint64_t ivshmem_size; /* size of shared memory region */
 +    int shm_fd; /* shared memory file descriptor */
 +
 +    int nr_allocated_vms;
 +    /* array of eventfds for each guest */
 +    int ** eventfds;
 +    /* keep track of # of eventfds for each guest*/
 +    int * eventfds_posn_count;


 More readable:

  typedef struct Peer {
      int nb_eventfds;
      int *eventfds;
  } Peer;
  int nb_peers;
  Peer *peers;

 Does eventfd_chr need to be there as well?

No it does not, eventfd_chr store character devices for receiving
interrupts when irqfd is not available, so we only them for this
guest, not for our peers.

I've switched over to this more readable naming you've suggested.


 +
 +    int nr_alloc_guests;
 +    int vm_id;
 +    int num_eventfds;
 +    uint32_t vectors;
 +    uint32_t features;
 +    EventfdEntry *eventfd_table;
 +
 +    char * shmobj;
 +    char * sizearg;


 Does this need to be part of the state?

They are because they're passed in as qdev properties from the
command-line so I thought they needed to be in the state struct to be
assigned via DEFINE_PROP_...


 +} IVShmemState;
 +
 +/* registers for the Inter-VM shared memory device */
 +enum ivshmem_registers {
 +    IntrMask = 0,
 +    IntrStatus = 4,
 +    IVPosition = 8,
 +    Doorbell = 12,
 +};
 +
 +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int
 feature) {
 +    return (ivs-features  (1  feature));
 +}
 +
 +static inline int is_power_of_two(int x) {
 +    return (x  (x-1)) == 0;
 +}


 argument needs to be uint64_t to avoid overflow with large BARs.  Return
 type can be bool.

 +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
 +{
 +    IVShmemState *s = opaque;
 +
 +    u_int64_t write_one = 1;
 +    u_int16_t dest = val  16;
 +    u_int16_t vector = val  0xff;
 +
 +    addr= 0xfe;


 Why 0xfe?  Can understand 0xfc or 0xff.

Forgot to change to 0xfc when registers went from 16 to 32-bits.


 +
 +    switch (addr)
 +    {
 +        case IntrMask:
 +            ivshmem_IntrMask_write(s, val);
 +            break;
 +
 +        case IntrStatus:
 +            ivshmem_IntrStatus_write(s, val);
 +            break;
 +
 +        case Doorbell:
 +            /* check doorbell range */
 +            if ((vector= 0)  (vector  s-eventfds_posn_count[dest]))
 {


 What if dest is too big?  We overflow s-eventfds_posn_count.

added a check for that.

Thanks,
Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SeaBIOS] About cpu_set, CPU hotplug and related subjects

2010-05-10 Thread Carl-Daniel Hailfinger

On 28.04.2010 12:45, Gleb Natapov wrote:
 On Wed, Apr 28, 2010 at 12:41:51PM +0200, Jes Sorensen wrote:
   
 The CPU declarations are particularly tricky as they get pretty big and
 complex and need to live in the DSDT, whereas a lot of other things we
 can shift off to separate SSDT tables and only put the minimum that
 needs to be generated dynamically in it's own table.

 
 We can generate complex code statically and call it from dynamically
 generated CPU declarations.
   

There is some ACPI code generator in coreboot. Not sure if it is usable
for those purposes. coreboot uses it to generate AMD CPU frequency tables.

Regards,
Carl-Daniel

-- 
http://www.hailfinger.org/

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3] Support for booting from virtio disks

2010-05-10 Thread Kevin O'Connor

On Mon, May 10, 2010 at 11:36:37AM +0300, Gleb Natapov wrote:
 This patch adds native support for booting from virtio disks to Seabios.
 
 Signed-off-by: Gleb Natapov g...@redhat.com

Thanks - commit 89acfa3f.  The patch had some compile errors on gcc3.4
and gcc4.5 - I went ahead and committed an update to fix the errors
(commit 7d09d0e3).

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] KVM: inject #UD if instruction emulation fails and exit to userspace

2010-05-10 Thread Takuya Yoshikawa


(2010/05/11 2:33), Gleb Natapov wrote:

On Mon, May 10, 2010 at 07:06:05PM +0300, Mohammed Gamal wrote:

On Mon, May 10, 2010 at 1:25 PM, Gleb Natapovg...@redhat.com  wrote:

On Mon, May 10, 2010 at 11:16:56AM +0300, Gleb Natapov wrote:

Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.



I am curious to know what'd happen in case the vcpu is in kernel space
(cpl == 0). Is that case handled?


Currently no matter where emulation fails VM is stopped and cpu state is
printed on stderr. After that patch userspace may choose to continue VM
execution after emulation error (#UD will be injected into VM though). The
policy is in userspace, but I don't see the point to continue execution
after emulation failed in kernel. How kernel can recover from the #UD?


I don't see what is the recommended(possible) way of trouble shooting yet.

If the user is managing both the guest and the host, it's simple: no worth
reentering the guest. The user will just see the stderr.

But what about if the user can only see the guest?

In the case of non-virt, usually the user sees oops log or something and calls
to a support staff. Compared to that, if VM is silently stopped, what 
information
can the user see? In such a case, catching up the trouble and calling to a 
support
staff is host side management staff's job?

If you give us preferred way of trouble shooting of KVM, from the developer's 
point
of view, it will really help us to prepare for the future use of KVM.

  What we can do will depend on your development!
  And this is one of the reasons why I'm interested in the x86's emulation
  development. :-)

Thanks,
  Takuya




--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 0/4 v3] KVM: VMX: Support hosted VMM coexsitence.

2010-05-10 Thread Xu, Dongxiao

Hi Avi,

Do you have other comments on this version of patch? 

Thanks,
Dongxiao

Xu, Dongxiao wrote:
 Hi all,
 
 This is hosted VMM coexistence support v3.
 
 Main changes from v2:
 1) Change vmm_coexistence to vmm_exclusive.
 2) Some code structure changes. Split the original 3 patches to 4.
 3) Address some comments from Avi.
 
 Main changes from v1:
 1) Add an module option vmm_coexistence to decide whether to
 enable this feature. Currently it is off defaultly.
 2) Each time when a KVM vcpu is scheduled in, it will invalidate EPT
 and VPID TLBs to avoid conflict between different VMMs.
 
 VMX: Support for coexistence of KVM and other hosted VMMs.
 
 The following NOTE is picked up from Intel SDM 3B 27.3 chapter,
 MANAGING VMCS REGIONS AND POINTERS.
 
 --
 NOTE
 As noted in Section 21.1, the processor may optimize VMX operation
 by maintaining the state of an active VMCS (one for which VMPTRLD
 has been executed) on the processor. Before relinquishing control to
 other system software that may, without informing the VMM, remove
 power from the processor (e.g., for transitions to S3 or S4) or leave
 VMX operation, a VMM must VMCLEAR all active VMCSs. This ensures
 that all VMCS data cached by the processor are flushed to memory
 and that no other software can corrupt the current VMM's VMCS data.
 It is also recommended that the VMM execute VMXOFF after such
 executions of VMCLEAR.
 --
 
 Currently, VMCLEAR is called at VCPU migration. To support hosted
 VMM coexistence, this patch modifies the VMCLEAR/VMPTRLD and
 VMXON/VMXOFF usages. VMCLEAR will be called when VCPU is
 scheduled out of a physical CPU, while VMPTRLD is called when VCPU
 is scheduled in a physical CPU. Also this approach could eliminates
 the IPI mechanism for original VMCLEAR. As suggested by SDM,
 VMXOFF will be called after VMCLEAR, and VMXON will be called
 before VMPTRLD.
 
 With this patchset, KVM and VMware Workstation 7 could launch
 serapate guests and they can work well with each other.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps

2010-05-10 Thread Marcelo Tosatti

On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote:
 Now that dirty bitmaps are accessible from user space, we export the
 addresses of these to achieve zero-copy dirty log check:
 
   KVM_GET_USER_DIRTY_LOG_ADDR
 
 We also need an API for triggering dirty bitmap switch to take the full
 advantage of double buffered bitmaps.
 
   KVM_SWITCH_DIRTY_LOG
 
 See the documentation in this patch for precise explanations.
 
 About performance improvement: the most important feature of switch API
 is the lightness. In our test, this appeared in the form of improved
 responses for GUI manipulations.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
 CC: Avi Kivity a...@redhat.com
 CC: Alexander Graf ag...@suse.de
 ---
  Documentation/kvm/api.txt |   87 
 +
  arch/ia64/kvm/kvm-ia64.c  |   27 +-
  arch/powerpc/kvm/book3s.c |   18 +++--
  arch/x86/kvm/x86.c|   44 ---
  include/linux/kvm.h   |   11 ++
  include/linux/kvm_host.h  |4 ++-
  virt/kvm/kvm_main.c   |   63 +
  7 files changed, 220 insertions(+), 34 deletions(-)
 
 diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
 index a237518..c106c83 100644
 --- a/Documentation/kvm/api.txt
 +++ b/Documentation/kvm/api.txt
 @@ -892,6 +892,93 @@ arguments.
  This ioctl is only useful after KVM_CREATE_IRQCHIP.  Without an in-kernel
  irqchip, the multiprocessing state must be maintained by userspace.
  
 +4.39 KVM_GET_USER_DIRTY_LOG_ADDR
 +
 +Capability: KVM_CAP_USER_DIRTY_LOG (=1 see below)
 +Architectures: all
 +Type: vm ioctl
 +Parameters: struct kvm_user_dirty_log (in/out)
 +Returns: 0 on success, -1 on error
 +
 +This ioctl makes it possible to use KVM_SWITCH_DIRTY_LOG (see 4.40) instead
 +of the old dirty log manipulation by KVM_GET_DIRTY_LOG.
 +
 +A bit about KVM_CAP_USER_DIRTY_LOG
 +
 +Before this ioctl was introduced, dirty bitmaps for dirty page logging were
 +allocated in the kernel's memory space.  But we have now moved them to user
 +space to get more flexiblity and performance.  To achive this move without
 +breaking the compatibility, we will split KVM_CAP_USER_DIRTY_LOG capability
 +into a few generations which can be identified by its check extension
 +return values.
 +
 +This KVM_GET_USER_DIRTY_LOG_ADDR belongs to the first generation with the
 +KVM_SWITCH_DIRTY_LOG (4.40) and must be supported by all generations.
 +
 +What you get
 +
 +By using this, you can get paired bitmap addresses which are accessible from
 +user space.  See the explanation in 4.40 for the roles of these two bitmaps.
 +
 +How to Get
 +
 +Before calling this, you have to set the slot member of kvm_user_dirty_log
 +to indicate the target memory slot.
 +
 +struct kvm_user_dirty_log {
 + __u32 slot;
 + __u32 flags;
 + __u64 dirty_bitmap;
 + __u64 dirty_bitmap_old;
 +};
 +
 +The addresses will be set in the paired members: dirty_bitmap and _old.

Why not pass the bitmap address to the kernel, instead of having the
kernel allocate it. Because apparently you plan to do that in a new
generation anyway?

Also, why does the kernel need to know about different bitmaps?

One alternative would be:

KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active
bitmap was clean, it returns 0, no switch performed. If the active
bitmap was dirty, the kernel switches to the new bitmap and returns 1.

And the responsability of cleaning the new bitmap could also be left
for userspace.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH RFC 10/12] KVM: move dirty bitmaps to user space

2010-05-10 Thread Marcelo Tosatti

On Tue, May 04, 2010 at 10:07:02PM +0900, Takuya Yoshikawa wrote:
 We move dirty bitmaps to user space.
 
  - Allocation and destruction: we use do_mmap() and do_munmap().
The new bitmap space is twice longer than the original one and we
use the additional space for double buffering: this makes it
possible to update the active bitmap while letting the user space
read the other one safely. For x86, we can also remove the vmalloc()
in kvm_vm_ioctl_get_dirty_log().
 
  - Bitmap manipulations: we replace all functions which access dirty
bitmaps with *_user() functions.
 
  - For ia64: moving the dirty bitmaps of memory slots does not affect
ia64 much because it's using a different place to store dirty logs
rather than the dirty bitmaps of memory slots: all we have to change
are sync and get of dirty log, so we don't need set_bit_user like
functions for ia64.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
 CC: Avi Kivity a...@redhat.com
 CC: Alexander Graf ag...@suse.de
 ---
  arch/ia64/kvm/kvm-ia64.c  |   15 +-
  arch/powerpc/kvm/book3s.c |5 +++-
  arch/x86/kvm/x86.c|   25 --
  include/linux/kvm_host.h  |3 +-
  virt/kvm/kvm_main.c   |   62 +---
  5 files changed, 82 insertions(+), 28 deletions(-)
 
 diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
 index 17fd65c..03503e6 100644
 --- a/arch/ia64/kvm/kvm-ia64.c
 +++ b/arch/ia64/kvm/kvm-ia64.c
 @@ -1823,11 +1823,19 @@ static int kvm_ia64_sync_dirty_log(struct kvm *kvm,
   n = kvm_dirty_bitmap_bytes(memslot);
   base = memslot-base_gfn / BITS_PER_LONG;
  
 + r = -EFAULT;
 + if (!access_ok(VERIFY_WRITE, memslot-dirty_bitmap, n))
 + goto out;
 +
   for (i = 0; i  n/sizeof(long); ++i) {
   if (dirty_bitmap[base + i])
   memslot-is_dirty = true;
  
 - memslot-dirty_bitmap[i] = dirty_bitmap[base + i];
 + if (__put_user(dirty_bitmap[base + i],
 +memslot-dirty_bitmap[i])) {
 + r = -EFAULT;
 + goto out;
 + }
   dirty_bitmap[base + i] = 0;
   }
   r = 0;
 @@ -1858,7 +1866,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   if (memslot-is_dirty) {
   kvm_flush_remote_tlbs(kvm);
   n = kvm_dirty_bitmap_bytes(memslot);
 - memset(memslot-dirty_bitmap, 0, n);
 + if (clear_user(memslot-dirty_bitmap, n)) {
 + r = -EFAULT;
 + goto out;
 + }
   memslot-is_dirty = false;
   }
   r = 0;
 diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
 index 4b074f1..2a31d2f 100644
 --- a/arch/powerpc/kvm/book3s.c
 +++ b/arch/powerpc/kvm/book3s.c
 @@ -1210,7 +1210,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   kvmppc_mmu_pte_pflush(vcpu, ga, ga_end);
  
   n = kvm_dirty_bitmap_bytes(memslot);
 - memset(memslot-dirty_bitmap, 0, n);
 + if (clear_user(memslot-dirty_bitmap, n)) {
 + r = -EFAULT;
 + goto out;
 + }
   memslot-is_dirty = false;
   }
  
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 023c7f8..32a3d94 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2760,40 +2760,37 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   /* If nothing is dirty, don't bother messing with page tables. */
   if (memslot-is_dirty) {
   struct kvm_memslots *slots, *old_slots;
 - unsigned long *dirty_bitmap;
 + unsigned long __user *dirty_bitmap;
 + unsigned long __user *dirty_bitmap_old;
  
   spin_lock(kvm-mmu_lock);
   kvm_mmu_slot_remove_write_access(kvm, log-slot);
   spin_unlock(kvm-mmu_lock);
  
 - r = -ENOMEM;
 - dirty_bitmap = vmalloc(n);
 - if (!dirty_bitmap)
 + dirty_bitmap = memslot-dirty_bitmap;
 + dirty_bitmap_old = memslot-dirty_bitmap_old;
 + r = -EFAULT;
 + if (clear_user(dirty_bitmap_old, n))
   goto out;
 - memset(dirty_bitmap, 0, n);
  
   r = -ENOMEM;
   slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 - if (!slots) {
 - vfree(dirty_bitmap);
 + if (!slots)
   goto out;
 - }
 +
   memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
 - slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
 + slots-memslots[log-slot].dirty_bitmap = dirty_bitmap_old;
 + slots-memslots[log-slot].dirty_bitmap_old = dirty_bitmap;

[PATCH 2/4] KVM: Clean up duplicate assignment

2010-05-10 Thread Sheng Yang

mmu.free() already set root_hpa to INVALID_PAGE, no need to do it again in the
destory_kvm_mmu().

kvm_x86_ops-set_cr4() and set_efer() already assign cr4/efer to
vcpu-arch.cr4/efer, no need to do it again later.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/mmu.c |5 ++---
 arch/x86/kvm/x86.c |4 +---
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95bee9d..5412185 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2450,10 +2450,9 @@ static int init_kvm_mmu(struct kvm_vcpu *vcpu)
 static void destroy_kvm_mmu(struct kvm_vcpu *vcpu)
 {
ASSERT(vcpu);
-   if (VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   if (VALID_PAGE(vcpu-arch.mmu.root_hpa))
+   /* mmu.free() should set root_hpa = INVALID_PAGE */
vcpu-arch.mmu.free(vcpu);
-   vcpu-arch.mmu.root_hpa = INVALID_PAGE;
-   }
 }
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cae460..764f89b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -486,7 +486,7 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;
 
kvm_x86_ops-set_cr4(vcpu, cr4);
-   vcpu-arch.cr4 = cr4;
+
kvm_mmu_reset_context(vcpu);
 
return 0;
@@ -720,8 +720,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
kvm_x86_ops-set_efer(vcpu, efer);
 
-   vcpu-arch.efer = efer;
-
vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
kvm_mmu_reset_context(vcpu);
 
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] VMX: x86: Only reset MMU when necessary

2010-05-10 Thread Sheng Yang

Only modifying some bits of CR0/CR4 needs paging mode switch.

Add update_rsvd_bits_mask() to address EFER.NX bit updating for reserved bits.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   17 ++---
 arch/x86/kvm/x86.c  |   14 --
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ed48904..c8c8a03 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -553,6 +553,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int 
slot);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
+void update_rsvd_bits_mask(struct kvm_vcpu *vcpu);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5412185..98abdcf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2335,6 +2335,19 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, 
int level)
}
 }
 
+void update_rsvd_bits_mask(struct kvm_vcpu *vcpu)
+{
+   if (!is_paging(vcpu))
+   return;
+   if (is_long_mode(vcpu))
+   reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL);
+   else if (is_pae(vcpu))
+   reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL);
+   else
+   reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
+}
+EXPORT_SYMBOL_GPL(update_rsvd_bits_mask);
+
 static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
 {
struct kvm_mmu *context = vcpu-arch.mmu;
@@ -2400,18 +2413,16 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context-gva_to_gpa = nonpaging_gva_to_gpa;
context-root_level = 0;
} else if (is_long_mode(vcpu)) {
-   reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL);
context-gva_to_gpa = paging64_gva_to_gpa;
context-root_level = PT64_ROOT_LEVEL;
} else if (is_pae(vcpu)) {
-   reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL);
context-gva_to_gpa = paging64_gva_to_gpa;
context-root_level = PT32E_ROOT_LEVEL;
} else {
-   reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
context-gva_to_gpa = paging32_gva_to_gpa;
context-root_level = PT32_ROOT_LEVEL;
}
+   update_rsvd_bits_mask(vcpu);
 
return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b59fc67..1c76e08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -416,6 +416,9 @@ out:
 
 static int __kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
+   unsigned long old_cr0 = kvm_read_cr0(vcpu);
+   unsigned long update_bits = X86_CR0_PG | X86_CR0_PE;
+
cr0 |= X86_CR0_ET;
 
 #ifdef CONFIG_X86_64
@@ -449,7 +452,8 @@ static int __kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long cr0)
 
kvm_x86_ops-set_cr0(vcpu, cr0);
 
-   kvm_mmu_reset_context(vcpu);
+   if ((cr0 ^ old_cr0)  update_bits)
+   kvm_mmu_reset_context(vcpu);
return 0;
 }
 
@@ -487,7 +491,8 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 
kvm_x86_ops-set_cr4(vcpu, cr4);
 
-   kvm_mmu_reset_context(vcpu);
+   if ((cr4 ^ old_cr4)  pdptr_bits)
+   kvm_mmu_reset_context(vcpu);
 
return 0;
 }
@@ -692,6 +697,8 @@ static u32 emulated_msrs[] = {
 
 static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 {
+   u64 old_efer = vcpu-arch.efer;
+
if (efer  efer_reserved_bits)
return 1;
 
@@ -722,6 +729,9 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
 
+   if ((efer ^ old_efer)  EFER_NX)
+   update_rsvd_bits_mask(vcpu);
+
return 0;
 }
 
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] KVM: x86: Remove kvm_mmu_reset_context() in kvm_set_efer()

2010-05-10 Thread Sheng Yang

Modify EFER won't result in mode switch directly. After EFER.LME set, the
following set CR0.PG would result in mode switch to IA32e. And the later
action already covered by kvm_set_cr0().

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/x86.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 764f89b..b59fc67 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -721,7 +721,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
kvm_x86_ops-set_efer(vcpu, efer);
 
vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
-   kvm_mmu_reset_context(vcpu);
 
return 0;
 }
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] KVM: x86: Check LMA bit before set_efer

2010-05-10 Thread Sheng Yang

kvm_x86_ops-set_efer() would execute vcpu-arch.efer = efer, so the
checking of LMA bit didn't work.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd8a606..2cae460 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -715,11 +715,11 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
return 1;
}
 
-   kvm_x86_ops-set_efer(vcpu, efer);
-
efer = ~EFER_LMA;
efer |= vcpu-arch.efer  EFER_LMA;
 
+   kvm_x86_ops-set_efer(vcpu, efer);
+
vcpu-arch.efer = efer;
 
vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps

2010-05-10 Thread Takuya Yoshikawa


(2010/05/11 12:43), Marcelo Tosatti wrote:

On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote:

+How to Get
+
+Before calling this, you have to set the slot member of kvm_user_dirty_log
+to indicate the target memory slot.
+
+struct kvm_user_dirty_log {
+   __u32 slot;
+   __u32 flags;
+   __u64 dirty_bitmap;
+   __u64 dirty_bitmap_old;
+};
+
+The addresses will be set in the paired members: dirty_bitmap and _old.


Why not pass the bitmap address to the kernel, instead of having the
kernel allocate it. Because apparently you plan to do that in a new
generation anyway?


Yes, we want to make qemu allocate and free bitmaps in the future.
But currently, those are strictly tied with memory slot registration and
changing it in one patch set is really difficult.

Anyway, we need kernel side allocation mechanism to keep the current
GET_DIRTY_LOG api. I don't mind not exporting kernel allocated bitmaps
in this patch set and later introducing a bitmap registration mechanism
in another patch set.

As this RFC is suggesting, kernel side double buffering infrastructure is
designed as general as possible and adding a new API like SWITCH can be done
naturally.



Also, why does the kernel need to know about different bitmaps?


Because we need to support current GET API. We don't have any way to get
a new bitmap in the case of GET and we don't want to do_mmap() every time
for GET.



One alternative would be:

KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active
bitmap was clean, it returns 0, no switch performed. If the active
bitmap was dirty, the kernel switches to the new bitmap and returns 1.

And the responsability of cleaning the new bitmap could also be left
for userspace.



That is a beautiful approach but we can do that only when we give up using
GET api.


I follow you and Avi's advice about that kind of maintenance policy!
What do you think?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Takuya Yoshikawa





Yes, I'm just using in kernel space: qemu has its own endian related helpers.

So if you allow us to place this macro in asm-generic/bitops/* it will help us.


No problem at all then. Thanks for the explanation.

Acked-by: Arnd Bergmanna...@arndb.de


Thanks you both. I will add your Acked-by from now on!

  Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Avi Kivity


On 05/04/2010 03:56 PM, Takuya Yoshikawa wrote:

[Performance test]

We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.

Test environment

   AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory


1. GUI test (running Ubuntu guest in graphical mode)

   sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...

We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.

get.org   get.opt  switch.opt

slots[7].len=32768  278379 66398 64024
slots[8].len=32768  181246   270   160
slots[7].len=32768  263961 64673 64494
slots[8].len=32768  181655   265   160
slots[7].len=32768  263736 64701 64610
slots[8].len=32768  182785   267   160
slots[7].len=32768  260925 65360 65042
slots[8].len=32768  182579   264   160
slots[7].len=32768  267823 65915 65682
slots[8].len=32768  186350   271   160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.
   


100 ns... this is a bit on the low side (and if you can measure it 
interactively you have much better reflexes than I).



To feel the difference, please try GUI on your PC with our patch series!
   


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is 
problematic.  Have you tried profiling to see where the time is spent 
(well I can guess, clearing the write access from the sptes).




2. Live-migration test (4GB guest, write loop with 1GB buf)

We also did a live-migration test.

get.org   get.opt  switch.opt

slots[0].len=655360 797383261144222181
slots[1].len=37570478082186721   1965244   1842824
slots[2].len=637534208 1433562   1012723   1031213
slots[3].len=131072 216858   331   331
slots[4].len=131072 121635   225   164
slots[5].len=131072 120863   356   164
slots[6].len=16777216   121746  1133   156
slots[7].len=32768  120415   230   278
slots[8].len=32768  120368   216   149
slots[0].len=655360 806497194710223582
slots[1].len=37570478082142922   1878025   1895369
slots[2].len=637534208 1386512   1021309   1000345
slots[3].len=131072 221118   459   296
slots[4].len=131072 121516   272   166
slots[5].len=131072 122652   244   173
slots[6].len=16777216   123226 99185   149
slots[7].len=32768  121803   457   505
slots[8].len=32768  121586   216   155
slots[0].len=655360 766113211317213179
slots[1].len=37570478082155662   1974790   1842361
slots[2].len=637534208 1481411   1020004   1031352
slots[3].len=131072 223100   351   295
slots[4].len=131072 122982   436   164
slots[5].len=131072 122100   300   503
slots[6].len=16777216   123653   779   151
slots[7].len=32768  122617   284   157
slots[8].len=32768  122737   253   149

For slots other than 0,1,2 we can see the similar improvement.

Considering the fact that switch.opt does not depend on the bitmap length
except for kvm_mmu_slot_remove_write_access(), this is the cause of some
usec to msec time consumption: there might be some context switches.

But note that this was done with the workload which dirtied the memory
endlessly during the live-migration.

In usual workload, the number of dirty pages varies a lot for each iteration
and we should gain really a lot for relatively clean cases.
   


Can you post such a test, for an idle large guest?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-05-10 Thread Takuya Yoshikawa





get.org get.opt switch.opt

slots[7].len=32768 278379 66398 64024
slots[8].len=32768 181246 270 160
slots[7].len=32768 263961 64673 64494
slots[8].len=32768 181655 265 160
slots[7].len=32768 263736 64701 64610
slots[8].len=32768 182785 267 160
slots[7].len=32768 260925 65360 65042
slots[8].len=32768 182579 264 160
slots[7].len=32768 267823 65915 65682
slots[8].len=32768 186350 271 160

At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.

Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.


100 ns... this is a bit on the low side (and if you can measure it
interactively you have much better reflexes than I).


To feel the difference, please try GUI on your PC with our patch series!


No doubt get.org - get.opt is measurable, but get.opt-switch.opt is
problematic. Have you tried profiling to see where the time is spent
(well I can guess, clearing the write access from the sptes).


Sorry but no, and I agree with your guess.
Anyway, I want to do some profiling to confirm this guess.


BTW, If we only think about performance improvement of time, optimized
get(get.opt) may be enough at this stage.

But if we consider the future expansion like using user allocated bitmaps,
new API's introduced for switch.opt won't become waste, I think, because we
need a structure to get and export bitmap addresses.




In usual workload, the number of dirty pages varies a lot for each
iteration
and we should gain really a lot for relatively clean cases.


Can you post such a test, for an idle large guest?


OK, I'll do!





--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

89 matches

Mail list logo