Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Gerd Hoffmann
  Hi,

 Well, we also want to clean up the registers, so how about:

 BAR0: legacy, as is.  If you access this, don't use the others.

Ok.

 BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
 BAR2: virtio-cfg.  If you use this, don't use BAR0.

Why use two bars for this?  You can put them into one mmio bar, together
with the msi-x vector table and PBA.  Of course a pci capability
describing the location is helpful for that ;)

 BAR3: ISR. If you use this, don't use BAR0.

Again, I wouldn't hardcode that but use a capability.

 I prefer the cases exclusive (ie. use one or the other) as a clear path
 to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
 an ugly corner case in future (ISR is BAR0 + 19?  WTF?).

Ok, so we have four register sets:

  (1) legacy layout
  (2) new virtio-pci
  (3) new virtio-config
  (4) new virtio-isr

We can have a vendor pci capability, with a dword for each register set:

  bit  31-- present bit
  bits 26-24 -- bar
  bits 23-0  -- offset

So current drivers which must support legacy can use this:

  legacy layout -- present, bar 0, offset 0
  new virtio-pci-- present, bar 1, offset 0
  new virtio-config -- present, bar 1, offset 256
  new virtio-isr-- present, bar 0, offset 19

[ For completeness: msi-x capability could add this: ]

  msi-x vector tablebar 1, offset 512
  msi-x pba bar 1, offset 768

 We'll never remove legacy so we shouldn't plan on it.  There are
 literally hundreds of thousands of VMs out there with the current virtio
 drivers installed in them.  We'll be supporting them for a very, very
 long time :-)

But new devices (virtio-qxl being a candidate) don't have old guests and
don't need to worry.

They could use this if they care about fast isr:

  legacy layout -- not present
  new virtio-pci-- present, bar 1, offset 0
  new virtio-config -- present, bar 1, offset 256
  new virtio-isr-- present, bar 0, offset 0

Or this if they don't worry about isr performance:

  legacy layout -- not present
  new virtio-pci-- present, bar 0, offset 0
  new virtio-config -- present, bar 0, offset 256
  new virtio-isr-- not present

 I don't think we gain a lot by moving the ISR into a separate BAR.
 Splitting up registers like that seems weird to me too.

Main advantage of defining a register set with just isr is that it
reduces pio address space consumtion for new virtio devices which don't
have to worry about the legacy layout (8 bytes which is minimum size for
io bars instead of 64 bytes).

 If we added an additional constraints that BAR1 was mirrored except for

Why add constraints?  We want something future-proof, don't we?

 The detection is simple: if BAR1 has non-zero length, it's new-style,
 otherwise legacy.

Doesn't fly.  BAR1 is in use today for MSI-X support.

 I agree that this is the best way to extend, but I think we should still
 use a transport feature bit.  We want to be able to detect within QEMU
 whether a guest is using these new features because we need to adjust
 migration state accordingly.

Why does migration need adjustments?

[ Not that I want veto a feature bit, but I don't see the need yet ]

cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-net: inline header support

2012-10-09 Thread Paolo Bonzini
Il 09/10/2012 06:59, Rusty Russell ha scritto:
 Paolo Bonzini pbonz...@redhat.com writes:
 Il 05/10/2012 07:43, Rusty Russell ha scritto:
 That's good.  But virtio_blk's scsi command is insoluble AFAICT.  As I
 said to Anthony, the best rules are always and never, so I'd really
 rather not have to grandfather that in.

 It is, but we can add a rule that if the (transport) flag
 VIRTIO_RING_F_ANY_HEADER_SG is set, the cdb field is always 32 bytes in
 virtio-blk.
 
 Could we do that?  It's the cmd length I'm concerned about; is it always
 32 in practice for some reason?

It is always 32 or less except in very obscure cases that are pretty
much confined to iSCSI.  We don't care about the obscure cases, and the
extra bytes don't hurt.

BTW, 32 is the default cdb_size used by virtio-scsi.

 Currently qemu does:
 
 struct sg_io_hdr hdr;
 memset(hdr, 0, sizeof(struct sg_io_hdr));
 hdr.interface_id = 'S';
 hdr.cmd_len = req-elem.out_sg[1].iov_len;
 hdr.cmdp = req-elem.out_sg[1].iov_base;
 hdr.dxfer_len = 0;
 
 If it's a command which expects more output data, there's no way to
 guess where the boundary is between that command and the data.

Yep, so I understood the problem right.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost-blk: Add vhost-blk support v2

2012-10-09 Thread Asias He
vhost-blk is an in-kernel virito-blk device accelerator.

Due to lack of proper in-kernel AIO interface, this version converts
guest's I/O request to bio and use submit_bio() to submit I/O directly.
So this version any supports raw block device as guest's disk image,
e.g. /dev/sda, /dev/ram0. We can add file based image support to
vhost-blk once we have in-kernel AIO interface. There are some work in
progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:

   http://marc.info/?l=linux-fsdevelm=133312234313122

Performance evaluation:
-
1) LKVM
Fio with libaio ioengine on Fusion IO device using kvm tool
IOPS   Before   After   Improvement
seq-read   107  121 +13.0%
seq-write  130  179 +37.6%
rnd-read   102  122 +19.6%
rnd-write  125  159 +27.0%

2) QEMU
Fio with libaio ioengine on Fusion IO device using QEMU
IOPS   Before   After   Improvement
seq-read   76   123 +61.8%
seq-write  139  173 +24.4%
rnd-read   73   120 +64.3%
rnd-write  75   156 +108.0%

Userspace bits:
-
1) LKVM
The latest vhost-blk userspace bits for kvm tool can be found here:
g...@github.com:asias/linux-kvm.git blk.vhost-blk

2) QEMU
The latest vhost-blk userspace prototype for QEMU can be found here:
g...@github.com:asias/qemu.git blk.vhost-blk

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/Kconfig |   1 +
 drivers/vhost/Kconfig.blk |  10 +
 drivers/vhost/Makefile|   2 +
 drivers/vhost/blk.c   | 641 ++
 drivers/vhost/blk.h   |   8 +
 5 files changed, 662 insertions(+)
 create mode 100644 drivers/vhost/Kconfig.blk
 create mode 100644 drivers/vhost/blk.c
 create mode 100644 drivers/vhost/blk.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 202bba6..acd8038 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -11,4 +11,5 @@ config VHOST_NET
 
 if STAGING
 source drivers/vhost/Kconfig.tcm
+source drivers/vhost/Kconfig.blk
 endif
diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
new file mode 100644
index 000..ff8ab76
--- /dev/null
+++ b/drivers/vhost/Kconfig.blk
@@ -0,0 +1,10 @@
+config VHOST_BLK
+   tristate Host kernel accelerator for virtio blk (EXPERIMENTAL)
+   depends on BLOCK   EXPERIMENTAL  m
+   ---help---
+ This kernel module can be loaded in host kernel to accelerate
+ guest block with virtio_blk. Not to be confused with virtio_blk
+ module itself which needs to be loaded in guest kernel.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index a27b053..1a8a4a5 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
 
 obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 000..6b2445a
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan tailai...@taobao.com
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He as...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-blk server in host kernel.
+ */
+
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/vhost.h
+#include linux/virtio_blk.h
+#include linux/mutex.h
+#include linux/file.h
+#include linux/kthread.h
+#include linux/blkdev.h
+
+#include vhost.c
+#include vhost.h
+#include blk.h
+
+#define BLK_HDR0
+
+static DEFINE_IDA(vhost_blk_index_ida);
+
+enum {
+   VHOST_BLK_VQ_REQ = 0,
+   VHOST_BLK_VQ_MAX = 1,
+};
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+struct vhost_blk_req {
+   struct llist_node llnode;
+   struct req_page_list *pl;
+   struct vhost_blk *blk;
+
+   struct iovec *iov;
+   int iov_nr;
+
+   struct bio **bio;
+   atomic_t bio_nr;
+
+   sector_t sector;
+   int write;
+   u16 head;
+   long len;
+
+   u8 *status;
+};
+
+struct vhost_blk {
+   struct task_struct *host_kick;
+   struct vhost_blk_req *reqs;
+   struct vhost_virtqueue vq;
+   struct llist_head llhead;
+   struct vhost_dev dev;
+   u16 reqs_nr;
+   int index;
+};
+
+static inline int iov_num_pages(struct iovec *iov)
+{
+   return (PAGE_ALIGN((unsigned long)iov-iov_base + iov-iov_len) -
+  ((unsigned long)iov-iov_base  PAGE_MASK))  PAGE_SHIFT;
+}
+
+static int vhost_blk_setup(struct vhost_blk *blk)
+{
+   blk-reqs_nr = blk-vq.num;
+
+   blk-reqs = kmalloc(sizeof(struct vhost_blk_req) * blk-reqs_nr,
+   

KVM call agenda for 2012-10-09

2012-10-09 Thread Juan Quintela

Hi

Please send in any agenda topics that you have.

Thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Avi Kivity
On 10/09/2012 05:16 AM, Rusty Russell wrote:
 Anthony Liguori aligu...@us.ibm.com writes:
 We'll never remove legacy so we shouldn't plan on it.  There are
 literally hundreds of thousands of VMs out there with the current virtio
 drivers installed in them.  We'll be supporting them for a very, very
 long time :-)
 
 You will be supporting this for qemu on x86, sure.  As I think we're
 still in the growth phase for virtio, I prioritize future spec
 cleanliness pretty high.

If a pure ppc hypervisor was on the table, this might have been
worthwhile.  As it is the codebase is shared, and the Linux drivers are
shared, so cleaning up the spec doesn't help the code.

 
 But I think you'll be surprised how fast this is deprecated:
 1) Bigger queues for block devices (guest-specified ringsize)
 2) Smaller rings for openbios (guest-specified alignment)
 3) All-mmio mode (powerpc)
 4) Whatever network features get numbers  31.
 
 I don't think we gain a lot by moving the ISR into a separate BAR.
 Splitting up registers like that seems weird to me too.
 
 Confused.  I proposed the same split as you have, just ISR by itself.

I believe Anthony objects to having the ISR by itself.  What is the
motivation for that?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Avi Kivity
On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
 
 From Intel's manual:
 
 • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
 subtracts) value X from the TSC,
 the logical processor also adds (or subtracts) value X from the
 IA32_TSC_ADJUST MSR.
 
 This is not handled in the patch. 
 
 To support migration, it will be necessary to differentiate between
 guest initiated and userspace-model initiated msr write. That is, 
 only guest initiated TSC writes should affect the value of 
 IA32_TSC_ADJUST MSR.
 
 Avi, any better idea?
 

I think we need that anyway, since there are some read-only MSRs that
need to be configured by the host (nvmx capabilities).  So if we add
that feature it will be useful elsewhere.  I don't think it's possible
to do it in any other way:

Local offset value of the IA32_TSC for a
logical processor. Reset value is Zero. A
write to IA32_TSC will modify the local
offset in IA32_TSC_ADJUST and the
content of IA32_TSC, but does not affect
the internal invariant TSC hardware.

What we want to do is affect the internal invariant TSC hardware, so we
can't do that through the normal means.

btw, will tsc writes from userspace (after live migration) cause tsc
skew?  If so we should think how to model a guest-wide tsc.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-kmod 3.6 on linux 3.2.0

2012-10-09 Thread Avi Kivity
On 10/04/2012 02:14 PM, Peter Lieven wrote:
 Hi,
 
 kvm-kmod 3.6 fails to compile against a 3.2.0 kernel with the following error:
 
 /usr/src/kvm-kmod-3.6/x86/x86.c: In function ‘get_msr_mce’:
 /usr/src/kvm-kmod-3.6/x86/x86.c:1908:27: error: ‘kvm’ undeclared (first use 
 in this function)
 /usr/src/kvm-kmod-3.6/x86/x86.c:1908:27: note: each undeclared identifier is 
 reported only once for each function it appears in
 
 Any ideas?

Out of curiosity, why are you using kvm-kmod 3.6 but not Linux 3.6?

kvm-kmod is useful for embedded configurations where the main kernel has
to be fixed for some reason, but for general use cases it's better to
just use a matched set.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization setup

2012-10-09 Thread Marc Zyngier
On Sat, 6 Oct 2012 17:33:43 -0400, Christoffer Dall
c.d...@virtualopensystems.com wrote:
 On Thu, Oct 4, 2012 at 10:23 PM, Min-gyu Kim mingyu84@samsung.com
 wrote:


 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Christoffer Dall
 Sent: Monday, October 01, 2012 6:11 PM
 To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
 kvm...@lists.cs.columbia.edu
 Cc: Marc Zyngier
 Subject: [PATCH v2 06/14] KVM: ARM: Memory virtualization setup

 +static void stage2_set_pte(struct kvm *kvm, struct
kvm_mmu_memory_cache
 *cache,
 +phys_addr_t addr, const pte_t *new_pte) {
 + pgd_t *pgd;
 + pud_t *pud;
 + pmd_t *pmd;
 + pte_t *pte, old_pte;
 +
 + /* Create 2nd stage page table mapping - Level 1 */
 + pgd = kvm-arch.pgd + pgd_index(addr);
 + pud = pud_offset(pgd, addr);
 + if (pud_none(*pud)) {
 + if (!cache)
 + return; /* ignore calls from kvm_set_spte_hva */
 + pmd = mmu_memory_cache_alloc(cache);
 + pud_populate(NULL, pud, pmd);
 + pmd += pmd_index(addr);
 + get_page(virt_to_page(pud));
 + } else
 + pmd = pmd_offset(pud, addr);
 +
 + /* Create 2nd stage page table mapping - Level 2 */
 + if (pmd_none(*pmd)) {
 + if (!cache)
 + return; /* ignore calls from kvm_set_spte_hva */
 + pte = mmu_memory_cache_alloc(cache);
 + clean_pte_table(pte);
 + pmd_populate_kernel(NULL, pmd, pte);
 + pte += pte_index(addr);
 + get_page(virt_to_page(pmd));
 + } else
 + pte = pte_offset_kernel(pmd, addr);
 +
 + /* Create 2nd stage page table mapping - Level 3 */
 + old_pte = *pte;
 + set_pte_ext(pte, *new_pte, 0);
 + if (pte_present(old_pte))
 + __kvm_tlb_flush_vmid(kvm);
 + else
 + get_page(virt_to_page(pte));
 +}


 I'm not sure about the 3-level page table, but isn't it necessary to
 clean the page table for 2nd level?
 There are two mmu_memory_cache_alloc calls. One has following
 clean_pte_table
 and the other doesn't have.
 
 hmm, it probably is - I couldn't really find the common case where
 this is done in the kernel normally (except for some custom loop in
 ioremap and idmap), but I added this fix:
 
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 5394a52..f11ba27f 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -430,6 +430,7 @@ static void stage2_set_pte(struct kvm *kvm, struct
 kvm_mmu_memory_cache *cache,
   if (!cache)
   return; /* ignore calls from kvm_set_spte_hva */
   pmd = mmu_memory_cache_alloc(cache);
 + clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t));
   pud_populate(NULL, pud, pmd);
   pmd += pmd_index(addr);
   get_page(virt_to_page(pud));

There ought to be a test of ID_MMFR3[23:20] to find out whether or not it
is useful to clean the dcache. Not sure if that's a massive gain, but it is
certainly an optimisation to consider for the kernel as a whole.

M.
-- 
Fast, cheap, reliable. Pick two.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2012-10-09

2012-10-09 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Hi

 Please send in any agenda topics that you have.

Hi

As there is no agenda, call gets cancelled.

Sorry for sending the request for agenda late.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Anthony Liguori
Rusty Russell ru...@rustcorp.com.au writes:

 Anthony Liguori aligu...@us.ibm.com writes:
 We'll never remove legacy so we shouldn't plan on it.  There are
 literally hundreds of thousands of VMs out there with the current virtio
 drivers installed in them.  We'll be supporting them for a very, very
 long time :-)

 You will be supporting this for qemu on x86, sure.

And PPC.

 As I think we're
 still in the growth phase for virtio, I prioritize future spec
 cleanliness pretty high.

 But I think you'll be surprised how fast this is deprecated:
 1) Bigger queues for block devices (guest-specified ringsize)
 2) Smaller rings for openbios (guest-specified alignment)
 3) All-mmio mode (powerpc)
 4) Whatever network features get numbers  31.

We can do all of these things with incremental change to the existing
layout.  That's the only way what I'm suggesting is different.

You want to reorder all of the fields and have a driver flag day.  But I
strongly suspect we'll decide we need to do the same exercise again in 4
years when we now need to figure out how to take advantage of
transactional memory or some other whiz-bang hardware feature.

There are a finite number of BARs but each BAR has an almost infinite
size.  So extending BARs instead of introducing new one seems like the
conservative approach moving forward.

 I don't think we gain a lot by moving the ISR into a separate BAR.
 Splitting up registers like that seems weird to me too.

 Confused.  I proposed the same split as you have, just ISR by itself.

I disagree with moving the ISR into a separate BAR.  That's what seems
weird to me.

 It's very normal to have a mirrored set of registers that are PIO in one
 bar and MMIO in a different BAR.

 If we added an additional constraints that BAR1 was mirrored except for
 the config space and the MSI section was always there, I think the end
 result would be nice.  IOW:

 But it won't be the same, because we want all that extra stuff, like
 more feature bits and queue size alignment.  (Admittedly queues past
 16TB aren't a killer feature).

 To make it concrete:

 Current:
 struct {
 __le32 host_features;   /* read-only */
 __le32 guest_features;  /* read/write */
 __le32 queue_pfn;   /* read/write */
 __le16 queue_size;  /* read-only */
 __le16 queue_sel;   /* read/write */
 __le16 queue_notify;/* read/write */
 u8 status;  /* read/write */
 u8 isr; /* read-only, clear on read */
 /* Optional */
 __le16 msi_config_vector;   /* read/write */
 __le16 msi_queue_vector;/* read/write */
 /* ... device features */
 };

 Proposed:
 struct virtio_pci_cfg {
   /* About the whole device. */
   __le32 device_feature_select;   /* read-write */
   __le32 device_feature;  /* read-only */
   __le32 guest_feature_select;/* read-write */
   __le32 guest_feature;   /* read-only */
   __le16 msix_config; /* read-write */
   __u8 device_status; /* read-write */
   __u8 unused;

   /* About a specific virtqueue. */
   __le16 queue_select;/* read-write */
   __le16 queue_align; /* read-write, power of 2. */
   __le16 queue_size;  /* read-write, power of 2. */
   __le16 queue_msix_vector;/* read-write */
   __le64 queue_address;   /* read-write: 0x == DNE. */
 };

 struct virtio_pci_isr {
 __u8 isr; /* read-only, clear on read */
 };

What I'm suggesting is:

 struct {
 __le32 host_features;   /* read-only */
 __le32 guest_features;  /* read/write */
 __le32 queue_pfn;   /* read/write */
 __le16 queue_size;  /* read-only */
 __le16 queue_sel;   /* read/write */
 __le16 queue_notify;/* read/write */
 u8 status;  /* read/write */
 u8 isr; /* read-only, clear on read */
 __le16 msi_config_vector;   /* read/write */
 __le16 msi_queue_vector;/* read/write */
 __le32 host_feature_select; /* read/write */
 __le32 guest_feature_select;/* read/write */
 __le32 queue_pfn_hi;/* read/write */
 };


With the additional semantic that the virtio-config space is overlayed
on top of the register set in BAR0 unless the
VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged.  This feature
acts as a latch and when set, removes the config space overlay.

If the config space overlays the registers, the offset in BAR0 of the
overlay depends on whether MSI is enabled or not in the PCI device.

BAR1 is an MMIO mirror of BAR0 except that the config space is never
overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG.

BAR2 contains the config space.

A guest can look at BAR1 and BAR2 to determine whether they exist.

 We could also enforce LE in the per-device config space 

Re: Proposal for virtio standardization.

2012-10-09 Thread Cornelia Huck
On Thu, 27 Sep 2012 09:59:33 +0930
Rusty Russell ru...@rustcorp.com.au wrote:

 Hi all,
 
   I've had several requests for a more formal approach to the
 virtio draft spec, and (after some soul-searching) I'd like to try that.
 
   The proposal is to use OASIS as the standards body, as it's
 fairly light-weight as these things go.  For me this means paperwork and
 setting up a Working Group and getting the right people involved as
 Voting members starting with the current contributors; for most of you
 it just means a new mailing list, though I'll be cross-posting any
 drafts and major changes here anyway.
 
   I believe that a documented standard (aka virtio 1.0) will
 increase visibility and adoption in areas outside our normal linux/kvm
 universe.  There's been some of that already, but this is the clearest
 path to accelerate it.  Not the easiest path, but I believe that a solid
 I/O standard is a Good Thing for everyone.
 
   Yet I also want to decouple new and experimental development
 from the standards effort; running code comes first.  New feature bits
 and new device numbers should be reservable without requiring a full
 spec change.
 
 So the essence of my proposal is:
 1) I start a Working Group within OASIS where we can aim for virtio spec
1.0.
 
 2) The current spec is textually reordered so the core is clearly
bus-independent, with PCI, mmio, etc appendices.
 
 3) Various clarifications, formalizations and cleanups to the spec text,
and possibly elimination of old deprecated features.
 
 4) The only significant change to the spec is that we use PCI
capabilities, so we can have infinite feature bits.
(see
 http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)

Infinite only applies to virtio-pci, no?

 
 5) Changes to the ring layout and other such things are deferred to a
future virtio version; whether this is done within OASIS or
externally depends on how well this works for the 1.0 release.
 
 Thoughts?
 Rusty.
 

Sounds like a good idea. I'll be happy to review the spec with an eye
to virtio-ccw.

Cornelia

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[QEMU PATCH] i386: cpu: add missing CPUID[EAX=7,ECX=0] flag names

2012-10-09 Thread Eduardo Habkost
This makes QEMU recognize the following CPU flag names:

 Flags| Corresponding KVM kernel commit
 -+
 FSGSBASE | 176f61da82435eae09cc96f70b530d1ba0746b8b
 AVX2, BMI1, BMI2 | fb215366b3c7320ac25dca766a0152df16534932
 HLE, RTM | 83c529151ab0d4a813e3f6a3e293fff75d468519
 INVPCID  | ad756a1603c5fac207758faaac7f01c34c9d0b7b
 ERMS | a01c8f9b4e266df1d7166d23216f2060648f862d

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
 target-i386/cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index f3708e6..b012372 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -105,8 +105,8 @@ static const char *svm_feature_name[] = {
 };
 
 static const char *cpuid_7_0_ebx_feature_name[] = {
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep,
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+fsgsbase, NULL, NULL, bmi1, hle, avx2, NULL, smep,
+bmi2, erms, invpcid, rtm, NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 };
-- 
1.7.11.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Anthony Liguori
Avi Kivity a...@redhat.com writes:

 On 10/09/2012 05:16 AM, Rusty Russell wrote:
 Anthony Liguori aligu...@us.ibm.com writes:
 We'll never remove legacy so we shouldn't plan on it.  There are
 literally hundreds of thousands of VMs out there with the current virtio
 drivers installed in them.  We'll be supporting them for a very, very
 long time :-)
 
 You will be supporting this for qemu on x86, sure.  As I think we're
 still in the growth phase for virtio, I prioritize future spec
 cleanliness pretty high.

 If a pure ppc hypervisor was on the table, this might have been
 worthwhile.  As it is the codebase is shared, and the Linux drivers are
 shared, so cleaning up the spec doesn't help the code.

Note that distros have been (perhaps unknowingly) shipping virtio-pci
for PPC for some time now.

So even though there wasn't a hypervisor that supported virtio-pci, the
guests already support it and are out there in the wild.

There's a lot of value in maintaining legacy support even for PPC.
 
 But I think you'll be surprised how fast this is deprecated:
 1) Bigger queues for block devices (guest-specified ringsize)
 2) Smaller rings for openbios (guest-specified alignment)
 3) All-mmio mode (powerpc)
 4) Whatever network features get numbers  31.
 
 I don't think we gain a lot by moving the ISR into a separate BAR.
 Splitting up registers like that seems weird to me too.
 
 Confused.  I proposed the same split as you have, just ISR by itself.

 I believe Anthony objects to having the ISR by itself.  What is the
 motivation for that?

Right, BARs are a precious resource not to be spent lightly.  Having an
entire BAR dedicated to a 1-byte register seems like a waste to me.

Regards,

Anthony Liguori



 -- 
 error compiling committee.c: too many arguments to function
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote:
 On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
  
  From Intel's manual:
  
  • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
  subtracts) value X from the TSC,
  the logical processor also adds (or subtracts) value X from the
  IA32_TSC_ADJUST MSR.
  
  This is not handled in the patch. 
  
  To support migration, it will be necessary to differentiate between
  guest initiated and userspace-model initiated msr write. That is, 
  only guest initiated TSC writes should affect the value of 
  IA32_TSC_ADJUST MSR.
  
  Avi, any better idea?
  
 
 I think we need that anyway, since there are some read-only MSRs that
 need to be configured by the host (nvmx capabilities).  So if we add
 that feature it will be useful elsewhere.  I don't think it's possible
 to do it in any other way:
 
 Local offset value of the IA32_TSC for a
 logical processor. Reset value is Zero. A
 write to IA32_TSC will modify the local
 offset in IA32_TSC_ADJUST and the
 content of IA32_TSC, but does not affect
 the internal invariant TSC hardware.
 
 What we want to do is affect the internal invariant TSC hardware, so we
 can't do that through the normal means.
 
 btw, will tsc writes from userspace (after live migration) cause tsc
 skew?  If so we should think how to model a guest-wide tsc.

No because there is an easy shortcut:

if (level == KVM_PUT_FULL_STATE) {
/*
 * KVM is yet unable to synchronize TSC values of multiple VCPUs
 * on
 * writeback. Until this is fixed, we only write the offset to
 * SMP
 * guests after migration, desynchronizing the VCPUs, but
 * avoiding
 * huge jump-backs that would occur without any writeback at
 * all.
 */
if (smp_cpus == 1 || env-tsc != 0) {
kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
}
}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Avi Kivity
On 10/09/2012 04:24 PM, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote:
 On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
  
  From Intel's manual:
  
  • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
  subtracts) value X from the TSC,
  the logical processor also adds (or subtracts) value X from the
  IA32_TSC_ADJUST MSR.
  
  This is not handled in the patch. 
  
  To support migration, it will be necessary to differentiate between
  guest initiated and userspace-model initiated msr write. That is, 
  only guest initiated TSC writes should affect the value of 
  IA32_TSC_ADJUST MSR.
  
  Avi, any better idea?
  
 
 I think we need that anyway, since there are some read-only MSRs that
 need to be configured by the host (nvmx capabilities).  So if we add
 that feature it will be useful elsewhere.  I don't think it's possible
 to do it in any other way:
 
 Local offset value of the IA32_TSC for a
 logical processor. Reset value is Zero. A
 write to IA32_TSC will modify the local
 offset in IA32_TSC_ADJUST and the
 content of IA32_TSC, but does not affect
 the internal invariant TSC hardware.
 
 What we want to do is affect the internal invariant TSC hardware, so we
 can't do that through the normal means.
 
 btw, will tsc writes from userspace (after live migration) cause tsc
 skew?  If so we should think how to model a guest-wide tsc.
 
 No because there is an easy shortcut:
 
 if (level == KVM_PUT_FULL_STATE) {
 /*
  * KVM is yet unable to synchronize TSC values of multiple VCPUs
  * on
  * writeback. Until this is fixed, we only write the offset to
  * SMP
  * guests after migration, desynchronizing the VCPUs, but
  * avoiding
  * huge jump-backs that would occur without any writeback at
  * all.
  */
 if (smp_cpus == 1 || env-tsc != 0) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
 }
 }

Still we write back after migration.  So this needs to be fixed (or I
misunderstood you).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote:
 On 10/09/2012 04:24 PM, Marcelo Tosatti wrote:
  On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote:
  On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
   
   From Intel's manual:
   
   • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
   subtracts) value X from the TSC,
   the logical processor also adds (or subtracts) value X from the
   IA32_TSC_ADJUST MSR.
   
   This is not handled in the patch. 
   
   To support migration, it will be necessary to differentiate between
   guest initiated and userspace-model initiated msr write. That is, 
   only guest initiated TSC writes should affect the value of 
   IA32_TSC_ADJUST MSR.
   
   Avi, any better idea?
   
  
  I think we need that anyway, since there are some read-only MSRs that
  need to be configured by the host (nvmx capabilities).  So if we add
  that feature it will be useful elsewhere.  I don't think it's possible
  to do it in any other way:
  
  Local offset value of the IA32_TSC for a
  logical processor. Reset value is Zero. A
  write to IA32_TSC will modify the local
  offset in IA32_TSC_ADJUST and the
  content of IA32_TSC, but does not affect
  the internal invariant TSC hardware.
  
  What we want to do is affect the internal invariant TSC hardware, so we
  can't do that through the normal means.
  
  btw, will tsc writes from userspace (after live migration) cause tsc
  skew?  If so we should think how to model a guest-wide tsc.
  
  No because there is an easy shortcut:
  
  if (level == KVM_PUT_FULL_STATE) {
  /*
   * KVM is yet unable to synchronize TSC values of multiple VCPUs
   * on
   * writeback. Until this is fixed, we only write the offset to
   * SMP
   * guests after migration, desynchronizing the VCPUs, but
   * avoiding
   * huge jump-backs that would occur without any writeback at
   * all.
   */
  if (smp_cpus == 1 || env-tsc != 0) {
  kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
  }
  }
 
 Still we write back after migration.  So this needs to be fixed (or I
 misunderstood you).

Handled by kvm_write_tsc() in x86.c. Is this what you mean?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Avi Kivity
On 10/09/2012 04:27 PM, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote:
 On 10/09/2012 04:24 PM, Marcelo Tosatti wrote:
  On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote:
  On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
   
   From Intel's manual:
   
   • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
   subtracts) value X from the TSC,
   the logical processor also adds (or subtracts) value X from the
   IA32_TSC_ADJUST MSR.
   
   This is not handled in the patch. 
   
   To support migration, it will be necessary to differentiate between
   guest initiated and userspace-model initiated msr write. That is, 
   only guest initiated TSC writes should affect the value of 
   IA32_TSC_ADJUST MSR.
   
   Avi, any better idea?
   
  
  I think we need that anyway, since there are some read-only MSRs that
  need to be configured by the host (nvmx capabilities).  So if we add
  that feature it will be useful elsewhere.  I don't think it's possible
  to do it in any other way:
  
  Local offset value of the IA32_TSC for a
  logical processor. Reset value is Zero. A
  write to IA32_TSC will modify the local
  offset in IA32_TSC_ADJUST and the
  content of IA32_TSC, but does not affect
  the internal invariant TSC hardware.
  
  What we want to do is affect the internal invariant TSC hardware, so we
  can't do that through the normal means.
  
  btw, will tsc writes from userspace (after live migration) cause tsc
  skew?  If so we should think how to model a guest-wide tsc.
  
  No because there is an easy shortcut:
  
  if (level == KVM_PUT_FULL_STATE) {
  /*
   * KVM is yet unable to synchronize TSC values of multiple VCPUs
   * on
   * writeback. Until this is fixed, we only write the offset to
   * SMP
   * guests after migration, desynchronizing the VCPUs, but
   * avoiding
   * huge jump-backs that would occur without any writeback at
   * all.
   */
  if (smp_cpus == 1 || env-tsc != 0) {
  kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
  }
  }
 
 Still we write back after migration.  So this needs to be fixed (or I
 misunderstood you).
 
 Handled by kvm_write_tsc() in x86.c. Is this what you mean?
 

It will generate a call to -write_tsc_offset().  Will the values be the
same for all vcpus?  Note the inputs won't be the same.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] s390/kvm: Add a channel I/O based virtio transport driver.

2012-10-09 Thread Cornelia Huck
On Wed, 19 Sep 2012 18:38:38 +0200
Alexander Graf ag...@suse.de wrote:

 
 On 04.09.2012, at 17:13, Cornelia Huck wrote:

  +static u32 virtio_ccw_get_features(struct virtio_device *vdev)
  +{
  +   struct virtio_ccw_device *vcdev = to_vc_device(vdev);
  +   struct virtio_feature_desc features;
  +   int ret;
  +
  +   /* Read the feature bits from the host. */
  +   /* TODO: Features  32 bits */
  +   features.index = 0;
  +   vcdev-ccw.cmd_code = CCW_CMD_READ_FEAT;
  +   vcdev-ccw.flags = 0;
  +   vcdev-ccw.count = sizeof(features);
  +   vcdev-ccw.cda = vcdev-area;
  +   ret = ccw_io_helper(vcdev, VIRTIO_CCW_DOING_READ_FEAT);
  +   if (ret)
  +   return 0;
  +
  +   memcpy(features, (void *)(unsigned long)vcdev-area,
  +  sizeof(features));
  +   return le32_to_cpu(features.features);
 
 The fact that the features are LE is missing from the spec, right?

You're right, I thought it was there.

 
 
 Alex
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] [HACK] Handle multiple virtio aliases.

2012-10-09 Thread Cornelia Huck
On Thu, 20 Sep 2012 09:27:00 -0500
Anthony Liguori aligu...@us.ibm.com wrote:

 Cornelia Huck cornelia.h...@de.ibm.com writes:
 
  This patch enables using both virtio-xxx-s390 and virtio-xxx-ccw
  by making the alias lookup code verify that a driver is actually
  registered.
 
  (Only included in order to allow testing of virtio-ccw; should be
  replaced by cleaning up the virtio bus model.)
 
  Not-signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 
 No more aliases.  Just drop the whole thing.

The whole alias stuff I put in there was just to be able to test
virtio-ccw. I agree that we should use the cleaned-up virtio driver
model that had been proposed for virtio-mmio earlier; I just want to
make sure the virtio-ccw interface is sane first.

 
 Regards,
 
 Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Question] Live migration - normal process migration

2012-10-09 Thread Paolo Bonzini
Il 09/10/2012 16:36, Grzegorz Dwornicki ha scritto:
 
 I have a question to you guys. Is it possible to use code from live
 migration of KVM VMs to migrate other process?

No, because this is hardware migration.  It migrates the whole machine
including the hardware state (e.g. the video card, or the USB controller).

 I wish to create a tool to convert systems on physical servers to KVM
 virtual machines. I wish to make some tests/experiments with this part of
 the code to migrate working system to newly created VMs in running state.

Something like that is done by OpenVZ containers.

For offline migration of physical servers to KVM virtual machines, look
at virt-p2v (http://libguestfs.org/virt-v2v/).

Paolo

 My goal is to create something similar to VMWare Converter but this one will
 be opensource :). I am at the begining of this project. I'm reading parts of
 the kernel code. But only thosse that could be usefull in this hard task.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Question] Live migration - normal process migration

2012-10-09 Thread Javier Guerra Giraldez
On Tue, Oct 9, 2012 at 9:36 AM, Grzegorz Dwornicki gd1...@gmail.com wrote:
 I have a question to you guys. Is it possible to use code from live
 migration of KVM VMs to migrate other process?

As far as I can tell, no.

most of the virtualization facililites of KVM are implemented in the
kernel, but they're managed by a 'normal' process (sometimes called
qemu-kvm, sometimes just kvm).  Is this userspace process that
implements it's own migration feature.

to do that, one running qemu-kvm (proc A) instance connects with a
just-started instance (proc B).  proc A first sends all the
configuration information, so B sets itself identically.  then all RAM
data is transferred (iteratively, to 'catch up' with the still-running
VM).  finally, proc A suspends the running VM, finishes the last
modified RAM data and all the CPU state.  now proc B is identical to
A, and both are suspended, and then, proc B is unsuspended and
notifies proc A, which just exits.

if you want to do that for arbitrary 'normal' processes, you'd have to
implement that in the kernel; but KVM doesn't do that.

As Paolo mentions, there are other projects that do similar things.
OpenVZ is one; LXC does it too. I'd start with LXC, since it's a
standard part of the kernel (just like KVM), and supported by many
common utilities.

-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Anthony Liguori
Gerd Hoffmann kra...@redhat.com writes:

   Hi,

 Well, we also want to clean up the registers, so how about:

 BAR0: legacy, as is.  If you access this, don't use the others.

 Ok.

 BAR1: new format virtio-pci layout.  If you use this, don't use BAR0.
 BAR2: virtio-cfg.  If you use this, don't use BAR0.

 Why use two bars for this?  You can put them into one mmio bar, together
 with the msi-x vector table and PBA.  Of course a pci capability
 describing the location is helpful for that ;)

You don't need a capability.  You can also just add a config offset
field to the register set and then make the semantics that it occurs in
the same region.


 BAR3: ISR. If you use this, don't use BAR0.

 Again, I wouldn't hardcode that but use a capability.

 I prefer the cases exclusive (ie. use one or the other) as a clear path
 to remove the legacy layout; and leaving the ISR in BAR0 leaves us with
 an ugly corner case in future (ISR is BAR0 + 19?  WTF?).

 Ok, so we have four register sets:

   (1) legacy layout
   (2) new virtio-pci
   (3) new virtio-config
   (4) new virtio-isr

 We can have a vendor pci capability, with a dword for each register set:

   bit  31-- present bit
   bits 26-24 -- bar
   bits 23-0  -- offset

 So current drivers which must support legacy can use this:

   legacy layout -- present, bar 0, offset 0
   new virtio-pci-- present, bar 1, offset 0
   new virtio-config -- present, bar 1, offset 256
   new virtio-isr-- present, bar 0, offset 19

 [ For completeness: msi-x capability could add this: ]

   msi-x vector tablebar 1, offset 512
   msi-x pba bar 1, offset 768

 We'll never remove legacy so we shouldn't plan on it.  There are
 literally hundreds of thousands of VMs out there with the current virtio
 drivers installed in them.  We'll be supporting them for a very, very
 long time :-)

 But new devices (virtio-qxl being a candidate) don't have old guests and
 don't need to worry.

 They could use this if they care about fast isr:

   legacy layout -- not present
   new virtio-pci-- present, bar 1, offset 0
   new virtio-config -- present, bar 1, offset 256
   new virtio-isr-- present, bar 0, offset 0

 Or this if they don't worry about isr performance:

   legacy layout -- not present
   new virtio-pci-- present, bar 0, offset 0
   new virtio-config -- present, bar 0, offset 256
   new virtio-isr-- not present

 I don't think we gain a lot by moving the ISR into a separate BAR.
 Splitting up registers like that seems weird to me too.

 Main advantage of defining a register set with just isr is that it
 reduces pio address space consumtion for new virtio devices which don't
 have to worry about the legacy layout (8 bytes which is minimum size for
 io bars instead of 64 bytes).

Doing some rough math, we should have at least 16k of PIO space.  That
let's us have well over 500 virtio-pci devices with the current register
layout.

I don't think we're at risk of running out of space...

 If we added an additional constraints that BAR1 was mirrored except for

 Why add constraints?  We want something future-proof, don't we?

 The detection is simple: if BAR1 has non-zero length, it's new-style,
 otherwise legacy.

 Doesn't fly.  BAR1 is in use today for MSI-X support.

But the location is specified via capabilities so we can change the
location to be within BAR1 at a non-conflicting offset.

 I agree that this is the best way to extend, but I think we should still
 use a transport feature bit.  We want to be able to detect within QEMU
 whether a guest is using these new features because we need to adjust
 migration state accordingly.

 Why does migration need adjustments?

Because there is additional state in the new layout.  We need to
understand whether a guest is relying on that state or not.

For instance, extended virtio features.  If a guest is in the process
of reading extended virtio features, it may not have changed any state
but we must ensure that we don't migrate to an older verison of QEMU w/o
the extended virtio features.

This cannot be handled by subsections today because there is no guest
written state that's affected.

Regards,

Anthony Liguori


 [ Not that I want veto a feature bit, but I don't see the need yet ]

 cheers,
   Gerd
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Auld, Will
I am just testing the second version of this patch. It addresses all the 
comments so far except Marcelo's issue with breaking the function 
compute_guest_tsc(). 

I needed to put the call for updating the TSC_ADJUST_MSR in kvm_write_tsc() to 
ensure it is only called from user space. Other changes added to vmcs offset 
should not be tracked in TSC_ADJUST_MSR. 

I had some trouble with the order of initialization during live migration. 
TSC_ADJUST is initialized first but then wiped out by multiple initializations 
of tsc. The fix for this is to not update TSC_ADJUST if the vmcs offset is not 
actually changing with the tsc write. So, after migration outcome is that vmcs 
offset gets defined independent from the migrating value of TSC_ADJUST. I 
believe this is what we want to happen.

Thanks,

Will 

-Original Message-
From: Avi Kivity [mailto:a...@redhat.com] 
Sent: Tuesday, October 09, 2012 5:12 AM
To: Marcelo Tosatti
Cc: Auld, Will; kvm@vger.kernel.org; Zhang, Xiantao
Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:
 
 From Intel's manual:
 
 • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
 subtracts) value X from the TSC,
 the logical processor also adds (or subtracts) value X from the 
 IA32_TSC_ADJUST MSR.
 
 This is not handled in the patch. 
 
 To support migration, it will be necessary to differentiate between 
 guest initiated and userspace-model initiated msr write. That is, only 
 guest initiated TSC writes should affect the value of IA32_TSC_ADJUST 
 MSR.
 
 Avi, any better idea?
 

I think we need that anyway, since there are some read-only MSRs that need to 
be configured by the host (nvmx capabilities).  So if we add that feature it 
will be useful elsewhere.  I don't think it's possible to do it in any other 
way:

Local offset value of the IA32_TSC for a logical processor. Reset value is 
Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and 
the content of IA32_TSC, but does not affect the internal invariant TSC 
hardware.

What we want to do is affect the internal invariant TSC hardware, so we can't 
do that through the normal means.

btw, will tsc writes from userspace (after live migration) cause tsc skew?  If 
so we should think how to model a guest-wide tsc.

--
error compiling committee.c: too many arguments to function
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 04:30:28PM +0200, Avi Kivity wrote:
 On 10/09/2012 04:27 PM, Marcelo Tosatti wrote:
  On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote:
  On 10/09/2012 04:24 PM, Marcelo Tosatti wrote:
   On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote:
   On 10/08/2012 07:30 PM, Marcelo Tosatti wrote:

From Intel's manual:

• If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or
subtracts) value X from the TSC,
the logical processor also adds (or subtracts) value X from the
IA32_TSC_ADJUST MSR.

This is not handled in the patch. 

To support migration, it will be necessary to differentiate between
guest initiated and userspace-model initiated msr write. That is, 
only guest initiated TSC writes should affect the value of 
IA32_TSC_ADJUST MSR.

Avi, any better idea?

   
   I think we need that anyway, since there are some read-only MSRs that
   need to be configured by the host (nvmx capabilities).  So if we add
   that feature it will be useful elsewhere.  I don't think it's possible
   to do it in any other way:
   
   Local offset value of the IA32_TSC for a
   logical processor. Reset value is Zero. A
   write to IA32_TSC will modify the local
   offset in IA32_TSC_ADJUST and the
   content of IA32_TSC, but does not affect
   the internal invariant TSC hardware.
   
   What we want to do is affect the internal invariant TSC hardware, so we
   can't do that through the normal means.
   
   btw, will tsc writes from userspace (after live migration) cause tsc
   skew?  If so we should think how to model a guest-wide tsc.
   
   No because there is an easy shortcut:
   
   if (level == KVM_PUT_FULL_STATE) {
   /*
* KVM is yet unable to synchronize TSC values of multiple VCPUs
* on
* writeback. Until this is fixed, we only write the offset to
* SMP
* guests after migration, desynchronizing the VCPUs, but
* avoiding
* huge jump-backs that would occur without any writeback at
* all.
*/
   if (smp_cpus == 1 || env-tsc != 0) {
   kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
   }
   }
  
  Still we write back after migration.  So this needs to be fixed (or I
  misunderstood you).
  
  Handled by kvm_write_tsc() in x86.c. Is this what you mean?
  
 
 It will generate a call to -write_tsc_offset().  Will the values be the
 same for all vcpus?  Note the inputs won't be the same.

Yes:

 * Special case: TSC write with a small delta (1 second) of
 * virtual
 * cycle time against real time is interpreted as an attempt to
 * synchronize the CPU.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 03/16] target-i386: Add cpu object access routines for Hypervisor level.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:05AM -0400, Don Slutz wrote:
 These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |   29 +
  1 files changed, 29 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 25ca986..451de12 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1166,6 +1166,32 @@ static void x86_cpuid_set_tsc_freq(Object *obj, 
 Visitor *v, void *opaque,
  cpu-env.tsc_khz = value / 1000;
  }
  
 +static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque,
 +const char *name, Error **errp)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +
 +visit_type_uint32(v, cpu-env.cpuid_hv_level, name, errp);
 +}
 +
 +static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque,
 +const char *name, Error **errp)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +uint32_t value;
 +
 +visit_type_uint32(v, value, name, errp);
 +if (error_is_set(errp)) {
 +return;
 +}
 +
 +if (value != 0  value  0x4000) {
 +value += 0x4000;
 +}

Whats the purpose of this? Adds ambiguity.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote:
 This is used to set the cpu object's hypervisor level to the default for 
 Microsoft's Hypervisor.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |9 +
  target-i386/cpu.h |2 ++
  2 files changed, 11 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 451de12..48bdaf9 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, 
 Visitor *v, void *opaque,
  }
  
  #if !defined(CONFIG_USER_ONLY)
 +static void x86_set_hyperv(Object *obj, Error **errp)
 +{
 +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV,
 +hypervisor-level, errp);
 +}
 +
  static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
   const char *name, Error **errp)
  {
 @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor 
 *v, void *opaque,
  return;
  }
  hyperv_set_spinlock_retries(value);
 +x86_set_hyperv(obj, errp);
  }
  
  static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque,
 @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, 
 void *opaque,
  return;
  }
  hyperv_enable_relaxed_timing(value);
 +x86_set_hyperv(obj, errp);
  }
  
  static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque,
 @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, 
 void *opaque,
  return;
  }
  hyperv_enable_vapic_recommended(value);
 +x86_set_hyperv(obj, errp);
  }
  #endif
  
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 1899f69..3152a4e 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -488,6 +488,8 @@
  
  #define CPUID_VENDOR_VIA   CentaurHauls
  
 +#define CPUID_HV_LEVEL_HYPERV  0x4005
 +

Where this comes from? 

http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

has under Leaf 0x4000 (at very top of table):

EAX

The maximum input value for hypervisor CPUID information. For Microsoft
hypervisors, this value will be at least 0x4005. The vendor ID
signature should be used only for reporting and diagnostic purposes.

Is that the same 0x4005 as in this patch?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [QEMU PATCH] i386: cpu: add missing CPUID[EAX=7, ECX=0] flag names

2012-10-09 Thread Don Slutz

On 10/09/12 10:03, Eduardo Habkost wrote:

This makes QEMU recognize the following CPU flag names:

  Flags| Corresponding KVM kernel commit
  -+
  FSGSBASE | 176f61da82435eae09cc96f70b530d1ba0746b8b
  AVX2, BMI1, BMI2 | fb215366b3c7320ac25dca766a0152df16534932
  HLE, RTM | 83c529151ab0d4a813e3f6a3e293fff75d468519
  INVPCID  | ad756a1603c5fac207758faaac7f01c34c9d0b7b
  ERMS | a01c8f9b4e266df1d7166d23216f2060648f862d

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
  target-i386/cpu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index f3708e6..b012372 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -105,8 +105,8 @@ static const char *svm_feature_name[] = {
  };
  
  static const char *cpuid_7_0_ebx_feature_name[] = {

-NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep,
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+fsgsbase, NULL, NULL, bmi1, hle, avx2, NULL, smep,
+bmi2, erms, invpcid, rtm, NULL, NULL, NULL, NULL,
  NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL,
  NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
  };

Reviewed-by: Don Slutz d...@cloudswitch.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 06/19] Implement -dimm command line option

2012-10-09 Thread Vasilis Liaskovitis
Hi,

sorry for the delayed answer.

On Sat, Sep 29, 2012 at 11:13:04AM +, Blue Swirl wrote:
 
  The -dimm option is supposed to specify the dimm/memory layout, and not 
  create
  any devices.
 
  If we don't want this new option, I have a question:
 
  A -device/device_add means we create a new qdev device at startup or as a
  hotplug operation respectively. So, the semantics of
  -device dimm,id=dimm0,size=512M,node=0,populated=on are clear to me.
 
  What does -device dimm,populated=off mean from a qdev perspective? There 
  are 2
  alternatives:
 
  - The device is created on the dimmbus, but is not used/populated yet. Than 
  the
  activation/acpi-hotplug of the dimm may require a separate command (we used 
  to have
  dimm_add in versions  3). device_add handling always hotplugs a new 
  qdev
  device, so this wouldn't fit this usecase, because the device already 
  exists. In
  this case, the actual acpi hotplug operation is decoupled from qdev device
  creation.
 
 The bus exists but the devices do not, device_add would add DIMMs to
 the bus. This matches PCI bus created by the host bridge and PCI
 device hotplug.
 
 A more complex setup would be dimm bus, dimm slot devices and DIMM
 devices. The intermediate slot device would contain one DIMM device if
 plugged.

interesting, I haven't thought about this alternative. It does sounds overly
complex, but a dimmslot / dimmdevice splitup could consolidate hotplug semantic
differences between populated=on/off. Something similar to the dimmslot device
is already present in v3 (dimmcfg structure), but it's not a qdev visible 
device.
I 'd rather avoid the complication, but i might revisit this idea.

 
 
  - The dimmdevice is not created when -device dimm,populated=off (this 
  would
  require some ugly checking in normal -device argument handling). Only the 
  dimm
  layout is saved. The hotplug is triggered from a normal device_add later. 
  So in
  this case, the acpi hotplug happens at the same time as the qdev hotplug.
 
  Do you see a simpler alternative without introducing a new option?
 
  Using the -dimm option follows the second semantic and avoids changing 
  the -device
  semantics. Dimm layout description is decoupled from dimmdevice creation, 
  and qdev
  hotplug coincides with acpi hotplug.
 
 Maybe even the dimmbus device shouldn't exist by itself after all, or
 it should be pretty much invisible to users. On real HW, the memory
 controller or south bridge handles the memory. For i440fx, it's part
 of the same chipset. So I think we should just add qdev properties to
 i440fx to specify the sizes, nodes etc. Then i440fx should create the
 dimmbus device unconditionally using the properties. The default
 properties should create a sane configuration, otherwise -global
 i440fx.dimm_size=512M etc. could be used. Then the bus would be
 populated as before or with device_add.

hmm the problem with using only i440fx properties, is that size/nodes look
dimm specific to me, not chipset-memcontroller specific. Unless we only allow
uniform size dimms. Is it possible to have a dynamic list of sizes/nodes pairs 
as 
properties of a qdev device?

Also if there is no dimmbus, and instead we have only links from i440fx to 
dimm-devices,
would the current qdev hotplug API be enough?

I am currently leaning towards this: i440fx unconditionally creates the 
dimmbus. Users
don't have to specify the bus (i assume this is what you mean by dimmbus should
be invisible to the users)

We only use -device dimm to describe dimms. With -device 
dimm,populated=off, only
the dimm config layout will be saved in the dimmbus. The hotplug is triggered 
from a normal
device_add later (same as pci hotplug). 

thanks,

- Vasilis
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote:
 On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote:
  This is used to set the cpu object's hypervisor level to the default for 
  Microsoft's Hypervisor.
  
  Signed-off-by: Don Slutz d...@cloudswitch.com
  ---
   target-i386/cpu.c |9 +
   target-i386/cpu.h |2 ++
   2 files changed, 11 insertions(+), 0 deletions(-)
  
  diff --git a/target-i386/cpu.c b/target-i386/cpu.c
  index 451de12..48bdaf9 100644
  --- a/target-i386/cpu.c
  +++ b/target-i386/cpu.c
  @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, 
  Visitor *v, void *opaque,
   }
   
   #if !defined(CONFIG_USER_ONLY)
  +static void x86_set_hyperv(Object *obj, Error **errp)
  +{
  +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV,
  +hypervisor-level, errp);
  +}
  +
   static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
const char *name, Error **errp)
   {
  @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor 
  *v, void *opaque,
   return;
   }
   hyperv_set_spinlock_retries(value);
  +x86_set_hyperv(obj, errp);
   }
   
   static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque,
  @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor 
  *v, void *opaque,
   return;
   }
   hyperv_enable_relaxed_timing(value);
  +x86_set_hyperv(obj, errp);
   }
   
   static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque,
  @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, 
  void *opaque,
   return;
   }
   hyperv_enable_vapic_recommended(value);
  +x86_set_hyperv(obj, errp);
   }
   #endif
   
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 1899f69..3152a4e 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -488,6 +488,8 @@
   
   #define CPUID_VENDOR_VIA   CentaurHauls
   
  +#define CPUID_HV_LEVEL_HYPERV  0x4005
  +
 
 Where this comes from? 
 
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 
 has under Leaf 0x4000 (at very top of table):
 
 EAX
 
 The maximum input value for hypervisor CPUID information. For Microsoft
 hypervisors, this value will be at least 0x4005. The vendor ID
 signature should be used only for reporting and diagnostic purposes.
 
 Is that the same 0x4005 as in this patch?

Yes, the #define can be reused:

#define HYPERV_CPUID_MIN0x4005

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote:
 Also known as Paravirtualization level.
 
 This change is based on:
 
 Microsoft Hypervisor CPUID Leaves:
   
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 
 Linux kernel change starts with:
   http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
 Also:
   http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
 
 VMware documention on CPUIDs (Mechanisms to determine if software is
 running in a VMware virtual machine):
   
 http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458
 
 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/kvm.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 895d848..8462c75 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
  c = cpuid_data.entries[cpuid_i++];
  memset(c, 0, sizeof(*c));
  c-function = KVM_CPUID_SIGNATURE
 -if (!hyperv_enabled()) {
 +if (!env-cpuid_hv_level_set) {
  memcpy(signature, KVMKVMKVM\0\0\0, 12);
  c-eax = 0;
  } else {
  memcpy(signature, Microsoft Hv, 12);
 -c-eax = HYPERV_CPUID_MIN;
 +c-eax = env-cpuid_hv_level;

This breaks hyperv_enabled() checks. 

Don, are you certain it is worthwhile to make this configurable? 
Can you explain why, under your scenario, it is worthwhile?

Because these are separate problems:

- Fake VMWare hypervisor  (which seems to be your main goal).
- Make CPUID HV leafs configurable via command line.

My point is that the CPUIDs must be carefully constructed, 
that i miss the point why making them configurable is 
desired.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v2

2012-10-09 Thread Christoph Hellwig
 +static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file)
 +{
 +
 + struct inode *inode = file-f_mapping-host;
 + struct block_device *bdev = inode-i_bdev;
 + int ret;

Please just pass the block_device directly instead of a file struct.

 +
 + ret = vhost_blk_bio_make(req, bdev);
 + if (ret  0)
 + return ret;
 +
 + vhost_blk_bio_send(req);
 +
 + return ret;
 +}

Then again how simple the this function is it probably should just go
away entirely.

 + set_fs(USER_DS);

What do we actually need the set_fs for here?

 +
 +static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
 +{
 +
 + *file = vhost_blk_stop_vq(blk, blk-vq);
 +}

What is the point of this helper?  Also I can't see anyone actually
using the returned struct file.

 + case VIRTIO_BLK_T_FLUSH:
 + ret = vfs_fsync(file, 1);
 + status = ret  0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
 + if (!vhost_blk_set_status(req, status))
 + vhost_add_used_and_signal(blk-dev, vq, head, ret);
 + break;

Sending a fsync here is actually wrong in two different ways:

 a) it operates at the filesystem level instead of bio level
 b) it's a blocking operation

It should instead send a REQ_FLUSH bio using the same submission scheme
as the read/write requests.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL stable-0.15] Stable-0.15 queue for qemu-kvm

2012-10-09 Thread Andreas Färber
Hello Marcelo,

Here's a couple of backports for your stable-0.15 branch.
Except for one (marked as backported) these were all clean cherry-picks.

My proposal is to merge these KVM-only patches before qemu-stable-0.15.git,
where I will be tagging v0.15.2 shortly.

Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Avi Kivity a...@redhat.com

Cc: Anthony Liguori anth...@codemonkey.ws
Cc: Bruce Rogers brog...@suse.com

The following changes since commit 725ba81ec6812d7eeb69be92639eb81151b5306e:

  Merge remote branch 'upstream/stable-0.15' into stable-0.15 (2011-10-19 
11:54:48 -0200)

are available in the git repository at:


  git://repo.or.cz/qemu/afaerber.git qemu-kvm-stable-0.15

for you to fetch changes up to b2be0429795b18c018610d48142e797cbc31be0d:

  pci-assign: Remove bogus PCIe lnkcap wmask setting (2012-10-09 19:03:59 +0200)


Alex Williamson (4):
  pci-assign: Fix PCI_EXP_FLAGS_TYPE shift
  pci-assign: Fix PCIe lnkcap
  pci-assign: Harden I/O port test
  pci-assign: Remove bogus PCIe lnkcap wmask setting

Jan Kiszka (1):
  pci-assign: Update legacy interrupts only if used

Lai Jiangshan (1):
  qemu-kvm: fix improper nmi emulation

 hw/apic.c  |   33 +
 hw/apic.h  |1 +
 hw/device-assignment.c |   30 +++---
 monitor.c  |6 +-
 4 Dateien geändert, 54 Zeilen hinzugefügt(+), 16 Zeilen entfernt(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 1/6] qemu-kvm: fix improper nmi emulation

2012-10-09 Thread Andreas Färber
From: Lai Jiangshan la...@cn.fujitsu.com

Currently, NMI interrupt is blindly sent to all the vCPUs when NMI
button event happens. This doesn't properly emulate real hardware on
which NMI button event triggers LINT1. Because of this, NMI is sent to
the processor even when LINT1 is maskied in LVT. For example, this
causes the problem that kdump initiated by NMI sometimes doesn't work
on KVM, because kdump assumes NMI is masked on CPUs other than CPU0.

With this patch, inject-nmi request is handled as follows.

- When in-kernel irqchip is disabled, deliver LINT1 instead of NMI
  interrupt.
- When in-kernel irqchip is enabled, get the in-kernel LAPIC states
  and test the APIC_LVT_MASKED, if LINT1 is unmasked, and then
  delivering the NMI directly. (Suggested by Jan Kiszka)

Changed from old version:
  re-implement it by the Jan's suggestion.
  fix the race found by Jan.

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Reported-by: Kenji Kaneshige kaneshige.ke...@jp.fujitsu.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 67feec6ed854b3618b37ccf050b90192cbb96e0f)

Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/apic.c |   33 +
 hw/apic.h |1 +
 monitor.c |6 +-
 3 Dateien geändert, 39 Zeilen hinzugefügt(+), 1 Zeile entfernt(-)

diff --git a/hw/apic.c b/hw/apic.c
index a45b57f..243900d 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -204,6 +204,39 @@ void apic_deliver_pic_intr(DeviceState *d, int level)
 }
 }
 
+static inline uint32_t kapic_reg(struct kvm_lapic_state *kapic, int reg_id);
+
+static void kvm_irqchip_deliver_nmi(void *p)
+{
+APICState *s = p;
+struct kvm_lapic_state klapic;
+uint32_t lvt;
+
+kvm_get_lapic(s-cpu_env, klapic);
+lvt = kapic_reg(klapic, 0x32 + APIC_LVT_LINT1);
+
+if (lvt  APIC_LVT_MASKED) {
+return;
+}
+
+if (((lvt  8)  7) != APIC_DM_NMI) {
+return;
+}
+
+kvm_vcpu_ioctl(s-cpu_env, KVM_NMI);
+}
+
+void apic_deliver_nmi(DeviceState *d)
+{
+APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
+
+if (kvm_irqchip_in_kernel()) {
+run_on_cpu(s-cpu_env, kvm_irqchip_deliver_nmi, s);
+} else {
+apic_local_deliver(s, APIC_LVT_LINT1);
+}
+}
+
 #define foreach_apic(apic, deliver_bitmask, code) \
 {\
 int __i, __j, __mask;\
diff --git a/hw/apic.h b/hw/apic.h
index c857d52..3a4be0a 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -10,6 +10,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
  uint8_t trigger_mode);
 int apic_accept_pic_intr(DeviceState *s);
 void apic_deliver_pic_intr(DeviceState *s, int level);
+void apic_deliver_nmi(DeviceState *d);
 int apic_get_interrupt(DeviceState *s);
 void apic_reset_irq_delivered(void);
 int apic_get_irq_delivered(void);
diff --git a/monitor.c b/monitor.c
index 7680929..6af0673 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2604,7 +2604,11 @@ static int do_inject_nmi(Monitor *mon, const QDict 
*qdict, QObject **ret_data)
 CPUState *env;
 
 for (env = first_cpu; env != NULL; env = env-next_cpu) {
-cpu_interrupt(env, CPU_INTERRUPT_NMI);
+if (!env-apic_state) {
+cpu_interrupt(env, CPU_INTERRUPT_NMI);
+} else {
+apic_deliver_nmi(env-apic_state);
+}
 }
 
 return 0;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 2/6] pci-assign: Fix PCI_EXP_FLAGS_TYPE shift

2012-10-09 Thread Andreas Färber
From: Alex Williamson alex.william...@redhat.com

Coverity found that we're doing (uint16_t)type  0xf0  8.
This is obviously always 0x0, so our attempt to filter out
some device types thinks everything is an endpoint.  Fix
shift amount.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit b4eccd18591f3d639bc3c923e299b3c1241a0b3f)

Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/device-assignment.c |2 +-
 1 Datei geändert, 1 Zeile hinzugefügt(+), 1 Zeile entfernt(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 177daa4..dc41bfd 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1459,7 +1459,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 }
 
 type = pci_get_word(pci_dev-config + pos + PCI_EXP_FLAGS);
-type = (type  PCI_EXP_FLAGS_TYPE)  8;
+type = (type  PCI_EXP_FLAGS_TYPE)  4;
 if (type != PCI_EXP_TYPE_ENDPOINT 
 type != PCI_EXP_TYPE_LEG_END  type != PCI_EXP_TYPE_RC_END) {
 fprintf(stderr,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 3/6] pci-assign: Fix PCIe lnkcap

2012-10-09 Thread Andreas Färber
From: Alex Williamson alex.william...@redhat.com

Another Coverity found issue, lnkcap is a 32bit register and
we're masking bits 16  17.  Fix to uin32_t.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 21f5a19a10c8f6a10d79a415bf640de85acede78)

[AF: Backported]
Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/device-assignment.c |8 
 1 Datei geändert, 4 Zeilen hinzugefügt(+), 4 Zeilen entfernt(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index dc41bfd..c4c2535 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1420,8 +1420,8 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
 uint8_t version, size;
-uint16_t type, devctl, lnkcap, lnksta;
-uint32_t devcap;
+uint16_t type, devctl, lnksta;
+uint32_t devcap, lnkcap;
 
 version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
 version = PCI_EXP_FLAGS_VERS;
@@ -1491,11 +1491,11 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 pci_set_word(pci_dev-config + pos + PCI_EXP_DEVSTA, 0);
 
 /* Link capabilities, expose links and latencues, clear reporting */
-lnkcap = pci_get_word(pci_dev-config + pos + PCI_EXP_LNKCAP);
+lnkcap = pci_get_long(pci_dev-config + pos + PCI_EXP_LNKCAP);
 lnkcap = (PCI_EXP_LNKCAP_SLS | PCI_EXP_LNKCAP_MLW |
PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL |
PCI_EXP_LNKCAP_L1EL);
-pci_set_word(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
+pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
 pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP,
  PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB |
  PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES |
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 5/6] pci-assign: Update legacy interrupts only if used

2012-10-09 Thread Andreas Färber
From: Jan Kiszka jan.kis...@siemens.com

Don't mess with assign_intx on devices that are in MSI or MSI-X mode, it
would corrupt their interrupt routing.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
(cherry picked from commit 096392efe1e5ee670f880c96c31f7ea8d6d76cf4)

Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/device-assignment.c |9 ++---
 1 Datei geändert, 6 Zeilen hinzugefügt(+), 3 Zeilen entfernt(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index d586ce4..43029a4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1062,9 +1062,12 @@ void assigned_dev_update_irqs(void)
 dev = QLIST_FIRST(devs);
 while (dev) {
 next = QLIST_NEXT(dev, next);
-r = assign_irq(dev);
-if (r  0)
-qdev_unplug(dev-dev.qdev);
+if (dev-irq_requested_type  KVM_DEV_IRQ_HOST_INTX) {
+r = assign_irq(dev);
+if (r  0) {
+qdev_unplug(dev-dev.qdev);
+}
+}
 dev = next;
 }
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 4/6] pci-assign: Harden I/O port test

2012-10-09 Thread Andreas Färber
From: Alex Williamson alex.william...@redhat.com

Markus Armbruster points out that we're missing a  0 check
from pread while trying to probe for pci-sysfs io-port
resource support.  We don't expect a short read, but we
should harden the test to abort if we get one so we're not
potentially looking at a stale errno.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 1d1c8a498b7ce5c5636f1014f7ad18aa4e1acc0a)

Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/device-assignment.c |5 +++--
 1 Datei geändert, 3 Zeilen hinzugefügt(+), 2 Zeilen entfernt(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index c4c2535..d586ce4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -618,8 +618,9 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
  * kernels return EIO.  New kernels only allow 1/2/4 byte reads
  * so should return EINVAL for a 3 byte read */
 ret = pread(pci_dev-v_addrs[i].region-resource_fd, val, 3, 0);
-if (ret == 3) {
-fprintf(stderr, I/O port resource supports 3 byte read?!\n);
+if (ret = 0) {
+fprintf(stderr, Unexpected return from I/O port read: %d\n,
+ret);
 abort();
 } else if (errno != EINVAL) {
 fprintf(stderr, Using raw in/out ioport access (sysfs - 
%s)\n,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable-0.15 6/6] pci-assign: Remove bogus PCIe lnkcap wmask setting

2012-10-09 Thread Andreas Färber
From: Alex Williamson alex.william...@redhat.com

All the fields of lnkcap are read-only and this is setting it
with mask values from LNKCTL.  Just below it, we indicate
link control is read only, so this appears to be a stray
chunk left in from development.  Trivial comment fix while
we're here.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com
(cherry picked from commit 0cbb68b1ce40e9b7e0b8cea5fd849f5c6bd09aee)

Signed-off-by: Andreas Färber afaer...@suse.de
---
 hw/device-assignment.c |6 +-
 1 Datei geändert, 1 Zeile hinzugefügt(+), 5 Zeilen entfernt(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 43029a4..3908144 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1481,7 +1481,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 pci_set_long(pci_dev-config + pos + PCI_EXP_DEVCAP, devcap);
 
 /* device control: clear all error reporting enable bits, leaving
- * leaving only a few host values.  Note, these are
+ * only a few host values.  Note, these are
  * all writable, but not passed to hw.
  */
 devctl = pci_get_word(pci_dev-config + pos + PCI_EXP_DEVCTL);
@@ -1500,10 +1500,6 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL |
PCI_EXP_LNKCAP_L1EL);
 pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
-pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP,
- PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB |
- PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES |
- PCI_EXP_LNKCTL_CLKREQ_EN | PCI_EXP_LNKCTL_HAWD);
 
 /* Link control, pass existing read-only copy.  Should be writable? */
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 12/14] KVM: ARM: VFP userspace interface

2012-10-09 Thread Peter Maydell
On 1 October 2012 10:11, Christoffer Dall c.d...@virtualopensystems.com wrote:
 From: Rusty Russell rusty.russ...@linaro.org
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1765,6 +1765,12 @@ ARM 64-bit CP15 registers have the following id bit 
 patterns:
  ARM CCSIDR registers are demultiplexed by CSSELR value:
0x4002  0011 00 csselr:8

 +ARM 32-bit VFP control registers have the following id bit patterns:
 +  0x4002  0012 1 selector:3
 +
 +ARM 64-bit FP registers have the following id bit patterns:
 +  0x4002  0012 0 selector:3
 +

In the other entries in this section, foo:3 indicates a 3 bit field.
But I think for these two it's trying to indicate a 3 byte field.
Compare the include file constants:

+#define KVM_REG_ARM_VFP(0x0012 
KVM_REG_ARM_COPROC_SHIFT)
+#define KVM_REG_ARM_VFP_MASK   0x
+#define KVM_REG_ARM_VFP_BASE_REG   0x0
+#define KVM_REG_ARM_VFP_FPSID  0x1000
+#define KVM_REG_ARM_VFP_FPSCR  0x1001

so eg the first control register is
 0x4002  0012 1000
and the first D reg is
 0x4002  0012 

So my guess is these should be selector:12
(or possibly even something slightly more specific than 'selector'
like 'regno:12' ?)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH v2 12/14] KVM: ARM: VFP userspace interface

2012-10-09 Thread Christoffer Dall
On Tue, Oct 9, 2012 at 2:11 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 1 October 2012 10:11, Christoffer Dall c.d...@virtualopensystems.com 
 wrote:
 From: Rusty Russell rusty.russ...@linaro.org
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1765,6 +1765,12 @@ ARM 64-bit CP15 registers have the following id bit 
 patterns:
  ARM CCSIDR registers are demultiplexed by CSSELR value:
0x4002  0011 00 csselr:8

 +ARM 32-bit VFP control registers have the following id bit patterns:
 +  0x4002  0012 1 selector:3
 +
 +ARM 64-bit FP registers have the following id bit patterns:
 +  0x4002  0012 0 selector:3
 +

 In the other entries in this section, foo:3 indicates a 3 bit field.
 But I think for these two it's trying to indicate a 3 byte field.
 Compare the include file constants:

 +#define KVM_REG_ARM_VFP(0x0012 
 KVM_REG_ARM_COPROC_SHIFT)
 +#define KVM_REG_ARM_VFP_MASK   0x
 +#define KVM_REG_ARM_VFP_BASE_REG   0x0
 +#define KVM_REG_ARM_VFP_FPSID  0x1000
 +#define KVM_REG_ARM_VFP_FPSCR  0x1001

 so eg the first control register is
  0x4002  0012 1000
 and the first D reg is
  0x4002  0012 

 So my guess is these should be selector:12
 (or possibly even something slightly more specific than 'selector'
 like 'regno:12' ?)

right you are, I'll modify this.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote:
 On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote:
  Also known as Paravirtualization level.
  
  This change is based on:
  
  Microsoft Hypervisor CPUID Leaves:

  http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
  
  Linux kernel change starts with:
http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
  Also:
http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
  
  VMware documention on CPUIDs (Mechanisms to determine if software is
  running in a VMware virtual machine):

  http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458
  
  QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).
  
  Signed-off-by: Don Slutz d...@cloudswitch.com
  ---
   target-i386/kvm.c |4 ++--
   1 files changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index 895d848..8462c75 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
   c = cpuid_data.entries[cpuid_i++];
   memset(c, 0, sizeof(*c));
   c-function = KVM_CPUID_SIGNATURE
  -if (!hyperv_enabled()) {
  +if (!env-cpuid_hv_level_set) {
   memcpy(signature, KVMKVMKVM\0\0\0, 12);
   c-eax = 0;
   } else {
   memcpy(signature, Microsoft Hv, 12);
  -c-eax = HYPERV_CPUID_MIN;
  +c-eax = env-cpuid_hv_level;
 
 This breaks hyperv_enabled() checks. 
 
 Don, are you certain it is worthwhile to make this configurable? 
 Can you explain why, under your scenario, it is worthwhile?
 
 Because these are separate problems:
 
 - Fake VMWare hypervisor  (which seems to be your main goal).
 - Make CPUID HV leafs configurable via command line.

Err, meant via properties. Point is, why have VMWare CPUID
configuration as data, if there are reasons to believe code 
is a better fit (code as in current Hyper-V implementation).

 My point is that the CPUIDs must be carefully constructed, 
 that i miss the point why making them configurable is 
 desired.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote:
  On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote:
   Also known as Paravirtualization level.
   
   This change is based on:
   
   Microsoft Hypervisor CPUID Leaves:
 
   http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
   
   Linux kernel change starts with:
 http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
   Also:
 http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
   
   VMware documention on CPUIDs (Mechanisms to determine if software is
   running in a VMware virtual machine):
 
   http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458
   
   QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).
   
   Signed-off-by: Don Slutz d...@cloudswitch.com
   ---
target-i386/kvm.c |4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
   
   diff --git a/target-i386/kvm.c b/target-i386/kvm.c
   index 895d848..8462c75 100644
   --- a/target-i386/kvm.c
   +++ b/target-i386/kvm.c
   @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
c = cpuid_data.entries[cpuid_i++];
memset(c, 0, sizeof(*c));
c-function = KVM_CPUID_SIGNATURE
   -if (!hyperv_enabled()) {
   +if (!env-cpuid_hv_level_set) {
memcpy(signature, KVMKVMKVM\0\0\0, 12);
c-eax = 0;
} else {
memcpy(signature, Microsoft Hv, 12);
   -c-eax = HYPERV_CPUID_MIN;
   +c-eax = env-cpuid_hv_level;
  
  This breaks hyperv_enabled() checks. 
  
  Don, are you certain it is worthwhile to make this configurable? 
  Can you explain why, under your scenario, it is worthwhile?
  
  Because these are separate problems:
  
  - Fake VMWare hypervisor  (which seems to be your main goal).
  - Make CPUID HV leafs configurable via command line.
 
 Err, meant via properties. Point is, why have VMWare CPUID
 configuration as data, if there are reasons to believe code 
 is a better fit (code as in current Hyper-V implementation).

Nevermind, its the right thing to do. Just separate the patchset
please:

1) Create object properties.
2) Export VMWare CPUID via properties.
3) Convert Hyper-V. 

Be careful to make sure Hyper-V's current options are functional 
in 3).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-09 Thread Raghavendra K T
* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:

 On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
  
  Again the numbers are ridiculously high for arch_local_irq_restore.
  Maybe there's a bad perf/kvm interaction when we're injecting an
  interrupt, I can't believe we're spending 84% of the time running the
  popf instruction. 
  
  Smells like a software fallback that doesn't do NMI, hrtimer based
  sampling typically hits popf where we re-enable interrupts.
 
 Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
 is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
 host will expose it (and a good idea anyway to get best performance).
 

Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.

I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish  to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.

I can respin the whole series including this default ple_window change.

I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements


16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---+---+-+---+
   |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---+---+-+---+
kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
---+---+-+---+
sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
---+---+-+---+
ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
---+---+-+---+

perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 

pleopt   ple_gap=0

ebizzy : 18131 records/s
63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
5.65%  [guest.kernel]  [g] smp_call_function_many
3.12%  [guest.kernel]  [g] clear_page
3.02%  [guest.kernel]  [g] down_read_trylock
1.85%  [guest.kernel]  [g] async_page_fault
1.81%  [guest.kernel]  [g] up_read
1.76%  [guest.kernel]  [g] native_apic_mem_write
1.70%  [guest.kernel]  [g] find_vma

kernbench :Elapsed Time 29.4933 (27.6007)
   5.72%  [guest.kernel]  [g] async_page_fault
3.48%  [guest.kernel]  [g] pvclock_clocksource_read
2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
2.58%  [guest.kernel]  [g] clear_page
2.09%  [guest.kernel]  [g] page_cache_get_speculative
2.00%  [guest.kernel]  [g] do_raw_spin_lock
1.78%  [guest.kernel]  [g] unmap_single_vma
1.74%  [guest.kernel]  [g] kmem_cache_alloc

pleopt ple_window = 4k
---
ebizzy: 10176 records/s
   69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
3.34%  [guest.kernel]  [g] clear_page
2.16%  [guest.kernel]  [g] down_read_trylock
1.94%  [guest.kernel]  [g] async_page_fault
1.89%  [guest.kernel]  [g] native_apic_mem_write
1.63%  [guest.kernel]  [g] smp_call_function_many
1.58%  [guest.kernel]  [g] SetPageLRU
1.37%  [guest.kernel]  [g] up_read
1.01%  [guest.kernel]  [g] find_vma


kernbench: 29.9533
nts: 240K cycles
6.04%  [guest.kernel]  [g] async_page_fault
4.17%  [guest.kernel]  [g] pvclock_clocksource_read
3.28%  [guest.kernel]  [g] clear_page
2.57%  [guest.kernel]  [g] copy_user_generic_unrolled
2.30%  [guest.kernel]  [g] do_raw_spin_lock
2.13%  [guest.kernel]  [g] _raw_spin_lock_irqsave
1.93%  [guest.kernel]  [g] page_cache_get_speculative
1.92%  [guest.kernel]  [g] unmap_single_vma
1.77%  [guest.kernel]  [g] kmem_cache_alloc
1.61%  [guest.kernel]  [g] __d_lookup_rcu
1.19%  

Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

2012-10-09 Thread Don Slutz

On 10/09/12 14:47, Marcelo Tosatti wrote:

On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote:

On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote:

On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote:

Also known as Paravirtualization level.

This change is based on:

Microsoft Hypervisor CPUID Leaves:
   
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
   http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
   http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
   
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).

Signed-off-by: Don Slutz d...@cloudswitch.com
---
  target-i386/kvm.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 895d848..8462c75 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
  c = cpuid_data.entries[cpuid_i++];
  memset(c, 0, sizeof(*c));
  c-function = KVM_CPUID_SIGNATURE
-if (!hyperv_enabled()) {
+if (!env-cpuid_hv_level_set) {
  memcpy(signature, KVMKVMKVM\0\0\0, 12);
  c-eax = 0;
  } else {
  memcpy(signature, Microsoft Hv, 12);
-c-eax = HYPERV_CPUID_MIN;
+c-eax = env-cpuid_hv_level;

This breaks hyperv_enabled() checks.

Don, are you certain it is worthwhile to make this configurable?
Can you explain why, under your scenario, it is worthwhile?

Because these are separate problems:

- Fake VMWare hypervisor  (which seems to be your main goal).
- Make CPUID HV leafs configurable via command line.

Err, meant via properties. Point is, why have VMWare CPUID
configuration as data, if there are reasons to believe code
is a better fit (code as in current Hyper-V implementation).

Nevermind, its the right thing to do. Just separate the patchset
please:

1) Create object properties.
2) Export VMWare CPUID via properties.
3) Convert Hyper-V.

Be careful to make sure Hyper-V's current options are functional
in 3).


Did you mean 3 patch sets (or more)? Or just a different order?
   -Don Slutz

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 08/16] target-i386: Add cpu object access routines for Hypervisor vendor.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:10AM -0400, Don Slutz wrote:
 These are modeled after x86_cpuid_set_vendor and x86_cpuid_get_vendor.
 Since kvm's vendor is shorter, the test for correct size is removed and zero 
 padding is added.
 
 Set Microsoft's Vendor now that we can.  Value defined in:
   
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 And matches want is in target-i386/kvm.c
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |   46 ++
  target-i386/cpu.h |2 ++
  2 files changed, 48 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 920278b..964877f 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1192,11 +1192,54 @@ static void x86_cpuid_set_hv_level(Object *obj, 
 Visitor *v, void *opaque,
  cpu-env.cpuid_hv_level_set = true;
  }
  
 +static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +CPUX86State *env = cpu-env;
 +char *value;
 +int i;
 +
 +value = (char *)g_malloc(CPUID_VENDOR_SZ + 1);
 +for (i = 0; i  4; i++) {
 +value[i + 0] = env-cpuid_hv_vendor1  (8 * i);
 +value[i + 4] = env-cpuid_hv_vendor2  (8 * i);
 +value[i + 8] = env-cpuid_hv_vendor3  (8 * i);
 +}
 +value[CPUID_VENDOR_SZ] = '\0';
 +
 +return value;
 +}
 +
 +static void x86_cpuid_set_hv_vendor(Object *obj, const char *value,
 +Error **errp)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +CPUX86State *env = cpu-env;
 +int i;
 +char adj_value[CPUID_VENDOR_SZ + 1];
 +
 +memset(adj_value, 0, sizeof(adj_value));
 +
 +pstrcpy(adj_value, sizeof(adj_value), value);
 +
 +env-cpuid_hv_vendor1 = 0;
 +env-cpuid_hv_vendor2 = 0;
 +env-cpuid_hv_vendor3 = 0;
 +for (i = 0; i  4; i++) {
 +env-cpuid_hv_vendor1 |= ((uint8_t)adj_value[i + 0])  (8 * i);
 +env-cpuid_hv_vendor2 |= ((uint8_t)adj_value[i + 4])  (8 * i);
 +env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8])  (8 * i);
 +}
 +env-cpuid_hv_vendor_set = true;
 +}
 +
  #if !defined(CONFIG_USER_ONLY)
  static void x86_set_hyperv(Object *obj, Error **errp)
  {
  object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV,
  hypervisor-level, errp);
 +object_property_set_str(obj, CPUID_HV_VENDOR_HYPERV,
 +hypervisor-vendor, errp);
  }
  
  static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
 @@ -2126,6 +2169,9 @@ static void x86_cpu_initfn(Object *obj)
  object_property_add(obj, hypervisor-level, int,
  x86_cpuid_get_hv_level,
  x86_cpuid_set_hv_level, NULL, NULL, NULL);
 +object_property_add_str(obj, hypervisor-vendor,
 +x86_cpuid_get_hv_vendor,
 +x86_cpuid_set_hv_vendor, NULL);
  #if !defined(CONFIG_USER_ONLY)
  object_property_add(obj, hv_spinlocks, int,
  x86_get_hv_spinlocks,
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 11730b2..eb6aa4a 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -488,6 +488,8 @@
  
  #define CPUID_VENDOR_VIA   CentaurHauls
  
 +#define CPUID_HV_VENDOR_HYPERV Microsoft Hv
 +
  #define CPUID_HV_LEVEL_HYPERV  0x4005

As requested, please separate patchset in groups.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 06/16] target-i386: Use Hypervisor level in -machine pc,accel=tcg.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:08AM -0400, Don Slutz wrote:
 Also known as Paravirtualization level.
 
 This change is based on:
 
 Microsoft Hypervisor CPUID Leaves:
   
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 
 Linux kernel change starts with:
   http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
 Also:
   http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
 
 VMware documention on CPUIDs (Mechanisms to determine if software is
 running in a VMware virtual machine):
   
 http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458
 
 QEMU knows this as KVM_CPUID_SIGNATURE (0x4000) in kvm on linux.
 
 This does not provide vendor support in tcg yet.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |   27 +++
  1 files changed, 27 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 48bdaf9..920278b 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1651,6 +1651,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  index =  env-cpuid_xlevel;
  }
  }
 +} else if (index  0x4000) {
 +if (env-cpuid_hv_level_set) {
 +uint32_t real_level = env-cpuid_hv_level;
 +
 +/* Handle Hypervisor CPUIDs */
 +if (real_level  0x4000) {
 +real_level = 0x4000;
 +}
 +if (index  real_level) {
 +index = real_level;
 +}
 +} else {
 +if (index  env-cpuid_level)
 +index = env-cpuid_level;
 +}

Whats the purpose of this checks?

Please provide changelogs for each patch.

  } else {
  if (index  env-cpuid_level)
  index = env-cpuid_level;
 @@ -1789,6 +1804,18 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  *edx = 0;
  }
  break;
 +case 0x4000:
 +*eax = env-cpuid_hv_level;
 +*ebx = 0;
 +*ecx = 0;
 +*edx = 0;
 +break;
 +case 0x4001:
 +*eax = env-cpuid_kvm_features;
 +*ebx = 0;
 +*ecx = 0;
 +*edx = 0;
 +break;
  case 0x8000:
  *eax = env-cpuid_xlevel;
  *ebx = env-cpuid_vendor1;
 -- 
 1.7.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/16] target-i386: Add setting of Hypervisor leaf extra for known vmare4.

2012-10-09 Thread Marcelo Tosatti
On Mon, Sep 24, 2012 at 10:32:16AM -0400, Don Slutz wrote:
 This was taken from:
   http://article.gmane.org/gmane.comp.emulators.kvm.devel/22643
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
  target-i386/cpu.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 8bb20c7..b77dbfe 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const 
 char *model_id,
  }
  }
  
 +static void x86_cpuid_set_vmware_extra(Object *obj)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +
 +if ((cpu-env.tsc_khz != 0) 
 +(cpu-env.cpuid_hv_level == CPUID_HV_LEVEL_VMWARE_4) 
 +(cpu-env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) 
 +(cpu-env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) 
 +(cpu-env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) {
 +const uint32_t apic_khz = 100L;
 +
 +/*
 + * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643
 + *
 + *Leaf 0x4010, Timing Information.
 + *
 + *VMware has defined the first generic leaf to provide timing
 + *information.  This leaf returns the current TSC frequency and
 + *current Bus frequency in kHz.
 + *
 + *# EAX: (Virtual) TSC frequency in kHz.
 + *# EBX: (Virtual) Bus (local apic timer) frequency in kHz.
 + *# ECX, EDX: RESERVED (Per above, reserved fields are set to 
 zero).
 + */
 +cpu-env.cpuid_hv_extra = 0x4010;
 +cpu-env.cpuid_hv_extra_a = (uint32_t)cpu-env.tsc_khz;
 +cpu-env.cpuid_hv_extra_b = apic_khz;
 +}
 +}
 +
  static void x86_cpuid_get_tsc_freq(Object *obj, Visitor *v, void *opaque,
 const char *name, Error **errp)
  {
 @@ -1164,6 +1194,7 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor 
 *v, void *opaque,
  }
  
  cpu-env.tsc_khz = value / 1000;
 +x86_cpuid_set_vmware_extra(obj);
  }
  
  static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque,
 @@ -1277,6 +1308,7 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const 
 char *value,
  env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8])  (8 * i);
  }
  env-cpuid_hv_vendor_set = true;
 +x86_cpuid_set_vmware_extra(obj);
  }

This is strange. Please have this configuration, that depends on other
properties being set, ordered in x86_cpu_initfn. Say:

object_property_add(obj, tsc-frequency, int,
x86_cpuid_get_tsc_freq,
x86_cpuid_set_tsc_freq, NULL, NULL, NULL);

/* depends on tsc frequency */
object_property_add(obj, vmware-extra, int,
x86_cpuid_get_vmware_extra,
x86_cpuid_set_vmware_extra, NULL, NULL, NULL);

Or something to that effect.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 03:09:17PM -0400, Don Slutz wrote:
 On 10/09/12 14:47, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote:
 On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote:
 Also known as Paravirtualization level.
 
 This change is based on:
 
 Microsoft Hypervisor CPUID Leaves:

  http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 
 Linux kernel change starts with:
http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
 Also:
http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
 
 VMware documention on CPUIDs (Mechanisms to determine if software is
 running in a VMware virtual machine):

  http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458
 
 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
   target-i386/kvm.c |4 ++--
   1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 895d848..8462c75 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
   c = cpuid_data.entries[cpuid_i++];
   memset(c, 0, sizeof(*c));
   c-function = KVM_CPUID_SIGNATURE
 -if (!hyperv_enabled()) {
 +if (!env-cpuid_hv_level_set) {
   memcpy(signature, KVMKVMKVM\0\0\0, 12);
   c-eax = 0;
   } else {
   memcpy(signature, Microsoft Hv, 12);
 -c-eax = HYPERV_CPUID_MIN;
 +c-eax = env-cpuid_hv_level;
 This breaks hyperv_enabled() checks.
 
 Don, are you certain it is worthwhile to make this configurable?
 Can you explain why, under your scenario, it is worthwhile?
 
 Because these are separate problems:
 
 - Fake VMWare hypervisor  (which seems to be your main goal).
 - Make CPUID HV leafs configurable via command line.
 Err, meant via properties. Point is, why have VMWare CPUID
 configuration as data, if there are reasons to believe code
 is a better fit (code as in current Hyper-V implementation).
 Nevermind, its the right thing to do. Just separate the patchset
 please:
 
 1) Create object properties.
 2) Export VMWare CPUID via properties.
 3) Convert Hyper-V.
 
 Be careful to make sure Hyper-V's current options are functional
 in 3).
 
 Did you mean 3 patch sets (or more)? Or just a different order?
-Don Slutz

Different order. Patches should be logically related (think of what
information the reviewer needs). Please write changelogs for
every patch.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.

2012-10-09 Thread Don Slutz

On 10/09/12 13:17, Marcelo Tosatti wrote:

On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote:

On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote:

This is used to set the cpu object's hypervisor level to the default for 
Microsoft's Hypervisor.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
  target-i386/cpu.c |9 +
  target-i386/cpu.h |2 ++
  2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 451de12..48bdaf9 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor 
*v, void *opaque,
  }
  
  #if !defined(CONFIG_USER_ONLY)

+static void x86_set_hyperv(Object *obj, Error **errp)
+{
+object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV,
+hypervisor-level, errp);
+}
+
  static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
   const char *name, Error **errp)
  {
@@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, 
void *opaque,
  return;
  }
  hyperv_set_spinlock_retries(value);
+x86_set_hyperv(obj, errp);
  }
  
  static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque,

@@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, 
void *opaque,
  return;
  }
  hyperv_enable_relaxed_timing(value);
+x86_set_hyperv(obj, errp);
  }
  
  static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque,

@@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, 
void *opaque,
  return;
  }
  hyperv_enable_vapic_recommended(value);
+x86_set_hyperv(obj, errp);
  }
  #endif
  
diff --git a/target-i386/cpu.h b/target-i386/cpu.h

index 1899f69..3152a4e 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -488,6 +488,8 @@
  
  #define CPUID_VENDOR_VIA   CentaurHauls
  
+#define CPUID_HV_LEVEL_HYPERV  0x4005

+

Where this comes from?

http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

has under Leaf 0x4000 (at very top of table):

EAX

The maximum input value for hypervisor CPUID information. For Microsoft
hypervisors, this value will be at least 0x4005. The vendor ID
signature should be used only for reporting and diagnostic purposes.

Is that the same 0x4005 as in this patch?

Yes, the #define can be reused:

#define HYPERV_CPUID_MIN0x4005


Not as simple as it seems.

http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg03359.html

I can make sure this info is part of the commit message.
-Don Slutz
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 03/16] target-i386: Add cpu object access routines for Hypervisor level.

2012-10-09 Thread Don Slutz

On 10/09/12 12:25, Marcelo Tosatti wrote:

On Mon, Sep 24, 2012 at 10:32:05AM -0400, Don Slutz wrote:

These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
  target-i386/cpu.c |   29 +
  1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 25ca986..451de12 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1166,6 +1166,32 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor 
*v, void *opaque,
  cpu-env.tsc_khz = value / 1000;
  }
  
+static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque,

+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_level, name, errp);
+}
+
+static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+uint32_t value;
+
+visit_type_uint32(v, value, name, errp);
+if (error_is_set(errp)) {
+return;
+}
+
+if (value != 0  value  0x4000) {
+value += 0x4000;
+}

Whats the purpose of this? Adds ambiguity.

Will add more info in this commit message.
   -Don
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.

2012-10-09 Thread Marcelo Tosatti
On Tue, Oct 09, 2012 at 03:12:00PM -0400, Don Slutz wrote:
 On 10/09/12 13:17, Marcelo Tosatti wrote:
 On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote:
 On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote:
 This is used to set the cpu object's hypervisor level to the default for 
 Microsoft's Hypervisor.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
   target-i386/cpu.c |9 +
   target-i386/cpu.h |2 ++
   2 files changed, 11 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 451de12..48bdaf9 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, 
 Visitor *v, void *opaque,
   }
   #if !defined(CONFIG_USER_ONLY)
 +static void x86_set_hyperv(Object *obj, Error **errp)
 +{
 +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV,
 +hypervisor-level, errp);
 +}
 +
   static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
const char *name, Error **errp)
   {
 @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, 
 Visitor *v, void *opaque,
   return;
   }
   hyperv_set_spinlock_retries(value);
 +x86_set_hyperv(obj, errp);
   }
   static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque,
 @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor 
 *v, void *opaque,
   return;
   }
   hyperv_enable_relaxed_timing(value);
 +x86_set_hyperv(obj, errp);
   }
   }
   static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque,
 @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor 
 *v, void *opaque,
   return;
   }
   hyperv_enable_vapic_recommended(value);
 +x86_set_hyperv(obj, errp);
   }
   #endif
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 1899f69..3152a4e 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -488,6 +488,8 @@
   #define CPUID_VENDOR_VIA   CentaurHauls
 +#define CPUID_HV_LEVEL_HYPERV  0x4005
 +
 Where this comes from?
 
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
 
 has under Leaf 0x4000 (at very top of table):
 
 EAX
 
 The maximum input value for hypervisor CPUID information. For Microsoft
 hypervisors, this value will be at least 0x4005. The vendor ID
 signature should be used only for reporting and diagnostic purposes.
 
 Is that the same 0x4005 as in this patch?
 Yes, the #define can be reused:
 
 #define HYPERV_CPUID_MIN0x4005
 
 Not as simple as it seems.
 
 http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg03359.html
 
 I can make sure this info is part of the commit message.
 -Don Slutz

Ok add a copy but have CPUID_MIN in the name (because
CPUID_HV_LEVEL_HYPERV is very confusing). Add a comment 

/* maximum input value for h... 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Gerd Hoffmann
  Hi,

 Why use two bars for this?  You can put them into one mmio bar, together
 with the msi-x vector table and PBA.  Of course a pci capability
 describing the location is helpful for that ;)
 
 You don't need a capability.  You can also just add a config offset
 field to the register set and then make the semantics that it occurs in
 the same region.

Yes, that will work too.  Removes some freedom to place the register
ranges, but given that we don't want burn bars and thus prefer to place
everything into a single mmio bar that shouldn't be an issue.  Real
hardware does this too btw (xhci for example).

 Main advantage of defining a register set with just isr is that it
 reduces pio address space consumtion for new virtio devices which don't
 have to worry about the legacy layout (8 bytes which is minimum size for
 io bars instead of 64 bytes).
 
 Doing some rough math, we should have at least 16k of PIO space.  That
 let's us have well over 500 virtio-pci devices with the current register
 layout.

I've seen worries nevertheless, but given we have virtio-scsi now which
can handle lots of disks without needing lots of virtio-pci devices it
is probably not that a big issue any more.

 The detection is simple: if BAR1 has non-zero length, it's new-style,
 otherwise legacy.

 Doesn't fly.  BAR1 is in use today for MSI-X support.
 
 But the location is specified via capabilities so we can change the
 location to be within BAR1 at a non-conflicting offset.

Sure.  Nevertheless BAR1 has non-zero length can't be used to detect
new-style virtio as old-style devices already have BAR1 with a non-zero
length.

So how about this:

(1) Add a vendor specific pci capability for new-style virtio.
Specifies the pci bar used for new-style virtio registers.
Guests can use it to figure whenever new-style virtio is
supported and to map the correct bar (which will probably
be bar 1 in most cases).

(2) Have virtio-offsets register set, located at the new-style bar,
offset 0:

struct virtio_offsets {
   __le32 num_offsets;// so we can extend this later
   __le32 virtio_pci_offset;  // location of virtio-pci registers
   __le32 virtio_cfg_offset;  // location of virtio config space
};

(3) place virtio-pci registers (same as 0..23 in bar 0) in the new-style
bar, offset virtio_pci_offset

(4) place virtio config space (same as 20+/24+ in bar 0) in the
new-style bar, offset virtio_cfg_offset

cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Jamie Lokier
Rusty Russell wrote:
 I don't think it'll be that bad; reset clears the device to unknown,
 bar0 moves it from unknown-legacy mode, bar1/2/3 changes it from
 unknown-modern mode, and anything else is bad (I prefer being strict so
 we catch bad implementations from the beginning).

Will that work, if the guest with kernel that uses modern mode, kexecs
to an older (but presumed reliable) kernel that only knows about legacy mode?

I.e. will the replacement kernel, or (ideally) replacement driver on
the rare occasion that is needed on a running kernel, be able to reset
the device hard enough?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization setup

2012-10-09 Thread Min-gyu Kim


 -Original Message-
 From: Marc Zyngier [mailto:marc.zyng...@arm.com]
 Sent: Tuesday, October 09, 2012 9:56 PM
 To: Christoffer Dall
 Cc: Min-gyu Kim; 김창환; linux-arm-ker...@lists.infradead.org;
 kvm@vger.kernel.org; kvm...@lists.cs.columbia.edu
 Subject: Re: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization
 setup
 
 On Sat, 6 Oct 2012 17:33:43 -0400, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
  On Thu, Oct 4, 2012 at 10:23 PM, Min-gyu Kim
  mingyu84@samsung.com
  wrote:
 
 
  -Original Message-
  From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org]
  On Behalf Of Christoffer Dall
  Sent: Monday, October 01, 2012 6:11 PM
  To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
  kvm...@lists.cs.columbia.edu
  Cc: Marc Zyngier
  Subject: [PATCH v2 06/14] KVM: ARM: Memory virtualization setup
 
  +static void stage2_set_pte(struct kvm *kvm, struct
 kvm_mmu_memory_cache
  *cache,
  +phys_addr_t addr, const pte_t *new_pte) {
  + pgd_t *pgd;
  + pud_t *pud;
  + pmd_t *pmd;
  + pte_t *pte, old_pte;
  +
  + /* Create 2nd stage page table mapping - Level 1 */
  + pgd = kvm-arch.pgd + pgd_index(addr);
  + pud = pud_offset(pgd, addr);
  + if (pud_none(*pud)) {
  + if (!cache)
  + return; /* ignore calls from kvm_set_spte_hva */
  + pmd = mmu_memory_cache_alloc(cache);
  + pud_populate(NULL, pud, pmd);
  + pmd += pmd_index(addr);
  + get_page(virt_to_page(pud));
  + } else
  + pmd = pmd_offset(pud, addr);
  +
  + /* Create 2nd stage page table mapping - Level 2 */
  + if (pmd_none(*pmd)) {
  + if (!cache)
  + return; /* ignore calls from kvm_set_spte_hva */
  + pte = mmu_memory_cache_alloc(cache);
  + clean_pte_table(pte);
  + pmd_populate_kernel(NULL, pmd, pte);
  + pte += pte_index(addr);
  + get_page(virt_to_page(pmd));
  + } else
  + pte = pte_offset_kernel(pmd, addr);
  +
  + /* Create 2nd stage page table mapping - Level 3 */
  + old_pte = *pte;
  + set_pte_ext(pte, *new_pte, 0);
  + if (pte_present(old_pte))
  + __kvm_tlb_flush_vmid(kvm);
  + else
  + get_page(virt_to_page(pte)); }
 
 
  I'm not sure about the 3-level page table, but isn't it necessary to
  clean the page table for 2nd level?
  There are two mmu_memory_cache_alloc calls. One has following
  clean_pte_table and the other doesn't have.
 
  hmm, it probably is - I couldn't really find the common case where
  this is done in the kernel normally (except for some custom loop in
  ioremap and idmap), but I added this fix:
 
  diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index
  5394a52..f11ba27f 100644
  --- a/arch/arm/kvm/mmu.c
  +++ b/arch/arm/kvm/mmu.c
  @@ -430,6 +430,7 @@ static void stage2_set_pte(struct kvm *kvm, struct
  kvm_mmu_memory_cache *cache,
  if (!cache)
  return; /* ignore calls from kvm_set_spte_hva */
  pmd = mmu_memory_cache_alloc(cache);
  +   clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t));
  pud_populate(NULL, pud, pmd);
  pmd += pmd_index(addr);
  get_page(virt_to_page(pud));
 
 There ought to be a test of ID_MMFR3[23:20] to find out whether or not it
 is useful to clean the dcache. Not sure if that's a massive gain, but it
 is certainly an optimisation to consider for the kernel as a whole.

That's part of what I was wondering. clean_dcache_area is substituted by nop if
TLB_CAN_READ_FROM_L1_CACHE is defined(mm/proc-v7.S). 
But I couldn't find any place where it is defined.
If it is part of bsp to enable TLB_CAN_RAD_FROM_L1_CACHE according to 
ID_MMFR3[23:20],
it would not be necessary to concerned about cleaning or not cleaning.

However, am I right?

 
 M.
 --
 Fast, cheap, reliable. Pick two.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v2

2012-10-09 Thread Asias He
Hello Christoph!

On 10/10/2012 01:39 AM, Christoph Hellwig wrote:
 +static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file 
 *file)
 +{
 +
 +struct inode *inode = file-f_mapping-host;
 +struct block_device *bdev = inode-i_bdev;
 +int ret;
 
 Please just pass the block_device directly instead of a file struct.

vhost_blk_req_submit() can be used to handle file based image later.
Using the file interface will work for both cases.

I do need a check in vhost_blk_set_backend() to tell if the user passed
file is a raw device file for now.

 +
 +ret = vhost_blk_bio_make(req, bdev);
 +if (ret  0)
 +return ret;
 +
 +vhost_blk_bio_send(req);
 +
 +return ret;
 +}
 
 Then again how simple the this function is it probably should just go
 away entirely.

No, vhost_blk_req_submit() is used for both read and write ops. It makes
no sense to write the code twice. Plus, this function might be complexer
when the file based image support is added.

 +set_fs(USER_DS);
 
 What do we actually need the set_fs for here?

See this commit: d7ffde35e31a81100d2d0d2c4013cbf527bb32ea

 +
 +static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
 +{
 +
 +*file = vhost_blk_stop_vq(blk, blk-vq);
 +}
 
 What is the point of this helper?  Also I can't see anyone actually
 using the returned struct file.

It is used in both vhost_blk_reset_owner() and vhost_blk_release(). The
returned struct file is used for fput(). We have similar helper in
vhost_net: vhost_net_stop().


 +case VIRTIO_BLK_T_FLUSH:
 +ret = vfs_fsync(file, 1);
 +status = ret  0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
 +if (!vhost_blk_set_status(req, status))
 +vhost_add_used_and_signal(blk-dev, vq, head, ret);
 +break;
 
 Sending a fsync here is actually wrong in two different ways:
 
  a) it operates at the filesystem level instead of bio level
  b) it's a blocking operation
 
 It should instead send a REQ_FLUSH bio using the same submission scheme
 as the read/write requests.

Will fix this.

Thanks for the review!

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-09 Thread Andrew Theurer
On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
 * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
   On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
   
   Again the numbers are ridiculously high for arch_local_irq_restore.
   Maybe there's a bad perf/kvm interaction when we're injecting an
   interrupt, I can't believe we're spending 84% of the time running the
   popf instruction. 
   
   Smells like a software fallback that doesn't do NMI, hrtimer based
   sampling typically hits popf where we re-enable interrupts.
  
  Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
  is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
  host will expose it (and a good idea anyway to get best performance).
  
 
 Hi Avi, you are right. SandyBridge machine result was not proper.
 I cleaned up the services, enabled PMU, re-ran all the test again.
 
 Here is the summary:
 We do get good benefit by increasing ple window. Though we don't
 see good benefit for kernbench and sysbench, for ebizzy, we get huge
 improvement for 1x scenario. (almost 2/3rd of ple disabled case).
 
 Let me know if you think we can increase the default ple_window
 itself to 16k.
 
 I am experimenting with V2 version of undercommit improvement(this) patch
 series, But I think if you wish  to go for increase of
 default ple_window, then we would have to measure the benefit of patches
 when ple_window = 16k.
 
 I can respin the whole series including this default ple_window change.
 
 I also have the perf kvm top result for both ebizzy and kernbench.
 I think they are in expected lines now.
 
 Improvements
 
 
 16 core PLE machine with 16 vcpu guest
 
 base = 3.6.0-rc5 + ple handler optimization patches
 base_pleopt_16k = base + ple_window = 16k
 base_pleopt_32k = base + ple_window = 32k
 base_pleopt_nople = base + ple_gap = 0
 kernbench, hackbench, sysbench (time in sec lower is better)
 ebizzy (rec/sec higher is better)
 
 % improvements w.r.t base (ple_window = 4k)
 ---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
 ---+---+-+---+
 kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
 kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
 ---+---+-+---+
 sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
 sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
 sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
 ---+---+-+---+
 ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
 ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
 ---+---+-+---+
 
 perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 
 

Is the perf data for 1x overcommit?

 pleopt   ple_gap=0
 
 ebizzy : 18131 records/s
 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
 5.65%  [guest.kernel]  [g] smp_call_function_many
 3.12%  [guest.kernel]  [g] clear_page
 3.02%  [guest.kernel]  [g] down_read_trylock
 1.85%  [guest.kernel]  [g] async_page_fault
 1.81%  [guest.kernel]  [g] up_read
 1.76%  [guest.kernel]  [g] native_apic_mem_write
 1.70%  [guest.kernel]  [g] find_vma

Does 'perf kvm top' not give host samples at the same time?  Would be
nice to see the host overhead as a function of varying ple window.  I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.

 kernbench :Elapsed Time 29.4933 (27.6007)
5.72%  [guest.kernel]  [g] async_page_fault
 3.48%  [guest.kernel]  [g] pvclock_clocksource_read
 2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
 2.58%  [guest.kernel]  [g] clear_page
 2.09%  [guest.kernel]  [g] page_cache_get_speculative
 2.00%  [guest.kernel]  [g] do_raw_spin_lock
 1.78%  [guest.kernel]  [g] unmap_single_vma
 1.74%  [guest.kernel]  [g] kmem_cache_alloc

 
 pleopt ple_window = 4k
 ---
 ebizzy: 10176 records/s
69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
 3.34%  [guest.kernel]  [g] clear_page
 2.16%  [guest.kernel]  [g] down_read_trylock
 1.94%  [guest.kernel]  [g] async_page_fault
 1.89%  [guest.kernel]  [g] native_apic_mem_write
 1.63%  [guest.kernel]  [g] smp_call_function_many
 1.58%  [guest.kernel]  [g] SetPageLRU
 1.37%  [guest.kernel]  [g] up_read
 1.01%