Re: [Qemu-devel] Using PCI config space to indicate config location
Hi, Well, we also want to clean up the registers, so how about: BAR0: legacy, as is. If you access this, don't use the others. Ok. BAR1: new format virtio-pci layout. If you use this, don't use BAR0. BAR2: virtio-cfg. If you use this, don't use BAR0. Why use two bars for this? You can put them into one mmio bar, together with the msi-x vector table and PBA. Of course a pci capability describing the location is helpful for that ;) BAR3: ISR. If you use this, don't use BAR0. Again, I wouldn't hardcode that but use a capability. I prefer the cases exclusive (ie. use one or the other) as a clear path to remove the legacy layout; and leaving the ISR in BAR0 leaves us with an ugly corner case in future (ISR is BAR0 + 19? WTF?). Ok, so we have four register sets: (1) legacy layout (2) new virtio-pci (3) new virtio-config (4) new virtio-isr We can have a vendor pci capability, with a dword for each register set: bit 31-- present bit bits 26-24 -- bar bits 23-0 -- offset So current drivers which must support legacy can use this: legacy layout -- present, bar 0, offset 0 new virtio-pci-- present, bar 1, offset 0 new virtio-config -- present, bar 1, offset 256 new virtio-isr-- present, bar 0, offset 19 [ For completeness: msi-x capability could add this: ] msi-x vector tablebar 1, offset 512 msi-x pba bar 1, offset 768 We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) But new devices (virtio-qxl being a candidate) don't have old guests and don't need to worry. They could use this if they care about fast isr: legacy layout -- not present new virtio-pci-- present, bar 1, offset 0 new virtio-config -- present, bar 1, offset 256 new virtio-isr-- present, bar 0, offset 0 Or this if they don't worry about isr performance: legacy layout -- not present new virtio-pci-- present, bar 0, offset 0 new virtio-config -- present, bar 0, offset 256 new virtio-isr-- not present I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Main advantage of defining a register set with just isr is that it reduces pio address space consumtion for new virtio devices which don't have to worry about the legacy layout (8 bytes which is minimum size for io bars instead of 64 bytes). If we added an additional constraints that BAR1 was mirrored except for Why add constraints? We want something future-proof, don't we? The detection is simple: if BAR1 has non-zero length, it's new-style, otherwise legacy. Doesn't fly. BAR1 is in use today for MSI-X support. I agree that this is the best way to extend, but I think we should still use a transport feature bit. We want to be able to detect within QEMU whether a guest is using these new features because we need to adjust migration state accordingly. Why does migration need adjustments? [ Not that I want veto a feature bit, but I don't see the need yet ] cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] virtio-net: inline header support
Il 09/10/2012 06:59, Rusty Russell ha scritto: Paolo Bonzini pbonz...@redhat.com writes: Il 05/10/2012 07:43, Rusty Russell ha scritto: That's good. But virtio_blk's scsi command is insoluble AFAICT. As I said to Anthony, the best rules are always and never, so I'd really rather not have to grandfather that in. It is, but we can add a rule that if the (transport) flag VIRTIO_RING_F_ANY_HEADER_SG is set, the cdb field is always 32 bytes in virtio-blk. Could we do that? It's the cmd length I'm concerned about; is it always 32 in practice for some reason? It is always 32 or less except in very obscure cases that are pretty much confined to iSCSI. We don't care about the obscure cases, and the extra bytes don't hurt. BTW, 32 is the default cdb_size used by virtio-scsi. Currently qemu does: struct sg_io_hdr hdr; memset(hdr, 0, sizeof(struct sg_io_hdr)); hdr.interface_id = 'S'; hdr.cmd_len = req-elem.out_sg[1].iov_len; hdr.cmdp = req-elem.out_sg[1].iov_base; hdr.dxfer_len = 0; If it's a command which expects more output data, there's no way to guess where the boundary is between that command and the data. Yep, so I understood the problem right. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost-blk: Add vhost-blk support v2
vhost-blk is an in-kernel virito-blk device accelerator. Due to lack of proper in-kernel AIO interface, this version converts guest's I/O request to bio and use submit_bio() to submit I/O directly. So this version any supports raw block device as guest's disk image, e.g. /dev/sda, /dev/ram0. We can add file based image support to vhost-blk once we have in-kernel AIO interface. There are some work in progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown: http://marc.info/?l=linux-fsdevelm=133312234313122 Performance evaluation: - 1) LKVM Fio with libaio ioengine on Fusion IO device using kvm tool IOPS Before After Improvement seq-read 107 121 +13.0% seq-write 130 179 +37.6% rnd-read 102 122 +19.6% rnd-write 125 159 +27.0% 2) QEMU Fio with libaio ioengine on Fusion IO device using QEMU IOPS Before After Improvement seq-read 76 123 +61.8% seq-write 139 173 +24.4% rnd-read 73 120 +64.3% rnd-write 75 156 +108.0% Userspace bits: - 1) LKVM The latest vhost-blk userspace bits for kvm tool can be found here: g...@github.com:asias/linux-kvm.git blk.vhost-blk 2) QEMU The latest vhost-blk userspace prototype for QEMU can be found here: g...@github.com:asias/qemu.git blk.vhost-blk Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/Kconfig | 1 + drivers/vhost/Kconfig.blk | 10 + drivers/vhost/Makefile| 2 + drivers/vhost/blk.c | 641 ++ drivers/vhost/blk.h | 8 + 5 files changed, 662 insertions(+) create mode 100644 drivers/vhost/Kconfig.blk create mode 100644 drivers/vhost/blk.c create mode 100644 drivers/vhost/blk.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 202bba6..acd8038 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -11,4 +11,5 @@ config VHOST_NET if STAGING source drivers/vhost/Kconfig.tcm +source drivers/vhost/Kconfig.blk endif diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk new file mode 100644 index 000..ff8ab76 --- /dev/null +++ b/drivers/vhost/Kconfig.blk @@ -0,0 +1,10 @@ +config VHOST_BLK + tristate Host kernel accelerator for virtio blk (EXPERIMENTAL) + depends on BLOCK EXPERIMENTAL m + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest block with virtio_blk. Not to be confused with virtio_blk + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_blk. diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index a27b053..1a8a4a5 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o +vhost_blk-y := blk.o diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c new file mode 100644 index 000..6b2445a --- /dev/null +++ b/drivers/vhost/blk.c @@ -0,0 +1,641 @@ +/* + * Copyright (C) 2011 Taobao, Inc. + * Author: Liu Yuan tailai...@taobao.com + * + * Copyright (C) 2012 Red Hat, Inc. + * Author: Asias He as...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + * virtio-blk server in host kernel. + */ + +#include linux/miscdevice.h +#include linux/module.h +#include linux/vhost.h +#include linux/virtio_blk.h +#include linux/mutex.h +#include linux/file.h +#include linux/kthread.h +#include linux/blkdev.h + +#include vhost.c +#include vhost.h +#include blk.h + +#define BLK_HDR0 + +static DEFINE_IDA(vhost_blk_index_ida); + +enum { + VHOST_BLK_VQ_REQ = 0, + VHOST_BLK_VQ_MAX = 1, +}; + +struct req_page_list { + struct page **pages; + int pages_nr; +}; + +struct vhost_blk_req { + struct llist_node llnode; + struct req_page_list *pl; + struct vhost_blk *blk; + + struct iovec *iov; + int iov_nr; + + struct bio **bio; + atomic_t bio_nr; + + sector_t sector; + int write; + u16 head; + long len; + + u8 *status; +}; + +struct vhost_blk { + struct task_struct *host_kick; + struct vhost_blk_req *reqs; + struct vhost_virtqueue vq; + struct llist_head llhead; + struct vhost_dev dev; + u16 reqs_nr; + int index; +}; + +static inline int iov_num_pages(struct iovec *iov) +{ + return (PAGE_ALIGN((unsigned long)iov-iov_base + iov-iov_len) - + ((unsigned long)iov-iov_base PAGE_MASK)) PAGE_SHIFT; +} + +static int vhost_blk_setup(struct vhost_blk *blk) +{ + blk-reqs_nr = blk-vq.num; + + blk-reqs = kmalloc(sizeof(struct vhost_blk_req) * blk-reqs_nr, +
KVM call agenda for 2012-10-09
Hi Please send in any agenda topics that you have. Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
On 10/09/2012 05:16 AM, Rusty Russell wrote: Anthony Liguori aligu...@us.ibm.com writes: We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) You will be supporting this for qemu on x86, sure. As I think we're still in the growth phase for virtio, I prioritize future spec cleanliness pretty high. If a pure ppc hypervisor was on the table, this might have been worthwhile. As it is the codebase is shared, and the Linux drivers are shared, so cleaning up the spec doesn't help the code. But I think you'll be surprised how fast this is deprecated: 1) Bigger queues for block devices (guest-specified ringsize) 2) Smaller rings for openbios (guest-specified alignment) 3) All-mmio mode (powerpc) 4) Whatever network features get numbers 31. I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Confused. I proposed the same split as you have, just ISR by itself. I believe Anthony objects to having the ISR by itself. What is the motivation for that? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-kmod 3.6 on linux 3.2.0
On 10/04/2012 02:14 PM, Peter Lieven wrote: Hi, kvm-kmod 3.6 fails to compile against a 3.2.0 kernel with the following error: /usr/src/kvm-kmod-3.6/x86/x86.c: In function ‘get_msr_mce’: /usr/src/kvm-kmod-3.6/x86/x86.c:1908:27: error: ‘kvm’ undeclared (first use in this function) /usr/src/kvm-kmod-3.6/x86/x86.c:1908:27: note: each undeclared identifier is reported only once for each function it appears in Any ideas? Out of curiosity, why are you using kvm-kmod 3.6 but not Linux 3.6? kvm-kmod is useful for embedded configurations where the main kernel has to be fixed for some reason, but for general use cases it's better to just use a matched set. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization setup
On Sat, 6 Oct 2012 17:33:43 -0400, Christoffer Dall c.d...@virtualopensystems.com wrote: On Thu, Oct 4, 2012 at 10:23 PM, Min-gyu Kim mingyu84@samsung.com wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Christoffer Dall Sent: Monday, October 01, 2012 6:11 PM To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu Cc: Marc Zyngier Subject: [PATCH v2 06/14] KVM: ARM: Memory virtualization setup +static void stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, +phys_addr_t addr, const pte_t *new_pte) { + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte, old_pte; + + /* Create 2nd stage page table mapping - Level 1 */ + pgd = kvm-arch.pgd + pgd_index(addr); + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) { + if (!cache) + return; /* ignore calls from kvm_set_spte_hva */ + pmd = mmu_memory_cache_alloc(cache); + pud_populate(NULL, pud, pmd); + pmd += pmd_index(addr); + get_page(virt_to_page(pud)); + } else + pmd = pmd_offset(pud, addr); + + /* Create 2nd stage page table mapping - Level 2 */ + if (pmd_none(*pmd)) { + if (!cache) + return; /* ignore calls from kvm_set_spte_hva */ + pte = mmu_memory_cache_alloc(cache); + clean_pte_table(pte); + pmd_populate_kernel(NULL, pmd, pte); + pte += pte_index(addr); + get_page(virt_to_page(pmd)); + } else + pte = pte_offset_kernel(pmd, addr); + + /* Create 2nd stage page table mapping - Level 3 */ + old_pte = *pte; + set_pte_ext(pte, *new_pte, 0); + if (pte_present(old_pte)) + __kvm_tlb_flush_vmid(kvm); + else + get_page(virt_to_page(pte)); +} I'm not sure about the 3-level page table, but isn't it necessary to clean the page table for 2nd level? There are two mmu_memory_cache_alloc calls. One has following clean_pte_table and the other doesn't have. hmm, it probably is - I couldn't really find the common case where this is done in the kernel normally (except for some custom loop in ioremap and idmap), but I added this fix: diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 5394a52..f11ba27f 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -430,6 +430,7 @@ static void stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, if (!cache) return; /* ignore calls from kvm_set_spte_hva */ pmd = mmu_memory_cache_alloc(cache); + clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t)); pud_populate(NULL, pud, pmd); pmd += pmd_index(addr); get_page(virt_to_page(pud)); There ought to be a test of ID_MMFR3[23:20] to find out whether or not it is useful to clean the dcache. Not sure if that's a massive gain, but it is certainly an optimisation to consider for the kernel as a whole. M. -- Fast, cheap, reliable. Pick two. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for 2012-10-09
Juan Quintela quint...@redhat.com wrote: Hi Please send in any agenda topics that you have. Hi As there is no agenda, call gets cancelled. Sorry for sending the request for agenda late. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori aligu...@us.ibm.com writes: We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) You will be supporting this for qemu on x86, sure. And PPC. As I think we're still in the growth phase for virtio, I prioritize future spec cleanliness pretty high. But I think you'll be surprised how fast this is deprecated: 1) Bigger queues for block devices (guest-specified ringsize) 2) Smaller rings for openbios (guest-specified alignment) 3) All-mmio mode (powerpc) 4) Whatever network features get numbers 31. We can do all of these things with incremental change to the existing layout. That's the only way what I'm suggesting is different. You want to reorder all of the fields and have a driver flag day. But I strongly suspect we'll decide we need to do the same exercise again in 4 years when we now need to figure out how to take advantage of transactional memory or some other whiz-bang hardware feature. There are a finite number of BARs but each BAR has an almost infinite size. So extending BARs instead of introducing new one seems like the conservative approach moving forward. I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Confused. I proposed the same split as you have, just ISR by itself. I disagree with moving the ISR into a separate BAR. That's what seems weird to me. It's very normal to have a mirrored set of registers that are PIO in one bar and MMIO in a different BAR. If we added an additional constraints that BAR1 was mirrored except for the config space and the MSI section was always there, I think the end result would be nice. IOW: But it won't be the same, because we want all that extra stuff, like more feature bits and queue size alignment. (Admittedly queues past 16TB aren't a killer feature). To make it concrete: Current: struct { __le32 host_features; /* read-only */ __le32 guest_features; /* read/write */ __le32 queue_pfn; /* read/write */ __le16 queue_size; /* read-only */ __le16 queue_sel; /* read/write */ __le16 queue_notify;/* read/write */ u8 status; /* read/write */ u8 isr; /* read-only, clear on read */ /* Optional */ __le16 msi_config_vector; /* read/write */ __le16 msi_queue_vector;/* read/write */ /* ... device features */ }; Proposed: struct virtio_pci_cfg { /* About the whole device. */ __le32 device_feature_select; /* read-write */ __le32 device_feature; /* read-only */ __le32 guest_feature_select;/* read-write */ __le32 guest_feature; /* read-only */ __le16 msix_config; /* read-write */ __u8 device_status; /* read-write */ __u8 unused; /* About a specific virtqueue. */ __le16 queue_select;/* read-write */ __le16 queue_align; /* read-write, power of 2. */ __le16 queue_size; /* read-write, power of 2. */ __le16 queue_msix_vector;/* read-write */ __le64 queue_address; /* read-write: 0x == DNE. */ }; struct virtio_pci_isr { __u8 isr; /* read-only, clear on read */ }; What I'm suggesting is: struct { __le32 host_features; /* read-only */ __le32 guest_features; /* read/write */ __le32 queue_pfn; /* read/write */ __le16 queue_size; /* read-only */ __le16 queue_sel; /* read/write */ __le16 queue_notify;/* read/write */ u8 status; /* read/write */ u8 isr; /* read-only, clear on read */ __le16 msi_config_vector; /* read/write */ __le16 msi_queue_vector;/* read/write */ __le32 host_feature_select; /* read/write */ __le32 guest_feature_select;/* read/write */ __le32 queue_pfn_hi;/* read/write */ }; With the additional semantic that the virtio-config space is overlayed on top of the register set in BAR0 unless the VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged. This feature acts as a latch and when set, removes the config space overlay. If the config space overlays the registers, the offset in BAR0 of the overlay depends on whether MSI is enabled or not in the PCI device. BAR1 is an MMIO mirror of BAR0 except that the config space is never overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG. BAR2 contains the config space. A guest can look at BAR1 and BAR2 to determine whether they exist. We could also enforce LE in the per-device config space
Re: Proposal for virtio standardization.
On Thu, 27 Sep 2012 09:59:33 +0930 Rusty Russell ru...@rustcorp.com.au wrote: Hi all, I've had several requests for a more formal approach to the virtio draft spec, and (after some soul-searching) I'd like to try that. The proposal is to use OASIS as the standards body, as it's fairly light-weight as these things go. For me this means paperwork and setting up a Working Group and getting the right people involved as Voting members starting with the current contributors; for most of you it just means a new mailing list, though I'll be cross-posting any drafts and major changes here anyway. I believe that a documented standard (aka virtio 1.0) will increase visibility and adoption in areas outside our normal linux/kvm universe. There's been some of that already, but this is the clearest path to accelerate it. Not the easiest path, but I believe that a solid I/O standard is a Good Thing for everyone. Yet I also want to decouple new and experimental development from the standards effort; running code comes first. New feature bits and new device numbers should be reservable without requiring a full spec change. So the essence of my proposal is: 1) I start a Working Group within OASIS where we can aim for virtio spec 1.0. 2) The current spec is textually reordered so the core is clearly bus-independent, with PCI, mmio, etc appendices. 3) Various clarifications, formalizations and cleanups to the spec text, and possibly elimination of old deprecated features. 4) The only significant change to the spec is that we use PCI capabilities, so we can have infinite feature bits. (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html) Infinite only applies to virtio-pci, no? 5) Changes to the ring layout and other such things are deferred to a future virtio version; whether this is done within OASIS or externally depends on how well this works for the 1.0 release. Thoughts? Rusty. Sounds like a good idea. I'll be happy to review the spec with an eye to virtio-ccw. Cornelia -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[QEMU PATCH] i386: cpu: add missing CPUID[EAX=7,ECX=0] flag names
This makes QEMU recognize the following CPU flag names: Flags| Corresponding KVM kernel commit -+ FSGSBASE | 176f61da82435eae09cc96f70b530d1ba0746b8b AVX2, BMI1, BMI2 | fb215366b3c7320ac25dca766a0152df16534932 HLE, RTM | 83c529151ab0d4a813e3f6a3e293fff75d468519 INVPCID | ad756a1603c5fac207758faaac7f01c34c9d0b7b ERMS | a01c8f9b4e266df1d7166d23216f2060648f862d Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- target-i386/cpu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index f3708e6..b012372 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -105,8 +105,8 @@ static const char *svm_feature_name[] = { }; static const char *cpuid_7_0_ebx_feature_name[] = { -NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep, -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +fsgsbase, NULL, NULL, bmi1, hle, avx2, NULL, smep, +bmi2, erms, invpcid, rtm, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, }; -- 1.7.11.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
Avi Kivity a...@redhat.com writes: On 10/09/2012 05:16 AM, Rusty Russell wrote: Anthony Liguori aligu...@us.ibm.com writes: We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) You will be supporting this for qemu on x86, sure. As I think we're still in the growth phase for virtio, I prioritize future spec cleanliness pretty high. If a pure ppc hypervisor was on the table, this might have been worthwhile. As it is the codebase is shared, and the Linux drivers are shared, so cleaning up the spec doesn't help the code. Note that distros have been (perhaps unknowingly) shipping virtio-pci for PPC for some time now. So even though there wasn't a hypervisor that supported virtio-pci, the guests already support it and are out there in the wild. There's a lot of value in maintaining legacy support even for PPC. But I think you'll be surprised how fast this is deprecated: 1) Bigger queues for block devices (guest-specified ringsize) 2) Smaller rings for openbios (guest-specified alignment) 3) All-mmio mode (powerpc) 4) Whatever network features get numbers 31. I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Confused. I proposed the same split as you have, just ISR by itself. I believe Anthony objects to having the ISR by itself. What is the motivation for that? Right, BARs are a precious resource not to be spent lightly. Having an entire BAR dedicated to a 1-byte register seems like a waste to me. Regards, Anthony Liguori -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote: On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. No because there is an easy shortcut: if (level == KVM_PUT_FULL_STATE) { /* * KVM is yet unable to synchronize TSC values of multiple VCPUs * on * writeback. Until this is fixed, we only write the offset to * SMP * guests after migration, desynchronizing the VCPUs, but * avoiding * huge jump-backs that would occur without any writeback at * all. */ if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On 10/09/2012 04:24 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote: On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. No because there is an easy shortcut: if (level == KVM_PUT_FULL_STATE) { /* * KVM is yet unable to synchronize TSC values of multiple VCPUs * on * writeback. Until this is fixed, we only write the offset to * SMP * guests after migration, desynchronizing the VCPUs, but * avoiding * huge jump-backs that would occur without any writeback at * all. */ if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } } Still we write back after migration. So this needs to be fixed (or I misunderstood you). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote: On 10/09/2012 04:24 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote: On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. No because there is an easy shortcut: if (level == KVM_PUT_FULL_STATE) { /* * KVM is yet unable to synchronize TSC values of multiple VCPUs * on * writeback. Until this is fixed, we only write the offset to * SMP * guests after migration, desynchronizing the VCPUs, but * avoiding * huge jump-backs that would occur without any writeback at * all. */ if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } } Still we write back after migration. So this needs to be fixed (or I misunderstood you). Handled by kvm_write_tsc() in x86.c. Is this what you mean? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On 10/09/2012 04:27 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote: On 10/09/2012 04:24 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote: On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. No because there is an easy shortcut: if (level == KVM_PUT_FULL_STATE) { /* * KVM is yet unable to synchronize TSC values of multiple VCPUs * on * writeback. Until this is fixed, we only write the offset to * SMP * guests after migration, desynchronizing the VCPUs, but * avoiding * huge jump-backs that would occur without any writeback at * all. */ if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } } Still we write back after migration. So this needs to be fixed (or I misunderstood you). Handled by kvm_write_tsc() in x86.c. Is this what you mean? It will generate a call to -write_tsc_offset(). Will the values be the same for all vcpus? Note the inputs won't be the same. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] s390/kvm: Add a channel I/O based virtio transport driver.
On Wed, 19 Sep 2012 18:38:38 +0200 Alexander Graf ag...@suse.de wrote: On 04.09.2012, at 17:13, Cornelia Huck wrote: +static u32 virtio_ccw_get_features(struct virtio_device *vdev) +{ + struct virtio_ccw_device *vcdev = to_vc_device(vdev); + struct virtio_feature_desc features; + int ret; + + /* Read the feature bits from the host. */ + /* TODO: Features 32 bits */ + features.index = 0; + vcdev-ccw.cmd_code = CCW_CMD_READ_FEAT; + vcdev-ccw.flags = 0; + vcdev-ccw.count = sizeof(features); + vcdev-ccw.cda = vcdev-area; + ret = ccw_io_helper(vcdev, VIRTIO_CCW_DOING_READ_FEAT); + if (ret) + return 0; + + memcpy(features, (void *)(unsigned long)vcdev-area, + sizeof(features)); + return le32_to_cpu(features.features); The fact that the features are LE is missing from the spec, right? You're right, I thought it was there. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] [HACK] Handle multiple virtio aliases.
On Thu, 20 Sep 2012 09:27:00 -0500 Anthony Liguori aligu...@us.ibm.com wrote: Cornelia Huck cornelia.h...@de.ibm.com writes: This patch enables using both virtio-xxx-s390 and virtio-xxx-ccw by making the alias lookup code verify that a driver is actually registered. (Only included in order to allow testing of virtio-ccw; should be replaced by cleaning up the virtio bus model.) Not-signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com No more aliases. Just drop the whole thing. The whole alias stuff I put in there was just to be able to test virtio-ccw. I agree that we should use the cleaned-up virtio driver model that had been proposed for virtio-mmio earlier; I just want to make sure the virtio-ccw interface is sane first. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] Live migration - normal process migration
Il 09/10/2012 16:36, Grzegorz Dwornicki ha scritto: I have a question to you guys. Is it possible to use code from live migration of KVM VMs to migrate other process? No, because this is hardware migration. It migrates the whole machine including the hardware state (e.g. the video card, or the USB controller). I wish to create a tool to convert systems on physical servers to KVM virtual machines. I wish to make some tests/experiments with this part of the code to migrate working system to newly created VMs in running state. Something like that is done by OpenVZ containers. For offline migration of physical servers to KVM virtual machines, look at virt-p2v (http://libguestfs.org/virt-v2v/). Paolo My goal is to create something similar to VMWare Converter but this one will be opensource :). I am at the begining of this project. I'm reading parts of the kernel code. But only thosse that could be usefull in this hard task. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] Live migration - normal process migration
On Tue, Oct 9, 2012 at 9:36 AM, Grzegorz Dwornicki gd1...@gmail.com wrote: I have a question to you guys. Is it possible to use code from live migration of KVM VMs to migrate other process? As far as I can tell, no. most of the virtualization facililites of KVM are implemented in the kernel, but they're managed by a 'normal' process (sometimes called qemu-kvm, sometimes just kvm). Is this userspace process that implements it's own migration feature. to do that, one running qemu-kvm (proc A) instance connects with a just-started instance (proc B). proc A first sends all the configuration information, so B sets itself identically. then all RAM data is transferred (iteratively, to 'catch up' with the still-running VM). finally, proc A suspends the running VM, finishes the last modified RAM data and all the CPU state. now proc B is identical to A, and both are suspended, and then, proc B is unsuspended and notifies proc A, which just exits. if you want to do that for arbitrary 'normal' processes, you'd have to implement that in the kernel; but KVM doesn't do that. As Paolo mentions, there are other projects that do similar things. OpenVZ is one; LXC does it too. I'd start with LXC, since it's a standard part of the kernel (just like KVM), and supported by many common utilities. -- Javier -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
Gerd Hoffmann kra...@redhat.com writes: Hi, Well, we also want to clean up the registers, so how about: BAR0: legacy, as is. If you access this, don't use the others. Ok. BAR1: new format virtio-pci layout. If you use this, don't use BAR0. BAR2: virtio-cfg. If you use this, don't use BAR0. Why use two bars for this? You can put them into one mmio bar, together with the msi-x vector table and PBA. Of course a pci capability describing the location is helpful for that ;) You don't need a capability. You can also just add a config offset field to the register set and then make the semantics that it occurs in the same region. BAR3: ISR. If you use this, don't use BAR0. Again, I wouldn't hardcode that but use a capability. I prefer the cases exclusive (ie. use one or the other) as a clear path to remove the legacy layout; and leaving the ISR in BAR0 leaves us with an ugly corner case in future (ISR is BAR0 + 19? WTF?). Ok, so we have four register sets: (1) legacy layout (2) new virtio-pci (3) new virtio-config (4) new virtio-isr We can have a vendor pci capability, with a dword for each register set: bit 31-- present bit bits 26-24 -- bar bits 23-0 -- offset So current drivers which must support legacy can use this: legacy layout -- present, bar 0, offset 0 new virtio-pci-- present, bar 1, offset 0 new virtio-config -- present, bar 1, offset 256 new virtio-isr-- present, bar 0, offset 19 [ For completeness: msi-x capability could add this: ] msi-x vector tablebar 1, offset 512 msi-x pba bar 1, offset 768 We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) But new devices (virtio-qxl being a candidate) don't have old guests and don't need to worry. They could use this if they care about fast isr: legacy layout -- not present new virtio-pci-- present, bar 1, offset 0 new virtio-config -- present, bar 1, offset 256 new virtio-isr-- present, bar 0, offset 0 Or this if they don't worry about isr performance: legacy layout -- not present new virtio-pci-- present, bar 0, offset 0 new virtio-config -- present, bar 0, offset 256 new virtio-isr-- not present I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Main advantage of defining a register set with just isr is that it reduces pio address space consumtion for new virtio devices which don't have to worry about the legacy layout (8 bytes which is minimum size for io bars instead of 64 bytes). Doing some rough math, we should have at least 16k of PIO space. That let's us have well over 500 virtio-pci devices with the current register layout. I don't think we're at risk of running out of space... If we added an additional constraints that BAR1 was mirrored except for Why add constraints? We want something future-proof, don't we? The detection is simple: if BAR1 has non-zero length, it's new-style, otherwise legacy. Doesn't fly. BAR1 is in use today for MSI-X support. But the location is specified via capabilities so we can change the location to be within BAR1 at a non-conflicting offset. I agree that this is the best way to extend, but I think we should still use a transport feature bit. We want to be able to detect within QEMU whether a guest is using these new features because we need to adjust migration state accordingly. Why does migration need adjustments? Because there is additional state in the new layout. We need to understand whether a guest is relying on that state or not. For instance, extended virtio features. If a guest is in the process of reading extended virtio features, it may not have changed any state but we must ensure that we don't migrate to an older verison of QEMU w/o the extended virtio features. This cannot be handled by subsections today because there is no guest written state that's affected. Regards, Anthony Liguori [ Not that I want veto a feature bit, but I don't see the need yet ] cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
I am just testing the second version of this patch. It addresses all the comments so far except Marcelo's issue with breaking the function compute_guest_tsc(). I needed to put the call for updating the TSC_ADJUST_MSR in kvm_write_tsc() to ensure it is only called from user space. Other changes added to vmcs offset should not be tracked in TSC_ADJUST_MSR. I had some trouble with the order of initialization during live migration. TSC_ADJUST is initialized first but then wiped out by multiple initializations of tsc. The fix for this is to not update TSC_ADJUST if the vmcs offset is not actually changing with the tsc write. So, after migration outcome is that vmcs offset gets defined independent from the migrating value of TSC_ADJUST. I believe this is what we want to happen. Thanks, Will -Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Tuesday, October 09, 2012 5:12 AM To: Marcelo Tosatti Cc: Auld, Will; kvm@vger.kernel.org; Zhang, Xiantao Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. -- error compiling committee.c: too many arguments to function N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
On Tue, Oct 09, 2012 at 04:30:28PM +0200, Avi Kivity wrote: On 10/09/2012 04:27 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 04:26:32PM +0200, Avi Kivity wrote: On 10/09/2012 04:24 PM, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:12:18PM +0200, Avi Kivity wrote: On 10/08/2012 07:30 PM, Marcelo Tosatti wrote: From Intel's manual: • If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR adds (or subtracts) value X from the TSC, the logical processor also adds (or subtracts) value X from the IA32_TSC_ADJUST MSR. This is not handled in the patch. To support migration, it will be necessary to differentiate between guest initiated and userspace-model initiated msr write. That is, only guest initiated TSC writes should affect the value of IA32_TSC_ADJUST MSR. Avi, any better idea? I think we need that anyway, since there are some read-only MSRs that need to be configured by the host (nvmx capabilities). So if we add that feature it will be useful elsewhere. I don't think it's possible to do it in any other way: Local offset value of the IA32_TSC for a logical processor. Reset value is Zero. A write to IA32_TSC will modify the local offset in IA32_TSC_ADJUST and the content of IA32_TSC, but does not affect the internal invariant TSC hardware. What we want to do is affect the internal invariant TSC hardware, so we can't do that through the normal means. btw, will tsc writes from userspace (after live migration) cause tsc skew? If so we should think how to model a guest-wide tsc. No because there is an easy shortcut: if (level == KVM_PUT_FULL_STATE) { /* * KVM is yet unable to synchronize TSC values of multiple VCPUs * on * writeback. Until this is fixed, we only write the offset to * SMP * guests after migration, desynchronizing the VCPUs, but * avoiding * huge jump-backs that would occur without any writeback at * all. */ if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } } Still we write back after migration. So this needs to be fixed (or I misunderstood you). Handled by kvm_write_tsc() in x86.c. Is this what you mean? It will generate a call to -write_tsc_offset(). Will the values be the same for all vcpus? Note the inputs won't be the same. Yes: * Special case: TSC write with a small delta (1 second) of * virtual * cycle time against real time is interpreted as an attempt to * synchronize the CPU. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 03/16] target-i386: Add cpu object access routines for Hypervisor level.
On Mon, Sep 24, 2012 at 10:32:05AM -0400, Don Slutz wrote: These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 29 + 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 25ca986..451de12 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1166,6 +1166,32 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor *v, void *opaque, cpu-env.tsc_khz = value / 1000; } +static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, cpu-env.cpuid_hv_level, name, errp); +} + +static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +uint32_t value; + +visit_type_uint32(v, value, name, errp); +if (error_is_set(errp)) { +return; +} + +if (value != 0 value 0x4000) { +value += 0x4000; +} Whats the purpose of this? Adds ambiguity. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.
On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote: This is used to set the cpu object's hypervisor level to the default for Microsoft's Hypervisor. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c |9 + target-i386/cpu.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 451de12..48bdaf9 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, } #if !defined(CONFIG_USER_ONLY) +static void x86_set_hyperv(Object *obj, Error **errp) +{ +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, +hypervisor-level, errp); +} + static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, void *opaque, return; } hyperv_set_spinlock_retries(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque, @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_relaxed_timing(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque, @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_vapic_recommended(value); +x86_set_hyperv(obj, errp); } #endif diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1899f69..3152a4e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA CentaurHauls +#define CPUID_HV_LEVEL_HYPERV 0x4005 + Where this comes from? http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx has under Leaf 0x4000 (at very top of table): EAX The maximum input value for hypervisor CPUID information. For Microsoft hypervisors, this value will be at least 0x4005. The vendor ID signature should be used only for reporting and diagnostic purposes. Is that the same 0x4005 as in this patch? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [QEMU PATCH] i386: cpu: add missing CPUID[EAX=7, ECX=0] flag names
On 10/09/12 10:03, Eduardo Habkost wrote: This makes QEMU recognize the following CPU flag names: Flags| Corresponding KVM kernel commit -+ FSGSBASE | 176f61da82435eae09cc96f70b530d1ba0746b8b AVX2, BMI1, BMI2 | fb215366b3c7320ac25dca766a0152df16534932 HLE, RTM | 83c529151ab0d4a813e3f6a3e293fff75d468519 INVPCID | ad756a1603c5fac207758faaac7f01c34c9d0b7b ERMS | a01c8f9b4e266df1d7166d23216f2060648f862d Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- target-i386/cpu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index f3708e6..b012372 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -105,8 +105,8 @@ static const char *svm_feature_name[] = { }; static const char *cpuid_7_0_ebx_feature_name[] = { -NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep, -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +fsgsbase, NULL, NULL, bmi1, hle, avx2, NULL, smep, +bmi2, erms, invpcid, rtm, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, }; Reviewed-by: Don Slutz d...@cloudswitch.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
Hi, sorry for the delayed answer. On Sat, Sep 29, 2012 at 11:13:04AM +, Blue Swirl wrote: The -dimm option is supposed to specify the dimm/memory layout, and not create any devices. If we don't want this new option, I have a question: A -device/device_add means we create a new qdev device at startup or as a hotplug operation respectively. So, the semantics of -device dimm,id=dimm0,size=512M,node=0,populated=on are clear to me. What does -device dimm,populated=off mean from a qdev perspective? There are 2 alternatives: - The device is created on the dimmbus, but is not used/populated yet. Than the activation/acpi-hotplug of the dimm may require a separate command (we used to have dimm_add in versions 3). device_add handling always hotplugs a new qdev device, so this wouldn't fit this usecase, because the device already exists. In this case, the actual acpi hotplug operation is decoupled from qdev device creation. The bus exists but the devices do not, device_add would add DIMMs to the bus. This matches PCI bus created by the host bridge and PCI device hotplug. A more complex setup would be dimm bus, dimm slot devices and DIMM devices. The intermediate slot device would contain one DIMM device if plugged. interesting, I haven't thought about this alternative. It does sounds overly complex, but a dimmslot / dimmdevice splitup could consolidate hotplug semantic differences between populated=on/off. Something similar to the dimmslot device is already present in v3 (dimmcfg structure), but it's not a qdev visible device. I 'd rather avoid the complication, but i might revisit this idea. - The dimmdevice is not created when -device dimm,populated=off (this would require some ugly checking in normal -device argument handling). Only the dimm layout is saved. The hotplug is triggered from a normal device_add later. So in this case, the acpi hotplug happens at the same time as the qdev hotplug. Do you see a simpler alternative without introducing a new option? Using the -dimm option follows the second semantic and avoids changing the -device semantics. Dimm layout description is decoupled from dimmdevice creation, and qdev hotplug coincides with acpi hotplug. Maybe even the dimmbus device shouldn't exist by itself after all, or it should be pretty much invisible to users. On real HW, the memory controller or south bridge handles the memory. For i440fx, it's part of the same chipset. So I think we should just add qdev properties to i440fx to specify the sizes, nodes etc. Then i440fx should create the dimmbus device unconditionally using the properties. The default properties should create a sane configuration, otherwise -global i440fx.dimm_size=512M etc. could be used. Then the bus would be populated as before or with device_add. hmm the problem with using only i440fx properties, is that size/nodes look dimm specific to me, not chipset-memcontroller specific. Unless we only allow uniform size dimms. Is it possible to have a dynamic list of sizes/nodes pairs as properties of a qdev device? Also if there is no dimmbus, and instead we have only links from i440fx to dimm-devices, would the current qdev hotplug API be enough? I am currently leaning towards this: i440fx unconditionally creates the dimmbus. Users don't have to specify the bus (i assume this is what you mean by dimmbus should be invisible to the users) We only use -device dimm to describe dimms. With -device dimm,populated=off, only the dimm config layout will be saved in the dimmbus. The hotplug is triggered from a normal device_add later (same as pci hotplug). thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.
On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote: This is used to set the cpu object's hypervisor level to the default for Microsoft's Hypervisor. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c |9 + target-i386/cpu.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 451de12..48bdaf9 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, } #if !defined(CONFIG_USER_ONLY) +static void x86_set_hyperv(Object *obj, Error **errp) +{ +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, +hypervisor-level, errp); +} + static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, void *opaque, return; } hyperv_set_spinlock_retries(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque, @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_relaxed_timing(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque, @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_vapic_recommended(value); +x86_set_hyperv(obj, errp); } #endif diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1899f69..3152a4e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA CentaurHauls +#define CPUID_HV_LEVEL_HYPERV 0x4005 + Where this comes from? http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx has under Leaf 0x4000 (at very top of table): EAX The maximum input value for hypervisor CPUID information. For Microsoft hypervisors, this value will be at least 0x4005. The vendor ID signature should be used only for reporting and diagnostic purposes. Is that the same 0x4005 as in this patch? Yes, the #define can be reused: #define HYPERV_CPUID_MIN0x4005 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_SIGNATURE -if (!hyperv_enabled()) { +if (!env-cpuid_hv_level_set) { memcpy(signature, KVMKVMKVM\0\0\0, 12); c-eax = 0; } else { memcpy(signature, Microsoft Hv, 12); -c-eax = HYPERV_CPUID_MIN; +c-eax = env-cpuid_hv_level; This breaks hyperv_enabled() checks. Don, are you certain it is worthwhile to make this configurable? Can you explain why, under your scenario, it is worthwhile? Because these are separate problems: - Fake VMWare hypervisor (which seems to be your main goal). - Make CPUID HV leafs configurable via command line. My point is that the CPUIDs must be carefully constructed, that i miss the point why making them configurable is desired. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-blk: Add vhost-blk support v2
+static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file) +{ + + struct inode *inode = file-f_mapping-host; + struct block_device *bdev = inode-i_bdev; + int ret; Please just pass the block_device directly instead of a file struct. + + ret = vhost_blk_bio_make(req, bdev); + if (ret 0) + return ret; + + vhost_blk_bio_send(req); + + return ret; +} Then again how simple the this function is it probably should just go away entirely. + set_fs(USER_DS); What do we actually need the set_fs for here? + +static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file) +{ + + *file = vhost_blk_stop_vq(blk, blk-vq); +} What is the point of this helper? Also I can't see anyone actually using the returned struct file. + case VIRTIO_BLK_T_FLUSH: + ret = vfs_fsync(file, 1); + status = ret 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; + if (!vhost_blk_set_status(req, status)) + vhost_add_used_and_signal(blk-dev, vq, head, ret); + break; Sending a fsync here is actually wrong in two different ways: a) it operates at the filesystem level instead of bio level b) it's a blocking operation It should instead send a REQ_FLUSH bio using the same submission scheme as the read/write requests. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL stable-0.15] Stable-0.15 queue for qemu-kvm
Hello Marcelo, Here's a couple of backports for your stable-0.15 branch. Except for one (marked as backported) these were all clean cherry-picks. My proposal is to merge these KVM-only patches before qemu-stable-0.15.git, where I will be tagging v0.15.2 shortly. Cc: Marcelo Tosatti mtosa...@redhat.com Cc: Avi Kivity a...@redhat.com Cc: Anthony Liguori anth...@codemonkey.ws Cc: Bruce Rogers brog...@suse.com The following changes since commit 725ba81ec6812d7eeb69be92639eb81151b5306e: Merge remote branch 'upstream/stable-0.15' into stable-0.15 (2011-10-19 11:54:48 -0200) are available in the git repository at: git://repo.or.cz/qemu/afaerber.git qemu-kvm-stable-0.15 for you to fetch changes up to b2be0429795b18c018610d48142e797cbc31be0d: pci-assign: Remove bogus PCIe lnkcap wmask setting (2012-10-09 19:03:59 +0200) Alex Williamson (4): pci-assign: Fix PCI_EXP_FLAGS_TYPE shift pci-assign: Fix PCIe lnkcap pci-assign: Harden I/O port test pci-assign: Remove bogus PCIe lnkcap wmask setting Jan Kiszka (1): pci-assign: Update legacy interrupts only if used Lai Jiangshan (1): qemu-kvm: fix improper nmi emulation hw/apic.c | 33 + hw/apic.h |1 + hw/device-assignment.c | 30 +++--- monitor.c |6 +- 4 Dateien geändert, 54 Zeilen hinzugefügt(+), 16 Zeilen entfernt(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 1/6] qemu-kvm: fix improper nmi emulation
From: Lai Jiangshan la...@cn.fujitsu.com Currently, NMI interrupt is blindly sent to all the vCPUs when NMI button event happens. This doesn't properly emulate real hardware on which NMI button event triggers LINT1. Because of this, NMI is sent to the processor even when LINT1 is maskied in LVT. For example, this causes the problem that kdump initiated by NMI sometimes doesn't work on KVM, because kdump assumes NMI is masked on CPUs other than CPU0. With this patch, inject-nmi request is handled as follows. - When in-kernel irqchip is disabled, deliver LINT1 instead of NMI interrupt. - When in-kernel irqchip is enabled, get the in-kernel LAPIC states and test the APIC_LVT_MASKED, if LINT1 is unmasked, and then delivering the NMI directly. (Suggested by Jan Kiszka) Changed from old version: re-implement it by the Jan's suggestion. fix the race found by Jan. Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com Reported-by: Kenji Kaneshige kaneshige.ke...@jp.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 67feec6ed854b3618b37ccf050b90192cbb96e0f) Signed-off-by: Andreas Färber afaer...@suse.de --- hw/apic.c | 33 + hw/apic.h |1 + monitor.c |6 +- 3 Dateien geändert, 39 Zeilen hinzugefügt(+), 1 Zeile entfernt(-) diff --git a/hw/apic.c b/hw/apic.c index a45b57f..243900d 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -204,6 +204,39 @@ void apic_deliver_pic_intr(DeviceState *d, int level) } } +static inline uint32_t kapic_reg(struct kvm_lapic_state *kapic, int reg_id); + +static void kvm_irqchip_deliver_nmi(void *p) +{ +APICState *s = p; +struct kvm_lapic_state klapic; +uint32_t lvt; + +kvm_get_lapic(s-cpu_env, klapic); +lvt = kapic_reg(klapic, 0x32 + APIC_LVT_LINT1); + +if (lvt APIC_LVT_MASKED) { +return; +} + +if (((lvt 8) 7) != APIC_DM_NMI) { +return; +} + +kvm_vcpu_ioctl(s-cpu_env, KVM_NMI); +} + +void apic_deliver_nmi(DeviceState *d) +{ +APICState *s = DO_UPCAST(APICState, busdev.qdev, d); + +if (kvm_irqchip_in_kernel()) { +run_on_cpu(s-cpu_env, kvm_irqchip_deliver_nmi, s); +} else { +apic_local_deliver(s, APIC_LVT_LINT1); +} +} + #define foreach_apic(apic, deliver_bitmask, code) \ {\ int __i, __j, __mask;\ diff --git a/hw/apic.h b/hw/apic.h index c857d52..3a4be0a 100644 --- a/hw/apic.h +++ b/hw/apic.h @@ -10,6 +10,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t trigger_mode); int apic_accept_pic_intr(DeviceState *s); void apic_deliver_pic_intr(DeviceState *s, int level); +void apic_deliver_nmi(DeviceState *d); int apic_get_interrupt(DeviceState *s); void apic_reset_irq_delivered(void); int apic_get_irq_delivered(void); diff --git a/monitor.c b/monitor.c index 7680929..6af0673 100644 --- a/monitor.c +++ b/monitor.c @@ -2604,7 +2604,11 @@ static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data) CPUState *env; for (env = first_cpu; env != NULL; env = env-next_cpu) { -cpu_interrupt(env, CPU_INTERRUPT_NMI); +if (!env-apic_state) { +cpu_interrupt(env, CPU_INTERRUPT_NMI); +} else { +apic_deliver_nmi(env-apic_state); +} } return 0; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 2/6] pci-assign: Fix PCI_EXP_FLAGS_TYPE shift
From: Alex Williamson alex.william...@redhat.com Coverity found that we're doing (uint16_t)type 0xf0 8. This is obviously always 0x0, so our attempt to filter out some device types thinks everything is an endpoint. Fix shift amount. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit b4eccd18591f3d639bc3c923e299b3c1241a0b3f) Signed-off-by: Andreas Färber afaer...@suse.de --- hw/device-assignment.c |2 +- 1 Datei geändert, 1 Zeile hinzugefügt(+), 1 Zeile entfernt(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 177daa4..dc41bfd 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1459,7 +1459,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) } type = pci_get_word(pci_dev-config + pos + PCI_EXP_FLAGS); -type = (type PCI_EXP_FLAGS_TYPE) 8; +type = (type PCI_EXP_FLAGS_TYPE) 4; if (type != PCI_EXP_TYPE_ENDPOINT type != PCI_EXP_TYPE_LEG_END type != PCI_EXP_TYPE_RC_END) { fprintf(stderr, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 3/6] pci-assign: Fix PCIe lnkcap
From: Alex Williamson alex.william...@redhat.com Another Coverity found issue, lnkcap is a 32bit register and we're masking bits 16 17. Fix to uin32_t. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 21f5a19a10c8f6a10d79a415bf640de85acede78) [AF: Backported] Signed-off-by: Andreas Färber afaer...@suse.de --- hw/device-assignment.c |8 1 Datei geändert, 4 Zeilen hinzugefügt(+), 4 Zeilen entfernt(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index dc41bfd..c4c2535 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1420,8 +1420,8 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) { uint8_t version, size; -uint16_t type, devctl, lnkcap, lnksta; -uint32_t devcap; +uint16_t type, devctl, lnksta; +uint32_t devcap, lnkcap; version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS); version = PCI_EXP_FLAGS_VERS; @@ -1491,11 +1491,11 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_set_word(pci_dev-config + pos + PCI_EXP_DEVSTA, 0); /* Link capabilities, expose links and latencues, clear reporting */ -lnkcap = pci_get_word(pci_dev-config + pos + PCI_EXP_LNKCAP); +lnkcap = pci_get_long(pci_dev-config + pos + PCI_EXP_LNKCAP); lnkcap = (PCI_EXP_LNKCAP_SLS | PCI_EXP_LNKCAP_MLW | PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL | PCI_EXP_LNKCAP_L1EL); -pci_set_word(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); +pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP, PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB | PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES | -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 5/6] pci-assign: Update legacy interrupts only if used
From: Jan Kiszka jan.kis...@siemens.com Don't mess with assign_intx on devices that are in MSI or MSI-X mode, it would corrupt their interrupt routing. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com (cherry picked from commit 096392efe1e5ee670f880c96c31f7ea8d6d76cf4) Signed-off-by: Andreas Färber afaer...@suse.de --- hw/device-assignment.c |9 ++--- 1 Datei geändert, 6 Zeilen hinzugefügt(+), 3 Zeilen entfernt(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index d586ce4..43029a4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1062,9 +1062,12 @@ void assigned_dev_update_irqs(void) dev = QLIST_FIRST(devs); while (dev) { next = QLIST_NEXT(dev, next); -r = assign_irq(dev); -if (r 0) -qdev_unplug(dev-dev.qdev); +if (dev-irq_requested_type KVM_DEV_IRQ_HOST_INTX) { +r = assign_irq(dev); +if (r 0) { +qdev_unplug(dev-dev.qdev); +} +} dev = next; } } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 4/6] pci-assign: Harden I/O port test
From: Alex Williamson alex.william...@redhat.com Markus Armbruster points out that we're missing a 0 check from pread while trying to probe for pci-sysfs io-port resource support. We don't expect a short read, but we should harden the test to abort if we get one so we're not potentially looking at a stale errno. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 1d1c8a498b7ce5c5636f1014f7ad18aa4e1acc0a) Signed-off-by: Andreas Färber afaer...@suse.de --- hw/device-assignment.c |5 +++-- 1 Datei geändert, 3 Zeilen hinzugefügt(+), 2 Zeilen entfernt(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index c4c2535..d586ce4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -618,8 +618,9 @@ static int assigned_dev_register_regions(PCIRegion *io_regions, * kernels return EIO. New kernels only allow 1/2/4 byte reads * so should return EINVAL for a 3 byte read */ ret = pread(pci_dev-v_addrs[i].region-resource_fd, val, 3, 0); -if (ret == 3) { -fprintf(stderr, I/O port resource supports 3 byte read?!\n); +if (ret = 0) { +fprintf(stderr, Unexpected return from I/O port read: %d\n, +ret); abort(); } else if (errno != EINVAL) { fprintf(stderr, Using raw in/out ioport access (sysfs - %s)\n, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable-0.15 6/6] pci-assign: Remove bogus PCIe lnkcap wmask setting
From: Alex Williamson alex.william...@redhat.com All the fields of lnkcap are read-only and this is setting it with mask values from LNKCTL. Just below it, we indicate link control is read only, so this appears to be a stray chunk left in from development. Trivial comment fix while we're here. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com (cherry picked from commit 0cbb68b1ce40e9b7e0b8cea5fd849f5c6bd09aee) Signed-off-by: Andreas Färber afaer...@suse.de --- hw/device-assignment.c |6 +- 1 Datei geändert, 1 Zeile hinzugefügt(+), 5 Zeilen entfernt(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 43029a4..3908144 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1481,7 +1481,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_set_long(pci_dev-config + pos + PCI_EXP_DEVCAP, devcap); /* device control: clear all error reporting enable bits, leaving - * leaving only a few host values. Note, these are + * only a few host values. Note, these are * all writable, but not passed to hw. */ devctl = pci_get_word(pci_dev-config + pos + PCI_EXP_DEVCTL); @@ -1500,10 +1500,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL | PCI_EXP_LNKCAP_L1EL); pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); -pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP, - PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB | - PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES | - PCI_EXP_LNKCTL_CLKREQ_EN | PCI_EXP_LNKCTL_HAWD); /* Link control, pass existing read-only copy. Should be writable? */ -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 12/14] KVM: ARM: VFP userspace interface
On 1 October 2012 10:11, Christoffer Dall c.d...@virtualopensystems.com wrote: From: Rusty Russell rusty.russ...@linaro.org --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1765,6 +1765,12 @@ ARM 64-bit CP15 registers have the following id bit patterns: ARM CCSIDR registers are demultiplexed by CSSELR value: 0x4002 0011 00 csselr:8 +ARM 32-bit VFP control registers have the following id bit patterns: + 0x4002 0012 1 selector:3 + +ARM 64-bit FP registers have the following id bit patterns: + 0x4002 0012 0 selector:3 + In the other entries in this section, foo:3 indicates a 3 bit field. But I think for these two it's trying to indicate a 3 byte field. Compare the include file constants: +#define KVM_REG_ARM_VFP(0x0012 KVM_REG_ARM_COPROC_SHIFT) +#define KVM_REG_ARM_VFP_MASK 0x +#define KVM_REG_ARM_VFP_BASE_REG 0x0 +#define KVM_REG_ARM_VFP_FPSID 0x1000 +#define KVM_REG_ARM_VFP_FPSCR 0x1001 so eg the first control register is 0x4002 0012 1000 and the first D reg is 0x4002 0012 So my guess is these should be selector:12 (or possibly even something slightly more specific than 'selector' like 'regno:12' ?) -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvmarm] [PATCH v2 12/14] KVM: ARM: VFP userspace interface
On Tue, Oct 9, 2012 at 2:11 PM, Peter Maydell peter.mayd...@linaro.org wrote: On 1 October 2012 10:11, Christoffer Dall c.d...@virtualopensystems.com wrote: From: Rusty Russell rusty.russ...@linaro.org --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1765,6 +1765,12 @@ ARM 64-bit CP15 registers have the following id bit patterns: ARM CCSIDR registers are demultiplexed by CSSELR value: 0x4002 0011 00 csselr:8 +ARM 32-bit VFP control registers have the following id bit patterns: + 0x4002 0012 1 selector:3 + +ARM 64-bit FP registers have the following id bit patterns: + 0x4002 0012 0 selector:3 + In the other entries in this section, foo:3 indicates a 3 bit field. But I think for these two it's trying to indicate a 3 byte field. Compare the include file constants: +#define KVM_REG_ARM_VFP(0x0012 KVM_REG_ARM_COPROC_SHIFT) +#define KVM_REG_ARM_VFP_MASK 0x +#define KVM_REG_ARM_VFP_BASE_REG 0x0 +#define KVM_REG_ARM_VFP_FPSID 0x1000 +#define KVM_REG_ARM_VFP_FPSCR 0x1001 so eg the first control register is 0x4002 0012 1000 and the first D reg is 0x4002 0012 So my guess is these should be selector:12 (or possibly even something slightly more specific than 'selector' like 'regno:12' ?) right you are, I'll modify this. Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_SIGNATURE -if (!hyperv_enabled()) { +if (!env-cpuid_hv_level_set) { memcpy(signature, KVMKVMKVM\0\0\0, 12); c-eax = 0; } else { memcpy(signature, Microsoft Hv, 12); -c-eax = HYPERV_CPUID_MIN; +c-eax = env-cpuid_hv_level; This breaks hyperv_enabled() checks. Don, are you certain it is worthwhile to make this configurable? Can you explain why, under your scenario, it is worthwhile? Because these are separate problems: - Fake VMWare hypervisor (which seems to be your main goal). - Make CPUID HV leafs configurable via command line. Err, meant via properties. Point is, why have VMWare CPUID configuration as data, if there are reasons to believe code is a better fit (code as in current Hyper-V implementation). My point is that the CPUIDs must be carefully constructed, that i miss the point why making them configurable is desired. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_SIGNATURE -if (!hyperv_enabled()) { +if (!env-cpuid_hv_level_set) { memcpy(signature, KVMKVMKVM\0\0\0, 12); c-eax = 0; } else { memcpy(signature, Microsoft Hv, 12); -c-eax = HYPERV_CPUID_MIN; +c-eax = env-cpuid_hv_level; This breaks hyperv_enabled() checks. Don, are you certain it is worthwhile to make this configurable? Can you explain why, under your scenario, it is worthwhile? Because these are separate problems: - Fake VMWare hypervisor (which seems to be your main goal). - Make CPUID HV leafs configurable via command line. Err, meant via properties. Point is, why have VMWare CPUID configuration as data, if there are reasons to believe code is a better fit (code as in current Hyper-V implementation). Nevermind, its the right thing to do. Just separate the patchset please: 1) Create object properties. 2) Export VMWare CPUID via properties. 3) Convert Hyper-V. Be careful to make sure Hyper-V's current options are functional in 3). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]: On 10/04/2012 03:07 PM, Peter Zijlstra wrote: On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: Again the numbers are ridiculously high for arch_local_irq_restore. Maybe there's a bad perf/kvm interaction when we're injecting an interrupt, I can't believe we're spending 84% of the time running the popf instruction. Smells like a software fallback that doesn't do NMI, hrtimer based sampling typically hits popf where we re-enable interrupts. Good nose, that's probably it. Raghavendra, can you ensure that the PMU is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu host will expose it (and a good idea anyway to get best performance). Hi Avi, you are right. SandyBridge machine result was not proper. I cleaned up the services, enabled PMU, re-ran all the test again. Here is the summary: We do get good benefit by increasing ple window. Though we don't see good benefit for kernbench and sysbench, for ebizzy, we get huge improvement for 1x scenario. (almost 2/3rd of ple disabled case). Let me know if you think we can increase the default ple_window itself to 16k. I am experimenting with V2 version of undercommit improvement(this) patch series, But I think if you wish to go for increase of default ple_window, then we would have to measure the benefit of patches when ple_window = 16k. I can respin the whole series including this default ple_window change. I also have the perf kvm top result for both ebizzy and kernbench. I think they are in expected lines now. Improvements 16 core PLE machine with 16 vcpu guest base = 3.6.0-rc5 + ple handler optimization patches base_pleopt_16k = base + ple_window = 16k base_pleopt_32k = base + ple_window = 32k base_pleopt_nople = base + ple_gap = 0 kernbench, hackbench, sysbench (time in sec lower is better) ebizzy (rec/sec higher is better) % improvements w.r.t base (ple_window = 4k) ---+---+-+---+ |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | ---+---+-+---+ kernbench_1x | 0.42371 | 1.15164| 0.09320 | kernbench_2x | -1.40981 | -17.48282 | -570.77053 | ---+---+-+---+ sysbench_1x| -0.92367 | 0.24241 | -0.27027 | sysbench_2x| -2.22706 |-0.30896 | -1.27573 | sysbench_3x| -0.75509 | 0.09444 | -2.97756 | ---+---+-+---+ ebizzy_1x | 54.99976 | 67.29460| 74.14076 | ebizzy_2x | -8.83386 |-27.38403| -96.22066 | ---+---+-+---+ perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) pleopt ple_gap=0 ebizzy : 18131 records/s 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave 5.65% [guest.kernel] [g] smp_call_function_many 3.12% [guest.kernel] [g] clear_page 3.02% [guest.kernel] [g] down_read_trylock 1.85% [guest.kernel] [g] async_page_fault 1.81% [guest.kernel] [g] up_read 1.76% [guest.kernel] [g] native_apic_mem_write 1.70% [guest.kernel] [g] find_vma kernbench :Elapsed Time 29.4933 (27.6007) 5.72% [guest.kernel] [g] async_page_fault 3.48% [guest.kernel] [g] pvclock_clocksource_read 2.68% [guest.kernel] [g] copy_user_generic_unrolled 2.58% [guest.kernel] [g] clear_page 2.09% [guest.kernel] [g] page_cache_get_speculative 2.00% [guest.kernel] [g] do_raw_spin_lock 1.78% [guest.kernel] [g] unmap_single_vma 1.74% [guest.kernel] [g] kmem_cache_alloc pleopt ple_window = 4k --- ebizzy: 10176 records/s 69.17% [guest.kernel] [g] _raw_spin_lock_irqsave 3.34% [guest.kernel] [g] clear_page 2.16% [guest.kernel] [g] down_read_trylock 1.94% [guest.kernel] [g] async_page_fault 1.89% [guest.kernel] [g] native_apic_mem_write 1.63% [guest.kernel] [g] smp_call_function_many 1.58% [guest.kernel] [g] SetPageLRU 1.37% [guest.kernel] [g] up_read 1.01% [guest.kernel] [g] find_vma kernbench: 29.9533 nts: 240K cycles 6.04% [guest.kernel] [g] async_page_fault 4.17% [guest.kernel] [g] pvclock_clocksource_read 3.28% [guest.kernel] [g] clear_page 2.57% [guest.kernel] [g] copy_user_generic_unrolled 2.30% [guest.kernel] [g] do_raw_spin_lock 2.13% [guest.kernel] [g] _raw_spin_lock_irqsave 1.93% [guest.kernel] [g] page_cache_get_speculative 1.92% [guest.kernel] [g] unmap_single_vma 1.77% [guest.kernel] [g] kmem_cache_alloc 1.61% [guest.kernel] [g] __d_lookup_rcu 1.19%
Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
On 10/09/12 14:47, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_SIGNATURE -if (!hyperv_enabled()) { +if (!env-cpuid_hv_level_set) { memcpy(signature, KVMKVMKVM\0\0\0, 12); c-eax = 0; } else { memcpy(signature, Microsoft Hv, 12); -c-eax = HYPERV_CPUID_MIN; +c-eax = env-cpuid_hv_level; This breaks hyperv_enabled() checks. Don, are you certain it is worthwhile to make this configurable? Can you explain why, under your scenario, it is worthwhile? Because these are separate problems: - Fake VMWare hypervisor (which seems to be your main goal). - Make CPUID HV leafs configurable via command line. Err, meant via properties. Point is, why have VMWare CPUID configuration as data, if there are reasons to believe code is a better fit (code as in current Hyper-V implementation). Nevermind, its the right thing to do. Just separate the patchset please: 1) Create object properties. 2) Export VMWare CPUID via properties. 3) Convert Hyper-V. Be careful to make sure Hyper-V's current options are functional in 3). Did you mean 3 patch sets (or more)? Or just a different order? -Don Slutz -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 08/16] target-i386: Add cpu object access routines for Hypervisor vendor.
On Mon, Sep 24, 2012 at 10:32:10AM -0400, Don Slutz wrote: These are modeled after x86_cpuid_set_vendor and x86_cpuid_get_vendor. Since kvm's vendor is shorter, the test for correct size is removed and zero padding is added. Set Microsoft's Vendor now that we can. Value defined in: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx And matches want is in target-i386/kvm.c Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 46 ++ target-i386/cpu.h |2 ++ 2 files changed, 48 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 920278b..964877f 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1192,11 +1192,54 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, cpu-env.cpuid_hv_level_set = true; } +static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = cpu-env; +char *value; +int i; + +value = (char *)g_malloc(CPUID_VENDOR_SZ + 1); +for (i = 0; i 4; i++) { +value[i + 0] = env-cpuid_hv_vendor1 (8 * i); +value[i + 4] = env-cpuid_hv_vendor2 (8 * i); +value[i + 8] = env-cpuid_hv_vendor3 (8 * i); +} +value[CPUID_VENDOR_SZ] = '\0'; + +return value; +} + +static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, +Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = cpu-env; +int i; +char adj_value[CPUID_VENDOR_SZ + 1]; + +memset(adj_value, 0, sizeof(adj_value)); + +pstrcpy(adj_value, sizeof(adj_value), value); + +env-cpuid_hv_vendor1 = 0; +env-cpuid_hv_vendor2 = 0; +env-cpuid_hv_vendor3 = 0; +for (i = 0; i 4; i++) { +env-cpuid_hv_vendor1 |= ((uint8_t)adj_value[i + 0]) (8 * i); +env-cpuid_hv_vendor2 |= ((uint8_t)adj_value[i + 4]) (8 * i); +env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8]) (8 * i); +} +env-cpuid_hv_vendor_set = true; +} + #if !defined(CONFIG_USER_ONLY) static void x86_set_hyperv(Object *obj, Error **errp) { object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, hypervisor-level, errp); +object_property_set_str(obj, CPUID_HV_VENDOR_HYPERV, +hypervisor-vendor, errp); } static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, @@ -2126,6 +2169,9 @@ static void x86_cpu_initfn(Object *obj) object_property_add(obj, hypervisor-level, int, x86_cpuid_get_hv_level, x86_cpuid_set_hv_level, NULL, NULL, NULL); +object_property_add_str(obj, hypervisor-vendor, +x86_cpuid_get_hv_vendor, +x86_cpuid_set_hv_vendor, NULL); #if !defined(CONFIG_USER_ONLY) object_property_add(obj, hv_spinlocks, int, x86_get_hv_spinlocks, diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 11730b2..eb6aa4a 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA CentaurHauls +#define CPUID_HV_VENDOR_HYPERV Microsoft Hv + #define CPUID_HV_LEVEL_HYPERV 0x4005 As requested, please separate patchset in groups. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 06/16] target-i386: Use Hypervisor level in -machine pc,accel=tcg.
On Mon, Sep 24, 2012 at 10:32:08AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this as KVM_CPUID_SIGNATURE (0x4000) in kvm on linux. This does not provide vendor support in tcg yet. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 48bdaf9..920278b 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1651,6 +1651,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, index = env-cpuid_xlevel; } } +} else if (index 0x4000) { +if (env-cpuid_hv_level_set) { +uint32_t real_level = env-cpuid_hv_level; + +/* Handle Hypervisor CPUIDs */ +if (real_level 0x4000) { +real_level = 0x4000; +} +if (index real_level) { +index = real_level; +} +} else { +if (index env-cpuid_level) +index = env-cpuid_level; +} Whats the purpose of this checks? Please provide changelogs for each patch. } else { if (index env-cpuid_level) index = env-cpuid_level; @@ -1789,6 +1804,18 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *edx = 0; } break; +case 0x4000: +*eax = env-cpuid_hv_level; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +case 0x4001: +*eax = env-cpuid_kvm_features; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; case 0x8000: *eax = env-cpuid_xlevel; *ebx = env-cpuid_vendor1; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 14/16] target-i386: Add setting of Hypervisor leaf extra for known vmare4.
On Mon, Sep 24, 2012 at 10:32:16AM -0400, Don Slutz wrote: This was taken from: http://article.gmane.org/gmane.comp.emulators.kvm.devel/22643 Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 8bb20c7..b77dbfe 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const char *model_id, } } +static void x86_cpuid_set_vmware_extra(Object *obj) +{ +X86CPU *cpu = X86_CPU(obj); + +if ((cpu-env.tsc_khz != 0) +(cpu-env.cpuid_hv_level == CPUID_HV_LEVEL_VMWARE_4) +(cpu-env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) +(cpu-env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) +(cpu-env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) { +const uint32_t apic_khz = 100L; + +/* + * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643 + * + *Leaf 0x4010, Timing Information. + * + *VMware has defined the first generic leaf to provide timing + *information. This leaf returns the current TSC frequency and + *current Bus frequency in kHz. + * + *# EAX: (Virtual) TSC frequency in kHz. + *# EBX: (Virtual) Bus (local apic timer) frequency in kHz. + *# ECX, EDX: RESERVED (Per above, reserved fields are set to zero). + */ +cpu-env.cpuid_hv_extra = 0x4010; +cpu-env.cpuid_hv_extra_a = (uint32_t)cpu-env.tsc_khz; +cpu-env.cpuid_hv_extra_b = apic_khz; +} +} + static void x86_cpuid_get_tsc_freq(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1164,6 +1194,7 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor *v, void *opaque, } cpu-env.tsc_khz = value / 1000; +x86_cpuid_set_vmware_extra(obj); } static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque, @@ -1277,6 +1308,7 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8]) (8 * i); } env-cpuid_hv_vendor_set = true; +x86_cpuid_set_vmware_extra(obj); } This is strange. Please have this configuration, that depends on other properties being set, ordered in x86_cpu_initfn. Say: object_property_add(obj, tsc-frequency, int, x86_cpuid_get_tsc_freq, x86_cpuid_set_tsc_freq, NULL, NULL, NULL); /* depends on tsc frequency */ object_property_add(obj, vmware-extra, int, x86_cpuid_get_vmware_extra, x86_cpuid_set_vmware_extra, NULL, NULL, NULL); Or something to that effect. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
On Tue, Oct 09, 2012 at 03:09:17PM -0400, Don Slutz wrote: On 10/09/12 14:47, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 03:27:26PM -0300, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 02:18:05PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:07AM -0400, Don Slutz wrote: Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_SIGNATURE -if (!hyperv_enabled()) { +if (!env-cpuid_hv_level_set) { memcpy(signature, KVMKVMKVM\0\0\0, 12); c-eax = 0; } else { memcpy(signature, Microsoft Hv, 12); -c-eax = HYPERV_CPUID_MIN; +c-eax = env-cpuid_hv_level; This breaks hyperv_enabled() checks. Don, are you certain it is worthwhile to make this configurable? Can you explain why, under your scenario, it is worthwhile? Because these are separate problems: - Fake VMWare hypervisor (which seems to be your main goal). - Make CPUID HV leafs configurable via command line. Err, meant via properties. Point is, why have VMWare CPUID configuration as data, if there are reasons to believe code is a better fit (code as in current Hyper-V implementation). Nevermind, its the right thing to do. Just separate the patchset please: 1) Create object properties. 2) Export VMWare CPUID via properties. 3) Convert Hyper-V. Be careful to make sure Hyper-V's current options are functional in 3). Did you mean 3 patch sets (or more)? Or just a different order? -Don Slutz Different order. Patches should be logically related (think of what information the reviewer needs). Please write changelogs for every patch. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.
On 10/09/12 13:17, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote: This is used to set the cpu object's hypervisor level to the default for Microsoft's Hypervisor. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c |9 + target-i386/cpu.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 451de12..48bdaf9 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, } #if !defined(CONFIG_USER_ONLY) +static void x86_set_hyperv(Object *obj, Error **errp) +{ +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, +hypervisor-level, errp); +} + static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, void *opaque, return; } hyperv_set_spinlock_retries(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque, @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_relaxed_timing(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque, @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_vapic_recommended(value); +x86_set_hyperv(obj, errp); } #endif diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1899f69..3152a4e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA CentaurHauls +#define CPUID_HV_LEVEL_HYPERV 0x4005 + Where this comes from? http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx has under Leaf 0x4000 (at very top of table): EAX The maximum input value for hypervisor CPUID information. For Microsoft hypervisors, this value will be at least 0x4005. The vendor ID signature should be used only for reporting and diagnostic purposes. Is that the same 0x4005 as in this patch? Yes, the #define can be reused: #define HYPERV_CPUID_MIN0x4005 Not as simple as it seems. http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg03359.html I can make sure this info is part of the commit message. -Don Slutz -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 03/16] target-i386: Add cpu object access routines for Hypervisor level.
On 10/09/12 12:25, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:05AM -0400, Don Slutz wrote: These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 29 + 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 25ca986..451de12 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1166,6 +1166,32 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor *v, void *opaque, cpu-env.tsc_khz = value / 1000; } +static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, cpu-env.cpuid_hv_level, name, errp); +} + +static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +uint32_t value; + +visit_type_uint32(v, value, name, errp); +if (error_is_set(errp)) { +return; +} + +if (value != 0 value 0x4000) { +value += 0x4000; +} Whats the purpose of this? Adds ambiguity. Will add more info in this commit message. -Don -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 04/16] target-i386: Add x86_set_hyperv.
On Tue, Oct 09, 2012 at 03:12:00PM -0400, Don Slutz wrote: On 10/09/12 13:17, Marcelo Tosatti wrote: On Tue, Oct 09, 2012 at 01:34:09PM -0300, Marcelo Tosatti wrote: On Mon, Sep 24, 2012 at 10:32:06AM -0400, Don Slutz wrote: This is used to set the cpu object's hypervisor level to the default for Microsoft's Hypervisor. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c |9 + target-i386/cpu.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 451de12..48bdaf9 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, } #if !defined(CONFIG_USER_ONLY) +static void x86_set_hyperv(Object *obj, Error **errp) +{ +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, +hypervisor-level, errp); +} + static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, void *opaque, return; } hyperv_set_spinlock_retries(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque, @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_relaxed_timing(value); +x86_set_hyperv(obj, errp); } } static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque, @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_vapic_recommended(value); +x86_set_hyperv(obj, errp); } #endif diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1899f69..3152a4e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA CentaurHauls +#define CPUID_HV_LEVEL_HYPERV 0x4005 + Where this comes from? http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx has under Leaf 0x4000 (at very top of table): EAX The maximum input value for hypervisor CPUID information. For Microsoft hypervisors, this value will be at least 0x4005. The vendor ID signature should be used only for reporting and diagnostic purposes. Is that the same 0x4005 as in this patch? Yes, the #define can be reused: #define HYPERV_CPUID_MIN0x4005 Not as simple as it seems. http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg03359.html I can make sure this info is part of the commit message. -Don Slutz Ok add a copy but have CPUID_MIN in the name (because CPUID_HV_LEVEL_HYPERV is very confusing). Add a comment /* maximum input value for h... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
Hi, Why use two bars for this? You can put them into one mmio bar, together with the msi-x vector table and PBA. Of course a pci capability describing the location is helpful for that ;) You don't need a capability. You can also just add a config offset field to the register set and then make the semantics that it occurs in the same region. Yes, that will work too. Removes some freedom to place the register ranges, but given that we don't want burn bars and thus prefer to place everything into a single mmio bar that shouldn't be an issue. Real hardware does this too btw (xhci for example). Main advantage of defining a register set with just isr is that it reduces pio address space consumtion for new virtio devices which don't have to worry about the legacy layout (8 bytes which is minimum size for io bars instead of 64 bytes). Doing some rough math, we should have at least 16k of PIO space. That let's us have well over 500 virtio-pci devices with the current register layout. I've seen worries nevertheless, but given we have virtio-scsi now which can handle lots of disks without needing lots of virtio-pci devices it is probably not that a big issue any more. The detection is simple: if BAR1 has non-zero length, it's new-style, otherwise legacy. Doesn't fly. BAR1 is in use today for MSI-X support. But the location is specified via capabilities so we can change the location to be within BAR1 at a non-conflicting offset. Sure. Nevertheless BAR1 has non-zero length can't be used to detect new-style virtio as old-style devices already have BAR1 with a non-zero length. So how about this: (1) Add a vendor specific pci capability for new-style virtio. Specifies the pci bar used for new-style virtio registers. Guests can use it to figure whenever new-style virtio is supported and to map the correct bar (which will probably be bar 1 in most cases). (2) Have virtio-offsets register set, located at the new-style bar, offset 0: struct virtio_offsets { __le32 num_offsets;// so we can extend this later __le32 virtio_pci_offset; // location of virtio-pci registers __le32 virtio_cfg_offset; // location of virtio config space }; (3) place virtio-pci registers (same as 0..23 in bar 0) in the new-style bar, offset virtio_pci_offset (4) place virtio config space (same as 20+/24+ in bar 0) in the new-style bar, offset virtio_cfg_offset cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell wrote: I don't think it'll be that bad; reset clears the device to unknown, bar0 moves it from unknown-legacy mode, bar1/2/3 changes it from unknown-modern mode, and anything else is bad (I prefer being strict so we catch bad implementations from the beginning). Will that work, if the guest with kernel that uses modern mode, kexecs to an older (but presumed reliable) kernel that only knows about legacy mode? I.e. will the replacement kernel, or (ideally) replacement driver on the rare occasion that is needed on a running kernel, be able to reset the device hard enough? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization setup
-Original Message- From: Marc Zyngier [mailto:marc.zyng...@arm.com] Sent: Tuesday, October 09, 2012 9:56 PM To: Christoffer Dall Cc: Min-gyu Kim; 김창환; linux-arm-ker...@lists.infradead.org; kvm@vger.kernel.org; kvm...@lists.cs.columbia.edu Subject: Re: [kvmarm] [PATCH v2 06/14] KVM: ARM: Memory virtualization setup On Sat, 6 Oct 2012 17:33:43 -0400, Christoffer Dall c.d...@virtualopensystems.com wrote: On Thu, Oct 4, 2012 at 10:23 PM, Min-gyu Kim mingyu84@samsung.com wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Christoffer Dall Sent: Monday, October 01, 2012 6:11 PM To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu Cc: Marc Zyngier Subject: [PATCH v2 06/14] KVM: ARM: Memory virtualization setup +static void stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, +phys_addr_t addr, const pte_t *new_pte) { + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte, old_pte; + + /* Create 2nd stage page table mapping - Level 1 */ + pgd = kvm-arch.pgd + pgd_index(addr); + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) { + if (!cache) + return; /* ignore calls from kvm_set_spte_hva */ + pmd = mmu_memory_cache_alloc(cache); + pud_populate(NULL, pud, pmd); + pmd += pmd_index(addr); + get_page(virt_to_page(pud)); + } else + pmd = pmd_offset(pud, addr); + + /* Create 2nd stage page table mapping - Level 2 */ + if (pmd_none(*pmd)) { + if (!cache) + return; /* ignore calls from kvm_set_spte_hva */ + pte = mmu_memory_cache_alloc(cache); + clean_pte_table(pte); + pmd_populate_kernel(NULL, pmd, pte); + pte += pte_index(addr); + get_page(virt_to_page(pmd)); + } else + pte = pte_offset_kernel(pmd, addr); + + /* Create 2nd stage page table mapping - Level 3 */ + old_pte = *pte; + set_pte_ext(pte, *new_pte, 0); + if (pte_present(old_pte)) + __kvm_tlb_flush_vmid(kvm); + else + get_page(virt_to_page(pte)); } I'm not sure about the 3-level page table, but isn't it necessary to clean the page table for 2nd level? There are two mmu_memory_cache_alloc calls. One has following clean_pte_table and the other doesn't have. hmm, it probably is - I couldn't really find the common case where this is done in the kernel normally (except for some custom loop in ioremap and idmap), but I added this fix: diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 5394a52..f11ba27f 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -430,6 +430,7 @@ static void stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, if (!cache) return; /* ignore calls from kvm_set_spte_hva */ pmd = mmu_memory_cache_alloc(cache); + clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t)); pud_populate(NULL, pud, pmd); pmd += pmd_index(addr); get_page(virt_to_page(pud)); There ought to be a test of ID_MMFR3[23:20] to find out whether or not it is useful to clean the dcache. Not sure if that's a massive gain, but it is certainly an optimisation to consider for the kernel as a whole. That's part of what I was wondering. clean_dcache_area is substituted by nop if TLB_CAN_READ_FROM_L1_CACHE is defined(mm/proc-v7.S). But I couldn't find any place where it is defined. If it is part of bsp to enable TLB_CAN_RAD_FROM_L1_CACHE according to ID_MMFR3[23:20], it would not be necessary to concerned about cleaning or not cleaning. However, am I right? M. -- Fast, cheap, reliable. Pick two. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-blk: Add vhost-blk support v2
Hello Christoph! On 10/10/2012 01:39 AM, Christoph Hellwig wrote: +static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file) +{ + +struct inode *inode = file-f_mapping-host; +struct block_device *bdev = inode-i_bdev; +int ret; Please just pass the block_device directly instead of a file struct. vhost_blk_req_submit() can be used to handle file based image later. Using the file interface will work for both cases. I do need a check in vhost_blk_set_backend() to tell if the user passed file is a raw device file for now. + +ret = vhost_blk_bio_make(req, bdev); +if (ret 0) +return ret; + +vhost_blk_bio_send(req); + +return ret; +} Then again how simple the this function is it probably should just go away entirely. No, vhost_blk_req_submit() is used for both read and write ops. It makes no sense to write the code twice. Plus, this function might be complexer when the file based image support is added. +set_fs(USER_DS); What do we actually need the set_fs for here? See this commit: d7ffde35e31a81100d2d0d2c4013cbf527bb32ea + +static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file) +{ + +*file = vhost_blk_stop_vq(blk, blk-vq); +} What is the point of this helper? Also I can't see anyone actually using the returned struct file. It is used in both vhost_blk_reset_owner() and vhost_blk_release(). The returned struct file is used for fput(). We have similar helper in vhost_net: vhost_net_stop(). +case VIRTIO_BLK_T_FLUSH: +ret = vfs_fsync(file, 1); +status = ret 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; +if (!vhost_blk_set_status(req, status)) +vhost_add_used_and_signal(blk-dev, vq, head, ret); +break; Sending a fsync here is actually wrong in two different ways: a) it operates at the filesystem level instead of bio level b) it's a blocking operation It should instead send a REQ_FLUSH bio using the same submission scheme as the read/write requests. Will fix this. Thanks for the review! -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]: On 10/04/2012 03:07 PM, Peter Zijlstra wrote: On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: Again the numbers are ridiculously high for arch_local_irq_restore. Maybe there's a bad perf/kvm interaction when we're injecting an interrupt, I can't believe we're spending 84% of the time running the popf instruction. Smells like a software fallback that doesn't do NMI, hrtimer based sampling typically hits popf where we re-enable interrupts. Good nose, that's probably it. Raghavendra, can you ensure that the PMU is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu host will expose it (and a good idea anyway to get best performance). Hi Avi, you are right. SandyBridge machine result was not proper. I cleaned up the services, enabled PMU, re-ran all the test again. Here is the summary: We do get good benefit by increasing ple window. Though we don't see good benefit for kernbench and sysbench, for ebizzy, we get huge improvement for 1x scenario. (almost 2/3rd of ple disabled case). Let me know if you think we can increase the default ple_window itself to 16k. I am experimenting with V2 version of undercommit improvement(this) patch series, But I think if you wish to go for increase of default ple_window, then we would have to measure the benefit of patches when ple_window = 16k. I can respin the whole series including this default ple_window change. I also have the perf kvm top result for both ebizzy and kernbench. I think they are in expected lines now. Improvements 16 core PLE machine with 16 vcpu guest base = 3.6.0-rc5 + ple handler optimization patches base_pleopt_16k = base + ple_window = 16k base_pleopt_32k = base + ple_window = 32k base_pleopt_nople = base + ple_gap = 0 kernbench, hackbench, sysbench (time in sec lower is better) ebizzy (rec/sec higher is better) % improvements w.r.t base (ple_window = 4k) ---+---+-+---+ |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | ---+---+-+---+ kernbench_1x | 0.42371 | 1.15164| 0.09320 | kernbench_2x | -1.40981 | -17.48282 | -570.77053 | ---+---+-+---+ sysbench_1x| -0.92367 | 0.24241 | -0.27027 | sysbench_2x| -2.22706 |-0.30896 | -1.27573 | sysbench_3x| -0.75509 | 0.09444 | -2.97756 | ---+---+-+---+ ebizzy_1x | 54.99976 | 67.29460| 74.14076 | ebizzy_2x | -8.83386 |-27.38403| -96.22066 | ---+---+-+---+ perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) Is the perf data for 1x overcommit? pleopt ple_gap=0 ebizzy : 18131 records/s 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave 5.65% [guest.kernel] [g] smp_call_function_many 3.12% [guest.kernel] [g] clear_page 3.02% [guest.kernel] [g] down_read_trylock 1.85% [guest.kernel] [g] async_page_fault 1.81% [guest.kernel] [g] up_read 1.76% [guest.kernel] [g] native_apic_mem_write 1.70% [guest.kernel] [g] find_vma Does 'perf kvm top' not give host samples at the same time? Would be nice to see the host overhead as a function of varying ple window. I would expect that to be the major difference between 4/16/32k window sizes. A big concern I have (if this is 1x overcommit) for ebizzy is that it has just terrible scalability to begin with. I do not think we should try to optimize such a bad workload. kernbench :Elapsed Time 29.4933 (27.6007) 5.72% [guest.kernel] [g] async_page_fault 3.48% [guest.kernel] [g] pvclock_clocksource_read 2.68% [guest.kernel] [g] copy_user_generic_unrolled 2.58% [guest.kernel] [g] clear_page 2.09% [guest.kernel] [g] page_cache_get_speculative 2.00% [guest.kernel] [g] do_raw_spin_lock 1.78% [guest.kernel] [g] unmap_single_vma 1.74% [guest.kernel] [g] kmem_cache_alloc pleopt ple_window = 4k --- ebizzy: 10176 records/s 69.17% [guest.kernel] [g] _raw_spin_lock_irqsave 3.34% [guest.kernel] [g] clear_page 2.16% [guest.kernel] [g] down_read_trylock 1.94% [guest.kernel] [g] async_page_fault 1.89% [guest.kernel] [g] native_apic_mem_write 1.63% [guest.kernel] [g] smp_call_function_many 1.58% [guest.kernel] [g] SetPageLRU 1.37% [guest.kernel] [g] up_read 1.01%