Re: [PATCH 5 of 5] virtio: expose added descriptors immediately
On Wed, Nov 23, 2011 at 11:49:01AM +1030, Rusty Russell wrote: On Tue, 22 Nov 2011 08:29:08 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Nov 22, 2011 at 11:03:04AM +1030, Rusty Russell wrote: - /* If you haven't kicked in this long, you're probably doing something - * wrong. */ - WARN_ON(vq-num_added vq-vring.num); + /* This is very unlikely, but theoretically possible. Kick + * just in case. */ + if (unlikely(vq-num_added == 65535)) This is 0x but why use the decimal notation? Interesting. Why use hex? Feels more like binary? Just easier to see it's the largest 16 bit number. But I've changed it to (1 16) - 1 to be clear. That's even better. + virtqueue_kick(_vq); pr_debug(Added buffer head %i to %p\n, head, vq); END_USE(vq); We also still need to reset vq-num_added, right? virtqueue_kick does that for us. Cheers, Rusty. Right. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ +struct virtio_pci_common_cfg { + /* About the whole device. */ + __u64 device_features; /* read-only */ + __u64 guest_features; /* read-write */ We currently require atomic accesses to common fields. Some architectures might not have such for 64 bit, so these need to be split I think ... + __u64 queue_address;/* read-write */ + __u16 msix_config; /* read-write */ + __u8 device_status; /* read-write */ + __u8 unused; + + /* About a specific virtqueue. */ + __u16 queue_select; /* read-write */ + __u16 queue_align; /* read-write, power of 2. */ + __u16 queue_size; /* read-write, power of 2. */ + __u16 queue_msix_vector;/* read-write */ +}; #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About hotplug multifunction
On 09/09/11 15:08, Amos Kong wrote: Hello all, I'm working on hotplug pci multifunction. 1. qemu cmdline: ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor unix:/tmp/a,server,nowait --enable-kvm -net none 2. script to add virtio-blk devices: for i in `seq 1 7` 0;do qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a echo device_add virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a done 3. script to add virio-nic devices: for i in `seq 1 7` 0;do echo netdev_add tap,id=drv$i | nc -U /tmp/a echo device_add virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a done 4. current qemu behaviors 4.1. add func 1~7 one by one, then add func 0 virtio-nic : success, all funcs are added virtio-blk : success 4.2. add func 0~7 one by one virtio-nic : failed, only func 0 is added virtio-blk : success 4.3. removing any single func in monitor virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 also exist. virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. /dev/vda disappears, vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs to guest, they all works. # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block device (rev ff) qemu sends an acpi event to guest, then guest will remove all funcs in the slot. linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: static int disable_device(struct acpiphp_slot *slot) { list_for_each_entry(func,slot-funcs, sibling) { ... Hello all, I re-tested the multifunction hotplug of winxp,rhel6,win7 guest. (upstream qemu-kvm) For all the guests, func0 needs to be added, otherwise no func can be found in guest. RHEL6 needs to add fun0 in the end, otherwise, the funcs added after hot-adding func0 could not be found in guest. For winxp,win7, add func0 in any time, all the funcs can be found in guest. This is the test result( hogplug order of funcs funcs can be found in guest): hotplug order xp win7rhel6 -- - func 0~70~7 0~7 0~7 func 0~50~6 0~6 0~6 func 1~4,0,50~5 0~5 0~4 func 1~4no no no It looks like the 'problem' of linux guest, or could we accept to add func0 after adding other funcs for linux guest? Thanks, Amos Questions: 1. why func1~7 still can be found after hot-remove? is it same as real hardware? 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)? 3. how about this interface to hotplug/hot-unplug multifunction: 1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice guest 2) Remove func0, send an acpi event to guest. (all funcs can be removed) 4. what does reversion 0xff stand for? Thanks in advance, Amos -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com wrote: Here's an updated vesion. I'm alternating between updating the spec and the driver, spec update to follow. Don't touch the spec yet, we have a long way to go :( I want the ability for driver to set the ring size, and the device to set the alignment. Did you mean driver to be able to set the alignment? This is what BIOS guys want - after BIOS completes, guest driver gets handed control and sets its own alignment to save memory. That's a bigger change than you have here. Why can't we just add the new registers at the end? With the new capability, we have as much space as we like for that. I imagine it almost rips the driver into two completely different drivers. If you insist on moving all the rest of registers around, certainly. But why do this? This is the kind of thing I had in mind, for the header. Want me to code up the rest? (I've avoided adding the constants for the new layout: a struct is more compact and more descriptive). Cheers, Rusty. Renaming constants in exported headers will break userspace builds. Do we care? Why not? diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h --- a/include/linux/virtio_pci.h +++ b/include/linux/virtio_pci.h @@ -42,56 +42,74 @@ #include linux/virtio_config.h /* A 32-bit r/o bitmask of the features supported by the host */ -#define VIRTIO_PCI_HOST_FEATURES 0 +#define VIRTIO_PCI_LEGACY_HOST_FEATURES 0 /* A 32-bit r/w bitmask of features activated by the guest */ -#define VIRTIO_PCI_GUEST_FEATURES4 +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES 4 /* A 32-bit r/w PFN for the currently selected queue */ -#define VIRTIO_PCI_QUEUE_PFN 8 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN 8 /* A 16-bit r/o queue size for the currently selected queue */ -#define VIRTIO_PCI_QUEUE_NUM 12 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM 12 /* A 16-bit r/w queue selector */ -#define VIRTIO_PCI_QUEUE_SEL 14 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL 14 /* A 16-bit r/w queue notifier */ -#define VIRTIO_PCI_QUEUE_NOTIFY 16 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY 16 /* An 8-bit device status register. */ -#define VIRTIO_PCI_STATUS18 +#define VIRTIO_PCI_LEGACY_STATUS 18 /* An 8-bit r/o interrupt status register. Reading the value will return the * current contents of the ISR and will also clear it. This is effectively * a read-and-acknowledge. */ -#define VIRTIO_PCI_ISR 19 - -/* The bit of the ISR which indicates a device configuration change. */ -#define VIRTIO_PCI_ISR_CONFIG0x2 +#define VIRTIO_PCI_LEGACY_ISR19 /* MSI-X registers: only enabled if MSI-X is enabled. */ /* A 16-bit vector for configuration changes. */ -#define VIRTIO_MSI_CONFIG_VECTOR20 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR20 /* A 16-bit vector for selected queue notifications. */ -#define VIRTIO_MSI_QUEUE_VECTOR 22 -/* Vector value used to disable MSI for queue */ -#define VIRTIO_MSI_NO_VECTOR0x +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR 22 /* The remaining space is defined by each driver as the per-driver * configuration space */ -#define VIRTIO_PCI_CONFIG(dev) ((dev)-msix_enabled ? 24 : 20) +#define VIRTIO_PCI_LEGACY_CONFIG(dev)((dev)-msix_enabled ? 24 : 20) + +/* How many bits to shift physical queue address written to QUEUE_PFN. + * 12 is historical, and due to x86 page size. */ +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT 12 + +/* The alignment to use between consumer and producer parts of vring. + * x86 pagesize again. */ +#define VIRTIO_PCI_LEGACY_VRING_ALIGN4096 + +#ifndef __KERNEL__ +/* Don't break compile of old userspace code. These will go away. */ +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev) +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN +#endif /* ...!KERNEL */ /* Virtio ABI version, this must match exactly */ #define VIRTIO_PCI_ABI_VERSION
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote: On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ +struct virtio_pci_common_cfg { + /* About the whole device. */ + __u64 device_features; /* read-only */ + __u64 guest_features; /* read-write */ We currently require atomic accesses to common fields. Some architectures might not have such for 64 bit, so these need to be split I think ... We can consider stealing the feature implementation from virtio-mmio: Use the first 32 bits as a selector and the last as the features themselves. It's more complex to work with, but it provides 2**32 32 feature bits (which should be enough for a while) and it solves the atomic access issue. + __u64 queue_address;/* read-write */ + __u16 msix_config; /* read-write */ + __u8 device_status; /* read-write */ + __u8 unused; + + /* About a specific virtqueue. */ + __u16 queue_select; /* read-write */ + __u16 queue_align; /* read-write, power of 2. */ + __u16 queue_size; /* read-write, power of 2. */ + __u16 queue_msix_vector;/* read-write */ +}; #endif -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, 2011-11-23 at 13:02 +1030, Rusty Russell wrote: On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com wrote: Here's an updated vesion. I'm alternating between updating the spec and the driver, spec update to follow. Don't touch the spec yet, we have a long way to go :( I want the ability for driver to set the ring size, and the device to set the alignment. That's a bigger change than you have here. I imagine it almost rips the driver into two completely different drivers. This is the kind of thing I had in mind, for the header. Want me to code up the rest? (I've avoided adding the constants for the new layout: a struct is more compact and more descriptive). Cheers, Rusty. diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h --- a/include/linux/virtio_pci.h +++ b/include/linux/virtio_pci.h @@ -42,56 +42,74 @@ #include linux/virtio_config.h /* A 32-bit r/o bitmask of the features supported by the host */ -#define VIRTIO_PCI_HOST_FEATURES 0 +#define VIRTIO_PCI_LEGACY_HOST_FEATURES 0 /* A 32-bit r/w bitmask of features activated by the guest */ -#define VIRTIO_PCI_GUEST_FEATURES4 +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES 4 /* A 32-bit r/w PFN for the currently selected queue */ -#define VIRTIO_PCI_QUEUE_PFN 8 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN 8 /* A 16-bit r/o queue size for the currently selected queue */ -#define VIRTIO_PCI_QUEUE_NUM 12 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM 12 /* A 16-bit r/w queue selector */ -#define VIRTIO_PCI_QUEUE_SEL 14 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL 14 /* A 16-bit r/w queue notifier */ -#define VIRTIO_PCI_QUEUE_NOTIFY 16 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY 16 /* An 8-bit device status register. */ -#define VIRTIO_PCI_STATUS18 +#define VIRTIO_PCI_LEGACY_STATUS 18 /* An 8-bit r/o interrupt status register. Reading the value will return the * current contents of the ISR and will also clear it. This is effectively * a read-and-acknowledge. */ -#define VIRTIO_PCI_ISR 19 - -/* The bit of the ISR which indicates a device configuration change. */ -#define VIRTIO_PCI_ISR_CONFIG0x2 +#define VIRTIO_PCI_LEGACY_ISR19 /* MSI-X registers: only enabled if MSI-X is enabled. */ /* A 16-bit vector for configuration changes. */ -#define VIRTIO_MSI_CONFIG_VECTOR20 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR20 /* A 16-bit vector for selected queue notifications. */ -#define VIRTIO_MSI_QUEUE_VECTOR 22 -/* Vector value used to disable MSI for queue */ -#define VIRTIO_MSI_NO_VECTOR0x +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR 22 /* The remaining space is defined by each driver as the per-driver * configuration space */ -#define VIRTIO_PCI_CONFIG(dev) ((dev)-msix_enabled ? 24 : 20) +#define VIRTIO_PCI_LEGACY_CONFIG(dev)((dev)-msix_enabled ? 24 : 20) + +/* How many bits to shift physical queue address written to QUEUE_PFN. + * 12 is historical, and due to x86 page size. */ +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT 12 + +/* The alignment to use between consumer and producer parts of vring. + * x86 pagesize again. */ +#define VIRTIO_PCI_LEGACY_VRING_ALIGN4096 + +#ifndef __KERNEL__ +/* Don't break compile of old userspace code. These will go away. */ Maybe wrap it in: #ifndef VIRTIO_PCI_NO_LEGACY #warning Please stop using old defines [...] +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev) +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN #endif /* VIRTIO_PCI_NO_LEGACY */ +#endif /* ...!KERNEL */ /* Virtio ABI version, this must match exactly */ #define VIRTIO_PCI_ABI_VERSION 0 -/* How many bits to shift physical queue address written to QUEUE_PFN. - * 12 is historical, and due to x86 page size. */ -#define VIRTIO_PCI_QUEUE_ADDR_SHIFT 12 +/* Vector value used to disable MSI for queue */ +#define VIRTIO_MSI_NO_VECTOR0x -/* The alignment to use between consumer and producer parts of vring. -
Re: Memory sync algorithm during migration
On Tue, Nov 22, 2011 at 02:04:17PM +0100, Oliver Hookins wrote: On Tue, Nov 22, 2011 at 10:31:58AM +0100, ext Juan Quintela wrote: Oliver Hookins oliver.hook...@nokia.com wrote: On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote: Adding qemu-devel ML to CC. Your question should have been sent to qemu-devel ML because the logic is implemented in QEMU, not KVM. (2011/11/11 1:35), Oliver Hookins wrote: Hi, I am performing some benchmarks on KVM migration on two different types of VM. One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes about 20 seconds to migrate on our hardware while the 32GB VM takes about a minute. With a reasonable amount of memory activity going on (in the hundreds of MB per second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never completes. Intuitively this tells me there is some watermarking of dirty pages going on that is not particularly efficient when the dirty pages ratio is high compared to total memory, but I may be completely incorrect. You can change the ratio IIRC. Hopefully, someone who knows well about QEMU will tell you better ways. Takuya Could anybody fill me in on what might be going on here? We're using libvirt 0.8.2 and kvm-83-224.el5.centos.1 This is pretty old qemu/kvm code base. In principle, it makes no sense that with 32GB RAM migration finishes, and with 4GB RAM it is unable (intuitively it should be, if ever, the other way around). Do you have an easy test that makes the problem easily reproducible? Have you tried ustream qemu.git? (some improvements on that department). I've just tried the qemu-kvm 0.14.1 tag which seems to be the latest that builds on my platform. For some strange reason migrations always seem to fail in one direction with Unknown savevm section or instance 'hpet' 0 messages. What is your platform? This seems like you are running with hpet in one side, but without it in the other. What command line are you using? Yes, my mistake. We were also testing later kernels and my test machines managed to get out of sync. One had support for hpet clocksource but the other one didn't. This seems to point to different migration protocols on either end but they are both running the same version of qemu-kvm I built. Does this ring any bells for anyone? Command line mismatch. But, what is your platform? CentOS5.6. Now running the VMs through qemu-kvm 0.14.1, unloaded migrations take about half the time but with memory I/O load now both VMs never complete the migration. In practical terms I'm writing about 50MB/s into memory and we have a 10Gbps network (and I've seen real speeds up to 8-9Gbps on the wire) so there should be enough capacity to sync up the dirty pages. So now the 32GB and 4GB VMs have matching behaviour (which makes more sense) but I'm not any closer to figuring out what is going on. Resending as it got rejected the first time: I have not set the maximum permissible migration time but that default certainly points to a possibility of a solution. As for the writing semantics it was a straight dd from disk to /dev/shm so I can't speak for the kernel but naively I would think it may be contiguous space. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IOAPIC doesn't handle byte writes
On 11/22/2011 05:06 PM, Julian Stecklina wrote: Hello, [please don't drop cc list] Avi Kivity wrote: Care to post a patch instead? Sure. Never hacked KVM, though. Is there a particular reason why the void *val argument to ioapic_mmio_read/_write is only dereferenced when ioapic-lock is not held? No particular reason. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.
On 11/22/2011 06:09 PM, Julian Stecklina wrote: This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Your patch indents with spaces, while Linux uses tabs for indents. Signed-off-by: Julian Stecklina j...@alien8.de --- virt/kvm/ioapic.c | 17 + 1 files changed, 13 insertions(+), 4 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 3eed61e..e94ef6ba 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, (void*)addr, len, val); ASSERT(!(addr 0xf)); /* check alignment */ - if (len == 4 || len == 8) - data = *(u32 *) val; - else { +switch (len) { +case 8: +case 4: +data = *(u32 *) val; +break; +case 2: +data = *(u16 *) val; +break; +case 1: +data = *(u8 *) val; +break; +default: printk(KERN_WARNING ioapic: Unsupported size %d\n, len); return 0; } @@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, spin_lock(ioapic-lock); switch (addr) { case IOAPIC_REG_SELECT: - ioapic-ioregsel = data; + ioapic-ioregsel = data 0xFF; /* 8-bit register */ break; case IOAPIC_REG_WINDOW: This is a bit over-permissive in that it allows 8-byte writes to the IOWIN register. I guess it's okay though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SMP Example using KVM only ?
Hello Everyone, I am currently using kvm-85 (I know its pretty old ... :)) and it has an smptest example in test/x86/. This example works for 1 cpu but doesn't work for more than one cpus. This issue has already been discussed on KVM Mailing list. And I 'think' this example has been removed from the later versions of KVM/QEMU-KVM code base. I have also tested the latest qemu-kvm with multiple processors, it works, but the code base is quite large. What I am interested in is only using KVM for my platform, with smp support, and without QEMU. Is there a simple example for testing KVM with SMP support ? or if someone could outline the changes required to make the smptest example work in kvm-85. Many Thanks, Mian Hamayun smime.p7s Description: S/MIME Cryptographic Signature
Re: nested virtualization on Intel and needed cpu flags to pass
On Tue, Nov 22, 2011, Gianluca Cecchi wrote about nested virtualization on Intel and needed cpu flags to pass: I'm going to test nested virtualization on Intel with Fedora 16 host. ... [root at f16 ~]# uname -r 3.1.1-2.fc16.x86_64 This Linux version indeed has nested VMX, while # uname -r 2.6.40.6-0.fc15.x86_64 doesn't. Based on docs, the nested option is disabled by default on Intel Indeed. You need to load the kvm_intel module with the nested=1 option. You also need to tell qemu that the emulated CPU will advertise VMX - the simplest way to do this is with -cpu host option to qemu. It seems you did all of this correctly. Preliminary results are not so good. I created an F16 guest (f16vm), with default options .. Inside the guest I create a windows xp with default values proposed by .. Until now not able to complete installation Unfortunately, this is a known bug - which I promised to work on, but haven't yet got around to :( nested-vmx.txt explictly lists under known limitations that: The current code supports running Linux guests under KVM guests. more tests to come with newer kernel 3.1.1-2.fc16.x86_64 But before proceeding, probably I need to adjust particular features to enable/disable about cpu to pass to f16vm guest... I don't think this is the problem. The underlying problem is that the VMX spec is very complex, and ideally nested VMX should correctly emulate each and every bit and each and every corner of it. Because all our testing was done with KVM L1s and Linux L2s, all the paths that get exercised in that case were tested and their bugs ironed out - but it is possible that Windows L2s end up excercising slightly different code paths that still have bugs, and those need to be fixed. Unfortunately, I doubt a newer kernel will solve your problem. I think there are genuine bugs that will have to be fixed. What are current advises about cpu flags to pass to optimally use nested virtualization with intel at this time? I don't think there is any such guidelines. The only thing you really need is -cpu qemu64,+vmx (replace qemu64 by whatever you want) to advertise the exisance of VMX. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |When you lose, don't lose the lesson. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SMP Example using KVM only ?
On 11/23/2011 12:32 PM, Mian M. Hamayun wrote: Hello Everyone, I am currently using kvm-85 (I know its pretty old ... :)) and it has an smptest example in test/x86/. This example works for 1 cpu but doesn't work for more than one cpus. This issue has already been discussed on KVM Mailing list. And I 'think' this example has been removed from the later versions of KVM/QEMU-KVM code base. I have also tested the latest qemu-kvm with multiple processors, it works, but the code base is quite large. What I am interested in is only using KVM for my platform, with smp support, and without QEMU. Is there a simple example for testing KVM with SMP support ? or if someone could outline the changes required to make the smptest example work in kvm-85. You can take a look at kvm-tool, git://github.com/penberg/linux-kvm.git. While it's not exactly trivial, it's a lot simpler than qemu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.
On Wed, Nov 23, 2011, Avi Kivity wrote about Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.: On 11/22/2011 06:09 PM, Julian Stecklina wrote: This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Your patch indents with spaces, while Linux uses tabs for indents. You might want to run scripts/checkpatch.pl on your patches before sending them - this will catch this and many other stylistic faux pas. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Why do doctors call what they do http://nadav.harel.org.il |practice? Think about it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.
On Mi, 2011-11-23 at 12:47 +0200, Avi Kivity wrote: On 11/22/2011 06:09 PM, Julian Stecklina wrote: This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Your patch indents with spaces, while Linux uses tabs for indents. True. I'll post a revised version in a minute. I think this is my second patch to Linux in total, so I am slowly getting there... Thanks for the patience. Regards, Julian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote: On 10.11.2011, at 18:35, Marcelo Tosatti wrote: On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote: Documentation/virtual/kvm/api.txt | 47 ++ arch/powerpc/kvm/powerpc.c| 51 + include/linux/kvm.h | 32 +++ 3 files changed, 130 insertions(+), 0 deletions(-) I don't see the benefit of this generalization, the current structure where context information is hardcoded in the data transmitted works well. Well, unfortunately it doesn't work quite as well for us because we are a much more evolving platform. Also, there are a lot of edges and corners of the architecture that simply aren't implemented in KVM as of now. I want to have something extensible enough so we don't break the ABI along the way. You still have to agree on format between userspace and kernel, right? If either party fails to conform to that, you're doomed. Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch. The problem with two interfaces is potential ambiguity: is register X implemented through KVM_GET_ONE_REG and also through KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is the register writeback order? Is there a plan to convert, etc. Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no? Yes, but there is a specified order which the registers must be written back, in case there are dependencies between them (the QEMU x86's code does its best to document these dependencies). All i'm saying is that two distinct interfaces make it potentially confusing for the programmer. That said, its up to Avi to decide. Avi wanted to go as far as making this a syscall interface even. Ok, then full convertion work must be done. If you agree these concerns are valid, perhaps this interface can be PPC specific. I can always make it PPC specific, but I believe it would make sense as a generic interface for everyone, similar to how ENABLE_CAP can make sense for any arch. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] KVM: Allow aligned byte and word writes to IOAPIC registers.
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write consistent with ioapic_mmio_read, which also allows byte and word accesses. Signed-off-by: Julian Stecklina j...@alien8.de --- virt/kvm/ioapic.c | 15 --- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 3eed61e..71e2253 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, (void*)addr, len, val); ASSERT(!(addr 0xf)); /* check alignment */ - if (len == 4 || len == 8) + switch (len) { + case 8: + case 4: data = *(u32 *) val; - else { + break; + case 2: + data = *(u16 *) val; + break; + case 1: + data = *(u8 *) val; + break; + default: printk(KERN_WARNING ioapic: Unsupported size %d\n, len); return 0; } @@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len, spin_lock(ioapic-lock); switch (addr) { case IOAPIC_REG_SELECT: - ioapic-ioregsel = data; + ioapic-ioregsel = data 0xFF; /* 8-bit register */ break; case IOAPIC_REG_WINDOW: -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Hi! On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: Fundamentally, the entity that should be deciding what memory should be present and where it should located is the kernel. I'm fundamentally opposed to trying to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. From what I can tell about ms_mbind(), it just uses process knowledge to bind specific areas of memory to a memsched group and let's the kernel decide what to do with that knowledge. This is exactly the type of interface that QEMU should be using. QEMU should tell the kernel enough information such that the kernel can make good decisions. QEMU should not be the one making the decisions. True, QEMU won't have to decide where the memory and vcpus should be located (but hey it wouldn't need to decide that even if you use cpusets, you can use relative mbind with cpusets, the admin or a cpuset job scheduler could decide) but it's still QEMU making the decision of what memory and which vcpus threads to ms_mbind/ms_tbind. Think how you're going to create the input of those syscalls... If it wasn't qemu to decide that, qemu wouldn't be required to scan the whole host physical numa (cpu/memory) topology in order to create the input arguments of ms_mbind/ms_tbind. And when you migrate the VM to another host, the whole vtopology may be counter-productive because the kernel isn't automatically detecting the numa affinity between threads and the guest vtopology will stick to whatever numa _physical_ topology that was seen on the first node where the VM was created. I doubt that the assumption that all cloud nodes will have the same physical numa topology is reasonable. Furthermore to get the same benefits that qemu gets on host by using ms_mbind/ms_tbind, every single guest application should be modified to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the hard bindings which is what we try to avoid). I think it's unreasonable to expect all applications to use ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or wrappers, few apps will be modified for sys_ms_tbind/mbind. You can always have the supercomputer case with just one app that is optimized and a single VM spanning over the whole host, but in that scenarios hard bindings would work perfectly too. In my view the trouble of the numa hard bindings is not the fact they're hard and qemu has to also decide the location (in fact it doesn't need to decide the location if you use cpusets and relative mbinds). The bigger problem is the fact either the admin or the app developer has to explicitly scan the numa physical topology (both cpus and memory) and tell the kernel how much memory to bind to each thread. ms_mbind/ms_tbind only partially solve that problem. They're similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you don't need an admin or a cpuset-job-scheduler (or a perl script) to redistribute the hardware resources. Now dealing with bindings isn't big deal for qemu, in fact this API is pretty much ideal for qemu, but it won't make life substantially easier than if compared to hard bindings. Simply the management code that is now done with a perl script will have to be moved in the kernel. It looks an incremental improvement compared to the relative mbind+cpuset, but I'm unsure if it's the best we could aim for and what we really need in virt considering we deal with VM migration too. The real long term design to me is not to add more syscalls, and initially handling the case of a process/VM spanning not more than one node in thread number and amount of memory. That's not too hard an in fact I've benchmarks for the scheduler already showing it to work pretty well (it's creating a too strict affinity but it can be relaxed to be more useful). Then later add some mechanism (simplest is the page fault at low frequency) to create a guest_vcpu_thread-host_memory affinity and have a parvirtualized interface that tells the guest scheduler to group CPUs. If the guest scheduler runs free and is allowed to move threads randomly without any paravirtualized interface that controls the CPU thread migration in the guest scheduler, the thread-memory affinity on host will be hopeless. But with a parvirtualized interface to make a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7, will allow to create a more meaningful guest_thread-physical_ram affinity on host through KVM page faults. And then this will work also with VM migration and without having to create a vtopology in guest. And for apps running in guest no paravirt will be needed of course. The reason paravirt would be needed for qemu-kvm with a full automatic thread-memory affinity is that the vcpu threads are magic. What runs in the vcpu thread are guest threads. And those can move through the guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4 physical cpu for each physical node, any affinity
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Sun, Nov 13, 2011, Orit Wasserman wrote about Re: [PATCH 02/10] nEPT: MMU context for nested EPT: Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly. I'm sorry I don't have setup to run nested VMX at the moment so i can't test it. Hi Orit, thanks for the code - I'm now working on incorporating something based on this into my patch. However, I do have a question: +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.mmu); + + vcpu-arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested; + + return r; +} I didn't see you actually call this function anywhere - how is it supposed to work? The way I understand the current code, kvm_mmu_reset_context() calls init_kvm_mmu() which (in our case) calls init_kvm_nested_mmu(). I think the above gva_to_gpa setting should be there - right? It seems we need a fifth case in that function. But at that point in mmu.c, how will I be able to check if this is the nested EPT case? Do you have any suggestion? Thanks, Nadav. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |This message contains 100% recycled http://nadav.harel.org.il |characters. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: Use kmemdup() instead of kzalloc/memcpy
Switch to kmemdup() in two places to shorten the code and avoid possible bugs. Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- arch/x86/kvm/x86.c |4 ++-- virt/kvm/kvm_main.c |7 +++ 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9eff4af..79254f3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3499,10 +3499,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, memset(dirty_bitmap, 0, n); r = -ENOMEM; - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); + slots = kmemdup(kvm-memslots, sizeof(*kvm-memslots), GFP_KERNEL); if (!slots) goto out; - memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); + slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; slots-generation++; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4c5b9a2..6f0a5e1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2538,13 +2538,12 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, int i, r; struct kvm_io_bus *new_bus, *bus; - new_bus = kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL); + bus = kvm-buses[bus_idx]; + + new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL); if (!new_bus) return -ENOMEM; - bus = kvm-buses[bus_idx]; - memcpy(new_bus, bus, sizeof(struct kvm_io_bus)); - r = -ENOENT; for (i = 0; i new_bus-dev_count; i++) if (new_bus-range[i].dev == dev) { -- 1.7.8.rc3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: Use memdup_user instead of kmalloc/copy_from_user
Switch to using memdup_user when possible. This makes code more smaller and compact, and prevents errors. Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- arch/x86/kvm/x86.c | 82 +- virt/kvm/kvm_main.c | 29 +++-- 2 files changed, 47 insertions(+), 64 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 79254f3..ee09ae4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1362,12 +1362,11 @@ static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data) if (page_num = blob_size) goto out; r = -ENOMEM; - page = kzalloc(PAGE_SIZE, GFP_KERNEL); - if (!page) + page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE); + if (IS_ERR(page)) { + r = PTR_ERR(page); goto out; - r = -EFAULT; - if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE)) - goto out_free; + } if (kvm_write_guest(kvm, page_addr, page, PAGE_SIZE)) goto out_free; r = 0; @@ -2041,15 +2040,12 @@ static int msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs __user *user_msrs, if (msrs.nmsrs = MAX_IO_MSRS) goto out; - r = -ENOMEM; size = sizeof(struct kvm_msr_entry) * msrs.nmsrs; - entries = kmalloc(size, GFP_KERNEL); - if (!entries) + entries = memdup_user(user_msrs-entries, size); + if (IS_ERR(entries)) { + r = PTR_ERR(entries); goto out; - - r = -EFAULT; - if (copy_from_user(entries, user_msrs-entries, size)) - goto out_free; + } r = n = __msr_io(vcpu, msrs, entries, do_msr); if (r 0) @@ -3043,13 +3039,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp, r = -EINVAL; if (!vcpu-arch.apic) goto out; - u.lapic = kmalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL); - r = -ENOMEM; - if (!u.lapic) - goto out; - r = -EFAULT; - if (copy_from_user(u.lapic, argp, sizeof(struct kvm_lapic_state))) + u.lapic = memdup_user(argp, sizeof(*u.lapic)); + if (IS_ERR(u.lapic)) { + r = PTR_ERR(u.lapic); goto out; + } + r = kvm_vcpu_ioctl_set_lapic(vcpu, u.lapic); if (r) goto out; @@ -3228,14 +3223,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp, break; } case KVM_SET_XSAVE: { - u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL); - r = -ENOMEM; - if (!u.xsave) - break; - - r = -EFAULT; - if (copy_from_user(u.xsave, argp, sizeof(struct kvm_xsave))) - break; + u.xsave = memdup_user(argp, sizeof(*u.xsave)); + if (IS_ERR(u.xsave)) { + r = PTR_ERR(u.xsave); + goto out; + } r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave); break; @@ -3256,15 +3248,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp, break; } case KVM_SET_XCRS: { - u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL); - r = -ENOMEM; - if (!u.xcrs) - break; - - r = -EFAULT; - if (copy_from_user(u.xcrs, argp, - sizeof(struct kvm_xcrs))) - break; + u.xcrs = memdup_user(argp, sizeof(*u.xcrs)); + if (IS_ERR(u.xcrs)) { + r = PTR_ERR(u.xcrs); + goto out; + } r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs); break; @@ -3659,14 +3647,14 @@ long kvm_arch_vm_ioctl(struct file *filp, } case KVM_GET_IRQCHIP: { /* 0: PIC master, 1: PIC slave, 2: IOAPIC */ - struct kvm_irqchip *chip = kmalloc(sizeof(*chip), GFP_KERNEL); + struct kvm_irqchip *chip; - r = -ENOMEM; - if (!chip) + chip = memdup_user(argp, sizeof(*chip)); + if (IS_ERR(chip)) { + r = PTR_ERR(chip); goto out; - r = -EFAULT; - if (copy_from_user(chip, argp, sizeof *chip)) - goto get_irqchip_out; + } + r = -ENXIO; if (!irqchip_in_kernel(kvm)) goto get_irqchip_out; @@ -3685,14 +3673,14 @@ long kvm_arch_vm_ioctl(struct file *filp, } case KVM_SET_IRQCHIP: {
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, Nov 23, 2011 at 10:46:40AM +0200, Michael S. Tsirkin wrote: On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com wrote: Here's an updated vesion. I'm alternating between updating the spec and the driver, spec update to follow. Don't touch the spec yet, we have a long way to go :( I want the ability for driver to set the ring size, and the device to set the alignment. Did you mean driver to be able to set the alignment? This is what BIOS guys want - after BIOS completes, guest driver gets handed control and sets its own alignment to save memory. That's a bigger change than you have here. Why can't we just add the new registers at the end? With the new capability, we have as much space as we like for that. Didn't have the time to finish the patch today, but just to clarify, we can apply something like the below on top of my patch, and then config_len can be checked to see whether the new fields like alignment are available. We probably can make the alignment smaller and save some memory - the default value could set the default alignment. Ring size, naturally, can just be made writeable - BIOS can put a small value there, but linux guest would just use the defaults so no need for any code changes at all. As a bonus, the capability length is verified to be large enough. The change's pretty small, isn't it? diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 681347b..39433d3 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -37,8 +37,9 @@ struct virtio_pci_device struct virtio_device vdev; struct pci_dev *pci_dev; - /* the IO address for the common PCI config space */ + /* the IO address and length for the common PCI config space */ void __iomem *ioaddr; + unsigned config_len; /* the IO address for device specific config */ void __iomem *ioaddr_device; /* the IO address to use for notifications operations */ @@ -101,22 +102,24 @@ static void __iomem *virtio_pci_legacy_map(struct virtio_pci_device *vp_dev) #endif /* - * With PIO, device-specific config moves as MSI-X is enabled/disabled. - * Use this accessor to keep pointer to that config in sync. + * With PIO, device-specific config moves as MSI-X is enabled/disabled, + * configuration region length changes as well. Use this accessor to keep + * pointer to that config and common config length, in sync. */ static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int enabled) { void __iomem* m; vp_dev-msix_enabled = enabled; m = virtio_pci_legacy_map(vp_dev); - if (m) + if (m) { vp_dev-ioaddr_device = m + VIRTIO_PCI_CONFIG(vp_dev); - else + vp_dev-config_len = VIRTIO_PCI_CONFIG(vp_dev); + } else vp_dev-ioaddr_device = vp_dev-device_map; } static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap_id, - u32 align) + u32 align, unsigned *lenp) { u32 size; u32 offset; @@ -160,12 +163,16 @@ static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap offset = VIRTIO_PCI_CAP_CFG_OFF_MASK; /* Align offset appropriately */ offset = ~(align - 1); + if (lenp size *lenp) + goto err; /* It's possible for a device to supply a huge config space, * but we'll never need to map more than a page ATM. */ p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE); if (!p) dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory); + if (lenp) + *lenp = min(size, PAGE_SIZE); return p; err: dev_err(vp_dev-vdev.dev, Unable to parse virtio pci capability); @@ -188,22 +195,24 @@ static void virtio_pci_iounmap(struct virtio_pci_device *vp_dev) static int virtio_pci_iomap(struct virtio_pci_device *vp_dev) { + unsigned common_len = VIRTIO_PCI_COMMON_CFG_MINSIZE; vp_dev-isr_map = virtio_pci_map_cfg(vp_dev, VIRTIO_PCI_CAP_ISR_CFG, -sizeof(u8)); +sizeof(u8), NULL); vp_dev-notify_map = virtio_pci_map_cfg(vp_dev, VIRTIO_PCI_CAP_NOTIFY_CFG, - sizeof(u16)); + sizeof(u16), NULL); vp_dev-common_map = virtio_pci_map_cfg(vp_dev, VIRTIO_PCI_CAP_COMMON_CFG, - sizeof(u32)); + sizeof(u32), common_len);
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Wed, Nov 23, 2011, Nadav Har'El wrote about Re: [PATCH 02/10] nEPT: MMU context for nested EPT: +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.mmu); + + vcpu-arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested; + + return r; +} .. I didn't see you actually call this function anywhere - how is it supposed to work? .. It seems we need a fifth case in that function. .. On second thought, why is this modifying nested_mmu.gva_to_gpa, and not mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT format, and what we really want to change is the outer mmu, which is EPT12 and is indeed in EPT format? Or am I missing something? Thanks, Nadav. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |My password is my dog's name. His name is http://nadav.harel.org.il |a#j!4@h, but I change it every month. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Move cpuid code to new file
The cpuid code has grown; put it into a separate file. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/Makefile |2 +- arch/x86/kvm/cpuid.c | 629 arch/x86/kvm/cpuid.h | 46 arch/x86/kvm/lapic.c |1 + arch/x86/kvm/vmx.c|1 + arch/x86/kvm/x86.c| 638 + arch/x86/kvm/x86.h|5 +- 7 files changed, 683 insertions(+), 639 deletions(-) create mode 100644 arch/x86/kvm/cpuid.c create mode 100644 arch/x86/kvm/cpuid.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..161b76a 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ - i8254.o timer.o + i8254.o timer.o cpuid.o kvm-intel-y+= vmx.o kvm-amd-y += svm.o diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c new file mode 100644 index 000..bbaa6d8 --- /dev/null +++ b/arch/x86/kvm/cpuid.c @@ -0,0 +1,629 @@ +/* + * Kernel-based Virtual Machine driver for Linux + * cpuid support routines + * + * derived from arch/x86/kvm/x86.c + * + * Copyright 2011 Red Hat, Inc. and/or its affiliates. + * Copyright IBM Corporation, 2008 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include linux/kvm_host.h +#include linux/module.h +#include asm/user.h +#include asm/xsave.h +#include cpuid.h +#include lapic.h +#include mmu.h +#include trace.h + +void kvm_update_cpuid(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + struct kvm_lapic *apic = vcpu-arch.apic; + u32 timer_mode_mask; + + best = kvm_find_cpuid_entry(vcpu, 1, 0); + if (!best) + return; + + /* Update OSXSAVE bit */ + if (cpu_has_xsave best-function == 0x1) { + best-ecx = ~(bit(X86_FEATURE_OSXSAVE)); + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) + best-ecx |= bit(X86_FEATURE_OSXSAVE); + } + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL + best-function == 0x1) { + best-ecx |= bit(X86_FEATURE_TSC_DEADLINE_TIMER); + timer_mode_mask = 3 17; + } else + timer_mode_mask = 1 17; + + if (apic) + apic-lapic_timer.timer_mode_mask = timer_mode_mask; +} + +static int is_efer_nx(void) +{ + unsigned long long efer = 0; + + rdmsrl_safe(MSR_EFER, efer); + return efer EFER_NX; +} + +static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu) +{ + int i; + struct kvm_cpuid_entry2 *e, *entry; + + entry = NULL; + for (i = 0; i vcpu-arch.cpuid_nent; ++i) { + e = vcpu-arch.cpuid_entries[i]; + if (e-function == 0x8001) { + entry = e; + break; + } + } + if (entry (entry-edx (1 20)) !is_efer_nx()) { + entry-edx = ~(1 20); + printk(KERN_INFO kvm: guest NX capability removed\n); + } +} + +/* when an old userspace process fills a new kernel module */ +int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, +struct kvm_cpuid *cpuid, +struct kvm_cpuid_entry __user *entries) +{ + int r, i; + struct kvm_cpuid_entry *cpuid_entries; + + r = -E2BIG; + if (cpuid-nent KVM_MAX_CPUID_ENTRIES) + goto out; + r = -ENOMEM; + cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry) * cpuid-nent); + if (!cpuid_entries) + goto out; + r = -EFAULT; + if (copy_from_user(cpuid_entries, entries, + cpuid-nent * sizeof(struct kvm_cpuid_entry))) + goto out_free; + for (i = 0; i cpuid-nent; i++) { + vcpu-arch.cpuid_entries[i].function = cpuid_entries[i].function; + vcpu-arch.cpuid_entries[i].eax = cpuid_entries[i].eax; + vcpu-arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx; + vcpu-arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx; + vcpu-arch.cpuid_entries[i].edx = cpuid_entries[i].edx; + vcpu-arch.cpuid_entries[i].index = 0; + vcpu-arch.cpuid_entries[i].flags = 0; + vcpu-arch.cpuid_entries[i].padding[0] = 0; + vcpu-arch.cpuid_entries[i].padding[1] = 0; + vcpu-arch.cpuid_entries[i].padding[2] = 0; + } + vcpu-arch.cpuid_nent = cpuid-nent; + cpuid_fix_nx_cap(vcpu); + r = 0; + kvm_apic_set_version(vcpu); + kvm_x86_ops-cpuid_update(vcpu); +
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On 11/23/2011 04:03 PM, Andrea Arcangeli wrote: Hi! On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: Fundamentally, the entity that should be deciding what memory should be present and where it should located is the kernel. I'm fundamentally opposed to trying to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. From what I can tell about ms_mbind(), it just uses process knowledge to bind specific areas of memory to a memsched group and let's the kernel decide what to do with that knowledge. This is exactly the type of interface that QEMU should be using. QEMU should tell the kernel enough information such that the kernel can make good decisions. QEMU should not be the one making the decisions. True, QEMU won't have to decide where the memory and vcpus should be located (but hey it wouldn't need to decide that even if you use cpusets, you can use relative mbind with cpusets, the admin or a cpuset job scheduler could decide) but it's still QEMU making the decision of what memory and which vcpus threads to ms_mbind/ms_tbind. Think how you're going to create the input of those syscalls... If it wasn't qemu to decide that, qemu wouldn't be required to scan the whole host physical numa (cpu/memory) topology in order to create the input arguments of ms_mbind/ms_tbind. And when you migrate the VM to another host, the whole vtopology may be counter-productive because the kernel isn't automatically detecting the numa affinity between threads and the guest vtopology will stick to whatever numa _physical_ topology that was seen on the first node where the VM was created. I doubt that the assumption that all cloud nodes will have the same physical numa topology is reasonable. Furthermore to get the same benefits that qemu gets on host by using ms_mbind/ms_tbind, every single guest application should be modified to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the hard bindings which is what we try to avoid). I think it's unreasonable to expect all applications to use ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or wrappers, few apps will be modified for sys_ms_tbind/mbind. You can always have the supercomputer case with just one app that is optimized and a single VM spanning over the whole host, but in that scenarios hard bindings would work perfectly too. In my view the trouble of the numa hard bindings is not the fact they're hard and qemu has to also decide the location (in fact it doesn't need to decide the location if you use cpusets and relative mbinds). The bigger problem is the fact either the admin or the app developer has to explicitly scan the numa physical topology (both cpus and memory) and tell the kernel how much memory to bind to each thread. ms_mbind/ms_tbind only partially solve that problem. They're similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you don't need an admin or a cpuset-job-scheduler (or a perl script) to redistribute the hardware resources. Well yeah, of course the guest needs to see some topology. I don't see why we'd have to actually scan the host for this though. All we need to tell the kernel is this memory region is close to that thread. So if you define -numa node,mem=1G,cpus=0 then QEMU should be able to tell the kernel that this GB of RAM actually is close to that vCPU thread. Of course the admin still needs to decide how to split up memory. That's the deal with emulating real hardware. You get the interfaces hardware gets :). However, if you follow a reasonable default strategy such as numa splitting your RAM into equal chunks between guest vCPUs you're probably close enough to optimal usage models. Or at least you could have a close enough approximation of how this mapping could work for the _guest_ regardless of the host and when you migrate it somewhere else it should also work reasonably well. Now dealing with bindings isn't big deal for qemu, in fact this API is pretty much ideal for qemu, but it won't make life substantially easier than if compared to hard bindings. Simply the management code that is now done with a perl script will have to be moved in the kernel. It looks an incremental improvement compared to the relative mbind+cpuset, but I'm unsure if it's the best we could aim for and what we really need in virt considering we deal with VM migration too. The real long term design to me is not to add more syscalls, and initially handling the case of a process/VM spanning not more than one node in thread number and amount of memory. That's not too hard an in fact I've benchmarks for the scheduler already showing it to work pretty well (it's creating a too strict affinity but it can be relaxed to be more useful). Then later add some mechanism (simplest is the page fault at low frequency) to create a guest_vcpu_thread-host_memory affinity and have a parvirtualized interface that tells the guest scheduler to group CPUs. If the guest
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: So if you define -numa node,mem=1G,cpus=0 then QEMU should be able to tell the kernel that this GB of RAM actually is close to that vCPU thread. Of course the admin still needs to decide how to split up memory. That's the deal with emulating real hardware. You get the interfaces hardware gets :). However, if you follow a reasonable default strategy such as The problem is how do you decide the parameter -numa node,mem=1G,cpus=0. Real hardware exists when the VM starts. But then the VM can be migrated. Or the VM may have to be split in the middle of two nodes regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so there may be two 512M nodes with 1 cpu each instead of 1 NUMA node with 1G and 2 cpus. Especially by relaxing the hard bindings and using ms_mbind/tbind, the vtopology you create won't match real hardware because you don't know the real hardware that you will get. numa splitting your RAM into equal chunks between guest vCPUs you're probably close enough to optimal usage models. Or at least you could have a close enough approximation of how this mapping could work for the _guest_ regardless of the host and when you migrate it somewhere else it should also work reasonably well. If you enforce these assumptions and the admin has still again to choose the -numa node,mem=1G parameters after checking the physical numa topology and make sure the vtopology can match the real physical topology and that the guest runs on real hardware, it's not very different from using hard bindings, hard bindings enforces the real hardware so there's no way it can go wrong. I mean you still need some NUMA topology knowledge outside of QEMU to be sure you get real hardware out of the vtopology. Ok cpusets would restrict the availability of idle cpus, so there's a slight improvement in maximizing idle CPU usage (it's better to run 50% slower than not to run at all), but that could be achieved also by a relax of the cpuset semantics (if that's not already available). So you want to basically dynamically create NUMA topologies from the runtime behavior of the guest? What if it changes over time? Yes, just I wouldn't call it NUMA topologies or it looks like a vtopology, and the vtopology is something fixed at boot time, the sort of thing created by using a command line like -numa node,mem=1G,cpu=0. I wouldn't try to give the guest any memory topology, just the vcpus are magic threads that don't behave like normal threads in memory affinity terms. So they need a paravirtualization layer to be dealt with. The fact vcpu0 accessed 10 pages right now, doesn't mean there's a real affinity between vcpu0 and those N pages if the guest scheduler is free to migrate anything anywhere. The guest thread running in the vcpu0 may be migrated to the vcpu7 which may belong to a different physical node. So if we want to automatically detect thread-memory affinity between vcpus and guest memory, we also need to group the guest threads in certain vcpus and prevent those cpu migrations. The thread in the guest would better stick to vcpu0/1/2/3 (instead of migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same memory which fits in one node. That can only be told dynamically from KVM to the guest OS scheduler as we may migrate virtual machines or we may move the memory. Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes (4G per node). Suppose one of the two VM that have all the 2.5G allocated in a single node quits. Then the VM that was split across the two nodes will memory-migrated to fit in one node. So far so good, but then KVM should tell the guest OS scheduler that it should stop grouping vcpus and all vcpus are equal and all guest threads can be migrated to any vcpu. I don't see a way to do those things with a vtopology fixed at boot. I actually like the idea of just telling the kernel how close memory will be to a thread. Sure, you can handle this basically by shoving your scheduler into user space, but isn't managing processes what a kernel is supposed to do in the first place? Assume you're not in virt and you just want to tell thread A uses memory range A and thread B uses memory range B. If the memory range A fits in one node you're ok. But if memory A now spans over two nodes (maybe to avoid swapping), you're still screwed and you won't give enough information to the kernel on the real runtime affinity that thread A has on the memory. Now if statistically the access to memory a are all equal, it won't make a difference but if you end up using half of memory A 99% of the time, it will not work as well. This is especially a problem for KVM because statistically the accesses to memory a given to vcpu0 won't be equal. 50% of it may not be used at all and just have pagecache sitting there, or even free memory, so we can do better if memory a is split across two nodes to avoid swapping, if we
Re: [RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter
On Thu, Nov 17, 2011 at 10:55:49AM +1100, Paul Mackerras wrote: From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001 From: Paul Mackerras pau...@samba.org Date: Mon, 14 Nov 2011 13:30:38 +1100 Subject: Currently, kvmppc_h_enter takes a spinlock that is global to the guest, kvm-mmu_lock, in order to check for pending PTE invalidations safely. On some workloads, kvmppc_h_enter is called heavily and the use of a global spinlock could compromise scalability. We already use a per- guest page spinlock in the form of the bit spinlock on the rmap chain, and this gives us synchronization with the PTE invalidation side, which also takes the bit spinlock on the rmap chain for each page being invalidated. Thus it is sufficient to check for pending invalidations while the rmap chain bit spinlock is held. However, now we require barriers in mmu_notifier_retry() and in the places where mmu_notifier_count and mmu_notifier_seq are updated, since we can now call mmu_notifier_retry() concurrently with updates to those fields. Signed-off-by: Paul Mackerras pau...@samba.org --- Cc'd to kvm@vger.kernel.org for review of the generic kvm changes. Looks good to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On Sat, Nov 19, 2011 at 08:54:24AM +1100, Paul Mackerras wrote: On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote: This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) I did cc the last patch (the one that adds barriers in the MMU notifier sequence/count logic) to kvm@vger. Do you mean I should cc the whole series? The only other thing touching generic code is the addition of the KVM_MEMSLOT_IO flag in the first patch. I don't see such flag anywhere in the patches in this thread? I'm hoping the extra barriers will be OK since they are no-ops on x86. In fact I now think that the smp_wmbs I added to kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte aren't in fact necessary, since it's only necessary to ensure that the sequence number increase is visible before the point where kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap chain they look at, which will be ensured anyway by the barrier before the unlock. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, 23 Nov 2011 11:38:40 +0200, Sasha Levin levinsasha...@gmail.com wrote: On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote: On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ +struct virtio_pci_common_cfg { + /* About the whole device. */ + __u64 device_features; /* read-only */ + __u64 guest_features; /* read-write */ We currently require atomic accesses to common fields. Some architectures might not have such for 64 bit, so these need to be split I think ... We can consider stealing the feature implementation from virtio-mmio: Use the first 32 bits as a selector and the last as the features themselves. It's more complex to work with, but it provides 2**32 32 feature bits (which should be enough for a while) and it solves the atomic access issue. That works. I don't expect we'll need more than 64 features given that virtio_net hasn't seen a new one in over a year, but it's gone from 5 to 18 in 4 years, so another 32 bits at that rate only gets us another decade. Unfortunately, it doesn't solve the queue_address problem. We currently multiply the (32-bit) address by 4096 (the alignment). If drivers are going to reduce alignment, that makes it trickier, hence the change here to a 64. Simplicity. We could cheat and assert that a 32-bit write there implies 0 in the upper bits, and hope that all 64 bit platforms can do a 64-bit atomic write. Or insist the value not be intepreted until the guest writes its (first) feature bit. Perhaps we really need a I'm done configuring! trigger, now we can't use the feature bit field for that. Thoughts welcome... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Wed, 23 Nov 2011 10:46:41 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote: On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com wrote: Here's an updated vesion. I'm alternating between updating the spec and the driver, spec update to follow. Don't touch the spec yet, we have a long way to go :( I want the ability for driver to set the ring size, and the device to set the alignment. Did you mean driver to be able to set the alignment? This is what BIOS guys want - after BIOS completes, guest driver gets handed control and sets its own alignment to save memory. Yep, sorry. But we really do want the guest to set the ring size. Because it has to be guest-physical-contiguous, the host currently sets a very small ring, because the guest is useless if it can't allocate. Either way, it's now the driver's responsibility to write those fields. That's a bigger change than you have here. Why can't we just add the new registers at the end? With the new capability, we have as much space as we like for that. We could, for sure. I imagine it almost rips the driver into two completely different drivers. If you insist on moving all the rest of registers around, certainly. But why do this? Because I suspect we'll be different enough anyway, once we change the way we allocate the ring, and write the alignment. It'll be *clearer* to have two completely separate paths than to fill with if() statements. And a rewrite won't hurt the driver. But to be honest I don't really care about the Linux driver: we're steeped in this stuff and we'll get it right. But I'm *terrified* of making the spec more complex; implementations will get it wrong. I *really* want to banish the legacy stuff to an appendix where noone will ever know it's there :) Renaming constants in exported headers will break userspace builds. Do we care? Why not? As the patch shows, I decided not to do that. It's a nice heads-up, but breaking older versions of the code is just mean. Hence this: +#ifndef __KERNEL__ +/* Don't break compile of old userspace code. These will go away. */ +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev) +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN +#endif /* ...!KERNEL */ ... +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ +struct virtio_pci_common_cfg { + /* About the whole device. */ + __u64 device_features; /* read-only */ + __u64 guest_features; /* read-write */ + __u64 queue_address;/* read-write */ + __u16 msix_config; /* read-write */ + __u8 device_status; /* read-write */ + __u8 unused; + + /* About a specific virtqueue. */ + __u16 queue_select; /* read-write */ + __u16 queue_align; /* read-write, power of 2. */ + __u16 queue_size; /* read-write, power of 2. */ + __u16 queue_msix_vector;/* read-write */ +}; Slightly confusing as the registers are in fact little endian ... Good point, should mark them appropriately with __le16. That makes it even clearer. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote: +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ +struct virtio_pci_common_cfg { + /* About the whole device. */ + __u64 device_features; /* read-only */ + __u64 guest_features; /* read-write */ + __u64 queue_address;/* read-write */ + __u16 msix_config; /* read-write */ + __u8 device_status; /* read-write */ + __u8 unused; + + /* About a specific virtqueue. */ + __u16 queue_select; /* read-write */ + __u16 queue_align; /* read-write, power of 2. */ + __u16 queue_size; /* read-write, power of 2. */ + __u16 queue_msix_vector;/* read-write */ +}; Slightly confusing as the registers are in fact little endian ... Good point, should mark them appropriately with __le16. That makes it even clearer. Thanks, Rusty. Do we still require atomic access to fields in common cfg? If yes it's a problem as some systems don't have 64 bit addresses. If no, implementations might get harder. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 456 +++--- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 166 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..7cf6fda 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return total; } -static inline u64 file_size(int fd) -{ - struct stat st; - - if (fstat(fd, st) 0) - return 0; - - return st.st_size; -} - -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - -/* Writes a level 2 table at the end of the file. */ -static u64 qcow_write_l2_table(struct qcow *q, u64 *table) -{ - struct qcow_header *header = q-header; - u64 clust_sz; - u64 f_sz; - u64 off; - u64 sz; - - f_sz= file_size(q-fd); - if (!f_sz) - return 0; - - sz = 1 header-l2_bits; - clust_sz= 1 header-cluster_bits; - off = ALIGN(f_sz, clust_sz); - - if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off) 0) - return 0; - - return off; -} - static void refcount_table_free_cache(struct qcow_refcount_table *rft) { struct rb_root *r = rft-root; @@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +671,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +698,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + +
[PATCH v4] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 456 +++--- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 166 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..400aae8 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return total; } -static inline u64 file_size(int fd) -{ - struct stat st; - - if (fstat(fd, st) 0) - return 0; - - return st.st_size; -} - -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - -/* Writes a level 2 table at the end of the file. */ -static u64 qcow_write_l2_table(struct qcow *q, u64 *table) -{ - struct qcow_header *header = q-header; - u64 clust_sz; - u64 f_sz; - u64 off; - u64 sz; - - f_sz= file_size(q-fd); - if (!f_sz) - return 0; - - sz = 1 header-l2_bits; - clust_sz= 1 header-cluster_bits; - off = ALIGN(f_sz, clust_sz); - - if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off) 0) - return 0; - - return off; -} - static void refcount_table_free_cache(struct qcow_refcount_table *rft) { struct rb_root *r = rft-root; @@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +671,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +698,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + +
Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote: Because I suspect we'll be different enough anyway, once we change the way we allocate the ring, and write the alignment. Well, it'd be easy to just add pa_high. It'll be *clearer* to have two completely separate paths than to fill with if() statements. Well, look at my patches. See how each new feature basically adds *one* if statement. Yes, we could declare all existing features to be required, this is what you are advocating. But in a couple of years more if statements have accumulated and we still don't have a flying car. Meanwhile the driver has a huge legacy section which is a huge pain to maintain. And a rewrite won't hurt the driver. Wow. Is everyone going for the rewrite these days? After a lot of work we will be back at the point where we left, with a minor tweak to enable a huge number of slow devices. Meanwhile we have no time to work on real problems, such as ring fragmentation that Avi pointed out, support for very large requests. And yes it will hurt, it will hurt bisectability and testability. What you propose is a huge patch that just changes all of it. If there's a breakage, bisect just points at a rewrite. What I would like to see is incremental patches. So I would like to 1. Split driver config from common config 2. Split isr/notifications out too 3. Copy config in MMIO 4. Add a new config 5. Add length checks 6. Add 64 bit features 7. Add alignment programming At each point we can bisect. It worked well for event index, qemu had a flag to turn it off, each time someone came in and went 'maybe it's event index' we just tested without. But to be honest I don't really care about the Linux driver: we're steeped in this stuff and we'll get it right. Maybe. What about windows drivers? But I'm *terrified* of making the spec more complex; All you do is move stuff around. Why do you think it simplifies the spec so much? implementations will get it wrong. Just to clarify: you assume that if virtio pci spec is changed, then there will be many more new implementations than existing ones, so it is worth it to make life (temporarily, for several short years) harder for all the existing ones? Look at the qemu side for example. The way I proposed - with optional capabilities, each one with a separate fallback - the change can be implemented very gradually. Each step will be bisectable, and testable. I *really* want to banish the legacy stuff to an appendix where noone will ever know it's there :) This does not make life easy for implementations as they need to implement it anyway. What makes life easy is making new features optional, so you pick just what you like. For example, we might want to backport some new feature into rhel6.X at some point. I would like the diff to be small, and easily understandable in its impact. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter
On Thu, Nov 17, 2011 at 10:55:49AM +1100, Paul Mackerras wrote: From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001 From: Paul Mackerras pau...@samba.org Date: Mon, 14 Nov 2011 13:30:38 +1100 Subject: Currently, kvmppc_h_enter takes a spinlock that is global to the guest, kvm-mmu_lock, in order to check for pending PTE invalidations safely. On some workloads, kvmppc_h_enter is called heavily and the use of a global spinlock could compromise scalability. We already use a per- guest page spinlock in the form of the bit spinlock on the rmap chain, and this gives us synchronization with the PTE invalidation side, which also takes the bit spinlock on the rmap chain for each page being invalidated. Thus it is sufficient to check for pending invalidations while the rmap chain bit spinlock is held. However, now we require barriers in mmu_notifier_retry() and in the places where mmu_notifier_count and mmu_notifier_seq are updated, since we can now call mmu_notifier_retry() concurrently with updates to those fields. Signed-off-by: Paul Mackerras pau...@samba.org --- Cc'd to k...@vger.kernel.org for review of the generic kvm changes. Looks good to me. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On Sat, Nov 19, 2011 at 08:54:24AM +1100, Paul Mackerras wrote: On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote: This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) I did cc the last patch (the one that adds barriers in the MMU notifier sequence/count logic) to kvm@vger. Do you mean I should cc the whole series? The only other thing touching generic code is the addition of the KVM_MEMSLOT_IO flag in the first patch. I don't see such flag anywhere in the patches in this thread? I'm hoping the extra barriers will be OK since they are no-ops on x86. In fact I now think that the smp_wmbs I added to kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte aren't in fact necessary, since it's only necessary to ensure that the sequence number increase is visible before the point where kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap chain they look at, which will be ensured anyway by the barrier before the unlock. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html